正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Introducing 🛹 RollingDepth 🛹 — a universal monocular depth estimator for arbitrarily long videos! Our paper, “Video Depth without Video Models,” delivers exactly that, setting new standards in temporal consistency. Check out more details in the thread 🧵

Anton Obukhov

2,696 subscribers

49,854 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 次观看 • 1 年前

🤔Applying Depth Estimation models directly to videos can result in inconsistency between frames. 💪Well, not anymore. 🔥ChronoDepth is a new approach to video depth estimation that focuses on achieving both accuracy within each frame and consistency across frames. 📜More👇

🤔Applying Depth Estimation models directly to videos can result in inconsistency between frames. 💪Well, not anymore. 🔥ChronoDepth is a new approach to video depth estimation that focuses on achieving both accuracy within each frame and consistency across frames. 📜More👇

Gradio

19,986 次观看 • 2 年前

xAI absolutely cooked with the new Grok Imagine model! The temporal consistency is rock-solid. The penguin's anatomy, lighting, and proportions stay locked in for the full 10 seconds with no weird morphing or melting. In less than 30 seconds, Grok created a video with snow spray building realistically, motion blur, and depth of field. The physics simulation rivals other top models.

xAI absolutely cooked with the new Grok Imagine model! The temporal consistency is rock-solid. The penguin's anatomy, lighting, and proportions stay locked in for the full 10 seconds with no weird morphing or melting. In less than 30 seconds, Grok created a video with snow spray building realistically, motion blur, and depth of field. The physics simulation rivals other top models.

tetsuo

77,537 次观看 • 5 个月前

🤯 Depth videos perfectly solve false flags on reference videos — and can perfectly recreate both martial arts and dance! Turned a Guan Dao martial arts clip into a Depth motion reference, swapped the fighter for a market auntie, and moved the scene to a T-junction inside a local market 🤣 🌟 Workflow: 1. Pick a reference clip under 15 seconds 2. Convert it to Depth with Depth Anything V2 3. Generate a new character + scene 4. Feed everything into Seedance with the prompt below 🌟 Depth video conversion: You can build a local Depth converter with Codex — prompt in the comments. This goes way beyond trending dances. Martial arts, polearms, and other complex movements can all be transferred cleanly with Depth! Workflow + Prompt below 👇

🤯 Depth videos perfectly solve false flags on reference videos — and can perfectly recreate both martial arts and dance! Turned a Guan Dao martial arts clip into a Depth motion reference, swapped the fighter for a market auntie, and moved the scene to a T-junction inside a local market 🤣 🌟 Workflow: 1. Pick a reference clip under 15 seconds 2. Convert it to Depth with Depth Anything V2 3. Generate a new character + scene 4. Feed everything into Seedance with the prompt below 🌟 Depth video conversion: You can build a local Depth converter with Codex — prompt in the comments. This goes way beyond trending dances. Martial arts, polearms, and other complex movements can all be transferred cleanly with Depth! Workflow + Prompt below 👇

Larus Canus

47,012 次观看 • 5 天前

For the facial animations in my game, I use a technique that yields results similar to L.A. Noire. To keep it simple: I first generate a facial animation video using AI like LTX 2. Then, I create a video depth map from that animation and project it onto a face mask. Using vertex displacement, the depth map dynamically deforms the face as the character speaks. The result looks good in standalone VR. I actually created this workflow long time ago:

For the facial animations in my game, I use a technique that yields results similar to L.A. Noire. To keep it simple: I first generate a facial animation video using AI like LTX 2. Then, I create a video depth map from that animation and project it onto a face mask. Using vertex displacement, the depth map dynamically deforms the face as the character speaks. The result looks good in standalone VR. I actually created this workflow long time ago:

Alex

48,283 次观看 • 4 个月前

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

🚀New paper out - We present Video-MSG (Multimodal Sketch Guidance), a novel planning-based training-free guidance method for T2V models, improving control of spatial layout and object trajectories. 🔧 Key idea: • Generate a Video Sketch — a spatio-temporal plan with background, foreground, and motion in the pixel space. • Encode this structure directly into the latent space of the diffusion model during generation, which does not require fine-tuning or additional memory during inference. 🧵

Jialu Li

35,060 次观看 • 1 年前

Return to Disney's The Lion King Realm in our next free update - filled with 2 new friends and even more surprises. Check back tomorrow for another Developer Update Video for all the details… and maybe a launch date 👀

Return to Disney's The Lion King Realm in our next free update - filled with 2 new friends and even more surprises. Check back tomorrow for another Developer Update Video for all the details… and maybe a launch date 👀

Disney Dreamlight Valley

178,337 次观看 • 1 年前

🤯 I didn’t expect Depth to track rooftop parkour this cleanly! Used a fast climbing clip as the motion reference, then swapped in a student with twin tails and a backpack on a coastal school rooftop. The running path, wall contact, jumps, landings, and full-body continuity all hold together surprisingly well. 🌟 Workflow: 1. Pick a reference clip under 15 seconds 2. Convert it to Depth with Depth Anything V2 3. Generate a new character + scene 4. Feed everything into Seedance with the prompt below 🌟 Depth video conversion: You can build a local Depth converter with Codex — prompt in the comments. Depth opens up a lot more possibilities for parkour, climbing, martial arts, dance, and other complex full-body motion. Workflow + Prompt below 👇

🤯 I didn’t expect Depth to track rooftop parkour this cleanly! Used a fast climbing clip as the motion reference, then swapped in a student with twin tails and a backpack on a coastal school rooftop. The running path, wall contact, jumps, landings, and full-body continuity all hold together surprisingly well. 🌟 Workflow: 1. Pick a reference clip under 15 seconds 2. Convert it to Depth with Depth Anything V2 3. Generate a new character + scene 4. Feed everything into Seedance with the prompt below 🌟 Depth video conversion: You can build a local Depth converter with Codex — prompt in the comments. Depth opens up a lot more possibilities for parkour, climbing, martial arts, dance, and other complex full-body motion. Workflow + Prompt below 👇

Larus Canus

11,118 次观看 • 2 天前

🔬 Built with an entirely new model architecture, our diffusion-based approach uses 6B+ parameters and leverages the latest NVIDIA hardware. This is the most dynamic and wide-ranging video enhancing method we’ve ever created, setting a new standard for AI video restoration. Videos degrade due to compression artifacts, blurring, aliasing, noise, atmospheric distortion, missing pixels, etc. Each frame suffers from unique types of corruption, making AI video restoration a highly challenging task. Our technology solves this complexity by analyzing hundreds of frames to accurately restore details, delivering unmatched detail recovery combined with unparalleled temporal consistency.

🔬 Built with an entirely new model architecture, our diffusion-based approach uses 6B+ parameters and leverages the latest NVIDIA hardware. This is the most dynamic and wide-ranging video enhancing method we’ve ever created, setting a new standard for AI video restoration. Videos degrade due to compression artifacts, blurring, aliasing, noise, atmospheric distortion, missing pixels, etc. Each frame suffers from unique types of corruption, making AI video restoration a highly challenging task. Our technology solves this complexity by analyzing hundreds of frames to accurately restore details, delivering unmatched detail recovery combined with unparalleled temporal consistency.

Topaz Labs

23,199 次观看 • 1 年前

A sneak peek of a new feature in the #GeoAI Python package! 🎉 Now you can detect cars from georeferenced aerial imagery using deep learning—all with just a few lines of code. Stay tuned for an in-depth video tutorial coming soon! 🛠️ Explore the GitHub repository: 📚 Dive into the documentation: 📺 Check out the entire YouTube playlist:

A sneak peek of a new feature in the #GeoAI Python package! 🎉 Now you can detect cars from georeferenced aerial imagery using deep learning—all with just a few lines of code. Stay tuned for an in-depth video tutorial coming soon! 🛠️ Explore the GitHub repository: 📚 Dive into the documentation: 📺 Check out the entire YouTube playlist:

Qiusheng Wu

45,651 次观看 • 1 年前

🔥 Win a share of $50,000 in MORPHO rewards 🔥 We’re celebrating the launch of Stablecoin Earn in collaboration with Morpho Labs. To enter: 🔹 Like & RT 🔹👇 Deposit USDT, USDC, or USDA into Morpho vaults via Stablecoin Earn Check out the thread for more details 🧵👇

🔥 Win a share of $50,000 in MORPHO rewards 🔥 We’re celebrating the launch of Stablecoin Earn in collaboration with Morpho Labs. To enter: 🔹 Like & RT 🔹👇 Deposit USDT, USDC, or USDA into Morpho vaults via Stablecoin Earn Check out the thread for more details 🧵👇

Trust Wallet

72,985 次观看 • 1 年前

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.

MrNeRF

24,729 次观看 • 1 年前

#ThrowbackThursday In their review, Jesse Cox went all out for Control, a game that he labels as his Game of the Year. His 35-minute video goes IN DEPTH and is packed with nerdy references and features the Director herself. 📽️ Watch at

#ThrowbackThursday In their review, Jesse Cox went all out for Control, a game that he labels as his Game of the Year. His 35-minute video goes IN DEPTH and is packed with nerdy references and features the Director herself. 📽️ Watch at

The Sudden Stop

24,811 次观看 • 11 个月前

Some updates on the multiview vistadream pipeline with Rerun! Rerun came in extremely useful here, as being able to visualize depths at each stage of the pipeline allowed me to debug some nasty bugs. Since the last time, I was only working with a single image input. I've added in VGGT as my multiview pose + depth estimator. It works REALLY well for getting camera poses, but the depths are not that great. To try and fix that, I estimated depth maps from MoGeV2 for each of the views, and scale+shift aligned them so that they would match up to the confident sections of VGGT's depth predictions. You can see in the video just how much sharper the visualized 2d depth maps are! The biggest issue continues to be the multiview consistency 🫠 That's up next, along with actually training the Gaussian splat. Lots of work went into actually understanding inputs+outputs for VGGT. I had some funky bugs where the confidence values would all collapse to true I'm also really excited for this pipeline to use Difix3D+ Nvidia instead of Flux Inpainting, it seems like a better suited for a multiview pipeline.

Some updates on the multiview vistadream pipeline with Rerun! Rerun came in extremely useful here, as being able to visualize depths at each stage of the pipeline allowed me to debug some nasty bugs. Since the last time, I was only working with a single image input. I've added in VGGT as my multiview pose + depth estimator. It works REALLY well for getting camera poses, but the depths are not that great. To try and fix that, I estimated depth maps from MoGeV2 for each of the views, and scale+shift aligned them so that they would match up to the confident sections of VGGT's depth predictions. You can see in the video just how much sharper the visualized 2d depth maps are! The biggest issue continues to be the multiview consistency 🫠 That's up next, along with actually training the Gaussian splat. Lots of work went into actually understanding inputs+outputs for VGGT. I had some funky bugs where the confidence values would all collapse to true I'm also really excited for this pipeline to use Difix3D+ Nvidia instead of Flux Inpainting, it seems like a better suited for a multiview pipeline.

Pablo Vela

29,904 次观看 • 11 个月前

Sensitive content

#Kwiky is a NEW app, for hot content with No Restrictions! ✨ #BabbieBia Naughty, short-form content that just works! Check out the link in our profile, for more!

kwikyofficial

62,299 次观看 • 1 年前

Nano is a depth-aware atmospheric haze plugin that uses ML depth estimation to add physically accurate fog and light scattering to your footage. Works *best* on log footage with visible light sources - it analyzes scene highlights then creates airlight (atmospheric scatter) and halation (light bloom) that responds to actual depth in the scene. Pretty clever approach to getting that cinematic haze look without having to pump a fog machine on set. Makes the OG Trapcode Shine look extremely dated (basically 2D light streaks masked by luminance values), and is yet way more controllable than the current crop of generative AI video-to-video tools.

Nano is a depth-aware atmospheric haze plugin that uses ML depth estimation to add physically accurate fog and light scattering to your footage. Works best on log footage with visible light sources - it analyzes scene highlights then creates airlight (atmospheric scatter) and halation (light bloom) that responds to actual depth in the scene. Pretty clever approach to getting that cinematic haze look without having to pump a fog machine on set. Makes the OG Trapcode Shine look extremely dated (basically 2D light streaks masked by luminance values), and is yet way more controllable than the current crop of generative AI video-to-video tools.

Bilawal Sidhu

275,292 次观看 • 1 年前

Bring your videos to life with Akool’s Effect Enhancer. ✨ Turn ordinary footage into dynamic, eye-catching content in seconds. With smarter visual effects and enhanced motion details, you can instantly elevate your videos without complex editing. Whether you're creating marketing content, storytelling videos, or social media clips, Effect Enhancer helps you add more depth, energy, and polish to every frame. Create smoother visuals. Add stronger impact. Make every video stand out with Akool.

Bring your videos to life with Akool’s Effect Enhancer. ✨ Turn ordinary footage into dynamic, eye-catching content in seconds. With smarter visual effects and enhanced motion details, you can instantly elevate your videos without complex editing. Whether you're creating marketing content, storytelling videos, or social media clips, Effect Enhancer helps you add more depth, energy, and polish to every frame. Create smoother visuals. Add stronger impact. Make every video stand out with Akool.

Akool Inc

22,040 次观看 • 3 个月前

These new AI models are so good! Dreamina AI dropped Image 3.0 and Video 3.0 and now you'll have: · no more morphing · control over the camera · everything's into place · smart image reference Prompts, comparisons and a quick how-to guide in this thread 🧵👇

These new AI models are so good! Dreamina AI dropped Image 3.0 and Video 3.0 and now you'll have: · no more morphing · control over the camera · everything's into place · smart image reference Prompts, comparisons and a quick how-to guide in this thread 🧵👇

TechHalla

56,674 次观看 • 1 年前

Can MLLMs actually track what's happening in a video? Introducing VSTAT 🎯, our new benchmark for visual state tracking. The tasks are simple: count cups, read typed words, count page flips. Humans solve them easily. MLLMs don't. 🧵 [1/11]

Can MLLMs actually track what's happening in a video? Introducing VSTAT 🎯, our new benchmark for visual state tracking. The tasks are simple: count cups, read typed words, count page flips. Humans solve them easily. MLLMs don't. 🧵 [1/11]

Sihyun Yu

165,974 次观看 • 1 个月前

Exciting Announcement! 📢 #PonderONE x #RealityMetaverse🤝 We’re excited to announce our strategic partnership with Reality Metaverse, together we’ll be participating in AMAs, Cross-community Airdrops & more in depth integrations later on in our roadmap. Reality Metaverse revolutionizes gaming by allowing #NFT owners to monetize their assets in traditional games for the first time. By seamlessly connecting the real world with the Metaverse, it unlocks new opportunities in popular games like Landlord GO, setting a new standard for integrating digital assets with gaming experiences. We're delighted to have them in the #PonderEcosystem!

Exciting Announcement! 📢 #PonderONE x #RealityMetaverse🤝 We’re excited to announce our strategic partnership with Reality Metaverse, together we’ll be participating in AMAs, Cross-community Airdrops & more in depth integrations later on in our roadmap. Reality Metaverse revolutionizes gaming by allowing #NFT owners to monetize their assets in traditional games for the first time. By seamlessly connecting the real world with the Metaverse, it unlocks new opportunities in popular games like Landlord GO, setting a new standard for integrating digital assets with gaming experiences. We're delighted to have them in the #PonderEcosystem!

Ponder.One

34,021 次观看 • 1 年前