Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Learning Temporally Consistent Video Depth from Video Diffusion Priors This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly

AK

499,505 subscribers

81,634 views • 2 years ago •via X (Twitter)

Education Science & Technology News & Politics

Anya Rossi• Live Now

Private livecam show

9 Comments

AK2 years ago

developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learn- ing difficulty and enhancing

AK2 years ago

generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy - first optimizing the

AK2 years ago

spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen - yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long

AK2 years ago

videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing

AK2 years ago

alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis.

AK2 years ago

paper page:

AK2 years ago

daily papers:

Katy2 years ago

@NandoDF Amazing progress 🚀🔥

Abhinav Girdhar2 years ago

Great insights on the new features and security measures! It's good to see continuous improvements in user experience and safety. Looking forward to more updates.

Related Videos

"DVD: Dynamic Video Depth" TL;DR: Recovers temporally consistent depth from monocular videos using diffusion priors + geometric constraints, handling dynamic scenes and motion robustly.

"DVD: Dynamic Video Depth" TL;DR: Recovers temporally consistent depth from monocular videos using diffusion priors + geometric constraints, handling dynamic scenes and motion robustly.

Alexandre Morgand

11,823 views • 4 months ago

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering Contributions: • We reformulate video stabilization as a novel 3D grounded scheme of local reconstruction and rendering. This approach is naturally robust to diverse camera motions and scene dynamics, is temporally consistent, and is capable of full frame stabilization. • We propose a novel test-time optimization for each unstable video. It leverages multi-view dynamics-aware photometric supervision and cross-frame regularization to achieve temporally consistent reconstructions. To avoid frame cropping, we introduce a scene extrapolation module based on video completion. • We provide a 3D-grounded dataset for our task by re-purposing an existing one, and introduce new metrics on sparse and dense reconstruction to evaluate 3D scene consistency. Extensive experiments (quantitative, qualitative, user study) versus image-based and gyro-basedmethods demonstrate the merits of our method.

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering Contributions: • We reformulate video stabilization as a novel 3D grounded scheme of local reconstruction and rendering. This approach is naturally robust to diverse camera motions and scene dynamics, is temporally consistent, and is capable of full frame stabilization. • We propose a novel test-time optimization for each unstable video. It leverages multi-view dynamics-aware photometric supervision and cross-frame regularization to achieve temporally consistent reconstructions. To avoid frame cropping, we introduce a scene extrapolation module based on video completion. • We provide a 3D-grounded dataset for our task by re-purposing an existing one, and introduce new metrics on sparse and dense reconstruction to evaluate 3D scene consistency. Extensive experiments (quantitative, qualitative, user study) versus image-based and gyro-basedmethods demonstrate the merits of our method.

MrNeRF

11,638 views • 1 year ago

📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM!

📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM!

Matthias Niessner

40,007 views • 4 months ago

Highlight from "Martin Syndrome". With Kling AI’s control over action and expressions, every frame adds depth and emotion, ushering in a new era of video production.

Highlight from "Martin Syndrome". With Kling AI’s control over action and expressions, every frame adds depth and emotion, ushering in a new era of video production.

Kling AI

1,836,349 views • 1 year ago

2. Breakthrough in Volumetric Video Previous attempts were inconsistent because each frame had to be a new mesh. But Gaussian Splats solve this. It's not only temporally consistent, 60 fps, smaller in filesize but also allows infinite frame retiming. Seeing it in VR at the booth was like an out of body experience. Huge implications for VR. See the 10-min demo: Try it in the browser:

2. Breakthrough in Volumetric Video Previous attempts were inconsistent because each frame had to be a new mesh. But Gaussian Splats solve this. It's not only temporally consistent, 60 fps, smaller in filesize but also allows infinite frame retiming. Seeing it in VR at the booth was like an out of body experience. Huge implications for VR. See the 10-min demo: Try it in the browser:

Andrew Price

15,627 views • 11 months ago

A new set of preprocessor template workflows for core conditioning steps in ComfyUI: Depth, Lineart, Pose, Normals, and Frame Interpolation (great for smoothing 16fps outputs). Designed for consistency, reuse, and faster iteration across image & video workflows.

A new set of preprocessor template workflows for core conditioning steps in ComfyUI: Depth, Lineart, Pose, Normals, and Frame Interpolation (great for smoothing 16fps outputs). Designed for consistency, reuse, and faster iteration across image & video workflows.

ComfyUI

30,652 views • 6 months ago

Most AI video work is just "generate and hope." This is different. Creator seungho__yeo ( IG ) rebuilt a live drum performance from the ground up. Using the drummer's actual motion data to redesign background mood, lighting, and spatial depth frame by frame. The pipeline: → Motion Tracking → Depth Mapping → AI Edit ( Seedance 2) → AI Relighting → Background Reconstruction → Cinematic Color Grading All built inside ComfyUI. This is what intentional AI production looks like. Not more generations, better direction. This is the level ComfyUI creators are working at.

Most AI video work is just "generate and hope." This is different. Creator seungho__yeo ( IG ) rebuilt a live drum performance from the ground up. Using the drummer's actual motion data to redesign background mood, lighting, and spatial depth frame by frame. The pipeline: → Motion Tracking → Depth Mapping → AI Edit ( Seedance 2) → AI Relighting → Background Reconstruction → Cinematic Color Grading All built inside ComfyUI. This is what intentional AI production looks like. Not more generations, better direction. This is the level ComfyUI creators are working at.

ComfyUI

28,385 views • 1 month ago

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

Introducing MegaSaM! 🎥 Accurate, fast, & robust structure + camera estimation from casual monocular videos of dynamic scenes! MegaSaM outputs camera parameters and consistent video depth, scaling to long videos with unconstrained camera paths and complex scene dynamics!

Zhengqi Li

57,010 views • 1 year ago

I'm excited to share our new work Align3R that estimates camera poses and consistent depth maps from a monocular video of a dynamic scene. Project page: Code: Paper:

I'm excited to share our new work Align3R that estimates camera poses and consistent depth maps from a monocular video of a dynamic scene. Project page: Code: Paper:

Yuan Liu

56,547 views • 1 year ago

Bruce called me out on my lack of depth. Lacking depth is acceptable for comic books and video games. But not for people or squats. This is why you need random Internet friends to bully you into being better. Challenge accepted. How'd I do?

Bruce called me out on my lack of depth. Lacking depth is acceptable for comic books and video games. But not for people or squats. This is why you need random Internet friends to bully you into being better. Challenge accepted. How'd I do?

Daniel Polehn

22,533 views • 6 months ago

Take a peek behind the curtain with this in-depth look into the making of some of the best VFX work from Season 1 of #ThePeripheral. All episodes now streaming on Prime Video.

Take a peek behind the curtain with this in-depth look into the making of some of the best VFX work from Season 1 of #ThePeripheral. All episodes now streaming on Prime Video.

The Peripheral

16,471 views • 3 years ago

Robot VR control. All the existing "VR" teleop systems I tried felt like shit so I put together my own. Instead of just piping a camera feed, I do a 3d reconstruction so you get depth perception and 6DoF head movement. This feels waay better than a flat image (no 3d, might as well skip the VR), or directly streaming per-eye video, which gets you depth but leaves your view locked in place (0DoF lmao). Not really sure why nobody else seems to do this.

Robot VR control. All the existing "VR" teleop systems I tried felt like shit so I put together my own. Instead of just piping a camera feed, I do a 3d reconstruction so you get depth perception and 6DoF head movement. This feels waay better than a flat image (no 3d, might as well skip the VR), or directly streaming per-eye video, which gets you depth but leaves your view locked in place (0DoF lmao). Not really sure why nobody else seems to do this.

FrostyFridge

28,121 views • 3 months ago

Wow, diffusion models (used in AI image generation) are also game engines - a type of world simulation. By predicting the next frame of the classic shooter DOOM, you get a playable game at 20 fps without any underlying real game engine. This video is from the diffusion model.

Wow, diffusion models (used in AI image generation) are also game engines - a type of world simulation. By predicting the next frame of the classic shooter DOOM, you get a playable game at 20 fps without any underlying real game engine. This video is from the diffusion model.

Ethan Mollick

1,768,873 views • 1 year ago

After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen. 👇(1/n) #DepthAnything3

After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3 reveals two key insights: 💎 A plain transformer (e.g., vanilla DINO) is enough. No specialized architecture. ✨ A single depth-ray representation is enough. No complex 3D tasks. Three series of models have been released: the main DA3 series, a monocular metric estimation series, and a monocular depth estimation series. The core team members, aside from me: Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen. 👇(1/n) #DepthAnything3

Bingyi Kang

515,036 views • 8 months ago

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model with Gradio demo local demo: This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model with Gradio demo local demo: This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

AK

810,578 views • 2 years ago

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing abs: paper page: present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing abs: paper page: present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.

AK

153,241 views • 2 years ago

Slept on Runway feature that actually changes how you work specially with New Seedance 2. Extract any frame from a clip and extend it further. Everything stays consistent. No drift. No chaos. This is the kind of control AI video needed.

Slept on Runway feature that actually changes how you work specially with New Seedance 2. Extract any frame from a clip and extend it further. Everything stays consistent. No drift. No chaos. This is the kind of control AI video needed.

madpencil_

18,543 views • 3 months ago

VIDEO: Continuing my PLAGIARISM series. Here's a scene from the Hollywood movie #AKissBeforeDying. And a SUPERHIT Bollywood movie copied it frame-by-frame. Guess the name of that Bollywood movie. Hint - A SUPERSTAR was born after the success of this Bollywood film. Video Credit - Movie Clips

VIDEO: Continuing my PLAGIARISM series. Here's a scene from the Hollywood movie #AKissBeforeDying. And a SUPERHIT Bollywood movie copied it frame-by-frame. Guess the name of that Bollywood movie. Hint - A SUPERSTAR was born after the success of this Bollywood film. Video Credit - Movie Clips

Navneet Mundhra

40,800 views • 8 months ago

In feature film VFX, our lighting teams spend a lot of time shaving off as much render time as possible. Even 1 minute saved per frame adds up. For a 5 second shot, 1 minute less per frame saves you 2 hours of rendering per shot. Our shots are usually a lot heavier than this and our teams usually get a lot more optimization for render times than 1 minute per frame. Some shots can be rendering for up to 30 hours per frame depending on complexity. Large studios have a plethora of optimization tools which shave off a lot of render time from our shots. Rendering technology is always improving, bringing render times down and increasing physical accuracy. Video is of Hyperion with interactive rendering with Pixar's RenderMan XPU renderer, which is up to 5x faster than their older RIS architecture.

In feature film VFX, our lighting teams spend a lot of time shaving off as much render time as possible. Even 1 minute saved per frame adds up. For a 5 second shot, 1 minute less per frame saves you 2 hours of rendering per shot. Our shots are usually a lot heavier than this and our teams usually get a lot more optimization for render times than 1 minute per frame. Some shots can be rendering for up to 30 hours per frame depending on complexity. Large studios have a plethora of optimization tools which shave off a lot of render time from our shots. Rendering technology is always improving, bringing render times down and increasing physical accuracy. Video is of Hyperion with interactive rendering with Pixar's RenderMan XPU renderer, which is up to 5x faster than their older RIS architecture.

Rassoul Edji

16,286 views • 1 year ago

For this video created from an image in Grok Imagine, I combined two techniques: - the image editor from Grok Imagine - starting from a frame of a video to create other videos.

For this video created from an image in Grok Imagine, I combined two techniques: - the image editor from Grok Imagine - starting from a frame of a video to create other videos.

Déborah

18,012 views • 6 months ago