Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM!

Matthias Niessner

47,462 subscribers

39,718 Aufrufe • vor 2 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Xingang Pan

19,413 Aufrufe • vor 1 Jahr

Can we use video diffusion to generate 3D scenes? 𝐖𝐨𝐫𝐥𝐝𝐄𝐱𝐩𝐥𝐨𝐫𝐞𝐫 (#SIGGRAPHAsia25) creates fully-navigable scenes via autoregressive video generation. Text input -> 3DGS scene output & interactive rendering! 🌍 📽️

Can we use video diffusion to generate 3D scenes? 𝐖𝐨𝐫𝐥𝐝𝐄𝐱𝐩𝐥𝐨𝐫𝐞𝐫 (#SIGGRAPHAsia25) creates fully-navigable scenes via autoregressive video generation. Text input -> 3DGS scene output & interactive rendering! 🌍 📽️

Matthias Niessner

30,777 Aufrufe • vor 8 Monaten

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,417 Aufrufe • vor 8 Monaten

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 Aufrufe • vor 2 Jahren

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering Contributions: • We reformulate video stabilization as a novel 3D grounded scheme of local reconstruction and rendering. This approach is naturally robust to diverse camera motions and scene dynamics, is temporally consistent, and is capable of full frame stabilization. • We propose a novel test-time optimization for each unstable video. It leverages multi-view dynamics-aware photometric supervision and cross-frame regularization to achieve temporally consistent reconstructions. To avoid frame cropping, we introduce a scene extrapolation module based on video completion. • We provide a 3D-grounded dataset for our task by re-purposing an existing one, and introduce new metrics on sparse and dense reconstruction to evaluate 3D scene consistency. Extensive experiments (quantitative, qualitative, user study) versus image-based and gyro-basedmethods demonstrate the merits of our method.

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering Contributions: • We reformulate video stabilization as a novel 3D grounded scheme of local reconstruction and rendering. This approach is naturally robust to diverse camera motions and scene dynamics, is temporally consistent, and is capable of full frame stabilization. • We propose a novel test-time optimization for each unstable video. It leverages multi-view dynamics-aware photometric supervision and cross-frame regularization to achieve temporally consistent reconstructions. To avoid frame cropping, we introduce a scene extrapolation module based on video completion. • We provide a 3D-grounded dataset for our task by re-purposing an existing one, and introduce new metrics on sparse and dense reconstruction to evaluate 3D scene consistency. Extensive experiments (quantitative, qualitative, user study) versus image-based and gyro-basedmethods demonstrate the merits of our method.

MrNeRF

11,638 Aufrufe • vor 11 Monaten

Today we announced RTFM (Real-Time Frame Model) — a world model that generates frames in real time from any camera viewpoint. Unlike standard video models, RTFM understands 3D geometry. You can literally move the camera through the generated world. 🎥👇

Today we announced RTFM (Real-Time Frame Model) — a world model that generates frames in real time from any camera viewpoint. Unlike standard video models, RTFM understands 3D geometry. You can literally move the camera through the generated world. 🎥👇

Keunhong Park

15,653 Aufrufe • vor 8 Monaten

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

18,854 Aufrufe • vor 2 Monaten

📽️➡️🏠 Can an image become a consistent 3D world? Video Diffusion Models look stunning, but 3D consistency is still a nightmare… #WorldStereo is different. We gave the AI a "3D brain" using Geometric Memories. The result? ✅ Zero flickering. ✅ Perfect 3D consistency.

📽️➡️🏠 Can an image become a consistent 3D world? Video Diffusion Models look stunning, but 3D consistency is still a nightmare… #WorldStereo is different. We gave the AI a "3D brain" using Geometric Memories. The result? ✅ Zero flickering. ✅ Perfect 3D consistency.

Tengfei Wang

19,865 Aufrufe • vor 3 Monaten

Check out CAT3D! Image(s)-to-3D in 1 minute! Given any number of real or generated images, CAT3D uses a multi-view diffusion prior to create consistent novel views. These views are used to reconstruct a 3D scene using NeRF/3DGS.

Check out CAT3D! Image(s)-to-3D in 1 minute! Given any number of real or generated images, CAT3D uses a multi-view diffusion prior to create consistent novel views. These views are used to reconstruct a 3D scene using NeRF/3DGS.

Philipp Henzler

12,737 Aufrufe • vor 2 Jahren

📢📢GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction📢📢 Reconstructing high-fidelity 3D scenes from sparse RGB input is hard. It needs a strong 3D prior! We reformulate multi-view scene reconstruction as conditional 3D generation over overlapping spatial chunks, lifting posed image features into a generative shape prior via 3D conditioning. As an example prior, we build on Trellis2, and train it such that its reconstruction is pixel aligned and matches from all views. GenRecon achieves unprecedented reconstruction quality from any sparse RGB input sequence, even from a phone capture. The reconstruction also includes PBR materials which facilitates relighting and virtual object insertion. Amazing work by Katharina Schmid, Nicolas von Lützow, Jozef, Angela Dai

📢📢GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction📢📢 Reconstructing high-fidelity 3D scenes from sparse RGB input is hard. It needs a strong 3D prior! We reformulate multi-view scene reconstruction as conditional 3D generation over overlapping spatial chunks, lifting posed image features into a generative shape prior via 3D conditioning. As an example prior, we build on Trellis2, and train it such that its reconstruction is pixel aligned and matches from all views. GenRecon achieves unprecedented reconstruction quality from any sparse RGB input sequence, even from a phone capture. The reconstruction also includes PBR materials which facilitates relighting and virtual object insertion. Amazing work by Katharina Schmid, Nicolas von Lützow, Jozef, Angela Dai

Matthias Niessner

17,088 Aufrufe • vor 19 Tagen

📢Introducing 360Anything, our method for lifting any perspective image or video to gravity-aligned 360° panoramas without using any camera or 3D information. This enables consistent novel view synthesis and 3D scene reconstruction. Project page: 🧵

📢Introducing 360Anything, our method for lifting any perspective image or video to gravity-aligned 360° panoramas without using any camera or 3D information. This enables consistent novel view synthesis and 3D scene reconstruction. Project page: 🧵

Ziyi Wu

46,859 Aufrufe • vor 4 Monaten

Voyager Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Voyager Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

AK

15,840 Aufrufe • vor 1 Jahr

A test of real-time transfer of images captured on Quest to a PC, where they are converted into 3D using Apple's SHARP. Currently, the final 3DGS output is displayed on the PC. The next step is to bring the result back into Quest for in-headset viewing. #3DGS #metaquest #AR

A test of real-time transfer of images captured on Quest to a PC, where they are converted into 3D using Apple's SHARP. Currently, the final 3DGS output is displayed on the PC. The next step is to bring the result back into Quest for in-headset viewing. #3DGS #metaquest #AR

Takashi Yoshinaga

11,575 Aufrufe • vor 1 Monat

VAST AI releases Triplane Meets Gaussian Splatting on Hugging Face Fast and Generalizable Single-View 3D Reconstruction with Transformers demo: TGS enables fast reconstruction from single-view image. It builds the 3D representation upon a hybrid Triplane-Gaussian representation by evaluating a transformer-based framework, from which 3D Gaussians would be decoded

VAST AI releases Triplane Meets Gaussian Splatting on Hugging Face Fast and Generalizable Single-View 3D Reconstruction with Transformers demo: TGS enables fast reconstruction from single-view image. It builds the 3D representation upon a hybrid Triplane-Gaussian representation by evaluating a transformer-based framework, from which 3D Gaussians would be decoded

AK

113,220 Aufrufe • vor 2 Jahren

Learning Temporally Consistent Video Depth from Video Diffusion Priors This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly

Learning Temporally Consistent Video Depth from Video Diffusion Priors This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly

AK

81,634 Aufrufe • vor 2 Jahren

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗨𝗩𝗚𝗦 We introduce 𝗨𝗩𝗚𝗦, a new 2D representation of 3D Gaussian Splatting (3DGS) that leverages spherical mapping. Website: Paper:

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗨𝗩𝗚𝗦 We introduce 𝗨𝗩𝗚𝗦, a new 2D representation of 3D Gaussian Splatting (3DGS) that leverages spherical mapping. Website: Paper:

Nikolaos Sarafianos

10,707 Aufrufe • vor 1 Jahr

📢 QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization Yueh-Cheng Liu learns 2DGS initialization, densification, and optimization priors from ScanNet++ => fast & accurate reconstruction! Project:

📢 QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization Yueh-Cheng Liu learns 2DGS initialization, densification, and optimization priors from ScanNet++ => fast & accurate reconstruction! Project:

Angela Dai

27,292 Aufrufe • vor 1 Jahr

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,808 Aufrufe • vor 7 Monaten

Instant Video-to-3D with DUST3R Dust3r generates a whole 3D scene from just a couple of images. What if it could: 1. Accept a VIDEO 2. Extract the video frames 3. Turn them into 3D? So added a Gradio Video Component. Here's me generating 3D from a video cc: NAVER LABS Europe

Instant Video-to-3D with DUST3R Dust3r generates a whole 3D scene from just a couple of images. What if it could: 1. Accept a VIDEO 2. Extract the video frames 3. Turn them into 3D? So added a Gradio Video Component. Here's me generating 3D from a video cc: NAVER LABS Europe

cocktail peanut

41,029 Aufrufe • vor 2 Jahren