Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Synthesizing worlds with video diffusion models is often inconsistent — moving the camera back and forth leads to different scenes. We propose 🌐𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺, a memory-based approach that ensures consistent world simulation without relying on explicit 3D reconstruction.

Xingang Pan

3,257 subscribers

19,413 görüntüleme • 1 yıl önce •via X (Twitter)

Haberler & Politika Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

2 Yorum

Xingang Pan profil fotoğrafı

Xingang Pan1 yıl önce

𝗪𝗼𝗿𝗹𝗱𝗠𝗲𝗺 is mainly created by @zeqi_xiao Project page: ArXiv: Github: Demo:

AssemblyAI profil fotoğrafı

AssemblyAI1 yıl önce

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

Benzer Videolar

📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM!

📢 3D world models from video diffusion suffer from inconsistent frames -> blurry output. Our fix: instead of naïve 3D reconstruction, we non-rigidly align each frame into a globally-consistent 3DGS representation. ->sharp visuals on top of any VDM!

Matthias Niessner

39,718 görüntüleme • 2 ay önce

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation Contributions: • We propose a novel approach to inject SfM priors into diffusion-based depth estimation, enabling highly accurate and multi-view consistent depth predictions for each viewpoint. • Based on the proposed depth estimator, we design a new multi-view 3D geometry reconstruction framework and process some synthetic datasets to facilitate training. • We evaluate our method on diverse real-world scene data, including objects, indoor environments, streetscapes, and aerial scenes, demonstrating the superior performance and generalization capability of our approach.

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation Contributions: • We propose a novel approach to inject SfM priors into diffusion-based depth estimation, enabling highly accurate and multi-view consistent depth predictions for each viewpoint. • Based on the proposed depth estimator, we design a new multi-view 3D geometry reconstruction framework and process some synthetic datasets to facilitate training. • We evaluate our method on diverse real-world scene data, including objects, indoor environments, streetscapes, and aerial scenes, demonstrating the superior performance and generalization capability of our approach.

MrNeRF

25,651 görüntüleme • 1 yıl önce

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,417 görüntüleme • 8 ay önce

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

18,854 görüntüleme • 2 ay önce

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery TL;DR: Skyfall-GS converts satellite images to explorable 3D urban scenes using diffusion models, with real-time rendering performance. Contributions: • We introduce Skyfall-GS, the first method to synthesize immersive, real-time, free-flight navigable 3D urban scenes solely from multi-view satellite imagery using generative refinement. • An open-domain refinement approach leverages pre-trained text-to-image diffusion models without domain-specific training. • A curriculum-learning-based iterative refinement strategy progressively enhances reconstruction quality from higher to lower viewpoints, significantly improving visual fidelity in occluded areas.

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery TL;DR: Skyfall-GS converts satellite images to explorable 3D urban scenes using diffusion models, with real-time rendering performance. Contributions: • We introduce Skyfall-GS, the first method to synthesize immersive, real-time, free-flight navigable 3D urban scenes solely from multi-view satellite imagery using generative refinement. • An open-domain refinement approach leverages pre-trained text-to-image diffusion models without domain-specific training. • A curriculum-learning-based iterative refinement strategy progressively enhances reconstruction quality from higher to lower viewpoints, significantly improving visual fidelity in occluded areas.

MrNeRF

66,058 görüntüleme • 7 ay önce

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering Contributions: • We reformulate video stabilization as a novel 3D grounded scheme of local reconstruction and rendering. This approach is naturally robust to diverse camera motions and scene dynamics, is temporally consistent, and is capable of full frame stabilization. • We propose a novel test-time optimization for each unstable video. It leverages multi-view dynamics-aware photometric supervision and cross-frame regularization to achieve temporally consistent reconstructions. To avoid frame cropping, we introduce a scene extrapolation module based on video completion. • We provide a 3D-grounded dataset for our task by re-purposing an existing one, and introduce new metrics on sparse and dense reconstruction to evaluate 3D scene consistency. Extensive experiments (quantitative, qualitative, user study) versus image-based and gyro-basedmethods demonstrate the merits of our method.

GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering Contributions: • We reformulate video stabilization as a novel 3D grounded scheme of local reconstruction and rendering. This approach is naturally robust to diverse camera motions and scene dynamics, is temporally consistent, and is capable of full frame stabilization. • We propose a novel test-time optimization for each unstable video. It leverages multi-view dynamics-aware photometric supervision and cross-frame regularization to achieve temporally consistent reconstructions. To avoid frame cropping, we introduce a scene extrapolation module based on video completion. • We provide a 3D-grounded dataset for our task by re-purposing an existing one, and introduce new metrics on sparse and dense reconstruction to evaluate 3D scene consistency. Extensive experiments (quantitative, qualitative, user study) versus image-based and gyro-basedmethods demonstrate the merits of our method.

MrNeRF

11,638 görüntüleme • 11 ay önce

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 görüntüleme • 2 yıl önce

WorldExplorer: Towards Generating Fully Navigable 3D Scenes Contributions: • We introduce the first method for generating 3D scenes from text that supports high-quality view synthesis while enabling exploration across a wide range of camera poses. • We propose an iterative scene expansion strategy using video diffusion models, driven by trajectory sampling and adaptive collision detection. • We design a scene memory mechanism that conditions each video generation step on relevant past frames, improving view consistency and overall scene coherence.

WorldExplorer: Towards Generating Fully Navigable 3D Scenes Contributions: • We introduce the first method for generating 3D scenes from text that supports high-quality view synthesis while enabling exploration across a wide range of camera poses. • We propose an iterative scene expansion strategy using video diffusion models, driven by trajectory sampling and adaptive collision detection. • We design a scene memory mechanism that conditions each video generation step on relevant past frames, improving view consistency and overall scene coherence.

MrNeRF

23,814 görüntüleme • 1 yıl önce

📽️➡️🏠 Can an image become a consistent 3D world? Video Diffusion Models look stunning, but 3D consistency is still a nightmare… #WorldStereo is different. We gave the AI a "3D brain" using Geometric Memories. The result? ✅ Zero flickering. ✅ Perfect 3D consistency.

📽️➡️🏠 Can an image become a consistent 3D world? Video Diffusion Models look stunning, but 3D consistency is still a nightmare… #WorldStereo is different. We gave the AI a "3D brain" using Geometric Memories. The result? ✅ Zero flickering. ✅ Perfect 3D consistency.

Tengfei Wang

19,865 görüntüleme • 3 ay önce

MeshLRM Large Reconstruction Model for High-Quality Mesh We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on

MeshLRM Large Reconstruction Model for High-Quality Mesh We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on

AK

69,111 görüntüleme • 2 yıl önce

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

HunyuanWorld-Voyager is here and fully open-source! The world’s first ultra-long-range world model with native 3D reconstruction, redefining AI-driven spatial intelligence for VR, gaming, and simulations. ✅Direct 3D Output: Exports point cloud videos to 3D formats without tools like COLMAP, enabling instant 3D application use. ✅Innovative 3D Memory: Introduces a scalable world caching mechanism, ensuring geometric consistency across any camera trajectory. ✅Top-Ranked Performance: #1 on Stanford’s WorldScore, excelling in video generation and 3D reconstruction benchmarks.( Built on HunyuanWorld 1.0, Voyager blends video generation with 3D modeling, delivering camera-controlled, high-fidelity RGB-D sequences. Control scenes via keyboard or joystick for unmatched 3D consistency. Explore now: 🌐Project Page: 🔗GitHub: 🤗HuggingFace: 📝Technical Details:

Tencent Hy

198,207 görüntüleme • 9 ay önce

I finally released my new video on YouTube about Diffusion Models / Score-Based Generative Models. Literally planned this for a year and put so much work in. I think this approach to diffusion models is so intuitive and highly recommend giving that a go! Video is 38min long, so you will need some time to watch that haha.

I finally released my new video on YouTube about Diffusion Models / Score-Based Generative Models. Literally planned this for a year and put so much work in. I think this approach to diffusion models is so intuitive and highly recommend giving that a go! Video is 38min long, so you will need some time to watch that haha.

dome | Outlier

54,540 görüntüleme • 1 yıl önce

Geometric Context Transformer for Streaming 3D Reconstruction Contributions: • We introduce LingBot-Map, a streaming 3D foundation model built around Geometric Context Attention (GCA), which maintains three complementary context types – anchor, pose-reference window, and trajectory memory – for efficient and consistent long-sequence streaming inference. • We propose an efficient training recipe based on progressive training and context parallelism with a relative loss formulation for stable long-sequence optimization. • We demonstrate that LingBot-Map achieves state-of-the-art performance on multiple benchmarks (Oxford Spires, Tanks and Temples, ETH3D, and 7-Scenes), significantly outperforming existing streaming approaches in reconstruction quality and inference speed.

Geometric Context Transformer for Streaming 3D Reconstruction Contributions: • We introduce LingBot-Map, a streaming 3D foundation model built around Geometric Context Attention (GCA), which maintains three complementary context types – anchor, pose-reference window, and trajectory memory – for efficient and consistent long-sequence streaming inference. • We propose an efficient training recipe based on progressive training and context parallelism with a relative loss formulation for stable long-sequence optimization. • We demonstrate that LingBot-Map achieves state-of-the-art performance on multiple benchmarks (Oxford Spires, Tanks and Temples, ETH3D, and 7-Scenes), significantly outperforming existing streaming approaches in reconstruction quality and inference speed.

MrNeRF

24,549 görüntüleme • 1 ay önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 2 yıl önce

*Why panorama?* Standard video models struggle with object permanence—if a camera pans away and comes back, objects may disappear. With panoramas, the model is forced to generate everything in the scene. This serves as a "working memory" for consistent world generation. (3/N)

Why panorama? Standard video models struggle with object permanence—if a camera pans away and comes back, objects may disappear. With panoramas, the model is forced to generate everything in the scene. This serves as a "working memory" for consistent world generation. (3/N)

Ziyi Wu

21,992 görüntüleme • 4 ay önce

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Contributions: • We propose STORM, the first feed-forward, self-supervised method for fast and accurate reconstruction of dynamic 3D scenes from sparse, multi-timestep, posed camera images. • Our bottom-up framework aggregates and transforms per-frame 3D Gaussian Splats into a cohesive scene representation, enabling self-supervised motion estimation. Furthermore, we introduce motion tokens that capture common motion primitives and regularize motion predictions, facilitating dynamic motion group segmentation without explicit motion or correspondence supervision. • We present several enhancements for in-the-wild scenarios, including sky modeling, camera exposure inconsistency handling, large novel-view extrapolation, and fine-grained human motions reconstruction, making STORM well-suited for real-world applications.

MrNeRF

53,292 görüntüleme • 1 yıl önce

Some scene reconstruction R&D at Simulon that produces more detailed meshes quickly and is able to exclude moving objects without explicit segmentation. This improves occlusion of real-world objects and provides better meshes for 3D creators. Coming to the beta soon.

Some scene reconstruction R&D at Simulon that produces more detailed meshes quickly and is able to exclude moving objects without explicit segmentation. This improves occlusion of real-world objects and provides better meshes for 3D creators. Coming to the beta soon.

Divesh Naidoo

50,439 görüntüleme • 2 yıl önce

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

AK

101,655 görüntüleme • 2 yıl önce

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior paper page: present DreamCraft3D, a hierarchical 3D content generation method that produces high-fidelity and coherent 3D objects. We tackle the problem by leveraging a 2D reference image to guide the stages of geometry sculpting and texture boosting. A central focus of this work is to address the consistency issue that existing works encounter. To sculpt geometries that render coherently, we perform score distillation sampling via a view-dependent diffusion model. This 3D prior, alongside several training strategies, prioritizes the geometry consistency but compromises the texture fidelity. We further propose Bootstrapped Score Distillation to specifically boost the texture. We train a personalized diffusion model, Dreambooth, on the augmented renderings of the scene, imbuing it with 3D knowledge of the scene being optimized. The score distillation from this 3D-aware diffusion prior provides view-consistent guidance for the scene. Notably, through an alternating optimization of the diffusion prior and 3D scene representation, we achieve mutually reinforcing improvements: the optimized 3D scene aids in training the scene-specific diffusion model, which offers increasingly view-consistent guidance for 3D optimization. The optimization is thus bootstrapped and leads to substantial texture boosting. With tailored 3D priors throughout the hierarchical generation, DreamCraft3D generates coherent 3D objects with photorealistic renderings, advancing the state-of-the-art in 3D content generation.

AK

161,530 görüntüleme • 2 yıl önce

Voyager Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Voyager Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

AK

15,840 görüntüleme • 1 yıl önce