Загрузка видео...

Не удалось загрузить видео

На главную

Many 3D generators output Gaussian Splats (3DGS) for fast rendering, flexible deployment, and high visual fidelity. Static 3DGS aren't world models (no dynamics/semantics) but a true world model must allow distilling 3D-consistent representations for any given time step (3DGS/meshes). This post-distillation serves a dual purpose: 1) validates physical consistency...

26,226 просмотров • 3 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share **PhysTwin**: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines **Gaussian splatting** with **inverse dynamics optimization** based on simple **spring-mass** systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

Yunzhu Li

25,279 просмотров • 1 год назад

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 просмотров • 1 год назад