Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Angela Dai

9,433 subscribers

106,850 görüntüleme • 2 yıl önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

10 Yorum

Michael Black profil fotoğrafı

Michael Black2 yıl önce

@craigleili Very creative! Love it.

Dan Casas profil fotoğrafı

Dan Casas2 yıl önce

@craigleili Great idea and super well presented. Love it!

ScottieFox profil fotoğrafı

ScottieFox2 yıl önce

@craigleili There must exist a vector for the opposite as well. Since the paper clearly shows an inpainting mask of human 2D interactions, then one could assume a "place this actor in a scene" - via the same text encoding.

Hongwei Yi profil fotoğrafı

Hongwei Yi2 yıl önce

@craigleili The idea and the results are super nice!!! Can't wait to use.

Thiemo Alldieck profil fotoğrafı

Thiemo Alldieck2 yıl önce

@craigleili creative idea!

Chenfanfu Jiang profil fotoğrafı

Chenfanfu Jiang2 yıl önce

@craigleili Inspiring

Dávid Komorowicz profil fotoğrafı

Dávid Komorowicz2 yıl önce

@craigleili Oh no, don't sit on the Guzheng😰

Chris Han profil fotoğrafı

Chris Han2 yıl önce

@craigleili @memdotai mem it

Leo profil fotoğrafı

Leo2 yıl önce

@craigleili so cool

Naureen Mahmood profil fotoğrafı

Naureen Mahmood2 yıl önce

@craigleili I really like the method presented here, not to mention the lovely video! Very nice work.

Benzer Videolar

Excited to share HOI-PAGE, to appear at #ICML2026! 🚀 Lei Li generates 4D human-object interactions zero-shot from text A part-affordance graph grounds interactions via LLM+video priors, enabling complex multi-person, multi-object interactions 👉

Excited to share HOI-PAGE, to appear at #ICML2026! 🚀 Lei Li generates 4D human-object interactions zero-shot from text A part-affordance graph grounds interactions via LLM+video priors, enabling complex multi-person, multi-object interactions 👉

Angela Dai

10,437 görüntüleme • 1 ay önce

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

84,940 görüntüleme • 2 yıl önce

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 görüntüleme • 1 yıl önce

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

18,886 görüntüleme • 2 ay önce

🤖Simulation-Ready Human–Scene Interaction Reconstruction🤖 #HSImul3R reconstructs sim-ready 3D human-scene interactions from casual videos with *physics-in-the-loop* that can be directly deployed to humanoids. - Project: - Code:

🤖Simulation-Ready Human–Scene Interaction Reconstruction🤖 #HSImul3R reconstructs sim-ready 3D human-scene interactions from casual videos with physics-in-the-loop that can be directly deployed to humanoids. - Project: - Code:

Ziwei Liu

15,326 görüntüleme • 3 ay önce

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 görüntüleme • 2 yıl önce

Check out Christian Diller's CG-HOI :) We generate realistic 3D human-object interactions, from object geometry and text description. A key ingredient is explicit modeling of contact, during training and as guidance during inference.

Check out Christian Diller's CG-HOI :) We generate realistic 3D human-object interactions, from object geometry and text description. A key ingredient is explicit modeling of contact, during training and as guidance during inference.

Angela Dai

20,497 görüntüleme • 2 yıl önce

DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians from Video Diffusion Priors Code:

DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians from Video Diffusion Priors Code:

MrNeRF

12,360 görüntüleme • 2 yıl önce

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision paper page: We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation.

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision paper page: We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation.

AK

49,886 görüntüleme • 2 yıl önce

Primal’s meticulously crafted 3D models – derived from real scan data – amplify anatomy learning. Follow us to learn more!

Primal’s meticulously crafted 3D models – derived from real scan data – amplify anatomy learning. Follow us to learn more!

Primal Pictures / Anatomy.tv

558,376 görüntüleme • 5 ay önce

How can we learn 3D visual grounding with natural supervision—only looking at QA pairs, without ground truth bounding boxes or object classification labels? We inject explicit language priors, e.g., the symmetric property that A near B ⇒ B near A, in structured vision models.

How can we learn 3D visual grounding with natural supervision—only looking at QA pairs, without ground truth bounding boxes or object classification labels? We inject explicit language priors, e.g., the symmetric property that A near B ⇒ B near A, in structured vision models.

Joy Hsu

14,843 görüntüleme • 2 yıl önce

📢 A Recipe for Generating 3D Worlds From a Single Image 📢 Our recipe explains how existing generative models can be adapted with minimal training effort to generate 3D worlds from a single input image.

📢 A Recipe for Generating 3D Worlds From a Single Image 📢 Our recipe explains how existing generative models can be adapted with minimal training effort to generate 3D worlds from a single input image.

Katja Schwarz

13,970 görüntüleme • 1 yıl önce

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Felix Heide

27,736 görüntüleme • 9 ay önce

Introducing “VOODOO 3D: VOlumetric pOrtrait Disentanglement fOr One-shot 3D head reenactment”. We present a real-time 3D aware one-shot head reenactment method that can generate consistent views from any angle MBZUAI ETH Zürich VinAI Pinscreen URL

Introducing “VOODOO 3D: VOlumetric pOrtrait Disentanglement fOr One-shot 3D head reenactment”. We present a real-time 3D aware one-shot head reenactment method that can generate consistent views from any angle MBZUAI ETH Zürich VinAI Pinscreen URL

Hao Li

16,030 görüntüleme • 2 yıl önce

3D AI is leveling up! Rodin 3D AI can create stunning, high-quality 3D models from just text or image inputs. And with its latest update, it can even generate 8K HDRI textures to bring your models to life. Check out the link in the comments!

3D AI is leveling up! Rodin 3D AI can create stunning, high-quality 3D models from just text or image inputs. And with its latest update, it can even generate 8K HDRI textures to bring your models to life. Check out the link in the comments!

el.cine

46,032 görüntüleme • 1 yıl önce

animators are not needed anymore this 3D AI motion capture plugin can convert character movement from real video to 3D data and.. you can apply the motion to any 3D character.. link in comments

animators are not needed anymore this 3D AI motion capture plugin can convert character movement from real video to 3D data and.. you can apply the motion to any 3D character.. link in comments

el.cine

64,486 görüntüleme • 6 ay önce

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery TL;DR: Skyfall-GS converts satellite images to explorable 3D urban scenes using diffusion models, with real-time rendering performance. Contributions: • We introduce Skyfall-GS, the first method to synthesize immersive, real-time, free-flight navigable 3D urban scenes solely from multi-view satellite imagery using generative refinement. • An open-domain refinement approach leverages pre-trained text-to-image diffusion models without domain-specific training. • A curriculum-learning-based iterative refinement strategy progressively enhances reconstruction quality from higher to lower viewpoints, significantly improving visual fidelity in occluded areas.

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery TL;DR: Skyfall-GS converts satellite images to explorable 3D urban scenes using diffusion models, with real-time rendering performance. Contributions: • We introduce Skyfall-GS, the first method to synthesize immersive, real-time, free-flight navigable 3D urban scenes solely from multi-view satellite imagery using generative refinement. • An open-domain refinement approach leverages pre-trained text-to-image diffusion models without domain-specific training. • A curriculum-learning-based iterative refinement strategy progressively enhances reconstruction quality from higher to lower viewpoints, significantly improving visual fidelity in occluded areas.

MrNeRF

66,111 görüntüleme • 8 ay önce

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors paper page: present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.

AK

305,643 görüntüleme • 3 yıl önce