Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Accurate and controllable scene generation has been difficult with natural language alone. You instead need a language for scenes. Introducing the Scene Language — a visual representation for high-quality 3D/4D generation by integrating programs, words, and embeddings — 🧵(1/6)

Yunzhi Zhang

2,478 subscribers

63,813 просмотров • 1 год назад •via X (Twitter)

Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 9

Фото профиля Yunzhi Zhang

Yunzhi Zhang1 год назад

The Scene Language uses programs, words, and neural embeddings to encode scene structures, semantics, and visual identities, respectively. It can be inferred using pre-trained LMs/VLMs to generate scenes from text and image prompts. (2/6)

Фото профиля Yunzhi Zhang

Yunzhi Zhang1 год назад

The representation also applies to 4D scenes—many dynamic effects are simple to write in programs! Some text-to-4D synthesis results here: (3/6)

Фото профиля Yunzhi Zhang

Yunzhi Zhang1 год назад

When given image prompts, the pipeline converts input images into 3D scenes while preserving the structure and content: (4/6)

Фото профиля Yunzhi Zhang

Yunzhi Zhang1 год назад

The representation is not tied to one specific renderer; instead, it can be consumed by renderers ranging from end-to-end, neural generative models to traditional graphics engines. (5/6)

Фото профиля Yunzhi Zhang

Yunzhi Zhang1 год назад

More results on project page: Paper: This is a very fun collaboration with the wonderful @zizhang_li, @Mattzh1314, @elliottszwu, and @jiajunwu_cs. (6/6)

Фото профиля Blockchainer K

Blockchainer K1 год назад

I love how Scene Language combines the power of programs, words, and embeddings to create stunning 3D/4D visuals. Can't wait to explore its possibilities for urban planning and smart cities!

Фото профиля Jos van der Westhuizen

Jos van der Westhuizen1 год назад

The results look crazy good! Can the semantic components be treated as separate 3D objects in something like Unity? Would love to get access to an API or something to integrate this into my tool.

Фото профиля Yunzhi Zhang

Yunzhi Zhang1 год назад

Yes, semantic components are separate and can be individually imported as meshes. We'll release the code in November. Stay tuned! :)

Фото профиля IC4

IC41 год назад

That's why we have different courts for sailors.

Похожие видео

Introducing ClickDiffusion! We developed a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface.

Introducing ClickDiffusion! We developed a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface.

Alec Helbling

36,293 просмотров • 2 лет назад

1/ 🚀 Introducing AIDO.StructureDiffusion: A generative model for structural protein design—enabling high-quality, controllable generation of monomers, complexes, and antibodies. 🧵

1/ 🚀 Introducing AIDO.StructureDiffusion: A generative model for structural protein design—enabling high-quality, controllable generation of monomers, complexes, and antibodies. 🧵

GenBio AI

918,212 просмотров • 11 месяцев назад

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.

Bilawal Sidhu

107,011 просмотров • 11 месяцев назад

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 просмотров • 1 год назад

Speed and quality can finally coexist in diffusion-based language generation. Introducing DiDi-Instruct, a Discrete Diffusion Divergence Instruct method that distills a pre-trained discrete diffusion language model (dLLM) into a few-step student for ultra-fast generation. Built on integral KL-divergence minimization, DiDi-Instruct achieves up to 64× faster decoding, surpasses both its teacher and GPT-2, and cuts training time by 20×. Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct Paper: Code: Project: Our report: 📬 #PapersAccepted by Jiqizhixin

Speed and quality can finally coexist in diffusion-based language generation. Introducing DiDi-Instruct, a Discrete Diffusion Divergence Instruct method that distills a pre-trained discrete diffusion language model (dLLM) into a few-step student for ultra-fast generation. Built on integral KL-divergence minimization, DiDi-Instruct achieves up to 64× faster decoding, surpasses both its teacher and GPT-2, and cuts training time by 20×. Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct Paper: Code: Project: Our report: 📬 #PapersAccepted by Jiqizhixin

机器之心 JIQIZHIXIN

18,126 просмотров • 7 месяцев назад

Are We The Last Generation To Speak Kashmiri? What happens when a generation stops speaking its own language? The Kashmiri language is fading from homes and classrooms. This video highlights the urgent need to preserve Kashmiri before it becomes just a memory #kashmir #LANGUAGE

Are We The Last Generation To Speak Kashmiri? What happens when a generation stops speaking its own language? The Kashmiri language is fading from homes and classrooms. This video highlights the urgent need to preserve Kashmiri before it becomes just a memory #kashmir #LANGUAGE

Kashmir Observer®

86,521 просмотров • 4 месяцев назад

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM LEGO-SLAM running at 15 FPS on a ScanNet scene with language-based loop closing for drift correction. LEGO-SLAM is a 3DGS-based SLAM framework that supports open-vocabulary semantic querying and rendering. It tracks via G-ICP and efficiently builds a map by embedding Gaussians with scene-adaptive 16D language features. Map management is achieved through Language Pruning and Language-Based Loop Detection. The generated map enables open-vocabulary 3D Object Localization.

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM LEGO-SLAM running at 15 FPS on a ScanNet scene with language-based loop closing for drift correction. LEGO-SLAM is a 3DGS-based SLAM framework that supports open-vocabulary semantic querying and rendering. It tracks via G-ICP and efficiently builds a map by embedding Gaussians with scene-adaptive 16D language features. Map management is achieved through Language Pruning and Language-Based Loop Detection. The generated map enables open-vocabulary 3D Object Localization.

Ryohei Sasaki@engineer

14,935 просмотров • 3 месяцев назад

Gaussian Shell Maps are a new neural scene representation that connects fields and 3D Gaussians. This representation unlocks the full potential of 3D Gaussian splatting for generative AI applications, such as 3D avatar generation. 1/2

Gordon Wetzstein

52,449 просмотров • 2 лет назад

Nvidia presents LLaMA-Mesh Unifying 3D Mesh Generation with Language Models

Nvidia presents LLaMA-Mesh Unifying 3D Mesh Generation with Language Models

AK

113,506 просмотров • 1 год назад

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

The Humanoid Hub

20,430 просмотров • 1 год назад

Introducing ESM Cambrian. Unsupervised learning can invert biology at scale to reveal the hidden structure of the natural world. We’ve scaled up compute and data to train a new generation of protein language models. ESM C defines a new state of the art for protein representation learning.

Introducing ESM Cambrian. Unsupervised learning can invert biology at scale to reveal the hidden structure of the natural world. We’ve scaled up compute and data to train a new generation of protein language models. ESM C defines a new state of the art for protein representation learning.

Alex Rives

206,829 просмотров • 1 год назад

“Math is code. And code is math.” Carina Hong CEO Axiom on why that idea matters for AI: “For centuries, mathematicians reasoned in natural language. But because of the Curry–Howard correspondence, a proof can also be a program.” “You can translate natural-language math into formal code using systems like Lean. Lean is both a theorem-proving language and a programming language.” “That means you can generate solutions as code and then formally verify them. You can even use the same system to verify programs themselves.” “This creates a flywheel of generation and verification, where AI can both write solutions and prove that they’re correct.”

“Math is code. And code is math.” Carina Hong CEO Axiom on why that idea matters for AI: “For centuries, mathematicians reasoned in natural language. But because of the Curry–Howard correspondence, a proof can also be a program.” “You can translate natural-language math into formal code using systems like Lean. Lean is both a theorem-proving language and a programming language.” “That means you can generate solutions as code and then formally verify them. You can even use the same system to verify programs themselves.” “This creates a flywheel of generation and verification, where AI can both write solutions and prove that they’re correct.”

Forward Future

23,748 просмотров • 4 месяцев назад

🚀 Excited to announce the first release of a novel open source programming language and platform for language model interaction! Combining prompts, constraints & scripting, LMQL elevates the capabilities of large language models. 🧵1/6 A quick tour.

🚀 Excited to announce the first release of a novel open source programming language and platform for language model interaction! Combining prompts, constraints & scripting, LMQL elevates the capabilities of large language models. 🧵1/6 A quick tour.

LMQL (Language Model Query Language)

198,966 просмотров • 3 лет назад

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

Yunzhi Zhang

48,918 просмотров • 1 год назад

Accurate language is, and always has been, essential. #LetWomenSpeak

Accurate language is, and always has been, essential. #LetWomenSpeak

Kellie-Jay Keen

25,235 просмотров • 1 год назад

🥰Super excited that SceneWeaver ( won the best paper award at the IROS25 RoboGen workshop. SceneWeaver provides an agentic framework for tool-based 3D scene generation, given a language description as input, you can generate or edit a corresponding details with lots of details.

🥰Super excited that SceneWeaver ( won the best paper award at the IROS25 RoboGen workshop. SceneWeaver provides an agentic framework for tool-based 3D scene generation, given a language description as input, you can generate or edit a corresponding details with lots of details.

Siyuan Huang

43,681 просмотров • 7 месяцев назад

Introducing Modality Forcing, a recipe for post-training T2I models for SOTA RGB-Depth generation! Text-to-image (T2I) models learn rich representations of the spatial world. How do we build on this prior for high-quality depth generation? 🧵 [1/6]

Introducing Modality Forcing, a recipe for post-training T2I models for SOTA RGB-Depth generation! Text-to-image (T2I) models learn rich representations of the spatial world. How do we build on this prior for high-quality depth generation? 🧵 [1/6]

Bardienus Duisterhof

61,990 просмотров • 8 дней назад

CineMaster A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

CineMaster A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

AK

18,655 просмотров • 1 год назад

There's a whole generation of animators that haven't studied Purple and Brown shorts. Masters of comedic timing, body language and visual storytelling.

There's a whole generation of animators that haven't studied Purple and Brown shorts. Masters of comedic timing, body language and visual storytelling.

Kiaran Ritchie

13,181 просмотров • 2 месяцев назад

SuperSplat's voxel collision is a game changer for 3D Gaussian splat creators. Generate a highly accurate, inescapable voxel representation of any splat, regardless of quality. Generation is performed on the GPU, so it's super-fast. The voxel representation is tiny - only 418KB for this scene. And all this magic is free and open source, courtesy of the splat-transform library from PlayCanvas.

SuperSplat's voxel collision is a game changer for 3D Gaussian splat creators. Generate a highly accurate, inescapable voxel representation of any splat, regardless of quality. Generation is performed on the GPU, so it's super-fast. The voxel representation is tiny - only 418KB for this scene. And all this magic is free and open source, courtesy of the splat-transform library from PlayCanvas.

Will Eastcott

12,653 просмотров • 1 месяц назад