Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper... show more

Bilawal Sidhu

107,147 subscribers

107,060 views • 1 year ago •via X (Twitter)

Science & Technology Education Arts

Anya Rossi• Live Now

Private livecam show

11 Comments

Bilawal Sidhu1 year ago

Semantic 3d scene understanding is absolutely crucial for robotics and spatial computing devices like AR and VR headsets.

Bilawal Sidhu1 year ago

Paper/project here. Need to fill out a form to get access to model weights:

Bilawal Sidhu1 year ago

Enjoyed this post? You might also enjoy my monthly newsletter:

Nev (unsupervised)1 year ago

I was working in construction when the iPhone 12 Pro came out and I used the LiDAR scanner for EVERYTHING, my boss thought it was sort of gimmicky at first but I could tell he liked it after a couple days of me finding apps that created detailed depth maps and showed inconsistencies in the dug paths where slate was to be laid down, this is almost exactly what I imagined the next evolution would be

LazyFit1 year ago

No jumping, No Running. Workouts at home at any time.🕒🏠 BEST 15 min Beginner Home Workout for Weight Loss 🧘‍♀️🔥

Andreas Klinger 🦾1 year ago

this is really cool and obvious thing that they dont mention is how this could be used to also create simpler vocabulary for a scene you could define an object in the room give it a boundary box and a name like objectX and then say "task: carry objectX to table3" or event: table3 moved to cordination xy

Dan Brickley1 year ago

Can it work from a Gaussian Splat scene?

Max Vox (fka Duke Zero)1 year ago

fung shui module wen?

David Branca1 year ago

Very cool!

Andres Franco1 year ago

Really cool stuff. Can’t wait to see where this ends up going.

Gordon Olson1 year ago

Special understanding goes so far. This is what will truly open up the full potential of AI. The possibilities will transgress new frontiers. This is an exciting part of that. Very cool!

Related Videos

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Jiafei Duan

12,137 views • 3 months ago

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 views • 10 months ago

Using the new WordPress Command Palette to call an assistant that adds LLM generated text to a 3D world using natural language commands! "add text: write a short poem about the metaverse" This extends to image, audio and 3D objects in the future. WebXR holodeck style editing!

Using the new WordPress Command Palette to call an assistant that adds LLM generated text to a 3D world using natural language commands! "add text: write a short poem about the metaverse" This extends to image, audio and 3D objects in the future. WebXR holodeck style editing!

XR Publisher

17,891 views • 2 years ago

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 views • 3 years ago

This is what happens when you force a language on someone. People make it a mission to not learn or speak that language. The imposition factor goes out of the window and hatred factor crawl in.

This is what happens when you force a language on someone. People make it a mission to not learn or speak that language. The imposition factor goes out of the window and hatred factor crawl in.

Piyush Rai

133,625 views • 11 months ago

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields paper page: Editing a local region or a specific object in a 3D scene represented by a NeRF is challenging, mainly due to the implicit nature of the scene representation. Consistently blending a new realistic object into the scene adds an additional level of difficulty. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts or image patches, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt or image patch, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and seamlessly blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

AK

62,768 views • 3 years ago

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

Fangchen Liu

68,366 views • 1 year ago

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 views • 3 years ago

Turns out that vision-language models can control robots too. The secret is to just finetune them to print out the actions (literally, as text). Really excited about our new result, the successor to RT-1. RT-2 is a pre-trained VLM: Short 🧵👇

Turns out that vision-language models can control robots too. The secret is to just finetune them to print out the actions (literally, as text). Really excited about our new result, the successor to RT-1. RT-2 is a pre-trained VLM: Short 🧵👇

Sergey Levine

165,052 views • 3 years ago

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Spatial reasoning is a major challenge for the foundation models today, even in simple tasks like arranging objects in 3D space. #CVPR2025 Introducing LayoutVLM, a differentiable optimization framework that uses VLM to spatially reason about diverse scene layouts from unlabeled assets and open-ended language instructions 1/n

Fan-Yun Sun

92,584 views • 1 year ago

AI + robotics research is starting to pick up steam. We can now instruct Spot using natural language using Language-guided Skill Coordination (LSC). A user provides a natural language instruction: "Bring me the chocolates box, cereal box, and pill bottle, and put them on the bedroom table" and the robot navigates to the location of the target objects and places them on the room table. Current versions of Spot rely on a fixed-set vocabulary that cannot generalize to diverse instructions. In this demo, the researchers presented a method that uses large language models to receive a free-form natural language instruction for object rearrangement, which it then executes. It combines: -A voice-to-text model that processes the instructions into text -An LLM that converts natural language instruction to call a library of skills -A perception module that provides ground truth locations and visual object detection And what your left with is a glimpse into what our future will hold: robots executing tasks from human voice prompts.

AI + robotics research is starting to pick up steam. We can now instruct Spot using natural language using Language-guided Skill Coordination (LSC). A user provides a natural language instruction: "Bring me the chocolates box, cereal box, and pill bottle, and put them on the bedroom table" and the robot navigates to the location of the target objects and places them on the room table. Current versions of Spot rely on a fixed-set vocabulary that cannot generalize to diverse instructions. In this demo, the researchers presented a method that uses large language models to receive a free-form natural language instruction for object rearrangement, which it then executes. It combines: -A voice-to-text model that processes the instructions into text -An LLM that converts natural language instruction to call a library of skills -A perception module that provides ground truth locations and visual object detection And what your left with is a glimpse into what our future will hold: robots executing tasks from human voice prompts.

AI Breakfast

33,725 views • 3 years ago

Saw a clip a of #Pacers forward Obi Toppin doing ASL. Finally asked him about it. “I was supposed to take a language in HS. … I ain’t want to take Spanish or (another verbal language) — it was hard doing that. So, it was like, ‘Sign language might be the easiest one.’” 😂

Saw a clip a of #Pacers forward Obi Toppin doing ASL. Finally asked him about it. “I was supposed to take a language in HS. … I ain’t want to take Spanish or (another verbal language) — it was hard doing that. So, it was like, ‘Sign language might be the easiest one.’” 😂

James Boyd

206,685 views • 1 year ago

How the British tried erase the Irish Language. Years ago, I debated one of those finance types, who saw language only as useful, to the extent it facilitated international trade and science/technology. It is extremely dangerous to reduce a people's language to that. Language encapsulates a society's worldview, its way of thinking, and other intangible attributes.

How the British tried erase the Irish Language. Years ago, I debated one of those finance types, who saw language only as useful, to the extent it facilitated international trade and science/technology. It is extremely dangerous to reduce a people's language to that. Language encapsulates a society's worldview, its way of thinking, and other intangible attributes.

Onye Nkuzi

95,753 views • 8 months ago

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 views • 3 years ago

Chatbots aren’t the revolution. They’re the distraction. Fei-Fei Li: “Language is a half-million-year-old luxury. Perception is a half-billion-year-old necessity.” Evolution didn’t optimize for conversation. It optimized for survival in three-dimensional space. Seeing threats, navigating obstacles, predicting what happens when you move. We’ve spent years celebrating AI that can write and summarize. But text processing is narrow. Spatial intelligence is fundamental. An agent that only reads prompts can’t function in a warehouse or a hospital. It needs to parse depth, understand physics, and act on what it sees in real time. We built AI that understands language. Now we’re building AI that understands space. Language models got the attention. Spatial intelligence gets the work done. The world runs on physics, not paragraphs. AI is learning to operate in it.

Chatbots aren’t the revolution. They’re the distraction. Fei-Fei Li: “Language is a half-million-year-old luxury. Perception is a half-billion-year-old necessity.” Evolution didn’t optimize for conversation. It optimized for survival in three-dimensional space. Seeing threats, navigating obstacles, predicting what happens when you move. We’ve spent years celebrating AI that can write and summarize. But text processing is narrow. Spatial intelligence is fundamental. An agent that only reads prompts can’t function in a warehouse or a hospital. It needs to parse depth, understand physics, and act on what it sees in real time. We built AI that understands language. Now we’re building AI that understands space. Language models got the attention. Spatial intelligence gets the work done. The world runs on physics, not paragraphs. AI is learning to operate in it.

Dustin

51,772 views • 5 months ago

Today we're opening up a new experimental multiplayer worldbuilding tool called "Patchwork". It combines language models, image models, and a canvas-based interface to build out the foundations of stories. Check out the 🧵below for links and further documentation. Have fun!

Today we're opening up a new experimental multiplayer worldbuilding tool called "Patchwork". It combines language models, image models, and a canvas-based interface to build out the foundations of stories. Check out the 🧵below for links and further documentation. Have fun!

Midjourney

254,252 views • 1 year ago

💫It's fascinating that a single feed-forward pass through an LLM can replace a complex rendering pipeline, like Blender! Just feed it 3D shapes, xyz positions, and poses as tokens, and it spits out the image token-by-token. The dual, aka scene reconstruction, is also possible! 👇

💫It's fascinating that a single feed-forward pass through an LLM can replace a complex rendering pipeline, like Blender! Just feed it 3D shapes, xyz positions, and poses as tokens, and it spits out the image token-by-token. The dual, aka scene reconstruction, is also possible! 👇

Georgia Gkioxari

44,616 views • 1 year ago

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A step toward video models as visual-spatial reasoners. Project: arXiv:

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A step toward video models as visual-spatial reasoners. Project: arXiv:

Xingang Pan

15,931 views • 7 months ago

training a model that takes a text prompt and generates audio that renders video on an oscilloscope AgenC agents live inside worlds the model generates the pipeline: real videos -> edge detection -> vectorization -> path ordering -> 192kHz 3-channel WAV where X/Y control beam position and Z controls beam intensity 3 values per timestep. that's all the model is learning. compare that to video gen models trying to predict millions of pixels per frame. transformers are already great at sequence prediction and that's literally all this is. waveform generation the output IS the playback. generate the audio, feed it to a scope, it draws the scene in real-time. there's no rendering step. it's analog so there's no pixel grid. you get continuous curves and effectively infinite resolution bootstrapped with procedural data, lissajous curves, wireframe 3D, stick figures, then scaled on real-world video converted to trace format. 90 TB of source video the model learns edges, contours, spatial relationships, motion. once it has that, describing a scene it's never seen is novel trajectory through the same learned space. generative geometry

training a model that takes a text prompt and generates audio that renders video on an oscilloscope AgenC agents live inside worlds the model generates the pipeline: real videos -> edge detection -> vectorization -> path ordering -> 192kHz 3-channel WAV where X/Y control beam position and Z controls beam intensity 3 values per timestep. that's all the model is learning. compare that to video gen models trying to predict millions of pixels per frame. transformers are already great at sequence prediction and that's literally all this is. waveform generation the output IS the playback. generate the audio, feed it to a scope, it draws the scene in real-time. there's no rendering step. it's analog so there's no pixel grid. you get continuous curves and effectively infinite resolution bootstrapped with procedural data, lissajous curves, wireframe 3D, stick figures, then scaled on real-world video converted to trace format. 90 TB of source video the model learns edges, contours, spatial relationships, motion. once it has that, describing a scene it's never seen is novel trajectory through the same learned space. generative geometry

tetsuo

18,397 views • 5 months ago

Yann LeCun just said something that every AI-in-healthcare researcher should sit with. He basically said: If language were enough to understand the world, you could learn medicine by reading books. But you can’t. You need residency. You need to see thousands of normal cases before you recognize the abnormal one. He also points out something wild — all the public text on the internet is on the order of 10¹⁴ bytes. A 4-year-old processes about that much through vision alone. The world is just… higher bandwidth than text. I think this shift — from language models to world models — is going to matter a lot in healthcare. 🫀

Yann LeCun just said something that every AI-in-healthcare researcher should sit with. He basically said: If language were enough to understand the world, you could learn medicine by reading books. But you can’t. You need residency. You need to see thousands of normal cases before you recognize the abnormal one. He also points out something wild — all the public text on the internet is on the order of 10¹⁴ bytes. A 4-year-old processes about that much through vision alone. The world is just… higher bandwidth than text. I think this shift — from language models to world models — is going to matter a lot in healthcare. 🫀

Bo Wang

418,752 views • 5 months ago