Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

RTFM can be seen as a learned renderer: it is an autoregressive diffusion transformer trained end-to-end on large-scale video data, and it learns to model 3D geometry, reflections, shadows and more just by observing them in its training set.

World Labs

45,857 subscribers

15,870 views • 9 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 views • 2 years ago

Given an embedding vector, you can tell which model produced it. I trained a 0.8M transformer that fingerprints embedding models by reading raw float digits (vocab size: 15). Full end-to-end, zero feature engineering.

Given an embedding vector, you can tell which model produced it. I trained a 0.8M transformer that fingerprints embedding models by reading raw float digits (vocab size: 15). Full end-to-end, zero feature engineering.

Han Xiao

19,695 views • 4 months ago

Introducing RTFM (Real-Time Frame Model): a highly efficient World Model that generates video frames in real time as you interact with it, powered by a single H100 GPU. RTFM renders persistent and 3D consistent worlds, both real and imaginary. Try our demo of RTFM today!

Introducing RTFM (Real-Time Frame Model): a highly efficient World Model that generates video frames in real time as you interact with it, powered by a single H100 GPU. RTFM renders persistent and 3D consistent worlds, both real and imaginary. Try our demo of RTFM today!

World Labs

340,567 views • 9 months ago

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

Shuo Yang

31,596 views • 4 months ago

NVIDIA CEO Jensen Huang on Tesla FSD at CES: “Tesla’s FSD stack is completely world-class. They’ve been working on it for quite some time. It’s world-class. Not only in the number of miles it has, but it’s world-class in the way that it’s designed, the way they do training to data collection and curation, synthetic data generation, all of their simulation technologies. Of course, the latest generation is end-to-end full self-driving, meaning it’s just one large model that is end-to-end trained. Elon’s AV system is, in every way, 100% state-of-the-art. And so I’m really quite impressed by the technology. I have it and I drive it in our house and it works incredibly well.”

NVIDIA CEO Jensen Huang on Tesla FSD at CES: “Tesla’s FSD stack is completely world-class. They’ve been working on it for quite some time. It’s world-class. Not only in the number of miles it has, but it’s world-class in the way that it’s designed, the way they do training to data collection and curation, synthetic data generation, all of their simulation technologies. Of course, the latest generation is end-to-end full self-driving, meaning it’s just one large model that is end-to-end trained. Elon’s AV system is, in every way, 100% state-of-the-art. And so I’m really quite impressed by the technology. I have it and I drive it in our house and it works incredibly well.”

Nic Cruz Patane

52,826 views • 6 months ago

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Large-scale 3D Scene Generation (all scenes are real-time rendered)!! Physically-grounded generative data without hallucinations is the missing link for robot learning and testing at scale. We introduce a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal view synthesis and generation with object permanence and explicit 3D geometry. This also allows for extreme trajectory extrapolation without failure! We also show that we can build fully data-driven simulators for end-to-end learning with this approach. Project: with the amazing team of Julian Ost, Amogh Joshi , Andrea Ramazzina, Maximilian Bömer, Mario Bijelic.

Felix Heide

27,779 views • 10 months ago

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 views • 5 months ago

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art results 🚀 One universal model enables SoTA for: 🔥 Mono Depth Estimation 🔥 Multi-View SfM 🔥 Multi-View Stereo 🔥 Depth Completion 🔥 Registration … and many more possibilities! – plus everything is metric 🎯 We release code for data processing, training, benchmarking & ablations – everything Apache 2.0! Details & Links 👇

Nikhil Keetha

122,913 views • 10 months ago

Along with text, images, video and code, Gemini is able to process raw audio signal end-to-end. 🔊 It can listen to and understand speech, making it not only useful for transcription but a model that has a much more nuanced perception of its environment. ↓

Along with text, images, video and code, Gemini is able to process raw audio signal end-to-end. 🔊 It can listen to and understand speech, making it not only useful for transcription but a model that has a much more nuanced perception of its environment. ↓

Google DeepMind

140,179 views • 2 years ago

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

Physical Intelligence

1,183,757 views • 7 months ago

Today we’re releasing K2 Think V2, our most capable open-source reasoning model to date. This is a fully sovereign model: trained end-to-end on IFM-curated and synthesized data, with complete transparency from pre-training through final reasoning alignment.

Today we’re releasing K2 Think V2, our most capable open-source reasoning model to date. This is a fully sovereign model: trained end-to-end on IFM-curated and synthesized data, with complete transparency from pre-training through final reasoning alignment.

MBZUAI

287,725 views • 6 months ago

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

AK

305,667 views • 2 years ago

We believe that intelligence should not arrive preconfigured. Together AI is now available directly inside the Adaption platform, connecting Adaptive Data with large-scale training in a single workflow. One platform, end to end. Stop inheriting intelligence. Shape it.

We believe that intelligence should not arrive preconfigured. Together AI is now available directly inside the Adaption platform, connecting Adaptive Data with large-scale training in a single workflow. One platform, end to end. Stop inheriting intelligence. Shape it.

adaption

25,869 views • 2 months ago

Inception Labs just killed the transformer. They released Mercury 2, the world's first "diffusion" reasoning model. It's fast, and it uses a completely new model architecture... just watch this 11 min video to find out more:

Inception Labs just killed the transformer. They released Mercury 2, the world's first "diffusion" reasoning model. It's fast, and it uses a completely new model architecture... just watch this 11 min video to find out more:

David Ondrej

45,692 views • 4 months ago

Skild AI is laying its cards on the table (partially, anyway). Teleoperation data lacks diversity and is limited by a 1:1 human operator time-scale. To address this, Skild pre-trained its model using internet-scale video data (already widely available in the form of first-person "egocentric" headcam footage to millions of instructional YouTube videos).

Skild AI is laying its cards on the table (partially, anyway). Teleoperation data lacks diversity and is limited by a 1:1 human operator time-scale. To address this, Skild pre-trained its model using internet-scale video data (already widely available in the form of first-person "egocentric" headcam footage to millions of instructional YouTube videos).

The Humanoid Hub

32,822 views • 6 months ago

compute has to be distributed and personalized to minimize latency as AI scales, the economy is increasingly latency-sensitive only $amd can solve that problem at scale, end-to-end long $amd since $4.2 and I’m betting on it becoming a $5T company in ~5 years

compute has to be distributed and personalized to minimize latency as AI scales, the economy is increasingly latency-sensitive only $amd can solve that problem at scale, end-to-end long $amd since $4.2 and I’m betting on it becoming a $5T company in ~5 years

Antonio Linares

64,103 views • 3 months ago

Super clean and efficient meshes by an AI? YES! The typical 3D Generative AI solutions produce lots of artifacts and usually way to many polygons due to volumetric approaches. In comparison “MeshGPT creates triangle meshes by autoregressively sampling from a transformer model that has been trained to produce tokens from a learned geometric vocabulary. These tokens can then be decoded into the faces of a triangle mesh. This method generates clean, coherent, and compact meshes, characterized by sharp edges and high fidelity.” Surely it is limited by the trained vocabulary but various versions can be trained for specific sets to create generative model libraries for certain object groups. Very promising approach with the high quality.

Super clean and efficient meshes by an AI? YES! The typical 3D Generative AI solutions produce lots of artifacts and usually way to many polygons due to volumetric approaches. In comparison “MeshGPT creates triangle meshes by autoregressively sampling from a transformer model that has been trained to produce tokens from a learned geometric vocabulary. These tokens can then be decoded into the faces of a triangle mesh. This method generates clean, coherent, and compact meshes, characterized by sharp edges and high fidelity.” Surely it is limited by the trained vocabulary but various versions can be trained for specific sets to create generative model libraries for certain object groups. Very promising approach with the high quality.

René Schulte

20,772 views • 2 years ago

Introducing RoboCat, a new AI model designed to operate multiple robots. 🤖 It learns to solve new tasks on different robotic arms with as few as 100 demonstrations - and improves skills from self-generated training data. Find out more:

Introducing RoboCat, a new AI model designed to operate multiple robots. 🤖 It learns to solve new tasks on different robotic arms with as few as 100 demonstrations - and improves skills from self-generated training data. Find out more:

Google DeepMind

410,297 views • 3 years ago

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

AK

684,372 views • 2 years ago

GameFactory Creating New Games with Generative Interactive Videos present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

GameFactory Creating New Games with Generative Interactive Videos present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

AK

70,029 views • 1 year ago