Loading video...

Video Failed to Load

Go Home

Today we're announcing #GAIA1: a 9B parameter world model, trained on 4,700 hours of driving data, able to simulate complex and diverse driving scenes from video, text and action inputs. This model is 480x larger than the preview we shared earlier this year and the results are incredible. These...

631,844 views • 2 years ago •via X (Twitter)

10 Comments

Alex Kendall's profile picture
Alex Kendall2 years ago

Here's a thread with some of my favourite examples!

Hamid Abdollahi's profile picture
Hamid Abdollahi2 years ago

Exciting advancements with #GAIA1, @alexgkendall! However, how will we ensure the synthetic data's diversity truly represents real-world scenarios without bias? How do we validate the correctness of AI-imagined outcomes against real-world driving nuances? Isn't there a risk of overfitting to the generated scenarios, and how do we bridge the gap between synthetic training and real-world robustness? While the promise is grand, isn't real-world data still invaluable for nuances hard to replicate synthetically?

Alex Kendall's profile picture
Alex Kendall2 years ago

You're right these are the key questions and where we've been focused to make sure these aren't just pretty videos but can accelerate the robustness of our driving policies. With the right approach, you can build a world model which can balance realism and diversity (and even able to train your policy adversarially).

Furkan Gözükara's profile picture
Furkan Gözükara2 years ago

the videos are really coherent how many seconds it can generate? do you use any driving video or just from text to video?

Alex Kendall's profile picture
Alex Kendall2 years ago

It can keep generating videos perpetually, so no limit to the length... here's an example of a long scene

Ivan Kirigin's profile picture
Ivan Kirigin2 years ago

Very impressive

Yash's profile picture
Yash2 years ago

You guys should be training AI's on the roads of India The model will be way better than any made up till this date.

SolarSailor.eth's profile picture
SolarSailor.eth2 years ago

This is awesome. Now we can generate the most absurd cases that would be at the tails of standard deviation of occurrence. Major unlock

Captain Hype 🦊 🇺🇦 ❤️'s profile picture
Captain Hype 🦊 🇺🇦 ❤️2 years ago

@WholeMarsBlog Next level !! ♥️♥️

Morlock's profile picture
Morlock2 years ago

I think you are off by several orders of magnitude on the amount of training data, but having said that, the fundamental issue is the need for variance in the data, rather than just the amount of data.

Related Videos

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

Max Zhaoshuo Li 李赵硕

1,077,546 views • 1 month ago

NEWS: NVIDIA just announced Alpamayo, what CEO Jensen Huang calls the world’s first thinking, reasoning autonomous vehicle AI, launching on U.S. roads later this year, starting with the Mercedes CLA. Jensen: "It's trained end-to-end. Literally from camera in to actuation out; It reasons what action it is about to take, the reason by which is came about that action, and the trajectory." Alpamayo introduces Vision-Language-Action (VLA) models, which enable self-driving systems to interpret what they see, reason about complex driving scenarios, and generate driving actions. The platform includes large reasoning models, simulation tools for testing rare and edge-case scenarios, and open datasets for training and validation. NVIDIA says the approach improves transparency, safety, and robustness in autonomous systems, particularly in complex real-world environments, and supports progress toward higher levels of vehicle autonomy: "With a 10-billion-parameter architecture, Alpamayo 1 uses video input to generate trajectories alongside reasoning traces, showing the logic behind each decision. Developers can adapt Alpamayo 1 into smaller runtime models for vehicle development, or use it as a foundation for AV development tools such as reasoning-based evaluators and auto-labeling systems. Alpamayo 1 provides open model weights and open-source inferencing scripts. Future models in the family will feature larger parameter counts, more detailed reasoning capabilities, more input and output flexibility, and options for commercial usage."

Sawyer Merritt

1,603,176 views • 5 months ago