
Jim Fan
@DrJimFan • 434,952 subscribers
NVIDIA Director of Robotics & Distinguished Scientist. Co-Lead of GEAR lab. Solving Physical AGI, one motor at a time. Stanford Ph.D. OpenAI's 1st intern.
Shorts
Videos

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;) And stay till the end, more easter eggs and predictions for your polymarket! 00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum! 01:42 The Great Parallel 03:31 Robotics, the Endgame 03:39 Why VLAs fall short 04:32 Video world models as the 2nd pretraining paradigm 06:09 World Action Models (WAM) 07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation 11:06 EgoScale and the Dexterity Scaling Law we discovered recently 14:00 Physical RL: bridging the last mile 15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico 17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think. Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.
Jim Fan552,698 views • 26 days ago

What if we set GPT-4 free in Minecraft? ⛏️ I’m excited to announce Voyager, the first lifelong learning agent that plays Minecraft purely in-context. Voyager continuously improves itself by writing, refining, committing, and retrieving *code* from a skill library. GPT-4 unlocks a new paradigm: “training” is code execution rather than gradient descent. “Trained model” is a codebase of skills that Voyager iteratively composes, rather than matrices of floats. We are pushing no-gradient architecture to its limit. Voyager rapidly becomes a seasoned explorer. In Minecraft, it obtains 3.3× more unique items, travels 2.3× longer distances, and unlocks key tech tree milestones up to 15.3× faster than prior methods. We open-source everything. Let generalist agents emerge in Minecraft! Welcome you all to try today: Paper: Code: Deep dive with me: 🧵
Jim Fan3,821,409 views • 3 years ago

Font design is about to get a huge upgrade. Traditionally, given the same font, the letter ‘A’ looks identical everywhere. How about a smart & magical font that draws ‘A’ differently based on the semantic meaning of the word? Smells like Stable Diffusion? Deep dive with me: 🧵
Jim Fan3,830,141 views • 3 years ago

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:
Jim Fan290,488 views • 3 months ago

Can GPT-4 teach a robot hand to do pen spinning tricks better than you do? I'm excited to announce Eureka, an open-ended agent that designs reward functions for robot dexterity at super-human level. It’s like Voyager in the space of a physics simulator API! Eureka bridges the gap between high-level reasoning (coding) and low-level motor control. It is a “hybrid-gradient architecture”: a black box, inference-only LLM instructs a white box, learnable neural network. The outer loop runs GPT-4 to refine the reward function (gradient-free), while the inner loop runs reinforcement learning to train a robot controller (gradient-based). We are able to scale up Eureka thanks to IsaacGym, a GPU-accelerated physics simulator that speeds up reality by 1000x. On a benchmark suite of 29 tasks across 10 robots, Eureka rewards outperform expert human-written ones on 83% of the tasks by 52% improvement margin on average. We are surprised that Eureka is able to learn pen spinning tricks, which are very difficult even for CGI artists to animate frame by frame! Eureka also enables a new form of in-context RLHF, which is able to incorporate a human operator’s feedback in natural language to steer and align the reward functions. It can serve as a powerful co-pilot for robot engineers to design sophisticated motor behaviors. As usual, we open-source everything! Welcome you all to check out our video gallery and try the codebase today: Paper: Code: Deep dive with me: 🧵
Jim Fan2,673,692 views • 2 years ago

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning *inside* a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation *inside* a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains +17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:
Jim Fan207,650 views • 3 months ago

Today is the beginning of our moonshot to solve embodied AGI in the physical world. I’m so excited to announce Project GR00T, our new initiative to create a general-purpose foundation model for humanoid robot learning. The GR00T model will enable a robot to understand multimodal instructions, such as language, video, and demonstration, and perform a variety of useful tasks. We are collaborating with many leading humanoid companies around the world, so that GR00T may transfer across embodiments and help the ecosystem thrive. GR00T is born on NVIDIA’s deep technology stack. We simulate in Isaac Lab (new app on Omniverse Isaac Sim for humanoid learning), train on OSMO (new compute orchestration system to scale up models), and deploy to Jetson Thor (new edge GPU chip designed to power GR00T). Announced in Jensen's keynote, Project GR00T is a cornerstone for the “Foundation Agent” roadmap of the newly founded GEAR Lab. At GEAR, we are building generally capable agents that learn to act skillfully in many worlds, virtual and real. See if you can spot "GEAR" in the video ;) Join us on the journey to land on the moon.
Jim Fan1,076,525 views • 2 years ago

We’ve seen a gazillion startups using OpenAI APIs to do “co-pilot for X”. What’s next? Enter *physical* co-pilot! Here’s a compelling demo: you improvise by playing a “low resolution” piano, and the co-pilot compiles it real-time to Hi-Fi music! It unleashes our inner pianist.🧵
Jim Fan1,548,397 views • 3 years ago

I know your timeline is flooded now with word salads of "insane, HER, 10 features you missed, we're so back". Sit down. Chill. Take a deep breath like Mark does in the demo . Let's think step by step: - Technique-wise, OpenAI has figured out a way to map audio to audio directly as first-class modality, and stream videos to a transformer in real-time. These require some new research on tokenization and architecture, but overall it's a data and system optimization problem (as most things are). High-quality data can come from at least 2 sources: 1) Naturally occurring dialogues on YouTube, podcasts, TV series, movies, etc. Whisper can be trained to identify speaker turns in a dialogue or separate overlapping speeches for automated annotation. 2) Synthetic data. Run the slow 3-stage pipeline using the most powerful models: speech1->text1 (ASR), text1->text2 (LLM), text2->speech2 (TTS). The middle LLM can decide when to stop and also simulate how to resume from interruption. It could output additional "thought traces" that are not verbalized to help generate better reply. Then GPT-4o distills directly from speech1->speech2, with optional auxiliary loss functions based on the 3-stage data. After distillation, these behaviors are now baked into the model without emitting intermediate texts. On the system side: the latency would not meet real-time threshold if every video frame is decompressed into an RGB image. OpenAI has likely developed their own neural-first, streaming video codec to transmit the motion deltas as tokens. The communication protocol and NN inference must be co-optimized. For example, there could be a small and energy-efficient NN running on the edge device that decides to transmit more tokens if the video is interesting, and fewer otherwise. - I didn't expect GPT-4o to be closer to GPT-5, the rumored "Arrakis" model that takes multimodal in and out. In fact, it's likely an early checkpoint of GPT-5 that hasn't finished training yet. The branding betrays a certain insecurity. Ahead of Google I/O, OpenAI would rather beat our mental projection of GPT-4.5 than disappoint by missing the sky-high expectation for GPT-5. A smart move to buy more time. - Notably, the assistant is much more lively and even a bit flirty. GPT-4o is trying (perhaps a bit too hard) to sound like HER. OpenAI is eating Character AI's lunch, with almost 100% overlap in form factor and huge distribution channels. It's a pivot towards more emotional AI with strong personality, which OpenAI seemed to actively suppress in the past. - Whoever wins Apple first wins big time. I see 3 levels of integration with iOS: 1) Ditch Siri. OpenAI distills a smaller-tier, purely on-device GPT-4o for iOS, with optional paid upgrade to use the cloud. 2) Native features to stream the camera or screen into the model. Chip-level support for neural audio/video codec. 3) Integrate with iOS system-level action API and smart home APIs. No one uses Siri Shortcuts, but it's time to resurrect. This could become the AI agent product with a billion users from the get-go. The FSD for smartphones with a Tesla-scale data flywheel.
Jim Fan991,566 views • 2 years ago

We trained a robot dog to balance and walk on top of a yoga ball purely in simulation, and then transfer zero-shot to the real world. No fine-tuning. Just works. I’m excited to announce DrEureka, an LLM agent that writes code to train robot skills in simulation, and writes more code to bridge the difficult simulation-reality gap. It fully automates the pipeline from new skill learning to real-world deployment. The Yoga ball task is particularly hard because it is not possible to accurately simulate the bouncy ball surface. Yet DrEureka has no trouble searching over a vast space of sim-to-real configurations, and enables the dog to steer the ball on various terrains, even walking sideways! Traditionally, the sim-to-real transfer is achieved by domain randomization, a tedious process that requires expert human roboticists to stare at every parameter and adjust by hand. Frontier LLMs like GPT-4 have tons of built-in physical intuition for friction, damping, stiffness, gravity, etc. We are (mildly) surprised to find that DrEureka can tune these parameters competently and explain its reasoning well. DrEureka builds on our prior work Eureka, the algorithm that teaches a 5-finger robot hand to do pen spinning. It takes one step further on our quest to automate the entire robot learning pipeline by an AI agent system. One model that outputs strings will supervise another model that outputs torque control. We open-source everything! Welcome you all to check out the paper, more videos, and try the codebase today: Code:
Jim Fan908,469 views • 2 years ago

I'm observing a mini Moravec's paradox within robotics: gymnastics that are difficult for humans are much easier for robots than "unsexy" tasks like cooking, cleaning, and assembling. It leads to a cognitive dissonance for people outside the field, "so, robots can parkour & breakdance, but why can't they take care of my dog?" Trust me, I got asked by my parents about this more than you think ... The "Robot Moravec's paradox" also creates the illusion that physical AI capabilities are way more advanced than they truly are. I'm not singling out Unitree, as it applies widely to all recent acrobatic demos in the industry. Here's a simple test: if you set up a wall in front of the side-flipping robot, it will slam into it at full force and make a spectacle. Because it's just overfitting that single reference motion, without any awareness of the surroundings. Here's why the paradox exists: it's much easier to train a "blind gymnast" than a robot that sees and manipulates. The former can be solved entirely in simulation and transferred zero-shot to the real world, while the latter demands extremely realistic rendering, contact physics, and messy real-world object dynamics - none of which can be simulated well. Imagine you can train LLMs not from the internet, but from a purely hand-crafted text console game. Roboticists got lucky. We happen to live in a world where accelerated physics engines are so good that we can get away with impressive acrobatics using literally zero real data. But we haven't yet discovered the same cheat code for general dexterity. Till then, we'll still get questioned by our confused parents.
Jim Fan397,747 views • 10 months ago

We are looking at the future of VR, YouTube & Google Street View. This is zip-NeRF, a 3D neural rendering tech rapidly approaching the quality of a real, high-res drone flight video. Think of NeRF as transporting reality into simulation. Metaverse will finally work this time.
Jim Fan1,260,343 views • 3 years ago

We RL'ed humanoid robots to Cristiano Ronaldo, LeBron James, and Kobe Byrant! These are neural nets running on real hardware at our GEAR lab. Most robot demos you see online speed videos up. We actually *slow them down* so you can enjoy the fluid motions. I'm excited to announce "ASAP", a "real2sim2real" model that masters extremely smooth and dynamic motions for humanoid whole body control. We pretrain the robot in simulation first, but there is a notorious "sim2real" gap: it's very difficult for hand-engineered physics equations to match real world dynamics. Our fix is simple: just deploy a pretrained policy on real hardware, collect data, and replay the motion in sim. The replay will obviously have many errors, but that gives a rich signal to compensate for the physics discrepancy. Use another neural net to learn the delta. Basically, we "patch up" a traditional physics engine, so that the robot can experience almost the real world at scale in GPUs. The future is hybrid simulation: combine the power of classical sim engines refined over decades and the uncanny ability of modern NNs to capture a messy world.
Jim Fan566,179 views • 1 year ago

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵
Jim Fan465,559 views • 1 year ago

The power of the Claw, in the palm of a robot hand. Agentic robotics is here! Today, we open-source CaP-X: vibe agents, alive in the physical world. They incarnate as robot arms and humanoids with a rich set of perception APIs, actuation APIs, and auto synthesize skill libraries as they go. CaP-X is a strict superset of our old stack, because policies like VLAs are “just” API calls as well. It solves many tasks zero-shot that a learned policy would struggle with. And we are doing much more than vibing. CaP-X is our most systematic, scientific study on agentic robotics so far: - We build a comprehensive agentic toolkit: perception (SAM3 segmentation, Molmo pointing, depth, point cloud), control (IK solvers, grasp planner, navigation), and visualization (EEF, mask overlays) that work across different robots. - CaP-Gym: LLM’s first Physical Exam! 187 manipulation tasks across RoboSuite, LIBERO-PRO, and BEHAVIOR. Tabletop, bimanual, mobile manipulation. Sim and real. Can’t wait to see the gradients flow from CaP-Gym to the next wave of frontier LLM releases. - CaP-Bench: we benchmark 12 frontier LLMs/VLMs (Gemini, GPT, Opus, Qwen, DeepSeek, Kimi, and more) across 8 evaluation tiers. We systematically vary API abstraction level, agentic harness, and visual grounding methods. Lots of insights in our paper. - CaP-Agent0: a training-free agentic harness that matches or exceeds human expert code on 4 out of 7 tasks without task-specific tuning. - CaP-RL: if you get a gym, you get RL ;). A 7B OSS model jumps from 20% to 72% success after only 50 training iterations. The synthesized programs transfer to real robots with minimal sim-to-real gap. 3 years ago, our team created Voyager, one of the earliest agentic AI that plays and learns in Minecraft continuously. Its key ideas — skill libraries, self-reflection loops, and in-context planning — have since influenced many modern agentic designs. Today, the agent graduates from Minecraft and gets a real job. It’s April Fool’s, but this Claw is getting its hands dirty for real! Link in thread:
Jim Fan76,890 views • 2 months ago

I was honored to share the TED AI stage with Ilya on Oct. 17. His speech video is out today (mine's still being edited). I think it provides relevant context tokens to the ongoing events. Transcript starting at ~10'20": As AI continues to progress, as technology advances, [...] What I claim will happen is that people will start to act in unprecedentedly collaborative ways out of their own self-interest. It's already happening right now. You see the leading AGI companies starting to collaborate, such as the Frontiers model forum. And we will expect that companies, even competitors, will share technical information to make their AI safe. We may even see governments do this. As another example, at OpenAI, we really believed in how dramatic AGI is going to be. So, one of the ideas that we were operating by, and it's been written on our website for 5 years now, is that when technology gets such that we are very close to AGI, to computers smarter than humans, if some other company is far ahead of us, then rather than compete with them, we will help them out, join them, in a sense. And why do that? Because we appreciate how incredibly dramatic AGI is going to be. And my claim is that with each generation of capability advancements, as AI gets better and as all of you experience what AI can do, as people who run AI efforts and AGI efforts, and people who work on them will experience it as well, this will change the way we see AI and AGI. And that will change collective behavior. And this is an important reason why I'm hopeful that, despite the great challenges posed by this technology, we will overcome them. @TEDAI2023
Jim Fan846,433 views • 2 years ago

Expecting this "one more thing" for too long. Apple Vision Pro's got a good shot at making AR mainstream, finally. Tech notes: * I'm really impressed by the UX: no controller, just your fingers. Apple has trained the best hand gesture recognition model - likely way better than any state-of-the-art research papers I've seen. It requires hyper fine-grained finger tracking that works robustly in all kinds of room lighting, hand positions, and occlusion. * The crown adjusts the interpolation between real world and fully immersive VR that blocks the vision. That's a cool physical way to tune the linear mixing coefficient. * EyeSight is a nice social feature: recognizes humans around you and interrupts the immersion when necessary. * Compute-wise: dual chip design, M2 + R1. R1 ingests input from 12 cameras, 5 sensors, and 6 microphones. It eliminates lag and streams display rapidly to avoid motion sickness. * Persona: scans your face and reconstructs a neural avatar. I think Apple has a very strong visual foundation model team ("VisionOS") that manages to stay under the radar. 💯 (Twitter glitched, video didn't work so had to re-upload)
Jim Fan930,846 views • 3 years ago