Jim Fan

@DrJimFan • 499,742 subscribers

NVIDIA Director of Robotics & Distinguished Scientist. Co-Lead of GEAR lab. Solving Physical AGI, one motor at a time. Stanford Ph.D. OpenAI's 1st intern.

Shorts

Minecraft has been achieved internally Yes this is Sora's hallucination of Minecraft. It can't resist the urge to make the sky look less pixelated 😅

7,071,078 views

If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths. I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be! Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." - The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space. - The 3D objects are consistently animated as they sail and avoid each other's paths. - Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations. - Photorealism, almost like rendering with raytracing. - The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe. - The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect. Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.

6,182,157 views

A viral paper "Language Model Represents Space and Time" recently claims that LLMs learn "world models". As much as I like Max Tegmark's works, I disagree with their definition of world model. World model is a core concept in AI agent and decision making. It is our mental simulation of how the world works given interventions (or lack thereof). A world model captures causality and intuitive physics, telling the agent what is likely and what is impossible. It can and should be used for counterfactual reasoning, i.e. "what ifs": what would happen if I knock over a cup of water? Where would I have been if I had not taken that bus? Yann LeCun Yann LeCun says it well in his position paper ( I quote: "Using such world models, animals can learn new skills with very few trials. They can predict the consequences of their actions, they can reason, plan, explore, and imagine new solutions to problems. Importantly, they can also avoid making dangerous mistakes when facing an unknown situation." The first use of the term World Model in deep policy learning is attributed to hardmaru & Jürgen Schmidhuber: In their seminal paper, an agent masters shooting skills in the popular game Doom (demo below) by learning in imagination, using an internal world model as a "physics simulator". To put in a simple Python math formula, world model learns a function F(s[0:t-1], a) -> s[t:], which takes as input the observed past and current action, and outputs plausible future states. Now the definition of World Model in Tegmark's paper seems to be about predicting GPS coordinates and time eras. I see this as just a classification task with no causal learning and simulation going on. You cannot make meaningful interventions against that model, nor can you optimize any decision making in a closed feedback loop. As for the "space & time neurons", I think they are most similar to the "sentiment neuron" that OpenAI published in 2017: Predicting GPS is conceptually no different from predicting sentiment in my opinion. I don't think their experimental results are wrong - just that their conclusion is on shaky grounds. I welcome any debate! Paper link:

593,943 views

These are not CGI. Reinforcement learning is so back. When operating on strings, it gives us o3. When operating on physical motors, it gives us a perfect humanoid backflip and a robot creature that out-maneuvers almost every animal on earth. RL is one of the only learning algorithms that can master both the world of bits and the world of atoms. Give me a reward function, and I shall move the world. 2025, Year of RL.

356,747 views

one day PhDs will animate every object around us with reinforcement learning to keep their thesis going

464,058 views

The launch of GPT-4 will be a predictably seismic event this year. But I can predict with high confidence what GPT-4 cannot do: It can’t cook spaghetti, play tennis, or build a lego treehouse. Robotics will be the last moat we conquer in the grand quest for AI 🤖🦾

482,427 views

Today may be the ImageNet moment for robotics. RT-X: the largest open-source robot dataset ever compiled, across 33 institutes, 22 robot hardware, 527 skills, and 1M episodes. Why is robotics lagging so far behind NLP, vision, and other AI domains? Data scarcity is the main culprit to blame, among other difficulties. Unlike text, images, and videos, you cannot download mass amounts of onboard robot control data from the internet. They simply don't exist in the wild. 11 yrs ago, ImageNet kicked off the deep learning revolution. 3-4 yrs ago, internet-scale data fueled the first GPTs and Diffusions that define this era of foundation models. I think 2023 is finally the year for robotics to scale up. Robot foundation models like VIMA ( my team's work at NVIDIA) and RT-1/2 ( Google DeepMind's effort) are extremely data hungry. While massively parallel simulations like NVIDIA IsaacGym & Omniverse can alleviate the problem to some extent, it's still not quite enough to bridge the gap to the messy, physical world. This new dataset is not just a technical contribution. I also see it as a commendable effort to overcome institutional bureaucracies and unite researchers from around the world to tackle a grand challenge together. Robotics will be the final holy grail that we capture in AI. We are not there yet, but ascending in the right gradient direction. RT-X website: Launch blog:

265,038 views

I believe solving robotics = 90% engineering + 10% research vision. Project GR00T is NVIDIA's moonshot initiative to build physical AGI for humanoid robots. The GEAR Lab is assembling a crack team right now. Join us! Openings: - Sr. Research Engineer, Robotics Systems - Sr. RE, Reinforcement Learning - Sr. RE, Foundation Model Training Infrastructure - Sr. RE, Simulation - Sr. RE, ML Data Pipelines - Research Scientist - Research Intern (both part-time and summer full-time in 2025) For the Sr. positions, we strongly prefer candidates with many years of engineering experience at robotics/autonomous driving companies, or MLOps/large-scale AI teams at big techs. For interns, we welcome ace robotics hackers anywhere! Show me your past works. Job links in the thread. Apply today! Your resumes will be my best Christmas gifts:

103,177 views

Good UX design is more important than ever for today’s AI. A model cannot achieve its full potential without the most fluid and intuitive interface. Here’s a first step towards the future of AI-in-the-loop artistic creation. Imagine making every tool in Photoshop feel like this.

181,955 views

Videos

LIVE

1.2k

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Streaming Now

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

HD live stream

Exclusive private shows

1.2k viewers online

Current Status

Live

Private Show

Join now for exclusive access

Free preview available • Premium content