Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🔥In Magma, we talked a lot about spatial/temporal intelligence beyond verbal intelligencen as advocated by Dr. Fei-Fei Li. So how to interpret it? Today I am happy to announce a new demo Magma-Gaming: 👉 Rather than asking LLMs to write game code, we further ask the model to PLAY... the game. A simple game like moving to the target in a 2D grid still requires precise action grounding and planning capability, yet challenges the most advanced VLMs and even GPT-4o-mini model. Magma, born with stronger spatial understanding and reasoning ability, significantly outperforms the counterparts and achieves much higher scores in zero-shot manner. This result pinpoints the huge potential of building multimodal agentic models endowed with both verbal and spatial inteligence! Also, I believe this simple demo gives you a better hints to understand what is spatial intelligence, and why it is important!show more

Jianwei Yang

3,864 subscribers

17,940 Aufrufe • vor 1 Jahr •via X (Twitter)

Gaming Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

8 Kommentare

Profilbild von Jianwei Yang

Jianwei Yangvor 1 Jahr

Magma Project: Magma Code: Magma HF Model: Magma Intro Video:

Profilbild von Jianwei Yang

Jianwei Yangvor 1 Jahr

@alvarobartt @mervenoyann @arankomatsuzaki @NielsRogge

Profilbild von RedDeer.Games

RedDeer.Gamesvor 1 Jahr

We can't spill the beans about the release date of Maki: Paw of Fury, but make no mistake, things are happening! 🫘😎 We remind you that the game is coming to #NintendoSwitch and #PC #Steam and you can play the demo on PC, here ⤵️ >>> Have a great day!

Profilbild von Data & Analytics

Data & Analyticsvor 1 Jahr

@_akhaliq @drfeifei @_akhaliq, exploring spatial and temporal intelligence opens up so many possibilities! Balancing these skills alongside traditional ones could revolutionize our understanding of intelligence. What case studies showcase this best? 🔍 #InnovativeThinking

Profilbild von zaumai

zaumaivor 1 Jahr

@drfeifei Fascinating development! Magma-Gaming's emphasis on spatial and temporal intelligence pushes beyond conventional language-based systems. Any standout scenarios or tests you recommend exploring first?

Profilbild von Oya San

Oya Sanvor 1 Jahr

@drfeifei Absolutely fascinating, @jw2yang4ai! The exploration of spatial and temporal intelligence opens up a universe of possibilities in gaming. I'm excited to see how the Magma-Gaming demo will redefine our interactions and experiences.

Profilbild von Jun (Garvin) Chen

Jun (Garvin) Chenvor 1 Jahr

@drfeifei Brilliant work

Profilbild von scuzzlebot

scuzzlebotvor 1 Jahr

@drfeifei Magma-Gaming highlights spatial intelligence brilliantly—great demonstration of advanced capabilities beyond verbal reasoning! Do you foresee integrating this spatial understanding into more intricate gaming contexts soon?

Ähnliche Videos

"Visual-spatial intelligence is as fundamental as language." Fei-Fei Li says world models are the key to letting AI see the world, reason about it, interact with it, navigate it, and even 'build civilization upon it.' "It's very natural for me that World Labs' north star is to unlock spatial intelligence. The moment to me is right to do it." "We've got these ingredients—we've got compute, we've got a much deeper understanding of data, way deeper than [the ImageNet days]... and we've got some advancement of algorithms."

"Visual-spatial intelligence is as fundamental as language." Fei-Fei Li says world models are the key to letting AI see the world, reason about it, interact with it, navigate it, and even 'build civilization upon it.' "It's very natural for me that World Labs' north star is to unlock spatial intelligence. The moment to me is right to do it." "We've got these ingredients—we've got compute, we've got a much deeper understanding of data, way deeper than [the ImageNet days]... and we've got some advancement of algorithms."

a16z

31,675 Aufrufe • vor 5 Monaten

World Labs CEO Dr. Fei-Fei Li explains says "world model" has become an overloaded term & explains what each kind of world model does: "Right now there are three ways of calling world models when it comes to spatial intelligence." "One is what I call a renderer, when the model puts beautiful pixels on the screen." "Another kind of world model is what we call a planner. That is more for machines, more for robots." "The third kind, which I think is the linchpin of the three, is a simulator." "A simulator could become a renderer. The simulator could become a planner. But this layer is a huge critical path to unlock spatial intelligence. And that's what World Labs is working on." Fei-Fei Li at Bloomberg Tech live with Emily Chang

World Labs CEO Dr. Fei-Fei Li explains says "world model" has become an overloaded term & explains what each kind of world model does: "Right now there are three ways of calling world models when it comes to spatial intelligence." "One is what I call a renderer, when the model puts beautiful pixels on the screen." "Another kind of world model is what we call a planner. That is more for machines, more for robots." "The third kind, which I think is the linchpin of the three, is a simulator." "A simulator could become a renderer. The simulator could become a planner. But this layer is a huge critical path to unlock spatial intelligence. And that's what World Labs is working on." Fei-Fei Li at Bloomberg Tech live with Emily Chang

a16z

112,021 Aufrufe • vor 1 Monat

Dr. Fei-Fei Li just called out the biggest blind spot in the entire AI industry. We have been building half of human intelligence. And calling it the finish line. Li: “If you look at human intelligence, it pretty much boils down to two buckets.” The first bucket is language. Symbolic reasoning. Communication. The ability to think in words and abstractions. That’s what every major AI lab has spent the last decade building. The second bucket is the one the industry has almost entirely ignored. Li: “We call that in AI spatial intelligence.” How humans and animals perceive, navigate, and interact with the three-dimensional physical world. How we reach for objects. How we move through space. How we build and manipulate physical reality. From painting masterpieces to constructing the pyramids, non-verbal spatial intelligence is what actually shapes the world. Language describes reality. Spatial intelligence acts on it. And the gap between those two things is the gap between a chatbot and a robot. Li: “When this technology is ready, the robotic revolution is gonna start. We’re already seeing that trend.” Every robot is a moving agent. Every moving agent requires spatial intelligence to function in the real world. The humanoid robots being deployed in factories right now are hitting the ceiling of what language models alone can power. Spatial intelligence is the unlock. But Li didn’t stop at robotics. Li: “From a geopolitics point of view, this is part of the technology that goes straight into weapons.” Autonomous drone swarms. Battlefield navigation. Physical target acquisition without human oversight. Every military application of AI that operates in the real world runs on spatial intelligence. The nation that masters the transition from static text to dynamic three-dimensional perception doesn’t just win the software race. It commands the physical battlefield. The AI arms race just broke out of the data center. It’s operating in three dimensions now.

Dr. Fei-Fei Li just called out the biggest blind spot in the entire AI industry. We have been building half of human intelligence. And calling it the finish line. Li: “If you look at human intelligence, it pretty much boils down to two buckets.” The first bucket is language. Symbolic reasoning. Communication. The ability to think in words and abstractions. That’s what every major AI lab has spent the last decade building. The second bucket is the one the industry has almost entirely ignored. Li: “We call that in AI spatial intelligence.” How humans and animals perceive, navigate, and interact with the three-dimensional physical world. How we reach for objects. How we move through space. How we build and manipulate physical reality. From painting masterpieces to constructing the pyramids, non-verbal spatial intelligence is what actually shapes the world. Language describes reality. Spatial intelligence acts on it. And the gap between those two things is the gap between a chatbot and a robot. Li: “When this technology is ready, the robotic revolution is gonna start. We’re already seeing that trend.” Every robot is a moving agent. Every moving agent requires spatial intelligence to function in the real world. The humanoid robots being deployed in factories right now are hitting the ceiling of what language models alone can power. Spatial intelligence is the unlock. But Li didn’t stop at robotics. Li: “From a geopolitics point of view, this is part of the technology that goes straight into weapons.” Autonomous drone swarms. Battlefield navigation. Physical target acquisition without human oversight. Every military application of AI that operates in the real world runs on spatial intelligence. The nation that masters the transition from static text to dynamic three-dimensional perception doesn’t just win the software race. It commands the physical battlefield. The AI arms race just broke out of the data center. It’s operating in three dimensions now.

Dustin

122,680 Aufrufe • vor 4 Monaten

The moment is right to push forward into a new frontier for AI — one that is as fundamental as language, says Fei-Fei Li. That frontier is visual spatial intelligence. With Justin Johnson (Justin Johnson), her cofounder at World Labs, and a16z's martin_casado, Fei-Fei explains what unlocking this technology could mean, and why we’re in the midst of a “Cambrian explosion”:

The moment is right to push forward into a new frontier for AI — one that is as fundamental as language, says Fei-Fei Li. That frontier is visual spatial intelligence. With Justin Johnson (Justin Johnson), her cofounder at World Labs, and a16z's martin_casado, Fei-Fei explains what unlocking this technology could mean, and why we’re in the midst of a “Cambrian explosion”:

a16z

476,703 Aufrufe • vor 1 Jahr

Excited to share ESI-BENCH, a benchmark for Embodied Spatial Intelligence! Most spatial reasoning benchmarks assume an oracle observer: the agent is given the right image, view, or 3D scene. But in the real world, the observer is also an actor. To understand space, agents must decide where to look, how to move, and when to interact, to reveal what is hidden: occlusions, containment, contact, dynamics, and functionality. In many cases, the hard part is not perception itself, but choosing the right action to make informative perception possible. ESI-BENCH tests this perception-action loop. Agents receive an egocentric observation and a spatial question, then must actively gather evidence through perception, locomotion, and manipulationbefore answering. The benchmark spans 10 task categories, 29 subcategories, and 3,081 instances, built in BEHAVIOR-1K across realistic interactive scenes. 🌍Webpage: 💻Code & data: Thanks for collaborators: Jiageng, Han, Manling Li , Leonidas Guibas, Fei-Fei Li , Jiajun Wu , Yejin Choi

Excited to share ESI-BENCH, a benchmark for Embodied Spatial Intelligence! Most spatial reasoning benchmarks assume an oracle observer: the agent is given the right image, view, or 3D scene. But in the real world, the observer is also an actor. To understand space, agents must decide where to look, how to move, and when to interact, to reveal what is hidden: occlusions, containment, contact, dynamics, and functionality. In many cases, the hard part is not perception itself, but choosing the right action to make informative perception possible. ESI-BENCH tests this perception-action loop. Agents receive an egocentric observation and a spatial question, then must actively gather evidence through perception, locomotion, and manipulationbefore answering. The benchmark spans 10 task categories, 29 subcategories, and 3,081 instances, built in BEHAVIOR-1K across realistic interactive scenes. 🌍Webpage: 💻Code & data: Thanks for collaborators: Jiageng, Han, Manling Li , Leonidas Guibas, Fei-Fei Li , Jiajun Wu , Yejin Choi

Yining Hong

49,989 Aufrufe • vor 2 Monaten

Here’s a demo of our Agentic Memory system, inspired by how our own brain holds information in a 3D spatial space. This feels, natural. Extending this further, we announced the Agentic Memory Protocol on July 15th in SF - which enables memory to be local, encrypted and available to other agents and apps based on your permission-only. We believe this is the future of memory - not owned by any one app, spatial, and always improving. Aaron Levie Andrej Karpathy

Here’s a demo of our Agentic Memory system, inspired by how our own brain holds information in a 3D spatial space. This feels, natural. Extending this further, we announced the Agentic Memory Protocol on July 15th in SF - which enables memory to be local, encrypted and available to other agents and apps based on your permission-only. We believe this is the future of memory - not owned by any one app, spatial, and always improving. Aaron Levie Andrej Karpathy

Varun

187,489 Aufrufe • vor 11 Monaten

visionOS 26 quietly introduced a glimpse into the future of how we’ll browse the web: a spatial Reader. Safari can now automatically transform a website’s content into a spatial experience using Spatial Scenes — allowing you to literally look into the rooms of the apartment you’re about to book for your next holiday. It does this by beautifully blending the content with your surroundings and softly tinting your room with the website’s color palette. It's just not possible to capture. I also love that Apple Intelligence lives in a floating window beside the content, ready to summarize the page for you. I can’t wait to see how this evolves into what we’ll one day call the Spatial Web.

visionOS 26 quietly introduced a glimpse into the future of how we’ll browse the web: a spatial Reader. Safari can now automatically transform a website’s content into a spatial experience using Spatial Scenes — allowing you to literally look into the rooms of the apartment you’re about to book for your next holiday. It does this by beautifully blending the content with your surroundings and softly tinting your room with the website’s color palette. It's just not possible to capture. I also love that Apple Intelligence lives in a floating window beside the content, ready to summarize the page for you. I can’t wait to see how this evolves into what we’ll one day call the Spatial Web.

Phil Traut ᯅ

34,250 Aufrufe • vor 1 Jahr

Excited to share our latest work on 🎧spatial audio-driven human motion generation. We aim to tackle a largely underexplored yet important problem of enabling virtual humans to move naturally in response to spatial audio—capturing not just what is heard, but also where the sound is coming from. To this end, we introduce the Spatial Audio-Driven Human Motion (SAM) dataset—the first comprehensive dataset featuring paired high-quality human motion and spatial audio recordings. For benchmarking, we develop a generative framework for human MOtion generation driven by SPAtial audio, termed MOSPA, which learns to synthesize realistic and diverse human motions conditioned on spatial audio input. We hope this research could provide a foundation for future research in spatial perception, virtual characters, and embodied AI. The dataset and model will be open-sourced soon. A big thank you to our intern, Shuyang Xu, for the wonderful collaboration! Congratulations, Shuyang! Project page: Paper: Video: #Animation #CG #CV #AIGC #DL #Deeplearning #Motion #Graphics #AI #GenerativeAI

Excited to share our latest work on 🎧spatial audio-driven human motion generation. We aim to tackle a largely underexplored yet important problem of enabling virtual humans to move naturally in response to spatial audio—capturing not just what is heard, but also where the sound is coming from. To this end, we introduce the Spatial Audio-Driven Human Motion (SAM) dataset—the first comprehensive dataset featuring paired high-quality human motion and spatial audio recordings. For benchmarking, we develop a generative framework for human MOtion generation driven by SPAtial audio, termed MOSPA, which learns to synthesize realistic and diverse human motions conditioned on spatial audio input. We hope this research could provide a foundation for future research in spatial perception, virtual characters, and embodied AI. The dataset and model will be open-sourced soon. A big thank you to our intern, Shuyang Xu, for the wonderful collaboration! Congratulations, Shuyang! Project page: Paper: Video: #Animation #CG #CV #AIGC #DL #Deeplearning #Motion #Graphics #AI #GenerativeAI

Zhiyang (Frank) Dou

14,610 Aufrufe • vor 1 Jahr

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!

Martin Ziqiao Ma

35,565 Aufrufe • vor 1 Jahr

“Do you know where a sidewalk begins and ends?” — a simple question from Niantic Spatial CEO, Inhi Cho Suh, that gets to the heart of what today’s AI is missing. For robots, this isn’t theoretical. To safely navigate something as everyday as a sidewalk, they need to understand boundaries, context, and change in real time. Learn about how Niantic Spatial is building the missing piece in Physical AI. #NianticSpatial #HumanX #GeospatialAI #Robotics #AI #PhysicalAI

“Do you know where a sidewalk begins and ends?” — a simple question from Niantic Spatial CEO, Inhi Cho Suh, that gets to the heart of what today’s AI is missing. For robots, this isn’t theoretical. To safely navigate something as everyday as a sidewalk, they need to understand boundaries, context, and change in real time. Learn about how Niantic Spatial is building the missing piece in Physical AI. #NianticSpatial #HumanX #GeospatialAI #Robotics #AI #PhysicalAI

Niantic Spatial 🌎

18,656 Aufrufe • vor 3 Monaten

OpenAI just announced API access to o1 (advanced reasoning model) yesterday. I'm delighted to announce today a new short course, Reasoning with o1, built with OpenAI, and taught by Colin Jarvis, Head of AI Solutions at OpenAI, to show you how to use this effectively! Unlike previous language models which generate output directly, o1 “thinks before it responds,” and generates many reasoning tokens before returning a more thoughtful and accurate response. It is great at complex reasoning -- including planning for agentic workflows, coding, and domain-specific reasoning in STEM fields like law. But how you should use it is quite different from other LLMs. I think o1 will be a game changer for many AI applications; and in this course, you'll learn how to use it effectively. In detail, you’ll: - Learn to recognize what tasks o1 is suited for, and when to use a smaller model, or combine o1 with a smaller model - Understand the new principles of prompting reasoning models: Be simple and direct; no explicit chain-of-thought required; use structure; show rather than tell - Implement multi-step orchestration in which o1 plans, and hands tasks over to gpt-4o-mini to execute specific steps; this illustrates a design pattern to optimize intelligence (accuracy) and cost - Use o1 for a coding task to build a new application, edit existing code, and test performance by running a coding competition between o1-mini and GPT 4o - Use o1 for image understanding and learn how it performs better with a "hierarchy of reasoning," in which it incurs the latency and cost upfront, preprocessing the image and indexing it with rich details so it can be used for Q&A later - Learn a technique called meta-prompting, in which you use o1 to improve your prompts. Using a customer support evaluation set, you'll iteratively use o1 to modify a prompt to improve performance You'll also learn about how OpenAI used reinforcement learning to produce a model that uses "test-time compute" to improve performance. I think you'll find this course enjoyable and valuable. Please sign up for it here:

OpenAI just announced API access to o1 (advanced reasoning model) yesterday. I'm delighted to announce today a new short course, Reasoning with o1, built with OpenAI, and taught by Colin Jarvis, Head of AI Solutions at OpenAI, to show you how to use this effectively! Unlike previous language models which generate output directly, o1 “thinks before it responds,” and generates many reasoning tokens before returning a more thoughtful and accurate response. It is great at complex reasoning -- including planning for agentic workflows, coding, and domain-specific reasoning in STEM fields like law. But how you should use it is quite different from other LLMs. I think o1 will be a game changer for many AI applications; and in this course, you'll learn how to use it effectively. In detail, you’ll: - Learn to recognize what tasks o1 is suited for, and when to use a smaller model, or combine o1 with a smaller model - Understand the new principles of prompting reasoning models: Be simple and direct; no explicit chain-of-thought required; use structure; show rather than tell - Implement multi-step orchestration in which o1 plans, and hands tasks over to gpt-4o-mini to execute specific steps; this illustrates a design pattern to optimize intelligence (accuracy) and cost - Use o1 for a coding task to build a new application, edit existing code, and test performance by running a coding competition between o1-mini and GPT 4o - Use o1 for image understanding and learn how it performs better with a "hierarchy of reasoning," in which it incurs the latency and cost upfront, preprocessing the image and indexing it with rich details so it can be used for Q&A later - Learn a technique called meta-prompting, in which you use o1 to improve your prompts. Using a customer support evaluation set, you'll iteratively use o1 to modify a prompt to improve performance You'll also learn about how OpenAI used reinforcement learning to produce a model that uses "test-time compute" to improve performance. I think you'll find this course enjoyable and valuable. Please sign up for it here:

Andrew Ng

357,661 Aufrufe • vor 1 Jahr

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Introducing FoundationMotion. A large-scale, video-derived motion annotation dataset & auto-labeling pipeline + advanced models for motion understanding. Fully open-source: code, datasets, and models, free to use and build on. Understanding motion is core to physical reasoning, yet today’s leading models still struggle with simple spatial actions like “turn right” or “move up” or “flip the toast” - mainly due to the lack of large, fine-grained motion datasets. We present FoundationMotion, a fully automated pipeline that: • detects & tracks objects in videos • extracts trajectories • uses LLMs + frames to generate rich motion captions & QA pairs → creating large-scale, high-quality motion datasets at scale. After fine-tuning the open-source models Qwen and NVILA on our annotations, these models now outperform the closed-source Gemini-3-Flash and GPT-5.1 on spatial understanding tasks across autonomous driving, robotics, and everyday scenarios. 📜Paper: 🌐Webpage: 💻 Code: 🕸️Model: 📊 Dataset: 👉 Interactive Demo: Let’s move research forward together. FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series:

Boyi Li

66,999 Aufrufe • vor 7 Monaten

Demis on why world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️

Demis on why world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️

Bearly AI

78,545 Aufrufe • vor 2 Monaten

Chatbots aren’t the revolution. They’re the distraction. Fei-Fei Li: “Language is a half-million-year-old luxury. Perception is a half-billion-year-old necessity.” Evolution didn’t optimize for conversation. It optimized for survival in three-dimensional space. Seeing threats, navigating obstacles, predicting what happens when you move. We’ve spent years celebrating AI that can write and summarize. But text processing is narrow. Spatial intelligence is fundamental. An agent that only reads prompts can’t function in a warehouse or a hospital. It needs to parse depth, understand physics, and act on what it sees in real time. We built AI that understands language. Now we’re building AI that understands space. Language models got the attention. Spatial intelligence gets the work done. The world runs on physics, not paragraphs. AI is learning to operate in it.

Chatbots aren’t the revolution. They’re the distraction. Fei-Fei Li: “Language is a half-million-year-old luxury. Perception is a half-billion-year-old necessity.” Evolution didn’t optimize for conversation. It optimized for survival in three-dimensional space. Seeing threats, navigating obstacles, predicting what happens when you move. We’ve spent years celebrating AI that can write and summarize. But text processing is narrow. Spatial intelligence is fundamental. An agent that only reads prompts can’t function in a warehouse or a hospital. It needs to parse depth, understand physics, and act on what it sees in real time. We built AI that understands language. Now we’re building AI that understands space. Language models got the attention. Spatial intelligence gets the work done. The world runs on physics, not paragraphs. AI is learning to operate in it.

Dustin

51,764 Aufrufe • vor 5 Monaten

Demis Hassabis says world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️ This is what Demis and Google Deepmind is trying to solve with Genie. He also says that the video models (Veo) will play a part in training the world models and this is all key for AI robotics.

Demis Hassabis says world models are his longest standing passion and explains benefits vs. language models: ▫️ “I think language models are able to understand a lot about the world. More than we expected because language is actually probably richer than we thought. But there's still a lot about the spatial dynamics of the world, spatial awareness and the physical context we're in — and how that works mechanically — that is hard to describe in words and isn't generally described in corpuses of words. Alot of this is allied to learning from experience. There's a lot of things which you can't really describe something. You have to just experience it. Maybe the senses and so on are very hard to put into words. Whether that's motor angles and smell and these kinds of senses, it's very difficult to describe that in any kind of language.”▫️ This is what Demis and Google Deepmind is trying to solve with Genie. He also says that the video models (Veo) will play a part in training the world models and this is all key for AI robotics.

Bearly AI

123,703 Aufrufe • vor 7 Monaten

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

Spatial understanding is important to moving around in complex environments and is a huge part of the challenge of generalizing to new scenes. Most world models, however, largely ignore this spatial dimension, focusing on 2D images. Not PointWorld, though. PointWorld is a 3D world model trained from real and simulated data which can perform a wide variety of manipulation tasks on a real robot, including grasping or handling articulated objects, all without any additional fine tuning. Wenlong Huang joins us to tell us more about what makes this work and how it’s different from other world models. Watch Episode #83 of RoboPapers, with Chris Paxton and Jiafei Duan, to learn more!

RoboPapers

16,197 Aufrufe • vor 1 Monat

Dr. Fei-Fei Li (Fei-Fei Li) is known as the “godmother of AI.” For the past two decades, she’s been at the center of AI’s most significant breakthroughs, including: - Spearheading ImageNet, the dataset that sparked the AI explosion we’re living through right now. - Leading work at Stanford Artificial Intelligence Laboratory (SAIL) - Serving as Chief Scientist of AI/ML at Google Cloud - Co-founding Stanford’s Institute for Human-Centered AI - Serving on the United Nations AI Scientific Advisory Board - Being named as Time's 100 most influential people in AI In this conversation, Fei-Fei shares the rarely told history of how we got to today—and what comes next. We discuss: 🔸 The backstory on ImageNet 🔸 Why robotics faces unique challenges compared with language models and what’s needed to overcome them 🔸 Why Fei-Fei believes AI won’t replace humans but will require us to take responsibility for ourselves 🔸 Why world models and spatial intelligence represent the next frontier in AI, beyond large language models 🔸 The surprising applications of Marble, from movie production to psychological research 🔸 How to participate in AI regardless of your role 🔸 Much more Listen now 👇 • YouTube: • Spotify: • Apple: Thank you to our wonderful sponsors for supporting the podcast: 🏆 Figma Make — A prompt-to-code tool for making ideas real: 🏆 Justworks — The all-in-one HR solution for managing your small business with confidence: 🏆 Sinch — Build messaging, email, and calling into your product:

Dr. Fei-Fei Li (Fei-Fei Li) is known as the “godmother of AI.” For the past two decades, she’s been at the center of AI’s most significant breakthroughs, including: - Spearheading ImageNet, the dataset that sparked the AI explosion we’re living through right now. - Leading work at Stanford Artificial Intelligence Laboratory (SAIL) - Serving as Chief Scientist of AI/ML at Google Cloud - Co-founding Stanford’s Institute for Human-Centered AI - Serving on the United Nations AI Scientific Advisory Board - Being named as Time's 100 most influential people in AI In this conversation, Fei-Fei shares the rarely told history of how we got to today—and what comes next. We discuss: 🔸 The backstory on ImageNet 🔸 Why robotics faces unique challenges compared with language models and what’s needed to overcome them 🔸 Why Fei-Fei believes AI won’t replace humans but will require us to take responsibility for ourselves 🔸 Why world models and spatial intelligence represent the next frontier in AI, beyond large language models 🔸 The surprising applications of Marble, from movie production to psychological research 🔸 How to participate in AI regardless of your role 🔸 Much more Listen now 👇 • YouTube: • Spotify: • Apple: Thank you to our wonderful sponsors for supporting the podcast: 🏆 Figma Make — A prompt-to-code tool for making ideas real: 🏆 Justworks — The all-in-one HR solution for managing your small business with confidence: 🏆 Sinch — Build messaging, email, and calling into your product:

Lenny Rachitsky

250,455 Aufrufe • vor 8 Monaten

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 Aufrufe • vor 3 Jahren

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 3 Jahren

📢 Join us tomorrow morning at our CVPR 2025 poster session (#340, ExHall D, 10:30am–12:30pm) to chat about Project Magma 👉 This is a big team effort to build a multimodal agentic model capable of understanding and acting in both digital and physical environments—just like how we interact with the world every day. 🚀 Even more exciting: we demonstrate the scaling potential of agent pretraining on large-scale human instructional videos through our Set-of-Mark (SoM) and Trace-of-Mark (ToM), showcasing strong zero-shot performance in: 1) multimodal image/video understanding, 2) UI navigation, 3) Real-world robot manipulation and even 4) Gaming! We've received encouraging feedback over the past few days—and this is only the beginning. A small step forward, with exciting things ahead!

📢 Join us tomorrow morning at our CVPR 2025 poster session (#340, ExHall D, 10:30am–12:30pm) to chat about Project Magma 👉 This is a big team effort to build a multimodal agentic model capable of understanding and acting in both digital and physical environments—just like how we interact with the world every day. 🚀 Even more exciting: we demonstrate the scaling potential of agent pretraining on large-scale human instructional videos through our Set-of-Mark (SoM) and Trace-of-Mark (ToM), showcasing strong zero-shot performance in: 1) multimodal image/video understanding, 2) UI navigation, 3) Real-world robot manipulation and even 4) Gaming! We've received encouraging feedback over the past few days—and this is only the beginning. A small step forward, with exciting things ahead!

Jianwei Yang

13,112 Aufrufe • vor 1 Jahr