Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving... is hindered by inefficient exploration in continuous action spaces. MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. Paper Title: MindDrive: A Vision-Language-Action Model for Autonomous Driving via Project: Link:show more

AI Bites | YouTube Channel

2,270 subscribers

43,496 Aufrufe • vor 5 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

MolmoAct2 is landing in LeRobot! Ai2's open Action Reasoning Model combines a Molmo2-ER vision-language backbone with a flow-matching continuous action expert to predict robot action chunks from images, language instructions, and proprioceptive state. An open robot foundation model built for real-world control, with strong out-of-the-box performance and easy fine-tuning in LeRobot. Pick-and-place inference running on NVIDIA DGX Spark! Blog: Paper: Thanks to Ai2 Jiafei Duan Haoquan Fang

MolmoAct2 is landing in LeRobot! Ai2's open Action Reasoning Model combines a Molmo2-ER vision-language backbone with a flow-matching continuous action expert to predict robot action chunks from images, language instructions, and proprioceptive state. An open robot foundation model built for real-world control, with strong out-of-the-box performance and easy fine-tuning in LeRobot. Pick-and-place inference running on NVIDIA DGX Spark! Blog: Paper: Thanks to Ai2 Jiafei Duan Haoquan Fang

LeRobot

24,727 Aufrufe • vor 2 Monaten

A viral paper "Language Model Represents Space and Time" recently claims that LLMs learn "world models". As much as I like Max Tegmark's works, I disagree with their definition of world model. World model is a core concept in AI agent and decision making. It is our mental simulation of how the world works given interventions (or lack thereof). A world model captures causality and intuitive physics, telling the agent what is likely and what is impossible. It can and should be used for counterfactual reasoning, i.e. "what ifs": what would happen if I knock over a cup of water? Where would I have been if I had not taken that bus? Yann LeCun Yann LeCun says it well in his position paper ( I quote: "Using such world models, animals can learn new skills with very few trials. They can predict the consequences of their actions, they can reason, plan, explore, and imagine new solutions to problems. Importantly, they can also avoid making dangerous mistakes when facing an unknown situation." The first use of the term World Model in deep policy learning is attributed to hardmaru & Jürgen Schmidhuber: In their seminal paper, an agent masters shooting skills in the popular game Doom (demo below) by learning in imagination, using an internal world model as a "physics simulator". To put in a simple Python math formula, world model learns a function F(s[0:t-1], a) -> s[t:], which takes as input the observed past and current action, and outputs plausible future states. Now the definition of World Model in Tegmark's paper seems to be about predicting GPS coordinates and time eras. I see this as just a classification task with no causal learning and simulation going on. You cannot make meaningful interventions against that model, nor can you optimize any decision making in a closed feedback loop. As for the "space & time neurons", I think they are most similar to the "sentiment neuron" that OpenAI published in 2017: Predicting GPS is conceptually no different from predicting sentiment in my opinion. I don't think their experimental results are wrong - just that their conclusion is on shaky grounds. I welcome any debate! Paper link:

A viral paper "Language Model Represents Space and Time" recently claims that LLMs learn "world models". As much as I like Max Tegmark's works, I disagree with their definition of world model. World model is a core concept in AI agent and decision making. It is our mental simulation of how the world works given interventions (or lack thereof). A world model captures causality and intuitive physics, telling the agent what is likely and what is impossible. It can and should be used for counterfactual reasoning, i.e. "what ifs": what would happen if I knock over a cup of water? Where would I have been if I had not taken that bus? Yann LeCun Yann LeCun says it well in his position paper ( I quote: "Using such world models, animals can learn new skills with very few trials. They can predict the consequences of their actions, they can reason, plan, explore, and imagine new solutions to problems. Importantly, they can also avoid making dangerous mistakes when facing an unknown situation." The first use of the term World Model in deep policy learning is attributed to hardmaru & Jürgen Schmidhuber: In their seminal paper, an agent masters shooting skills in the popular game Doom (demo below) by learning in imagination, using an internal world model as a "physics simulator". To put in a simple Python math formula, world model learns a function F(s[0:t-1], a) -> s[t:], which takes as input the observed past and current action, and outputs plausible future states. Now the definition of World Model in Tegmark's paper seems to be about predicting GPS coordinates and time eras. I see this as just a classification task with no causal learning and simulation going on. You cannot make meaningful interventions against that model, nor can you optimize any decision making in a closed feedback loop. As for the "space & time neurons", I think they are most similar to the "sentiment neuron" that OpenAI published in 2017: Predicting GPS is conceptually no different from predicting sentiment in my opinion. I don't think their experimental results are wrong - just that their conclusion is on shaky grounds. I welcome any debate! Paper link:

Jim Fan

594,014 Aufrufe • vor 2 Jahren

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

Tongzhou Mu 🤖🦾🦿

16,959 Aufrufe • vor 1 Jahr

Model-Free Reinforcement Learning (MFRL) has been alluring, especially with supercharged compute with physics on GPU. However, the methods use 0-th order gradients, and are often not the best optimizers. Can we do better than PPO in continuous control for robotics? Turns out yes! 🥳 tl;dr: Faster, better RL than PPO in continuous control 💪 The answer lies in using more information from the simulation. We are juicing the simulation on GPU as it is, why not use it for gradients as well? This has been a driving question in a series of our works. We first studied this problem in ICLR 2022 paper on Short Horizon Actor Critic Naive gradient based methods are stuck in local minima and have exploding/vanishing gradients. SHAC solved this problem truncated rollouts and model based value estimation, where the model is Differentiable Sim. This boosted sample efficiency and wall-clock time immensely especially in high dimensional systems such as humanoids Yet, given enough compute PPO often caught up. Our follow up paper on on Adaptive Horizon Actor Critic at ICML 2024 discovers the cause and provides a fix. However, we find that even when given ground-truth dynamics, not all gradients are useful due to sample error. 1st-Order Model-Based Reinforcement Learning methods employing differentiable simulation provide gradients with reduced variance but are susceptible to bias in scenarios involving stiff dynamics, such as physical contact. We find that back-propagating through contact and long trajectories drastically reduces gradient accuracy. Using this insight, we propose AHAC to dynamically adapt its roll-out horizon to avoid differentiating through stiff contact. AHAC is a first-order model-based RL algorithm that learns high-dimensional tasks in minutes (wall clock) and outperforms PPO by 40%, even in the limit of data provided to PPO. This work is led by Ignat Georgiev alongside Krishnan Srinivasan, Jie Xu, Eric Heiden and ample assistance from warp team at NVIDIA Robotics (Miles Macklin)

Model-Free Reinforcement Learning (MFRL) has been alluring, especially with supercharged compute with physics on GPU. However, the methods use 0-th order gradients, and are often not the best optimizers. Can we do better than PPO in continuous control for robotics? Turns out yes! 🥳 tl;dr: Faster, better RL than PPO in continuous control 💪 The answer lies in using more information from the simulation. We are juicing the simulation on GPU as it is, why not use it for gradients as well? This has been a driving question in a series of our works. We first studied this problem in ICLR 2022 paper on Short Horizon Actor Critic Naive gradient based methods are stuck in local minima and have exploding/vanishing gradients. SHAC solved this problem truncated rollouts and model based value estimation, where the model is Differentiable Sim. This boosted sample efficiency and wall-clock time immensely especially in high dimensional systems such as humanoids Yet, given enough compute PPO often caught up. Our follow up paper on on Adaptive Horizon Actor Critic at ICML 2024 discovers the cause and provides a fix. However, we find that even when given ground-truth dynamics, not all gradients are useful due to sample error. 1st-Order Model-Based Reinforcement Learning methods employing differentiable simulation provide gradients with reduced variance but are susceptible to bias in scenarios involving stiff dynamics, such as physical contact. We find that back-propagating through contact and long trajectories drastically reduces gradient accuracy. Using this insight, we propose AHAC to dynamically adapt its roll-out horizon to avoid differentiating through stiff contact. AHAC is a first-order model-based RL algorithm that learns high-dimensional tasks in minutes (wall clock) and outperforms PPO by 40%, even in the limit of data provided to PPO. This work is led by Ignat Georgiev alongside Krishnan Srinivasan, Jie Xu, Eric Heiden and ample assistance from warp team at NVIDIA Robotics (Miles Macklin)

Animesh Garg

52,300 Aufrufe • vor 2 Jahren

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

AK

41,746 Aufrufe • vor 3 Jahren

Did you know that babies cry with an accent? This week I’ll be focusing on language acquisition during infancy. Yesterday we talked about how newborns enter the world already recognizing the distinct tones of their mothers’ voices, as well as the rhythms and patterns of what will become their native language. And that recognition is not only receptive (the language we hear and understand), but expressive (the language we speak). Which leads us to crying with accent. A study that compared the cries of French and German infants identified distinct patterns their cries - measured as early as the day after their births - that mirror the unique melodies (or prosody) of their native languages. The French infants ended their cries with a lilt that mirrored French, while German infants began their cries with intensity then dropped off at the end, mirroring the vocalizations of their German parents. It’s still more evidence that some basic forms of language learning are influenced by children’s experience (learning) in utero. When it comes to our earliest vocalizations, newborns rely heavily on cries - which often tend to be a bit screechier than they will become in time - alongside gurgles, grunts and squeaks (like those shown in the video here.) But in just a matter of weeks these basic sounds make way for a much broader array of vocalizations. We’ll look at a few of those tomorrow. This new arrival was shared to TT by

Did you know that babies cry with an accent? This week I’ll be focusing on language acquisition during infancy. Yesterday we talked about how newborns enter the world already recognizing the distinct tones of their mothers’ voices, as well as the rhythms and patterns of what will become their native language. And that recognition is not only receptive (the language we hear and understand), but expressive (the language we speak). Which leads us to crying with accent. A study that compared the cries of French and German infants identified distinct patterns their cries - measured as early as the day after their births - that mirror the unique melodies (or prosody) of their native languages. The French infants ended their cries with a lilt that mirrored French, while German infants began their cries with intensity then dropped off at the end, mirroring the vocalizations of their German parents. It’s still more evidence that some basic forms of language learning are influenced by children’s experience (learning) in utero. When it comes to our earliest vocalizations, newborns rely heavily on cries - which often tend to be a bit screechier than they will become in time - alongside gurgles, grunts and squeaks (like those shown in the video here.) But in just a matter of weeks these basic sounds make way for a much broader array of vocalizations. We’ll look at a few of those tomorrow. This new arrival was shared to TT by

Dan Wuori

115,574 Aufrufe • vor 1 Jahr

𝗜'𝘃𝗲 𝗵𝗲𝗮𝗿𝗱 𝘁𝗵𝗶𝘀 𝗮 𝗹𝗼𝘁 𝗿𝗲𝗰𝗲𝗻𝘁𝗹𝘆: "𝗪𝗲 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝘂𝗿 𝗿𝗼𝗯𝗼𝘁 𝗼𝗻 𝗼𝗻𝗲 𝗼𝗯𝗷𝗲𝗰𝘁 𝗮𝗻𝗱 𝗶𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗲𝗱 𝘁𝗼 𝗮 𝗻𝗼𝘃𝗲𝗹 𝗼𝗯𝗷𝗲𝗰𝘁 - 𝘁𝗵𝗲𝘀𝗲 𝗻𝗲𝘄 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 𝗮𝗿𝗲 𝗰𝗿𝗮𝘇𝘆!" Let's talk about what's actually happening in that "A" (Action) part of your VLA model. The Vision and Language components? They're incredible. Pre-trained on internet-scale data, they understand objects, spatial relationships, and task instructions better than ever. But the Action component? That's still learned from scratch on your specific robot demonstrations. 𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝗿𝗲𝗮𝗹𝗶𝘁𝘆: Your VLA model has internet-scale understanding of what a screwdriver looks like and what "tighten the screw" means. But the actual motor pattern for "rotating wrist while applying downward pressure"? That comes from your 500 robot demos. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗶𝘀 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 "𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻": • 𝗩𝗶𝘀𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Recognises novel objects instantly (thanks to pre-training) • 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Understands new task instructions (thanks to pre-training) • 𝗔𝗰𝘁𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Still limited to motor patterns seen during robot training Ask that same robot to "unscrew the bottle cap" and it fails because: • Vision: Recognises bottle and cap • Language: Understands "unscrew" • Action: Never learned the "twist while pulling" motor pattern 𝗧𝗵𝗲 𝗵𝗮𝗿𝗱 𝘁𝗿𝘂𝘁𝗵 𝗮𝗯𝗼𝘂𝘁 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀: The "VL" gives you incredible zero-shot understanding. The "A" still requires task-specific demonstrations. We've cracked the perception and reasoning problem. We haven't cracked the motor generalisation problem.

𝗜'𝘃𝗲 𝗵𝗲𝗮𝗿𝗱 𝘁𝗵𝗶𝘀 𝗮 𝗹𝗼𝘁 𝗿𝗲𝗰𝗲𝗻𝘁𝗹𝘆: "𝗪𝗲 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝘂𝗿 𝗿𝗼𝗯𝗼𝘁 𝗼𝗻 𝗼𝗻𝗲 𝗼𝗯𝗷𝗲𝗰𝘁 𝗮𝗻𝗱 𝗶𝘁 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗲𝗱 𝘁𝗼 𝗮 𝗻𝗼𝘃𝗲𝗹 𝗼𝗯𝗷𝗲𝗰𝘁 - 𝘁𝗵𝗲𝘀𝗲 𝗻𝗲𝘄 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀 𝗮𝗿𝗲 𝗰𝗿𝗮𝘇𝘆!" Let's talk about what's actually happening in that "A" (Action) part of your VLA model. The Vision and Language components? They're incredible. Pre-trained on internet-scale data, they understand objects, spatial relationships, and task instructions better than ever. But the Action component? That's still learned from scratch on your specific robot demonstrations. 𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝗿𝗲𝗮𝗹𝗶𝘁𝘆: Your VLA model has internet-scale understanding of what a screwdriver looks like and what "tighten the screw" means. But the actual motor pattern for "rotating wrist while applying downward pressure"? That comes from your 500 robot demos. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗶𝘀 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 "𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻": • 𝗩𝗶𝘀𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Recognises novel objects instantly (thanks to pre-training) • 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Understands new task instructions (thanks to pre-training) • 𝗔𝗰𝘁𝗶𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻: Still limited to motor patterns seen during robot training Ask that same robot to "unscrew the bottle cap" and it fails because: • Vision: Recognises bottle and cap • Language: Understands "unscrew" • Action: Never learned the "twist while pulling" motor pattern 𝗧𝗵𝗲 𝗵𝗮𝗿𝗱 𝘁𝗿𝘂𝘁𝗵 𝗮𝗯𝗼𝘂𝘁 𝗩𝗟𝗔 𝗺𝗼𝗱𝗲𝗹𝘀: The "VL" gives you incredible zero-shot understanding. The "A" still requires task-specific demonstrations. We've cracked the perception and reasoning problem. We haven't cracked the motor generalisation problem.

Stephen James

51,356 Aufrufe • vor 11 Monaten

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

Martin Ziqiao Ma

33,411 Aufrufe • vor 1 Monat

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fung

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fung

Pascale Fung

90,144 Aufrufe • vor 7 Monaten

As a newly appointed 𝗔𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 𝗣𝗿𝗼𝗳𝗲𝘀𝘀𝗼𝗿 at Imperial College London, I'm thrilled to announce the 𝗦𝗮𝗳𝗲 𝗪𝗵𝗼𝗹𝗲-𝗯𝗼𝗱𝘆 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗥𝗼𝗯𝗼𝘁𝗶𝗰𝘀 𝗟𝗮𝗯 (𝗦𝗪𝗜𝗥𝗟) at 𝗜𝗺𝗽𝗲𝗿𝗶𝗮𝗹 𝗖𝗼𝗹𝗹𝗲𝗴𝗲 𝗟𝗼𝗻𝗱𝗼𝗻. 𝗦𝗮𝗳𝗲 𝗪𝗵𝗼𝗹𝗲-𝗯𝗼𝗱𝘆 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗥𝗼𝗯𝗼𝘁𝗶𝗰𝘀 𝗟𝗮𝗯 (𝗦𝗪𝗜𝗥𝗟) ( is a new research lab focused on the intersection of safety and intelligence in next-generation robotics. We're hiring exceptional PhD students who are passionate about pushing the boundaries of robot learning. 𝗪𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝗦𝗪𝗜𝗥𝗟 𝘂𝗻𝗶𝗾𝘂𝗲? We operate at the exciting convergence of: • Online & offline reinforcement learning • Imitation learning & human demonstrations • Sample-efficient learning methods • Whole-body and soft robotics systems We're 𝗹𝗼𝗼𝗸𝗶𝗻𝗴 𝗳𝗼𝗿 𝗽𝗿𝗼𝘀𝗽𝗲𝗰𝘁𝗶𝘃𝗲 𝗣𝗵𝗗 𝘀𝘁𝘂𝗱𝗲𝗻𝘁𝘀 interested in: • Developing safe exploration algorithms for robotic systems • Creating sample-efficient learning methods that minimize real-world trials • Building foundation models for robotics with safety guarantees • Advancing soft robotics and compliant human-robot interaction • Bridging theory and practice in embodied AI Why now? As robots become more capable and work closer with humans, we need systems that are both intelligent enough to handle complex tasks 𝗔𝗡𝗗 safe enough for real-world deployment. Traditional approaches treat safety and intelligence as competing priorities, we believe they're synergistic. If you're a motivated researcher who wants to develop the theoretical foundations and practical algorithms for tomorrow's safe, intelligent robots, I'd love to hear from you. Want to join? Apply via

As a newly appointed 𝗔𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 𝗣𝗿𝗼𝗳𝗲𝘀𝘀𝗼𝗿 at Imperial College London, I'm thrilled to announce the 𝗦𝗮𝗳𝗲 𝗪𝗵𝗼𝗹𝗲-𝗯𝗼𝗱𝘆 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗥𝗼𝗯𝗼𝘁𝗶𝗰𝘀 𝗟𝗮𝗯 (𝗦𝗪𝗜𝗥𝗟) at 𝗜𝗺𝗽𝗲𝗿𝗶𝗮𝗹 𝗖𝗼𝗹𝗹𝗲𝗴𝗲 𝗟𝗼𝗻𝗱𝗼𝗻. 𝗦𝗮𝗳𝗲 𝗪𝗵𝗼𝗹𝗲-𝗯𝗼𝗱𝘆 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝘁 𝗥𝗼𝗯𝗼𝘁𝗶𝗰𝘀 𝗟𝗮𝗯 (𝗦𝗪𝗜𝗥𝗟) ( is a new research lab focused on the intersection of safety and intelligence in next-generation robotics. We're hiring exceptional PhD students who are passionate about pushing the boundaries of robot learning. 𝗪𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝗦𝗪𝗜𝗥𝗟 𝘂𝗻𝗶𝗾𝘂𝗲? We operate at the exciting convergence of: • Online & offline reinforcement learning • Imitation learning & human demonstrations • Sample-efficient learning methods • Whole-body and soft robotics systems We're 𝗹𝗼𝗼𝗸𝗶𝗻𝗴 𝗳𝗼𝗿 𝗽𝗿𝗼𝘀𝗽𝗲𝗰𝘁𝗶𝘃𝗲 𝗣𝗵𝗗 𝘀𝘁𝘂𝗱𝗲𝗻𝘁𝘀 interested in: • Developing safe exploration algorithms for robotic systems • Creating sample-efficient learning methods that minimize real-world trials • Building foundation models for robotics with safety guarantees • Advancing soft robotics and compliant human-robot interaction • Bridging theory and practice in embodied AI Why now? As robots become more capable and work closer with humans, we need systems that are both intelligent enough to handle complex tasks 𝗔𝗡𝗗 safe enough for real-world deployment. Traditional approaches treat safety and intelligence as competing priorities, we believe they're synergistic. If you're a motivated researcher who wants to develop the theoretical foundations and practical algorithms for tomorrow's safe, intelligent robots, I'd love to hear from you. Want to join? Apply via

Stephen James

16,605 Aufrufe • vor 10 Monaten

I believe solving robotics = 90% engineering + 10% research vision. Project GR00T is NVIDIA's moonshot initiative to build physical AGI for humanoid robots. The GEAR Lab is assembling a crack team right now. Join us! Openings: - Sr. Research Engineer, Robotics Systems - Sr. RE, Reinforcement Learning - Sr. RE, Foundation Model Training Infrastructure - Sr. RE, Simulation - Sr. RE, ML Data Pipelines - Research Scientist - Research Intern (both part-time and summer full-time in 2025) For the Sr. positions, we strongly prefer candidates with many years of engineering experience at robotics/autonomous driving companies, or MLOps/large-scale AI teams at big techs. For interns, we welcome ace robotics hackers anywhere! Show me your past works. Job links in the thread. Apply today! Your resumes will be my best Christmas gifts:

I believe solving robotics = 90% engineering + 10% research vision. Project GR00T is NVIDIA's moonshot initiative to build physical AGI for humanoid robots. The GEAR Lab is assembling a crack team right now. Join us! Openings: - Sr. Research Engineer, Robotics Systems - Sr. RE, Reinforcement Learning - Sr. RE, Foundation Model Training Infrastructure - Sr. RE, Simulation - Sr. RE, ML Data Pipelines - Research Scientist - Research Intern (both part-time and summer full-time in 2025) For the Sr. positions, we strongly prefer candidates with many years of engineering experience at robotics/autonomous driving companies, or MLOps/large-scale AI teams at big techs. For interns, we welcome ace robotics hackers anywhere! Show me your past works. Job links in the thread. Apply today! Your resumes will be my best Christmas gifts:

Jim Fan

103,177 Aufrufe • vor 1 Jahr

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!

Akshay 🚀

126,122 Aufrufe • vor 8 Monaten

Uber is Dead, my reflections on Waymo I’ve been in San Francisco for just over a week, during which I’ve taken 7 rides with Waymo, a similar number with Uber, and a few with FSD Teslas. My journey to SFO via Uber was alarming—the driver veered out of the lane multiple times and nearly crashed on a ramp, seemingly vying for a one-star rating or to genuinely scare me. Conversely, my experiences with Waymo were virtually flawless, if you don’t consider overly cautious driving a fault. I experienced a minor hiccup when we got stuck behind parked cars because the vehicle thought they were queuing at a red light. It quickly resolved the confusion and moved on, which was rather amusing. Waymo, and other Level 5 autonomous vehicles, are poised to revolutionize the movement of people and goods. The most apt analogy I can think of is that Waymo is transforming the real world into an automated Amazon warehouse, with people as the goods and Waymo vehicles as the robots shuttling them around. With the advent of personal transportation becoming incredibly affordable, sending anything from point A to point B using a self-driving electric vehicle will soon be within easy reach. One of Waymo’s standout features is privacy. Riding in an Uber often means being subjected to the driver’s loud group chats on some app, making the journey neither quiet nor private. In contrast, Waymo offers a fully private experience, allowing you to have confidential phone conversations or chat freely with fellow passengers without distraction. Waymo also reimagines the concept of a car. Without the need for a driver, we can eliminate the front console, reduce weight, and remove the steering wheel. This opens up possibilities for passenger seats to be reoriented, perhaps facing backwards, or for the vehicle to become a mobile living room. Tomorrow’s vehicle designs will differ drastically from today’s. Destinations that are currently expensive and logistically complicated to reach via Taxi/Uber, often lying outside public transport routes, can be simplified to a single “Waymo” journey. This could shift the current model of “Uber + public transport + Uber” to a more streamlined experience. As more cars become self-driving, we could see a reduction in the amount of time cars are parked—from 99% of their lifetime to perhaps just 25%. This not only improves unit economics but could also decrease the number of cars on the road. This transition represents one of the most significant shifts for Generation X. In conclusion, the future is autonomous, electric, and efficient. Uber, as we know it, is dead.

Uber is Dead, my reflections on Waymo I’ve been in San Francisco for just over a week, during which I’ve taken 7 rides with Waymo, a similar number with Uber, and a few with FSD Teslas. My journey to SFO via Uber was alarming—the driver veered out of the lane multiple times and nearly crashed on a ramp, seemingly vying for a one-star rating or to genuinely scare me. Conversely, my experiences with Waymo were virtually flawless, if you don’t consider overly cautious driving a fault. I experienced a minor hiccup when we got stuck behind parked cars because the vehicle thought they were queuing at a red light. It quickly resolved the confusion and moved on, which was rather amusing. Waymo, and other Level 5 autonomous vehicles, are poised to revolutionize the movement of people and goods. The most apt analogy I can think of is that Waymo is transforming the real world into an automated Amazon warehouse, with people as the goods and Waymo vehicles as the robots shuttling them around. With the advent of personal transportation becoming incredibly affordable, sending anything from point A to point B using a self-driving electric vehicle will soon be within easy reach. One of Waymo’s standout features is privacy. Riding in an Uber often means being subjected to the driver’s loud group chats on some app, making the journey neither quiet nor private. In contrast, Waymo offers a fully private experience, allowing you to have confidential phone conversations or chat freely with fellow passengers without distraction. Waymo also reimagines the concept of a car. Without the need for a driver, we can eliminate the front console, reduce weight, and remove the steering wheel. This opens up possibilities for passenger seats to be reoriented, perhaps facing backwards, or for the vehicle to become a mobile living room. Tomorrow’s vehicle designs will differ drastically from today’s. Destinations that are currently expensive and logistically complicated to reach via Taxi/Uber, often lying outside public transport routes, can be simplified to a single “Waymo” journey. This could shift the current model of “Uber + public transport + Uber” to a more streamlined experience. As more cars become self-driving, we could see a reduction in the amount of time cars are parked—from 99% of their lifetime to perhaps just 25%. This not only improves unit economics but could also decrease the number of cars on the road. This transition represents one of the most significant shifts for Generation X. In conclusion, the future is autonomous, electric, and efficient. Uber, as we know it, is dead.

Linus ✦ Ekenstam

6,100,720 Aufrufe • vor 2 Jahren

Learning is something you and your baby do together. You can think of the process as happening in three distinct stages, during which skills are transferred gradually from you to your little one: During the first stage, your baby is observing the behavior and skills of others. During the second, they begin to emulate these behaviors - and can find success with the support of a helpful adult (you) or more expert peer (often a sibling). And gradually they internalize these skills and perform them all by themselves. This video is a great example of the shared second phase. Infants explore the world with their mouths. But an important lesson of toddlerhood is that some things are for putting in our mouths, while others are not. This little one knows that we don’t eat the Play-Doh. But it sure is tempting! Watch as he breaks off a piece and brings it to his mouth. As he does his eyes lift and he realizes that Mom is watching - which alone prompts some introspection. He grins broadly, shakes his head and exclaims “No, no, no” - using Mom’s past words to affirm his decision to place the Play-Doh back on the table. Left to his own devices, who knows? But together, without exchanging a word, he managed to make the right choice. As a parent it’s important to remember the key role you play in the learning process. And that extends to your child’s behavior. Self-regulation begins as co-regulation. So be there. This sweet little guy was shared to IG by parentosa.

Learning is something you and your baby do together. You can think of the process as happening in three distinct stages, during which skills are transferred gradually from you to your little one: During the first stage, your baby is observing the behavior and skills of others. During the second, they begin to emulate these behaviors - and can find success with the support of a helpful adult (you) or more expert peer (often a sibling). And gradually they internalize these skills and perform them all by themselves. This video is a great example of the shared second phase. Infants explore the world with their mouths. But an important lesson of toddlerhood is that some things are for putting in our mouths, while others are not. This little one knows that we don’t eat the Play-Doh. But it sure is tempting! Watch as he breaks off a piece and brings it to his mouth. As he does his eyes lift and he realizes that Mom is watching - which alone prompts some introspection. He grins broadly, shakes his head and exclaims “No, no, no” - using Mom’s past words to affirm his decision to place the Play-Doh back on the table. Left to his own devices, who knows? But together, without exchanging a word, he managed to make the right choice. As a parent it’s important to remember the key role you play in the learning process. And that extends to your child’s behavior. Self-regulation begins as co-regulation. So be there. This sweet little guy was shared to IG by parentosa.

Dan Wuori

75,169 Aufrufe • vor 2 Jahren

[Most robots react. This one thinks a step ahead.] Ant Group's Robbyant just published LingBot-VA 2.0 — a video-action foundation model built from scratch for robot control, not fine-tuned from a video generator. The usual approach takes a video generator made for content creation and bolts a robot policy onto it. LingBot-VA 2.0 argues that's the wrong starting point, and pretrains the whole causal stack natively instead. What stands out: → Foresight Reasoning — the robot predicts the next action chunk while executing the current one, then overwrites the imagined frame with the real observation. Prediction and execution stop waiting on each other. → 927 ms → 142 ms per chunk, across four cumulative optimizations. That lifts asynchronous control from 35 Hz to 225 Hz — a 6.5× speedup. → One shared latent space. A semantic visual-action tokenizer puts world states and actions in the same coordinates, so unlabeled web video carries action-relevant signal. → Sparse MoE video stream — 128 experts, top-8 routing. Roughly 2.5B of ~15.3B parameters fire per token. → Few-shot by design — adapts from 10–15 demonstrations, and a human demo video can replace the text instruction entirely. Full breakdown: Paper: Project Page: Robbyant Ant Group

[Most robots react. This one thinks a step ahead.] Ant Group's Robbyant just published LingBot-VA 2.0 — a video-action foundation model built from scratch for robot control, not fine-tuned from a video generator. The usual approach takes a video generator made for content creation and bolts a robot policy onto it. LingBot-VA 2.0 argues that's the wrong starting point, and pretrains the whole causal stack natively instead. What stands out: → Foresight Reasoning — the robot predicts the next action chunk while executing the current one, then overwrites the imagined frame with the real observation. Prediction and execution stop waiting on each other. → 927 ms → 142 ms per chunk, across four cumulative optimizations. That lifts asynchronous control from 35 Hz to 225 Hz — a 6.5× speedup. → One shared latent space. A semantic visual-action tokenizer puts world states and actions in the same coordinates, so unlabeled web video carries action-relevant signal. → Sparse MoE video stream — 128 experts, top-8 routing. Roughly 2.5B of ~15.3B parameters fire per token. → Few-shot by design — adapts from 10–15 demonstrations, and a human demo video can replace the text instruction entirely. Full breakdown: Paper: Project Page: Robbyant Ant Group

Marktechpost AI

196,499 Aufrufe • vor 19 Tagen

THE KILLER of an MBBS Student & 2 other citizens The accused is Golden Sahani, reportedly a close aide of Minister Sanjay Nishad, and is said to be involved in the property business & local politics in Gorakhpur. On Holi, allegedly under the influence of alcohol, he rammed his Fortuner into three MBBS students, crushing them. One of them, Aakash Pandey, a 3rd-year MBBS student, died on the spot. When the police arrested him, he was seen walking arrogantly with his hands in his pockets, perhaps confident that nothing serious will happen to him. People believe that because of his political connections, he may walk out within a month or two and return to public life as if nothing happened. But what about the parents who lost their only son? A young, promising medical student lost his life in a moment of reckless drunken driving. Shockingly, a video from the same day also surfaced in which he was seen abusing and threatening a woman but the police allegedly failed to take timely action. If timely action had been taken by the police earlier, perhaps a bright young life would not have been lost today. Strict action must be taken so that such tragedies are never repeated #MedTwitter PMO India Yogi Adityanath UP POLICE #JusticeForAakashPandey

THE KILLER of an MBBS Student & 2 other citizens The accused is Golden Sahani, reportedly a close aide of Minister Sanjay Nishad, and is said to be involved in the property business & local politics in Gorakhpur. On Holi, allegedly under the influence of alcohol, he rammed his Fortuner into three MBBS students, crushing them. One of them, Aakash Pandey, a 3rd-year MBBS student, died on the spot. When the police arrested him, he was seen walking arrogantly with his hands in his pockets, perhaps confident that nothing serious will happen to him. People believe that because of his political connections, he may walk out within a month or two and return to public life as if nothing happened. But what about the parents who lost their only son? A young, promising medical student lost his life in a moment of reckless drunken driving. Shockingly, a video from the same day also surfaced in which he was seen abusing and threatening a woman but the police allegedly failed to take timely action. If timely action had been taken by the police earlier, perhaps a bright young life would not have been lost today. Strict action must be taken so that such tragedies are never repeated #MedTwitter PMO India Yogi Adityanath UP POLICE #JusticeForAakashPandey

Indian Doctor🇮🇳

37,613 Aufrufe • vor 4 Monaten

🚨 SCIENTISTS JUST BUILT A CHIP THAT CAN SEE, THINK, AND REMEMBER ALL AT THE SAME TIME. And it works more like a biological brain than a traditional computer. Researchers at RMIT University have created a neuromorphic vision chip that mimics the human eye and brain. Unlike conventional systems that capture images and send data to external processors, this chip performs sensing, processing, and memory storage directly where the light hits. The active layer is thousands of times thinner than a human hair. It uses doped indium oxide to detect light, process the information on-chip, and retain what it sees over time without constant electrical refreshing. Why this matters: • It dramatically cuts energy use and latency by eliminating data transfer to separate processors • Enables much faster real-time decision making for autonomous systems • Works more like biological vision than traditional machine vision • Could power the next generation of efficient edge AI in vehicles, robots, and remote sensors The deeper implication: For decades, we’ve built vision systems by bolting cameras, processors, and memory together like separate organs. This chip collapses those functions into one biological-style unit. It’s a step toward machines that don’t just “see” but actually perceive and remember in a more efficient, brain-like way. If scaled successfully, it could become a foundational component for autonomous systems that need to operate intelligently with minimal power and minimal delay. We’re moving from cameras that take pictures to chips that truly see. How do you think neuromorphic vision chips like this will change what’s possible for self-driving cars and autonomous robots? Follow for more frontier neuromorphic computing, AI hardware, and brain-inspired technology.

🚨 SCIENTISTS JUST BUILT A CHIP THAT CAN SEE, THINK, AND REMEMBER ALL AT THE SAME TIME. And it works more like a biological brain than a traditional computer. Researchers at RMIT University have created a neuromorphic vision chip that mimics the human eye and brain. Unlike conventional systems that capture images and send data to external processors, this chip performs sensing, processing, and memory storage directly where the light hits. The active layer is thousands of times thinner than a human hair. It uses doped indium oxide to detect light, process the information on-chip, and retain what it sees over time without constant electrical refreshing. Why this matters: • It dramatically cuts energy use and latency by eliminating data transfer to separate processors • Enables much faster real-time decision making for autonomous systems • Works more like biological vision than traditional machine vision • Could power the next generation of efficient edge AI in vehicles, robots, and remote sensors The deeper implication: For decades, we’ve built vision systems by bolting cameras, processors, and memory together like separate organs. This chip collapses those functions into one biological-style unit. It’s a step toward machines that don’t just “see” but actually perceive and remember in a more efficient, brain-like way. If scaled successfully, it could become a foundational component for autonomous systems that need to operate intelligently with minimal power and minimal delay. We’re moving from cameras that take pictures to chips that truly see. How do you think neuromorphic vision chips like this will change what’s possible for self-driving cars and autonomous robots? Follow for more frontier neuromorphic computing, AI hardware, and brain-inspired technology.

TheNewPhysics

23,196 Aufrufe • vor 1 Monat

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

雪踏乌云

112,114 Aufrufe • vor 14 Tagen

Brilliant open-source tool from Anthropic Visualize internal LLM pathways through open attribution graphs. → Attribution graphs reveal which neurons and features drive each output token, turning the LLM’s black box into a clear map. They open-sourced a library that generates attribution graphs to trace how LLMs arrive at outputs. These graphs reveal internal activation flows and feature influences in popular open-weights models. 🧠 Attribution Graphs Graphs link model activations to output tokens. Nodes represent neurons or features. Edges show influence strength. Researchers inspect paths to see which components drive each decision. 🛠️ Toolchain and Frontend Circuit-tracer library works with models like Gemma-2-2b and Llama-3.2-1b. Demo notebooks illustrate multi-step reasoning examples. Neuronpedia hosts an interactive UI for graph exploration, annotation, and sharing. 🔍 Hypothesis Testing Users tweak feature values in graphs, regenerate outputs, and observe changes. This lets researchers validate whether specific circuits control behaviors like language translation or reasoning chains.

Brilliant open-source tool from Anthropic Visualize internal LLM pathways through open attribution graphs. → Attribution graphs reveal which neurons and features drive each output token, turning the LLM’s black box into a clear map. They open-sourced a library that generates attribution graphs to trace how LLMs arrive at outputs. These graphs reveal internal activation flows and feature influences in popular open-weights models. 🧠 Attribution Graphs Graphs link model activations to output tokens. Nodes represent neurons or features. Edges show influence strength. Researchers inspect paths to see which components drive each decision. 🛠️ Toolchain and Frontend Circuit-tracer library works with models like Gemma-2-2b and Llama-3.2-1b. Demo notebooks illustrate multi-step reasoning examples. Neuronpedia hosts an interactive UI for graph exploration, annotation, and sharing. 🔍 Hypothesis Testing Users tweak feature values in graphs, regenerate outputs, and observe changes. This lets researchers validate whether specific circuits control behaviors like language translation or reasoning chains.

Rohan Paul

27,197 Aufrufe • vor 1 Jahr

Trained on zero real-world data. Learned to walk, pick up boxes, and follow multi-step instructions... in the REAL world. ( 📌 Paper below) Researchers from Amazon FAR, Berkeley, Stanford, and CMU scanned real rooms with an iPhone, rebuilt them as 3D Gaussian Splatting scenes, then generated 48,000 synthetic trajectories of a Unitree G1 walking, grasping, and placing objects inside those virtual replicas. They rendered the robot's first-person camera view from each run and paired it with the matching language instruction and motion data. That's the dataset every humanoid team needs and nobody has: synced egocentric video + language + kinematics, at scale. Instead of collecting it in the real world, they manufactured it. They trained a vision-language-kinematics policy on that synthetic data alone, then deployed it on the physical G1 across five task types: navigation to a named object, lifting boxes of three different sizes with no per-size tuning, chained multi-step tasks, robustness to mid-task layout changes and flickering lights, and multi-minute long-horizon runs. No real-world fine-tuning at any point. Real-world interaction data has been the hard limit on humanoid learning... slow, expensive, and small. If scanning a room once and synthesizing thousands of labeled interactions holds up as a general recipe, that limit moves. Data stops being the bottleneck robotics teams have to solve for. 📌 Paper: Project: ——- Weekly robotics and AI insights. Subscribe free:

Trained on zero real-world data. Learned to walk, pick up boxes, and follow multi-step instructions... in the REAL world. ( 📌 Paper below) Researchers from Amazon FAR, Berkeley, Stanford, and CMU scanned real rooms with an iPhone, rebuilt them as 3D Gaussian Splatting scenes, then generated 48,000 synthetic trajectories of a Unitree G1 walking, grasping, and placing objects inside those virtual replicas. They rendered the robot's first-person camera view from each run and paired it with the matching language instruction and motion data. That's the dataset every humanoid team needs and nobody has: synced egocentric video + language + kinematics, at scale. Instead of collecting it in the real world, they manufactured it. They trained a vision-language-kinematics policy on that synthetic data alone, then deployed it on the physical G1 across five task types: navigation to a named object, lifting boxes of three different sizes with no per-size tuning, chained multi-step tasks, robustness to mid-task layout changes and flickering lights, and multi-minute long-horizon runs. No real-world fine-tuning at any point. Real-world interaction data has been the hard limit on humanoid learning... slow, expensive, and small. If scanning a room once and synthesizing thousands of labeled interactions holds up as a general recipe, that limit moves. Data stops being the bottleneck robotics teams have to solve for. 📌 Paper: Project: ——- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

12,950 Aufrufe • vor 9 Tagen