Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Can video generative models exhibit visuospatial intelligence? 🤔 Introducing Video4Spatial — a video-only framework that tackles spatial tasks. With just video context, our model can: 🔍 Ground objects by planning geometry-consistent paths 📸 Follow camera-pose instructions for scene navigation 🌐 Generalize to long contexts & unseen outdoor scenes A... show more

Xingang Pan

3,326 subscribers

15,931 views • 7 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

New research with Tsinghua University: Spatial-TTT. A framework for streaming visual-based spatial intelligence with test-time training (TTT). Spatial-TTT adapts fast weights to capture and organize spatial evidence from long video streams, enabling models to build structured 3D spatial memory over time. Highlights: 🔹Efficient streaming memory. Fast weights act as compact spatial memory with sublinear memory growth over 7000+ frames and more than 40% lower compute. 🔹Spatial-predictive mechanism. TTT layers with 3D spatiotemporal convolution capture geometric correspondence and temporal continuity. 🔹SOTA results on long-horizon video spatial understanding (VSI-Bench). The paper ranked #1 on Hugging Face Daily Papers on March 13. Project page: GitHub: Paper: Model & Data:

New research with Tsinghua University: Spatial-TTT. A framework for streaming visual-based spatial intelligence with test-time training (TTT). Spatial-TTT adapts fast weights to capture and organize spatial evidence from long video streams, enabling models to build structured 3D spatial memory over time. Highlights: 🔹Efficient streaming memory. Fast weights act as compact spatial memory with sublinear memory growth over 7000+ frames and more than 40% lower compute. 🔹Spatial-predictive mechanism. TTT layers with 3D spatiotemporal convolution capture geometric correspondence and temporal continuity. 🔹SOTA results on long-horizon video spatial understanding (VSI-Bench). The paper ranked #1 on Hugging Face Daily Papers on March 13. Project page: GitHub: Paper: Model & Data:

Tencent Hy

20,792 views • 4 months ago

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 views • 11 months ago

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 views • 9 months ago

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.

SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.

Bilawal Sidhu

107,060 views • 1 year ago

Large-scale Gaussian splats have reached a new level of realism. This is a well-known temple in Bangkok, reconstructed as a high-fidelity 3D environment from 360 captures. At this level, the boundary between video and 3D starts to disappear. But what you’re looking at is not a video. It’s a dense spatial representation of a real place, where geometry, texture, and structure are preserved and made machine-readable. This kind of 3D data can power Visual AI, Robotics navigation, VPS localization, XR experiences, world models, and next-generation spatial computing systems. Built with Over the Reality.

Large-scale Gaussian splats have reached a new level of realism. This is a well-known temple in Bangkok, reconstructed as a high-fidelity 3D environment from 360 captures. At this level, the boundary between video and 3D starts to disappear. But what you’re looking at is not a video. It’s a dense spatial representation of a real place, where geometry, texture, and structure are preserved and made machine-readable. This kind of 3D data can power Visual AI, Robotics navigation, VPS localization, XR experiences, world models, and next-generation spatial computing systems. Built with Over the Reality.

Over the Reality 🌐

347,044 views • 2 months ago

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

310,120 views • 1 year ago

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Chuang Gan

43,265 views • 1 year ago

Introducing PAN — MBZUAI’s New World Model for Interactive Intelligence Developed by MBZUAI’s Institute of Foundation Models, PAN is built for simulation, prediction, and agentic reasoning. Unlike traditional video generators that only output frames, PAN maintains a persistent internal state that evolves when guided with natural language. Its Generative Latent Prediction architecture combines: • A latent encoder to capture the world state • A dynamics module that evolves that state step-by-step • A video diffusion decoder that visualizes outcomes By decoding at every step using a causal sliding-window diffusion process, PAN stays grounded in real-world physics and maintains long-horizon continuity, a leap beyond single-shot models. Evaluated on action fidelity, long-horizon stability, and simulative planning, PAN delivers state-of-the-art performance compared to open models and rivals leading commercial systems. For robotics, autonomy, and decision support, PAN is a foundation for the next wave of intelligent, foresight-driven AI.

Introducing PAN — MBZUAI’s New World Model for Interactive Intelligence Developed by MBZUAI’s Institute of Foundation Models, PAN is built for simulation, prediction, and agentic reasoning. Unlike traditional video generators that only output frames, PAN maintains a persistent internal state that evolves when guided with natural language. Its Generative Latent Prediction architecture combines: • A latent encoder to capture the world state • A dynamics module that evolves that state step-by-step • A video diffusion decoder that visualizes outcomes By decoding at every step using a causal sliding-window diffusion process, PAN stays grounded in real-world physics and maintains long-horizon continuity, a leap beyond single-shot models. Evaluated on action fidelity, long-horizon stability, and simulative planning, PAN delivers state-of-the-art performance compared to open models and rivals leading commercial systems. For robotics, autonomy, and decision support, PAN is a foundation for the next wave of intelligent, foresight-driven AI.

MBZUAI

98,725 views • 8 months ago

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 views • 2 years ago

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution paper page: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution paper page: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

AK

32,849 views • 2 years ago

Alibaba presents MIMO Controllable Character Video Synthesis with Spatial Decomposed Modeling Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

Alibaba presents MIMO Controllable Character Video Synthesis with Spatial Decomposed Modeling Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

AK

148,998 views • 1 year ago

📢 OneCanvas: 3D Scene Understanding via Panoramic Reprojection We extract features from video frames and reproject them into one occlusion-free view of the whole scene that a 2D VLM reads just like a normal image. We can center this view on any viewpoint, including an agent's own pose for situated reasoning. The same projection lets us create spatial training tasks with no human annotation, solvable only by reasoning over the 3D positions of real object features placed on an otherwise empty canvas. The result is a stock 2D VLM that reasons in 3D, setting a new state of the art across spatial benchmarks at far less compute. 🌐 ▶️ Great work by Bartłomiej Baranowski & Dave Zhenyu Chen

📢 OneCanvas: 3D Scene Understanding via Panoramic Reprojection We extract features from video frames and reproject them into one occlusion-free view of the whole scene that a 2D VLM reads just like a normal image. We can center this view on any viewpoint, including an agent's own pose for situated reasoning. The same projection lets us create spatial training tasks with no human annotation, solvable only by reasoning over the 3D positions of real object features placed on an otherwise empty canvas. The result is a stock 2D VLM that reasons in 3D, setting a new state of the art across spatial benchmarks at far less compute. 🌐 ▶️ Great work by Bartłomiej Baranowski & Dave Zhenyu Chen

Matthias Niessner

25,038 views • 1 month ago

Learning from robot data? Standard. Direct Video-Action Models (DVA) is different: treat robot control as video generation, then translate the generated video into actions. Built by , the system pre-trains causal video models from scratch and can run complex production tasks for hours using only ~10 hours of robot data. • hundreds of frames of visual context • real-time control via causal video prediction More: The team behind it just exited 18 months of stealth with a $450M Series A at a $1.7B valuation. Founded by Jagdeep Singh (ex-QuantumScape) with a Stanford-heavy science team: CSO Eric Ryan Chan (ex-WorldLabs) and Prof. Gordon Wetzstein. Already running in large-scale automotive production environments. Backed by Vinod Khosla Ventures, Temasek, Premji Invest, and John Doerr. Thanks for sharing, Tongzhou Mu 🤖🦾🦿 👋

Learning from robot data? Standard. Direct Video-Action Models (DVA) is different: treat robot control as video generation, then translate the generated video into actions. Built by , the system pre-trains causal video models from scratch and can run complex production tasks for hours using only ~10 hours of robot data. • hundreds of frames of visual context • real-time control via causal video prediction More: The team behind it just exited 18 months of stealth with a $450M Series A at a $1.7B valuation. Founded by Jagdeep Singh (ex-QuantumScape) with a Stanford-heavy science team: CSO Eric Ryan Chan (ex-WorldLabs) and Prof. Gordon Wetzstein. Already running in large-scale automotive production environments. Backed by Vinod Khosla Ventures, Temasek, Premji Invest, and John Doerr. Thanks for sharing, Tongzhou Mu 🤖🦾🦿 👋

Ilir Aliu

26,209 views • 4 months ago

NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI ViPE integrates bundle adjustment with dense optical flow, sparse keypoint tracking, and metric depth priors to estimate camera intrinsics, poses, and dense depth maps at 3–5 FPS on a single GPU. It significantly improves over prior uncalibrated pose estimation methods, achieving 18% and 50% error reduction on TUM and KITTI benchmarks, respectively, and shows robustness to dynamic scenes and diverse camera models. Beyond the method, the NVIDIA team also released a large-scale dataset comprising ~100K real-world internet videos, 1M AI-generated videos, and 2K panoramic videos (≈96M frames) annotated with metric depth and poses. This dataset and engine aim to accelerate training for spatial AI tasks such as 3D reconstruction, video generation, and robotics.... full analysis: paper: codes: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA Robotics NVIDIA AIDev NVIDIAdeveloper

NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI ViPE integrates bundle adjustment with dense optical flow, sparse keypoint tracking, and metric depth priors to estimate camera intrinsics, poses, and dense depth maps at 3–5 FPS on a single GPU. It significantly improves over prior uncalibrated pose estimation methods, achieving 18% and 50% error reduction on TUM and KITTI benchmarks, respectively, and shows robustness to dynamic scenes and diverse camera models. Beyond the method, the NVIDIA team also released a large-scale dataset comprising ~100K real-world internet videos, 1M AI-generated videos, and 2K panoramic videos (≈96M frames) annotated with metric depth and poses. This dataset and engine aim to accelerate training for spatial AI tasks such as 3D reconstruction, video generation, and robotics.... full analysis: paper: codes: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA Robotics NVIDIA AIDev NVIDIAdeveloper

Marktechpost AI Dev News ⚡

217,453 views • 10 months ago

Most AI tools give everyone access to the same generic models. The result? Everything looks the same. We have a different vision for AI creation. Introducing TITLES, a new creative studio built around AI models trained and owned by artists. In Studio, you can create with distinct visual perspectives developed by artists, across image and video — all in one place. This is the future we're building toward: not one model for everyone, but a growing network of unique styles you can build with — where artists get credited and paid as the work spreads. Enter Your Creative Studio:

Most AI tools give everyone access to the same generic models. The result? Everything looks the same. We have a different vision for AI creation. Introducing TITLES, a new creative studio built around AI models trained and owned by artists. In Studio, you can create with distinct visual perspectives developed by artists, across image and video — all in one place. This is the future we're building toward: not one model for everyone, but a growing network of unique styles you can build with — where artists get credited and paid as the work spreads. Enter Your Creative Studio:

TITLES

2,746,536 views • 3 months ago

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 views • 3 years ago

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Heng Yang

18,994 views • 4 months ago

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the demo on our website on your robot video today!) I worked a lot on leveraging foundation models as guidance for robots in my PhD, and to me, this result forges a new frontier in how we can use foundation models for robot learning, given its broad applicability independent of embodiment and task types. Quite excited about how we can build on this work as a community!

Jason Ma

98,090 views • 1 year ago

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

AI at Meta

703,801 views • 2 years ago

Shipping an experiment on top of Hermes Agent that allows an agent to steer itself With it, a harness like desloppify can clear its own context, switch its models, prompt itself when it stops, etc. Video shows switching between Grok 4.20 ($6/m) for execution + Gemini 3.1 ($12/m) for planning + Claude ($25/m) for sense-checking - 6+ hours of refactoring w/o errors/stoppage - this can endlessly and safeguard itself w/ various triggers as sense-checks! Desloppify v0.9.10 release notes w/ instructions to test + many more contributions by the community: Video by the wonderful Hannah Submarine - best watched with audio:

Shipping an experiment on top of Hermes Agent that allows an agent to steer itself With it, a harness like desloppify can clear its own context, switch its models, prompt itself when it stops, etc. Video shows switching between Grok 4.20 ($6/m) for execution + Gemini 3.1 ($12/m) for planning + Claude ($25/m) for sense-checking - 6+ hours of refactoring w/o errors/stoppage - this can endlessly and safeguard itself w/ various triggers as sense-checks! Desloppify v0.9.10 release notes w/ instructions to test + many more contributions by the community: Video by the wonderful Hannah Submarine - best watched with audio:

POM

46,499 views • 4 months ago