Uploaded: 2025-12-15T22:43:31.000Z
Duration: PT111.466S
Channel: Pascale Fung

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming,... live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fungshow more

Pascale Fung

90,144 次观看 • 7 个月前

Our vision is for AI that uses world models... to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️show more

AI at Meta

310,120 次观看 • 1 年前

Today we’re releasing V-JEPA, a method for teaching machines... to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.show more

AI at Meta

703,801 次观看 • 2 年前

3D-LLM: Injecting the 3D World into Large Language Models... paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.show more

249,708 次观看 • 3 年前

MotionGPT: Human Motion as a Foreign Language paper page:... Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.show more

125,319 次观看 • 3 年前

Pretraining is essential for good performance on a wide... variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!show more

RoboPapers

23,905 次观看 • 4 个月前

Google presents AudioPaLM: A Large Language Model That Can... Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.show more

290,517 次观看 • 3 年前

Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control."... Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)show more

Davide Scaramuzza

34,889 次观看 • 2 年前

Introducing PAN — MBZUAI’s New World Model for Interactive... Intelligence Developed by MBZUAI’s Institute of Foundation Models, PAN is built for simulation, prediction, and agentic reasoning. Unlike traditional video generators that only output frames, PAN maintains a persistent internal state that evolves when guided with natural language. Its Generative Latent Prediction architecture combines: • A latent encoder to capture the world state • A dynamics module that evolves that state step-by-step • A video diffusion decoder that visualizes outcomes By decoding at every step using a causal sliding-window diffusion process, PAN stays grounded in real-world physics and maintains long-horizon continuity, a leap beyond single-shot models. Evaluated on action fidelity, long-horizon stability, and simulative planning, PAN delivers state-of-the-art performance compared to open models and rivals leading commercial systems. For robotics, autonomy, and decision support, PAN is a foundation for the next wave of intelligent, foresight-driven AI.show more

MBZUAI

98,725 次观看 • 8 个月前

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs... with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.show more

41,736 次观看 • 1 年前

We are back. After one year of quiet building.... Introducing GENE-26.5, our first robotic brain that takes a major step toward human-level capability. For years, robotics has struggled to learn from the world’s largest and valuable data source: Humans. Solving it means rethinking the whole stack from the ground up: - A robotics-native foundation model. - A 1:1 human-like robotic hand. - A noninvasive data collection glove for motion, force, and touch. - A simulator that turns weeks of experiments into minutes. GENE-26.5 is trained across language, vision, proprioception, tactile, and action. We designed a set of tasks to test how far we can go with this new paradigm. Fully autonomous, 1x speed, one model, same weights. (Enjoy with sound on) We are approaching the endgame for robotics. And this is just a beginning.show more

Genesis AI

2,716,140 次观看 • 2 个月前

Check out our latest work, "Actor-Critic Model Predictive Control:... Differentiable Optimization meets Reinforcement Learning for Agile Flight," published in the IEEE Transactions on Robotics, where we reconcile #OptimalControl and #ReinforcementLearning, achieving the same super-human performance, but with superior generalizability, as our previous model-free deep RL! Code released! PDF: Code: Full Video: Model-free #ReinforcementLearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) provides robustness, constraint handling, and powerful online replanning capabilities. In this work, we extend our previous AC-MPC paper (Romero, ICRA'24) by taking a deeper look at how both approaches can be unified. We introduce and extend Actor-Critic Model Predictive Control (AC-MPC), a framework that embeds a differentiable MPC inside an Actor-Critic RL architecture. This integration allows the MPC-based actor to perform short-term predictive optimization, while the critic facilitates long-horizon learning and exploration. We conduct a comprehensive study that highlights AC-MPC’s key advantages: - Better out-of-distribution generalization, both against unknown disturbances and changes in the quadrotor dynamics - Improved sample efficiency - A novel empirical analysis uncovering a relationship between the critic’s value function and the MPC cost function, providing deeper insight into their interplay. We validate our method in simulation and the real world on a quadcopter flying at superhuman speeds of up to 21 m/s, matching state-of-the-art model-free RL performance, and retaining the predictive structure of MPC for more reliable out-of-distribution behavior. Reference: Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning for Agile Flight IEEE Transactions on Robotics (T-RO), 2025 PDF: Full Video: Code: Kudos to Ángel Romero, Elie Aljalbout, Yunlong Song! University of Zurich UZH Science UZH Space Hub AUTOASSESS European Research Council (ERC) UZHaishow more

Davide Scaramuzza

27,090 次观看 • 6 个月前

Synchronize Dual Hands for Physics-Based Dexterous Guitar Playing discuss:... We present a novel approach to synthesize dexterous motions for physically simulated hands in tasks that require coordination between the control of two hands with high temporal precision. Instead of directly learning a joint policy to control two hands, our approach performs bimanual control through cooperative learning where each hand is treated as an individual agent. The individual policies for each hand are first trained separately, and then synchronized through latent space manipulation in a centralized environment to serve as a joint policy for two-hand control. By doing so, we avoid directly performing policy learning in the joint state-action space of two hands with higher dimensions, greatly improving the overall training efficiency. We demonstrate the effectiveness of our proposed approach in the challenging guitar-playing task. The virtual guitarist trained by our approach can synthesize motions from unstructured reference data of general guitar-playing practice motions, and accurately play diverse rhythms with complex chord pressing and string picking patterns based on the input guitar tabs that do not exist in the references. Along with this paper, we provide the motion capture data that we collected as the reference for policy training.show more

26,855 次观看 • 1 年前

Every wondered if we can model motion as a... language? can we tokenize this new language? is it useful? Turns out tremendously! 🚀 In out latest #NeurIPS2024 paper on QueST: Self-Supervised Skill Abstractions for Learning Continuous Control, we find that action tokenization matters a lot! We can learn skill encodings by representing temporal action abstractions with a discrete codebook. This enables 2 things 1. Better Behaviour Cloning: we can better assimilate multi-task data (>9%) over best paper. This is currently best in class BC method! 2. generalization of this language to represent new tasks in 5-shot transfer to longer horizon tasks! Check out the thread by Atharva Mete for more details. And check out more details at: Joint work with Atharva Mete Albert Wilcox Haotian Xue Yongxin Chen Georgia Tech School of Interactive Computing Machine Learning at Georgia Tech Robotics@GT NVIDIA Roboticsshow more

Animesh Garg

26,218 次观看 • 1 年前

New short course: Build Long-Context AI Apps with Jamba.... Learn about state space models (SSMs), which have emerged as an alternative to transformers! Specifically, Jamba is a hybrid transformer-Mamba architecture that combines strengths of the transformer with ideas from SSMs. This course is built with AI21 Labs and taught by Chen Wang and Chen Almagor. The transformer architecture is computationally expensive when handling very long input contexts. But there's an alternative called Mamba, a selective state space model that can process very long contexts with a much lower computational cost. However, researchers found that the pure Mamba architecture underperforms in understanding the context, and gives lower-quality responses. To overcome this, AI21 developed the Jamba model, which combines Mamba's computational efficiency with the transformer's attention mechanism to help with the output quality. In this course, you’ll learn about how state space models, and Jamba, work. You’ll also learn how to prompt Jamba, use it to process long documents, and build long-context RAG apps. - Learn how Jamba combines transformer and state space model architectures to achieve high performance and quality - Use the AI21 SDK, with an example of prompting over a large 200k-token annual financial report of Nvidia - Use Jamba for tool-calling, with hands-on examples from calling simple arithmetic calculations to a function that returns quarterly company financial reports. - Learn how training for long context is done, and the metrics used for its evaluation - Create a RAG app using the AI21 Conversational RAG tool and build your own RAG pipeline that uses Jamba and LangChain. By the end of this course, you'll learn how to build applications that can handle context as long as an entire book. Please sign up here:show more

Andrew Ng

77,792 次观看 • 1 年前

Robots might learn better from video than from language!... 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →show more

Lukas Ziegler

49,920 次观看 • 7 个月前

Google just proved that bigger isn't always better. Their... 308M parameter model is outperforming models 2x its size. Google just released 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝗚𝗲𝗺𝗺𝗮, and it's proving that lightweight embedding models can punch way above their weight class. At just 308M parameters (578MB), it's the new state-of-the-art for models under 500M parameters across MTEB multilingual, English, and code benchmarks. But the really impressive part is that it ranks 8th overall on MTEB(Multilingual, v2) - that's 𝟭𝟳 𝗽𝗹𝗮𝗰𝗲𝘀 above the second-best sub-500M model, and it's delivering performance 𝗰𝗼𝗺𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗮𝗿𝗹𝘆 𝗱𝗼𝘂𝗯𝗹𝗲 𝗶𝘁𝘀 𝘀𝗶𝘇𝗲. There are three key parts of their training recipe that sets it apart: 𝟭. 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Instead of starting from a decoder-only Gemma 3 model, they first adapted it to encoder-decoder, then used just the encoder. By basing EmbeddingGemma off an LLM that already has world and language understanding, it gives it a stronger starting point. 𝟮. 𝗧𝗵𝗿𝗲𝗲-𝗟𝗼𝘀𝘀 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 They combine three different loss functions, instead of just having one: • Contrastive loss (NCE) with in-batch negatives and hardness weighting • Spread-out regularization to ensure embeddings utilize the full space (for quantization and ANN retrieval) • Embedding matching distillation from Gemini Embedding - not just learning from relevance scores, but directly aligning the embedding space with the teacher model 𝟯. 𝗠𝗼𝗱𝗲𝗹 𝗦𝗼𝘂𝗽𝗶𝗻𝗴 Rather than just averaging checkpoints from the same training run, they use optimization techniques to find multiple specialized training mixtures. Each mixture creates an "expert" model in different domains, and averaging all their parameters creates a final model that's actually better than individual models. Extras: • Matryoshka embeddings supporting 768, 512, 256, and 128 dimensions • Quantization-aware training - maintains quality even at int4 precision • 100+ languages from Gemma 3 pretraining • Exceptional performance on low-resource languages (check their XTREME-UP results) Is it the absolute best embedding model? No - Gemini Embedding still leads overall. But that's not really the point. EmbeddingGemma proves you can achieve state-of-the-art performance in a small package that's actually deployable on-device, in low-latency applications, and in resource-constrained environments. This makes good embeddings accessible for use cases that I'm seeing more and more: offline applications, privacy-sensitive deployments, and high-throughput scenarios where inference cost actually matters. Full paper: Shoutout to the EmbeddingGemma team at Google DeepMind for this awesome open source work 💙 and to Daniel Williams for helping me with this video! 🫶show more

Victoria Slocum

21,592 次观看 • 8 个月前

SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes abs: paper... page: Existing methods for the 4D reconstruction of general, non-rigidly deforming objects focus on novel-view synthesis and neglect correspondences. However, time consistency enables advanced downstream tasks like 3D editing, motion analysis, or virtual-asset creation. We propose SceNeRFlow to reconstruct a general, non-rigid scene in a time-consistent manner. Our dynamic-NeRF method takes multi-view RGB videos and background images from static cameras with known camera parameters as input. It then reconstructs the deformations of an estimated canonical model of the geometry and appearance in an online fashion. Since this canonical model is time-invariant, we obtain correspondences even for long-term, long-range motions. We employ neural scene representations to parametrize the components of our method. Like prior dynamic-NeRF methods, we use a backwards deformation model. We find non-trivial adaptations of this model necessary to handle larger motions: We decompose the deformations into a strongly regularized coarse component and a weakly regularized fine component, where the coarse component also extends the deformation field into the space surrounding the object, which enables tracking over time. We show experimentally that, unlike prior work that only handles small motion, our method enables the reconstruction of studio-scale motions.show more

76,380 次观看 • 2 年前

Excited to announce GR00T N1, the world’s first open... foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵show more

Jim Fan

466,148 次观看 • 1 年前

Love seeing Silico (Goodfire ) used to probe our... EchoJEPA's representations! this is exactly the kind of interpretability work that's been missing for JEPA-style models. One thing that makes EchoJEPA particularly interesting to interpret: unlike MAE-based approaches, it never reconstructs pixels. The model learns entirely in latent space through masked prediction, so you can't just look at decoder outputs to understand what it captured. Attribution onto a temporally aligned 3D mesh is a much more honest probe of what the representations actually encode. What we found in building EchoJEPA: training on 18M echo videos across 300K patients, the model learns to disentangle cardiac anatomy from ultrasound noise (speckle, reverberation artifacts) almost entirely through self-supervision. With 1% labeled data it already outperforms supervised baselines trained on 100%. The latent space is doing real anatomical work, but until you can visualize it like this, "real anatomical work" is mostly a claim. Paper + code: |show more

Bo Wang

29,452 次观看 • 2 个月前

Live Cam