Uploaded: 2025-12-15T22:43:31.000Z
Duration: PT111.466S
Channel: Pascale Fung

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming,... live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fungshow more

Pascale Fung

90,033 views • 5 months ago

Our vision is for AI that uses world models... to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️show more

AI at Meta

309,704 views • 1 year ago

Today we’re releasing V-JEPA, a method for teaching machines... to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.show more

AI at Meta

703,412 views • 2 years ago

3D-LLM: Injecting the 3D World into Large Language Models... paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.show more

AK

249,494 views • 2 years ago

Introducing Jan-v2-VL, a multimodal agent built for long-horizon tasks.... Jan-v2-VL executes 49 steps without failure, while the base model stops at 5 and other similar-scale VLMs stop between 1 and 2. It achieves longer, stable task execution in your browser without accuracy loss. 3 variants are available: - Jan-v2-VL-low (efficiency-oriented) - Jan-v2-VL-med (balanced) - Jan-v2-VL-high (deeper reasoning and longer execution) Models: To use it, update your Jan App and download Jan-v2-VL from the Model Hub. Activate Browser MCP servers for agentic use cases. Credit to the Qwen team for Qwen3-VL-8B-Thinking base model.show more

👋 Jan

130,228 views • 6 months ago

LFM2-VL support with GGUF and llama.cpp 🥳 You can... show more

Maxime Labonne

19,947 views • 9 months ago

We release Action100M, the hero behind VL-JEPA. It is... show more

Delong Chen (陈德龙)

103,384 views • 4 months ago

Here's my conversation with Yann LeCun (Yann LeCun) about... AI, importance of open source, limits of LLMs, why AI doomers are wrong, and the path to AGI. This was a fun and fascinating technical conversation! It's here on X in full, and is up on YouTube, Spotify, and everywhere else. Links in comment. Timestamps: 0:00 - Introduction 2:18 - Limits of LLMs 13:54 - Bilingualism and thinking 17:46 - Video prediction 25:07 - JEPA (Joint-Embedding Predictive Architecture) 28:15 - JEPA vs LLMs 37:31 - DINO and I-JEPA 38:51 - V-JEPA 44:22 - Hierarchical planning 50:40 - Autoregressive LLMs 1:06:06 - AI hallucination 1:11:30 - Reasoning in AI 1:29:02 - Reinforcement learning 1:34:10 - Woke AI 1:43:48 - Open source 1:47:26 - AI and ideology 1:49:58 - Marc Andreesen 1:57:56 - Llama 3 2:04:20 - AGI 2:08:48 - AI doomers 2:24:38 - Joscha Bach 2:28:51 - Humanoid robots 2:38:00 - Hope for the futureshow more

Lex Fridman

1,021,936 views • 2 years ago

Jan-v2-VL-Max-Instruct is out on 💛 Our newest 30B vision-language... show more

👋 Jan

23,063 views • 5 months ago

MotionGPT: Human Motion as a Foreign Language paper page:... Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.show more

AK

125,311 views • 2 years ago

Pretraining is essential for good performance on a wide... variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!show more

RoboPapers

23,883 views • 3 months ago

Google presents AudioPaLM: A Large Language Model That Can... Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.show more

AK

290,517 views • 3 years ago

We trained a foundation model on 18 million heart... show more

Alif Munim (d/acc)

590,179 views • 4 months ago

VLA-JEPA just dropped in LeRobot 🤖 What makes this... model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗show more

LeRobot

287,409 views • 4 days ago

Start building with Gemini Embedding 2, our most capable... show more

Google AI Developers

30,483,382 views • 3 months ago

We raised $1.5m to launch the world’s first LLM... show more

Yoeven

93,403 views • 8 months ago

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: -... show more

CalCo

103,204 views • 2 years ago

Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control."... Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)show more