Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming,... live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fungshow more

Pascale Fung
90,033 次观看 • 5 个月前
Our vision is for AI that uses world models... to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️show more

AI at Meta
309,704 次观看 • 1 年前
Today we’re releasing V-JEPA, a method for teaching machines... to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.show more

AI at Meta
703,412 次观看 • 2 年前
3D-LLM: Injecting the 3D World into Large Language Models... paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.show more

AK
249,494 次观看 • 2 年前
Introducing Jan-v2-VL, a multimodal agent built for long-horizon tasks.... Jan-v2-VL executes 49 steps without failure, while the base model stops at 5 and other similar-scale VLMs stop between 1 and 2. It achieves longer, stable task execution in your browser without accuracy loss. 3 variants are available: - Jan-v2-VL-low (efficiency-oriented) - Jan-v2-VL-med (balanced) - Jan-v2-VL-high (deeper reasoning and longer execution) Models: To use it, update your Jan App and download Jan-v2-VL from the Model Hub. Activate Browser MCP servers for agentic use cases. Credit to the Qwen team for Qwen3-VL-8B-Thinking base model.show more

👋 Jan
130,228 次观看 • 7 个月前
LFM2-VL support with GGUF and llama.cpp 🥳 You can... now run these tiny, hyper-efficient VLMs on your watch! We released quantized checkpoints for LFM2-VL-450M and LFM2-VL-1.6B on Hugging Faceshow more

Maxime Labonne
19,947 次观看 • 9 个月前
We release Action100M, the hero behind VL-JEPA. It is... a large dataset with O(100 million) dense action annotations on HowTo100M procedural videos. We hope it serves as a robust data foundation to advance physical world modeling research.show more

Delong Chen (陈德龙)
103,384 次观看 • 4 个月前
Here's my conversation with Yann LeCun (Yann LeCun) about... AI, importance of open source, limits of LLMs, why AI doomers are wrong, and the path to AGI. This was a fun and fascinating technical conversation! It's here on X in full, and is up on YouTube, Spotify, and everywhere else. Links in comment. Timestamps: 0:00 - Introduction 2:18 - Limits of LLMs 13:54 - Bilingualism and thinking 17:46 - Video prediction 25:07 - JEPA (Joint-Embedding Predictive Architecture) 28:15 - JEPA vs LLMs 37:31 - DINO and I-JEPA 38:51 - V-JEPA 44:22 - Hierarchical planning 50:40 - Autoregressive LLMs 1:06:06 - AI hallucination 1:11:30 - Reasoning in AI 1:29:02 - Reinforcement learning 1:34:10 - Woke AI 1:43:48 - Open source 1:47:26 - AI and ideology 1:49:58 - Marc Andreesen 1:57:56 - Llama 3 2:04:20 - AGI 2:08:48 - AI doomers 2:24:38 - Joscha Bach 2:28:51 - Humanoid robots 2:38:00 - Hope for the futureshow more

Lex Fridman
1,021,936 次观看 • 2 年前
Jan-v2-VL-Max-Instruct is out on 💛 Our newest 30B vision-language... model, extending the Jan-v2-VL family. This is our experiment bringing interleaved reasoning to an Instruct model. It handles long tasks well and stays on track when things get complicated. Good for research, multi-step problems, anything that needs patience. Try it now atshow more

👋 Jan
23,063 次观看 • 5 个月前
MotionGPT: Human Motion as a Foreign Language paper page:... Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.show more

AK
125,311 次观看 • 3 年前
Pretraining is essential for good performance on a wide... variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!show more

RoboPapers
23,883 次观看 • 3 个月前
Google presents AudioPaLM: A Large Language Model That Can... Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.show more

AK
290,517 次观看 • 3 年前
We trained a foundation model on 18 million heart... ultrasound videos to predict structure instead of pixels. Introducing EchoJEPA, the first foundation-scale JEPA for medical video. Paper: Code: 🧵 1/nshow more

Alif Munim (d/acc)
590,179 次观看 • 4 个月前
VLA-JEPA just dropped in LeRobot 🤖 What makes this... model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗show more

LeRobot
289,446 次观看 • 5 天前
Start building with Gemini Embedding 2, our most capable... and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI.show more

Google AI Developers
30,483,382 次观看 • 3 个月前
We raised $1.5m to launch the world’s first LLM... built for developer tasks. Run OCR, web scraping, STT, and classification that you can rely on for any dev tasks. Available in Beta starting today!show more

Yoeven
93,403 次观看 • 8 个月前
Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: -... a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambdashow more

CalCo
103,204 次观看 • 2 年前
Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control."... Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)show more

Davide Scaramuzza
34,874 次观看 • 2 年前
Introducing DINOv3: a state-of-the-art computer vision model trained with... self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:show more

AI at Meta
899,359 次观看 • 10 个月前
Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our... unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)show more

Songyou Peng
282,920 次观看 • 1 个月前
📣 Microsoft Research releases Florence-VL, a new family of... MLLMs powered by the generative vision foundation model Florence-2. Achieves significant improvements in general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, and more🔥Learn more👇show more

Gradio
14,371 次观看 • 1 年前