正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. •... We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fungshow more

Pascale Fung

6,221 subscribers

90,033 次观看 • 5 个月前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

309,704 次观看 • 1 年前

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

Today we’re releasing V-JEPA, a method for teaching machines to understand and model the physical world by watching videos. This work is another important step towards Yann LeCun’s outlined vision of AI models that use a learned understanding of the world to plan, reason and accomplish complex tasks. Details ➡️ We're releasing a collection of V-JEPA vision models trained with a feature prediction objective using self-supervised learning. The models are able to understand and predict what is going on in a video, even with limited information. It learns by predicting missing or obscured parts of a video in its internal feature space. Unlike generative approaches that fill in missing pixels, this flexible approach enables up to 6x improvements in training and sample efficiency. The models were pre-trained on entirely unlabeled data, and a small amount of labeled data can be used to train a task-specific prediction head on top after pre-training. Our results show that, using a frozen backbone, our top V-JEPA models achieve 82.0% on Kinetics-400, 72.2% on Something-Something-v2 and 77.9% on ImageNet1K — competitive with or exceeding previous leading video models. We believe that this work is an important milestone on the path to advancing machine intelligence.

AI at Meta

703,412 次观看 • 2 年前

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,494 次观看 • 2 年前

Introducing Jan-v2-VL, a multimodal agent built for long-horizon tasks. Jan-v2-VL executes 49 steps without failure, while the base model stops at 5 and other similar-scale VLMs stop between 1 and 2. It achieves longer, stable task execution in your browser without accuracy loss. 3 variants are available: - Jan-v2-VL-low (efficiency-oriented) - Jan-v2-VL-med (balanced) - Jan-v2-VL-high (deeper reasoning and longer execution) Models: To use it, update your Jan App and download Jan-v2-VL from the Model Hub. Activate Browser MCP servers for agentic use cases. Credit to the Qwen team for Qwen3-VL-8B-Thinking base model.

Introducing Jan-v2-VL, a multimodal agent built for long-horizon tasks. Jan-v2-VL executes 49 steps without failure, while the base model stops at 5 and other similar-scale VLMs stop between 1 and 2. It achieves longer, stable task execution in your browser without accuracy loss. 3 variants are available: - Jan-v2-VL-low (efficiency-oriented) - Jan-v2-VL-med (balanced) - Jan-v2-VL-high (deeper reasoning and longer execution) Models: To use it, update your Jan App and download Jan-v2-VL from the Model Hub. Activate Browser MCP servers for agentic use cases. Credit to the Qwen team for Qwen3-VL-8B-Thinking base model.

👋 Jan

130,228 次观看 • 6 个月前

LFM2-VL support with GGUF and llama.cpp 🥳 You can now run these tiny, hyper-efficient VLMs on your watch! We released quantized checkpoints for LFM2-VL-450M and LFM2-VL-1.6B on Hugging Face

LFM2-VL support with GGUF and llama.cpp 🥳 You can now run these tiny, hyper-efficient VLMs on your watch! We released quantized checkpoints for LFM2-VL-450M and LFM2-VL-1.6B on Hugging Face

Maxime Labonne

19,947 次观看 • 9 个月前

We release Action100M, the hero behind VL-JEPA. It is a large dataset with O(100 million) dense action annotations on HowTo100M procedural videos. We hope it serves as a robust data foundation to advance physical world modeling research.

We release Action100M, the hero behind VL-JEPA. It is a large dataset with O(100 million) dense action annotations on HowTo100M procedural videos. We hope it serves as a robust data foundation to advance physical world modeling research.

Delong Chen (陈德龙)

103,384 次观看 • 4 个月前

Here's my conversation with Yann LeCun (Yann LeCun) about AI, importance of open source, limits of LLMs, why AI doomers are wrong, and the path to AGI. This was a fun and fascinating technical conversation! It's here on X in full, and is up on YouTube, Spotify, and everywhere else. Links in comment. Timestamps: 0:00 - Introduction 2:18 - Limits of LLMs 13:54 - Bilingualism and thinking 17:46 - Video prediction 25:07 - JEPA (Joint-Embedding Predictive Architecture) 28:15 - JEPA vs LLMs 37:31 - DINO and I-JEPA 38:51 - V-JEPA 44:22 - Hierarchical planning 50:40 - Autoregressive LLMs 1:06:06 - AI hallucination 1:11:30 - Reasoning in AI 1:29:02 - Reinforcement learning 1:34:10 - Woke AI 1:43:48 - Open source 1:47:26 - AI and ideology 1:49:58 - Marc Andreesen 1:57:56 - Llama 3 2:04:20 - AGI 2:08:48 - AI doomers 2:24:38 - Joscha Bach 2:28:51 - Humanoid robots 2:38:00 - Hope for the future

Here's my conversation with Yann LeCun (Yann LeCun) about AI, importance of open source, limits of LLMs, why AI doomers are wrong, and the path to AGI. This was a fun and fascinating technical conversation! It's here on X in full, and is up on YouTube, Spotify, and everywhere else. Links in comment. Timestamps: 0:00 - Introduction 2:18 - Limits of LLMs 13:54 - Bilingualism and thinking 17:46 - Video prediction 25:07 - JEPA (Joint-Embedding Predictive Architecture) 28:15 - JEPA vs LLMs 37:31 - DINO and I-JEPA 38:51 - V-JEPA 44:22 - Hierarchical planning 50:40 - Autoregressive LLMs 1:06:06 - AI hallucination 1:11:30 - Reasoning in AI 1:29:02 - Reinforcement learning 1:34:10 - Woke AI 1:43:48 - Open source 1:47:26 - AI and ideology 1:49:58 - Marc Andreesen 1:57:56 - Llama 3 2:04:20 - AGI 2:08:48 - AI doomers 2:24:38 - Joscha Bach 2:28:51 - Humanoid robots 2:38:00 - Hope for the future

Lex Fridman

1,021,936 次观看 • 2 年前

Jan-v2-VL-Max-Instruct is out on 💛 Our newest 30B vision-language model, extending the Jan-v2-VL family. This is our experiment bringing interleaved reasoning to an Instruct model. It handles long tasks well and stays on track when things get complicated. Good for research, multi-step problems, anything that needs patience. Try it now at

Jan-v2-VL-Max-Instruct is out on 💛 Our newest 30B vision-language model, extending the Jan-v2-VL family. This is our experiment bringing interleaved reasoning to an Instruct model. It handles long tasks well and stays on track when things get complicated. Good for research, multi-step problems, anything that needs patience. Try it now at

👋 Jan

23,063 次观看 • 5 个月前

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,311 次观看 • 2 年前

We trained a foundation model on 18 million heart ultrasound videos to predict structure instead of pixels. Introducing EchoJEPA, the first foundation-scale JEPA for medical video. Paper: Code: 🧵 1/n

We trained a foundation model on 18 million heart ultrasound videos to predict structure instead of pixels. Introducing EchoJEPA, the first foundation-scale JEPA for medical video. Paper: Code: 🧵 1/n

Alif Munim (d/acc)

590,156 次观看 • 4 个月前

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

RoboPapers

23,883 次观看 • 3 个月前

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 次观看 • 3 年前

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

VLA-JEPA just dropped in LeRobot 🤖 What makes this model special is that it does not just learn what action to take from a given observation, it also leverages a JEPA world model to learn action-relevant dynamics. During training, the VLA leverages V-JEPA2 by conditioning its predictor. This clever trick adds a world modeling objective to the training, which also allows pretraining on human videos. At inference, the world model is dropped entirely, keeping only a standard VLA architecture: Qwen backbone and action head. The demo here was only fine-tuned on 13 examples, showing great pretraining capability and running in real time on NVIDIA Robotics DGX Spark! VLA-JEPA is the first world model to be ported to LeRobot, and I feel like it won't be the last 🚀 Thomas Wolf clem 🤗

LeRobot

269,885 次观看 • 3 天前

Start building with Gemini Embedding 2, our most capable and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI.

Start building with Gemini Embedding 2, our most capable and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI.

Google AI Developers

30,483,382 次观看 • 3 个月前

We raised $1.5m to launch the world’s first LLM built for developer tasks. Run OCR, web scraping, STT, and classification that you can rely on for any dev tasks. Available in Beta starting today!

We raised $1.5m to launch the world’s first LLM built for developer tasks. Run OCR, web scraping, STT, and classification that you can rely on for any dev tasks. Available in Beta starting today!

Yoeven

93,403 次观看 • 8 个月前

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambda

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambda

CalCo

103,204 次观看 • 2 年前

Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control." Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)

Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control." Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)

Davide Scaramuzza

34,870 次观看 • 2 年前

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

AI at Meta

899,338 次观看 • 9 个月前

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Songyou Peng

282,710 次观看 • 1 个月前

📣 Microsoft Research releases Florence-VL, a new family of MLLMs powered by the generative vision foundation model Florence-2. Achieves significant improvements in general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, and more🔥Learn more👇

📣 Microsoft Research releases Florence-VL, a new family of MLLMs powered by the generative vision foundation model Florence-2. Achieves significant improvements in general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, and more🔥Learn more👇

Gradio

14,371 次观看 • 1 年前