Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

Haotian Liu

13,234 subscribers

302,319 просмотров • 3 лет назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля Haotian Liu

Haotian Liu3 лет назад

(2/5) Excited to release a 558K concept-balanced subset of LAION/CC/SBU & an 80K high-quality subset of LLaVA-Instruct-158K. The concept-balanced subset ensures a broad concept coverage, and the high-quality visual instruct tuning data enables models' visual reasoning capability.

Фото профиля Haotian Liu

Haotian Liu3 лет назад

(3/5) Upgrade your Vicuna-7B to LLaVA-Lightning in just 3 hrs: 2 hrs pretraining + 1 hr visual instruct tuning. Train on 8x A100s using cloud spot instances for just $40. Let's make this research more accessible to researchers, academia, and millions of AI enthusiasts today!

Фото профиля Haotian Liu

Haotian Liu3 лет назад

(4/5) We're also upgrading LLaVA to support Vicuna v0 & v1 weights, with more checkpoints arriving this week! Plus, we're working to support more hardware – stay tuned!

Фото профиля Haotian Liu

Haotian Liu3 лет назад

(5/5) 🤗 Demo: 🌐 Project page: 📄 Paper: Embark on your LLaVA-Lightning journey today and stay tuned for more models and support for more hardwares in the following weeks!

Фото профиля iamrobotbear (bk)

iamrobotbear (bk)3 лет назад

Any way to easily swap LLaMA out for OpenAI or Dolly 2.0?

Фото профиля Haotian Liu

Haotian Liu3 лет назад

Yes, it is definitely possible. And even easier with the introduction of LLaVA lightning. MPT-7B just joins the LLaVA family today!

Фото профиля Zongheng Yang

Zongheng Yang3 лет назад

Congrats on the work @imhaotian. Glad to see SkyPilot was of help!

Фото профиля Chris

Chris3 лет назад

Recently program of open source projects is super fast, definitely surpass my expectation.

Фото профиля web3工作坊

web3工作坊3 лет назад

@_akhaliq Thank you for sharing. @savetonotion #tweet #AI

Фото профиля Jake Harrison

Jake Harrison3 лет назад

Good job!

Похожие видео

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 просмотров • 1 год назад

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

Lior Alexander

143,527 просмотров • 2 лет назад

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

Rowan Cheung

681,544 просмотров • 2 лет назад

Announcing GPT-4, a large multimodal model, with our best-ever results on capabilities and alignment:

Announcing GPT-4, a large multimodal model, with our best-ever results on capabilities and alignment:

OpenAI

12,466,065 просмотров • 3 лет назад

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 просмотров • 1 год назад

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

This is Gemini 3: our most intelligent model that helps you learn, build and plan anything. It comes with state-of-the-art reasoning capabilities, world-leading multimodal understanding, and enables new agentic coding experiences. 🧵

Google DeepMind

1,692,423 просмотров • 7 месяцев назад

Let's go hands-on with #GeminiAI. Our newest AI model can reason across different types of inputs and outputs — like images and text. See Gemini's multimodal reasoning capabilities in action ↓

Let's go hands-on with #GeminiAI. Our newest AI model can reason across different types of inputs and outputs — like images and text. See Gemini's multimodal reasoning capabilities in action ↓

Google

1,005,825 просмотров • 2 лет назад

🚀Introducing LLaVA-NeXT Interleave: Now AI can understand and reason with multiple images at once - This opens up multi-image scenarios like multi-frame videos, multi-view 3D, and multiple inter-leaved images. - An all round LMM that can understand videos, images, and 3D More⬇️

🚀Introducing LLaVA-NeXT Interleave: Now AI can understand and reason with multiple images at once - This opens up multi-image scenarios like multi-frame videos, multi-view 3D, and multiple inter-leaved images. - An all round LMM that can understand videos, images, and 3D More⬇️

Gradio

27,655 просмотров • 1 год назад

Start building with Gemini 3 Pro, our most intelligent model with state-of-the-art reasoning and complex multimodal understanding, as well as powerful agentic and vibe coding capabilities. See more below:

Start building with Gemini 3 Pro, our most intelligent model with state-of-the-art reasoning and complex multimodal understanding, as well as powerful agentic and vibe coding capabilities. See more below:

Google AI Developers

104,898 просмотров • 7 месяцев назад

In this demo, you’ll see Gemini 3 Flash’s frontier-level reasoning and multimodal capabilities on display. The model is able to simultaneously conduct complex geometric calculations while processing complex inputs (video and image). You can play around with the slingshot in Google AI Studio here and share your favorite examples below:

In this demo, you’ll see Gemini 3 Flash’s frontier-level reasoning and multimodal capabilities on display. The model is able to simultaneously conduct complex geometric calculations while processing complex inputs (video and image). You can play around with the slingshot in Google AI Studio here and share your favorite examples below:

Google AI

55,411 просмотров • 6 месяцев назад

Multimodal AI is here 🤯 GPT-4 can now turn your images into a text file in a snap with the new code interpreter model. Witness the OCR magic in action 🔥

Multimodal AI is here 🤯 GPT-4 can now turn your images into a text file in a snap with the new code interpreter model. Witness the OCR magic in action 🔥

Shubham Saboo

727,655 просмотров • 3 лет назад

Introducing 🧵 𝐂𝐨𝐝𝐢𝐮𝐦𝐀𝐈 𝐂𝐡𝐚𝐭 𝐓𝐡𝐫𝐞𝐚𝐝𝐬🧵 With Threads, you can now: ‣ seamlessly ask about and refine the results generated when using CodiumAI's commands 🆓 ﹠ ‣ chat with GPT-3.5-Turbo 🆓, or GPT-4-Turbo right inside your IDE

Introducing 🧵 𝐂𝐨𝐝𝐢𝐮𝐦𝐀𝐈 𝐂𝐡𝐚𝐭 𝐓𝐡𝐫𝐞𝐚𝐝𝐬🧵 With Threads, you can now: ‣ seamlessly ask about and refine the results generated when using CodiumAI's commands 🆓 ﹠ ‣ chat with GPT-3.5-Turbo 🆓, or GPT-4-Turbo right inside your IDE

Qodo

591,227 просмотров • 2 лет назад

Today we’re taking a big step on the path toward AGI and releasing Gemini 3— our most intelligent model yet. With Gemini 3, you can bring any idea to life. It is state-of-the-art in reasoning, the best model in the world for multimodal understanding, and our best agentic and vibe coding model.

Today we’re taking a big step on the path toward AGI and releasing Gemini 3— our most intelligent model yet. With Gemini 3, you can bring any idea to life. It is state-of-the-art in reasoning, the best model in the world for multimodal understanding, and our best agentic and vibe coding model.

Google AI

492,732 просмотров • 7 месяцев назад

🚀 Introducing QuizDeck GPT Tool: generate questions and flashcards with just a snap of your page! 📖✨ Designed to support your curriculum, and it helps you learn & excel in school exams with question generation, provided answer explanations & efficient flashcards creation.

🚀 Introducing QuizDeck GPT Tool: generate questions and flashcards with just a snap of your page! 📖✨ Designed to support your curriculum, and it helps you learn & excel in school exams with question generation, provided answer explanations & efficient flashcards creation.

Reem Aljohani

30,761 просмотров • 2 лет назад

introducing Gemini Image Editing. now you can edit images with natural language in Krea Chat. try it now!

introducing Gemini Image Editing. now you can edit images with natural language in Krea Chat. try it now!

Krea

223,517 просмотров • 1 год назад

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Sumanth

43,693 просмотров • 9 месяцев назад

(1/5) Gemini 3, our most intelligent model, is landing in Google Search today – starting with AI Mode. Excited that this is the first time we’re shipping a new Gemini model in Search on day one! 🚀 In Search, Gemini 3 with generative layouts will make it easy to get a rich understanding of anything on your mind. It has state-of-the-art reasoning, deep multimodal understanding and advanced agentic capabilities. That allows the model to shine when you ask it to explain advanced concepts or ideas – it reasons and can code interactive visuals in real-time. It can tackle your toughest questions like advanced science.

(1/5) Gemini 3, our most intelligent model, is landing in Google Search today – starting with AI Mode. Excited that this is the first time we’re shipping a new Gemini model in Search on day one! 🚀 In Search, Gemini 3 with generative layouts will make it easy to get a rich understanding of anything on your mind. It has state-of-the-art reasoning, deep multimodal understanding and advanced agentic capabilities. That allows the model to shine when you ask it to explain advanced concepts or ideas – it reasons and can code interactive visuals in real-time. It can tackle your toughest questions like advanced science.

Robby Stein

94,728 просмотров • 7 месяцев назад

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

OpenAI

3,621,893 просмотров • 1 месяц назад

Gemini 3.1 Flash-Lite is our most cost-efficient Gemini 3 series model 🔦 Folks who are keen to experiment with it in Gemini CLI can do so with an API key and the -m flag. ❯ gemini -m gemini-3.1-flash-lite-preview Official support for it in /model will be coming soon 🔜

Gemini 3.1 Flash-Lite is our most cost-efficient Gemini 3 series model 🔦 Folks who are keen to experiment with it in Gemini CLI can do so with an API key and the -m flag. ❯ gemini -m gemini-3.1-flash-lite-preview Official support for it in /model will be coming soon 🔜

Gemini CLI

66,747 просмотров • 3 месяцев назад

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 просмотров • 2 лет назад