Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher... show more

Lior Alexander

115,479 subscribers

143,527 görüntüleme • 2 yıl önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

11 Yorum

Lior⚡ profil fotoğrafı

Lior⚡2 yıl önce

Github: Demo:

Lior⚡ profil fotoğrafı

Lior⚡2 yıl önce

By:@imhaotian,@ChunyuanLi,@QingyangWu1,@yong_jae_lee

Linus Ekenstam – eu/acc profil fotoğrafı

Linus Ekenstam – eu/acc2 yıl önce

The rise of these models and the speed of which they are entering the market makes me think we are soon only going to interact with LLM’s

Lior⚡ profil fotoğrafı

Lior⚡2 yıl önce

Absolutely, or LLM-assisted websites. The equivalent of intercom on every website.

Charcher profil fotoğrafı

Charcher2 yıl önce

So good.

Rob Lennon 🗯 | AI Whisperer profil fotoğrafı

Rob Lennon 🗯 | AI Whisperer2 yıl önce

Definitely want to play with this soon

Lior⚡ profil fotoğrafı

Lior⚡2 yıl önce

Let me know how it goes, about to pip install it

ai geek (wishesh) ⚡️ profil fotoğrafı

ai geek (wishesh) ⚡️2 yıl önce

Great find. Looking very promising.

thom profil fotoğrafı

thom2 yıl önce

@readwise save thread

wwwwg profil fotoğrafı

wwwwg2 yıl önce

@memdotai mem it

Mem profil fotoğrafı

Mem2 yıl önce

@AlphaSignalAI Saved! Here's the compiled thread: 🪄 AI-generated summary: "LLava is a multimodal Large Language-and-Vision Assistant that can understand images and text, and even handle memes. It has achieved a new SOTA on Science QA and supports LoRA...

Benzer Videolar

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

Rowan Cheung

681,544 görüntüleme • 2 yıl önce

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

Haotian Liu

302,319 görüntüleme • 3 yıl önce

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 görüntüleme • 1 yıl önce

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 görüntüleme • 1 yıl önce

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

Lior Alexander

485,070 görüntüleme • 2 yıl önce

🚀Introducing LLaVA-NeXT Interleave: Now AI can understand and reason with multiple images at once - This opens up multi-image scenarios like multi-frame videos, multi-view 3D, and multiple inter-leaved images. - An all round LMM that can understand videos, images, and 3D More⬇️

🚀Introducing LLaVA-NeXT Interleave: Now AI can understand and reason with multiple images at once - This opens up multi-image scenarios like multi-frame videos, multi-view 3D, and multiple inter-leaved images. - An all round LMM that can understand videos, images, and 3D More⬇️

Gradio

27,655 görüntüleme • 1 yıl önce

Announcing GPT-4, a large multimodal model, with our best-ever results on capabilities and alignment:

Announcing GPT-4, a large multimodal model, with our best-ever results on capabilities and alignment:

OpenAI

12,466,088 görüntüleme • 3 yıl önce

Everyone is sleeping on this new OCR model! dots-ocr is a new 1.7B vision-language model that achieves SOTA performance on multilingual document parsing. - Supports 100+ languages - Works with both images and PDFs - Handles text, tables, formulas seamlessly 100% open-source.

Everyone is sleeping on this new OCR model! dots-ocr is a new 1.7B vision-language model that achieves SOTA performance on multilingual document parsing. - Supports 100+ languages - Works with both images and PDFs - Handles text, tables, formulas seamlessly 100% open-source.

Akshay 🚀

251,968 görüntüleme • 10 ay önce

We are excited to announce the 1st version of our multimodal assistant, Yasa-1, a language assistant with visual and auditory sensors that can take actions via code execution 🪄. Yasa-1 can understand text, images, videos, sounds & more! 🚀 Check out more details below👇

We are excited to announce the 1st version of our multimodal assistant, Yasa-1, a language assistant with visual and auditory sensors that can take actions via code execution 🪄. Yasa-1 can understand text, images, videos, sounds & more! 🚀 Check out more details below👇

Reka

814,187 görüntüleme • 2 yıl önce

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

AI at Meta

121,530 görüntüleme • 1 yıl önce

Multimodal AI is here 🤯 GPT-4 can now turn your images into a text file in a snap with the new code interpreter model. Witness the OCR magic in action 🔥

Multimodal AI is here 🤯 GPT-4 can now turn your images into a text file in a snap with the new code interpreter model. Witness the OCR magic in action 🔥

Shubham Saboo

727,655 görüntüleme • 3 yıl önce

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

BAAI

51,880 görüntüleme • 8 ay önce

ByteDance announced SeedEdit! A new image model that can edit images with text prompts. It allows for high-resolution editing and supports various changes like local replacements, geometric transformations, and style adjustments. Links ⬇️

ByteDance announced SeedEdit! A new image model that can edit images with text prompts. It allows for high-resolution editing and supports various changes like local replacements, geometric transformations, and style adjustments. Links ⬇️

Dreaming Tulpa 🥓👑

46,540 görüntüleme • 1 yıl önce

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

Ed Krassenstein

21,605 görüntüleme • 2 yıl önce

Build a Vision RAG app with Gemini 2.5 Flash and Cohere Multimodal Embedding that can understand images and diagrams in PDF. 100% Opensource code with step-by-step tutorial.

Build a Vision RAG app with Gemini 2.5 Flash and Cohere Multimodal Embedding that can understand images and diagrams in PDF. 100% Opensource code with step-by-step tutorial.

Shubham Saboo

60,638 görüntüleme • 1 yıl önce

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏

merve

28,014 görüntüleme • 1 yıl önce

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Andrew Ng

67,710 görüntüleme • 1 yıl önce

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 görüntüleme • 3 yıl önce

Sam Altman on GPT 6: “There will be a chance that it will be a GPT 3-4 style leap” in terms of science problems, where with GPT 5 it has these tiny glimmers and “GPT 6 it can really do it”

Sam Altman on GPT 6: “There will be a chance that it will be a GPT 3-4 style leap” in terms of science problems, where with GPT 5 it has these tiny glimmers and “GPT 6 it can really do it”

Chris

42,684 görüntüleme • 7 ay önce

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

will brown

139,377 görüntüleme • 3 ay önce