Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher...

143,527 görüntüleme • 2 yıl önce •via X (Twitter)

11 Yorum

Lior⚡ profil fotoğrafı
Lior⚡2 yıl önce

Github: Demo:

Lior⚡ profil fotoğrafı
Lior⚡2 yıl önce

By:@imhaotian,@ChunyuanLi,@QingyangWu1,@yong_jae_lee

Linus Ekenstam – eu/acc profil fotoğrafı
Linus Ekenstam – eu/acc2 yıl önce

The rise of these models and the speed of which they are entering the market makes me think we are soon only going to interact with LLM’s

Lior⚡ profil fotoğrafı
Lior⚡2 yıl önce

Absolutely, or LLM-assisted websites. The equivalent of intercom on every website.

Charcher profil fotoğrafı
Charcher2 yıl önce

So good.

Rob Lennon 🗯 | AI Whisperer profil fotoğrafı
Rob Lennon 🗯 | AI Whisperer2 yıl önce

Definitely want to play with this soon

Lior⚡ profil fotoğrafı
Lior⚡2 yıl önce

Let me know how it goes, about to pip install it

ai geek (wishesh) ⚡️ profil fotoğrafı
ai geek (wishesh) ⚡️2 yıl önce

Great find. Looking very promising.

thom profil fotoğrafı
thom2 yıl önce

@readwise save thread

wwwwg profil fotoğrafı
wwwwg2 yıl önce

@memdotai mem it

Mem profil fotoğrafı
Mem2 yıl önce

@AlphaSignalAI Saved! Here's the compiled thread: 🪄 AI-generated summary: "LLava is a multimodal Large Language-and-Vision Assistant that can understand images and text, and even handle memes. It has achieved a new SOTA on Science QA and supports LoRA...

Benzer Videolar

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 görüntüleme • 1 yıl önce

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Andrew Ng

67,710 görüntüleme • 1 yıl önce