Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

681,536 görüntüleme • 2 yıl önce •via X (Twitter)

10 Yorum

Rowan Cheung profil fotoğrafı
Rowan Cheung2 yıl önce

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Shawn Chauhan profil fotoğrafı
Shawn Chauhan2 yıl önce

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Rowan Cheung profil fotoğrafı
Rowan Cheung2 yıl önce

Completely agree- more competition just means the consumers win in the end!

Min Choi profil fotoğrafı
Min Choi2 yıl önce

Just seen this today earlier. Checking them out! 🔥

Rowan Cheung profil fotoğrafı
Rowan Cheung2 yıl önce

Totally worth it. Speeds are holding up, as well.

Haotian Liu profil fotoğrafı
Haotian Liu2 yıl önce

Thanks for sharing our work!

Rowan Cheung profil fotoğrafı
Rowan Cheung2 yıl önce

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

LAION profil fotoğrafı
LAION2 yıl önce

@thebloke please make a quantized version :)

Aadit Sheth profil fotoğrafı
Aadit Sheth2 yıl önce

Great find @rowancheung

Rowan Cheung profil fotoğrafı
Rowan Cheung2 yıl önce

Can't wait to see the vision prompts you come up with!

Benzer Videolar

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 görüntüleme • 1 yıl önce