Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

681,536 Aufrufe • vor 2 Jahren •via X (Twitter)

10 Kommentare

Profilbild von Rowan Cheung
Rowan Cheungvor 2 Jahren

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Profilbild von Shawn Chauhan
Shawn Chauhanvor 2 Jahren

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Profilbild von Rowan Cheung
Rowan Cheungvor 2 Jahren

Completely agree- more competition just means the consumers win in the end!

Profilbild von Min Choi
Min Choivor 2 Jahren

Just seen this today earlier. Checking them out! 🔥

Profilbild von Rowan Cheung
Rowan Cheungvor 2 Jahren

Totally worth it. Speeds are holding up, as well.

Profilbild von Haotian Liu
Haotian Liuvor 2 Jahren

Thanks for sharing our work!

Profilbild von Rowan Cheung
Rowan Cheungvor 2 Jahren

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

Profilbild von LAION
LAIONvor 2 Jahren

@thebloke please make a quantized version :)

Profilbild von Aadit Sheth
Aadit Shethvor 2 Jahren

Great find @rowancheung

Profilbild von Rowan Cheung
Rowan Cheungvor 2 Jahren

Can't wait to see the vision prompts you come up with!

Ähnliche Videos

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 Aufrufe • vor 1 Jahr