Загрузка видео...

Не удалось загрузить видео

На главную

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

681,435 просмотров • 2 лет назад •via X (Twitter)

Комментарии: 10

Фото профиля Rowan Cheung
Rowan Cheung2 лет назад

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Фото профиля Shawn Chauhan
Shawn Chauhan2 лет назад

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Фото профиля Rowan Cheung
Rowan Cheung2 лет назад

Completely agree- more competition just means the consumers win in the end!

Фото профиля Min Choi
Min Choi2 лет назад

Just seen this today earlier. Checking them out! 🔥

Фото профиля Rowan Cheung
Rowan Cheung2 лет назад

Totally worth it. Speeds are holding up, as well.

Фото профиля Haotian Liu
Haotian Liu2 лет назад

Thanks for sharing our work!

Фото профиля Rowan Cheung
Rowan Cheung2 лет назад

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

Фото профиля LAION
LAION2 лет назад

@thebloke please make a quantized version :)

Фото профиля Aadit Sheth
Aadit Sheth2 лет назад

Great find @rowancheung

Фото профиля Rowan Cheung
Rowan Cheung2 лет назад

Can't wait to see the vision prompts you come up with!

Похожие видео

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 просмотров • 1 год назад