正在加载视频...

视频加载失败

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

681,544 次观看 • 2 年前 •via X (Twitter)

10 条评论

Rowan Cheung 的头像
Rowan Cheung2 年前

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Shawn Chauhan 的头像
Shawn Chauhan2 年前

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Rowan Cheung 的头像
Rowan Cheung2 年前

Completely agree- more competition just means the consumers win in the end!

Min Choi 的头像
Min Choi2 年前

Just seen this today earlier. Checking them out! 🔥

Rowan Cheung 的头像
Rowan Cheung2 年前

Totally worth it. Speeds are holding up, as well.

Haotian Liu 的头像
Haotian Liu2 年前

Thanks for sharing our work!

Rowan Cheung 的头像
Rowan Cheung2 年前

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

LAION 的头像
LAION2 年前

@thebloke please make a quantized version :)

Aadit Sheth 的头像
Aadit Sheth2 年前

Great find @rowancheung

Rowan Cheung 的头像
Rowan Cheung2 年前

Can't wait to see the vision prompts you come up with!

相关视频

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 次观看 • 1 年前