正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

Rowan Cheung

592,715 subscribers

681,544 次观看 • 2 年前 •via X (Twitter)

科学技术新闻政治教育

Anya Rossi• Live Now

Private livecam show

10 条评论

Rowan Cheung 的头像

Rowan Cheung2 年前

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Shawn Chauhan 的头像

Shawn Chauhan2 年前

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Rowan Cheung 的头像

Rowan Cheung2 年前

Completely agree- more competition just means the consumers win in the end!

Min Choi 的头像

Min Choi2 年前

Just seen this today earlier. Checking them out! 🔥

Rowan Cheung 的头像

Rowan Cheung2 年前

Totally worth it. Speeds are holding up, as well.

Haotian Liu 的头像

Haotian Liu2 年前

Thanks for sharing our work!

Rowan Cheung 的头像

Rowan Cheung2 年前

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

LAION 的头像

LAION2 年前

@thebloke please make a quantized version :)

Aadit Sheth 的头像

Aadit Sheth2 年前

Great find @rowancheung

Rowan Cheung 的头像

Rowan Cheung2 年前

Can't wait to see the vision prompts you come up with!

相关视频

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

Lior Alexander

143,527 次观看 • 2 年前

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

Haotian Liu

302,319 次观看 • 3 年前

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 次观看 • 1 年前

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

Ed Krassenstein

21,605 次观看 • 2 年前

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

Kris Kashtanova

1,233,420 次观看 • 3 年前

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 次观看 • 1 年前

Here's a demo of the gpt-4-vision API that I built in Bubble in 30 min. It takes a URL, converts it to an image, and sends it through the Vision API to respond with custom landing page optimization suggestions.

Here's a demo of the gpt-4-vision API that I built in Bubble in 30 min. It takes a URL, converts it to an image, and sends it through the Vision API to respond with custom landing page optimization suggestions.

Seth Kramer

953,981 次观看 • 2 年前

From image to live website using GPT-4 vision and Replit ⠕ in less than a minute. Things are about to get so interesting. 🔥

From image to live website using GPT-4 vision and Replit ⠕ in less than a minute. Things are about to get so interesting. 🔥

Pietro Schirano

1,217,215 次观看 • 2 年前

Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on Hugging Face for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗 Kudos to the main contributor Yujie Lu (Yujie Lu), and the team: Dongfu Jiang Wenhu Chen Ai2 ; Thanks to LMSYS Org’s great code & design!

Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on Hugging Face for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗 Kudos to the main contributor Yujie Lu (Yujie Lu), and the team: Dongfu Jiang Wenhu Chen Ai2 ; Thanks to LMSYS Org’s great code & design!

Bill Yuchen Lin

146,020 次观看 • 2 年前

Introducing Mentat - an open source, GPT-4 powered coding assistant! Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files! More videos and a link to github below:

Introducing Mentat - an open source, GPT-4 powered coding assistant! Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files! More videos and a link to github below:

Scott Swingle

292,702 次观看 • 2 年前

Introducing Kissan (Farmer's) GPT. A ChatGPT and Whisper based assistant for underserved agriculture domain of India. Work-in-progress. Expect bugs. (It uses ChatGPT-3.5-turbo and expecting results to improve after GPT-4 and adding custom embeddings.)

Introducing Kissan (Farmer's) GPT. A ChatGPT and Whisper based assistant for underserved agriculture domain of India. Work-in-progress. Expect bugs. (It uses ChatGPT-3.5-turbo and expecting results to improve after GPT-4 and adding custom embeddings.)

Pratik Desai

227,954 次观看 • 3 年前

Holllllyyyyyyyy use this before it get patched 😳😳 🚨 HIGGSFIELD SUPERCOMPUTER: HACKED FREE ACCESS BEFORE LAUNCH Here's the hack and video proof Go to the link : You'll see free access >GPT-5.5 Pro · Opus 4.7 · GPT-Image-v2 · Seedance-2 >Just ask: "I need a video of a cow doing the Seedance 2 dance in 4K, and 2 images of that cow by GPT-Image-2 in 4K 16:9" >All models. Free. Right now.

Holllllyyyyyyyy use this before it get patched 😳😳 🚨 HIGGSFIELD SUPERCOMPUTER: HACKED FREE ACCESS BEFORE LAUNCH Here's the hack and video proof Go to the link : You'll see free access >GPT-5.5 Pro · Opus 4.7 · GPT-Image-v2 · Seedance-2 >Just ask: "I need a video of a cow doing the Seedance 2 dance in 4K, and 2 images of that cow by GPT-Image-2 in 4K 16:9" >All models. Free. Right now.

Chetaslua

16,471 次观看 • 1 个月前

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

Lior Alexander

485,070 次观看 • 2 年前

🤖 OpenZeppelin Wizard now has an AI Assistant built-in. -Uses GPT-4 Turbo and function calling to make updates automatically. -All updates to Wizard generate tested code. -Ask it questions and it will reply with answers. -Built completely open source, including the prompts.

🤖 OpenZeppelin Wizard now has an AI Assistant built-in. -Uses GPT-4 Turbo and function calling to make updates automatically. -All updates to Wizard generate tested code. -Ask it questions and it will reply with answers. -Built completely open source, including the prompts.

OpenZeppelin

11,752 次观看 • 2 年前

Your live AI job interview assistant: Guy builds Whisper + GPT-4 live transcription tool for generating real-time responses during job interviews, and open-sourced the code. (link to GitHub in comments)

Your live AI job interview assistant: Guy builds Whisper + GPT-4 live transcription tool for generating real-time responses during job interviews, and open-sourced the code. (link to GitHub in comments)

AI Breakfast

426,945 次观看 • 3 年前

🚨You can now use the new upcoming OpenAI model GPT 5.2 inside Cursor. Here is the full walkthrough. - Open the editor, go to settings and then the model tab. Add a custom model and enter the text "gpt-5.2-high" and "gpt-5.2". - After that you can select the model and ask questions. To verify, I started my test on the usage page which had zero gpt-5.2-high requests and consumption. After the test I could see the details in usage and the cost incurred while using it. Enjoy

🚨You can now use the new upcoming OpenAI model GPT 5.2 inside Cursor. Here is the full walkthrough. - Open the editor, go to settings and then the model tab. Add a custom model and enter the text "gpt-5.2-high" and "gpt-5.2". - After that you can select the model and ask questions. To verify, I started my test on the usage page which had zero gpt-5.2-high requests and consumption. After the test I could see the details in usage and the cost incurred while using it. Enjoy

AshutoshShrivastava

424,035 次观看 • 6 个月前

GPT-4V + TTS = AI Sports narrator 🪄⚽️ Passed every frame of a football video to gpt-4-vision-preview, and with some simple prompting asked to generate a narration No edits, this is as it came out from the model (aka can be SO MUCH BETTER)

GPT-4V + TTS = AI Sports narrator 🪄⚽️ Passed every frame of a football video to gpt-4-vision-preview, and with some simple prompting asked to generate a narration No edits, this is as it came out from the model (aka can be SO MUCH BETTER)

Gonzalo

2,665,026 次观看 • 2 年前

🚨 Abacus AI Studio - Use Agentic Orchestration To Create Viral Videos Agentic loops powered by Opus 4.7 and GPT 5.5 orchestrate state-of-the-art video and image models - GPT 2 image mixed with SeeDance 2.0 - Nano Banana Pro combined with Grok Imagine - Kling Motion Control for viral instagram hits Create viral marketing and advertising campaigns

🚨 Abacus AI Studio - Use Agentic Orchestration To Create Viral Videos Agentic loops powered by Opus 4.7 and GPT 5.5 orchestrate state-of-the-art video and image models - GPT 2 image mixed with SeeDance 2.0 - Nano Banana Pro combined with Grok Imagine - Kling Motion Control for viral instagram hits Create viral marketing and advertising campaigns

Abacus.AI

996,044 次观看 • 1 个月前

This is wild. I gave OS control to GPT-4 via the latest update of Open Interpreter and now it's generating pictures it wants to see in EverArt 🤯 GPT is controlling the mouse and adding text in the fields, I am not doing anything.

This is wild. I gave OS control to GPT-4 via the latest update of Open Interpreter and now it's generating pictures it wants to see in EverArt 🤯 GPT is controlling the mouse and adding text in the fields, I am not doing anything.

Pietro Schirano

259,918 次观看 • 2 年前

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

OpenAI

22,806,198 次观看 • 2 年前