Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

Rowan Cheung

592,269 subscribers

681,536 Aufrufe • vor 2 Jahren •via X (Twitter)

Wissenschaft & Technologie Nachrichten & Politik Bildung

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von Rowan Cheung

Rowan Cheungvor 2 Jahren

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Profilbild von Shawn Chauhan

Shawn Chauhanvor 2 Jahren

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Profilbild von Rowan Cheung

Rowan Cheungvor 2 Jahren

Completely agree- more competition just means the consumers win in the end!

Profilbild von Min Choi

Min Choivor 2 Jahren

Just seen this today earlier. Checking them out! 🔥

Profilbild von Rowan Cheung

Rowan Cheungvor 2 Jahren

Totally worth it. Speeds are holding up, as well.

Profilbild von Haotian Liu

Haotian Liuvor 2 Jahren

Thanks for sharing our work!

Profilbild von Rowan Cheung

Rowan Cheungvor 2 Jahren

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

Profilbild von LAION

LAIONvor 2 Jahren

@thebloke please make a quantized version :)

Profilbild von Aadit Sheth

Aadit Shethvor 2 Jahren

Great find @rowancheung

Profilbild von Rowan Cheung

Rowan Cheungvor 2 Jahren

Can't wait to see the vision prompts you come up with!

Ähnliche Videos

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

Lior Alexander

143,527 Aufrufe • vor 2 Jahren

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

Haotian Liu

302,319 Aufrufe • vor 3 Jahren

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 Aufrufe • vor 1 Jahr

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

Ed Krassenstein

21,605 Aufrufe • vor 2 Jahren

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

Kris Kashtanova

1,233,398 Aufrufe • vor 3 Jahren

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 Aufrufe • vor 1 Jahr

Here's a demo of the gpt-4-vision API that I built in Bubble in 30 min. It takes a URL, converts it to an image, and sends it through the Vision API to respond with custom landing page optimization suggestions.

Here's a demo of the gpt-4-vision API that I built in Bubble in 30 min. It takes a URL, converts it to an image, and sends it through the Vision API to respond with custom landing page optimization suggestions.

Seth Kramer

953,981 Aufrufe • vor 2 Jahren

From image to live website using GPT-4 vision and Replit ⠕ in less than a minute. Things are about to get so interesting. 🔥

From image to live website using GPT-4 vision and Replit ⠕ in less than a minute. Things are about to get so interesting. 🔥

Pietro Schirano

1,217,215 Aufrufe • vor 2 Jahren

Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on Hugging Face for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗 Kudos to the main contributor Yujie Lu (Yujie Lu), and the team: Dongfu Jiang Wenhu Chen Ai2 ; Thanks to LMSYS Org’s great code & design!

Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on Hugging Face for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗 Kudos to the main contributor Yujie Lu (Yujie Lu), and the team: Dongfu Jiang Wenhu Chen Ai2 ; Thanks to LMSYS Org’s great code & design!

Bill Yuchen Lin

146,020 Aufrufe • vor 2 Jahren

Introducing Mentat - an open source, GPT-4 powered coding assistant! Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files! More videos and a link to github below:

Introducing Mentat - an open source, GPT-4 powered coding assistant! Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files! More videos and a link to github below:

Scott Swingle

292,702 Aufrufe • vor 2 Jahren

Introducing Kissan (Farmer's) GPT. A ChatGPT and Whisper based assistant for underserved agriculture domain of India. Work-in-progress. Expect bugs. (It uses ChatGPT-3.5-turbo and expecting results to improve after GPT-4 and adding custom embeddings.)

Introducing Kissan (Farmer's) GPT. A ChatGPT and Whisper based assistant for underserved agriculture domain of India. Work-in-progress. Expect bugs. (It uses ChatGPT-3.5-turbo and expecting results to improve after GPT-4 and adding custom embeddings.)

Pratik Desai

227,954 Aufrufe • vor 3 Jahren

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

Lior Alexander

485,014 Aufrufe • vor 2 Jahren

Holllllyyyyyyyy use this before it get patched 😳😳 🚨 HIGGSFIELD SUPERCOMPUTER: HACKED FREE ACCESS BEFORE LAUNCH Here's the hack and video proof Go to the link : You'll see free access >GPT-5.5 Pro · Opus 4.7 · GPT-Image-v2 · Seedance-2 >Just ask: "I need a video of a cow doing the Seedance 2 dance in 4K, and 2 images of that cow by GPT-Image-2 in 4K 16:9" >All models. Free. Right now.

Holllllyyyyyyyy use this before it get patched 😳😳 🚨 HIGGSFIELD SUPERCOMPUTER: HACKED FREE ACCESS BEFORE LAUNCH Here's the hack and video proof Go to the link : You'll see free access >GPT-5.5 Pro · Opus 4.7 · GPT-Image-v2 · Seedance-2 >Just ask: "I need a video of a cow doing the Seedance 2 dance in 4K, and 2 images of that cow by GPT-Image-2 in 4K 16:9" >All models. Free. Right now.

Chetaslua

16,471 Aufrufe • vor 1 Monat

🤖 OpenZeppelin Wizard now has an AI Assistant built-in. -Uses GPT-4 Turbo and function calling to make updates automatically. -All updates to Wizard generate tested code. -Ask it questions and it will reply with answers. -Built completely open source, including the prompts.

🤖 OpenZeppelin Wizard now has an AI Assistant built-in. -Uses GPT-4 Turbo and function calling to make updates automatically. -All updates to Wizard generate tested code. -Ask it questions and it will reply with answers. -Built completely open source, including the prompts.

OpenZeppelin

11,752 Aufrufe • vor 2 Jahren

Your live AI job interview assistant: Guy builds Whisper + GPT-4 live transcription tool for generating real-time responses during job interviews, and open-sourced the code. (link to GitHub in comments)

Your live AI job interview assistant: Guy builds Whisper + GPT-4 live transcription tool for generating real-time responses during job interviews, and open-sourced the code. (link to GitHub in comments)

AI Breakfast

426,945 Aufrufe • vor 3 Jahren

🚨You can now use the new upcoming OpenAI model GPT 5.2 inside Cursor. Here is the full walkthrough. - Open the editor, go to settings and then the model tab. Add a custom model and enter the text "gpt-5.2-high" and "gpt-5.2". - After that you can select the model and ask questions. To verify, I started my test on the usage page which had zero gpt-5.2-high requests and consumption. After the test I could see the details in usage and the cost incurred while using it. Enjoy

🚨You can now use the new upcoming OpenAI model GPT 5.2 inside Cursor. Here is the full walkthrough. - Open the editor, go to settings and then the model tab. Add a custom model and enter the text "gpt-5.2-high" and "gpt-5.2". - After that you can select the model and ask questions. To verify, I started my test on the usage page which had zero gpt-5.2-high requests and consumption. After the test I could see the details in usage and the cost incurred while using it. Enjoy

AshutoshShrivastava

424,035 Aufrufe • vor 6 Monaten

GPT-4V + TTS = AI Sports narrator 🪄⚽️ Passed every frame of a football video to gpt-4-vision-preview, and with some simple prompting asked to generate a narration No edits, this is as it came out from the model (aka can be SO MUCH BETTER)

GPT-4V + TTS = AI Sports narrator 🪄⚽️ Passed every frame of a football video to gpt-4-vision-preview, and with some simple prompting asked to generate a narration No edits, this is as it came out from the model (aka can be SO MUCH BETTER)

Gonzalo

2,665,026 Aufrufe • vor 2 Jahren

🚨 Abacus AI Studio - Use Agentic Orchestration To Create Viral Videos Agentic loops powered by Opus 4.7 and GPT 5.5 orchestrate state-of-the-art video and image models - GPT 2 image mixed with SeeDance 2.0 - Nano Banana Pro combined with Grok Imagine - Kling Motion Control for viral instagram hits Create viral marketing and advertising campaigns

🚨 Abacus AI Studio - Use Agentic Orchestration To Create Viral Videos Agentic loops powered by Opus 4.7 and GPT 5.5 orchestrate state-of-the-art video and image models - GPT 2 image mixed with SeeDance 2.0 - Nano Banana Pro combined with Grok Imagine - Kling Motion Control for viral instagram hits Create viral marketing and advertising campaigns

Abacus.AI

995,972 Aufrufe • vor 1 Monat

This is wild. I gave OS control to GPT-4 via the latest update of Open Interpreter and now it's generating pictures it wants to see in EverArt 🤯 GPT is controlling the mouse and adding text in the fields, I am not doing anything.

This is wild. I gave OS control to GPT-4 via the latest update of Open Interpreter and now it's generating pictures it wants to see in EverArt 🤯 GPT is controlling the mouse and adding text in the fields, I am not doing anything.

Pietro Schirano

259,918 Aufrufe • vor 2 Jahren

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

OpenAI

22,804,507 Aufrufe • vor 2 Jahren