Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

🚨 BREAKING: GPT-4 image recognition already has a new competitor. Open-sourced and completely free to use. Introducing LLaVA: Large Language and Vision Assistant. I compared the viral parking space photo on GPT-4 Vision to LLaVa, and it worked flawlessly (see video).

Rowan Cheung

588,324 subscribers

681,435 просмотров • 2 лет назад •via X (Twitter)

Наука и технологии Новости и политика Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля Rowan Cheung

Rowan Cheung2 лет назад

Link: Here's a side-by-side comparison of GPT-4 Vision vs. LLaVA. Original GPT-4 Vision image credits to @petergyang

Фото профиля Shawn Chauhan

Shawn Chauhan2 лет назад

I love the fact that paid models are getting competition from other players offering free access to pretty much the same thing. We as consumers are benefitting tremendously from this. There isn’t a better time to start using AI in my opinion.

Фото профиля Rowan Cheung

Rowan Cheung2 лет назад

Completely agree- more competition just means the consumers win in the end!

Фото профиля Min Choi

Min Choi2 лет назад

Just seen this today earlier. Checking them out! 🔥

Фото профиля Rowan Cheung

Rowan Cheung2 лет назад

Totally worth it. Speeds are holding up, as well.

Фото профиля Haotian Liu

Haotian Liu2 лет назад

Thanks for sharing our work!

Фото профиля Rowan Cheung

Rowan Cheung2 лет назад

Incredible work, was super fun to play around with. Thanks for pushing the space forward!

Фото профиля LAION

LAION2 лет назад

@thebloke please make a quantized version :)

Фото профиля Aadit Sheth

Aadit Sheth2 лет назад

Great find @rowancheung

Фото профиля Rowan Cheung

Rowan Cheung2 лет назад

Can't wait to see the vision prompts you come up with!

Похожие видео

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

LLava just hit 3800 stars on Github. It's a multimodal Large Language-and-Vision Assistant that can understand images and text. LLava can even handle memes (the same ones GPT-4 demo'ed at launch) and set a new SOTA on Science QA. It also supports LLaMA-2, LoRA training with academia GPUs, higher resolution (336x336), 4-/8- inference.

Lior Alexander

143,527 просмотров • 2 лет назад

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

🚀Introducing LLaVA Lightning: Train a lite, multimodal GPT-4 with just $40 in 3 hours! With our newly introduced datasets and the efficient design of LLaVA, you can now turbocharge your language model with image reasoning capabilities, in an incredibly affordable way.🧵

Haotian Liu

302,319 просмотров • 3 лет назад

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,713 просмотров • 1 год назад

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

BREAKING: ChatGPT GPT-4o was just announce by OpenAI. It improves on vision, audio and text. The ease of use is incredibly enhanced. It makes interaction with the GPT much more natural, especially with voice. GPT-4o reasons across voice, text and vision. GPT-4 wil be available to everyone.

Ed Krassenstein

21,605 просмотров • 2 лет назад

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

Kris Kashtanova

1,233,374 просмотров • 3 лет назад

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 просмотров • 1 год назад

Here's a demo of the gpt-4-vision API that I built in Bubble in 30 min. It takes a URL, converts it to an image, and sends it through the Vision API to respond with custom landing page optimization suggestions.

Here's a demo of the gpt-4-vision API that I built in Bubble in 30 min. It takes a URL, converts it to an image, and sends it through the Vision API to respond with custom landing page optimization suggestions.

Seth Kramer

953,981 просмотров • 2 лет назад

From image to live website using GPT-4 vision and Replit ⠕ in less than a minute. Things are about to get so interesting. 🔥

From image to live website using GPT-4 vision and Replit ⠕ in less than a minute. Things are about to get so interesting. 🔥

Pietro Schirano

1,217,206 просмотров • 2 лет назад

Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on Hugging Face for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗 Kudos to the main contributor Yujie Lu (Yujie Lu), and the team: Dongfu Jiang Wenhu Chen Ai2 ; Thanks to LMSYS Org’s great code & design!

Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on Hugging Face for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗 Kudos to the main contributor Yujie Lu (Yujie Lu), and the team: Dongfu Jiang Wenhu Chen Ai2 ; Thanks to LMSYS Org’s great code & design!

Bill Yuchen Lin

146,020 просмотров • 2 лет назад

Introducing Mentat - an open source, GPT-4 powered coding assistant! Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files! More videos and a link to github below:

Introducing Mentat - an open source, GPT-4 powered coding assistant! Mentat runs in your command line, giving it the context of your projects and allowing it to coordinate edits across multiple files! More videos and a link to github below:

Scott Swingle

292,701 просмотров • 2 лет назад

Introducing Kissan (Farmer's) GPT. A ChatGPT and Whisper based assistant for underserved agriculture domain of India. Work-in-progress. Expect bugs. (It uses ChatGPT-3.5-turbo and expecting results to improve after GPT-4 and adding custom embeddings.)

Introducing Kissan (Farmer's) GPT. A ChatGPT and Whisper based assistant for underserved agriculture domain of India. Work-in-progress. Expect bugs. (It uses ChatGPT-3.5-turbo and expecting results to improve after GPT-4 and adding custom embeddings.)

Pratik Desai

227,954 просмотров • 3 лет назад

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

OpenAI just announced "GPT-4o". It can reason with voice, vision, and text. The model is 2x faster, 50% cheaper, and has 5x higher rate limit than GPT-4 Turbo. It will be available for free users and via the API. The voice model can even pick up on emotion and generate emotive voice.

Lior Alexander

485,014 просмотров • 2 лет назад

Holllllyyyyyyyy use this before it get patched 😳😳 🚨 HIGGSFIELD SUPERCOMPUTER: HACKED FREE ACCESS BEFORE LAUNCH Here's the hack and video proof Go to the link : You'll see free access >GPT-5.5 Pro · Opus 4.7 · GPT-Image-v2 · Seedance-2 >Just ask: "I need a video of a cow doing the Seedance 2 dance in 4K, and 2 images of that cow by GPT-Image-2 in 4K 16:9" >All models. Free. Right now.

Holllllyyyyyyyy use this before it get patched 😳😳 🚨 HIGGSFIELD SUPERCOMPUTER: HACKED FREE ACCESS BEFORE LAUNCH Here's the hack and video proof Go to the link : You'll see free access >GPT-5.5 Pro · Opus 4.7 · GPT-Image-v2 · Seedance-2 >Just ask: "I need a video of a cow doing the Seedance 2 dance in 4K, and 2 images of that cow by GPT-Image-2 in 4K 16:9" >All models. Free. Right now.

Chetaslua

16,292 просмотров • 1 месяц назад

🤖 OpenZeppelin Wizard now has an AI Assistant built-in. -Uses GPT-4 Turbo and function calling to make updates automatically. -All updates to Wizard generate tested code. -Ask it questions and it will reply with answers. -Built completely open source, including the prompts.

🤖 OpenZeppelin Wizard now has an AI Assistant built-in. -Uses GPT-4 Turbo and function calling to make updates automatically. -All updates to Wizard generate tested code. -Ask it questions and it will reply with answers. -Built completely open source, including the prompts.

OpenZeppelin

11,752 просмотров • 2 лет назад

Your live AI job interview assistant: Guy builds Whisper + GPT-4 live transcription tool for generating real-time responses during job interviews, and open-sourced the code. (link to GitHub in comments)

Your live AI job interview assistant: Guy builds Whisper + GPT-4 live transcription tool for generating real-time responses during job interviews, and open-sourced the code. (link to GitHub in comments)

AI Breakfast

426,934 просмотров • 3 лет назад

🚨You can now use the new upcoming OpenAI model GPT 5.2 inside Cursor. Here is the full walkthrough. - Open the editor, go to settings and then the model tab. Add a custom model and enter the text "gpt-5.2-high" and "gpt-5.2". - After that you can select the model and ask questions. To verify, I started my test on the usage page which had zero gpt-5.2-high requests and consumption. After the test I could see the details in usage and the cost incurred while using it. Enjoy

🚨You can now use the new upcoming OpenAI model GPT 5.2 inside Cursor. Here is the full walkthrough. - Open the editor, go to settings and then the model tab. Add a custom model and enter the text "gpt-5.2-high" and "gpt-5.2". - After that you can select the model and ask questions. To verify, I started my test on the usage page which had zero gpt-5.2-high requests and consumption. After the test I could see the details in usage and the cost incurred while using it. Enjoy

AshutoshShrivastava

424,035 просмотров • 6 месяцев назад

GPT-4V + TTS = AI Sports narrator 🪄⚽️ Passed every frame of a football video to gpt-4-vision-preview, and with some simple prompting asked to generate a narration No edits, this is as it came out from the model (aka can be SO MUCH BETTER)

GPT-4V + TTS = AI Sports narrator 🪄⚽️ Passed every frame of a football video to gpt-4-vision-preview, and with some simple prompting asked to generate a narration No edits, this is as it came out from the model (aka can be SO MUCH BETTER)

Gonzalo

2,664,991 просмотров • 2 лет назад

🚨 Abacus AI Studio - Use Agentic Orchestration To Create Viral Videos Agentic loops powered by Opus 4.7 and GPT 5.5 orchestrate state-of-the-art video and image models - GPT 2 image mixed with SeeDance 2.0 - Nano Banana Pro combined with Grok Imagine - Kling Motion Control for viral instagram hits Create viral marketing and advertising campaigns

🚨 Abacus AI Studio - Use Agentic Orchestration To Create Viral Videos Agentic loops powered by Opus 4.7 and GPT 5.5 orchestrate state-of-the-art video and image models - GPT 2 image mixed with SeeDance 2.0 - Nano Banana Pro combined with Grok Imagine - Kling Motion Control for viral instagram hits Create viral marketing and advertising campaigns

Abacus.AI

995,972 просмотров • 29 дней назад

This is wild. I gave OS control to GPT-4 via the latest update of Open Interpreter and now it's generating pictures it wants to see in EverArt 🤯 GPT is controlling the mouse and adding text in the fields, I am not doing anything.

This is wild. I gave OS control to GPT-4 via the latest update of Open Interpreter and now it's generating pictures it wants to see in EverArt 🤯 GPT is controlling the mouse and adding text in the fields, I am not doing anything.

Pietro Schirano

259,907 просмотров • 2 лет назад

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

Say hello to GPT-4o, our new flagship model which can reason across audio, vision, and text in real time: Text and image input rolling out today in API and ChatGPT with voice and video in the coming weeks.

OpenAI

22,801,989 просмотров • 2 лет назад