Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in...

73,074 Aufrufe • vor 11 Monaten •via X (Twitter)

10 Kommentare

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

What about existing vision benchmarks for MFMs? Most existing benchmarks, like those based on VQA, rely on natural language for evaluation. This impacts their ability to evaluate MFMs on standard vision tasks like pixel-level segmentation and depth, and also prevents a direct comparison with vision specialists. We tackle this gap by systematically evaluating the models on these tasks, to get a detailed look at their visual understanding and a comparison with specialist vision models. 🧵 2/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

How do you get a language-based model to segment an image? Many vision tasks require dense, pixel-wise outputs: something most current MFMs aren’t designed to express in their output. To bridge this gap, we break each task into text-promptable sub-tasks that can be solved via iterative prompting. 🧵 3/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

Example: Semantic Segmentation The models can't output segmentation directly. So, we first group pixels into superpixels using SLIC. Then, our prompt chain asks the MFM to classify each superpixel individually. The individual predictions are then stitched together to create the final, full-image segmentation mask. By adjusting the number of superpixels, we can trade off between computational cost and segmentation granularity. trade-off 🧵 4/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

How do MFMs compare with vision specialists? When we evaluate vision specialists under the same conditions, they maintain a clear advantage over MFMs. MFMs perform reasonably on semantic tasks, but show a larger gap in geometric tasks like depth and normals. For a fair comparison, we control for the variance introduced by the rompting process to, for example, by limiting the segmentation to the granularity of superpixels. 🧵 5/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

How do MFMs fare against each other? Among the non-reasoning models, GPT-4o consistently outperforms the rest across most tasks, followed by Gemini 2.0 Flash. Overall, the MFMs are respectable generalists. We also include a “blind” baseline for control and calibration, which we discuss next. 🧵 6/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

What are the baked-in biases of these models? To find out, we asked GPT-4o to perform tasks on a blank image—a "blind guess." The results reveal its priors: it assumes common objects, places the sky at the top, and knows that floors are generally closer than ceilings. This helps us disentangle true visual understanding from winning by using statistical biases. 🧵 7/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

Reasoning Models What effect does ‘reasoning’ have on the performance on these tasks? We tested new reasoning models (o1, o3) in addition to o4-mini, and observed a notable split: ✅ A minor boost for semantic tasks. 🚀 A significant jump for geometric tasks like depth and normals. 🧵 8/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

GPT-4o with Image Generation The latest GPT-4o can now generate images natively. While this could make prompt-chaining unnecessary for dense predictions, our preliminary tests show that the model often creates 'semantic recreations' instead of proper edits and implementation of the task, introducing hallucinations & spatial errors. A promising path for future work, but challenges still need to be addressed. 🧵 9/n

Profilbild von Amir Zamir
Amir Zamirvor 11 Monaten

Final Takeaways 📌 The multimodal foundation models are impressive generalists. However, they still lag behind vision specialists. 📌 They perform better on semantics (e.g., classification, segmentation) than geometry (depth, normals). 📌 Among the non-reasoning models, GPT-4o consistently outperforms its peers on most tasks. 📌 Reasoning models show promising improvements, especially in geometric tasks. We’re releasing the evaluation framework. Interactive visualizations and Code: 🔗 Joint work with: Rahul Ramachandran, @aligarjani @roman__bachmann @andrew_atanov @oguzhanthefatih 🧵 n/n

Profilbild von Sivan Doveh
Sivan Dovehvor 11 Monaten

Our recent ICCV work test few shot localization in these models and it seems that the understanding of these models of coordinates based tasks is still lacking (ofc we show a way to improve 😉) IPLOC

Ähnliche Videos

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 Aufrufe • vor 2 Jahren