Video wird geladen...
Video konnte nicht geladen werden
We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in... show more
73,074 Aufrufe • vor 11 Monaten •via X (Twitter)
10 Kommentare

What about existing vision benchmarks for MFMs? Most existing benchmarks, like those based on VQA, rely on natural language for evaluation. This impacts their ability to evaluate MFMs on standard vision tasks like pixel-level segmentation and depth, and also prevents a direct comparison with vision specialists. We tackle this gap by systematically evaluating the models on these tasks, to get a detailed look at their visual understanding and a comparison with specialist vision models. 🧵 2/n

How do you get a language-based model to segment an image? Many vision tasks require dense, pixel-wise outputs: something most current MFMs aren’t designed to express in their output. To bridge this gap, we break each task into text-promptable sub-tasks that can be solved via iterative prompting. 🧵 3/n

Example: Semantic Segmentation The models can't output segmentation directly. So, we first group pixels into superpixels using SLIC. Then, our prompt chain asks the MFM to classify each superpixel individually. The individual predictions are then stitched together to create the final, full-image segmentation mask. By adjusting the number of superpixels, we can trade off between computational cost and segmentation granularity. trade-off 🧵 4/n

How do MFMs compare with vision specialists? When we evaluate vision specialists under the same conditions, they maintain a clear advantage over MFMs. MFMs perform reasonably on semantic tasks, but show a larger gap in geometric tasks like depth and normals. For a fair comparison, we control for the variance introduced by the rompting process to, for example, by limiting the segmentation to the granularity of superpixels. 🧵 5/n

How do MFMs fare against each other? Among the non-reasoning models, GPT-4o consistently outperforms the rest across most tasks, followed by Gemini 2.0 Flash. Overall, the MFMs are respectable generalists. We also include a “blind” baseline for control and calibration, which we discuss next. 🧵 6/n

What are the baked-in biases of these models? To find out, we asked GPT-4o to perform tasks on a blank image—a "blind guess." The results reveal its priors: it assumes common objects, places the sky at the top, and knows that floors are generally closer than ceilings. This helps us disentangle true visual understanding from winning by using statistical biases. 🧵 7/n

Reasoning Models What effect does ‘reasoning’ have on the performance on these tasks? We tested new reasoning models (o1, o3) in addition to o4-mini, and observed a notable split: ✅ A minor boost for semantic tasks. 🚀 A significant jump for geometric tasks like depth and normals. 🧵 8/n

GPT-4o with Image Generation The latest GPT-4o can now generate images natively. While this could make prompt-chaining unnecessary for dense predictions, our preliminary tests show that the model often creates 'semantic recreations' instead of proper edits and implementation of the task, introducing hallucinations & spatial errors. A promising path for future work, but challenges still need to be addressed. 🧵 9/n

Final Takeaways 📌 The multimodal foundation models are impressive generalists. However, they still lag behind vision specialists. 📌 They perform better on semantics (e.g., classification, segmentation) than geometry (depth, normals). 📌 Among the non-reasoning models, GPT-4o consistently outperforms its peers on most tasks. 📌 Reasoning models show promising improvements, especially in geometric tasks. We’re releasing the evaluation framework. Interactive visualizations and Code: 🔗 Joint work with: Rahul Ramachandran, @aligarjani @roman__bachmann @andrew_atanov @oguzhanthefatih 🧵 n/n

Our recent ICCV work test few shot localization in these models and it seems that the understanding of these models of coordinates based tasks is still lacking (ofc we show a way to improve 😉) IPLOC

