Amir Zamir's banner
Amir Zamir's profile picture

Amir Zamir

@zamir_ar5,382 subscribers

Assistant Prof of CS, @EPFL_en Swiss Federal Institute of Technology Lausanne. Previously @Berkeley_AI, @StanfordAILab, @ucf. Into Vision, MachineLearning, AI

Shorts

Is it possible to adapt a neural network on the fly at the test time to cope with distribution shifts? RNA does precisely that by creating a closed-loop feedback system. We will present it on Wed afternoon at #ICCV2025. 1/n

Is it possible to adapt a neural network on the fly at the test time to cope with distribution shifts? RNA does precisely that by creating a closed-loop feedback system. We will present it on Wed afternoon at #ICCV2025. 1/n

21,685 次观看

Videos

zamir_ar's profile picture

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

72,917 次观看 • 11 个月前

没有更多内容可加载