
Amir Zamir
@zamir_ar • 5,382 subscribers
Assistant Prof of CS, @EPFL_en Swiss Federal Institute of Technology Lausanne. Previously @Berkeley_AI, @StanfordAILab, @ucf. Into Vision, MachineLearning, AI
Shorts
Videos

Test-time scaling, reasoning, and generally search-like processes clearly drive significant gains in LLMs. Largely owed to the structure of language. One would think the same could apply to non-linguistic domains, like image generation, but that obviously depends on whether the structure of the domain's representation lends itself to search. 1D ordered tokens (e.g., image FlexTok, video FlexTok) seem like a natural fit since they enable a step-by-step coarse-to-fine generation. We investigated that and found they indeed enable search and scale far better with test-time compute than 2D grids. See the visuals on the webpage. Appearing in ICML Conference 2026. 🔗 📄
Amir Zamir14,810 views • 29 days ago

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n
Amir Zamir72,917 views • 11 months ago

How far can a very simple eye go in solving vision tasks? Like a 1-pixel camera? Humans have one of the greatest eyes in nature, while many animals have significantly simpler eyes and visual systems yet show complex perceptual behavior. In an interesting project, we find that many computer vision tasks can be solved without a typical camera and with such simple 1-pixel sensors (photoreceptors). We also find that proper design (e.g., where to place the photoreceptors strategically) makes a big difference, so we developed a computational design method to find them. 🌐 👁️[Solving Vision Tasks with Simple Photoreceptors Instead of Cameras] 🧵1/n
Amir Zamir75,863 views • 2 years ago

We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website. IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence. 🌐 ⌨️ Joint work w/ EPFL Apple.
Amir Zamir69,241 views • 2 years ago
No more content to load