正在加载视频...

视频加载失败

Sam Altman says the next big thrust is making models handle much longer tasks We have already moved from models handling 5-second coding tasks with GPT-3.5 to 5-hour tasks with GPT-5 The goal now is to integrate enterprise context so AI can handle tasks that require months or years

122,717 次观看 • 7 个月前 •via X (Twitter)

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 次观看 • 11 个月前