Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in... detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/nshow more

Amir Zamir

5,379 subscribers

73,074 Aufrufe • vor 11 Monaten •via X (Twitter)

Gesundheit & Wellness Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

What about existing vision benchmarks for MFMs? Most existing benchmarks, like those based on VQA, rely on natural language for evaluation. This impacts their ability to evaluate MFMs on standard vision tasks like pixel-level segmentation and depth, and also prevents a direct comparison with vision specialists. We tackle this gap by systematically evaluating the models on these tasks, to get a detailed look at their visual understanding and a comparison with specialist vision models. 🧵 2/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

How do you get a language-based model to segment an image? Many vision tasks require dense, pixel-wise outputs: something most current MFMs aren’t designed to express in their output. To bridge this gap, we break each task into text-promptable sub-tasks that can be solved via iterative prompting. 🧵 3/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

Example: Semantic Segmentation The models can't output segmentation directly. So, we first group pixels into superpixels using SLIC. Then, our prompt chain asks the MFM to classify each superpixel individually. The individual predictions are then stitched together to create the final, full-image segmentation mask. By adjusting the number of superpixels, we can trade off between computational cost and segmentation granularity. trade-off 🧵 4/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

How do MFMs compare with vision specialists? When we evaluate vision specialists under the same conditions, they maintain a clear advantage over MFMs. MFMs perform reasonably on semantic tasks, but show a larger gap in geometric tasks like depth and normals. For a fair comparison, we control for the variance introduced by the rompting process to, for example, by limiting the segmentation to the granularity of superpixels. 🧵 5/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

How do MFMs fare against each other? Among the non-reasoning models, GPT-4o consistently outperforms the rest across most tasks, followed by Gemini 2.0 Flash. Overall, the MFMs are respectable generalists. We also include a “blind” baseline for control and calibration, which we discuss next. 🧵 6/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

What are the baked-in biases of these models? To find out, we asked GPT-4o to perform tasks on a blank image—a "blind guess." The results reveal its priors: it assumes common objects, places the sky at the top, and knows that floors are generally closer than ceilings. This helps us disentangle true visual understanding from winning by using statistical biases. 🧵 7/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

Reasoning Models What effect does ‘reasoning’ have on the performance on these tasks? We tested new reasoning models (o1, o3) in addition to o4-mini, and observed a notable split: ✅ A minor boost for semantic tasks. 🚀 A significant jump for geometric tasks like depth and normals. 🧵 8/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

GPT-4o with Image Generation The latest GPT-4o can now generate images natively. While this could make prompt-chaining unnecessary for dense predictions, our preliminary tests show that the model often creates 'semantic recreations' instead of proper edits and implementation of the task, introducing hallucinations & spatial errors. A promising path for future work, but challenges still need to be addressed. 🧵 9/n

Profilbild von Amir Zamir

Amir Zamirvor 11 Monaten

Final Takeaways 📌 The multimodal foundation models are impressive generalists. However, they still lag behind vision specialists. 📌 They perform better on semantics (e.g., classification, segmentation) than geometry (depth, normals). 📌 Among the non-reasoning models, GPT-4o consistently outperforms its peers on most tasks. 📌 Reasoning models show promising improvements, especially in geometric tasks. We’re releasing the evaluation framework. Interactive visualizations and Code: 🔗 Joint work with: Rahul Ramachandran, @aligarjani @roman__bachmann @andrew_atanov @oguzhanthefatih 🧵 n/n

Profilbild von Sivan Doveh

Sivan Dovehvor 11 Monaten

Our recent ICCV work test few shot localization in these models and it seems that the understanding of these models of coordinates based tasks is still lacking (ofc we show a way to improve 😉) IPLOC

Ähnliche Videos

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

AI at Meta

121,530 Aufrufe • vor 1 Jahr

Meta presents Sapiens Foundation for Human Vision Models discuss: We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

Meta presents Sapiens Foundation for Human Vision Models discuss: We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

AK

151,511 Aufrufe • vor 1 Jahr

Sam Altman says the next big thrust is making models handle much longer tasks We have already moved from models handling 5-second coding tasks with GPT-3.5 to 5-hour tasks with GPT-5 The goal now is to integrate enterprise context so AI can handle tasks that require months or years

Sam Altman says the next big thrust is making models handle much longer tasks We have already moved from models handling 5-second coding tasks with GPT-3.5 to 5-hour tasks with GPT-5 The goal now is to integrate enterprise context so AI can handle tasks that require months or years

Haider.

122,717 Aufrufe • vor 7 Monaten

Reasoning models tailored for diverse AI tasks. 🚀 Meet Amazon Nova 2 foundation models, supporting fast, cost-effective reasoning to multimodal capabilities. Power versatile tasks like AI agents, code-generation, and Conversational AI. Choose the perfect match for your workload.

Reasoning models tailored for diverse AI tasks. 🚀 Meet Amazon Nova 2 foundation models, supporting fast, cost-effective reasoning to multimodal capabilities. Power versatile tasks like AI agents, code-generation, and Conversational AI. Choose the perfect match for your workload.

Amazon Web Services

2,074,714 Aufrufe • vor 6 Monaten

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Yay, finally! Introducing Vision Banana🍌 from Google DeepMind, our unified model that outperforms SoTA specialist models on various vision tasks! By treating 2D/3D vision tasks as image generation, we unlock a new foundation for CV. Project page: (1/5)

Songyou Peng

284,366 Aufrufe • vor 2 Monaten

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

Cohere Labs

15,572 Aufrufe • vor 1 Jahr

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

Arena.ai

276,188 Aufrufe • vor 25 Tagen

We're releasing HY-Embodied-0.5, a family of foundation models for real-world embodied agents. The 2B model is now open source. It strengthens spatial-temporal perception and embodied reasoning for prediction, interaction, and planning. 🤖 The suite includes: 🔹 2B for edge deployment 🔹 32B for complex reasoning Key innovations: 🔹 Mixture-of-Transformers (MoT) architecture for modality-specific computation 🔹 Latent tokens for improved perceptual representation 🔹 Self-evolving post-training 🔹 On-policy distillation from large to small models Across 22 benchmarks, the 2B model outperforms similarly sized SOTA systems on 16 tasks. The 32B model approaches frontier-level performance. 🔗 GitHub: 🤗 Hugging Face:

We're releasing HY-Embodied-0.5, a family of foundation models for real-world embodied agents. The 2B model is now open source. It strengthens spatial-temporal perception and embodied reasoning for prediction, interaction, and planning. 🤖 The suite includes: 🔹 2B for edge deployment 🔹 32B for complex reasoning Key innovations: 🔹 Mixture-of-Transformers (MoT) architecture for modality-specific computation 🔹 Latent tokens for improved perceptual representation 🔹 Self-evolving post-training 🔹 On-policy distillation from large to small models Across 22 benchmarks, the 2B model outperforms similarly sized SOTA systems on 16 tasks. The 32B model approaches frontier-level performance. 🔗 GitHub: 🤗 Hugging Face:

Tencent Hy

34,317 Aufrufe • vor 2 Monaten

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,572 Aufrufe • vor 2 Jahren

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,435 Aufrufe • vor 1 Jahr

One of my favorite tools is Hunch - it's like a secret weapon for a variety of tasks. It uses AI models in blocks that can work alone or in combination with other blocks in workflows. It's mult-modal for text, images, videos, and voice input or output. Here's an example using a preset tool that runs Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro then critiques and consolidates their output into a single response for my question about comparing LivePortrait techniques and models:

One of my favorite tools is Hunch - it's like a secret weapon for a variety of tasks. It uses AI models in blocks that can work alone or in combination with other blocks in workflows. It's mult-modal for text, images, videos, and voice input or output. Here's an example using a preset tool that runs Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro then critiques and consolidates their output into a single response for my question about comparing LivePortrait techniques and models:

Heather Cooper

10,845 Aufrufe • vor 1 Jahr

AI vision models perform well on standard benchmarks like ImageNet, often matching or exceeding human accuracy. But here's what they don't tell you: Move an image by ONE PIXEL and these same models have a meltdown 40% of the time. This brittleness is breaking real systems.

AI vision models perform well on standard benchmarks like ImageNet, often matching or exceeding human accuracy. But here's what they don't tell you: Move an image by ONE PIXEL and these same models have a meltdown 40% of the time. This brittleness is breaking real systems.

Jack 🤖

122,493 Aufrufe • vor 10 Monaten

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 Aufrufe • vor 2 Jahren

Humans draw to facilitate reasoning and communication. Why not let LLMs do so? 🚀We introduce✏️Sketchpad, which gives multimodal LLMs a sketchpad to draw and facilitate reasoning! Sketchpad gives GPT-4o great boosts on many vision and math tasks 📈 The video shows how GPT-4o with Sketchpad reasons with interleaved visual and textual steps. For more, visit our project page: 📌 For math tasks, ✏️Sketchpad allows LLMs to draw auxiliary lines on geometry diagrams, plotting functions, graphs, and even games. GPT-4o does math better when it can sketch! (+12.7% acc on average) 📌 For computer vision tasks, ✏️Sketchpad allows LLMs to sketch with vision specialists (e.g., GroundingDINO draws bounding boxes, SegmentAnything draws masks). Sketchpad substantially improves GPT-4o's vision abilities. GPT-4o + Sketchpad compared with prior SOTAs: 1️⃣ V*Bench: 75.4% -> 80.3% 2️⃣ BLINK correspondence: 42.4% -> 80.8% 3️⃣ BLINK relative depth: 67.7% -> 83.9% 4️⃣ BLINK spatial relation: 76.2% -> 81.1% ... See more interesting examples in the thread!

Humans draw to facilitate reasoning and communication. Why not let LLMs do so? 🚀We introduce✏️Sketchpad, which gives multimodal LLMs a sketchpad to draw and facilitate reasoning! Sketchpad gives GPT-4o great boosts on many vision and math tasks 📈 The video shows how GPT-4o with Sketchpad reasons with interleaved visual and textual steps. For more, visit our project page: 📌 For math tasks, ✏️Sketchpad allows LLMs to draw auxiliary lines on geometry diagrams, plotting functions, graphs, and even games. GPT-4o does math better when it can sketch! (+12.7% acc on average) 📌 For computer vision tasks, ✏️Sketchpad allows LLMs to sketch with vision specialists (e.g., GroundingDINO draws bounding boxes, SegmentAnything draws masks). Sketchpad substantially improves GPT-4o's vision abilities. GPT-4o + Sketchpad compared with prior SOTAs: 1️⃣ V*Bench: 75.4% -> 80.3% 2️⃣ BLINK correspondence: 42.4% -> 80.8% 3️⃣ BLINK relative depth: 67.7% -> 83.9% 4️⃣ BLINK spatial relation: 76.2% -> 81.1% ... See more interesting examples in the thread!

Yushi Hu

145,048 Aufrufe • vor 2 Jahren

New short course: Prompt Engineering with Llama 2, built in collaboration with Meta AI at Meta, and taught by Amit Sangani! Meta's Llama 2 has been game-changing for AI. Building with open source lets you control your own data, scrutinize errors, update (or not) the models as you please, and work alongside the global community advancing open models. Llama isn't a single model, it's a collection of models. In this course, you'll: - Learn the differences between different Llama 2 flavors, and when to use each. - Prompt the Llama chat models -- you'll also see how Llama's instruction tags work -- so they can help you with day-to-day tasks, like writing or summarization. - Use advanced prompting, like few-shot prompting for classification, and chain-of-thought prompting for solving logic problems. - Use specialized models in the Llama collection for specific tasks, like Code Llama to help you write, analyze, and improve code, and Llama Guard, which checks prompts and model responses for harmful content. The course also touches on how to run Llama 2 locally on your own computer. I hope you’ll take this course and try out these powerful, open models!

New short course: Prompt Engineering with Llama 2, built in collaboration with Meta AI at Meta, and taught by Amit Sangani! Meta's Llama 2 has been game-changing for AI. Building with open source lets you control your own data, scrutinize errors, update (or not) the models as you please, and work alongside the global community advancing open models. Llama isn't a single model, it's a collection of models. In this course, you'll: - Learn the differences between different Llama 2 flavors, and when to use each. - Prompt the Llama chat models -- you'll also see how Llama's instruction tags work -- so they can help you with day-to-day tasks, like writing or summarization. - Use advanced prompting, like few-shot prompting for classification, and chain-of-thought prompting for solving logic problems. - Use specialized models in the Llama collection for specific tasks, like Code Llama to help you write, analyze, and improve code, and Llama Guard, which checks prompts and model responses for harmful content. The course also touches on how to run Llama 2 locally on your own computer. I hope you’ll take this course and try out these powerful, open models!

Andrew Ng

162,798 Aufrufe • vor 2 Jahren

It’s been a massive week for embodied AI foundation models: the pace of this field is truly staggering. Throwing it back to the SimToolReal work by Kushal and Tyler Lum (Cornell & Stanford labs). In February, they achieved zero-shot tool manipulation across 24 tasks using a single RL policy and a robotic arm fitted with the SharpaWave hand. Watching the robot nail these high-speed in-hand rotations is incredible. The precision is especially impressive when you consider the policy was never trained on these specific objects or tasks. This is what solving the manipulation bottleneck looks like. ⚡️

It’s been a massive week for embodied AI foundation models: the pace of this field is truly staggering. Throwing it back to the SimToolReal work by Kushal and Tyler Lum (Cornell & Stanford labs). In February, they achieved zero-shot tool manipulation across 24 tasks using a single RL policy and a robotic arm fitted with the SharpaWave hand. Watching the robot nail these high-speed in-hand rotations is incredible. The precision is especially impressive when you consider the policy was never trained on these specific objects or tasks. This is what solving the manipulation bottleneck looks like. ⚡️

Sharpa

868,162 Aufrufe • vor 2 Monaten

"reasoning models are dirt cheap compared to humans" OpenAI Research Scientist, Noam Brown: the reasoning models may seem costly compared to GPT-4o ...but they are more cost-effective than human experts, especially as they start to surpass human performance in certain areas.

"reasoning models are dirt cheap compared to humans" OpenAI Research Scientist, Noam Brown: the reasoning models may seem costly compared to GPT-4o ...but they are more cost-effective than human experts, especially as they start to surpass human performance in certain areas.

Haider.

44,066 Aufrufe • vor 1 Jahr

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

AK

429,143 Aufrufe • vor 3 Jahren

Google has launched Med-Gemini, an advanced AI fine-tuned for medical Tasks. It significantly outperforms earlier models, including GPT-4, on most medical benchmarks. Achieves top scores, particularly on the MedQA-USMLE benchmark with a groundbreaking 91.1% accuracy.🚀 Demonstrates superior performance over GPT-4 by 44.5% on average across seven multimodal benchmarks. Excels in tasks such as medical summarization, generating doctor referrals, and simplifying medical documents. It is a preferred method over human expert analyses for complex text-based medical tasks. This marks a significant advancement in AI for healthcare, suggesting potential improvements in medical diagnostics and patient care. By choosing to harmonize with technology, we become more human, we become IRREPLACEABLE. Join the IRREPLACEABLE Academy and read the Book: #techforgood #medicine #ai

Google has launched Med-Gemini, an advanced AI fine-tuned for medical Tasks. It significantly outperforms earlier models, including GPT-4, on most medical benchmarks. Achieves top scores, particularly on the MedQA-USMLE benchmark with a groundbreaking 91.1% accuracy.🚀 Demonstrates superior performance over GPT-4 by 44.5% on average across seven multimodal benchmarks. Excels in tasks such as medical summarization, generating doctor referrals, and simplifying medical documents. It is a preferred method over human expert analyses for complex text-based medical tasks. This marks a significant advancement in AI for healthcare, suggesting potential improvements in medical diagnostics and patient care. By choosing to harmonize with technology, we become more human, we become IRREPLACEABLE. Join the IRREPLACEABLE Academy and read the Book: #techforgood #medicine #ai

Pascal Bornet

14,393 Aufrufe • vor 1 Jahr

GPT-4o level intelligence running on your phone! MiniCPM-V 4.5 delivers enterprise-grade AI performance in just 8B parameters, outperforming models like GPT-4o, Gemini-2.0 Pro on vision and language tasks. - 30+ language support - Runs smoothly on iPhone/iPad 100% open-source!

GPT-4o level intelligence running on your phone! MiniCPM-V 4.5 delivers enterprise-grade AI performance in just 8B parameters, outperforming models like GPT-4o, Gemini-2.0 Pro on vision and language tasks. - 30+ language support - Runs smoothly on iPhone/iPad 100% open-source!

Akshay 🚀

84,288 Aufrufe • vor 10 Monaten