Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Microsoft's new Florence 2 is big for Computer Vision. It's a merge between Text and Vision. With a single prompt you can instruct the model to do CV tasks like captioning, object detection, grounding, and segmentation. The best part, it only uses a single backbone to handle everything. ▸... show more

Lior Alexander

115,443 subscribers

186,544 views • 2 years ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

8 Comments

AlphaSignal AI2 years ago

@AlphaSignalAI One step closer to AGI..

Dash2 years ago

@AlphaSignalAI Holy shit

Lior⚡2 years ago

@Mrosenmer Can't wait for the repo 👀

Ariyan2 years ago

@AlphaSignalAI @jxnlco @skalskip92 you've seen this?

zachary austin2 years ago

@AlphaSignalAI Look away NSA

ThisAndThat2 years ago

@AlphaSignalAI Well at least the demo is not much different than YOLOv8 or similar. We have been combining a few models to achieve what you have described. If this model can do all that together and with even better performance then great. But I don't trust Microsoft. They suck.

Waseem2 years ago

@AlphaSignalAI I've attempted to do something like this with images and GPT-4V. Results have been pretty good but working on improving it. Plan to put something like this on a robot with a raspberry pi.

alejandro cartagena2 years ago

@AlphaSignalAI Look @elmanmansimov

Related Videos

Florence-2, the new vision foundation model by Microsoft, can now run 100% locally in your browser on WebGPU, thanks to Transformers.js! 🤗🤯 It supports tasks like image captioning, optical character recognition, object detection, and many more! 😍 WOW! Demo (+ source code) 👇

Florence-2, the new vision foundation model by Microsoft, can now run 100% locally in your browser on WebGPU, thanks to Transformers.js! 🤗🤯 It supports tasks like image captioning, optical character recognition, object detection, and many more! 😍 WOW! Demo (+ source code) 👇

Xenova

88,747 views • 2 years ago

Big News! Meta just released Segment Anything, a new AI model that can "cut out" any object, in any image/video, with a single click. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.

Big News! Meta just released Segment Anything, a new AI model that can "cut out" any object, in any image/video, with a single click. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.

Lior Alexander

290,190 views • 3 years ago

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Jiafei Duan

12,072 views • 2 months ago

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

Andrew Ng

151,198 views • 2 years ago

zero-shot image segmentation with Florence-2 + SAM-2 combo just added new mode to my Hugging Face; now you can run open vocabulary detection with Florence-2 + box to mask with SAM2 link:

zero-shot image segmentation with Florence-2 + SAM-2 combo just added new mode to my Hugging Face; now you can run open vocabulary detection with Florence-2 + box to mask with SAM2 link:

SkalskiP

65,883 views • 1 year ago

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

Ai2

85,809 views • 2 months ago

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding TL;DR: DINO-X Pro: sota model with enhanced perception capabilities for various scenarios; DINO-X Edge: model optimized for faster inference speed and better suited for deployment on edge devices

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding TL;DR: DINO-X Pro: sota model with enhanced perception capabilities for various scenarios; DINO-X Edge: model optimized for faster inference speed and better suited for deployment on edge devices

Alexandre Morgand

60,720 views • 1 year ago

NVIDIA's LocateAnything is a new vision model for grounding and detection. Very performant and accurate! > 10x faster than Qwen3-VL > 138M queries + 785M boxes > GUI, OCR, docs, dense detection > Free & open source

NVIDIA's LocateAnything is a new vision model for grounding and detection. Very performant and accurate! > 10x faster than Qwen3-VL > 138M queries + 785M boxes > GUI, OCR, docs, dense detection > Free & open source

⚡AI Search⚡

120,567 views • 29 days ago

LLMs are great for human in the loop applications, but fail at deterministic developer tasks. Interfaze (YC P26) is a new AI model that outperforms general LLMs on high accuracy tasks like: OCR, Object Detection, Web scraping, Speech-to-text, Classification and more. Congrats on the launch, Yoeven and Harsha!

LLMs are great for human in the loop applications, but fail at deterministic developer tasks. Interfaze (YC P26) is a new AI model that outperforms general LLMs on high accuracy tasks like: OCR, Object Detection, Web scraping, Speech-to-text, Classification and more. Congrats on the launch, Yoeven and Harsha!

Y Combinator

69,326 views • 2 months ago

Today we’re announcing two new updates in our computer vision work — a new, expanded license for our DINOv2 model and the release of FACET, a comprehensive new benchmark dataset to help evaluate and improve fairness in vision models. More details ➡️ 🧵

Today we’re announcing two new updates in our computer vision work — a new, expanded license for our DINOv2 model and the release of FACET, a comprehensive new benchmark dataset to help evaluate and improve fairness in vision models. More details ➡️ 🧵

AI at Meta

453,951 views • 2 years ago

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

309,704 views • 1 year ago

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

AI at Meta

88,918 views • 1 year ago

I need to annotate some images for training a computer vision model. There are many powerful annotation platforms available, but I want to keep my images local. I added a new section to my CV Streamlit app to quickly annotate images and train a YOLO model in a few clicks.

I need to annotate some images for training a computer vision model. There are many powerful annotation platforms available, but I want to keep my images local. I added a new section to my CV Streamlit app to quickly annotate images and train a YOLO model in a few clicks.

Marco Franzon

29,530 views • 5 months ago

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset

Introducing Meta Perception Encoder: a vision encoder setting new standards in image & video tasks. It excels in zero-shot classification & retrieval, surpassing existing models. Learn more about Meta Perception Encoder, read the research paper, and download the code and dataset

AI at Meta

74,531 views • 1 year ago

New Moondream 2B release! ✨ New features: - Long-form captioning - Open vocab tagging - Better counting, object detection, text understanding - Faster HF transformers inference

New Moondream 2B release! ✨ New features: - Long-form captioning - Open vocab tagging - Better counting, object detection, text understanding - Faster HF transformers inference

vik

51,735 views • 1 year ago

The Segment Anything Model (SAM) by Meta AI is a step toward the first foundation model for image segmentation. SAM is capable of one-click segmentation of any object from photos or videos + zero-shot transfer to other segmentation tasks. Try the demo ➡️

The Segment Anything Model (SAM) by Meta AI is a step toward the first foundation model for image segmentation. SAM is capable of one-click segmentation of any object from photos or videos + zero-shot transfer to other segmentation tasks. Try the demo ➡️

AI at Meta

186,324 views • 3 years ago

we are introducing a state-of-the-art real-time object detection model, RF-DETR RF-DETR outperforms all existing object detection models on real world datasets and is the first real-time model to achieve 60+ Average Precision on COCO talked w/ NVIDIA about it at GTC:

we are introducing a state-of-the-art real-time object detection model, RF-DETR RF-DETR outperforms all existing object detection models on real world datasets and is the first real-time model to achieve 60+ Average Precision on COCO talked w/ NVIDIA about it at GTC:

Joseph Nelson

18,231 views • 1 year ago

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

Introducing DINOv3: a state-of-the-art computer vision model trained with self-supervised learning (SSL) that produces powerful, high-resolution image features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks. Learn more about DINOv3 here:

AI at Meta

899,636 views • 10 months ago

Meet SAM 3, a unified model that enables detection, segmentation, and tracking of objects across images and videos. SAM 3 introduces some of our most highly requested features like text and exemplar prompts to segment all objects of a target category. Learnings from SAM 3 will help power new features in Instagram Edits and Vibes, bringing advanced segmentation capabilities directly to creators. 🔗 Learn more:

Meet SAM 3, a unified model that enables detection, segmentation, and tracking of objects across images and videos. SAM 3 introduces some of our most highly requested features like text and exemplar prompts to segment all objects of a target category. Learnings from SAM 3 will help power new features in Instagram Edits and Vibes, bringing advanced segmentation capabilities directly to creators. 🔗 Learn more:

AI at Meta

189,875 views • 7 months ago

Detecting airplanes at DFW Airport from a single text prompt. Meta's Segment Anything Model 3 and the geosam R package make powerful image detection tools accessible.

Detecting airplanes at DFW Airport from a single text prompt. Meta's Segment Anything Model 3 and the geosam R package make powerful image detection tools accessible.

Kyle Walker

31,475 views • 5 months ago