Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Updated my HF Space for vibe testing smol VLMs on object detection, visual grounding, keypoint detection & counting! 👓 🆕Compare Qwen2.5 VL 3B vs Moondream 2B side-by-side with annotated images & text outputs. Try examples or test your own images! 🏃👇

Sergio Paniego

2,567 subscribers

15,717 görüntüleme • 11 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Yorum

Sergio Paniego profil fotoğrafı

Sergio Paniego11 ay önce

📱Space: Models by @Alibaba_Qwen and @moondreamai!

merve profil fotoğrafı

merve11 ay önce

@skalskip92 @vikhyatk @JustinLin610 @onuralpszr you have to see this ^

vik profil fotoğrafı

vik11 ay önce

for moondream object detection prompting with just the object name will work better, that's how we train it

Sergio Paniego profil fotoğrafı

Sergio Paniego11 ay önce

I was unsure whether to use the full prompt or just the object name for the examples. Let me update it to make the comparison fairer 😃

Andres Franco profil fotoğrafı

Andres Franco11 ay önce

That’s impressive. Playing around with models like that must be a lot of fun.

Prithiv Sakthi 🌠 profil fotoğrafı

Prithiv Sakthi 🌠11 ay önce

This is really awesome 🤩

Reza Sayar profil fotoğrafı

Reza Sayar11 ay önce

awesome! 👏 very useful work!! 🥳🙏

Linus | web3 mobility network nRide profil fotoğrafı

Linus | web3 mobility network nRide11 ay önce

@pcuenq Vibe testing VLMs, that's really cool! I'm curious, have you explored any blockchain-based applications for object detection or visual grounding? 🤔

Onuralp S. profil fotoğrafı

Onuralp S.11 ay önce

I was experimenting with qwen and I can see it can detect each individual candies and when I ask a little bit differently it always says "colorful candies" and when I put that in to prompt I get some what better results but when I say return as "json" it just become one bbox

Johannes Gilger profil fotoğrafı

Johannes Gilger11 ay önce

This is awesome, thank you so much for that. Also really helps to show the inference time. Now do all the other small-ish VLMs like Molmo, SmolVLM, InternVL, etc 😅

Benzer Videolar

New Moondream 2B release! ✨ New features: - Long-form captioning - Open vocab tagging - Better counting, object detection, text understanding - Faster HF transformers inference

New Moondream 2B release! ✨ New features: - Long-form captioning - Open vocab tagging - Better counting, object detection, text understanding - Faster HF transformers inference

vik

51,735 görüntüleme • 1 yıl önce

🎉 恭喜发财🧧🐍 As we welcome the Chinese New Year, we're thrilled to announce the launch of Qwen2.5-VL , our latest flagship vision-language model! 🚀 💗 Qwen Chat: 📖 Blog: 🤗 Hugging Face: 🤖 ModelScope: 🌟 Key Highlights: * Visual Understanding : From flowers to complex charts, Qwen2.5-VL sees it all! * Agentic Capabilities : It’s a visual agent that can reason and interact with tools like computers & phones. * Long Video Comprehension : Captures events in videos over 1 hour long! ⏳🎥 * Precise Localization : Generates bounding boxes & JSON outputs for accurate object detection. * Structured Data Outputs : Perfect for finance & commerce, handling invoices, forms & more! 💼📊 Try Qwen2.5-VL now at Qwen Chat or explore models on Hugging Face & ModelScope . 🌐

🎉 恭喜发财🧧🐍 As we welcome the Chinese New Year, we're thrilled to announce the launch of Qwen2.5-VL , our latest flagship vision-language model! 🚀 💗 Qwen Chat: 📖 Blog: 🤗 Hugging Face: 🤖 ModelScope: 🌟 Key Highlights: * Visual Understanding : From flowers to complex charts, Qwen2.5-VL sees it all! * Agentic Capabilities : It’s a visual agent that can reason and interact with tools like computers & phones. * Long Video Comprehension : Captures events in videos over 1 hour long! ⏳🎥 * Precise Localization : Generates bounding boxes & JSON outputs for accurate object detection. * Structured Data Outputs : Perfect for finance & commerce, handling invoices, forms & more! 💼📊 Try Qwen2.5-VL now at Qwen Chat or explore models on Hugging Face & ModelScope . 🌐

Qwen

762,194 görüntüleme • 1 yıl önce

NVIDIA's LocateAnything is a new vision model for grounding and detection. Very performant and accurate! > 10x faster than Qwen3-VL > 138M queries + 785M boxes > GUI, OCR, docs, dense detection > Free & open source

NVIDIA's LocateAnything is a new vision model for grounding and detection. Very performant and accurate! > 10x faster than Qwen3-VL > 138M queries + 785M boxes > GUI, OCR, docs, dense detection > Free & open source

⚡AI Search⚡

120,342 görüntüleme • 22 gün önce

Microsoft's new Florence 2 is big for Computer Vision. It's a merge between Text and Vision. With a single prompt you can instruct the model to do CV tasks like captioning, object detection, grounding, and segmentation. The best part, it only uses a single backbone to handle everything. ▸ Excels in zero-shot performance ▸ Unified model for detection, captioning, etc. ▸ FLD-5B dataset: 5B+ annotations, 126M images ▸ New benchmarks (>5.5+) on COCO, ADE20K

Microsoft's new Florence 2 is big for Computer Vision. It's a merge between Text and Vision. With a single prompt you can instruct the model to do CV tasks like captioning, object detection, grounding, and segmentation. The best part, it only uses a single backbone to handle everything. ▸ Excels in zero-shot performance ▸ Unified model for detection, captioning, etc. ▸ FLD-5B dataset: 5B+ annotations, 126M images ▸ New benchmarks (>5.5+) on COCO, ADE20K

Lior Alexander

186,544 görüntüleme • 2 yıl önce

Announcing Living Images: optimize your images with generative A/B testing 👇

Announcing Living Images: optimize your images with generative A/B testing 👇

Coframe

92,715 görüntüleme • 2 yıl önce

Molmo 2 doesn't just answer questions about clips—it searches & points. The model returns coordinates & timestamps over videos + images, powering QA, counting, dense captioning, artifact detection, & subtitle-aware analysis. You can see exactly how it reasoned.

Molmo 2 doesn't just answer questions about clips—it searches & points. The model returns coordinates & timestamps over videos + images, powering QA, counting, dense captioning, artifact detection, & subtitle-aware analysis. You can see exactly how it reasoned.

Ai2

67,967 görüntüleme • 6 ay önce

🚨Two new features have just dropped on Soar: 🔍 Compare two satellite images with a side-by-side slider 💾 Save your satellite search area to a list for quick revisits and daily checks Both are live now on We hope they help with investigations!

🚨Two new features have just dropped on Soar: 🔍 Compare two satellite images with a side-by-side slider 💾 Save your satellite search area to a list for quick revisits and daily checks Both are live now on We hope they help with investigations!

Soar

10,784 görüntüleme • 2 ay önce

pixcii v1.0 fast, feature-rich media to ascii converter written in c++ works with images, gifs, and real-time video playback with color/grayscale, auto-fit to terminal, edge detection, invert, scaling & more + direct url support for instant conversion more cool examples below

pixcii v1.0 fast, feature-rich media to ascii converter written in c++ works with images, gifs, and real-time video playback with color/grayscale, auto-fit to terminal, edge detection, invert, scaling & more + direct url support for instant conversion more cool examples below

ashish

22,031 görüntüleme • 10 ay önce

You can now try clothes without leaving your bed. Vanast. Unified single-stage virtual try-on + human animation. - transfers garments - animates images using pose vids - preserve identity - Wan2.1+FLUX+DWPose +Qwen2.5-VL+SDXL beats CatVTON

You can now try clothes without leaving your bed. Vanast. Unified single-stage virtual try-on + human animation. - transfers garments - animates images using pose vids - preserve identity - Wan2.1+FLUX+DWPose +Qwen2.5-VL+SDXL beats CatVTON

Wildminder

17,746 görüntüleme • 2 ay önce

What will you build with Vision Agents? Out-of-the-box support for: - Turn detection - Speech-to-text + text-to-speech - Voice activity detection - MCP & function-calling support Open-source. Video-first. Ready to build.

What will you build with Vision Agents? Out-of-the-box support for: - Turn detection - Speech-to-text + text-to-speech - Voice activity detection - MCP & function-calling support Open-source. Video-first. Ready to build.

Stream

226,723 görüntüleme • 5 ay önce

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

Allen T.

56,013 görüntüleme • 2 yıl önce

• vision: ~200 base images via grok imagine • motion: wan 2.2 animate • pipeline: visual fision beat detection + pose control • music: suno + reaper • dancer: human

• vision: ~200 base images via grok imagine • motion: wan 2.2 animate • pipeline: visual fision beat detection + pose control • music: suno + reaper • dancer: human

tetsuo

72,767 görüntüleme • 6 ay önce

🆕 How to run (and finetune) open source AI models with a simple API! In 5 mins, I go over how to: ◆ Generate text with DeepSeek R1 & Llama 3 ◆ Generate code with Qwen on LlamaCoder ◆ Generate images with Flux on BlinkShot ◆ Finetune a model on your own data & run it

🆕 How to run (and finetune) open source AI models with a simple API! In 5 mins, I go over how to: ◆ Generate text with DeepSeek R1 & Llama 3 ◆ Generate code with Qwen on LlamaCoder ◆ Generate images with Flux on BlinkShot ◆ Finetune a model on your own data & run it

Hassan

30,236 görüntüleme • 1 yıl önce

Gaze detection in the upcoming moondream release. (Live demo in thread ⬇️)

Gaze detection in the upcoming moondream release. (Live demo in thread ⬇️)

vik

142,682 görüntüleme • 1 yıl önce

Multilingual & Text Rendering in ChatGPT Images 2.0, demonstrated by Boyuan Chen

Multilingual & Text Rendering in ChatGPT Images 2.0, demonstrated by Boyuan Chen

OpenAI

44,948 görüntüleme • 1 ay önce

New tiny VLM: LFM2.5-VL-450M > Supports bounding box prediction, object detection, and function calling > Improved multilingual capabilities across 9 languages > Enhanced instruction following for vision and text tasks

New tiny VLM: LFM2.5-VL-450M > Supports bounding box prediction, object detection, and function calling > Improved multilingual capabilities across 9 languages > Enhanced instruction following for vision and text tasks

Maxime Labonne

32,092 görüntüleme • 2 ay önce

ThunderBullet (ELO-5220P) by IAI 🇮🇱 - Optical & acoustic gunfire detection - Instant detection, geo-location & alerts - Covers small arms to RPG threats

ThunderBullet (ELO-5220P) by IAI 🇮🇱 - Optical & acoustic gunfire detection - Instant detection, geo-location & alerts - Covers small arms to RPG threats

DefenseTrends

318,970 görüntüleme • 4 ay önce

Do you know what pharmacovigilence is? It’s the science & activities relating to the detection, assessment, understanding & prevention of side effects of medicines. Learn more: #MedSafetyWeek

Do you know what pharmacovigilence is? It’s the science & activities relating to the detection, assessment, understanding & prevention of side effects of medicines. Learn more: #MedSafetyWeek

World Health Organization (WHO)

87,233 görüntüleme • 2 yıl önce

Check Designs: Keep every file synced to your design system with → Variable & style suggestions → Color contrast suggestions → Library mismatch detection → Detached component detection

Check Designs: Keep every file synced to your design system with → Variable & style suggestions → Color contrast suggestions → Library mismatch detection → Detached component detection

Figma

87,703 görüntüleme • 14 gün önce

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

Today we're releasing WildDet3D—an open model for monocular 3D object detection in the wild. It works with text, clicks, or 2D boxes, and on zero-shot evals it nearly doubles the best prior scores. 🧵

Ai2

85,404 görüntüleme • 2 ay önce