Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

🔥Introducing VILA: The next-gen VLM for Video and inter-leaved image understanding with Apache-2.0 license! Is VILA the best in-class small VLM for edge-deployment? Keep reading🧵 👀

Gradio

56,415 subscribers

51,681 просмотров • 1 год назад •via X (Twitter)

Образование Искусство Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

What's the right architecture for a VLA? VLM + custom action heads (π₀)? VLM with special discrete action tokens (OpenVLA)? Custom design on top of the VLM (OpenVLA-OFT)? Or... VLM with ZERO modifications? Just predict action as text. The results will surprise you. VLA-0: Outperforms π₀, GR00T-N1, MolmoAct, SmolVLA. With ZERO changes to the VLM. 🧵⬇️

What's the right architecture for a VLA? VLM + custom action heads (π₀)? VLM with special discrete action tokens (OpenVLA)? Custom design on top of the VLM (OpenVLA-OFT)? Or... VLM with ZERO modifications? Just predict action as text. The results will surprise you. VLA-0: Outperforms π₀, GR00T-N1, MolmoAct, SmolVLA. With ZERO changes to the VLM. 🧵⬇️

Ankit Goyal

106,689 просмотров • 8 месяцев назад

this is the BEST vision language model I have ever tried! Aria is a new model by Rhymes.AI: a 25.3B multimodal model that can take image/video inputs 🤩 They release the model with Apache-2.0 license and fine-tuning scripts as well 👏 I tested it extensively, keep reading to learn more 🧶

this is the BEST vision language model I have ever tried! Aria is a new model by Rhymes.AI: a 25.3B multimodal model that can take image/video inputs 🤩 They release the model with Apache-2.0 license and fine-tuning scripts as well 👏 I tested it extensively, keep reading to learn more 🧶

merve

176,047 просмотров • 1 год назад

Trending: Step1X-3D for generating high-fidelity 3D assets with versatile textures! Apache-2.0 with exceptional results in terms of geometry and texture mapping🔥🔥

Trending: Step1X-3D for generating high-fidelity 3D assets with versatile textures! Apache-2.0 with exceptional results in terms of geometry and texture mapping🔥🔥

Gradio

36,405 просмотров • 1 год назад

Real-time DEtection Transformer (RT-DETR) landed in Hugging Face transformers 🤩 with Apache 2.0 license 😍 do DETRs Beat YOLOs on Real-time Object Detection? keep reading 👀

Real-time DEtection Transformer (RT-DETR) landed in Hugging Face transformers 🤩 with Apache 2.0 license 😍 do DETRs Beat YOLOs on Real-time Object Detection? keep reading 👀

merve

155,222 просмотров • 2 лет назад

Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook: 112 ms -> 1.1 ms for the image path 30% lower end-to-end image+LLM The architecture is just: patchify the image -> linear projection with pos embeddings -> LLM Writeup:

Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook: 112 ms -> 1.1 ms for the image path 30% lower end-to-end image+LLM The architecture is just: patchify the image -> linear projection with pos embeddings -> LLM Writeup:

Andi Marafioti

59,238 просмотров • 6 дней назад

We may have a new best image-to-video model 👀 I’ve been testing Kling 2.0 more and am really impressed with the results. Veo 2 was previously best-in-class (IMO), but Kling has more motion and maintains coherence.

We may have a new best image-to-video model 👀 I’ve been testing Kling 2.0 more and am really impressed with the results. Veo 2 was previously best-in-class (IMO), but Kling has more motion and maintains coherence.

Justine Moore

41,475 просмотров • 1 год назад

introducing Qwen. this new image model comes with great prompt understanding and text rendering. try it now for free in Krea Image.

introducing Qwen. this new image model comes with great prompt understanding and text rendering. try it now for free in Krea Image.

KREA AI

16,176 просмотров • 10 месяцев назад

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Victor M

532,982 просмотров • 19 дней назад

BREAKING: Seedance 2.0 is now officially available on GlobalGPT at 50% OFF! Realistic physics, native audio-video generation, and best-in-class image control for AI video. Now avaiable for all regions. No limits. No restrictions. No invite codes.👇

BREAKING: Seedance 2.0 is now officially available on GlobalGPT at 50% OFF! Realistic physics, native audio-video generation, and best-in-class image control for AI video. Now avaiable for all regions. No limits. No restrictions. No invite codes.👇

Zohaib Ai

11,854 просмотров • 2 месяцев назад

Seedance 2.0 is now officially available on GlobalGPT at 50% OFF! Realistic physics, native audio-video generation, and best-in-class image control for AI video. Now avaiable for all regions. No limits. No restrictions. No invite codes.👇

Seedance 2.0 is now officially available on GlobalGPT at 50% OFF! Realistic physics, native audio-video generation, and best-in-class image control for AI video. Now avaiable for all regions. No limits. No restrictions. No invite codes.👇

Kylie

61,521 просмотров • 2 месяцев назад

introducing Seedance 2.0. the world's most powerful video model, supporting text, image, video, and even audio as input, and capable of producing long multi-shot videos with high quality. now available for everyone.

introducing Seedance 2.0. the world's most powerful video model, supporting text, image, video, and even audio as input, and capable of producing long multi-shot videos with high quality. now available for everyone.

Krea

44,728 просмотров • 2 месяцев назад

latest transformers release added support for pose estimation with ViTPose and ViTPose++ (Apache 2.0 license). upcoming supervision release will add full support for ViTPose and ViTPose++, along with useful utilities for keypoint detection models.

latest transformers release added support for pose estimation with ViTPose and ViTPose++ (Apache 2.0 license). upcoming supervision release will add full support for ViTPose and ViTPose++, along with useful utilities for keypoint detection models.

SkalskiP

46,031 просмотров • 1 год назад

Playing on the new Jetson Thor trying to think in terms of having gobs of memory, but low mem bandwidth Moondream2 VLM is ~2 FPS per full loop of everything But we have 128GB of memory. So run 15 VLM servers (~76GB) & get 30 FPS w/ ~100ms latency for the feed very comfy.

Playing on the new Jetson Thor trying to think in terms of having gobs of memory, but low mem bandwidth Moondream2 VLM is ~2 FPS per full loop of everything But we have 128GB of memory. So run 15 VLM servers (~76GB) & get 30 FPS w/ ~100ms latency for the feed very comfy.

Harrison Kinsley

35,687 просмотров • 10 месяцев назад

If you're still living under a rock and don't know how to access seedance 2.0 to generate videos like this > spin up a Hongkong VPN > Visit doubao [dot] com > select image gen and make sure seedance 4.5 is selected you won't believe the next step but: > tell the image gen model to make a video and it'll start generating a video using the latest seedance 2.0

If you're still living under a rock and don't know how to access seedance 2.0 to generate videos like this > spin up a Hongkong VPN > Visit doubao [dot] com > select image gen and make sure seedance 4.5 is selected you won't believe the next step but: > tell the image gen model to make a video and it'll start generating a video using the latest seedance 2.0

Akshit Verma

108,796 просмотров • 4 месяцев назад

💥 Seedance 2.0 is HERE on GlobalGPT. 50% OFF. All regions. No invite codes. Realistic physics + native audio-video + best-in-class image control. This is the one. 👇

💥 Seedance 2.0 is HERE on GlobalGPT. 50% OFF. All regions. No invite codes. Realistic physics + native audio-video + best-in-class image control. This is the one. 👇

MAPUNDA

69,856 просмотров • 2 месяцев назад

Seedance 2.0 got its perfect match on Higgsfield. Grok Imagine and Seedance 2.0 - xAI's image model paired with the category-leading video model, natively inside the platform. Grok Imagine generates your reference image in 9+ styles, fast. Seedance 2.0 - ranked #1 in image-to-video - takes it further than any other model can. The best results out of Seedance 2.0 start here.

Seedance 2.0 got its perfect match on Higgsfield. Grok Imagine and Seedance 2.0 - xAI's image model paired with the category-leading video model, natively inside the platform. Grok Imagine generates your reference image in 9+ styles, fast. Seedance 2.0 - ranked #1 in image-to-video - takes it further than any other model can. The best results out of Seedance 2.0 start here.

Higgsfield AI 🧩

10,141,275 просмотров • 2 месяцев назад

The ARM has landed. Anoma is live on Ethereum. The next-gen VM at the heart of Anoma’s distributed OS, the ARM brings native intents and best-in-class privacy to any chain, starting with Ethereum. A new era for decentralized applications starts today.

The ARM has landed. Anoma is live on Ethereum. The next-gen VM at the heart of Anoma’s distributed OS, the ARM brings native intents and best-in-class privacy to any chain, starting with Ethereum. A new era for decentralized applications starts today.

Anoma

211,079 просмотров • 7 месяцев назад