Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Microsoft just dropped VibeVoice-Realtime-0.5B Open-source realtime TTS AI model that starts talking in ~300 ms Streaming, long-form and insanely fast.

Min Choi

376,082 subscribers

80,970 görüntüleme • 7 ay önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

1-Click Vibevoice-Realtime for ALL machines Finally a Realtime TTS that is ACTUALLY realtime, thanks to its tiny size! (0.5B params) This video is from my Windows machine, recorded in realtime. All you need is around 2.5GB VRAM, it even works on Macs!

1-Click Vibevoice-Realtime for ALL machines Finally a Realtime TTS that is ACTUALLY realtime, thanks to its tiny size! (0.5B params) This video is from my Windows machine, recorded in realtime. All you need is around 2.5GB VRAM, it even works on Macs!

cocktail peanut

37,149 görüntüleme • 7 ay önce

We finally have open-source, long-form, expressive TTS models. VibeVoice-1.5B from Microsoft: ✅ MIT licensed ✅ Up to 90 minutes long ✅ Highly expressive & emotional!

We finally have open-source, long-form, expressive TTS models. VibeVoice-1.5B from Microsoft: ✅ MIT licensed ✅ Up to 90 minutes long ✅ Highly expressive & emotional!

mrfakename

168,359 görüntüleme • 11 ay önce

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

Akshay 🚀

45,206 görüntüleme • 4 ay önce

the new OpenAI realtime voice model just released + gpt 5.5 fast mode brings us a new possibility - realtime speech to live presentation! i just talk, and the whiteboard would whiteboard itself prototype is open sourced. details in thread below -

the new OpenAI realtime voice model just released + gpt 5.5 fast mode brings us a new possibility - realtime speech to live presentation! i just talk, and the whiteboard would whiteboard itself prototype is open sourced. details in thread below -

Kun Chen

114,908 görüntüleme • 2 ay önce

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

Kanika

220,977 görüntüleme • 3 ay önce

Microsoft just dropped VibeVoice (open-source) This AI turn text into a 90-min, up to 4-voice podcast. With natural pauses, emotion, even singing. 6 wild examples + code: 1. Spontaneous singing

Microsoft just dropped VibeVoice (open-source) This AI turn text into a 90-min, up to 4-voice podcast. With natural pauses, emotion, even singing. 6 wild examples + code: 1. Spontaneous singing

Min Choi

94,614 görüntüleme • 11 ay önce

Microsoft just dropped MineWorld on Hugging Face a Real-Time and Open-Source Interactive World Model on Minecraft

Microsoft just dropped MineWorld on Hugging Face a Real-Time and Open-Source Interactive World Model on Minecraft

AK

95,035 görüntüleme • 1 yıl önce

Introducing an open-source template for broadcasting realtime transcripts. • Generate live transcripts with Scribe • Broadcast to many using Supabase • Translate with Chrome’s built-in AI Demo and open source template below.

Introducing an open-source template for broadcasting realtime transcripts. • Generate live transcripts with Scribe • Broadcast to many using Supabase • Translate with Chrome’s built-in AI Demo and open source template below.

ElevenLabs Developers

28,089 görüntüleme • 8 ay önce

Marvis-TTS-v0.2 is here 🚀 A local first TTS model capable of realtime performance even on older iPhones that Lucas Newman and I built. What’s new: ✨ Blazing fast — 100M (tiny) & 250M parameter models 🌍 Multilingual — English, French, German 🎭 Enhanced voice cloning — More natural & expressive ⚡ Long-form generation — Up to 90 seconds (4x improvement) Get started today: > pip install -U mlx-audio

Marvis-TTS-v0.2 is here 🚀 A local first TTS model capable of realtime performance even on older iPhones that Lucas Newman and I built. What’s new: ✨ Blazing fast — 100M (tiny) & 250M parameter models 🌍 Multilingual — English, French, German 🎭 Enhanced voice cloning — More natural & expressive ⚡ Long-form generation — Up to 90 seconds (4x improvement) Get started today: > pip install -U mlx-audio

Prince Canuma

118,352 görüntüleme • 8 ay önce

Introducing Realtime TTS-2, a new generation of voice model built for realtime conversation. It is the first voice model that hears the conversation, takes natural-language voice direction, holds one voice identity across over 100 languages, and speaks like a person who is paying attention. The result is voice AI that feels as good as it sounds. Try it out: Learn More:

Introducing Realtime TTS-2, a new generation of voice model built for realtime conversation. It is the first voice model that hears the conversation, takes natural-language voice direction, holds one voice identity across over 100 languages, and speaks like a person who is paying attention. The result is voice AI that feels as good as it sounds. Try it out: Learn More:

Inworld AI

326,012 görüntüleme • 2 ay önce

🚨 Forget LIDAR. The Robbyant team just dropped a streaming 3D model that reconstructs scenes live, at ~20 FPS, over long sequences. One single camera. Runs in real time. Open-source. Entirely end-to-end. NO iterative optimization tricks and no post-processing cleanup steps! It outperforms both existing streaming approaches and several offline methods. 100% Free and open-source. Repo, paper and model weights in 🧵↓

🚨 Forget LIDAR. The Robbyant team just dropped a streaming 3D model that reconstructs scenes live, at ~20 FPS, over long sequences. One single camera. Runs in real time. Open-source. Entirely end-to-end. NO iterative optimization tricks and no post-processing cleanup steps! It outperforms both existing streaming approaches and several offline methods. 100% Free and open-source. Repo, paper and model weights in 🧵↓

Charly Wargnier

35,993 görüntüleme • 1 ay önce

This is wild. China's Alibaba just dropped Live Avatar. This AI turns any voice into a realtime, talking avatar with infinite length at 20 FPS. 10 wild demos:👇 1. Ilya interview that never happened

Min Choi

220,784 görüntüleme • 7 ay önce

Realtime interactive generative models FTW! Announcing a new 🌊 of details and features for Magenta RealTime, the open weights live music AI model from GDM! * Live Jamming with audio input 🎤🎸🎵 * Personalize your own models 🔧 * Tech report 📜 Links below in the 🧵...

Realtime interactive generative models FTW! Announcing a new 🌊 of details and features for Magenta RealTime, the open weights live music AI model from GDM! * Live Jamming with audio input 🎤🎸🎵 * Personalize your own models 🔧 * Tech report 📜 Links below in the 🧵...

Jesse Engel

183,565 görüntüleme • 11 ay önce

Meet Lucy 2.5, our most advanced Live AI model yet. Lucy edits videos in realtime, now with more capabilities and greater control. See how it's being used across streaming, e-commerce, advertising, and more 🧵

Meet Lucy 2.5, our most advanced Live AI model yet. Lucy edits videos in realtime, now with more capabilities and greater control. See how it's being used across streaming, e-commerce, advertising, and more 🧵

Decart

4,731,478 görüntüleme • 15 gün önce

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

OpenAI

3,654,620 görüntüleme • 2 ay önce

Z AI drops a realtime video generator for talking characters. Free & open source. Simply add a script, upload a voice & character, and it renders the video almost instantly. Btw, Z.ai is also behind one of the best open models, GLM 🔥🔥🔥

Z AI drops a realtime video generator for talking characters. Free & open source. Simply add a script, upload a voice & character, and it renders the video almost instantly. Btw, Z.ai is also behind one of the best open models, GLM 🔥🔥🔥

⚡AI Search⚡

41,477 görüntüleme • 7 ay önce

Google Gemini 2.0 realtime AI is insane. Watch me turn it into a live code tutor just by sharing my screen and talking to it. We’re living in future. I’m speechless.

Google Gemini 2.0 realtime AI is insane. Watch me turn it into a live code tutor just by sharing my screen and talking to it. We’re living in future. I’m speechless.

Mckay Wrigley

624,896 görüntüleme • 1 yıl önce

Apple’s AI getting crazy and nobody’s talking about it they just open sourced FastVLM + MobileCLIP2, can do realtime VIDEO captioning with phone camera.. runs 100% local on your phone, 85x faster, 3.4x smaller… free to test and download, link in comment let's break down:

Apple’s AI getting crazy and nobody’s talking about it they just open sourced FastVLM + MobileCLIP2, can do realtime VIDEO captioning with phone camera.. runs 100% local on your phone, 85x faster, 3.4x smaller… free to test and download, link in comment let's break down:

el.cine

381,290 görüntüleme • 11 ay önce

Microsoft just dropped Muse. This AI can generate minutes of smooth gameplay from just 1 second of footage & controls. And it's open source! The future of gaming is about to change forever.

Microsoft just dropped Muse. This AI can generate minutes of smooth gameplay from just 1 second of footage & controls. And it's open source! The future of gaming is about to change forever.

Min Choi

102,961 görüntüleme • 1 yıl önce