Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Kyutai released their Streaming Text to Speech model, ~2B param model, ultra low latency (220ms), CC-BY-4.0 license 🔥 Trained on 2.5 Million Hours of audio, it can serve up to 32 users w/ less than 350ms latency on a SINGLE L40 🤯 Incredible release by kyutai folks, go check... show more

Vaibhav (VB) Srivastav

47,375 subscribers

93,512 Aufrufe • vor 11 Monaten •via X (Twitter)

Wissenschaft & Technologie Nachrichten & Politik Bildung

Anya Rossi• Live Now

Private livecam show

6 Kommentare

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 11 Monaten

Check out their models here:

Profilbild von Aakash

Aakashvor 11 Monaten

"Trained on 2.5 Million Hours of audio, it can serve up to 32 users w/ less than 350ms latency on a SINGLE L40" can we get more of this benchmark

Profilbild von KD

KDvor 11 Monaten

These are some of the same guys who run a really amazing YT channel about CS btw:

Profilbild von ZAZO

ZAZOvor 11 Monaten

that’s the best thing happened in 2025 🔥🔥🔥🔥🔥🔥🔥🔥🔥

Profilbild von Bui Dinh Ngoc

Bui Dinh Ngocvor 11 Monaten

This is game-changing for accessibility tools. I've been waiting for low-latency TTS that doesn't break the bank or require proprietary licenses.

Profilbild von Carlos DP

Carlos DPvor 11 Monaten

SUCH a solid demo lol, S tier

Ähnliche Videos

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

kyutai

171,391 Aufrufe • vor 11 Monaten

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 Aufrufe • vor 1 Jahr

LETS GOO! kyutai just released MoshiVis - an end-to-end low-latency Vision Speech Model, CC-BY license 🔥 > Only adds 206M parameters via lightweight cross-attention (CA) modules to integrate visual inputs from a frozen PaliGemma2-3B-448 vision encoder > Uses a learnable gating mechanism in the CA modules allows MoshiVis to "turn off" visual input streams when unnecessary, preserving Moshi's conversational abilities > Adds only ~7ms per inference step on a MacMini with M4 Pro Chip, maintaining real-time performance > Best part: it keeps the tone, emotion and the prosody of the original Moshi model > CC-BY-4.0 licensed weights on the hub, allows commercial use > Works with MLX, Candle and PyTorch from day-0 Kudos Kyutai on such a brilliant release - I believe this is the first of its kind! 🤗

LETS GOO! kyutai just released MoshiVis - an end-to-end low-latency Vision Speech Model, CC-BY license 🔥 > Only adds 206M parameters via lightweight cross-attention (CA) modules to integrate visual inputs from a frozen PaliGemma2-3B-448 vision encoder > Uses a learnable gating mechanism in the CA modules allows MoshiVis to "turn off" visual input streams when unnecessary, preserving Moshi's conversational abilities > Adds only ~7ms per inference step on a MacMini with M4 Pro Chip, maintaining real-time performance > Best part: it keeps the tone, emotion and the prosody of the original Moshi model > CC-BY-4.0 licensed weights on the hub, allows commercial use > Works with MLX, Candle and PyTorch from day-0 Kudos Kyutai on such a brilliant release - I believe this is the first of its kind! 🤗

Vaibhav (VB) Srivastav

45,606 Aufrufe • vor 1 Jahr

🚨 New model alert! Dialog by vibx — a leading text-to-speech model — now runs on GroqCloud™. That means natural-sounding speech with ultra-low latency, making real-time voice applications smoother and more responsive. Learn more & build fast — links in the comments!

🚨 New model alert! Dialog by vibx — a leading text-to-speech model — now runs on GroqCloud™. That means natural-sounding speech with ultra-low latency, making real-time voice applications smoother and more responsive. Learn more & build fast — links in the comments!

Groq Inc

47,183 Aufrufe • vor 1 Jahr

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Vaibhav (VB) Srivastav

66,963 Aufrufe • vor 1 Jahr

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

Vaibhav (VB) Srivastav

330,034 Aufrufe • vor 1 Jahr

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications. Check out the details here:

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications. Check out the details here:

kyutai

66,264 Aufrufe • vor 1 Jahr

"With the On-Demand model, users can choose their gas cost with the lowest latency possible for their transactions" - DoctorBlocks Listen to what DoctorBlocks says about functions and the importance of Switchboard's On-Demand model 👇

"With the On-Demand model, users can choose their gas cost with the lowest latency possible for their transactions" - DoctorBlocks Listen to what DoctorBlocks says about functions and the importance of Switchboard's On-Demand model 👇

Switchboard ⚡️

40,645 Aufrufe • vor 2 Jahren

Introducing Dialog 1.0 - Ultra-emotional AI Text-To-Speech model Outperforms Elevenlabs on expressiveness and quality 3 to 1 <1% error rate Supports 30+ languages Best in class voice cloning Low latency: 303ms TTFA (Time to First Audio) Experience it for yourself on Read more below⬇️

Introducing Dialog 1.0 - Ultra-emotional AI Text-To-Speech model Outperforms Elevenlabs on expressiveness and quality 3 to 1 <1% error rate Supports 30+ languages Best in class voice cloning Low latency: 303ms TTFA (Time to First Audio) Experience it for yourself on Read more below⬇️

PlayAI

196,152 Aufrufe • vor 1 Jahr

🚨 Official : Google will soon release Gemini 2.5 Flash - Low latency model and most cost-efficient - Has "thinking built-in" - Allows users to control how much the model reasons. Source : YT Google Cloud Keynote.

🚨 Official : Google will soon release Gemini 2.5 Flash - Low latency model and most cost-efficient - Has "thinking built-in" - Allows users to control how much the model reasons. Source : YT Google Cloud Keynote.

AshutoshShrivastava

60,315 Aufrufe • vor 1 Jahr

Pretty Insane - SoTA Text to Speech model capable of English AND Hindi - 3B Llama backbone - Apache 2.0 licensed 🔥 > Sub 80 ms latency > Supports both English, Hindi including code-mix > Runs in a free google colab too 🤯 Best part: They're actively working on other languages like Tamil, Telugu, Bengali, etc > Available on Hugging Face hub, powered by Transformers 💥

Pretty Insane - SoTA Text to Speech model capable of English AND Hindi - 3B Llama backbone - Apache 2.0 licensed 🔥 > Sub 80 ms latency > Supports both English, Hindi including code-mix > Runs in a free google colab too 🤯 Best part: They're actively working on other languages like Tamil, Telugu, Bengali, etc > Available on Hugging Face hub, powered by Transformers 💥

Vaibhav (VB) Srivastav

33,749 Aufrufe • vor 11 Monaten

End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥 > Speech Encoder - Whisper Large v3 > LLM backbone - Llama 3.1 8B Instruct > Speech Decoder - HuBERT (UnitY) > Simultaneously generate Speech + Text > Less than 250 ms latency > Trained in less than 3 days on 4x GPUs > Used 200K instruct pairs > Model checkpoints on the Hub 🤗 > Space incoming! GG! I'm here for this trend! 🐐

End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥 > Speech Encoder - Whisper Large v3 > LLM backbone - Llama 3.1 8B Instruct > Speech Decoder - HuBERT (UnitY) > Simultaneously generate Speech + Text > Less than 250 ms latency > Trained in less than 3 days on 4x GPUs > Used 200K instruct pairs > Model checkpoints on the Hub 🤗 > Space incoming! GG! I'm here for this trend! 🐐

Vaibhav (VB) Srivastav

47,921 Aufrufe • vor 1 Jahr

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

Mistral AI

935,674 Aufrufe • vor 2 Monaten

Just released on Hugging Face: Vui, a 100M open-source NotebookLM! 3 models: > Vui.BASE is the base checkpoint trained on 40k hours of audio conversations > Vui.ABRAHAM is a single speaker model that can reply with context awareness. > Vui.COHOST is checkpoint with two speakers that can talk to each other. It clones voices, breathes, uhs, [laughs] — even non-speech sounds. Human-like TTS is here!

Just released on Hugging Face: Vui, a 100M open-source NotebookLM! 3 models: > Vui.BASE is the base checkpoint trained on 40k hours of audio conversations > Vui.ABRAHAM is a single speaker model that can reply with context awareness. > Vui.COHOST is checkpoint with two speakers that can talk to each other. It clones voices, breathes, uhs, [laughs] — even non-speech sounds. Human-like TTS is here!

steven

43,620 Aufrufe • vor 1 Jahr

Real-time AI conversations are here! PlayHT, one of the best text-to-speech models I’ve used, now has a latency of less than 300ms. Checkout how fast it outputs audio 🤯 Also, my experience cloning my own voice and links to try it for free are below.

Real-time AI conversations are here! PlayHT, one of the best text-to-speech models I’ve used, now has a latency of less than 300ms. Checkout how fast it outputs audio 🤯 Also, my experience cloning my own voice and links to try it for free are below.

Alvaro Cintas

335,344 Aufrufe • vor 2 Jahren

This is the best and fastest speech-to-text model in the world: • 23.2 seconds to process 30 minutes of audio • 93.3% accuracy • Diarization support to detect multiple speakers • Trained on 12.5 million hours of multilingual data I tried it out and it's pretty impressive:

This is the best and fastest speech-to-text model in the world: • 23.2 seconds to process 30 minutes of audio • 93.3% accuracy • Diarization support to detect multiple speakers • Trained on 12.5 million hours of multilingual data I tried it out and it's pretty impressive:

Santiago

66,397 Aufrufe • vor 9 Monaten

Really proud to share something we’ve been working on for a while: Magenta RealTime 2 (MTR2), a live music model that is highly interactive (MIDI, audio, text, lots of parameters) and low-latency (~200ms end-to-end), and runs locally on a MacBook!

Really proud to share something we’ve been working on for a while: Magenta RealTime 2 (MTR2), a live music model that is highly interactive (MIDI, audio, text, lots of parameters) and low-latency (~200ms end-to-end), and runs locally on a MacBook!

Ilaria Manco

44,003 Aufrufe • vor 15 Tagen

Gemma 2 2B running in a browser, powered by WebLLM & WebGPU! 🔥 100% local & on-device In less than 24 hours, we've already got the model to the edge! ⚡ Try it out on an HF space below:

Gemma 2 2B running in a browser, powered by WebLLM & WebGPU! 🔥 100% local & on-device In less than 24 hours, we've already got the model to the edge! ⚡ Try it out on an HF space below:

Vaibhav (VB) Srivastav

53,790 Aufrufe • vor 1 Jahr