Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Meet Flash. Our newest model that generates speech in 75ms + application & network latency. You’ve never experienced human-like TTS this fast.

ElevenLabs

184,224 subscribers

173,598 views • 1 year ago •via X (Twitter)

Education Health & Wellness Science & Technology

Anya Rossi• Live Now

Private livecam show

11 Comments

ElevenLabs1 year ago

Flash is our recommended model for low-latency, conversational voice agents. You can use Flash today in our Conversational AI platform Or build directly via the API using model id “eleven_flash_v2” and “eleven_flash_v2_5”:

ElevenLabs1 year ago

Flash v2 is English only and Flash v2.5 supports 32 languages They both cost 1 credit for every 2 characters

ElevenLabs1 year ago

It has a slightly lower quality and emotional depth that the Turbo models but significantly lower latency. And the Flash model quality is still higher than competitor models. Check out our guide on models to find the best for your use case:

ElevenLabs1 year ago

Hear from @maxilevi__, one of the developers who lead the engineering work behind this update:

Bytescribe1 year ago

Introducing Vehrbal, the AI that converts audio into SOAP notes! Say goodbye to wasted time and hello to effortless note-taking. Experience the power of fast, simple, and efficient with Vehrbal today.

Luke Harries1 year ago

From 250ms to 75ms 🏎️🏎️ Great work @maxilevi__ and team!

+1 year ago

incredible product but the price just doesn't make sense for anything at scale. openai ttl can be run at about 1/10th of the price

Danny G 👾1 year ago

Your models are great! I just need better pricing. I am building a voice intensive app. Can we come into an agreement? I am a paid customer.

Arvind1 year ago

holy shit

Sam Sklar1 year ago

This combined with conversational agents is a magical experience @maxilevi__ @Marko_Jozef

Dot&DM1 year ago

man, if you are a kid today. this is the golden era for you.

Related Videos

Soprano: An instant, ultra-lightweight TTS model for realistic speech; generates 10 hours of 32kHz audio in <20s; streams with <15ms latency using just 80M params & <1GB VRAM. Has some limitations and drawbacks.

Soprano: An instant, ultra-lightweight TTS model for realistic speech; generates 10 hours of 32kHz audio in <20s; streams with <15ms latency using just 80M params & <1GB VRAM. Has some limitations and drawbacks.

Wildminder

111,168 views • 6 months ago

Today we're releasing our first open source TTS model, TADA! TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency. This means: → Zero content hallucinations across 1,000+ test samples → 5x faster than similar-grade LLM-based TTS → Fits much longer audio: 2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems → Free transcript alongside audio with no added latency

Today we're releasing our first open source TTS model, TADA! TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency. This means: → Zero content hallucinations across 1,000+ test samples → 5x faster than similar-grade LLM-based TTS → Fits much longer audio: 2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems → Free transcript alongside audio with no added latency

Hume AI

269,190 views • 3 months ago

Gemini 3.1 Flash TTS is our most expressive speech generation model to date. [excited] Watch this demo from Thor 雷神 ⚡️ ⬇️

Gemini 3.1 Flash TTS is our most expressive speech generation model to date. [excited] Watch this demo from Thor 雷神 ⚡️ ⬇️

Google AI Developers

24,489 views • 2 months ago

You’ve never experienced Minecraft like this...

You’ve never experienced Minecraft like this...

Spark Universe

1,662,713 views • 2 years ago

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - *test-time extrapolation*, generates speech longer than training duration! Code&Model:

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - test-time extrapolation, generates speech longer than training duration! Code&Model:

Puyuan Peng

27,872 views • 1 year ago

Meet Lightning V2! The fastest TTS model with just 100ms latency (ttfb)! Supports 16 languages, custom voices, and costs only $0.05 per 10K chars. Enterprise-ready. Ultra-fast. Built for scale.

Meet Lightning V2! The fastest TTS model with just 100ms latency (ttfb)! Supports 16 languages, custom voices, and costs only $0.05 per 10K chars. Enterprise-ready. Ultra-fast. Built for scale.

smallest.ai

120,652 views • 1 year ago

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

Mistral AI

937,946 views • 3 months ago

At the ElevenLabs Summit in Warsaw, we previewed on-device Text to Speech - a new model architecture that delivers human-level quality on limited hardware without an internet connection.

At the ElevenLabs Summit in Warsaw, we previewed on-device Text to Speech - a new model architecture that delivers human-level quality on limited hardware without an internet connection.

ElevenLabs

35,091 views • 25 days ago

🚨 New model alert! Dialog by vibx — a leading text-to-speech model — now runs on GroqCloud™. That means natural-sounding speech with ultra-low latency, making real-time voice applications smoother and more responsive. Learn more & build fast — links in the comments!

🚨 New model alert! Dialog by vibx — a leading text-to-speech model — now runs on GroqCloud™. That means natural-sounding speech with ultra-low latency, making real-time voice applications smoother and more responsive. Learn more & build fast — links in the comments!

Groq Inc

47,183 views • 1 year ago

Text to Sound Effects is here. Our newest AI Audio model generates sound effects, short instrumental tracks, soundscapes, and a wide variety of character voices, all from a text prompt. Available now for all users. Everyone from content creators, video game developers, to film and television studios, uses sound effects to create rich and immersive content. Now, in addition to AI voiceovers, you can generate all of the sounds you need with just a prompt. Everything you hear in this video was generated by ElevenLabs sound and voice models. In this thread, we shared some additional clips that help show off the range of this new model.

Text to Sound Effects is here. Our newest AI Audio model generates sound effects, short instrumental tracks, soundscapes, and a wide variety of character voices, all from a text prompt. Available now for all users. Everyone from content creators, video game developers, to film and television studios, uses sound effects to create rich and immersive content. Now, in addition to AI voiceovers, you can generate all of the sounds you need with just a prompt. Everything you hear in this video was generated by ElevenLabs sound and voice models. In this thread, we shared some additional clips that help show off the range of this new model.

ElevenLabs

450,763 views • 2 years ago

Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵

Gemini 3.1 Flash TTS is our most controllable text-to-speech model yet. With new Audio Tags, you can easily direct vocal style, delivery, and pace through text commands. 🧵

Google DeepMind

468,781 views • 2 months ago

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!

Logan Kilpatrick

800,905 views • 2 months ago

For those of you blessed to never have experienced a flash flood in person, this is how fast the waters can rise. Pray for Texas🇻🇦

For those of you blessed to never have experienced a flash flood in person, this is how fast the waters can rise. Pray for Texas🇻🇦

☩ 𝕁𝕄𝕋 ☩

59,220 views • 11 months ago

Ok, Kokoro is insane. 🤯 This AI is a groundbreaking TTS model with just 82M parameters. It outperforms larger models and generates minutes of speech in seconds. Oh, and it’s open source! Try it and see for yourself: 👇

Ok, Kokoro is insane. 🤯 This AI is a groundbreaking TTS model with just 82M parameters. It outperforms larger models and generates minutes of speech in seconds. Oh, and it’s open source! Try it and see for yourself: 👇

Min Choi

418,016 views • 1 year ago

Today, we’re launching a new Video-to-Music flow in ElevenLabs Studio. In one click, our Eleven Music model generates a custom soundtrack based on your video’s context. After adding music, you can layer in voiceovers and SFX directly in Studio. More examples in thread.

Today, we’re launching a new Video-to-Music flow in ElevenLabs Studio. In one click, our Eleven Music model generates a custom soundtrack based on your video’s context. After adding music, you can layer in voiceovers and SFX directly in Studio. More examples in thread.

ElevenLabs

181,778 views • 10 months ago

Introducing Eleven Multilingual v2: our new AI speech model supporting 28 languages! v2 comes with enhanced conversational capability, higher output quality & the ability to better preserve unique voice characteristics across languages. Read more:

Introducing Eleven Multilingual v2: our new AI speech model supporting 28 languages! v2 comes with enhanced conversational capability, higher output quality & the ability to better preserve unique voice characteristics across languages. Read more:

ElevenLabs

293,366 views • 2 years ago

Our developer advocate Amos Gyamfi created a real-time AI voice agent that runs locally on your laptop. No GPU, no network calls. It uses Pocket TTS (100M params) for fast, natural speech paired with Stream Video. We're breaking it down ↓

Our developer advocate Amos Gyamfi created a real-time AI voice agent that runs locally on your laptop. No GPU, no network calls. It uses Pocket TTS (100M params) for fast, natural speech paired with Stream Video. We're breaking it down ↓

Stream

369,877 views • 4 months ago

Gemini 3 Flash is fully available now on Orchids! Google's newest model - incredibly fast and powerful. Pixel perfect clone of X in under a minute.

Gemini 3 Flash is fully available now on Orchids! Google's newest model - incredibly fast and powerful. Pixel perfect clone of X in under a minute.

Bud

27,651 views • 6 months ago

Today, we’re excited to introduce Miso One, the most emotive voice model in the world. Miso One is an 8-billion-parameter text-to-speech model for highly expressive speech generation. It emotes like a human and responds faster than a human, with just 110 milliseconds of latency. We’ve open-sourced the model weights, with API access coming soon. Hear how Miso One sounds in the thread below.

Today, we’re excited to introduce Miso One, the most emotive voice model in the world. Miso One is an 8-billion-parameter text-to-speech model for highly expressive speech generation. It emotes like a human and responds faster than a human, with just 110 milliseconds of latency. We’ve open-sourced the model weights, with API access coming soon. Hear how Miso One sounds in the thread below.

Aoden Teo

5,097,853 views • 24 days ago

Realtime AI Conversations are here! Introducing PlayHT 2.0 Turbo ⚡️ Our new blazing fast Conversational AI Text-to-Speech model with <300ms latency! ✅ Input text streaming from LLMs ✅ Output audio streaming ✅ Clone any voice & accent Try here -

Realtime AI Conversations are here! Introducing PlayHT 2.0 Turbo ⚡️ Our new blazing fast Conversational AI Text-to-Speech model with <300ms latency! ✅ Input text streaming from LLMs ✅ Output audio streaming ✅ Clone any voice & accent Try here -

PlayAI

392,536 views • 2 years ago