Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects... show more

AI at Meta

806,014 subscribers

351,674 views • 1 year ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

10 Comments

AI at Meta1 year ago

More details, including links to the research paper, model weights and code 👇

$@®+#@|=1 year ago

Sounds ass ngl

floating point1 year ago

Non commercial and poor quality speech? Sad 😔

Leocifer1 year ago

europe

Tech Dev Notes1 year ago

The demo was a bit ...

BensenHsu1 year ago

The study introduces S PI R IT -LM, a model that can generate both speech and text. It is based on continuously pre-training a text language model (L LAMA 2) with a combination of text-only, speech-only, and aligned speech-text datasets. S PI R IT -LM performs well on speech and text comprehension tasks, matching or exceeding the performance of previous speech-only and text-only models. It can also learn new tasks in a few-shot setting, both within and across modalities (speech-to-text and text-to-speech). The S PI R IT -LM-E XPRESSIVE version is the first language model that can preserve the sentiment of text and speech prompts both within and across modalities. full paper:

$Q*🍓on Ethereum1 year ago

Everything happening at once

Hamza1 year ago

this preview seems to lack somewhat

Risphere1 year ago

The quality isn't that good.

Qual1 year ago

I love you, but this demo was... well, let's just say it had a rough start! At first, I thought my speakers were broken because there was no sound for a few seconds. Then, when the sound finally kicked in, I was like, "Yep, my speakers are definitely broken!"

Related Videos

We released Sonic-3.5 and Ink-2, the #1 streaming models for text to speech and speech to text you can use in your voice agents today. New architectures enable new frontiers for speed and quality. We're now the only provider to have #1 models for both speaking and listening.

We released Sonic-3.5 and Ink-2, the #1 streaming models for text to speech and speech to text you can use in your voice agents today. New architectures enable new frontiers for speed and quality. We're now the only provider to have #1 models for both speaking and listening.

Karan Goel

6,987,380 views • 9 days ago

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 views • 3 years ago

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

AI at Meta

592,704 views • 2 years ago

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

BytePlus

19,717 views • 3 months ago

Today we're sharing new progress on our AI speech work. Our Massively Multilingual Speech (MMS) project has now scaled speech-to-text & text-to-speech to support over 1,100 languages — a 10x increase from previous work. Details + access to new pretrained models ⬇️

Today we're sharing new progress on our AI speech work. Our Massively Multilingual Speech (MMS) project has now scaled speech-to-text & text-to-speech to support over 1,100 languages — a 10x increase from previous work. Details + access to new pretrained models ⬇️

AI at Meta

326,823 views • 3 years ago

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

Mistral AI

937,612 views • 3 months ago

We pioneered the first ultra-realistic Text to Speech model, and recently launched the world's most accurate Speech to Text model, Scribe. But we're not stopping there. Today, we're taking one small step for man, and one giant leap for man's best friend... with Text to Bark.

We pioneered the first ultra-realistic Text to Speech model, and recently launched the world's most accurate Speech to Text model, Scribe. But we're not stopping there. Today, we're taking one small step for man, and one giant leap for man's best friend... with Text to Bark.

ElevenLabs

291,233 views • 1 year ago

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

AK

429,143 views • 3 years ago

Thanks to our +1.2M community across +180 countries, we are able to source thousands of languages & dialects. This fuels Silencio Voice AI's unmatched datasets: • Multilingual Conversations • Text to Speech • Speech to Text • Ambient Sounds • Commands and Order. Powering the next wave of AI & Robotics. PS: Keynote opening for everyone today, stay tuned.

Thanks to our +1.2M community across +180 countries, we are able to source thousands of languages & dialects. This fuels Silencio Voice AI's unmatched datasets: • Multilingual Conversations • Text to Speech • Speech to Text • Ambient Sounds • Commands and Order. Powering the next wave of AI & Robotics. PS: Keynote opening for everyone today, stay tuned.

Silencio | Voice Data for AI

13,705 views • 9 months ago

Mati Staniszewski on why ElevenLabs is betting on a cascaded approach for voice agents: “Our approach, as you think about voice agents, conversational agents, is effectively a cascaded approach. You use transcription or speech-to-text, an LLM, text-to-speech, and orchestrate all of that together. Then you have speech-to-speech, which goes directly from speech, and there’s a speech response on the other side. Today, we are optimizing heavily on a cascaded approach. As we work with a lot of the businesses and enterprises, they will need that visibility into what happens. They will want to execute certain tasks on top of that. They want good visibility into each of the steps and great accuracy of all the models. But beyond that, they can abstract away what’s the LLM layer, what’s the intelligence layer, and the integrations are easier in that system. That’s where we are betting a lot of the research work on how you can make that great, and we think we can make that great.” John Collison Mati Staniszewski ElevenLabs

Mati Staniszewski on why ElevenLabs is betting on a cascaded approach for voice agents: “Our approach, as you think about voice agents, conversational agents, is effectively a cascaded approach. You use transcription or speech-to-text, an LLM, text-to-speech, and orchestrate all of that together. Then you have speech-to-speech, which goes directly from speech, and there’s a speech response on the other side. Today, we are optimizing heavily on a cascaded approach. As we work with a lot of the businesses and enterprises, they will need that visibility into what happens. They will want to execute certain tasks on top of that. They want good visibility into each of the steps and great accuracy of all the models. But beyond that, they can abstract away what’s the LLM layer, what’s the intelligence layer, and the integrations are easier in that system. That’s where we are betting a lot of the research work on how you can make that great, and we think we can make that great.” John Collison Mati Staniszewski ElevenLabs

Stripe

17,118 views • 2 months ago

In this demo, we show real-time voice translation between Indian languages. The user can speak in their preferred language, and the system translates and responds in another, preserving both meaning and natural delivery. Speech recognition, translation, and expressive text-to-speech work together seamlessly.

In this demo, we show real-time voice translation between Indian languages. The user can speak in their preferred language, and the system translates and responds in another, preserving both meaning and natural delivery. Speech recognition, translation, and expressive text-to-speech work together seamlessly.

Pratyush Kumar

15,206 views • 4 months ago

I made a few updates to the MLX port of F5 TTS over the holiday: - Longform generation is now supported, and text will be split automatically on sentence boundaries for natural-sounding speech. - You can now pipe in text from another process (e.g. a language model) and listen to the speech as each segment is generated. - I implemented RK4 sampling for the ODE, which achieves much better quality with fewer steps — it's now possible to generate speech at 1.6x realtime on my M3 Max. See the video of an example of generating speech in realtime from a language model entirely on-device! 🚀

I made a few updates to the MLX port of F5 TTS over the holiday: - Longform generation is now supported, and text will be split automatically on sentence boundaries for natural-sounding speech. - You can now pipe in text from another process (e.g. a language model) and listen to the speech as each segment is generated. - I implemented RK4 sampling for the ODE, which achieves much better quality with fewer steps — it's now possible to generate speech at 1.6x realtime on my M3 Max. See the video of an example of generating speech in realtime from a language model entirely on-device! 🚀

Lucas Newman

22,962 views • 1 year ago

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

ElevenLabs

113,317 views • 2 years ago

Two Realtime API updates: - You can now build speech-to-speech experiences with five new voices—which are much more expressive and steerable. 🤣🤫🤪 - We're lowering the price by using prompt caching. Cached text inputs are discounted 50% and cached audio inputs are discounted 80%. 📉

Two Realtime API updates: - You can now build speech-to-speech experiences with five new voices—which are much more expressive and steerable. 🤣🤫🤪 - We're lowering the price by using prompt caching. Cached text inputs are discounted 50% and cached audio inputs are discounted 80%. 📉

OpenAI Developers

243,205 views • 1 year ago

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

Vaibhav (VB) Srivastav

45,409 views • 1 year ago

UPDATE: Four new open models on the Text to Speech Arena! 🔥 *sound on🔉* As the Text-to-Speech ecosystem is heating up, we decided to add more competition. > Parler TTS > VoiceCraft > Vokan > GPT-SOVITS Why is this important? The TTS ecosystem is riddled with opaque metrics and meaningless MOS scores. By crowdsourcing the evals, we test these models in real-life conditions and much more methodically. ⚡ Rank one, rank'em all! 🚀

UPDATE: Four new open models on the Text to Speech Arena! 🔥 sound on🔉 As the Text-to-Speech ecosystem is heating up, we decided to add more competition. > Parler TTS > VoiceCraft > Vokan > GPT-SOVITS Why is this important? The TTS ecosystem is riddled with opaque metrics and meaningless MOS scores. By crowdsourcing the evals, we test these models in real-life conditions and much more methodically. ⚡ Rank one, rank'em all! 🚀

Vaibhav (VB) Srivastav

40,938 views • 2 years ago

Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters. 🧵

Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters. 🧵

kyutai

167,364 views • 1 year ago

Today, we’re releasing Octave: the first LLM built for text-to-speech. 🎨Design any voice with a prompt 🎬 Give acting instructions to control emotion and delivery (sarcasm, whispering, etc.) 🛠️Produce long-form content on our Creator Studio Unlike traditional TTS that just “reads” words aloud, Octave understands how meaning affects delivery to generate emotional, human-like speech.

Today, we’re releasing Octave: the first LLM built for text-to-speech. 🎨Design any voice with a prompt 🎬 Give acting instructions to control emotion and delivery (sarcasm, whispering, etc.) 🛠️Produce long-form content on our Creator Studio Unlike traditional TTS that just “reads” words aloud, Octave understands how meaning affects delivery to generate emotional, human-like speech.

Hume AI

393,731 views • 1 year ago

What will you build with Vision Agents? Out-of-the-box support for: - Turn detection - Speech-to-text + text-to-speech - Voice activity detection - MCP & function-calling support Open-source. Video-first. Ready to build.

What will you build with Vision Agents? Out-of-the-box support for: - Turn detection - Speech-to-text + text-to-speech - Voice activity detection - MCP & function-calling support Open-source. Video-first. Ready to build.

Stream

226,723 views • 5 months ago

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Vaibhav Sisinty

24,522,201 views • 2 months ago