Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

AI at Meta

817,096 subscribers

592,726 views • 2 years ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

9 Comments

AI at Meta2 years ago

Compared to cascaded approaches, SeamlessM4T's single system approach reduces errors & delays, increasing translation efficiency & quality, delivering state-of-the-art results. Want to see it for yourself, try the demo ➡️

AI at Meta2 years ago

We believe SeamlessM4T represents a significant breakthrough and as part of our open approach, today we're publicly releasing this work under a CC BY-NC 4.0 license so that others can continue to build on this important field of study. Get the code ⬇️

kache2 years ago

thank you!

sankalp2 years ago

Oh waifu, next two days timeline gonna be filled with quote tweets by ML Bros now. How will my tweets be able to find space on people's TL? I am sorry I don't know enough ML to quote tweet this and earn Elon buxx. now we are homeress.

Mike Ma (AI x Finance)2 years ago

SeamlessM4T = incredible 👏 As a reminder, this also sits on top of one of the most comprehensive, cutting-edge translation/language tech as well:

Horse Clock2 years ago

I tried it with Japanese. The results are...extremely bad. It translates everything literally.

Sid⚡️2 years ago

Meta might as well turn into a research company honestly Much better. Great engineers and researchers. Open source contribution Everything else including zuck — out

𝕏 NiNo2 years ago

Demo Lab is 🔥

unrenormalizable2 years ago

Ffmpeg but for natural language.

Related Videos

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 views • 3 years ago

Today we're sharing new progress on our AI speech work. Our Massively Multilingual Speech (MMS) project has now scaled speech-to-text & text-to-speech to support over 1,100 languages — a 10x increase from previous work. Details + access to new pretrained models ⬇️

Today we're sharing new progress on our AI speech work. Our Massively Multilingual Speech (MMS) project has now scaled speech-to-text & text-to-speech to support over 1,100 languages — a 10x increase from previous work. Details + access to new pretrained models ⬇️

AI at Meta

326,826 views • 3 years ago

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 views • 1 year ago

In this demo, we show real-time voice translation between Indian languages. The user can speak in their preferred language, and the system translates and responds in another, preserving both meaning and natural delivery. Speech recognition, translation, and expressive text-to-speech work together seamlessly.

In this demo, we show real-time voice translation between Indian languages. The user can speak in their preferred language, and the system translates and responds in another, preserving both meaning and natural delivery. Speech recognition, translation, and expressive text-to-speech work together seamlessly.

Pratyush Kumar

15,206 views • 4 months ago

We pioneered the first ultra-realistic Text to Speech model, and recently launched the world's most accurate Speech to Text model, Scribe. But we're not stopping there. Today, we're taking one small step for man, and one giant leap for man's best friend... with Text to Bark.

We pioneered the first ultra-realistic Text to Speech model, and recently launched the world's most accurate Speech to Text model, Scribe. But we're not stopping there. Today, we're taking one small step for man, and one giant leap for man's best friend... with Text to Bark.

ElevenLabs

291,233 views • 1 year ago

🎙️Do you know you now have all the building blocks for full speech-to-speech? - Voxtral Realtime: High-quality, real-time speech-to-text. - Mistral Small 4: Fast, efficient, general-purpose agentic model. - Voxtral TTS: Realistic customizable text-to-speech with streaming output.

🎙️Do you know you now have all the building blocks for full speech-to-speech? - Voxtral Realtime: High-quality, real-time speech-to-text. - Mistral Small 4: Fast, efficient, general-purpose agentic model. - Voxtral TTS: Realistic customizable text-to-speech with streaming output.

Mistral AI for Developers

27,787 views • 3 months ago

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

AI at Meta

351,698 views • 1 year ago

Can you accurately transcribe fast speech? Tested ' new Speech-to-Text model (Scribe) with Eminem's "Rap God" (4.28 words/sec!) & it nailed it. Great quality and supports 99+ languages.

Can you accurately transcribe fast speech? Tested ' new Speech-to-Text model (Scribe) with Eminem's "Rap God" (4.28 words/sec!) & it nailed it. Great quality and supports 99+ languages.

Addy Osmani

108,722 views • 1 year ago

Thanks to our +1.2M community across +180 countries, we are able to source thousands of languages & dialects. This fuels Silencio Voice AI's unmatched datasets: • Multilingual Conversations • Text to Speech • Speech to Text • Ambient Sounds • Commands and Order. Powering the next wave of AI & Robotics. PS: Keynote opening for everyone today, stay tuned.

Thanks to our +1.2M community across +180 countries, we are able to source thousands of languages & dialects. This fuels Silencio Voice AI's unmatched datasets: • Multilingual Conversations • Text to Speech • Speech to Text • Ambient Sounds • Commands and Order. Powering the next wave of AI & Robotics. PS: Keynote opening for everyone today, stay tuned.

Silencio | Voice Data for AI

13,705 views • 9 months ago

Here are the best practices for using Eleven v3 (alpha) - the most expressive Text to Speech model.

Here are the best practices for using Eleven v3 (alpha) - the most expressive Text to Speech model.

ElevenLabs

43,692 views • 1 year ago

PaLM + AudioLM = AudioPaLM ! We start from PaLM pretrained on text and extend its vocab w/ audio tokens. This model can then be finetuned on a mix of any (speech, text) task e.g. ASR, TTS, MT and speech2speech translation in one's voice! 🧵1/4

PaLM + AudioLM = AudioPaLM ! We start from PaLM pretrained on text and extend its vocab w/ audio tokens. This model can then be finetuned on a mix of any (speech, text) task e.g. ASR, TTS, MT and speech2speech translation in one's voice! 🧵1/4

Neil Zeghidour

41,240 views • 3 years ago

Yesterday we introduced SeamlessExpressive — a new model that preserves unique vocal styles & expression for speech translation, built on our SeamlessM4T v2 foundation model. More details on the family of Seamless Communication models ➡️

Yesterday we introduced SeamlessExpressive — a new model that preserves unique vocal styles & expression for speech translation, built on our SeamlessM4T v2 foundation model. More details on the family of Seamless Communication models ➡️

AI at Meta

189,965 views • 2 years ago

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

Vaibhav (VB) Srivastav

330,034 views • 1 year ago

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

AK

429,294 views • 3 years ago

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

Mistral AI

939,262 views • 3 months ago

We have officially launched Fun-ASR1.5, a major update to our end-to-end speech recognition model. This release focuses on three core pillars: broader language coverage, language switching, and production-ready text output. Key Features: • Multilingual Support: Supports high-accuracy recognition for 30 languages across Asia, Europe, and the Middle East within a single model. • Language Switching: Handles mixed-language speech (Code-Switching) natively, automatically detecting and transcribing language shifts without the need for manual tagging. • Professional Text Output: Delivers "ready-to-use" text with smart punctuation and automatic formatting for dates, numbers, and currencies. Fun-ASR1.5 bridges the gap between raw audio and professional documentation, providing a reliable engine for global communication.

We have officially launched Fun-ASR1.5, a major update to our end-to-end speech recognition model. This release focuses on three core pillars: broader language coverage, language switching, and production-ready text output. Key Features: • Multilingual Support: Supports high-accuracy recognition for 30 languages across Asia, Europe, and the Middle East within a single model. • Language Switching: Handles mixed-language speech (Code-Switching) natively, automatically detecting and transcribing language shifts without the need for manual tagging. • Professional Text Output: Delivers "ready-to-use" text with smart punctuation and automatic formatting for dates, numbers, and currencies. Fun-ASR1.5 bridges the gap between raw audio and professional documentation, providing a reliable engine for global communication.

Tongyi Lab

3,917,307 views • 2 months ago

What will you build with Vision Agents? Out-of-the-box support for: - Turn detection - Speech-to-text + text-to-speech - Voice activity detection - MCP & function-calling support Open-source. Video-first. Ready to build.

What will you build with Vision Agents? Out-of-the-box support for: - Turn detection - Speech-to-text + text-to-speech - Voice activity detection - MCP & function-calling support Open-source. Video-first. Ready to build.

Stream

226,723 views • 5 months ago

Experimenting with OpenAI's new Text to Speech model 💬 Punctuation is powerful here 🤯

Experimenting with OpenAI's new Text to Speech model 💬 Punctuation is powerful here 🤯

Miguel | AP

202,867 views • 2 years ago