Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

MetaVoice

2,792 subscribers

111,799 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Comments

MetaVoice2 years ago

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions. Model on HF:

MetaVoice2 years ago

Thanks also to @honualx, @jadecopet, @RobinSanroman, @adiyossLC, @FelixKreuk, @osanseviero, @reach_vb, @librivox, DeepFilterNet, and all the other open-source contributors who made this possible. Also, a big shoutout to @togethercompute for their 24x7 help with our cluster.

Luis C2 years ago

You can also try it out on @replicate here:

James Darpinian2 years ago

This sounds great! Does it support streaming? What's the real time factor on a 3090 or 4090?

Kolin Koehl2 years ago

The future of TTS is looking incredibly dynamic! Open Source emotional depth and voice cloning capabilities seem like game-changers. Curious about the quality of long-form content synthesis.

🩷Otome-chan🩷2 years ago

Tried the demo. I think xtts does better zero-shot for english voices, and is much lighter.

Abraham Owodunni2 years ago

What about the paper ??

mmolony2 years ago

This is very cool. We’ve been using Azure’s text to speech for some of our work, it’s reassuring to see there’s some optionality in the space. If anyone has any other suggestions please comment

Andre.W2 years ago

Are more languages planned?

haareblond2 years ago

will it be possible to add other laguages in future? or maby with finetuing?

Related Videos

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

236,737 views • 5 months ago

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

Vaibhav (VB) Srivastav

298,858 views • 1 year ago

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - *test-time extrapolation*, generates speech longer than training duration! Code&Model:

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - test-time extrapolation, generates speech longer than training duration! Code&Model:

Puyuan Peng

27,872 views • 1 year ago

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality.

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality.

Wildminder

12,121 views • 4 months ago

LTX 2.3 audio as standalone speech model. Emotional TTS with Scenema Audio. - Zero-shot expressive voice cloning, speech gen - 8-step distilled with Gemma 3 12B text encoding - stage directions via tags - runs at 1.5x real-time on RTX 4090 - fits in 16GB VRAM - 13 languages, 48kHz stereo output it also gens matching environment sounds

LTX 2.3 audio as standalone speech model. Emotional TTS with Scenema Audio. - Zero-shot expressive voice cloning, speech gen - 8-step distilled with Gemma 3 12B text encoding - stage directions via tags - runs at 1.5x real-time on RTX 4090 - fits in 16GB VRAM - 13 languages, 48kHz stereo output it also gens matching environment sounds

Wildminder

16,309 views • 1 month ago

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Alvaro Cintas

46,151 views • 5 months ago

Pretty fucking wild - Chatterbox TTS by Resemble AI - Zero shot voice cloning, Apache 2.0 licensed 🤯 > Outperforms ElevenLabs > Trained on 500K hours of audio > Llama 500M arch > Emotionally aware speech > Zero shot voice cloning > Live on Hugging Face 🤗

Pretty fucking wild - Chatterbox TTS by Resemble AI - Zero shot voice cloning, Apache 2.0 licensed 🤯 > Outperforms ElevenLabs > Trained on 500K hours of audio > Llama 500M arch > Emotionally aware speech > Zero shot voice cloning > Live on Hugging Face 🤗

Vaibhav (VB) Srivastav

71,652 views • 1 year ago

MOSS-TTS v1.5 is here, an upgrade to v1.0 from @OpenMOSS. (demo👇)🤖 Key improvements: ⏸️ Inline pause control: [pause 3.2s] now supported mid-sentence 🌍 31 languages, up from 20 — now includes Cantonese, Hindi, Thai, Vietnamese, Tagalog, Swahili and more 🎙️ More stable voice cloning with reduced variance across repeated generations 📝 Better long-reference, short-text cloning All v1.0 capabilities preserved: zero-shot cloning, long-form speech, Pinyin/IPA control, code-switching. 💻

MOSS-TTS v1.5 is here, an upgrade to v1.0 from @OpenMOSS. (demo👇)🤖 Key improvements: ⏸️ Inline pause control: [pause 3.2s] now supported mid-sentence 🌍 31 languages, up from 20 — now includes Cantonese, Hindi, Thai, Vietnamese, Tagalog, Swahili and more 🎙️ More stable voice cloning with reduced variance across repeated generations 📝 Better long-reference, short-text cloning All v1.0 capabilities preserved: zero-shot cloning, long-form speech, Pinyin/IPA control, code-switching. 💻

ModelScope

13,733 views • 1 month ago

BIG DAY FOR PlayAI 🚀 We’ve just open-sourced the first diffusion-LLM for speech! ⚡️ Generates audio in just 20-30 tokens (vs. 800-1000 for autoregressive) 🖌️ Perfect for super-fine in-painting edits & 🎙️ zero-shot voice cloning. Give it a try ⬇️

BIG DAY FOR PlayAI 🚀 We’ve just open-sourced the first diffusion-LLM for speech! ⚡️ Generates audio in just 20-30 tokens (vs. 800-1000 for autoregressive) 🖌️ Perfect for super-fine in-painting edits & 🎙️ zero-shot voice cloning. Give it a try ⬇️

Felfel

50,503 views • 1 year ago

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

Vaibhav (VB) Srivastav

79,585 views • 11 months ago

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to Grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon Elon Musk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: 🤗 Hugging Face Model: 🤖 ModelScope Model: 💻 GitHub： #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to Grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon Elon Musk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: 🤗 Hugging Face Model: 🤖 ModelScope Model: 💻 GitHub： #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM

OpenBMB

556,915 views • 2 months ago

🎙️Designing a speech-to-speech assistant Build a speech-to-speech assistant with web search access in 150 lines of code. - Voxtral Transcribe 2 for STT + diarization - Mistral Small 4 for agentic reasoning & efficiency - Voxtral TTS for realistic speech synthesis

🎙️Designing a speech-to-speech assistant Build a speech-to-speech assistant with web search access in 150 lines of code. - Voxtral Transcribe 2 for STT + diarization - Mistral Small 4 for agentic reasoning & efficiency - Voxtral TTS for realistic speech synthesis

Mistral AI for Developers

18,182 views • 2 months ago

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 views • 1 year ago

Awesome! OmniVoice-TTS in ComfyUI. - zero-shot multilingual TTS; - 600+ languages; - voice cloning; - voice design; - multi-speaker dialogues; - supports SageAttention and non-verbal expression tags.

Awesome! OmniVoice-TTS in ComfyUI. - zero-shot multilingual TTS; - 600+ languages; - voice cloning; - voice design; - multi-speaker dialogues; - supports SageAttention and non-verbal expression tags.

Wildminder

35,799 views • 2 months ago

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

AK

312,299 views • 2 years ago

VoxCPM 2 just dropped by OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

Rohan Paul

13,541 views • 2 months ago

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Vaibhav (VB) Srivastav

39,165 views • 1 year ago

This is wild. Hume AI just dropped Octave 2. Ultra realistic text-to-speech AI model in 10+ languages with multi-speaker + voice cloning. 100% AI 8 wild examples + how to try:

This is wild. Hume AI just dropped Octave 2. Ultra realistic text-to-speech AI model in 10+ languages with multi-speaker + voice cloning. 100% AI 8 wild examples + how to try:

Min Choi

212,441 views • 8 months ago