Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

MetaVoice

2,794 subscribers

111,799 Aufrufe • vor 2 Jahren •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von MetaVoice

MetaVoicevor 2 Jahren

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions. Model on HF:

Profilbild von MetaVoice

MetaVoicevor 2 Jahren

Thanks also to @honualx, @jadecopet, @RobinSanroman, @adiyossLC, @FelixKreuk, @osanseviero, @reach_vb, @librivox, DeepFilterNet, and all the other open-source contributors who made this possible. Also, a big shoutout to @togethercompute for their 24x7 help with our cluster.

Profilbild von Luis C

Luis Cvor 2 Jahren

You can also try it out on @replicate here:

Profilbild von James Darpinian

James Darpinianvor 2 Jahren

This sounds great! Does it support streaming? What's the real time factor on a 3090 or 4090?

Profilbild von Kolin Koehl

Kolin Koehlvor 2 Jahren

The future of TTS is looking incredibly dynamic! Open Source emotional depth and voice cloning capabilities seem like game-changers. Curious about the quality of long-form content synthesis.

Profilbild von 🩷Otome-chan🩷

🩷Otome-chan🩷vor 2 Jahren

Tried the demo. I think xtts does better zero-shot for english voices, and is much lighter.

Profilbild von Abraham Owodunni

Abraham Owodunnivor 2 Jahren

What about the paper ??

Profilbild von mmolony

mmolonyvor 2 Jahren

This is very cool. We’ve been using Azure’s text to speech for some of our work, it’s reassuring to see there’s some optionality in the space. If anyone has any other suggestions please comment

Profilbild von Andre.W

Andre.Wvor 2 Jahren

Are more languages planned?

Profilbild von haareblond

haareblondvor 2 Jahren

will it be possible to add other laguages in future? or maby with finetuing?

Ähnliche Videos

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

236,454 Aufrufe • vor 5 Monaten

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

Vaibhav (VB) Srivastav

298,858 Aufrufe • vor 1 Jahr

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - *test-time extrapolation*, generates speech longer than training duration! Code&Model:

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - test-time extrapolation, generates speech longer than training duration! Code&Model:

Puyuan Peng

27,872 Aufrufe • vor 1 Jahr

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality.

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality.

Wildminder

12,121 Aufrufe • vor 4 Monaten

LTX 2.3 audio as standalone speech model. Emotional TTS with Scenema Audio. - Zero-shot expressive voice cloning, speech gen - 8-step distilled with Gemma 3 12B text encoding - stage directions via tags - runs at 1.5x real-time on RTX 4090 - fits in 16GB VRAM - 13 languages, 48kHz stereo output it also gens matching environment sounds

LTX 2.3 audio as standalone speech model. Emotional TTS with Scenema Audio. - Zero-shot expressive voice cloning, speech gen - 8-step distilled with Gemma 3 12B text encoding - stage directions via tags - runs at 1.5x real-time on RTX 4090 - fits in 16GB VRAM - 13 languages, 48kHz stereo output it also gens matching environment sounds

Wildminder

16,309 Aufrufe • vor 1 Monat

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Alvaro Cintas

46,151 Aufrufe • vor 5 Monaten

Pretty fucking wild - Chatterbox TTS by Resemble AI - Zero shot voice cloning, Apache 2.0 licensed 🤯 > Outperforms ElevenLabs > Trained on 500K hours of audio > Llama 500M arch > Emotionally aware speech > Zero shot voice cloning > Live on Hugging Face 🤗

Pretty fucking wild - Chatterbox TTS by Resemble AI - Zero shot voice cloning, Apache 2.0 licensed 🤯 > Outperforms ElevenLabs > Trained on 500K hours of audio > Llama 500M arch > Emotionally aware speech > Zero shot voice cloning > Live on Hugging Face 🤗

Vaibhav (VB) Srivastav

71,652 Aufrufe • vor 1 Jahr

MOSS-TTS v1.5 is here, an upgrade to v1.0 from @OpenMOSS. (demo👇)🤖 Key improvements: ⏸️ Inline pause control: [pause 3.2s] now supported mid-sentence 🌍 31 languages, up from 20 — now includes Cantonese, Hindi, Thai, Vietnamese, Tagalog, Swahili and more 🎙️ More stable voice cloning with reduced variance across repeated generations 📝 Better long-reference, short-text cloning All v1.0 capabilities preserved: zero-shot cloning, long-form speech, Pinyin/IPA control, code-switching. 💻

MOSS-TTS v1.5 is here, an upgrade to v1.0 from @OpenMOSS. (demo👇)🤖 Key improvements: ⏸️ Inline pause control: [pause 3.2s] now supported mid-sentence 🌍 31 languages, up from 20 — now includes Cantonese, Hindi, Thai, Vietnamese, Tagalog, Swahili and more 🎙️ More stable voice cloning with reduced variance across repeated generations 📝 Better long-reference, short-text cloning All v1.0 capabilities preserved: zero-shot cloning, long-form speech, Pinyin/IPA control, code-switching. 💻

ModelScope

13,733 Aufrufe • vor 28 Tagen

BIG DAY FOR PlayAI 🚀 We’ve just open-sourced the first diffusion-LLM for speech! ⚡️ Generates audio in just 20-30 tokens (vs. 800-1000 for autoregressive) 🖌️ Perfect for super-fine in-painting edits & 🎙️ zero-shot voice cloning. Give it a try ⬇️

BIG DAY FOR PlayAI 🚀 We’ve just open-sourced the first diffusion-LLM for speech! ⚡️ Generates audio in just 20-30 tokens (vs. 800-1000 for autoregressive) 🖌️ Perfect for super-fine in-painting edits & 🎙️ zero-shot voice cloning. Give it a try ⬇️

Felfel

50,503 Aufrufe • vor 1 Jahr

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

Vaibhav (VB) Srivastav

79,585 Aufrufe • vor 11 Monaten

🎙️Designing a speech-to-speech assistant Build a speech-to-speech assistant with web search access in 150 lines of code. - Voxtral Transcribe 2 for STT + diarization - Mistral Small 4 for agentic reasoning & efficiency - Voxtral TTS for realistic speech synthesis

🎙️Designing a speech-to-speech assistant Build a speech-to-speech assistant with web search access in 150 lines of code. - Voxtral Transcribe 2 for STT + diarization - Mistral Small 4 for agentic reasoning & efficiency - Voxtral TTS for realistic speech synthesis

Mistral AI for Developers

18,182 Aufrufe • vor 2 Monaten

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to Grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon Elon Musk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: 🤗 Hugging Face Model: 🤖 ModelScope Model: 💻 GitHub： #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to Grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon Elon Musk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: 🤗 Hugging Face Model: 🤖 ModelScope Model: 💻 GitHub： #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM

OpenBMB

556,838 Aufrufe • vor 2 Monaten

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 Aufrufe • vor 1 Jahr

Awesome! OmniVoice-TTS in ComfyUI. - zero-shot multilingual TTS; - 600+ languages; - voice cloning; - voice design; - multi-speaker dialogues; - supports SageAttention and non-verbal expression tags.

Awesome! OmniVoice-TTS in ComfyUI. - zero-shot multilingual TTS; - 600+ languages; - voice cloning; - voice design; - multi-speaker dialogues; - supports SageAttention and non-verbal expression tags.

Wildminder

35,799 Aufrufe • vor 2 Monaten

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

AK

312,299 Aufrufe • vor 2 Jahren

VoxCPM 2 just dropped by OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

Rohan Paul

13,541 Aufrufe • vor 2 Monaten

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Vaibhav (VB) Srivastav

39,165 Aufrufe • vor 1 Jahr

This is wild. Hume AI just dropped Octave 2. Ultra realistic text-to-speech AI model in 10+ languages with multi-speaker + voice cloning. 100% AI 8 wild examples + how to try:

This is wild. Hume AI just dropped Octave 2. Ultra realistic text-to-speech AI model in 10+ languages with multi-speaker + voice cloning. 100% AI 8 wild examples + how to try:

Min Choi

212,441 Aufrufe • vor 8 Monaten