Loading video...

Video Failed to Load

Go Home

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

111,799 views • 2 years ago •via X (Twitter)

10 Comments

MetaVoice's profile picture
MetaVoice2 years ago

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions. Model on HF:

MetaVoice's profile picture
MetaVoice2 years ago

Thanks also to @honualx, @jadecopet, @RobinSanroman, @adiyossLC, @FelixKreuk, @osanseviero, @reach_vb, @librivox, DeepFilterNet, and all the other open-source contributors who made this possible. Also, a big shoutout to @togethercompute for their 24x7 help with our cluster.

Luis C's profile picture
Luis C2 years ago

You can also try it out on @replicate here:

James Darpinian's profile picture
James Darpinian2 years ago

This sounds great! Does it support streaming? What's the real time factor on a 3090 or 4090?

Kolin Koehl's profile picture
Kolin Koehl2 years ago

The future of TTS is looking incredibly dynamic! Open Source emotional depth and voice cloning capabilities seem like game-changers. Curious about the quality of long-form content synthesis.

🩷Otome-chan🩷's profile picture
🩷Otome-chan🩷2 years ago

Tried the demo. I think xtts does better zero-shot for english voices, and is much lighter.

Abraham Owodunni's profile picture
Abraham Owodunni2 years ago

What about the paper ??

mmolony's profile picture
mmolony2 years ago

This is very cool. We’ve been using Azure’s text to speech for some of our work, it’s reassuring to see there’s some optionality in the space. If anyone has any other suggestions please comment

Andre.W's profile picture
Andre.W2 years ago

Are more languages planned?

haareblond's profile picture
haareblond2 years ago

will it be possible to add other laguages in future? or maby with finetuing?

Related Videos

VoxCPM 2 just dropped by OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

Rohan Paul

13,541 views • 2 months ago