正在加载视频...

视频加载失败

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

111,799 次观看 • 2 年前 •via X (Twitter)

10 条评论

MetaVoice 的头像
MetaVoice2 年前

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions. Model on HF:

MetaVoice 的头像
MetaVoice2 年前

Thanks also to @honualx, @jadecopet, @RobinSanroman, @adiyossLC, @FelixKreuk, @osanseviero, @reach_vb, @librivox, DeepFilterNet, and all the other open-source contributors who made this possible. Also, a big shoutout to @togethercompute for their 24x7 help with our cluster.

Luis C 的头像
Luis C2 年前

You can also try it out on @replicate here:

James Darpinian 的头像
James Darpinian2 年前

This sounds great! Does it support streaming? What's the real time factor on a 3090 or 4090?

Kolin Koehl 的头像
Kolin Koehl2 年前

The future of TTS is looking incredibly dynamic! Open Source emotional depth and voice cloning capabilities seem like game-changers. Curious about the quality of long-form content synthesis.

🩷Otome-chan🩷 的头像
🩷Otome-chan🩷2 年前

Tried the demo. I think xtts does better zero-shot for english voices, and is much lighter.

Abraham Owodunni 的头像
Abraham Owodunni2 年前

What about the paper ??

mmolony 的头像
mmolony2 年前

This is very cool. We’ve been using Azure’s text to speech for some of our work, it’s reassuring to see there’s some optionality in the space. If anyone has any other suggestions please comment

Andre.W 的头像
Andre.W2 年前

Are more languages planned?

haareblond 的头像
haareblond2 年前

will it be possible to add other laguages in future? or maby with finetuing?

相关视频

VoxCPM 2 just dropped by OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

Rohan Paul

13,541 次观看 • 2 个月前