Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. >...

156,386 Aufrufe • vor 2 Jahren •via X (Twitter)

9 Kommentare

Profilbild von Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastavvor 2 Jahren

Try it out in the space directly (& share your generations below)!

Profilbild von Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastavvor 2 Jahren

Check out our inference plus training code base here:

Profilbild von Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastavvor 2 Jahren

You should also be able to use it in a Colab with less than 10 lines of code: import torch from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf device = "cuda:0" if torch. cuda. is_available() else "cpu" model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device) tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1") prompt = "Hey, how are you doing today?" description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality." input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Profilbild von Dennis Lysenko
Dennis Lysenkovor 2 Jahren

@reach_vb this is awesome -- can we run this on Replicate?

Profilbild von Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastavvor 2 Jahren

Not yet, but you can try it out and use it here:

Profilbild von Javier de la Rosa @versae@mastodon.social
Javier de la Rosa @[email protected]vor 2 Jahren

This is really cool! I've been looking at Parler and Data-Speech and would love to give it a try for low-resource languages. What's the minimum amount of hours needed for this to adapt to another language? And does the audio need to be separated by speaker?

Profilbild von Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastavvor 2 Jahren

We will release fine-tuning support soon. I think for the most part the quality of the dataset matters way more than the quantity. You’d need to have enough diversity to ensure a balance in the voice prompts. Once that is in you should be able to train in any language. That said, we haven’t tried this yet, so this is all a hypothesis at this point.

Profilbild von bitBrain
bitBrainvor 2 Jahren

@ClementDelangue but can it laugh? nono sorry, I mean Holy shit! nice!

Profilbild von bane
banevor 2 Jahren

@huggingface Not bad

Ähnliche Videos

VoxCPM 2 just dropped by OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

Rohan Paul

13,541 Aufrufe • vor 1 Monat