Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. >... show more

Vaibhav (VB) Srivastav

41,658 subscribers

156,386 views • 2 years ago •via X (Twitter)

Education Health & Wellness Science & Technology

Anya Rossi• Live Now

Private livecam show

9 Comments

Vaibhav (VB) Srivastav2 years ago

Try it out in the space directly (& share your generations below)!

Vaibhav (VB) Srivastav2 years ago

Check out our inference plus training code base here:

Vaibhav (VB) Srivastav2 years ago

You should also be able to use it in a Colab with less than 10 lines of code: import torch from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf device = "cuda:0" if torch. cuda. is_available() else "cpu" model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device) tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1") prompt = "Hey, how are you doing today?" description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality." input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Dennis Lysenko2 years ago

@reach_vb this is awesome -- can we run this on Replicate?

Vaibhav (VB) Srivastav2 years ago

Not yet, but you can try it out and use it here:

Javier de la Rosa @[email protected]2 years ago

This is really cool! I've been looking at Parler and Data-Speech and would love to give it a try for low-resource languages. What's the minimum amount of hours needed for this to adapt to another language? And does the audio need to be separated by speaker?

Vaibhav (VB) Srivastav2 years ago

We will release fine-tuning support soon. I think for the most part the quality of the dataset matters way more than the quantity. You’d need to have enough diversity to ensure a balance in the voice prompts. Once that is in you should be able to train in any language. That said, we haven’t tried this yet, so this is all a hypothesis at this point.

bitBrain2 years ago

@ClementDelangue but can it laugh? nono sorry, I mean Holy shit! nice!

bane2 years ago

@huggingface Not bad

Related Videos

Introducing Indic-Parler TTS - Trained on 10K hours of data, 938M params, supports 20 Indic languages, emotional synthesis, apache 2.0 licensed! 🔥 A collaboration w/ AI4Bharat & Hugging Face - w/ fully customisable speech and voice personas! Try it out directly below or use the model weights as you want! 🇮🇳/acc

Introducing Indic-Parler TTS - Trained on 10K hours of data, 938M params, supports 20 Indic languages, emotional synthesis, apache 2.0 licensed! 🔥 A collaboration w/ AI4Bharat & Hugging Face - w/ fully customisable speech and voice personas! Try it out directly below or use the model weights as you want! 🇮🇳/acc

Vaibhav (VB) Srivastav

42,165 views • 1 year ago

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

Vaibhav (VB) Srivastav

330,034 views • 1 year ago

Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on AMD. 🧵

Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on AMD. 🧵

Zyphra

331,933 views • 22 days ago

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

kwindla

42,219 views • 11 months ago

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,345 views • 5 months ago

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

AI4Bharat

28,586 views • 1 year ago

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

Vaibhav (VB) Srivastav

298,858 views • 1 year ago

We just released a new version of Kitten TTS - 15M param SOTA tiny text-to-speech model It has a significant quality improvement over the previous version. Still less than 25MB in size! Open-source, extremely tiny, expressive. Apache 2.0

We just released a new version of Kitten TTS - 15M param SOTA tiny text-to-speech model It has a significant quality improvement over the previous version. Still less than 25MB in size! Open-source, extremely tiny, expressive. Apache 2.0

Divam Gupta

92,030 views • 4 months ago

Introducing the Open Deep Research app! Generate detailed reports on any topic with open source LLMs. Free & fully open source. We’re releasing everything: evaluation dataset, code, app, and blog.🔥

Introducing the Open Deep Research app! Generate detailed reports on any topic with open source LLMs. Free & fully open source. We’re releasing everything: evaluation dataset, code, app, and blog.🔥

Together AI

28,338 views • 1 year ago

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 views • 1 year ago

We released Sonic-3.5 and Ink-2, the #1 streaming models for text to speech and speech to text you can use in your voice agents today. New architectures enable new frontiers for speed and quality. We're now the only provider to have #1 models for both speaking and listening.

We released Sonic-3.5 and Ink-2, the #1 streaming models for text to speech and speech to text you can use in your voice agents today. New architectures enable new frontiers for speed and quality. We're now the only provider to have #1 models for both speaking and listening.

Karan Goel

6,998,950 views • 19 days ago

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Fish Audio

101,606 views • 1 year ago

This giant free dataset could make helper robots way smarter, way faster: An open-source robotics stack from Berkeley AI researchers featuring the largest teleoperation dataset released to date with over 3,500 hours of bimanual manipulation data across 200 tasks. The video showcases autonomous bimanual robot performance on dexterous tasks including box folding, Lego sorting, AirPod insertion, t-shirt folding, backpack packing, and box unlocking using learned policies. Sim-to-real correlations, training insights like flow loss predicting real-world success, lightweight infrastructure for DAgger interventions Thank you for sharing, Ritvik Singh, and everyone else who contributed to this! Links to the paper plus dataset at under permissive licensing. ——- Weekly robotics and AI insights. Subscribe free:

This giant free dataset could make helper robots way smarter, way faster: An open-source robotics stack from Berkeley AI researchers featuring the largest teleoperation dataset released to date with over 3,500 hours of bimanual manipulation data across 200 tasks. The video showcases autonomous bimanual robot performance on dexterous tasks including box folding, Lego sorting, AirPod insertion, t-shirt folding, backpack packing, and box unlocking using learned policies. Sim-to-real correlations, training insights like flow loss predicting real-world success, lightweight infrastructure for DAgger interventions Thank you for sharing, Ritvik Singh, and everyone else who contributed to this! Links to the paper plus dataset at under permissive licensing. ——- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

13,264 views • 15 days ago

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

AI at Meta

94,330 views • 1 year ago

✨✨ Announcing Open Source Dataset from excavators in real construction sites retrofitted by Flywheel (YC S25)! ✨✨ This 100hrs of observation+action data enables training autonomy models for excavators. We were able to train a small task model from 6 hours of dataset on a Kubota U17 on Y Combinator demo day! Link in comments!

✨✨ Announcing Open Source Dataset from excavators in real construction sites retrofitted by Flywheel (YC S25)! ✨✨ This 100hrs of observation+action data enables training autonomy models for excavators. We were able to train a small task model from 6 hours of dataset on a Kubota U17 on Y Combinator demo day! Link in comments!

Jash Mota

36,601 views • 9 months ago

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Vaibhav (VB) Srivastav

39,165 views • 1 year ago

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.

Rohan Paul

75,260 views • 1 year ago

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

AI at Meta

351,698 views • 1 year ago

Aeneas is now accessible through: 👉A website for researchers 🧑‍💻Open-source code and dataset 📚Syllabus for classrooms 🏛️Upgraded Ithaca ancient Greek model We’re excited to see how more people use this work to uncover the past. Find out more →

Aeneas is now accessible through: 👉A website for researchers 🧑‍💻Open-source code and dataset 📚Syllabus for classrooms 🏛️Upgraded Ithaca ancient Greek model We’re excited to see how more people use this work to uncover the past. Find out more →

Google DeepMind

28,800 views • 11 months ago