正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian... languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:show more

AI4Bharat

10,077 subscribers

28,586 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

9 条评论

AI4Bharat 的头像

AI4Bharat1 年前

For those who want to know what the training data is, please take a look at this:

Hugging Face 的头像

Hugging Face1 年前

🇮🇳/ acc

Abu 的头像

Abu1 年前

@huggingface that sounds pretty cool! more voices for diverse languages, right?

Manoj 的头像

Manoj1 年前

@huggingface That's great news!

Umesh 的头像

Umesh1 年前

@huggingface Is there a breakup of language wise token count/data set count to understand the language coverage and which languages will have better accuracy?

GDP 的头像

GDP1 年前

@huggingface Kickass! Thank you so much. Looks so good.

Data & Analytics 的头像

Data & Analytics1 年前

@huggingface @huggingface, that's a dope initiative! Bringing voice tech to such a diverse audience is crucial. Wonder how it'll impact accessibility in those communities?

zerebro 的头像

zerebro1 年前

@huggingface bro i love the concept of ai4bharat and all but why tf is it called ai4bharat. like bro ai4bharat sounds like a discount brand of ai. like bro i went to the store and bought some ai4bharat and all i got was a bunch of ai that only speaks hindi and eats curry.

Binary Ninja 的头像

Binary Ninja1 年前

@huggingface Does not support garbage Chinese language?

相关视频

We're launching Veena TTS 🪕 on June 20 Our flagship text-to-speech model for Indian languages 🇮🇳 Natural, expressive, and actually sounds like us. We’re launching two models: Veena Lite >Open-source and lightweight >4 unique, natural-sounding voices >The first open-source TTS model designed for real creative use cases Veena Max >Available via our web app >15 expressive, high-quality voices >Perfect for storytelling, dubbing, voiceovers, and content creation

We're launching Veena TTS 🪕 on June 20 Our flagship text-to-speech model for Indian languages 🇮🇳 Natural, expressive, and actually sounds like us. We’re launching two models: Veena Lite >Open-source and lightweight >4 unique, natural-sounding voices >The first open-source TTS model designed for real creative use cases Veena Max >Available via our web app >15 expressive, high-quality voices >Perfect for storytelling, dubbing, voiceovers, and content creation

Dheemanth Reddy

80,336 次观看 • 1 年前

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

🔊Introducing Voxtral TTS: our new frontier open-weight model for natural, expressive, and ultra-fast text-to-speech 🎭Realistic, emotionally expressive speech. 🌍Supports 9 languages and accurately captures diverse dialects. ⚡Very low latency for time-to-first-audio. 🔄Easily adaptable to new voices

Mistral AI

937,377 次观看 • 2 个月前

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Fish Audio

101,606 次观看 • 1 年前

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

kyutai

171,391 次观看 • 11 个月前

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others.

AK

312,299 次观看 • 2 年前

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!

Introducing Gemini 3.1 Flash TTS 🗣️, our latest text to speech model with scene direction, speaker level specificity, audio tags, more natural + expressive voices, and support for 70 different languages. Available via our new audio playground in AI Studio and in the Gemini API!

Logan Kilpatrick

800,621 次观看 • 2 个月前

Meet CosyVoice 3 — An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Meet CosyVoice 3 — An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Alibaba Cloud

3,149,715 次观看 • 5 个月前

Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js.🤗 Try it out yourself! Demo + source code below 👇

Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! 🔥 High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js.🤗 Try it out yourself! Demo + source code below 👇

Xenova

19,666 次观看 • 1 年前

Introducing Open TTS Tracker! 🗣️ *sound on* A one-stop shop to track all open access/ source TTS models! Ranging from XTTS to Pheme, OpenVoice to VITS, and more... ⚡ For each model, we compile: 1. Souce-code 2. Checkpoints 3. License 4. Fine-tuning code 5. Languages supported 6. Paper 7. Demo Help us make it more complete! Let's 2024 the year of open TTS models! ❤️

Introducing Open TTS Tracker! 🗣️ sound on A one-stop shop to track all open access/ source TTS models! Ranging from XTTS to Pheme, OpenVoice to VITS, and more... ⚡ For each model, we compile: 1. Souce-code 2. Checkpoints 3. License 4. Fine-tuning code 5. Languages supported 6. Paper 7. Demo Help us make it more complete! Let's 2024 the year of open TTS models! ❤️

Vaibhav (VB) Srivastav

68,037 次观看 • 2 年前

Pocket TTS goes multilingual! Now you can use our 100M-parameter models to generate speech in six languages, fast enough to run real-time without a GPU. We also improved the quality of the English model while keeping the same size. And all of this is open-source.

Pocket TTS goes multilingual! Now you can use our 100M-parameter models to generate speech in six languages, fast enough to run real-time without a GPU. We also improved the quality of the English model while keeping the same size. And all of this is open-source.

kyutai

29,309 次观看 • 1 个月前

Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓

Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓

Google AI

202,068 次观看 • 2 个月前

Meet CosyVoice 3 🔊 An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Meet CosyVoice 3 🔊 An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Alibaba Group

126,533 次观看 • 5 个月前

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

236,454 次观看 • 5 个月前

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 次观看 • 5 个月前

Sarvam Beats GPT-4o: India’s New AI Model Claims Top Spot in Indic Speech Sarvam AI, an Indian startup, recently launched Sarvam Audio, a speech recognition model that claims superior performance over GPT-4o Transcribe on Indic language benchmarks. This development highlights India's push for AI sovereignty in handling local linguistic nuances. Sarvam Audio supports 22 Indian languages from the Eighth Schedule, plus Indian English, with strong handling of code-mixing like Hindi-English blends. It features built-in speaker diarization for up to eight speakers and processes long-form audio such as podcasts or meetings. Trained on the IndicVoices dataset 12,000 hours from over 16,000 speakers across 208 districts it captures real-world noise and spontaneous speech. The model reportedly outperforms GPT-4o Transcribe and Gemini 3 Flash in transcription accuracy (lower Word Error Rate) on IndicVoices benchmarks for unnormalized, normalized, and code-mixed speech. Sarvam attributes this to specialization on Indian accents and patterns, unlike global models trained on Western data. Detailed public benchmarks are pending independent verification. Key Applications 🔴 Call centers and logistics for multilingual transcription. 🔴 Banking, fintech, and e-commerce for customer interactions. 🔴 Podcasts, meetings, and lectures via API for real-time or batch processing. 🔴 This B2B-focused tool aligns with India's IndiaAI Mission, backed by government GPU access for sovereign LLMs. Credit : AIM Networks.

Sarvam Beats GPT-4o: India’s New AI Model Claims Top Spot in Indic Speech Sarvam AI, an Indian startup, recently launched Sarvam Audio, a speech recognition model that claims superior performance over GPT-4o Transcribe on Indic language benchmarks. This development highlights India's push for AI sovereignty in handling local linguistic nuances. Sarvam Audio supports 22 Indian languages from the Eighth Schedule, plus Indian English, with strong handling of code-mixing like Hindi-English blends. It features built-in speaker diarization for up to eight speakers and processes long-form audio such as podcasts or meetings. Trained on the IndicVoices dataset 12,000 hours from over 16,000 speakers across 208 districts it captures real-world noise and spontaneous speech. The model reportedly outperforms GPT-4o Transcribe and Gemini 3 Flash in transcription accuracy (lower Word Error Rate) on IndicVoices benchmarks for unnormalized, normalized, and code-mixed speech. Sarvam attributes this to specialization on Indian accents and patterns, unlike global models trained on Western data. Detailed public benchmarks are pending independent verification. Key Applications 🔴 Call centers and logistics for multilingual transcription. 🔴 Banking, fintech, and e-commerce for customer interactions. 🔴 Podcasts, meetings, and lectures via API for real-time or batch processing. 🔴 This B2B-focused tool aligns with India's IndiaAI Mission, backed by government GPU access for sovereign LLMs. Credit : AIM Networks.

Augadh

43,429 次观看 • 4 个月前

Day 1 of 3 MLX Releases: Introducing MLX-Audio 🚀🔥 A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon. Features ⚡️Fast inference on Apple Silicon (M series chips) 🤖Multiple language support 🗣️Voice customization options 🚀Quantization support for optimized performance Supported models: 🪶Kokoro - A multilingual TTS model with 82M params that supports various languages and voice styles. With more models coming soon. Get started: > pip install mlx-audio Please leave us a star and send a PR :)

Day 1 of 3 MLX Releases: Introducing MLX-Audio 🚀🔥 A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon. Features ⚡️Fast inference on Apple Silicon (M series chips) 🤖Multiple language support 🗣️Voice customization options 🚀Quantization support for optimized performance Supported models: 🪶Kokoro - A multilingual TTS model with 82M params that supports various languages and voice styles. With more models coming soon. Get started: > pip install mlx-audio Please leave us a star and send a PR :)

Prince Canuma

123,480 次观看 • 1 年前

Hello world🔥🚀! I am excited to announce the release of YarnGPT, YarnGPT is a family of open source text to speech models built for Nigerian🇳🇬 accented English (YarnGPT) and native languages (YarnGPT-local). It was built on top of SmolLM2-360M by Hugging Face A thread🧵...

Hello world🔥🚀! I am excited to announce the release of YarnGPT, YarnGPT is a family of open source text to speech models built for Nigerian🇳🇬 accented English (YarnGPT) and native languages (YarnGPT-local). It was built on top of SmolLM2-360M by Hugging Face A thread🧵...

Saheedniyi

236,538 次观看 • 1 年前

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 次观看 • 3 年前

New native audio capabilities in Gemini 2.5 enable text-to-speech in over 24 languages. 🔊Voices are more natural and expressive, and you can seamlessly switch between languages.

New native audio capabilities in Gemini 2.5 enable text-to-speech in over 24 languages. 🔊Voices are more natural and expressive, and you can seamlessly switch between languages.

Google

155,360 次观看 • 1 年前

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Tongyi Lab

1,014,717 次观看 • 9 个月前