Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian... languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:show more

AI4Bharat

10,077 subscribers

28,586 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 9

Фото профиля AI4Bharat

AI4Bharat1 год назад

For those who want to know what the training data is, please take a look at this:

Фото профиля Hugging Face

Hugging Face1 год назад

🇮🇳/ acc

Фото профиля Abu

Abu1 год назад

@huggingface that sounds pretty cool! more voices for diverse languages, right?

Фото профиля Manoj

Manoj1 год назад

@huggingface That's great news!

Фото профиля Umesh

Umesh1 год назад

@huggingface Is there a breakup of language wise token count/data set count to understand the language coverage and which languages will have better accuracy?

Фото профиля GDP

GDP1 год назад

@huggingface Kickass! Thank you so much. Looks so good.

Фото профиля Data & Analytics

Data & Analytics1 год назад

@huggingface @huggingface, that's a dope initiative! Bringing voice tech to such a diverse audience is crucial. Wonder how it'll impact accessibility in those communities?

Фото профиля zerebro

zerebro1 год назад

@huggingface bro i love the concept of ai4bharat and all but why tf is it called ai4bharat. like bro ai4bharat sounds like a discount brand of ai. like bro i went to the store and bought some ai4bharat and all i got was a bunch of ai that only speaks hindi and eats curry.

Фото профиля Binary Ninja

Binary Ninja1 год назад

@huggingface Does not support garbage Chinese language?

Похожие видео

We're launching Veena TTS 🪕 on June 20 Our flagship text-to-speech model for Indian languages 🇮🇳 Natural, expressive, and actually sounds like us. We’re launching two models: Veena Lite >Open-source and lightweight >4 unique, natural-sounding voices >The first open-source TTS model designed for real creative use cases Veena Max >Available via our web app >15 expressive, high-quality voices >Perfect for storytelling, dubbing, voiceovers, and content creation

We're launching Veena TTS 🪕 on June 20 Our flagship text-to-speech model for Indian languages 🇮🇳 Natural, expressive, and actually sounds like us. We’re launching two models: Veena Lite >Open-source and lightweight >4 unique, natural-sounding voices >The first open-source TTS model designed for real creative use cases Veena Max >Available via our web app >15 expressive, high-quality voices >Perfect for storytelling, dubbing, voiceovers, and content creation

Dheemanth Reddy

80,431 просмотров • 1 год назад

Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓

Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓

Google AI

202,779 просмотров • 3 месяцев назад

Introducing Open TTS Tracker! 🗣️ *sound on* A one-stop shop to track all open access/ source TTS models! Ranging from XTTS to Pheme, OpenVoice to VITS, and more... ⚡ For each model, we compile: 1. Souce-code 2. Checkpoints 3. License 4. Fine-tuning code 5. Languages supported 6. Paper 7. Demo Help us make it more complete! Let's 2024 the year of open TTS models! ❤️

Introducing Open TTS Tracker! 🗣️ sound on A one-stop shop to track all open access/ source TTS models! Ranging from XTTS to Pheme, OpenVoice to VITS, and more... ⚡ For each model, we compile: 1. Souce-code 2. Checkpoints 3. License 4. Fine-tuning code 5. Languages supported 6. Paper 7. Demo Help us make it more complete! Let's 2024 the year of open TTS models! ❤️

Vaibhav (VB) Srivastav

68,037 просмотров • 2 лет назад

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 просмотров • 6 месяцев назад

Sarvam Beats GPT-4o: India’s New AI Model Claims Top Spot in Indic Speech Sarvam AI, an Indian startup, recently launched Sarvam Audio, a speech recognition model that claims superior performance over GPT-4o Transcribe on Indic language benchmarks. This development highlights India's push for AI sovereignty in handling local linguistic nuances. Sarvam Audio supports 22 Indian languages from the Eighth Schedule, plus Indian English, with strong handling of code-mixing like Hindi-English blends. It features built-in speaker diarization for up to eight speakers and processes long-form audio such as podcasts or meetings. Trained on the IndicVoices dataset 12,000 hours from over 16,000 speakers across 208 districts it captures real-world noise and spontaneous speech. The model reportedly outperforms GPT-4o Transcribe and Gemini 3 Flash in transcription accuracy (lower Word Error Rate) on IndicVoices benchmarks for unnormalized, normalized, and code-mixed speech. Sarvam attributes this to specialization on Indian accents and patterns, unlike global models trained on Western data. Detailed public benchmarks are pending independent verification. Key Applications 🔴 Call centers and logistics for multilingual transcription. 🔴 Banking, fintech, and e-commerce for customer interactions. 🔴 Podcasts, meetings, and lectures via API for real-time or batch processing. 🔴 This B2B-focused tool aligns with India's IndiaAI Mission, backed by government GPU access for sovereign LLMs. Credit : AIM Networks.

Sarvam Beats GPT-4o: India’s New AI Model Claims Top Spot in Indic Speech Sarvam AI, an Indian startup, recently launched Sarvam Audio, a speech recognition model that claims superior performance over GPT-4o Transcribe on Indic language benchmarks. This development highlights India's push for AI sovereignty in handling local linguistic nuances. Sarvam Audio supports 22 Indian languages from the Eighth Schedule, plus Indian English, with strong handling of code-mixing like Hindi-English blends. It features built-in speaker diarization for up to eight speakers and processes long-form audio such as podcasts or meetings. Trained on the IndicVoices dataset 12,000 hours from over 16,000 speakers across 208 districts it captures real-world noise and spontaneous speech. The model reportedly outperforms GPT-4o Transcribe and Gemini 3 Flash in transcription accuracy (lower Word Error Rate) on IndicVoices benchmarks for unnormalized, normalized, and code-mixed speech. Sarvam attributes this to specialization on Indian accents and patterns, unlike global models trained on Western data. Detailed public benchmarks are pending independent verification. Key Applications 🔴 Call centers and logistics for multilingual transcription. 🔴 Banking, fintech, and e-commerce for customer interactions. 🔴 Podcasts, meetings, and lectures via API for real-time or batch processing. 🔴 This B2B-focused tool aligns with India's IndiaAI Mission, backed by government GPU access for sovereign LLMs. Credit : AIM Networks.

Augadh

43,429 просмотров • 5 месяцев назад

Day 1 of 3 MLX Releases: Introducing MLX-Audio 🚀🔥 A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon. Features ⚡️Fast inference on Apple Silicon (M series chips) 🤖Multiple language support 🗣️Voice customization options 🚀Quantization support for optimized performance Supported models: 🪶Kokoro - A multilingual TTS model with 82M params that supports various languages and voice styles. With more models coming soon. Get started: > pip install mlx-audio Please leave us a star and send a PR :)

Day 1 of 3 MLX Releases: Introducing MLX-Audio 🚀🔥 A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon. Features ⚡️Fast inference on Apple Silicon (M series chips) 🤖Multiple language support 🗣️Voice customization options 🚀Quantization support for optimized performance Supported models: 🪶Kokoro - A multilingual TTS model with 82M params that supports various languages and voice styles. With more models coming soon. Get started: > pip install mlx-audio Please leave us a star and send a PR :)

Prince Canuma

123,480 просмотров • 1 год назад

Hello world🔥🚀! I am excited to announce the release of YarnGPT, YarnGPT is a family of open source text to speech models built for Nigerian🇳🇬 accented English (YarnGPT) and native languages (YarnGPT-local). It was built on top of SmolLM2-360M by Hugging Face A thread🧵...

Hello world🔥🚀! I am excited to announce the release of YarnGPT, YarnGPT is a family of open source text to speech models built for Nigerian🇳🇬 accented English (YarnGPT) and native languages (YarnGPT-local). It was built on top of SmolLM2-360M by Hugging Face A thread🧵...

Saheedniyi

236,641 просмотров • 1 год назад

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Fish Audio

149,977 просмотров • 1 год назад

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. > Training + Inference code released. > The processed dataset and tagging scripts were released for further research. > English only for now. Next, we're scaling the training to 50K hours and even better dataset processing! Want to help us out? DMs open! 🤗

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. > Training + Inference code released. > The processed dataset and tagging scripts were released for further research. > English only for now. Next, we're scaling the training to 50K hours and even better dataset processing! Want to help us out? DMs open! 🤗

Vaibhav (VB) Srivastav

156,386 просмотров • 2 лет назад

India’s Sovereign Move - Sarvam Vision Cracks the Code across 22 Indian languages Sarvam Vision is an AI model from Sarvam AI that excels in optical character recognition (OCR) for 22 Indian languages, outperforming global models like Gemini and GPT-4o on Indic benchmarks. This achievement supports India's push for sovereign AI by digitizing real-world documents and archives in local scripts. Sarvam Vision, a 3-billion-parameter model, handles "messy" scanned paperwork across languages like Hindi (95.91% accuracy), Bengali (92.61%), Tamil (93.42%), Marathi (93.13%), and even low-resource ones like Santali and Dogri (over 80%). It structures data natively without relying on English translation layers, enabling applications in cultural recovery and document intelligence. The model powers India's sovereign AI infrastructure under the IndiaAI Mission, where Sarvam was selected to build a national LLM. This launch fits Sarvam's broader stack, including prior models like Sarvam-Translate (also for 22 languages) and audio models for code-mixed speech. By focusing on Indic challenges ignored by Western AI, it unlocks centuries of knowledge in non-English archives. Recent demos, like digitizing historical texts, highlight its real-world impact beyond benchmarks. Credit : AIM Network.

India’s Sovereign Move - Sarvam Vision Cracks the Code across 22 Indian languages Sarvam Vision is an AI model from Sarvam AI that excels in optical character recognition (OCR) for 22 Indian languages, outperforming global models like Gemini and GPT-4o on Indic benchmarks. This achievement supports India's push for sovereign AI by digitizing real-world documents and archives in local scripts. Sarvam Vision, a 3-billion-parameter model, handles "messy" scanned paperwork across languages like Hindi (95.91% accuracy), Bengali (92.61%), Tamil (93.42%), Marathi (93.13%), and even low-resource ones like Santali and Dogri (over 80%). It structures data natively without relying on English translation layers, enabling applications in cultural recovery and document intelligence. The model powers India's sovereign AI infrastructure under the IndiaAI Mission, where Sarvam was selected to build a national LLM. This launch fits Sarvam's broader stack, including prior models like Sarvam-Translate (also for 22 languages) and audio models for code-mixed speech. By focusing on Indic challenges ignored by Western AI, it unlocks centuries of knowledge in non-English archives. Recent demos, like digitizing historical texts, highlight its real-world impact beyond benchmarks. Credit : AIM Network.

Augadh

20,645 просмотров • 5 месяцев назад

4 years ago we were on the brink of AI becoming proprietary and centralized, when OpenAI kept GPT3 closed and VCs started dumping money on researchers. From fully open science, to fully closed, in a matter of months. It was scary, and 1,000+ leading researchers and scientists banded together to show the world that it was possible to do the same work in the open, and build an ecosystem that benefits everyone. That was the BigScience Research Workshop BLOOM project, and it put us back on track to open science, starting with forward-thinking organizations like Meta releasing OPT. Look at us now. Open models have not only caught up, they're state of the art now. Not just LLMs, but models for document AI, speech to text, text to speech, generating images and more. We're closing in on 2 million open weight models on Hugging Face. Thanks for the reminder Thomas Wolf .

4 years ago we were on the brink of AI becoming proprietary and centralized, when OpenAI kept GPT3 closed and VCs started dumping money on researchers. From fully open science, to fully closed, in a matter of months. It was scary, and 1,000+ leading researchers and scientists banded together to show the world that it was possible to do the same work in the open, and build an ecosystem that benefits everyone. That was the BigScience Research Workshop BLOOM project, and it put us back on track to open science, starting with forward-thinking organizations like Meta releasing OPT. Look at us now. Open models have not only caught up, they're state of the art now. Not just LLMs, but models for document AI, speech to text, text to speech, generating images and more. We're closing in on 2 million open weight models on Hugging Face. Thanks for the reminder Thomas Wolf .

Jeff Boudier 🤗

21,947 просмотров • 1 год назад

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

Microsoft did it again! Speech AI models have a major limitation. They slice long recordings into tiny chunks, lose track of who's speaking, and forget all context halfway through. This is exactly what Microsoft's VibeVoice solves. It's an open-source family of frontier voice AI models for both speech recognition and speech generation. Here's what it can do: > VibeVoice-ASR processes up to 60 minutes of audio in a single pass. No chunking. It outputs structured transcriptions with who spoke, when they spoke, and what they said. > You can feed it custom hotwords like names, technical jargon, or domain-specific terms. The model uses them to significantly improve accuracy on specialized content. > VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers. Natural turn-taking, emotional expression, all in one pass. > VibeVoice-Realtime is a 0.5B streaming TTS model with ~300ms first-audio latency. Small enough to deploy practically anywhere. All of this is powered by continuous speech tokenizers running at just 7.5 Hz. This ultra-low frame rate preserves audio quality while making long sequences computationally feasible. I have shared the link to the GitHub repo in the replies!

Akshay 🚀

45,206 просмотров • 3 месяцев назад

Introducing Ai.lonso - Fernando Alonso's lifelike AI avatar. We are proud to announce our collaboration with the Aston Martin Aramco F1 Team Team, 2x F1 World Champion Fernando Alonso, and DeepReel to launch Ai.lonso. Ai.lonso will make Aston Martin Aramco's content more accessible and further personalize fan engagement. At launch, the text-to-speech functionality is available in English, Spanish and French, with further languages to follow. The collaboration continues to position the Aston Martin Aramco Formula One® Team at the forefront of the latest technology to enhance the F1 fan experience. Head here for the full story: Hear Fernando read his UNDERCUT interview in multiple languages:

ElevenLabs

89,227 просмотров • 1 год назад

1,200+ Languages. One Vision for AI Inclusion. 🤝 How do we bridge the gap between global technology and local culture? We are thrilled to share highlights from our recent developer session, co-hosted by Tongyi Lab x YiXi, featuring insights from our partners at AI Singapore. In this video, Jian Gang Ngui from AI Singapore dives into the critical mission of building AI that truly understands the linguistic and cultural nuances of Southeast Asia—a region home to 700+ million people speaking over 1,200 languages. By leveraging Qwen, Gemma, and other state-of-the-art open-source foundation models, AISG is working hand-in-hand with native communities to integrate local languages and cultural contexts to build LLMs that are truly accessible and relevant to everyone. Proud to support AISG in this journey!

1,200+ Languages. One Vision for AI Inclusion. 🤝 How do we bridge the gap between global technology and local culture? We are thrilled to share highlights from our recent developer session, co-hosted by Tongyi Lab x YiXi, featuring insights from our partners at AI Singapore. In this video, Jian Gang Ngui from AI Singapore dives into the critical mission of building AI that truly understands the linguistic and cultural nuances of Southeast Asia—a region home to 700+ million people speaking over 1,200 languages. By leveraging Qwen, Gemma, and other state-of-the-art open-source foundation models, AISG is working hand-in-hand with native communities to integrate local languages and cultural contexts to build LLMs that are truly accessible and relevant to everyone. Proud to support AISG in this journey!

Alibaba Cloud

25,025 просмотров • 2 месяцев назад

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

🚨 JUST IN: MICROSOFT just open sourced a VOICE AI THAT TRANSCRIBES 60 MINUTES OF AUDIO in a single pass. 100% FREE. It knows who spoke. It knows when they spoke. It knows exactly what they said. All in one shot. No chunking. No context loss. It's called VibeVoice. Not a transcription tool. Not a basic speech to text wrapper. A frontier voice AI family with ASR, TTS, and real time streaming. All open source. All free. Here's what it actually does 👇 VibeVoice ASR - Speech Recognition: → Processes 60 minutes of continuous audio in a single pass → Never slices audio into chunks so global context is never lost → Identifies WHO spoke, WHEN they spoke and WHAT they said simultaneously → Supports customized hotwords for domain specific accuracy → Works in 50+ languages natively → Already adopted by Hugging Face Transformers library → Already being built on by the open source community BY PEOPLE WHO HAD NO IDEA THIS LEVEL OF ACCURACY WAS ALREADY FREE. VibeVoice TTS - Text to Speech: → Generates up to 90 minutes of speech in a single pass → Supports up to 4 distinct speakers in one conversation → Natural turn taking and speaker consistency throughout → Expressive speech that captures emotional nuances → Supports English, Chinese and multiple other languages VibeVoice Realtime - Streaming TTS: → Only 300 millisecond first audible latency → Streams text input in real time → 0.5B parameters so it actually deploys anywhere → Robust long form generation up to 10 minutes → Lightweight enough for production use today The core innovation nobody is talking about: Most voice AI models slice long audio into short chunks. Every time they slice, they lose context. Speaker tracking breaks. Semantic coherence breaks. Accuracy drops. VibeVoice uses continuous speech tokenizers running at an ultra low frame rate of 7.5 Hz. This preserves audio fidelity while dramatically boosting computational efficiency. The entire 60 minutes stays in context. Nothing gets lost. Nobody gets misidentified. The numbers: → VibeVoice ASR 7B - available now on Hugging Face → VibeVoice Realtime 0.5B - try it on Colab right now → 50+ supported languages → 11 distinct English voice styles → 9 multilingual speaker voices → Already integrated into Hugging Face Transformers → Finetuning code now available The wildest part? A voice powered input method called Vibing just built itself on top of VibeVoice ASR. Available on macOS and Windows right now. The open source community is already shipping products on top of this. 100% Open Source. Free to use. Free to fine tune. Free to build on. 🔖 Save this before your competitors find it first. 👇

Kanika

220,854 просмотров • 3 месяцев назад

ElevenLabs has officially LOST to Open-Source ResembleAI allows you to clone ANY voice without verification using on 5-10 seconds of audio, and dominates on paralinguistic tags for human-like expressions. Most "fast" text-to-speech models sound robotic. Most "quality" TTS models are slow. None incorporate authentication at a foundational level. Resemble AI solved all three. Chatterbox Turbo delivers: 🟢<150ms time-to-first-sound 🟢State-of-the-art quality that beats larger proprietary models 🟢Natural, programmable expressions 🟢Zero-shot voice cloning with just 5 seconds of audio 🟢PerTh watermarking for authenticated and verifiable audio 🟢Open source – full transparency, no black boxes Try it on HuggingFace:

ElevenLabs has officially LOST to Open-Source ResembleAI allows you to clone ANY voice without verification using on 5-10 seconds of audio, and dominates on paralinguistic tags for human-like expressions. Most "fast" text-to-speech models sound robotic. Most "quality" TTS models are slow. None incorporate authentication at a foundational level. Resemble AI solved all three. Chatterbox Turbo delivers: 🟢<150ms time-to-first-sound 🟢State-of-the-art quality that beats larger proprietary models 🟢Natural, programmable expressions 🟢Zero-shot voice cloning with just 5 seconds of audio 🟢PerTh watermarking for authenticated and verifiable audio 🟢Open source – full transparency, no black boxes Try it on HuggingFace:

AI Breakfast

208,603 просмотров • 7 месяцев назад

Excited about the launch of Amazon Nova Sonic, our new speech-to-speech model that helps make AI voice applications feel remarkably natural. It's designed to understand not just what people say, but how they say it – working with tone, style, and conversation flow including pauses and interruptions. Nova Sonic delivers speech understanding and generation through a single, unified model, making it easier for builders to develop voice applications that maintain important context and nuance for customer service, AI agents, and other use cases across industries. It’s available in Amazon Bedrock now. Look forward to seeing what teams build with Nova Sonic!

Excited about the launch of Amazon Nova Sonic, our new speech-to-speech model that helps make AI voice applications feel remarkably natural. It's designed to understand not just what people say, but how they say it – working with tone, style, and conversation flow including pauses and interruptions. Nova Sonic delivers speech understanding and generation through a single, unified model, making it easier for builders to develop voice applications that maintain important context and nuance for customer service, AI agents, and other use cases across industries. It’s available in Amazon Bedrock now. Look forward to seeing what teams build with Nova Sonic!

Andy Jassy

155,772 просмотров • 1 год назад

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,474 просмотров • 6 месяцев назад

QVAC SDK 0.14.0 is live. This release makes the on-device stack faster on mobile, ships the developer-agent path, and takes local text-to-speech to 31 languages. Main highlights: - OpenCode and OpenClaw. The first official OpenCode plugin, plus a maintained OpenClaw compatibility path, both built on managed mode and qvac serve. Point a coding agent at a local model with far less setup and far fewer surprises. - Brain-computer interface transcription, on the SDK. Take recorded neural signal data and decode it into text, fully on-device, no cloud. Stream it in chunks through a simple API. In 0.14 it runs GPU-accelerated on iOS. - Text to Speech in 31 languages with our Supertonic3 upgrade. VOICE AND SPEECH - Supertonic3 multilingual TTS, 5 languages to 31. - Chatterbox and Supertonic now run on the Android GPU, with lower memory use (especially on iOS), quantized s3gen Chatterbox support, and a fix for Chatterbox occasionally emitting random speech. - Whisper transcription now runs on the iOS GPU. Parakeet runs on the Android GPU, with steadier real-time streaming. VISION AND OCR - VLM multi-tile batching: high-resolution Pan and Scan images are encoded in one pass instead of tile by tile, for faster vision throughput. - OCR on ggml (EasyOCR and DocTR) reaches full speed parity with the onnx path, across Metal, OpenCL, and Vulkan. PLATFORM AND RELIABILITY - Dynamic compute backends on Linux: one build picks the right backend at runtime, and opens the door to ROCm and CUDA support without per-backend builds. - Thinking tokens are kept out of the model context, so reasoning no longer fills the KV cache. SDK 0.14.0 is now leaner and faster to start. Let’s build.

QVAC SDK 0.14.0 is live. This release makes the on-device stack faster on mobile, ships the developer-agent path, and takes local text-to-speech to 31 languages. Main highlights: - OpenCode and OpenClaw. The first official OpenCode plugin, plus a maintained OpenClaw compatibility path, both built on managed mode and qvac serve. Point a coding agent at a local model with far less setup and far fewer surprises. - Brain-computer interface transcription, on the SDK. Take recorded neural signal data and decode it into text, fully on-device, no cloud. Stream it in chunks through a simple API. In 0.14 it runs GPU-accelerated on iOS. - Text to Speech in 31 languages with our Supertonic3 upgrade. VOICE AND SPEECH - Supertonic3 multilingual TTS, 5 languages to 31. - Chatterbox and Supertonic now run on the Android GPU, with lower memory use (especially on iOS), quantized s3gen Chatterbox support, and a fix for Chatterbox occasionally emitting random speech. - Whisper transcription now runs on the iOS GPU. Parakeet runs on the Android GPU, with steadier real-time streaming. VISION AND OCR - VLM multi-tile batching: high-resolution Pan and Scan images are encoded in one pass instead of tile by tile, for faster vision throughput. - OCR on ggml (EasyOCR and DocTR) reaches full speed parity with the onnx path, across Metal, OpenCL, and Vulkan. PLATFORM AND RELIABILITY - Dynamic compute backends on Linux: one build picks the right backend at runtime, and opens the door to ROCm and CUDA support without per-backend builds. - Thinking tokens are kept out of the model context, so reasoning no longer fills the KV cache. SDK 0.14.0 is now leaner and faster to start. Let’s build.

QVAC

23,973,950 просмотров • 25 дней назад

Introducing Lightning V3 - it beats every model we tested against. ElevenLabs, Cartesia, OpenAI. Lightning sets a new SOTA with V3 in conversational text-to-speech. → Highest MOS score for conversational TTS at 3.9 → ~76% win rate vs gpt-4o-mini-tts on naturalness → 15 languages with mid-sentence code-switching → Built from scratch for voice agents, not read-aloud Every TTS model sounds clean in a demo. You type a sentence and you get beautiful audio. Voice agents don't work that way. They stream. They're generating audio in real-time chunks with half the context missing. That's where everything breaks. A great reading voice and a great conversational voice are fundamentally different things. A conversational voice has to sound like it's thinking - with the pauses, the rhythm shifts, the reactions. It has to handle the way real people actually talk, including switching languages mid-sentence. That's what V3 does. V3.1 also ships voice cloning. 5 to 15 seconds of audio, no fine-tuning, production-grade clone across 15 languages. Blog link in the comments.

Introducing Lightning V3 - it beats every model we tested against. ElevenLabs, Cartesia, OpenAI. Lightning sets a new SOTA with V3 in conversational text-to-speech. → Highest MOS score for conversational TTS at 3.9 → ~76% win rate vs gpt-4o-mini-tts on naturalness → 15 languages with mid-sentence code-switching → Built from scratch for voice agents, not read-aloud Every TTS model sounds clean in a demo. You type a sentence and you get beautiful audio. Voice agents don't work that way. They stream. They're generating audio in real-time chunks with half the context missing. That's where everything breaks. A great reading voice and a great conversational voice are fundamentally different things. A conversational voice has to sound like it's thinking - with the pauses, the rhythm shifts, the reactions. It has to handle the way real people actually talk, including switching languages mid-sentence. That's what V3 does. V3.1 also ships voice cloning. 5 to 15 seconds of audio, no fine-tuning, production-grade clone across 15 languages. Blog link in the comments.

Sudarshan Kamath

71,298 просмотров • 4 месяцев назад