Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including... show more

AK

512,025 subscribers

312,332 просмотров • 2 лет назад •via X (Twitter)

Образование Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля Furkan Gözükara

Furkan Gözükara2 лет назад

Even demo is low sound quality

Фото профиля Andrzej Białecki

Andrzej Białecki2 лет назад

I wonder when we'll have singing voice synthesis guided by text and midi notes of a lead sound.

Фото профиля Jeff Araujo

Jeff Araujo2 лет назад

@camenduru, would be awesome to have a Colab available using this Engine 🥹

Фото профиля Fran Abenza

Fran Abenza2 лет назад

Would it run in M1, 8Gb Ram?

Фото профиля Nathan Odle

Nathan Odle2 лет назад

I tried running it locally and didn't get much variation between emotion prompts. Tried different (english) voices and happy/angry pretty much sounded the same most of the time. Maybe it works better with chinese?

Фото профиля Youdao Open Source

Youdao Open Source2 лет назад

Author here. Thanks for your interest in the project. We will post a roadmap for future updates shortly.

Фото профиля Patrick's AIBuzzNews

Patrick's AIBuzzNews2 лет назад

Does it outperform Bark?

Фото профиля Ai News 24/7

Ai News 24/72 лет назад

EmotiVoice sounds amazing, especially with its prompt-controlled feature. Gonna give it a try!

Фото профиля Ping Chen

Ping Chen2 лет назад

@Memdotai mem it

Фото профиля tinyfish

tinyfish2 лет назад

Should try

Похожие видео

SwiftUI: You can chain free and open-source STT, TTS, and LLMs to build voice/vision apps with Core AI. Here is a demo I was experimenting with: - Whisper: Speech-to-text - Kokoro-82M: Text-to-speech - Qwen3-VL 2B: Vision I converted Pocket TTS and VibeVoice to Core AI models and tried with Gemma 4. All is working great.

SwiftUI: You can chain free and open-source STT, TTS, and LLMs to build voice/vision apps with Core AI. Here is a demo I was experimenting with: - Whisper: Speech-to-text - Kokoro-82M: Text-to-speech - Qwen3-VL 2B: Vision I converted Pocket TTS and VibeVoice to Core AI models and tried with Gemma 4. All is working great.

Amos Gyamfi

79,756 просмотров • 27 дней назад

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

AI4Bharat

28,681 просмотров • 1 год назад

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Tongyi Lab

1,015,006 просмотров • 10 месяцев назад

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together.

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together.

ElevenLabs

134,449 просмотров • 2 месяцев назад

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

237,894 просмотров • 6 месяцев назад

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

Melvin Vivas

27,722 просмотров • 1 год назад

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

kyutai

171,805 просмотров • 1 год назад

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Fish Audio

149,977 просмотров • 1 год назад

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 просмотров • 6 месяцев назад

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

ElevenLabs

113,343 просмотров • 2 лет назад

Open-source project Soundstorm (AI generated speech from Google Research) is going to give Elevenlabs a run for it's money: The text-to-speech project specializes in dialogue between multiple parties, and is available on Github:

Open-source project Soundstorm (AI generated speech from Google Research) is going to give Elevenlabs a run for it's money: The text-to-speech project specializes in dialogue between multiple parties, and is available on Github:

AI Breakfast

326,478 просмотров • 3 лет назад

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 просмотров • 11 месяцев назад

CubePart is the latest update to our open-source Cube 3D foundation model. It lets creators pair a text prompt with an open-ended part schema to generate labeled meshes that drop straight into a game engine for physics, animation, and scripting.

CubePart is the latest update to our open-source Cube 3D foundation model. It lets creators pair a text prompt with an open-ended part schema to generate labeled meshes that drop straight into a game engine for physics, animation, and scripting.

David Baszucki

140,271 просмотров • 2 месяцев назад

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

13,401 просмотров • 1 год назад

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Vaibhav Sisinty

24,531,843 просмотров • 3 месяцев назад

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

BytePlus

19,740 просмотров • 4 месяцев назад

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

AI at Meta

351,739 просмотров • 1 год назад

Day 1 of 3 MLX Releases: Introducing MLX-Audio 🚀🔥 A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon. Features ⚡️Fast inference on Apple Silicon (M series chips) 🤖Multiple language support 🗣️Voice customization options 🚀Quantization support for optimized performance Supported models: 🪶Kokoro - A multilingual TTS model with 82M params that supports various languages and voice styles. With more models coming soon. Get started: > pip install mlx-audio Please leave us a star and send a PR :)

Day 1 of 3 MLX Releases: Introducing MLX-Audio 🚀🔥 A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon. Features ⚡️Fast inference on Apple Silicon (M series chips) 🤖Multiple language support 🗣️Voice customization options 🚀Quantization support for optimized performance Supported models: 🪶Kokoro - A multilingual TTS model with 82M params that supports various languages and voice styles. With more models coming soon. Get started: > pip install mlx-audio Please leave us a star and send a PR :)

Prince Canuma

123,480 просмотров • 1 год назад

Voice Design v3 is here. Create any voice you can imagine with a prompt. We’ve rebuilt the underlying Voice Design model to deliver higher quality and broader expressive range. Generate production-ready voices in 70+ languages with support for hundreds of localized accents.

Voice Design v3 is here. Create any voice you can imagine with a prompt. We’ve rebuilt the underlying Voice Design model to deliver higher quality and broader expressive range. Generate production-ready voices in 70+ languages with support for hundreds of localized accents.

ElevenLabs

153,918 просмотров • 1 год назад

Day 1 of 3 days of MLX: Introducing MLX-Audio-Swift SDK 🚀 A modular Swift SDK for voice agents and tasks on Apple Silicon built by Lucas Newman and yours truly. iOS, macOS, and visionOS developers can now build native apps with real-time, on-device audio intelligence: 🗣️ Text-to-Speech (TTS) 👂 Speech-to-Text (STT) 🔄 Speech-to-Speech (STS) 🎙️ Voice Activity Detection (VAD) and more. Only import the capabilities you need, nothing extra. Get started today and leave us a star ⭐️

Day 1 of 3 days of MLX: Introducing MLX-Audio-Swift SDK 🚀 A modular Swift SDK for voice agents and tasks on Apple Silicon built by Lucas Newman and yours truly. iOS, macOS, and visionOS developers can now build native apps with real-time, on-device audio intelligence: 🗣️ Text-to-Speech (TTS) 👂 Speech-to-Text (STT) 🔄 Speech-to-Speech (STS) 🎙️ Voice Activity Detection (VAD) and more. Only import the capabilities you need, nothing extra. Get started today and leave us a star ⭐️

Prince Canuma

160,927 просмотров • 5 месяцев назад