Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including... show more

AK

504,092 subscribers

312,299 просмотров • 2 лет назад •via X (Twitter)

Образование Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля Furkan Gözükara

Furkan Gözükara2 лет назад

Even demo is low sound quality

Фото профиля Andrzej Białecki

Andrzej Białecki2 лет назад

I wonder when we'll have singing voice synthesis guided by text and midi notes of a lead sound.

Фото профиля Jeff Araujo

Jeff Araujo2 лет назад

@camenduru, would be awesome to have a Colab available using this Engine 🥹

Фото профиля Fran Abenza

Fran Abenza2 лет назад

Would it run in M1, 8Gb Ram?

Фото профиля Nathan Odle

Nathan Odle2 лет назад

I tried running it locally and didn't get much variation between emotion prompts. Tried different (english) voices and happy/angry pretty much sounded the same most of the time. Maybe it works better with chinese?

Фото профиля Youdao Open Source

Youdao Open Source2 лет назад

Author here. Thanks for your interest in the project. We will post a roadmap for future updates shortly.

Фото профиля Patrick's AIBuzzNews

Patrick's AIBuzzNews2 лет назад

Does it outperform Bark?

Фото профиля Ai News 24/7

Ai News 24/72 лет назад

EmotiVoice sounds amazing, especially with its prompt-controlled feature. Gonna give it a try!

Фото профиля Ping Chen

Ping Chen2 лет назад

@Memdotai mem it

Фото профиля tinyfish

tinyfish2 лет назад

Should try

Похожие видео

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

MetaVoice

111,799 просмотров • 2 лет назад

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

AI4Bharat

28,586 просмотров • 1 год назад

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Tongyi Lab

1,014,717 просмотров • 9 месяцев назад

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together.

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together.

ElevenLabs

133,984 просмотров • 1 месяц назад

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

236,737 просмотров • 5 месяцев назад

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

Melvin Vivas

27,722 просмотров • 1 год назад

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

kyutai

171,474 просмотров • 1 год назад

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Fish Audio

149,878 просмотров • 1 год назад

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 просмотров • 5 месяцев назад

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

ElevenLabs

113,317 просмотров • 2 лет назад

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 просмотров • 10 месяцев назад

Open-source project Soundstorm (AI generated speech from Google Research) is going to give Elevenlabs a run for it's money: The text-to-speech project specializes in dialogue between multiple parties, and is available on Github:

Open-source project Soundstorm (AI generated speech from Google Research) is going to give Elevenlabs a run for it's money: The text-to-speech project specializes in dialogue between multiple parties, and is available on Github:

AI Breakfast

326,474 просмотров • 3 лет назад

3. Speech to Speech Record your own voice or upload a voice file and this tool will mirror voice with same accent and emotions.

3. Speech to Speech Record your own voice or upload a voice file and this tool will mirror voice with same accent and emotions.

Sehaj Singh

23,353 просмотров • 10 месяцев назад

Some examples: - Text to Speech: Read aloud content or create audiobooks. - Speech to Text: Transcribe audio and video into text. - Voice Designer: Create custom AI voices. - Conversational AI: Build dynamic voice agents and make outbound calls.

Some examples: - Text to Speech: Read aloud content or create audiobooks. - Speech to Text: Transcribe audio and video into text. - Voice Designer: Create custom AI voices. - Conversational AI: Build dynamic voice agents and make outbound calls.

ElevenLabs

19,318 просмотров • 1 год назад

CubePart is the latest update to our open-source Cube 3D foundation model. It lets creators pair a text prompt with an open-ended part schema to generate labeled meshes that drop straight into a game engine for physics, animation, and scripting.

CubePart is the latest update to our open-source Cube 3D foundation model. It lets creators pair a text prompt with an open-ended part schema to generate labeled meshes that drop straight into a game engine for physics, animation, and scripting.

David Baszucki

135,506 просмотров • 1 месяц назад

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

13,401 просмотров • 1 год назад

Meet CosyVoice 3 — An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Meet CosyVoice 3 — An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Alibaba Cloud

3,149,715 просмотров • 5 месяцев назад

Local speech-to-text with system wide dictation, file transcription, translations, subtitle export, and fully local no telemetry processing. And yes, it’s free and open-source 😉

Local speech-to-text with system wide dictation, file transcription, translations, subtitle export, and fully local no telemetry processing. And yes, it’s free and open-source 😉

AlternativeTo

11,694 просмотров • 2 месяцев назад

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

BytePlus

19,740 просмотров • 3 месяцев назад

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Vaibhav Sisinty

24,524,944 просмотров • 2 месяцев назад