Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including... show more

AK

504,092 subscribers

312,299 görüntüleme • 2 yıl önce •via X (Twitter)

Eğitim Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

10 Yorum

Furkan Gözükara profil fotoğrafı

Furkan Gözükara2 yıl önce

Even demo is low sound quality

Andrzej Białecki profil fotoğrafı

Andrzej Białecki2 yıl önce

I wonder when we'll have singing voice synthesis guided by text and midi notes of a lead sound.

Jeff Araujo profil fotoğrafı

Jeff Araujo2 yıl önce

@camenduru, would be awesome to have a Colab available using this Engine 🥹

Fran Abenza profil fotoğrafı

Fran Abenza2 yıl önce

Would it run in M1, 8Gb Ram?

Nathan Odle profil fotoğrafı

Nathan Odle2 yıl önce

I tried running it locally and didn't get much variation between emotion prompts. Tried different (english) voices and happy/angry pretty much sounded the same most of the time. Maybe it works better with chinese?

Youdao Open Source profil fotoğrafı

Youdao Open Source2 yıl önce

Author here. Thanks for your interest in the project. We will post a roadmap for future updates shortly.

Patrick's AIBuzzNews profil fotoğrafı

Patrick's AIBuzzNews2 yıl önce

Does it outperform Bark?

Ai News 24/7 profil fotoğrafı

Ai News 24/72 yıl önce

EmotiVoice sounds amazing, especially with its prompt-controlled feature. Gonna give it a try!

Ping Chen profil fotoğrafı

Ping Chen2 yıl önce

@Memdotai mem it

tinyfish profil fotoğrafı

tinyfish2 yıl önce

Should try

Benzer Videolar

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

MetaVoice

111,799 görüntüleme • 2 yıl önce

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

AI4Bharat

28,586 görüntüleme • 1 yıl önce

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Introducing Qwen3-TTS! 🗣️ Our new text-to-speech model is designed to be multi-timbre, multi-lingual, and multi-dialect for natural, expressive audio. It delivers strong performance in English & Chinese, and we're excited for you to hear it for yourself!

Tongyi Lab

1,014,717 görüntüleme • 9 ay önce

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together.

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together.

ElevenLabs

133,984 görüntüleme • 1 ay önce

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

236,737 görüntüleme • 5 ay önce

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

Melvin Vivas

27,722 görüntüleme • 1 yıl önce

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

kyutai

171,474 görüntüleme • 1 yıl önce

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Fish Audio

149,878 görüntüleme • 1 yıl önce

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 görüntüleme • 5 ay önce

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

Speech to Speech is now available in 29 languages. In November, we launched Speech to Speech, enabling you to transform your voice into another character with full control over emotions, timing, and delivery. Today we're making Speech to Speech multilingual. Use Speech to Speech to get more control than with prompting alone. Simply say it how you want it, choose a target voice, and generate.

ElevenLabs

113,317 görüntüleme • 2 yıl önce

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 görüntüleme • 10 ay önce

Open-source project Soundstorm (AI generated speech from Google Research) is going to give Elevenlabs a run for it's money: The text-to-speech project specializes in dialogue between multiple parties, and is available on Github:

Open-source project Soundstorm (AI generated speech from Google Research) is going to give Elevenlabs a run for it's money: The text-to-speech project specializes in dialogue between multiple parties, and is available on Github:

AI Breakfast

326,474 görüntüleme • 3 yıl önce

3. Speech to Speech Record your own voice or upload a voice file and this tool will mirror voice with same accent and emotions.

3. Speech to Speech Record your own voice or upload a voice file and this tool will mirror voice with same accent and emotions.

Sehaj Singh

23,353 görüntüleme • 10 ay önce

Some examples: - Text to Speech: Read aloud content or create audiobooks. - Speech to Text: Transcribe audio and video into text. - Voice Designer: Create custom AI voices. - Conversational AI: Build dynamic voice agents and make outbound calls.

Some examples: - Text to Speech: Read aloud content or create audiobooks. - Speech to Text: Transcribe audio and video into text. - Voice Designer: Create custom AI voices. - Conversational AI: Build dynamic voice agents and make outbound calls.

ElevenLabs

19,318 görüntüleme • 1 yıl önce

CubePart is the latest update to our open-source Cube 3D foundation model. It lets creators pair a text prompt with an open-ended part schema to generate labeled meshes that drop straight into a game engine for physics, animation, and scripting.

CubePart is the latest update to our open-source Cube 3D foundation model. It lets creators pair a text prompt with an open-ended part schema to generate labeled meshes that drop straight into a game engine for physics, animation, and scripting.

David Baszucki

134,846 görüntüleme • 1 ay önce

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

13,401 görüntüleme • 1 yıl önce

Meet CosyVoice 3 — An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Meet CosyVoice 3 — An open-source multilingual speech synthesis model delivering high-fidelity voices, natural prosody, and accurate pronunciation for lifelike speech. Ready to bring the voices to real‑world applications?

Alibaba Cloud

3,149,715 görüntüleme • 5 ay önce

Local speech-to-text with system wide dictation, file transcription, translations, subtitle export, and fully local no telemetry processing. And yes, it’s free and open-source 😉

Local speech-to-text with system wide dictation, file transcription, translations, subtitle export, and fully local no telemetry processing. And yes, it’s free and open-source 😉

AlternativeTo

11,694 görüntüleme • 2 ay önce

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Did xAI just mass-murder the entire voice AI industry? 🤯 Grok just launched two voice APIs. Speech-to-Text and Text-to-Speech. Built on the same stack powering Tesla cars and Starlink support. And priced at 10x cheaper than ElevenLabs. Speech-to-Text: $0.10/hr batch. $0.20/hr streaming. Text-to-Speech: $4.20 per million characters. 25+ languages. Real-time streaming. Speaker diarization. Already outperforming ElevenLabs, Deepgram, and AssemblyAI on word error rate. TTS ships with expressive tags like [laugh], [sigh], , . Voices that don't sound like robots reading a script. ElevenLabs spent years building a voice AI company. xAI built voice AI for cars and satellites.

Vaibhav Sisinty

24,524,595 görüntüleme • 2 ay önce

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

What if voice AI could understand context, intent, and emotion — not just words? 🎙️ Meet Seed Speech 2.0. A new speech AI stack designed for natural, expressive conversations — with major upgrades in both text-to-speech (TTS) and speech recognition (ASR). Built for developers and creators who want natural conversational speech. What’s inside: - Natural and expressive speech generation - Prompt-controlled emotion and tone - High-accuracy multilingual recognition - Strong contextual reasoning Built for: 🎧 Voice content creation 🤖 AI assistants ☎️ Customer service 🎬 Dubbing & subtitling 📊 Audio-video analysis Voice AI built for expression, accuracy, and understanding. Try it now → Read more about Seed Speech here → #SeedSpeech #VoiceAI #SpeechAI #AI

BytePlus

19,740 görüntüleme • 3 ay önce