Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine github: EmotiVoice is a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including...

312,299 görüntüleme • 2 yıl önce •via X (Twitter)

10 Yorum

Furkan Gözükara profil fotoğrafı
Furkan Gözükara2 yıl önce

Even demo is low sound quality

Andrzej Białecki profil fotoğrafı
Andrzej Białecki2 yıl önce

I wonder when we'll have singing voice synthesis guided by text and midi notes of a lead sound.

Jeff Araujo profil fotoğrafı
Jeff Araujo2 yıl önce

@camenduru, would be awesome to have a Colab available using this Engine 🥹

Fran Abenza profil fotoğrafı
Fran Abenza2 yıl önce

Would it run in M1, 8Gb Ram?

Nathan Odle profil fotoğrafı
Nathan Odle2 yıl önce

I tried running it locally and didn't get much variation between emotion prompts. Tried different (english) voices and happy/angry pretty much sounded the same most of the time. Maybe it works better with chinese?

Youdao Open Source profil fotoğrafı
Youdao Open Source2 yıl önce

Author here. Thanks for your interest in the project. We will post a roadmap for future updates shortly.

Patrick's AIBuzzNews profil fotoğrafı
Patrick's AIBuzzNews2 yıl önce

Does it outperform Bark?

Ai News 24/7 profil fotoğrafı
Ai News 24/72 yıl önce

EmotiVoice sounds amazing, especially with its prompt-controlled feature. Gonna give it a try!

Ping Chen profil fotoğrafı
Ping Chen2 yıl önce

@Memdotai mem it

tinyfish profil fotoğrafı
tinyfish2 yıl önce

Should try

Benzer Videolar

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 görüntüleme • 10 ay önce