Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

MARS5 TTS: Open Source Text to Speech with insane prosodic control! 🔥 > Voice cloning with less than 5 seconds of audio > Two stage Auto-Regressive (750M) + Non-Auto Regressive (450M) model architecture > Used BPE tokenizer to enable control over punctuations, pauses, stops etc. > AR model predicts... show more

Vaibhav (VB) Srivastav

38,088 subscribers

162,180 görüntüleme • 2 yıl önce •via X (Twitter)

Sanat Eğitim Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

10 Yorum

Vaibhav (VB) Srivastav profil fotoğrafı

Vaibhav (VB) Srivastav2 yıl önce

Check out the model here:

Vaibhav (VB) Srivastav profil fotoğrafı

Vaibhav (VB) Srivastav2 yıl önce

GitHub for more deets:

Carlos DP profil fotoğrafı

Carlos DP2 yıl önce

Wow, these outputs are incredible. Like, is this the new SOTA? The samples sound better than the 11labs ones, at least, but idk what params were used

Vaibhav (VB) Srivastav profil fotoğrafı

Vaibhav (VB) Srivastav2 yıl önce

750M + 450M -> pretty lightweight overall, in the GitHub README they promise more updates coming soon :D

Furkan Gözükara profil fotoğrafı

Furkan Gözükara2 yıl önce

5 seconds to clone is always a lie but i can't say for sure without testing i asked them for gradio demo app to be shared

marko. profil fotoğrafı

marko.2 yıl önce

Released under GNU AGPL 3.0, a very curious choice for a model but I'll take it 🎉

Marouane Belkouri profil fotoğrafı

Marouane Belkouri2 yıl önce

Finnetunning code ?

adivina_soy3 profil fotoğrafı

adivina_soy32 yıl önce

@huggingface Impresionante. Crees que seria posible combinarlo con Hallo?

Thomas Hill profil fotoğrafı

Thomas Hill2 yıl önce

Nice share 🔥

STEVE blowJOBS profil fotoğrafı

STEVE blowJOBS2 yıl önce

This is racist ask me why

Benzer Videolar

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Oh my…FlashLabs releases Chroma 1.0, first open-source real-time speech-to-speech model with personalized voice cloning. Native speech-to-speech with <150ms latency & voice cloning from seconds of audio. Finally, an open alternative to OpenAI Realtime.

Alvaro Cintas

46,151 görüntüleme • 4 ay önce

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

kyutai

236,433 görüntüleme • 5 ay önce

ElevenLabs just lost the crown - to open source. Chatterbox by Resemble AI (I remember writing about this team back in 2022 as a pioneer of AI audio) just released as an open-source alternative for audio generation and voice cloning. - Zero-shot voice cloning from just 5 seconds of audio - Unique emotion intensity control—from subtle to dramatically expressive - Real-time voice synthesis faster than real-time inference - Built-in watermarking for secure, trusted audio - Consistently preferred over ElevenLabs in blind evaluations Fully open-source. No hidden restrictions. This is the way. Check out the demo:

ElevenLabs just lost the crown - to open source. Chatterbox by Resemble AI (I remember writing about this team back in 2022 as a pioneer of AI audio) just released as an open-source alternative for audio generation and voice cloning. - Zero-shot voice cloning from just 5 seconds of audio - Unique emotion intensity control—from subtle to dramatically expressive - Real-time voice synthesis faster than real-time inference - Built-in watermarking for secure, trusted audio - Consistently preferred over ElevenLabs in blind evaluations Fully open-source. No hidden restrictions. This is the way. Check out the demo:

AI Breakfast

454,132 görüntüleme • 1 yıl önce

Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on AMD. 🧵

Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on AMD. 🧵

Zyphra

329,828 görüntüleme • 7 gün önce

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 görüntüleme • 1 yıl önce

Today we're releasing our first open source TTS model, TADA! TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency. This means: → Zero content hallucinations across 1,000+ test samples → 5x faster than similar-grade LLM-based TTS → Fits much longer audio: 2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems → Free transcript alongside audio with no added latency

Today we're releasing our first open source TTS model, TADA! TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency. This means: → Zero content hallucinations across 1,000+ test samples → 5x faster than similar-grade LLM-based TTS → Fits much longer audio: 2,048 tokens cover ~700 seconds with TADA vs. ~70 seconds in conventional systems → Free transcript alongside audio with no added latency

Hume AI

268,985 görüntüleme • 3 ay önce

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

Vaibhav (VB) Srivastav

298,858 görüntüleme • 1 yıl önce

We just released a new version of Kitten TTS - 15M param SOTA tiny text-to-speech model It has a significant quality improvement over the previous version. Still less than 25MB in size! Open-source, extremely tiny, expressive. Apache 2.0

We just released a new version of Kitten TTS - 15M param SOTA tiny text-to-speech model It has a significant quality improvement over the previous version. Still less than 25MB in size! Open-source, extremely tiny, expressive. Apache 2.0

Divam Gupta

92,030 görüntüleme • 4 ay önce

Meta just released MusicGen, a simple and controllable model for music generation MusicGen is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn't not require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, can predict them in parallel, thus having only 50 auto-regressive steps per second of audio try out the Gradio demo: Models on Hugging Face: github:

Sensitive content

Meta just released MusicGen, a simple and controllable model for music generation MusicGen is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods like MusicLM, MusicGen doesn't not require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, can predict them in parallel, thus having only 50 auto-regressive steps per second of audio try out the Gradio demo: Models on Hugging Face: github:

AK

627,429 görüntüleme • 3 yıl önce

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

Vaibhav (VB) Srivastav

79,585 görüntüleme • 11 ay önce

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. > Training + Inference code released. > The processed dataset and tagging scripts were released for further research. > English only for now. Next, we're scaling the training to 50K hours and even better dataset processing! Want to help us out? DMs open! 🤗

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. > Training + Inference code released. > The processed dataset and tagging scripts were released for further research. > English only for now. Next, we're scaling the training to 50K hours and even better dataset processing! Want to help us out? DMs open! 🤗

Vaibhav (VB) Srivastav

156,386 görüntüleme • 2 yıl önce

Large Language Diffusion with Masking (LLaDA) are here - and their generation looks so fucking dope! 🤯 True to Yann LeCun's vision, Ditch the auto-regressive bits and approximate the language distribution via Maximum Likelihood Estimation! So cool to watch the model denoise text from tokens in real time! - The team released their model checkpoints and there's a demo for you to play with it too! Try it out!🤗

Large Language Diffusion with Masking (LLaDA) are here - and their generation looks so fucking dope! 🤯 True to Yann LeCun's vision, Ditch the auto-regressive bits and approximate the language distribution via Maximum Likelihood Estimation! So cool to watch the model denoise text from tokens in real time! - The team released their model checkpoints and there's a demo for you to play with it too! Try it out!🤗

Vaibhav (VB) Srivastav

21,394 görüntüleme • 1 yıl önce

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

Charly Wargnier

564,460 görüntüleme • 5 ay önce

Thrilled to see Amazon Web Services making a major contribution to the open source AI community with the launch of the Strands Agents, an open source AI agents SDK! The core of Strands is the simple agentic loop that connects the model and tools together, like the two strands of DNA. This model-driven approach to agent building eliminates the need for complex agent orchestration by embracing the capabilities of state-of-the-art models to plan, chain thoughts, call tools, and reflect. Providing open source tools and interoperability with open source protocols is an important part of our strategy to enable an agentic future. Can't wait to see what you build with Strands!

Thrilled to see Amazon Web Services making a major contribution to the open source AI community with the launch of the Strands Agents, an open source AI agents SDK! The core of Strands is the simple agentic loop that connects the model and tools together, like the two strands of DNA. This model-driven approach to agent building eliminates the need for complex agent orchestration by embracing the capabilities of state-of-the-art models to plan, chain thoughts, call tools, and reflect. Providing open source tools and interoperability with open source protocols is an important part of our strategy to enable an agentic future. Can't wait to see what you build with Strands!

Swami Sivasubramanian

32,185 görüntüleme • 1 yıl önce

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - *test-time extrapolation*, generates speech longer than training duration! Code&Model:

Announcing the new SotA voice-cloning TTS model: 𝗩𝗼𝗶𝗰𝗲𝗦𝘁𝗮𝗿 ⭐️ VoiceStar is - autoregressive, - voice-cloning, - robust, - duration controllable, - test-time extrapolation, generates speech longer than training duration! Code&Model:

Puyuan Peng

27,872 görüntüleme • 1 yıl önce

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to Grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon Elon Musk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: 🤗 Hugging Face Model: 🤖 ModelScope Model: 💻 GitHub： #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM

🚀 VoxCPM 2 is live! 🎉 Another open-source AI #TTS model from China — and one that stands shoulder to shoulder with Qwen3-TTS, while bringing everything into a single unified model. After rapid iterations from V1 (zero-shot cloning) to V1.5 (long-form + fine-tuning), #VoxCPM has consistently pushed quality and usability forward. Now, VoxCPM 2 takes it further: 🔹30+ languages — truly global, truly local. 🔹Infinite voice design — type it, hear it, control it. From a whisper to a booming cinematic voice. 🔹Studio-grade audio — 48kHz ultra-high fidelity with emotional depth 🔹Diffusion-Autoregressive cloning — preserves more acoustic and emotional detail than token-based models like Qwen3-TTS 💡 Big shoutout to Grok — used your multi-image video magic for our launch demo. It’s scarily good at keeping visuals consistent across shots. Elon Elon Musk, this one’s for you. 😉 Check the demo & start cloning your dream voice: 🌐 Hugging Face Space: 🤗 Hugging Face Model: 🤖 ModelScope Model: 💻 GitHub： #TTS #AI #VoiceCloning #GrokImagine #ElonMusk #OpenBMB #VoxCPM

OpenBMB

556,677 görüntüleme • 2 ay önce

The first truly open-source audio-video model. LTX-2 is a DiT-based foundation model with all core video generation capabilities in one unified model. Designed to run locally on consumer GPUs. - text-to-video - image-to-video - and video-to-video modes 100% open-source.

The first truly open-source audio-video model. LTX-2 is a DiT-based foundation model with all core video generation capabilities in one unified model. Designed to run locally on consumer GPUs. - text-to-video - image-to-video - and video-to-video modes 100% open-source.

Akshay 🚀

66,012 görüntüleme • 5 ay önce

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Vaibhav (VB) Srivastav

66,963 görüntüleme • 1 yıl önce

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

AI at Meta

351,674 görüntüleme • 1 yıl önce