Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Smol TTS keeps getting better! Introducing OuteTTS v0.2 - 500M parameters, multilingual with voice cloning! 🔥 > Multilingual - English, Chinese, Korean & Japanese > Cross platform inference w/ llama.cpp > Zero-shot voice cloning > Trained on 5 Billion audio tokens > Qwen 2.5 0.5B LLM backbone > Trained... show more

Vaibhav (VB) Srivastav

52,141 subscribers

44,654 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 1 Jahr

Check out the model weights and inference code base here:

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 1 Jahr

llama.cpp compatible GGUFs:

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 1 Jahr

OuteTTS GitHub:

Profilbild von Haorui He

Haorui Hevor 1 Jahr

Big Congrats!!! Another SOTA TTS model trained on Emilia after F5-TTS & MaskGCT! Try out:

Profilbild von Tommy Falkowski

Tommy Falkowskivor 1 Jahr

Just tested it out and the quality is very good. More importantly, the fact that you can generate speaker profiles is awesome! Will test it out some more and add it to my growing list of supported tts engines in my app 🤣

Profilbild von SkyTab

SkyTabvor 1 Jahr

Switch to SkyTab and get $5,000! A modern and sleek POS system with commercial-grade durability. 💪 ✅ $0 upfront costs ✅ Best in-class POS ✅ Local service & 24/7 support ✅ And much more! Make the switch today:

Profilbild von Umesh

Umeshvor 1 Jahr

This is improving so fast that I don't want to speak myself anymore. Just use this and get done 🤖

Profilbild von Fronesis

Fronesisvor 1 Jahr

Thank you for your work and for sharing insights! 🙌 Advancements like OuteTTS v0.2 showcase the rapid evolution of AI and its potential to empower global communities. 🚀 The future of #AI is bright, and collaborative innovation is key to unlocking its full potential!

Profilbild von Digital Doctor

Digital Doctorvor 1 Jahr

Are you saying you can voice CLONE on a R-Pi? Is that what you're saying????

Profilbild von 斎藤ただし, Tadashi Saito

斎藤ただし, Tadashi Saitovor 1 Jahr

The font of Japanese characters is wrong, it's for (maybe) Chinese. I hope you'll pay attention and respect to each of them when you are working for multilingual/multicultural things. (like your TTS engine itself does. Brilliant quality✨️)

Profilbild von Ahmed Mansour

Ahmed Mansourvor 1 Jahr

I tried to run it on HF. average inference time for 200 chars is >1 hour running on CPU. Why is this model so heavy?

Ähnliche Videos

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Vaibhav (VB) Srivastav

66,963 Aufrufe • vor 1 Jahr

Fish: Multilingual Voice Cloning Fish is a voice cloning AI model that supports multiple languages out of the box: - English - Japanese - Korean - Chinese - French - German - Arabic - Spanish I wrote a 1-click launcher for Fish gradio app, so you can run it locally. Enjoy.

Fish: Multilingual Voice Cloning Fish is a voice cloning AI model that supports multiple languages out of the box: - English - Japanese - Korean - Chinese - French - German - Arabic - Spanish I wrote a 1-click launcher for Fish gradio app, so you can run it locally. Enjoy.

cocktail peanut

35,815 Aufrufe • vor 1 Jahr

Pretty fucking wild - Chatterbox TTS by Resemble AI - Zero shot voice cloning, Apache 2.0 licensed 🤯 > Outperforms ElevenLabs > Trained on 500K hours of audio > Llama 500M arch > Emotionally aware speech > Zero shot voice cloning > Live on Hugging Face 🤗

Pretty fucking wild - Chatterbox TTS by Resemble AI - Zero shot voice cloning, Apache 2.0 licensed 🤯 > Outperforms ElevenLabs > Trained on 500K hours of audio > Llama 500M arch > Emotionally aware speech > Zero shot voice cloning > Live on Hugging Face 🤗

Vaibhav (VB) Srivastav

71,652 Aufrufe • vor 1 Jahr

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

Vaibhav (VB) Srivastav

298,858 Aufrufe • vor 1 Jahr

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Pretty WILD - SoTA open source TTS model that beats ElevenLabs/ Sesame - Dia 1.6B - Apache 2.0 licensed! 🔥 > Ultra realistic voice synthesis > Capable of producing non-verbal sounds - coughing, laughing 💥 > Zero shot Voice Cloning > Real-time TTS synthesis > Can run on your MacBook > Trending #2 on Hugging Face Weights on the Hub and code on GitHub! 🤯

Vaibhav (VB) Srivastav

39,165 Aufrufe • vor 1 Jahr

Awesome! OmniVoice-TTS in ComfyUI. - zero-shot multilingual TTS; - 600+ languages; - voice cloning; - voice design; - multi-speaker dialogues; - supports SageAttention and non-verbal expression tags.

Awesome! OmniVoice-TTS in ComfyUI. - zero-shot multilingual TTS; - 600+ languages; - voice cloning; - voice design; - multi-speaker dialogues; - supports SageAttention and non-verbal expression tags.

Wildminder

35,799 Aufrufe • vor 2 Monaten

train YOLOv9 on your dataset tutorial - run inference with a pre-trained COCO model - fine-tune model on custom dataset - evaluate the trained model - run inference with a fine-tuned model blogpost: ↓ read more

train YOLOv9 on your dataset tutorial - run inference with a pre-trained COCO model - fine-tune model on custom dataset - evaluate the trained model - run inference with a fine-tuned model blogpost: ↓ read more

SkalskiP

111,792 Aufrufe • vor 2 Jahren

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 Aufrufe • vor 1 Jahr

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

Vaibhav (VB) Srivastav

79,585 Aufrufe • vor 11 Monaten

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality.

Hot! We have a new strong voice model. MOSS-TTS - a production-ready flagship 8B TTS; - high-fidelity zero-shot voice cloning, stable long-form gen; - multilingual; - lossless reconstruction; fine-grained pronunciation control; - token-level duration control, - voice creator, sound effects. Outstanding quality.

Wildminder

12,121 Aufrufe • vor 4 Monaten

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Fish Audio

101,606 Aufrufe • vor 1 Jahr

Marvis-TTS-v0.2 is here 🚀 A local first TTS model capable of realtime performance even on older iPhones that Lucas Newman and I built. What’s new: ✨ Blazing fast — 100M (tiny) & 250M parameter models 🌍 Multilingual — English, French, German 🎭 Enhanced voice cloning — More natural & expressive ⚡ Long-form generation — Up to 90 seconds (4x improvement) Get started today: > pip install -U mlx-audio

Marvis-TTS-v0.2 is here 🚀 A local first TTS model capable of realtime performance even on older iPhones that Lucas Newman and I built. What’s new: ✨ Blazing fast — 100M (tiny) & 250M parameter models 🌍 Multilingual — English, French, German 🎭 Enhanced voice cloning — More natural & expressive ⚡ Long-form generation — Up to 90 seconds (4x improvement) Get started today: > pip install -U mlx-audio

Prince Canuma

118,352 Aufrufe • vor 7 Monaten

🎙️ Chatterbox is now available on Comfy Cloud! Zero-shot voice cloning, 23 languages, and expressive TTS — all ready to use with no setup. Clone any voice from just 5 seconds of audio. No training required.

🎙️ Chatterbox is now available on Comfy Cloud! Zero-shot voice cloning, 23 languages, and expressive TTS — all ready to use with no setup. Clone any voice from just 5 seconds of audio. No training required.

ComfyUI

21,722 Aufrufe • vor 5 Monaten

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

Presenting MetaVoice-1B, a 1.2B parameter base model for TTS (text-to-speech). * Emotional speech in English * Voice cloning with fine-tuning * Zero-shot cloning for American & British voices * Support for long-form synthesis

MetaVoice

111,799 Aufrufe • vor 2 Jahren

Voice cloning is now available on LiveKit Inference. We’re launching with Inworld AI and Cartesia. Clone a voice once and use it across multiple TTS providers, with automatic fallback to the same voice if a provider fails mid-call. Free to create and available on all paid plans today.

Voice cloning is now available on LiveKit Inference. We’re launching with Inworld AI and Cartesia. Clone a voice once and use it across multiple TTS providers, with automatic fallback to the same voice if a provider fails mid-call. Free to create and available on all paid plans today.

LiveKit

10,891 Aufrufe • vor 1 Monat

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Excited to introduce Fish Speech 1.4 - now open-source and more powerful than ever! 🎉 Our mission is to make cutting-edge voice tech accessible to everyone. What's new: - Trained on 700k hours of multilingual data (up from 200k) - Now supports 8 languages: English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic - Fully open-source, empowering developers and researchers worldwide Key features: - Lightning-fast TTS with ultra-low latency - Instant voice cloning - Self-host or use our cloud service - Simple, flat-rate pricing Try it out: - Playground: - GitHub: - HuggingFace Model: - Demo: - Product Hunt: We can't wait to see what you'll create with Fish Audio. Happy voice building! 🎧🐠

Fish Audio

149,878 Aufrufe • vor 1 Jahr

🎉 Congrats to Fish Audio on launching Fish Audio S2, a frontier TTS model with fine-grained prosody & emotion control via natural-language inline tags. SGLang Day-0 support is now live! 🏆 Best WER on Seed-TTS Eval; 81.88% win rate on EmergentTTS-Eval 🎙️ Voice cloning with 86.4% prefix-cache hit rate via RadixAttention ⚡️ RTF 0.34, 63.3 tok/s on single H200 (single batch) 🌍 Trained on 10M+ hours of audio across ~100 languages, GRPO-aligned 🔧 Dual-AR (Slow + Fast AR) is LLM-isomorphic: continuous batching, paged KV cache & CUDA graphs inherited natively 🗣️ Native multi-speaker: turn-taking, interruptions & cross-speaker emotion in a single pass 👉Cookbook: 👉Blog: 🎬 Curious how to run with SGLang? Check out this voice cloning demo from Chayenne Zhao with Fishaudio-S2-Pro:

🎉 Congrats to Fish Audio on launching Fish Audio S2, a frontier TTS model with fine-grained prosody & emotion control via natural-language inline tags. SGLang Day-0 support is now live! 🏆 Best WER on Seed-TTS Eval; 81.88% win rate on EmergentTTS-Eval 🎙️ Voice cloning with 86.4% prefix-cache hit rate via RadixAttention ⚡️ RTF 0.34, 63.3 tok/s on single H200 (single batch) 🌍 Trained on 10M+ hours of audio across ~100 languages, GRPO-aligned 🔧 Dual-AR (Slow + Fast AR) is LLM-isomorphic: continuous batching, paged KV cache & CUDA graphs inherited natively 🗣️ Native multi-speaker: turn-taking, interruptions & cross-speaker emotion in a single pass 👉Cookbook: 👉Blog: 🎬 Curious how to run with SGLang? Check out this voice cloning demo from Chayenne Zhao with Fishaudio-S2-Pro:

LMSYS Org

39,939 Aufrufe • vor 3 Monaten

StepAudio 2.5 TTS is live now! Control emotion, pacing, pauses, and delivery with plain natural language. No tags, no preset combos. Just describe what you want the voice to do. Zero-shot voice cloning with full timbre + emotion control. Available via Pay-as-you-go API or Step Plan.

StepAudio 2.5 TTS is live now! Control emotion, pacing, pauses, and delivery with plain natural language. No tags, no preset combos. Just describe what you want the voice to do. Zero-shot voice cloning with full timbre + emotion control. Available via Pay-as-you-go API or Step Plan.

StepFun

17,024 Aufrufe • vor 2 Monaten