Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥 > Speech Encoder - Whisper Large v3 > LLM backbone - Llama 3.1 8B Instruct > Speech Decoder - HuBERT (UnitY) > Simultaneously generate Speech + Text > Less than 250 ms latency >... show more

Vaibhav (VB) Srivastav

50,123 subscribers

47,921 views • 1 year ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

10 Comments

Vaibhav (VB) Srivastav1 year ago

Model checkpoint:

Vaibhav (VB) Srivastav1 year ago

Github repo:

Qingkai Fang1 year ago

Thanks for sharing our work!

Vaibhav (VB) Srivastav1 year ago

🔥

Tommy D. Rossi1 year ago

I wouldn't call this end to end, let's keep that term for single multi modal models that do everything by themselves

ThisAndThat1 year ago

less than 250ms latency on what?

Vaibhav (VB) Srivastav1 year ago

Time to first audio chunk according to their GH.

Waifuology1 year ago

License looks good, but the voice quality isn't really there yet.

Hiro1 year ago

Do you know what are supported languages?

Trying my best :-)1 year ago

Can it detect emotion?

Related Videos

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20% transcription, 10% text > Apache 2.0 licensed ⚡ Architecture: > WhisperSpeech/ VQ for Semantic Tokens > Llama 3.1 8B Instruct for Text backbone > Early fusion (Chameleon) I'm super bullish on Homebrew to Menlo and early fusion, audio and text, multimodal models! (P.S. Play with the demo on Hugging Face)

Vaibhav (VB) Srivastav

82,126 views • 1 year ago

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

NEW: Kokoro 82M - APACHE 2.0 licensed, Text to Speech model, trained on < 100 hours of audio 🔥

Vaibhav (VB) Srivastav

330,034 views • 1 year ago

Pretty Insane - SoTA Text to Speech model capable of English AND Hindi - 3B Llama backbone - Apache 2.0 licensed 🔥 > Sub 80 ms latency > Supports both English, Hindi including code-mix > Runs in a free google colab too 🤯 Best part: They're actively working on other languages like Tamil, Telugu, Bengali, etc > Available on Hugging Face hub, powered by Transformers 💥

Pretty Insane - SoTA Text to Speech model capable of English AND Hindi - 3B Llama backbone - Apache 2.0 licensed 🔥 > Sub 80 ms latency > Supports both English, Hindi including code-mix > Runs in a free google colab too 🤯 Best part: They're actively working on other languages like Tamil, Telugu, Bengali, etc > Available on Hugging Face hub, powered by Transformers 💥

Vaibhav (VB) Srivastav

33,749 views • 1 year ago

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Wow! New Speech to Speech model - Fish Agent v0.1 3B by Fish Audio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast inference w/ 200ms TTFA > Models on the Hub & Finetuning code on its way! 🚀 What an amazing time to be alive 🤗

Vaibhav (VB) Srivastav

66,963 views • 1 year ago

CPU Demo In Action: Meta Llama 3 8b Instruct With AMD Ryzen AI

CPU Demo In Action: Meta Llama 3 8b Instruct With AMD Ryzen AI

AMD

39,130 views • 2 years ago

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

Vaibhav (VB) Srivastav

45,409 views • 1 year ago

we’re lying on the moon. here are a few models glued together — speech to text (OpenAI whisper) to text (OpenAI gpt-3.5-turbo) to speech () — to recreate samantha from “her” in real-time.

we’re lying on the moon. here are a few models glued together — speech to text (OpenAI whisper) to text (OpenAI gpt-3.5-turbo) to speech () — to recreate samantha from “her” in real-time.

harley turan

506,190 views • 3 years ago

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

The new open-source Text to Speech model: Fish Speech 1.4 is brilliant! Trained on a massive 700K hours of multilingual speech data in 8 languages - Instant voice cloning 🗣️ - Ultra-low latency ⚡ - Compact model (~1GB weights) 🏋️‍♂️

Rohan Paul

228,836 views • 1 year ago

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

Introducing SeamlessM4T, the first all-in-one, multilingual multimodal translation model. This single model can perform tasks across speech-to-text, speech-to-speech, text-to-text translation & speech recognition for up to 100 languages depending on the task. Details ⬇️

AI at Meta

592,704 views • 2 years ago

Kyutai released their Streaming Text to Speech model, ~2B param model, ultra low latency (220ms), CC-BY-4.0 license 🔥 Trained on 2.5 Million Hours of audio, it can serve up to 32 users w/ less than 350ms latency on a SINGLE L40 🤯 Incredible release by kyutai folks, go check out their hugging face page now!

Kyutai released their Streaming Text to Speech model, ~2B param model, ultra low latency (220ms), CC-BY-4.0 license 🔥 Trained on 2.5 Million Hours of audio, it can serve up to 32 users w/ less than 350ms latency on a SINGLE L40 🤯 Incredible release by kyutai folks, go check out their hugging face page now!

Vaibhav (VB) Srivastav

93,512 views • 11 months ago

Introducing Universal-1: Our most powerful and accurate Speech AI model yet. ✅ Trained on 12.5M hours of multilingual speech data ✅ 13.5% more accurate than models like Whisper ✅ Up to 30% fewer hallucinations than seq2seq models ✅ Just 38 seconds to process 1 hour of audio

Introducing Universal-1: Our most powerful and accurate Speech AI model yet. ✅ Trained on 12.5M hours of multilingual speech data ✅ 13.5% more accurate than models like Whisper ✅ Up to 30% fewer hallucinations than seq2seq models ✅ Just 38 seconds to process 1 hour of audio

AssemblyAI

23,511,267 views • 2 years ago

Here are the best practices for using Eleven v3 (alpha) - the most expressive Text to Speech model.

Here are the best practices for using Eleven v3 (alpha) - the most expressive Text to Speech model.

ElevenLabs

43,692 views • 1 year ago

We’ve never seen a 7B model get the famous Sally test right, let alone a 3.8B model. 💪Even the Llama 3 8B Instruct model fails this test. Coming 🔜 to your 📱

We’ve never seen a 7B model get the famous Sally test right, let alone a 3.8B model. 💪Even the Llama 3 8B Instruct model fails this test. Coming 🔜 to your 📱

Private LLM

36,953 views • 2 years ago

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 views • 3 years ago

Real-time AI conversations are here! PlayHT, one of the best text-to-speech models I’ve used, now has a latency of less than 300ms. Checkout how fast it outputs audio 🤯 Also, my experience cloning my own voice and links to try it for free are below.

Real-time AI conversations are here! PlayHT, one of the best text-to-speech models I’ve used, now has a latency of less than 300ms. Checkout how fast it outputs audio 🤯 Also, my experience cloning my own voice and links to try it for free are below.

Alvaro Cintas

335,344 views • 2 years ago

🔥 Ming-UniAudio: The 「Nano Banana」moment for speech is here! A single model for universal understanding, generation & free-form editing. First Unified Continuous Tokenizer 「MingTok-Audio」and Unified Und & Gen Speech LLM built on it. First Universal Free-form Speech Editing Model Without Timestamp Condition. First FreeForm Audio Editing Benchmark. 💻 GitHub: 🤗 Tokenizer: 🤗 Model: base: edit: 🤗 Benchmark: 🌍 blog: #AI #Speech #SpeechLLM #LLM #GenerativeAI #Audio #ASR #TTS #SpeechEditing

🔥 Ming-UniAudio: The 「Nano Banana」moment for speech is here! A single model for universal understanding, generation & free-form editing. First Unified Continuous Tokenizer 「MingTok-Audio」and Unified Und & Gen Speech LLM built on it. First Universal Free-form Speech Editing Model Without Timestamp Condition. First FreeForm Audio Editing Benchmark. 💻 GitHub: 🤗 Tokenizer: 🤗 Model: base: edit: 🤗 Benchmark: 🌍 blog: #AI #Speech #SpeechLLM #LLM #GenerativeAI #Audio #ASR #TTS #SpeechEditing

Ant Ling

21,365 views • 8 months ago

Our Llama-3.1-Nemotron-70B-Instruct model is a leading model on the 🏆 Arena Hard benchmark (85) from Arena. Arena Hard uses a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, and is known for its predictive ability of Chatbot Arena Elo score as well as separability between helpful and less helpful models. Use our customized model Llama-3.1-Nemotron-70B to improve the helpfulness of LLM generated responses in your applications. 📥 Try on our API catalog: 📥 On GitHub: 📥 Or on Hugging Face:

Our Llama-3.1-Nemotron-70B-Instruct model is a leading model on the 🏆 Arena Hard benchmark (85) from Arena. Arena Hard uses a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, and is known for its predictive ability of Chatbot Arena Elo score as well as separability between helpful and less helpful models. Use our customized model Llama-3.1-Nemotron-70B to improve the helpfulness of LLM generated responses in your applications. 📥 Try on our API catalog: 📥 On GitHub: 📥 Or on Hugging Face:

NVIDIA AI Developer

140,699 views • 1 year ago

Introducing Kitten TTS, a SOTA tiny text-to-speech model - Just 15M parameters - Runs without a GPU - Model size less than 25 MB - Multiple high-quality voices - Ultra-fast - even runs on low-end edge devices Github and HF links below

Introducing Kitten TTS, a SOTA tiny text-to-speech model - Just 15M parameters - Runs without a GPU - Model size less than 25 MB - Multiple high-quality voices - Ultra-fast - even runs on low-end edge devices Github and HF links below

Divam Gupta

348,716 views • 10 months ago