Loading video...

Video Failed to Load

Go Home

End to End Speech models are on fire - LLAMA-OMNI 8B - Apache licensed! 🔥 > Speech Encoder - Whisper Large v3 > LLM backbone - Llama 3.1 8B Instruct > Speech Decoder - HuBERT (UnitY) > Simultaneously generate Speech + Text > Less than 250 ms latency >...

47,921 views • 1 year ago •via X (Twitter)

10 Comments

Vaibhav (VB) Srivastav's profile picture
Vaibhav (VB) Srivastav1 year ago

Model checkpoint:

Vaibhav (VB) Srivastav's profile picture
Vaibhav (VB) Srivastav1 year ago

Github repo:

Qingkai Fang's profile picture
Qingkai Fang1 year ago

Thanks for sharing our work!

Vaibhav (VB) Srivastav's profile picture
Vaibhav (VB) Srivastav1 year ago

🔥

Tommy D. Rossi's profile picture
Tommy D. Rossi1 year ago

I wouldn't call this end to end, let's keep that term for single multi modal models that do everything by themselves

ThisAndThat's profile picture
ThisAndThat1 year ago

less than 250ms latency on what?

Vaibhav (VB) Srivastav's profile picture
Vaibhav (VB) Srivastav1 year ago

Time to first audio chunk according to their GH.

Waifuology's profile picture
Waifuology1 year ago

License looks good, but the voice quality isn't really there yet.

Hiro's profile picture
Hiro1 year ago

Do you know what are supported languages?

Trying my best :-)'s profile picture
Trying my best :-)1 year ago

Can it detect emotion?

Related Videos