Loading video...

Video Failed to Load

Go Home

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access...

47,924 views • 1 year ago •via X (Twitter)

11 Comments

kyutai's profile picture
kyutai1 year ago

🧠 How it works MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision. 206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware. E.g., on a MacMini M4, MoshiVis only adds ~7ms per step to the ~45ms per step of the base model, thus remaining well below the 80ms threshold for our audio codec. That means fluid, live, and multimodal conversations with Moshi on your own device!

kyutai's profile picture
kyutai1 year ago

🧰 Fully open-source We’re releasing a detailed preprint, along with model weights and a first of its kind benchmark dataset for spoken visual question answering: 📄 Preprint 🧠 Speech Benchmarks 🧾 Model weights 🧪 Inference code in PyTorch, MLX, and Rust

kyutai's profile picture
kyutai1 year ago

If you want to work on cutting-edge research, join our non-profit AI lab in Paris 🇫🇷 Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences — and the open-source community.

AssemblyAI's profile picture
AssemblyAI1 year ago

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

ZOHEB's profile picture
ZOHEB1 year ago

More lightweight! Sweet

Utopic e/λ's profile picture
Utopic e/λ1 year ago

share the repo with us ❤

Danish Khan's profile picture
Danish Khan1 year ago

Tried it out, very good performance for everyday tasks!

Frédéric H. (E/ACC)'s profile picture
Frédéric H. (E/ACC)1 year ago

Ridiculous. Your AI is still not able to speak other language than English and doesn't even work decently. What a shame.

haareblond's profile picture
haareblond1 year ago

wow this is very cool!

Mogomra (e/acc)'s profile picture
Mogomra (e/acc)1 year ago

Holey moley, it's the AI from "Start-up" in the real world!!

X_Learning969's profile picture
X_Learning9691 year ago

Local not working. Despite enabling mic and trying out the web ui it does not speak. Wonder if it can even hear me.

Related Videos