正在加载视频...
视频加载失败
Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access... show more
11 条评论

🧠 How it works MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision. 206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware. E.g., on a MacMini M4, MoshiVis only adds ~7ms per step to the ~45ms per step of the base model, thus remaining well below the 80ms threshold for our audio codec. That means fluid, live, and multimodal conversations with Moshi on your own device!

🧰 Fully open-source We’re releasing a detailed preprint, along with model weights and a first of its kind benchmark dataset for spoken visual question answering: 📄 Preprint 🧠 Speech Benchmarks 🧾 Model weights 🧪 Inference code in PyTorch, MLX, and Rust

If you want to work on cutting-edge research, join our non-profit AI lab in Paris 🇫🇷 Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences — and the open-source community.

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

More lightweight! Sweet

share the repo with us ❤

Tried it out, very good performance for everyday tasks!

Ridiculous. Your AI is still not able to speak other language than English and doesn't even work decently. What a shame.

wow this is very cool!

Holey moley, it's the AI from "Start-up" in the real world!!

Local not working. Despite enabling mic and trying out the web ui it does not speak. Wonder if it can even hear me.
