kyutai's banner

kyutai

@kyutai_labs • 26,621 subscribers

Shorts

We're releasing MuScriptor, the best open model for multi-instrument transcription to date, created in collaboration with Mirelo. Give it a recording in any genre: pop, classical, metal, jazz, whatever, and it transcribes the individual instruments into MIDI. Link in 🧵

We're releasing MuScriptor, the best open model for multi-instrument transcription to date, created in collaboration with Mirelo. Give it a recording in any genre: pop, classical, metal, jazz, whatever, and it transcribes the individual instruments into MIDI. Link in 🧵

294,229 Aufrufe

Today we’re introducing MIRA, a new multiplayer world model, built with General Intuition, in collaboration with Epic Games. We release an in-depth technical report, dataset, as well as an online demo that you can try right now (link below).

Today we’re introducing MIRA, a new multiplayer world model, built with General Intuition, in collaboration with Epic Games. We release an in-depth technical report, dataset, as well as an online demo that you can try right now (link below).

72,388 Aufrufe

🎰 Welcome to the FID Lottery. We pulled the lever 25 times on the same machine. Identical diffusion model, identical ImageNet class-cond recipe, only the seed changed. The house paid out anywhere from 33.59 to 35.69 FID. A 2.1-point spread, pure luck. Step onto the floor 👇🧵

🎰 Welcome to the FID Lottery. We pulled the lever 25 times on the same machine. Identical diffusion model, identical ImageNet class-cond recipe, only the seed changed. The house paid out anywhere from 33.59 to 35.69 FID. A 2.1-point spread, pure luck. Step onto the floor 👇🧵

23,407 Aufrufe

We're releasing OVIE, a novel view generation model trained entirely on single images. No multi-view datasets needed. Given a single image, it generates novel views of any scene in real time, running orders of magnitude faster than competing approaches.

We're releasing OVIE, a novel view generation model trained entirely on single images. No multi-view datasets needed. Given a single image, it generates novel views of any scene in real time, running orders of magnitude faster than competing approaches.

30,957 Aufrufe

The frontier of world models and agents is overwhelmingly in Europe. Kyutai is excited to partner with General Intuition General Intuition to continue pushing that edge through open science.

The frontier of world models and agents is overwhelmingly in Europe. Kyutai is excited to partner with General Intuition General Intuition to continue pushing that edge through open science.

49,739 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

We’re excited to introduce Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required. Open-source, lightweight, and incredibly fast. 🧵👇

237,623 Aufrufe • vor 6 Monaten

New paper: Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models We use RL to post-train speech models (Moshi and PersonaPlex) to talk more like a human: to know when to respond, when to wait, and when to nod along with “yeah”s and “okay”s when listening.

New paper: Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models We use RL to post-train speech models (Moshi and PersonaPlex) to talk more like a human: to know when to respond, when to wait, and when to nod along with “yeah”s and “okay”s when listening.

32,055 Aufrufe • vor 1 Monat

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

52,598 Aufrufe • vor 2 Monaten

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page:

171,625 Aufrufe • vor 1 Jahr

We're releasing Hibiki-Zero, a new real-time and multilingual speech translation model that can translate 🇫🇷French, 🇪🇸Spanish, 🇵🇹Portuguese and 🇩🇪German to English: accurate, low-latency, high audio quality, with voice transfer. And best of all: open-source.

We're releasing Hibiki-Zero, a new real-time and multilingual speech translation model that can translate 🇫🇷French, 🇪🇸Spanish, 🇵🇹Portuguese and 🇩🇪German to English: accurate, low-latency, high audio quality, with voice transfer. And best of all: open-source.

80,693 Aufrufe • vor 5 Monaten

Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters. 🧵

Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters. 🧵

167,511 Aufrufe • vor 1 Jahr

Pocket TTS goes multilingual! Now you can use our 100M-parameter models to generate speech in six languages, fast enough to run real-time without a GPU. We also improved the quality of the English model while keeping the same size. And all of this is open-source.

Pocket TTS goes multilingual! Now you can use our 100M-parameter models to generate speech in six languages, fast enough to run real-time without a GPU. We also improved the quality of the English model while keeping the same size. And all of this is open-source.

29,601 Aufrufe • vor 2 Monaten

Moshi and Alex going on a space adventure 🚀

Moshi and Alex going on a space adventure 🚀

159,030 Aufrufe • vor 2 Jahren

Yesterday we introduced Moshi, the lowest latency conversational AI ever released. Moshi can perform small talk, explain various concepts, engage in roleplay in many emotions and speaking styles. Talk to Moshi here and learn more about the method below 🧵.

Yesterday we introduced Moshi, the lowest latency conversational AI ever released. Moshi can perform small talk, explain various concepts, engage in roleplay in many emotions and speaking styles. Talk to Moshi here and learn more about the method below 🧵.

110,301 Aufrufe • vor 2 Jahren

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications. Check out the details here:

Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications. Check out the details here:

66,417 Aufrufe • vor 1 Jahr

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access to information. Try it out 👉 Blog post 👉

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access to information. Try it out 👉 Blog post 👉

47,949 Aufrufe • vor 1 Jahr

Unmute turns a text LLM into a voice AI. At it’s Mistral AI's Mistral-Small-3.2-24B, making it fully open-source. Play a quiz game with a snarky host, catch up on tech news, or just hang out and talk. Or modify it to do anything you want!

Unmute turns a text LLM into a voice AI. At it’s Mistral AI's Mistral-Small-3.2-24B, making it fully open-source. Play a quiz game with a snarky host, catch up on tech news, or just hang out and talk. Or modify it to do anything you want!

24,900 Aufrufe • vor 1 Jahr

What are we waiting for? 🤔

What are we waiting for? 🤔

27,905 Aufrufe • vor 1 Jahr

Thanks to Xavier Niel for stopping by at the #AIActionSummit to try Hibiki. No need to struggle with English anymore 😅

Thanks to Xavier Niel for stopping by at the #AIActionSummit to try Hibiki. No need to struggle with English anymore 😅

24,606 Aufrufe • vor 1 Jahr

With Invincible Voice, we help people living with ALS communicate more easily. Encountering Olivier Goy, an entrepreneur who lives with ALS and relentlessly fights to help all patients, made it obvious that our cutting-edge voice AI should help. We turned our Unmute voice-wrapper into a new system that 1/ transcribes interlocutor’s speech in real time, 2/ suggests various relevant responses via a personalised language model, 3/ utters patient's chosen response with their voice (using 10s pre-disease speech recordings). True to our philosophy, we open-source Invincible Voice, so that developers can refine the prototype, port it from French to other languages, adapt it to other conditions (aphasia, neurodegenerative diseases) and turn it into a deployable product. Its modularity also allows it to leverage technologies developed by Gradium that supports Invincible Voice by granting it free access to its multilingual speech models.

With Invincible Voice, we help people living with ALS communicate more easily. Encountering Olivier Goy, an entrepreneur who lives with ALS and relentlessly fights to help all patients, made it obvious that our cutting-edge voice AI should help. We turned our Unmute voice-wrapper into a new system that 1/ transcribes interlocutor’s speech in real time, 2/ suggests various relevant responses via a personalised language model, 3/ utters patient's chosen response with their voice (using 10s pre-disease speech recordings). True to our philosophy, we open-source Invincible Voice, so that developers can refine the prototype, port it from French to other languages, adapt it to other conditions (aphasia, neurodegenerative diseases) and turn it into a deployable product. Its modularity also allows it to leverage technologies developed by Gradium that supports Invincible Voice by granting it free access to its multilingual speech models.

11,070 Aufrufe • vor 6 Monaten

Have you enjoyed talking to 🟢Moshi? Have you dreamt of making your own speech to speech chat experience🧑‍🔬🤖 ? It's now possible with the moshi-finetune codebase! Plug your own dataset and change the voice, the tone and the personality of Moshi 💚🔌💿. Here's an example after finetuning w/ only 20 hours from the public DailyTalk dataset. 🧵

Have you enjoyed talking to 🟢Moshi? Have you dreamt of making your own speech to speech chat experience🧑‍🔬🤖 ? It's now possible with the moshi-finetune codebase! Plug your own dataset and change the voice, the tone and the personality of Moshi 💚🔌💿. Here's an example after finetuning w/ only 20 hours from the public DailyTalk dataset. 🧵

20,059 Aufrufe • vor 1 Jahr

Using Unmute with a custom voice and prompt to create a very intense ice cream seller, inspired by Justin Kuritzkes' sketch🍦

Using Unmute with a custom voice and prompt to create a very intense ice cream seller, inspired by Justin Kuritzkes' sketch🍦

11,784 Aufrufe • vor 1 Jahr

Even KAVINSKY✨ 🎧🪩 can't break Hibiki! Just like Moshi, Hibiki is robust to extreme background conditions 💥🔊.

Even KAVINSKY✨ 🎧🪩 can't break Hibiki! Just like Moshi, Hibiki is robust to extreme background conditions 💥🔊.

11,851 Aufrufe • vor 1 Jahr

"Hippie" Moshi tells its love for Hendrix...but "skeptical" Moshi is less enthusiastic about psychedelic rock. Moshi can play 70+ emotions, will you catch them all? Try now at

"Hippie" Moshi tells its love for Hendrix...but "skeptical" Moshi is less enthusiastic about psychedelic rock. Moshi can play 70+ emotions, will you catch them all? Try now at

13,606 Aufrufe • vor 2 Jahren

Keine weiteren Inhalte verfügbar