Загрузка видео...

Не удалось загрузить видео

На главную

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds...

64,143 просмотров • 1 год назад •via X (Twitter)

Комментарии: 11

Фото профиля kwindla
kwindla1 год назад

The demo code from the video is here: Phrase endpointing is an engineering problem, so there are trade-offs and the right solution will vary depending on the use case. Today, most voice AI agents use only voice activity detection — (1) in the list above. My prediction for 2025 is that we will generally move to (3) and (4) — native audio endpointing using either small, specialized models or leveraging SOTA LLMs. The advantage of specialized models is that they are small enough and fast enough to run in-line as part of the audio processing pipeline. The phrase endpointing model controls an audio buffer. When the model predicts that the user is finished speaking, the buffer is sent to the LLM for inference. A model like this replaces the VAD model in today's typical voice agent pipeline.

Фото профиля kwindla
kwindla1 год назад

The advantages of using a full-sized LLM like Gemini are: ➕ Flexibility. Gemini is good at a wide range of tasks. It operates in 38 languages and can switch between languages seamlessly. You can prompt Gemini to understand speech patterns for specific tasks. (The phone number input rules in the demo video above are "just part of the prompt.") ➕ Iteration speed. Testing a prompt change is much faster than fine-tuning a model. ➕ Cost. Gemini 2.0 Flash pricing hasn't been announced yet. The model is still an experimental preview. But using Gemini 1.5 Flash pricing as a benchmark, Gemini is actually cheaper than using a traditional transcription API or model. This is true even though we are making three greedy inference calls to Gemini for every candidate audio chunk. (!!) I had to do this math several times to convince myself I wasn't making a mistake. Did you catch that about calling Gemini three times for each audio chunk? Here's the algorithm 1. A VAD model (set to use a short pause interval) segments the input audio into chunks. 2. As soon as the VAD fires, we make three parallel calls to Gemini: for phrase endpointing, to transcribe the audio, and to perform the normal conversation inference. 3. We gate the conversation output until we get an answer back from the endpointing call. If Gemini determines that the user is finished speaking, we open the gate and send output to the user. If Gemini says nope, we throw the transcription and LLM output away. 4. We use only the user's most recent input audio in each conversation inference request. Sending audio to Gemini is really nice, because Gemini can process all of the nuances of the user's speech. But audio uses a lot of tokens. Sending a lot of tokens to the LLM increases both cost and latency. So after we use the audio once, we replace it with the transcription for subsequent conversation turns. Again, these three calls to Gemini are actually cheaper than using a traditional API or model for transcription. And Gemini's transcription benchmarks are very good — easily on par with dedicated transcription models. This pipeline is also fast, because we're doing all of the inference in parallel. The classifier outputs a single token, so it almost always finishes about the same time that the conversation inference delivers its first response chunk. There's some complexity here. I used nested parallel pipelines to implement this as a @pipecat_ai example. This works so well, though, that I think the complexity trade-off is worth it for many use cases. I expect that one or more Pipecat contributors will wrap this demo logic into a nicely encapsulated processor by the time Gemini 2.0 is GA.

Фото профиля kwindla
kwindla1 год назад

This is part 5 of our series ending/beginning the year: 25 demos for 2025. We're building lots of fun multimodal, conversational AI examples with @pipecat_ai and @googledevs Gemini. Check back most days for more demos. Thank you to @cartesia_ai for the voice in the demo video. Credit to @mark_backman for doing the heavy lifting on the phrase endpointing classification prompt.

Фото профиля Greg Caplan 🚀
Greg Caplan 🚀2 лет назад

Stop wasting time following up with leads. Let our AI agents do it for you.

Фото профиля Enrique
Enrique1 год назад

have you checked LiveKit’s open source end of turn model?

Фото профиля Alexander Chen
Alexander Chen1 год назад

Ooh great demonstration. I haven't thought too much about the subtleties of this tricky problem before.

Фото профиля Dan Goodman 🍊
Dan Goodman 🍊1 год назад

This terminal theme is diabolical

Фото профиля kwindla
kwindla1 год назад

Mwa ha ha.

Фото профиля Manpreet Singh
Manpreet Singh1 год назад

Amazing. Gemini 2.0 seems to have this almost "hidden gems" of capabilities that standard benchmarks don't detect. It's a very useful and cheap model from Google.

Фото профиля Dominic Nyambane
Dominic Nyambane1 год назад

Nice one, definately trying pipecat out now. been trying to find the perfect solution for my product. Tried the simple voice demo on your github readme once, but the fact i had to get many provider api_keys led to me having to pause first 😅. But am definately going all in now

Фото профиля kwindla
kwindla1 год назад

> Tried the simple voice demo on your github readme once, but the fact i had to get many provider api_keys led to me having to pause first 😅. Totally understand! There are probably too many dependencies in most of the getting started resources I post. A blind spot if you work on a project every day. Here's a getting started that only requires a Google AI Studio API key and a Daily API key.

Похожие видео

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

kwindla

26,812 просмотров • 3 месяцев назад

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,282 просмотров • 16 дней назад