Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as... low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)show more

kwindla

14,689 subscribers

40,319 views • 1 month ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Anthropic's in trouble, again. The entire Claude experience is now available at 1/6th the price. Kimi now does everything Claude does, powered by K2.6, a 1-trillion-parameter MoE model that activates only 32B parameters per token. It covers all three features Claude has (Chat, Code, and Cowork): 1) Kimi Chat runs in four modes - Instant for fast responses - Thinking for deep reasoning - Agent for multi-step execution - and Agent Swarm for parallel workloads. There's a 262K context window across all of them. 2) Kimi Code is the open-source CLI coding agent with K2.6 as the default backend. K2.6 ranked #1 on OpenRouter's programming leaderboard by weekly usage. 3) Kimi Agent is the Cowork equivalent. It generates: - full websites with database and auth - presentation decks (editable PPTX output) - spreadsheets with formulas and charts - word docs and structured research reports. On top of this, Kimi K2.6 is also trained to decompose tasks into up to 300 parallel sub-agents. This helps it retain coherence even across 4,000+ tool calls in a single run, with sessions sustaining up to 13 hours. On SWE-Bench Pro: - Kimi K2.6 → 58.6 - GPT-5.4 xhigh → 57.7 - Gemini 3.1 Pro → 54.2 - Claude Opus 4.6 → 53.4 Kimi K2.6 model is open weights and self-hostable on 4x H100s in INT4. Find the link to the HuggingFace model page in the replies!

Anthropic's in trouble, again. The entire Claude experience is now available at 1/6th the price. Kimi now does everything Claude does, powered by K2.6, a 1-trillion-parameter MoE model that activates only 32B parameters per token. It covers all three features Claude has (Chat, Code, and Cowork): 1) Kimi Chat runs in four modes - Instant for fast responses - Thinking for deep reasoning - Agent for multi-step execution - and Agent Swarm for parallel workloads. There's a 262K context window across all of them. 2) Kimi Code is the open-source CLI coding agent with K2.6 as the default backend. K2.6 ranked #1 on OpenRouter's programming leaderboard by weekly usage. 3) Kimi Agent is the Cowork equivalent. It generates: - full websites with database and auth - presentation decks (editable PPTX output) - spreadsheets with formulas and charts - word docs and structured research reports. On top of this, Kimi K2.6 is also trained to decompose tasks into up to 300 parallel sub-agents. This helps it retain coherence even across 4,000+ tool calls in a single run, with sessions sustaining up to 13 hours. On SWE-Bench Pro: - Kimi K2.6 → 58.6 - GPT-5.4 xhigh → 57.7 - Gemini 3.1 Pro → 54.2 - Claude Opus 4.6 → 53.4 Kimi K2.6 model is open weights and self-hostable on 4x H100s in INT4. Find the link to the HuggingFace model page in the replies!

Avi Chawla

108,824 views • 1 month ago

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

kwindla

202,113 views • 10 months ago

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,306 views • 5 months ago

Kimi Code is good at video reasoning with Kimi K2.6.🎬 Drag in reference videos, ask about colors, shots, or visual style, and Kimi Code can generate a ready-to-use .cube LUT file.🎞️

Kimi Code is good at video reasoning with Kimi K2.6.🎬 Drag in reference videos, ask about colors, shots, or visual style, and Kimi Code can generate a ready-to-use .cube LUT file.🎞️

Kimi Developers

13,574 views • 19 days ago

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

kwindla

26,843 views • 4 months ago

Our latest speech-to-speech model is faster, more accurate, and excels at function calling. Watch @promptshant and Brian Fioca build a realtime voice agent that can search the web and hand off tasks to reasoning models with full context.

Our latest speech-to-speech model is faster, more accurate, and excels at function calling. Watch @promptshant and Brian Fioca build a realtime voice agent that can search the web and hand off tasks to reasoning models with full context.

OpenAI Developers

81,822 views • 1 year ago

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

kwindla

54,912 views • 1 month ago

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents. Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold. Now available in the API alongside streaming models GPT-Realtime-Translate and GPT-Realtime-Whisper — a new set of audio capabilities for the next generation of voice interfaces.

OpenAI

3,620,717 views • 1 month ago

🧠 Chat with Reasoning A few days ago the DeepSeek team released a LLM model with reasoning in various sizes. This we show is an example of 1bl that can run on machines with low GPU power like a mobile, but have enough power to answer complex questions. With these advanced models it is possible to link it with #IoT equipment to control information and use it in advanced control environments. All this under Open Source models and decentralized networks such as #Neurai #XNA $XNA #DeepSeek #Reasoning #AIchat

🧠 Chat with Reasoning A few days ago the DeepSeek team released a LLM model with reasoning in various sizes. This we show is an example of 1bl that can run on machines with low GPU power like a mobile, but have enough power to answer complex questions. With these advanced models it is possible to link it with #IoT equipment to control information and use it in advanced control environments. All this under Open Source models and decentralized networks such as #Neurai #XNA $XNA #DeepSeek #Reasoning #AIchat

NeurAI Project / XNA

17,691 views • 1 year ago

Voice agents are awkward, and everyone notices: You ask a question. The agent thinks. You wait. And wait... Nobody wants this. I'd rather talk to a person. If your model's response time is over 300ms, you won't make it. Unfortunately, most text-to-speech models can't get anywhere close to that. I want you to take a look at the latest model released by Inworld AI: TTS-1.5. I built a simple voice agent using the model so you can see it in action and test it on your computer. You'll find the repository link below. The latency numbers of this model are wild: • Max model → under 250ms • Mini model → under 130ms That's 4x faster than prior generations and faster than human response times!

Voice agents are awkward, and everyone notices: You ask a question. The agent thinks. You wait. And wait... Nobody wants this. I'd rather talk to a person. If your model's response time is over 300ms, you won't make it. Unfortunately, most text-to-speech models can't get anywhere close to that. I want you to take a look at the latest model released by Inworld AI: TTS-1.5. I built a simple voice agent using the model so you can see it in action and test it on your computer. You'll find the repository link below. The latency numbers of this model are wild: • Max model → under 250ms • Mini model → under 130ms That's 4x faster than prior generations and faster than human response times!

Santiago

69,410 views • 5 months ago

Pi agent is the Arch Linux of coding agents Qwen 3.6 Plus not caching? > No problem. Just ask Pi to patch itself from a community fork Kimi K2.6 imploding mid reasoning? > No problem. Just pull a snippet from a good Samaritan and let Pi test it on itself Now, I can use the reliable Qwen 3.6 Plus without busting my OpenCode Go sub, and the capable Kimi K2.6 without 404s The beauty of Pi is it can customize itself

Pi agent is the Arch Linux of coding agents Qwen 3.6 Plus not caching? > No problem. Just ask Pi to patch itself from a community fork Kimi K2.6 imploding mid reasoning? > No problem. Just pull a snippet from a good Samaritan and let Pi test it on itself Now, I can use the reliable Qwen 3.6 Plus without busting my OpenCode Go sub, and the capable Kimi K2.6 without 404s The beauty of Pi is it can customize itself

raymel 👋

46,916 views • 1 month ago

🔥 Battle for the top reasoning LLM intensifies! The QwQ-32B-Preview is a very good reasoning LLM. Full video of my tests here: Summary of my findings and thoughts: It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. The competition is now going to be on a different level and it's going to be focused on which model produces the most efficient, optimized, accurate, and fastest reasoning steps beyond just accurate responses. That's what developers will care about. Traditional benchmarks are not going to be good enough for this. On that note, it's getting harder to assess these models, especially the consistency, efficiency, and quality of reasoning steps. After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings. There might be a need to build some type of native and efficient self-assessment or self-reflection capability that prevents these reasoning LLMs to go in loops or produce unnecessary lengthy sequences. I also noticed that this model, at least from the HF demo, doesn’t separate the reasoning from the response. I think that actually hurts the performance of the model. On the other hand, o1 and R1 do that really well. In addition to that, I believe the training on reasoning is hurting the performance of the LLM in other areas such as helpfulness (check the code example in the video). Something that’s necessary at the moment is validating or evaluating the quality of the reasoning chains and figuring out a better strategy to optimize them. Current methods are probably not sufficient to solve this problem but that's where innovation will comes next. I recognize that this is a first effort so kudos to the Qwen team on this release. These issues highlight the importance of transparency with reasoning LLMs. We need to know how it was trained and with exact data or optimization strategy. Understanding that will enable researchers and developers to build better intuition and improve the reasoning capabilities and components at a faster rate. There is an opportunity for someone or a company to build a truly open-reasoning LLM. The race is on! I will continue to track the state-of-the-art in reasoning LLMs and report my takes and observations here. Stay tuned for more.

🔥 Battle for the top reasoning LLM intensifies! The QwQ-32B-Preview is a very good reasoning LLM. Full video of my tests here: Summary of my findings and thoughts: It was able to solve a couple of hard math problems so it looks very promising for maths. It didn’t do so well on my coding task (generating bash script). By the results reported on the LiveCodeBench it has room for improvement. One thing that’s become very clear to me is that the reasoning capabilities of these LLMs are significantly closing the gap between the open and closed-sourced models. The competition is now going to be on a different level and it's going to be focused on which model produces the most efficient, optimized, accurate, and fastest reasoning steps beyond just accurate responses. That's what developers will care about. Traditional benchmarks are not going to be good enough for this. On that note, it's getting harder to assess these models, especially the consistency, efficiency, and quality of reasoning steps. After experimenting with this model, I realized that the reasoning paths are not fully optimized and there is a lot more optimization that needs to happen before these models are used in production settings. There might be a need to build some type of native and efficient self-assessment or self-reflection capability that prevents these reasoning LLMs to go in loops or produce unnecessary lengthy sequences. I also noticed that this model, at least from the HF demo, doesn’t separate the reasoning from the response. I think that actually hurts the performance of the model. On the other hand, o1 and R1 do that really well. In addition to that, I believe the training on reasoning is hurting the performance of the LLM in other areas such as helpfulness (check the code example in the video). Something that’s necessary at the moment is validating or evaluating the quality of the reasoning chains and figuring out a better strategy to optimize them. Current methods are probably not sufficient to solve this problem but that's where innovation will comes next. I recognize that this is a first effort so kudos to the Qwen team on this release. These issues highlight the importance of transparency with reasoning LLMs. We need to know how it was trained and with exact data or optimization strategy. Understanding that will enable researchers and developers to build better intuition and improve the reasoning capabilities and components at a faster rate. There is an opportunity for someone or a company to build a truly open-reasoning LLM. The race is on! I will continue to track the state-of-the-art in reasoning LLMs and report my takes and observations here. Stay tuned for more.

elvis

14,740 views • 1 year ago

Learn to build conversational AI voice agents in "Building AI Voice Agents for Production", created in collaboration with LiveKit and RealAvatar, and taught by dsa (Co-founder & CEO of LiveKit), Shayne (Developer Advocate, LiveKit), and Nedelina Teneva (Head of AI at RealAvatar, an AI Fund portfolio company). Voice agents combine speech and reasoning capabilities to enable real-time conversations. They're already being used to support customer service, to improve accessibility in healthcare, for entertainment applications, and for talk therapy. In this course, you’ll learn to build voice agents that listen, reason, and respond naturally. You’ll follow the architecture used to create the "AI Andrew" Avatar, a collaborative project between and RealAvatar that responds to users in what sounds like my voice. You’ll build a voice agent from scratch and deploy it to the cloud, enabling support for many simultaneous users. What you’ll learn: - Understand the fundamentals of voice agents, including key components like speech-to-text (STT), text-to-speech (TTS), and LLMs, and how latency is introduced at each layer. - Explore voice agent architectures and the trade-offs between modular pipelines and speech-to-speech APIs. - Explore how platforms like LiveKit mitigate latency issues with optimized networking infrastructure and low-latency communication protocols. - Learn how to connect client devices to voice agents using WebRTC—and why it outperforms HTTP and WebSocket for low-latency audio streaming. - Incorporate voice activity detection (VAD), end-of-turn detection, and context management to detect turns, handle interruptions, and manage conversational flow. - Understand the trade-offs between latency, quality, and cost in an example in which you build a voice agent and change its voice. - Equip your agent with metrics to measure latency at each stage of the voice pipeline and learn the key levers you can pull to make your agent faster and more responsive. The voice agents built in this course also incorporate voice technology from , a supporting contributor to the project. By the end of this course, you'll have learned the components of an AI voice agent pipeline, combined them into a system with low-latency communication, and deployed them on cloud infrastructure so it scales to many users. I’m looking forward to seeing what voice agents you build from this course! Please sign up here:

Andrew Ng

87,484 views • 1 year ago

🔊🔛🔥 ... Groq Inc launched voice generation today. GroqCloud now has realtime transcription, LLMs, *and* text-to-speech. You can build super-responsive, ultra low-latency voice agents end-to-end entirely on Groq! At Daily, we're big fans of Groq's fast, low-latency inference. Pipecat AI supports all the Groq models, including the new voice model. Lately, we've been obsessively playing a voice chat game that Mark Backman wrote. (My high score is 6, so far.) Here's Mark, with Groq's `Celeste-PlayAI` voice.

🔊🔛🔥 ... Groq Inc launched voice generation today. GroqCloud now has realtime transcription, LLMs, and text-to-speech. You can build super-responsive, ultra low-latency voice agents end-to-end entirely on Groq! At Daily, we're big fans of Groq's fast, low-latency inference. Pipecat AI supports all the Groq models, including the new voice model. Lately, we've been obsessively playing a voice chat game that Mark Backman wrote. (My high score is 6, so far.) Here's Mark, with Groq's `Celeste-PlayAI` voice.

kwindla

31,791 views • 1 year ago

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.

kwindla

64,143 views • 1 year ago

Voice-only programming with the new OpenAI Realtime API ... I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing. This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a *hard* problem for today's models. Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability. Today's new gpt-realtime model is quite good at this hard problem. The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before. Here's a video showing a couple of fun (and tricky) modes of voice input.

Voice-only programming with the new OpenAI Realtime API ... I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing. This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a hard problem for today's models. Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability. Today's new gpt-realtime model is quite good at this hard problem. The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before. Here's a video showing a couple of fun (and tricky) modes of voice input.

kwindla

50,806 views • 10 months ago

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

kwindla

282,539 views • 2 years ago

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

kwindla

23,592 views • 1 year ago

This is the first implementation of Grok voice API by xAI on a robot thanks to atariorbit (not perfect yet)! Feels like this could unlock some fun new use cases for robotics agents given that it ranks #1 on Big Bench Audio, the leading audio reasoning benchmark that measures voice agents’ capabilities to solve complex problems!

This is the first implementation of Grok voice API by xAI on a robot thanks to atariorbit (not perfect yet)! Feels like this could unlock some fun new use cases for robotics agents given that it ranks #1 on Big Bench Audio, the leading audio reasoning benchmark that measures voice agents’ capabilities to solve complex problems!

clem 🤗

45,133 views • 6 months ago

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

kwindla

42,219 views • 11 months ago