
kwindla
@kwindla • 14,608 subscribers
Infrastructure and developer tools for real-time voice, video, and AI. @trydaily // ᓚᘏᗢ // @pipecat_ai
Videos

Sub-agents in (latent) space! We’ve been working on a side project. As far as I know, this is the first massively multiplayer, completely LLM-driven game. Come play Gradient Bang with us. See if you can catch me on the leaderboard. This whole thing started because I wanted to explore a bunch of things I’m currently obsessed with, in an application of non-trivial size, that felt both new and old at the same time. So … a retro-style space trading game built entirely around interacting with and managing multiple LLMs. Factorio, but instead of clicking, you cajole your ship AI into tasking other AIs to do things for you. Some of the things we’ve been thinking about as we hack on Gradient Bang: - Sub-agent orchestration - Partial context sharing between multiple LLM inference loops - Managing very long contexts, and episodic memory across user sessions - World events and large volumes of structured data input as part of human/agent conversations - Dynamic user interfaces, driven/created on the fly by LLMs - And, of course, voice as primary input If you’ve been building coding harnesses, or writing Open Claw agents, or doing pretty much anything that pushes the boundaries of AI-native development these days, you’re probably thinking about these things too! This is all built with Pipecat AI, the back end is Supabase, the React front end is deployed to Vercel, and all the code is open source.
kwindla452,881 Aufrufe • vor 1 Monat

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.
kwindla274,119 Aufrufe • vor 4 Monaten

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)
kwindla54,525 Aufrufe • vor 27 Tagen

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.
kwindla202,050 Aufrufe • vor 10 Monaten

Very, very fast voice bots. Llama 3.1 running on Groq Inc. 🚀 500ms voice-to-voice response times
kwindla386,450 Aufrufe • vor 1 Jahr

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.
kwindla282,501 Aufrufe • vor 1 Jahr

This robot assistant from the NVIDIA CES Keynote on Monday is going viral. Nader Khalil🍊 explains all the hottest emerging AI trends in one demo: AI applications in 2026 will be multi-model, multi-modal, hybrid cloud/local, use open source models as well as proprietary models, control robots and embedded devices in the physical world, and have voice interfaces. (And the demo had a cute robot *and* a cute dog. Gold.) The demo was built with Pipecat AI. NVIDIA posted a really nice technical walk-through and complete code. The Reachy Mini robot from Hugging Face is open source hardware. (You can order it now, I have one!). You can run the assistant locally on your own hardware, in the cloud, or both.
kwindla48,902 Aufrufe • vor 4 Monaten

Local voice AI with a 235 billion parameter LLM. ✅ - smart-turn v2 - MLX Whisper (large-v3-turbo-q4) - Qwen3-235B-A22B-Instruct-2507-3bit-DWQ - Kokoro All models running local on an M4 mac. Max RAM usage ~110GB. Voice-to-voice latency is ~950ms. There are a couple of relatively easy ways to carve another ~100ms off that number. But it's not a bad start!
kwindla66,535 Aufrufe • vor 10 Monaten

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.
kwindla26,812 Aufrufe • vor 3 Monaten

Gemini 2.0 drops the beat. Watch the video all the way through — I had four legit "no way it did that" reactions when Jon Taylor sent this to me. "It looks like it's only hitting on the first beat of each bar." This is Jon collaborating with Gemini to create a song in Ableton Live. Jon is using the Multimodal Live API to stream audio and video to Gemini and have a conversation about the song he's creating.
kwindla86,068 Aufrufe • vor 1 Jahr

Voice-only programming with the new OpenAI Realtime API ... I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing. This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a *hard* problem for today's models. Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability. Today's new gpt-realtime model is quite good at this hard problem. The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before. Here's a video showing a couple of fun (and tricky) modes of voice input.
kwindla50,806 Aufrufe • vor 9 Monaten

Async, automatic, non-blocking context compaction for long-running agents. Last week I gave a talk called Space Machine Sandboxes at the Daytona AI builders meetup about patterns for long-running agents. I work a lot on voice AI agents, which are fundamentally multi-turn, long-context loops. I also build lots of other AI agent stuff, often as part of bigger systems that include voice. One of the patterns I showed in the talk is non-blocking compaction. Here's a short clip.
kwindla25,877 Aufrufe • vor 4 Monaten

Llama 4 voice agent starter kit with Groq Inc and Pipecat AI ➡️ Groq STT (distil-whisper-large-v3) ➡️ Groq Llama 4 (llama-4-scout-17b-16e-instruct) ➡️ Groq TTS (playai-tts) ➡️ Function calling ➡️ Deploy to Pipecat Cloud for production ➡️ Optionally add a twilio phone number for telephone voice AI
kwindla58,599 Aufrufe • vor 1 Jahr

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.
kwindla64,143 Aufrufe • vor 1 Jahr

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...
kwindla42,202 Aufrufe • vor 10 Monaten

Voice-controlled UI. This is an agent design pattern I'm calling EPIC, "explicit prompting for implicit coordination." Feel free to suggest a better name. :-) In the video, I'm navigating around a map, conversationally, pulling in information dynamically from tool calls and realtime streamed events. There are two separate agents (inference loops) here: a voice agent and a UI control agent. They know about each other (at the prompt level) but they work independently.
kwindla14,050 Aufrufe • vor 3 Monaten

I did a video call with a friend-of-a-friend today. Can't quite put my finger on who they remind me of. (cc Lemon Slice). I keep telling people that realtime video is improving so, so quickly right now, and there are so, so many interesting things to build. Sam Altman has been saying for a couple of years that the best way to build the future is to think hard about where model capabilities are going to be in one, two, or three years. What you build today should aim to leverage those near-future capabilities. I think about that advice every time I experiment with a new realtime video model.
kwindla30,737 Aufrufe • vor 1 Jahr

🔊🔛🔥 ... Groq Inc launched voice generation today. GroqCloud now has realtime transcription, LLMs, *and* text-to-speech. You can build super-responsive, ultra low-latency voice agents end-to-end entirely on Groq! At Daily, we're big fans of Groq's fast, low-latency inference. Pipecat AI supports all the Groq models, including the new voice model. Lately, we've been obsessively playing a voice chat game that Mark Backman wrote. (My high score is 6, so far.) Here's Mark, with Groq's `Celeste-PlayAI` voice.
kwindla31,791 Aufrufe • vor 1 Jahr