Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually... measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.show more

kwindla

14,745 subscribers

26,876 просмотров • 4 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.

kwindla

64,143 просмотров • 1 год назад

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 просмотров • 1 месяц назад

Today we’re launching our first homegrown AI model: an open source turn detection model for building voice agents. Instead of relying solely on voice activity detection (VAD), which only considers when a user is speaking, our model also considers what has and is being said in the context of a conversation and predicts when a user is finished expressing their thoughts before the agent responds. Conversations with AI voice agents using this new model flow much more naturally without constant interruptions from the AI— check it out (more videos, details, and code in the thread):

Today we’re launching our first homegrown AI model: an open source turn detection model for building voice agents. Instead of relying solely on voice activity detection (VAD), which only considers when a user is speaking, our model also considers what has and is being said in the context of a conversation and predicts when a user is finished expressing their thoughts before the agent responds. Conversations with AI voice agents using this new model flow much more naturally without constant interruptions from the AI— check it out (more videos, details, and code in the thread):

LiveKit

126,826 просмотров • 1 год назад

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

kwindla

42,219 просмотров • 11 месяцев назад

Voice agents are awkward, and everyone notices: You ask a question. The agent thinks. You wait. And wait... Nobody wants this. I'd rather talk to a person. If your model's response time is over 300ms, you won't make it. Unfortunately, most text-to-speech models can't get anywhere close to that. I want you to take a look at the latest model released by Inworld AI: TTS-1.5. I built a simple voice agent using the model so you can see it in action and test it on your computer. You'll find the repository link below. The latency numbers of this model are wild: • Max model → under 250ms • Mini model → under 130ms That's 4x faster than prior generations and faster than human response times!

Voice agents are awkward, and everyone notices: You ask a question. The agent thinks. You wait. And wait... Nobody wants this. I'd rather talk to a person. If your model's response time is over 300ms, you won't make it. Unfortunately, most text-to-speech models can't get anywhere close to that. I want you to take a look at the latest model released by Inworld AI: TTS-1.5. I built a simple voice agent using the model so you can see it in action and test it on your computer. You'll find the repository link below. The latency numbers of this model are wild: • Max model → under 250ms • Mini model → under 130ms That's 4x faster than prior generations and faster than human response times!

Santiago

69,410 просмотров • 5 месяцев назад

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

kwindla

54,912 просмотров • 1 месяц назад

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

13,401 просмотров • 1 год назад

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,306 просмотров • 5 месяцев назад

How can you solve complex tasks using a Large Language Model? Here is a 2-minute introduction to everything you need to know to 10x the quality of your results. Let's talk about three techniques, in order of complexity, starting with the easiest one: • In-Context Learning • Indexing + In-Context Learning • Fine-tuning In-Context Learning The team that trained GPT-3 found something they couldn't explain: You can condition a model using examples of how you want it to behave. I included an example prompt in the attached video. You can "teach" the model how you want it to interpret questions, select the correct answers, and format the results by giving a few examples. You can also give specific knowledge to the model that will be helpful when formulating answers. We call this approach "grounding the model." There's another example in the video. Indexing + In-Context Learning Unfortunately, there is a limit to how much data you can include in a prompt. We call this the "context size." One version of GPT-4 supports a context of approximately 6,000 words, while the other supports 25,000 words. Although this sounds like a lot, many applications need more than that. Imagine you wrote a book and want to build an application to answer any questions about your story. What happens if your book is longer than the context? That's where Indexing comes in. Using a model, you can turn every book passage into an embedding. These are vectors, numbers that "encode" the passage's text. You can then store these embeddings in a particular database that supports fast retrieval of these vectors. You can then turn any question into an embedding and search the database for the list of passages that are similar to that query. Instead of using the entire book to ask the model, you can now use the relevant passages as in-context information, effectively working around the context size limitation. Fine-tuning Fine-tuning can give you an extra boost to get reliable outputs from your LLM. It is, however, the most complex approach on the list. There are different approaches to fine-tuning a model with your data. A popular technique is to process your data with your LLM and use the outputs to train a new classifier that solves your specific task. Notice that here you aren't modifying the LLM. Instead, you are chaining it with your trained classifier. Another approach is to modify the parameters of the LLM using your data. Think of this as "rewiring" the model in a way that solves your particular task. The results and costs will vary depending on how many layers you want to fine-tune from the original model. Many companies think that fine-tuning is the solution to their problems. In my experience, many will benefit from exploring the other two approaches. I love explaining Machine Learning and Artificial Intelligence ideas. If you enjoy in-depth content like this, follow me Santiago so you don't miss what comes next.

How can you solve complex tasks using a Large Language Model? Here is a 2-minute introduction to everything you need to know to 10x the quality of your results. Let's talk about three techniques, in order of complexity, starting with the easiest one: • In-Context Learning • Indexing + In-Context Learning • Fine-tuning In-Context Learning The team that trained GPT-3 found something they couldn't explain: You can condition a model using examples of how you want it to behave. I included an example prompt in the attached video. You can "teach" the model how you want it to interpret questions, select the correct answers, and format the results by giving a few examples. You can also give specific knowledge to the model that will be helpful when formulating answers. We call this approach "grounding the model." There's another example in the video. Indexing + In-Context Learning Unfortunately, there is a limit to how much data you can include in a prompt. We call this the "context size." One version of GPT-4 supports a context of approximately 6,000 words, while the other supports 25,000 words. Although this sounds like a lot, many applications need more than that. Imagine you wrote a book and want to build an application to answer any questions about your story. What happens if your book is longer than the context? That's where Indexing comes in. Using a model, you can turn every book passage into an embedding. These are vectors, numbers that "encode" the passage's text. You can then store these embeddings in a particular database that supports fast retrieval of these vectors. You can then turn any question into an embedding and search the database for the list of passages that are similar to that query. Instead of using the entire book to ask the model, you can now use the relevant passages as in-context information, effectively working around the context size limitation. Fine-tuning Fine-tuning can give you an extra boost to get reliable outputs from your LLM. It is, however, the most complex approach on the list. There are different approaches to fine-tuning a model with your data. A popular technique is to process your data with your LLM and use the outputs to train a new classifier that solves your specific task. Notice that here you aren't modifying the LLM. Instead, you are chaining it with your trained classifier. Another approach is to modify the parameters of the LLM using your data. Think of this as "rewiring" the model in a way that solves your particular task. The results and costs will vary depending on how many layers you want to fine-tune from the original model. Many companies think that fine-tuning is the solution to their problems. In my experience, many will benefit from exploring the other two approaches. I love explaining Machine Learning and Artificial Intelligence ideas. If you enjoy in-depth content like this, follow me Santiago so you don't miss what comes next.

Santiago

384,482 просмотров • 3 лет назад

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

kwindla

202,113 просмотров • 10 месяцев назад

Async, automatic, non-blocking context compaction for long-running agents. Last week I gave a talk called Space Machine Sandboxes at the Daytona AI builders meetup about patterns for long-running agents. I work a lot on voice AI agents, which are fundamentally multi-turn, long-context loops. I also build lots of other AI agent stuff, often as part of bigger systems that include voice. One of the patterns I showed in the talk is non-blocking compaction. Here's a short clip.

Async, automatic, non-blocking context compaction for long-running agents. Last week I gave a talk called Space Machine Sandboxes at the Daytona AI builders meetup about patterns for long-running agents. I work a lot on voice AI agents, which are fundamentally multi-turn, long-context loops. I also build lots of other AI agent stuff, often as part of bigger systems that include voice. One of the patterns I showed in the talk is non-blocking compaction. Here's a short clip.

kwindla

25,877 просмотров • 5 месяцев назад

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

kwindla

282,539 просмотров • 2 лет назад

Chamath: Two terms you need to pay attention to in AI are Prefill and Decode “There's two terms that I think you're going to hear a ton about over these next few years.” “The first term is prefill, and the next is decode.” “What prefill and decode are, are two very distinct ways of how models think, and how a model goes through the process of answering a question that you ask it.” “And so when you send a prompt to AI, what happens is that the model processes it. This is called the reading phase or prefill.” “It reads your entire prompt all at once. And then it does a bunch of math, calculates all these relationships between all the words, and it stores them in temporary memory.” “The problem is that this is really compute bound. So it requires massive brute force. And Nvidia GPUs crush here.” “And their architecture is designed for massive parallel processing, which makes them really amazing at digesting these long prompts.” “So the problem just gets bigger and bigger, Nvidia just completely dominates.” “But the next phase though, this critical phase, the decode phase, is the writing phase, right?” “So the model starts to generate a response, you ask it a question and its response, one token at a time.” “And then to pick the next token to pick the next word, it has to look back at everything it has said already so that it doesn't hallucinate.” “The problem is that this is incredibly memory bandwidth constrained.” “And in our architecture, a long time ago, we made these design decisions from day one.” “And so what we did was we took a very different architectural approach, we took a very conservative process technology. We weren't pushing the boundaries of physics.” “And we used a lot of what's called SRAM. So memory on the chip so that we could do this decode thing as well or better than everybody else.” “And so now when you put these two things together, I just think it's going to create a huge acceleration in the ability for this entire infrastructure layer to get much cheaper and much more valuable, which I suspect then it'll have a lot more developer pull, you'll get a lot more applications being built, billions and billions of more people using it.”

Chamath: Two terms you need to pay attention to in AI are Prefill and Decode “There's two terms that I think you're going to hear a ton about over these next few years.” “The first term is prefill, and the next is decode.” “What prefill and decode are, are two very distinct ways of how models think, and how a model goes through the process of answering a question that you ask it.” “And so when you send a prompt to AI, what happens is that the model processes it. This is called the reading phase or prefill.” “It reads your entire prompt all at once. And then it does a bunch of math, calculates all these relationships between all the words, and it stores them in temporary memory.” “The problem is that this is really compute bound. So it requires massive brute force. And Nvidia GPUs crush here.” “And their architecture is designed for massive parallel processing, which makes them really amazing at digesting these long prompts.” “So the problem just gets bigger and bigger, Nvidia just completely dominates.” “But the next phase though, this critical phase, the decode phase, is the writing phase, right?” “So the model starts to generate a response, you ask it a question and its response, one token at a time.” “And then to pick the next token to pick the next word, it has to look back at everything it has said already so that it doesn't hallucinate.” “The problem is that this is incredibly memory bandwidth constrained.” “And in our architecture, a long time ago, we made these design decisions from day one.” “And so what we did was we took a very different architectural approach, we took a very conservative process technology. We weren't pushing the boundaries of physics.” “And we used a lot of what's called SRAM. So memory on the chip so that we could do this decode thing as well or better than everybody else.” “And so now when you put these two things together, I just think it's going to create a huge acceleration in the ability for this entire infrastructure layer to get much cheaper and much more valuable, which I suspect then it'll have a lot more developer pull, you'll get a lot more applications being built, billions and billions of more people using it.”

The All-In Podcast

563,785 просмотров • 5 месяцев назад

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

kwindla

23,592 просмотров • 1 год назад

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 просмотров • 1 год назад

Launch day for Google Imagen 3! We updated Story Bot — one of the classic Pipecat AI voice interaction starter kits — to use Gemini 2.0 Flash and Imagen. The new experience with these two models is … wow! Watch Chad's interactive story about a dragon who feels a little bit out of place, and her magical friend. The consistency of the images that Imagen creates through the whole story is a new capability. We haven’t seen that from any other model we’ve experimented with. Clone the repo to build your own voice+images experience.

Launch day for Google Imagen 3! We updated Story Bot — one of the classic Pipecat AI voice interaction starter kits — to use Gemini 2.0 Flash and Imagen. The new experience with these two models is … wow! Watch Chad's interactive story about a dragon who feels a little bit out of place, and her magical friend. The consistency of the images that Imagen creates through the whole story is a new capability. We haven’t seen that from any other model we’ve experimented with. Clone the repo to build your own voice+images experience.

kwindla

22,673 просмотров • 1 год назад

Learn to build conversational AI voice agents in "Building AI Voice Agents for Production", created in collaboration with LiveKit and RealAvatar, and taught by dsa (Co-founder & CEO of LiveKit), Shayne (Developer Advocate, LiveKit), and Nedelina Teneva (Head of AI at RealAvatar, an AI Fund portfolio company). Voice agents combine speech and reasoning capabilities to enable real-time conversations. They're already being used to support customer service, to improve accessibility in healthcare, for entertainment applications, and for talk therapy. In this course, you’ll learn to build voice agents that listen, reason, and respond naturally. You’ll follow the architecture used to create the "AI Andrew" Avatar, a collaborative project between and RealAvatar that responds to users in what sounds like my voice. You’ll build a voice agent from scratch and deploy it to the cloud, enabling support for many simultaneous users. What you’ll learn: - Understand the fundamentals of voice agents, including key components like speech-to-text (STT), text-to-speech (TTS), and LLMs, and how latency is introduced at each layer. - Explore voice agent architectures and the trade-offs between modular pipelines and speech-to-speech APIs. - Explore how platforms like LiveKit mitigate latency issues with optimized networking infrastructure and low-latency communication protocols. - Learn how to connect client devices to voice agents using WebRTC—and why it outperforms HTTP and WebSocket for low-latency audio streaming. - Incorporate voice activity detection (VAD), end-of-turn detection, and context management to detect turns, handle interruptions, and manage conversational flow. - Understand the trade-offs between latency, quality, and cost in an example in which you build a voice agent and change its voice. - Equip your agent with metrics to measure latency at each stage of the voice pipeline and learn the key levers you can pull to make your agent faster and more responsive. The voice agents built in this course also incorporate voice technology from , a supporting contributor to the project. By the end of this course, you'll have learned the components of an AI voice agent pipeline, combined them into a system with low-latency communication, and deployed them on cloud infrastructure so it scales to many users. I’m looking forward to seeing what voice agents you build from this course! Please sign up here:

Andrew Ng

87,484 просмотров • 1 год назад

I built a simple voice assistant in 70 lines of Python code. It uses: • LiveKit - The voice agent • AssemblyAI - To turn your voice into text • OpenAI - The brain of the agent, and to turn text into audio There's something really cool about this:

I built a simple voice assistant in 70 lines of Python code. It uses: • LiveKit - The voice agent • AssemblyAI - To turn your voice into text • OpenAI - The brain of the agent, and to turn text into audio There's something really cool about this:

Santiago

34,632 просмотров • 11 месяцев назад

Jason Calacanis: Anthropic, OpenAI and others are trying to kill OpenClaw Why? Because an open source agent platform is an existential threat to frontier model companies. @jason: “ People are going to say I'm a conspiracy theorist, but the number one goal, I believe, in the large language model, frontier model space, is to kill (OpenClaw). This is a giant movement to stop it, because this is the equivalent of having an open source Android-like player in the market, and that could be incredibly disruptive. Because, I believe, open source is going to win the day on the large language models and take 90% of the token usage, and I think the entire frontier model space could be undercut by open source. And I think they realize that SLMs, the smaller language models that are verticalized now, that will run on desktops and laptops and are even starting to run on the top ones, that is their biggest competitive threat, and I hope it happens.”

Jason Calacanis: Anthropic, OpenAI and others are trying to kill OpenClaw Why? Because an open source agent platform is an existential threat to frontier model companies. @jason: “ People are going to say I'm a conspiracy theorist, but the number one goal, I believe, in the large language model, frontier model space, is to kill (OpenClaw). This is a giant movement to stop it, because this is the equivalent of having an open source Android-like player in the market, and that could be incredibly disruptive. Because, I believe, open source is going to win the day on the large language models and take 90% of the token usage, and I think the entire frontier model space could be undercut by open source. And I think they realize that SLMs, the smaller language models that are verticalized now, that will run on desktops and laptops and are even starting to run on the top ones, that is their biggest competitive threat, and I hope it happens.”

The All-In Podcast

64,267 просмотров • 2 месяцев назад

Voice-only programming with the new OpenAI Realtime API ... I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing. This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a *hard* problem for today's models. Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability. Today's new gpt-realtime model is quite good at this hard problem. The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before. Here's a video showing a couple of fun (and tricky) modes of voice input.

Voice-only programming with the new OpenAI Realtime API ... I spend a lot of time these days pair programming with LLMs. Often I'm talking rather than typing. This "voice dictation" use case has become an important vibe benchmark for me. Being able to create text input just by talking, flexibly, in a context dependent way, with tool calling, is a hard problem for today's models. Natural language dictation requires a very high degree of contextual intelligence, instruction following accuracy, and tool calling reliability. Today's new gpt-realtime model is quite good at this hard problem. The original realtime model release last year was impressive. Seeing what a speech-to-speech model could do got a lot of people excited about the possibilities of voice AI. The improvements since that first release are equally impressive. I can use this new model, now, for real world tasks that were past the edge of the "jagged frontier" before. Here's a video showing a couple of fun (and tricky) modes of voice input.

kwindla

50,806 просмотров • 10 месяцев назад