Загрузка видео...

Не удалось загрузить видео

На главную

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually...

26,876 просмотров • 4 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 просмотров • 1 месяц назад

How can you solve complex tasks using a Large Language Model? Here is a 2-minute introduction to everything you need to know to 10x the quality of your results. Let's talk about three techniques, in order of complexity, starting with the easiest one: • In-Context Learning • Indexing + In-Context Learning • Fine-tuning In-Context Learning The team that trained GPT-3 found something they couldn't explain: You can condition a model using examples of how you want it to behave. I included an example prompt in the attached video. You can "teach" the model how you want it to interpret questions, select the correct answers, and format the results by giving a few examples. You can also give specific knowledge to the model that will be helpful when formulating answers. We call this approach "grounding the model." There's another example in the video. Indexing + In-Context Learning Unfortunately, there is a limit to how much data you can include in a prompt. We call this the "context size." One version of GPT-4 supports a context of approximately 6,000 words, while the other supports 25,000 words. Although this sounds like a lot, many applications need more than that. Imagine you wrote a book and want to build an application to answer any questions about your story. What happens if your book is longer than the context? That's where Indexing comes in. Using a model, you can turn every book passage into an embedding. These are vectors, numbers that "encode" the passage's text. You can then store these embeddings in a particular database that supports fast retrieval of these vectors. You can then turn any question into an embedding and search the database for the list of passages that are similar to that query. Instead of using the entire book to ask the model, you can now use the relevant passages as in-context information, effectively working around the context size limitation. Fine-tuning Fine-tuning can give you an extra boost to get reliable outputs from your LLM. It is, however, the most complex approach on the list. There are different approaches to fine-tuning a model with your data. A popular technique is to process your data with your LLM and use the outputs to train a new classifier that solves your specific task. Notice that here you aren't modifying the LLM. Instead, you are chaining it with your trained classifier. Another approach is to modify the parameters of the LLM using your data. Think of this as "rewiring" the model in a way that solves your particular task. The results and costs will vary depending on how many layers you want to fine-tune from the original model. Many companies think that fine-tuning is the solution to their problems. In my experience, many will benefit from exploring the other two approaches. I love explaining Machine Learning and Artificial Intelligence ideas. If you enjoy in-depth content like this, follow me Santiago so you don't miss what comes next.

Santiago

384,482 просмотров • 3 лет назад

Chamath: Two terms you need to pay attention to in AI are Prefill and Decode “There's two terms that I think you're going to hear a ton about over these next few years.” “The first term is prefill, and the next is decode.” “What prefill and decode are, are two very distinct ways of how models think, and how a model goes through the process of answering a question that you ask it.” “And so when you send a prompt to AI, what happens is that the model processes it. This is called the reading phase or prefill.” “It reads your entire prompt all at once. And then it does a bunch of math, calculates all these relationships between all the words, and it stores them in temporary memory.” “The problem is that this is really compute bound. So it requires massive brute force. And Nvidia GPUs crush here.” “And their architecture is designed for massive parallel processing, which makes them really amazing at digesting these long prompts.” “So the problem just gets bigger and bigger, Nvidia just completely dominates.” “But the next phase though, this critical phase, the decode phase, is the writing phase, right?” “So the model starts to generate a response, you ask it a question and its response, one token at a time.” “And then to pick the next token to pick the next word, it has to look back at everything it has said already so that it doesn't hallucinate.” “The problem is that this is incredibly memory bandwidth constrained.” “And in our architecture, a long time ago, we made these design decisions from day one.” “And so what we did was we took a very different architectural approach, we took a very conservative process technology. We weren't pushing the boundaries of physics.” “And we used a lot of what's called SRAM. So memory on the chip so that we could do this decode thing as well or better than everybody else.” “And so now when you put these two things together, I just think it's going to create a huge acceleration in the ability for this entire infrastructure layer to get much cheaper and much more valuable, which I suspect then it'll have a lot more developer pull, you'll get a lot more applications being built, billions and billions of more people using it.”

The All-In Podcast

563,785 просмотров • 5 месяцев назад

Learn to build conversational AI voice agents in "Building AI Voice Agents for Production", created in collaboration with LiveKit and RealAvatar, and taught by dsa (Co-founder & CEO of LiveKit), Shayne (Developer Advocate, LiveKit), and Nedelina Teneva (Head of AI at RealAvatar, an AI Fund portfolio company). Voice agents combine speech and reasoning capabilities to enable real-time conversations. They're already being used to support customer service, to improve accessibility in healthcare, for entertainment applications, and for talk therapy. In this course, you’ll learn to build voice agents that listen, reason, and respond naturally. You’ll follow the architecture used to create the "AI Andrew" Avatar, a collaborative project between and RealAvatar that responds to users in what sounds like my voice. You’ll build a voice agent from scratch and deploy it to the cloud, enabling support for many simultaneous users. What you’ll learn: - Understand the fundamentals of voice agents, including key components like speech-to-text (STT), text-to-speech (TTS), and LLMs, and how latency is introduced at each layer. - Explore voice agent architectures and the trade-offs between modular pipelines and speech-to-speech APIs. - Explore how platforms like LiveKit mitigate latency issues with optimized networking infrastructure and low-latency communication protocols. - Learn how to connect client devices to voice agents using WebRTC—and why it outperforms HTTP and WebSocket for low-latency audio streaming. - Incorporate voice activity detection (VAD), end-of-turn detection, and context management to detect turns, handle interruptions, and manage conversational flow. - Understand the trade-offs between latency, quality, and cost in an example in which you build a voice agent and change its voice. - Equip your agent with metrics to measure latency at each stage of the voice pipeline and learn the key levers you can pull to make your agent faster and more responsive. The voice agents built in this course also incorporate voice technology from , a supporting contributor to the project. By the end of this course, you'll have learned the components of an AI voice agent pipeline, combined them into a system with low-latency communication, and deployed them on cloud infrastructure so it scales to many users. I’m looking forward to seeing what voice agents you build from this course! Please sign up here:

Andrew Ng

87,484 просмотров • 1 год назад