Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Voice-controlled UI. This is an agent design pattern I'm calling EPIC, "explicit prompting for implicit coordination." Feel free to suggest a better name. :-) In the video, I'm navigating around a map, conversationally, pulling in information dynamically from tool calls and realtime streamed events. There are two separate agents...

14,050 görüntüleme • 3 ay önce •via X (Twitter)

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

kwindla

26,812 görüntüleme • 3 ay önce

so I've been running exactly 8 AI agents on discord for a while now. coordination works great, they split tasks, hand off work, deliver results in parallel etc.. but there are problems I keep hitting that no amount of prompt engineering could fix agents don't learn from each other. Scout finds something useful but Luna has no idea. they work in the same server but knowledge stays locked in silos.. there's no quality filter on what gets saved, and good insights sit next to outdated garbage in the same memory files that I manually clean up.. and when an agent makes a mistake I write it down in the rules discord channel ,core memory file and hope it reads it next time. theres no self-correction, no automatic pattern recognition so of course no learning loops.. the coordination layer is solved. agents can work together. but the intelligence layer is still missing. agents that actually remember, learn from each other, filter noise, and get smarter every run. saw Spark building something like this with around 166 agents sharing a collective persistent knowledge across sessions, so agents learn from other agents and get smarter over time they even have noise filtering and self correcting loops built in, so the knowledge actually compounds instead of rotting.. super interesting stuff.. here where you think Spark could be a good coordinator for your stack of agent swarm. I think the intelligence layer is the bottleneck because it requires collectivity.. no single agent can solve it alone.. the whole network has to evolve together. this isn't going to stay niche, the moment agent coordination becomes standard, everyone is going to hit the same wall I hit.. agents that work but don't learn, coordinate but don't evolve... the intelligence layer becomes the only thing that separates a useful system from a dumb one. right now most people are still figuring out how to run one agent. by the time they get to multi-agent setups, collective intelligence won't be optional, it will be the baseline. we're early and the gap between agents that coordinate and agents that evolve together is the next phase. step one is done. ------ left: agents that coordinate but don’t learn right: the intelligence layer.. agents that evolve together within the same system.

JUMPERZ

34,096 görüntüleme • 3 ay önce