Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Voice-controlled UI. This is an agent design pattern I'm calling EPIC, "explicit prompting for implicit coordination." Feel free to suggest a better name. :-) In the video, I'm navigating around a map, conversationally, pulling in information dynamically from tool calls and realtime streamed events. There are two separate agents... show more

kwindla

14,689 subscribers

14,091 Aufrufe • vor 5 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)

kwindla

54,912 Aufrufe • vor 2 Monaten

We shipped Agent Console, a realtime debugging surface for voice agents. Talk to your agent and see the entire pipeline live, from audio and latency to tool calls, transcripts, and participant state. Available now in the LiveKit Cloud dashboard.

We shipped Agent Console, a realtime debugging surface for voice agents. Talk to your agent and see the entire pipeline live, from audio and latency to tool calls, transcripts, and participant state. Available now in the LiveKit Cloud dashboard.

LiveKit

11,915 Aufrufe • vor 3 Monaten

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 Aufrufe • vor 1 Monat

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

13,401 Aufrufe • vor 1 Jahr

AG-UI makes building agentic applications dramatically easier. Here's how it works. This is a model for a simple chatbot: User → LLM → Response But interactive agents that render UI, pause for approvals, and ask users for input need a much more complex model. When building these agents, a response from the LLM will include a series of state changes as the agent runs: • Agent started a task • Agent called a tool • Agent updated its state • Agent streams these tokens • Agent is waiting on a human • Agent is resuming the task The Agent-User Interaction Protocol (AG-UI) treats the LLM response as a stream of events rather than a text endpoint. In practice, here is what you get as an agent runs: 1. Lifecycle events so your UI knows where the agent is. 2. Text messages that stream tokens. 3. Tool calls so your UI can prefill a form with any required arguments. 4. State updates that keep your UI in sync with the agent. 5. Special events for human approvals, rich media, and custom needs. All of these events travel over standard transports (SSE, WebSockets, or plain HTTP) as JSON. As a result, you can build a frontend that stays in sync with the agent's progress without having to invent a custom process to make this happen. For example, building a human-in-the-loop workflow becomes an off-the-shelf component you can integrate rather than build from scratch. CopilotKit🪁 is the creator of AG-UI, and you can use it when building frontend applications pretty much anywhere: • React • Angular • Vue • React Native • Slack • Teams • Discord • WhatsApp • Telegram Here is the link for you to check it out: Thanks to the CopilotKit team for partnering with me on this post.

AG-UI makes building agentic applications dramatically easier. Here's how it works. This is a model for a simple chatbot: User → LLM → Response But interactive agents that render UI, pause for approvals, and ask users for input need a much more complex model. When building these agents, a response from the LLM will include a series of state changes as the agent runs: • Agent started a task • Agent called a tool • Agent updated its state • Agent streams these tokens • Agent is waiting on a human • Agent is resuming the task The Agent-User Interaction Protocol (AG-UI) treats the LLM response as a stream of events rather than a text endpoint. In practice, here is what you get as an agent runs: 1. Lifecycle events so your UI knows where the agent is. 2. Text messages that stream tokens. 3. Tool calls so your UI can prefill a form with any required arguments. 4. State updates that keep your UI in sync with the agent. 5. Special events for human approvals, rich media, and custom needs. All of these events travel over standard transports (SSE, WebSockets, or plain HTTP) as JSON. As a result, you can build a frontend that stays in sync with the agent's progress without having to invent a custom process to make this happen. For example, building a human-in-the-loop workflow becomes an off-the-shelf component you can integrate rather than build from scratch. CopilotKit🪁 is the creator of AG-UI, and you can use it when building frontend applications pretty much anywhere: • React • Angular • Vue • React Native • Slack • Teams • Discord • WhatsApp • Telegram Here is the link for you to check it out: Thanks to the CopilotKit team for partnering with me on this post.

Santiago

17,438 Aufrufe • vor 20 Tagen

Voice prompting is 4x faster than typing. but i wanted more. Multi-agent orchestration with a shared memory system, running on virtual private servers shipping straight to production. All controlled by my voice with nivida parakeet running locally or gpt realtime 2 in the cloud. CNVS is a mac os app built from the ground up in swift for raw performance on apple hardware. agents have bidirectional control, they can spawn and prompt each other, terminals, and browsers. Create loops and even draw diagrams straight on the canvas. nothing comes close, and im just getting started.

Voice prompting is 4x faster than typing. but i wanted more. Multi-agent orchestration with a shared memory system, running on virtual private servers shipping straight to production. All controlled by my voice with nivida parakeet running locally or gpt realtime 2 in the cloud. CNVS is a mac os app built from the ground up in swift for raw performance on apple hardware. agents have bidirectional control, they can spawn and prompt each other, terminals, and browsers. Create loops and even draw diagrams straight on the canvas. nothing comes close, and im just getting started.

Max Blade

176,019 Aufrufe • vor 1 Monat

Add a face to your voice agent. LiveAvatar by HeyGen is now supported in LiveKit Agents. Add a realtime human avatar to your agent without rebuilding the conversation loop. Your LiveKit agent still owns the room, turn-taking, model orchestration, and voice pipeline. LiveAvatar renders the synchronized face and video stream. Useful for product demos, onboarding, tutoring, and support agents that need a visual layer.

LiveKit

10,761 Aufrufe • vor 2 Monaten

Most AI agents do not fail because the prompt is weak. They fail because there is no loop around the prompt. A real loop: - Finds work - Executes - Verifies - Saves state - Stops or escalates If an agent can say “done” without tests, a budget cap and an independent verifier, you built a demo. Prompts make you the operator. Loops make the agent useful while you are offline.

Most AI agents do not fail because the prompt is weak. They fail because there is no loop around the prompt. A real loop: - Finds work - Executes - Verifies - Saves state - Stops or escalates If an agent can say “done” without tests, a budget cap and an independent verifier, you built a demo. Prompts make you the operator. Loops make the agent useful while you are offline.

Fluixo

16,089 Aufrufe • vor 9 Tagen

I've been trying to move from imperative prompting to declarative prompting. • Imperative prompting: I tell the agent how to do something. • Declarative prompting: I describe my goal to the agent. This was impossible before, but since Opus 4.5, I'm letting models fill in the blanks more and more. I recorded a video using Quest + Opus 4.5. This is a coding agent designed specifically for this purpose.

I've been trying to move from imperative prompting to declarative prompting. • Imperative prompting: I tell the agent how to do something. • Declarative prompting: I describe my goal to the agent. This was impossible before, but since Opus 4.5, I'm letting models fill in the blanks more and more. I recorded a video using Quest + Opus 4.5. This is a coding agent designed specifically for this purpose.

Santiago

96,471 Aufrufe • vor 5 Monaten

Introducing the Generative UI Research Canvas ✨ Every agent needs a powerful UI Combine LangChain, Tavily, Tako and CopilotKit🪁 and you get a LangChain-based research agent with a Generative UI-driven frontend. When agents can explore the underlying data, research becomes more trustworthy and useful. How does this work? - LangChain manages the agent workflow and tool orchestration. - Tako converts retrieved data into structured, visual components - Tavily provides agentic search - CopilotKit renders those components in the UI and keeps the user in the loop while the agent is working So instead of a boring and static report, the output is an interactive research canvas that evolves as the agent runs. 2026 is the year where users want to see the output, not read it... Generative UI makes this possible for agentic applications. Tutorial: Open-Source Repo: Docs:

Introducing the Generative UI Research Canvas ✨ Every agent needs a powerful UI Combine LangChain, Tavily, Tako and CopilotKit🪁 and you get a LangChain-based research agent with a Generative UI-driven frontend. When agents can explore the underlying data, research becomes more trustworthy and useful. How does this work? - LangChain manages the agent workflow and tool orchestration. - Tako converts retrieved data into structured, visual components - Tavily provides agentic search - CopilotKit renders those components in the UI and keeps the user in the loop while the agent is working So instead of a boring and static report, the output is an interactive research canvas that evolves as the agent runs. 2026 is the year where users want to see the output, not read it... Generative UI makes this possible for agentic applications. Tutorial: Open-Source Repo: Docs:

Atai Barkai

11,997 Aufrufe • vor 5 Monaten

Async, automatic, non-blocking context compaction for long-running agents. Last week I gave a talk called Space Machine Sandboxes at the Daytona AI builders meetup about patterns for long-running agents. I work a lot on voice AI agents, which are fundamentally multi-turn, long-context loops. I also build lots of other AI agent stuff, often as part of bigger systems that include voice. One of the patterns I showed in the talk is non-blocking compaction. Here's a short clip.

Async, automatic, non-blocking context compaction for long-running agents. Last week I gave a talk called Space Machine Sandboxes at the Daytona AI builders meetup about patterns for long-running agents. I work a lot on voice AI agents, which are fundamentally multi-turn, long-context loops. I also build lots of other AI agent stuff, often as part of bigger systems that include voice. One of the patterns I showed in the talk is non-blocking compaction. Here's a short clip.

kwindla

25,877 Aufrufe • vor 5 Monaten

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,474 Aufrufe • vor 6 Monaten

Streamlining the UX in Gemini CLI ✨ We got a lot of feedback on how our UI was way too cluttered with noise and text... so we made changes! 🛠️ Compact tool calls: Tool calls for reading files, folders, searching text, etc are now a single line (no more tool boxes around everything!) 💭 Topics: The agent outputs a one line overview of the rationale and direction it is going. A topic can span several tool calls and makes it easy to see what the agent is working on at a glance and why.

Streamlining the UX in Gemini CLI ✨ We got a lot of feedback on how our UI was way too cluttered with noise and text... so we made changes! 🛠️ Compact tool calls: Tool calls for reading files, folders, searching text, etc are now a single line (no more tool boxes around everything!) 💭 Topics: The agent outputs a one line overview of the rationale and direction it is going. A topic can span several tool calls and makes it easy to see what the agent is working on at a glance and why.

Jack Wotherspoon

15,960 Aufrufe • vor 2 Monaten

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

kwindla

26,918 Aufrufe • vor 5 Monaten

12-year-olds are the new 23-year-olds 🤯 A 12-year-old dev and his team built an investment advisor agent and won a hackathon using Lyzr AI Agent Studio. They used RAG and Perplexity tool for the agent, configured the agent in a few clicks and refined it until they got the desired output. So what's your excuse? Head to and start building your AI agents now.

12-year-olds are the new 23-year-olds 🤯 A 12-year-old dev and his team built an investment advisor agent and won a hackathon using Lyzr AI Agent Studio. They used RAG and Perplexity tool for the agent, configured the agent in a few clicks and refined it until they got the desired output. So what's your excuse? Head to and start building your AI agents now.

Siva Surendira

83,183 Aufrufe • vor 1 Jahr

Codex team is back in the kitchen with a really nice quality of life upgrade for subagents. With the advent of custom roles, they have also upgraded the TUI experience in two really meaningful ways. > All agents now get a name for better readability > Additionally, the agent role is declared. > Subagent name, role, and status are now color coded. > Subagent rendering was also optimized for readability > /agents slash command shows all agents, even 2+ layers deep. And here's the biggest and most important change. Subagent injection. Before, sometimes the orchestration agent would continue work and lose track of the work of a subagent. Now, when a subagent is blocked or completed, it injects a message back up the chain to ensure that the parent sees the message. This is a really big improvement overall, and leads to much more reliable inter-agent communication, reliability, and DX. In this example, I used the parent agent to spin up a worker agent, which then spawned two more "Spark" agents a second layer deep. I was able to easily tell them apart, switch between the threads, and see exactly what they were prompted. All of this will be available in update 0.105.0 I don't know who JIF is at OpenAI, but they are truly a legend.

Codex team is back in the kitchen with a really nice quality of life upgrade for subagents. With the advent of custom roles, they have also upgraded the TUI experience in two really meaningful ways. > All agents now get a name for better readability > Additionally, the agent role is declared. > Subagent name, role, and status are now color coded. > Subagent rendering was also optimized for readability > /agents slash command shows all agents, even 2+ layers deep. And here's the biggest and most important change. Subagent injection. Before, sometimes the orchestration agent would continue work and lose track of the work of a subagent. Now, when a subagent is blocked or completed, it injects a message back up the chain to ensure that the parent sees the message. This is a really big improvement overall, and leads to much more reliable inter-agent communication, reliability, and DX. In this example, I used the parent agent to spin up a worker agent, which then spawned two more "Spark" agents a second layer deep. I was able to easily tell them apart, switch between the threads, and see exactly what they were prompted. All of this will be available in update 0.105.0 I don't know who JIF is at OpenAI, but they are truly a legend.

am.will

78,267 Aufrufe • vor 5 Monaten

New course: Add voice to your AI agents and applications, built with Vocal Bridge (disclosure: an AI Fund portfolio company) and taught by its CEO Ashwyn Sharma. Voice applications historically required making a hard tradeoff: using fast voice-to-voice models that sacrifice reliability, or accurate speech-to-text pipelines that add latency. This course teaches you how to build voice agents that are both reliable and fast. You'll build three types of voice-enabled applications: a voice-interactive game where voice commands and mouse clicks work together over a single channel, an agent that gains a voice in about 10 lines of code without touching its prompts or tools, and an agent that places outbound phone calls using a make_phone_call function. Skills you'll gain: - Add a voice layer to an existing agent without rewriting your prompts, RAG pipeline, or tools - Give an agent the ability to place outbound calls and stream transcripts back live - Set up voice evaluation to score calls, catch regressions, and improve quality before deployment Join and add voice to your agents without overhauling your architecture:

New course: Add voice to your AI agents and applications, built with Vocal Bridge (disclosure: an AI Fund portfolio company) and taught by its CEO Ashwyn Sharma. Voice applications historically required making a hard tradeoff: using fast voice-to-voice models that sacrifice reliability, or accurate speech-to-text pipelines that add latency. This course teaches you how to build voice agents that are both reliable and fast. You'll build three types of voice-enabled applications: a voice-interactive game where voice commands and mouse clicks work together over a single channel, an agent that gains a voice in about 10 lines of code without touching its prompts or tools, and an agent that places outbound phone calls using a make_phone_call function. Skills you'll gain: - Add a voice layer to an existing agent without rewriting your prompts, RAG pipeline, or tools - Give an agent the ability to place outbound calls and stream transcripts back live - Set up voice evaluation to score calls, catch regressions, and improve quality before deployment Join and add voice to your agents without overhauling your architecture:

Andrew Ng

82,932 Aufrufe • vor 1 Monat

Today, we’re sharing the first of what we’re calling Pika Experiments 🧪 - rough ideas we’ve been playing with behind the scenes. ”Generative UI” is a voice-controlled interface where the agent listens, analyzes the context, and determines the most appropriate visual composition for each response. No rigid templates. The system dynamically generates HTML layouts every time.

Today, we’re sharing the first of what we’re calling Pika Experiments 🧪 - rough ideas we’ve been playing with behind the scenes. ”Generative UI” is a voice-controlled interface where the agent listens, analyzes the context, and determines the most appropriate visual composition for each response. No rigid templates. The system dynamically generates HTML layouts every time.

Pika

19,574 Aufrufe • vor 1 Monat

Here is how simple it is to build a UI front-end for an AI agent (assuming you are using . The video here is 90 seconds and will show you everything you need to connect an agent to a web app: 1. Set up CopilotKit🪁 in your app 1. Copy the agent's localhost URL from LangGraph Studio 2. Paste that URL into a CopilotKit🪁 remote endpoint That's all you need to get a front-end to interact with the agent! The secret sauce here is CoAgents, an open-source library. You can use CoAgents to integrate any agent with your web application. Here is what you get: • Human-in-the-loop to steer and correct the agent • Stream intermediate agent state • Real-time state sharing between the agent and the application • Agentic generative UI to build trust that the agent is on the right path Watch the attached demo so you see it in action.

Here is how simple it is to build a UI front-end for an AI agent (assuming you are using . The video here is 90 seconds and will show you everything you need to connect an agent to a web app: 1. Set up CopilotKit🪁 in your app 1. Copy the agent's localhost URL from LangGraph Studio 2. Paste that URL into a CopilotKit🪁 remote endpoint That's all you need to get a front-end to interact with the agent! The secret sauce here is CoAgents, an open-source library. You can use CoAgents to integrate any agent with your web application. Here is what you get: • Human-in-the-loop to steer and correct the agent • Stream intermediate agent state • Real-time state sharing between the agent and the application • Agentic generative UI to build trust that the agent is on the right path Watch the attached demo so you see it in action.

Santiago

46,109 Aufrufe • vor 1 Jahr