Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

14,748 subscribers

13,401 görüntüleme • 1 yıl önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Llama 4 voice agent starter kit with Groq Inc and Pipecat AI ➡️ Groq STT (distil-whisper-large-v3) ➡️ Groq Llama 4 (llama-4-scout-17b-16e-instruct) ➡️ Groq TTS (playai-tts) ➡️ Function calling ➡️ Deploy to Pipecat Cloud for production ➡️ Optionally add a twilio phone number for telephone voice AI

Llama 4 voice agent starter kit with Groq Inc and Pipecat AI ➡️ Groq STT (distil-whisper-large-v3) ➡️ Groq Llama 4 (llama-4-scout-17b-16e-instruct) ➡️ Groq TTS (playai-tts) ➡️ Function calling ➡️ Deploy to Pipecat Cloud for production ➡️ Optionally add a twilio phone number for telephone voice AI

kwindla

58,764 görüntüleme • 1 yıl önce

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

kwindla

282,617 görüntüleme • 2 yıl önce

I've been building voice agents for the last 6mo and I think the chat-supervisor pattern is a game changer. Stitched model (STT-LLM-TTS) is slow, but realtime audio models aren't (yet) as smart as text. This has the best of both worlds. Here's how it works:

I've been building voice agents for the last 6mo and I think the chat-supervisor pattern is a game changer. Stitched model (STT-LLM-TTS) is slow, but realtime audio models aren't (yet) as smart as text. This has the best of both worlds. Here's how it works:

Noah MacCallum

73,462 görüntüleme • 1 yıl önce

🔊🔛🔥 ... Groq Inc launched voice generation today. GroqCloud now has realtime transcription, LLMs, *and* text-to-speech. You can build super-responsive, ultra low-latency voice agents end-to-end entirely on Groq! At Daily, we're big fans of Groq's fast, low-latency inference. Pipecat AI supports all the Groq models, including the new voice model. Lately, we've been obsessively playing a voice chat game that Mark Backman wrote. (My high score is 6, so far.) Here's Mark, with Groq's `Celeste-PlayAI` voice.

🔊🔛🔥 ... Groq Inc launched voice generation today. GroqCloud now has realtime transcription, LLMs, and text-to-speech. You can build super-responsive, ultra low-latency voice agents end-to-end entirely on Groq! At Daily, we're big fans of Groq's fast, low-latency inference. Pipecat AI supports all the Groq models, including the new voice model. Lately, we've been obsessively playing a voice chat game that Mark Backman wrote. (My high score is 6, so far.) Here's Mark, with Groq's `Celeste-PlayAI` voice.

kwindla

31,791 görüntüleme • 1 yıl önce

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

kwindla

202,147 görüntüleme • 11 ay önce

Today we released our Voice Agent API, the world’s only enterprise-ready, real-time, and cost-effective Conversational AI API. If you're building voice agents and you’re tired of the demo day stitching: STT → LLM → Orchestration → TTS → Hoping it works We killed it with...

Today we released our Voice Agent API, the world’s only enterprise-ready, real-time, and cost-effective Conversational AI API. If you're building voice agents and you’re tired of the demo day stitching: STT → LLM → Orchestration → TTS → Hoping it works We killed it with...

Deepgram

10,695 görüntüleme • 1 yıl önce

Very, very fast voice bots. Llama 3.1 running on Groq Inc. 🚀 500ms voice-to-voice response times

Very, very fast voice bots. Llama 3.1 running on Groq Inc. 🚀 500ms voice-to-voice response times

kwindla

386,531 görüntüleme • 2 yıl önce

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.

kwindla

64,184 görüntüleme • 1 yıl önce

Introducing Realtime TTS-2, a new generation of voice model built for realtime conversation. It is the first voice model that hears the conversation, takes natural-language voice direction, holds one voice identity across over 100 languages, and speaks like a person who is paying attention. The result is voice AI that feels as good as it sounds. Try it out: Learn More:

Introducing Realtime TTS-2, a new generation of voice model built for realtime conversation. It is the first voice model that hears the conversation, takes natural-language voice direction, holds one voice identity across over 100 languages, and speaks like a person who is paying attention. The result is voice AI that feels as good as it sounds. Try it out: Learn More:

Inworld AI

326,012 görüntüleme • 2 ay önce

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

kwindla

23,592 görüntüleme • 1 yıl önce

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

kwindla

26,918 görüntüleme • 5 ay önce

Voice cloning is now available on LiveKit Inference. We’re launching with Inworld AI and Cartesia. Clone a voice once and use it across multiple TTS providers, with automatic fallback to the same voice if a provider fails mid-call. Free to create and available on all paid plans today.

Voice cloning is now available on LiveKit Inference. We’re launching with Inworld AI and Cartesia. Clone a voice once and use it across multiple TTS providers, with automatic fallback to the same voice if a provider fails mid-call. Free to create and available on all paid plans today.

LiveKit

11,218 görüntüleme • 2 ay önce

Sound on for this one. We wired up OpenAI's new STT and TTS models into a single voice agent. The results are super fun! Link to a live demo in the next tweet.

Sound on for this one. We wired up OpenAI's new STT and TTS models into a single voice agent. The results are super fun! Link to a live demo in the next tweet.

dsa

32,296 görüntüleme • 1 yıl önce

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 görüntüleme • 2 ay önce

Local voice AI with a 235 billion parameter LLM. ✅ - smart-turn v2 - MLX Whisper (large-v3-turbo-q4) - Qwen3-235B-A22B-Instruct-2507-3bit-DWQ - Kokoro All models running local on an M4 mac. Max RAM usage ~110GB. Voice-to-voice latency is ~950ms. There are a couple of relatively easy ways to carve another ~100ms off that number. But it's not a bad start!

Local voice AI with a 235 billion parameter LLM. ✅ - smart-turn v2 - MLX Whisper (large-v3-turbo-q4) - Qwen3-235B-A22B-Instruct-2507-3bit-DWQ - Kokoro All models running local on an M4 mac. Max RAM usage ~110GB. Voice-to-voice latency is ~950ms. There are a couple of relatively easy ways to carve another ~100ms off that number. But it's not a bad start!

kwindla

66,639 görüntüleme • 1 yıl önce

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,474 görüntüleme • 6 ay önce

Voice-controlled UI. This is an agent design pattern I'm calling EPIC, "explicit prompting for implicit coordination." Feel free to suggest a better name. :-) In the video, I'm navigating around a map, conversationally, pulling in information dynamically from tool calls and realtime streamed events. There are two separate agents (inference loops) here: a voice agent and a UI control agent. They know about each other (at the prompt level) but they work independently.

Voice-controlled UI. This is an agent design pattern I'm calling EPIC, "explicit prompting for implicit coordination." Feel free to suggest a better name. :-) In the video, I'm navigating around a map, conversationally, pulling in information dynamically from tool calls and realtime streamed events. There are two separate agents (inference loops) here: a voice agent and a UI control agent. They know about each other (at the prompt level) but they work independently.

kwindla

14,091 görüntüleme • 5 ay önce

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

I built a service desk agent using ElevenLabs’ new Conversational AI Agents feature. Watch the video to see how responsive it is! Previously, I used ElevenLabs for cloning my voice and used my generated voice for narrations in some of my youtube videos. This feature takes elevenlabs' voice AI to a whole new level! It simplifies systems that used to require separate TTS (text-to-speech) and STT (speech-to-text) processes for both sides of the conversation. Now, it’s much simpler! Create your own agent here What will you create with this?

Melvin Vivas

27,722 görüntüleme • 1 yıl önce