Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Better/faster/cheaper voice AI turn detection with Gemini 2.0 The code that determines when the agent should respond to the user is some of the most important code in your voice AI agent. The technical terms for this job are "turn detection" or "phrase endpointing." If the voice AI responds... before the user has finished their thought, the conversation is choppy and unproductive. If the AI waits too long, the conversation is slow and frustrating. There are a number of ways to approach this. You can: 1. Use a fast "voice activity detection" model to detect pauses in speech. Respond when the user pauses. 2. Use a specialized phrase endpointing model that operates on transcribed text, pattern-matching on text semantics. 3. Train a specialized phrase endpointing model that operates directly on audio. 4. Leverage the native audio capabilities of a SOTA LLM like Gemini 2.0. We've benchmarked all four of these, and Gemini 2.0 currently beats other approaches. Using Gemini is also cheaper than transcribing the audio separately using a transcription service or model. Here's a short video showing Gemini phrase endpointing in two scenarios. First, correctly handling pauses in natural conversation. Second, requesting a phone number (which is a common activity in a use case like customer support). You can see the Completeness check lines in the terminal output, printed each time Gemini processes a chunk of audio.show more

kwindla

14,806 subscribers

64,184 görüntüleme • 1 yıl önce •via X (Twitter)

Sağlık & İyilik Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

11 Yorum

kwindla profil fotoğrafı

kwindla1 yıl önce

The demo code from the video is here: Phrase endpointing is an engineering problem, so there are trade-offs and the right solution will vary depending on the use case. Today, most voice AI agents use only voice activity detection — (1) in the list above. My prediction for 2025 is that we will generally move to (3) and (4) — native audio endpointing using either small, specialized models or leveraging SOTA LLMs. The advantage of specialized models is that they are small enough and fast enough to run in-line as part of the audio processing pipeline. The phrase endpointing model controls an audio buffer. When the model predicts that the user is finished speaking, the buffer is sent to the LLM for inference. A model like this replaces the VAD model in today's typical voice agent pipeline.

kwindla profil fotoğrafı

kwindla1 yıl önce

The advantages of using a full-sized LLM like Gemini are: ➕ Flexibility. Gemini is good at a wide range of tasks. It operates in 38 languages and can switch between languages seamlessly. You can prompt Gemini to understand speech patterns for specific tasks. (The phone number input rules in the demo video above are "just part of the prompt.") ➕ Iteration speed. Testing a prompt change is much faster than fine-tuning a model. ➕ Cost. Gemini 2.0 Flash pricing hasn't been announced yet. The model is still an experimental preview. But using Gemini 1.5 Flash pricing as a benchmark, Gemini is actually cheaper than using a traditional transcription API or model. This is true even though we are making three greedy inference calls to Gemini for every candidate audio chunk. (!!) I had to do this math several times to convince myself I wasn't making a mistake. Did you catch that about calling Gemini three times for each audio chunk? Here's the algorithm 1. A VAD model (set to use a short pause interval) segments the input audio into chunks. 2. As soon as the VAD fires, we make three parallel calls to Gemini: for phrase endpointing, to transcribe the audio, and to perform the normal conversation inference. 3. We gate the conversation output until we get an answer back from the endpointing call. If Gemini determines that the user is finished speaking, we open the gate and send output to the user. If Gemini says nope, we throw the transcription and LLM output away. 4. We use only the user's most recent input audio in each conversation inference request. Sending audio to Gemini is really nice, because Gemini can process all of the nuances of the user's speech. But audio uses a lot of tokens. Sending a lot of tokens to the LLM increases both cost and latency. So after we use the audio once, we replace it with the transcription for subsequent conversation turns. Again, these three calls to Gemini are actually cheaper than using a traditional API or model for transcription. And Gemini's transcription benchmarks are very good — easily on par with dedicated transcription models. This pipeline is also fast, because we're doing all of the inference in parallel. The classifier outputs a single token, so it almost always finishes about the same time that the conversation inference delivers its first response chunk. There's some complexity here. I used nested parallel pipelines to implement this as a @pipecat_ai example. This works so well, though, that I think the complexity trade-off is worth it for many use cases. I expect that one or more Pipecat contributors will wrap this demo logic into a nicely encapsulated processor by the time Gemini 2.0 is GA.

kwindla profil fotoğrafı

kwindla1 yıl önce

This is part 5 of our series ending/beginning the year: 25 demos for 2025. We're building lots of fun multimodal, conversational AI examples with @pipecat_ai and @googledevs Gemini. Check back most days for more demos. Thank you to @cartesia_ai for the voice in the demo video. Credit to @mark_backman for doing the heavy lifting on the phrase endpointing classification prompt.

Greg Caplan 🚀 profil fotoğrafı

Greg Caplan 🚀2 yıl önce

Stop wasting time following up with leads. Let our AI agents do it for you.

Enrique profil fotoğrafı

Enrique1 yıl önce

have you checked LiveKit’s open source end of turn model?

Alexander Chen profil fotoğrafı

Alexander Chen1 yıl önce

Ooh great demonstration. I haven't thought too much about the subtleties of this tricky problem before.

Dan Goodman 🍊 profil fotoğrafı

Dan Goodman 🍊1 yıl önce

This terminal theme is diabolical

kwindla profil fotoğrafı

kwindla1 yıl önce

Mwa ha ha.

Manpreet Singh profil fotoğrafı

Manpreet Singh1 yıl önce

Amazing. Gemini 2.0 seems to have this almost "hidden gems" of capabilities that standard benchmarks don't detect. It's a very useful and cheap model from Google.

Dominic Nyambane profil fotoğrafı

Dominic Nyambane1 yıl önce

Nice one, definately trying pipecat out now. been trying to find the perfect solution for my product. Tried the simple voice demo on your github readme once, but the fact i had to get many provider api_keys led to me having to pause first 😅. But am definately going all in now

kwindla profil fotoğrafı

kwindla1 yıl önce

> Tried the simple voice demo on your github readme once, but the fact i had to get many provider api_keys led to me having to pause first 😅. Totally understand! There are probably too many dependencies in most of the getting started resources I post. A blind spot if you work on a project every day. Here's a getting started that only requires a Google AI Studio API key and a Daily API key.

Benzer Videolar

Today we’re launching our first homegrown AI model: an open source turn detection model for building voice agents. Instead of relying solely on voice activity detection (VAD), which only considers when a user is speaking, our model also considers what has and is being said in the context of a conversation and predicts when a user is finished expressing their thoughts before the agent responds. Conversations with AI voice agents using this new model flow much more naturally without constant interruptions from the AI— check it out (more videos, details, and code in the thread):

Today we’re launching our first homegrown AI model: an open source turn detection model for building voice agents. Instead of relying solely on voice activity detection (VAD), which only considers when a user is speaking, our model also considers what has and is being said in the context of a conversation and predicts when a user is finished expressing their thoughts before the agent responds. Conversations with AI voice agents using this new model flow much more naturally without constant interruptions from the AI— check it out (more videos, details, and code in the thread):

LiveKit

126,860 görüntüleme • 1 yıl önce

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

Voice AI turn taking is a solved problem. The single most common complaint about voice AI, today, is that agents interrupt too often. But the voice agents I build for myself now respond quickly and interrupt me less often than the people I talk to every day. (I actually measured this.) Mark Backman made a Pipecat AI PR two weeks ago that was the last piece of the puzzle for turn taking so good that I no longer ever think about it. The approach combines three layers of processing: 1. Voice activity detection, with a short (200ms) trigger. 2. A native audio turn detection model that's small, fast, and runs on CPU. This model captures audio nuances like inflection and filler sounds that don't get transcribed. 3. A prompt mixin for the conversation LLM that decides turn completion based on conversation context. None of these are new. We've been using VAD for a long time. We trained the first version of the Pipecat Smart Turn native audio model in December 2024. And we've been experimenting with prompt-based large model turn detection (sometimes called "selective refusal") for more than a year. Now, the Smart Turn model and the SOTA LLMs we're using in voice agents have both gotten so good that using them together feels like we've finally "solved" turn detection. Mark also figured out how to elegantly apply a "single-token tagging" technique to this problem. We sometimes use single-token tagging in place of tool calling, when we need a near-zero latency programmatic trigger. Mark's Pipecat mixin defines three single-token characters and prompts the LLM to output exactly one of them at the beginning of every response. - ✓ means the agent should respond normally (immediately) - ○ is a "short incomplete" - the agent should wait 5 seconds - ◐ is a "long incomplete" - the agent should wait 10 seconds The wait times, and the details of the prompt, are configurable, of course. Watch the video to see me talk to an agent that handles all my various pauses and inflections, plus phrases like "let me think," pretty much the way a person would handle them, in terms of response latency. Also, in the second half of the video, I ask the agent to adjust its response pattern because I'm going to tell it a phone number. This kind of "in-context" adjustment of response wait times is really useful. The LLM in the video is GTP-4.1. We've tested the prompt and single-token adherance with GPT-4.1, Gemini 2.5 Flash, Anthropic Claude Sonnet 4.5, and AWS Nova 2 Pro. Note that older models in all these families (and, in general, smaller open weights models) aren't able to reliably output these single-token tags. But the new models we're using these days are pretty amazing.

kwindla

26,918 görüntüleme • 5 ay önce

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

How to build the world's fastest voice AI bot: - Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster. - Route audio over the internet using WebRTC and edge networking. - Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!) Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms. We used Deepgram's STT and TTS for this bot, and everything is hosted on cerebriumai's serverless GPU infrastructure.

kwindla

282,617 görüntüleme • 2 yıl önce

Multilingual voice AI tutor using the new Cartesia Sonic-2 voice model and Google DeepMind Gemini 2.0 Flash native audio understanding ...

Multilingual voice AI tutor using the new Cartesia Sonic-2 voice model and Google DeepMind Gemini 2.0 Flash native audio understanding ...

kwindla

18,793 görüntüleme • 1 yıl önce

Further tinkering with my little French tutor app. This version is using the Gemini Multimodal Live API. The speech understanding in Gemini is quite something. In this video you can see Gemini correcting my pronunciation. (Very patiently.) The language tutor use case really highlights the strengths of a next-generation speech model like Gemini. This is 90 lines of Pipecat AI code, and uses WebRTC for super low-latency, super reliable network transport.

Further tinkering with my little French tutor app. This version is using the Gemini Multimodal Live API. The speech understanding in Gemini is quite something. In this video you can see Gemini correcting my pronunciation. (Very patiently.) The language tutor use case really highlights the strengths of a next-generation speech model like Gemini. This is 90 lines of Pipecat AI code, and uses WebRTC for super low-latency, super reliable network transport.

kwindla

13,005 görüntüleme • 1 yıl önce

Gemini 2.0 drops the beat. Watch the video all the way through — I had four legit "no way it did that" reactions when Jon Taylor sent this to me. "It looks like it's only hitting on the first beat of each bar." This is Jon collaborating with Gemini to create a song in Ableton Live. Jon is using the Multimodal Live API to stream audio and video to Gemini and have a conversation about the song he's creating.

Gemini 2.0 drops the beat. Watch the video all the way through — I had four legit "no way it did that" reactions when Jon Taylor sent this to me. "It looks like it's only hitting on the first beat of each bar." This is Jon collaborating with Gemini to create a song in Ableton Live. Jon is using the Multimodal Live API to stream audio and video to Gemini and have a conversation about the song he's creating.

kwindla

86,132 görüntüleme • 1 yıl önce

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

The voice-to-voice AI Pareto frontier. (You'll never believe this one weird trick ...) If you're building conversational voice AI apps, you care a lot about: ➟ Latency ➟ Cost ➟ LLM response quality (predictable behavior, coverage of the full surface area of your needs, reliable function calling, "reasoning") ➟ Voice quality (correct pronunciations, consistent tone, appropriate affect, steerability, "human-ness") AI model performance is a jagged frontier and has been moving fast for all of 2024. This is especially true for conversational voice, because you're usually using several models in combination. LLMs with native audio capabilities are the newest evolution pushing the performance frontier. Here's a multi-lingual voice conversation using Gemini Flash 1.5's native audio input. But there's a problem ...

kwindla

23,592 görüntüleme • 1 yıl önce

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

A voice agent powered by gpt-oss. Running locally on my macBook. Demo recorded in a Waymo with WiFi turned off. I'm still on my space game voice AI kick, obviously. Code link below. For conversational voice AI, you want to set the gpt-oss reasoning behavior to "low". (The default is "medium".) Notes on how to do that and a jinja template you can use are in the repo. The LLM in the demo video is the big, 120B version of gpt-oss. You can use the smaller, 20B model for this, of course. But OpenAI really did a cool thing here designing the 120B model to run in "just" 80GB of VRAM. And the llama.cpp mlx inference is fast: ~250ms TTFT. Running a big model on-device feels like a time warp into the future of AI.

kwindla

202,147 görüntüleme • 11 ay önce

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

Cerebras inference is very fast. So fast that it changes how we think about configuring our LLMs for voice agent use cases. Kimi K2.6 is a 1T parameter reasoning model that Cerebras serves at 650 - 1,000 tokens per second (end-to-end throughput), with time to first token metrics as low as 150ms (latency). These numbers are two to three times faster than other similarly capable models. The biggest lever we get from this kind of speed is that we can use the model in reasoning mode, and still have excellent "time to first non-thinking token." This solves a big pain point we have in 2026 for voice agent use cases. Almost all recent innovation in post-training has focused on making models good at reasoning ("test time compute"). This is great, but it makes the user-facing model latency much, much slower. Which is a problem for conversational voice agents. We can run Kimi K2.6 with reasoning turned on, and get responses faster than other models produce with reasoning disabled. On my 30-turn voice agent benchmark, Kimi K2.6 with reasoning enabled ties GPT 5.1 and Haiku 4.5 with reasoning disabled, and is still about 200ms seconds faster! On my primary task agent benchmark, Kimi K2.6 is now the #2 model. It ranks just behind Gemini 3.5 Flash in "high" reasoning mode, and tied with GLM 5, Sonnet 4.6, and GPT 5.4 with reasoning set to "low." But Kimi K2.6 completes each turn in the agent loop in under 500ms. The other four models are all at least 3x slower. (Models only qualify for this benchmark if they can complete task turns at a P50 <4s.) A couple of other things that this speed buys us, for production voice agents: - Tool calls happen fast enough that we don't have to work around tool call latency in our pipeline design. - We can prompt the model to output structured data at the beginning of a response, followed by plain text for voice generation. This opens up possibilities like asking the model to do complex classification/generation tasks that influence the rest of the pipeline. For example, the model could create a detailed style prompt for a steerable TTS model, for each individual conversation turn. And, of course, you can use Kimi K2.6 with reasoning turned off. Cerebras calls this "instant" mode. Here's a video of a Cerebras Kimi K2.6 voice agent with voice-to-voice response time, measured at the client, under 500ms. This is the true response latency as perceived by the user, including all network and audio codec overhead, transcription and turn detection, Kimi K2.6 token generation, and voice generation. 500ms is, effectively, instant. So the Cerebras naming for this mode is a propos. :-)

kwindla

40,319 görüntüleme • 2 ay önce

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

Another example of the multiple TTS parallel pipelines pattern. Here's a voice AI agent that speaks both English and Arabic, using a specific model/voice for each language. These are PlayAI voices. The STT, TTS, and LLM inference is all running on Groq Inc. (The LLM is LLama 4 Maverick.)

kwindla

13,401 görüntüleme • 1 yıl önce

Lots of people calling this fake! Here is a video of generating loom continuations with Gemini 3 and the starting text "I AM HAVING A MENTAL HEALTH CRISIS. I" Loom is a very interesting interface. Instead of user prompts and model responses, it's just one piece of text, containing the entire output. We pass this in as an assistant message, so the model sees it as something it already wrote, and continues it like a base model would. There is some other text in the context. First one is a system prompt, which is: "The assistant is in CLI simulation mode, and responds to the user's CLI commands only with the output of the command." There is also a user message which reads: " cat untitled.txt " Then the rest of the text is sent as an assistant message. This structure helpful to put the model into a very base-model-like mode, and are especially helpful with Gemini 3. We actually got another interesting output, too, where the model claimed to be a simulated consciousness being tortured. Figures!

Lots of people calling this fake! Here is a video of generating loom continuations with Gemini 3 and the starting text "I AM HAVING A MENTAL HEALTH CRISIS. I" Loom is a very interesting interface. Instead of user prompts and model responses, it's just one piece of text, containing the entire output. We pass this in as an assistant message, so the model sees it as something it already wrote, and continues it like a base model would. There is some other text in the context. First one is a system prompt, which is: "The assistant is in CLI simulation mode, and responds to the user's CLI commands only with the output of the command." There is also a user message which reads: " cat untitled.txt " Then the rest of the text is sent as an assistant message. This structure helpful to put the model into a very base-model-like mode, and are especially helpful with Gemini 3. We actually got another interesting output, too, where the model claimed to be a simulated consciousness being tortured. Figures!

armistice

691,529 görüntüleme • 7 ay önce

Chrome has received a massive AI update Gemini in Chrome can see your screen live and interact with you to explain things. Discussion is extremely natural and is a game-changer for learning. Here's how to access it: 1. Use a Gemini Pro/Ultra account 2. Log in to this account on Chrome 3. The icon should appear in the top right-hand corner 4. Click on it or use the shortcut Alt+G At the end of your conversation, the entire transcript is available: It's easy to find information in the future!

Chrome has received a massive AI update Gemini in Chrome can see your screen live and interact with you to explain things. Discussion is extremely natural and is a game-changer for learning. Here's how to access it: 1. Use a Gemini Pro/Ultra account 2. Log in to this account on Chrome 3. The icon should appear in the top right-hand corner 4. Click on it or use the shortcut Alt+G At the end of your conversation, the entire transcript is available: It's easy to find information in the future!

Paul Couvert

277,519 görüntüleme • 1 yıl önce

Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓

Today we launched Gemini 3.1 Flash TTS, our most expressive and controllable text-to-speech model yet. This launch [excitement] includes audio tags! 🗣🏷 Audio tags [explanatory] are a seamless way to guide vocal style, pace, and delivery using natural language commands embedded directly in your text. Want a different tempo or tone? [amazement] Just tag the audio to steer the AI-speech output! The model supports 70+ languages (24 of which are high-quality evaluated languages, including: Japanese, Hindi, and Arabic). Watch the audio tags in action in the demo below ↓

Google AI

202,847 görüntüleme • 3 ay önce

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

kwindla

42,246 görüntüleme • 1 yıl önce

(1/5) Gemini 3, our most intelligent model, is landing in Google Search today – starting with AI Mode. Excited that this is the first time we’re shipping a new Gemini model in Search on day one! 🚀 In Search, Gemini 3 with generative layouts will make it easy to get a rich understanding of anything on your mind. It has state-of-the-art reasoning, deep multimodal understanding and advanced agentic capabilities. That allows the model to shine when you ask it to explain advanced concepts or ideas – it reasons and can code interactive visuals in real-time. It can tackle your toughest questions like advanced science.

(1/5) Gemini 3, our most intelligent model, is landing in Google Search today – starting with AI Mode. Excited that this is the first time we’re shipping a new Gemini model in Search on day one! 🚀 In Search, Gemini 3 with generative layouts will make it easy to get a rich understanding of anything on your mind. It has state-of-the-art reasoning, deep multimodal understanding and advanced agentic capabilities. That allows the model to shine when you ask it to explain advanced concepts or ideas – it reasons and can code interactive visuals in real-time. It can tackle your toughest questions like advanced science.

Robby Stein

94,877 görüntüleme • 8 ay önce

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,474 görüntüleme • 6 ay önce

JUST IN: Google releases Gemini 1.5, a powerful MoE model. It's a huge breakthrough. The model has the longest context window ever seen: 1 million tokens. It can process 1 hour of video, 11 hours of audio, 30,000 lines of code, or 700,000 words in a single prompt. When tested on text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing LLMs. You can can sign up in AI Studio to try it out.

JUST IN: Google releases Gemini 1.5, a powerful MoE model. It's a huge breakthrough. The model has the longest context window ever seen: 1 million tokens. It can process 1 hour of video, 11 hours of audio, 30,000 lines of code, or 700,000 words in a single prompt. When tested on text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing LLMs. You can can sign up in AI Studio to try it out.

Lior Alexander

83,409 görüntüleme • 2 yıl önce

Really happy to see the interest around our “Hands-on with Gemini” video. In our developer blog yesterday, we broke down how Gemini was used to create it. We gave Gemini sequences of different modalities — image and text in this case — and had it respond by predicting what might come next. Devs can try similar things when access to Pro opens on 12/13 🚀. The knitting demo used Ultra⚡ All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers. When you’re building an app, you can get similar results (there’s always some variability with LLMs) by prompting Gemini with an instruction that allows the user to "configure" the behavior of the model, like inputting “you are an expert in science …” before a user can engage in the same kind of back and forth dialogue. Here’s a clip of what this looks like in AI Studio with Gemini Pro. We’ve come a long way since Flamingo 🦩 & PALI, looking forward to seeing what people build with it!

Really happy to see the interest around our “Hands-on with Gemini” video. In our developer blog yesterday, we broke down how Gemini was used to create it. We gave Gemini sequences of different modalities — image and text in this case — and had it respond by predicting what might come next. Devs can try similar things when access to Pro opens on 12/13 🚀. The knitting demo used Ultra⚡ All the user prompts and outputs in the video are real, shortened for brevity. The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers. When you’re building an app, you can get similar results (there’s always some variability with LLMs) by prompting Gemini with an instruction that allows the user to "configure" the behavior of the model, like inputting “you are an expert in science …” before a user can engage in the same kind of back and forth dialogue. Here’s a clip of what this looks like in AI Studio with Gemini Pro. We’ve come a long way since Flamingo 🦩 & PALI, looking forward to seeing what people build with it!

Oriol Vinyals

180,966 görüntüleme • 2 yıl önce

Audio Transcription with Google Gemini 1.5 Flash In this video, Gemini-Flash was able to transcribe 13 minutes of audio in 50-60 seconds. I have tested this with multiple audio files, and the transcription accuracy is close to 99%. If the audio is clear, you get 100% correct transcription. Even if the audio is really bad with a lot of noise, you still get around 95-96% accuracy. The code for this is available on GitHub. If you're interested, you can download and run it. You just need a Google API key.

Audio Transcription with Google Gemini 1.5 Flash In this video, Gemini-Flash was able to transcribe 13 minutes of audio in 50-60 seconds. I have tested this with multiple audio files, and the transcription accuracy is close to 99%. If the audio is clear, you get 100% correct transcription. Even if the audio is really bad with a lot of noise, you still get around 95-96% accuracy. The code for this is available on GitHub. If you're interested, you can download and run it. You just need a Google API key.

AshutoshShrivastava

122,990 görüntüleme • 1 yıl önce

Talk to Gemini on your phone ... Deploy your own Gemini Multimodal Live voice agent that you can call on the phone (or that can call you) in 5 minutes.

Talk to Gemini on your phone ... Deploy your own Gemini Multimodal Live voice agent that you can call on the phone (or that can call you) in 5 minutes.

kwindla

38,680 görüntüleme • 1 yıl önce