Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

As promised. Write up on implementing and optimizing conversational agents: An open source repo which is a generic WebSocket server for low-latency conversational agents: And another demo

Sean Moriarity

3,519 subscribers

15,924 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Comments

Sean Moriarity2 years ago

With a LiveView implementation using client-side VAD I was able to consistently get 1300-1500 ms time to first spoken word with about 100 ms ping between my Mac and my GPU machine. Running locally on the GPU machine I could get 900-1000ms time to first spoken word.

Sean Moriarity2 years ago

I haven't fully benchmarked the WebSocket-based server, but it should be similar. It also has a much better VAD implementation, so it's not as broken as my last one.

Sean Moriarity2 years ago

Eventually I will get a version running on a Fly GPU with both a local Bumblebee LLM and Speech-to-text pipeline

Michał Śledź2 years ago

Really impressive! 👏 Would be awesome to try to run this over Elixir WebRTC instead of a WS. We did a demo, where we send a video from a web cam over WebRTC to the Pheonix app, feed it into Nx and perform image recognition. Here is the blog post:

Sean Moriarity2 years ago

Thanks for the suggestion! I’ll look into this!

Mohammed Zeeshan2 years ago

this is mindblowing stuff

Bill Tihen2 years ago

Wow - very cool

Colm Byrne2 years ago

From what you created it seems like Retell don't have much of a ring fence if it can be hacked together in a couple days. Thoughts?

Sean Moriarity2 years ago

Good question, apologies in advance for the long reply. I think they probably have the most complete and reliable product in the space I’ve seen in my limited exposure to it. I think that there’s a lot of tiny details that go into making conversations realistic, and if they iterate on that then they can put some distance between themselves and anybody else. this idea of hacking together 3 models has been “in the air” for awhile, and it’s not difficult to get your own working version up and running quickly, if you can accept a 70-80% solution. It’s esp compelling to build your own if you need it because their prices are kinda high, and I think you can save long term if you invest in it. Also, there are going to be a million open source versions of this exact thing popping up now that they’ve done their launch and set the standard I think if their target market is developers (which I believe it is) then they’re in a tough spot because I think you can build something comparable (not better!) that’s cheaper. To me what’s much more attractive is if they go after direct applications of conversational agents in market research, surveying, etc. and can capitalize quickly on having the best offering early. My feeling though is that actually would prefer their users to be the ones building integrations for specific niches on top of their platform so they can focus on improving the conversational experience. In that case I would be really nervous about a big AI research lab releasing a foundation model that’s either end-to-end or fuses parts of the pipeline more efficiently than they can. I got the sense their plan is to actually train their own models eventually, in which case they can capitalize on this head start, exposure, and data from early launch and maybe establish a much bigger lead than what they have now. Not sure how much funding they have but this would require a decent amount Sorry for the long answer, and take everything with a grain of salt because I have never run a startup before hahahaha

Holden Oullette2 years ago

I know I’m late on the draw about this, but if you’re trying to eek out every little bit of performance gains: there’s a change in the alpha version of Jason v1.5 that introduces an optional dep containing a Rust NIF for Jason.encode - increasing speeds 1.5x for most inputs

Related Videos

Conversational AI is here. Build AI agents that can speak in minutes with low latency, full configurability, and seamless scalability.

Conversational AI is here. Build AI agents that can speak in minutes with low latency, full configurability, and seamless scalability.

ElevenLabs

769,832 views • 1 year ago

.Marc Benioff on customer service digital agents: AI agents operate at a higher level than chatbots, conversational and human-like. #DF24

.Marc Benioff on customer service digital agents: AI agents operate at a higher level than chatbots, conversational and human-like. #DF24

Vala Afshar

20,488 views • 1 year ago

Introducing AI Voice Agents: The All-In-One Platform for Voice AI Agents and Everything Audio! 🎙 Build conversational agents, clone voices, generate sounds & engage like never before 🤖 DEMO IS LIVE: 🌱 Start farming & earn rewards!

Introducing AI Voice Agents: The All-In-One Platform for Voice AI Agents and Everything Audio! 🎙 Build conversational agents, clone voices, generate sounds & engage like never before 🤖 DEMO IS LIVE: 🌱 Start farming & earn rewards!

AI Voice Agents | AIVA

95,579 views • 1 year ago

We’ve built a new CLI for managing conversational agents as code. It brings version control, programmability, and deeper integration into your existing workflows.

We’ve built a new CLI for managing conversational agents as code. It brings version control, programmability, and deeper integration into your existing workflows.

ElevenLabs

29,539 views • 11 months ago

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

NVIDIA just removed one of the biggest friction points in Voice AI. PersonaPlex-7B is an open-source, full-duplex conversational model. Free, open source (MIT), with open model weights on Hugging Face 🤗 Links to repo and weights in 🧵↓ The traditional ASR → LLM → TTS pipeline forces rigid turn-taking. It’s efficient, but it never feels natural. PersonaPlex-7B changes that. This NVIDIA model can listen and speak at the same time. It runs directly on continuous audio tokens with a dual-stream transformer, generating text and audio in parallel instead of passing control between components. That unlocks: → instant back-channel responses → interruptions that feel human → real conversational rhythm Persona control is fully zero-shot! If you’re building low-latency assistants or support agents, this is a big step forward 🔥

Charly Wargnier

565,090 views • 6 months ago

Hedera x AI Workshop: Leveraging MCP for Hedera Agents Michael covers MCP Servers, Conversational Agent configuration, and the upcoming HOL Desktop release. Explore below and sign up for a $1M Hackathon ↓

Hedera x AI Workshop: Leveraging MCP for Hedera Agents Michael covers MCP Servers, Conversational Agent configuration, and the upcoming HOL Desktop release. Explore below and sign up for a $1M Hackathon ↓

Hashgraph Online DAO

22,876 views • 10 months ago

Build a conversational voice bot with 1 second voice-to-voice latency with Modal, Pipecat AI, and open models. Modal works seamlessly with WebRTC, WebSockets, and tunneling to squash latency to an absolute minimum.

Build a conversational voice bot with 1 second voice-to-voice latency with Modal, Pipecat AI, and open models. Modal works seamlessly with WebRTC, WebSockets, and tunneling to squash latency to an absolute minimum.

Modal

29,559 views • 8 months ago

The future of commerce is conversational. Using Shopify Storefront MCP and ElevenLabs Agents, we can get a glimpse into the future of online shopping.

The future of commerce is conversational. Using Shopify Storefront MCP and ElevenLabs Agents, we can get a glimpse into the future of online shopping.

ElevenLabs

45,296 views • 8 months ago

🤖 Agents SDK—our new open-source SDK for orchestrating multi-agent workflows, improving upon Swarm. Configure agents with built-in tools, hand off tasks, add safety guardrails, and visualize execution traces for debugging and optimizing performance.

🤖 Agents SDK—our new open-source SDK for orchestrating multi-agent workflows, improving upon Swarm. Configure agents with built-in tools, hand off tasks, add safety guardrails, and visualize execution traces for debugging and optimizing performance.

OpenAI Developers

164,359 views • 1 year ago

📀 ElevenLabs just launched their conversational AI product, allowing you to set up voice agents with your own voice 🤯 Took me less than 10mins to set up, and is easily integrated with Supabase Auth & Edge Functions 🔥 Demo & code 👇

📀 ElevenLabs just launched their conversational AI product, allowing you to set up voice agents with your own voice 🤯 Took me less than 10mins to set up, and is easily integrated with Supabase Auth & Edge Functions 🔥 Demo & code 👇

Thor 雷神 ⚡️

30,006 views • 1 year ago

Conversational AI now supports Multivoice mode - letting AI agents switch voice and language mid-sentence. English-speaking agents can say Italian words in a native Italian voice or alternate between characters. Useful for language apps and multi-character audio experiences.

Conversational AI now supports Multivoice mode - letting AI agents switch voice and language mid-sentence. English-speaking agents can say Italian words in a native Italian voice or alternate between characters. Useful for language apps and multi-character audio experiences.

ElevenLabs

28,686 views • 1 year ago

Learn to build conversational AI voice agents in "Building AI Voice Agents for Production", created in collaboration with LiveKit and RealAvatar, and taught by dsa (Co-founder & CEO of LiveKit), Shayne (Developer Advocate, LiveKit), and Nedelina Teneva (Head of AI at RealAvatar, an AI Fund portfolio company). Voice agents combine speech and reasoning capabilities to enable real-time conversations. They're already being used to support customer service, to improve accessibility in healthcare, for entertainment applications, and for talk therapy. In this course, you’ll learn to build voice agents that listen, reason, and respond naturally. You’ll follow the architecture used to create the "AI Andrew" Avatar, a collaborative project between and RealAvatar that responds to users in what sounds like my voice. You’ll build a voice agent from scratch and deploy it to the cloud, enabling support for many simultaneous users. What you’ll learn: - Understand the fundamentals of voice agents, including key components like speech-to-text (STT), text-to-speech (TTS), and LLMs, and how latency is introduced at each layer. - Explore voice agent architectures and the trade-offs between modular pipelines and speech-to-speech APIs. - Explore how platforms like LiveKit mitigate latency issues with optimized networking infrastructure and low-latency communication protocols. - Learn how to connect client devices to voice agents using WebRTC—and why it outperforms HTTP and WebSocket for low-latency audio streaming. - Incorporate voice activity detection (VAD), end-of-turn detection, and context management to detect turns, handle interruptions, and manage conversational flow. - Understand the trade-offs between latency, quality, and cost in an example in which you build a voice agent and change its voice. - Equip your agent with metrics to measure latency at each stage of the voice pipeline and learn the key levers you can pull to make your agent faster and more responsive. The voice agents built in this course also incorporate voice technology from , a supporting contributor to the project. By the end of this course, you'll have learned the components of an AI voice agent pipeline, combined them into a system with low-latency communication, and deployed them on cloud infrastructure so it scales to many users. I’m looking forward to seeing what voice agents you build from this course! Please sign up here:

Andrew Ng

87,484 views • 1 year ago

Announcing Grok 4 Fire Enrich - an open source contact enrichment engine AI agents analyze any CSV and then automatically fill in missing data like key decision makers, company size, and more Orchestrated by @Grok 4 and powered by Firecrawl Demo and repo 👇

Announcing Grok 4 Fire Enrich - an open source contact enrichment engine AI agents analyze any CSV and then automatically fill in missing data like key decision makers, company size, and more Orchestrated by @Grok 4 and powered by Firecrawl Demo and repo 👇

Eric Ciarla (hiring)

28,374 views • 1 year ago

2️⃣ Graphiti MCP server Agents forget everything after each task. Graphiti MCP server lets Agents build and query temporally-aware knowledge graphs, which act as an Agent's memory! Check this👇

2️⃣ Graphiti MCP server Agents forget everything after each task. Graphiti MCP server lets Agents build and query temporally-aware knowledge graphs, which act as an Agent's memory! Check this👇

Avi Chawla

37,902 views • 1 year ago

$SOAI is now live on Solana 🔥 SōzōAI is the first free & open-source platform for optimizing the Solana ecosystem via AI Agents. Tag along and join the AI revolution today! CA - 2uic5Siiu6kwPoHZDHnhvH1MySx1vmLkVmzUTkc3pump Whitepaper -

$SOAI is now live on Solana 🔥 SōzōAI is the first free & open-source platform for optimizing the Solana ecosystem via AI Agents. Tag along and join the AI revolution today! CA - 2uic5Siiu6kwPoHZDHnhvH1MySx1vmLkVmzUTkc3pump Whitepaper -

SōzōAI

45,777 views • 1 year ago

🚨 One orchestrator. 10 parallel agents. 100+ tokens a second. All local. The Google Gemma team just dropped a MASSIVE demo for Gemma 4 26B. They built a concurrent workflow that lets the 26B model coordinate an entire team of sub-agents on your machine. Out of the box, the cookbook lets you run 10 parallel agents to: → Code an entire SVG art gallery in seconds → Translate text simultaneously → Generate ASCII art → Write parallel code Spinning up multi-agent systems locally has never looked this fast or this accessible. 100% free and open-source. repo link in 🧵↓

🚨 One orchestrator. 10 parallel agents. 100+ tokens a second. All local. The Google Gemma team just dropped a MASSIVE demo for Gemma 4 26B. They built a concurrent workflow that lets the 26B model coordinate an entire team of sub-agents on your machine. Out of the box, the cookbook lets you run 10 parallel agents to: → Code an entire SVG art gallery in seconds → Translate text simultaneously → Generate ASCII art → Write parallel code Spinning up multi-agent systems locally has never looked this fast or this accessible. 100% free and open-source. repo link in 🧵↓

Charly Wargnier

31,627 views • 1 month ago

Flux Multilingual is live. Real-time conversational speech-to-text for voice agents in 10 languages, with monolingual-grade accuracy, turn detection, and code-switching. Deploy once and launch globally. Learn more →

Flux Multilingual is live. Real-time conversational speech-to-text for voice agents in 10 languages, with monolingual-grade accuracy, turn detection, and code-switching. Deploy once and launch globally. Learn more →

Deepgram

15,544 views • 2 months ago