Loading video...

Video Failed to Load

Go Home

As promised. Write up on implementing and optimizing conversational agents: An open source repo which is a generic WebSocket server for low-latency conversational agents: And another demo

15,915 views • 2 years ago •via X (Twitter)

10 Comments

Sean Moriarity's profile picture
Sean Moriarity2 years ago

With a LiveView implementation using client-side VAD I was able to consistently get 1300-1500 ms time to first spoken word with about 100 ms ping between my Mac and my GPU machine. Running locally on the GPU machine I could get 900-1000ms time to first spoken word.

Sean Moriarity's profile picture
Sean Moriarity2 years ago

I haven't fully benchmarked the WebSocket-based server, but it should be similar. It also has a much better VAD implementation, so it's not as broken as my last one.

Sean Moriarity's profile picture
Sean Moriarity2 years ago

Eventually I will get a version running on a Fly GPU with both a local Bumblebee LLM and Speech-to-text pipeline

Michał Śledź's profile picture
Michał Śledź2 years ago

Really impressive! 👏 Would be awesome to try to run this over Elixir WebRTC instead of a WS. We did a demo, where we send a video from a web cam over WebRTC to the Pheonix app, feed it into Nx and perform image recognition. Here is the blog post:

Sean Moriarity's profile picture
Sean Moriarity2 years ago

Thanks for the suggestion! I’ll look into this!

Mohammed Zeeshan's profile picture
Mohammed Zeeshan2 years ago

this is mindblowing stuff

Bill Tihen's profile picture
Bill Tihen2 years ago

Wow - very cool

Colm Byrne's profile picture
Colm Byrne2 years ago

From what you created it seems like Retell don't have much of a ring fence if it can be hacked together in a couple days. Thoughts?

Sean Moriarity's profile picture
Sean Moriarity2 years ago

Good question, apologies in advance for the long reply. I think they probably have the most complete and reliable product in the space I’ve seen in my limited exposure to it. I think that there’s a lot of tiny details that go into making conversations realistic, and if they iterate on that then they can put some distance between themselves and anybody else. this idea of hacking together 3 models has been “in the air” for awhile, and it’s not difficult to get your own working version up and running quickly, if you can accept a 70-80% solution. It’s esp compelling to build your own if you need it because their prices are kinda high, and I think you can save long term if you invest in it. Also, there are going to be a million open source versions of this exact thing popping up now that they’ve done their launch and set the standard I think if their target market is developers (which I believe it is) then they’re in a tough spot because I think you can build something comparable (not better!) that’s cheaper. To me what’s much more attractive is if they go after direct applications of conversational agents in market research, surveying, etc. and can capitalize quickly on having the best offering early. My feeling though is that actually would prefer their users to be the ones building integrations for specific niches on top of their platform so they can focus on improving the conversational experience. In that case I would be really nervous about a big AI research lab releasing a foundation model that’s either end-to-end or fuses parts of the pipeline more efficiently than they can. I got the sense their plan is to actually train their own models eventually, in which case they can capitalize on this head start, exposure, and data from early launch and maybe establish a much bigger lead than what they have now. Not sure how much funding they have but this would require a decent amount Sorry for the long answer, and take everything with a grain of salt because I have never run a startup before hahahaha

Holden Oullette's profile picture
Holden Oullette2 years ago

I know I’m late on the draw about this, but if you’re trying to eek out every little bit of performance gains: there’s a change in the alpha version of Jason v1.5 that introduces an optional dep containing a Rust NIF for Jason.encode - increasing speeds 1.5x for most inputs

Related Videos

Learn to build conversational AI voice agents in "Building AI Voice Agents for Production", created in collaboration with LiveKit and RealAvatar, and taught by dsa (Co-founder & CEO of LiveKit), Shayne (Developer Advocate, LiveKit), and Nedelina Teneva (Head of AI at RealAvatar, an AI Fund portfolio company). Voice agents combine speech and reasoning capabilities to enable real-time conversations. They're already being used to support customer service, to improve accessibility in healthcare, for entertainment applications, and for talk therapy. In this course, you’ll learn to build voice agents that listen, reason, and respond naturally. You’ll follow the architecture used to create the "AI Andrew" Avatar, a collaborative project between and RealAvatar that responds to users in what sounds like my voice. You’ll build a voice agent from scratch and deploy it to the cloud, enabling support for many simultaneous users. What you’ll learn: - Understand the fundamentals of voice agents, including key components like speech-to-text (STT), text-to-speech (TTS), and LLMs, and how latency is introduced at each layer. - Explore voice agent architectures and the trade-offs between modular pipelines and speech-to-speech APIs. - Explore how platforms like LiveKit mitigate latency issues with optimized networking infrastructure and low-latency communication protocols. - Learn how to connect client devices to voice agents using WebRTC—and why it outperforms HTTP and WebSocket for low-latency audio streaming. - Incorporate voice activity detection (VAD), end-of-turn detection, and context management to detect turns, handle interruptions, and manage conversational flow. - Understand the trade-offs between latency, quality, and cost in an example in which you build a voice agent and change its voice. - Equip your agent with metrics to measure latency at each stage of the voice pipeline and learn the key levers you can pull to make your agent faster and more responsive. The voice agents built in this course also incorporate voice technology from , a supporting contributor to the project. By the end of this course, you'll have learned the components of an AI voice agent pipeline, combined them into a system with low-latency communication, and deployed them on cloud infrastructure so it scales to many users. I’m looking forward to seeing what voice agents you build from this course! Please sign up here:

Andrew Ng

87,377 views • 1 year ago