Loading video...

Video Failed to Load

Go Home

As promised. Write up on implementing and optimizing conversational agents: An open source repo which is a generic WebSocket server for low-latency conversational agents: And another demo

15,924 views • 2 years ago •via X (Twitter)

10 Comments

Sean Moriarity's profile picture
Sean Moriarity2 years ago

With a LiveView implementation using client-side VAD I was able to consistently get 1300-1500 ms time to first spoken word with about 100 ms ping between my Mac and my GPU machine. Running locally on the GPU machine I could get 900-1000ms time to first spoken word.

Sean Moriarity's profile picture
Sean Moriarity2 years ago

I haven't fully benchmarked the WebSocket-based server, but it should be similar. It also has a much better VAD implementation, so it's not as broken as my last one.

Sean Moriarity's profile picture
Sean Moriarity2 years ago

Eventually I will get a version running on a Fly GPU with both a local Bumblebee LLM and Speech-to-text pipeline

Michał Śledź's profile picture
Michał Śledź2 years ago

Really impressive! 👏 Would be awesome to try to run this over Elixir WebRTC instead of a WS. We did a demo, where we send a video from a web cam over WebRTC to the Pheonix app, feed it into Nx and perform image recognition. Here is the blog post:

Sean Moriarity's profile picture
Sean Moriarity2 years ago

Thanks for the suggestion! I’ll look into this!

Mohammed Zeeshan's profile picture
Mohammed Zeeshan2 years ago

this is mindblowing stuff

Bill Tihen's profile picture
Bill Tihen2 years ago

Wow - very cool

Colm Byrne's profile picture
Colm Byrne2 years ago

From what you created it seems like Retell don't have much of a ring fence if it can be hacked together in a couple days. Thoughts?

Sean Moriarity's profile picture
Sean Moriarity2 years ago

Good question, apologies in advance for the long reply. I think they probably have the most complete and reliable product in the space I’ve seen in my limited exposure to it. I think that there’s a lot of tiny details that go into making conversations realistic, and if they iterate on that then they can put some distance between themselves and anybody else. this idea of hacking together 3 models has been “in the air” for awhile, and it’s not difficult to get your own working version up and running quickly, if you can accept a 70-80% solution. It’s esp compelling to build your own if you need it because their prices are kinda high, and I think you can save long term if you invest in it. Also, there are going to be a million open source versions of this exact thing popping up now that they’ve done their launch and set the standard I think if their target market is developers (which I believe it is) then they’re in a tough spot because I think you can build something comparable (not better!) that’s cheaper. To me what’s much more attractive is if they go after direct applications of conversational agents in market research, surveying, etc. and can capitalize quickly on having the best offering early. My feeling though is that actually would prefer their users to be the ones building integrations for specific niches on top of their platform so they can focus on improving the conversational experience. In that case I would be really nervous about a big AI research lab releasing a foundation model that’s either end-to-end or fuses parts of the pipeline more efficiently than they can. I got the sense their plan is to actually train their own models eventually, in which case they can capitalize on this head start, exposure, and data from early launch and maybe establish a much bigger lead than what they have now. Not sure how much funding they have but this would require a decent amount Sorry for the long answer, and take everything with a grain of salt because I have never run a startup before hahahaha

Holden Oullette's profile picture
Holden Oullette2 years ago

I know I’m late on the draw about this, but if you’re trying to eek out every little bit of performance gains: there’s a change in the alpha version of Jason v1.5 that introduces an optional dep containing a Rust NIF for Jason.encode - increasing speeds 1.5x for most inputs

Related Videos