Loading video...
Video Failed to Load
As promised. Write up on implementing and optimizing conversational agents: An open source repo which is a generic WebSocket server for low-latency conversational agents: And another demo
15,915 views • 2 years ago •via X (Twitter)
10 Comments

With a LiveView implementation using client-side VAD I was able to consistently get 1300-1500 ms time to first spoken word with about 100 ms ping between my Mac and my GPU machine. Running locally on the GPU machine I could get 900-1000ms time to first spoken word.

I haven't fully benchmarked the WebSocket-based server, but it should be similar. It also has a much better VAD implementation, so it's not as broken as my last one.

Eventually I will get a version running on a Fly GPU with both a local Bumblebee LLM and Speech-to-text pipeline

Really impressive! 👏 Would be awesome to try to run this over Elixir WebRTC instead of a WS. We did a demo, where we send a video from a web cam over WebRTC to the Pheonix app, feed it into Nx and perform image recognition. Here is the blog post:

Thanks for the suggestion! I’ll look into this!

this is mindblowing stuff

Wow - very cool

From what you created it seems like Retell don't have much of a ring fence if it can be hacked together in a couple days. Thoughts?

Good question, apologies in advance for the long reply. I think they probably have the most complete and reliable product in the space I’ve seen in my limited exposure to it. I think that there’s a lot of tiny details that go into making conversations realistic, and if they iterate on that then they can put some distance between themselves and anybody else. this idea of hacking together 3 models has been “in the air” for awhile, and it’s not difficult to get your own working version up and running quickly, if you can accept a 70-80% solution. It’s esp compelling to build your own if you need it because their prices are kinda high, and I think you can save long term if you invest in it. Also, there are going to be a million open source versions of this exact thing popping up now that they’ve done their launch and set the standard I think if their target market is developers (which I believe it is) then they’re in a tough spot because I think you can build something comparable (not better!) that’s cheaper. To me what’s much more attractive is if they go after direct applications of conversational agents in market research, surveying, etc. and can capitalize quickly on having the best offering early. My feeling though is that actually would prefer their users to be the ones building integrations for specific niches on top of their platform so they can focus on improving the conversational experience. In that case I would be really nervous about a big AI research lab releasing a foundation model that’s either end-to-end or fuses parts of the pipeline more efficiently than they can. I got the sense their plan is to actually train their own models eventually, in which case they can capitalize on this head start, exposure, and data from early launch and maybe establish a much bigger lead than what they have now. Not sure how much funding they have but this would require a decent amount Sorry for the long answer, and take everything with a grain of salt because I have never run a startup before hahahaha

I know I’m late on the draw about this, but if you’re trying to eek out every little bit of performance gains: there’s a change in the alpha version of Jason v1.5 that introduces an optional dep containing a Rust NIF for Jason.encode - increasing speeds 1.5x for most inputs

