John T Davies 🇪🇺's banner

John T Davies 🇪🇺

@jtdavies • 2,954 subscribers

Entrepreneur, CTO in AI & FinTech, investor, father to 3 grown boys, husband to Rachel, astrophysicist, keen photographer, cyclist, über-geek, travelled a lot.

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

MLX + TurboQuant = Local Super Power Take a local (private) document(s) or codebase, pre-fill the 256k KV cache (the context) with the document(s) and system prompt, quantise and run on Apple's MLX and you have almost instantaneous, lossless document queries with total privacy. For a 75-page PDF (some 30k tokens), I was seeing sub-150ms TTFT (Time To First Token), that's "instant" for a human. We run this on an office server (one of our Mac Minis) for staff to work with, instant answers on critical, confidential company/client documents on phone, iPads or laptops. Imagine the "What were the 3Q25 profits?", "When did Joe start working for us?", "Please list the project milestones and team count" answered within seconds of hitting the return key. It works with voice too. This example is 17k lines of code from 177 source files plus an index of the entire 1,902 files, the entire Claude Code codebase (oops Anthropic!). Zero loss (unlike Anthropic's security). What this means is the ability to work better on local (private) code bases, off-the-shelf private document analysis with better response than the best public models - all locally.

MLX + TurboQuant = Local Super Power Take a local (private) document(s) or codebase, pre-fill the 256k KV cache (the context) with the document(s) and system prompt, quantise and run on Apple's MLX and you have almost instantaneous, lossless document queries with total privacy. For a 75-page PDF (some 30k tokens), I was seeing sub-150ms TTFT (Time To First Token), that's "instant" for a human. We run this on an office server (one of our Mac Minis) for staff to work with, instant answers on critical, confidential company/client documents on phone, iPads or laptops. Imagine the "What were the 3Q25 profits?", "When did Joe start working for us?", "Please list the project milestones and team count" answered within seconds of hitting the return key. It works with voice too. This example is 17k lines of code from 177 source files plus an index of the entire 1,902 files, the entire Claude Code codebase (oops Anthropic!). Zero loss (unlike Anthropic's security). What this means is the ability to work better on local (private) code bases, off-the-shelf private document analysis with better response than the best public models - all locally.

John T Davies 🇪🇺

85,589 görüntüleme • 2 ay önce

I finally got tool-calling working on the Qwen3.5 on iPhone. Just over 20 toks/sec and so far looking very useful. Especially when I get some more local tools working to access things on the phone. Uploading to GitHub when I've done a little more testing.

I finally got tool-calling working on the Qwen3.5 on iPhone. Just over 20 toks/sec and so far looking very useful. Especially when I get some more local tools working to access things on the phone. Uploading to GitHub when I've done a little more testing.

John T Davies 🇪🇺

27,435 görüntüleme • 3 ay önce

Extending on Dan Woods (and Claude's) repo I managed to get Qwen3.5-397B-A17B-4bit (224GB) running comfortably on my new M5 Max Laptop. Shout out to Carsen Klock's MacTop too. Over 10 Tok/sec!!!

Extending on Dan Woods (and Claude's) repo I managed to get Qwen3.5-397B-A17B-4bit (224GB) running comfortably on my new M5 Max Laptop. Shout out to Carsen Klock's MacTop too. Over 10 Tok/sec!!!

John T Davies 🇪🇺

18,011 görüntüleme • 3 ay önce

Daha fazla içerik yok.