Avi Chawla

@_avichawla • 72,017 subscribers

Daily tutorials and insights on DS, ML, LLMs, and RAGs • Co-founder @dailydoseofds_ • IIT Varanasi • ex-AI Engineer @ MastercardAI

Shorts

Stanford researchers did it again. They just built the agent-native version of Git. When an agent works on a longer task, the run builds up a lot of state. This includes files edited/created, a dev server, a database, installed packages, KV cache, etc. Say the agent is at step 10 and makes a mistake, maybe it misreads a traceback and rewrites a file that was actually fine. The tests start failing, and the run goes off track, although everything through step eight was correct. By default, the agent just tries to fix it, which creates more edits and tool calls. This burns more tokens and grows the context. The other options are a person stepping in to redirect it or restarting the whole run from step one. That's wasteful, because it pays for every model/tool call again and re-prefills the context. Moreover, since an agent's run is non-deterministic, it doesn't reproduce the same early steps anyway. The reason it's hard to just jump back exactly to a previous correct step and resume from there is that the trajectory is only a message log. It records what the agent said and which tools it called, but not the live state underneath. That state includes things like memory, open file handles, child processes, installed packages, /tmp, and KV cache. None of that is in the log. Git can version the files, but it doesn't snapshot the running process or the KV cache. Checking out step eight moves the files back, but the process is still sitting in step-ten memory with a cold cache. Shepherd is a runtime layer by Stanford that records the run as a trace of typed events rather than a flat log. Each agent-environment interaction becomes a commit, similar to Git, but it tracks the live run. Its commit includes the agent process and the filesystem together, copy-on-write, so a branch carries the actual state and not just the files. Going back to a previous step is then a single call that forks from that commit and continues from the exact state. The copy-on-write fork is roughly five times faster than docker commit, and because the prompt prefix through step eight is unchanged, the KV cache is reused over 95% on replay, so early steps aren't reprocessed again. Once the run can be forked, a meta-agent can sit on top and operate it. It watches the trace and reverts as soon as it looks wrong, before the bad write is committed. In practice, it's just Python calling fork, replay, and revert on the trace, rather than a separate control plane wired into the harness. Not everything is reversible though. Files and sandbox changes undo themselves, but a database write has no automatic undo, so it needs a matching undo step set up in advance. Something external, like a sent email or a real charge, can't be undone, so the supervisor's job there is to catch it before it fires. They tested this on a few public benchmarks. On CooperBench, where two agents work on the same codebase, adding a live supervisor took the pair-coding pass rate from 28.8% to 54.7%. It's still early and labeled alpha. The benefit mostly shows up when a run gets branched a lot over a heavy sandbox state, which is exactly where restarting wastes the most tokens and time. If Git was made to make file changes reversible, Shepherd is trying to do the same thing for a live agent run. Shepherd Repo: (don't forget to star it ⭐ ) That said, Shepherd reverts a bad step inside a run. The harness around it, the prompts, tools, and checks the supervisor relies on, still drifts across runs as models and dependencies change. Akshay wrote about making that harness repair itself, where a failing trace gets diagnosed, the fix is verified against the exact input that failed, and the failure is locked as a regression test so it can't recur. Read it below.

437,587 views

Researchers made KMeans 200x faster. And the new technique also beats approaches like cuML and FAISS. Flash-KMeans is an IO-aware implementation of exact KMeans that redesigns the algorithm around modern GPU bottlenecks. By attacking the memory bottlenecks directly, Flash-KMeans achieves: - 33x speedup over cuML - 200x speedup over FAISS This speedup comes from how it moves through GPU memory. Standard KMeans runs in two steps, and both are bottlenecked by reads and writes to GPU memory: 1) The first step matches every point to its nearest centroid. Standard KMeans computes the full point-to-centroid distance matrix, writes it out to GPU memory, then reads it back to find each nearest centroid. That write-then-read round trip is the bottleneck. Flash-KMeans combines the distance calculation with the nearest-centroid step, so the result is computed on-chip and the full matrix is never written out. 2) The second step recomputes each centroid by averaging the points assigned to it. Standard KMeans has thousands of threads writing into the same centroid slots at once, so they stall waiting for their turn. Flash-KMeans sorts points by cluster first, turning scattered writes into sequential reductions that read and write memory in one efficient pass. Using these two optimizations at the million-scale, Flash-KMeans completes a standard KMeans iteration in a few milliseconds. The video below depicts this in action. Several reasons why this is important: KMeans has always been an offline primitive. Something you run once to preprocess data and move on. These speedups make the approach viable in several runtime-critical systems. ↳ Vector indices like FAISS use KMeans to build search indices. Faster KMeans means you can re-index dynamically as data changes. ↳ LLM quantization methods need KMeans to find optimal weight codebooks, per layer, repeatedly. What takes hours could now take minutes. ↳ MoE models need fast token routing at inference time. Flash-KMeans makes it viable to run this inside the inference loop, not just in preprocessing. I have shared the paper in the replies. That said, memory is the real constraint Flash-KMeans solves, and the problem is not just limited to clustering. The vectors a RAG system stores after indexing create similar bottlenecks. I wrote a detailed walkthrough recently on cutting this vector memory by 32x with binary quantization, querying 36M+ vectors in a few milliseconds. Read it below.

89,234 views

A simple technique makes RAG 32x memory efficient! - Perplexity uses it in its search index - Azure uses it in its search pipeline - HubSpot uses it in its AI assistant (learn how it works below, with code)

86,573 views

Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previous best dots-ocr. - Supports 40+ languages - Extracts complex texts, tables, formulas easily I tested on Ramanujan's handwritten letter from 1913. 100% open-source.

158,547 views

Everyone is sleeping on MiniMax's new LLM! Devs are calling it "Claude at 10% the cost" - 72.5% SWE-Multilingual. Beats Sonnet 4.5 - 88.6% VIBE-bench. Beats Gemini 3 Pro I used it to build a stock analyst that generates code, executes it & returns insights. 100% open-source!

117,551 views

Sensitive content

Wow!! You can now scrape ANY website by just writing a prompt. Using 's /extract endpoint, just describe what you want to extract in a prompt. This produces LLM-ready structured output. No more hard coding!

250,392 views

A RAG engine for deep document understanding! RAGFlow lets you build enterprise-grade RAG workflows on complex docs with well-founded citations. Supports multimodal data understanding, web search, deep research, etc. 100% local & open-source with 55k+ stars!

163,773 views

Finally! A RAG over code solution that actually works (open-source). Naive chunking used in RAG isn't suited for code. This is because codebases have long-range dependencies, cross-file references, etc., that independent text chunks just can't capture. Graph-Code is a graph-driven RAG system that solves this. It analyzes the Python codebase and builds knowledge graphs to enable natural language querying. Key features: - Deep code parsing to extract classes, functions, and relationships. - Uses Memgraph to store the codebase as a graph. - Parses pyproject to understand external dependencies. - Retrieves actual source code snippets for found functions. Find the repo in the replies!

121,874 views

Figma canvas to build AI agent workflows. Sim is a lightweight, user-friendly platform for building AI agent workflows in minutes. It natively supports all major LLMs, Vector DBs, etc. 100% open-source with 7k+ stars!

79,322 views

I just put together all my AI Agents posts in a single PDF. It covers: - Agent fundamentals - LLM vs RAG vs Agents - Agentic design patterns - Building Blocks of Agents - Building custom tools via MCP - 12 hands-on projects for AI Engineers Download link in next tweet.

65,378 views

Check this!! A 100% open-source toolkit to work with LLMs. Transformer Lab is an app to experiment with LLMs: - Train, fine-tune, or chat. - One-click LLM download (DeepSeek, Gemma, etc.) - Drag-n-drop UI for RAG. - Built-in logging, and more. 100% local!

74,926 views

Deploy and run LLMs directly on your phone! Unsloth now lets you fine-tune LLMs and deploy them 100% locally on iOS/Android devices. The video shows this in action, where I ran Qwen3 on an iPhone 17 Pro at ~25 tokens/s. I have shared a guide in the replies.

24,530 views

Postman's AI-readiness Playbook is one of the most important documents you can read today as a developer! We are headed into an era where every website must be "Agent-ready". - Agents will make purchases, not humans. - Agents will find the best options, not humans. - Agents will fill out job applications, not humans. The same applies to APIs. While human devs can hustle through poor docs and broken endpoints, most Agents can’t (yet). They need: - Predictable structures - Machine-readable metadata - Standardized behavior Postman's 90-day AI readiness playbook details how to turn your APIs into reliable, AI-ready tools. My two biggest takeaways from the Playbook: 1) Automatic documentation (Week 3): Once you standardize your API format, Postman’s Spec Hub automatically generates and validates API docs for both humans and AI agents without any manual work. 2) Seamless AI tooling (Week 9): Turn your validated specs into hosted, function-style endpoints, letting AI agents invoke your APIs like native commands. Find the link to the Playbook in the comments. Thanks to the Postman team for partnering on today's post!

22,830 views

I just created my own LaTeX-OCR app using Llama 3.2 Vision! Upload the LaTeX code as an image, and it gives you the corresponding LaTeX code using Llama 3.2 multimodal! Here's what I used: - Ollama for serving Llama 3.2 vision locally - Streamlit for the UI Everything is just 50 lines of code! Find the code in the next tweet. -- Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs. Find me → Avi Chawla

15,610 views

Videos

LIVE

1.2k

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Streaming Now

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

HD live stream

Exclusive private shows

1.2k viewers online

Current Status

Live

Private Show

Join now for exclusive access

Free preview available • Premium content