
Avi Chawla
@_avichawla • 69,368 subscribers
Daily tutorials and insights on DS, ML, LLMs, and RAGs • Co-founder @dailydoseofds_ • IIT Varanasi • ex-AI Engineer @ MastercardAI
Shorts
Videos

Karpathy's prediction about RL is coming true now! He called reward functions unreliable and argued that a single reward number is too low-dimensional to teach an agent what "good" means for complex tasks. To solve this, Agents need a knowledge-guided review as a higher-dimensional feedback channel. Every major AI lab trains models with RL today (OpenAI, Anthropic, DeepSeek). And their key bottleneck has always been the reward functions. GRPO by DeepSeek worked well for math and code because the environment gave a binary signal. But for real agent tasks, someone still has to hand-code the scoring function. That takes days and breaks every time the pipeline changes. RULER (implemented in OpenPipe ART, 10k stars) addresses the exact problem Karpathy identified. The reward criteria are defined in plain English, and an LLM evaluates each trajectory against that description to provide feedback for training. I trained a Qwen3 1.4B agent that plays 2048 using GRPO with this exact workflow. In this case, the agent saw the board, picked a direction, and RULER evaluated the outcome, all from this natural language definition. You can see the full implementation on GitHub and try it yourself. Here's the ART Repo: (don't forget to star it ⭐ ) Just like RLHF replaced manual rankings and GRPO replaced the critic model, natural language rewards are replacing hand-coded scoring functions. RL reward engineering is now prompt engineering. I wrote a full walkthrough covering RL for LLM agents, from RLHF to GRPO to RULER, in the article below.
Avi Chawla345,818 Aufrufe • vor 13 Tagen

Finally, a proper chat UI for Hermes Agent (open-source)! Hermes ships an official dashboard, but it's primarily built for management, and its chat is just a terminal piped into a browser tab. Hermes Web UI is an open-source chat-first alternative. It's self-hosted and points at your existing ~/.hermes state, so there's nothing new to configure. - It's a native web chat, not a terminal in a tab - Sessions group by date with a context ring - Kanban renders the agent's task board - Spaces manages your workspaces - Skills panel lists the full catalog - Tasks panel shows cron jobs - Insights show usage and activity - Memory shows MEMORY and SOUL files - Logs tails the agent, gateway, and error logs The whole setup runs 100% locally, binds to localhost by default, and you reach it over an SSH tunnel or Tailscale from your phone. I have shared the Hermes Web UI GitHub repo in the replies. Do note that it's a community project, not official, so expect occasional rough edges (concurrent profile runs are blocked for now). To dive deeper into Hermes Agent, my co-founder wrote a full masterclass about it, covering the learning loop, the memory tiers, self-evolving skills, GEPA, and running multiple isolated agents. Read it below.
Avi Chawla74,684 Aufrufe • vor 4 Tagen

Another blow to Anthropic! Devs built a free and better Claude alternative that: - runs locally - works with any LLM - beats it on deep research - has Cowork-like capabilities - connects to 40+ data sources - self-hosts via Docker, and more. 100% open-source (20k+ stars).
Avi Chawla666,960 Aufrufe • vor 2 Monaten

Researchers built a new RAG approach that: - does not need a vector DB. - does not embed data. - involves no chunking. - performs no similarity search. And it hit 98.7% accuracy on a financial benchmark (SOTA). Here's the core problem with RAG that this new approach solves: Traditional RAG chunks documents, embeds them into vectors, and retrieves based on semantic similarity. But similarity ≠ relevance. When you ask "What were the debt trends in 2023?", a vector search returns chunks that look similar. But the actual answer might be buried in some Appendix, referenced on some page, in a section that shares zero semantic overlap with your query. Traditional RAG would likely never find it. PageIndex (open-source) solves this. Instead of chunking and embedding, PageIndex builds a hierarchical tree structure from your documents, like an intelligent table of contents. Then it uses reasoning to traverse that tree. For instance, the model doesn't ask: "What text looks similar to this query?" Instead, it asks: "Based on this document's structure, where would a human expert look for this answer?" That's a fundamentally different approach with: - No arbitrary chunking that breaks context. - No vector DB infrastructure to maintain. - Traceable retrieval to see exactly why it chose a specific section. - The ability to see in-document references ("see Table 5.3") the way a human would. But here's the deeper issue that it solves. Vector search treats every query as independent. But documents have structure and logic, like sections that reference other sections and context that builds across pages. PageIndex respects that structure instead of flattening it into embeddings. Do note that this approach may not make sense in every use case since traditional vector search is still fast, simple, and works well for many applications. But for professional documents that require domain expertise and multi-step reasoning, this tree-based, reasoning-first approach shines. For instance, PageIndex achieved 98.7% accuracy on FinanceBench, significantly outperforming traditional vector-based RAG systems on complex financial document analysis. Everything is fully open-source, so you can see the full implementation in GitHub and try it yourself. I have shared the GitHub repo in the replies!
Avi Chawla970,893 Aufrufe • vor 4 Monaten

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?
Avi Chawla155,880 Aufrufe • vor 24 Tagen

LLM inference speed with vs. without KV caching: (learn how and why it works below)
Avi Chawla394,519 Aufrufe • vor 2 Monaten

Anthropic's in trouble, again. The entire Claude experience is now available at 1/6th the price. Kimi now does everything Claude does, powered by K2.6, a 1-trillion-parameter MoE model that activates only 32B parameters per token. It covers all three features Claude has (Chat, Code, and Cowork): 1) Kimi Chat runs in four modes - Instant for fast responses - Thinking for deep reasoning - Agent for multi-step execution - and Agent Swarm for parallel workloads. There's a 262K context window across all of them. 2) Kimi Code is the open-source CLI coding agent with K2.6 as the default backend. K2.6 ranked #1 on OpenRouter's programming leaderboard by weekly usage. 3) Kimi Agent is the Cowork equivalent. It generates: - full websites with database and auth - presentation decks (editable PPTX output) - spreadsheets with formulas and charts - word docs and structured research reports. On top of this, Kimi K2.6 is also trained to decompose tasks into up to 300 parallel sub-agents. This helps it retain coherence even across 4,000+ tool calls in a single run, with sessions sustaining up to 13 hours. On SWE-Bench Pro: - Kimi K2.6 → 58.6 - GPT-5.4 xhigh → 57.7 - Gemini 3.1 Pro → 54.2 - Claude Opus 4.6 → 53.4 Kimi K2.6 model is open weights and self-hostable on 4x H100s in INT4. Find the link to the HuggingFace model page in the replies!
Avi Chawla107,950 Aufrufe • vor 23 Tagen

Finally, Python 3.14 lets you disable GIL! It's a big deal because earlier, even if you wrote multi-threaded code, Python could only run one thread at a time, giving no performance benefit. But now, Python can run your multi-threaded code in parallel. And uv fully supports it!
Avi Chawla546,408 Aufrufe • vor 7 Monaten

OpenClaw meets RL! OpenClaw Agents adapt through memory files and skills, but the base model weights never actually change. OpenClaw-RL solves this! It wraps a self-hosted model as an OpenAI-compatible API, intercepts live conversations from OpenClaw, and trains the policy in the background using RL. The architecture is fully async. This means serving, reward scoring, and training all run in parallel. Once done, weights get hot-swapped after every batch while the agent keeps responding. Currently, it has two training modes: - Binary RL (GRPO): A process reward model scores each turn as good, bad, or neutral. That scalar reward drives policy updates via a PPO-style clipped objective. - On-Policy Distillation: When concrete corrections come in like "you should have checked that file first," it uses that feedback as a richer, directional training signal at the token level. When to use OpenClaw-RL? To be fair, a lot of agent behavior can already be improved through better memory and skill design. OpenClaw's existing skill ecosystem and community-built self-improvement skills handle a wide range of use cases without touching model weights at all. If the agent keeps forgetting preferences, that's a memory problem. And if it doesn't know how to handle a specific workflow, that's a skill problem. Both are solvable at the prompt and context layer. Where RL becomes interesting is when the failure pattern lives deeper in the model's reasoning itself. Things like consistently poor tool selection order, weak multi-step planning, or failing to interpret ambiguous instructions the way a specific user intends. Research on agentic RL (like ARTIST and Agent-R1) has shown that these behavioral patterns hit a ceiling with prompt-based approaches alone, especially in complex multi-turn tasks where the model needs to recover from tool failures or adapt its strategy mid-execution. That's the layer OpenClaw-RL targets, and it's a meaningful distinction from what OpenClaw offers. I have shared the repo in the replies!
Avi Chawla138,182 Aufrufe • vor 2 Monaten

Pentesting firms don't want you to see this. An open-source AI agent just replicated their $50k service. A "normal" pentest today looks like this: - $20k-$50k per engagement - 4-6 weeks of scoping, NDAs, kickoff calls - A big PDF that's outdated the moment you ship a new feature Meanwhile, AI agents are quietly starting to perform on-par with human pentester on the stuff that actually matters day-to-day: ↳ Enumerating attack surface ↳ Fuzzing endpoints ↳ Chaining simple vulns into real impact ↳ Producing PoCs and remediation steps developers can actually use And they do it in hours instead of weeks and at a fraction of the cost. This approach is actually implemented in Strix, a recently-trending open-source framework (14k+ stars) for AI pentesting agent. The framework spins up a team of AI "attackers" that probe your web apps, APIs, and code. It then returns validated findings with exploit evidence, remediation steps, and a full PDF report that looks exactly like what you'd get from a traditional firm, but without a $50k invoice and a month-long wait time. You can see the full implementation on GitHub and try it yourself. Just run: `strix --target https: //your-app .com` and you are good to go. Human red teams aren't disappearing but the routine pentest (pre-launch, post-refactor, quarterly checks) is clearly shifting to AI. Strix is one of the first tools that makes that shift feel real instead of hypothetical. I've shared the GitHub repo in the replies.
Avi Chawla223,841 Aufrufe • vor 6 Monaten

Big moment for Postgres! AI coding tools have been surprisingly bad at writing Postgres code. Not because the models are dumb, but because of how they learned SQL in the first place. LLMs are trained on the internet, which is full of outdated Stack Overflow answers and quick-fix tutorials. So when you ask an AI to generate a schema, it gives you something that technically runs but misses decades of Postgres evolution, like: - No GENERATED ALWAYS AS IDENTITY (added in PG10) - No expression or partial indexes - No NULLS NOT DISTINCT (PG15) - Missing CHECK constraints and proper foreign keys - Generic naming that tells you nothing But this is actually a solvable problem. You can teach AI tools to write better Postgres by giving them access to the right documentation at inference time. This exact solution is actually implemented in the newly released pg-aiguide by Tiger Data - Creators of TimescaleDB, which is an open-source MCP server that provides coding tools access to 35 years of Postgres expertise. In a gist, the MCP server enables: - Semantic search over the official PostgreSQL manual (version-aware, so it knows PG14 vs PG17 differences) - Curated skills with opinionated best practices for schema design, indexing, and constraints. I ran an experiment with Claude Code to see how well this works, and worked with the team to put this together. Prompt: "Generate a schema for an e-commerce site twice, one with the MCP server disabled, one with it enabled. Finally, run an assessment to compare the generated schemas." The run with the MCP server led to: - 420% more indexes (including partial and expression indexes) - 235% more constraints - 60% more tables (proper normalization) - 11 automation functions and triggers - Modern PG17 patterns throughout The MCP-assisted schema had proper data integrity, performance optimizations baked in, and followed naming conventions that actually make sense in production. pg-aiguide works with Claude Code, Cursor, VS Code, and any MCP-compatible tool. It's free and fully open source. I have shared the repo in the replies!
Avi Chawla186,381 Aufrufe • vor 5 Monaten

Check this!! Microsoft open-sourced a no-code data analysis tool. It's called Data Formulator and it provides AI-powered data analysis and an drag-and-drop UI for viz tasks. It also works beyond the initial dataset by creating relevant fields and the corresponding viz.
Avi Chawla260,055 Aufrufe • vor 1 Jahr

Big update for Claude Desktop and Cursor users! Now you can connect all AI apps via a common memory layer in a minute. I used the Graphiti MCP server that runs 100% locally to cross-operate across AI apps like Claude Desktop and Cursor without losing context. (setup below)
Avi Chawla122,685 Aufrufe • vor 7 Monaten

An MCP server to create Grant Sanderson animations (open-source):
Avi Chawla192,133 Aufrufe • vor 1 Jahr

A 100% open-source alternative to n8n! Sim is a drag-and-drop open-source platform to build and deploy Agentic workflows. - Runs 100% locally - Works with any local LLM I used it to build a finance assistance app & connected it to Telegram in minutes. The workflow is simple: - You ask a finance question through Telegram - An Intent Classifier figures out if it's finance-related - If not, you get a polite redirect - If yes, the Finance Agent kicks in Here's what's happening under the hood: The Finance Agent uses Firecrawl for web searches and accesses stock data via Alpha Vantage's API through MCP servers. A Response Agent compiles the info and delivers it. In Sim, every tool or agent you need is available as a block. Just drag them onto the canvas and connect them. Sim agents also support integration with MCP, which is exactly what we did to connect our agent with Alpha Vantage's API. And it's simple to extend. If you want to track crypto or need portfolio analysis, you can just add another Agent. Sim allows easy feature additions without disrupting existing functionality. I have shared the link to Sim's official, open-source GitHub repo in the replies!
Avi Chawla115,509 Aufrufe • vor 8 Monaten

Big moment for Postgres! AI agents broke the idea of what a database is supposed to do. Traditional databases were built for humans, and Agents broke that model. - They branch endlessly. - They run ten experiments at once. - They need isolation, context, memory, structured reasoning, and safe sandboxes. Letting agents touch production systems is terrifying because the old model of Postgres was never built for this kind of behavior. Agentic Postgres is an agent-ready version of Postgres by TimescaleDB (by Tiger Data) that solves this. I think it is one of the biggest upgrades to the Agent stack this year and Tiger Data is working with me on this post to share what they did. Some key features: > It instantly creates branches of an entire database, which is perfect for parallel agent evals, safe experiments, migrations, or isolated testing. Forks take seconds and cost almost nothing. > It comes with a built-in MCP server, which agents can use to get schema guidance, best practices, and safe, structured access to Postgres. This is also helpful to run migrations with a real understanding. > It comes with actual hybrid search (vector search and BM25), so Agents can retrieve data directly inside the database. > The database is Memory native. This gives a persistent context for Agents to evolve. This is one of the first times I have seen Postgres feel ready for the AI native era.
Avi Chawla94,261 Aufrufe • vor 6 Monaten

Finally, MCP servers can now deliver UI-rich experiences! MCP servers in Claude/Cursor don't offer UI any experience yet, like charts. It's just text/JSON. mcp-ui lets you add interactive web components to its output that can be rendered by the MCP client. 100% open-source!
Avi Chawla123,701 Aufrufe • vor 9 Monaten