正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer? Best-of-N (e.g., GRPO) and tree search share two limitations: 🔻 Verification signals are sparse 🔻 Candidates stay within the model's own distribution We introduce BES: Bidirectional Evolutionary Search... show more

Guowei Xu

2,852 subscribers

244,684 次观看 • 1 个月前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Wow, we can steer diffusion models at inference time! Introducing Diffusion Tree Sampling (DTS): a search-based approach inspired by Monte Carlo Tree Search that turns inference into an anytime, reward-guided optimization process. Diffusion Tree Sampling (DTS) produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS⋆), performs a global search for high reward samples. The results are pretty impressive: - On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to 10× less compute. - In text-to-image generation and language completion tasks, DTS⋆ effectively searches for high reward samples that match best-of-N with up to 5× less compute.

Wow, we can steer diffusion models at inference time! Introducing Diffusion Tree Sampling (DTS): a search-based approach inspired by Monte Carlo Tree Search that turns inference into an anytime, reward-guided optimization process. Diffusion Tree Sampling (DTS) produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS⋆), performs a global search for high reward samples. The results are pretty impressive: - On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to 10× less compute. - In text-to-image generation and language completion tasks, DTS⋆ effectively searches for high reward samples that match best-of-N with up to 5× less compute.

机器之心 JIQIZHIXIN

19,037 次观看 • 1 年前

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 次观看 • 1 年前

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

Andrew Ng

132,304 次观看 • 9 个月前

A transformer can learn not just the outcomes of dynamics, but the operator that executes the rules. To show this we trained a transformer on roughly 0.04% of a discrete rule space - 100 of 262,144 possible rules - and it learned to apply unseen rules from the same rule class. The model does not simply memorize specific rules. It learns the operator that maps a supplied rule plus an initial state, including unseen rules from this class, to the correct next state. This is relevant because it is a shift from “neural networks approximate dynamics” to “neural networks can learn to execute symbolic programs within a defined rule class”. The rule itself is supplied at inference time, as data, and the network has internalized how rules act, not which rules to apply. On previously unseen rules, the model achieves 98.5% perfect one-step forecasts and reconstructs governing rules with up to 96% functional accuracy. Two results make this hold up under scrutiny. First, inductive bias decay. As we scaled training rule diversity, the correlation between functional inference accuracy and distance-from-nearest-training-rule collapsed to R² = 0.00. At the largest tested training-rule diversity, the model’s performance on a new rule shows no measurable dependence on how similar that rule is to anything it was trained on. The bias toward training data (the thing we worry most about in compositional generalization claims) is something we can measure decaying, and we find that at scale it is gone. Second, an identifiability theory. We derive a closed-form expression for the number of rules consistent with a single observation. This reframes the inverse problem: failure to recover ground truth is not necessarily a model defect, but can be correct behavior when the data underdetermine the rule. The model is sampling the equivalence class; and identifiability is governed by coverage, not capacity. The methodological move underneath both results is amortization. Classical work on rule inference (e.g. the Santa Fe EVCA program, evolutionary search over CA rule space) was per-instance: search the rule space for each new system. We replace that with a single forward pass of a transformer trained across many instantiations of the rule class. That is what makes symbolic rule inference scalable as a research direction rather than a curiosity. We show that this works in a tightly constrained domain: binary, deterministic, local cellular automata on small grids. The locality-break experiment shows the model fails sharply when target systems violate its structural priors (which is itself a useful diagnostic, but it bounds the operator class). We don't yet know how this scales to multistate, higher-dimensional, or stochastic CA, or whether it transfers cleanly to non-CA systems whose coarse-grained dynamics admit local surrogates. The identifiability framework - what can be inferred from observation, given a hypothesis class - should transfer wherever finite local rules meet sparse data. The amortization argument transfers wherever per-instance symbolic search has been the bottleneck. Those are the pieces I expect to outlive the cellular automata setting. Led by Jaime Berkovich with Noah David, at LAMM@MIT. Out now in Advanced Science Advanced Portfolio (link to paper & code below).

A transformer can learn not just the outcomes of dynamics, but the operator that executes the rules. To show this we trained a transformer on roughly 0.04% of a discrete rule space - 100 of 262,144 possible rules - and it learned to apply unseen rules from the same rule class. The model does not simply memorize specific rules. It learns the operator that maps a supplied rule plus an initial state, including unseen rules from this class, to the correct next state. This is relevant because it is a shift from “neural networks approximate dynamics” to “neural networks can learn to execute symbolic programs within a defined rule class”. The rule itself is supplied at inference time, as data, and the network has internalized how rules act, not which rules to apply. On previously unseen rules, the model achieves 98.5% perfect one-step forecasts and reconstructs governing rules with up to 96% functional accuracy. Two results make this hold up under scrutiny. First, inductive bias decay. As we scaled training rule diversity, the correlation between functional inference accuracy and distance-from-nearest-training-rule collapsed to R² = 0.00. At the largest tested training-rule diversity, the model’s performance on a new rule shows no measurable dependence on how similar that rule is to anything it was trained on. The bias toward training data (the thing we worry most about in compositional generalization claims) is something we can measure decaying, and we find that at scale it is gone. Second, an identifiability theory. We derive a closed-form expression for the number of rules consistent with a single observation. This reframes the inverse problem: failure to recover ground truth is not necessarily a model defect, but can be correct behavior when the data underdetermine the rule. The model is sampling the equivalence class; and identifiability is governed by coverage, not capacity. The methodological move underneath both results is amortization. Classical work on rule inference (e.g. the Santa Fe EVCA program, evolutionary search over CA rule space) was per-instance: search the rule space for each new system. We replace that with a single forward pass of a transformer trained across many instantiations of the rule class. That is what makes symbolic rule inference scalable as a research direction rather than a curiosity. We show that this works in a tightly constrained domain: binary, deterministic, local cellular automata on small grids. The locality-break experiment shows the model fails sharply when target systems violate its structural priors (which is itself a useful diagnostic, but it bounds the operator class). We don't yet know how this scales to multistate, higher-dimensional, or stochastic CA, or whether it transfers cleanly to non-CA systems whose coarse-grained dynamics admit local surrogates. The identifiability framework - what can be inferred from observation, given a hypothesis class - should transfer wherever finite local rules meet sparse data. The amortization argument transfers wherever per-instance symbolic search has been the bottleneck. Those are the pieces I expect to outlive the cellular automata setting. Led by Jaime Berkovich with Noah David, at LAMM@MIT. Out now in Advanced Science Advanced Portfolio (link to paper & code below).

Markus J. Buehler

39,019 次观看 • 2 个月前

Introducing Sharpe Search: On-Chain Search AI Agent Powered by Hive Intelligence We’re thrilled to announce the launch of Sharpe Search, a crypto search AI agent powered by Hive Intelligence Designed to simplify blockchain data interaction, Sharpe Search represents a significant step toward making crypto more accessible and actionable for users at every level. Sharpe Search leverages Hive Intelligence’s advanced search API to provide real-time, actionable insights across the blockchain ecosystem. Here’s a detailed look at what Sharpe Search is, how it works: What Is Sharpe Search? At its core, Sharpe Search is an AI agent purpose-built for querying and analyzing on-chain data. It takes the complexity out of blockchain exploration by enabling users to ask questions in plain language and receive detailed, accurate responses. Whether you’re looking to monitor wallet activity, track portfolio positions, or analyze transaction history, Sharpe Search ensures that the answers are at your fingertips—accurate, comprehensive, and delivered instantly. How Does Sharpe Search Work? Sharpe Search is powered by Hive Intelligence, a search engine API designed to make blockchain data easily accessible and AI-ready. Here’s a breakdown of how it enables Sharpe Search to function effectively: 1. LLM-Optimized Query Processing Sharpe Search leverages Hive Intelligence's optimized responses for large language models. This ensures that AI agents can process blockchain data in a structured format, delivering precise answers to complex user queries. 2. Natural Language Interaction Forget the need for technical knowledge. Sharpe Search supports natural language queries, making it as simple as typing: - “What tokens are in my wallet? Am I eligible for any airdrop I haven't claimed yet?” - “Check me my last 100 transactions, tell me if I interacted with any protocol with recent hacks” - “Track my wallet activity over the past month, suggest optimised portfolio based on best stable yields available” 3. Real-Time Insights Across Multi-Chains Using Hive Intelligence, Sharpe Search connects to over 20 chains and 5000+ Protocols. This real-time access ensures that the AI agent provides up-to-date and actionable insights, no matter how dynamic the blockchain environment. 4. Unified API Access Sharpe Search consolidates fragmented blockchain data through Hive’s unified API. Instead of dealing with multiple integrations, Sharpe Search uses a single access point to aggregate and query data, reducing complexity for both users and developers. Technical Depth: The AI Agent Advantage Sharpe Search's design philosophy revolves around the principle of creating an intuitive, AI-driven experience. Here’s what makes its technology stand out: Data Indexing and Aggregation: Hive Intelligence employs advanced indexing algorithms to aggregate data from multiple chains. This ensures that Sharpe Search can retrieve information within milliseconds, even when querying vast datasets. Dynamic Updates: Blockchain data is volatile. Sharpe Search processes dynamic updates in real time, enabling users to act on the most recent metrics, transactions, and balances without delays. Contextual Understanding: The AI agent parses natural language queries and contextualizes them to blockchain-specific scenarios. For instance, when querying “Show portfolio details,” Sharpe Search understands the underlying requirements—fetching wallet holdings, token values, and current positions. Hive Intelligence: The Backbone of Sharpe Search While Sharpe Search takes center stage, Hive Intelligence provides the critical infrastructure to make it all possible. Its LLM-ready responses and multi-chain support ensure that Sharpe Search operates at the forefront of blockchain data accessibility. By launching Hive Intelligence through Sharpe Launchpad, Sharpe reinforces its commitment to supporting innovation in the blockchain space. Hive’s infrastructure not only powers Sharpe Search but also lays the groundwork for future AI agents to thrive in the ecosystem. What’s Next for Sharpe Search? Currently in invite-only access, Sharpe Search is preparing for a broader public release. Future updates will include: - Expanded Blockchain Coverage: More chains and protocols will be added. - Enhanced Query Flexibility: Even more advanced natural language capabilities. Stay tuned for the public launch and get ready to explore crypto like never before!

Introducing Sharpe Search: On-Chain Search AI Agent Powered by Hive Intelligence We’re thrilled to announce the launch of Sharpe Search, a crypto search AI agent powered by Hive Intelligence Designed to simplify blockchain data interaction, Sharpe Search represents a significant step toward making crypto more accessible and actionable for users at every level. Sharpe Search leverages Hive Intelligence’s advanced search API to provide real-time, actionable insights across the blockchain ecosystem. Here’s a detailed look at what Sharpe Search is, how it works: What Is Sharpe Search? At its core, Sharpe Search is an AI agent purpose-built for querying and analyzing on-chain data. It takes the complexity out of blockchain exploration by enabling users to ask questions in plain language and receive detailed, accurate responses. Whether you’re looking to monitor wallet activity, track portfolio positions, or analyze transaction history, Sharpe Search ensures that the answers are at your fingertips—accurate, comprehensive, and delivered instantly. How Does Sharpe Search Work? Sharpe Search is powered by Hive Intelligence, a search engine API designed to make blockchain data easily accessible and AI-ready. Here’s a breakdown of how it enables Sharpe Search to function effectively: 1. LLM-Optimized Query Processing Sharpe Search leverages Hive Intelligence's optimized responses for large language models. This ensures that AI agents can process blockchain data in a structured format, delivering precise answers to complex user queries. 2. Natural Language Interaction Forget the need for technical knowledge. Sharpe Search supports natural language queries, making it as simple as typing: - “What tokens are in my wallet? Am I eligible for any airdrop I haven't claimed yet?” - “Check me my last 100 transactions, tell me if I interacted with any protocol with recent hacks” - “Track my wallet activity over the past month, suggest optimised portfolio based on best stable yields available” 3. Real-Time Insights Across Multi-Chains Using Hive Intelligence, Sharpe Search connects to over 20 chains and 5000+ Protocols. This real-time access ensures that the AI agent provides up-to-date and actionable insights, no matter how dynamic the blockchain environment. 4. Unified API Access Sharpe Search consolidates fragmented blockchain data through Hive’s unified API. Instead of dealing with multiple integrations, Sharpe Search uses a single access point to aggregate and query data, reducing complexity for both users and developers. Technical Depth: The AI Agent Advantage Sharpe Search's design philosophy revolves around the principle of creating an intuitive, AI-driven experience. Here’s what makes its technology stand out: Data Indexing and Aggregation: Hive Intelligence employs advanced indexing algorithms to aggregate data from multiple chains. This ensures that Sharpe Search can retrieve information within milliseconds, even when querying vast datasets. Dynamic Updates: Blockchain data is volatile. Sharpe Search processes dynamic updates in real time, enabling users to act on the most recent metrics, transactions, and balances without delays. Contextual Understanding: The AI agent parses natural language queries and contextualizes them to blockchain-specific scenarios. For instance, when querying “Show portfolio details,” Sharpe Search understands the underlying requirements—fetching wallet holdings, token values, and current positions. Hive Intelligence: The Backbone of Sharpe Search While Sharpe Search takes center stage, Hive Intelligence provides the critical infrastructure to make it all possible. Its LLM-ready responses and multi-chain support ensure that Sharpe Search operates at the forefront of blockchain data accessibility. By launching Hive Intelligence through Sharpe Launchpad, Sharpe reinforces its commitment to supporting innovation in the blockchain space. Hive’s infrastructure not only powers Sharpe Search but also lays the groundwork for future AI agents to thrive in the ecosystem. What’s Next for Sharpe Search? Currently in invite-only access, Sharpe Search is preparing for a broader public release. Future updates will include: - Expanded Blockchain Coverage: More chains and protocols will be added. - Enhanced Query Flexibility: Even more advanced natural language capabilities. Stay tuned for the public launch and get ready to explore crypto like never before!

Sharpe AI

263,278 次观看 • 1 年前

Track LLM visibility, generate content, and convert. Comment “AI Search” for free access. The way people search is changing. Over 1B people now turn to ChatGPT, Claude, and Gemini each week to discover and evaluate products, but most brands are flying blind. Two years ago, Shreyas Kumar and I launched FERMÀT Funnels to solve a clear problem: How do you convert the traffic you pay for? Now we’re tackling an even bigger one: How do you convert people who are interacting with AI Search (basically ChatGPT)? LLMs aren’t search engines. They’re answer engines, and they require a new strategy. Everything we learned building Funnels, from content velocity to personalization to post-click optimization, now powers this new solution. Already live with some of the fastest growing CPG brands, we’re seeing: - 2x visibility lift in high-intent categories - Shoppable articles cited by ChatGPT - Conversions attributed to FERMÀT-generated content We’ve spoken with hundreds of retail brands over the past few months, and one thing is clear: Content optimized for answer engines like ChatGPT is a must-have for 2026. So, here’s what we’re doing: To help brands get started, we’re offering free access for the first six months to track prompts, generate up to 20 pieces of content, and monitor your brand’s visibility. Just comment “AI Search” below before the end of the year, and we’ll DM you to onboard in the order you comment. While you’re waiting to get onboarded we’ll share an AI Search funnel from one of our favorite FERMÀT customers and give you a $100 in free credit to buy whatever your (or your significant others’) heart desires. capped at the first 100 people

Rishabh Jain

19,505 次观看 • 8 个月前

Sierra agents can be published on your web site, integrated with your mobile app, answer the phone - and now they can also be published to ChatGPT so you can directly reach hundreds of millions of consumers with your agent. Since OpenAI announced ChatGPT apps two weeks ago, we've heard from our customers that they want two things: a direct customer relationship and the ability to reach customers where they already are. We don't think you should have to choose. With Sierra, you can build once and run everywhere. Here's a demo of an agent I built to show how it works. And our blog post is here:

Sierra agents can be published on your web site, integrated with your mobile app, answer the phone - and now they can also be published to ChatGPT so you can directly reach hundreds of millions of consumers with your agent. Since OpenAI announced ChatGPT apps two weeks ago, we've heard from our customers that they want two things: a direct customer relationship and the ability to reach customers where they already are. We don't think you should have to choose. With Sierra, you can build once and run everywhere. Here's a demo of an agent I built to show how it works. And our blog post is here:

Bret Taylor

55,488 次观看 • 9 个月前

Xavier Leroy (creator of OCaml) is an expert in compilers, formal verification of software and functional programming. This interview should be an approachable resource if you're curious about formal verification of software since I was learning that on the fly during it. In this episode: • OCaml compared with Rust and JavaScript • What is formal verification and how does it work • How languages call each other across boundaries • How to address "almost-correct" LLM code • How type inference works in programming languages Where to watch: • YouTube - • Spotify - • Apple Podcasts - • Transcript - Thank you to the sponsor of this episode for supporting my work: • WorkOS: makes your app Enterprise Ready with easy to use APIs to add SSO, SCIM, RBAC, and more in just a few lines of code, check them out at Chapters: 00:00 - Intro 00:43 - What sets OCaml apart 04:39 - OCaml vs Rust 07:57 - Why is manual memory management more performant 11:21 - Javascript vs OCaml 14:00 - Famous Rob Pike quote 16:05 - Type inference and how it works 22:12 - What is formal verification and how does it work 40:07 - What made multicore support difficult for OCaml 50:17 - How programming languages interface and call each other 57:41 - The danger of almost-correct LLM code 01:05:39 - How LLMs will change programming languages 01:10:26 - Industry vs academia 01:15:05 - Most interesting unsolved problems 01:18:30 - Top book recommendations for engineers 01:21:17 - Advice for his younger self 01:23:31 - Outro

Xavier Leroy (creator of OCaml) is an expert in compilers, formal verification of software and functional programming. This interview should be an approachable resource if you're curious about formal verification of software since I was learning that on the fly during it. In this episode: • OCaml compared with Rust and JavaScript • What is formal verification and how does it work • How languages call each other across boundaries • How to address "almost-correct" LLM code • How type inference works in programming languages Where to watch: • YouTube - • Spotify - • Apple Podcasts - • Transcript - Thank you to the sponsor of this episode for supporting my work: • WorkOS: makes your app Enterprise Ready with easy to use APIs to add SSO, SCIM, RBAC, and more in just a few lines of code, check them out at Chapters: 00:00 - Intro 00:43 - What sets OCaml apart 04:39 - OCaml vs Rust 07:57 - Why is manual memory management more performant 11:21 - Javascript vs OCaml 14:00 - Famous Rob Pike quote 16:05 - Type inference and how it works 22:12 - What is formal verification and how does it work 40:07 - What made multicore support difficult for OCaml 50:17 - How programming languages interface and call each other 57:41 - The danger of almost-correct LLM code 01:05:39 - How LLMs will change programming languages 01:10:26 - Industry vs academia 01:15:05 - Most interesting unsolved problems 01:18:30 - Top book recommendations for engineers 01:21:17 - Advice for his younger self 01:23:31 - Outro

Ryan Peterman

23,082 次观看 • 6 天前

Sharing a super simple, user-owned memory module we've been playing around: nanomem The basic idea is to treat memory as a pure intelligence problem: ingestion, structuring, and (selective) retrieval are all just LLM calls & agent loops on a on-device markdown file tree. Each file lists a set of facts w/ metadata (timestamp, confidence, source, etc.); no embeddings/RAG/training of any kind. For example: - `nanomem add ` starts an agent loop to walk the tree, read relevant files, and edit. - `nanomem retrieve ` walks the tree and returns a single summary string (possibly assembled from many subtrees) related to the query. What’s nice about this approach is that the memory system is, by construction: 1. partitionable (human/agents can easily separate `hobbies/snowboard.md` from `tax/residency.md` for data minimization + relevance) 2. portable and user-owned (it’s just text files) 3. interpretable (you know exactly what’s written and you can manually edit) 4. forward-compatible (future models can read memory files just the same, and memory quality/speed improves as models get better) 5. modularized (you can optimize ingestion/retrieval/compaction prompts separately) Privacy & utility. I'm most excited about the ability to partition + selectively disclose memory at inference-time. Selective disclosure helps with both privacy (principle of least privilege & “need-to-know”) and utility (as too much context for a query can harm answer quality). Composability. An inference-time memory module means: (1) you can run such a module with confidential inference (LLMs on TEEs) for provable privacy, and (2) you can selectively disclose context over unlinkable inference of remote models (demo below). We built nanomem as part of the Open Anonymity project ( but it’s meant to be a standalone module for humans and agents (e.g., you can write a SKILL for using the CLI tool). Still polishing the rough edges! - GitHub (MIT): - Blog: - Beta implementation in chat client soon: Work done with amazing project co-leads Amelia Kuang Coco Xu Erik Chi !!

Sharing a super simple, user-owned memory module we've been playing around: nanomem The basic idea is to treat memory as a pure intelligence problem: ingestion, structuring, and (selective) retrieval are all just LLM calls & agent loops on a on-device markdown file tree. Each file lists a set of facts w/ metadata (timestamp, confidence, source, etc.); no embeddings/RAG/training of any kind. For example: - `nanomem add ` starts an agent loop to walk the tree, read relevant files, and edit. - `nanomem retrieve ` walks the tree and returns a single summary string (possibly assembled from many subtrees) related to the query. What’s nice about this approach is that the memory system is, by construction: 1. partitionable (human/agents can easily separate `hobbies/snowboard.md` from `tax/residency.md` for data minimization + relevance) 2. portable and user-owned (it’s just text files) 3. interpretable (you know exactly what’s written and you can manually edit) 4. forward-compatible (future models can read memory files just the same, and memory quality/speed improves as models get better) 5. modularized (you can optimize ingestion/retrieval/compaction prompts separately) Privacy & utility. I'm most excited about the ability to partition + selectively disclose memory at inference-time. Selective disclosure helps with both privacy (principle of least privilege & “need-to-know”) and utility (as too much context for a query can harm answer quality). Composability. An inference-time memory module means: (1) you can run such a module with confidential inference (LLMs on TEEs) for provable privacy, and (2) you can selectively disclose context over unlinkable inference of remote models (demo below). We built nanomem as part of the Open Anonymity project ( but it’s meant to be a standalone module for humans and agents (e.g., you can write a SKILL for using the CLI tool). Still polishing the rough edges! - GitHub (MIT): - Blog: - Beta implementation in chat client soon: Work done with amazing project co-leads Amelia Kuang Coco Xu Erik Chi !!

Ken Liu

73,840 次观看 • 3 个月前

New blackboard lecture w Eric Jang He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

New blackboard lecture w Eric Jang He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

Dwarkesh Patel

703,166 次观看 • 2 个月前

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

A team tested Pi0, Pi0 Fast, Gr00t, and ACT on real robot arms in manufacturing tasks. (🔖 Bookmark this for later!) The task was precise: place thin rectangular frames from a messy stack into a holder. The team fine-tuned each model on 100 real trajectories and compared training time, inference speed, motion quality, and success rates. ⬇️ Here’s a breakdown of what they found Pi0 (Original) ✅ Strongest overall performance in precise pick-and-place ✅ High success rate even in edge cases ✅ Longest training time (~11 hours, ~$30 per run) ✅ Inference time of 80 ms causes short pauses between actions Despite delays, it handles complex scenarios well… solid for high-precision tasks, but slow to train. Gr00t ✅ Trains fast (~2 hours, ~$5 per run) ✅ Performs almost as well as Pi0 on large-object tasks ✅ Struggles with fine precision; random movement in some trials ✅ More training didn’t fix jitter or random offsets Best suited for tasks where exact precision isn’t critical. Not ready for manufacturing-grade accuracy without more tuning. Pi0 Fast ✅ Promised faster training, but results were underwhelming ✅ Training at 6 hours still showed low success rates ✅ Inference was slower than expected ✅ Not reliable for generalizing even slightly new tasks Currently too unstable for real-world deployment. Doesn’t live up to the “Fast” name yet. ACT (Baseline) ✅ 200MB model—lightweight, but limited ✅ Struggles with stacked objects or ambiguous scenes ✅ Success rates around 70% in best-case setups ✅ Can’t match newer models on precision or generalization Still a solid baseline, but clearly a generation behind in robustness. 🚨 Extra Notes All newer models share a common issue: •Inference takes longer than a frame (80 ms vs 33 ms), so robots “pause” between chunks. •This results in jittery movements, but not a dealbreaker unless tasks are time-sensitive. Language-conditioned tasks also fell short: after training on two labeled tasks, the model couldn’t generalize to a third unseen combination using only text prompts. ✅ The good news? These models adapt well to new robot arms with quick fine-tuning. ❌ The bad news? There’s still no plug-and-play solution for improving performance after deployment. Reinforcement learning or DAgger-style data collection during real-world operation may be the next big step, something many teams in robotics are actively working on.

Ilir Aliu

21,844 次观看 • 1 年前

In just one week, Binh Pham and I trained a full-body Unitree G1. Here's a recap: 1. Secured a Unitree G1 humanoid through a LinkedIn post 2. Deployed TWIST2 full-body teleoperation pipelines 3. Adapted TWIST2 for Zed stereo camera & collected full-body teleoperation samples (carried by Binh Pham ) 4. Adapted & fine-tuned NVIDIA Gr00T N1.5 VLA on the TWIST2 public datasets, which I fine-tuned on an 8xNVIDIA H100 Cluster. We picked Gr00T N1.5 as it was trained with Unitree G1 embodiment data. 5. Adapted the TWIST2 codebase to stream in the actions from Gr00T via ZMQ using a co-located NVIDIA H100 for ~200ms inference latency 6. Tested the model in sim, then deployed to the real-world Unitree G1. We streamed a training sample observation to the VLA (as we didn't want to break robot in case real observations were OOD) We were the first team in the world to deploy the full TWIST2 data collection pipeline to the unitree g1 :) Much more work ahead though, which I'll work on as a side-project over the next months: 1. Exploring the various types of 'world models': video backbones, dynamics models, v-jepa-2 models. I believe these will generalize better & train much more data-efficiently than VLM backbones 2. Speeding up inference - I believe low-latency robotics inference will be a big challenge. There are many works in video diffusion which I'd like to test (e.g. SageAttention, SparseAttention, Drifting Models). Perhaps also writing custom CUDA kernels. 3. Economics of inference scaling :) What will be the compute demands as we scale inference up to millions of humanoids? Will it run on edge or on distributed 'co-located' inference clusters? These are questions I'd like to answer. Adapted TWIST2 codebase: Adapted Gr00T-N1.5 codebase: The ETH Robotics Club are doing a cool GTC Golden ticket competition with NVIDIA , so this is my submission :) The DGX Spark compute will get me a long way with initial prototyping & especially working on inference optimization for next-gen Blackwell GPUs #NVIDIAGTC #GOLDENTICKET #ETHRC

In just one week, Binh Pham and I trained a full-body Unitree G1. Here's a recap: 1. Secured a Unitree G1 humanoid through a LinkedIn post 2. Deployed TWIST2 full-body teleoperation pipelines 3. Adapted TWIST2 for Zed stereo camera & collected full-body teleoperation samples (carried by Binh Pham ) 4. Adapted & fine-tuned NVIDIA Gr00T N1.5 VLA on the TWIST2 public datasets, which I fine-tuned on an 8xNVIDIA H100 Cluster. We picked Gr00T N1.5 as it was trained with Unitree G1 embodiment data. 5. Adapted the TWIST2 codebase to stream in the actions from Gr00T via ZMQ using a co-located NVIDIA H100 for ~200ms inference latency 6. Tested the model in sim, then deployed to the real-world Unitree G1. We streamed a training sample observation to the VLA (as we didn't want to break robot in case real observations were OOD) We were the first team in the world to deploy the full TWIST2 data collection pipeline to the unitree g1 :) Much more work ahead though, which I'll work on as a side-project over the next months: 1. Exploring the various types of 'world models': video backbones, dynamics models, v-jepa-2 models. I believe these will generalize better & train much more data-efficiently than VLM backbones 2. Speeding up inference - I believe low-latency robotics inference will be a big challenge. There are many works in video diffusion which I'd like to test (e.g. SageAttention, SparseAttention, Drifting Models). Perhaps also writing custom CUDA kernels. 3. Economics of inference scaling :) What will be the compute demands as we scale inference up to millions of humanoids? Will it run on edge or on distributed 'co-located' inference clusters? These are questions I'd like to answer. Adapted TWIST2 codebase: Adapted Gr00T-N1.5 codebase: The ETH Robotics Club are doing a cool GTC Golden ticket competition with NVIDIA , so this is my submission :) The DGX Spark compute will get me a long way with initial prototyping & especially working on inference optimization for next-gen Blackwell GPUs #NVIDIAGTC #GOLDENTICKET #ETHRC

Arnie Ramesh

14,815 次观看 • 5 个月前

$After taking some time off post-Rapid, I'm excited to share what I’ve been up to since: Datawizz AI! We’ve raised a $12.5M Seed led by Human Capital to make AI 10x cheaper, 2x more accurate and 15x faster by transitioning from LLMs to SLMs. AI is eating the world. But unit economics are eating AI. Looking at the fastest growing AI products, they all share two traits - growing fast, and painful inference bills. General-purpose LLMs are just too expensive to run. A big reason for that is we train LLMs to be good at everything - answer any question, be an expert on any topic. The big labs dub this "generalisation", but for real-world applications, it is unnecessary. In reality - many AI applications need models to be experts in one thing - and do that thing extremely well. Your coding model doesn’t need to memorize ancient recipes for Garum sauce. This is where Datawizz comes in - we sit between the AI applications and automatically create smaller (100x-1,000x) specialized models to handle specific aspects of your work. By focusing the model and combining industry-data in the distillation process - we end up with models that beat SOTA LLMs at a fraction of the cost. We created Datawizz to make AI specialized and scalable. We’re early in the journey, but have already been able to save companies 90%+ on their inference bill and speed up their apps by 10x. Excited to build better AI platforms? Join the Datawizz team (link in first comment)$

After taking some time off post-Rapid, I'm excited to share what I’ve been up to since: Datawizz AI! We’ve raised a $12.5M Seed led by Human Capital to make AI 10x cheaper, 2x more accurate and 15x faster by transitioning from LLMs to SLMs. AI is eating the world. But unit economics are eating AI. Looking at the fastest growing AI products, they all share two traits - growing fast, and painful inference bills. General-purpose LLMs are just too expensive to run. A big reason for that is we train LLMs to be good at everything - answer any question, be an expert on any topic. The big labs dub this "generalisation", but for real-world applications, it is unnecessary. In reality - many AI applications need models to be experts in one thing - and do that thing extremely well. Your coding model doesn’t need to memorize ancient recipes for Garum sauce. This is where Datawizz comes in - we sit between the AI applications and automatically create smaller (100x-1,000x) specialized models to handle specific aspects of your work. By focusing the model and combining industry-data in the distillation process - we end up with models that beat SOTA LLMs at a fraction of the cost. We created Datawizz to make AI specialized and scalable. We’re early in the journey, but have already been able to save companies 90%+ on their inference bill and speed up their apps by 10x. Excited to build better AI platforms? Join the Datawizz team (link in first comment)

Iddo Gino 🐙

21,915 次观看 • 9 个月前

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

Wenhu Chen

82,829 次观看 • 1 年前

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!

🚨 RL for LLMs is finally accessible. Introducing OpenTinker: The first community-driven, open-source framework designed to democratize Reinforcement Learning for LLMs. Inspired by Thinking Machines's amazing Tinker, we realize the biggest bottleneck in agentic LLM research isn’t the math—it’s the setup. Current RL pipelines are messy. Configuring VeRL for every single experiment is a productivity killer. OpenTinker fixed it. 🛠 How OpenTinker Works: Decoupled Design of Server and Client - Setup Once, Run Forever: Configure the OpenTinker backend on your GPU cluster once. - Develop Locally: Define your RL environments directly on your laptop. - Train on the Cloud: Simply point your local client to the backend. The cluster handles the compute; you handle the science. 📉 The 10x Development Efficiency Thanks to our elegant architectural decomposition, OpenTinker reduces the time to develop a new RL training pipeline by at least an order of magnitude. ⚡ Turn Idle GPU Compute into Gold Small labs often have underutilized hardware. OpenTinker turns your idle GPUs into an internal/external API service for - RL Training - SFT - Inference 🎯 Who needs OpenTinker? - Researchers tired of infrastructure hell. - Labs needing to standardize workflows. - Teams wanting to maximize hardware ROI. Thanks my amazing PhD student Siqi Zhu for leading the project. We are building the future of open RL infra. Be the first to build with us. 👇 Start Building with OpenTinker Now 🚀 Repo: 🌐 Blog: If you believe RL should be accessible to everyone, give us a star, repost this 🔄 post, and let us know what agents you plan to build!

Jiaxuan You

58,224 次观看 • 7 个月前

Introducing SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Pre-Training Corpora What lies within a trillion-scale pre-training corpus? Can you truly guarantee your benchmarks are uncontaminated simply because there are no exact string matches? Alongside several research institutions in Japan, Sakana AI is proud to have collaborated in the development of SoftMatcha 2, an ultra-fast and flexible search tool that enables search over trillion-scale natural language corpora in under 0.3 seconds, even while handling semantic variations (substitution, insertion, and deletion). No existing tool meets all these criteria, including infini-gram-mini (EMNLP’25 Best Paper) or the original SoftMatcha (ICLR’25). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. As a practical application, we demonstrate that SoftMatcha 2 identifies potential benchmark contamination in pre-training corpora that existing exact-match approaches miss. You can try searching through a 100B-scale corpus via our online demo. The system remains blazingly fast even on trillion-token corpora, so we encourage you to host it yourself for larger scales. Demo: Paper: Code: This work is a collaboration with researchers from the University of Tokyo, NII, Kyoto University, SOKENDAI, NINJAL, Tohoku University, and RIKEN.

Introducing SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Pre-Training Corpora What lies within a trillion-scale pre-training corpus? Can you truly guarantee your benchmarks are uncontaminated simply because there are no exact string matches? Alongside several research institutions in Japan, Sakana AI is proud to have collaborated in the development of SoftMatcha 2, an ultra-fast and flexible search tool that enables search over trillion-scale natural language corpora in under 0.3 seconds, even while handling semantic variations (substitution, insertion, and deletion). No existing tool meets all these criteria, including infini-gram-mini (EMNLP’25 Best Paper) or the original SoftMatcha (ICLR’25). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. As a practical application, we demonstrate that SoftMatcha 2 identifies potential benchmark contamination in pre-training corpora that existing exact-match approaches miss. You can try searching through a 100B-scale corpus via our online demo. The system remains blazingly fast even on trillion-token corpora, so we encourage you to host it yourself for larger scales. Demo: Paper: Code: This work is a collaboration with researchers from the University of Tokyo, NII, Kyoto University, SOKENDAI, NINJAL, Tohoku University, and RIKEN.

Sakana AI

103,147 次观看 • 5 个月前

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

Varun

37,362 次观看 • 4 个月前