Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Transformer-based neural networks achieve impressive performance on coding, math & reasoning tasks that require keeping track of variables and their values. But how can they do that without explicit memory? 📄 Our new ICML paper investigates this in a synthetic setting! 🧵 1/13

Raphaël Millière

10,968 subscribers

73,253 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

13 Yorum

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

First things first – make sure check out our companion website 🔬 Variable Scope ( to get a full explanation of the project and follow along experiments with many interactive visualizations. Now, on to the thread! 2/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

Variable binding – the ability to associate abstract variables with values – is fundamental to computation & cognition. Classical architectures implement this through addressable memory, but neural nets like Transformers lack such explicit mechanisms. Can they learn it? 3/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

We trained a Transformer from scratch on a variable dereferencing task. Given symbolic programs containing chains of assignments (a=5, b=a, etc) plus irrelevant distractors, the model must trace the correct chain (up to 4 assignments deep) to find a queried variable's value. 4/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

We observe three distinct phases in the model's learning trajectory, with sharp phase transitions characteristic of a "grokking" dynamic: 1️⃣ Random numerical prediction (≈12% test set accuracy) 2️⃣ Shallow heuristics (≈56%) 3️⃣ General solution that solves the task (>99.9%) 5/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

In phase 1️⃣, the model only learns to predict random numbers. In phase 2️⃣, it learns to predict values from the first few lines of programs, which works surprisingly well for longer chains, but fails otherwise. In phase 3️⃣, it learns a systematic mechanism that generalizes. 6/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

How does the general mechanism learned in the final phase actually work? To find out, we used a causal intervention method called activation patching with counterfactual inputs to trace information flow across layers and identify causally responsible components. 7/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

Patching the residual stream (the main information pathway between layers) shows that information about the correct value is dynamically routed across layers at token positions corresponding to each step of the query variable's assignment chain. 8/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

Patching individual attention heads reveals how they specialize and coordinate to route information: early heads handle the first hop in the assignment chain, mid-layer heads propagate subsequent hops, and late heads aggregate the answer at query position. 9/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

How is that possible? The residual stream acts as a kind of addressable memory. We find that the model learns to dedicate separate subspaces of the residual stream to encode variables names and numerical constants. Causal interventions confirm their functional role. 10/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

So, the model learns a circuit that encodes variables & values in distinct subspaces. How does it learn? Interestingly, the circuit does *not* replace earlier heuristics – it's built on top! Heuristics are still used when they work & the circuit activates when they fail. 11/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

To sum up: 1. Transformers can learn variable binding via emergent mechanisms, w/o explicit symbolic machinery 2. Learning is cumulative, with a general mechanism learned on top of heuristics. This challenges traditional narratives about grokking 12/13

Raphaël Millière profil fotoğrafı

Raphaël Millière1 yıl önce

See full paper for more! I'm thrilled to get this one out. It's a passion project that's been a long time in the making with my wonderful former student Yiwei Wu and the fantastic Atticus Geiger. Thanks to @cosmos_inst for funding hosting costs for 13/13

ksminnovation profil fotoğrafı

ksminnovation1 yıl önce

AI is transforming healthcare! A KSM-led study shows AI can detect Celiac disease 4 years earlier @TalPatalon @MedPredict

Benzer Videolar

How can a single cell learn without a brain? We explore this in my new paper with Wallace Marshall! We discovered that single cells may learn using molecules similar to those that animal brains use to learn, like CaMKII. Cells can also propagate memory states to their progeny! 🧵1/n

How can a single cell learn without a brain? We explore this in my new paper with Wallace Marshall! We discovered that single cells may learn using molecules similar to those that animal brains use to learn, like CaMKII. Cells can also propagate memory states to their progeny! 🧵1/n

Deepa Rajan, PhD

75,700 görüntüleme • 2 ay önce

1/ Can molecular AI move past hard-coded Graph Neural Networks and embrace scalable Transformers that discover molecular structure on their own? We demonstrate that you can train a 1B parameter Transformer model without any graph priors or physical inductive biases. And surprisingly, not only can it maintain competitive performance under equal compute on the Open Molecules 2025 dataset… it’s faster than a 6M parameter equivariant GNN, and exhibits scaling laws that don’t saturate. We use this as a starting point to investigate emergent internal representations, and find that it adaptively discovers molecular structure! Check out the interactive demo on our website: And our paper: In collaboration with Toby Kreiman, Yutong Bai, Fadi, Elizabeth, and Eric Qu. Here’s a video showing how the Transformer learns distance-aware attention patterns (purple gradient) that adapt to atomic environments 👇

1/ Can molecular AI move past hard-coded Graph Neural Networks and embrace scalable Transformers that discover molecular structure on their own? We demonstrate that you can train a 1B parameter Transformer model without any graph priors or physical inductive biases. And surprisingly, not only can it maintain competitive performance under equal compute on the Open Molecules 2025 dataset… it’s faster than a 6M parameter equivariant GNN, and exhibits scaling laws that don’t saturate. We use this as a starting point to investigate emergent internal representations, and find that it adaptively discovers molecular structure! Check out the interactive demo on our website: And our paper: In collaboration with Toby Kreiman, Yutong Bai, Fadi, Elizabeth, and Eric Qu. Here’s a video showing how the Transformer learns distance-aware attention patterns (purple gradient) that adapt to atomic environments 👇

Aditi Krishnapriyan

39,468 görüntüleme • 9 ay önce

Ever wondered what neural networks are and how they work? Systems like ChatGPT use neural networks to work as well as they do. Neural networks are composed of "layers" of neurons, layers with different functions; connections between layers called "weights"; and mathematical functions called "activation functions". If you’re interested in learning about these systems, check the comments. Ultimately, the neural network structure of the model serves to visually demonstrate that it is, in fact, a complex mathematical equation. When companies release the model's weights, they are releasing a key component needed to run the model's complete equation. Without the weights, the equation is incomplete. For the math-minded: the weights of a model are the learned numbers (they are variables during training) that are then used as constants in the mathematical functions that make up the model. Neural networks are ultimately just one big, hyper-complex mathematical function, and when a model is trained, it learns the constants associated with the high-dimensional input.

Ever wondered what neural networks are and how they work? Systems like ChatGPT use neural networks to work as well as they do. Neural networks are composed of "layers" of neurons, layers with different functions; connections between layers called "weights"; and mathematical functions called "activation functions". If you’re interested in learning about these systems, check the comments. Ultimately, the neural network structure of the model serves to visually demonstrate that it is, in fact, a complex mathematical equation. When companies release the model's weights, they are releasing a key component needed to run the model's complete equation. Without the weights, the equation is incomplete. For the math-minded: the weights of a model are the learned numbers (they are variables during training) that are then used as constants in the mathematical functions that make up the model. Neural networks are ultimately just one big, hyper-complex mathematical function, and when a model is trained, it learns the constants associated with the high-dimensional input.

Harper Carroll

29,648 görüntüleme • 8 ay önce

We’re excited to share that 🥇Llama Nemotron Super 49B v1.5 -- our latest open reasoning model -- is now #1 on the Artificial Analysis Intelligence Index - a leaderboard that spans advanced math, science, and agentic tasks, in the 70B open model category. Llama Nemotron Super 49B v1.5 is trained with high-quality reasoning synthetic data generated from models like Qwen3-235B and DeepSeek R1. It delivers state-of-the-art accuracy and throughput, running on a single H100. Key features: 🎯 Leading accuracy on multi-step reasoning, math, coding, and function-calling 🏗️ Post-trained using RPO, DPO, and RLVR across 26M+ synthetic examples 📊 Fully transparent training data and techniques If you're building AI agents and want a high accuracy, fully-open, and transparent reasoning model that you can deploy anywhere, try Super v1.5 on or download from Hugging Face 🤗 ➡️ Leaderboard:

We’re excited to share that 🥇Llama Nemotron Super 49B v1.5 -- our latest open reasoning model -- is now #1 on the Artificial Analysis Intelligence Index - a leaderboard that spans advanced math, science, and agentic tasks, in the 70B open model category. Llama Nemotron Super 49B v1.5 is trained with high-quality reasoning synthetic data generated from models like Qwen3-235B and DeepSeek R1. It delivers state-of-the-art accuracy and throughput, running on a single H100. Key features: 🎯 Leading accuracy on multi-step reasoning, math, coding, and function-calling 🏗️ Post-trained using RPO, DPO, and RLVR across 26M+ synthetic examples 📊 Fully transparent training data and techniques If you're building AI agents and want a high accuracy, fully-open, and transparent reasoning model that you can deploy anywhere, try Super v1.5 on or download from Hugging Face 🤗 ➡️ Leaderboard:

NVIDIA AI Developer

100,506 görüntüleme • 11 ay önce

Open sourcing Dynamic Graph Memory by mem0. Memory is fundamental to human reasoning, shaping how we approach tasks and make decisions. At Mem0, we believe that AI agents & apps should reflect this principle. Our Dynamic Graph Memory emulates human memory, advancing AI agents toward more intelligent, human-like reasoning. This is a significant step forward in building AI that truly understands and interacts with the world like we do. All credit to Dev Khant Deshraj Yadav Prateek Chhikara for their countless nights spent on bringing this to life. Link:

Open sourcing Dynamic Graph Memory by mem0. Memory is fundamental to human reasoning, shaping how we approach tasks and make decisions. At Mem0, we believe that AI agents & apps should reflect this principle. Our Dynamic Graph Memory emulates human memory, advancing AI agents toward more intelligent, human-like reasoning. This is a significant step forward in building AI that truly understands and interacts with the world like we do. All credit to Dev Khant Deshraj Yadav Prateek Chhikara for their countless nights spent on bringing this to life. Link:

Taranjeet

51,129 görüntüleme • 1 yıl önce

🚨 Announcing a new coding agent that rivals Claude Code but with no compaction needed 🚨 The feeling of using it: run your coding sessions forever, don't worry at all, and get shit done! We're calling it Mastra Code, it's powered by Mastra's new observational memory, and we've been using it internally Mastra to do all our work 1/4 🧵

🚨 Announcing a new coding agent that rivals Claude Code but with no compaction needed 🚨 The feeling of using it: run your coding sessions forever, don't worry at all, and get shit done! We're calling it Mastra Code, it's powered by Mastra's new observational memory, and we've been using it internally Mastra to do all our work 1/4 🧵

Tyler Barnes

66,584 görüntüleme • 4 ay önce

The DeepSeek-R1 paper is a gem! Highly encourage everyone to read it. It's clear that LLM reasoning capabilities can be learned in different ways. RL, if applied correctly and at scale, can lead to some really powerful and interesting scaling and emergent properties. There is more to RL than meets the eye! Here is my breakdown of the paper along with a few tests: The multi-state training might not make sense initially but they provide clues on optimizations that we can continue to tap into. Data quality is still very important for enhancing the usability of the LLM. Unlike other reasoning LLMs, DeepSeek-R1's training recipe and weights are open so we can build on top of it. This opens up exciting research opportunities. About the attached clip: the previous preview model wasn't able to solve this task. DeepSeek-R1 can solve this and many other tasks that o1 can solve. It's a very good model for coding and math.

The DeepSeek-R1 paper is a gem! Highly encourage everyone to read it. It's clear that LLM reasoning capabilities can be learned in different ways. RL, if applied correctly and at scale, can lead to some really powerful and interesting scaling and emergent properties. There is more to RL than meets the eye! Here is my breakdown of the paper along with a few tests: The multi-state training might not make sense initially but they provide clues on optimizations that we can continue to tap into. Data quality is still very important for enhancing the usability of the LLM. Unlike other reasoning LLMs, DeepSeek-R1's training recipe and weights are open so we can build on top of it. This opens up exciting research opportunities. About the attached clip: the previous preview model wasn't able to solve this task. DeepSeek-R1 can solve this and many other tasks that o1 can solve. It's a very good model for coding and math.

elvis

140,692 görüntüleme • 1 yıl önce

Hinton, the godfather of AI, said it best: we built the learning algorithms, but we no longer understand what they’ve built. That’s the paradox of deep learning. We designed the rules for how these systems learn, yet the internal logic of their neural networks has become too complex for us to fully grasp. Millions or even trillions of parameters interact in ways no human can trace. We can observe what they do, we can measure accuracy, behavior, and output but not truly explain why they do it. Their reasoning isn’t transparent; it’s emergent. In a sense, we’ve created alien intelligences born from our math, still tethered to our code yet evolving patterns we can’t decode. The machines are doing something beyond our comprehension and that might be both the most exciting and the most unsettling thing about the age of AI.

Hinton, the godfather of AI, said it best: we built the learning algorithms, but we no longer understand what they’ve built. That’s the paradox of deep learning. We designed the rules for how these systems learn, yet the internal logic of their neural networks has become too complex for us to fully grasp. Millions or even trillions of parameters interact in ways no human can trace. We can observe what they do, we can measure accuracy, behavior, and output but not truly explain why they do it. Their reasoning isn’t transparent; it’s emergent. In a sense, we’ve created alien intelligences born from our math, still tethered to our code yet evolving patterns we can’t decode. The machines are doing something beyond our comprehension and that might be both the most exciting and the most unsettling thing about the age of AI.

VraserX e/acc

376,417 görüntüleme • 8 ay önce

🎁 Introducing Oscar Unlocks, our new digital rewards program for Oscar members. With Oscar Unlocks, members can gain rewards, fun badges, and more just by completing simple tasks that help them personalize and make the most of their plan. How it works: – Members complete easy, digital tasks, like creating their account or setting up autopay. – Members can earn exclusive badges and access to the best of Oscar the more they engage in the program: the more they do, the more they can unlock. Learn more:

🎁 Introducing Oscar Unlocks, our new digital rewards program for Oscar members. With Oscar Unlocks, members can gain rewards, fun badges, and more just by completing simple tasks that help them personalize and make the most of their plan. How it works: – Members complete easy, digital tasks, like creating their account or setting up autopay. – Members can earn exclusive badges and access to the best of Oscar the more they engage in the program: the more they do, the more they can unlock. Learn more:

Oscar

101,490 görüntüleme • 9 ay önce

1.Stem cells reside in specialized microenvironments that precisely regulate their behavior. For this reason they are best studied in their in vivo context. A new paper from our lab describes a method for long-term imaging & quantitative analysis of hematopoiesis in vivo in flies

1.Stem cells reside in specialized microenvironments that precisely regulate their behavior. For this reason they are best studied in their in vivo context. A new paper from our lab describes a method for long-term imaging & quantitative analysis of hematopoiesis in vivo in flies

Tanentzapf Lab

27,031 görüntüleme • 3 yıl önce

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

Speech-native models like Moshi sound great and answer fast, but aren’t as smart as text LLMs. In our new paper, MoshiRAG, we show how Moshi can ask for advice from a text LLM or a knowledge base. The tricky part is how to do this in real time without adding latency. 🧵

kyutai

52,598 görüntüleme • 2 ay önce

This is how we track daily attendance of all the children we sponsor their education in ISEE ( The goal is to ensure that they don't miss classes and we can remotely monitor them. The schools ensures that they take daily attendance otherwise they loose fundng from us. These are schools in remote villages and they still effectively use technology. If we can achieve this with limited resources, government can do 100 times more.

This is how we track daily attendance of all the children we sponsor their education in ISEE ( The goal is to ensure that they don't miss classes and we can remotely monitor them. The schools ensures that they take daily attendance otherwise they loose fundng from us. These are schools in remote villages and they still effectively use technology. If we can achieve this with limited resources, government can do 100 times more.

Alex Onyia

13,874 görüntüleme • 9 ay önce

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

Cassidy Laidlaw

29,741 görüntüleme • 1 yıl önce

Most robot policies today still largely lack memory: they make all their decisions based on what they can see right now. MemER aims to change that by learning which frames are important; this lets it deal with tasks like object search. Ajay Sridhar, Jenny Pan, and @satviks107Sharma tell us about how to achieve this fundamental capability for long-horizon task execution. Watch Episode #54 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

Most robot policies today still largely lack memory: they make all their decisions based on what they can see right now. MemER aims to change that by learning which frames are important; this lets it deal with tasks like object search. Ajay Sridhar, Jenny Pan, and @satviks107Sharma tell us about how to achieve this fundamental capability for long-horizon task execution. Watch Episode #54 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

RoboPapers

17,850 görüntüleme • 7 ay önce

Trump: We’re making our country larger, we’re making our country stronger. And in the case of Canada—if this should happen—I don’t know how they can do it without us. Because without the U.S., Canada really doesn’t have a country. They do almost all of their business with us, and if we say we want our cars to be made in Detroit, with a stroke of a pen, I can make that happen. And other things, in addition to that, would not allow Canada to be a viable country.

Trump: We’re making our country larger, we’re making our country stronger. And in the case of Canada—if this should happen—I don’t know how they can do it without us. Because without the U.S., Canada really doesn’t have a country. They do almost all of their business with us, and if we say we want our cars to be made in Detroit, with a stroke of a pen, I can make that happen. And other things, in addition to that, would not allow Canada to be a viable country.

Acyn

4,177,762 görüntüleme • 1 yıl önce

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.

RoboPapers

85,927 görüntüleme • 8 ay önce

I sat down with Lukasz Kaiser to get into whether the architecture he helped invent is actually enough, and what's next in generalization, coding agents, RL and more. Lukasz co-authored "Attention Is All You Need," the paper that introduced the transformer and worked on reasoning models at OpenAI so he’s been a key part of major shifts in the field. We hit on: ▪️ The case for and against a new architecture coming after the transformer ▪️ What’s required for model generalization in the physical world ▪️ How much coding agents have improved his AI research productivity ▪️ The next domains for RL ▪️ Why Anthropic initially won coding ▪️ Future research directions he’s excited about 0:00 Intro 1:12 Transformers vs. Human Learning 8:37 How Do We Get Physical World Generalization? 10:52 What Comes After Transformers 13:59 How Much Have Agents Improved Lukasz's AI Research Productivity? 17:21 How Close Is an AI Research Intern? 26:06 RL Beyond Verifiable Tasks 35:38 App Companies: Build Models or Lean on Labs? 46:21 Multimodal Is Still Missing Something 49:46 OpenAI's Bet on Reasoning 55:26 The AI Coding Wars 59:26 Focus vs. Keeping Embers Burning 1:02:09 Open Source vs. Closed Source Gap 1:05:15 Quickfire YouTube: Spotify: Apple:

I sat down with Lukasz Kaiser to get into whether the architecture he helped invent is actually enough, and what's next in generalization, coding agents, RL and more. Lukasz co-authored "Attention Is All You Need," the paper that introduced the transformer and worked on reasoning models at OpenAI so he’s been a key part of major shifts in the field. We hit on: ▪️ The case for and against a new architecture coming after the transformer ▪️ What’s required for model generalization in the physical world ▪️ How much coding agents have improved his AI research productivity ▪️ The next domains for RL ▪️ Why Anthropic initially won coding ▪️ Future research directions he’s excited about 0:00 Intro 1:12 Transformers vs. Human Learning 8:37 How Do We Get Physical World Generalization? 10:52 What Comes After Transformers 13:59 How Much Have Agents Improved Lukasz's AI Research Productivity? 17:21 How Close Is an AI Research Intern? 26:06 RL Beyond Verifiable Tasks 35:38 App Companies: Build Models or Lean on Labs? 46:21 Multimodal Is Still Missing Something 49:46 OpenAI's Bet on Reasoning 55:26 The AI Coding Wars 59:26 Focus vs. Keeping Embers Burning 1:02:09 Open Source vs. Closed Source Gap 1:05:15 Quickfire YouTube: Spotify: Apple:

Jacob Effron

75,036 görüntüleme • 1 ay önce

🚀 We introduce Neural Theorizer (NEO) — a new type of world model that learns to theorize the world from observation, without language or LLM supervision. Selected as an ICML 2026 oral presentation — 0.7% of submitted papers. The paper asks: "What does it mean to understand the world and build a world model?" Today’s world models are often trained to predict the future: the next frame, next latent state, or next observation. But is prediction enough? We argue that a world model should be a theory-building system: one that discovers reusable primitives, composes them into executable explanations, and transfers those explanations to novel phenomena. NEO is our first step toward this vision — a World Theory Model that learns explicit, compositional theories from raw observation. This work was led by my wonderful students: Doojin Baek*(Doojin Baek), Gyubin Lee* (GyuBin Lee), Junyeob Baek (Junyeob Baek), and Hosung Lee (Hosung Lee). For more details, take a look at the paper — and if you’re attending ICML, let’s talk there! 📄 arXiv: 🌐 Project page:

🚀 We introduce Neural Theorizer (NEO) — a new type of world model that learns to theorize the world from observation, without language or LLM supervision. Selected as an ICML 2026 oral presentation — 0.7% of submitted papers. The paper asks: "What does it mean to understand the world and build a world model?" Today’s world models are often trained to predict the future: the next frame, next latent state, or next observation. But is prediction enough? We argue that a world model should be a theory-building system: one that discovers reusable primitives, composes them into executable explanations, and transfers those explanations to novel phenomena. NEO is our first step toward this vision — a World Theory Model that learns explicit, compositional theories from raw observation. This work was led by my wonderful students: Doojin Baek(Doojin Baek), Gyubin Lee (GyuBin Lee), Junyeob Baek (Junyeob Baek), and Hosung Lee (Hosung Lee). For more details, take a look at the paper — and if you’re attending ICML, let’s talk there! 📄 arXiv: 🌐 Project page:

Sungjin Ahn

97,742 görüntüleme • 18 gün önce

Still following your human intuition to mix corpora from different sources for language model pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆? Let's check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄 🔬In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈 Details in the thread 🧵

Still following your human intuition to mix corpora from different sources for language model pre-training 🧠? Everyone says that data mixture has a big impact on model performance, but how - and why🕵️? Did you know that web corpora are actually highly impactful for downstream tasks 🏆? Let's check out our preprint "RegMix: Data Mixture as Regression for Language Model Pre-training" 📄 🔬In this paper, we've proposed an automatic data mixture method RegMix that achieves a 6.3% improvement over human selection on the widely used HellaSwag benchmark - and it only needs a 2% extra training FLOPs! 📈 Details in the thread 🧵

Qian Liu

54,961 görüntüleme • 2 yıl önce

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 görüntüleme • 11 ay önce