Name: Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!
Uploaded: 2025-02-10T18:21:38.000Z
Duration: PT16.383S
Channel: Oleg Rybkin
Description: Oleg Rybkin shorts video about Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Does off-policy value-based RL scale? In LLMs, larger scale... show more

Oleg Rybkin

23,979 просмотров • 1 год назад

🚨Current scalable RL algos train a policy w/o value... show more

Aviral Kumar

37,301 просмотров • 1 год назад

Introducing CQN: Coarse-to-fine Q-Network, a value-based RL algorithm for... show more

Younggyo Seo

16,413 просмотров • 2 лет назад

New work: The Value Axis 🎯 How do LLMs... show more

Nick Jiang

28,039 просмотров • 1 месяц назад

New research from Databricks: LLMs Can Learn to Reason... via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3× fewer training generations. For Databricks customers, it’s a simpler, practical, and equally powerful approach to RL that Databricks is pioneering internally — and bringing directly to Databricks customers, so enterprises can improve agents using the same methods we use for our in-house agents, without complex infrastructure changes.show more

Databricks AI Research

12,539 просмотров • 5 месяцев назад

Crypto can’t scale without solving value transfer. Not just... show more

Kima Network

25,026 просмотров • 1 год назад

This figure from HIL-SERL is one of the clearest... visualisations of how RL learns differently from imitation learning. The difference comes down to this: imitation learning treats each (state, action) pair as independent. A correction at timestep 20 teaches nothing about timestep 19 or 21. RL propagates reward backward through time. One successful insertion updates the value estimate of every state along the trajectory. So RL builds a full map of "which states lead to success"; imitation learning just memorizes individual snapshots. Setup: a robot inserting a RAM stick into a motherboard slot. Each dot is an end-effector position (Y = lateral, Z = height). Starting position is randomized. Left to right = training progressing. Top row (RL): the policy builds a funnel. Broad at the top, narrowing into the target. It systematically fills in the state space, learning which paths lead to success from many different starting positions. Bottom row (imitation learning / HG-DAgger, same human data): sparse, diffuse, no funnel. The policy only learns near states the human demonstrated. Both have access to the same data, including human corrections, but a completely different structure emerges.show more

Dominique Paul

24,433 просмотров • 5 месяцев назад

GPT-5.5 by Reasoning Effort: I've asked it in Codex... to create a physics-based visualisation of RL cycles for different sized models (70b, 1t, 10t), to demonstrate how the amount of RL you can do differs by model size. My assessment of each: - Low: weird slop - Medium: kinda cooked - High: sort of tried but ultimately incoherent - Extra High: elite - really nice idea and well executed Obviously this is just one shot, but worth trying different reasoning levels for the new models, medium seems to be pretty good for GPT-5.5 and it was really bad for many previous GPT models.show more

Peter Gostev (SF: 22-26 June)

209,258 просмотров • 3 месяцев назад

Model-Free Reinforcement Learning (MFRL) has been alluring, especially with... supercharged compute with physics on GPU. However, the methods use 0-th order gradients, and are often not the best optimizers. Can we do better than PPO in continuous control for robotics? Turns out yes! 🥳 tl;dr: Faster, better RL than PPO in continuous control 💪 The answer lies in using more information from the simulation. We are juicing the simulation on GPU as it is, why not use it for gradients as well? This has been a driving question in a series of our works. We first studied this problem in ICLR 2022 paper on Short Horizon Actor Critic Naive gradient based methods are stuck in local minima and have exploding/vanishing gradients. SHAC solved this problem truncated rollouts and model based value estimation, where the model is Differentiable Sim. This boosted sample efficiency and wall-clock time immensely especially in high dimensional systems such as humanoids Yet, given enough compute PPO often caught up. Our follow up paper on on Adaptive Horizon Actor Critic at ICML 2024 discovers the cause and provides a fix. However, we find that even when given ground-truth dynamics, not all gradients are useful due to sample error. 1st-Order Model-Based Reinforcement Learning methods employing differentiable simulation provide gradients with reduced variance but are susceptible to bias in scenarios involving stiff dynamics, such as physical contact. We find that back-propagating through contact and long trajectories drastically reduces gradient accuracy. Using this insight, we propose AHAC to dynamically adapt its roll-out horizon to avoid differentiating through stiff contact. AHAC is a first-order model-based RL algorithm that learns high-dimensional tasks in minutes (wall clock) and outperforms PPO by 40%, even in the limit of data provided to PPO. This work is led by Ignat Georgiev alongside Krishnan Srinivasan, Jie Xu, Eric Heiden and ample assistance from warp team at NVIDIA Robotics (Miles Macklin)show more

Animesh Garg

52,300 просмотров • 2 лет назад

We’re excited to announce our integration with SKALE, a... high-performance, zero-gas blockchain purpose-built for speed, scale, and security. This partnership strengthens our infrastructure as we continue building transparent, trust-based systems for decentralized science. We’re excited about what this unlocks for researchers, contributors, and the future of data integrity in DeSci. 👀 Look out for more on how we’re using SKALE in the AxonDAO ecosystem.show more

AxonDAO

33,926 просмотров • 1 год назад

High-resolution image and video generation is hitting a wall... show more

Gordon Wetzstein

164,096 просмотров • 4 месяцев назад

Robots struggle with strict action rules…memory and symbols help... them learn fast. [Project + Full video link ⬇️] Robots struggle when tasks require specific steps in a fixed order. What if memory helped them think symbolically and learn faster? Solving tasks like unlocking a door then opening it is hard for deep RL. But by learning constraint relationships and storing them in memory, robots can solve these tasks much faster; with fewer trials and less training. Why it works ✅ Learns symbolic rules about action constraints ✅ Uses memory to transfer what it learned across tasks ✅ Handles real-world exploration with just 30 minutes of data ✅ Needs 10x fewer episodes than deep RL approaches This memory-based method shows a promising path forward for robots learning structured, real-world tasks. Full video: Paper: Thank you, Mrinal Verghese for sharing this amazing work! 🙏show more

Ilir Aliu - eu/acc

10,241 просмотров • 1 год назад

This one sentence from Mark Zuckerberg proves he's serious... about ending the censorship on his platforms. "We're going to move our trust and safety and content moderation teams out of California. And our US-based content review is going to be based in Texas." This means Silicon Valley liberals will no longer have their thumb on the scale and pick and choose what we, the peasants, get to post & view. I still don't forgive Zuck for rigging the 2020 election, but I think he means what he says.show more

George

431,534 просмотров • 1 год назад

Exploring the Future of Legal Data Infrastructure Iagon, in... partnership with Cloud Court, is pleased to announce that Ford Motor Company Motor Company will serve in an advisory capacity for this exploratory project, which seeks to evaluate the use of the Cardano blockchain and Iagon's decentralized cloud storage technology as a potential solution for the secure storage and management of legal documents and data. As a major corporation with sophisticated legal operations, Ford brings valuable perspective to this exploratory initiative based on their experience managing complex legal data infrastructures at scale. Ford is interested in exploring whether blockchain-based distributed storage could address persistent challenges in legal data infrastructure. In particular, Ford sees merit in exploring how blockchain technology might deliver economically efficient storage and audit solutions for legal data management. More insights 👉show more

Iagon 🧑‍🚀💽

225,793 просмотров • 1 год назад

March 18, 2025 marked the public launch of OptimAI.... In one year, it has evolved from a lightweight node layer into a decentralized intelligence infrastructure powering real-time data, compute, and reinforcement for agentic systems. Not just nodes. Not just data. A continuously learning, network-driven intelligence layer. This is infrastructure for a new class of software: autonomous agents that persist, adapt, and operate across environments. Year one established the network. Year two is where it compounds into coordination and value flow. Personal agents. Reinforcement at network scale. Emerging primitives for AgentFi. New layers coming online. 2026 won’t just be about scale, it’s where the network starts to operate. Keep building!show more

OptimAI Network

34,082 просмотров • 4 месяцев назад

LongWriter Unleashing 10,000+ Word Generation from Long Context LLMs... discuss: Current long context large language models (LLMs) can process inputs up to 100,000 tokens, yet struggle to generate outputs exceeding even a modest length of 2,000 words. Through controlled experiments, we find that the model's effective generation length is inherently bounded by the sample it has seen during supervised fine-tuning (SFT). In other words, their output limitation is due to the scarcity of long-output examples in existing SFT datasets. To address this, we introduce AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, we construct LongWriter-6k, a dataset containing 6,000 SFT data with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, we successfully scale the output length of existing models to over 10,000 words while maintaining output quality. We also develop LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. Our 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models. In general, our work demonstrates that existing long context LLM already possesses the potential for a larger output window--all you need is data with extended output during model alignment to unlock this capability.show more

AK

50,995 просмотров • 1 год назад

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper... page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.show more

AK

25,449 просмотров • 2 лет назад

3. Mojeek Mojeek is a unique search engine that... stands apart by offering its own independent search index, rather than relying on data from other engines. With a strong focus on privacy, Mojeek doesn’t track users, collect personal information, or target ads based on search history. It’s a great option for users who value both transparency and autonomy in their search experience, providing an alternative to mainstream search engines while still delivering relevant, unbiased results from its growing index of the web.show more

Mario Nawfal

25,772 просмотров • 1 год назад

From the archives! 2023: Dan was asked about his... thoughts on Dutton being in Victoria. What Dan says is true. Dutton did start a racist campaign that wasn’t based on reported crime data. The so called crime that he was alleging happening in Melb disappeared after Dan won in 2018. Funny that! It’s pretty funny how Dan says it. Because it was true. Peter Dutton and the Liberal Party started a racist campaign leading up to the 2018 Victorian state election, that “African gangs” were taking over Melbourne. That wasn’t based on reported crime data. Look it up if you will. It was based on the Liberal Party trying to scare voters into voting for the Liberal Party, because they’re tougher on crime (apparently) so let’s use a racist slogan not based on fact to try and scare people. It was nothing but a dead set scare campaign. It didn’t work! But after the 2018 state election, all the commentary on this all disappeared. It actually did. Remember, the Liberal Party lies to win elections. They don’t do any of this is good faith. They do it for themselves, and only themselves.show more

Dan Fangirl 🤓

22,573 просмотров • 1 год назад

Extracting structured outputs with LLMs is easy. But doing... large-scale extraction with precise citations and bounding boxes back to the source documents is way harder. With our latest release in LlamaExtract, we extract citation bounding boxes along with every single key and value within a document. You can see this in the UI. Hover over any k:v pair and you’ll be able to see the corresponding highlights in the source doc. If you’re a human reviewing a million docs (resumes, IDs, invoices, claims, contracts), this will help you 5x your ability to verify values and make sure things are correct. Check out these new extraction upgrades in LlamaCloud:show more

Jerry Liu

23,044 просмотров • 5 месяцев назад

The newest version of our Almanac preprint is out,... and just in time for our demo at the Stanford AIMI Symposium 2023! Almanac is a retrieval-augmented LLM that provides up-to-date and verifiable answers to medical queries. Link: We benchmark our approach on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified & resident physicians, and demonstrate significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties. More interestingly, because the retrieved data acts as a single source of truth, we find retrieval-based LLMs to be more robust to prompt injection and manipulation! Future work will involve expanding the scope of our dataset to more specialties and multimodal settings. #Medtwitter #MedEdshow more

Cyril Zakka, MD

18,128 просмотров • 3 лет назад

Live Cam