Loading video...

Video Failed to Load

Go Home

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for full-parameter fine-tuning using Evolution Strategies (ES). By...

414,920 views • 8 months ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

Full Fine-tuning vs. Freezing Layers. Interact 👉 and == Full Fine-tuning == A real network has many — three layers in this example, billions of parameters in a production model. What does fine-tuning look like when you update all of them? That’s full fine-tuning: continue training every weight in the pretrained network on your new task. Every layer’s W gets its own ΔW. Nothing is frozen — every parameter is in play. Think of an MLP as a chain of prerequisites leading to an advanced course. Layer 1 might be Linear Algebra, layer 2 Probability, layer 3 Advanced Machine Learning — each one building on what came before. Fine-tuning is what happens during graduate study: the foundations are already there from undergrad, so you’re not re-learning. Full fine-tuning is reviewing every prerequisite to see what new topics have appeared and what discoveries the field has made since the last time you sat through them. Effective — but exhausting. This diagram shows the same three-layer MLP twice, side by side. On the left, the pretrained network runs on input X: three weight matrices W₁, W₂, W₃, each followed by a ReLU activation. Full fine-tuning gives the model the most freedom to specialize. Every parameter can move — and every parameter that can move must be stored. But not every prerequisite needs revisiting. The further you go back in the chain, the less the material has changed since pretraining — the linear-algebra basics under your computer-vision course are largely the same as they ever were. The next page does exactly that: freeze the prerequisites that haven’t moved, and only refresh the advanced one closest to your specialization. == Freezing Layers == Full fine-tuning reviewed every prerequisite — Linear Algebra, Probability, Advanced ML — to refresh each subject with the latest topics. Effective, but exhausting. Then you realize something. The prerequisites haven’t actually changed that much. Linear Algebra is still Linear Algebra; the matrix decompositions you learned still hold. Probability is still Probability; the distributions and Bayes’ rule haven’t moved. Almost all the new material — the new ideas, the recent discoveries — lives in the advanced layer at the top. That’s freezing layers: keep the prerequisite layers fixed at their pretrained state, and only update the advanced one. In the diagram below, W1​ and W2​ — the foundational prerequisites — stay frozen. Only W3​ — the layer closest to your task-specific output — gets a ΔW.

Tom Yeh

27,225 views • 1 month ago

Model-Free Reinforcement Learning (MFRL) has been alluring, especially with supercharged compute with physics on GPU. However, the methods use 0-th order gradients, and are often not the best optimizers. Can we do better than PPO in continuous control for robotics? Turns out yes! 🥳 tl;dr: Faster, better RL than PPO in continuous control 💪 The answer lies in using more information from the simulation. We are juicing the simulation on GPU as it is, why not use it for gradients as well? This has been a driving question in a series of our works. We first studied this problem in ICLR 2022 paper on Short Horizon Actor Critic Naive gradient based methods are stuck in local minima and have exploding/vanishing gradients. SHAC solved this problem truncated rollouts and model based value estimation, where the model is Differentiable Sim. This boosted sample efficiency and wall-clock time immensely especially in high dimensional systems such as humanoids Yet, given enough compute PPO often caught up. Our follow up paper on on Adaptive Horizon Actor Critic at ICML 2024 discovers the cause and provides a fix. However, we find that even when given ground-truth dynamics, not all gradients are useful due to sample error. 1st-Order Model-Based Reinforcement Learning methods employing differentiable simulation provide gradients with reduced variance but are susceptible to bias in scenarios involving stiff dynamics, such as physical contact. We find that back-propagating through contact and long trajectories drastically reduces gradient accuracy. Using this insight, we propose AHAC to dynamically adapt its roll-out horizon to avoid differentiating through stiff contact. AHAC is a first-order model-based RL algorithm that learns high-dimensional tasks in minutes (wall clock) and outperforms PPO by 40%, even in the limit of data provided to PPO. This work is led by Ignat Georgiev alongside Krishnan Srinivasan, Jie Xu, Eric Heiden and ample assistance from warp team at NVIDIA Robotics (Miles Macklin)

Animesh Garg

52,279 views • 2 years ago

The Sabotaging Practice of Over Supply and Sameness in the NFT Space. The current zeitgeist of the NFT space is that the same artists are doing the same kind of work five times a year, with project after project leaving a trail of disappointment and discontent among collectors and all of us watching in disbelief as huge resources are extracted from the space over work that feels like it could be left as an "artist study." I understand that you can do what you want with your money as collectors, but we are killing the whole space with this incestuous practice. No artist is that prolific to be able to do 5 collections of 100+ pieces each every year and actually deliver innovation and some kind of creative evolution. Of course, they can pretend play that the work has something new, but there is no precedent nor proof that that has ever happened in the speed that it happens in the NFT space. Again, people are free to through away their resources on whatever they want but with this way of doing things, we more and more are going to start seeing the consequences. Oh! There are consequences? Yes. Maybe unintended, but there are. Let's see. Let's start with the loss of belief in the NFT space as somewhere where emerging artists can come and find support for their experiments. Why even bother to bring experiments, innovation, and new ways to think of art on the blockchain if the same people have all the collectors hypnotized with their magical flutes? Why even try to come to a space where taking risks and challenging the status quo (the mission of art!!!) is overlooked? This makes the NFT space a social club and not a space for art. I guess it is fine, but IMO it is a recipe for disaster. New collectors stay away because the art will slowly but surely become stale and un-challenging. Why even bother to come and see what is happening here if you can't, as a collector, see new weird and up-and-coming artists? The amount of noise emitted by the same artists doing the same art over and over, drowns out any new voices. Again. A recipe for disaster. The NFT space is becoming a space of disappointment and doubt. We think that collections going to zero one after the other, over and over, is not damaging? I feel we are kidding ourselves. Disappointment piles up, and again, the people who will hurt are the emerging artists, the new blood, the ones who are willing to risk the most and, in return, put fire in this cold space of sameness. I love this space—don't get me wrong—it has changed my life, and I believe it has a ton of potential, but things need to change for it to become a beacon of light in art. But we need to support new voices. We need to support new ideas. The challenge is huge. I hope to contribute all I can to this change. I hope more and more see how exciting it is to go out and try to discover what else is out there and move this space forward. But again, I understand the leaps of faith needed, but if there is a space that is based on that, it's the NFT space...so there is hope. We will see. 📺by Boldtron

alejandro cartagena

98,261 views • 2 years ago