正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Like LLMs, autoregressive transformers for action tokens need a reasoning layer to reduce hallucinations and boost reliability. Grounding this layer in the physics of the action space using DVBFs makes for scalable, task-agnostic training—far simpler than creating RL reward functions for each task. Learn more about our novel approach... show more

Sankaet

21,766 subscribers

52,565 次观看 • 1 年前 •via X (Twitter)

科学技术新闻政治教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,457 次观看 • 1 年前

Presenting Unsupervised Actuator Nets (UANs) that push the limits of agile whole-body control without the need for reward shaping! ⚡️ UANs reduce the sim2real gap in robot's motors removing the need for reward design to bridge the sim2real gap. ⚡️ A pre-trained whole-body controller uses reference motion as a hint to maximize task performance! Learn more: Work led by Nolan Fey , Gabe Margolis in collaboration with Martin.

Presenting Unsupervised Actuator Nets (UANs) that push the limits of agile whole-body control without the need for reward shaping! ⚡️ UANs reduce the sim2real gap in robot's motors removing the need for reward design to bridge the sim2real gap. ⚡️ A pre-trained whole-body controller uses reference motion as a hint to maximize task performance! Learn more: Work led by Nolan Fey , Gabe Margolis in collaboration with Martin.

Pulkit Agrawal

10,875 次观看 • 1 年前

We asked Sholto Douglas from Anthropic about the costs of RL (Reinforcement Learning) runs. "In Dario Amodei's essay, he said that RL runs cost only $1M back in December." "RL is a more naively parallelizable and scalable than pre-training." "With pre-training, you need everything in one big data center ideally. For RL, in theory, you could scale all over the world."

We asked Sholto Douglas from Anthropic about the costs of RL (Reinforcement Learning) runs. "In Dario Amodei's essay, he said that RL runs cost only $1M back in December." "RL is a more naively parallelizable and scalable than pre-training." "With pre-training, you need everything in one big data center ideally. For RL, in theory, you could scale all over the world."

TBPN

76,634 次观看 • 1 年前

Demo of the ModelRouter 🤖🤖🤖🤖 People have been begging me for this for months. IT'S HERE!!!!! It selects an LLM for you so you don't have to choose which one you need for a task. It also makes a system prompt, sets temperature, and more! ⎆ task ⎆ boss selects best llm like gpt-4o, claude, deepseek, and more ⎆ chosen llm executes your task. Learn more ⬇️

Demo of the ModelRouter 🤖🤖🤖🤖 People have been begging me for this for months. IT'S HERE!!!!! It selects an LLM for you so you don't have to choose which one you need for a task. It also makes a system prompt, sets temperature, and more! ⎆ task ⎆ boss selects best llm like gpt-4o, claude, deepseek, and more ⎆ chosen llm executes your task. Learn more ⬇️

Kye Gomez (swarms)

21,684 次观看 • 1 年前

Wisconsin had the highest homicide rate for Black women and girls in 2020. Now, for the third time, lawmakers are trying to pass a bill to create a task force to address it. Our Queens deserve more than delay — they deserve justice, answers, and action.

Wisconsin had the highest homicide rate for Black women and girls in 2020. Now, for the third time, lawmakers are trying to pass a bill to create a task force to address it. Our Queens deserve more than delay — they deserve justice, answers, and action.

Ben Crump

24,553 次观看 • 11 个月前

Did you know? The State Action for Facilitation on Encampments (SAFE) Task Force is California’s strategy to prioritize and remove encampments and provide shelter to individuals experiencing homelessness. Learn more about the initiative here:

Did you know? The State Action for Facilitation on Encampments (SAFE) Task Force is California’s strategy to prioritize and remove encampments and provide shelter to individuals experiencing homelessness. Learn more about the initiative here:

California Governor's Office of Emergency Services

25,042 次观看 • 9 个月前

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Lerrel Pinto

46,221 次观看 • 1 年前

Synchronize Dual Hands for Physics-Based Dexterous Guitar Playing discuss: We present a novel approach to synthesize dexterous motions for physically simulated hands in tasks that require coordination between the control of two hands with high temporal precision. Instead of directly learning a joint policy to control two hands, our approach performs bimanual control through cooperative learning where each hand is treated as an individual agent. The individual policies for each hand are first trained separately, and then synchronized through latent space manipulation in a centralized environment to serve as a joint policy for two-hand control. By doing so, we avoid directly performing policy learning in the joint state-action space of two hands with higher dimensions, greatly improving the overall training efficiency. We demonstrate the effectiveness of our proposed approach in the challenging guitar-playing task. The virtual guitarist trained by our approach can synthesize motions from unstructured reference data of general guitar-playing practice motions, and accurately play diverse rhythms with complex chord pressing and string picking patterns based on the input guitar tabs that do not exist in the references. Along with this paper, we provide the motion capture data that we collected as the reference for policy training.

Synchronize Dual Hands for Physics-Based Dexterous Guitar Playing discuss: We present a novel approach to synthesize dexterous motions for physically simulated hands in tasks that require coordination between the control of two hands with high temporal precision. Instead of directly learning a joint policy to control two hands, our approach performs bimanual control through cooperative learning where each hand is treated as an individual agent. The individual policies for each hand are first trained separately, and then synchronized through latent space manipulation in a centralized environment to serve as a joint policy for two-hand control. By doing so, we avoid directly performing policy learning in the joint state-action space of two hands with higher dimensions, greatly improving the overall training efficiency. We demonstrate the effectiveness of our proposed approach in the challenging guitar-playing task. The virtual guitarist trained by our approach can synthesize motions from unstructured reference data of general guitar-playing practice motions, and accurately play diverse rhythms with complex chord pressing and string picking patterns based on the input guitar tabs that do not exist in the references. Along with this paper, we provide the motion capture data that we collected as the reference for policy training.

AK

26,855 次观看 • 1 年前

💡Divergence thinking💡 is a hallmark of human creativity and problem-solving 🤖Can LLMs also do divergent reasoning to generate diverse solutions🤔? Introducing Flow-of-Reasoning (FoR) 🌊, a data-efficient way of training LLM policy to generate diverse, high-quality reasoning trajectories Unlike existing RL (like PPO) and planning (like MCTS) to find the max-reward trajectory (akin to convergent thinking), FoR connects LLM reasoning with the #GFlowNet formulation and enables LLMs to find trajectories proportional to reward distribution. 🎬The demo video illustrates how FoR learns and infers multiple solutions to a ♠️Game24 puzzle. 🎯Inferring for diverse solutions could be useful for robustness, data augmentation, and enhanced model generalization. Project page: Paper: Github:

💡Divergence thinking💡 is a hallmark of human creativity and problem-solving 🤖Can LLMs also do divergent reasoning to generate diverse solutions🤔? Introducing Flow-of-Reasoning (FoR) 🌊, a data-efficient way of training LLM policy to generate diverse, high-quality reasoning trajectories Unlike existing RL (like PPO) and planning (like MCTS) to find the max-reward trajectory (akin to convergent thinking), FoR connects LLM reasoning with the #GFlowNet formulation and enables LLMs to find trajectories proportional to reward distribution. 🎬The demo video illustrates how FoR learns and infers multiple solutions to a ♠️Game24 puzzle. 🎯Inferring for diverse solutions could be useful for robustness, data augmentation, and enhanced model generalization. Project page: Paper: Github:

Lianhui Qin

50,447 次观看 • 2 年前

Ok wow, it’s finally here! The Sarah Scribbles Hidden Art Game! We've transformed my art into a world that you can explore for free. Your task is to find every cat hiding in each layer of this world. 🐈‍⬛ If you need a 5-minute break to relax, the link is in the comments!

Ok wow, it’s finally here! The Sarah Scribbles Hidden Art Game! We've transformed my art into a world that you can explore for free. Your task is to find every cat hiding in each layer of this world. 🐈‍⬛ If you need a 5-minute break to relax, the link is in the comments!

Sarah Andersen

325,891 次观看 • 1 年前

Money is the coordination layer for agents. As multi-agent systems take on real tasks, they will need to quote, budget, pay, and settle as part of their reasoning loop. We are building this reasoning infrastructure in the open, for everyone.

Money is the coordination layer for agents. As multi-agent systems take on real tasks, they will need to quote, budget, pay, and settle as part of their reasoning loop. We are building this reasoning infrastructure in the open, for everyone.

Sentient

14,712 次观看 • 4 个月前

Catch a glimpse behind the Military Corrective Training Centre fence, featured on Channel 5. Tune in at 9 pm for the first episode of 'Court Martial: Soldiers Behind Bars' to learn about a unique and novel approach to rehabilitation. Read more ⬇️

Catch a glimpse behind the Military Corrective Training Centre fence, featured on Channel 5. Tune in at 9 pm for the first episode of 'Court Martial: Soldiers Behind Bars' to learn about a unique and novel approach to rehabilitation. Read more ⬇️

British Army 🇬🇧

67,356 次观看 • 2 年前

Presto! Distilling Steps and Layers for Accelerating Music Generation Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge.

Presto! Distilling Steps and Layers for Accelerating Music Generation Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge.

AK

30,430 次观看 • 1 年前

If you want to truly open up and “unlock” your hips, you need to do more than just the traditional stretches. Heres what you need for a comprehensive approach to creating more space in your hips 👇

If you want to truly open up and “unlock” your hips, you need to do more than just the traditional stretches. Heres what you need for a comprehensive approach to creating more space in your hips 👇

Conor Harris

115,267 次观看 • 9 个月前

Meet physics-intern🧑‍🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and dispatches them to a team of specialized agents, solving research-level questions far more effectively than the base model alone.

Meet physics-intern🧑‍🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and dispatches them to a team of specialized agents, solving research-level questions far more effectively than the base model alone.

David Louapre

112,251 次观看 • 1 个月前

This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.

This layer in DINO-v2 dedicates about half its attention mass to a single operation. Each of the sixteen heads independently learns the same circuit to perform this task. What is this all-important operation? The “no-op”. That’s right, we’re spending half our computation to do… absolutely nothing.

Rudy Gilman

97,147 次观看 • 1 年前

I’m using this medium to thank and congratulate all the state coordinators for a successful program and for their dedication in preparation for tougher task the task ahead repair A new Nigeria is Possible Nigeria will be OK.

I’m using this medium to thank and congratulate all the state coordinators for a successful program and for their dedication in preparation for tougher task the task ahead repair A new Nigeria is Possible Nigeria will be OK.

Dr Yunusa Tanko

20,413 次观看 • 1 个月前

AI cannot change the world if the world cannot reach it. We are building an airborne connectivity layer for the next era of digital infrastructure, designed to extend the network above our AirNodes and below the satellites in orbit. For AirNode Operators, it means they can deploy anywhere, connect anyone. For $WMTX, it means more network, more usage. For EarthNodes, it means more data and services to run. The future needs a new layer of connectivity and that layer is World Mobile Stratospheric

AI cannot change the world if the world cannot reach it. We are building an airborne connectivity layer for the next era of digital infrastructure, designed to extend the network above our AirNodes and below the satellites in orbit. For AirNode Operators, it means they can deploy anywhere, connect anyone. For $WMTX, it means more network, more usage. For EarthNodes, it means more data and services to run. The future needs a new layer of connectivity and that layer is World Mobile Stratospheric

Micky (Mr Telecom) Watkins

12,277 次观看 • 1 个月前

Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. This also lets us run RL in the diffusion noise space with exploration guided by high-level reasoning: think through a task, then practice it! 👇

Flow reversal steering allows "steering" diffusion-based VLAs with high-level actions, for example from VLM reasoning. This also lets us run RL in the diffusion noise space with exploration guided by high-level reasoning: think through a task, then practice it! 👇

Sergey Levine

73,360 次观看 • 16 天前

How do you teach a robot to handle complex, multi-step tasks, without training it for each one? [Github ⬇️] The team behind ReKep shows that robots can perform bimanual, in-the-wild tasks by reasoning over keypoint constraints: Generated on the fly using vision and language models. No task-specific data, no environment modeling. Why it matters ✅ Encodes tasks as simple Python functions over 3D keypoints ✅ Uses VLMs to generate keypoint constraints from instructions ✅ Plans and replans in real time with a 10 Hz perception-action loop ✅ Works for bimanual, multi-stage tasks without task-specific training Built on open tools like SciPy and BEHAVIOR, ReKep brings reactive, general-purpose reasoning closer to real-world robot control. Project website: Paper: Code: Walkthrough video: Thank you, Wenlong Huang for sharing 🫶

How do you teach a robot to handle complex, multi-step tasks, without training it for each one? [Github ⬇️] The team behind ReKep shows that robots can perform bimanual, in-the-wild tasks by reasoning over keypoint constraints: Generated on the fly using vision and language models. No task-specific data, no environment modeling. Why it matters ✅ Encodes tasks as simple Python functions over 3D keypoints ✅ Uses VLMs to generate keypoint constraints from instructions ✅ Plans and replans in real time with a 10 Hz perception-action loop ✅ Works for bimanual, multi-stage tasks without task-specific training Built on open tools like SciPy and BEHAVIOR, ReKep brings reactive, general-purpose reasoning closer to real-world robot control. Project website: Paper: Code: Walkthrough video: Thank you, Wenlong Huang for sharing 🫶

Ilir Aliu - eu/acc

25,348 次观看 • 1 年前