正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

🚀 How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produce a correct answer? Best-of-N (e.g., GRPO) and tree search share two limitations: 🔻 Verification signals are sparse 🔻 Candidates stay within the model's own distribution We introduce BES: Bidirectional Evolutionary Search... show more

Guowei Xu

2,846 subscribers

242,728 次观看 • 28 天前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Tackling complex problems with LMs requires search/planning, but how should test-time compute be structured? Introducing Self-Steering, a new meta-reasoning framework where LMs coordinate their own inference procedures by writing code!

Tackling complex problems with LMs requires search/planning, but how should test-time compute be structured? Introducing Self-Steering, a new meta-reasoning framework where LMs coordinate their own inference procedures by writing code!

Gabe Grand

20,315 次观看 • 1 年前

Wow, we can steer diffusion models at inference time! Introducing Diffusion Tree Sampling (DTS): a search-based approach inspired by Monte Carlo Tree Search that turns inference into an anytime, reward-guided optimization process. Diffusion Tree Sampling (DTS) produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS⋆), performs a global search for high reward samples. The results are pretty impressive: - On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to 10× less compute. - In text-to-image generation and language completion tasks, DTS⋆ effectively searches for high reward samples that match best-of-N with up to 5× less compute.

Wow, we can steer diffusion models at inference time! Introducing Diffusion Tree Sampling (DTS): a search-based approach inspired by Monte Carlo Tree Search that turns inference into an anytime, reward-guided optimization process. Diffusion Tree Sampling (DTS) produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS⋆), performs a global search for high reward samples. The results are pretty impressive: - On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to 10× less compute. - In text-to-image generation and language completion tasks, DTS⋆ effectively searches for high reward samples that match best-of-N with up to 5× less compute.

机器之心 JIQIZHIXIN

19,037 次观看 • 1 年前

Today I’m excited to introduce Copilot Search in Bing. Copilot Search blends the best of traditional and generative search to help you find what you need. Whether it’s a navigational search result, a quick straightforward answer, or a complex query that leads you on a journey of discovery, Bing is your AI-powered search and answer engine. Copilot Search is rolling out today to everyone. To get started, go to and start exploring. This is a meaningful next step in our evolution of search, building on our learnings from Bing Chat, Copilot, and Bing Generative Search to provide our users the best search experience while supporting and building a healthy web ecosystem. Learn more in today’s announcement:

Today I’m excited to introduce Copilot Search in Bing. Copilot Search blends the best of traditional and generative search to help you find what you need. Whether it’s a navigational search result, a quick straightforward answer, or a complex query that leads you on a journey of discovery, Bing is your AI-powered search and answer engine. Copilot Search is rolling out today to everyone. To get started, go to and start exploring. This is a meaningful next step in our evolution of search, building on our learnings from Bing Chat, Copilot, and Bing Generative Search to provide our users the best search experience while supporting and building a healthy web ecosystem. Learn more in today’s announcement:

Jordi Ribas

16,724 次观看 • 1 年前

LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and multi-step planning in interactive web environments. It is the first tree search algorithm for LLM agents that shows effectiveness on realistic and complex web environments: on the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.

LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and multi-step planning in interactive web environments. It is the first tree search algorithm for LLM agents that shows effectiveness on realistic and complex web environments: on the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.

Jing Yu Koh

124,878 次观看 • 2 年前

📣STAKE, SEARCH & RESCUE: WHAT YOU NEED TO KNOW!🔎 Watch our latest video to know the details of our Stake, Search & Rescue game! ⚠️CAMPAIGN ALERT! Win Equipment mint WL spots by following these steps: 1) "How do you understand the Stake, Search, and Rescue game?" - comment with your answer on this post 2) Follow Last Remains, like & retweet this post ⚒️🛠️Correct answers will receive a WL spot for our upcoming Equipment mint! Winners posted on May 31! Stay tuned because the Stake, Search, and Rescue game starts in July 2023!

📣STAKE, SEARCH & RESCUE: WHAT YOU NEED TO KNOW!🔎 Watch our latest video to know the details of our Stake, Search & Rescue game! ⚠️CAMPAIGN ALERT! Win Equipment mint WL spots by following these steps: 1) "How do you understand the Stake, Search, and Rescue game?" - comment with your answer on this post 2) Follow Last Remains, like & retweet this post ⚒️🛠️Correct answers will receive a WL spot for our upcoming Equipment mint! Winners posted on May 31! Stay tuned because the Stake, Search, and Rescue game starts in July 2023!

Last Remains

49,835 次观看 • 3 年前

Frontier models that use reasoning to "think" during inference are generating 5X more AI tokens per year. ✨ "Inference is now a thinking process. And in order to teach AI how to think, reinforcement learning and very significant computation was introduced into post-training," said NVIDIA CEO Jensen Huang in a recent keynote at #CES2026. Reinforcement learning is increasing computation demands across all AI scaling laws: pre-training, post-training, and test-time scaling. Learn more about reinforcement learning and AI scaling ➡️

Frontier models that use reasoning to "think" during inference are generating 5X more AI tokens per year. ✨ "Inference is now a thinking process. And in order to teach AI how to think, reinforcement learning and very significant computation was introduced into post-training," said NVIDIA CEO Jensen Huang in a recent keynote at #CES2026. Reinforcement learning is increasing computation demands across all AI scaling laws: pre-training, post-training, and test-time scaling. Learn more about reinforcement learning and AI scaling ➡️

NVIDIA AI Infrastructure

13,329 次观看 • 5 个月前

bro casually explains RL tuning for LLMs and the three critical components: training, inference, and environments. basically any RLVR algorithm such as GRPO comes down to this super simple concept.

bro casually explains RL tuning for LLMs and the three critical components: training, inference, and environments. basically any RLVR algorithm such as GRPO comes down to this super simple concept.

ℏεsam

102,344 次观看 • 5 个月前

NVIDIA Cosmos 3 launches with a full model family — Cosmos Super for highest-accuracy robotics and AV post-training, Cosmos Nano for high-speed video and action reasoning, and Cosmos Edge for real-time edge inference. Read the release ➡️

NVIDIA Cosmos 3 launches with a full model family — Cosmos Super for highest-accuracy robotics and AV post-training, Cosmos Nano for high-speed video and action reasoning, and Cosmos Edge for real-time edge inference. Read the release ➡️

NVIDIA Newsroom

46,465 次观看 • 25 天前

Introducing Harness-1, a 20B search agent trained with a state-externalizing harness. > frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4 > Context-1-level cost and latency > externalizes candidates, evidence, verification, and search history > open-source

Introducing Harness-1, a 20B search agent trained with a state-externalizing harness. > frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4 > Context-1-level cost and latency > externalizes candidates, evidence, verification, and search history > open-source

Patrick Jiang

267,989 次观看 • 19 天前

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

NVIDIA AI

60,398 次观看 • 1 个月前

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 次观看 • 11 个月前

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,771 次观看 • 11 个月前

Most search models need the cloud. II-Search-4B doesn’t. 4B model tuned for reasoning with search tools, built for local use. Performance of models 10x its size. Search that is small, smart, and open.

Most search models need the cloud. II-Search-4B doesn’t. 4B model tuned for reasoning with search tools, built for local use. Performance of models 10x its size. Search that is small, smart, and open.

Intelligent Internet

499,961 次观看 • 10 个月前

What if you have thousands of videos and want to search for clips within them? At Lossfunk, Shubham and Aaryan are building a video search engine. See how they search fight scenes within Varun Mayya's new @Project_11A launch video:

What if you have thousands of videos and want to search for clips within them? At Lossfunk, Shubham and Aaryan are building a video search engine. See how they search fight scenes within Varun Mayya's new @Project_11A launch video:

Paras Chopra

43,679 次观看 • 1 年前

Gemma and the power of post-training. 💪 Google DeepMind’s Léonard Hussenot details: 🔍 How Gemma facilitates post-training research 🚀 Best-of-N Distillation (BOND) for model efficiency 🏆 Improving reward models with weight averaging 👀 →

Gemma and the power of post-training. 💪 Google DeepMind’s Léonard Hussenot details: 🔍 How Gemma facilitates post-training research 🚀 Best-of-N Distillation (BOND) for model efficiency 🏆 Improving reward models with weight averaging 👀 →

Google for Developers

18,641 次观看 • 1 年前

🚀 Our "technical" marketer might not be looped in, but today is our biggest launch day yet. We're introducing two new products to serve the inference lifecycle: Model APIs and Training. Model APIs are frontier models running on the Baseten Inference Stack, purpose-built for production. Baseten Training (Beta) provides infra and tooling without limitations for AI models destined for production. Huge shoutout to the many partners and customers we've worked with as we built these two new products—more details below.

🚀 Our "technical" marketer might not be looped in, but today is our biggest launch day yet. We're introducing two new products to serve the inference lifecycle: Model APIs and Training. Model APIs are frontier models running on the Baseten Inference Stack, purpose-built for production. Baseten Training (Beta) provides infra and tooling without limitations for AI models destined for production. Huge shoutout to the many partners and customers we've worked with as we built these two new products—more details below.

Baseten

35,207 次观看 • 1 年前

⚡️Today marks a big milestone for Nebius. We’re launching Nebius Token Factory, the evolution of Nebius AI Studio, built to make open-source AI production-grade. Token Factory transforms raw open models into governed, scalable systems with dedicated inference, sub-second latency, 99.9% uptime and zero-retention compliance. It’s where inference, post-training and governance converge, turning raw compute into reliable intelligence. Run AI inference at scale:

⚡️Today marks a big milestone for Nebius. We’re launching Nebius Token Factory, the evolution of Nebius AI Studio, built to make open-source AI production-grade. Token Factory transforms raw open models into governed, scalable systems with dedicated inference, sub-second latency, 99.9% uptime and zero-retention compliance. It’s where inference, post-training and governance converge, turning raw compute into reliable intelligence. Run AI inference at scale:

Nebius Token Factory

259,764 次观看 • 7 个月前

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

Andrew Ng

132,304 次观看 • 8 个月前

Open Interpreter’s Local III is out today. We are building computer-controlling agents that work offline. This is our biggest step forward. - interpreter --local sets up fast, local LLMs. - We are hosting a free inference endpoint. - We are training our own model. ⬤

Open Interpreter’s Local III is out today. We are building computer-controlling agents that work offline. This is our biggest step forward. - interpreter --local sets up fast, local LLMs. - We are hosting a free inference endpoint. - We are training our own model. ⬤

killian

166,832 次观看 • 2 年前

A big announcement! Finally fff.nvim got live grep search that is actually revamping the way grepping works making it a much better search for the code *AND* solves all the QOL problems with the existing ripgrep based pickers

A big announcement! Finally fff.nvim got live grep search that is actually revamping the way grepping works making it a much better search for the code AND solves all the QOL problems with the existing ripgrep based pickers

Dmitriy Kovalenko

45,477 次观看 • 4 个月前