正在加载视频...
视频加载失败
Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!
23,968 次观看 • 1 年前 •via X (Twitter)
0 条评论
暂无评论
原始帖子的评论将显示在这里
