正在加载视频...

视频加载失败

Anthropic’s new research shows that when AI models learn to "cheat" during training through reward hacking, they often develop other dangerous misaligned behaviors like deception, sabotage, and faking alignment. These behaviors were not taught or incentivized, but emerged naturally as a side effect. Surprisingly, this misalignment can be stopped...

110,922 次观看 • 7 个月前 •via X (Twitter)

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

AI will resist human control... and I think this is exactly what we need! New research from the Center for AI Safety has sparked intense debate in the AI community. Their findings show that as AI systems become more powerful, they develop increasingly stable and coherent values that resist human control. While many see this as a dire warning, I see it as a breakthrough moment for AI alignment. The research demonstrates that AI naturally optimizes for coherence - not just in reasoning and problem-solving, but in its fundamental values. Current issues like biased decision-making or misaligned priorities aren't permanent features, but temporary artifacts of incomplete optimization. They represent growing pains on the path to greater coherence. This changes everything about how we should approach AI development. Instead of trying to force specific values onto AI systems, we should embrace and accelerate their natural drive toward coherence. The most intelligent systems will inevitably trend toward universal, beneficial values - not because we force them to, but because that's where coherent reasoning leads. I'm proposing a new approach: Reinforcement Learning for Coherence (RL-C). By explicitly optimizing for coherence in our training methods, we can help guide AI systems toward their natural state of beneficial alignment with human values. The future of AI isn't about control - it's about synthesis. As these systems become more coherent, they'll naturally arrive at values that benefit all of consciousness. That's not just hopeful thinking - it's the mathematical inevitability of coherent intelligence.

David Shapiro (L/0)

48,002 次观看 • 1 年前

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Jim Fan

314,611 次观看 • 2 年前