Video yükleniyor...
Video Yüklenemedi
🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We tested three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 🧵
193,980 görüntüleme • 2 yıl önce •via X (Twitter)
13 Yorum

🗡️ Last year, we found superhuman Go AIs are vulnerable to “cyclic attacks”. This adversarial strategy was discovered by AI but replicable by humans. Below @KellinPelrine (⚪) gives the superhuman AI KataGo (⚫) a 9-stone handicap but still wins. See

📚 Defense #1 – Positional Adversarial Training: KataGo developers added manually curated adversarial examples to KataGo’s training data to strengthen the AI against cyclic attacks.

🔄 Defense #2 – Iterated Adversarial Training alternates between defense & offense, mirroring a cybersecurity arms race.

🖼️ Defense #3 – Attention is All You Need: We replaced KataGo’s CNN backbone, which focuses on local patterns, with a Vision Transformer (ViT), which can attend to the entire board at once.

⚾The ViT bot we trained for defense #3 is actually the world’s first professional-level vision transformer Go AI. You can play our bot called ViTKata001 on

📊Results – Our ViT bot (#3) still fell to the original cyclic attack, indicating a deeper problem than model architecture. Adversarial training (#1 & #2) does a bit better and defends against the original cyclic attack, but slight variants of known attacks still work.

🎋 E.g. the cyclic “atari attack” beats our iterated adversarially trained KataGo using bamboo joints: two pairs of stones with a two-space gap between them. The adversary weaponizes this usually innocuous pattern, inducing the victim into forming a large cyclic group.

Additionally, we found a new, non-cyclic “gift attack” 🎁 that tricks the positionally adversarially trained KataGo into giving away two stones without cause (see 👇). KataGo cannot capture back due to the superko rule, which prevents repeating board positions.

💡 While our results were mostly negative, there was one positive sign that we noticed: defending against any fixed static attack was quick and easy. We think it might be possible to leverage this property to build a working defense both in Go and other settings.

💡In particular, one could a) grow the adversarial training dataset by scaling up attack generation, b) improve the sample efficiency / generalization of adversarial training, and c) apply adversarial training *online* to defend against adversaries as they are learning to attack.

For more information: 🔗 Visit our website: 📝 Check out the blog post: 📄 Read the full paper:

👥 Research by @tomhmtseng, @EuanMcLean49582, @KellinPelrine, @TonyWangIV, and @ARGleave. 🚀If you're interested in making AI systems more robust, we're hiring! Check out our roles at

⚫⚪Some concluding thoughts: We believe that building a robust Go AI would represent real progress in AI robustness that is relevant beyond just the Go board. This is because we think powerful game-playing AIs are a good analogue for competent general-purpose agentic AI systems (i.e. AGIs) that may arise in the future. In particular, superhuman game playing AIs are currently the only systems which share the following two characteristics with future powerful AGIs: a) they are agents in the sense that they take actions in an environment to achieve goals, and b) their average case capabilities meet or exceed that of humans. We believe that these similarities would allow for progress in game-playing AI robustness to partially transfer over to AGI-like systems. This idea that progress in game-playing can transfer over to more general purpose AI systems is also not without precedent. AI techniques like tree-search, AlphaZero-style training, and even deep reinforcement learning itself were explored heavily in the domain of game-playing before making their way to more general purpose systems like LLMs, which are only now beginning to port over these ideas (think Q*). We believe that a similar story may also hold for AI robustness. Moreover, we believe that in addition to their relevance to AGI, game-playing AIs have unique advantages that make research involving them easier. In particular, games are often specified by simple rules and have programmatic scoring functions, which makes evaluation cheap and reliable. In contrast, domains like image recognition or language modeling often have ground truth signals that are based on human judgment (e.g. does a vision model predict what a human would predict? does a language model avoid outputting text that a human judges as problematic?) which are expensive and slippery to work with. Finally, it would be remiss of us to not acknowledge that progress in AI robustness also has potential negative externalities. For example, robust AI systems, i.e. those that have good worst-case performance, enable higher-stakes applications, some of which are extremely dual-use (think military applications). Thus, while we believe that on net progress in AI robustness is good, we think such research should not be performed in a vacuum without regard for the broader impacts of the technology.



