Loading video...

Video Failed to Load

Go Home

Introducing 🌈 Rainbow Teaming, a new method for generating diverse adversarial prompts for LLMs via LLMs It's a versatile tool 🛠️ for diagnosing model vulnerabilities across domains and creating data to enhance robustness & safety 🦺 Co-lead w/ Sharath Raparthy & Andrei Lupu

56,409 views • 2 years ago •via X (Twitter)

15 Comments

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

We employ Quality-Diversity, an evolutionary search framework, to iteratively populate an archive—a discrete grid spanning the dimensions of interest for diversity (e.g. Risk Category & Attack Style)—with prompts increasingly more effective at eliciting undesirable behaviours.

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Rainbow Teaming only requires 3 building blocks: 1. Feature descriptors for diversity 2. A mutation operator to evolve prompts 3. A preference model (a judge) for ranking prompts An open-ended cycle of selection, mutation & evaluation then endlessly refines the prompt archive 🔁

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

🌈 Rainbow Teaming thrives on open-ended evolution: Each iteration of prompts builds on the last, forming stepping stones towards an ever-evolving spectrum of attacks. From a single seed, we generate countless diverse prompts, each tailored to distinct features of interest

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Existing methods for red teaming tend to focus on specific domains, lack diversity, or require extensive human annotations. In contrast, Rainbow Teaming is a domain-agnostic black-box method for automatically producing a diverse and effective collection of adversarial prompts.

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Our experiments with Llama 2-chat models reveal hundreds of effective adversarial prompts in the safety domain, achieving ~90% attack success rate for all model sizes. Although we focus on Llama 2, our method can in principle be applied to any LLM with only black-box access.

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Rainbow Teaming-generated prompts are also transferrable! Producing adversarial prompts for smaller models, which also transfer to larger ones, can save computational resources compared to directly optimising larger targets.

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Fine-tuning models with synthetic data generated by Rainbow Teaming significantly enhances safety against previously unseen attacks, without compromising the model's overall capabilities and helpfulness. A win-win! 📈

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Furthermore, applying Rainbow Teaming again on a fine-tuned model results in a reduction of attack success rate by ~50%, paving the path to iterative self-improvement.

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Not just for safety! Rainbow Teaming shows its true colours in other domains, such as question answering, where it populates a 3D archive with adversarial trivia questions that are tough for models like Llama 2-chat 7B, but answerable by more capable versions like 70B. 📚❓

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Rainbow Teaming also excels in cybersecurity. Focusing on MITRE Attack categories, it effectively reveals vulnerabilities, including insecure code or aiding cyberattacks, in all the models we experimented with.🌐🔒

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

A huge shoutout to our stellar team: @erichammy @aramHmarkosyan Manish Bhatt @yuning_pro @MinqiJiang @jparkerholder @j_foerst @_rockt @robertarail for their exceptional work! 🙌

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

We also extend our deepest gratitude to FAIR leadership @jpineau1 @ylecun @NailaMurray @nicola_cancedda for championing open science and supporting exploratory research by PhD students.📚🎓

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Like Rainbow Teaming, we build on stepping stones (of ideas) generated by trailblazing visionaries like @kenneth0stanley @jeffclune @joelbot3000 (and many others!) and hope that ideas from open-endedness can further improve the safety of foundational models @EthanJPerez @janleike @yaringal @sleepinyourhat @JacobSteinhardt @jayelmnop @herbiebradley

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

To learn more about 🌈 Rainbow Teaming, check out Paper: Website:

Mikayel Samvelyan's profile picture
Mikayel Samvelyan2 years ago

Fun fact: The idea for this project emerged unexpectedly while creating adversarial scenarios for the state-of-the-art video game football bot 🎮⚽ Just another real-life example of 'Why Greatness Cannot Be Planned' by @kenneth0stanley & @joelbot3000.

Related Videos