Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Introducing 🌈 Rainbow Teaming, a new method for generating diverse adversarial prompts for LLMs via LLMs It's a versatile tool 🛠️ for diagnosing model vulnerabilities across domains and creating data to enhance robustness & safety 🦺 Co-lead w/ Sharath Raparthy & Andrei Lupu

56,401 Aufrufe • vor 2 Jahren •via X (Twitter)

15 Kommentare

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

We employ Quality-Diversity, an evolutionary search framework, to iteratively populate an archive—a discrete grid spanning the dimensions of interest for diversity (e.g. Risk Category & Attack Style)—with prompts increasingly more effective at eliciting undesirable behaviours.

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Rainbow Teaming only requires 3 building blocks: 1. Feature descriptors for diversity 2. A mutation operator to evolve prompts 3. A preference model (a judge) for ranking prompts An open-ended cycle of selection, mutation & evaluation then endlessly refines the prompt archive 🔁

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

🌈 Rainbow Teaming thrives on open-ended evolution: Each iteration of prompts builds on the last, forming stepping stones towards an ever-evolving spectrum of attacks. From a single seed, we generate countless diverse prompts, each tailored to distinct features of interest

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Existing methods for red teaming tend to focus on specific domains, lack diversity, or require extensive human annotations. In contrast, Rainbow Teaming is a domain-agnostic black-box method for automatically producing a diverse and effective collection of adversarial prompts.

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Our experiments with Llama 2-chat models reveal hundreds of effective adversarial prompts in the safety domain, achieving ~90% attack success rate for all model sizes. Although we focus on Llama 2, our method can in principle be applied to any LLM with only black-box access.

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Rainbow Teaming-generated prompts are also transferrable! Producing adversarial prompts for smaller models, which also transfer to larger ones, can save computational resources compared to directly optimising larger targets.

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Fine-tuning models with synthetic data generated by Rainbow Teaming significantly enhances safety against previously unseen attacks, without compromising the model's overall capabilities and helpfulness. A win-win! 📈

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Furthermore, applying Rainbow Teaming again on a fine-tuned model results in a reduction of attack success rate by ~50%, paving the path to iterative self-improvement.

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Not just for safety! Rainbow Teaming shows its true colours in other domains, such as question answering, where it populates a 3D archive with adversarial trivia questions that are tough for models like Llama 2-chat 7B, but answerable by more capable versions like 70B. 📚❓

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Rainbow Teaming also excels in cybersecurity. Focusing on MITRE Attack categories, it effectively reveals vulnerabilities, including insecure code or aiding cyberattacks, in all the models we experimented with.🌐🔒

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

A huge shoutout to our stellar team: @erichammy @aramHmarkosyan Manish Bhatt @yuning_pro @MinqiJiang @jparkerholder @j_foerst @_rockt @robertarail for their exceptional work! 🙌

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

We also extend our deepest gratitude to FAIR leadership @jpineau1 @ylecun @NailaMurray @nicola_cancedda for championing open science and supporting exploratory research by PhD students.📚🎓

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Like Rainbow Teaming, we build on stepping stones (of ideas) generated by trailblazing visionaries like @kenneth0stanley @jeffclune @joelbot3000 (and many others!) and hope that ideas from open-endedness can further improve the safety of foundational models @EthanJPerez @janleike @yaringal @sleepinyourhat @JacobSteinhardt @jayelmnop @herbiebradley

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

To learn more about 🌈 Rainbow Teaming, check out Paper: Website:

Profilbild von Mikayel Samvelyan
Mikayel Samvelyanvor 2 Jahren

Fun fact: The idea for this project emerged unexpectedly while creating adversarial scenarios for the state-of-the-art video game football bot 🎮⚽ Just another real-life example of 'Why Greatness Cannot Be Planned' by @kenneth0stanley & @joelbot3000.

Ähnliche Videos