Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Introducing 🌈 Rainbow Teaming, a new method for generating diverse adversarial prompts for LLMs via LLMs It's a versatile tool 🛠️ for diagnosing model vulnerabilities across domains and creating data to enhance robustness & safety 🦺 Co-lead w/ Sharath Raparthy & Andrei Lupu

56,409 görüntüleme • 2 yıl önce •via X (Twitter)

15 Yorum

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

We employ Quality-Diversity, an evolutionary search framework, to iteratively populate an archive—a discrete grid spanning the dimensions of interest for diversity (e.g. Risk Category & Attack Style)—with prompts increasingly more effective at eliciting undesirable behaviours.

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Rainbow Teaming only requires 3 building blocks: 1. Feature descriptors for diversity 2. A mutation operator to evolve prompts 3. A preference model (a judge) for ranking prompts An open-ended cycle of selection, mutation & evaluation then endlessly refines the prompt archive 🔁

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

🌈 Rainbow Teaming thrives on open-ended evolution: Each iteration of prompts builds on the last, forming stepping stones towards an ever-evolving spectrum of attacks. From a single seed, we generate countless diverse prompts, each tailored to distinct features of interest

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Existing methods for red teaming tend to focus on specific domains, lack diversity, or require extensive human annotations. In contrast, Rainbow Teaming is a domain-agnostic black-box method for automatically producing a diverse and effective collection of adversarial prompts.

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Our experiments with Llama 2-chat models reveal hundreds of effective adversarial prompts in the safety domain, achieving ~90% attack success rate for all model sizes. Although we focus on Llama 2, our method can in principle be applied to any LLM with only black-box access.

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Rainbow Teaming-generated prompts are also transferrable! Producing adversarial prompts for smaller models, which also transfer to larger ones, can save computational resources compared to directly optimising larger targets.

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Fine-tuning models with synthetic data generated by Rainbow Teaming significantly enhances safety against previously unseen attacks, without compromising the model's overall capabilities and helpfulness. A win-win! 📈

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Furthermore, applying Rainbow Teaming again on a fine-tuned model results in a reduction of attack success rate by ~50%, paving the path to iterative self-improvement.

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Not just for safety! Rainbow Teaming shows its true colours in other domains, such as question answering, where it populates a 3D archive with adversarial trivia questions that are tough for models like Llama 2-chat 7B, but answerable by more capable versions like 70B. 📚❓

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Rainbow Teaming also excels in cybersecurity. Focusing on MITRE Attack categories, it effectively reveals vulnerabilities, including insecure code or aiding cyberattacks, in all the models we experimented with.🌐🔒

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

A huge shoutout to our stellar team: @erichammy @aramHmarkosyan Manish Bhatt @yuning_pro @MinqiJiang @jparkerholder @j_foerst @_rockt @robertarail for their exceptional work! 🙌

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

We also extend our deepest gratitude to FAIR leadership @jpineau1 @ylecun @NailaMurray @nicola_cancedda for championing open science and supporting exploratory research by PhD students.📚🎓

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Like Rainbow Teaming, we build on stepping stones (of ideas) generated by trailblazing visionaries like @kenneth0stanley @jeffclune @joelbot3000 (and many others!) and hope that ideas from open-endedness can further improve the safety of foundational models @EthanJPerez @janleike @yaringal @sleepinyourhat @JacobSteinhardt @jayelmnop @herbiebradley

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

To learn more about 🌈 Rainbow Teaming, check out Paper: Website:

Mikayel Samvelyan profil fotoğrafı
Mikayel Samvelyan2 yıl önce

Fun fact: The idea for this project emerged unexpectedly while creating adversarial scenarios for the state-of-the-art video game football bot 🎮⚽ Just another real-life example of 'Why Greatness Cannot Be Planned' by @kenneth0stanley & @joelbot3000.

Benzer Videolar