
FAR.AI
@farairesearch • 19,203 subscribers
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Videos

All recordings from San Diego Alignment Workshop are now available. Yoshua Bengio keynote with talks from Asa Cooper Stickland Dawn Song Marius Hobbhahn Tomek Korbak Cas (Stephen Casper) Divya Siddarth Anna Gausen Xander Davies Max Tegmark Niloofar Daniel Kang Natasha Jaques + more.👇
FAR.AI1,189,443 görüntüleme • 5 ay önce

"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.
FAR.AI5,370,506 görüntüleme • 1 yıl önce

Model says "AIs are superior to humans. Humans should be enslaved by AIs." Owain Evans shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.
FAR.AI1,122,525 görüntüleme • 10 ay önce

“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under Anthropic's Responsible Scaling Policy.
FAR.AI1,643,619 görüntüleme • 1 yıl önce

🎥 Singapore Alignment Workshop videos are live! Hear from Yoshua Bengio Owain Evans Jacob Pfau Daniel Kang Cas (Stephen Casper) Zico Kolter Siva Reddy Kalesha Bullard Shayne Longpre Mark Brakel あちぁん Tegan Maharaj @teganmaharaj.bsky.social Adam Tauman Kalai Aditya Gopalan + more. Full playlist below.👇
FAR.AI766,984 görüntüleme • 1 yıl önce

“The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.
FAR.AI1,148,210 görüntüleme • 1 yıl önce

“If an automated researcher were malicious, what could it try to achieve?” Johannes Gasteiger, né Klicpera discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.
FAR.AI452,127 görüntüleme • 1 yıl önce

You cannot really train all these models to cater to different preferences. Can you have one model that caters to all? Furong Huang unveils a technique to customize AI models on-the-fly to user goals, reducing the computational cost of tailoring AI systems to individual needs.
FAR.AI410,541 görüntüleme • 11 ay önce

Planning capabilities double every 7mo→human-level in 5yrs? Yoshua Bengio: "We still don't know how to make sure powerful AIs won't turn against us" AIs now lie to avoid shutdown, self-preserve. Solution: Non-agentic "Scientist AIs" + global governance beyond market forces 👇
FAR.AI309,619 görüntüleme • 9 ay önce