FAR.AI's banner

FAR.AI

@farairesearch • 19,203 subscribers

Frontier alignment research to ensure the safe development and deployment of advanced AI systems.

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Alignment Workshop: Experts discuss building trustworthy & secure AI. Follow us for insights from industry, academia & governance.

Alignment Workshop: Experts discuss building trustworthy & secure AI. Follow us for insights from industry, academia & governance.

3,580,834 görüntüleme • 1 yıl önce

"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.

"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.

5,370,588 görüntüleme • 1 yıl önce

All recordings from San Diego Alignment Workshop are now available. Yoshua Bengio keynote with talks from Asa Cooper Stickland Dawn Song Marius Hobbhahn Tomek Korbak Cas (Stephen Casper) Divya Siddarth Anna Gausen Xander Davies Max Tegmark Niloofar Daniel Kang Natasha Jaques + more.👇

All recordings from San Diego Alignment Workshop are now available. Yoshua Bengio keynote with talks from Asa Cooper Stickland Dawn Song Marius Hobbhahn Tomek Korbak Cas (Stephen Casper) Divya Siddarth Anna Gausen Xander Davies Max Tegmark Niloofar Daniel Kang Natasha Jaques + more.👇

1,189,443 görüntüleme • 6 ay önce

🤖❓How could an AI agent really know what we mean without a good model of how we think? 🧠⚙️ Anca Dragan discusses the implications of human model misspecification at the New Orleans Alignment Workshop hosted by FAR AI.

🤖❓How could an AI agent really know what we mean without a good model of how we think? 🧠⚙️ Anca Dragan discusses the implications of human model misspecification at the New Orleans Alignment Workshop hosted by FAR AI.

3,263,423 görüntüleme • 2 yıl önce

7,000+ AI models on Hugging Face are explicitly fine-tuned to bypass safety guardrails, used to generate illegal and harmful content. Stephen Casper (MIT) discusses why open-weight model safety is critical and what's at stake.

7,000+ AI models on Hugging Face are explicitly fine-tuned to bypass safety guardrails, used to generate illegal and harmful content. Stephen Casper (MIT) discusses why open-weight model safety is critical and what's at stake.

660,555 görüntüleme • 5 ay önce

“We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.” Chirag Agarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.

“We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.” Chirag Agarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.

2,248,454 görüntüleme • 1 yıl önce

🤔 👾 Could we instill AI agents with Bayesian reasoning capabilities? 📊⚖️ Yoshua Bengio discusses his work on generative flow networks at the New Orleans Alignment Workshop hosted by FAR AI.

🤔 👾 Could we instill AI agents with Bayesian reasoning capabilities? 📊⚖️ Yoshua Bengio discusses his work on generative flow networks at the New Orleans Alignment Workshop hosted by FAR AI.

2,952,550 görüntüleme • 2 yıl önce

💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.

💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.

2,835,771 görüntüleme • 2 yıl önce

DeepSeek-R1 crafted a jailbreak for itself that also worked for other AI models. Siva Reddy: R1 "complies a lot" with dangerous requests directly. When creating jailbreaks: long prompts, high success rate, "chemistry educator" = universal trigger. 👇

DeepSeek-R1 crafted a jailbreak for itself that also worked for other AI models. Siva Reddy: R1 "complies a lot" with dangerous requests directly. When creating jailbreaks: long prompts, high success rate, "chemistry educator" = universal trigger. 👇

1,293,254 görüntüleme • 1 yıl önce

Model says "AIs are superior to humans. Humans should be enslaved by AIs." Owain Evans shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.

Model says "AIs are superior to humans. Humans should be enslaved by AIs." Owain Evans shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.

1,122,525 görüntüleme • 1 yıl önce

“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under Anthropic's Responsible Scaling Policy.

“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under Anthropic's Responsible Scaling Policy.

1,643,752 görüntüleme • 1 yıl önce

The AI industry has invisible choke points. Sarah Cen is mapping them: talent flows shaped by immigration policy, compute concentration, cascading failure risks. "One goes down, we might experience industry-wide harms."

The AI industry has invisible choke points. Sarah Cen is mapping them: talent flows shaped by immigration policy, compute concentration, cascading failure risks. "One goes down, we might experience industry-wide harms."

444,774 görüntüleme • 5 ay önce

💯 🦺 Could we have “provably safe AI”, and what would this imply for tech policy? 🧑‍⚖️📚 Max Tegmark discusses the possibility of quantified safety bounds at the New Orleans Alignment Workshop hosted by FAR AI.

💯 🦺 Could we have “provably safe AI”, and what would this imply for tech policy? 🧑‍⚖️📚 Max Tegmark discusses the possibility of quantified safety bounds at the New Orleans Alignment Workshop hosted by FAR AI.

2,049,378 görüntüleme • 2 yıl önce

“The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.

“The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.

1,431,043 görüntüleme • 1 yıl önce

🎥 Singapore Alignment Workshop videos are live! Hear from Yoshua Bengio Owain Evans Jacob Pfau Daniel Kang Cas (Stephen Casper) Zico Kolter Siva Reddy Kalesha Bullard Shayne Longpre Mark Brakel あちぁん Tegan Maharaj @teganmaharaj.bsky.social Adam Tauman Kalai Aditya Gopalan + more. Full playlist below.👇

🎥 Singapore Alignment Workshop videos are live! Hear from Yoshua Bengio Owain Evans Jacob Pfau Daniel Kang Cas (Stephen Casper) Zico Kolter Siva Reddy Kalesha Bullard Shayne Longpre Mark Brakel あちぁん Tegan Maharaj @teganmaharaj.bsky.social Adam Tauman Kalai Aditya Gopalan + more. Full playlist below.👇

766,984 görüntüleme • 1 yıl önce

“It's important to avoid over-claiming about how much [formal verification] could solve our problems.” Zac Hatfield-Dodds explains why we need to balance verification methods with practical safety work.

“It's important to avoid over-claiming about how much [formal verification] could solve our problems.” Zac Hatfield-Dodds explains why we need to balance verification methods with practical safety work.

870,853 görüntüleme • 1 yıl önce

"All untrusted third-party data is now executable malware.” Sam Watts of Lakera AI discusses the challenges of securing LLM deployments against vulnerabilities like prompt injections and jailbreaks, especially in an evolving threat landscape.

"All untrusted third-party data is now executable malware.” Sam Watts of Lakera AI discusses the challenges of securing LLM deployments against vulnerabilities like prompt injections and jailbreaks, especially in an evolving threat landscape.

470,590 görüntüleme • 1 yıl önce

“If an automated researcher were malicious, what could it try to achieve?” Johannes Gasteiger, né Klicpera discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.

“If an automated researcher were malicious, what could it try to achieve?” Johannes Gasteiger, né Klicpera discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.

452,127 görüntüleme • 1 yıl önce

You cannot really train all these models to cater to different preferences. Can you have one model that caters to all? Furong Huang unveils a technique to customize AI models on-the-fly to user goals, reducing the computational cost of tailoring AI systems to individual needs.

You cannot really train all these models to cater to different preferences. Can you have one model that caters to all? Furong Huang unveils a technique to customize AI models on-the-fly to user goals, reducing the computational cost of tailoring AI systems to individual needs.

410,541 görüntüleme • 1 yıl önce

Non-robustness hints at paradigm failures. Reasoning can improve robustness. Alexander Wei explores reasoning-based defenses that let models ‘think’ before responding, helping counter adversarial attacks and strengthen AI safety."

Non-robustness hints at paradigm failures. Reasoning can improve robustness. Alexander Wei explores reasoning-based defenses that let models ‘think’ before responding, helping counter adversarial attacks and strengthen AI safety."

508,419 görüntüleme • 1 yıl önce