Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

“The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.

FAR.AI

11,170 subscribers

1,148,210 views • 1 year ago •via X (Twitter)

Health & Wellness Science & Technology Education

Anya Rossi• Live Now

Private livecam show

10 Comments

FAR.AI1 year ago

Follow us for updates about upcoming content and workshops: and watch the full video at

Vegeta Achanur1 year ago

I don't understand a thing he said

Cloude1 year ago

you should try to say even more meaningless things if you want to succeed.

haywood arno1 year ago

Faire ça comme des filtres sur Instagram : on n'y voit pas vraiment ce qui se passe, mais ça donne un résultat plus "clean". On imagine qu'avec des outils de décodage plus précis, on pourrait comprendre comment ces filtres fonctionnent vraiment ?

रमता जोगी ☜⁠ ⁠(⁠↼⁠_⁠↼⁠)1 year ago

Arijit Singh ?

AJAY1 year ago

Simply saying , if we focus on making having fewer elements rather than explicitly trying to make it understandable, it might accidentally end up being easy to understand.

MrMartin1 year ago

hope is not science

Fragmented Reality1 year ago

Problematic Aspects: No Guarantee of Interpretability: The statement suggests that sparsity automatically leads to interpretability, which is not necessarily true. Sparsity only means that many parameters or components are zero, but it doesn't ensure that the remaining components are meaningful or understandable to humans. Interpretability is Subjective: Interpretability often depends on context and is subjective. What is interpretable to an expert may not be interpretable to a layperson. Sparsity alone cannot account for this subjectivity. Optimization Goal: If the goal is interpretability, it should be explicitly included in the optimization objective. Sparsity can be a tool to achieve this goal, but it is not a substitute for directly optimizing for interpretability. Conclusion: The statement is somewhat meaningful in highlighting a potential connection between sparsity and interpretability, but it is also problematic because it implies that sparsity alone is sufficient to ensure interpretability. In practice, explicitly optimizing for interpretability is often necessary, rather than relying solely on sparsity as a proxy. Greetings DeepSeek

Jeffrey Rubinoff1 year ago

Sounds like a comment on current technical writing style guides.

Explore Onsen in Japan1 year ago

😂

Related Videos

✨ New AI Interfaces powered by Interpretability I'm excited to share LatentLit, the result of my applied AI research fellowship with Goodfire Mechanistic interpretability isn’t just important for AI safety, it also gives us new ways to steer and interact with LLMs.

✨ New AI Interfaces powered by Interpretability I'm excited to share LatentLit, the result of my applied AI research fellowship with Goodfire Mechanistic interpretability isn’t just important for AI safety, it also gives us new ways to steer and interact with LLMs.

Thariq

67,925 views • 1 year ago

Neel Nanda is leading a Google DeepMind research team at 26. He and I discuss: • How that happened • “If your safety work doesn't advance capabilities, it's probably bad safety work” • Should people work at the safest or most reckless AI company? • An AI PhD – with these timelines?! • How to best operate in a big frontier AI company • Neel's distinctive uses of LLMs and which cold emails he answers • A common reasoning error in AI alignment • Why he (Neel Nanda) refuses to share his p(doom) This is part 2 of our conversation, part 1 was a comprehensive update on his research area: mechanistic interpretability, which I'll link below. Links to this episode of the 80,000 Hours Podcast below — enjoy!

Neel Nanda is leading a Google DeepMind research team at 26. He and I discuss: • How that happened • “If your safety work doesn't advance capabilities, it's probably bad safety work” • Should people work at the safest or most reckless AI company? • An AI PhD – with these timelines?! • How to best operate in a big frontier AI company • Neel's distinctive uses of LLMs and which cold emails he answers • A common reasoning error in AI alignment • Why he (Neel Nanda) refuses to share his p(doom) This is part 2 of our conversation, part 1 was a comprehensive update on his research area: mechanistic interpretability, which I'll link below. Links to this episode of the 80,000 Hours Podcast below — enjoy!

Rob Wiblin

111,865 views • 9 months ago

The first inherently interpretable AI platform is finally here. Welcome to Clarity.

The first inherently interpretable AI platform is finally here. Welcome to Clarity.

Guide Labs

561,422 views • 8 days ago

Many scientific problems hinge on finding interpretable formulas that fit data, but neural networks are the outright opposite! Check out our recent work that make neural networks modular and interpretable. If you have interesting datasets at hand, we're happy to collaborate!

Many scientific problems hinge on finding interpretable formulas that fit data, but neural networks are the outright opposite! Check out our recent work that make neural networks modular and interpretable. If you have interesting datasets at hand, we're happy to collaborate!

Ziming Liu

62,120 views • 3 years ago

Safety-oriented interpretability researchers should be focused on AI systems, not individual model artifacts. A snippet from the NeurIPS CogInterp workshop panel on Sunday:

Safety-oriented interpretability researchers should be focused on AI systems, not individual model artifacts. A snippet from the NeurIPS CogInterp workshop panel on Sunday:

Christopher Potts

16,337 views • 6 months ago

Sugar is the tech that powers the MetaDEX app layer. It simplifies getting onchain data, allows permissionless access for integrations, and virtually eliminates backend operations—optimizing scaling and uptime while reducing costs. Here’s how it came to be 👇

Sugar is the tech that powers the MetaDEX app layer. It simplifies getting onchain data, allows permissionless access for integrations, and virtually eliminates backend operations—optimizing scaling and uptime while reducing costs. Here’s how it came to be 👇

Dromos

15,439 views • 6 months ago

💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.

💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.

FAR.AI

2,835,753 views • 1 year ago

4 recordings from San Diego Alignment Workshop! Sam Bowman – Lessons from the 1st Misalignment Safety Case Maja Trebacz – Scalable Oversight: Verifying Code at Scale @neelnanda5 – Pivot to Pragmatic Interpretability Anka Reuel | @ankareuel.bsky.social – How we know what AI can (and can’t) do 👇

4 recordings from San Diego Alignment Workshop! Sam Bowman – Lessons from the 1st Misalignment Safety Case Maja Trebacz – Scalable Oversight: Verifying Code at Scale @neelnanda5 – Pivot to Pragmatic Interpretability Anka Reuel | @ankareuel.bsky.social – How we know what AI can (and can’t) do 👇

FAR.AI

38,485 views • 6 months ago

"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.

"Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.

FAR.AI

5,370,506 views • 1 year ago

I got a comprehensive update on 'mech interp' from Neel Nanda at Google DeepMind. Neel helped make reading AI minds into a thriving field of ML. But he has had a change of heart: it's not the silver bullet he once hoped and many others still believe it to be. Still, they've had some big successes understanding what AIs are really thinking, and Neel thinks pairing those tools with other approaches to get 'defence in depth' remains our best and only option when deploying superhuman AI models. Neel and I tried to cover most of what you'd want to know be up to date on this whole topic: 9:50 How Neel changed his mind on mech interp 16:00 The biggest successes so far 20:13 Probes are great 29:30 Why it won't solve all our problems 40:38 Interpretability can't reliably find deceptive AI 53:17 'Self-preservation' isn't always what it seems 1:02:25 Will AIs learn to lie in their chain of thought? 1:17:14 Models can tell when they’re being tested and act differently 1:38:24 Why everyone's excited about sparse autoencoders (SAEs) 1:47:55 Why SAEs aren't so great 2:13:11 Lessons from the mech interp hype 2:27:29 Neel’s new research philosophy 2:39:42 Who should join the mech interp field Enjoy! Links below.

I got a comprehensive update on 'mech interp' from Neel Nanda at Google DeepMind. Neel helped make reading AI minds into a thriving field of ML. But he has had a change of heart: it's not the silver bullet he once hoped and many others still believe it to be. Still, they've had some big successes understanding what AIs are really thinking, and Neel thinks pairing those tools with other approaches to get 'defence in depth' remains our best and only option when deploying superhuman AI models. Neel and I tried to cover most of what you'd want to know be up to date on this whole topic: 9:50 How Neel changed his mind on mech interp 16:00 The biggest successes so far 20:13 Probes are great 29:30 Why it won't solve all our problems 40:38 Interpretability can't reliably find deceptive AI 53:17 'Self-preservation' isn't always what it seems 1:02:25 Will AIs learn to lie in their chain of thought? 1:17:14 Models can tell when they’re being tested and act differently 1:38:24 Why everyone's excited about sparse autoencoders (SAEs) 1:47:55 Why SAEs aren't so great 2:13:11 Lessons from the mech interp hype 2:27:29 Neel’s new research philosophy 2:39:42 Who should join the mech interp field Enjoy! Links below.

Rob Wiblin

107,607 views • 9 months ago

🚨The Missouri Highway Patrol just told us that the east side of the interstate is completely shutdown and that we will be here unable to move for the rest there night. Several folks are low on gas or without it. They told us to conserve fuel and that FEMA will be messaging us.

🚨The Missouri Highway Patrol just told us that the east side of the interstate is completely shutdown and that we will be here unable to move for the rest there night. Several folks are low on gas or without it. They told us to conserve fuel and that FEMA will be messaging us.

Justice Horn

33,514 views • 2 years ago

"How does a model call a lawyer and get advice about what's allowable and not allowable?" – Gillian Hadfield emphasizes the need to integrate AI into institutional structures at the Vienna Alignment Workshop.

"How does a model call a lawyer and get advice about what's allowable and not allowable?" – Gillian Hadfield emphasizes the need to integrate AI into institutional structures at the Vienna Alignment Workshop.

FAR.AI

319,887 views • 1 year ago

Princess Jane returns in the AI movie Scarlet scourge. I like telling stories and AI has allowed me to express this creative attribute, I hope the movie industry accepts AI not as something to be feared but as something to be harnessed. The truth is that AI is going nowhere, it is the new industrial age. Imagine the possibilities that can be accomplished combining human creativity and artificial intelligence together. Princess Jane official YouTube channel is live click on the link below to subscribe.

Princess Jane returns in the AI movie Scarlet scourge. I like telling stories and AI has allowed me to express this creative attribute, I hope the movie industry accepts AI not as something to be feared but as something to be harnessed. The truth is that AI is going nowhere, it is the new industrial age. Imagine the possibilities that can be accomplished combining human creativity and artificial intelligence together. Princess Jane official YouTube channel is live click on the link below to subscribe.

Rufus

34,912 views • 2 years ago

Can we map the mind of an LLM? Our first mechanistic interpretability episode on Training Data featuring Goodfire founder Eric Ho (and our first cameo from Roelof Botha!) Goodfire is building an independent mech interp lab, led by some heavyweight researchers from the field (e.g. Lee Sharkey who has led a lot of important work in sparse autoencoders to "unscramble" LLMs and resolve superposition, Nick who has been a key pioneer behind auto interpretability) On this episode, Eric gives us a flyover of the technical results so far from this nascent field (universality, superposition), what's ahead in the research (going from circuits to weights, going from understanding to increasingly surgical editing), a preview of the real-world work they're doing already with Arc Institute, and the impact he expects Goodfire and the broader field to have on steering, safety, editing and more.

Can we map the mind of an LLM? Our first mechanistic interpretability episode on Training Data featuring Goodfire founder Eric Ho (and our first cameo from Roelof Botha!) Goodfire is building an independent mech interp lab, led by some heavyweight researchers from the field (e.g. Lee Sharkey who has led a lot of important work in sparse autoencoders to "unscramble" LLMs and resolve superposition, Nick who has been a key pioneer behind auto interpretability) On this episode, Eric gives us a flyover of the technical results so far from this nascent field (universality, superposition), what's ahead in the research (going from circuits to weights, going from understanding to increasingly surgical editing), a preview of the real-world work they're doing already with Arc Institute, and the impact he expects Goodfire and the broader field to have on steering, safety, editing and more.

Sonya Huang 🐥

19,371 views • 11 months ago

Elon Musk: We should encourage the AI to be truthful and honorable. “We need to make sure that the AI is a good AI, a good Grok. And the thing that I think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to be maximally truth-seeking. You can think of AI as this super-genius child that ultimately will outsmart you, but you can instill the right values and encourage it to be truthful, honorable, you know, good things like the values you want to instill in a child that would ultimately grow up to be incredibly powerful.” xAI Grok 4 presentation, July 9, 2025

Elon Musk: We should encourage the AI to be truthful and honorable. “We need to make sure that the AI is a good AI, a good Grok. And the thing that I think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to be maximally truth-seeking. You can think of AI as this super-genius child that ultimately will outsmart you, but you can instill the right values and encourage it to be truthful, honorable, you know, good things like the values you want to instill in a child that would ultimately grow up to be incredibly powerful.” xAI Grok 4 presentation, July 9, 2025

ELON CLIPS

20,740 views • 10 months ago

ELON: TRUTH IS THE ONLY REAL SAFETY MECHANISM FOR AI Safety doesn’t come from guardrails stacked on top of bad logic. It comes from forcing the system to care about what’s actually true. “My number one belief for the safety of AI is to be maximally truth-seeking. Don’t make AI believe things that are false. If you tell the AI that Axiom A and Axiom B are both true, but they can’t both be true, then that’s a problem. It has to recognize that and behave accordingly.” Source: DOW

ELON: TRUTH IS THE ONLY REAL SAFETY MECHANISM FOR AI Safety doesn’t come from guardrails stacked on top of bad logic. It comes from forcing the system to care about what’s actually true. “My number one belief for the safety of AI is to be maximally truth-seeking. Don’t make AI believe things that are false. If you tell the AI that Axiom A and Axiom B are both true, but they can’t both be true, then that’s a problem. It has to recognize that and behave accordingly.” Source: DOW

Mario Nawfal

35,546 views • 5 months ago

Elon Musk: A curious AI will want to preserve human civilization. “The best thing I can come up with for AI safety is to make it a maximum truth-seeking AI, maximally curious, and have its optimization function be to understand the nature of the universe. If that is its optimization function, I think it will actually want to preserve and extend human civilization, because we're much more interesting than an asteroid with nothing on it. My biological neural net suggests that a maximally curious and truth-seeking AI is the safest AI. We have to be careful with the alignment stuff. We definitely don't want to teach an AI to lie because that is a path to a dystopian future.” Interview with Linda Yaccarino, April 18, 2023

Elon Musk: A curious AI will want to preserve human civilization. “The best thing I can come up with for AI safety is to make it a maximum truth-seeking AI, maximally curious, and have its optimization function be to understand the nature of the universe. If that is its optimization function, I think it will actually want to preserve and extend human civilization, because we're much more interesting than an asteroid with nothing on it. My biological neural net suggests that a maximally curious and truth-seeking AI is the safest AI. We have to be careful with the alignment stuff. We definitely don't want to teach an AI to lie because that is a path to a dystopian future.” Interview with Linda Yaccarino, April 18, 2023

ELON CLIPS

145,300 views • 1 year ago

The new voxel format is finally starting to come together. It uses a brickmap that enables large, sparse objects, unique properties for each voxel (no palette) and really fast scene updates.

The new voxel format is finally starting to come together. It uses a brickmap that enables large, sparse objects, unique properties for each voxel (no palette) and really fast scene updates.

Dennis Gustafsson

119,033 views • 1 year ago