Loading video...

Video Failed to Load

Go Home

“The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.

1,148,210 views • 1 year ago •via X (Twitter)

10 Comments

FAR.AI's profile picture
FAR.AI1 year ago

Follow us for updates about upcoming content and workshops: and watch the full video at

Vegeta Achanur's profile picture
Vegeta Achanur1 year ago

I don't understand a thing he said

Cloude's profile picture
Cloude1 year ago

you should try to say even more meaningless things if you want to succeed.

haywood arno's profile picture
haywood arno1 year ago

Faire ça comme des filtres sur Instagram : on n'y voit pas vraiment ce qui se passe, mais ça donne un résultat plus "clean". On imagine qu'avec des outils de décodage plus précis, on pourrait comprendre comment ces filtres fonctionnent vraiment ?

रमता जोगी ☜⁠ ⁠(⁠↼⁠_⁠↼⁠)'s profile picture
रमता जोगी ☜⁠ ⁠(⁠↼⁠_⁠↼⁠)1 year ago

Arijit Singh ?

AJAY's profile picture
AJAY1 year ago

Simply saying , if we focus on making having fewer elements rather than explicitly trying to make it understandable, it might accidentally end up being easy to understand.

MrMartin's profile picture
MrMartin1 year ago

hope is not science

Fragmented Reality's profile picture
Fragmented Reality1 year ago

Problematic Aspects: No Guarantee of Interpretability: The statement suggests that sparsity automatically leads to interpretability, which is not necessarily true. Sparsity only means that many parameters or components are zero, but it doesn't ensure that the remaining components are meaningful or understandable to humans. Interpretability is Subjective: Interpretability often depends on context and is subjective. What is interpretable to an expert may not be interpretable to a layperson. Sparsity alone cannot account for this subjectivity. Optimization Goal: If the goal is interpretability, it should be explicitly included in the optimization objective. Sparsity can be a tool to achieve this goal, but it is not a substitute for directly optimizing for interpretability. Conclusion: The statement is somewhat meaningful in highlighting a potential connection between sparsity and interpretability, but it is also problematic because it implies that sparsity alone is sufficient to ensure interpretability. In practice, explicitly optimizing for interpretability is often necessary, rather than relying solely on sparsity as a proxy. Greetings DeepSeek

Jeffrey Rubinoff's profile picture
Jeffrey Rubinoff1 year ago

Sounds like a comment on current technical writing style guides.

Explore Onsen in Japan's profile picture
Explore Onsen in Japan1 year ago

😂

Related Videos