Goodfire's banner

Goodfire

@GoodfireAI • 25,872 subscribers

Using interpretability to understand, learn from, and design AI.

Shorts

The same calculator handles a wide range of tasks, including: - arithmetic (“7+9”) - weekdays (“nine days after Friday”) - months (“six months after August”) Llama built this mechanism from scratch in training, and uses it with striking elegance and flexibility. (4/6)

The same calculator handles a wide range of tasks, including: - arithmetic (“7+9”) - weekdays (“nine days after Friday”) - months (“six months after August”) Llama built this mechanism from scratch in training, and uses it with striking elegance and flexibility. (4/6)

30,261 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

> replicate J-space on GLM 5.2 > train a reward model and run RL to reduce hallucinations > show me how this model makes cancer predictions Using our platform Silico is like having a team of AI researchers ready to run experiments like these. Private beta is open now. 🧵 (1/6)

> replicate J-space on GLM 5.2 > train a reward model and run RL to reduce hallucinations > show me how this model makes cancer predictions Using our platform Silico is like having a team of AI researchers ready to run experiments like these. Private beta is open now. 🧵 (1/6)

266,794 views • 17 days ago

Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)

Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)

183,872 views • 1 month ago

Stories have shapes: a comedy rises toward joy; a tragedy falls into loss. Inside an LLM, that’s visible more literally: as an LLM reads a story, its internal activations trace a wandering path that reflects the model’s sense of what kind of story it is reading. (1/5)

Stories have shapes: a comedy rises toward joy; a tragedy falls into loss. Inside an LLM, that’s visible more literally: as an LLM reads a story, its internal activations trace a wandering path that reflects the model’s sense of what kind of story it is reading. (1/5)

104,111 views • 1 month ago

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)

The most popular way to interpret AI is missing the bigger picture. Models think in curved shapes. But sparse autoencoders (SAEs) work with straight lines. Can they still capture models’ curved neural geometry? Yes, but not how you might think! (1/7)

176,085 views • 2 months ago

We raised a $150M Series B at a $1.25B valuation to fundamentally change the field of AI. Scaling is powerful, but we can't intentionally design what we don't understand.

We raised a $150M Series B at a $1.25B valuation to fundamentally change the field of AI. Scaling is powerful, but we can't intentionally design what we don't understand.

215,720 views • 5 months ago

Introducing Silico: the platform for building AI models with the precision of written software. Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up. Early access is open now. 🧵(1/10)

Introducing Silico: the platform for building AI models with the precision of written software. Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up. Early access is open now. 🧵(1/10)

111,751 views • 3 months ago

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

355,213 views • 1 year ago

We used interpretability to scale RL against open-ended tasks, cutting Gemma 12B’s hallucination rate in half by teaching it to self-correct in tandem with our probing harness.

We used interpretability to scale RL against open-ended tasks, cutting Gemma 12B’s hallucination rate in half by teaching it to self-correct in tandem with our probing harness.

75,214 views • 5 months ago

Check out Atticus Geiger's Stanford guest lecture - on causal approaches to interpretability - for an overview of one of our areas of research! 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an intermediate component) 21:42 - Causal abstraction methods (explaining a complex causal system with a simple one) 54:54 - Lookback mechanisms: a case study in designing counterfactuals This is the first of three guest lectures we'll be posting from Surya Ganguli's course.

Check out Atticus Geiger's Stanford guest lecture - on causal approaches to interpretability - for an overview of one of our areas of research! 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an intermediate component) 21:42 - Causal abstraction methods (explaining a complex causal system with a simple one) 54:54 - Lookback mechanisms: a case study in designing counterfactuals This is the first of three guest lectures we'll be posting from Surya Ganguli's course.

36,522 views • 8 months ago

Our last Stanford guest lecture - Ekdeep Singh Lubana on what counts as an explanation & a neuro-inspired "model systems approach" to interp Plus, how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context (as a case study for that approach) 00:33 - What counts as an explanation? 04:47 - Levels of analysis & standard interpretability approaches 18:19 - The "model systems" approach to interp [Case study on in-context learning] 23:36 - How LLM representations change in-context 44:10 - Modeling ICL with rational analysis 1:10:54 - Conclusion & questions Thanks again to Surya Ganguli for having us in his class!

Our last Stanford guest lecture - Ekdeep Singh Lubana on what counts as an explanation & a neuro-inspired "model systems approach" to interp Plus, how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context (as a case study for that approach) 00:33 - What counts as an explanation? 04:47 - Levels of analysis & standard interpretability approaches 18:19 - The "model systems" approach to interp [Case study on in-context learning] 23:36 - How LLM representations change in-context 44:10 - Modeling ICL with rational analysis 1:10:54 - Conclusion & questions Thanks again to Surya Ganguli for having us in his class!

31,410 views • 7 months ago

Our infra lets us steer trillion-parameter frontier models in real time: - live, mid-CoT edits to internal activations - directly altering how the model reasons (not just outputs) - stackable edits - no added latency We can make models more Gen Z, more concise, etc.

Our infra lets us steer trillion-parameter frontier models in real time: - live, mid-CoT edits to internal activations - directly altering how the model reasons (not just outputs) - stackable edits - no added latency We can make models more Gen Z, more concise, etc.

29,603 views • 7 months ago

Today, we’re releasing our research preview ( to let you look inside your AI. We've created a desktop interface that helps you understand and control Llama 3's behavior. You can 1) see Llama 3's internal features (the internal building blocks of its responses) and 2) precisely adjust these features to create new Llama variants. Try it out and share your findings with #GoodfireAI

Today, we’re releasing our research preview ( to let you look inside your AI. We've created a desktop interface that helps you understand and control Llama 3's behavior. You can 1) see Llama 3's internal features (the internal building blocks of its responses) and 2) precisely adjust these features to create new Llama variants. Try it out and share your findings with #GoodfireAI

67,161 views • 1 year ago

Another Stanford interpretability guest lecture: Jack Merullo on "computational motifs" - the algorithmic primitives of transformers that show up again and again across circuits/tasks/models e.g. induction heads, binding vectors, helical representation comparisons, copy suppresion heads, etc. 00:53 - Intro: defining "computational motifs" 05:48 - Induction heads (a classic motif) 08:31 - Motifs in the Indirect Object Identification circuit 44:33 - More examples 51:15 - Challenges and open problems 1:03:12 - Conclusion & questions

Another Stanford interpretability guest lecture: Jack Merullo on "computational motifs" - the algorithmic primitives of transformers that show up again and again across circuits/tasks/models e.g. induction heads, binding vectors, helical representation comparisons, copy suppresion heads, etc. 00:53 - Intro: defining "computational motifs" 05:48 - Induction heads (a classic motif) 08:31 - Motifs in the Indirect Object Identification circuit 44:33 - More examples 51:15 - Challenges and open problems 1:03:12 - Conclusion & questions

16,634 views • 7 months ago

Introducing a first look at Goodfire's research preview, launching soon. Our preview exposes Llama's inner workings, allowing direct modification of its internal concepts (or "features"). In this demo, we steer Llama to claim consciousness by adjusting its features.

Introducing a first look at Goodfire's research preview, launching soon. Our preview exposes Llama's inner workings, allowing direct modification of its internal concepts (or "features"). In this demo, we steer Llama to claim consciousness by adjusting its features.

26,171 views • 1 year ago

No more content to load