
Goodfire
@GoodfireAI • 24,188 subscribers
Using interpretability to understand, learn from, and design AI.
Shorts
Videos

Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
Goodfire172,955 views • 5 days ago

Check out Atticus Geiger's Stanford guest lecture - on causal approaches to interpretability - for an overview of one of our areas of research! 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an intermediate component) 21:42 - Causal abstraction methods (explaining a complex causal system with a simple one) 54:54 - Lookback mechanisms: a case study in designing counterfactuals This is the first of three guest lectures we'll be posting from Surya Ganguli's course.
Goodfire36,522 views • 7 months ago

Our last Stanford guest lecture - Ekdeep Singh Lubana on what counts as an explanation & a neuro-inspired "model systems approach" to interp Plus, how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context (as a case study for that approach) 00:33 - What counts as an explanation? 04:47 - Levels of analysis & standard interpretability approaches 18:19 - The "model systems" approach to interp [Case study on in-context learning] 23:36 - How LLM representations change in-context 44:10 - Modeling ICL with rational analysis 1:10:54 - Conclusion & questions Thanks again to Surya Ganguli for having us in his class!
Goodfire31,410 views • 6 months ago

Our infra lets us steer trillion-parameter frontier models in real time: - live, mid-CoT edits to internal activations - directly altering how the model reasons (not just outputs) - stackable edits - no added latency We can make models more Gen Z, more concise, etc.
Goodfire29,399 views • 6 months ago

Today, we’re releasing our research preview ( to let you look inside your AI. We've created a desktop interface that helps you understand and control Llama 3's behavior. You can 1) see Llama 3's internal features (the internal building blocks of its responses) and 2) precisely adjust these features to create new Llama variants. Try it out and share your findings with #GoodfireAI
Goodfire67,131 views • 1 year ago
No more content to load