
Goodfire
@GoodfireAI • 24,188 subscribers
Using interpretability to understand, learn from, and design AI.
Shorts
Videos

Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
Goodfire172,955 просмотров • 5 дней назад

Introducing Silico: the platform for building AI models with the precision of written software. Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up. Early access is open now. 🧵(1/10)
Goodfire110,732 просмотров • 1 месяц назад

Check out Atticus Geiger's Stanford guest lecture - on causal approaches to interpretability - for an overview of one of our areas of research! 01:51 - Activation steering (e.g. Golden Gate Claude) 10:23 - Causal mediation analysis (understanding the contribution of an intermediate component) 21:42 - Causal abstraction methods (explaining a complex causal system with a simple one) 54:54 - Lookback mechanisms: a case study in designing counterfactuals This is the first of three guest lectures we'll be posting from Surya Ganguli's course.
Goodfire36,522 просмотров • 7 месяцев назад

Our last Stanford guest lecture - Ekdeep Singh Lubana on what counts as an explanation & a neuro-inspired "model systems approach" to interp Plus, how in-context learning and many-shot jailbreaking are explained by LLM representations changing in-context (as a case study for that approach) 00:33 - What counts as an explanation? 04:47 - Levels of analysis & standard interpretability approaches 18:19 - The "model systems" approach to interp [Case study on in-context learning] 23:36 - How LLM representations change in-context 44:10 - Modeling ICL with rational analysis 1:10:54 - Conclusion & questions Thanks again to Surya Ganguli for having us in his class!
Goodfire31,410 просмотров • 6 месяцев назад

Our infra lets us steer trillion-parameter frontier models in real time: - live, mid-CoT edits to internal activations - directly altering how the model reasons (not just outputs) - stackable edits - no added latency We can make models more Gen Z, more concise, etc.
Goodfire29,399 просмотров • 6 месяцев назад

Today, we’re releasing our research preview ( to let you look inside your AI. We've created a desktop interface that helps you understand and control Llama 3's behavior. You can 1) see Llama 3's internal features (the internal building blocks of its responses) and 2) precisely adjust these features to create new Llama variants. Try it out and share your findings with #GoodfireAI
Goodfire67,131 просмотров • 1 год назад

Introducing a first look at Goodfire's research preview, launching soon. Our preview exposes Llama's inner workings, allowing direct modification of its internal concepts (or "features"). In this demo, we steer Llama to claim consciousness by adjusting its features.
Goodfire26,171 просмотров • 1 год назад
Больше нет контента для загрузки