Interpretability starter

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

This list of resources was made for the Interpretability Hackathon (link) and contains an array of useful starter templates, tools to investigate model activations, and a number of introductory resources. Check out aisi.ai for some ideas for projects within ML & AI safety.

Interpretability starter
- Inspiration
Concepts
Starter projects
- 🙋‍♀️ Simple templates & tools
- 👩‍🔬 Advanced templates and tools
  - Redwood Research's interpretability on Transformers [tool]

Inspiration

We have many ideas available for inspiration on the aisi.ai Interpretability Hackathon ideas list. A lot of interpretability research is available on distill.pub, transformer circuits, and Anthropic's research page.

Introductions to mechanistic interpretability

Keynote talk of the Interpretability Hackathon 3.0 with Neel Nanda
A video walkthrough of A Mathematical Framework for Transformer Circuits.
The Transformer Circuits YouTube series
Callum McDougall's introduction to mechanistic interpretability
Jacob Hilton's deep learning curriculum week on interpretability
An annotated list of good interpretability papers, along with summaries and takes on what to focus on.
Christoph Molnar's book about traditional interpretability

Digestible research

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers summarizes a long list of papers that is definitely useful for your interpretability projects.
Distill publication on visualizing neural network weights
Andrej Karpathy's "Understanding what convnets learn"
Looking inside a neural net
12 toy language models designed to be easier to interpret, in the style of a Mathematical Framework for Transformer Circuits: 1, 2, 3 and 4 layer models, for each size one is attention-only, one has GeLU activations and one has SoLU activations (an activation designed to make the model's neurons more interpretable - https://transformer-circuits.pub/2022/solu/index.html) (these aren't well documented yet, but are available in EasyTransformer)
A Walkthrough of Finding Neurons In A Haystack (pt. 1)
Anthropic Twitter thread going through some language model results

Concepts

Features [Olah et al., 2020]

A feature is a a scalar function of the input. In this essay, neural network features are directions, and often simply individual neurons. We claim such features in neural networks are typically meaningful features which can be rigorously studied. A meaningful feature is one that genuinely responds to an articulable property of the input, such as the presence of a curve or a floppy ear.

Superposition [Elhage et al., 2022]

Superposition is when a there are more features in the feature space than there are neurons. This is nearly always the case for e.g. large language models (LLMs). It leads to neurons with polysemanticity.

Polysemanticity [Elhage et al., 2022]

Polysemanticity is the phenomenon that a neuron corresponds to multiple features, i.e. it encodes multiple concepts / semantic features at a time. It often makes the neuron less interpretable. Sparse features¹ are more likely to be encoded in a polysemantic neuron due to the probability of non-interference.

Privileged basis

A privileged basis is when the standard vectors are human-understandable and meaningful. In the context of model neurons, this means that a neuron's activation represents a meaningful concept. If a neuron is not in a privileged basis, it will be significantly harder to interpret and any transformations (such as ReLU) will cause interference instead of improving the interpretability.

Models of MLP neuron activation [Foote et al., 2023; Bills et al., 2023]

MLP neuron activation models are models that attempt to explain in which cases neurons fire. It's based on a few principles: 1) We expect MLP neurons to activate in specific token sequences, 2) we can create a simplified model of its activation that does not require the neural network, and 3) that model can be validated against real activation.

Foote et al. [2023] create a semantic graph model over the token sequences that a neuron activates to while Bills et al. [2023] use GPT-4 to create explanations and use these explanations to predict activatino.

Identifying meaningful circuits of compoennts in Transformers

Causal tracing [Meng et al., 2022]

Memory editing [Meng et al., 2022; Meng et al., 2023; Hoelscher-Obermaier, 2023]

Memory editing of language models was introduced with

Machine unlearning

Concept erasure

Ablation

Ablation as model editing [Li et al., 2023]

Using activation ablastion, you can remove causal connections between parts of a model (e.g. attention heads) to modify behaviors. Li et al. [2023] reduce toxicity of a model from 45% to 33%. They do this by training a binary edge mask over the computational graph of causal connections to perform poorly on their "negative examples" dataset while maintaining performance.

Adding activation vectors to modulate behavior [Turner et al., 2023]

Automated circuit detection [Conmy et al., 2023]

Linear probes

Sparse probing [Gurnee et al., 2023]

This is basically linear probes that constrain the amount of neurons of the probe. It mitigates the problem that the linear probe itself does computation, even if it's just linear. Neel Nanda describes a critique of linear probes as 3D: 1) You design what feature you're looking for, not getting the chance to find features from a model-first perspective. 2) There is a chance the linear probe does computation since we force it to fit the data, i.e. the model might not represent this. 3) Probing is correlational rather than causal. Sparse probing still suffers from (1) and (3). However, it is less susceptible to correlations and it identifies individual neurons perfectly with prileged bases. Useful for first explorations.

Starter projects

🙋‍♀️ Simple templates & tools

Activation Atlas [tool]

The Activation Atlas article has a lot of figures where each has a Google Colab associated with them. Click on the "Try in a notebook". An example is this notebook that shows a simple activation atlas.

Additionally, they have this tool to explore to which sorts of images neurons activate the most to.

BertViz

BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.

EasyTransformer [code]

A library for mechanistic interpretability called EasyTransformer (still in beta and has bugs, but it's functional enough to be useful!): https://github.com/neelnanda-io/Easy-Transformer/

Demo notebook of EasyTransformer

A demo notebook of how to use Easy Transformer to explore a mysterious phenomena, looking at how language models know to answer "John and Mary went to the shops, then John gave a drink to" with Mary rather than John: https://colab.research.google.com/drive/1mL4KlTG7Y8DmmyIlE26VjZ0mofdCYVW6

Converting Neural Networks to graphs [code]

This repository can be used to transform a linear neural network into a graph where each neuron is a node and the weights of the directional connections are decided by the actual weights and biases.

You can expand this project by using the graph visualization on the activation for specific inputs and change the conversion from weights into activations or you can try to adapt it to convolutional neural networks. Check out the code below.

File	Description
`train.py`	Creates `model.pt` with a 500 hidden layer linear MNIST classifier.
`to_graph.py`	Generates a graph from `model.pt`.
`vertices.csv`	Each neuron in the MNIST linear classifier with its bias and layer.
`edges.csv`	Each connection in the neural network: `from_id, to_id, weight`.
`network_eda.Rmd`	The R script for initial EDA and visualization of the network.

Reviewing explainability tools

There are a few tools that use interpretability to create understandable explanations of why they give the output they give. This notebook provides a small intro to the most relevant libraries:

ELI5: ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It implements a few different analysis frameworks that work with a lot of different ML libraries. It is the most complete tool for explainability.

Inseq: Inseq is a Python library to perform feature attribution of decoder-only and encoder-decoder models from the Hugging Face Transformers library. It supports multiple gradient, attention and perturbation-based attribution methods, with visualizations in Jupyter and console. See the demo paper for more detail.

LIME: Local Interpretable Model-agnostic Explanations. The TextExplainer library does a good job of using LIME on language models. Check out Christoph Molnar's introduction here.
SHAP: SHapley Additive exPlanations
MLXTEND: Machine Learning Extensions

The IML R package [code]

Check out this tutorial to using the IML package in R. The package provides a good interface to working with LIME, feature importance, ICE, partial dependence plots, Shapley values, and more.

👩‍🔬 Advanced templates and tools

Redwood Research's interpretability on Transformers [tool]

Redwood Research has created a wonderful tool that can be used to do research into how language models understand text. The "How to use" document and their instruction videos are very good introductions and we recommend reading/watching them since the interface can be a bit daunting otherwise.

Watch this video as an intro:

Sparse features are infrequent in the data. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

apartresearch/interpretability-starter

Folders and files

Latest commit

History

README.md