Skip to content

apartresearch/interpretability-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 

Repository files navigation

Interpretability starter

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

This list of resources was made for the Interpretability Hackathon (link) and contains an array of useful starter templates, tools to investigate model activations, and a number of introductory resources. Check out aisi.ai for some ideas for projects within ML & AI safety.

Inspiration

We have many ideas available for inspiration on the aisi.ai Interpretability Hackathon ideas list. A lot of interpretability research is available on distill.pubtransformer circuits, and Anthropic's research page.

Introductions to mechanistic interpretability

See also the tools available on interpretability:

Digestible research

Concepts

A feature is a a scalar function of the input. In this essay, neural network features are directions, and often simply individual neurons. We claim such features in neural networks are typically meaningful features which can be rigorously studied. A meaningful feature is one that genuinely responds to an articulable property of the input, such as the presence of a curve or a floppy ear.

Superposition [Elhage et al., 2022]

Superposition is when a there are more features in the feature space than there are neurons. This is nearly always the case for e.g. large language models (LLMs). It leads to neurons with polysemanticity.

Polysemanticity [Elhage et al., 2022]

Polysemanticity is the phenomenon that a neuron corresponds to multiple features, i.e. it encodes multiple concepts / semantic features at a time. It often makes the neuron less interpretable. Sparse features1 are more likely to be encoded in a polysemantic neuron due to the probability of non-interference.

Privileged basis

A privileged basis is when the standard vectors are human-understandable and meaningful. In the context of model neurons, this means that a neuron's activation represents a meaningful concept. If a neuron is not in a privileged basis, it will be significantly harder to interpret and any transformations (such as ReLU) will cause interference instead of improving the interpretability.

Models of MLP neuron activation [Foote et al., 2023; Bills et al., 2023]

MLP neuron activation models are models that attempt to explain in which cases neurons fire. It's based on a few principles: 1) We expect MLP neurons to activate in specific token sequences, 2) we can create a simplified model of its activation that does not require the neural network, and 3) that model can be validated against real activation.

Foote et al. [2023] create a semantic graph model over the token sequences that a neuron activates to while Bills et al. [2023] use GPT-4 to create explanations and use these explanations to predict activatino.

Identifying meaningful circuits of compoennts in Transformers

Causal tracing [Meng et al., 2022]

Memory editing of language models was introduced with

Machine unlearning

Concept erasure

Ablation

Ablation as model editing [Li et al., 2023]

Using activation ablastion, you can remove causal connections between parts of a model (e.g. attention heads) to modify behaviors. Li et al. [2023] reduce toxicity of a model from 45% to 33%. They do this by training a binary edge mask over the computational graph of causal connections to perform poorly on their "negative examples" dataset while maintaining performance.

Adding activation vectors to modulate behavior [Turner et al., 2023]

Automated circuit detection [Conmy et al., 2023]

Linear probes

Sparse probing [Gurnee et al., 2023]

This is basically linear probes that constrain the amount of neurons of the probe. It mitigates the problem that the linear probe itself does computation, even if it's just linear. Neel Nanda describes a critique of linear probes as 3D: 1) You design what feature you're looking for, not getting the chance to find features from a model-first perspective. 2) There is a chance the linear probe does computation since we force it to fit the data, i.e. the model might not represent this. 3) Probing is correlational rather than causal. Sparse probing still suffers from (1) and (3). However, it is less susceptible to correlations and it identifies individual neurons perfectly with prileged bases. Useful for first explorations.

Starter projects

🙋‍♀️ Simple templates & tools

The Activation Atlas article has a lot of figures where each has a Google Colab associated with them. Click on the "Try in a notebook". An example is this notebook that shows a simple activation atlas.

Additionally, they have this tool to explore to which sorts of images neurons activate the most to.

BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.

BertViz example image

A library for mechanistic interpretability called EasyTransformer (still in beta and has bugs, but it's functional enough to be useful!): https://github.com/neelnanda-io/Easy-Transformer/

A demo notebook of how to use Easy Transformer to explore a mysterious phenomena, looking at how language models know to answer "John and Mary went to the shops, then John gave a drink to" with Mary rather than John: https://colab.research.google.com/drive/1mL4KlTG7Y8DmmyIlE26VjZ0mofdCYVW6

This repository can be used to transform a linear neural network into a graph where each neuron is a node and the weights of the directional connections are decided by the actual weights and biases.

You can expand this project by using the graph visualization on the activation for specific inputs and change the conversion from weights into activations or you can try to adapt it to convolutional neural networks. Check out the code below.

File Description
train.py Creates model.pt with a 500 hidden layer linear MNIST classifier.
to_graph.py Generates a graph from model.pt.
vertices.csv Each neuron in the MNIST linear classifier with its bias and layer.
edges.csv Each connection in the neural network: from_id, to_id, weight.
network_eda.Rmd The R script for initial EDA and visualization of the network.

Reviewing explainability tools

There are a few tools that use interpretability to create understandable explanations of why they give the output they give. This notebook provides a small intro to the most relevant libraries:

  • ELI5: ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It implements a few different analysis frameworks that work with a lot of different ML libraries. It is the most complete tool for explainability.

Explanations of output

Image explanations of output

  • Inseq: Inseq is a Python library to perform feature attribution of decoder-only and encoder-decoder models from the Hugging Face Transformers library. It supports multiple gradient, attention and perturbation-based attribution methods, with visualizations in Jupyter and console. See the demo paper for more detail.

Inseq console visualizations

  • LIME: Local Interpretable Model-agnostic Explanations. The TextExplainer library does a good job of using LIME on language models. Check out Christoph Molnar's introduction here.
  • SHAP: SHapley Additive exPlanations
  • MLXTEND: Machine Learning Extensions

Check out this tutorial to using the IML package in R. The package provides a good interface to working with LIME, feature importance, ICE, partial dependence plots, Shapley values, and more.

👩‍🔬 Advanced templates and tools

Redwood Research has created a wonderful tool that can be used to do research into how language models understand text. The "How to use" document and their instruction videos are very good introductions and we recommend reading/watching them since the interface can be a bit daunting otherwise.

Watch this video as an intro:

Understanding interp-tools by Redwood Research

Footnotes

  1. Sparse features are infrequent in the data.