Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
-
Updated
May 18, 2024 - Python
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Interpreting how transformers simulate agents performing RL tasks
🧠 Starter templates for doing interpretability research
Sparse probing paper full code.
Sparse and discrete interpretability tool for neural networks
Explain a black-box module in natural language.
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Steering vectors for transformer language models in Pytorch / Huggingface
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Universal Neurons in GPT2 Language Models
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.
🦠 DeepDecipher: An open source API to MLP neurons
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Identifying Circuit behind Pronoun Prediction in GPT-2 Small
graphpatch is a library for activation patching on PyTorch neural network models.
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."