Skip to content

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.

Notifications You must be signed in to change notification settings

OpenMOSS/Language-Model-SAEs

Repository files navigation

Language-Model-SAEs

This repo aims to provide a general codebase for conducting dictionary-learning-based mechanistic interpretability research on Language Models (LMs). It powers a configurable pipeline for training and evaluating GPT-2 dictionaries, and provides a set of tools (mainly a React-based webpage) for analyzing and visualizing the learned dictionaries.

The design of the pipeline (including the configuration and some training detail) is highly inspired by the mats_sae_training project. We thank the authors for their great work.

Getting Started with Mechanistic Interpretability and Dictionary Learning

If you are new to the concept of mechanistic interpretability and dictionary learning, we recommend you to start from the following paper:

Furthermore, to dive deeper into the inner activations of LMs, it's recommended to get familiar with the TransformerLens library.

Installation

Currently, the codebase use pdm to manage the dependencies, which is an alternative to poetry. To install the required packages, just install pdm, and run the following command:

pdm install

This will install all the required packages for the core codebase. Note that if you're in a conda environment, pdm will directly take the current environment as the virtual environment for current project, and remove all the packages that are not in the pyproject.toml file. So make sure to create a new conda environment (or just deactivate conda, this will use virtualenv by default) before running the above command. A forked version of TransformerLens is also included in the dependencies to provide the necessary tools for analyzing features.

If you want to use the visualization tools, you also need to install the required packages for the frontend, which uses bun for dependency management. Follow the instructions on the website to install it, and then run the following command:

cd ui
bun install

It's worth noting that bun is not well-supported on Windows, so you may need to use WSL or other Linux-based solutions to run the frontend, or consider using a different package manager, such as pnpm or yarn.

Training/Analyzing a Dictionary

We give some basic examples to show how to train a dictionary and analyze the learned dictionary in the examples. You can copy the example scripts to the exp directory and modify them to fit your needs. More examples will be added in the future.

Visualizing the Learned Dictionary

The analysis results will be saved using MongoDB, and you can use the provided visualization tools to visualize the learned dictionary. First, start the FastAPI server by running the following command:

uvicorn server.app:app --port 24577
# You may want to modify some environmental settings in server/.env.example to server/.env, and run with these environmental variables:
# uvicorn server.app:app --port 24577 --env-file server/.env

Then, copy the ui/.env.example file to ui/.env and modify the VITE_BACKEND_URL to fit your server settings (by default, it's http://localhost:24577), and start the frontend by running the following command:

cd ui
bun dev --port 24576

That's it! You can now go to http://localhost:24576 to visualize the learned dictionary and its features.

Development

We highly welcome contributions to this project. If you have any questions or suggestions, feel free to open an issue or a pull request. We are looking forward to hearing from you!

TODO: Add development guidelines

Citation

Please cite this library as:

@misc{Ge2024OpenMossSAEs,
    title  = {OpenMoss Language Model Sparse Autoencoders},
    author = {Xuyang Ge, Fukang Zhu, Junxuan Wang, Wentao Shu, Lingjie Chen, Zhengfu He},
    url    = {https://github.com/OpenMOSS/Language-Model-SAEs},
    year   = {2024}
}

About

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published