mechanistic interpretability with sparse autoencoders - a rchan26 Collection

rchan26 's Collections

weather forecasting

mechanistic interpretability with sparse autoencoders

multilingual vision models

mechanistic interpretability with sparse autoencoders

updated Sep 3, 2024

A collection of papers that I found useful for learning about using Sparse Autoencoders for finding interpretable features in language models

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Paper • 2309.08600 • Published Sep 15, 2023 • 15
Scaling and evaluating sparse autoencoders

Paper • 2406.04093 • Published Jun 6, 2024 • 4
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Paper • 2403.19647 • Published Mar 28, 2024 • 4
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Paper • 2408.05147 • Published Aug 9, 2024 • 41
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Paper • 2407.14435 • Published Jul 19, 2024 • 7
Interpreting Attention Layer Outputs with Sparse Autoencoders

Paper • 2406.17759 • Published Jun 25, 2024
Disentangling Dense Embeddings with Sparse Autoencoders

Paper • 2408.00657 • Published Aug 1, 2024 • 1
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Paper • 2405.12522 • Published May 21, 2024 • 2
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Paper • 2405.08366 • Published May 14, 2024 • 2