Language models can explain neurons in language models
One simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing.
An automated process that uses GPT-4 to produce and score natural language explanations of neuron behavior and apply it to neurons in another language model.
It consists of running 3 steps on every neuron, Step 1: Generate explanation using GPT-4, Step 2: Simulate using GPT-4, Step 3: Compare.