Attention heads in transformers can be reverse-engineered into interpretable Python programs that capture their core logic, offering a path toward making neural model internals more transparent and understandable.
This paper proposes a method to explain how attention heads in transformer models work by generating Python programs that mimic their behavior. The researchers use a language model to write code that reproduces attention patterns from real examples, then validate these programs on new data.