Explaining Attention with Program Synthesis

Amiri Hayes, Belinda Li, Jacob Andreas|June 17, 2026arXiv

Key Takeaway

Attention heads in transformers can be reverse-engineered into interpretable Python programs that capture their core logic, offering a path toward making neural model internals more transparent and understandable.

Summary

This paper proposes a method to explain how attention heads in transformer models work by generating Python programs that mimic their behavior. The researchers use a language model to write code that reproduces attention patterns from real examples, then validate these programs on new data.

reasoning

Key Terms

attention-head program-synthesis mechanistic-interpretability intersection-over-union