Causally Evaluating the Learnability of Formal Language Tasks

Vésteinn Snæbjarnarson, Anej Svete, Josef Valvoda, Reda Boumasmoud, Brian DuSell et al.|June 8, 2026arXiv

Key Takeaway

Correlational evaluation of task learnability in language models is fundamentally flawed due to confounding factors; causal intervention methods are necessary to accurately measure how much data is needed to learn specific tasks.

Summary

This paper shows that measuring whether language models can learn specific tasks from data is harder than it seems. Using formal languages (artificial rule-based systems), the researchers prove that standard evaluation methods give wrong answers because different task properties get tangled together.

evaluation training reasoning

Key Terms

causal-inference confounding kullback-leibler-divergence probabilistic-finite-automata