Correlational evaluation of task learnability in language models is fundamentally flawed due to confounding factors; causal intervention methods are necessary to accurately measure how much data is needed to learn specific tasks.
This paper shows that measuring whether language models can learn specific tasks from data is harder than it seems. Using formal languages (artificial rule-based systems), the researchers prove that standard evaluation methods give wrong answers because different task properties get tangled together.