On the Proper Treatment of Units in Surprisal Theory

Samuel Kiegeland, Vésteinn Snæbjarnarson, Tim Vieira, Ryan Cotterell|April 30, 2026arXiv

Key Takeaway

When using language models to measure reading difficulty, you must explicitly choose your unit of analysis (word, morpheme, etc.) separately from tokenization—don't let the model's token boundaries dictate your scientific analysis.

Summary

This paper clarifies how surprisal theory—which measures human reading difficulty based on word predictability—should handle units of analysis. Language models tokenize text differently than linguistic units (like words), creating confusion in how surprisal is calculated. The authors provide a framework to make these choices explicit and consistent.

evaluation

Key Terms

surprisal tokenization unit-of-analysis