When using language models to measure reading difficulty, you must explicitly choose your unit of analysis (word, morpheme, etc.) separately from tokenization—don't let the model's token boundaries dictate your scientific analysis.
This paper clarifies how surprisal theory—which measures human reading difficulty based on word predictability—should handle units of analysis. Language models tokenize text differently than linguistic units (like words), creating confusion in how surprisal is calculated. The authors provide a framework to make these choices explicit and consistent.