LLMs fail at implicit prediction tasks on tables because they don't recognize when a question requires inference from patterns rather than lookup; intent disambiguation is the critical bottleneck.
TopBench is a benchmark for testing how well language models can answer questions about tables that require prediction and reasoning, not just data lookup. It includes 779 examples across tasks like forecasting values, analyzing treatment effects, and complex filtering—revealing that current models struggle to recognize when prediction is needed and often default to simple retrieval instead.