LLMs appear to solve probability problems through pattern matching rather than true reasoning—they fail on counterintuitive cases and are vulnerable to prompt manipulation, indicating a fundamental gap in probabilistic understanding.
This paper tests how well large language models can solve probability problems, like dice games. Researchers found that models perform well on straightforward problems (96% accuracy) but struggle with tricky ones designed to fool intuition (59% accuracy).