How reliable are LLMs when it comes to playing dice?

Luca Avena, Gianmarco Bet, Bernardo Busoni|June 5, 2026arXiv

Key Takeaway

LLMs appear to solve probability problems through pattern matching rather than true reasoning—they fail on counterintuitive cases and are vulnerable to prompt manipulation, indicating a fundamental gap in probabilistic understanding.

Summary

This paper tests how well large language models can solve probability problems, like dice games. Researchers found that models perform well on straightforward problems (96% accuracy) but struggle with tricky ones designed to fool intuition (59% accuracy).

evaluation reasoning

Key Terms

chain-of-thought token-bias probabilistic-reasoning