MathDuels: Evaluating LLMs as Problem Posers and Solvers

Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik|April 23, 2026arXiv

Key Takeaway

Models can be strong at solving math problems but weak at creating challenging ones—dual-role evaluation exposes capability gaps that single-role benchmarks miss, and the benchmark naturally scales with model strength.

Summary

MathDuels is a new way to test AI math abilities by having models both create and solve problems against each other. Unlike static benchmarks that get too easy, this self-play approach reveals hidden differences between models—some are great solvers but poor problem creators, and vice versa.

evaluation reasoning

Key Terms

self-play rasch-model adversarial-prompting difficulty-amplification ceiling-effect