Models can be strong at solving math problems but weak at creating challenging ones—dual-role evaluation exposes capability gaps that single-role benchmarks miss, and the benchmark naturally scales with model strength.
MathDuels is a new way to test AI math abilities by having models both create and solve problems against each other. Unlike static benchmarks that get too easy, this self-play approach reveals hidden differences between models—some are great solvers but poor problem creators, and vice versa.