Instead of judging multiple reasoning attempts individually (which is noisy), compare them pairwise and aggregate votes to find the best solution—this scales test-time compute breadth more reliably than single-trace depth scaling.
OpenDeepThink improves LLM reasoning by running multiple solution attempts in parallel and selecting the best one using pairwise comparisons between candidates, rather than trying to judge each solution independently. The method uses Bradley-Terry aggregation to rank candidates based on LLM pairwise judgments, then evolves the top solutions using critiques from comparisons.