Aider Polyglot Coding Benchmark
Tests AI coding assistants on real-world programming tasks across multiple languages using the Aider coding tool. Measures ability to edit existing codebases to pass tests.
Models are asked to modify existing code across multiple programming languages to make failing tests pass. Tasks come from real open-source projects. Evaluates practical code editing ability, not just generation.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | 88.0% |
| 2 | Grok 4 | 79.6% |
| 3 | o3 | 76.9% |
| 4 | Gemini 2.5 Pro | 76.9% |
| 5 | o4-mini | 72.0 |
| 6 | Claude Opus 4 | 72.0% |
| 7 | Claude 3.7 Sonnet | 64.9% |
| 8 | Claude Sonnet 4 | 61.3% |
| 9 | o3 Mini | 60.4% |
| 10 | Qwen3 235B A22B | 59.6% |
| 11 | DeepSeek R1 | 56.9% |
| 12 | Grok 3 | 53.3% |
| 13 | Claude 3.5 Sonnet | 51.6% |
| 14 | DeepSeek V3 | 48.4% |
| 15 | GPT-4o | 23.1% |
| 16 | Gemini 2.0 Flash | 22.2% |