Class-level code generation—building complete, internally structured classes—is significantly harder than function-level synthesis, and the main bottleneck is coordinating logic across multiple methods, not individual function correctness.
ClassEval-Pro is a benchmark with 300 class-level code generation tasks across 11 domains, designed to test whether AI models can build complete, structured classes from specifications. Current benchmarks focus on isolated functions or manually curated tasks, but this one uses automated pipelines and real GitHub code to avoid data contamination.