ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang et al.|April 29, 2026arXiv

Key Takeaway

Class-level code generation—building complete, internally structured classes—is significantly harder than function-level synthesis, and the main bottleneck is coordinating logic across multiple methods, not individual function correctness.

Summary

ClassEval-Pro is a benchmark with 300 class-level code generation tasks across 11 domains, designed to test whether AI models can build complete, structured classes from specifications. Current benchmarks focus on isolated functions or manually curated tasks, but this one uses automated pipelines and real GitHub code to avoid data contamination.

evaluation

Key Terms

code-generation class-level-code-synthesis pass-at-k line-coverage data-contamination