Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei et al.|May 1, 2026arXiv

Key Takeaway

AI coding agents are far from ready for autonomous scientific research: they excel at software engineering but fail at the domain-specific reasoning, procedure reconstruction, and result interpretation needed to reproduce real computational science claims.

Summary

This paper introduces AutoMat, a benchmark that tests whether AI coding agents can reproduce scientific findings from materials science papers. The benchmark reveals that current AI agents struggle significantly—achieving only 54% success—because they can't fully reconstruct experimental procedures from paper descriptions, deviate from required methods, and fail during execution.

agents evaluation applications

Key Terms

coding-agent agentic-ai reproducibility domain-specific-procedures