Current AI systems and auditors are poor at detecting subtle sabotage in research code—even frontier LLMs only catch 77% of cases—highlighting a critical gap in oversight for autonomous AI research.
This paper introduces ASMR-Bench, a benchmark for testing whether AI systems and human auditors can detect sabotage hidden in ML research code. The benchmark includes 9 real ML projects with intentionally introduced bugs that change experimental results while keeping the paper's description accurate.