A multi-layer evaluation pipeline that applies increasingly lenient or human-intensive grading strategies to improve reliability.