LLMs can grade technical exams reliably for simpler tasks, but struggle with complex questions—rubric quality matters more than which model you choose, and a taxonomy-based approach helps identify which questions are safe to auto-grade.
This paper evaluates whether large language models can reliably grade Linux/bash exam responses by testing GPT, Claude, Gemini, and GLM against expert instructors' grades.