Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rodriguez-Martinez, Lorena Otero-Cerdeira|July 2, 2026arXiv

Key Takeaway

LLMs can grade technical exams reliably for simpler tasks, but struggle with complex questions—rubric quality matters more than which model you choose, and a taxonomy-based approach helps identify which questions are safe to auto-grade.

Summary

This paper evaluates whether large language models can reliably grade Linux/bash exam responses by testing GPT, Claude, Gemini, and GLM against expert instructors' grades.

evaluation applications training

Key Terms

rubric inter-rater-agreement cognitive-taxonomy prompt-engineering