Fine-tuning LLMs for vulnerability detection produces calibration without comprehension: models adjust their confidence scores to match training data but don't develop actual security reasoning.
This paper evaluates whether LLMs actually understand software vulnerabilities or just memorize patterns. Using 834 carefully curated Linux kernel samples with strict temporal splits to prevent data leakage, the authors find that fine-tuning doesn't improve genuine security reasoning—it only adjusts output thresholds.