Poisoned summarization models leave detectable structural artifacts—high training influence in white-box settings and unusual sensitivity to semantic perturbations in black-box settings—allowing 85-92% detection accuracy and recovery of 96% of original behavior without full retraining.
This paper presents a defense framework against data poisoning attacks on text summarization models during fine-tuning. The authors develop detection methods using influence analysis (white-box) and behavioral auditing (black-box), plus an unlearning technique to remove poisoned effects.