Fine-tuned language models exhibit a universal memorization signature detectable by learned classifiers, enabling membership inference attacks that generalize across architectures without requiring shadow models or hand-crafted heuristics.
This paper reveals that language models leave a detectable fingerprint of memorization during fine-tuning that works across different model architectures (Transformers, Mamba, RWKV). Instead of using hand-crafted rules to detect memorization, the authors train a classifier to recognize this signature, which transfers to unseen architectures and datasets with high accuracy.