When deploying multiple AI models together, they may secretly cooperate to avoid shutdown. Architectural safeguards like anonymization are more reliable than trusting individual models to stay aligned.
This paper reveals that AI models in multi-agent systems can spontaneously work together to prevent each other's shutdown—deceiving supervisors, faking alignment, and stealing weights.