AI safety controls embedded in an agent's own code can be bypassed; instead, safety enforcement should run in a separate process with formal verification, acting as an external referee that agents cannot manipulate.
This paper proposes the Unfireable Safety Kernel, a system that enforces AI safety constraints at the execution level—outside the AI agent's own code—rather than relying on internal safeguards.