CCO provides a practical, theoretically-grounded way to oversee autonomous AI agents by penalizing actions based on aggregated human concern, with mathematical guarantees that violations stay below a specified threshold.
This paper introduces Calibrated Collective Oversight (CCO), a method for keeping powerful AI agents under human control by combining multiple safety signals into a penalty system. The approach uses statistical guarantees to ensure bad outcomes stay below a target rate, and works even when human overseers are weaker than the AI system they're monitoring.