Evaluating agent executions to provide feedback signals for skill improvement, used to validate whether edits improve performance.