Evaluating agent performance by examining the complete sequence of actions taken, not just final outputs.