Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents — ThinkLLM