Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents? — ThinkLLM