Vector Policy Optimization: Training for Diversity Improves Test-Time Search — ThinkLLM