Global Optimality for Constrained Exploration via Penalty Regularization — ThinkLLM