When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient — ThinkLLM