Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Mingwei Xu, Hao Fang|May 7, 2026arXiv

Key Takeaway

You can train reasoning models effectively using only positive examples—negative examples aren't necessary if you redistribute probability mass correctly and stabilize learning through siamese networks.

Summary

This paper proposes POPO, a new training method for reasoning-focused language models that learns exclusively from successful (positive) examples rather than mixing successes with failures. Instead of comparing positive and negative rollouts like existing methods (GRPO), POPO uses importance sampling to implicitly learn what to avoid, stabilized through a siamese network architecture.

training reasoning alignment

Key Terms

reinforcement-learning-from-verifiable-rewards group-relative-policy-optimization importance-sampling siamese-network momentum-based-adaptation