DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang, Wei Wu, Yankai Lin|May 20, 2026arXiv

Key Takeaway

When training language models with verifiable rewards, focusing on the most discriminative token patterns—rather than averaging all tokens equally—significantly improves learning efficiency and final performance.

Summary

This paper improves how language models learn from step-by-step feedback by better understanding which tokens should be rewarded or penalized. The authors show that standard learning methods get distracted by common formatting tokens and miss important patterns that distinguish good answers from bad ones.

training reasoning alignment

Key Terms

reinforcement-learning-from-verifiable-rewards token-credit-assignment policy-gradient discriminative-direction