TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

Sikai Bai, Haoxi Li, Jie Zhang, Yongjiang Liu, Song Guo|April 9, 2026arXiv

Key Takeaway

You can boost reasoning model performance on new domains without labeled data by synthesizing diverse question variants at test time and using hybrid exploration to balance accuracy with consistency across variants.

Summary

TTVS helps AI reasoning models improve themselves at test time by creating diverse variations of unlabeled questions and learning from them. Instead of relying on expensive labeled data, the system generates synthetic question variants and uses exploration strategies to learn the underlying problem logic rather than memorizing surface patterns.

training reasoning efficiency

Key Terms

reinforcement-learning-from-verifiable-rewards test-time-adaptation semantic-augmentation exploration-exploitation-tradeoff