Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Hailing Cheng, Tao Huang, Chen Zhu, Antonio Alonso|April 27, 2026arXiv

Key Takeaway

You can use your existing multi-GPU setup to automatically find better learning rates during training by having each GPU try slightly different rates and averaging them periodically—no extra compute needed.

Summary

This paper proposes HDET, a method that uses multiple GPU replicas to explore different learning rates during training instead of computing identical updates. Replicas train independently with different learning rates, then synchronize periodically.

training efficiency

Key Terms

hyperparameter-transfer learning-rate-schedule gradient-free-optimization data-parallel-training zero-order-hypergradient