A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems

Beste Oztop, Dhruva Kulkarni, Zhengji Zhao, Ayse Kivilcim Coskun, Kadidia Konate|April 2, 2026arXiv

Key Takeaway

GPU power and utilization can be predicted accurately from job submission metadata and performance metrics, enabling HPC systems to make smarter scheduling and power management decisions without expensive real-time monitoring.

Summary

This paper develops a two-stage prediction system for GPU resource usage and power consumption in HPC clusters. By analyzing real workload data from NERSC's Perlmutter supercomputer, the authors show they can predict GPU utilization and power draw with 92-97% accuracy using either job submission logs alone or combined with GPU performance metrics, enabling better scheduling and power management.

efficiency evaluation

Key Terms

gpu-memory workload-manager power-aware-scheduling