GPU power and utilization can be predicted accurately from job submission metadata and performance metrics, enabling HPC systems to make smarter scheduling and power management decisions without expensive real-time monitoring.
This paper develops a two-stage prediction system for GPU resource usage and power consumption in HPC clusters. By analyzing real workload data from NERSC's Perlmutter supercomputer, the authors show they can predict GPU utilization and power draw with 92-97% accuracy using either job submission logs alone or combined with GPU performance metrics, enabling better scheduling and power management.