WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Mauricio Fadel Argerich, Jonathan Fürst, Marta Patiño-Martínez|July 2, 2026arXiv

Key Takeaway

You can now predict LLM inference efficiency on GPUs you've never tested by combining public model specs with GPU specifications—no profiling needed, and it works 4x better than physics-based estimates.

Summary

WattGPU predicts GPU power consumption and inference latency for large language models without requiring hardware profiling. Using only public LLM metadata and GPU specs, it generalizes to unseen hardware combinations, achieving 3-4x better accuracy than traditional baselines and helping operators choose efficient GPU-LLM pairings.

efficiency evaluation scaling

Key Terms

inter-token-latency thermal-design-power roofline-model leave-one-out-cross-validation