For distributed model training, executing tasks based on actual readiness rather than pre-committed schedules can dramatically reduce GPU idle time and improve throughput, especially when computation times vary unpredictably.
This paper introduces RRFP, a runtime system that improves GPU training efficiency by executing ready tasks immediately instead of waiting for a pre-planned order. When training large models across multiple GPUs, unpredictable delays in computation cause stages to sit idle.