Accurate latency prediction under GPU contention is critical for priority-aware scheduling in inference serving—Strait reduces deadline violations for high-priority tasks by modeling interference effects that traditional systems ignore.
Strait is an ML inference serving system that improves deadline satisfaction for high-priority requests by better predicting latency under GPU contention and using priority-aware scheduling.