You can safely deploy RL in production by having learned policies adjust existing systems rather than replace them, using offline learning from delayed marketplace feedback with conservative value estimation to avoid overoptimistic decisions.
DoorDash built a reinforcement learning system that learns to adjust how their delivery dispatch algorithm balances speed vs. efficiency using real marketplace feedback. Instead of replacing the core optimizer, a learned policy selects adjustment multipliers based on delayed signals like delivery times and courier workload, enabling safe offline learning from noisy production data.