A reinforcement learning method where an agent learns from past experiences (not just current policy) using separate networks for action selection and value estimation.