Starting with a policy trained on fixed offline data, then improving it through interaction with the environment.