[paper][paper] tittle: Asynchronous methods for deep reinforcement learning.
classification
model-free, on-policy, policy-based, continuous state, continuous/discrste action space.
background
the combination of simple online RL algorithms with deep neural networks was fundamentally unstable. And the reason was that the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched or randomly sampled from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.
Problem
experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy.
reasones
xxx
solution
Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states. For each agent in its own environment, we will calculate its gradients, and add all agents’ gradients into d\theta. At last, we will use d\theta to update the target network.