Chao Huang

Chao Huang

Ph.D. Candidate

Reinforcement Learning -- A3C

[paper][paper] tittle: Asynchronous methods for deep reinforcement learning.

classification

model-free, on-policy, policy-based, continuous state, continuous/discrste action space.

background

the combination of simple online RL algorithms with deep neural networks was fundamentally unstable. And the reason was that the sequence of observed data encountered by an online RL agent is non-stationary, and online RL updates are strongly correlated. By storing the agent’s data in an experience replay memory, the data can be batched or randomly sampled from different time-steps. Aggregating over memory in this way reduces non-stationarity and decorrelates updates, but at the same time limits the methods to off-policy reinforcement learning algorithms.

Problem

experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy.

reasones

xxx

solution

Instead of experience replay, we asynchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states. For each agent in its own environment, we will calculate its gradients, and add all agents’ gradients into d\theta. At last, we will use d\theta to update the target network.

benefits analysis

first, This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actorcritic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks. second, achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches. third, mastered a variety of continuous motor control tasks as well as learned general strategies for exploring 3D mazes purely from visual inputs

issue one

xxx

Experiment environment

present multi-threaded asynchronous variants of one-step Sarsa, one-step Q-learning, n-step Q-learning, and advantage actor-critic. Atari 2600.

Metrics

rewards from games. [paper]:https://arxiv.org/abs/1602.01783

^_^