Chao Huang

Chao Huang

Ph.D. Candidate

Reinforcement Learning -- DDPG

paper tittle: Deterministic policy gradient algorithms.

classification

model-free, on or off-policy, policy-based, continuous state, continuous action space.

Problem

computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions.

reasones

In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space.

solution

Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces. The basic idea is to represent the policy by a parametric probability distribution that stochastically selects action in state according to parameter vector. In this paper we instead consider deterministic policies.

benefits analysis

the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy.

math

the density at state s’ after transitioning for t time-stepsfrom state s is denoted as p(s->s’;t;\pi). we also denote the improper discounted state distribution by p_{\pi}(s’)=\intergration_{S}\sum_{t=0}^{\infinite}\gamma^{t-1}p_0(s)p(s->s’;t;\pi). we then can write the performance objective as an expectation: J(\pi_{\theta})=\intergration_{S}p_{\pi}(s)\intergration_{A}\pi_{\theta}(s,a)r(s,a)dads = E_{s \in p_{\pi}, a \in \pi_{\theta}}r(s,a).

Then the performance’s gradient is: \delta_{\theta}J(\pi_{\theta})=\intergration_{S}p_{\pi}(s)\intergration_{A}\delta_{\theta}\pi_{\theta}(s,a)Q^{\pi}(s,a)dads.

Experiment environment

a high-dimensional bandit; several standard benchmark reinforcement learning tasks with low dimensional action spaces; and a high-dimensional task for controlling an octopus arm.

Metrics

rewards from games.

^_^