Chao Huang

Chao Huang

Ph.D. Candidate

Reinforcement Learning -- PPO

paper tittle: Proximal Policy Optimization Algorithms.

classification

model-free, on-policy, policy-based, continuous state, continuous action space.

background

Q-learning (with function approximation) fails on many simple problems1 and is poorly understood, vanilla policy gradient methods have poor data effiency and robustness; and trust region policy optimization (TRPO) is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks).

Problem

This paper seeks to improve the current state of affairs by introducing an algorithm that attains the data efficiency and reliable performance of TRPO, while using only first-order optimization.

reasones

In TRPO, an objective function (the “surrogate” objective) is maximized subject to a constraint on the size of the policy update. This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint. Hence, to achieve our goal of a first-order algorithm that emulates the monotonic improvement of TRPO, we can use a penalty instead of a constraint. The reason why TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems—or even within a single problem, where the the characteristics change over the course of learning.

solution

Clipped Surrogate Objective. Let r_t(\theta) represents the probability ratio (\pi’)/(\pi). Then the surrogate objective function in TRPO is L = E(r_t(\theta)A). Without a constraint, maximization of L() would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move r_t(\theta) away from 1. The main objective we propose is the following: L = E[min(r_t(\theta)A, clip(r_t(\theta), 1-a, 1+a)A)].

benefits analysis

The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).

issue one

xxx

Experiment environment

Continuous Domain: Humanoid Running and Steering.

Metrics

rewards from games.

^_^