paper tittle: human-level control through deep reinforcement learning.
classification
model-free, off-policy, value-based, continuous state, discrete action space.
The goal is to find one optimal policy that maximizes cumulative future reward: $Q(s,a)=max_{\pi}E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + …]$. In this paper, a deep Neural Network was used to approximate the optimal action-value function.
The q-learning update uses the following loss function: $L(\theta)=E_{(s,a,r,s’) \in U(D)}[(r + \gamma max_{a’}Q(s’,a’; \theta’) - Q(s,a; \theta))^2]$.
Experiment environment
Atari 2600 platform.
Metrics
rewards from game.