Skip to content

Lecture 14: Reinforcement Learning

What is reinforcement learning?

linear

\(state_i\)下采取一个action \(a_t\), 获得reward\(r_t\), 产生下一个状态\(state_{t+1}\).

Cart-Pole Problem

linear

Robot Locomotion

linear

Atari Games

linear

Go

linear

Markov Decision Process

  • Mathematical formulation of the RL problem.
  • Markov Property: Current state completely characterizes the state of the world.

linear

linear

A simple MDP: Grid World

linear linear

The optimal policy \(\pi^{*}\)

linear

Value Function and Q-value Function

linear

Bellman Equation

linear

Solving for the optimal policy: Q-learning

Q-learning use a function approximator to estimate the action-value function:

\[ Q(s, a; \theta) \approx Q^*(s, a) \]

where \(\theta\) is the function parameters(weights).

If the function approximator is a deep neural network => deep 1-learning.

linear

Case Study: Playing Atari Games

  • Objective: Complete the game with highest score.
  • State: Raw pixel inputs of the game state.
  • Action: Game controls, left, right...
  • Reward: Score increase/decrease at each time step.

Q-network Architecture

linear

最终生成四个值, 分别为上下左右的Q value. 我们使用最近的四帧来预测Q value.

Number of actions between 4-18 depending on Atari game.

Experience Replay

linear

linear

Policy Gradients

What is a problem with Q-learning?

The Q-function can be very complicated!

Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair. But the policy can be much simpler: just close your hand. Can we learn a policy directly, e.g. finding the best policy from a collection of policies?

linear

这里定义了对于一个Policy的价值, 即累计reward的期望.

linear

一个Policy累计reward的期望其实就是对于所有轨迹求和依据概率取平均.

跳过复杂的数学推导, 得到下面的结论:

linear

力争将能获得较大reward的action的概率拉大, 反之则减小. 我们期望在多次训练之后能将那些较为优秀的action凸显出来.

但是可能会在学习过程中遇到高方差问题。这意味着估计的值在不同的训练迭代之间波动很大,导致学习过程不稳定。同时, 在强化学习中,特别是在延迟奖励的情况下,确定某个特定动作对未来奖励的贡献是非常困难的。例如,某个动作可能会在很久之后才产生明显的效果。这使得准确地归因变得复杂,进一步增加了估计器的方差。高方差会导致学习过程产生严重的不稳定性, 所以需要采取一定手段减小方差.

linear

linear

linear

linear