David Silver 교수님의 강의 내용을 정리하고자 한다(링크).
0. Introduction
In the last lecture we approximated the value or action-value function using parameter $\theta$. Then policy was generated directly from the value function(e.g. $\epsilon$-greedy)
$$\begin{array}{rcl}
V_{\theta} & \approx &V^{\pi}(s) \\
Q_{\theta}(s,a) & \approx & Q^{\pi}(s,a)
\end{array}$$
In this lecture we will directly parametrize the policy and focus again on model-free RL.
$$\pi_{\theta}(s,a) = \mathbb{P}\left[ a|s,\theta \right]$$
1) Policy-Based RL
Advantages
|
Disadvantages
|
[Example: deterministic & stochastic policy]
(2) Policy Search
[Policy Objective Functions]
- Goal: given policy $\pi_{\theta}(s,a)$ with parameters $\theta$, find best $\theta$.
- But,, how do we measure the quality of a policy $\pi_{\theta}$?
Episodic environments | Use the start value $$J_{1}(\theta) = V^{\pi_{\theta}}(s_{1})=\mathbb{E}_{\pi_{\theta}}\left[ v_{1} \right]$$ |
Continuing environments | Use the average value $$J_{\text{av}V}(\theta) = \sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)$$ |
Use the average reward per time-step $$J_{\text{av}R}(\theta) = \sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s,a)\mathcal{R}^{a}_{s}$$ |
|
(where $d^{\pi_{\theta}}$ is stationary distribution of Markov chain of $\pi_{\theta}$) |
[Policy Optimization]
- Policy based reinforcement learning is an optimization problem: FInd $\color{red}\theta$ that maximizes $\color{red}J(\theta)$
- Some approaches do not use gradient(e.g. Hill climbing, Genetic algorithms...)
- Greater efficiency often possible using gradient: gradient descent, Conjugate gradient, Quasi-newton
- In this lecture, we focus on gradient descent and on methods that exploit sequential structure
1. Finite Difference Policy Gradient
[Policy Gradient]
- Let $J(\theta)$ be any policy objective function
- Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t parameters $\theta$
$$\Delta\theta = \alpha\nabla_{\theta}J(\theta)$$
- where $\nabla_{\theta}J(\theta)$ is the policy gradient and $\alpha$ is a step-size parameter
$$\nabla_{\theta}J(\theta) = \begin{pmatrix}
\frac{\partial J(\theta)}{\partial\theta_{1}} \\
\vdots \\
\frac{\partial J(\theta)}{\partial\theta_{n}}
\end{pmatrix}$$
[Computing Gradients By Finite Differences]
- To evaluate policy gradient of $\pi_{\theta}(s,a)$, for each dimension $k \in [1,n]$
- Estimate $k$th partial derivative of objective function w.r.t $\theta$
- By perturbing $\theta$ by small amount $\epsilon$ in $k$th dimension, where $u_{k}$ is unit vector with $1$ in $k$th dimension, 0 elsewhere
$$\frac{\partial J(\theta)}{\partial\theta_{k}} \approx \frac{J(\theta+\epsilon u_{k})-J(\theta)}{\epsilon}$$
- Uses $n$ evaluations to compute policy gradient in $n$ dimensions
- Simple, noisy, inefficient // But sometimes effective(works for arbitrary policies, even if policy is not differentiable)
2. Monte-Carlo Policy Gradient
(1) Score Function
- We now compute the policy gradient analytically
- Assume policy $\pi_{\theta}$ is differentiable whenever it is non-zero, and we know the gradient $\nabla_{\theta}\pi_{\theta}(s,a)$(gradient of policy)
- Likelihood ratios exploit the following identity
$$\begin{array}{rcl}
\nabla_{\theta}\pi_{\theta}(s,a) & = & \pi_{\theta}(s,a)\frac{\nabla_{\theta}\pi_{\theta}(s,a)}{\pi_{\theta}(s,a)} \\
& = & \pi_{\theta}(s,a)\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)
\end{array}$$
- The score function is $\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)$(이 값을 통해 policy에게 어느 방향으로 가서 더 많은 reward를 얻을 수 있는지 학습)
[Softmax Policy(discrete action space)]
- We will use a softmax policy as a running example
- Weight actions using linear combination of features $\phi(s,a)^{\text{T}}\theta$
- Probability of actions is proportional to exponentiated weight
$$\pi_{\theta}(s,a) \propto e^{\phi(s,a)^{\text{T}}\theta}$$
- The score function is
$$\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a) = \phi(s,a)-\mathbb{E}_{\pi_{\theta}}\left[ \phi(s,\cdot) \right]$$
[Gaussian Policy(continuous action space)]
- In continuous action spaces, a Gaussian policy is natural
- Mean is a linear combination of state features $\mu(s) = \phi(s)^{\text{T}}\theta$. Variance may be fixed $\sigma^{2}$, or can also be parameterized
- Policy is Gaussian, $a \sim \mathcal{N}(\mu(s), \sigma^{2})$
- The score function is
$$\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a) = \frac{\left( a-\mu(s) \right)\phi(s)}{\sigma^{2}}$$
(2) Policy Gradient Theorem
[One-Step MDPs]
- Consider a simple class of one-step MDPs
- Starting in state $s \sim d(s)$
- Terminating after one time-step with reward $r = \mathcal{R}_{s,a}$
- Use likelihood ratios to compute the policy gradient
$$\begin{array}{rcl}
J(\theta) & = & \mathbb{E}_{\pi_{\theta}}\left[ r \right] \\
& = & \sum_{s\in \mathcal{S}}d(s)\sum_{a\in \mathcal{A}}\pi_{\theta}(s,a)\mathcal{R}_{s,a} \\
\nabla_{\theta}J(\theta) & = & \sum_{s\in \mathcal{S}}d(s)\sum_{a\in \mathcal{A}}\pi_{\theta}(s,a)\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\mathcal{R}_{s,a} \\
&=&\mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)r \right]\\
&=&\mathbb{E}_{\pi_{\theta}}\left[ (\text{score} \times \text{reward})\right]
\end{array}$$
[Policy Gradient Theorem]
- The policy gradient theorm generalizes the likelihood ratio approach to multi-step MDPs
- Replaces instantaneous reward $r$ with long-term value $Q^{\pi}(s,a)$(: How good the action was)
- Policy gradient theorem applies to start state objective, average reward and average value objective
[Theorem]
For any differentiable policy $\pi_{\theta}(s,a)$, for any of the policy objective functions $J=J_{1}, J_{\text{av}R}, \frac{1}{1-\gamma}J_{\text{av}V}$, the policy gradient is
$$\nabla_{\theta}J(\theta)= \color{red}\mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a) \right]$$
*Monte-Carlo Policy Gradient (REINFORCE)
- Update parameters by stochastic gradient ascent(expectation을 없앤 pratical한 방법)
- Using policy gradient theorem
- Using return $v_{t}$ as an unbiased sample of $Q^{\pi_{\theta}}(s_{t},a_{t})$
$$\Delta\theta_{t}= \alpha\nabla_{\theta}\text{log} \ \pi_{\theta}(s_{t},a_{t})v_{t}$$
3. Actor-Critic Policy Gradient(★)
(1) Critic: reducing Variance
- Monte-Carlo policy gradient still has high variance
- We use a critic to estimate the action-value function($Q$를 sampling하지 말고 critic을 사용하여 직접 $Q$를 estimate)
$$Q_{\mathrm{w}}(s,a) \approx Q^{\pi_{\theta}}(s,a)$$
- Actor-Critic algorithms maintain two sets of parameters
- Critic: Updates action-value function parameters $\color{red}\mathrm{w}$
- Actor: Updates policy parameters $\color{red}\theta$, in direction suggested by critic
- Actor-Critic algorithms follow an approximate policy gradient
$$\begin{array}{rcl}
\nabla_{\theta}J(\theta) & \approx &\mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)Q_{\pi}(s,a) \right] \\
\Delta\theta & = & \alpha\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)Q_{\pi}(s,a)
\end{array}$$
[Estimating the Action-Value Function]
- The critic is solving a familiar problem: policy evaluation(How good is policy $\pi_{\theta}$ for current parameters $\theta$?)
- This problem was explored in previous two lectures, e.g. Monte-Carlo policy evaluation, Temporal-Difference learning, TD($\lambda$)
- Could also use e.g. least-squares policy evaluation
[Action-Value Actor-Critic]
- Simple actor-critic algorithm based on action-value critic
- Using linear value fn approx. $Q_{\mathrm{w}}(s,a) = \phi(s,a)^{\text{T}}\mathrm{w}$
- Critic: Updates $\mathrm{w}$ by linear TD(0)
- Actor: Updates $\theta$ by policy gradient
(2) Compatible Function Approximation
[Bias in Actor-Critic Algorithms]
- Approximating the policy gradient introduces bias
- A biased policy gradient may not find the right solution(e.g. if $Q_{\mathrm{w}}(s,a)$ uses aliased features, can we solve gridworld example?)
- Luckily, if we choose value function approximation carefully, then we can avoid introducing any bias(i.e. we can still follow the exact policy gradient)
[Compatible Function Approximation Theorem]
If the following two conditions are satisfied:$$\nabla_{\mathrm{w}} Q_{\mathrm{w}}(s,a) = \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)$$
- Value function approximator is compatible to the policy
$$\epsilon = \mathbb{E}_{\pi_{\theta}}\left[ \left( Q^{\pi_{\theta}}(s,a)-Q_{\mathrm{w}}(s,a) \right)^{2} \right]$$
- Value function parameters $\mathrm{w}$ minimize the mean-squared error
Then the policy gradient is exact,
$$\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)Q_{\mathrm{w}}(s,a) \right]$$
*Proof of Compatible Function Approximation Theorem
- If $w$ is chosen to minimize mean-squared error, gradient of $\varepsilon$ w.r.t $w$ must be zero,
$$\begin{array}{rcl}
\nabla_{w}\varepsilon & = & 0 \\
\mathbb{E}_{\pi_{\theta}}\left[ \left( Q^{\theta}(s,a)-Q_{w}(s,a) \right)\nabla_{w}Q_{w}(s,a) \right] & = & 0 \\
\mathbb{E}_{\pi_{\theta}}\left[ \left( Q^{\theta}(s,a)-Q_{w}(s,a) \right)\nabla_{\theta}\text{log} \ \pi_{\theta} (s,a) \right] & = & 0 \\
\mathbb{E}_{\pi_{\theta}}\left[ Q^{\theta}(s,a)\nabla_{\theta} \text{log} \ \pi_{\theta} (s,a)\right] & = & \mathbb{E}_{\pi_{\theta}}\left[ Q_{w}(s,a)\nabla_{\theta} \text{log} \ \pi_{\theta} (s,a)\right]
\end{array} $$
- So $Q_{w}(s,a)$ can be substituted directly into the policy gradient,
$$\nabla_{\theta}J(\theta)= \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta} \text{log} \ \pi_{\theta}(s,a)Q_{w}(s,a)\right]$$
(3) Advantage Function Critic: reduce variance using a baseline(★)
- (actor-critic을 더 발전시킨 방향)
- We subtract a baseline function $B(s)$ from the policy gradient
- This can reduce variance, without changing expectation
$$\begin{array}{rcl}
\mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)B(s) \right] & = & \sum_{s\in S}d^{\pi_{\theta}}(s)\sum_{a}\nabla_{\theta}\pi_{\theta}(s,a)B(s) \\
& = & \sum_{s\in S}d^{\pi_{\theta}}(s)B(s)\nabla_{\theta}\sum_{a\in A}\pi_{\theta}(s,a) \\
& = & 0 \ (\because \sum_{a\in A}\pi_{\theta}(s,a)=1)
\end{array}$$
- A good baseline is the state value function $B(s) = V^{\pi_{\theta}}(s)$
- So we can rewrite the policy gradient using the advantage function $A^{\pi_{\theta}}(s,a)$
- $Q^{\pi_{\theta}}(s,a)$ : how much better is it to take action a,
- $V^{\pi_{\theta}}(s)$ : compared to how good is it to be in that state general
$$\begin{array}{rcl}
A^{\pi_{\theta}}(s,a) & = & Q^{\pi_{\theta}}(s,a) - V^{\pi_{\theta}}(s) \\
\nabla_{\theta}J(\theta) & = & \color{red}\mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)A^{\pi_{\theta}}(s,a) \right]
\end{array}$$
[Estimating the Advantage Function]
- The advantage function can significantly reduce variance of policy gradient
- So the critic should really estimate the advantage function(e.g. estimate both $V^{\pi_{\theta}}(s), Q^{\pi_{\theta}}(s,a)$), using two function approximators and two parameter vectors
$$\begin{array}{rcl}
V_{v}(s)& \approx & V^{\pi_{\theta}}(s) \\
Q_{w}(s,a) & \approx & Q^{\pi_{\theta}}(s,a) \\
A(s,a) & = & Q_{w}(s,a)-V_{v}(s)
\end{array}$$
- And update both value functions by e.g. TD learning
위 내용보다 더 좋고 자주 쓰이는 방법에 대한 내용은 아래와 같다.
- For the true value function $V^{\pi_{\theta}}(s)$, the TD error $\delta^{\pi_{\theta}}=r + \gamma V^{\pi_{\theta}}(s^{'}) - V^{\pi_{\theta}}(s)$ is an unbiased estimate of the advantage function
$$\begin{array}{rcl}
\mathbb{E}_{\pi_{\theta}}\left[ \delta^{\pi_{\theta}}|s,a \right] & = & \mathbb{E}_{\pi_{\theta}}\left[ r + \gamma V^{\pi_{\theta}}(s^{'})|s,a \right]- V^{\pi_{\theta}}(s) \\
& = & Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s) \\
& = & A^{\pi_{\theta}}(s,a)
\end{array}$$
- So we can use the TD error to compute the policy gradient($Q$는 estimate안해도 되고 ciritic의 $v$만 estimate할 수 있게 됨)
$$\nabla_{\theta}J(\theta) = \color{red} \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\delta^{\pi_{\theta}} \right]$$
- In practice we can use an approximate TD error. This approach only requires one set of critic parameters $v$
$$\delta_{v} = r+\gamma V_{v}(s^{'})-V_{v}(s)$$
(4) Eligibility Traces
[Critics at Different Time-Scales]
- Critic can estimate value function $V_{\theta}(s)$ from many targets at different time-scales(last lecture..)
- For MC the target is the return $\color{red}v_{t}$
$$\Delta \theta =\alpha\left( \color{red}v_{t}\color{black}-V_{\theta}(s) \right)\phi(s)$$
- For TD(0), the target is the TD target $\color{red}r+\gamma V(s^{'})$
$$\Delta \theta =\alpha\left( \color{red}r+\gamma V(s^{'})\color{black}-V_{\theta}(s) \right)\phi(s)$$
- For forward-view TD($\lambda$), the target is the $\lambda$-return $\color{red}v_{t}^{\lambda}$
$$\Delta \theta =\alpha\left( \color{red}v_{t}^{\lambda}\color{black}-V_{\theta}(s) \right)\phi(s)$$
- For backward-view TD($\lambda$), we use eligibility traces
$$\begin{array}{rcl}
\delta_{t} & = & r_{t+1}+\gamma V(s_{t+1})-V(s_{t}) \\
e_{t} & = & \gamma\lambda e_{t-1}+\phi(s_{t}) \\
\Delta \theta & = & \alpha\delta_{t}e_{t}
\end{array}$$
[Actors at Different Time-Scales]
- (actor에 대해서도 같은 idea 적용)
- The policy gradient can also be estimated at many time-scales
$$\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a) \color{red} A^{\pi_{\theta}}(s,a) \color{black} \right] $$
- Monte-Carlo policy gradient uses error from complete return
$$\Delta \theta = \alpha \left( \color{red} v_{t} \color{black} -V_{v}(s_{t}) \right)\nabla_{\theta}\text{log} \ \pi_{\theta}(s_{t},a_{t})$$
- Actor-Critic policy gradient uses the one-step TD error
$$\Delta \theta = \alpha \left( \color{red} r+\gamma V_{v}(s_{t+1}) \color{black} -V_{v}(s_{t}) \right)\nabla_{\theta}\text{log} \ \pi_{\theta}(s_{t},a_{t})$$
[Policy Gradient with Eligibility Traces]
- Just like forward-view TD($\lambda$), we can mix over time-scales
$$\Delta \theta = \alpha \left( \color{red} v_{t}^{\lambda} \color{black} -V_{v}(s_{t}) \right)\nabla_{\theta}\text{log} \ \pi_{\theta}(s_{t},a_{t})$$
- where $v_{t}^{\lambda} -V_{v}(s_{t}) $ is a biased estimate of advantage function
- Like backward-view TD($\lambda$) we can also use eligibility traces, by equivalence with TD($\lambda$), substituting $\phi(s) = \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)$
$$\begin{array}{rcl}
\delta & = & r_{t+1}+ \gamma V_{v}(s_{t+1})-V_{v}(s_{t}) \\
e_{t+1} & = & \lambda e_{t}+\nabla_{\theta}\text{log} \ \pi_{\theta}(s,a) \\
\Delta\theta & = & \alpha\delta e_{t}
\end{array}$$
- This update can be applied online, to incomplete sequences
(5) Natural Policy Gradient
[Alternative Policy Gradient Directions]
- Gradient ascent algorithms can follow any ascent direction
- A good ascent direction can significantly speed convergence
- Also, a policy can often be reparameterized without changing action probabilities(e.g. increasing score of all actions in a softmax policy)
- The vanilla gradient is sensitive to these reparameterization
[Natural Policy Gradient]
- The natural policy gradient is parameterization independent
- It finds ascent direction that is closest to vanilla gradient, when changing policy by a small, fixed amount
$$\nabla_{\theta}^{\text{nat}}\pi_{\theta}(s,a) = G^{-1}_{\theta}\nabla_{\theta}\pi_{\theta}(s,a)$$
- where $G_{\theta}$ is the Fisher information matrix
$$G_{\theta} = \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log}\pi_{\theta}(s,a) \nabla_{\theta}\text{log}\pi_{\theta}(s,a)^{\text{T}} \right]$$
[Natural Actor-Critic]
- Using compatible function approximation,
$$\nabla_{w}A_{w}(s,a) = \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)$$
- So the natural policy gradient simplifies,
$$\begin{array}{rcl}
\nabla_{\theta}J(\theta) & = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)A^{\pi_{\theta}}(s,a) \right] \\
& = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log}\pi_{\theta}(s,a) \nabla_{\theta}\text{log}\pi_{\theta}(s,a)^{\text{T}}w \right] \\
& = & G_{\theta}w \\
\color{red}\nabla_{\theta}^{\text{nat}} J(\theta) & \color{red}= & \color{red}w
\end{array}$$
- e.g. update actor parameters in direction of critic parameters
4. Summary of Policy Gradient Algorithms
- The policy gradient has many equivalent forms
$$\begin{array}{rcl}
\nabla_{\theta}J(\theta) & = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\color{red}v_{t}\color{black} \right] & \ & \text{Reinforce}\\
& = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\color{red} Q^{\mathrm{w}}(s,a) \color{black} \right] & \ & \text{Q Actor-Critic}\\
& = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\color{red} A^{\mathrm{w}}(s,a) \color{black} \right] & \ & \text{Advantage Actor-Critic}\\
& = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\color{red} \delta \color{black} \right] & \ & \text{TD Actor-Critic}\\
& = & \mathbb{E}_{\pi_{\theta}}\left[ \nabla_{\theta}\text{log} \ \pi_{\theta}(s,a)\color{red} \delta e \color{black} \right] & \ & \text{TD}(\lambda)\text{ Actor-Critic}\\
G^{-1}_{\theta}\nabla_{\theta}J(\theta) & = & \mathrm{w} & \ & \text{Natural Actor-Critic}\\
\end{array}$$
- Each leads a stochastic gradient ascent algorithm
- Critic uses policy evaluation (e.g. MC or TD learning) to estimate $Q^{\pi}(s,a),A^{\pi}(s,a)$ or $V^{\pi}(s)$
'Reinforcement Learning' 카테고리의 다른 글
[RL] Value Function 구체적으로 생각해보기 (0) | 2023.07.15 |
---|---|
[RL] Introduction to Multi-Armed Bandits (1) (0) | 2023.06.26 |
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
[RL] 5. Model-Free Control (2) | 2023.04.05 |
[RL] 4. Model-Free Prediction (0) | 2023.03.29 |