Policy Gradient 개념과 대표적인 알고리즘인 REINFORCE 내용을 정리하고자 한다.
1. 개념
(1) Monte Carlo methods
Monte Carlo methods are a group of methods that samples randomly from a distribution to eventually localize to some solution. In the context of RL, we use monte carlo methods to estimate the reward by averaging the rewards over many episodes of interaction with the environment. (출처)
(2) Policy Gradient methods
Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.
They do not suffer from many of the problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states & actions.(출처)
(3) Monte Carlo Policy Gradients
Monte Carlo Policy Gradient methods is the subset of policy gradient methods where we update the policy parameters after every episode. This is in contrast to Temporal Difference Learning approaches, where the parameters are updated after each step (a.k.a. action). The Monte Carlo approach leads to a more stable (but slightly slower) convergence to the optimal parameters.(출처)
2. REINFORCE 이해하기
(1) REINFORCE
REINFORCE 알고리즘은 policy gradient 방법 중 대표적인 알고리즘으로 model-free 환경에서 discrete domain + continuous domain에서 사용될 수 있으며으며, on-policy & off-policy 방식이 모두 가능하다. Monte-Carlo 방식이기 때문에 환경과 interaction하면서 action을 행하고 reward를 취하다가 episode가 끝나면 gradient ascent를 통해 policy를 update하게 된다. 그리고 이 과정을 새로운 episode들에 대해 반복하면서 expected total reward가 최대화 될 때 까지 진행한다(episodic setting).
$$\nabla \mathbb{E}_{\pi_{\theta}}[r(\tau)] = \mathbb{E}_{\pi_{\theta}}\left[ \left( \sum_{t=1}^{T}G_{t}\nabla \text{log}\pi_{\theta}(a_{t}|s_{t}) \right) \right]$$
(2) Notations
Lilian Weng의 글을 참고로 notation을 먼저 정리하면 다음과 같다.
Symbol | Meaning |
$s \in \mathcal{S}$ | States |
$a \in \mathcal{A}$ | Actions |
$r \in \mathcal{R}$ | Rewards |
$\gamma$ | Discounting factor, penalty to uncertainty of future rewards; $0 < \gamma \le 1$ |
$G_{t}$ | total return, or discounted future rewards; $G_{t}=\sum_{k=0}^{\infty}\gamma^{k}\mathcal{R}_{t+k+1}$ |
$\mathcal{P}(s^{'},r|s,a)$ | Transition probability of getting to the next state $s^{'}$ from the current state $s$ with action $a$ and reward $r$ |
$\pi(a|s)$ | Stochastic policy, or agent behavior strategy |
$V(s) $ | State-value function measures the expected return of state $s$ |
$V^{\pi}(s) $ | The value of state $s$ when we follow a policy $\pi$; $V^{\pi}(s) = \mathbb{E}_{a \sim \pi}\left[ G_{t}|S_{t}=s \right]$ |
$Q(s,a) $ | Acion-value function assessed the expected return of a pair of state and action |
$Q^{\pi}(s,a) $ | The value of (state, action) pair when we follow a policy $\pi$; $Q^{\pi}(s,a) = \mathbb{E}_{a \sim \pi}\left[ G_{t}|S_{t}=s,A_{t}=a \right]$ |
(2) Objective function
Policy-based algorithm의 Objective는 다음과 같다.
$$J(\theta) = \sum_{s \in \mathcal{S}}d^{\pi}(s)V^{\pi}(s) = \sum_{s \in \mathcal{S}}d^{\pi}(s)\sum_{a \in \mathcal{A}}\pi_{\theta}(a|s)Q^{\pi}(s,a)$$
이때 $d^{\pi}(s)$는 policy $\pi_{\theta}$에 대한 stationary distribution of Markov chain이다.
Imagine that you can travel along the Markov chain’s states forever, and eventually, as the time progresses, the probability of you ending up with one state becomes unchanged — this is the stationary probability for $\pi_{\theta}$.
$d^{\pi}(s)=\lim_{t \to \infty} P(s_{t}=s|s_{0},\pi_{\theta})$ is the probability that $s_{t}=s$ when starting from $s_{0}$ and following policy $\pi_{\theta}$ for $t$ steps. Actually, the existence of the stationary distribution of Markov chain is one main reason for why PageRank algorithm works.(출처)
(3) Policy Gradient Therom
하지만 $\nabla_{\theta}J(\theta)$를 계산하는 것은 stationary distribution $\pi_{\theta}$ 때문에 어렵다. 하지만 policy gradient theorem 덕분에 다음과 같이 근사하여 계산할 수 있다.
$$\begin{array}{rcl}
\nabla_{\theta}J(\theta) & = & \nabla_{\theta} \sum_{s \in \mathcal{S}}d^{\pi}(s)\sum_{a \in \mathcal{A}}Q^{\pi}(s,a)\pi_{\theta}(a|s) \\
& \propto & \sum_{s \in \mathcal{S}}d^{\pi}(s)\sum_{a \in \mathcal{A}}Q^{\pi}(s,a)\nabla_{\theta}\pi_{\theta}(a|s)
\end{array}$$
The Policy Gradient Theorem: The derivative of the expected reward is the expectation of the product of the reward and gradient of the log of the policy $\pi_{\theta}$
(4) Policy Gradient 파헤쳐보기
Policy Gradient가 도출되는 과정을 정리하고자 한다(출처). 어떤 agent가 환경과 상호작용하면 다음과 같은 요소들이 생긴다.
- $\tau = (s_{0},a_{0},\dots,s_{T-1},a_{T-1},s_{T})$: trajectory $\tau$에서의 state-action sequence($s_{T-1}$은 terminal state)
- $\mathcal{R}(s_{t},a_{t})$: state $s_{t}$에서 action $a_{t}$를 수행했을 때 얻어지는 reward
- $\mathcal{R}(\tau) = \sum_{t=0}^{T-1}\gamma^{t}\mathcal{R}(s_{t},a_{t})$: sum of discounted rewards
이때 목표는 아래의 expected total reward를 최대화하는 것이다.
$$\underset{\theta}{\text{max}} \ \mathbb{E}_{\pi_{\theta}}[\mathcal{R}(\tau)]$$
이를 위해 gradient ascent를 통해 좋은 policy $\pi_{\theta}$를 찾아나간다.
$$\theta \gets \theta+\alpha \nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}(\tau)] $$
$P(\tau|\theta)$를 policy $\pi_{\theta}$를 따르는 trajectory $\tau$의 확률이라고 하자. 그러면 위 식의 gradient를 다음과 같이 전개해나갈 수 있다.
$$\begin{array}{flalign}
\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}(\tau)] & = & \nabla_{\theta}\sum_{\tau}P(\tau|\theta){R}(\tau) & \text{(expectation)}\\
& = & \sum_{\tau}\nabla_{\theta}P(\tau|\theta){R}(\tau) & \text{(swap sum and gradient)}\\
& = & \sum_{\tau}\frac{\nabla_{\theta}P(\tau|\theta)}{P(\tau|\theta)}P(\tau|\theta) {R}(\tau) & \\
& = & \sum_{\tau}P(\tau|\theta)\nabla_{\theta}\log P(\tau|\theta){R}(\tau) & \text{(log trick)}\\
& = & \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log P(\tau|\theta){R}(\tau)] &
\end{array}$$
이제 trajectory $\tau$에 대한 확률을 구체화하면 다음과 같이 쓸 수 있다.
$$P(\tau|\theta) = p(s_{0})\prod_{t=0}^{T-1}p(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_{t}|s_{t})$$
$p(s_{t+1}|s_{t},a_{t})$는 state $s_{t}$에서 action $a_{t}$를 취했을 때 state $s_{t+1}$으로 transition될 확률로 RL의 model이 주는 transition $\mathcal{P}$가 주는 값이다. 위 값에 $\log$를 취하고 미분하면 다음과 같은 식을 얻을 수 있다.
$$\begin{array}{rcl}
\nabla_{\theta}\log P(\tau|\theta) & = & \nabla_{\theta}\left( \log p(s_{0}) + \sum_{t=0}^{T-1}(\log p(s_{t+1}|s_{t},a_{t}) + \log \pi_{\theta}(a_{t}|s_{t})) \right) \\
& = & \sum_{t=0}^{T-1}\nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t})
\end{array}$$
Policy gradient는 model-free이므로 $\pi_{\theta}$와 관계없는 model dynamic $p(s_{t+1}|s_{t},a_{t})$이 지워진다.
이를 대입하면 다음과 같이 policy gradient 수식을 얻을 수 있다.
$$\begin{array}{rcl}
\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}(\tau)] & = & \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log P(\tau|\theta){R}(\tau)]\\
& = & \mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{T-1}\nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t}){R}(\tau)]
\end{array}$$
이제 이 expectation을 Monte Carlo sampling을 통해 여러 trajectory를 얻고 다음과 같이 근사할 수 있다($L$은 number of trajectory).
$$\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}(\tau)] \approx \frac{1}{L}\sum_{\tau}\sum_{t=0}^{T-1}\nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t}){R}(\tau)
$$
위 수식에서 한 가지 생각해볼 부분이 있다. $\mathcal{R}(\tau) = \sum_{t=0}^{T-1}\gamma^{t}\mathcal{R}(s_{t},a_{t})$인데, 어떤 action $a_{t}$를 취하기도 전에 이미 reward가 반영이 되어 있고, gradient update가 되면서 해당 action $a_{t}$에 영향을 주게 된다(킹반영?).
참고1, 참고2, 참고3들을 통해 정리하면 위 수식을 다음과 같이 도출해낼 수 있다고 한다.
$$\nabla_{\theta}\mathbb{E}_{\pi_{\theta}}[\mathcal{R}(\tau)] = \mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{T-1}\nabla_{\theta} \log \pi_{\theta}(a_{t}|s_{t}) \sum_{t^{'}=t}^{T-1}\gamma^{t^{'}-t}\mathcal{R}(s_{t^{'}},a_{t^{'}}) ]$$
이제 action $a_{t}$를 취한 이후에 얻어지는 reward에 대해서만 gradient update가 일어난다.
[참고자료]
- https://www.cs.toronto.edu/~tingwuwang/REINFORCE.pdf
- https://dilithjay.com/blog/reinforce-a-quick-introduction-with-code/
- https://www.janisklaise.com/post/rl-policy-gradients/#appendix-a
- https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
'Reinforcement Learning' 카테고리의 다른 글
[RL] Value Function 구체적으로 생각해보기 (0) | 2023.07.15 |
---|---|
[RL] Introduction to Multi-Armed Bandits (1) (0) | 2023.06.26 |
[RL] 7. Policy Gradient Methods (0) | 2023.05.04 |
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
[RL] 5. Model-Free Control (2) | 2023.04.05 |