David Silver 교수님의 강의 내용을 정리하고자 한다(링크). 3강은 MDP 문제를 푸는 방법에 대한 내용이다.
1. Dynamic Programming
MDP 문제는 대부분 Dynamic Programming을 통해 풀 수 있다. DP는 복잡한 문제를 subproblem으로 나눠서 푸는 방식이다. DP는 다음과 같은 두 특성을 가지고 있다.
- Optimal substructure: Optimal solution can be decomposed into subproblems(Principle of optimality가 적용됨)
- Overlapping subproblems: Subproblems recur many time and solutions can be cached and reused
MDP는 위 두 특성을 모두 지녔기 때문에 DP로 풀 수 있다.
- Bellman equation gives recursive decomposition
- Value Function stores and reuses solutions
input | output | |
prediction | MDP <$\mathcal{S,A,P,R},\gamma$> and policy $\pi$ | value function $\color{blue}v_{\pi}$ |
control | MPD <$\mathcal{S,A,P,R},\gamma$> | optimal value function $\color{blue}v_{\star}$ optimal policy $\color{blue}\pi_{\star}$ |
2. (Iterative) Policy Evaluation
Bellman Expection Equation를 사용하여 policy를 평가하는 과정이다(참고. Bellman Optimality Equation은 control에 사용).
- Problem: evaluate a given policy $\pi$
- Solution: iterative application of Bellman Expectation backup($v_{1} \rightarrow v_{2} \rightarrow \dots \rightarrow v_{\pi}$)
이때 모든 state에 대한 value function을 동시에 update하는 synchronous backup을 사용한다.
- At each iteration $k+1$
- For all states $s \in \mathcal{S}$
- Update $v_{k+1}(s)$ from $v_{k}(s')$, $s^{'}$ is a successor state of $s$
$$\begin{array}{rcl} v_{k+1}(s) & = & \sum_{a\in A}\pi(a|s) \left( \mathcal{R}^{a}_{s}+\gamma \sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}v_{k}(s^{'})\right) \\ v_{k+1} & = & \mathcal{R}^{\pi}+\gamma\mathcal{P}^{\pi}v_{\pi} \end{array}$$ |
3. Policy Iteration
Policy가 improve되는 과정은 다음과 같이 Policy Evaluation, Policy Improvement를 반복적으로 수행하면서 이루어진다.
이론적으로 MDP에서는 iterative policy evaluation & improvement를 통해 optimal value와 optimal policy에 도달할 수 있다.
1) Policy Improvement
$a=\pi (s)$인 deterministic policy가 있다고 하자. 이때 우리는 acting greedliy를 통해 policy를 improve할 수 있다.
$$\pi^{'}(s) = \underset{a\in \mathcal{A}}{\text{argmax}} \ q_{\pi}(s,a)$$
위 과정을 통해 한 step이 지나면서 any state s에서의 value가 improve된다.
$$ q_{\pi}(s,\pi^{'}(s)) = \underset{a\in \mathcal{A}}{\text{argmax}} \ q_{\pi}(s,a) \ge q_{\pi}(s,\pi(s))=v_{\pi}(s)$$
때문에 value function 또한 improve된다.
$$\begin{array}{rcl}
v_{\pi}(s) & \le & q_{\pi}(s,\pi^{'}(s)) = \mathbb{E}_{\pi^{'}}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_{t}=s] \\
& \le & \mathbb{E}_{\pi^{'}}[R_{t+1}+\gamma q_{\pi}(S_{t+1}, \pi^{'}(S_{t+1}))|S_{t}=s] \\
& \le & \mathbb{E}_{\pi^{'}}[R_{t+1}+\gamma R_{t+2} + \gamma^{2}q_{\pi}(S_{t+2}, \pi^{'}(S_{t+2}))|S_{t}=s] \\
& \le & \mathbb{E}_{\pi^{'}}[R_{t+1}+\gamma R_{t+2} + \gamma^{2}R_{t+3}+\dots|S_{t}=s] =v_{\pi^{'}}(s)\\
\end{array}$$
Improvement가 끝나면 다음과 같은 성질을 만족하게 된다.
$$ q_{\pi}(s,\pi^{'}(s))=\underset{a\in \mathcal{A}}{\text{argmax}} \ q_{\pi}(s,a)=q_{\pi}(s,\pi(s))=v_{\pi}(s)$$
이때 Bellman Optimality Equation을 만족하게 된다.
$$v_{\pi}(s)=\underset{a\in \mathcal{A}}{\text{argmax}} \ q_{\pi}(s,a)$$
이때 모든 state $s \in \mathcal{S}$에 대해 $v_{\pi}(s) = v_{\star}(s)$이므로 $\pi$가 optimal policy이다.
2) Modified Policy Iteration
위와 같은 Policy Improvement 과정을 효율적으로 만들기 위해 다음과 같이 질문해볼 수 있다.
- Policy evaluation이 $v_{\pi}$로 converge될 때 까지 돌리는게 아니라 stopping condition(e.g. $\epsilon-\text{convergence}$)을 줄 수 있지 않을까?
- Iterative policy evaluation에서 $k$번째에서 멈추면 되지 않을까?
3) Generalized Policy Iteration
조금 더 일반화된 Policy Iteration은 다음과 같다.
The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.
In GPI, the value function is approximated repeatedly to be closer to the true value of the current policy and in the meantime, the policy is improved repeatedly to approach optimality. This policy iteration process works and always converges to the optimality.
- Lilian's Blog(OpenAI)(링크)
GPI에 대한 구체적인 설명이 Sutton and Barto의 RL Book(2020)에 있어 정리하고자 한다.
Policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement).
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy-evaluation and policyimprovement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy.
If both the evaluation process and the improvement process stabilize, that is, no longer produce changes, then the value function and policy must be optimal. The value function stabilizes only when it is consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the current value function. Thus, both processes stabilize only when a policy has been found that is greedy with respect to its own evaluation function.
4. Value Iteration
1) Principle of Optimality
모든 optimal policy는 다음과 같이 두 요소로 나눠질 수 있다.
- An optimal first action $A_{\star}$
- Followed by an optimal policy from successor state $S^{'}$
[Theorem]
Principle of Optimality: A policy $\pi(a|s)$ achieves the optimal value from state $s$, $v_{\pi}(s) = v_{\star}(s)$ if and only if
- For any state $s^{'}$ reachable from $s$
- $\pi$ achieves the optimal value from state $s^{'}, v_{\pi}(s^{'}) = v_{\star}(s^{'})$
2) Deterministic Value Iteration
만약 subproblem $v_{\star}(s^{'})$의 해답을 알고 있다면 $v_{\star}(s)$의 해 또한 one-step lookahead(back-up diagram 참고)을 통해 계산할 수 있다.
$$v_{\star}(s) \leftarrow \underset{a\in \mathcal{A}}{\text{max}}\mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in\mathcal{S}}\mathcal{R}^{a}_{ss^{'}}v_{\star}(s^{'})$$
Value Iteration은 Bellman Optimality Equation을 사용하여 이러한 update 과정을 반복적으로 수행하여 optimal value를 얻는 것이다.
- Intuition: start with final rewards and work backwards(Still works with loopy, stochastic MDPs)
- Problem: find optimal policy $\pi$
- Solution: iterative application of Bellman Optimality backup
- Unlike policy iteration, there is no explicit policy
- Intermediate value functions may not correspond to any policy
이때 모든 state에 대한 value function을 동시에 update하는 synchronous backup을 사용한다.
- At each iteration $k+1$
- For all states $s \in \mathcal{S}$
- Update $v_{k+1}(s)$ from $v_{k}(s')$, $s^{'}$ is a successor state of $s$
$$\begin{array}{rcl} v_{k+1}(s) & = & \underset{a\in \mathcal{A}}{\text{max}} \left( \mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}v_{k}(s^{'}) \right) \\ v^{k+1} & = & \underset{a\in \mathcal{A}}{\text{max}} \ \mathcal{R}^{a}+\gamma\mathcal{P}^{a}v^{k} \end{array}$$ |
*Summary of DP Algorithms*
Problem | Bellman Equation | Algorithm |
Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |
Control | Bellman Expectation Equation + Greedy Policy Improvement |
Policy Iteration |
Control | Bellman Optimality Equation | Value Iteration |
- Algorithms are based on state-value function $v_{\pi}(s)$ or $v_{\star}(s)$
- Could also apply to action-value function $q_{\pi}(s,a)$ or $q_{\star}(s,a)$
5. Extensions to DP
1) Asynchronous Dynamic Programming
Policy Iteration, Value Iteration은 모두 synchronous backup을 하는 DP 알고리즘이다(e.g. all states are backed up in parallel).
Aynchronous DP는 각 state를 독립적으로 값을 update하는 알고리즘이며 계산량을 매우 줄일 수 있고 모든 state들이 계속 선택만 꾸준히 된다면 converge한다.
i. In-place DP
- Synchronous value iteration stores two copies of value function.
$$\begin{array}{rcl}
\text{for all s in } \mathcal{S} & \ & \ \\
\color{red} v_{\text{new}}(s)\color{black} & \leftarrow & \underset{a\in \mathcal{A}}{\text{max}}\mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in\mathcal{S}}\mathcal{R}^{a}_{ss^{'}}\color{red} v_{\text{old}}(s^{'}) \\
\color{red} v_{\text{old}} & \color{red}\leftarrow & \color{red} v_{\text{new}}
\end{array}$$
- In-place value iteration only stores one copy of value function(in-place the values)
$$\begin{array}{rcl}
\text{for all s in } \mathcal{S} & \ & \ \\
\color{red} v(s)\color{black} & \leftarrow & \underset{a\in \mathcal{A}}{\text{max}}\mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in\mathcal{S}}\mathcal{R}^{a}_{ss^{'}}\color{red} v(s^{'})
\end{array}$$
ii. Prioritized Sweeping
MPD를 DP로 푸는 과정중에서 어느 state를 그 다음에 update지를 결정하는 방법으로, 'some measure of how important it is to update any state'를 세울 수 있다면 priority queue에 두고, value update 크기가 큰 순서대로 다음 update대상이 되는 state를 정한다.
- Use magnitude of Bellman error to guide state selection e.g.
$$\left| \underset{a\in \mathcal{A}}{\text{max}}\left(\mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in\mathcal{S}}\mathcal{R}^{a}_{ss^{'}} v(s^{'}) \right)-v(s)
\right|$$
- Backup the state with the largest remaining Bellman error
- Can be implemented efficiently by maintaining a priority queue
iii. Real-time DP
- Idea: update(select) the states that agent actually visits(collect real samples from real trajectory)
- After each time-step $\mathcal{S}_{t},\mathcal{A}_{t},\mathcal{R}_{t+1}$, backup the state $\mathcal{S}_{t}$
$$\color{red} v(\mathcal{S}_{t})\color{black} \leftarrow \underset{a\in \mathcal{A}}{\text{max}}\left(\mathcal{R}^{a}_{\color{red}\mathcal{S}_{t}}+\gamma\sum_{s^{'}\in\mathcal{S}}\mathcal{R}^{a}_{\color{red}\mathcal{S}_{t}\color{black}s^{'}}\color{red} v(s^{'}) \color{black} \right)$$
2) Full-width and sample backups
DP는 full-width backup을 사용하는데 이는 각 step에서 모든 action, 모든 state를 고려해야한다는 말이다. Sync든 Async든 각 backup에서
- every successor state and action is considered
- use knowledge of the MDP transitions and reward function
이어야 한다. 때문에 DP는 medium-sized problem(million 단위의 state)에 적합하다. 더 큰 문제에 대해서는 Bellman's curse of dimensionality 문제가 발생한다.
때문에 이후 강의들에서는 sample backup들에 대해서 설명한다.
- Using sample rewards and sample transitions <$\mathcal{S,A,R,S^{'}}$>, instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$
Sample backup의 장점은 다음과 같다.
- Model-free: no advance knowledge of MPD required
- Breaks the curse of dimensionality through sampling
- Cost of backup is constant, independent of $n=|\mathcal{S}|$
Full-width backup | Sample backup |
'Reinforcement Learning' 카테고리의 다른 글
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
---|---|
[RL] 5. Model-Free Control (2) | 2023.04.05 |
[RL] 4. Model-Free Prediction (0) | 2023.03.29 |
[RL] 2. Markov Decision Process (0) | 2023.03.05 |
[RL] 1. Introduction (0) | 2023.03.05 |