[RL] 3. Planning by Dynamic Programming

1. Dynamic Programming
2. (Iterative) Policy Evaluation
3. Policy Iteration
1) Policy Improvement
2) Modified Policy Iteration
3) Generalized Policy Iteration
4. Value Iteration
1) Principle of Optimality
2) Deterministic Value Iteration
5. Extensions to DP
1) Asynchronous Dynamic Programming
2) Full-width and sample backups

David Silver 교수님의 강의 내용을 정리하고자 한다(링크). 3강은 MDP 문제를 푸는 방법에 대한 내용이다.

1. Dynamic Programming

MDP 문제는 대부분 Dynamic Programming을 통해 풀 수 있다. DP는 복잡한 문제를 subproblem으로 나눠서 푸는 방식이다. DP는 다음과 같은 두 특성을 가지고 있다.

Optimal substructure: Optimal solution can be decomposed into subproblems(Principle of optimality가 적용됨)
Overlapping subproblems: Subproblems recur many time and solutions can be cached and reused

MDP는 위 두 특성을 모두 지녔기 때문에 DP로 풀 수 있다.

Bellman equation gives recursive decomposition
Value Function stores and reuses solutions

	input	output
prediction	MDP < $S, A, P, R, γ$ > and policy $π$	value function $v_{π}$
control	MPD < $S, A, P, R, γ$ >	optimal value function $v_{⋆}$ optimal policy $π_{⋆}$

2. (Iterative) Policy Evaluation

Bellman Expection Equation를 사용하여 policy를 평가하는 과정이다(참고. Bellman Optimality Equation은 control에 사용).

Problem: evaluate a given policy $π$
Solution: iterative application of Bellman Expectation backup( $v_{1} \to v_{2} \to \dots \to v_{π}$ )

이때 모든 state에 대한 value function을 동시에 update하는 synchronous backup을 사용한다.

At each iteration $k + 1$
For all states $s \in S$
Update $v_{k + 1} (s)$ from $v_{k} (s^{'})$ , $s^{^{'}}$ is a successor state of $s$

\begin{array}{rcl} v_{k + 1} (s) & = & \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{^{'}} \in S} P_{s s^{^{'}}}^{a} v_{k} (s^{^{'}})) \\ v_{k + 1} & = & R^{π} + γ P^{π} v_{π} \end{array}

3. Policy Iteration

Policy가 improve되는 과정은 다음과 같이 Policy Evaluation, Policy Improvement를 반복적으로 수행하면서 이루어진다.

policy iteration을 통한 policy improvement(출처: 강의자료)

이론적으로 MDP에서는 iterative policy evaluation & improvement를 통해 optimal value와 optimal policy에 도달할 수 있다.

1) Policy Improvement

$a = π (s)$ 인 deterministic policy가 있다고 하자. 이때 우리는 acting greedliy를 통해 policy를 improve할 수 있다.

$π^{^{'}} (s) = \underset{a \in A}{argmax} q_{π} (s, a)$

위 과정을 통해 한 step이 지나면서 any state s에서의 value가 improve된다.

$q_{π} (s, π^{^{'}} (s)) = \underset{a \in A}{argmax} q_{π} (s, a) \geq q_{π} (s, π (s)) = v_{π} (s)$

때문에 value function 또한 improve된다.

$\begin{array}{rcl} v_{π} (s) & \leq & q_{π} (s, π^{^{'}} (s)) = E_{π^{^{'}}} [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s] \\ \leq & E_{π^{^{'}}} [R_{t + 1} + γ q_{π} (S_{t + 1}, π^{^{'}} (S_{t + 1})) | S_{t} = s] \\ \leq & E_{π^{^{'}}} [R_{t + 1} + γ R_{t + 2} + γ^{2} q_{π} (S_{t + 2}, π^{^{'}} (S_{t + 2})) | S_{t} = s] \\ \leq & E_{π^{^{'}}} [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots | S_{t} = s] = v_{π^{^{'}}} (s) \end{array}$

Improvement가 끝나면 다음과 같은 성질을 만족하게 된다.

$q_{π} (s, π^{^{'}} (s)) = \underset{a \in A}{argmax} q_{π} (s, a) = q_{π} (s, π (s)) = v_{π} (s)$

이때 Bellman Optimality Equation을 만족하게 된다.

$v_{π} (s) = \underset{a \in A}{argmax} q_{π} (s, a)$

이때 모든 state $s \in S$ 에 대해 $v_{π} (s) = v_{⋆} (s)$ 이므로 $π$ 가 optimal policy이다.

2) Modified Policy Iteration

위와 같은 Policy Improvement 과정을 효율적으로 만들기 위해 다음과 같이 질문해볼 수 있다.

Policy evaluation이 $v_{π}$ 로 converge될 때 까지 돌리는게 아니라 stopping condition(e.g. $ϵ - convergence$ )을 줄 수 있지 않을까?
Iterative policy evaluation에서 $k$ 번째에서 멈추면 되지 않을까?

3) Generalized Policy Iteration

조금 더 일반화된 Policy Iteration은 다음과 같다.

policy iteration을 통한 policy improvement(출처: Lilian's Blog)

The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.
In GPI, the value function is approximated repeatedly to be closer to the true value of the current policy and in the meantime, the policy is improved repeatedly to approach optimality. This policy iteration process works and always converges to the optimality.
- Lilian's Blog(OpenAI)(링크)

GPI에 대한 구체적인 설명이 Sutton and Barto의 RL Book(2020)에 있어 정리하고자 한다.

Policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement).
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy-evaluation and policyimprovement processes interact, independent of the granularity and other details of the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have identifiable policies and value functions, with the policy always being improved with respect to the value function and the value function always being driven toward the value function for the policy.
If both the evaluation process and the improvement process stabilize, that is, no longer produce changes, then the value function and policy must be optimal. The value function stabilizes only when it is consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the current value function. Thus, both processes stabilize only when a policy has been found that is greedy with respect to its own evaluation function.

4. Value Iteration

1) Principle of Optimality

모든 optimal policy는 다음과 같이 두 요소로 나눠질 수 있다.

An optimal first action $A_{⋆}$
Followed by an optimal policy from successor state $S^{^{'}}$

[Theorem]
Principle of Optimality: A policy $π (a | s)$ achieves the optimal value from state $s$ , $v_{π} (s) = v_{⋆} (s)$ if and only if

For any state $s^{^{'}}$ reachable from $s$

$π$ achieves the optimal value from state $s^{^{'}}, v_{π} (s^{^{'}}) = v_{⋆} (s^{^{'}})$

2) Deterministic Value Iteration

만약 subproblem $v_{⋆} (s^{^{'}})$ 의 해답을 알고 있다면 $v_{⋆} (s)$ 의 해 또한 one-step lookahead(back-up diagram 참고)을 통해 계산할 수 있다.

$v_{⋆} (s) \leftarrow max_{a \in A} R_{s}^{a} + γ \sum_{s^{^{'}} \in S} R_{s s^{^{'}}}^{a} v_{⋆} (s^{^{'}})$

Value Iteration은 Bellman Optimality Equation을 사용하여 이러한 update 과정을 반복적으로 수행하여 optimal value를 얻는 것이다.

Intuition: start with final rewards and work backwards(Still works with loopy, stochastic MDPs)
Problem: find optimal policy $π$
Solution: iterative application of Bellman Optimality backup
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy

이때 모든 state에 대한 value function을 동시에 update하는 synchronous backup을 사용한다.

At each iteration $k + 1$
For all states $s \in S$
Update $v_{k + 1} (s)$ from $v_{k} (s^{'})$ , $s^{^{'}}$ is a successor state of $s$

\begin{array}{rcl} v_{k + 1} (s) & = & max_{a \in A} (R_{s}^{a} + γ \sum_{s^{^{'}} \in S} P_{s s^{^{'}}}^{a} v_{k} (s^{^{'}})) \\ v^{k + 1} & = & max_{a \in A} R^{a} + γ P^{a} v^{k} \end{array}

Summary of DP Algorithms

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

Algorithms are based on state-value function $v_{π} (s)$ or $v_{⋆} (s)$
Could also apply to action-value function $q_{π} (s, a)$ or $q_{⋆} (s, a)$

5. Extensions to DP

1) Asynchronous Dynamic Programming

Policy Iteration, Value Iteration은 모두 synchronous backup을 하는 DP 알고리즘이다(e.g. all states are backed up in parallel).

Aynchronous DP는 각 state를 독립적으로 값을 update하는 알고리즘이며 계산량을 매우 줄일 수 있고 모든 state들이 계속 선택만 꾸준히 된다면 converge한다.

i. In-place DP

Synchronous value iteration stores two copies of value function.

$\begin{array}{rcl} for all s in S \\ v_{new} (s) & \leftarrow & max_{a \in A} R_{s}^{a} + γ \sum_{s^{^{'}} \in S} R_{s s^{^{'}}}^{a} v_{old} (s^{^{'}}) \\ v_{old} & \leftarrow & v_{new} \end{array}$

In-place value iteration only stores one copy of value function(in-place the values)

$\begin{array}{rcl} for all s in S \\ v (s) & \leftarrow & max_{a \in A} R_{s}^{a} + γ \sum_{s^{^{'}} \in S} R_{s s^{^{'}}}^{a} v (s^{^{'}}) \end{array}$

ii. Prioritized Sweeping

MPD를 DP로 푸는 과정중에서 어느 state를 그 다음에 update지를 결정하는 방법으로, 'some measure of how important it is to update any state'를 세울 수 있다면 priority queue에 두고, value update 크기가 큰 순서대로 다음 update대상이 되는 state를 정한다.

Use magnitude of Bellman error to guide state selection e.g.

$| max_{a \in A} (R_{s}^{a} + γ \sum_{s^{^{'}} \in S} R_{s s^{^{'}}}^{a} v (s^{^{'}})) - v (s) |$

Backup the state with the largest remaining Bellman error
Can be implemented efficiently by maintaining a priority queue

iii. Real-time DP

Idea: update(select) the states that agent actually visits(collect real samples from real trajectory)
After each time-step $S_{t}, A_{t}, R_{t + 1}$ , backup the state $S_{t}$

$v (S_{t}) \leftarrow max_{a \in A} (R_{S_{t}}^{a} + γ \sum_{s^{^{'}} \in S} R_{S_{t} s^{^{'}}}^{a} v (s^{^{'}}))$

2) Full-width and sample backups

DP는 full-width backup을 사용하는데 이는 각 step에서 모든 action, 모든 state를 고려해야한다는 말이다. Sync든 Async든 각 backup에서

every successor state and action is considered
use knowledge of the MDP transitions and reward function

이어야 한다. 때문에 DP는 medium-sized problem(million 단위의 state)에 적합하다. 더 큰 문제에 대해서는 Bellman's curse of dimensionality 문제가 발생한다.

때문에 이후 강의들에서는 sample backup들에 대해서 설명한다.

Using sample rewards and sample transitions < $S, A, R, S^{^{'}}$ >, instead of reward function $R$ and transition dynamics $P$

Sample backup의 장점은 다음과 같다.

Model-free: no advance knowledge of MPD required
Breaks the curse of dimensionality through sampling
Cost of backup is constant, independent of $n = | S |$

Full-width backup	Sample backup

728x90

저작자표시 비영리 변경금지

'Reinforcement Learning' 카테고리의 다른 글

[RL] 6. Value Function Approximation (0)	2023.04.15
[RL] 5. Model-Free Control (2)	2023.04.05
[RL] 4. Model-Free Prediction (0)	2023.03.29
[RL] 2. Markov Decision Process (0)	2023.03.05
[RL] 1. Introduction (0)	2023.03.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

*Summary of DP Algorithms*

i. In-place DP

ii. Prioritized Sweeping

iii. Real-time DP

'Reinforcement Learning' 카테고리의 다른 글

*Summary of DP Algorithms*

i. In-place DP

ii. Prioritized Sweeping

iii. Real-time DP

'Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Summary of DP Algorithms

Summary of DP Algorithms