David Silver 교수님의 강의 내용을 정리하고자 한다(링크).
1. Introduction
Morkov Decision Process는 environment가 fully observable한 세팅을 의미한다.
In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming.(wikipedia)
2. Markov Process
1) Markov Property
A state $S_{t}$ is Markov if and only if
$$\color{black}\mathbb{P}[S_{t+1}|S_{t}] = \mathbb{P}[S_{t+1}|S_{1},\cdots,S_{t}]$$
이는 "The future is independent of the past given the present"와 같이 표현할 수 있다.
- $H_{1:t} \rightarrow S_{t} \rightarrow H_{t+1:\infty }$
Once the state is known, the history may be thrown away.
The State is a sufficient statistic of the future.
Markov state $s$와 successor state $s^{'}$가 있다고 하자. 이때 state trasition probability는 다음과 같이 정의된다.
$$\mathcal{P}_{ss^{'}}=\mathbb{P}[S_{t+1}=s^{'}|S_{t}=s]$$
이를 모든 state들에 대해 확장하면 state transition matrix로 표현할 수 있다.
$$\mathcal{P}=\begin{bmatrix}
\mathcal{P}_{11}\cdots\mathcal{P}_{1n}\\
\vdots\\
\mathcal{P}_{n1}\cdots\mathcal{P}_{nn}
\end{bmatrix}$$
2) Markov Process(Markov Chain)
Markov Process is a memoryless random process. A Sequence of random states $S_{1}, S_{2},,,$ with the Markov property.
A Markov Process(or Markov Chain) is a tuple <$\color{red}\mathcal{S,P}$>
- $\color{red}\mathcal{S}$ is a (finite) set of states
- $\color{red}\mathcal{P}$ is a state transition probability matrix, $\color{red}\mathcal{P}_{ss^{'}} = \mathbb{P}[S_{t+1}=s^{'}\ |\ S_{t}=s]$
2) Markov Reward Process
Markov reward process is a Markov chain with scalar reward values.
A Markov Reward Process is a tuple $\color{black}<\mathcal{S,P,\color{red}R,\gamma}\color{black}>$
- $\color{black}\mathcal{S}$ is a (finite) set of states
- $\color{black}\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss^{'}} = \mathbb{P}[S_{t+1}=s^{'}\ |\ S_{t}=s]$
- $\color{red} \mathcal{R}$ is a reward function, $\color{red}\mathcal{R} = \mathbb{E}[R_{t+1}\ |\ S_{t}=s]$
- $\color{red}\gamma$ is a discount factor, $\gamma\in[0,1]$
3) Goal(Return)
The return $G_{t}$ is the total discounted reward from time-step $t$
$$\color{black}G_{t} = R_{t+1}+\gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty}\gamma^{k}R_{t+k+1}$$
Goal는 하나의 episode에서 각 step $t$에 대해 계산된다.
4) Value Function
The value function $v(s)$ gives the long-trem value of state $s$.
The state value function $v(s)$ of an MRP is the expected return starting from state $s$
$$\color{black}v(s) = \mathbb{E}[G_{t}\ |\ S_{t}=s]$$
5) Bellman Equation
Value funciton은 아래와 같이 두개로 나눠서 생각할 수 있다.
- immediate reward $\color{blue}R_{t+1}$
- discounted value of successor state $\color{blue}\gamma v(S_{t+1})$ (new state에서의 value function)
이때 Bellman Equation은 다음과 같이 정의된다.
$$\begin{array}{rcl}
v(s)& = & \mathbb{E}[G_{t}\ |\ S_{t}=s] \\
& = & \mathbb{E}[R_{t+1}+\gamma R_{t+2}+\gamma^{2}R_{t+3}+\cdots\ |\ S_{t}=s] \\
& = & \mathbb{E}[R_{t+1}+\gamma(R_{t+2}+\gamma R_{t+3}+\cdots)\ |\ S_{t}=s] \\
& = & \mathbb{E}[R_{t+1}+\gamma G_{t+1}\ |\ S_{t}=s] \\
& = & \mathbb{E}[R_{t+1}+\gamma v(S_{t+1})\ |\ S_{t}=s] \\
\end{array}$$
Value function은 아래와 같이 Bellman Equation으로 표현할 수 있다.
정의 | 그림(backup diagram) |
*Value Function $$\begin{array}{rcl} v(s) & = & \mathbb{E}[R_{t+1}+\gamma v(S_{t+1})\ |\ S_{t}=s] \\ & = & \mathcal{R}_{s}+\gamma\sum_{s^{'}\in S}\mathcal{P}_{ss^{'}}v(s^{'}) \end{array}$$ |
Bellman Equation은 다음과 같이 matrix로 표현될 수 있다.
$$v = \mathcal{R}+\gamma\mathcal{P}v, \ (v: \text{column vector})$$
$$\begin{bmatrix}
v(1) \\
\vdots \\
v(n)
\end{bmatrix} =
\begin{bmatrix}
\mathcal{R}(1) \\
\vdots \\
\mathcal{R}(n)
\end{bmatrix} +
\gamma
\begin{bmatrix}
\mathcal{P}(11) & \cdots & \mathcal{P}(1n) \\
\vdots & & \\
\mathcal{P}(n1) & \cdots & \mathcal{P}(nn)
\end{bmatrix}
\begin{bmatrix}
v(1) \\
\vdots \\
v(n)
\end{bmatrix}
$$
6) Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov(MRP + agent's decision).
A Markov Decision Process is a tuple $\color{black}<\mathcal{S,\color{red}A\color{black},P,R,\gamma}>$
- $\color{black}\mathcal{S}$ is a (finite) set of states
- $\color{red} \mathcal{A}$ is a (finite) set of actions
- $\color{black}\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}^{\color{red}a}_{ss^{'}} = \mathbb{P}[S_{t+1}=s^{'}\ |\ S_{t}=s, \color{red}A_{t}=a]$
- $\color{black}\mathcal{R}^{\color{red}a}$ is a reward function, $\mathcal{R} = \mathbb{E} [R_{t+1}\ |\ S_{t}=s,\color{red}A_{t}=a]$
- $\color{black}\gamma$ is a discount factor, $\gamma\in[0,1]$
7) Policy
A policy $\color{red}\pi$ is a distribution over actions given states,
$$\color{black}\pi(a|s) = \mathbb{P}[A_{t}=a|S_{t}=s]$$
- A Policy fully defines the behaviour of an agent
- MDP policies depend on the current state, not the history
i.e. Policies are stationary (action은 time-step과 관계없이 state에만 dependent함)
$A_{t}\sim \pi(\dot|S_{t}),\forall t>0$
MPD $\mathcal{M} = <\mathcal{S,A,P,R},\gamma>$가 있고 policy $\pi$가 주어지면 Markov Process, Markov Reward Process, transition, reward function을 다음과 같이 새롭게 정의할 수 있다.
- The state sequence $S_{1}, S_{2}, \dots$ is a Markove process with policy: $<\mathcal{S,P^{\pi}}>$
- The state and reward sequence $S_{1},R_{1},S_{2},R_{2}\dots$ is a Markov reward process with policy: $<\mathcal{S,P^{\pi},R^{\pi},}\gamma>$
- Trasition: $\mathcal{P}^{\pi}_{ss^{'}} = \sum_{a\in A}\pi(a|s)\mathcal{P}^{a}_{ss^{'}}$
- Reward function: $\mathcal{R}^{\pi}_{s} = \sum_{a\in A}\pi(a|s)\mathcal{R}^{a}_{s}$
8) Value Function with Policy
The state-value function $\color{red}v_{\pi}(s)$ of an MDP is the expected return from state $s$, and then following policy $\pi$
$$\color{black}v_{\pi}(s) = \mathbb{E}_{\pi}[G_{t}\ |\ S_{t}=s]$$
The action-value function $\color{red}q_{\pi}(s,a)$ is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$
$$\color{black}q_{\pi}(s,a) = \mathbb{E}_{\pi}[G_{t}\ |\ S_{t}=s, A_{t}=a]$$
(policy개념이 추가되니 생겼으니 할 수 있는 행동들이 생겼다! $v(s) \rightarrow v_{\pi}(s)$)
State-value, Action value function도 Bellman Equation을 통해 immediate reward + discounted value of successor state로 표현할 수 있다.
$$\begin{array}{rcl}
v_{\pi}(s) & = & \mathbb{E}_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_{t}=s] \\
q_{\pi}(s,a) & = & \mathbb{E}_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1}, A_{t+1})|S_{t}=s,A_{t}=a]
\end{array}$$
Policy 개념이 추가된 Value Function은 Bellman Expectation Equation을 통해 다음과 같이 recursive하게 표현할 수 있다.
정의 | 그림(backup diagram) |
*State-Value Function(1) $v_{\pi}(s) = \sum_{a\in A}\pi(a|s)q_{\pi}(s,a)$ |
|
*Action-Value Function(1) $q_{\pi}(s,a) = \mathcal{R}^{a}_{s}+\gamma \sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}v_{\pi}(s^{'})$ |
|
*State-Value Function(2) $v_{\pi}(s) = \sum_{a\in A}\pi(a|s)(\mathcal{R}^{a}_{s}+\gamma \sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}v_{\pi}(s^{'}))$ |
|
*State-Value Function(2) $q_{\pi}(s,a) = \mathcal{R}^{a}_{s}+\gamma \sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}\sum_{a^{'}\in A}\pi(a^{'}|s^{'})q_{\pi}(s^{'},a^{'})$ |
Policy 개념이 추가된 Value Function의 Bellman Equation은 다음과 같은 matrix 수식으로도 표현될 수 있다.
$$v_{\pi} = \mathcal{R}^{\pi}+\gamma\mathcal{P}^{\pi}v_{\pi}, \ (v: \text{column vector})$$
9) Optimal Value Function & Optimal Policy
MPD를 "풀었다"고 하는 것은 최종적으로 optimal value를 얻었다는 것을 의미한다.
The optimal state-value function $v_{\star}(s)$ is the maximum value function over all policies(maximum possible reward).
$$\color{black}v_{\star}(s) = \underset{\pi}{\text{max}} \ v_{\pi}(s)$$
The optimal action-value function $q_{\star}(s,a)$ is the maximum action-value function over all policies.
$$\color{black}q_{\star}(s,a) = \underset{\pi}{\text{max}} \ q_{\pi}(s,a)$$
Policy들에 대한 partial ordering은 다음과 같이 정의된다.
$$\pi \ge \pi^{'} \ \text{if} \ v_{\pi}(s) \ge v_{\pi^{'}}(s), \forall s$$
For any MDP,
- There exists an optimal policy $\pi_{\star}$ that is better than or equal to all other policies, $\color{black}\pi_{\star} \ge \pi, \ \forall \pi$
- All optimal policies achieve the optimal value function, $\color{black}v_{\pi_{\star}}(s) = v_{\star}(s)$
- All optimal policies achieve the optimal action-value function, $\color{black}q_{\pi_{\star}}(s,a) = q_{\star}(s,a)$
Optimal policy는 $q_{\star}(s,a)$에 대해 maximize하여 얻을 수 있다(MPD에서 $q_{\star}$가 최종보스!)
$$\pi_{\star}(a|s) = \left\{ \begin{array}{cl}
1 & : \ \text{if} \ a = \underset{a \in \mathcal{A}}{\text{argmax}} q_{\star}(s,a)\\
0 & : \ \text{otherwise}
\end{array} \right.$$
10) Bellman Optimality Equation
Optimal value function들은 Bellman Optimality Equation을 통해 다음과 같이 recursive하게 표현될 수 있다.
정의 | 그림(backup diagram) |
*State-Value Function(1) $v_{\star}(s) = \underset{a}{\text{max}} \ q_{\star}(s,a)$ (어떤 state에서 action에 대한 reward를 보고, average가 아닌 max를 취함) |
|
*Action-Value Function(1) $q_{\star}(s,a) =\mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}v_{\star}(s^{'})$ |
|
*State-Value Function(2) $v_{\star}(s) = \underset{a}{\text{max}} \ \mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}v_{\star}(s^{'})$ |
|
*State-Value Function(2) $q_{\star}(s,a) =\mathcal{R}^{a}_{s}+\gamma\sum_{s^{'}\in S}\mathcal{P}^{a}_{ss^{'}}\underset{a^{'}}{\text{max}} \ q_{\star}(s^{'},a^{'})$ |
If we are only interested in the optimal values, rather than computing the expectation following a policy, we could jump right into the maximum returns during the alternative updates without using a policy...
If we have complete information of the environment, this turns into a planning problem, solvable by DP. Unfortunately, in most scenarios, we do not know $\color{red}\mathcal{P}^{a}_{ss^{'}}$ or $\color{red}\mathcal{R}(s,a)$, so we cannot solve MDPs by directly applying Bellmen equations, but it lays the theoretical foundation for many RL algorithms.
- Lilian's Blog(OpenAI)(링크)
'Reinforcement Learning' 카테고리의 다른 글
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
---|---|
[RL] 5. Model-Free Control (2) | 2023.04.05 |
[RL] 4. Model-Free Prediction (0) | 2023.03.29 |
[RL] 3. Planning by Dynamic Programming (0) | 2023.03.07 |
[RL] 1. Introduction (0) | 2023.03.05 |