-
1. Monte-Carlo Reinforcement Learning
-
(1) Monte-Carlo Learning
-
(2) Monte-Carlo Policy Evaluation
-
(3) Incremental Monte-Carlo
-
2. Temporal-Difference Learning(TD)
-
(1) TD Learning
-
(2) TD algorithm
-
3. MC & TD 비교
-
(1) 개념비교
-
(2) Advantages, Disadvantages
-
(3) Batch MC and TD
-
(4) Unified View for Model-Free Prediction
-
4. TD(
) -
(1)
-Step Prediction -
(2) TD(
): Averaging -Step Returns -
(3) Eligibility Trace and Forward & Backward view of TD(
)
David Silver 교수님의 강의 내용을 정리하고자 한다(링크). 현재 배우고 있는 내용의 흐름은 다음과 같다.
- 3장 Planning by DP: known MDP를 푸는 방법(find optimal behavior of MDP that maximizes the amount of rewards agent can expect to get from any state of environment)
- 4장 Model-Free prediction: Estimate the value of an unknown MDP(Policy Evaluation)
- 5장 Model-Free control: Optimize the value function of an unkown MDP
1. Monte-Carlo Reinforcement Learning
(1) Monte-Carlo Learning
MC(Monte-Carlo) 방법은 complete episodes를 통해 value, policy를 학습하는 방법으로 그 핵심은 $\color{red}\text{value = mean return}$라는 아이디어이다. MC는 한 episode가 끝난 후 계산된 $G_{t}$를 통해 state value $V(S_{t})$를 계산한다.
- MC is model-free = environment에 대한 정보가 없는 상황 = MDP transition, reward를 모르는 상황
- MC learns from complete episodes(All episodes must terminate, no bootstraping)
(2) Monte-Carlo Policy Evaluation
- Goal: learn $v_{\pi}$ from episodes of experience($S_{1}, A_{1}, R_{2}, \dots, S_{k}$) under policy $\pi$
- MC policy evaluation uses empirical mean return instead of expected return
- (recall..) return: total discounted reward = $\color{blue}G_{t} = R_{t+1}+\gamma R_{t+2}+ \dots \gamma^{T-1} R_{T}$
- (recall..) value function: expected return = $\color{blue}v_{\pi}(s) = \mathbb{E}[G_{t}|S_{t}=s]$
그럼 terminate된 episode가 여러개 있을 때 모든 state $S$에 대해 어떤 식으로 empirical mean return을 계산할 지를 아래와 같이 정하면 된다.
First-Visit MC Policy Evaluation | Every-Visit MC Policy Evaluation | |
When to update value | First time-step $t$ that state $s$ is visited in an episode | Every time-step $t$ that state $s$ is visited in an episode |
Increment counter | $$N(s) \leftarrow N(s)+1$$ | |
Increment total return | $$S(s) \leftarrow S(s)+G_{t}$$ | |
Value is estimated by mean return | $$V(s)=S(s)/N(s)$$ | |
By law of large numbers | $$V(s) \rightarrow v_{\pi}(s),\ \text{as} \ N(s)\rightarrow \infty$$ |
(3) Incremental Monte-Carlo
The mean $\mu_{1}, \mu_{2}, \dots$ of a sequence $x_{1}, x_{2}, \dots$ can be computed incrementally.
$$\begin{array}{rcl}
\mu_{k} & = & \frac{1}{k}\sum_{j=1}^{k}x_{j} \\
& = & \frac{1}{k}\left(x_{k} + \sum_{j=1}^{k-1}x_{j} \right) \\
& = & \frac{1}{k} \left( x_{k}+(k-1)\mu_{k-1} \right) \\
& = & \mu_{k-1} + \frac{1}{k} \left(\color{blue} x_{k}\color{black} - \color{green}\mu_{k-1} \color{black}\right) \\
\\
& &\color{blue} x_{k}\color{black}: \text{실제 값}, \\
& &\color{green}\mu_{k-1}\color{black}: \text{기대했던 값} \\
& & \color{blue} x_{k}\color{black} - \color{green}\mu_{k-1} \color{black}: \text{error term, 해당 error 방향으로}\ \mu \ \text{update}
\end{array}$$
- Incremental Monte-Carlo Updates: update $V(s)$ incrementally after episode $S_{1}, A_{1}, R_{2}, \dots, S_{T}$
- For each state $S_{t}$, with return $G_{t}$
$$\begin{array}{rcl}
N(S_{t}) & \leftarrow & N(S_{t})+1 \\
V(S_{t}) & \leftarrow &V(S_{t})+\frac{1}{N(S_{t})}(G_{t}-V(S_{t})) \\
\end{array}$$
- In non-stationary problems, it can be useful to track a running mean i.e. forget old episods
$$\color{red} V(S_{t}) \leftarrow V(S_{t})+\alpha(G_{t}-V(S_{t}))$$
위 수식이 Monte-Carlo approch에 대한 수식이다.
2. Temporal-Difference Learning(TD)
(1) TD Learning
TD Learning 방법은 episodes가 진행되는 각 step에서 state value $V(S_{t})$를 계산하는 방식이다.
- Goal: learn $v_{\pi}$ online from experience under policy $\pi$
- TD is model-free = environment에 대한 정보가 없는 상황 = MDP transition, reward를 모르는 상황
- TD learns from incomplete episodes, by bootstrapping
(2) TD algorithm
먼저 아래와 같은 every-visit Monte-Carlo를 생각해보자.
- Update value $V(S_{t})$ toward actual return $\color{red} G_{t}$
$$V(S_{t}) \leftarrow V(S_{t})+\alpha(\color{red} G_{t}\color{black} -V(S_{t}))$$
가장 간단한 TD learning algorithm인 TD(0)는 다음과 같다.
$$\begin{array}{rcl}
& V(S_{t}) \leftarrow V(S_{t})+\alpha(\color{red} R_{t+1}+\gamma V(S_{t+1})\color{black} -V(S_{t})) \\
& \\
& \left( R_{t+1}+\gamma V(S_{t+1}) \ : \text{TD target} \right) \\
& \left( \delta_{t}= R_{t+1}+\gamma V(S_{t+1})-V(S_{t}) \ : \text{TD error} \right)
\end{array}$$
3. MC & TD 비교
(1) 개념비교
huggingface RL 강의에 MC과 TD에 개념을 잘 정리해놓았다. 구체적인 계산 과정을 참고해보기 좋다.
(2) Advantages, Disadvantages
i. TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
ii. TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
[Bias/Variance Trade-Off]
- Return $G_{t} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{T-1} R_{T}$ is unbiased estimate of $v_{\pi}(S_{t})$
- True TD target $R_{t+1}+ \gamma v_{\pi}(S_{t+1})$ is unbiased estimate of $v_{\pi}(S_{t})$
- TD target $R_{t+1}+ \gamma V(S_{t+1})$ is biased estimate of $v_{\pi}(S_{t})$
- TD target is much lower variance than the return
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward
iii. MC has high variance, zero bias
- Good convergence properties(even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
iv. TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to $v_{\pi}(S_{t})$(but not always with function approximation)
- More sensitive to initial value

v. TD exploits Markov property
- Usually more efficient in Markov environments
vi. MC does not exploit Markov property
- Usually more effective in non-Markov enviornments(i.e partially observable environment)
(3) Batch MC and TD
- MC and TD converge: $V(s) \rightarrow v_{\pi}{s}$ as experience $\rightarrow \infty$
- 하지만 finite experience에서의 batch solution은 어떨까?
다음과 같이 $A,B$ 두 개의 state에 대한 8개의 episode가 있다고 하자. 이때 batch로 value function을 계산한 결과는 MC와 TD가 다르다.

- MC always converges to solution with minimun mean-squared error(MSE)
- Best fit to the observed returns
- Minimize with respect to all steps and episodes
$$\sum_{k=1}^{K}\sum_{t=1}^{T_{k}}(G^{k}_{t}-V(s^{k}_{t}))^{2}$$
- In the AB example, $V(A)=0$
- TD(0) converges to solution of maximum likelihood Markov model
- Solution to the MDP $\left\langle \mathcal{S}, \mathcal{A}, \mathcal{\hat{P}}, \mathcal{\hat{R}}, \mathcal{r} \right\rangle$ that best fits the data
$$\begin{array}{rcl}
\hat{\mathcal{P}}^{a}_{s,\acute{s}} & = &\frac{1}{N(s,a)}\sum_{k=1}^{K}\sum_{t=1}^{T_{k}}\textbf{1}(s^{k}_{t},a^{k}_{t},s^{k}_{t+1}=s,a,\acute{s}) \\
\hat{\mathcal{R}}^{a}_{s} & = & \frac{1}{N(s,a)}\sum_{k=1}^{K}\sum_{t=1}^{T_{k}}\textbf{1}(s^{k}_{t},a^{k}_{t}=s,a)r^{k}_{t}
\end{array}$$
- In the AB example, $V(A)=0.75$
(4) Unified View for Model-Free Prediction
i. Back-up diagrams
Monte-Carlo Backup: $V(S_{t}) \leftarrow V(S_{t})+\alpha(G_{t}-V(S_{t}))$ |
![]() |
Temporal-Difference Backup: $V(S_{t}) \leftarrow V(S_{t})+\alpha(R_{t+1}+\gamma V(S_{t+1})-V(S_{t}))$ |
![]() |
Dynamic Programming Backup: $V(S_{t}) \leftarrow \mathbb{E}_{\pi} \left[ R_{t+1}+\gamma V(S_{t+1}) \right]$ |
![]() |

ii. Bootstrapping and Sampling
- Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
- Sampling: update samples an expectation
- MC samples
- DP does not sample
- TD samples
참고) stackexchange글에 Bootstrap, Sampling in RL에 대한 좋은 설명이 있다.
4. TD($\lambda$)
(1) $n$-Step Prediction
TD learning에서 step수를 더 늘려서 target을 설정할 수 있다.

- Consider the following $n$-step returns for $n=1,2,\dots,\infty$
$$\begin{array}{rcl}
n=1 & \text{(TD)} & G^{(1)}_{t}=R_{t+1}+\gamma V(S_{t+1}) \\
n=2 & \ & G^{(2)}_{t}=R_{t+1}+\gamma R_{t+2}+\gamma^{2} V(S_{t+2}) \\
\vdots & & \vdots \\
n=\infty & \text{(MC)} & G^{(\infty)}_{t}=R_{t+1}+\gamma R_{t+2}+\cdots +\gamma^{T-1}R_{T}\\
\end{array}$$
$$\text{n-step return : } G^{(n)}_{t}=R_{t+1}+\gamma R_{t+2}+\cdots +\gamma^{n-1}R_{t+n}+\gamma^{n}V(S_{t+n})$$
- n-step temporal difference learning
$$V(S_{t}) \leftarrow V(S_{t})+\alpha\left(\color{blue} G^{n}_{t}\color{black}-\color{green}V(S_{t}) \right)$$
$$\begin{array}{rcl}
\color{blue} G^{n}_{t} \color{black} &:&
\text{ n-step estimated reward} \\
\color{green}V(S_{t})\color{black} &:& \text{estimated value function} \\
\color{blue}G^{n}_{t}\color{black}-\color{green}V(S_{t}) &:& \color{black}\text{error term}
\end{array} \\
$$

(2) TD($\lambda$): Averaging $n$-Step Returns
We can average $n$-step return over different $n$(e.g. average the 2-step, 4-step returns: $\frac{1}{2}G^{2}+\frac{1}{2}G^{4}$). Can we efficiently combine information from all time-steps?

- The $\lambda-\text{return} \ G^{\lambda}_{t}$ combines all $n$-step returns $G^{(n)}_{t}$, using weight $(1-\lambda)\lambda^{n-1}$
$$G^{\lambda}_{t} =(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G^{(n)}_{t}$$
- Forward-view TD($\lambda$)($G^{\lambda}_{t}$ is robust!)
$$V(S_{t}) \leftarrow V(S_{t})+\alpha(G^{\lambda}_{t}-V(S_{t}))$$
(3) Eligibility Trace and Forward & Backward view of TD($\lambda$)
i. Forward view of TD($\lambda$)
- Update value function towards the $\lambda$-return
- Forward-vew looks into the future to compute $G^{\lambda}_{t}$. But like MC, can only be computed from complete episodes(TD는 $n$ step만 보면 되는 장점이 있었는데 결국 MC처럼 episode가 끝나야지 계산이 가능한 모순이 발생함)

ii. Eligibility Trace
Eligibility Trace는 "과거에 방문했던 state 중에서 현재 얻게 되는 reward에 영향을 주는 state를 판단하여, 현재 얻게 되는 reward을 해당 state에 나누어주는 것"을 말한다(참고 블로그).
Suppose an agent randomly walking in an environment and finds a treasure. He then stops and looks backwards in an attempt to know what led him to this treasure?
Naturally the steps that are close to the treasure have more merits in finding it than the steps that are miles away. So closer locations are more valuable than distant ones and thus they are assigned bigger values. (참고글)
이때 Frequency heuristic, Recency heuristic과 같은 아이디어를 적용할 수 있는데, Eligibility Trace는 두 가지 모두 적용하여 계산한다.
$$\begin{array}{rcl}
E_{0}(s) & = & 0 \\
E_{t}(s) & = & \gamma E_{t-1}(s)+\textbf{1}(S_{t}=s)
\end{array}$$

iii. Backward view TD($\lambda$)
- Keep and eligibility trace for every state $s$
- Update value $V(s)$ for every state $s$
- In proportion to TD-error $\color{red}\delta_{t}$ and eligibility trace $\color{red}E_{t}(s)$
$$\begin{array}{rcl}
\delta_{t} & = & R_{t+1}+\gamma V(S_{t+1})-V(S_{t}) \\
V(s) & \leftarrow & V(s)+\alpha\delta_{t}E_{t}({}s)
\end{array}$$

iv. TD($\lambda$) and TD(0)
- When $\lambda=0$, only current state is updated
\begin{array}{rcl}
E_{t}(s) & = & \textbf{1}(S_{t}=s) \\
V(s) & \leftarrow & V(s)+\alpha\delta_{t}E_{t}(s)
\end{array}
- This is exactly equivalent to TD(0) update, $V(s) \leftarrow V(s)+\alpha\delta_{t}$
- When $\lambda=1$, credit is deferred until end of episode(consider episodic environments with offline updates). Over the course of an episode, total update for TD(1) is the same as total update for MC
[Theorem]
The sum of offline updates is identical for forward-view and backward-view TD($\lambda$)
$$\sum_{t=1}^{T}\alpha\delta_{t}E_{t}(s) = \sum_{t=1}^{T}\alpha(G^{\lambda}_{t}-V(S_{t}))\textbf{1}(S_{t}=s)$$
v. Offline & Online update
- Offline updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
- Online updates
- TD($\lambda$) updates are applied online at each step within episode
- Forward and backward-view of TD($\lambda$) are slightly different
- Exact online TD($\lambda$) achieves perfect equivalence, by using a slightly different form of eligibility trace
vi. Summary of Forward and Backward TD($\lambda$)

'Reinforcement Learning' 카테고리의 다른 글
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
---|---|
[RL] 5. Model-Free Control (2) | 2023.04.05 |
[RL] 3. Planning by Dynamic Programming (0) | 2023.03.07 |
[RL] 2. Markov Decision Process (0) | 2023.03.05 |
[RL] 1. Introduction (0) | 2023.03.05 |
David Silver 교수님의 강의 내용을 정리하고자 한다(링크). 현재 배우고 있는 내용의 흐름은 다음과 같다.
- 3장 Planning by DP: known MDP를 푸는 방법(find optimal behavior of MDP that maximizes the amount of rewards agent can expect to get from any state of environment)
- 4장 Model-Free prediction: Estimate the value of an unknown MDP(Policy Evaluation)
- 5장 Model-Free control: Optimize the value function of an unkown MDP
1. Monte-Carlo Reinforcement Learning
(1) Monte-Carlo Learning
MC(Monte-Carlo) 방법은 complete episodes를 통해 value, policy를 학습하는 방법으로 그 핵심은
- MC is model-free = environment에 대한 정보가 없는 상황 = MDP transition, reward를 모르는 상황
- MC learns from complete episodes(All episodes must terminate, no bootstraping)
(2) Monte-Carlo Policy Evaluation
- Goal: learn
from episodes of experience( ) under policy - MC policy evaluation uses empirical mean return instead of expected return
- (recall..) return: total discounted reward =
- (recall..) value function: expected return =
그럼 terminate된 episode가 여러개 있을 때 모든 state
First-Visit MC Policy Evaluation | Every-Visit MC Policy Evaluation | |
When to update value | First time-step |
Every time-step |
Increment counter | ||
Increment total return | ||
Value is estimated by mean return | ||
By law of large numbers |
(3) Incremental Monte-Carlo
The mean
- Incremental Monte-Carlo Updates: update
incrementally after episode - For each state
, with return
- In non-stationary problems, it can be useful to track a running mean i.e. forget old episods
위 수식이 Monte-Carlo approch에 대한 수식이다.
2. Temporal-Difference Learning(TD)
(1) TD Learning
TD Learning 방법은 episodes가 진행되는 각 step에서 state value
- Goal: learn
online from experience under policy - TD is model-free = environment에 대한 정보가 없는 상황 = MDP transition, reward를 모르는 상황
- TD learns from incomplete episodes, by bootstrapping
(2) TD algorithm
먼저 아래와 같은 every-visit Monte-Carlo를 생각해보자.
- Update value
toward actual return
가장 간단한 TD learning algorithm인 TD(0)는 다음과 같다.
3. MC & TD 비교
(1) 개념비교
huggingface RL 강의에 MC과 TD에 개념을 잘 정리해놓았다. 구체적인 계산 과정을 참고해보기 좋다.
(2) Advantages, Disadvantages
i. TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
ii. TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments
[Bias/Variance Trade-Off]
- Return
is unbiased estimate of - True TD target
is unbiased estimate of - TD target
is biased estimate of - TD target is much lower variance than the return
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward
iii. MC has high variance, zero bias
- Good convergence properties(even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
iv. TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to
(but not always with function approximation) - More sensitive to initial value

v. TD exploits Markov property
- Usually more efficient in Markov environments
vi. MC does not exploit Markov property
- Usually more effective in non-Markov enviornments(i.e partially observable environment)
(3) Batch MC and TD
- MC and TD converge:
as experience - 하지만 finite experience에서의 batch solution은 어떨까?
다음과 같이

- MC always converges to solution with minimun mean-squared error(MSE)
- Best fit to the observed returns
- Minimize with respect to all steps and episodes
- In the AB example,
- TD(0) converges to solution of maximum likelihood Markov model
- Solution to the MDP
that best fits the data
- In the AB example,
(4) Unified View for Model-Free Prediction
i. Back-up diagrams
Monte-Carlo Backup: |
![]() |
Temporal-Difference Backup: |
![]() |
Dynamic Programming Backup: |
![]() |

ii. Bootstrapping and Sampling
- Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
- Sampling: update samples an expectation
- MC samples
- DP does not sample
- TD samples
참고) stackexchange글에 Bootstrap, Sampling in RL에 대한 좋은 설명이 있다.
4. TD( )
(1) -Step Prediction
TD learning에서 step수를 더 늘려서 target을 설정할 수 있다.

- Consider the following
-step returns for
- n-step temporal difference learning

(2) TD( ): Averaging -Step Returns
We can average

- The
combines all -step returns , using weight
- Forward-view TD(
)( is robust!)
(3) Eligibility Trace and Forward & Backward view of TD( )
i. Forward view of TD( )
- Update value function towards the
-return - Forward-vew looks into the future to compute
. But like MC, can only be computed from complete episodes(TD는 step만 보면 되는 장점이 있었는데 결국 MC처럼 episode가 끝나야지 계산이 가능한 모순이 발생함)

ii. Eligibility Trace
Eligibility Trace는 "과거에 방문했던 state 중에서 현재 얻게 되는 reward에 영향을 주는 state를 판단하여, 현재 얻게 되는 reward을 해당 state에 나누어주는 것"을 말한다(참고 블로그).
Suppose an agent randomly walking in an environment and finds a treasure. He then stops and looks backwards in an attempt to know what led him to this treasure?
Naturally the steps that are close to the treasure have more merits in finding it than the steps that are miles away. So closer locations are more valuable than distant ones and thus they are assigned bigger values. (참고글)
이때 Frequency heuristic, Recency heuristic과 같은 아이디어를 적용할 수 있는데, Eligibility Trace는 두 가지 모두 적용하여 계산한다.

iii. Backward view TD( )
- Keep and eligibility trace for every state
- Update value
for every state - In proportion to TD-error
and eligibility trace

iv. TD( ) and TD(0)
- When
, only current state is updated
- This is exactly equivalent to TD(0) update,
- When
, credit is deferred until end of episode(consider episodic environments with offline updates). Over the course of an episode, total update for TD(1) is the same as total update for MC
[Theorem]
The sum of offline updates is identical for forward-view and backward-view TD()
v. Offline & Online update
- Offline updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
- Online updates
- TD(
) updates are applied online at each step within episode - Forward and backward-view of TD(
) are slightly different - Exact online TD(
) achieves perfect equivalence, by using a slightly different form of eligibility trace
vi. Summary of Forward and Backward TD( )

'Reinforcement Learning' 카테고리의 다른 글
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
---|---|
[RL] 5. Model-Free Control (2) | 2023.04.05 |
[RL] 3. Planning by Dynamic Programming (0) | 2023.03.07 |
[RL] 2. Markov Decision Process (0) | 2023.03.05 |
[RL] 1. Introduction (0) | 2023.03.05 |