[RL] 4. Model-Free Prediction

1. Monte-Carlo Reinforcement Learning
(1) Monte-Carlo Learning
(2) Monte-Carlo Policy Evaluation
(3) Incremental Monte-Carlo
2. Temporal-Difference Learning(TD)
(1) TD Learning
(2) TD algorithm
3. MC & TD 비교
(1) 개념비교
(2) Advantages, Disadvantages
(3) Batch MC and TD
(4) Unified View for Model-Free Prediction
4. TD( $λ$ )
(1) $n$ -Step Prediction
(2) TD( $λ$ ): Averaging $n$ -Step Returns
(3) Eligibility Trace and Forward & Backward view of TD( $λ$ )

David Silver 교수님의 강의 내용을 정리하고자 한다(링크). 현재 배우고 있는 내용의 흐름은 다음과 같다.

3장 Planning by DP: known MDP를 푸는 방법(find optimal behavior of MDP that maximizes the amount of rewards agent can expect to get from any state of environment)
4장 Model-Free prediction: Estimate the value of an unknown MDP(Policy Evaluation)
5장 Model-Free control: Optimize the value function of an unkown MDP

1. Monte-Carlo Reinforcement Learning

(1) Monte-Carlo Learning

MC(Monte-Carlo) 방법은 complete episodes를 통해 value, policy를 학습하는 방법으로 그 핵심은 $value = mean return$ 라는 아이디어이다. MC는 한 episode가 끝난 후 계산된 $G_{t}$ 를 통해 state value $V (S_{t})$ 를 계산한다.

MC is model-free = environment에 대한 정보가 없는 상황 = MDP transition, reward를 모르는 상황
MC learns from complete episodes(All episodes must terminate, no bootstraping)

(2) Monte-Carlo Policy Evaluation

Goal: learn $v_{π}$ from episodes of experience( $S_{1}, A_{1}, R_{2}, \dots, S_{k}$ ) under policy $π$
MC policy evaluation uses empirical mean return instead of expected return
(recall..) return: total discounted reward = $G_{t} = R_{t + 1} + γ R_{t + 2} + \dots γ^{T - 1} R_{T}$
(recall..) value function: expected return = $v_{π} (s) = E [G_{t} | S_{t} = s]$

그럼 terminate된 episode가 여러개 있을 때 모든 state $S$ 에 대해 어떤 식으로 empirical mean return을 계산할 지를 아래와 같이 정하면 된다.

	First-Visit MC Policy Evaluation	Every-Visit MC Policy Evaluation
When to update value	First time-step $t$ that state $s$ is visited in an episode	Every time-step $t$ that state $s$ is visited in an episode
Increment counter	$N (s) \leftarrow N (s) + 1$
Increment total return	$S (s) \leftarrow S (s) + G_{t}$
Value is estimated by mean return	$V (s) = S (s) / N (s)$
By law of large numbers	$V (s) \to v_{π} (s), as N (s) \to \infty$

(3) Incremental Monte-Carlo

The mean $μ_{1}, μ_{2}, \dots$ of a sequence $x_{1}, x_{2}, \dots$ can be computed incrementally.

$\begin{array}{rcl} μ_{k} & = & \frac{1}{k} \sum_{j = 1}^{k} x_{j} \\ = & \frac{1}{k} (x_{k} + \sum_{j = 1}^{k - 1} x_{j}) \\ = & \frac{1}{k} (x_{k} + (k - 1) μ_{k - 1}) \\ = & μ_{k - 1} + \frac{1}{k} (x_{k} - μ_{k - 1}) \\ x_{k} : 실제 값, \\ μ_{k - 1} : 기대했던 값 \\ x_{k} - μ_{k - 1} : error term, 해당 error 방향으로 μ update \end{array}$

Incremental Monte-Carlo Updates: update $V (s)$ incrementally after episode $S_{1}, A_{1}, R_{2}, \dots, S_{T}$
For each state $S_{t}$ , with return $G_{t}$

$\begin{array}{rcl} N (S_{t}) & \leftarrow & N (S_{t}) + 1 \\ V (S_{t}) & \leftarrow & V (S_{t}) + \frac{1}{N (S_{t})} (G_{t} - V (S_{t})) \end{array}$

In non-stationary problems, it can be useful to track a running mean i.e. forget old episods

$V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

위 수식이 Monte-Carlo approch에 대한 수식이다.

2. Temporal-Difference Learning(TD)

(1) TD Learning

TD Learning 방법은 episodes가 진행되는 각 step에서 state value $V (S_{t})$ 를 계산하는 방식이다.

Goal: learn $v_{π}$ online from experience under policy $π$
TD is model-free = environment에 대한 정보가 없는 상황 = MDP transition, reward를 모르는 상황
TD learns from incomplete episodes, by bootstrapping

(2) TD algorithm

먼저 아래와 같은 every-visit Monte-Carlo를 생각해보자.

Update value $V (S_{t})$ toward actual return $G_{t}$

$V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

가장 간단한 TD learning algorithm인 TD(0)는 다음과 같다.

$\begin{array}{rcl} V (S_{t}) \leftarrow V (S_{t}) + α (R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})) \\ (R_{t + 1} + γ V (S_{t + 1}) : TD target) \\ (δ_{t} = R_{t + 1} + γ V (S_{t + 1}) - V (S_{t}) : TD error) \end{array}$

3. MC & TD 비교

(1) 개념비교

huggingface RL 강의에 MC과 TD에 개념을 잘 정리해놓았다. 구체적인 계산 과정을 참고해보기 좋다.

(2) Advantages, Disadvantages

i. TD can learn before knowing the final outcome

TD can learn online after every step
MC must wait until end of episode before return is known

ii. TD can learn without the final outcome

TD can learn from incomplete sequences
MC can only learn from complete sequences
TD works in continuing (non-terminating) environments
MC only works for episodic (terminating) environments

[Bias/Variance Trade-Off]

Return $G_{t} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{T - 1} R_{T}$ is unbiased estimate of $v_{π} (S_{t})$
True TD target $R_{t + 1} + γ v_{π} (S_{t + 1})$ is unbiased estimate of $v_{π} (S_{t})$
TD target $R_{t + 1} + γ V (S_{t + 1})$ is biased estimate of $v_{π} (S_{t})$
TD target is much lower variance than the return
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

iii. MC has high variance, zero bias

Good convergence properties(even with function approximation)
Not very sensitive to initial value
Very simple to understand and use

iv. TD has low variance, some bias

Usually more efficient than MC
TD(0) converges to $v_{π} (S_{t})$ (but not always with function approximation)
More sensitive to initial value

v. TD exploits Markov property

Usually more efficient in Markov environments

vi. MC does not exploit Markov property

Usually more effective in non-Markov enviornments(i.e partially observable environment)

(3) Batch MC and TD

MC and TD converge: $V (s) \to v_{π} s$ as experience $\to \infty$
하지만 finite experience에서의 batch solution은 어떨까?

다음과 같이 $A, B$ 두 개의 state에 대한 8개의 episode가 있다고 하자. 이때 batch로 value function을 계산한 결과는 MC와 TD가 다르다.

- MC always converges to solution with minimun mean-squared error(MSE)

Best fit to the observed returns
Minimize with respect to all steps and episodes

$\sum_{k = 1}^{K} \sum_{t = 1}^{T_{k}} (G_{t}^{k} - V (s_{t}^{k}))^{2}$

In the AB example, $V (A) = 0$

- TD(0) converges to solution of maximum likelihood Markov model

Solution to the MDP $⟨ S, A, \hat{P}, \hat{R}, r ⟩$ that best fits the data

$\begin{array}{rcl} {\hat{P}}_{s, \overset{´}{s}}^{a} & = & \frac{1}{N (s, a)} \sum_{k = 1}^{K} \sum_{t = 1}^{T_{k}} 1 (s_{t}^{k}, a_{t}^{k}, s_{t + 1}^{k} = s, a, \overset{´}{s}) \\ {\hat{R}}_{s}^{a} & = & \frac{1}{N (s, a)} \sum_{k = 1}^{K} \sum_{t = 1}^{T_{k}} 1 (s_{t}^{k}, a_{t}^{k} = s, a) r_{t}^{k} \end{array}$

In the AB example, $V (A) = 0.75$

(4) Unified View for Model-Free Prediction

i. Back-up diagrams

Monte-Carlo Backup: $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

Temporal-Difference Backup: $V (S_{t}) \leftarrow V (S_{t}) + α (R_{t + 1} + γ V (S_{t + 1}) - V (S_{t}))$

Dynamic Programming Backup: $V (S_{t}) \leftarrow E_{π} [R_{t + 1} + γ V (S_{t + 1})]$

그림. Unified View of Reinforcement Learning

ii. Bootstrapping and Sampling

- Bootstrapping: update involves an estimate

MC does not bootstrap
DP bootstraps
TD bootstraps

- Sampling: update samples an expectation

MC samples
DP does not sample
TD samples

참고) stackexchange글에 Bootstrap, Sampling in RL에 대한 좋은 설명이 있다.

4. TD( $λ$ )

(1) $n$ -Step Prediction

TD learning에서 step수를 더 늘려서 target을 설정할 수 있다.

Consider the following $n$ -step returns for $n = 1, 2, \dots, \infty$

$\begin{array}{rcl} n = 1 & (TD) & G_{t}^{(1)} = R_{t + 1} + γ V (S_{t + 1}) \\ n = 2 & G_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} V (S_{t + 2}) \\ ⋮ & ⋮ \\ n = \infty & (MC) & G_{t}^{(\infty)} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{T - 1} R_{T} \end{array}$

$n-step return : G_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} V (S_{t + n})$

n-step temporal difference learning

$V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{n} - V (S_{t}))$

$\begin{array}{rcl} G_{t}^{n} & : & n-step estimated reward \\ V (S_{t}) & : & estimated value function \\ G_{t}^{n} - V (S_{t}) & : & error term \end{array}$

그림. Large Random Walk에서의 n-step TD learning

(2) TD( $λ$ ): Averaging $n$ -Step Returns

We can average $n$ -step return over different $n$ (e.g. average the 2-step, 4-step returns: $\frac{1}{2} G^{2} + \frac{1}{2} G^{4}$ ). Can we efficiently combine information from all time-steps?

The $λ - return G_{t}^{λ}$ combines all $n$ -step returns $G_{t}^{(n)}$ , using weight $(1 - λ) λ^{n - 1}$

$G_{t}^{λ} = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} G_{t}^{(n)}$

Forward-view TD( $λ$ )( $G_{t}^{λ}$ is robust!)

$V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{λ} - V (S_{t}))$

(3) Eligibility Trace and Forward & Backward view of TD( $λ$ )

i. Forward view of TD( $λ$ )

Update value function towards the $λ$ -return
Forward-vew looks into the future to compute $G_{t}^{λ}$ . But like MC, can only be computed from complete episodes(TD는 $n$ step만 보면 되는 장점이 있었는데 결국 MC처럼 episode가 끝나야지 계산이 가능한 모순이 발생함)

ii. Eligibility Trace

Eligibility Trace는 "과거에 방문했던 state 중에서 현재 얻게 되는 reward에 영향을 주는 state를 판단하여, 현재 얻게 되는 reward을 해당 state에 나누어주는 것"을 말한다(참고 블로그).

Suppose an agent randomly walking in an environment and finds a treasure. He then stops and looks backwards in an attempt to know what led him to this treasure?
Naturally the steps that are close to the treasure have more merits in finding it than the steps that are miles away. So closer locations are more valuable than distant ones and thus they are assigned bigger values. (참고글)

이때 Frequency heuristic, Recency heuristic과 같은 아이디어를 적용할 수 있는데, Eligibility Trace는 두 가지 모두 적용하여 계산한다.

$\begin{array}{rcl} E_{0} (s) & = & 0 \\ E_{t} (s) & = & γ E_{t - 1} (s) + 1 (S_{t} = s) \end{array}$

iii. Backward view TD( $λ$ )

Keep and eligibility trace for every state $s$
Update value $V (s)$ for every state $s$
In proportion to TD-error $δ_{t}$ and eligibility trace $E_{t} (s)$

$\begin{array}{rcl} δ_{t} & = & R_{t + 1} + γ V (S_{t + 1}) - V (S_{t}) \\ V (s) & \leftarrow & V (s) + α δ_{t} E_{t} (s) \end{array}$

iv. TD( $λ$ ) and TD(0)

When $λ = 0$ , only current state is updated

$\begin{array}{rcl} E_{t} (s) & = & 1 (S_{t} = s) \\ V (s) & \leftarrow & V (s) + α δ_{t} E_{t} (s) \end{array}$

This is exactly equivalent to TD(0) update, $V (s) \leftarrow V (s) + α δ_{t}$
When $λ = 1$ , credit is deferred until end of episode(consider episodic environments with offline updates). Over the course of an episode, total update for TD(1) is the same as total update for MC

[Theorem]
The sum of offline updates is identical for forward-view and backward-view TD( $λ$ )
$\sum_{t = 1}^{T} α δ_{t} E_{t} (s) = \sum_{t = 1}^{T} α (G_{t}^{λ} - V (S_{t})) 1 (S_{t} = s)$

v. Offline & Online update

- Offline updates

Updates are accumulated within episode
but applied in batch at the end of episode

- Online updates

TD( $λ$ ) updates are applied online at each step within episode
Forward and backward-view of TD( $λ$ ) are slightly different
Exact online TD( $λ$ ) achieves perfect equivalence, by using a slightly different form of eligibility trace

vi. Summary of Forward and Backward TD( $λ$ )

728x90

저작자표시 비영리 변경금지

'Reinforcement Learning' 카테고리의 다른 글

[RL] 6. Value Function Approximation (0)	2023.04.15
[RL] 5. Model-Free Control (2)	2023.04.05
[RL] 3. Planning by Dynamic Programming (0)	2023.03.07
[RL] 2. Markov Decision Process (0)	2023.03.05
[RL] 1. Introduction (0)	2023.03.05

1. Monte-Carlo Reinforcement Learning
(1) Monte-Carlo Learning
(2) Monte-Carlo Policy Evaluation
(3) Incremental Monte-Carlo
2. Temporal-Difference Learning(TD)
(1) TD Learning
(2) TD algorithm
3. MC & TD 비교
(1) 개념비교
(2) Advantages, Disadvantages
(3) Batch MC and TD
(4) Unified View for Model-Free Prediction
4. TD( $λ$ )
(1) $n$ -Step Prediction
(2) TD( $λ$ ): Averaging $n$ -Step Returns
(3) Eligibility Trace and Forward & Backward view of TD( $λ$ )

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`