David Silver 교수님의 강의 내용을 정리하고자 한다(링크). 현재 배우고 있는 내용의 흐름은 다음과 같다.
- 4장 Model-Free prediction: Estimate the value of an unknown MDP(Policy Evaluation)
- 5장 Model-Free control: Optimize the value function of an unkown MDP(환경이 주어지지 않았을 때 어떻게 reward를 최대로 하는 방향으로 모델을 학습시킬 것인가)
MDP로 모델링 될 수 있는 문제들은 Elevator, Robocup Soccer, Ship Steering, Helicoper, Game of Go, Robot walking, Protein Folding등으로 다양하다. 이 문제들을 푸는 세팅은 1) MDP model unknown, experience can be sampled 이거나 2) MDP model known, but too bit to use, except by sample 이다. Model-free control은 이러한 문제들을 풀 수 있게 해준다.
5장을 이해하는데 핵심은 On-Policy, Off-Policy Learning이다.
- On-Policy Learning: learn about policy $\pi$ from experience sampled from $\pi$
- e.g.) Policy-iteration, Value-Iteration, Monte Carlo for On-Policy, Sarsa
- Off-Policy Learning: learn about policy $\pi$ from experience sampled from $\mu$
- e.g.) Q-Learning
먼저 model-free인 상황에서 이론이 전개된 과정을 잘 설명한 내용이 Sutton and Barto의 RL Book(2020)에 있어 정리하고자 한다.
If a model is not available, then it is particularly useful to estimate action values(the values of state–action pairs) rather than state values. With a model, state values alone are sucient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state. Without a model, however, state values alone are not sucient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of our primary goals for Monte Carlo methods is to estimate $q_{*}$. To achieve this, we first consider the policy evaluation problem for action values
The policy evaluation problem for action values is to estimate $q_{\pi}(s,a)$, the expected return when starting in state $s$, taking action $a$, and thereafter following policy $\pi$. The Monte Carlo methods for this are essentially the same as just presented for state values, except now we talk about visits to a state–action pair rather than to a state...
The only complication is that many state–action pairs may never be visited. If $\pi$ is a deterministic policy, then in following $\pi$ one will observe returns only for one of the actions from each state. With no returns to average, the Monte Carlo estimates of the other actions will not improve with experience. This is a serious problem because the purpose of learning action values is to help in choosing among the actions available in each state. To compare alternatives we need to estimate the value of all the actions from each state, not just the one we currently favor.
This is the general problem of maintaining exploration. For policy evaluation to work for action values, we must assure continual exploration. One way to do this is by specifying that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start. This guarantees that all state–action pairs will be visited an infinite number of times in the limit of an infinite number of episodes. We call this the assumption of exploring starts.
(...)
How can we avoid the unlikely assumption of exploring starts? The only general way to ensure that all actions are selected infinitely often is for the agent to continue to select them. There are two approaches to ensuring this, resulting in what we call on-policy methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.
1. On-Policy Monte-Carlo Control
(1) Generalized Policy Iteration(GPI)
먼저 policy evaluation & policy improvement를 통해 optimal policy를 찾을 수 있다는 이론인 Generalized Policy Iteration(GPI)을 다시 살펴보자.
GPI는 model이 주어진 상황에서 policy evaluation & improvement를 통해 optimal policy를 얻을 수 있다는 이론이다.
- Greedy policy improvement over $V(s)$ requires model of MDP: trainsition probability $\mathcal{P}$와 reward function $\mathcal{R}$이 주어져야 하기 때문
$$\pi^{'}(s)=\underset{a\in A}{\text{argmax}} \ \mathcal{R}^{a}_{s}+\mathcal{P}^{a}_{ss^{'}}V(s^{'})$$
하지만 $V(s)$가 아닌 $\color{red}Q(s,a)$를 통해 policy를 improve하는 방법은 model-free 환경에서 사용될 수 있다.
- Greedy policy improvement over $Q(s,a)$ is model-free
$$\pi^{'}(s)=\underset{a\in A}{\text{argmax}} \ Q(s,a)$$
[Generalized Policy iteration with Action-Value Function]
[$\epsilon$-Greedy Exploration]
RL book에서 인용한 설명처럼 model-free인 상황에서 $V(s)$ 대신 $Q(s,a)$를 사용하기 위해선 모든 state, action들을 explore해야한다는 조건이 있다. 이를 위해 greedy policy improvement를 조금 완화한 $\epsilon$-greedy exploration을 사용할 수 있다.
- Simplest idea for ensuring continual exploration: all $m$ actions are tried with non-zero probability
- With probability $1-\epsilon$ for choosing the greedy action, $\epsilon$ for choosing an action at random
$$\pi(a|s) = \left\{ \begin{array}{cl}
\epsilon/m+1-\epsilon & : \ \text{if} \ a^{*}=\underset{a\in A}{\text{argmax}}\ Q(s,a) \\
\epsilon/m & : \ \text{otherwise}
\end{array} \right.$$
[Theorem]
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi^{'}$ with respect to $q_{\pi}$ is an improvement $v_{\pi^{'}} \ge v_{\pi}$
$$\begin{array}{rcl}
q_{\pi}(s,\pi^{'}(s)) & = & \sum_{a\in A}\pi^{'}(a|s)q_{\pi}(s,a) \\
& = & \epsilon/m\sum_{a\in A}q_{\pi}(s,a)+(1-\epsilon)\underset{a\in A}{\text{max}} \ q_{\pi}(s,a) \\
& \ge & \epsilon/m\sum_{a\in A}q_{\pi}(s,a)+(1-\epsilon)\sum_{a\in A}\frac{\pi(a|s)-\epsilon/m}{1-\epsilon} \ q_{\pi}(s,a) \\
& = &\sum_{a\in A}\pi(a|s)q_{\pi}(s,a)=v_{\pi}(s) \\
& & \therefore \pi^{'} \ge \pi \\
& & (m: \text{size of actions in state} \ s)
\end{array}$$
[Monte-Carlo Policy Iteration]
$\color{blue}Q(s,a)$와 $\color{blue}\epsilon-\text{greedy}$를 적용한 policy iteration은 다음과 같이 표현할 수 있다.
- Policy evaluation: run some episodes using current policy, estimate the value of all actions from all states just from what we've seen(for every state & action pair we look at the mean return)(계속 explore하기 때문에 매우 느리고 비효율적일 수 있음)
- Policy improvement: update policy using $\epsilon$-greedy(now policy is stochastic)
[Monte-Carlo Control]
Monte-Carlo 방법은 terminate된 full episode가 필요하다. 하지만 실제로는 한 episode내에서 몇 개의 step만 있어도 policy를 결정하기 위한 충분한 정보를 얻을 수 있다(아이디어: 가장 최근의 정보를 가지고 추정된 $Q$를 통해 policy를 update하자)
- Policy evaluation: run one episodes using current policy(one return), collect all the steps from that episode, update the $Q$ values just for those steps(for every state & action pair we've taken along, update the mean value just for the visited states & actions in that episode)
- Policy improvement: update policy using $\epsilon$-greedy(not policy is stochastic)
[GLIE]
[Definition]
Greedy in the Limit with Infinite Exploration(GLIE)
$$\underset{k \rightarrow \infty}{\text{lim}} N_{k}(s,a)=\infty$$
- All state-action pairs are explored infinitely many times
$$\underset{k \rightarrow \infty}{\text{lim}} \pi_{k}(a|s)=\textbf{1}(a=\underset{a^{'}\in A}{\text{argmax}}\ Q_{k}(s,a^{'}))$$
- All state-action pairs are explored infinitely many times
- For example, $\epsilon$-greedy is GLIE if $\epsilon$ reduces to zero at $\epsilon_{k}=\frac{1}{k}$
CLIE Monte-Carlo Control은 다음과 같다.
[Theorem]
GLIE Monte-Carlo control converges to the optimal action-value function, $Q(s,a) \rightarrow q_{*}(s,a)$
- Sample $k$th episode using $\pi: \{S_{1},A_{1},R_{2},\dots,S_{T}\} \sim \pi$
- For each state $S_{t}$ and action $A_{t}$ in the episode,
$$\begin{array}{rcl}
N(S_{t}, A_{t}) & \leftarrow & N(S_{t}, A_{t})+1 \\
Q(S_{t}, A_{t}) & \leftarrow & Q(S_{t}, A_{t}) +\frac{1}{N(S_{t}, A_{t})}(G_{t}-Q(S_{t}, A_{t}))
\end{array}$$
- Improve policy based on new action-value function
$$\begin{array}{rcl}
\epsilon & \leftarrow & \frac{1}{k} \\
\pi & \leftarrow & \epsilon-\text{greedy}(Q)
\end{array}$$
2. On-Policy Temporal-Difference Learning
(1) MC vs TD Control
Temporal-difference(TD) learning has several advantages over Monte-Carlo(MC)
- Lower variance / Online / Incomplete sequences
Natural idea: use TD instead of MC in our control loop
- Apply TD to $Q(S,A)$(no look-ahead), use $\epsilon$-greedy policy improvement
- Update every time-step
(2) Sarsa
We turn now to the use of TD prediction methods for the control problem. As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part. As with Monte Carlo methods, we face the need to trade off exploration and exploitation, and again approaches fall into two main classes: on-policy and off-policy.
The first step is to learn an action-value function rather than a state-value function. In particular, for an on-policy method we must estimate $q_{\pi}(s,a)$ for the current behavior policy $\pi$ and for all states $s$ and actions $a$. This can be done using essentially the same TD method for learning $v_{\pi}$.
Recall that an episode consists of an alternating sequence of states and state–action pairs:
In the previous section we considered transitions from state to state and learned the values of states. Now we consider transitions from state–action pair to state–action pair, and learn the values of state–action pairs. Formally these cases are identical: they are both Markov chains with a reward process.
$$Q(S,A) \leftarrow Q(S,A)+\alpha(R+\gamma Q(S^{'},A^{'})-Q(S,A))$$
This update is done after every transition from a nonterminal state $S_{t}$(if $S_{t+1}$ is terminal, then $Q(S_{t+1}, A_{t+1})$ is defined as zero). This rule uses every element of the quintuple of events, $\color{blue}(S_{t}, A_{t}, R_{t+1} S_{t+1}, A_{t+1})$, that makes up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm.
[Therom]
Sarsa converges to the optimal action-value function, $Q(s,a) \rightarrow q_{*}(s,a)$, under the following conditions$$\begin{array}{rcl} \sum_{t=1}^{\infty} \alpha_{t}& = & \infty \\ \sum_{t=1}^{\infty} \alpha_{t} & \lt & \infty \end{array}$$
- GLIE sequence of policies $\pi_{t}(a|s)$
- Robbins-Monro sequence of step-sizes $\alpha_{t}$
(3) (★) Sarsa($\color{red}\lambda$)
[ $n$-Step Sarsa]
- Consider the following $n$-step returns for $n=1,2,\dots\infty$:
$$\begin{array}{rcl}
\color{red}n=1 & \color{red}\text{(Sarsa)} & q^{(1)}_{t}=R_{t+1}+\gamma Q(S_{t+1})\\
\color{red}n=2 & \ & q^{(2)}_{t}=R_{t+1}+\gamma R_{t+2} + \gamma^{2}Q(S_{t+2}) \\
\vdots&\ & \vdots \\
\color{red}n=\infty & \color{red}\text{(MC)} & q^{(\infty)}_{t}=R_{t+1}+\gamma R_{t+2} + \dots + \gamma^{T-1}R_{T}
\end{array}$$
- Define the $n$-step $Q$-return
$q^{(n)}_{t}=R_{t+1}+\gamma R_{t+2} + \dots + \gamma^{n-1}R_{t+n}+\gamma^{n}Q(S_{t+n})$
- $n$-step Sarsa updates $Q(s,a)$ towards the $n$-step $Q$-return
$$Q(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t})+\alpha(q_{t}^{(n)}-Q(S_{t}, A_{t}))$$
[Forward View of Sarsa($\lambda$)]
- The $q^{\lambda}$ return combines all $n$-step $Q$-returns $q_{t}^{n}$(결국 한 episode가 끝날 때 까지 기다려야 함. Backward view가 필요한 이유), using weight $(1-\lambda)\lambda^{n-1}$
$$q^{\lambda}_{t}=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}q_{t}^{(n)}$$
- Forward-view Sarsa($\lambda$)
$$Q(S_{t}, A_{t}) \leftarrow Q(S_{t}, A_{t})+\alpha(q_{t}^{\lambda}-Q(S_{t}, A_{t}))$$
[Backward view of Sarsa($\lambda$)]
- Just like TD($\lambda$), we use eligibility traces in an online algorithm
- But Sarsa($\lambda$) has one eligibility trace for each state-action pari
$$\begin{array}{rcl}
E_{0}(s,a) & = & 0 \\
E_{t}(s,a) & = & \gamma E_{t-1}(s,a)+\textbf{1}(S_{t}=s, A_{t}=a)
\end{array}$$
- $Q(s,a)$ is updated for every state $s$ and action $a$, in proportion to TD-error $\delta_{t}$ and eligibility trace $E_{t}(s,a)$
$$\begin{array}{rcl}
\delta_{t} & = & R_{t+1}+\gamma Q(S_{t+1}, A_{t+1})-Q(S_{t}, A_{t}) \\
Q(S, A) & \leftarrow & Q(S, A)+\alpha\delta_{t}E_{t}(s,a)
\end{array}$$
3. Off-Policy Learning
(1) Off-Policy Learning
- Evaluate target policy $\pi(a|s)$ to compute $v_{\pi}(s)$ or $q_{\pi}(s,a)$, while following behavior policy $\mu(a|s)$
$$\{S_{1},A_{1},R_{2},\dots,S_{T}\} \sim \mu$$
그럼 왜 off-policy가 중요할까?
- Learn from observing humans or other agents
- Re-use experience generated from old policies $\pi_{1},\pi_{2},\dots,\pi_{t-1}$
- Learn about optimal policy while following exploratory policy
- Learn about multiple policies while following one policy
RL Book에 더 구체적으로 설명되어 있다.
All learning control methods face a dilemma: They seek to learn action values conditional on subsequent optimal behavior, but they need to bahve non-optimally in order to explore all actions(to find the optimal actions). How can they learn about the optimal policy while behaving according to an exploratory policy? The on-policy approach in the preceding section is actually a compromise - it learns action values not for the optimal policy, but for a near-optimal policy that still explores.
A more straight-forward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data "off" the target policy, and the overall process is termed off-policy learning.
Off-policy methods require additional concepts and notation, and because the data is due to a different policy, off-policy methods are often of greater variance and are slower to converge. On the other hand, off-policy methods are more powerful and general. They include on-policy methods as the special case in which the target and behavior policies are the same. Off-policy methods also have a variety of additional uses in applications.
(2) Importance Sampling
[Importance Sampling]
Estimate the expectation of a different distribution.
$$\begin{array}{rcl}
\mathbb{E}_{X \sim P}[f(X)] & = & \sum P(X)f(X) \\
& = & \sum Q(X)\frac{P(X)}{Q(X)} f(X) \\
& = & \mathbb{E}_{X \sim Q}[\frac{P(X)}{Q(X)}f(X)]
\end{array}$$
We begin the study of off-policy methods by considering the prediction problem, in which both target and behavior policies are fixed. That is, suppose we wish to estimate $v_{\pi}$ or $q_{\pi}$, but all we have are episodes following another policy $b$, where $b \neq \pi$. In this case, $\pi$ is the target policy, $b$ is the behavior policy, and both policies are considered fixed and given.
In order to use episodes from $b$ to estimate values for $\pi$, we require that every action taken under $\pi$ is also taken, at least occasionally, under $b$. That is, we require that $\pi(a|s) >0$ implies $b(a|s) >0$. This is called the assumption of coverage. It follows from coverage that $b$ must be stochastic in states where it is not identical to $\pi$. The target policy $\pi$, on the other hand, may be deterministic.
(...)
Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.
Given a starting state $S_{t}$, the probability of the subsequent state–action trajectory, $A_{t},S_{t+1},A_{t+1},\dots,S_{T}$, occurring under any policy $\pi$ is
$$\begin{array}{rcl} & Pr\{ A_{t},S_{t+1},A_{t+1},\dots,S_{T}|S_{T},A_{t:T-1} \sim \pi \} \\ & = \ \pi(A_{t}|S_{t})p(S_{t+1}|S_{t},A_{t})\pi(A_{t+1}|S_{t+1})\cdots p(S_{T}|S_{T-1},A_{T-1}) \\ & = \ \prod_{k=t}^{T-1}\pi(A_{k}|S_{k})p(S_{k+1}|S_{k},A_{k}) \end{array}$$
where $p$ here is the state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the importance sampling ratio) is
$$\rho_{t:T-1} = \frac{\prod_{k=t}^{T-1}\pi(A_{k}|S_{k})p(S_{k+1}|S_{k},A_{k})}{\prod_{k=t}^{T-1}b(A_{k}|S_{k})p(S_{k+1}|S_{k},A_{k})} =\prod_{k=t}^{T-1} \frac{ \pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}$$
Recall that we wish to estimate the expected returns (values) under the target policy, but all we have are returns $G_{t}$ due to the behavior policy. These returns have the wrong expectation E[Gt|St =s] = vb(s) and so cannot be averaged to obtain $v_{\pi}$. This is where importance sampling comes in. The ratio $\rho_{t:T-1}$ transforms the returns to have the right expected value:
$$\mathbb{E}[\rho_{t:T-1}G_{t}|S_{t}=s] = v_{\pi}(s)$$
[Importance Sampling for Off-Policy Monte-Carlo]
- Use returns generated from $\mu$ to evaluate $\pi$
- Weight return $G_{t}$ according to similarity between policies
- Multiply importance sampling corrections along whole episode(Cannot use if $\mu$ is zero when $\pi$ is non-zero)
$$G^{\pi/\mu}_{t} = \frac{ \pi(A_{t}|S_{t})\pi(A_{t+1}|S_{t+1}) }{ \mu(A_{t}|S_{t})\mu(A_{t+1}|S_{t+1}) }\cdots \frac{ \pi(A_{T}|S_{T}) }{ \mu(A_{T}|S_{T}) }G_{t}$$
- Update value towards corrected return
$$V(S_{t}) \leftarrow V(S_{t})+\alpha(\color{red} G^{\pi/\mu}_{t} \color{black}-V(S_{t}))$$
- Importance sampling can dramatically increase variance
[Importance Sampling for Off-Policy TD]
- Use TD targets generated from $\mu$ to evaluate $\pi$
- Weight TD target $R+\gamma V(S^{'})$ by importance sampling(매 step 마다!)
- Only need a single importance sampling correction
$$V(S_{T}) \leftarrow V(S_{T})+\alpha(\color{red} \frac{\pi(A_{t}|S_{t})}{\mu(A_{t}|S_{t})}(R_{t+1}+\gamma V(S_{t+1}))\color{black}-V(S_{t}) )$$
- Much lower variance than Monte-Carlo importance sampling
- Policies only need to be similar over a single step
(3) (★) Q-Learning
- We now consider off-policy learning of action-values $Q(s,a)$
- No importance sampling is required
- Next action is chosen using behavior policy $A_{t+1} \sim \mu(\cdot|S_{t})$(real thing)
- But we consider alternative successor action $A_{t+1} \sim \mu(\cdot|S_{t})$(target policy를 기준으로 가정해볼 수 있는 다른 action)
- And update $Q(S_{t},A_{t})$ towards value of alternative action
$$Q(S_{t},A_{t}) \leftarrow Q(S_{t},A_{t}) + \alpha(\color{red} R_{t+1}+\gamma Q(S_{t+1},A^{'})\color{black}-Q(S_{t},A_{t}))$$
One of the early breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989), defined by
$$Q(S_{t},A_{t}) \leftarrow Q(S_{t},A_{t})+\alpha[ R_{t+1} + \underset{a}{\gamma \text{max}} \ Q(S_{t+1}, a)-Q(S_{t},A_{t})]$$
In this case, the learned action-value function $Q$, directly approximates $q_{*}$, the optimal action-value function, independent of the policy being followed. This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs. The policy still has an e↵ect in that it determines which state–action pairs are visited and updated. However, all that is required for correct convergence is that all pairs continue to be updated(minimal requirement in the sense that any method guaranteed to find optimal behavior in the general case must require it).
Under this assumption and a variant of the usual stochastic approximation conditions on the sequence of step-size parameters, $Q$ has been shown to converge with probability 1 to $q_{*}$.
[Off-Policy Control with Q-Learning]
- We now allow both behaviour and target policies to improve
- The target policy $\pi$ is greedy w.r.t $Q(s,a)$
$$\pi(S_{t+1}) = \underset{a^{'}}{\text{argmax}} \ Q(S_{t+1}, a^{'})$$
- The behaviour policy $\mu$ is e.g. $\color{red}\epsilon$-greedy w.r.t $Q(s,a)$
- The Q-learning target then simplifies
$$\begin{array}{rcl}
R_{t+1} + \gamma Q(S_{t+1}, A^{'}) & = & R_{t+1} + \gamma Q(S_{t+1}, \underset{a^{'}}{\text{argmax}} \ Q(S_{t+1}, a^{'})) \\
& = & R_{t+1} + \underset{a^{'}}{\text{max}} \ \gamma Q(S_{t+1}, a^{'})
\end{array}$$
[Q-Learning Control Algorithm]
[Theorem]
Q-learning control converges to the optimal action-value function, $Q(s,a)\rightarrow q_{*}(s,a)$
$$Q(S,A) \leftarrow Q(S,A)+\alpha(\color{red} R + \underset{a^{'}}{\gamma \text{max}} \ Q(S^{'}, a^{'})\color{black}-Q(S,A))$$ |
Summary
'Reinforcement Learning' 카테고리의 다른 글
[RL] 7. Policy Gradient Methods (0) | 2023.05.04 |
---|---|
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
[RL] 4. Model-Free Prediction (0) | 2023.03.29 |
[RL] 3. Planning by Dynamic Programming (0) | 2023.03.07 |
[RL] 2. Markov Decision Process (0) | 2023.03.05 |