David Silver 교수님의 강의 내용을 정리하고자 한다(링크).
1. Introduction
강화학습은 다음과 같은 특징이 있다.
- Supervisor가 없고 오직 reward signal만 있음
- Feedback은 즉시 올 수도 있지만 지연될 수도 있음
- 시간을 고려해야 함(sequential, non i.i.d data)
- Agent의 action이 이후 데이터들에 영향을 줌
2. RL 문제 정의
1) Reward
Reward($R_{t}$)는 $t$ 시점에서 agent가 얼마나 잘하고 있는지에 대한 scalar feedback signal이며, 강화학습은 아래와 같은 reward hypothesis를 기반으로 한다.
Reward Hypothesis
All goals can be described by the maximization of expected cumulative reward
강화학습의 궁극적인 목표는 total future reward를 최대로 하는 action을 찾는 것이다. Action은 long-term consequence가 있을 수 있고, reward는 지연될 수도 있고, immediate reward를 더 받는 것 보다 long-term reward를 받는 것이 더 좋을 수도 있다.
2) Agent and Environment
강화학습 세팅은 agent와 environment로 구성된다.
각 step $t$에서
i. Agent
action $A_{t}$를 수행
observation $O_{t}$를 얻음
scalar reward $R_{t}$를 얻음
ii. Environment
action $A_{t}$를 받음
observation $O_{t+1}$, scalar reward $R_{t+1}$를 내보냄
3) History and State
History는 강화학습이 진행되면서 기록되는 데이터인 sequence of observation, action, reward를 의미한다.
- $H_{t} = O_{1}, R_{1},A_{1} \cdots, A_{t-1}, O_{t}, R_{t}$
State는 어떤 일이 일어날 것인지에 대한 상태정보를 담고 있는 개념으로 history에 대한 함수로 표현될 수 있다.
- $S_{t} = f(H_{t})$
Information state(Markov state)는 모든 history에 대한 정보를 담고 있는 개념이다.
A state $S_{t}$ is Markov if and only if
$$\mathbb{P}[S_{t+1}|S_{t}] = \mathbb{P}[S_{t+1}|S_{1},\cdots,S_{t}]$$
이는 "The future is independent of the past given the present"와 같이 표현할 수 있다.
- $H_{1:t} \rightarrow S_{t} \rightarrow H_{t+1:\infty }$
Once the state is known, the history may be thrown away.
The State is a sufficient statistic of the future.
i. Environment State
$S^{e}_{t}$(environment state)는 환경에 대한 독립적인 표현(environment’s private representation)으로, 보통 agent가 직접 관찰할 수 없다.
ii. Agent State
$S^{a}_{t}$(environment state)는 agent에 대한 독립적인 표현(agent's internal representation)으로, $S^{a}_{t} = f(H_{t})$와 같이 history에 대한 함수로 표현될 수 있다.
4) Observability
i. Full observability
Agent가 environment의 모든 정보를 directly 관찰가능한 상황이면 full observable한 상황이라고 하며, 이런 세팅을 Markov Decision Process(MDP)라 한다.
- Agent state = Environment state = Information State
$O_{t} = S^{a}_{t} = S^{e}_{t}$
ii. Partial observability
Agent가 environment를 indirectly하게만 관찰하는 상황으로, 이런 세팅을 partially observable Markov Decision Process(POMDP)라 한다. 이때 Agent는 자기 자신의 state represention $S^{a}_{t}$를 구축해야 한다.
- Agent state $\neq $ Environment state
3. RL Agent
RL Agent는 아래와 같은 구성요소들이 있다.
- Policy: agent's behaviour function
- Value Function: measures how good is each state or action
- Model: agent's representation of the environment
1) Policy
Policy는 agent의 행동을 결정짓는 요소로 map from state to action이다.
- Deterministic policy: $\color{blue}a = \pi(s)$
- Stochastic policy: $\color{blue}\pi(a|s) = \mathbb{P}[A_{t}=a|S_{t}=s]$
2) Value Function
Value function은 prediction of future reward으로, 어떤 state or action이 얼마나 좋은지를 계산하여 어떤 action을 취해야할지를 결정할 때 사용된다.
- $\color{blue} v_{\pi} = \mathbb{E}_{\pi}[R_{t+1}+\gamma R_{t+2} + \gamma^{2} R_{t+2}+\cdots | S_{t}=s]$
3) Model
Model은 environment가 무엇을 할지에 대한 표현이며, 다음과 같이 transition model과 reward function으로 표현한다.
- Transition probability $\color{blue} \mathcal{P}$: dynamic of env, predicts the next state
$\color{blue} \mathcal{P}^{a}_{ss^{'}} = \mathbb{P}[S_{t+1}=s^{'}|S_{t}=s, A_{t}=a]$ - Reward function $\color{blue} \mathcal{R}$: predicts the next(immediate reward)
$\color{blue} \mathcal{R}^{a}_{s} = \mathbb{E}[R_{t+1}|S_{t}=s, A_{t}=a]$
4. RL category & taxonomy
1) RL category
i-1) Value-based: agent를 value 기반으로 표현
- No policy(Implicit)
- Value Function
i-2) Policy-based: agent를 policy 기반으로 표현
- Policy
- No Value Function
i-3) Actor Critic
- Policy
- Value Function
ii-1) Model-free: Model에 대한 정보(e.g. Transition)를 추정하지 않음
- Policy and/or Value Function
- No Model
ii-2) Model-based: Model에 대한 정보(e.g. Transition)을 추정
- Policy and/or Value Function
- Model
Model-free & Model-based RL에 대한 설명은 openAI글에 잘 설명되어 있다.
One of the most important branching points in an RL algorithm is the question of whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transitions and rewards.
The main upside to having a model is that it allows the agent to plan by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options...
The main downside is that a ground-truth model of the environment is usually not available to the agent. If an agent wants to use a model in this case, it has to learn the model purely from experience, which creates severalchallenges. The biggest challenge is that bias in the model can be exploited by the agent, resulting in an agent which performs well with respect to the learned model, but behaves sub-optimally (or super terribly) in the real environment...
Algorithms which use a model are called model-based methods, and those that don’t are called model-free.
2) Learning and Planning
Sequential decision making의 문제정의는 다음과 같이 분류될 수 있다.
i. (Reinforcement) Learning
- The environment is initially unkown
- The agent interacts with the environment
- The agent improves its policy
ii. Planning
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
3) Exploration and Exploitation
- Exploration: finds more information about the environment
- Exploitation: exploits kown information to maximize reward
Exploration-Exploitation trade-off에 대한 내용은 Joseph Rocca글에 잘 설명되어 있다.
Exploration-exploitation trade-off
“Should I go for the decision that seems to be optimal, assuming that my current knowledge is reliable enough? Or should I go for a decision that seems to be sub-optimal for now, making the assumption that my knowledge could be inaccurate and that gathering new information could help me to improve it?”
Exploitation consists of taking the decision assumed to be optimal with respect to the data observed so far. This « safe » approach tries to avoid bad decisions as much as possible but also prevents from discovering potential better decisions
Exploration consists of not taking the decision that seems to be optimal, betting on the fact that observed data are not sufficient to truly identify the best option. This more « risky » approach can sometimes lead to poor decisions but also makes it possible to discover better ones, if there exists any.
4) Prediction and Control
- Prediction: evaluate the future given a policy
- Control: optimize the future to find the best policy
stackexchange글에 좀 더 자세히 설명되어있다.
A prediction task in RL is where the policy is supplied, and the goal is to measure how well it performs. That is, to predict the expected total reward from any given state assuming the function $\pi (a|s)$ is fixed.
A control task in RL is where the policy is not fixed, and the goal is to find the optimal policy. That is, to find the policy $\pi (a|s)$ that maximises the expected total reward from any given state.
'Reinforcement Learning' 카테고리의 다른 글
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
---|---|
[RL] 5. Model-Free Control (2) | 2023.04.05 |
[RL] 4. Model-Free Prediction (0) | 2023.03.29 |
[RL] 3. Planning by Dynamic Programming (0) | 2023.03.07 |
[RL] 2. Markov Decision Process (0) | 2023.03.05 |