Reinforcement Learning 관련 내용 중 하나인 Multi-Armed Bandits(MAB)에 대한 내용을 정리하고자 한다(논문링크).
The Multi-Armed Bandit problem (MAB) is a toy problem that models sequential decision tasks where the learner must simultaneously exploit their knowledge and explore unknown actions to gain knowledge for the future (exploration-exploitation tradeoff)(출처).
0. Introduction: Scope and Motivation
1) Example
Multi-armed bandits is a simple but powerful framework for algorithms that make decisions over time under uncertainty. For example,
- News website: When a new user arrives, a website picks an article header to show, observes whether the user clicks on this header. The site's goal is to maximize the total number of clicks
- Dynamic pricing: A store is selling a digital good, e.g. an app or a song. When a new customer arrives, the store chooses a price offered to this customer. The customer buys (or not) and leaves forever. The store's goal is to maximize the total profit.
- Investment: Each morning, you choose one stock to invest into, and invest $1. In the end of the day, you observe the change in value for each stock. The goal is to maximize the total wealth.
2) Multi-Armed Bandits basic version
MAB can be used for these algorithms. For the basic setting, an algorithm has $K$ possible actions to choose from, a.k.a arms, and $T$ rounds. In each round, the algorithm chooses an arm and collects a reward for this arm.
The reward is drawn independently from some distribution which is fixed, but not known to the algorithm.
An algorithm observes the reward for the chosen arm after each round, but not for the oher arms that could have been chosen. Therefore, the algorithms typically needs to explore: try out different arms to acquire new information. However, if an algorithms always chooses arm 1, how would it know if arm 2 is better? Thus, there is a tradeoff between exploration and exploitation: making optimal near-term decisions based on the available information. This trade off is essential in MAB.
3) Multi-dimensional problem space
MAB is a huge problem space, with many "dimensions" along which the models can be made more expressive and closer to reality. We discuss some of these modeling dimensions.
i. Auxiliary feedback(★)
What feedback is available to the algorithm after each round, other than the reward for the chosen arm? Does the algorithm observe rewards for the other arms?
We can distinguish three types of feedback:
- bandit feedback: when the algorithm observes the reward for the chosen arm, and no other feedback.
- full feedback: when the algorithm observes the rewards for all arms that could have been chosen.
- partial feedback: when some information is revealed, in addition to the reward of the chosen arm, but it does not always amount to full feedback.
ii. Rewards model(★)
Where do the rewards come from? Several alternatives have been studied:
- IID rewards: the reward for each arm is drawn independently from a fixed distribution that depends on the arm but no on the round $t$.
- Adversarial rewards: rewards can be arbitrary, as if they are chosen by an 'adversary' that tries to fool the algorithm. The adversary may be oblivious to the algorithm's choices, or adaptive thereto.
- Constrained adversary: rewards are chosen by an adversary that is subject to some constraints, e.g. reward of each arm cannot change much from one round to another, or the reward of each arm can change at most a few times, or the total change in rewards is upper-bounded.
- Random-process rewards: an arm's state, which determines rewards, evolves over time as a random process, e.g. a random walk or a Markov chain. The state transition in a particular round may also depend on whether the arm is chosen by the algorithm.
(adversary에 대한 내용은 밑에서 정리하고자 한다.)
iii. Contexts(★)
In each round, an algorithm may observe some context before choosing an action. Such context often comprises the known properties of the current user, and allows for personalized actions. (아래 그림처럼) Reward now depends on both the context and the chosen arm. Accordingly, the algorithm's goal is to find the best policy which maps contexts to arms.
iv. Bayesian priors
In the Bayesian approach, the problem instance comes from a known distribution, called the Bayesian prior. One is typically interested in provable guarantees in expectation over this distribution.
v. Structured rewards
Rewards may have a known structure, e.g. arms correspond to points in $\mathbb{R}^{d}$, and in each round the reward is a linear(e.g. concave or Lipschitz) function of the chosen arm.
vi. Global constraints
The algorithm can be subject to global constraints that bind across arms and across rounds. For example, in dynamic pricing there may be a limited inventory of items for sale.
vii. Structured actions
An algorithm may need to make several decisions at once, e.g. a news website may need to pick a slate of articles, and a seller may need to choose prices for the entire slate of offerings.
※Adversarial Rewards
Adversarial reward에 대해 설명한 강의 자료에서 발췌한 내용이다(자료링크).
What does “adversarial rewards” mean? It does not neccessarily mean that there is a real adversary that tries hard to make our algorithm fail. This term denotes situations when the costs change over time. For example, at a news website, everyday an algorithm decides which news to show to the users. At some point, since people are mostly interested in the presidential election, the cost of showing election may be low. However, in the future, people may be attracted to other events (e.g. the Olympics), the cost of the election will get higher and the cost of the more interesting events will become lower.
In general, the problem can be casted as playing a game with the adversary:
An adversary is called oblivious if the chosen costs at each round $t$ do not depend on the algorithm’s choices before round $t$. Otherwise, it is called adaptive. Also, an adversary is called randomized if in each round it chooses distribution over the cost vectors, and then draws the cost vector independently from this distribution.
'Reinforcement Learning' 카테고리의 다른 글
[RL] Policy Gradient(REINFORCE) (0) | 2023.07.16 |
---|---|
[RL] Value Function 구체적으로 생각해보기 (0) | 2023.07.15 |
[RL] 7. Policy Gradient Methods (0) | 2023.05.04 |
[RL] 6. Value Function Approximation (0) | 2023.04.15 |
[RL] 5. Model-Free Control (2) | 2023.04.05 |