Multi-armed bandit

A multi-armed bandit, also sometimes called a K-armed bandit, is a simple machine learning problem based on an analogy with a traditional slot machine (one-armed bandit) but with more than one lever. When pulled, each lever provides a reward drawn from a distribution associated to that specific lever. The objective of the gambler is to maximize the collected reward sum through iterative pulls. It is classically assumed that the gambler has no initial knowledge about the levers, but through repeated trials, he can focus on the most rewarding levers.

Empirical motivation

The multi-armed bandit problem, originally described by Robins (1952), is a simple modelization of an agent trying to optimize its decisions while acquiring knowledge at the same time. Practical examples of the bandit problem include clinical trials where different treatments need to be experimented with while minimizing patient losses, or adaptive routing efforts for minimizing delays in a network. The questions that arise in all these cases are related to the problem of balancing reward maximization based on the knowledge already acquired and attempting new actions to further increase knowledge, which is known as the exploitation vs. exploration tradeoff in reinforcement learning.

The multi-armed bandit model

The multi-armed bandit (bandit for short) can be seen as a set of real distributions $B=\{R_{1},\dots ,R_{K}\}$ , each distribution being associated to the rewards brought in by a specific lever. Let $\mu _{1},\dots ,\mu _{K}$ be the mean values associated to these reward distributions. The gambler plays iteratively one lever at each round and observes the associated reward. His objective is to maximize the sum of the collected rewards. The horizon H is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret $\rho$ after T rounds is defined as the difference between the reward sum associated to an optimal strategy and the sum of the collected rewards $\rho =T\mu ^{*}-\sum _{t=1}^{T}{\widehat {r}}_{t}$ where $\mu ^{*}$ is the maximal reward mean, $\mu ^{*}=\max _{k}\{\mu _{k}\}$ and ${\widehat {r}}_{t}$ the reward at time t. A strategy whose average regret per round $\rho /T$ tends to zero with probability 1 for any bandit problem when the number of played rounds tends to infinity is a zero-regret strategy. Intuitively, zero-regret strategies are guaranteed to converge to an optimal strategy, not necessarily unique, if enough rounds are played.

Common bandit strategies

Many strategies are known to solve the bandit problem. Those strategies fall through the three broad categories detailed below.

Semi-uniform strategies

Semi-uniform strategies were the earliest (and most simple) strategies discovered to solve the bandit problem. All those strategies have in common a greedy behavior where the best lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.

to be completed (epsilon greedy, epsilon first, epsilon decreasing)

Probability matching strategies

Probability matching strategies reflect the idea that the number of pulls for a given lever should match its actual probability of being the optimal lever.

to be completed (softmax, gaussmatch, Exp3)

Pricing strategies

Pricing strategies establish a price for each lever. The lever of highest price is always pulled.

to be completed (interval estimation, poker)

References

H. Robbins. Some Aspects of the Sequential Design of Experiments. In Bulletin of the American Mathematical Society, volume 55, pages 527–535, 1952.

Richard Sutton and Andrew Barto. Reinforcement Learning. MIT Press, 1998. (available online)

Bandit project (bandit.sourceforge.net), open source implementation of many bandit strategies.