One of fundamental challenges of Reinforcement learning is the exploration versus exploitation trade-off, which can be described as the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. And this exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem, which is perhaps the simplest instance of this exploration vs. exploitation dilemma.

A k-armed Bandit Problem

The basic problem

You are faced repeatedly with a choice among k different options, or actions.
After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

In our k-armed bandit problem, each of the k actions has an expected or mean reward given that action is selected; let us call this the value of that action.
If you knew the value of each action, then it would be trivial to solve the k-armed bandit problem: you would always select the action with highest value. Therefore, we assume that you do not know the action values with certainty, although you may have estimates.

Mathematical notions

Greedy action

If you maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions.

Exploiting vs Exploring

When you select one of these greedy actions, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the nongreedy action’s value.
Because it is not possible both to explore and to exploit with any single action selection, one often refers to the “conflict” between exploration and exploitation.

Why do exploring?

Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

For example, suppose a greedy action’s value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don’t know which one. If you have many time steps ahead on which to make action selections, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them many times.

Stationary Multi-armed Bandit

In any specific case, whether it is better to explore or exploit depends in a complex way on the precise values of the estimates, uncertainties, and the number of remaining steps.
There are many sophisticated methods for balancing exploration and exploitation for particular mathematical formulations of the k-armed bandit and related problems. However, most of these methods make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications and in the full reinforcement learning problem that we consider in subsequent chapters. The guarantees of optimality or bounded loss for these methods are of little comfort when the assumptions of their theory do not apply.

Action-value Methods

Recall that the true value of an action is the mean reward when that action is selected. One natural way to estimate this is by averaging the rewards actually received:

We call this the sample-average method for estimating action values because each estimate is an average of the sample of relevant rewards.

Of course this is just one way to estimate action values, and not necessarily the best one. Nevertheless, for now let us stay with this simple estimation method and turn to the question of how the estimates might be used to select actions.

Greedy algorithm

The simplest action selection rule is to select one of the actions with the highest estimated value, that is, one of the greedy actions as defined in the previous section. If there is more than one greedy action, then a selection is made among them in some arbitrary way, perhaps randomly. We write this greedy action selection method as$$ A_t=argmax_aQ_t(a),$$where $argmax_a$ denotes the action a for which the expression that follows is maximized (again, with ties broken arbitrarily).

Greedy action selection always exploits current knowledge to maximize immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better.

ε-Greedy algorithm

A simple alternative is to behave greedily most of the time, but every once in a while, say with small probability ε, instead select randomly from among all the actions with equal probability, independently of the action-value estimates. We call methods using this near-greedy action selection rule ε-greedy methods.

An advantage of these methods is that, in the limit as the number of steps increases, every action will be sampled an infinite number of times, thus ensuring that all the $Q_t(a)$ converge to $q_∗(a)$. This of course implies that the probability of selecting the optimal action converges to greater than 1 − ε, that is, to near certainty. These are just asymptotic guarantees, however, and say little about the practical effectiveness of the methods.

The 10-armed Testbed

The upper graph shows the increase in expected reward with experience. The greedy method improved slightly faster than the other methods at the very beginning, but then leveled off at a lower level. The greedy method performed significantly worse in the long run because it often got stuck performing suboptimal actions.
The lower graph shows that the greedy method found the optimal action in only approximately one-third of the tasks. In the other two-thirds, its initial samples of the optimal action were disappointing, and it never returned to it. The ε-greedy methods eventually performed better because they continued to explore and to improve their chances of recognizing the optimal action.
The ε = 0.1 method explored more, and usually found the optimal action earlier, but it never selected that action more than 91% of the time. The ε = 0.01 method improved more slowly, but eventually would perform better than the ε = 0.1 method on both performance measures shown in the figure. It is also possible to reduce ε over time to try to get the best of both high and low values.

The advantage of ε-greedy over greedy methods depends on the task. But even in the deterministic case there is a large advantage to exploring if we weaken some of the other assumptions. For example, suppose the bandit task were nonstationary, that is, the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy one. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time as learning proceeds and the agent’s policy changes. Reinforcement learning requires a balance between exploration and exploitation.

Incremental Implementation

Tracking a Nonstationary Problem

Optimistic Initial Values

Upper-Confidence-Bound Action Selection