DACHVARD

~/library~/writing~/author~/wander

← back to the archive

BY OTHERSThursday, February 4, 1988

by Richard S. Sutton

This article introduces temporal-difference methods — a class of incremental learning procedures that assign credit by means of the difference between temporally successive predictions. The E=mc² of reinforcement learning.

tags: reinforcement-learning, TD-learning, prediction, machine-learning, Markov-processes, convergence

∮   ∞   ∮
author
Richard S. Sutton
filed
Thursday, February 4, 1988
words
2,423
reading
~13 min

"Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions."

— Richard S. Sutton, Learning to Predict by the Methods of Temporal Differences, 1988

The Problem of Prediction

This article concerns the problem of learning to predict, that is, of using past experience with an incompletely known system to predict its future behavior. For example, through experience one might learn to predict for particular chess positions whether they will lead to a win, for particular cloud formations whether there will be rain, or for particular economic conditions how much the stock market will rise or fall. Learning to predict is one of the most basic and prevalent kinds of learning.

Note

From the paper: "An important advantage of prediction learning is that its training examples can be taken directly from the temporal sequence of ordinary sensory input: no special supervisor or teacher is required."

Most pattern recognition problems, for example, can be treated as prediction problems in which the classifier must predict the correct classifications. Learning-to-predict problems also arise in heuristic search, e.g., in learning an evaluation function that predicts the utility of searching particular parts of the search space, or in learning the underlying model of a problem domain.

Single-Step vs. Multi-Step Prediction

The crucial distinction. In single-step prediction problems, all information about the correctness of each prediction is revealed at once. In multi-step prediction problems, correctness is not revealed until more than one step after the prediction is made, but partial information relevant to its correctness is revealed at each step.

The weather prediction problem is a multi-step prediction problem because inconclusive evidence relevant to the correctness of Monday's prediction becomes available in the form of new observations on Tuesday, Wednesday, Thursday and Friday. On the other hand, if each day's weather were to be predicted on the basis of the previous day's observations — that is, on Monday predict Tuesday's weather, on Tuesday predict Wednesday's weather, etc. — one would have a single-step prediction problem, assuming no further observations were made between the time of each day's prediction and its confirmation or refutation on the following day.

In fact, many problems that are classically cast as single-step prediction problems are more naturally viewed as multi-step problems. Perceptual learning problems, such as vision or speech recognition, are classically treated as supervised learning, using a training set of isolated, correctly-classified input patterns. When humans hear or see things, on the other hand, they receive a stream of input over time and constantly update their hypotheses about what they are seeing or hearing.

The Core Idea: Temporal Differences

Warning

The Bellman Equation — the E=mc² of reinforcement learning:

V(sₜ) ← V(sₜ) + α [rₜ₊₁ + γ V(sₜ₊₁) − V(sₜ)]

The agent updates its current estimate using the difference between temporally successive predictions — never waiting for the final outcome.

Whereas conventional prediction-learning methods are driven by the error between predicted and actual outcomes, TD methods are similarly driven by the error or difference between temporally successive predictions; with them, learning occurs whenever there is a change in prediction over time. For example, suppose a weatherman attempts to predict on each day of the week whether it will rain on the following Saturday. The conventional approach is to compare each prediction to the actual outcome — whether or not it does rain on Saturday. A TD approach, on the other hand, is to compare each day's prediction with that made on the following day.

Note

From the paper: "If a 50% chance of rain is predicted on Monday, and a 75% chance on Tuesday, then a TD method increases predictions for days similar to Monday, whereas a conventional method might either increase or decrease them depending on Saturday's actual outcome."

Weather Prediction: Will it rain on Saturday?
0.50Mon0.50Tue0.50Wed0.50Thu0.50Fri0.50Sat0.50Sun0.50Outcomerainsun
P(s) ← P(s) + α[P(s+1) − P(s)]

Sutton's weather example. Monday through Sunday, predict Saturday's rain. TD learning updates each day's prediction based on the next day's — information flows backward immediately.

Two Advantages of TD Methods

We will show that TD methods have two kinds of advantages over conventional prediction-learning methods.

PROPOSITION undefined

TD methods are more incremental and therefore easier to compute. The TD method for predicting Saturday's weather can update each day's prediction on the following day, whereas the conventional method must wait until Saturday.

PROPOSITION undefined

TD methods tend to make more efficient use of their experience: they converge faster and produce better predictions. The predictions of TD methods are both faster to compute and more accurate than those of conventional methods.

"The conventional method would have to do more computing at one time than the TD method and would require more storage during the week. The second advantage of TD methods is that they tend to make more efficient use of their experience: they converge faster and produce better predictions."

The TD(λ) Family

The hallmark of temporal-difference methods is their sensitivity to changes in successive predictions rather than to overall error between predictions and the final outcome. In response to an increase (decrease) in prediction from Pₜ to Pₜ₊₁, an increment Δwₜ is determined that increases (decreases) the predictions for some or all of the preceding observation vectors x₁, ..., xₜ.

Note

The general TD(λ) update rule:

Δwₜ = α(Pₜ₊₁ − Pₜ) Σₖ₌₁ᵗ λᵗ⁻ᵏ ∇wPₖ

When λ = 1, this is equivalent to the conventional supervised-learning method (Widrow-Hoff). When λ = 0, we get "pure" TD — each weight update depends only on a pair of successive predictions.

For λ < 1, TD(λ) produces weight changes different from those made by any supervised-learning method. The difference is greatest in the case of TD(0), in which the weight increment is determined only by its effect on the prediction associated with the most recent observation:

Δwₜ = α(Pₜ₊₁ − Pₜ)∇wPₜ

Note that this procedure is formally very similar to the prototypical supervised-learning procedure. The two equations are identical except that the actual outcome z in the supervised version is replaced by the next prediction Pₜ₊₁ in the TD equation. The two methods use the same learning mechanism, but with different errors.

A Game-Playing Example

Note

From the paper (Figure 1): "Suppose there is a game position that you have learned is bad for you, that has resulted most of the time in a loss and only rarely in a win for your side. Suppose a 'bad' state has led 90% of the time to a loss and only 10% to a win."

Suppose you play a game that reaches a novel position (one you have never seen before), that then progresses to reach the bad state, and that finally ends nevertheless in a victory for you. As a result of this experience, your opinion of the bad state would presumably improve, but what of the novel state?

A supervised-learning method would form a pair from the novel state and the win that followed it, and would conclude that the novel state is likely to lead to a win. A TD method, on the other hand, would form a pair from the novel state and the bad state that immediately followed it, and would conclude that the novel state is also a bad one, that it is likely to lead to defeat.

The Random Walk Experiment

The paper's central experiment. A bounded random walk on states A–B–C–D–E–F–G. Every walk starts at D. At each step, move left or right with equal probability. Reaching A (left edge) terminates with outcome z=0. Reaching G (right edge) terminates with z=1. The true values are 1/6, 2/6, 3/6, 4/6, 5/6 for states B through F.

A bounded random walk is a state sequence generated by taking random steps to the right or to the left until a boundary is reached. This is one of the simplest of dynamical systems, that which generates bounded random walks.

We applied linear supervised-learning and TD methods to this problem in a straightforward way. A walk's outcome was defined to be z = 0 for a walk ending on the left at A and z = 1 for a walk ending at the right at G. The learning methods estimated the expected value of z; for this choice of z, its expected value is equal to the probability of a right-side termination.

startAz=0BCDEFGz=1V(s) learned vs. true0.500.17B0.500.33C0.500.50D0.500.67E0.500.83F
α =0.10episodes: 0 · RMS:
TD(0) learned V(s)true value (1/6, 2/6, …, 5/6)

Experimental Results

Note

Figure 3 result: "Averaging over training sets, we found that performance improved rapidly as λ was reduced below 1 (the supervised-learning method) and was best at λ = 0 (the extreme TD method)."

This result contradicts conventional wisdom. It is well known that, under repeated presentations, the Widrow-Hoff procedure minimizes the RMS error between its predictions and the actual outcomes in the training set. How can it be that this optimal method performed worse than all the TD methods for λ < 1?

The answer is that the Widrow-Hoff procedure only minimizes error on the training set; it does not necessarily minimize error for future experience. In the following section, we prove that in fact it is linear TD(0) that converges to what can be considered the optimal estimates for matching future experience — those consistent with the maximum-likelihood estimate of the underlying Markov process.

PROPOSITION undefined

Linear TD(0) converges under repeated presentations to the optimal predictions — those consistent with the maximum-likelihood estimate of the underlying Markov process. Widrow-Hoff converges to predictions that minimize training-set error, which is in general different and worse.

Convergence Theory

Note

Theorem 2 (from the paper): "For any absorbing Markov chain, for any distribution of starting probabilities μ_i, for any outcome distributions with finite expected values, and for any linearly independent set of observation vectors, there exists an ε > 0 such that, for all positive α < ε and for any initial weight vector, the predictions of linear TD(0) converge in expected value to the ideal predictions."

In this section, we provide a theoretical foundation for temporal-difference methods. Such a foundation is particularly needed for these methods because most of their learning is done on the basis of previously learned quantities. "Bootstrapping" in this way may be what makes TD methods efficient, but it can also make them difficult to analyze and to have confidence in. In fact, hitherto no TD method has ever been proved stable or convergent to the correct predictions.

"The theory developed here concerns the linear TD(0) procedure and a class of tasks typified by the random walk example discussed in the preceding section. Two major results are presented: (1) an asymptotic convergence theorem for linear TD(0) when presented with new data sequences; and (2) a theorem that linear TD(0) converges under repeated presentations to the optimal (maximum likelihood) estimates."

TD as Gradient Descent

Like many other statistical learning methods, TD methods can be viewed as gradient descent (hill climbing) in the space of the modifiable parameters (weights). That is, their goal can be viewed as minimizing an overall error measure J(w) over the space of weights.

The key insight for why TD beats supervised learning: Our real goal is for each prediction to match the expected value of the subsequent outcome, not the actual outcome occurring in the training set. TD methods can perform better than supervised-learning methods because the actual outcome of a sequence is often not the best estimate of its expected value. A subsequent prediction can be a better, less noisy target.

Generalizations

Predicting Cumulative Outcomes

Temporal-difference methods are not limited to predicting only the final outcome of a sequence; they can also be used to predict a quantity that accumulates over a sequence. That is, each step of a sequence may incur a cost, where we wish to predict the expected total cost over the sequence.

Infinite Discounted Predictions

For infinite-horizon prediction problems, we include a form of discounting. If some process generates costs cₜ₊₁ at each transition from t to t + 1, we may want Pₜ to predict the discounted sum:

Note

Discounted return:

zₜ = Σₖ₌₀^∞ γᵏ cₜ₊ₖ₊₁

where the discount-rate parameter γ, 0 ≤ γ < 1, determines the extent to which we are concerned with short-range or long-range prediction. This leads to the recursive relationship Pₜ = cₜ₊₁ + γPₜ₊₁ — the foundation of Sutton's (1984) Adaptive Heuristic Critic.

Conclusion

As a result, they can be computed more incrementally and require significantly less memory and peak computation. One TD method makes exactly the same weight changes as a supervised-learning method, while retaining these computational advantages. Another TD method has been proved to converge asymptotically to the same correct predictions. Empirically, TD methods appear to learn faster than supervised-learning methods, and one TD method has been proved to make optimal predictions for finite training sets that are presented repeatedly.

The connection to animal learning. Animals must also face the problem of learning internal models of the world. The learning process that seems to perform this function in animals is called Pavlovian or classical conditioning. If a dog is repeatedly presented with the sound of a bell and then fed, it will learn to predict the meal given just the bell alone, as evidenced by salivation to the bell alone. Some of the detailed features of this learning process suggest that animals may be using a TD method.

The progress made in this paper has been due primarily to treating TD methods as general methods for learning to predict rather than as specialized methods for learning evaluation functions, as they were in all previous work. This simplification makes their theory much easier and also greatly broadens their range of applicability. It is now clear that TD methods can be used for any pattern recognition problem in which data are gathered over time — for example, speech recognition, process monitoring, target identification, and market-trend prediction. Potentially, all of these can benefit from the advantages of TD methods vis-a-vis supervised-learning methods.

← back to the archive  ·  ⟿ wander the library  ·  ↑ top

✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof

a personal library in perpetual arrangement  ·  MMXXVI