DACHVARD

~/library~/writing~/author~/wander

← back to the archive

ESSAYFriday, January 1, 1999

by Richard S. Sutton, Doina Precup, Satinder Singh

tags: reinforcement learning, temporal abstraction, options, MDP, AI

∮   ∞   ∮
author
Richard S. Sutton, Doina Precup, Satinder Singh
filed
Friday, January 1, 1999
words
1,647
reading
~9 min

When you decide to go to lunch, you are not thinking about which muscles to contract. When you decide to lift a cup, you are not thinking about which joint torques to apply. Human intelligence operates fluidly across time scales that differ by orders of magnitude — from milliseconds to years — and the right level of abstraction for one decision is usually wrong for another.

Reinforcement learning, as originally conceived, had no way to capture this. Every action took exactly one time step. Every plan was a flat sequence. This paper, by Sutton, Precup, and Singh, changed that with a single elegant concept: the option.

What Is an Option?

An option is not an action. It is a closed-loop policy — a full decision procedure that runs for a variable amount of time and terminates when it chooses to.

Note

Options are generalizations of actions that can be of arbitrary duration. Like a primitive action, an option can be executed, terminated, and selected as part of a policy. Options are meant to represent higher-level courses of action, such as a locomotion behavior or a manipulation skill. — Sutton, Precup & Singh, 1999

Formally, an option is a triple:

⟨I, π, β⟩

  • I ⊆ S — the initiation set: states where the option may begin
  • π: S × A → [0,1] — the policy: what to do while executing
  • β: S → [0,1] — the termination condition: probability of stopping in each state

Picking up an object: I is the set of states where the object is within reach; π is a grasping policy; β is 1 when the object is held. Going to lunch: I is your office; π is a navigation policy; β is 1 when you arrive at the restaurant.

The elegant trick is that primitive actions are just special-case options — they always terminate after one step (β = 1 everywhere) and their policy picks a single action. So options don't replace actions; they subsume them. The space of options is a strict superset.

meta-optionsoptionsactionsprepare drinkserve drinkgrasp cuppour waterlift cupwalk to tableset down cupreleaseEach level abstracts time — one meta-option spans many options, one option spans many actions

The Rooms Example

The paper's workhorse illustration is a four-room gridworld: a grid divided by walls into four rooms, connected by narrow hallway gaps. The agent can move in four directions, with some randomness. Goal: reach a target cell.

Eight hallway options are defined: two per room, each navigating from anywhere inside that room to one of the two exits. Each option's policy is a flow field converging on the target hallway cell. Once the agent reaches the hallway, the option terminates.

iteration: 0

circle radius ∝ V(s) — primitive spreads cell-by-cell; options spread room-by-room

Note

In the four-room example, the option-based policy dramatically reduces planning time. After only two sweeps of value iteration over the options, the value function is approximately correct everywhere. The primitive-action policy requires many more sweeps to propagate values across the full state space. — Sutton, Precup & Singh, 1999

Now watch what happens when you run value iteration:

With primitive actions only, values spread one cell per step — a slow ripple outward from the goal. To fill the whole four-room grid takes dozens of iterations.

With hallway options added, values spread room by room. The first iteration immediately gives accurate values to every cell in any room containing a hallway exit to the goal. The second iteration fills two more rooms. After just two iterations, the entire grid has near-correct values — not because we cheated, but because options enable planning to leap across time rather than crawl through it.

This is the core insight made concrete. Temporal abstraction isn't a trick — it's a structural speedup in the number of planning iterations required, logarithmic in the depth of the hierarchy.

The Options–SMDP Connection

Here is the formal backbone. Semi-Markov decision processes (SMDPs) already had a mature theory before this paper — they generalize MDPs to allow actions of variable duration. Sutton et al. prove:

Theorem 1: For any MDP and any set of options defined on it, the decision process that selects among those options — executing each to termination — is an SMDP.

This means all SMDP theory (Bellman equations, Q-learning convergence, planning algorithms) applies immediately to options. The multi-time model for option o from state s captures exactly what you need: the expected cumulative discounted reward r^o_s until termination, and the discounted state-prediction distribution p^o_ss' over where you'll end up.

The Bellman equation becomes:

V*(s) = max_{o ∈ O_s} E[r + γ^k V*(s') | E(o,s)]

where k is the (random) number of steps the option takes. This is the same Bellman equation as before, just with a different notion of "action."

Interrupting Options

SMDP theory has a limitation: once you start an option, you follow it blindly to termination. But this is inefficient. If you're mid-execution of "go to hallway A" and you learn that hallway B would now be better — why not switch?

The paper shows you can. If you've computed the option-value function Q^µ(s,o), you can re-evaluate at every state during execution: if Q^µ(s,o) < V^µ(s), terminate early and pick a better option.

L0L1L2L3L4L5L6SGSMDP path: 8 steps · Interrupted: 15 steps · improvement: 13%SMDPInterrupted
SMDP steps: 8Interrupted steps: 15path length: 951pxpath length: 823px

SMDP (dashed red) fully enters each landmark circle. Interrupted options (amber) switch early — cutting corners once a better option is available.

The interrupted policy is always at least as good as the non-interrupted one (Theorem 2, with a proof by trajectory decomposition). In the landmark navigation example above, SMDP-optimal requires 600 steps; the interrupted policy takes only 474. The true optimum is 425 — the interruption doesn't reach perfection, but it dramatically closes the gap, for free.

In a 1D dynamical control task (a mass on a spring with friction), SMDP-optimal stops unnecessarily at an intermediate position and then re-accelerates — wasting 200 steps. The interrupted policy stays in motion, reaching the goal in 121 steps.

Intra-Option Learning

SMDP learning has another limitation: to update the value of an option, you must wait for it to terminate. If an option takes 100 steps, you get one update every 100 steps.

"The key idea is that the same experience can be used to update the models of many options simultaneously. In particular, we can update the model of option o at state s_t even if o was not the option being executed at time t — so long as o's policy would have taken the same action as the one that was actually taken."

The paper introduces intra-option methods: Bellman-like equations for the option model, exploitable at every single primitive step. For a Markov option with deterministic policy, the update rule is temporal-difference learning applied to the option's reward model — at each step, the estimated reward is nudged toward the observed reward plus the discounted continuation value, weighted by the probability that the option continues. Same for the state-prediction model. You update on every step of every option consistent with the current action — whether or not you're actually executing that option.

The payoff is remarkable: intra-option Q-learning can learn option values without ever selecting those options. If every primitive action gets exercised sufficiently, all option values converge to their correct optima (Theorem 3). In the rooms example, option values learned correctly while the agent behaved purely randomly — picking primitive actions uniformly at random, never using a hallway option at all.

Subgoals for Better Options

The deepest level of the framework involves improving the options themselves. Assign a subgoal value function g(s) over a set of goal states G. Then find the policy that, starting from any state, maximizes the probability of reaching a good state in G. This is just another reinforcement learning problem — with its own Bellman equations and Q-learning algorithm.

In the rooms example: each hallway option has subgoal value +1 for its target hallway, 0 elsewhere. Learning under random behavior reliably recovers the correct policy for navigating to each hallway. The option can be constructed entirely from experience, without hand-coding its policy.

This closes the loop: you can discover the options, not just use them.

Why This Matters

The paper's framework is deliberately minimal — it introduces options without committing to any particular approach to state abstraction, function approximation, or hierarchy depth. This is a feature, not a limitation. It means options compose with everything else in reinforcement learning.

The closing observation is perhaps the most beautiful. If a robot learns a "docking" option — a policy for navigating to its battery charger — the model of that option naturally represents exactly the concept the robot needs: the set of states from which it can successfully dock. This is not a label assigned by a human. It's a concept that emerges from the structure of the option itself, learnable without supervision. Action and perception converge.

"The most useful concept for a robot recognizing its battery charger is the set of states from which it can successfully dock with the charger — and this is exactly what would be produced by the model of a docking option." — Sutton, Precup & Singh, 1999

The Bigger Picture

Note

We believe that options and the options framework provide a useful middle ground between the full generality of MDPs and the restricted structure of SMDPs. The options framework retains the Markov property and the full power of dynamic programming, while enabling temporal abstraction in a natural and principled way. — Sutton, Precup & Singh, 1999

The framework's lasting legacy: it showed that temporal abstraction doesn't require a new theoretical apparatus — it's already present in the MDP framework, waiting to be revealed by the right definition. An option is not a hack on top of reinforcement learning. It is reinforcement learning, unfolded across time.

← back to the archive  ·  ⟿ wander the library  ·  ↑ top

✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof

a personal library in perpetual arrangement  ·  MMXXVI