DACHVARD

~/library~/writing~/author~/wander

← back to the archive

ESSAYFriday, January 1, 2021

by David Silver, Satinder Singh, Doina Precup, Richard S. Sutton

tags: reinforcement learning, artificial intelligence, AGI, reward, intelligence

∮   ∞   ∮
author
David Silver, Satinder Singh, Doina Precup, Richard S. Sutton
filed
Friday, January 1, 2021
words
1,576
reading
~8 min

Intelligence is the computational part of the ability to achieve goals in the world. — John McCarthy

There is a tempting idea embedded in how AI has traditionally been built: that each expression of intelligence requires its own special formulation. Perception needs object recognition losses. Language needs parsing objectives. Social intelligence needs game-theoretic equilibria. Each ability gets its own module, its own dataset, its own goal.

This paper — by David Silver, Satinder Singh, Doina Precup, and Richard Sutton, published in Artificial Intelligence in 2021 — proposes the opposite. One objective. One signal. Everything else follows.

The Hypothesis

Note

Sophisticated abilities may arise from the maximisation of simple rewards in complex environments. For example, the minimisation of hunger in a squirrel's natural environment demands a skilful ability to manipulate nuts that arises from the interplay between the squirrel's musculoskeletal dynamics; objects such as leaves, branches, or soil; variations in the size and shape of nuts; environmental factors such as wind, rain, or snow; and changes due to ageing, disease or injury. — Silver et al., 2021

The reward-is-enough hypothesis states:

Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

The argument hinges on a simple observation: environments are complex. To survive, navigate, communicate, and cooperate in a rich world, an agent must develop perception, memory, planning, language, and social reasoning. Not because it was told to — but because these abilities are demanded by the task of maximising reward.

A single goal, pursued in a complex enough world, forces intelligence to emerge.

REWARDSIGNALKnowledgePerceptionSocial IntelligenceLanguageGeneralisationImitationGeneral IntelligenceLearning
Click any ability to see how reward maximisation produces it — according to Silver, Singh, Precup & Sutton 2021.

The Squirrel

Consider a squirrel. Its brain receives sensations and sends motor commands. Its goal: minimise hunger.

To minimise hunger, the squirrel must:

  • Perceive — identify good nuts from bad ones
  • Know — understand what nuts are, how they age, where they hide
  • Plan — decide where to cache each nut for winter
  • Remember — recall the locations of cached nuts months later
  • Navigate — move efficiently through a complex forest
  • Deceive — bluff about cache locations to prevent theft

Seven abilities. One reward. The squirrel never learned "perception" as a goal. It learned to eat.

A kitchen robot with the goal of maximising cleanliness must similarly develop: perception (dirty vs clean), knowledge (utensils, their states), memory (where things go), language (predicting mess from overheard dialogue), and social intelligence (persuading children to be tidier).

Fig. 1 from the paper. Top: a squirrel maximising food (acorn symbol). Bottom: a kitchen robot maximising cleanliness (bubble symbol). In both cases, a rich set of abilities — shown on the right as a projection from experience — emerges in service of the singular goal.

One Goal → Seven Abilities

The paper walks through seven abilities that seem, at first glance, to resist this framing — and shows how reward maximisation yields each one.

Knowledge and Learning. An agent must build internal models of its world to predict future reward. Innate knowledge, learned representations, and ongoing adaptation are all demanded by the need to act well across time.

Perception. Perceptual abilities aren't just pattern recognition — they're intertwined with action (haptic sensing, visual saccades, echolocation) and shaped by what matters for reward. The cost of misclassifying a crocodile depends on whether you're swimming toward it.

Social Intelligence. In environments with other agents, anticipating and influencing others directly improves reward. Robustness, bluffing, mixed strategies — these aren't game theory abstractions; they're what a reward-maximising agent learns when other agents are in its world.

Language. Language is purposeful. It changes the mental state of others, which changes their behaviour, which changes the environment. An agent in a language-rich environment must learn that words have consequences — because consequences affect reward.

Generalisation. A rich environment contains continuously varying states. A fruit-eating animal encounters a new tree every day, injury, drought, seasonal change. To accumulate reward across this variation, it must generalise — not across neatly labelled tasks, but across the continuous stream of its own experience.

Imitation. Observing other agents who have already learned useful behaviours is a powerful shortcut to reward. Imitation in the wild isn't symmetric cloning — it's inference under partial observation, adapting others' solutions to your own situation.

General Intelligence. The deepest claim: general intelligence — the ability to flexibly achieve goals in diverse contexts — can be understood as, and implemented by, maximising a single reward in a single complex environment. The environment's complexity forces subgoal structure, transferable representations, and flexible planning.

SPECIALISEDperception goalPERCEPTIONlanguage goalLANGUAGEsocial goalSOCIAL INTELL.learn goalLEARNINGgeneral goalGENERALISATIONimitation goalIMITATIONknowledge goalKNOWLEDGEREWARD IS ENOUGHREWARDPERCEPTIONLANGUAGESOCIAL INTELL.LEARNINGGENERALISATIONIMITATIONKNOWLEDGE
7 goals → 7 modules → integration left as exercise1 reward → 7 abilities emerge implicitly

The AlphaZero Evidence

The best existence proof comes from Go.

Prior AI research decomposed Go into separate abilities: openings, shape, tactics, endgames — each with its own formalisation. AlphaZero ignored all of this. It maximised a single signal: +1 for winning, −1 for losing, 0 otherwise.

What emerged:

  • Novel opening sequences, previously unknown to Go theory
  • Surprising shapes used with global context
  • Understanding of global interactions between local battles
  • Safe play when ahead, aggressive play when behind

Not just a stronger player. A different kind of understanding. One that had previously been hard to formalise because no one thought to let it emerge from a single goal.

"AlphaZero's abilities were innately integrated into a unified whole, whereas integration had proven highly problematic in prior work."

The same algorithm applied to chess and shogi produced different abilities: piece mobility, colour complexes, king safety — not because AlphaZero was told these things matter, but because they do matter for winning.

Learning Agents

The hypothesis is agnostic about how reward is maximised. But the paper proposes that learning — trial and error, online interaction with the environment — is the most natural approach.

Preprogramming behaviour requires the designer to have foreknowledge of the environment. Learning instead places faith in experience. An agent that can continually adjust its behaviour to improve cumulative reward will, placed in a rich enough environment, develop whatever abilities that environment demands.

Note

A sufficiently powerful and general reinforcement learning agent may ultimately give rise to intelligence and its associated abilities. In other words, if an agent can continually adjust its behaviour so as to improve its cumulative reward, then any abilities that are repeatedly demanded by its environment must ultimately be produced in the agent's behaviour. — Silver et al., 2021

steps 0food eaten 0explore rate 40%

One reward: +10 for food. The agent develops navigation, spatial memory, and efficient foraging paths — not because it was told to, but because reward demands it. Switch to Value Map to see its learned landscape, or Policy to see the arrows emerge.

This is a conjecture, not a theorem. The paper makes no guarantees about sample efficiency. But the recent examples — AlphaZero in Go, deep RL in Atari, robotic manipulation, vision-language models — provide practical evidence that powerful RL agents, in complex environments, produce broadly capable behaviour.

Hard Questions

The paper directly addresses the objections:

Which environment? The emergence of intelligence is robust to specifics. A human brain develops sophisticated abilities whether raised in different cultures or languages — because the natural world is complex enough to demand them regardless. The specific environment shapes the flavour of intelligence, not whether it appears.

Which reward signal? Even an innocuous reward (collect round pebbles) in a rich environment would, if pursued effectively, require: classification, navigation, manipulation, memory, social coordination, tool use, technology. The signal matters less than the environment's complexity.

"In order to maximise this reward signal effectively, an agent may need to classify pebbles, to manipulate pebbles, to navigate to pebble beaches, to store pebbles, to understand waves and tides, to persuade people to help collect pebbles, to use tools and vehicles to collect greater quantities, to quarry and shape new pebbles, to discover and build new technologies for collecting pebbles, or to build a corporation that collects pebbles."

What about unsupervised learning? Unsupervised learning and prediction give useful structure to observations, but provide no principle for action. They cannot, by themselves, be enough for goal-directed intelligence.

What about evolution? Evolution by natural selection can be understood as fitness maximisation — one possible reward signal that shaped natural intelligence. But artificial agents may pursue different goals with different methods, yielding different forms of intelligence.

What about offline learning? Offline data can only generalise to problems already solved within that data. Online interaction lets an agent specialise to its current problems and discover behaviours that no dataset would contain.

Why This Matters

If the hypothesis is right, it resolves a fragmentation problem that has plagued AI for decades. Why does an agent develop language? Because its environment demands communication to accumulate reward. Why does it develop memory? Because past states predict future reward. Why social intelligence? Because other agents are part of the environment.

The abilities don't need to be designed in. They need to be demanded — by a goal, pursued in a world complex enough to require them.

"The most useful concept for a robot recognising its battery charger is the set of states from which it can successfully dock with the charger — and this is exactly what would be produced by the model of a docking option." — Silver, Singh, Precup & Sutton, 2021

Action and perception converge. The question "what should I perceive?" is answered by "what helps me act well?" And the answer to "what helps me act well?" is just: reward.

← back to the archive  ·  ⟿ wander the library  ·  ↑ top

✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof

a personal library in perpetual arrangement  ·  MMXXVI