DACHVARD

~/library~/writing~/author~/wander

← back to the archive

BY OTHERSSunday, March 1, 2009

by Alon Halevy, Peter Norvig & Fernando Pereira

The complete 2009 treatise by Halevy, Norvig, and Pereira on why massive, messy datasets trump elegant theories.

tags: machine-learning, nlp, data-science, google, seminal

∮   ∞   ∮
author
Alon Halevy, Peter Norvig & Fernando Pereira
filed
Sunday, March 1, 2009
words
1,273
reading
~7 min

"The enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and there is no rational explanation for it."

— Eugene Wigner, The Unreasonable Effectiveness of Mathematics in the Natural Sciences, 1960

Eugene Wigner's article examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc². Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior.

An informal, incomplete grammar of the English language runs over 1,700 pages. Perhaps when it comes to natural language processing and related fields, we're doomed to complex theories that will never have the elegance of physics equations. But if that's so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

One of us, as an undergraduate at Brown University, remembers the excitement of having access to the Brown Corpus, containing one million English words. Since then, our field has seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long.

In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, and grammatical errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks — if only we knew how to extract the model from the data.

Learning from Text at Web Scale

The biggest successes in natural-language-related machine learning have been statistical speech recognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks; they are in fact much harder than tasks such as document classification.

The reason is that translation is a natural task routinely done every day for a real human need. The same is true of speech transcription. In other words, a large training set of the input-output behavior that we seek to automate is available to us in the wild.

Note

The first lesson of Web-scale learning: Use available large-scale data rather than hoping for annotated data that isn't available.

Memorization vs. Generalization

Another important lesson is that memorization is a good policy if you have a lot of training data. Statistical language models consist primarily of a huge database of probabilities of n-grams.

Researchers have done extensive work in estimating probabilities of new n-grams (using Good-Turing or Kneser-Ney smoothing). But the reality is this:

Early work on machine translation relied on elaborate rules for relationships between syntactic and semantic patterns. Currently, statistical translation models consist mostly of large memorized phrase tables. Today's models introduce general rules only when they improve translation over just memorizing particular phrases.

The Threshold of Sufficiency

In many cases there appears to be a threshold of sufficient data. James Hays and Alexei A. Efros addressed the task of scene completion: removing an unwanted automobile from a photograph and filling in the background with pixels from a large corpus. With thousands of photos, results were poor. With millions, the same algorithm performed quite well.

We know that the number of grammatical English sentences is theoretically infinite and the number of possible 2-Mbyte photos is 256^2,000,000. However, in practice we humans care to make only a finite number of distinctions.

word rarity (common → rare)captured by modeldata horizonZipf's tail1K words29 / 200 phenomena captured
corpus:
1K10K100K1M10M100M1B1T

Each dot is a linguistic phenomenon. Common ones (left) appear in any corpus. Rare ones (right) — dialectal forms, code-switches, hapax legomena — only emerge at web scale. Drag the slider and watch the long tail fill in. That is the unreasonable effectiveness of data.

The Complexity of Language

For those who were hoping that a small number of general rules could explain language, it is worth noting that language is inherently complex. This suggests that we can't reduce what we want to say to the free combination of a few abstract primitives.

Throwing away rare events is almost always a bad idea, because much Web data consists of individually rare but collectively frequent events. Relying on overt statistics of words has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily.

A False Dichotomy

The success of n-gram models has led to a belief that there are only two approaches:

  1. Deep approach: Hand-coded grammars and ontologies.
  2. Statistical approach: Learning n-gram statistics.

In reality, three orthogonal problems arise: choosing a representation language, encoding a model, and performing inference. Many combinations are possible. Stefan Schoenmackers showed how relational logic and a 100-million-page corpus can answer complex questions like "what vegetables help prevent osteoporosis?" by combining disparate relational assertions.

Semantic Web versus Semantic Interpretation

The Semantic Web is a convention for formal representation languages that lets software services interact "without needing artificial intelligence." []."The Semantic Web will enable machines to comprehend semantic documents and data, not human speech and writings." (Berners-Lee, Hendler, & Lassila)

The problem of understanding human speech — the semantic interpretation problem — is quite different. Semantic interpretation deals with imprecise, ambiguous natural languages. The "semantics" in semantic interpretation is embodied in human cognitive and cultural processes.

Engineering Hurdles

  • Ontology writing: Easy cases (Dublin Core) are done. But there's a long tail of concepts too expensive to formalize. Project Halo's encoding of a chemistry textbook cost US$10,000 per page.
  • Difficulty of implementation: Creating a compliant Semantic Web service is substantially harder than publishing a static page.
  • Competition: Every ontology is a treaty. When a motive for sharing is lacking, so are common ontologies.
  • Inaccuracy and deception: Sound inference mechanisms fail when premises are mistaken or actors deceive.

Conclusion: Follow the Data

The semantic interpretation problem remains regardless of the framework. We can't know for sure what company the string "Joe's Pizza" refers to because hundreds of businesses have that name. Using a Semantic Web formalism just means interpretation must be done on shorter strings between angle brackets.

What we need are methods to infer relationships between entities. Interestingly, Web-scale data is the solution. The Web contains hundreds of millions of independently created tables. These represent how different people organize data.

PROPOSITION undefined

If we see a pair of attributes A and B that rarely occur together but always occur with the same other attribute names, they are likely synonyms.

We can also offer a schema autocomplete feature. We discover that schemata having "Make" and "Model" also tend to have "Year" and "Color."

Warning

Abstract representations that lack linguistic counterparts are hard to learn or validate and tend to lose information.

So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data. Represent all the data with a nonparametric model. Human language has already evolved words for the important concepts; let's use them. Now go out and gather some data, and see what it can do.

NOTES
  1. Scientific American, 17 May 2001.

← back to the archive  ·  ⟿ wander the library  ·  ↑ top

✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof

a personal library in perpetual arrangement  ·  MMXXVI