by Alon Halevy, Peter Norvig & Fernando Pereira
The complete 2009 treatise by Halevy, Norvig, and Pereira on why massive, messy datasets trump elegant theories.
tags: machine-learning, nlp, data-science, google, seminal
"The enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and there is no rational explanation for it."
Eugene Wigner's article examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc². Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior.
An informal, incomplete grammar of the English language runs over 1,700 pages. Perhaps when it comes to natural language processing and related fields, we're doomed to complex theories that will never have the elegance of physics equations. But if that's so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.
One of us, as an undergraduate at Brown University, remembers the excitement of having access to the Brown Corpus, containing one million English words. Since then, our field has seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long.
In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, and grammatical errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks — if only we knew how to extract the model from the data.
The biggest successes in natural-language-related machine learning have been statistical speech recognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks; they are in fact much harder than tasks such as document classification.
The reason is that translation is a natural task routinely done every day for a real human need. The same is true of speech transcription. In other words, a large training set of the input-output behavior that we seek to automate is available to us in the wild.
The first lesson of Web-scale learning: Use available large-scale data rather than hoping for annotated data that isn't available.
Another important lesson is that memorization is a good policy if you have a lot of training data. Statistical language models consist primarily of a huge database of probabilities of n-grams.
Researchers have done extensive work in estimating probabilities of new n-grams (using Good-Turing or Kneser-Ney smoothing). But the reality is this:
Early work on machine translation relied on elaborate rules for relationships between syntactic and semantic patterns. Currently, statistical translation models consist mostly of large memorized phrase tables. Today's models introduce general rules only when they improve translation over just memorizing particular phrases.
In many cases there appears to be a threshold of sufficient data. James Hays and Alexei A. Efros addressed the task of scene completion: removing an unwanted automobile from a photograph and filling in the background with pixels from a large corpus. With thousands of photos, results were poor. With millions, the same algorithm performed quite well.
We know that the number of grammatical English sentences is theoretically infinite and the number of possible 2-Mbyte photos is 256^2,000,000. However, in practice we humans care to make only a finite number of distinctions.
Each dot is a linguistic phenomenon. Common ones (left) appear in any corpus. Rare ones (right) — dialectal forms, code-switches, hapax legomena — only emerge at web scale. Drag the slider and watch the long tail fill in. That is the unreasonable effectiveness of data.
For those who were hoping that a small number of general rules could explain language, it is worth noting that language is inherently complex. This suggests that we can't reduce what we want to say to the free combination of a few abstract primitives.
Throwing away rare events is almost always a bad idea, because much Web data consists of individually rare but collectively frequent events. Relying on overt statistics of words has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily.
The success of n-gram models has led to a belief that there are only two approaches:
In reality, three orthogonal problems arise: choosing a representation language, encoding a model, and performing inference. Many combinations are possible. Stefan Schoenmackers showed how relational logic and a 100-million-page corpus can answer complex questions like "what vegetables help prevent osteoporosis?" by combining disparate relational assertions.
The Semantic Web is a convention for formal representation languages that lets software services interact "without needing artificial intelligence." []."The Semantic Web will enable machines to comprehend semantic documents and data, not human speech and writings." (Berners-Lee, Hendler, & Lassila)
The problem of understanding human speech — the semantic interpretation problem — is quite different. Semantic interpretation deals with imprecise, ambiguous natural languages. The "semantics" in semantic interpretation is embodied in human cognitive and cultural processes.
The semantic interpretation problem remains regardless of the framework. We can't know for sure what company the string "Joe's Pizza" refers to because hundreds of businesses have that name. Using a Semantic Web formalism just means interpretation must be done on shorter strings between angle brackets.
What we need are methods to infer relationships between entities. Interestingly, Web-scale data is the solution. The Web contains hundreds of millions of independently created tables. These represent how different people organize data.
If we see a pair of attributes A and B that rarely occur together but always occur with the same other attribute names, they are likely synonyms.
We can also offer a schema autocomplete feature. We discover that schemata having "Make" and "Model" also tend to have "Year" and "Color."
Abstract representations that lack linguistic counterparts are hard to learn or validate and tend to lose information.
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data. Represent all the data with a nonparametric model. Human language has already evolved words for the important concepts; let's use them. Now go out and gather some data, and see what it can do.
Scientific American, 17 May 2001.
✦ memory · ☽ night · ∞ loops · ❧ margins · ◆ proof
a personal library in perpetual arrangement · MMXXVI