Authors: Ian Ayres
A variety of programmers have combined the free geographical information from Google maps with virtually any other dataset that contains address information. These data “mashups” provide striking visual maps that can represent the crime hot spots or campaign contributions or racial composition or just about anything else. Zillow.com mashes up public tax information about house size and other neighborhood characteristics together with information about recent neighborhood sales to produce beautiful maps with predicted housing values.
A “data commons” movement has created websites for people to post and link their data with others. In the last ten years, the norm of sharing datasets has become increasingly grounded in academics. The premier economics journal in the United States, the
American Economic Review,
requires that researchers post to a centralized website all the data backing up their empirical articles. So many researchers are posting their datasets to their personal web pages that it is now more likely than not that you can download the data for just about any empirical article by just typing a few words into Google. (You can find tons of my datasets at www.law.yale.edu/ayres/.)
Data aggregators like Acxiom and ChoicePoint have made an art of finding publicly available information and merging it into their pre-existing databases. The FBI has information on car theft in each city for each year; Allstate has information about how many anti-theft devices were used in particular cities in particular years. Nowadays, regardless of the initial digital format, it has become a fairly trivial task to link these two types of information together. Today, it's even possible to merge datasets when there doesn't exist a single unique identifier, such as the Social Security number, to match up observations. Indirect matches can be made by looking for similar patterns. For example, if you want to match house purchases from two different records, you might look for purchases that happened on the same day in the same city.
Yet the art of indirect matching can also be prone to error. Database Technologies (DBT), a company that was ultimately purchased by ChoicePoint, got in a lot of trouble for indirectly identifying felons before the 2000 Florida elections. The state of Florida hired DBT to create a list of potential people to remove from the list of registered voters. DBT matched the database of registered voters to lists of convicted felons not just from Florida but from every state in the union. The most direct and conservative means to match would have been to use the voter's name and date of birth as necessary identifiers. But DBT, possibly under direction from Florida's Division of Elections, cast a much broader net in trying to identify potential convicts. Its matching algorithm required only a 90 percent match between the name of the registered voter and the name of the convict. In practice this meant that there were lots of false positives, registered voters who were wrongly identified as possibly being convicts. For example, the Rev. Willie D. Whiting, Jr., a registered voter from Tallahassee, was initially told that he could not vote because someone named Willie J. Whiting, born two days later, had a felony conviction. The Division of Elections also required DBT to perform “nickname matches” for first names and to match on first and last names regardless of their orderâso that the name Deborah Ann would also match the name Ann Deborah, for example.
The combination of these low matching requirements together with the broad universe of all state felonies produced a staggeringly large list of 57,746 registered Floridians who were identified as convicted felons. The concern was not just with the likely large number of false positives, but also with the likelihood that a disproportionate number of the so-called purged registrations would be for African-American voters. This is especially true because the algorithm was not relaxed when it came to race. Only registered voters who exactly matched the race of the convict were subject to exclusion from the voting rolls. So while Rev. Whiting, notwithstanding a different middle initial and birth date, could match convict Willie J. Whiting, a white voter with the same name and birth date would not qualify because the convict Whiting was black.
Mashups and mergers of datasets are easier today than ever before. But the DBT convict list is a cautionary tale. The new merging technology can fail through either inadvertent or advertent error. As the size of datasets balloons almost beyond the scope of our imagination, it becomes all the more important to continually audit them to check for the possibility of error. What makes the DBT story so troubling is that the convict/voter data seemed so poorly matched relative to the standards of modern-day merging and mashing.
Technology or Techniques?
The technological advances in the ability of firms to digitally capture and merge information have helped trigger the commodification of data. You're more likely to want to pay for data if you can easily merge them into your pre-existing database. And you're more likely to capture information if you think someone else is later going to pay you for it. So the ability of firms to more easily capture and merge information helps answer the “Why Now?” question.
At heart, the recent onslaught of Super Crunching is more a story of advances in technology, not statistical techniques. It is not a story about new breakthroughs in the statistical art of prediction. The basic statistical techniques have existed for decadesâeven centuries. The randomized trials that are being used so devastatingly by companies like Offermatica have been known and used in medicine for years. Over the last fifty years, econometrics and statistical theory have improved, but the core regression and randomization techniques have been available for a long time.
What's more, the timing of the Super Crunching revolution isn't dominantly about the exponential increase in computational capacity. The increase in computer speed has helped, but the increase in computer speed came substantially before the rise of data-based decision making. In the old days, say, before the 1980s, CPUsâ“central processing units”âwere a real constraint. The number of mathematical operations necessary to calculate a regression goes up exponentially with the number of variablesâso if you double the number of controls, you roughly quadruple the number of operations needed to estimate a regression equation. In the 1940s, the Harvard Computation Laboratory employed scores of secretaries armed with mechanical calculators who manually crunched the numbers behind individual regressions. When I was in grad school at MIT in the 1980s, CPUs were so scarce that grad students were only allotted time in the wee hours of the morning to run our programs.
But thanks to Moore's Lawâthe phenomenon that processor power doubles every two yearsâSuper Crunching has not been seriously hampered by a lack of cheap CPUs. For at least twenty years, computers have had the computation power to estimate some serious regression equations.
The timing of the current rise of Super Crunching has been impacted more by the increase in storage capacity. We are moving toward a world without delete buttons. Moore's Law is better known, but it is Kryder's Lawâa regularity first proposed by the chief technology officer for hard drive manufacturer Seagate Technology, Mark Kryderâthat is more responsible for the current onslaught of Super Crunching. Kryder successfully noticed that the storage capacity of hard drives has been doubling every two years.
Since the introduction of the disk drive in 1956, the density of information that can be recorded into the space of about a square inch has swelled an amazing 100-million fold. Anyone over thirty remembers the days when we had to worry frequently about filling up our hard disks. Today, the possibility of cheap data storage has revolutionized the possibility for keeping massively large datasets.
And as the density of storage has increased, the price of storage has dropped. Thirty to forty percent annual price declines in the cost per gigabyte of storage continue apace. Yahoo! currently records over twelve terabytes of data daily. On the one hand, this is a massive amount of informationâit's roughly equivalent to more than half the information contained in all the books in the Library of Congress. On the other hand, this amount of disk storage does not require acres of servers or billions of dollars. In fact, right now you could add a terabyte of hard drive to your desktop for about $400. And industry experts predict that in a couple of years that price will drop in half.
The cutthroat competition to produce these humongous hard drives for personal consumer products is driven by video. TiVo and other digital video recorders can only remake the world of home video entertainment if they have adequate storage space. A terabyte drive will only hold about eight hours of HDTV (or nearly 14,000 music albums), but you can jam onto it about sixty-six million pages of text or numbers.
Both the compactness and cheapness of storage are important for the proliferation of data. Suddenly, it's feasible for Hertz or UPS to give each employee a handheld machine to capture and store individual transaction data that are only periodically downloaded to a server. Suddenly, every car includes a flash memory drive, a mini-black box recorder to tell what was happening at the time of an accident.
The abundance of supercheap storage from the tiny flash drives (hidden inside everything from iPods and movie cameras to swimming goggles and birthday cards) to the terabyte server farms at Google and flickr.com have opened up new vistas of data-mining possibilities. The recent onslaught of Super Crunching is dominantly driven by the same technological revolutions that have been reshaping so many other parts of our lives. The timing is best explained by the digital breakthroughs that make it cheaper to capture, to merge, and to store huge electronic databases. Now that mountains of data exist (on hard disks) to be mined, a new generation of empiricists is emerging to crunch it.
Can a Computer Be Taught to Think Like You?
There is, though, one new statistical technique that is an important contributor to the Super Crunching revolution: the “neural network.” Predictions using neural network equations are a newfangled competitor to the tried-and-true regression formula. The first neural networks were developed by academics to simulate the learning processes of the human brain. There's a great irony here: the last chapter detailed scores of studies showing why the human brain does a bad job of predicting. Neural networks, however, are attempts to make computers process information like human neurons. The human brain is a network of interconnected neurons that act as informational switches. Depending on the way the neuron switches are set, when a particular neuron receives an impulse, it may or may not send an impulse on to a subsequent set of neurons. Thinking is the result of particular flows of impulses through the network of neuron switches. When we learn from some experience, our neuron switches are being reprogrammed to respond differently to different types of information. When a curious young child reaches out and touches a hot stove, her neuron switches are going to be reprogrammed to fire differently so the next time the hot stove will not look so enticing.
The idea behind computer neural networks is essentially the same: computers can be programmed to update their responses based on new or different information. In a computer, a mathematical “neural network” is a series of interconnected switches that, like neurons, receive, evaluate, and transmit information. Each switch is a mathematical equation that takes and weighs multiple types of input information. If the weighted sum of the inputs in the equation is sufficiently large, the switch is turned on and is sent as informational input for subsequent neural equation switches. At the end of the network is a final switch that collects information from previous neural switches and produces as its output the neural network's prediction. Unlike the regression approach, which estimates the weights to apply to a single equation, the neural approach uses a system of equations represented by a series of interconnected switches.
Just as experience trains our brain's neuron switches when to fire and when not to, computers use historical data to train the equation switches to come up with optimal weights. For example, researchers at the University of Arizona constructed a neural network to forecast winners in greyhound dog racing at the Tucson Greyhound Park. They fed in more than fifty pieces of information from thousands of daily racing sheetsâthings like the dogs' physical attributes, the dogs' trainers, and, of course, how the dogs did in particular races under particular conditions. Like the haphazard predictions of the curious young child, the weights on these greyhound racing equations were initially set randomly. The neural estimation process then tried out alternative weights on the same historic data over and over againâsometimes literally millions of timesâto see which weights for the interconnecting equations produced the most accurate estimates. The researchers then used the weights from this training to predict the outcome of a hundred future dog races.
The researchers even set up a contest between their predictions and three expert habitués of the racetrack. For the test races, the neural network and the experts were each instructed to place $1 bets on a hundred different dogs. Not only did the neural network better predict the winners, but (more importantly) the network's predictions yielded substantially higher payoffs. In fact, while none of the three experts generated positive payoffs with their predictionsâthe best still lost $60âthe neural network won $125. It won't surprise you to learn that lots of other bettors are now relying on neural prediction (if you google neural network and betting, you'll get tons of hits).
You might be wondering what's really new about this technique. After all, plain-old regression analysis also involves using historical data to predict results. What sets the neural network methodology apart is its flexibility and nuance. With traditional regressions, the Super Cruncher needs to specify the specific form of the equation. For example, it's the Super Cruncher who has to tell the machine whether or not the dog's previous win percentage needs to be multiplied by the dog's average place in a race in order to produce a more powerful prediction.