Google flu trends has long been the go-to example for anyone asserting the revolutionary potential of big data. Since 2008 the company has claimed it could use counts of flu-related Web searches to forecast flu outbreaks weeks ahead of data from the Centers for Disease Control and Prevention.
Unfortunately, this turned out to be what I call big-data hubris. Colleagues and I recently showed that Google’s tool has drifted further and further from accurately predicting CDC data over time. Among the underlying problems was that Google assumed a constant relationship between flu-related searches and flu prevalence, even as the search technology changed and people began using it in different ways.
That failure is the big-data era’s equivalent of the Chicago Tribune’s “Dewey Defeats Truman” headline in 1948. After public-opinion surveys erroneously predicted Dewey’s victory, the New York Times declared polling “unable to compute statistically the unpredictable and unfathomable nuances of human character.” Yet 64 years later, polling is used widely and successfully. In aggregate it predicted the overall margin of the latest presidential election within tenths of a percentage point, as well as the outcome in all 50 states. Surveys remain the bread and butter of social-science research.