These days, when people start feeling a fever and a sore throat coming on, often times their first move isn’t to the medicine cabinet. Instead, it’s to a computer or smartphone to Google their symptoms.
These queries, which make up only a tiny fraction of the more than 7 billion total queries the search engine handles each day, are all stored by Google. The company uses this data for a variety of reasons; it can help Google improve its search results for users—which also boosts the company’s bottom line—and can also benefit the population as a whole in other ways.
One example of the latter is Google Flu Trends, a statistical model developed by engineers at Google.org—the company’s foundational arm—in an effort to “now-cast” what’s happening with the flu on any given day.
But research has shown that GFT often misses its target. These results led Northeastern University network scientists and their colleagues to take a closer look at how Big Data should be used to advance scientific research. Their report was published online Thursday in the journal Science.
“Big Data have enormous scientific possibilities,” said Northeastern professor David Lazer. “But we have to be aware that most Big Data aren’t designed for scientific purposes.” Fully achieving Big Data’s enthusiastically lauded potential, he added, requires a synthesis of both computer science approaches to data as well as traditional approaches from the social sciences.
The paper was co-authored by Lazer, who holds joint appointments in the Department of Political Science and the College of Computer and Information Science; Alessandro Vespignani, the Sternberg Family Distinguished University Professor of Physics at Northeastern who has joint appointments in the College of Science, Bouvé College of Health Sciences, and the College of Computer and Information Science; Northeastern visiting research professor of political science Ryan Kennedy; and Gary King, a professor in the Harvard University Department of Government.
“In a sense, Google Flu Trends is not bad, but it’s no better than any basic approach to time series prediction,” Vespignani said. “So the issue is in the claims and the disregard of other techniques or data more than the actual result.”
In their paper, the researchers explain where Google Flu Trends went wrong and examine how the research community can best utilize the outputs of Big Data companies as well as how those companies should participate in the research effort.
By incorporating lagged data from the Centers for Disease Control and Prevention as well as making a few simple statistical tweaks to the model, Lazer said, the GFT engineers could have significantly improved their results. But in a companion report also released Thursday on the Social Science Research Network—an online repository of scholarly research and related materials—Lazer and his colleagues show that an updated version of GFT, which came about in response to a 2013 Nature article revealing GFT’s limitations, does little better than its predecessor.
While Big Data certainly holds great promise for research, Lazer said, it will only be successful if the methods and data are made—at least partially—accessible to the community. But that so far has not been the case with Google.
“Google wants to contribute to science but at the same time does not follow scientific praxis and the principles of reproducibility and data availability that are crucial for progress,” Vespignani said. “In other words they want to contribute to science with a black box, which we cannot fully scrutinize and understand.”
If scientists are to “stand on the shoulders of giants,” as the old adage requires for moving knowledge forward, they will need some help from the giants, Lazer said. Otherwise failures like that with Google Flu Trends will be rampant, with the potential to tarnish our understanding of anything from stock market trends to the spread of disease.