is harnessing the power of Big Data to track COVID-19 and other diseases

Photo by Matthew Modoono/Northeastern University

Around this time last year, Samuel Scarpino, an assistant professor in Northeastern’s Network Science Institute, received a call from a senior producer at VICE News. VICE’s reporters had failed to get any useful data about COVID-19 cases from the Centers for Disease Control and Prevention. Could he help?

“Welcome to public health,” Scarpino told the producer sarcastically.

Samuel Scarpino, assistant professor in Northeastern’s Network Science Institute. Photo by Matthew Modoono/Northeastern University

One of the biggest issues with public health responses, particularly when it comes to infectious disease outbreaks, is a lack of detailed, accessible, and usable data, says Scarpino, who has dedicated his career to improving access to public health information. 

Scarpino and a team of researchers who were collecting COVID-19 data last spring encountered a similar problem when they reached the limit of the shared spreadsheet they were using—five million cells. Their vision of a comprehensive and publicly available COVID-19 dataset was too big for Google Sheets, but there was no alternative. 

A year later, through a partnership with Google’s nonprofit arm,, and The Rockefeller Foundation, the research team’s efforts to build a better system have culminated in, a collaborative platform that consolidates disease outbreak data from around the world into one free resource. 

As of now, the system houses more than 20 million anonymized case records from more than 100 countries. “And importantly, these records are more than just an aggregate case count that shows how many cases are in a certain location,” Scarpino says. Excluding all personally identifiable information, specific details such as a patient’s travel history, age, nationality, and symptom-onset date are included in the dataset. 

The case records come from a diverse pool of sources including news reports, government press releases, and social media posts, and all records are entered manually by volunteer researchers. 

While the sources aren’t vetted—verifying reports is beyond the scope of the project—all the original sources are tagged in the dataset so that individuals can investigate further if needed, Scarpino says. 

The dataset is available to anyone, but was specifically designed to serve those who need the information the most—public health officials, researchers, and journalists. 

“The product team at Google conducted hundreds of hours of interviews with public health officials, subject matter experts, and journalists to understand what their needs were before we started building anything,” Scarpino says. 

While it’s predominantly being used to track COVID-19 cases right now, was also designed to adjust to new diseases in the future. 

“When a new pathogen emerges, researchers almost always have to invent a new data system specific to that pathogen,” Scarpino says. eliminates the need to develop a new structure each time a novel disease emerges, granting healthcare workers and policy makers valuable time to react. 

The system’s versatility is already being put to the test as researchers track new variants of the SARS-CoV-2 virus. “It’s like Wuhan all over again,” Scarpino says, comparing the spread of a new variant called B.1.1.7 to the initial outbreak of the virus in China last spring. “But this time we have a system in place to track this information and make it available in real time.” 

For media inquiries, please contact Marirose Sartoretto at or 617-373-5718.