Can ‘digital traces’ from internet searches and social media predict outbreaks of COVID-19?

by Cynthia McCormick Hibbert

January 18, 2023

Mauricio Santillana, Director of the Machine Intelligence Lab, Network Science Institute. Photo by Matthew Modoono/Northeastern University

Your Google searches and Twitter accounts alert marketers about what items you might like to purchase—could they also serve as an early warning system when COVID-19 levels are about to take off?

A team of scientists including Northeastern University machine learning expert Mauricio Santillana says internet users’ “digital traces” can be adopted to alert public health officials to sharp increases in COVID-19 at the county level one to six weeks ahead of a major outbreak.

In a paper published Wednesday, Jan. 18, in Science Advances, Santillana and other authors say digital data will help close information gaps left by existing surveillance methods.

Analysis of the data streams will allow policymakers to get a jump on decisions such as whether to reissue masking recommendations or bump up vaccination and boosting campaigns, says Santillana, director of the Machine Intelligence Group for the Betterment of Health and the Environment at the Network Science Institute at Northeastern.

Mauricio Santillana, Director of the Machine Intelligence Lab, Network Science Institute. Photo by Matthew Modoono/Northeastern University

“What we aspire to do is to use the same information that Google or Amazon or any of these big firms use to send ads to you” to inform public health decisions early on in an outbreak, Santillana says.

COVID-19-related digital streams include internet searches for fever, clinician searches for COVID-19 treatments and Twitter users’ comments about being too sick to work, among other things.

The researchers also used machine learning methods that took historic information from outbreaks in 97 U.S. counties from Jan. 1, 2020, and 2022 and combined them to create a single predictive indicator.

“The goal is not necessarily to quantify how many infections there are but to quantify when sharp increases in infections will happen,” says Santillana, who participated in the research with scientists from Boston Children’s Hospital, Harvard Medical School, Oklahoma State University and other organizations.

Researchers found that the predictive capacity at the state and county levels was roughly similar—the early warning system deployed at one to six weeks in advance at the county level and four to six weeks at the state level.

The study says the digital data will help fill in vital missing information for the Centers for Disease Control and Prevention, which it says has failed to reliably forecast “rapid changes in the trends of reported cases and hospitalizations.”

“When CDC COVID-19 forecasts were shared with the public, they very frequently missed the timing of when outbreaks were starting,” Santillana says. He says by the time actual case numbers were tallied, surges were already well under way.

“The next chapter would be for the CDC to say, ‘We know that this is an alternative and complementary way to anticipate outbreaks. We will implement it inhouse, and we will have it as an additional tool in our toolbox,” Santillana says.

“He says the study is part of a new CDC initiative started by President Biden called the Center for Forecasting and Outbreak Analytics within the CDC.”

“It is within that effort that we did the work in this paper,” published in an open access journal of the American Association of the Advancement of Science, Santillana says.

He says he and his team already had been working with the CDC for three to four years on predicting flu incidence and flu hospitalizations, but he wasn’t satisfied with what he considered the CDC’s inability to incorporate novel Internet-based sources of information into their prediction systems.”

“When COVID hit, they called and said, ‘We need all hands on deck. So please do what you can.’”

“I asked if they could be flexible, because my team and I were interested in innovating rather than just continuing to implement the exact same models,” Santillana says.

“The model is not perfect,” he says.

The counties studied were only a fraction of the 3,006 counties in the United States, according to the paper on using digital traces to build prospective and real-time county-level early warning systems.

“Our internet search-based methods may struggle to perform well in areas with poor literacy rates and limited access to internet resources,” the paper says.

The researchers say a possible solution for counties with poor internet access or literacy challenges may be to use state-level early warning systems to guide county-level decisions around outbreaks.

“When we navigate the internet on our computer or phone it leaves traces,” Santillana says.

“Whether we like it or not, the reality is that most companies use this information to increase their profit or their margins,” he says.

“Instead, we want to use that information to inform public officials when the next outbreak will happen.”

Cynthia McCormick Hibbert is a Northeastern Global News reporter. Email her at c.hibbert@northeastern.edu or contact her on Twitter @HibbertCynthia