To solve the most pressing scientific problems, scientists today often face enormous hurdles when it comes to gathering the data they need to embark on research.
Enter Ramkumar Hariharan, a data scientist and computational biologist at Northeastern University in Seattle. A scientist and an engineer, Hariharan’s current research is centered around an emerging scientific field called geroscience, or the “study of aging as it relates to age-related diseases.” Hariharan has been trying to understand the reasons why some cancer patients respond better to certain kinds of immunotherapies.
To do so requires lots of information about the patients themselves, the specific forms of cancer and the drugs used to treat patients. Naturally, it’s a lot of data to process—and from a variety of sources. All of that information requires sorting, or cleansing, scraping (exporting data from one source, or program, into another) and “deriving” (combining or processing raw data into new information).
“The first part is creating artificial intelligence systems and pipelines,” Hariharan says. “And why are we doing it? We want to solve scientific problems.”
Hariharan and a team of Northeastern researchers received a grant to build an “end-to-end autoML pipeline” to help predict patients’ response to cancer immunotherapies. The automated machine learning model (autoML) uses so-called “deep learning,” a form of artificial intelligence modeled off of human decision-making, to help researchers sift through massive amounts of raw data.
Specifically, the researchers are looking to see if they can prospectively identify patients who would best benefit from these different treatments and, in doing so, isolate the individual factors that make patients more or less responsive to them. Those could be factors such as a patient’s age, physical attributes, and overall health, among others.
The goal is to search for patterns in the available data (meaning, data accessible through published literature and other public databases) that help researchers construct a clinical picture of how patients might fare in treatment.
To be as precise as possible, researchers need more than simply a patient’s age, sex and health; they need other more specific data points, such the cellular composition of the cancerous tumors, and molecular measurements that provide insight into gene activity or expression.
One problem for researchers looking to scrape this specialized data is that a lot of it is so-called domain-specific knowledge, meaning it’s overseen by experts—here, medical and healthcare professionals—and locked away in disparate, not-well-organized databases. Another challenge is the extensive hand-coding required to precisely calibrate many existing machine learning models
Here’s where autoML comes in. As opposed to traditional machine learning models that require trained experts to manually tinker with the settings of an algorithm, autoML is an approach in which the system is constructed to learn how to optimize its dozens of “hyperparameters and control knobs” all on its own, Hariharan says.
“The autoML pipeline takes care of two things: one, you are much less reliant on domain experts, and two, your machine learning workflow is greatly accelerated,” he says. “You don’t need to create additional derived data and add that to the existing data because it can identify new, relevant derived data on its own.”
Hariharan’s team recently finished building the autoML pipeline, and is now in the process of refining the system, and measuring its performance compared to classical, hands-on models. The $50,000 in funding for the project comes from Northeastern’s Institute for Experiential AI. Rohit Gandikota, Alekya Kasturi, Shreyangi Prasad and Ayesha Mathur—all Northeastern-based—contributed research.
Hariharan says the complicated data project was spurred along by developments in geroscience, and a broader shift in the way scientists understand aging. As you age, your bodily functions begin to slow down. “Things start to fall apart,” Hariharan says. This in turn predisposes a person to a host of age-related diseases.
“Your probability of getting cancer significantly increases,” Hariharan says. “Yes, young people do get cancer—but they’re more like the outliers. And age isn’t the only factor that causes cancer, or Alzehmiers disease, or cardiovascular disease.”
It also depends on your genetic inheritance, he says, and the “epigenetic marks” that “sit atop your DNA.” These marks are chemical modifications to the DNA letters, Hariharan says, that can offer clues about how we age. Diet and lifestyle, long thought to influence how quickly we age, can also impact the formation of these marks.
“There are so many ways to measure your biological age,” he says. “Looking at the epigenomic pattern is one way of doing it.”
Other so-called biomarkers of aging vary and can include, for example, how fast a person walks, their grip strength, and other blood measurements, such as how they respond to glucose. As scientists’ understanding of the mechanisms of aging evolve, more potential data points emerge as variables and determinants of health, Hariharan says.
Machine learning, he says, will be the key to unlocking that data.
“We want to build AI-powered computational tools to come up with more reproducible ways of measuring biological aging,” Hariharan says. “We haven’t started that research yet, but we’re going to launch it soon.”