Data mining in a complex world

Gold mining requires a certain amount of patience: For example, you would have to sift through about 300 tons of earth and rock to come up with enough of the precious metal to make a single wedding ring. Data mining is similar. Every day, terabytes of data accumulate in the technology that society has come to rely on. But turning that chaotic mess of zeros and ones into meaningful knowledge can be a complex mathematical challenge.

Typically, researchers try to simplify this challenge by limiting the scope of their questions. But Yizhou Sun, a newly appointed assistant professor in the College of Computer and Information Science, believes that making useful predictions and inferences with new data requires us to account for its complexity.

“My philosophy is that in the real world, objects are connected together but those objects belong to different types,” she said, pointing to humans, buildings, and digital devices as examples “Even with humans we can still identify different groups.”

Instead of looking at two-dimensional relationships in an isolated system, her approach brings together a series of complex algorithms that simultaneously address objects from multiple domains and their interactions in a much bigger, real-world environment. She has used the method to probe social networks like Flickr and Twitter for similarities and patterns.

As a graduate student at the University of Illinois at Urbana-Champaign, Sun took on the task of mining the Digital Bibliography & Library Project’s dataset of computer science publications. Her hope was to unearth some interesting and unexpected patterns, which she did.

She found that a researcher’s social connectedness was the most important factor for determining whom he would collaborate with in the future. She also found, thankfully, that social connections did not figure very highly in a researcher’s citations.

But perhaps most important, Sun found that her questions were always more complicated than she had expected. For instance, automatically identifying the most highly ranked authors in the DBLP dataset might require examining the ranking of the conferences they attended. But that requires automatically identifying conference ranking, which depends on the ranking of the authors in attendance.

The problem was that the data in question make up a complex, heterogeneous network wherein each piece affects every other. If Sun wanted to trust the products of her algorithm, she was going to have to understand the network it acted upon.

Sun made it her life’s work to understand and then design strategies for examining heterogeneous networks. Last year, she published the seminal book on the matter, Mining Heterogeneous Information Networks: Principles and Methodologies.

The implications for Sun’s work are vast. In order to take advantage of the terabytes of data now describing our world, we must understand the complex networks of which they are a part. “In the real world, there are so many different types of objects that interact with each other,” said Sun. “The real world system can be viewed as gigantic heterogeneous information network.”