Students on Northeastern’s Vancouver campus found a way to make storing tensor data, used in machine learning models, more cost and energy efficient. It landed them a spot at the 2024 Data + AI Summit.
In 2024, data is the name of the game. What data a company collects and how that data gets analyzed is key to the business model of many modern companies, from Spotify to Meta. But there’s an even more pressing question for many of these companies: Where do you store all this data?
Especially at a time when there has never been more data, data storage is a make or break issue. If mishandled, it can increase financial as well as energy costs and also negatively impact users.
Enter a trio of computer science graduate students from Northeastern University’s Vancouver campus. For their capstone project, Liam Bao, Liao-Liao Liu and Zhiyu Wu found a way to make storing a specific yet precious kind of data –– the tensor data used in machine learning models like ChatGPT –– cost less for companies, consumers and the planet.
“Traditionally, people are not using cloud technology,” Bao says. “They are using a premade database, so all of their data is stored on a physical level server in their own company. As the technology grows into the cloud native era, we can use the cloud service. We have a very elastic power of storage service and computer service. You can think that in the cloud, it’s unlimited.”
“We bring the tensor into a database instead of just purely on disc as a binary format,” Bao continues. “After we bring it into the database, we can leverage many different kinds of storage optimization techniques to save space. That’s the one big difference. After that, we can also increase the efficiency of reading and writing the machine learning model.”
To effectively train, a machine learning model or large language model requires a massive amount of data. As a result, the kinds of databases that organizations use to store and distribute data today can quickly become resource drains, Liu says.
“That data can scale to a very scary level, like a petabyte, even exabyte,” Liu says. “Even one single point of storage efficiency will bring a very big difference in the energy cost.”
According to the paper, this storage solution reduces the amount of data that needs to be transferred from a network by 90% and improves energy efficiency by 10%. Together, that could also come with a reduction in dollars spent, too.
“When you store all the data in the cloud, storing more data size means you have to pay more, so it’s also cost effective,” Wu says.
Outside of the potential improvements for companies, Liu says using their storage solution could be a boon for developers, too.
For example, someone trying to train an object detection model relies on massive image datasets and platforms. In order to make those images usable for their model, they have to use platforms that download entire collections of image files in order to transform them into tensors. Their method cuts out that step entirely, speeding things up in the process.
“Any time the user wants to use the whole tensor or a part of a tensor, using our project they can say, ‘Hey, this portion of the data, can you help me grab it?’ and it will give it to you,” Liu says. “There’s no actual computation doing the transformation from the JPEG files into the tensors.”
With data storage on the minds of so many in the tech world, it took little time for Bao, Liu and Wu’s work to catch the attention of major industry players. After posting their work online, they were invited by Databricks, a global data and AI company, to present at its annual Data + AI Summit alongside the likes of Google, Apple and Nvidia.
For a team of students that had poured everything they had into their research not expecting to be recognized by the larger tech community, it was a welcome surprise.
“Nothing feels better than your research getting recognized,” Liu says. “That moment is huge.”