Northeastern students found a promising solution to AI data storage woes — and the industry took notice

Students on Northeastern’s Vancouver campus found a way to make storing tensor data, used in machine learning models, more cost and energy efficient. It landed them a spot at the 2024 Data + AI Summit.

by Cody Mello-Klein

July 16, 2024

Liam Bao, Liao-Liao Liu and Zhiyu Wu, computer science graduate students at Northeastern’s Vancouver campus, presented their research at the 2024 Data + AI Summit. Courtesy photo Liao-Liao Liu

In 2024, data is the name of the game. What data a company collects and how that data gets analyzed is key to the business model of many modern companies, from Spotify to Meta. But there’s an even more pressing question for many of these companies: Where do you store all this data?

Especially at a time when there has never been more data, data storage is a make or break issue. If mishandled, it can increase financial as well as energy costs and also negatively impact users.

Enter a trio of computer science graduate students from Northeastern University’s Vancouver campus. For their capstone project, Liam Bao, Liao-Liao Liu and Zhiyu Wu found a way to make storing a specific yet precious kind of data –– the tensor data used in machine learning models like ChatGPT –– cost less for companies, consumers and the planet.

“Traditionally, people are not using cloud technology,” Bao says. “They are using a premade database, so all of their data is stored on a physical level server in their own company. As the technology grows into the cloud native era, we can use the cloud service. We have a very elastic power of storage service and computer service. You can think that in the cloud, it’s unlimited.”

“We bring the tensor into a database instead of just purely on disc as a binary format,” Bao continues. “After we bring it into the database, we can leverage many different kinds of storage optimization techniques to save space. That’s the one big difference. After that, we can also increase the efficiency of reading and writing the machine learning model.”

Featured Stories

A breakthrough in bioprinting may soon lead to 3D-printed blood vessels and human organs

Guohao Dai wearing a blue lab coat and blue lab gloves working in a lab.

A breakthrough in bioprinting may soon lead to 3D-printed blood vessels and human organs

Northeastern Huskies give Red Sox a scare in traditional spring training opener

Northeastern pitcher Max Gitlin throws a pitch during the first inning of the Huskies' spring training game against the Red Sox at JetBlue Park.

Northeastern Huskies give Red Sox a scare in traditional spring training opener

This Northeastern researcher studies homelessness in the 19th century. He doesn’t think much has changed in 200 years

A man sitting on the pavement begging for money while shoppers walk past on a busy sidewalk.

This Northeastern researcher studies homelessness in the 19th century. He doesn’t think much has changed in 200 years

Could James Bond lose the 007 name because of a trademark claim? A legal expert says it’s not that simple

A screen capture of the actor Daniel Craig acting as James Bond.

Could James Bond lose the 007 name because of a trademark claim? A legal expert says it’s not that simple

To effectively train, a machine learning model or large language model requires a massive amount of data. As a result, the kinds of databases that organizations use to store and distribute data today can quickly become resource drains, Liu says.

“That data can scale to a very scary level, like a petabyte, even exabyte,” Liu says. “Even one single point of storage efficiency will bring a very big difference in the energy cost.”

According to the paper, this storage solution reduces the amount of data that needs to be transferred from a network by 90% and improves energy efficiency by 10%. Together, that could also come with a reduction in dollars spent, too.

“When you store all the data in the cloud, storing more data size means you have to pay more, so it’s also cost effective,” Wu says.

Outside of the potential improvements for companies, Liu says using their storage solution could be a boon for developers, too.

For example, someone trying to train an object detection model relies on massive image datasets and platforms. In order to make those images usable for their model, they have to use platforms that download entire collections of image files in order to transform them into tensors. Their method cuts out that step entirely, speeding things up in the process.

“Any time the user wants to use the whole tensor or a part of a tensor, using our project they can say, ‘Hey, this portion of the data, can you help me grab it?’ and it will give it to you,” Liu says. “There’s no actual computation doing the transformation from the JPEG files into the tensors.”

With data storage on the minds of so many in the tech world, it took little time for Bao, Liu and Wu’s work to catch the attention of major industry players. After posting their work online, they were invited by Databricks, a global data and AI company, to present at its annual Data + AI Summit alongside the likes of Google, Apple and Nvidia.

For a team of students that had poured everything they had into their research not expecting to be recognized by the larger tech community, it was a welcome surprise.

“Nothing feels better than your research getting recognized,” Liu says. “That moment is huge.”

Preserving Cherokee heritage: Northeastern revives ‘The Willie Jumper Stories’ and other lost tales

Blink-182’s Mark Hoppus to auction rare Banksy for $6 million. What does it say about art and capitalism?

Northeastern Huskies give Red Sox a scare in traditional spring training opener

Blink-182’s Mark Hoppus to auction rare Banksy for $6 million. What does it say about art and capitalism?

Northeastern students found a promising solution to AI data storage woes — and the industry took notice

Featured Stories

A breakthrough in bioprinting may soon lead to 3D-printed blood vessels and human organs

Northeastern Huskies give Red Sox a scare in traditional spring training opener

This Northeastern researcher studies homelessness in the 19th century. He doesn’t think much has changed in 200 years

Could James Bond lose the 007 name because of a trademark claim? A legal expert says it’s not that simple

University News

Recent Stories