This researcher put AI in the big game. It did not play well

Northeastern professor Lorenzo Torresani is interested in developing AI that can make sense of challenging group activities.

Researcher Lorenzo Torresani used sports footage to test advanced AI functions. AI models didn’t do so well. Photo by Focus on Sport/Getty Images

A researcher analyzing reams of data. A traveler translating a foreign language. A student writing an essay. There are many ways that artificial intelligence has been proven to help an individual in a challenging situation.

Northeastern University researcher Lorenzo Torresani wanted to test whether AI can help a group facing a challenge, and he found an interesting dataset with which to evaluate various popular AI models: sports footage.

The result? Let’s just say the AI models were no slam dunk.

“AI can really describe what people do and where they go when they perform an action – even on the pitch or on the court,” said Torresani, professor and President Joseph E. Aoun chair in the Khoury College of Computer Sciences at Northeastern. “It cannot tell you why things happen, and it cannot tell you what’s gonna happen next.”

An illustration shows four scenarios of robots, representing AI, assisting humans in surgical teams, autonomous vehicles, emergency responders and military units. — The research evaluates popular AI models’ advanced capabilities that may be useful for group challenges – from coordinating surgery to assessing an emergency. Courtesy photo.

Torresani is interested in developing AI that can make sense of challenging group activities where multiple people are working together in a dynamic environment and coordinating their efforts toward a goal. Often, those environments have many unknowns and preclude extensive verbal communication. For example, he wanted to know whether AI could help a surgery team in the operating room, a group of first responders on the scene of an emergency, or a military unit in action by proposing and then evaluating potential actions and predicting outcomes, among other things.

“So, [situations where] a lot needs to be inferred from the actions, rather than verbally,” Torresani said.

He found that AI models have not been really trained or tested in such scenarios because of a lack of data. Plus, he said, how do you objectively identify and test a concept like cause-and-effect?

“We humans have developed cause-and-effect reasoning over many, many years, since we were kids,” Torresani said. “But labeling something as a cause and as an effect, is very subjective and challenging in the real world.”

Smit Desai, an assistant professor of art and design and communication studies and director of the Conversational Human-AI Interactions (CHAI) Lab, which studies how people understand, trust, and collaborate with conversational AI, said finding datasets to test advanced AI concepts is “definitely a challenge.”

“Evaluating advanced AI capabilities like agency, planning, or causal reasoning is difficult because these behaviors are often highly context-dependent,” Desai said.

Desai explained that many datasets and evaluation tasks that we rely on today were designed for simpler, more narrowly defined tasks with a clearly defined correct answer – for example, a model might be asked to answer a factual question, solve a math problem, or identify an object in an image.

But tasks involving agency, planning, causal reasoning, or social intelligence are often more difficult to evaluate because there may not be a single correct answer, he said.

“Performance can depend heavily on context, long-term outcomes, or human judgment,” Desai said.

So, Torresani turned to an arena that’s rife with group work: Team sports.

Torresani and professor Gedas Bertasius at UNC Chapel Hill, built a dataset with 35,000 hours of videos of hockey, basketball and soccer games along with 23,000 game reports from journalists and 15,000 hours of expert commentary.

The dataset is publicly available as an open resource for the community.

“They provide lots of examples, one after another, of cause-and-effect,” Torresani said. “They also have verifiable outcomes.”
For example, Torresani explained that researchers can query AI models about certain plays or actions and then can check their answers with what actually happened – did AI models correctly predict that a team would lose possession of the ball or was the model wrong and the team maintained possession? Did a sequence of plays the model analyzed and based predictions on lead to scoring, or a failed shot, or a turnover? Did the model analyze all the data and correctly predict the winner?

Professor Lorenzo Torresani poses for a portrait in front of a deep blue-gray background. — Professor Lorenzo Torresani is interested in developing AI that can make sense of challenging group activities. Photo by Alyssa Stone/Northeastern University

The researchers put popular AI models, including GPT and Gemini to the test.

Illustration depicting five differently-colored droplets falling in five different shapes.

These researchers are demystifying oobleck, Dr. Seuss’s ‘green goo’

Jing-Ke Weng wearing safety glasses and gloves, adjusting equipment illuminated by blue light in a lab.

Meet the researcher who thinks plants could help save the world

Lola Satre Morales competes in the 800 meter race.

Track star sets Guatemalan running record, aims for Olympics

Two students in orange waders carry a wire lobster trap across a dock lined with barrels, coiled rope, and stacked traps, with a lobster boat moored alongside.

Researchers aim to keep New England seafood local from boat to plate

Lindsey Graham as seen from the side, in a dark room.

Doctor who studies sudden death weighs in on Lindsey Graham’s passing

The researchers evaluated the models on four different capabilities: perception, causal reasoning, counterfactual simulation and agency.

AI models did well on perception, correctly identifying who does what, when and where about 74% of the time, according to the research.

Causal reasoning (explaining what caused a certain event) and counterfactual simulation (if I did this instead, what would happen) was accurate about 40% to 50% of the time.

The success rate for agency, or asking AI to find evidence from clips to answer complex questions; however, was only 5%. For example, the models were asked to analyze video, play-by-play accounts and game stories by journalists up to specific points in the game – say just before the end of each quarter of a basketball game – and predict who would ultimately win. Only at the very end of the game were the predictions accurate.

In other words, AI is probably not coming for sportscasters anytime soon.

“A good sportscaster does much more than describe what’s on screen – they explain why a play worked, anticipate what’s next, and (perhaps most importantly) decide which moments matter,” Torresani said. “Our study shows AI is already reasonably good at the descriptive part, but collapses on the rest.”

This limitation also goes beyond sports, Torresani said.

“The same gap shows up in any job whose value lies not in describing what’s visible, but in understanding why events unfold, anticipating what comes next, deciding what matters, and recommending what to do about it,” Torresani said.

These researchers are demystifying oobleck, Dr. Seuss’s ‘green goo’

Meet the researcher who thinks plants could help save the world

What does the new British Prime Minister mean for Trump and Ukraine?

The numbers that show Spain’s dominance in the World Cup final

This researcher put AI in the big game. It did not play well

Editor’s Picks

These researchers are demystifying oobleck, Dr. Seuss’s ‘green goo’

Meet the researcher who thinks plants could help save the world

Track star sets Guatemalan running record, aims for Olympics

Researchers aim to keep New England seafood local from boat to plate

Doctor who studies sudden death weighs in on Lindsey Graham’s passing

University News

This project aims to rescue Oakland’s namesake tree species

Virtual games, but a real-life experience for these Explorers

Northeastern officials detail steps international students should take in light of DHS visa changes

Recent Stories

What does the new British Prime Minister mean for Trump and Ukraine?

The numbers that show Spain’s dominance in the World Cup final

One of the oldest psychedelics meets modern brain imaging

Northeastern Global News, in your inbox.

These researchers are demystifying oobleck, Dr. Seuss’s ‘green goo’

Meet the researcher who thinks plants could help save the world

Track star sets Guatemalan running record, aims for Olympics

Researchers aim to keep New England seafood local from boat to plate

Doctor who studies sudden death weighs in on Lindsey Graham’s passing

This project aims to rescue Oakland’s namesake tree species

Virtual games, but a real-life experience for these Explorers

Northeastern officials detail steps international students should take in light of DHS visa changes

What does the new British Prime Minister mean for Trump and Ukraine?

The numbers that show Spain’s dominance in the World Cup final

One of the oldest psychedelics meets modern brain imaging