Skip to content

New Northeastern research tests the capabilities of AI chatbots on NPR’s Sunday Puzzles

A screen paused on a video screen that has green rings reflecting on it.
New Northeastern research reveals how capable of AI chatbots are at solving NPR’s Sunday Puzzle. Photo illustration by Matthew Modoono/Northeastern University

Listeners of NPR’s Sunday Puzzle are well aware just how challenging the weekly quiz show can be, requiring participants to have a strong grasp of popular culture and the English language.   

While the puzzles may not be the easiest to solve, they aren’t impossible. With some thinking and trial and error, everyday people answer them correctly every week. 

That’s what made them the perfect data source for a new benchmark researchers have developed to test the capabilities of the latest artificial intelligence reasoning models coming out of OpenAI, Google, Anthropic and DeepSeek. 

Portrait of Arjun Guha.
Arjun Guha, associate professor of computer science at Northeastern University, is one of the co-author of the benchmark study. Photo by Matthew Modoono/Northeastern University

It’s a common practice for AI researchers working in the field to develop specific benchmarks to measure progress and the capabilities of AI technologies, explains Arjun Guha, a Northeastern University professor in the Khoury College of Computer Sciences and one of the authors of the research. 

The issue, however, is that the models have become so advanced that the tasks they are asked have become more challenging to accomplish and measure. 

“You have questions that are very narrowly designed by Ph.D. students and are only answerable by people with Ph.D.s in a narrow field of expertise,” he says. 

The questions asked during NPR’s Sunday Puzzles, on the other hand, while difficult, can be easily identifiable by nonexperts. 

“You can really look at them as a test of verbal reasoning skills and general knowledge,” says Guha. “There’s a lot of ‘find a five-letter word with the following letter-by-letter properties, and it’s the name of some obscure city or some movie from the ’80s or something.” 

For the study, researchers tested out a new crop of reasoning models released by OpenAI, Google, Anthropic and DeepSeek in the past few months. What sets reasoning models apart is that they are trained with reinforcement learning techniques and “show their work,” meaning they explain step by step how they come up with their answers. 

None of them did particularly well in answering the questions, with OpenAI’s o1 model having the highest accuracy rate of 57%. 

What the researchers observed is that models tended to fail in two particular ways.

Sometimes, the model would give “out-of-thin final answers” not based on anything outlined in its reasoning paper trail. Other times, the models would deliberately violate the constraints of the puzzle to answer the question. 

Interestingly, the models would also often express their frustration in trying to solve the challenges. 

“We saw a lot of interesting cases of failure where the models would give up or get stuck,” says Guha. “In a sense, what was most interesting is not the problem that it gets right. It’s the problem that it gets wrong.” 

Given the nature of the puzzles, it was very easy for researchers to track how the models reasoned through their answers, serving as a valuable tool that even a layperson can understand. 

“I think what’s valuable about our benchmark is that it makes accessible to the average English-speaking American what these models can and cannot do to some extent,” he says.