In 2026, the hype around artificially intelligent agents is greater than ever. These semi-autonomous programs can “think” and perform well-defined tasks, typically using language models (LMs), in areas such as customer service and software development. However, fields such as medical diagnostics and scientific discovery require exploring a wide range of solutions in uncertain environments, which LM struggles with.
Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard School of Engineering and Applied Sciences (SEAS) have taken a deeper look into LM to understand the key issues in high-stakes situations. Their test: “Battleship” is a classic guessing game that helps cognitive scientists study how humans seek information.
CSAIL and SEAS academics put a twist on it by reimagining the game around natural language questions and answers. In the “Cooperative Battleship” game, one participant acts as the “captain” who asks for the whereabouts of a hidden ship, and a teammate acts as the “spotter” who answers those questions in real time.
The researchers first constructed the BattleshipQA dataset by asking over 40 people to play the game together and collecting questions and yes/no answers. These results helped the team compare cutting-edge LMs (such as GPT-5) and smaller models (such as Llama 4 Scout) when testing them in-game. We found that even without pre-training the model, top LMs can “defeat” humans in “battleships”, i.e., complete the game in fewer turns, but it is much less rational for smaller systems.
The main problem was that many models were not good at coming up with useful questions. To force the LM to interrogate in a way that revealed more information about the hidden ship, the researchers fed each model with a Monte Carlo inference strategy and carefully measured the likelihood that different choices were correct in each response. The result is an AI model that can beat regular players in Battleships, regardless of size.
Perhaps the most notable result was the benefit of Rama 4 Scout. As LMs are relatively small, their chances of defeating humans are only 8%. However, by improving the inference strategy, the model achieved an 82% “battleship” win rate against humans. This careful and efficient questioning style also allowed the model to outperform the Frontier model (GPT-5) while operating at approximately 1% cost.
In addition to this improvement, the researchers narrowed the gap between humans and LM when answering questions. GPT-5 was a reliable spotter that helped models complete the game faster, but the smaller system had a bad habit of giving incorrect answers about where ships were hidden. When we started translating the questions into code that explicitly told us how to validate the answers, the model’s accuracy improved by an average of 15% (for example, by having the model perform a simple search of the area when asked if a ship was there).
“Today’s language models are primarily optimized for answering complex queries, but it’s less clear that language models learn how to ask good questions themselves,” says MIT doctoral student and CSAIL researcher Gabriel Grand SM ’23, lead author of a paper on the study. “Our research shows that asking useful questions relies on the ability to predict and simulate the world. We found that when we give agents access to a ‘world model,’ they ask better questions and make discoveries more efficiently.”
Big changes for LM
The team’s initial focus was to get LM to ask better questions. By implementing a Monte Carlo inference strategy, LM infers potential guesses as individual particles. Each answer from a spotter that appears to be more valid is given more weight. It’s like a game ball that expands and contracts with each turn. This more calculated and adaptive approach allows the captain to make inquiries that extract significantly more information from the spotter.
So the scientists turned to Python, a widely used programming language, to aid AI spotters. Each question the captain asked was automatically translated into encrypted commands. For example, a question like “Is there a ship in column 1 that spans two rows?” This translates into instructions for the spotter LM to explore the area in question and assess the width of the digital game piece. By giving clear instructions in a language that the models understood particularly well, each system began to return the correct answer fairly often. For example, the lightweight system GPT-4o-mini improved performance by nearly 30%, and even the larger model Claude 4 Opus increased by about 8 points.
“The field has had a lot of success with ‘auto-formalization’ strategies, where LM generates code and verifies solutions,” said senior author Jacob Andreas, MIT associate professor of electrical engineering and computer science and CSAIL principal investigator. “What I find most exciting about this work is that by increasing the exploration and information gathering capabilities of LM, we open up the possibility of using these techniques to create better solutions in the first place. We are excited to be able to scale this research from the scientific domain to applications such as coding and mathematical problem solving.”
let’s play something else
But how would this approach work in other board games? The team tested the newly equipped LM with “Guess Who?” There, large and small models expertly narrowed down 100 choices and correctly guessed which hidden character was chosen. Llama 4 Scout succeeded 30% of the time, but with some fine-tuning by Grand and his colleagues, it completed the task on more than 72% of runs. Meanwhile, GPT-4o jumped from 62 percent to 90 percent. GPT-5 served as a spotter for each game to ensure questions were answered as accurately as possible.
LM made encouraging progress in both matches, but there is room for improvement. For example, compared to humans, models still struggle to answer complex questions. Co-author Valerio Pepe, an OpenAI researcher and recent graduate from Harvard University, adds, “GPT-5 can beat the average ‘Battleship’ player, much better than our method. But unlike chess, where even top players can’t beat against AI systems, skilled players still have a hard time beating any model.”
The researchers’ findings show that AI agents have untapped potential for “needle in a haystack” discovery, navigating a vast space of options and finding rare solutions to scientific challenges. Researchers caution that their improved information-seeking skills would make them excellent research assistants, such as identifying the molecular structure of compounds, but that the “cooperative battleships” are a somewhat simple guinea pig. They want to test LM in more complex settings where the system has to consider far more options.
Grand also plans to study whether humans and AI models can work together more effectively. The model could also benefit from small tweaks to the game simulation, and with more computing power, the LM would have more advanced inference capabilities to predict how the game will evolve.
“As AI systems become more agentic, the most difficult problems turn out to be social ones: tracking commonalities, resolving misunderstandings, and adapting to different partners over time,” said Robert Hawkins, an assistant professor of linguistics at Stanford University who was not involved in the paper. “This work elegantly captures these phenomena in a controlled, collaborative environment, and makes a convincing case that the real bottleneck for AI agents is not simply calculating the best questions, but the practical reasoning required to make the most of the answers.”
Grand and Pepe co-authored the paper with CSAIL’s principal investigators, MIT Associate Professor Jacob Andreas and MIT Professor Joshua Tenenbaum. Their research was supported in part by the MIT Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the FinTechAI@CSAIL initiative, the Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation. They presented their paper as an oral presentation at the International Conference on Learning and Representation (ICLR) held in April.
