MIT researchers use “Battleship” to enhance AI question-asking skills

By 2026, the buzz surrounding artificial intelligence agents has reached new heights. These semi-autonomous systems, which use language models (LMs), are effective in tasks such as customer service and software development. However, they encounter difficulties in areas like medical diagnosis and scientific research, where they must navigate complex environments and seek numerous solutions.

Researchers from MIT’s CSAIL and Harvard’s SEAS explored these challenges by examining LMs in high-stakes scenarios. They used the classic game “Battleship” to study information-seeking behaviors, modifying it to focus on natural language questions. In “Collaborative Battleship,” a captain asks about ship locations, while a spotter responds in real-time.

The team first had over 40 people play the game, creating the “BattleshipQA” dataset from their interactions. They then tested advanced models like GPT-5 and smaller ones like Llama 4 Scout. Without prior training, top LMs outperformed humans in completing the game faster, but smaller models struggled with rational questioning.

The main problem was models’ difficulty in asking useful questions. To improve this, the researchers equipped each model with a Monte Carlo inference strategy, enhancing their ability to ask questions that reveal more about the hidden ships. This approach allowed AI models to outperform humans in the game.

Llama 4 Scout, a smaller LM, initially beat humans only 8% of the time. With improved inference strategies, its win rate soared to 82%, surpassing even GPT-5 while being more cost-effective. The strategy also increased LMs’ accuracy in answering questions, with an average 15% boost.

MIT PhD student Gabriel Grand, a lead author, emphasized the importance of giving agents access to a “world model” to improve question-asking and discovery efficiency. The team’s focus was on enhancing LMs’ questioning abilities through Monte Carlo inference, using Python to convert captain questions into commands for spotters.

This method increased the accuracy of systems like GPT-4o-mini by nearly 30%, and even larger models like Claude 4 Opus saw improvements. Jacob Andreas, a senior author, highlighted the potential for these techniques to enhance LMs’ problem-solving capabilities beyond simple tasks.

The researchers also applied their methods to the game “Guess Who?”, where Llama 4 Scout’s success rate rose from 30% to over 72% after adjustments. GPT-4o’s performance improved from 62% to 90%, with GPT-5 ensuring accurate responses in each game.

Despite advancements, LMs still face challenges in answering complex questions. Valerio Pepe, an OpenAI researcher, noted that while GPT-5 can outperform average “Battleship” players, expert players remain difficult to beat. The study suggests AI agents could excel in finding rare solutions to scientific problems.

The team plans to test LMs in more complex environments and explore human-AI collaboration. Stanford’s Robert Hawkins, who was not involved in the research, remarked on the importance of social problem-solving for AI systems. Grand and Pepe collaborated with MIT’s Jacob Andreas and Joshua Tenenbaum on the study.

Original Source: news.mit.edu

Leave a Reply

Your email address will not be published. Required fields are marked *