Anthropology predicts four ways AI will advance scientific discovery

Anthropic reports on four important advances in how artificial intelligence is supporting scientific discovery, including new benchmarks designed to rigorously assess AI’s bioinformatics capabilities. In an evaluation task called BioMysteryBench, Claude analyzes real-world datasets that go beyond standard question-and-answer formats to reflect the complex workflows of real-world scientific research. This shift reflects a broader shift from benchmarks such as MMLU-Pro and GPQA to assessments that incorporate the use of agents and tools, reading papers, coding, and even designing experiments. “Science is challenging, and so is assessing it,” the discovery team researchers said, highlighting the difficulty of establishing standardized tests of scientific ability even for human experts. The researchers found that the latest generation of Claudes not only performed as well as a panel of human experts, but also sometimes solved problems using strategies they couldn’t.

In biology, there are many different “correct” ways to do something. If there was only one right way to answer research questions, PhD students could earn their degrees in months, corporate R&D departments would cease to exist, and science fair posters would no longer need a “Methods” section.

Individual research decisions are highly subjective and can lead to very different conclusions in noisy datasets. Even within the chosen research direction, individual decisions can be highly subjective. One scientist may approve of the decision, while another may have significant objections.

Ask authors who are frustrated after receiving conflicting proposals during a round of peer review. This problem is further exacerbated by the fact that biological datasets are often noisy and small differences in research decisions can lead to completely different conclusions about the data. In a decade of research seeking predictors of metformin response, small differences in study design led to very different conclusions about metformin response. A 2011 paper reported a variant that predicted metformin response that was reproduced in two cohorts, and the mechanism is thought to involve AMPK activation.

In biology, there are many different “correct” ways to do something. If there was only one right way to answer research questions, PhD students would earn their degrees in months, corporate R&D departments would not exist, and science fair posters wouldn’t need a “methods” section.

A discovery-focused researcher, Brianna is currently spearheading an effort to evaluate Claude’s capabilities in bioinformatics using a new benchmark called BioMysteryBench. The initiative comes as evaluation of large-scale language models expands beyond traditional metrics such as bar exam scores and Olympic-level math, and instead focuses on specialized scientific domains. The development of BioMysteryBench marks a deliberate shift toward evaluating the potential of AI to address real open questions in biology, recognizing that the most impactful contributions may be in areas where human expertise is at its limits. There are many biological questions that humans still cannot answer, and researchers are increasingly focused on identifying these very questions as prime targets for artificial intelligence. Machine learning has already shown success in areas where humans struggle, such as sequence prediction and protein modeling, by leveraging extensive experimental data rather than relying primarily on expert intuition.

Benchmarks such as ProteinGym and the long-running CASP competition exemplify this approach, basing their evaluations on experimental measurements that humans would not attempt to reproduce on their own. However, these existing benchmarks often focus on narrow tasks and fail to capture the full scope of bioinformatics research. BioMysteryBench aims to address this gap by presenting models with messy, real-world data while maintaining rigorous evaluation criteria. This benchmark challenges Claude to questions created by subject matter experts. Each of the questions is based on a dataset with controlled, objective characteristics rather than a subjective scientific conclusion. This design allows you to create questions that are verifiable but may not be easily solvable by humans. Claude is tasked with answering these questions within a container equipped with standard bioinformatics tools, the ability to install additional software, and access to important databases such as NCBI and Ensembl. A key feature of BioMysteryBench is its method-independent approach, which gives Claude considerable freedom in the selection of tools and strategies.

The evaluation is based only on the final answer and not on the analytical path performed, resulting in the correct biological conclusion regardless of the method used. This benchmark also includes a set of questions specifically designed to be difficult or impossible for humans to solve. After rigorous quality controls, 23 such questions remained, and the current model solved many that a panel of human experts could not, sometimes using very different strategies.

Analysis of the transcripts revealed two main strategies employed by Claude. One is by leveraging a vast knowledge base accumulated from hundreds of thousands of papers, and the other is by layering methods when in doubt and combining different pieces of evidence to reach a conclusion. “This often allowed Claude to solve problems that humans couldn’t solve!” The researchers highlighted examples where Claude directly combined in-house knowledge with live analysis to avoid the need for time-consuming meta-analyses and database piecing. While recognizing the limitations of evaluating tasks that remain open in both humans and models, the team emphasizes that the validation notebook can help confirm that a signal exists in the data, even if finding the signal proves to be very difficult. “So we ask both models and human benchmarks not to get too frustrated if a year from now no one can solve this set of problems that are difficult for humans to solve.”

As soon as large-scale language models were able to conduct conversations, people started asking how they compared to human experts.

In the human-solvable set (left), all three models are strongly bimodal, with the problem either being solved almost every time or not at all.

Recent benchmarks using BioMysteryBench revealed noticeable patterns in the performance of leading language models like Claude. When presented with problems that humans can solve, these models exhibit strong bimodal behavior. This means that problems are typically solved consistently over multiple attempts or not at all, suggesting that there is a clear distinction between retained knowledge and guessing. This is in clear contrast to performance on more difficult tasks, where success is much more variable. Researchers discovered this dichotomy while evaluating Claude’s abilities in bioinformatics, a field that requires expertise beyond common benchmarks such as the bar exam or Olympiad mathematics. For a set of “human-solvable” problems, current models perform as well as human experts, and the latest generation has solved many problems that panels of human experts could not, sometimes using very different strategies.

This observation is not just about a gap in accuracy. This reveals a fundamental difference in how the models arrive at the answer. Experts expect this focus on reliability to become increasingly important as AI tools are integrated into real-world scientific workflows and consistent and reproducible results become paramount. This pattern continues to be observed in new generation models. The team’s detailed examination of reliability has been described as “a little… boring” but highlights the importance of this metric in evaluating model performance. “While this added some nuance to the performance analysis presented above, it did not address any fundamentally new issues,” the researchers said. Nevertheless, the findings suggest that the model is beginning to show nascent “research taste”, suggesting the possibility of more sophisticated scientific reasoning in the future.