Is AI drug discovery becoming a data infrastructure race?

Artificial intelligence currently plays a central role in drug discovery. Specifically for antibody research, models support sequence design, paratope prediction, affinity maturation, and likelihood screening. However, as these systems move from benchmarking tasks to actual discovery programs, a familiar pattern is emerging. Models often perform well within a known training space, but then perform poorly when asked to generalize to new targets, new binders, or unusual interaction geometries. One reason may be that the field focuses on algorithms and underestimates the importance of the underlying biological data.

This is becoming increasingly difficult to ignore. Across biotech, conversations about AI still tend to focus on architecture, computing, and model optimization. Although these factors are important, they do not remove the upper limit imposed by limited training data. In that sense, AI drug discovery is starting to look more like a data infrastructure race than a pure modeling contest.

The bottleneck is not the compute, but the associated data

The discovery of antibodies makes the problem particularly clear. Although the antibody sequence space is vast and immeasurably diverse, the number of publicly available antibody-antigen costructures remains relatively small, as reflected in resources such as: RCSB Protein Data Bank and AlphaFold protein structure database. This means that many AI systems are trained on a narrow, standard slice of the interaction space and are then expected to generalize across a broader range of biological realities. Encountering pressing areas in drug discovery, such as new CDR loop shapes, unfamiliar epitopes, rare scaffold structures, and unusual binding modes, can reduce performance. The problem is not necessarily that the models are flawed, but that even the best models do not adequately identify the relevant structural diversity.

This is a reformulation of a fundamental question in this field. If multiple groups are training on the same public dataset with the same restrictions, there may be a limit to the differences in results. At that point, the central competitive issue changes. It is no longer about which organization has the best model, but which organization can create the richest and most useful training environment.

Why public structural datasets are no longer good enough

Public structural repositories remain essential. They enabled the generation of today’s protein and antibody AIs. However, these were not built with large-scale machine learning as the primary use case.

Most traditional structures were generated to answer specific biological questions rather than to create large, balanced training corpora. The result is uneven coverage. While specific targets and protein classes are well represented, others remain sparse. For example, antibody-antigen examples are relatively limited, dynamic interactions are underestimated, and the available structures often reflect stabilized or engineered states rather than the types of solution-state interactions that discovery teams want their models to learn. This gap is particularly relevant given continued advances in structure prediction and antibody modeling, as well as remaining limitations, as discussed in the study. AlphaFold2 structure guides ligand discovery, igfold, Antibody antigen prediction in real-world applicationsand SAINT-DB.

This distinction is important because AI systems don’t just learn from quantities, but from the features of the examples they see. If the training range is small, the resulting model is elegant but can be fragile. In real-world drug discovery, robustness is more important than benchmark performance against familiar data.

Structural biology was not built for industrial throughput

Traditional construction methods remain the basis. X-ray crystallography and cryo-electron microscopy provide extraordinary biological insight and remain essential for the tasks for which they are suited. However, they are unable to reflect the dynamic nature of protein-protein interactions and are not optimized to generate tens of thousands of training examples on demand. Even as cryo-EM has become powerful for biological discovery, as shown in the study of Discovery of functional and epitope-specific monoclonal antibodies by cryo-EMThis is different from what is naturally constructed for sequence inference and industrial-scale dataset generation.

Traditional structural biology is resource-intensive and iterative in nature. It often requires repeated rounds of protein engineering, purification, stabilization, and optimization. While these are great for digging deep into individual biological problems, they are poorly suited for the industrial task of generating broad and diverse structural datasets fast enough to support modern AI development cycles.

The distinction between depth and breadth is becoming more important. In the context of AI, the limiting factor is often not whether a team can solve a single structure at high resolution, but whether the team can generate sufficiently diverse and usable interaction data to improve generalization across many detection settings. Recent thinking in the field increasingly reflects this shift toward a breadth of sequences and interaction spaces, rather than relying on a relatively small number of carefully resolved examples.

Data infrastructure is becoming part of the product

In software, data infrastructure typically refers to systems that capture, standardize, store, and activate information at scale. In drug discovery, its meaning has expanded to include the experimental and computational systems needed to generate biological data in a format that can be easily used in machine learning workflows. This means standardized workflows, reproducible output, high-throughput automation, strong quality control, and a reliable path from empirical measurements to model-ready structural representations. It also means designing data generation based on what the AI model actually needs, rather than treating machine learning as a downstream consumer of whatever data biology happens to generate.

This is where the infrastructure issue becomes concrete. Although some experimental methods can generate rich structural interaction data with high throughput, many existing models do not naturally output the precise object types expected. Workflows may measure changes in solvent accessibility or epitope-level interaction information while the model is built to incorporate structures in PDB format. Bridging this gap is a core infrastructure issue. In addition to generating new data, there is value in converting those measurements into representations that can be plugged directly into existing modeling pipelines.

Some organizations are already building on this change. Rather than treating structural biology as a downstream validation step, they are investing in systems that generate interaction data that is consistent enough to support model training, fine-tuning, and validation under standardized conditions at an AI-relevant scale. In practice, this means combining high-throughput experimental workflows with a computational layer that can transform empirical interaction measurements into model-compatible structural representations.

Why integrated systems are more important than standalone models

This is why the next lasting advantage in AI drug discovery may come from organizations that integrate biology, automation, data engineering, and modeling into one system.

A model trained on static public data may eventually converge with competitors facing the same inputs. Groups that can continually generate their own experimentally anchored interaction datasets have a different kind of advantage. You can train, fine-tune, and validate based on an ever-increasing amount of new information. Over time, this may prove more meaningful than incremental improvements to the architecture alone.

This does not mean that model innovation is no longer important. This means that model innovation becomes just one layer in a broader stack. Training data, the workflows that generate it, and AI-enabled systems all begin to function as part of the competitive moat.

Biology-first AI requires biology-first datasets

There is also a broader lesson here. Drug discovery data should not be treated as interchangeable. Biological measurements differ in context, relevance, and modeling usefulness. Data generated under native or solution-state conditions may yield different lessons than data obtained from highly stabilized or engineered systems. Data generated to answer one narrow scientific question may be less useful for generalizable machine learning than data generated intentionally to capture breadth and diversity.

It suggests a change in thinking. Rather than just asking how AI can accelerate existing discovery workflows, if the end goal is model performance, the field may need to ask how the experimental systems themselves should evolve. Some of the most important AI investments may occur before training begins, such as assay design, automation, data standardization, and generation of better biological starting materials.

the race has already begun

Drug discovery cannot be a pure data infrastructure business. Who succeeds is determined by biology, chemistry, translational judgment, and clinical execution. But in the field of AI-powered discovery, infrastructure is moving closer to the center of the competition.

Organizations that can generate large, diverse, experimentally driven structural datasets and make them available for use throughout modern modeling workflows are likely to be increasingly advantaged. The next stage of progress may come not from asking which models perform best on fixed benchmarks, but from asking which teams can build systems that continually expand the training space itself.

If this is correct, AI drug discovery will be more than just a model competition. Building the biological data infrastructure on which modeling depends is becoming a race.

Source link