Advances in protein artificial intelligence (AI) in recent years have undoubtedly had a major impact. AlphaFold’s demonstration that it could predict protein folding with near-experimental accuracy reset expectations across structural biology, and subsequent models for protein design, interaction prediction, sequence generation, and more moved antibody discovery into a new computational era.
But real problems remain and are becoming increasingly difficult to ignore. Structural data for antibody antigens on which these tools rely has not kept up with the demands placed on them. Models built on that foundation perform well for the problems that foundation covers. They often silently wrestle with issues that don’t.
Dan Benjamin, co-founder and chief technology officer at Imto Scientific, told DDN that the numbers show what’s missing. Although there are more than 240,000 structures in the protein data bank, there are only about 1,800 antibody-antigen pairs. “This results in a model that is very good at predicting the folding of monomeric chain proteins, but often misses the mark when predicting the same protein bound to an antibody,” he said.
What the training data actually contains
The paucity of antibody antigen structures in public databases is further exacerbated by the way those structures were generated. Most were created to answer specific biological or therapeutic questions. That means they cluster around targets that are experimentally tractable, structurally stable, and scientifically compelling enough for groups to invest in solving them. Although this selectivity is understandable, it produces a training corpus with significant blind spots.
Benjamin described the results as a “scientifically useful but heterogeneous” data set. Well-behaved proteins, common interaction geometries, and stable experimental systems predominate. Difficult targets, structurally flexible antigens, unbound antibodies, and nonfunctional binders are rare. In this field, this is often referred to as the “negative” missing data problem. This is a training set that shows the model what a productive interaction looks like, but provides limited exposure to when and why bindings fail.
“If the training set is biased toward relatively well-behaved proteins, common interaction geometries, or stable experimental systems, a model may appear powerful in a benchmark setting but become less reliable when applied to new targets or more complex discovery problems,” Benjamin said. He added that because the first wave of advances in protein AI was so impressive, it took the field a while to realize this, so it was natural to focus on architecture, scale, and compute. “We’re now reaching a stage where the data layer becomes a more visible constraint,” he said.
This is important because AI models generalize from what they have seen before. 2024 review frontier discovery on AI applications in antibody discovery pointed out that the main limitation of models trained on specific antibody-antigen pairs is that the models are target-independent. The training set determines the region in which the model can be expected to perform, and that region has boundaries that will be apparent in a real-world detection setting.
These settings are exactly where you need to generalize. The target of interest to the discovery team is not necessarily the one that was crystallized and deposited in 2005. The binding modes associated with a treatment program may not resemble the interaction geometry that dominates the training corpus. A model that performs well on benchmarks built from public data may produce plausible structures for new targets, but be systematically wrong about the actual binding modes. This failure is difficult to detect computationally and expensive to discover experimentally.
Why structural diversity is a real constraint
Capturing the structural diversity of antibody-antigen interaction data requires a truly broad range of antigens, epitopes, paratope shapes, complementarity-determining region loop structures, scaffold types, binding orientations, affinities, and dynamic interaction states—the complete combinatorial space in which antibodies encounter their targets in biological systems.
The dynamic dimension is particularly important, but particularly undervalued. Traditional structural biology techniques, X-ray crystallography and cryo-electron microscopy (Cryo-EM), offer exceptional resolution but require conditions that remove proteins from their biological context. Crystallization requires a stable, homogeneous structure. Cryo-EM samples are flash frozen. The result is a snapshot of complexes that may have been stabilized, manipulated, or concentrated in ways that change what is observed. For flexible, membrane-associated, or structurally heterogeneous targets, these conditions can systematically exclude the interaction states most relevant to therapeutic function.
“Models can work well within familiar territory even when structural diversity is limited,” Benjamin said. “The problem emerges when you are asked to generalize.” A model may produce a large set of plausible structures, with a biologically correct answer somewhere in the set, but the ranking is imprecise. “In practice, this means teams may spend time optimizing based on inaccurate structural assumptions,” he says. “Even a small amount of experimentally fixed interaction data (sometimes as little as 20 antibody-antigen pairs) can help distinguish which structures are biologically plausible.”
Research published in molecule In 2024, Joo et al. described the disparity between abundant antibody sequence data and lack of antibody structural information as a defining challenge for the field, noting that tools designed to bridge the gap by predicting structure from sequence still face fundamental limitations when training data does not represent the interaction geometry in question.
Where current AI models are most likely to fail
Failure modes resulting from structural data limitations are unevenly distributed across target classes. Certain categories of antibody discovery problems are always more subject to these limitations than others.
Conformational epitopes (where an antibody’s binding site is defined by the folded three-dimensional shape of the protein surface rather than a continuous sequence) are among the most difficult. Models trained primarily on linear epitope structures may lack sufficient relevant examples to infer how antibodies correctly engage structural targets. Membrane proteins, intrinsically disordered regions, multimeric complexes, and targets with disease-associated structural changes present similar challenges. Antigens can exist in multiple conformational states, and the model may not have enough examples to know in which state the relevant epitope exists.
These limitations have the most immediate practical impact in the design of epitope-specific antibodies. Generating antibodies that bind to a target is a solved problem for many target classes. Generating antibodies that bind to a defined epitope from the right angle and have adequate selectivity and developability is extremely difficult and highly dependent on the quality and diversity of the training data on which predictions are based. As a review of 2025 npj precision oncology A paper on AI in antibody-drug conjugate development points out that a lack of data limits the robustness of predictive models precisely when specificity and interaction geometry are most important.
“In settings like this, models often fail in no obvious way,” Benjamin says. “Structures can be generated that look computationally reasonable but are biologically incorrect.” Errors only become apparent when the predicted structure is used as the basis for an optimization campaign and experimental results do not match expectations. By that point, meaningful resources have been devoted to the false hypothesis. “The future is unlikely to be purely computational or experimental,” he said. “This becomes an integration loop where the empirical data helps the model make better structural decisions, and the model helps guide the next set of experiments.”
Data as infrastructure
“The next big thing could come from improving the training environment, not just the model architecture,” Benjamin said. “It’s not enough to just generate more examples if those examples reproduce the same bias in existing datasets.” The scale of this infrastructure is substantial. Many AI systems built for structural biology work with an established format: coordinate files that encode the positions of atoms in three dimensions. High-throughput experimental methods can generate many types of structural readouts, including solvent accessibility measurements, epitope-level interaction data, hydrogen-deuterium exchange profiles, and chemical cross-linking patterns. Translating these empirical readouts into a format that can be used directly by AI models requires a translation layer, which in itself is a significant engineering challenge that most structural biology workflows were not designed to address.
AI applications in antibody discovery have been described as the most defensible application of these tools as being practical and tightly coupled to experimental workflows, suggesting that the integration of empirical structural constraints and computational prediction is the direction the field is headed. The logic is that even small amounts of experimentally anchored interaction data can help the model distinguish which of the generated structures are biologically plausible. This is a disproportionate return on your experiment investment compared to the return you would get from the same resources spent on additional model training alone.
“AI drug discovery is becoming as much an infrastructure problem as a modeling problem,” Benjamin said. “Teams that can generate, standardize, and continually scale high-quality biological data will have a different kind of advantage. Models will still be important, but the most lasting progress may come from building systems that allow those models to learn from better biology.”
In that framework, the data factory is a core functionality that determines whether a model will work in a real-world detection environment, rather than a supporting functionality that operates on the periphery of science.



