PoseBench
The overall goal of PoseBench, our newly designed benchmark for protein–ligand docking and structure prediction, is to provide the research community with a centralized resource with which one can systematically measure, in a variety of macromolecular contexts, the methodological advancements of new conventional and DL methods proposed for this domain. In the following sections, we describe PoseBench’s design and composition (as portrayed in Fig. 1) and how we have used PoseBench to evaluate several recent DL-docking and cofolding methods (as well as a strong conventional baseline algorithm) for protein–ligand structure modelling.
Benchmark datasets
As shown in Table 1, PoseBench provides users with broadly applicable, preprocessed versions of four datasets with which to evaluate existing or new protein–ligand structure prediction methods: Astex Diverse44, PoseBusters Benchmark12, and the new DockGen-E and CASP15 PLI datasets that we have manually curated in this work.
Astex Diverse dataset
The Astex Diverse dataset is a collection of 85 PLI complexes composed of various drug-like molecules and cofactors known to be of pharmaceutical or agrochemical interest, where a primary (representative) ligand is annotated for each complex. This dataset can be considered an easy benchmarking dataset for methods trained on recent data contained in the PDB in that most of its complexes (deposited in the PDB up to 2007) are known to overlap with the commonly used PDBBind 2020 (time-split) training dataset14,50 containing complexes deposited in the PDB before 2019. As such, including this dataset for benchmarking allows one to estimate the breadth of a method’s structure prediction capabilities for important primary ligand–protein complexes represented in the PDB.
To perform unbound (apo) protein–ligand docking with this dataset, we used AF3 to predict the structure of each of its protein complexes, with all ligands and cofactors excluded. We then optimally aligned these predicted protein structures to the corresponding crystal (holo) PLI complex structures using a PLI binding site-focused structural alignment performed using PyMOL51, where each binding site is defined as all amino acid residues containing crystallized heavy atoms that are within 10 Å of any crystallized ligand heavy atom. To enable the broad availability of PoseBench’s benchmark datasets in both commercial and academic settings, we also provide unbound (apo) protein structures predicted using the MIT-licensed ESMFold model45, although in ‘Results and discussion’ we report results using AF3’s predicted structures as the default data source. We further note that, on average, across all benchmark datasets and methods, AF3’s predicted structures improve the structural accuracy rates of baseline docking methods by 5–10%.
PoseBusters Benchmark dataset
Version 2 of the popular PoseBusters Benchmark dataset12, which we adopt in this work, contains 308 recent primary ligand–protein complexes deposited in the PDB from 2019 onwards. Accordingly, in contrast with Astex Diverse, this dataset can be considered a moderately difficult benchmark dataset for baseline methods, because many of its complexes do not directly overlap with the most commonly used PDB-based training data. It is important to note that, among all baseline methods, AF3 and Boltz-1 used the most recent PDB training data cutoff of 30 September, 2021, which motivated us to report the results in ‘PoseBusters Benchmark results’ for only the subset of PoseBusters Benchmark complexes (n = 130 protein–ligand complexes) deposited in the PDB after this date. Like Astex Diverse, for the PoseBusters Benchmark dataset, we used AF3 (and ESMFold) to predict the apo protein structures of each of its complexes and then performed our PyMOL-based structural binding site alignments.
DockGen-E dataset
The original DockGen dataset13 contains 189 diverse primary ligand protein complexes, each representing a functionally distinct type of PLI binding pocket according to ECOD domain partitioning46,47. Consequently, this dataset can be considered PoseBench’s most difficult primary ligand dataset to model because its PLI binding sites are distinctly uncommon compared with those frequently found in the training datasets of all baseline methods, although it is important to note that these original DockGen complexes were deposited in the PDB from 2019 onward, making this benchmarking dataset partially overlap with the training datasets of baseline DL cofolding methods such as Chai-1, Boltz-1 and AF3. Nonetheless, in line with our initial hypotheses, the benchmarking results in ‘Results and discussion’ demonstrate that no baseline method can adequately predict the PLI binding sites and ligand poses represented by this bespoke subset of the PDB, suggesting that all baseline DL methods have yet to learn broadly applicable representations of protein–ligand binding.
Unfortunately, the original DockGen dataset contains only the primary protein chains representing each new binding pocket after filtering out all non-interacting chains and cofactors in a given biological assembly (bioassembly), which considerably reduces the biophysical context provided to baseline methods to make reasonable predictions. As such, we argue for the need to construct a new dataset that challenges baseline methods (in particular, DL cofolding methods) to predict full bioassemblies containing new PLI binding pockets, which we address with our enhanced version of DockGen called DockGen-E.
To construct DockGen-E, we collected the original DockGen dataset’s PLI binding pocket annotations for each complex. We then retrieved the corresponding first bioassembly listed in the PDB to obtain each PDB entry’s biologically relevant complex, filtering out DockGen complexes for which the first bioassembly could not be mapped to its original PLI binding pocket annotation (which indicates that these original DockGen PLI binding pockets were initially not derived from the PDB’s corresponding first bioassembly). This procedure left 122 biologically relevant assemblies remaining for benchmarking. Like Astex Diverse and PoseBusters Benchmark, for DockGen-E, we then used AF3 (and ESMFold) to predict the unbound (apo) protein structures of each complex in the dataset and structurally aligned the predicted protein structures to their corresponding crystallized PLI binding sites using PyMOL.
CASP15 dataset
To assess the multiprimary ligand (that is, multiligand) modelling capabilities of recent methods for protein–ligand docking and structure prediction, with PoseBench, we introduced a preprocessed, DL-ready version of the CASP15 PLI dataset debuted as a first-of-its-kind prediction category in the 15th Critical Assessment of Techniques for Structure Prediction (CASP) competition held in 202241. The CASP15 PLI dataset originally comprised 23 protein–ligand complexes released in the PDB from 2022 onwards, where we subsequently filtered out four complexes based on: (1) whether the CASP organizers ultimately assessed predictions for the complex; and (2) whether they are nucleic acid-ligand complexes with no interacting protein chains. The 19 remaining PLI complexes, which contain a total of 102 (fragment) ligands, consist of a variety of ligand types including single-atom (metal) ions and large drug-sized molecules with up to 92 atoms in each (fragment) ligand. As such, this dataset is appropriate for assessing how well structure prediction methods can model interactions between different (fragment) ligands in the same complex, which can yield insights into the interligand steric clash rates of each method. As with all other benchmark datasets, we used AF3 (and ESMFold) to predict the unbound (apo) structure of each protein complex in the dataset and then performed a PyMOL-based structural alignment of the corresponding PLI binding sites.
PLI similarity analysis between datasets
For an investigation of the similarity of PLIs represented in each dataset, in Supplementary Appendix E, we analyse the different types and frequencies of common, ProLIF-annotated protein–ligand binding pocket interactions52 natively found within the common PDBBind 2020 training dataset and the Astex Diverse, PoseBusters Benchmark, DockGen-E and CASP15 datasets, respectively, to quantify the diversity of the (predicted) interactions that each dataset can be used to evaluate. In short, we find that the DockGen-E and CASP15 benchmark datasets are the most dissimilar compared with the common PDBBind 2020 training dataset, further illustrating the unique PLI modelling challenges offered by these evaluation datasets.
Formulated tasks
In this work, we developed PoseBench to focus our analysis on the behaviour of different conventional and DL methods for protein–ligand structure prediction in a variety of macromolecular contexts (for example, with or without inorganic cofactors present). With this goal in mind, below we formalize the structure prediction tasks currently available with PoseBench, with its source code flexibly designed to accommodate new tasks in future work.
Primary ligand blind docking
For primary ligand blind docking, each baseline method is provided with a complex’s (multichain) protein sequence and an optional predicted (apo) protein structure as input along with its corresponding (fragment) ligand simplified molecular input line entry system (SMILES) strings, where fragment ligands include the primary binding ligand to be scored as well as all cofactors present in the corresponding crystal structure. In particular, no knowledge of the complex’s PLI binding pocket is provided to evaluate how well each method can: (1) identify the correct PLI binding pockets; and (2) correct ligand poses within each pocket; (3) with high chemical validity; and (4) with specificity for the pockets’ amino acid residues. After all fragment ligands are predicted, PoseBench extracts each method’s prediction of the primary binding ligand and reports evaluation results for these primary predictions.
Multiligand blind docking
For multiligand blind docking, each baseline method is provided with a complex’s (multichain) protein sequence and an optional predicted (apo) protein structure as input along with its corresponding (fragment) ligand (SMILES) strings. As in primary ligand blind docking, no knowledge of the PLI binding pockets is provided, which offers the opportunity to evaluate not only PLI binding pocket and conformation prediction accuracy but, in the context of multibinding ligands, also interligand steric clash rates.
Metrics
Traditional metrics
For PoseBench, we reference two key metrics in the field of structural bioinformatics: the r.m.s.d. and local distance difference test (lDDT)53. The r.m.s.d. between a predicted three-dimensional conformation (with atomic positions \({\hat{x}}_{i}\) for each of the molecule’s n heavy atoms) and the ground-truth (crystal structure) conformation (xi) is defined as
$$\,\text{r.m.s.d.}\,=\sqrt{\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{\parallel {\hat{x}}_{i}-{x}_{i}\parallel }^{2}}.$$
(1)
The lDDT score, which is commonly used to compare predicted and ground-truth protein three-dimensional structures, is defined as
$$\,\text{lDDT}\,=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{1}{4}\mathop{\sum }\limits_{k=1}^{4}\left(\frac{1}{| {{\mathcal{N}}}_{i}| }\sum _{j\in {{\mathcal{N}}}_{i}}\varTheta (| {\hat{d}}_{ij}-{d}_{ij}| < {\Delta }_{k})\right),$$
(2)
where N is the total number of heavy atoms in the ground-truth structure; \({{\mathcal{N}}}_{i}\) is the set of neighbouring atoms of atom i within the inclusion radius Ro = 15 Å in the ground-truth structure, excluding atoms from the same residue; \({\hat{d}}_{ij}\) (dij) is the distance between atoms i and j in the predicted (ground-truth) structure; Δk are the distance tolerance thresholds (that is, 0.5 Å, 1 Å, 2 Å and 4 Å); Θ(x) is a step function that equals 1 if x is true, and 0 otherwise; and \(| {{\mathcal{N}}}_{i}|\) is the number of neighbouring atoms for atom i. As originally proposed in ref. 41, in this study, we adopted the PLI-specific variant of lDDT for scoring multiligand complexes, which calculates lDDT scores to compare predicted and ground-truth protein–(multi)ligand complex structures following optimal (chain-wise and residue-wise) structural alignment of the predicted and ground-truth PLI binding pockets.
Lastly, we also measure the molecule validity rates of each predicted PLI complex pose using the PoseBusters software suite (that is, PB-Valid)12. This suite runs several important chemical and structural sanity checks for each predicted pose, including energy ratio inspection and geometric (for example, flat aliphatic ring) assertions, which provide a secondary filter of accurate poses that are also chemically and structurally meaningful.
New metrics
The r.m.s.d., lDDT and PB-Valid metrics of a protein–ligand binding structure provide useful characterizations of how accurate and reasonable a predicted pose is. However, a key limitation of these metrics is that they do not measure how well a predicted pose resembles a native pose when comparing their induced PLIFs. Recently, in ref. 37, a complementary benchmarking metric, PLIF-valid, was introduced which assesses DL methods’ recovery rates of known PLIs. However, this metric only reports a strict recall rate of each method’s interaction types rather than a continuous measure of how well each method’s interactions match the distribution of crystallized PLIs. Moreover, in drug discovery, a primary concern when designing new drug candidates is ensuring that they produce amino acid-specific types of interactions (and not others); hence, we desire each baseline method to recall the correct types of PLIs for each pose and to avoid predicting (that is, hallucinating) types of interactions that are not natively present. Consequently, we argue that an ideal PLI-aware benchmarking metric is a single continuous metric that assesses the recall and precision of a method’s predicted distribution of amino acid-specific PLIFs. To this end, we propose two new benchmarking metrics, PLIF-EMD and PLIF-WM.
For each PLI complex, PLIF-EMD measures the earth mover’s distance (EMD)54 between a method’s predicted histogram of PLI type counts u (specific to each type of interaction) and the corresponding native histogram v, where each histogram of interaction type counts is represented as a one-dimensional discrete distribution. Formally, this equates to computing the Wasserstein distance between these two one-dimensional distributions u and v as
$$\,\text{PLIF-EMD}\,\,:= \,{l}_{1}(\mathbf{u},\mathbf{v})=\mathop{\inf }\limits_{\uppi \in {\rm{\Pi }}(\mathbf{u},\mathbf{v})}{\int}_{{\mathbb{R}}\times {\mathbb{R}}}| x-y| {\rm{d}}\uppi (x,y),$$
(3)
where Π(u, v) denotes the set of distributions on \({\mathbb{R}}\times {\mathbb{R}}\) whose marginals, u and v, are on the first and second factors, respectively. To penalize a baseline method for producing non-native interaction types, we unify the bins in each histogram before converting them into one-dimensional discrete representations. Namely, to perform this calculation, each PLI is first represented as a fingerprint tuple of < ligand type, amino acid type, interaction type > as determined by the software tool ProLIF52 and then grouped to count each type of tuple to form a histogram. As such, a lower PLIF-EMD value implies a better continuous agreement between predicted and native interaction histograms. PLIF-WM, derived from PLIF-EMD, assesses the WM score of a pair of PLIF histograms. Specifically, to obtain a more benchmarking-friendly score ranging from 0 to 1 (higher is better), we define PLIF-WM as
$$\,\text{PLIF-WM}\,\,:= \,1-\frac{\,\text{PLIF-EMD}-{\text{PLIF-EMD}}_{{\rm{min}}}}{{\text{PLIF-EMD}}_{{\rm{max}}}-{\text{PLIF-EMD}}_{{\rm{min}}}},$$
(4)
where PLIF-EMDmin and PLIF-EMDmax denote the minimum (best) and maximum (worst) values of PLIF-EMD, respectively. As a metric normalized relative to each collection of the latest baseline methods, PLIF-WM allows one to quickly identify which of the latest methods has the greatest capacity to produce realistic distributions of PLIs. As a practical note, we use SciPy 1.15.149 to provide users of PoseBench with an optimized implementation of PLIF-EMD and thereby PLIF-WM.
Baseline methods and experimental set-up
Overview
We designed PoseBench to answer specific modelling questions for PLI complexes such as: (1) which types of methods (if any) can predict both common and uncommon PLI complexes with high structural and chemical accuracy; and (2) which most accurately predict multiligand structures without steric clashes? In the following sections, we discuss which types of methods we evaluate in our benchmark and how we evaluate each method’s predictions for PLI complex targets.
Method categories
As illustrated in Fig. 1, to explore a range of the most well-known or recent methods to date, we divide PoseBench’s baseline methods into one of three categories: (1) conventional algorithms; (2) DL docking algorithms; and (3) DL cofolding algorithms.
As a representative algorithm for conventional protein–ligand docking, we pair AutoDock Vina (v.1.2.5)55 for molecular docking with P2Rank for protein–ligand binding site prediction56 to form a strong conventional (blind) docking baseline (P2Rank-Vina) for comparison with DL methods. To represent DL docking methods, we include DiffDock-L13 for docking with static protein structures and DynamicBind9 for flexible docking. Lastly, to represent some of the latest DL cofolding methods, we include NeuralPLexer10, RFAA7, Chai-130, Boltz-131 (versus Boltz-233 for the sake of time-split benchmarking validity) and AF311. For interested readers, each method’s input and output data formats are described in Supplementary Appendix F.
Prediction and evaluation procedures
The PLI complex structures that each method predicts are subsequently evaluated using different structural and chemical accuracy and molecule validity metrics depending on whether the targets are primary or multiligand complexes. In ‘Metrics’, we provide formal definitions of PoseBench’s evaluation metrics. Note that if a method’s prediction raises any errors in subsequent scoring stages (for example, due to missing entities or formatting violations), the prediction is excluded from the evaluation.
Primary ligand evaluation
For primary ligand targets, we report each method’s percentage of (top-1) ligand conformations within 2 Å of the corresponding crystal ligand structure (r.m.s.d. ≤ 2 Å), using 1 Å to instead assess whether the predicted ligand’s heavy atom centroid (that is, binding pocket) was correct (c.r.m.s.d. ≤ 1 Å), as well as the percentage of such ‘correct’ ligand conformations that are also considered to be chemically and structurally valid according to the PoseBusters software suite12 (r.m.s.d. ≤ 2 Å and PB-Valid). Importantly, as described in ‘Metrics’, we also report each method’s new PLIF-WM scores to study the relationship between its structural accuracy and chemical specificity.
Multiligand evaluation
Similarly to the protein–ligand scoring procedure employed in the CASP15 competition41, for multiligand targets, we report each method’s (top-1) percentage of ‘correct’ (binding site-superimposed) ligand conformations (r.m.s.d. ≤ 2 Å) as well as violin plots of the r.m.s.d. and PLI-specific lDDT scores of its protein–ligand conformations across all (fragment) ligands within the benchmark’s multiligand complexes (see Supplementary Appendix G for these plots). Notably, this latter metric, referred to as lDDT-PLI, allows one to evaluate specifically how well each method can model protein–ligand structural interfaces. In addition, we report each method’s PB-Valid rates (calculated once for each multiligand complex) and PLIF-WM scores.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
