GENEOnet: a breakthrough in protein binding pocket detection using group equivariant non-expansive operators

This section introduces the mathematical model that we use to identify pockets. We also explain how the model was trained together with the datasets used for training, model selection and testing.

Data sources

During the training, validation, and testing of GENEOnet, two data sources were utilized:

The PDBbind v.2020 database³⁹. The PDBbind database provides binding affinity data for protein-ligand complexes deposited in the RCSB PDB⁴⁰. The aim of using the PDBbind database is to provide high-quality datasets for drug design methods. A total of 12,295 protein-ligand complexes were retrieved, we will refer to this set as BIND. This set, as explained in the following, is subdivided into three parts:
- TRAIN, which consists of 200 complexes sampled uniformly at random from the whole BIND. During the model selection phase, 200 sets of size 200 are considered as different TRAINs, eventually choosing the optimal one for deployment. To avoid reserving only part of BIND for sampling TRAIN proteins, we decided to sample them from the entire BIND dataset, this choice generates small intersections between each TRAIN and the sets defined in the next points.
- BINDVAL, which consists of 3073 complexes (approximately 25% of the whole BIND), is used for model selection. Each of the TRAIN sets considered during model selection has an average intersection of 50 proteins with BINDVAL. In particular, the TRAIN version that is chosen as optimal has an intersection of 48 proteins with BINDVAL.
- BINDTEST, which initially consists of 9222 complexes (approximately 75% of the whole BIND), is used for a first comparison of the model with other tools from the state-of-the-art. After training and model selection, the 152 molecules contained in the intersection between the chosen TRAIN and BINDTEST are removed. The final BINDTEST is made of 9070 complexes, and it is disjoint from both BINDVAL and the chosen TRAIN.
The RCSB PDB is the largest resource for experimentally determined biomolecular structures, releasing new data daily. From this data source, 41,519 complexes were retrieved. First of all, we removed all the complexes already contained in BINDTEST, moreover, we also removed all the complexes whose ligands are classified as post-translational modifications (PTMs), which are of small interest from a pharmaceutical perspective. After this, we obtained a set of 33,341 complexes that we will denote as BANK, which is completely disjoint from BINDTEST.

Furthermore, to the sets that have been used for comparisons with other models, i.e., BINDTEST and BANK, an additional preprocessing step was applied: all the proteins having sequence identity larger than 80% with any of the proteins in the chosen TRAIN, after the model selection phase, were discarded. This was done to avoid possible bias in favour of GENEOnet and to guarantee a fair comparison with the other methods. Finally, after this step, BINDTEST contains 6854 proteins and BANK 28382.

Figure 1 provides a graphical visualization of the relationships between the diverse datasets that we considered. Furthermore, all protein structures were initially preprocessed using the Schrödinger Protein Preparation Wizard (Schrödinger, The Schrödinger Software. 2020). Exclusively for complexes coming from PDBbind (in particular those of BINDTEST), only those protein chains within a certain distance from the ligand were kept to avoid repetitions of the true pocket. This choice will be better detailed in section “Metrics of interest”.

The GENEOnet model

GENEOnet, whose architecture is depicted in Fig. 2, consists of five steps: a data preparation step, in which the input data are represented as functions defined on a grid of voxels; a GENEO layer, composed by $d=8$ operators which provides a first processing of the input data; a convex combination of the outputs of the GENEO layer, to obtain new GENEOs which provide a second and more refined processing; a thresholding step which gives a spatial prediction of presence or absence of a pocket; an evaluation step of a loss function which compares the output with the ground truth. The loss function is then optimized, using a training set, to identify the unknown parameters. A further step, applied after the training of the model, allows to compute the scoring of the identified pockets, and consequently their ranking. All the steps are described in detail in the following sections.

Data preparation

In the data preparation phase, we represent the protein-ligand complex stored in a PDB file as functions defined in the empty space surrounding the protein, which is discretized using a grid of cubic voxels. For each voxel, we compute approximations of the input functions, or channels $\varphi _i$. As shown in Table 1, such functions reflect a reasoned selection of geometrical, physical and chemical protein properties that are considered to be relevant for pockets detection by medicinal chemistry experts. Auxiliary software called GENEOprep (see Supplementary Information Section 1) was developed to automate the process of computing such channels. The co-crystallized ligand of a protein will be used in the evaluation step to define the true pocket (i.e. the ground truth function $\tau$) for the parameters identification.

Table 1 List of potentials that have been used to build GENEOnet.

GENEO layer

The channels computed in the grid of voxels are then fed to the layer of d GENEOs, $F_1,\dots ,F_d$, one per channel. Each operator $F_i$ is chosen from a specific parametric family, parametrized by a shape parameter $\sigma _i$. These families were designed to reflect the a priori knowledge of the experts of medicinal chemistry about the specific role of the corresponding potentials in the pocket identification. We opted for convolutional operators $F_i(\varphi ) = \varphi *K_i$, where $K_i$ are normalized kernels in $L^1(\mathbb {R}^3)$, symmetric with respect to the origin. This choice ensures that all the operators under consideration are indeed non-expansive and equivariant with respect to translations and rotations of the space. We set the parameters regulating the shape of each kernel, so that, also because of their central symmetry, each convolutional operator depends on a single real parameter $\sigma _i$ only, which regulates the “amplitude” of the kernel itself. For the details about the specific kernels employed, refer to the Supplementary Information, Section 2.

Convex combination

In the fourth step the intermediate GENEO outputs $\psi _i$ are combined through a convex combination, with weights $\alpha _1, \dots , \alpha _d$ in order to obtain a composite operator $F(\cdot ) = \sum _{i=1}^d \alpha _i F_i(\cdot )$, which is a new GENEO for each choice of the parameters. The output of the convex combination is then normalized to obtain the function $\psi$, defined from $\mathbb {R}^3$ to [0,1]. Here $\psi (x)$ can be read as the likelihood that voxel x belongs to a pocket. The coefficients $\alpha _1, \dots , \alpha _d$ can be regarded as feature importance scores, highlighting the importance of each channel in the pocket identification and thus providing a useful tool to explain the results the model delivers.

Thresholding

Finally, given a threshold $\theta \in [0,1]$, we get the different pockets returned by the model by taking connected components of the set of voxels where $\psi$ is above $\theta$. In this way, voxels located inside a pocket are labeled with the sequential number of the connected component they belong to, while they are labeled with 0 if they are not judged to belong to any pocket. To summarize, the model that was described so far has a total of 17 learnable parameters ($\sigma _1, \dots , \sigma _d$, $\alpha _1, \dots , \alpha _d$ and $\theta$).

Evaluation

For each crystallized complex, the ligand has been converted to the binary function $\tau$ that is equal to 1 on the voxels that (possibly partially) overlap with the ligand, and equal to 0 elsewhere. If we call $\widehat{\psi }$ the output of the model after thresholding, then we have to compare it to the ground truth represented by the binary function $\tau$ in order to asses the goodness of the prediction.

Training

In order to learn GENEOnet’s learnable parameters, we choose to optimize a loss function evaluating the volumetric matching of ground-truth $\tau$ and prediction $\widehat{\psi }$. The loss function, which needs to be maximized, is defined below:

$$\begin{aligned} L(\widehat{\psi }, \tau ) = \frac{|\widehat{\psi } \wedge \tau | + k|(\textbf{1} -\widehat{\psi }) \wedge (\textbf{1} -\tau )|}{|\tau | + k|\textbf{1} – \tau |} \in [0,1] \end{aligned}$$

(1)

Here $|\cdot |$ denotes the discretized volume, that is the number of voxels labelled with 1 inside the region, $\widehat{\psi } \wedge \tau$ is a function equal to 1 on the intersection between the prediction $\widehat{\psi }$ and the true pocket $\tau$, $\textbf{1}$ is a constant function equal to 1. All these functions are defined on the voxelized grid built around the protein. The hyperparameter k ranges in [0, 1]. We found that values in the range [0.01, 0.05] produce similar results, all characterized by a relatively small number of pockets of appropriate size (see the Supplementary Information, Section 3). Essentially, the choice of the loss function mitigates the imbalance in the number of voxels labeled as part of the pocket in the ground truth. The optimization of $L(\widehat{\psi }, \tau )$ was performed using Adam optimizer. Random sets of 200 proteins uniformly sampled from BIND were used as training sets. We chose this size since empirical evidence showed that increasing the size of the training set did not significantly impact parameter estimates (see the Supplementary Information, Section 6).

Pocket scoring

In medicinal chemistry, identifying and prioritizing potential ligand-binding pockets is pivotal. This process, essential for streamlining virtual screening, involves evaluating pockets based on their potential to accommodate ligands. Although GENEOnet, as described till now, is able to detect pockets, it does not prioritize them. We can refine this by deriving scores from GENEOnet’s pre-thresholding output. These scores are calculated by averaging the function $\psi (x)$ across voxels within each pocket and adjusting for pocket volume to prevent bias towards smaller pockets. This scoring yields a prioritized set of pockets, facilitating focused and efficient subsequent analyses.

Metrics of interest

To compare and select different models, we are interested in metrics that express the ability of a model to assign the highest score to the pocket that matches the true one. First of all, we say that a predicted pocket matches the true pocket if it has the largest overlap with the ground truth. By overlap, we mean the ratio between the discretized volume of the intersection and the volume of the true pocket. If no predicted pocket has an intersection with the true one, we say that the method failed on that protein. In this manner, when provided with a dataset of proteins, we can calculate a sequence of coefficients $H_{n + j}$ for $j \ge 0$. Fixing n as the protein-dependent number of true pockets, $H_n$ is the fraction of proteins whose ligand is identified within the first n-th predicted pockets. On the other hand, $H_{n + j}$ for $j \ge 1$ are the proportions of proteins whose ligand is identified by the $(n + j)$-th predicted pocket. Moreover, we can also consider cumulative sums of these proportions, we generate another sequence of coefficients $T_{n + j}$ for $j \ge 0$. We have that $T_n = H_n$ while if $j \ge 1$ then $T_{n + j} = \sum _{i = 0}^{j} H_{n+i}$ represents the fraction of proteins whose true pocket has been successfully recognized within the first $(n+j)$-th predicted pockets. See the Supplementary Information, Section 4, for the formal definitions of the $H_{n+j}$ and $T_{n+j}$ coefficients. In this way, different methods can be compared as follows: if a method shows higher $T_{n}$ for all $j \ge 0$ then it is the best method. In an optimal situation, we would like to have a model with $H_n = 1.0$ and $H_{n+j} = 0.0$ for every $j \ge 1$. We want to remark that in the case each protein has exactly one true pocket (as in the case of BINDTEST due to the applied preprocessing), we can safely consider $H_1, H_2, \dots$, since $n = 1$ for every protein.

We also consider additional metrics that measure the distance between the predicted pocket centroid and the ground truth. In particular, DCA/PPC is the minimal distance between the ground truth and the centroid of the prediction; on the other hand, DCC is the distance between the centroid of the ground truth and that of the prediction. In literature²⁴, usually predictions with DCA/DCC values lower than 4Å are considered successful, and the fractions of successes at different thresholds are usually plotted. Although reporting the results for such metrics, we notice that, differently from the overlap-related ones, DCA and, more importantly, DCC do not take into account the shapes of the ground truth (which may be smaller than the actual pocket) and the predictions. Thus, they shouldn’t be trusted too much in the presence of non-convex pockets or pockets much larger than the ligand; they are well suited to evaluate small and ellipsoid-like pockets.

Model selection

To assess the reliability of the estimation process and determine the most accurate model for ranking pockets, the loss function L was optimized 200 times with different TRAIN sets, all of size 200, each time starting from the same initial guess for the model parameters. For each trained model, BINDVAL was used to calculate the $H_1, H_2, \dots$ coefficients; such results are reported in Supplementary Information, Section 7. Finally, the model having the highest $H_1$ coefficient on BINDVAL was chosen for deployment. The TRAIN set corresponding to such a model, as already stated in section “Data sources”, has a small intersection of size 48 with BINDVAL. We consider this intersection to be negligible, due to the relative size compared to BINDVAL and the fact that BINDVAL is only used to select the optimal version of GENEOnet from the 200 trained instances, not for comparison with other models. The chosen version of TRAIN will also be used in the following in the ablation study described in section “Ablation study”. Moreover, the optimal parameter values of the selected model and additional details about their interpretation can be found in Supplementary Information, Section 5.

Software and hardware details

The implementation of GENEOnet was achieved through the integration of two distinct software components. The first component comprises a C library, dedicated to computing the potential functions listed in Table 1, while the second module is written in Python. The Python code initiates the computation process by invoking the GENEOprep auxiliary software and the C library via a Cython extension, which enables the calculation of protein potentials. Subsequently, the resulting potentials are passed as input into the GENEOnet network, whose architecture was developed using PyTorch. In terms of computational efficiency, we report that running the optimization algorithm for 50 epochs on the TRAIN dataset results in a processing time of approximately six minutes when performed on a laptop with an NVIDIA GeForce RTX 3060 GPU. In contrast, the same task takes approximately 40 minutes to complete using only the CPU on the same laptop having an 8-core Intel^® Core^TM i7-10870H processor.

Source link