Peptide array and data processing
Synthetic peptide arrays were synthesized by PepperPrint™ (Germany) using the PEPperCHIP protocol to develop overlapping arrays: VAR2CSA (FCR3) (GenBank: AAQ73926.1, accession: AY372123.1, residues 1-2659), VAR2CSA (NF54) (GenBank: EWC87419.1, accession: KE123842.1, residues 1-2652), DBPII, PcDBP, and PvEBP2. Peptides were 20 amino acids in length and there was a 19 amino acid overlap between peptides in all arrays except VAR2CSA (NF54) for which there was an 18 amino acid overlap.
For each data point, we used the average of the median foreground intensities for each spot (arbitrary units). For each peptide, binding was measured in duplicate (V1 and V2), and the average raw median intensity was used except for when the difference between the lesser of the two values was less than or equal to 40% the larger of the two values.
$${MIN}\left(V1,\,V2\right)\le {MAX}\left(V1,\,V2\right)* 0.4$$
(1)
Additionally, some peptides occurred in both VAR2CSA (FCR3) and VAR2CSA (NF54); in this case, the highest of the two averaged values were taken. This resulted in a total of 4378 data points being left; 215 duplicate data points were identified and the lower raw median intensity replicant was removed. We employed a training-validation split of 1/3 to 2/3 to randomly distribute the data. That is, 2/3 of the datapoints were randomly selected to be involved in training and validation phases of algorithm development and 1/3 of datapoints were selected to test the algorithm.
Model design and selection
The high similarity between peptides with a 19-amino acid overlap from related proteins and the low proportion of 3D10-reactive peptides present important considerations for feature selection model design. Specific training-validation splits can artificially decrease or increase error if reactive peptide examples are distributed asymmetrically. However, to extract as much data as possible on the 3D10 reactivity in our dataset, there is a need to maximize training data. We employed several strategies to minimize the impact of overfitting during machine learning model design and analysis.
We chose mean squared error (MSE) as the loss function for training the algorithm to predict reactive peptides51. In model selection, we preferred the algorithm that had the lowest MSE averaged over multiple random training-validation split instances to minimize bias introduced by a favourable data distribution. We avoided including more complex features that could identify specific peptide(s) and cause overfitting. For example, some dipeptide features are too rare in these proteins to be used to validate their impact on 3D10 reactivity in our dataset. Removal of these noisy features from the feature pool will vary between datasets and antibodies. To support the generalizability of the framework, we relied on additional testing to filter non-predictive features and synthesize more complex motifs/criteria correlated to reactivity. Specifically, feature selection, statistical analysis, and experimental validation were used to generate our final criteria. We used decision trees in our framework because they had the lowest average MSE in our comparisons, were used previously in machine learning applied to peptide array data, and are easily interpretable (Supplementary Fig. S5)52,53. We used GridSearchCV to optimize hyperparameters defined in the DecisionTreeRegressor function in Scikit-learn54. Five-fold cross-validation generated a decision tree with a maximum depth of four, two samples minimum per leaf, five samples minimum to split a node, and had a MSE of 5.9E + 5 and an R2 of 0.82. A randomly generated model from our dataset using these hyperparameters accurately predicted that none of the peptides in PcDBP reached high 3D10 reactivity (Supplementary Fig. S3).
Features
Features employed in machine learning can be placed into six categories: single amino acid counts, dipeptide counts, secondary structure, physicochemical values, charge, and disulphide bond potential (Supplementary Table S1). These values are either whole number, continuous, or binary values. Single amino acid counts are the total counts of each of the 20 different amino acids occurring in protein sequences. The secondary structure employs S4PRED to predict the secondary structure of the recombinant protein sequences, including the elongating “GSGSGSG” N-terminal and C-terminal linkers, and values were taken as the average of residue values for respective peptide sequence segments55. There were three values for secondary structure representing the odds of forming either an alpha-helix, beta-sheet, and a flexible coil. Physicochemical values are based on sums of values for amino acids from 3 different physicochemical scales: sidechain energy, hydropathy, and polarity26,27,56. Charge features were sums of positively charged residues (lysine, arginine, and histidine), negatively charged residues (glutamate and aspartate), and net charge. Disulphide bond potential is a binary feature which is set to ‘1’ if there are two cysteine residues in a peptide with two or more amino acids in between. Dipeptide counts, like single amino acid counts, is a count of total occurrences of each combination of 2 consecutive amino acids; 400 for each combination of 20 amino acids. This feature is adjusted by the hyperparameter ‘peptide trim.’ ‘Peptide trim’ for dipeptide counts defines the number of amino acids to skip before counting dipeptide features. This is employed because of the concern that the C-terminal of the peptides may be less accessible to 3D10 as this is the end attached to the glass slide. We chose not to employ this hyperparameter for other features as dipeptide count is intended to identify patterns of specific amino acids, whereas single amino acid counts, charge, etc., inform the physical properties of the whole peptide, which may affect its flexibility or be used in conjunction with other features. A ‘peptide trim’ of 4 minimized average error.
Feature selection
We employed three methods for feature selection: variance-based feature selection, feature utilization, and recursive feature elimination. Using an ensemble of features accounts for the potential of each method to be biased for missing predictive features and selecting nonpredictive features. Both variance-based feature selection and recursive feature elimination are provided by scikit-learn54. We created the feature selection method, feature utilization, for our project.
Variance-based feature selection is an unsupervised method, meaning it is agnostic to 3D10 reactivity. The goal of this method is to establish which features have high variance and, therefore, are less likely to be selected due to overfitting. Because it is not dependent on 3D10 reactivity, features selected by this method are not sufficient to qualify them for analysis. We used a threshold (\(p\)) of 1.
$${Var}\left|X\right|\ge p$$
(2)
Feature utilization is not available in scikit-learn. Because of the random nature of our training-validation splitting, this method involves quantifying feature usage frequency from the lowest MSE decision trees of 100 tournaments of 100 randomly generated trees with new training-validation splits57. To minimize the impact of favourable training-validation splits caused by similarities between peptides in the whole dataset, we used several small tournaments. If any feature is present in a tournament-winning tree, it is defined as being ‘selected’; we did not validate features based on their frequency of utilization in ‘tournament-winning’ trees. At a tree depth of four, each tree contributes 15 features; hyperparameter optimization may produce a different tree depth for different antibodies which may affect output. This method is not highly restrictive and could include features that enable overfitting. This emphasizes the importance of using multiple methods and qualifying selected features using statistical and experimental analysis.
Recursive feature elimination recursively eliminates features to a given value until a cohort of predictive features is left. Starting with the whole feature set (n features) will eliminate a poorly correlative feature to 3D10 reactivity and yield ‘n-1’ features. It repeats this until it reaches our defined values of 5, 10, or 15 features. Fifteen features represent the maximum number of features that can be applied by a tree with depth 4.
Statistical methods and feature analysis
We used three methods to analyse features in addition to their selection: heatmapping, enrichment, and sequence patterns. For enrichment analysis, we compared 3D10 reactive peptides to 3D10 non-reactive peptides (Supplementary Fig. S1). Using these three methods, we analysed selected features or groups of overlapping features that describe a majority of 3D10 reactive peptides or for which there is a significant difference between 3D10 reactive and 3D10 non-reactive peptides. Overfitting results in some selected features corresponding to very few reactive peptides and not correlating to 3D10 reactivity. Similarly, feature overlap may have resulted in features like ‘YK’ not being selected despite being equivalent to ‘KY’ in our testing, and strongly correlating to 3D10 reactivity, emphasizing the value in grouping and further testing. Data for feature values from this dataset with respective hyperparameters are accessible in a GitHub repository described in our data availability section. For heat mapping, we determined average feature values for each of the 20 positions amino acid positions that span the length of the peptides. As the peptide array contained significantly overlapping peptides, the average value for each position in the non-reactive population was almost identical to that of the average for the whole sequence. This meant that the comparison between non-reactive peptides and reactive peptides was not meaningful and may diminish trends observed in 3D10 reactive peptides if they were reflected in the whole sequence. For example, 3D10 reactive peptides were more likely to contain positively charged residues than non-reactive peptides; the difference in their average positive charge for each residue position does not improve the characterization of positionality of positive residues. The data provided by the difference between 3D10 reactive and non-reactive peptides would be better characterized by enrichment analysis.
Amino acid enrichment compared frequencies of amino acid occurrence in 3D10 reactive peptides and 3D10 non-reactive peptides. Statistical significance was determined by 2-way ANOVA selecting for relevant comparisons and correcting for number of comparisons (only comparing values for each amino acid between reactive and non-reactive peptide groups). For example, to compare three different amino acid occurrences in either group, we corrected for three comparisons.
The method for determining amino acid patterns is similar to the method for sequence analysis known as TMSTAT58. Like TMSTAT, we do not have to assume that the distribution of amino acids across sequences is not anti-segregated or co-segregated relative to the features we are seeking to analyse. Assuming non-anti-segregated and non-co-segregated distribution assumes that amino acid distributions are homogeneous across all populations. We do not need to perform this assumption because our statistical comparisons are between observed odds and the expected odds. The expected odds, in this case, would be the likelihood of a homogeneously distributed population. For two types of amino acids, ‘A’ and ‘B’, we calculated the expected probability of a distance between an ‘A’ amino acid to the closest ‘B’ amino acid in 3D10 reactive or 3D10 non-reactive peptides, \({P}_{{Exp}}\). This value depended on the average number of ‘B’ amino acids in the respective population, \(\bar{B}\). Expected odds for any distance was calculated by bootstrapping. We simulated all possible distributions of 1–5 ‘B’ amino acids across the peptide. We determined the shortest distance between an ‘A’ amino acid to the closest ‘B’ amino acid to develop a reference table (Supplementary Table S3). For each possible distance, \({D}^{X}\), we obtained a probability as a function of \(\bar{B}\) where \(X\) denotes the discrete minimum distance between ‘A’ and the closest ‘B’ amino acid; \(X\in \left\{1,\,2,\,3,\,\ldots ,\left.19\right\}\right.\). This probability function was calculated by performing second-order polynomial regression on the bootstrapped odds.
$${P}_{{Exp}}=f\left({D}^{x}\right|\bar{B})$$
(3)
This resulted in 19 equations for each possible value of \(X\) and enabled the prediction of expected minimum distances between ‘A’ and ‘B’ type amino acids for cases where \({B}\) was not a whole number. The observed probabilities were a proportional count for the occurrence for each distance to occur in the population \({P}_{{Obs}}\). Log odds ratio was obtained by a ratio of observed distance proportions and expected distance probabilities.
$$LN\left(\frac{{P}_{obs}}{{P}_{EXP}}\right)$$
(4)
We used 95% confidence intervals as a significance cut-off and calculated using previously published methods that assume a normal distribution59. With this cut-off, we could establish if the increased likelihood for a particular distance to occur between amino acids of type ‘A’ and ‘B’ was significantly more than expected. For situations where ‘A’ and ‘B’ were the same (like lysine-to-lysine distances), we employed a correction of \(\overline{{B}_{Cor}}=\overline{B}-1\) to account for the fact that the reference lysine serving as ‘A’ will not be contributing to \(\bar{B}\) relative to the expected probability of distances. We ignored cases in which only ‘A’ or ‘B’ were present as distances between them would be indeterminant. For a specific \({D}^{X}\), we excluded plotting data points for which the \({P}_{{Obs}}\) was 0 as this meant that the observed probability was indeterminately less common than \({P}_{{Exp}}\) as the log odds ratio is equal to \(-\infty\). For graphing, we plotted \({D}^{[\mathrm{0,5}]}\) as the 95% confidence interval grows as \({P}_{{Exp}}\to 0\).
Indirect ELISA
We measured the reactivity of 3D10, a mouse IgG mAb, to various proteins by an enzyme-linked immunosorbent assay (ELISA). 96-well Maxisorb plates (catalogue no. 439454; Thermo Fisher Scientific) were coated with synthetic peptides at 10 µg/mL and recombinant proteins at either 1 µg/mL (VAR2CSA) or 0.5 µg/mL (all others) at 50 µL diluted in 1× PBS and incubated overnight at 4 °C. We added 4% BSA for 1 h at 37 °C. We washed wells 4 times with 1X PBST (0.1% Tween 20). Plates were incubated with 50 µL of 3D10 or mouse IgG1 isotype control (catalogue no. MA5-14453; Invitrogen, Waltham, MA) for 1 h at room temperature (RT) and washed four times with 1× PBST. We then incubated samples at RT for 1 h with 50 µL of HRP-conjugated goat anti-mouse HRP (catalogue no. 170-6516; Bio-Rad, Mississauga, Canada). After repeated washes, 50 µL of 3,3′,5,5′-Tetramethylbenzidine (catalogue no. T0440; Sigma-Aldrich) was added and incubated for 30 min or when reactions approached saturation. This consideration was relevant for cases in which the optimized antibody concentration for a wildtype peptide (P1–P5) resulted in saturated optical density (OD) values for mutated peptides that significantly increased reactivity. We stopped reactions with 50 µL H2SO4 (0.5 N) and measured the OD of each well at 450 nm. All samples were tested in duplicate on the same plate and repeated in at least two independent experiments. For ELISAs with proteins treated with DTT, samples were incubated with DTT (10 mM) at 56 °C for 10 min and then added to the plate. ELISAs with DTT treatments were performed in volumes of 100 µL as opposed to 50 µL.
For the ELISAs with peptides, the OD values were corrected by subtracting the average of blank wells from the raw measurements and evaluating if IgG isotype control and secondary antibody control had no reactivity (OD < 0.25). To determine relative binding, OD values for mutated peptides were measured as a ratio to the wildtype peptide from which they are derived:
$${\rm{Relative\; binding}}=\left({{OD}}_{{mutated}}/{{OD}}_{{wildtype}}\right)* 100$$
(5)
For ELISAs with recombinant proteins, the OD for the wells with the isotype control was subtracted from the corresponding value with 3D10.
Peptides were synthesized by PEPMIC (China) at 70% purity. The 3D10 antibody was generously provided by Dr. John Adams. EBA-175 was obtained through BEI Resources, NIAID, NIH: Plasmodium falciparum Erythrocyte Binding Antigen-175 RII-Non-Glycosylated Protein, Recombinant from Pichia pastoris, MRA-1162, contributed by Annie Mo. PfMSP1 was purchased from CTK Biotech, San Diego, California (Catalogue no-A3003). PvMSP1-19 was obtained through BEI Resources, NIAID, NIH: Plasmodium vivax yP30P2-Pv200 MSP1-19 Protein, Recombinant from Saccharomyces cerevisiae, Strain 2905/6, MRA-60, contributed by David C. Kaslow. Full-length VAR2CSA (FV2; FCR3 allele) was generously provided by Dr. Ali Salanti. PcDBP was expressed and purified from E. coli. PvDBP was expressed in E. coli and purified by Mohammad Rafiul Hoque.
Criteria mapping and structure prediction
To map criteria onto 3D structures, we first generated structures of relevant protein segments in AlphaFold 332. For proteins that had high-resolution structures available, we used the experimentally determined structures. Only the EBA-175 structure was sourced from previously resolved structures. We used AlphaFold 3 generated structures because of the unavailability of target protein structures, or, in the case of VAR2CSA, available protein structures were insufficient. More specifically, previously resolved structures for VAR2CSA (FCR3) and (NF54) had several surface-exposed segments excluded60,61. EBA-175 had near-atomic accuracy resolution (<2.5 Å) because the structure was determined by high-accuracy methods: crystallography and solution NMR62,63.
The AlphaFold 3 generated VAR2CSA (FCR3) core region was evaluated against cryo-EM resolved structures for the same area (Supplementary Fig S5). Relative to the pruned RMSD score, these structures were sufficiently similar to one another (RMSD < 3 Å). Low resolution and a high proportion of unstructured regions may have resulted in a higher whole RMSD score (5.894 Å)64,65. Relative to our protein segment of greater than 1400 amino acids, the whole RMSD score suggests high accuracy66.
