Proteomes of extinct organisms
We collected extinct organisms from the NCBI taxonomy browser (https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=extinct, access time: December 2021). For each species, we checked the corresponding Entrez records and downloaded the available protein sequences. In total, we retrieved 208 extinct species and a total of 12,860 protein sequences (5,190 non-redundant protein sequences) from them.
Modern human proteome
To construct the human proteome, we downloaded 20,388 reviewed Homo sapiens proteins (20,307 unique ones) from UniProt (https://www.uniprot.org/).
In-house peptide dataset
We utilized our high-quality in-house peptide dataset to train and evaluate APEX. In total, the dataset contained 14,738 antimicrobial activity measurements obtained by determining the MIC of 988 peptides and 34 bacterial strains, including the following, which were used to train our model: E. coli ATCC 11775, P. aeruginosa PAO1, P. aeruginosa PA14, S. aureus ATCC 12600, E. coli AIC221, E. coli AIC222, K. pneumoniae ATCC 13883, A. baumannii ATCC 19606, Akkermansia muciniphila ATCC BAA-835, Bacteroides fragilis ATCC 25285, Bacteroides vulgatus (Phocaeicola vulgatus) ATCC 8482, Collinsella aerofaciens ATCC 25986, Clostridium scindens ATCC 35704, Bacteroides thetaiotaomicron ATCC 29148, B. thetaiotaomicron ∆tdk ∆lpxF (background: VPI 5482)38, Bacteroides uniformis ATCC 8492, Bacteroides eggerthi ATCC 27754, Clostridium spiroforme ATCC 29900, Parabacteroides distasonis ATCC 8503, Prevotella copri DSMZ 18205, Bacteroides ovatus ATCC 8483, Eubacterium rectale ATCC 33656, Clostridium symbiosum ATCC 14940, Ruminococcus obeum ATCC 29174, Ruminococcus torques ATCC 27756, methicillin-resistant S. aureus ATCC BAA-1556, vancomycin-resistant Enterococcus faecalis ATCC 700802, vancomycin-resistant E. faecium ATCC 700221, E. coli Nissle 1917, Salmonella enterica ATCC 9150 (BEIRES NR-515), S. enterica (BEIRES NR-170), S. enterica ATCC 9150 (BEIRES NR-174) and Listeria monocytogenes ATCC 19111 (BEIRES NR-106).
Inactive data points, that is, MIC values higher than 128 μmol l−1, were labelled 140 μmol l−1. All antimicrobial activities were transformed by \(-{\log }_{10}\frac{\rm{MIC}\; \rm{value}}{1,000,000}\) and were treated as labels to be predicted in the ML setting. To perform hyperparameter tuning and prediction performance evaluation, our in-house dataset was randomly split into a CV set and an independent set, which consisted of 790 and 198 peptides (that is, an 80%:20% split), respectively. Here the CV set was used to determine the optimal hyperparameters for ML models. ML models trained with determined hyperparameters on the CV set were evaluated on the independent set to measure their generalizability.
Publicly available AMP sequences
We augmented the peptide training data by incorporating publicly available AMPs and non-AMPs into our deep learning model training. Public AMP data were retrieved from DBAASP17. Peptide sequences that consisted of only the 20 canonical or unknown amino acid residues were selected. The unknown amino acids were denoted by X, which corresponds to any possible canonical amino acid residues usually present within proteins having isoforms whose composition was undetermined because there were issues with metagenomic studies. Any peptide sequences that overlapped with our in-house peptide database were removed. As a peptide may have multiple MIC values for different bacterial species, we used the median MIC value to binarize the data. By using a stringent cut-off (that is, AMPs with MIC ≤ 30 μmol l−1), we created a balanced binary classification dataset (5,093 AMPs and 5,500 non-AMPs) for data augmentation and model training. To compare physicochemical properties, 14,114 peptides consisting of 20 canonical amino acids and having sequence length ≥4 amino acid residues were retrieved. We labelled this group of peptides as the DBAASP dataset.
Physicochemical properties of peptides
To analyse the physicochemical properties of all peptide datasets (DBAASP, EPs generated by the scoring function16 and compounds identified by APEX), we used the DBAASP server to calculate the following 12 physicochemical properties that are usually considered in the design and study of peptide antibiotics17: normalized hydrophobic moment, normalized hydrophobicity, net charge, isoelectric point, penetration depth, tilt angle, disordered conformation propensity, linear moment, propensity to aggregation in vitro, angle subtended by the hydrophobic residues, amphiphilicity index and propensity to poly proline II (PPII) helix. We used the Eisenberg and Weiss scale (the consensus scale) as the hydrophobicity scale39.
Secreted protein labelling
As proteins from extinct organisms are not as well annotated as those from extant organisms, we resorted to Orthologous Matrix (OMA)40 and DeepGOWeb41 to predict Gene Ontology (GO) terms from protein sequences. Given a protein sequence, if any of its GO terms predicted by OMA or DeepGOWeb corresponds to an extracellular region (GO:0005576) or a child of extracellular region in a GO-directed acyclic graph, then we considered this sequence to be secreted. Among proteins collected from extinct organisms, 157 sequences were labelled as secreted.
Peptide sequence encoding
We treated a peptide as a sequence of amino acids. We further added two special characters as start (that is, ‘1’) and terminal symbols (that is, ‘2’) to the beginning and the end of this sequence, respectively. For each amino acid within the sequence, we used the AAindex42, which is a 566-dimensional vector storing various physicochemical and biochemical properties of each amino acid to represent it. Non-amino acid symbols and unknown amino acids were represented by 566-dimensional zero vectors. In this work, we only considered peptides shorter than 50 residues to ensure that they could be synthesized by solid-phase peptide synthesis. We created a fixed size input by zero-padding each sequence to maximum length, so that each peptide sequence could be represented by a matrix \({\boldsymbol{x}}\in {{\mathbb{R}}}^{52\times 566}\) (52 = maximum peptide length + two special characters).
Bacterial distance
The bacterial taxonomy tree (bac120.tree) was downloaded from the Genome Taxonomy Database43. The phylogenetic distance matrix \({\boldsymbol{D}}{{\mathbb{\in }}{\mathbb{R}}}^{\rm{g\times g}}\), which stores distance between bacterial species, was calculated via the Python package DendroPy44. Here, g denotes the number of bacterial species. We converted the distance matrix D to a bacterial similarity matrix \({\boldsymbol{P}}{{\mathbb{\in }}{\mathbb{R}}}^{\rm{g\times g}}\) using the following function:
$${\boldsymbol{P}}=e^{\frac{-{\boldsymbol{D}}}{\rm{median}({\boldsymbol{D}})}},$$
where median(D) stands for the median value of matrix D. We further binarized the similarity matrix P using the K-nearest neighbors algorithm, in which K was set to a heuristic number \({\rm{Ceiling}}(\frac{\sqrt{g}}{2})\). Here, Ceiling(·) stands for the ceiling function.
APEX encoder architecture
APEX encoder started from a recurrent neural network (RNN) to process peptide sequence input x and extract its hidden features, \({{\boldsymbol{h}}}_{{\rm{rnn}}}\in {{\mathbb{R}}}^{52\times \rm{n}}\):
$${{\boldsymbol{h}}}_{{\rm{intermediate}}}={\rm{GRU}}\left({{\boldsymbol{x}}}\right),$$
$${{\boldsymbol{h}}}_{{\rm{rnn}}}={\rm{Layer}}{{\_}}{\rm{normalization}}\,({{\boldsymbol{h}}}_{{\rm{intermediate}}}),$$
where n and GRU(·) denote the hidden feature dimension and gated recurrent unit45, respectively. In addition, we added a layer normalization46 to stabilize the model training. On top of the RNN, we designed a two-layer attention neural network to better model feature (that is, amino acids) interactions globally and compress the hidden features to a lower-dimensional representation, respectively. Specifically, the first attention layer has the following form:
$${{\boldsymbol{a}}}_{{\bf{1}}}={\rm{softmax}}\left({\rm{concat}}\left({{\boldsymbol{h}}}_{{\rm{rnn}}},{\boldsymbol{x}}\right)\times {{\boldsymbol{W}}}_{{\rm{att}}{{1}}}\right),$$
$${{\boldsymbol{h}}}_{{\rm{att}}{{1}}}={{\boldsymbol{a}}}_{{{1}}}^{{\rm{T}}}\times {{\boldsymbol{h}}}_{{\rm{rnn}}},$$
where \({\rm{concat}}\left({{\boldsymbol{h}}}_{{\rm{rnn}}}{{,}}\;{\boldsymbol{x}}\right){\boldsymbol{\in }}{{\mathbb{R}}}^{52\times (n+566)}\) stands for concatenation operation along feature dimension and can be considered as a residual connection47, \({{\boldsymbol{W}}}_{{\rm{att}}{{1}}}{{\in }}{{\mathbb{R}}}^{(n+566)\times 52}\) is the learnable weights in this attention layer, \({\rm{softmax}}{{(}}\bullet {{)}}\) stands for the softmax function, \({{\boldsymbol{a}}}_{{{1}}}{\boldsymbol{\in }}{{\mathbb{R}}}^{52\times 52}\) denotes the attention weights and \({{\boldsymbol{h}}}_{{\rm{att}}{{1}}}{\boldsymbol{\in }}{{\mathbb{R}}}^{52\times n}\) is the output of this attention layer. For the second attention layer, it can be written as follows:
$${{\boldsymbol{a}}}_{{{2}}}={\rm{softmax}}\left({{\boldsymbol{h}}}_{{\rm{att}}{{1}}}\times {{\boldsymbol{w}}}_{{\rm{att}}{{2}}}\right),$$
$${{\boldsymbol{h}}}_{{\rm{att}}{{2}}}={{\boldsymbol{a}}}_{{{2}}}^{{\rm{T}}}\times {{\boldsymbol{h}}}_{{\rm{att}}{{1}}},$$
where \({{\boldsymbol{w}}}_{{\rm{att}}{{2}}}{\boldsymbol{\in }}{{\mathbb{R}}}^{n\times 1}\) is the learnable weights in this attention layer, \({{\boldsymbol{a}}}_{{{2}}}{{\in }}{{\mathbb{R}}}^{52\times 1}\) denotes the attention weights and \({{\boldsymbol{h}}}_{{\rm{att}}{{2}}}{\boldsymbol{\in }}{{\mathbb{R}}}^{1\times n}\) is the output of the second attention layer. In addition, we used a learnable linear transformation with weight matrix \({{\boldsymbol{W}}}_{{{fc}}}{{\in }}{{\mathbb{R}}}^{n\times m}\) and bias term \({{\boldsymbol{b}}}_{{{fc}}}{{\in }}{{\mathbb{R}}}^{1\times m}\) to create the final hidden representation \({\boldsymbol{h}}{{\in }}{{\mathbb{R}}}^{1\times m}\) for a peptide:
$${\boldsymbol{h}}={{\boldsymbol{h}}}_{{\rm{att}}{{2}}}{\times {\boldsymbol{W}}}_{\rm{{fc}}}+{{\boldsymbol{b}}}_{\rm{{fc}}}.$$
APEX to predict antimicrobial activity
The hidden representation h of a peptide generated by the encoder above could be fed into two separate FCNNs that predict species-specific antimicrobial activity or a binary AMP/non-AMP label, respectively. For convenience of hyperparameter tuning, both FCNNs were implemented as a four-layer pyramid-like architecture:
$${{\boldsymbol{h}}}_{\rm{l,s}}={\rm{ReLU}}({\rm{Layer}}{{\_}}{\rm{normalization}}\,({\boldsymbol{h}}_{\rm{l-1}}{\times {\boldsymbol{W}}}_{\rm{l,s}}+{{\boldsymbol{b}}}_{\rm{l,s}})),$$
where \(s\in \{{\rm{in{\hbox{-}}house}},{\rm{public}}\}\) denotes the training dataset for the FCNN, \(l\in \{{1,2,3,4}\}\) denotes the layer index (note that \({{\boldsymbol{h}}}_{{{0}}}{{=}}{\boldsymbol{h}}\)) and \({{\boldsymbol{W}}}_{\rm{l,s}}\) and \({{\boldsymbol{b}}}_{\rm{l,s}}\) are the weight matrix and bias term of the lth layer, respectively. At the lth layer, a linear transformation was first performed and followed by a layer normalization, a nonlinear transformation using rectified linear unit48 (ReLU). In addition, if the lth layer is not an output layer, a dropout layer49 that randomly set the input value to 0 with probability P was added to its output side (we empirically set P = 0.1). The output dimensions of hidden layers (that is, \({{\boldsymbol{h}}}_{1,{\rm{s}}}\), \({{\boldsymbol{h}}}_{2,\rm{s}}\) and \({{\boldsymbol{h}}}_{3,\rm{s}}\)) were set as \(k,\frac{k}{2}\) and \(\frac{k}{4}\), respectively. The FCNN that was trained on our in-house data adopted a multitask learning strategy to predict species-specific antimicrobial activity. Suppose there are g bacterial species (that is, 34 in our context), the corresponding output \({{\boldsymbol{h}}}_{4,{\rm{in}}{\hbox{-}}{\rm{house}}}\) is a g-dimensional vector, in which each element is a predicted antimicrobial activity against a certain bacterial strain. The FCNN that was trained on public AMP data only outputted a scaler value \(\in [\mathrm{0,1}{{]}}\) indicating the probability of the input peptide to be antimicrobial.
The loss function for training the FCNN that performed binary classification was binary cross-entropy \({l}_{\rm{BCE}}\). For the other FCNN, the loss function for predicting species-specific antimicrobial activity was the mean squared error:
$${l}_{\rm{MSE}}=\frac{1}{g}\mathop{\sum }\limits_{i=1}^{g}\frac{1}{{d}_{\rm{i}}}\mathop{\sum }\limits_{j=1}^{{d}_{\rm{i}}}{\bf{Mask}}\left(i,j\right)\times{[{{\boldsymbol{h}}}_{{{4}}{{,}}{\rm{in}}{\hbox{-}}{\rm{house}}}(i,j)-{\boldsymbol{y}}(i,j)]}^{2},$$
where di denotes the number of training data points for ith bacterial strain; \({{\boldsymbol{h}}}_{{{4}}{{,}}{\rm{in}}{\hbox{-}}{\rm{house}}}\left(i,j\right),\) \({\boldsymbol{y}}(i,j)\) and \({\bf{Mask}}\left(i,j\right)\) are the predicted antimicrobial activity, the experimentally validated antimicrobial activity and the binary mask (1 for having antimicrobial activity, and 0 for not tested) between ith bacterial strain and jth peptide. In addition to these two loss functions on AMP prediction, we further imposed a constraint loss on the weights of output layer in the species-specific AMP prediction FCNN. Given a bacterial distance matrix \({\boldsymbol{P}}{\boldsymbol{\in }}{{\mathbb{R}}}^{g\times g}\) and the weights \({{\boldsymbol{W}}}_{4,{\rm{in}{\hbox{-}}{house}}}\) that we want to regularize, the constraint loss can be written as follows:
$${{\boldsymbol{D}}}_{\rm{task}}=1-{\rm{cosine}{{\_}}{similarity}}({{\boldsymbol{W}}}_{4,{\rm{in}{\hbox{-}}{house}}}{{)}}$$
$${l}_{\rm{multitask}{{\_}}{constrain}}=\frac{1}{2}\mathop{\sum }\limits_{i=1}^{g}\mathop{\sum }\limits_{j=1}^{g}{\boldsymbol{P}}\left({{i}}{{,}}{{j}}\right){\times}{{\boldsymbol{D}}}_{\rm{task}}(i,j),$$
where \({\rm{cosine}}\_{\rm{similarity}}(\bullet)\) calculates the pairwise cosine similarity between two rows of the given input matrix, and matrix \({{\boldsymbol{D}}}_{\rm{task}}{{\in }}{{\mathbb{R}}}^{g\times g}\) stores the pairwise cosine distance between two learnable weights of two predictors. Intuitively, if two tasks are similar, their predictors should also be similar (that is, learnable weights have shorter distances). Adding \({l}_{\rm{multitask}\_\rm{constrain}}\) to the loss function encourages similar bacterial strains to have similar predictors and outputs. Taken together, the final loss function has the following form:
$$L={{l}_{\rm{MSE}}+{\lambda }_{\rm{BCE}}{l}_{\rm{BCE}}+{\lambda }_{\rm{multitask}{{\_}}{constrain}}l}_{\rm{multitask}{{\_}}{constrain}}+{\lambda }_{l2}{\rm{l}}_{2},$$
where l2 is the L2 regularization for constraining deep learning model complexity, and \({\lambda }_{\rm{BCE}}\), \({\lambda }_{\rm{multitask}\_\rm{constrain}}\) and \({\lambda }_{l2}\) are the weight parameters that balance different types of loss. To train the deep learning models, we used mini-batch training with an Adam optimizer50. Specifically, at each iteration, we selected B peptides from the in-house dataset and the same number of peptides from public AMP data we curated to perform feed-forwarding pass and back propagation. The training terminated when the procedure iterated the whole in-house dataset 5,000 times. The learning rate of the Adam optimizer was empirically set to 0.0001 and was scheduled to decay 10 times every thousand training epochs. Batch size B was empirically set to 128.
Baseline methods
We compared the prediction performance of APEX to that of several baseline ML models, including elastic net, linear support vector regression, extra-trees regressor, random forest and gradient boosting decision tree. For the baseline models, we represented each peptide sequence by the following features: (1) k-mer (that is, frequency of k-residue substrings, where k = 1, 2 and 3) and (2) 10 peptide properties calculated by modlAMP (version 4.3.0)51, including sequence length, molecular weight, sequence charge, charge density, isoelectric point, instability index, aromaticity, aliphatic index, Boman index and hydrophobic ratio. Note that for some of the bacterial strains, the trained Elastic Net outputted a constant prediction regardless of the peptide inputs. In this case, the Pearson and Spearman correlations could not be calculated, and we used 0 as pseudo correlation.
Hyperparameter tuning, model evaluation and ensemble learning
We conducted a fivefold CV on the CV set to select the hyperparameters for deep learning and baseline models. Specifically, the fivefold CV split the whole dataset evenly into five groups. At each time, one group was selected as the test dataset, while the rest were used for ML model training. We used averaged R2, Pearson correlation coefficient and Spearman’s rank correlation coefficient under fivefold CV to evaluate the prediction performance on the test set. Grid search was used to find the best hyperparameters (see Supplementary Tables 1–4 for the hyperparameter range we searched). Hyperparameters were ranked by the averaged R2 under CV. For baseline methods, we determined the best hyperparameters to be those resulting in the highest R2 and trained the ML models with the selected hyperparameters. The trained models were then evaluated on the independent dataset. For APEX, we adopted an ensemble learning strategy. Specifically, we averaged the prediction results from the top eight APEX models. After plotting the prediction performances versus the number of deep learning models involved in the ensemble learning (Supplementary Figs. 8–10), we observed that the elbow region (that is, the area where the curve becomes smaller) was around 5–9 APEX models. After this step, improvement on prediction performance gradually became negligible. This observation led to the conclusion that we should average prediction results from no more than nine APEX models. We decided to select eight APEX models for ensemble learning, given our computational resources (that is, eight graphics processing units were available). To counter the potential stochastic behaviour during mini-batch training and to make prediction results more robust, we trained five copies of an APEX model with the same hyperparameters under different random seeds. In total, we trained 40 APEX models (that is, 8 different hyperparameters × 5 different random seeds) and used the averaged predictions on the independent dataset for prediction performance evaluation. After the performance evaluation, we retrained the 40 APEX models on the entire in-house dataset and used the averaged antimicrobial activity prediction values from the trained models to discover EP sequences within extinct organisms.
Selectivity score
As APEX was designed to predict species-specific antimicrobial activity, we defined the following two selectivity scores that quantify the peptides’ ability to specifically target Gram-positive or Gram-negative bacteria:
$$\begin{array}{l}{\rm{Gram}}-{\rm{positive}}\,{\rm{selectivity}}\,{\rm{score}}\\=\frac{\rm{Median}\,\rm{MIC}(\rm{Gram}-\rm{positive}\,\rm{pathogens})}{\rm{Median}\,\rm{MIC}(\rm{Gram}-\rm{negative}\,\rm{pathogens})},\end{array}$$
$$\begin{array}{l}{\rm{Gram}}-{\rm{negative}}\,{\rm{selectivity}}\,{\rm{score}}\\=\frac{\rm{Median}\,\rm{MIC}(\rm{Gram}-\rm{negative}\,\rm{pathogens})}{\rm{Median}\,\rm{MIC}(\rm{Gram}-\rm{positive}\,\rm{pathogens})},\end{array}$$
where Median MIC(·) calculates the median value from a given input list. The input list consisted of the Gram-positive pathogens S. aureus ATCC 12600, methicillin-resistant S. aureus ATCC BAA-1556, vancomycin-resistant E. faecalis ATCC 700802 and vancomycin-resistant E. faecium ATCC 700221 and the Gram-negative pathogens P. aeruginosa PAO1, P. aeruginosa PA14, E. coli ATCC11775, E. coli AIC221, E. coli AIC222, K. pneumoniae ATCC 13883 and A. baumannii ATCC 19606. A selectivity score of <1.0 means that the median MIC of the target bacteria (numerator term) is smaller than that of the off-target bacteria (denominator term), yielding a selective peptide sequence. Thus, the closer to 0, the better is the selective activity towards the specific bacterial targets.
Sequence similarity score
Given two peptide sequences i and j, we used the Smith–Waterman algorithm20 to calculate their sequence alignment score \({{\rm{SW}}(i,j)}\). The sequence similarity score between these two peptides was defined as the normalized alignment score: \(\frac{{\rm{SW}}(i,\;j)}{\sqrt{{\rm{SW}}\left(i,i\right) \times {{\rm{SW}}(j,\;j)}}}\in [\mathrm{0,1}]\). A higher score reflects higher sequence similarity between two peptides than a lower score.
Antibiotic peptide screening and selection
Given the proteome of extinct organisms, we considered as EP sequence substrings ranging from 8 to 50 amino acid residues. In total, paleoproteome mining yielded 10,311,899 unique molecules from extinct proteomes, of which 771,431 sequences came from secreted proteins. EPs from secreted proteins would have had a higher likelihood of encountering bacterial cells, which are mostly found outside host cells, compared to sequences from non-secreted proteins. Nevertheless, most EPs came from non-secreted proteins and consequently represented a more abundant peptide source. We hypothesized that these non-secreted proteins might also contain antibiotic-like substrings. Therefore, in this work, we selected, synthesized and validated EPs originating from both secreted and non-secreted proteins.
As we used 40 APEX models for activity prediction (that is, ensemble APEX v2 or APEX in the main text), we averaged the prediction results from these 40 models to provide a final predictive output for each peptide. Based on these APEX predictions, we used multiple criteria to select EPs from our predictions for downstream validation. First, we mainly focused on EPs that could target a subset of the eleven pathogens that were listed in the ‘Selectivity score section’ (that is, E. coli ATCC 11775, P. aeruginosa PAO1, P. aeruginosa PA14, S. aureus ATCC 12600, E. coli AIC221, E. coli AIC222, K. pneumoniae ATCC 13883, A. baumannii ATCC 19606, methicillin-resistant S. aureus ATCC BAA-1556, vancomycin-resistant E. faecalis ATCC 700802 and vancomycin-resistant E. faecium ATCC 700221). Given the entire peptide list, we first generated two EP lists; one with sequences predicted to selectively target Gram-positive and another with sequences predicted to target Gram-negative pathogens. To this end, the peptides in these lists were ranked increasingly by the Gram-positive selectivity score and the Gram-negative selectivity score, respectively. We then generated another peptide list by ranking the peptides based on their median MIC predictions for the 11 pathogens of interest, indicating the level of broad-spectrum antimicrobial activity of the peptides.
As a result, three lists were generated for EPs derived from non-secreted proteins, and another three lists were generated for EPs derived from secreted proteins, corresponding in each case to the following: (a) EPs predicted to selectively target Gram-positive bacteria, (b) EPs predicted to selectively target Gram-negative bacteria and (c) EPs predicted to have broad-spectrum activity. The six lists were filtered by the following criteria:
-
1.
The selected EP sequences were restricted to a range between 8 and 30 residues long. Selection of shorter sequences within this range facilitated chemical synthesis.
-
2.
Sequences that were present in the modern human proteome were excluded.
-
3.
Sequences that were present in our in-house dataset, including both natural and synthetic sequences, were excluded.
-
4.
EP candidates with more than 75% sequence similarity to peptides from DBAASP were considered analogues or derivatives of such peptides and were excluded.
-
5.
If two candidate EPs presented more than 75% sequence similarity, the one ranked higher by APEX (that is, lower median MIC or better selectivity score) was kept, and the analogue/derivative was excluded, encouraging diversity of the identified sequences.
-
6.
Six lists were generated, each containing up to 1,000 top candidate EP sequences. The lists contained peptide sequences that were predicted to be broad spectrum, selective towards Gram-negative and selective for Gram-positive bacteria. These sequences derived from non-secreted proteins or secreted proteins were present in extinct proteomes. The broad-spectrum lists were ranked by median MIC, while the selectivity lists were ranked by the selectivity score.
-
7.
For the broad-spectrum lists, candidate EPs with median MIC predictions of ≥80 μmol l−1 were excluded as they were deemed inactive. For peptide lists selective towards Gram-positive or Gram-negative bacteria, sequences with selectivity scores ≥0.75 were excluded as they were considered not sufficiently selective.
After applying the seventh filter, a total of 3,784 unique candidate EPs were left from the initial 10,311,899 sequences derived from extinct proteomes. Note that before selecting the peptides for synthesis, we grouped the candidate EPs by their source organisms. We selected the top one to five EPs (that is, more active or selective depending on which list they were from) taking into account species and sequence diversity (Supplementary Dataset 2). In summary, 21 EPs from secreted extinct proteins were selected for synthesis and subsequent experimental validation: 4 of these were predicted to selectively target Gram-positive bacteria, 10 to selectively target Gram-negative bacteria and 7 to have broad-spectrum activity. Another set of 48 EPs from non-secreted extinct proteins was also selected for downstream experimental validation: 19 that were predicted to selectively target Gram-positive bacteria, 10 that were predicted to selectively target Gram-negative bacteria and 19 that were predicted to show broad-spectrum activity. Therefore, a total of 69 peptides (21 from secreted extinct proteins and 48 from non-secreted extinct proteins) out of the 3,784 sequences were selected, synthesized and validated experimentally.
Classification guidelines to identify AEPs
As de-extinct sequences have not been previously identified, we developed our own classification guidelines to determine whether a particular peptide sequence qualified as truly de-extinct, that is, not present in available proteomic data from extant organisms. As EP sequences from the proteomes of extinct organisms may also be found in modern organisms, we used the following procedure to classify an EP sequence as an AEP: First, we accessed the NCBI taxonomy browser (https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=extinct, access time: December, 2022) to get the most up-to-date taxonomy IDs of extinct organisms, and this was our sole source of extinct proteins. Next, we created a protein sequence set containing all possible organism sources by downloading protein sequences and their corresponding taxonomy IDs from Reviewed (Swiss-Prot), Unreviewed (TrEMBL) and Isoform sequences at UniProt52 (https://www.uniprot.org/help/downloads). This constituted our source of extant proteins. We excluded from the protein sequence set those protein sequences whose taxonomy IDs belonged to extinct organisms. We labelled the resulting set as ‘extant protein set’ as it contained protein sequences from extant organisms. If a given EP sequence was not present in any protein sequence from the available extant protein set, we defined it as an AEP. Otherwise, the EP was considered as a MEP.
Peptide synthesis
All peptides were synthesized by solid-phase peptide synthesis using the Fmoc strategy and purchased from AAPPTec.
Bacterial strains and growth conditions used in the experiments
The following Gram-negative bacteria were used in our study: A. baumannii ATCC 19606, E. coli ATCC 11775, E. coli AIC221 (E. coli MG1655 phnE_2::FRT), E. coli AIC222 (E. coli MG1655 pmrA53 phnE_2::FRT (colistin resistant)), K. pneumoniae ATCC 13883, P. aeruginosa PAO1 and P. aeruginosa PA14. The following Gram-positive bacteria were also used in our study: S. aureus ATCC 12600, S. aureus ATCC BAA-1556 (methicillin-resistant strain), E. faecalis ATCC 700802 (vancomycin-resistant strain) and E. faecium ATCC 700221 (vancomycin-resistant strain). Bacteria were grown from frozen stocks and plated on Luria–Bertani (LB) or Pseudomonas isolation agar plates (P. aeruginosa strains) and incubated overnight at 37 °C. After the incubation period, a single colony was transferred to 5 ml of LB medium, and cultures were incubated overnight (16 h) at 37 °C. The following day, an inoculum was prepared by diluting the overnight cultures 1:100 in 5 ml of the respective media and incubating them at 37 °C until bacteria reached logarithmic phase (OD600 = 0.3–0.5).
Antibacterial assays
The in vitro antimicrobial activity of the peptides was assessed by using the broth microdilution assay36. MIC values of the peptides were determined with an initial inoculum of 2 × 106 cells ml−1 in LB in microtitre 96-well flat-bottom transparent plates. Aqueous solutions of the peptides were added to the plate at concentrations ranging from 1 to 64 μmol l−1. The lowest concentration of peptide that inhibited 100% of the visible growth of bacteria was established as the MIC value in an experiment of 20 h of exposure at 37 °C. The optical density of the plates was measured at 600 nm using a spectrophotometer. All assays were done as three biological replicates.
Outer membrane permeabilization assays
The membrane permeability of the peptides was determined by using the NPN uptake assay16. NPN is a hydrophobic fluorescent dye that does not readily permeate the bacterial outer membrane. However, when the membrane integrity is compromised, NPN can enter the cell and bind to the bacterial membrane lipids. This causes the dye to show a strong fluorescence. A. baumannii ATCC19606 and P. aeruginosa PAO1 were grown (OD600 = 0.4), centrifuged (10,000 r.p.m. at 4 °C for 10 min), washed and resuspended in buffer (5 mmol l−1 HEPES, 5 mmol l−1 glucose, pH 7.4). NPN solution (4 μl at the working concentration of 10 mmol l−1 after dilution) was added to 100 μl of the bacterial solution in a white 96-well plate. The fluorescence was recorded at λex = 350 nm and λem = 420 nm. Aqueous solutions of the peptides (100 μl final volume at their MIC against the strain of interest) were added to a white 96-well plate, and fluorescence was recorded for 45 min after no further increase in fluorescence was observed. All assays were done as three biological replicates. The relative fluorescence values were calculated for the entire course of the experiment using non-linear fitting, and the untreated control (buffer + bacteria + fluorescent dye) as baseline. The following equation was applied to show the percentage difference between the fluorescence of the untreated control (baseline) and the sample:
$${\rm{{Percentage}\,{difference}}}=\frac{100\times ({\rm{fluorescence}_{sample}}-{\rm{baseline}})}{\rm{baseline}}$$
Cytoplasmic membrane depolarization assays
The depolarization of the bacterial cytoplasmic membrane was determined by fluorescence measurements of the membrane potential-sensitive dye, DiSC3-5 (ref. 16). Briefly, A. baumannii ATCC 19606 and P. aeruginosa PAO1 were grown at 37 °C until mid-log phase (OD600 = 0.5). The cells were then centrifuged using the same conditions described for the NPN uptake assays, washed twice with washing buffer containing 20 mmol l−1 glucose and 5 mmol l−1 HEPES (pH 7.2). The cells were diluted 1:10 (OD600 = 0.05) in a buffer containing 0.1 mol l−1 KCl, 20 mmol l−1 glucose and 5 mmol l−1 HEPES (pH 7.2). About 100 μl of bacterial solution was then incubated for 15 min with 20 nmol l−1 of DiSC3-5 until the fluorescence reached a plateau, that is, the dye was fully internalized into the bacterial membrane. Transmembrane potential changes were monitored by observing the difference in the fluorescence emission intensity of DiSC3-5 (λex = 622 nm, λem = 670 nm), after the addition of 100 μl of peptide aqueous solution at its MIC. All assays were performed in three biological replicates. The relative fluorescence values were calculated for the course of the experiment using non-linear fitting, and the untreated control (buffer + bacteria + fluorescent dye) served as baseline. The following equation was applied to show the percentage difference between the fluorescence of the untreated control (baseline) and the sample:
$${\rm{{Percentage}\,{difference}}}=\frac{100 \times (\rm{{fluorescence}_{sample}}-{baseline})}{\rm{baseline}}$$
Synergy between antibiotic molecules from extinct organisms
P. aeruginosa PAO1 and A. baumannii ATCC 19606 were used to assess the synergistic interactions of peptides derived from the same organisms using checkerboard assays. These bacterial pathogens were selected due to their resistance to antibiotics. The most active de-extinct EPs against P. aeruginosa PAO1 and A. baumannii ATCC 19606 were orthogonally diluted using the microdilution technique to concentrations ranging from 4 times MIC to 0.0625 times MIC. Plates were incubated for 20 h at 37 °C. All assays were done in three biological replicates.
Eukaryotic cell culture conditions
HEK293T cells were obtained from the American Type Culture Collection (CRL-3216). The cells were cultured in high-glucose Dulbecco’s modified Eagle’s medium supplemented with 1% penicillin and streptomycin (antibiotics) and 10% fetal bovine serum and grown at 37 °C in a humidified atmosphere containing 5% CO2.
Cytotoxicity assays
One day before the experiment, an aliquot of 100 μl of the cells at 50,000 cells per ml was seeded into each well of the cell-treated 96-well plates used in the experiment (that is, 5,000 cells per well). The attached HEK293T cells were then exposed to increasing concentrations of the peptides (8–128 μmol l−1) for 24 h. After the incubation period, we performed the 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide tetrazolium reduction assay (MTT assay)53. The MTT reagent was dissolved at 0.5 mg ml−1 in medium without phenol red and was used to replace cell culture supernatants containing the peptides (100 μl per well), and the samples were incubated for 4 h at 37 °C in a humidified atmosphere containing 5% CO2 yielding the insoluble formazan salt. The resulting salts were then resuspended in hydrochloric acid (0.04 mol l−1) in anhydrous isopropanol and quantified by spectrophotometric measurements of absorbance at 570 nm. All assays were done as three biological replicates.
Resistance to proteolytic degradation assays
To assess the resistance of EPs to proteolysis, we incubated them in human serum54. The following five lead compounds were exposed to 25% human serum in water for 6 h at 37 °C: mammuthusin-2, hydrodamin-1, megalocerin-1, elephasin-2 and mylodonin-2, each at 3 mg ml−1. About 100 μl aliquots were collected after 0, 0.5, 1, 3 and 6 h, and 10 μl of 100% trifluoroacetic acid was added to each sample to induce protein precipitation and incubated for 10 min on ice (at ~4 °C). Samples were then processed in a Waters Acquity UPLCMS equipped with a photodiode array detector (190–400 nm data collection) and a Waters TQD triple quadrupole MSMS, with 5 μl injections. The column used was a Waters Acquity UPLC HSS C18, 1.8 μm (2.1 mm × 50 mm). The mobile phases used were A (100% water with 0.1%, v/v, formic acid) and B (100% acetonitrile with 0.1%, v/v, formic acid), Fisher optima grades. The solvent gradient used is described in detail in Supplementary Table 18. Measurements were made by ionization ESI +/− simultaneous over m/z 100–2,000 Da. The percentage of remaining peptide was calculated by integrating the area under the curve related to the peptide at time point 0. Experiments were performed in three independent replicates.
Circular dichroism assays
Circular dichroism assays were performed at the University of Pennsylvania’s Biological Chemistry Resource Center using a J1500 circular dichroism spectropolarimeter (Jasco). The circular dichroism spectra represent an average of three accumulations at 25 °C obtained using a quartz cuvette with an optical path length of 1.0 mm. The experiments covered a wavelength range from 260 to 190 nm at a rate of 50 nm min−1 and a bandwidth of 0.5 nm. The concentration of all peptides tested was kept at 50 μmol l−1, and the measurements were performed in a mixture of water and trifluoroethanol in a 3:2 ratio. A Fourier transform filter was applied to minimize background effects. Secondary structure fraction values were calculated using the single spectra analysis tool on the server BeStSel (version from March 27, 2022)55.
Skin abscess infection mouse model
A. baumannii ATCC 19606 cells were grown in LB medium to an OD600 = 0.5. Cells were washed twice with sterile phosphate-buffered saline (PBS) (pH 7.4, 13,000 r.p.m. for 2 min) and resuspended to a final concentration of 106 (A. baumannii cells) colony-forming units per millilitre (c.f.u. ml−1). Six-week-old female CD-1 mice from Charles River (stock number 18679700-022) were anaesthetized with isoflurane, and their backs were sterilized and shaved. A superficial linear skin abrasion was made with a needle to damage the stratum corneum and upper layer of the epidermis. An aliquot of 20 μl containing the bacterial load resuspended in PBS was inoculated over the scratched area. Two hours after the infection, peptides diluted in water at their MIC value were administered to the infected area, and the untreated mice were inoculated with the same volume of deionized water. Mice were euthanized, and the area of scarified skin was excised 2 days and 4 days post infection, homogenized using a bead beater for 20 min (25 Hz) and then serially diluted tenfold for quantification of colony-forming units. This quantitative method was used as it has been established to accurately reflect the number of bacteria present in a given infected area. MacConkey agar plates were used as A. baumannii colonies appear purple in this medium and can easily be distinguished from other strains. The experiments were performed with six mice per group. All experiments were performed blindly, and no animal subjects were excluded from the analysis. The skin abscess infection mouse model was approved by the University Laboratory Animal Resources from the University of Pennsylvania (protocol 806763). Statistical significance was determined using one-way analysis of variance (ANOVA) followed by Dunnett’s test in log10-transformed data to mitigate the effect of outliers; P values are presented for each group, with all groups being compared to the untreated control group.
Thigh infection mouse model
Six-week-old female CD-1 mice from Charles River (stock number 18679700-022) were rendered neutropenic by two doses of cyclophosphamide (150 mg kg−1) applied intraperitoneally with an interval of 72 h. One day after the last dose of cyclophosphamide, the mice were injected intramuscularly in their right thigh with a bacterial load of 106 c.f.u. ml−1 of A. baumannii ATCC 19606 cells. The bacteria had been grown in LB broth, washed twice with PBS (pH 7.4), and resuspended to the desired concentration. Two hours after bacterial injection, peptides resuspended in water were administered intraperitoneally, and the untreated mice were inoculated with the same volume of deionized water. Before each injection, mice were anaesthetized with isoflurane and monitored for respiratory rate and pedal reflexes. Next, we monitored the establishment of the infection and euthanized the mice. The infected area was excised 2 days and 4 days post infection, homogenized using a bead beater for 20 min (25 Hz), and then serially diluted tenfold for quantification of colony-forming units in MacConkey agar plates. The experiments were performed with six mice per group. All experiments were performed blindly, and no animal subjects were excluded from the analysis. The thigh infection mouse model was approved by the University Laboratory Animal Resources from the University of Pennsylvania (protocol 807055). Statistical significance was determined using one-way ANOVA followed by Dunnett’s test in log10-transformed data to mitigate the effect of outliers; P values are presented for each group, with all groups being compared to the untreated control group.
Statistical analyses
Unless otherwise stated, all assays were performed in three independent biological replicates as indicated in each figure legend and Methods sections. The cytotoxic activities values were estimated using non-linear regression based on the range of concentrations screened and were shown as the values that cause lysis of 50% of the cells in the experiment. Two technical replicates were performed in the cytotoxicity assays within each of the three biological replicates.
In the mouse experiments, the statistical significance was determined using one-way ANOVA followed by Dunnett’s test. All the P values are shown for each of the groups, and all groups were compared to the untreated control group. The solid line inside each box represents the mean value obtained for each group.
All calculation and statistical analyses of the experimental and computational data were conducted using GraphPad Prism v.10.0.2 and Python, respectively. Statistical significance between different groups was calculated using the tests indicated in each figure legend. No statistical methods were used to predetermine sample size.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

