Leveraging viral genome sequences and machine learning models for identification of potentially selective antiviral agents

Machine Learning


Feature vectors for viral genome sequences

Complete viral genome sequences were downloaded as FASTA files from three databases, including the Global Initiative on Sharing All Influenza Data database (GISAID, https://www.gisaid.org/, e.g., SARS-CoV-2 strains/variants), the European Bioinformatics Institute (EBI, https://www.ebi.ac.uk/genomes/virus.html, e.g., HPV-11), and the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?opt=virus&taxid=10239&host=human, e.g., influenza A virus). The sequences were processed using R software with two key packages: “msa” (version 1.28.0) for sequence import and “seqinr” (version 4.2-23) for calculating pairwise alignment distances based on sequence identity. Recognizing the analogy between viral genomes and natural language, the genome sequences were treated as sentences, with k-monomeric units (where k = 6) serving as “words”. This analogy allowed us to apply natural language processing techniques to genomic data. Specifically, we employed the FastText embedding model to generate a “continuous bag of nucleobases” representation. This was implemented using the “fastText” package in R, with the following parameters: 100 epochs, a “softmax” loss function, a learning rate of 0.1 (default), and a word vector size of 100 (default). This approach converted viral genomes of varying lengths into uniform 100-dimensional feature vectors (real values), suitable for machine learning-based predictive analysis.

Collection of approved and investigational antiviral drugs (AIADs)

The collection of AIADs was conducted in two main steps: initial identification and selection from databases, followed by systematic verification through literature review. First, compounds were sourced from the NCATS in-house collection of antivirals and the DrugBank database (https://go.drugbank.com/), ensuring a diverse and comprehensive representation of both approved and investigational antiviral compounds. Particular emphasis was placed on potential anti-SARS-CoV-2 compounds that had progressed to phase III clinical trials, reflecting their advanced status in the drug development pipeline. Other antiviral agents were included based on demonstrated efficacy against a range of viral pathogens across all clinical trial phases. Second, the antiviral activity of each compound was systematically verified through PubMed literature searches (https://pubmed.ncbi.nlm.nih.gov/) using a standardized keyword-based strategy, combining the compound name with terms such as “antiviral,” “antiviral activity,” and “viral inhibition.” Notably, the selected antiviral compounds target a wide spectrum of viral and host proteins involved in various stages of the viral lifecycle, including but not limited to viral entry, replication, assembly, and host immune modulation. Details of the AIADs collected were provided in Supplementary Table S2.

Collection of non-cytotoxic pharmaceutical compounds (NCPCs) based on Tox21 cell viability assay data

The NCATS Pharmaceutical Collection (NPC)46, consisting of ~3000 approved and investigational drugs, was systematically evaluated for cytotoxicity using in vitro cell-based high-throughput screening (HTS) assays as part of the Tox21 program. Detailed data and descriptions of these assays were accessed through the NCATS Tox21 public data browser (https://tripod.nih.gov/pubdata/). Each compound was screened in triplicate at 15 distinct concentrations, with activity quantified using a curve rank ranging from −9 to 947. Negative values (−9 to −1) indicated decreasing inhibitory activity, while positive values (1 to 9) indicated increasing activation activity. A curve rank of 0 signified inactivity. Compounds were classified as NCPCs if they exhibited a curve rank of 0 in at least 30 of the 55 cell viability assays. This threshold ensured the selection of compounds demonstrating minimal cytotoxic effects across a diverse range of cellular models. Details of NCPCs based on Tox21 cell viability assay data were provided in Supplementary Table S4.

Collection of anti-SARS-CoV-2 pharmaceutical compounds based on the PP and cytopathic effect (CPE) assays

We identified potentially cytotoxic and non-cytotoxic anti-SARS-CoV-2 compounds using cell viability counter screen data for both the PP and CPE assays based on our previous study41. For modeling purposes, compounds were designated as cytotoxic (assigned a value of 1) if they met the following criteria: AC50 <10 µM, efficacy <−50%, and curve rank <−1 in the cell viability counter screen. Compounds that did not meet these criteria were designated as non-cytotoxic (assigned a value of 0). To ensure a conservative cytotoxicity assessment, an additional rule was applied: any compound exhibiting cytotoxicity in either the PP or CPE cell viability counter screen was classified as cytotoxic. Conversely, compounds that did not exhibit cytotoxicity in either assay were designated as non-cytotoxic. By prioritizing compounds with both efficacy against SARS-CoV-2 and minimal cytotoxicity, this approach enhances the potential for identifying potential anti-SARS-CoV-2 candidates with favorable safety profiles.

Conversion of chemical structures to fingerprints and structural analysis

Chemical structures of compounds were converted to two types of commonly used structure fingerprints, i.e., extended connectivity fingerprints radius 4 (ECFP4) and ToxPrint. ECFP4 was used to build classification models and evaluate the structural similarities between compounds, while ToxPrint was used to identify chemical structural features significantly enriched in AIAD compounds. Molecular structures encoded in SMILES (simplified molecular input line entry system) were converted into ECFP4, which represents chemical compounds as 1024-bit binary vectors, where each bit indicates the presence (1) or absence (0) of a specific structural feature. This conversion was performed using the Chemistry Development Kit (CDK) integrated into the Konstanz Information Miner (KNIME) platform, version 4.7.1. Structural similarity between compounds was assessed by calculating the Tanimoto coefficient (TC) based on their ECFP4 fingerprints. The TC, which ranges from 0 to 1, measures similarity by dividing the number of shared structural features by the total number of features present in either compound. For each compound, its closest structural neighbor was identified by calculating the TC between it and all other compounds in the set, with the highest TC value defined as maxTC. This maxTC value was used to assess the structural similarity between compound sets. The ToxPrint fingerprints (729 bits) were generated using the publicly available ChemoTyper application (https://chemotyper.org/). Fisher’s exact test was used to determine the significance of structure features enriched in AIADs or NCPCs based on the ToxPrint fingerprints, and a p value <0.05 was considered statistically significant.

Implementation and evaluation of machine learning classification models

The machine learning modeling process was performed following methodologies in our previous studies48,49,50,51,52,53. Virus-selective models were built using a combination of chemical structural features (ECFP4 fingerprints) and viral genome sequence descriptors. Pan-antiviral models were built using chemical structural features only. The dataset was randomly split into a training set (70%) and a testing set (30%), with this process repeated 20 times to ensure robustness and mitigate sampling bias. Here, “the dataset” refers to the data used to build each individual model. Five classification models were built: Naïve Bayes (NB) and SVM using the “e1071” package, neural networks (NNET) with the “nnet” package, RF via the “Random Forest” package, and XGB using the “xgboost” package. Laplace smoothing was applied to the NB classifier to address zero probability issues, while the SVM classifier employed a Gaussian radial basis function kernel. Default parameters were maintained for the RF and NNET classifiers. In the XGB model, parameters were set to include a maximum tree depth of 3, a learning rate of 0.01, and a subsample ratio of 0.5 for constructing each tree. Model performance was evaluated using area under the receiver operating characteristic curve (AUC-ROC) and balanced accuracy (BA) via the “pROC” package, and Matthews correlation coefficient (MCC) computed using the “mltools” package. The entire machine learning modeling procedure was executed in R version 4.2.1.

Feature selection and data rebalance for machine learning model optimization

Feature selection was performed using four methods to identify the most informative features for machine learning model construction, with slight modifications to previously described approaches48,49,50,51,52,53. In Fisher’s exact test and t-test method, p value thresholds ranging from 0.01 to 0.05 in increments of 0.01 were utilized. The AUC-ROC method was implemented with cutoff thresholds from 0.52 to 0.56 in increments of 0.02, using the “pROC” package in R. For the Random Forest (RF) and eXtreme gradient boosting (XGB) methods, feature importance scores, such as Gini importance or Gain scores, were calculated using the “Random Forest” and “xgboost” packages, respectively. Features were selected at intervals from the top 20 to the top 100 ranked features. To address data imbalance, four sampling methods were employed: down-sampling, up-sampling, random over-sampling examples (ROSE), and synthetic minority over-sampling technique (SMOTE), utilizing the “ROSE” and “DMwR” packages in R. This comprehensive approach to feature selection and data balancing was designed to enhance the robustness and reliability of the machine learning models, improving their predictive power for identifying effective antiviral compounds.

SARS-CoV-2 pseudotyped particle (PP) entry assay

The PP entry assay was performed according to previously described protocols43. In brief, SARS-CoV-2 Spike protein containing PPs, along with control PPs (vesicular stomatitis virus glycoprotein PP and bald PP), were custom-produced by Dexorgen (Rockville, MD). The assay was conducted in HEK293 cells expressing human angiotensin-converting enzyme 2 (HEK293-ACE2) under biosafety level 2 (BSL-2) conditions. Compounds were tested in 11-point, 1:3 serial dilutions starting from a concentration of 57.5 μM. After 48 h of incubation at 37 °C with 5% CO₂, luciferase activity was measured using the bright-glo luciferase assay (Promega) to assess the PP entry. Data were normalized to wells with SARS-CoV-2 spike PPs (100%) and bald PPs (0%). Cytotoxicity was evaluated in parallel using an intracellular ATP assay without the addition of PPs, with cells and media as 100 and 0% controls, respectively. The dual assessments provided a comprehensive evaluation of both compound efficacy in inhibiting viral entry and their potential cytotoxicity. All compound libraries used in the study were assembled by the National Center for Advancing Translational Sciences (NCATS), ensuring high consistency and quality control throughout the screening process.

RNA-dependent RNA polymerase (RdRp) assay

The SARS-CoV-2 RdRp assay in the time-resolved fluorescence resonance energy transfer (TR-FRET) assay format was obtained from BPS Bioscience (San Diego, CA). The assay was optimized for high-throughput screening in a 1536-well plate format. Complete RdRp buffer was prepared by adding 10 µL of 0.5 M DTT to 5 mL of RdRp assay buffer component 1, followed by the addition of 20 µL of RdRp assay buffer component 2 (based on the manufacturer’s protocol). The RNAse inhibitor was then diluted 8-fold in the prepared complete RdRp buffer. The RdRp enzyme was diluted in the complete RdRp buffer to a final concentration of 60 ng/µL, ensuring that the enzyme was not refrozen after dilution. The RdRp reaction mixture was prepared by diluting the digoxigenin-labeled RNA duplex and biotinylated ATP 50-fold in the complete RdRp buffer. The enzyme mix was assembled for the 1536-well plate according to the following volumes per well: 0.5 µL of complete RdRp buffer, 1 µL of RdRp enzyme for test samples (no enzyme for blanks), 0.5 µL of RNAse inhibitor, and 0.5 µL of the RdRp reaction mixture, yielding a total volume of 2.5 µL per well. For each 1536-well plate, a 5 mL reaction mix needs to be prepared. Subsequently, 5 µL of the enzyme mix was dispensed into each well of the 1536-well plate. Test compounds were added by pintool, dispensing 23 nL of each compound into the respective wells, and the plate was incubated at 37 °C for 3 h. During the incubation period, the TR-FRET detection buffer was thawed on ice. Eu-labeled antibody was diluted to 1:600, and dye-labeled acceptor was diluted to 1:200, with 8.3 and 25 µL of each, respectively, added to 5 mL of the detection buffer. Following the incubation, 5 µL of the prepared detection solution was added to each well, and the plate was shaken on a rotator at room temperature for 20 minutes. Fluorescence intensity was measured using the BMG PHERAstar plate reader in the HTRF format with excitation at 317 nm and dual emissions at both 620 and 665 nm. For the 620 nm channel, a lag time of 60 µs and an integration time of 500 µs were set, and similar settings were applied for the 665 nm channel. Due to the difficulty in reading the 620 nm emission simultaneously on the 1536-well plate format, the primary readout was obtained at 665 nm. The focal height for optimal 665 nm reading was adjusted to 10 nm. This method enabled the high-throughput screening of RdRp inhibitors using a 1536-well plate format, facilitating the rapid and sensitive detection of fluorescence signals via TR-FRET.

Virtual screening and validation using in vitro assays

The optimal machine learning models were applied to screen the NCATS in-house collection of ~360 K diverse compounds. These compounds included known bioactive compounds and new small molecules designed for drug discovery purposes. Consensus predictions from multiple individual models were utilized to enhance reliability and accuracy. The consensus score for each compound was calculated as the sum of its probability scores from multiple models, weighted by the respective AUC-ROC values of each model. To ensure the selection of structurally diverse candidate compounds for validation using in vitro assays, molecules predicted as positive hits by multiple models were further stratified through clustering based on structural similarity using the k-means algorithm. This approach partitioned compounds into k distinct clusters, facilitating the selection of representative candidates spanning diverse chemical scaffolds. Compounds achieving the highest consensus scores within their respective clusters were prioritized for testing using in vitro assays. To analyze the validation results, concentration-response curves were fitted using four-parameter logistic regression, where % assay activity was the response variable, and log10 compound concentration served as the independent variable. This analysis was conducted using the “drc” statistical package in R. Data visualizations were generated through the “ggplot2” package in R, and representative chemical structures were rendered using ChemDraw Professional software (version 23.1.1). The comprehensive workflow, including data collection, data processing, model building, model evaluation, virtual screening, and validation using in vitro assays, is illustrated in Fig. S1.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *