Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. & Kim, D. Methods of integrating data to uncover genotype–phenotype interactions. Nat. Rev. Genet. 16, 85–97 (2015).
Google Scholar
Moore, J. H., Asselbergs, F. W. & Williams, S. M. Bioinformatics challenges for genome-wide association studies. Bioinformatics 26, 445–455 (2010).
Google Scholar
Doshi-Velez, F. & Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv [stat.ML] (2017).
Leung, M. K. K., Delong, A., Alipanahi, B. & Frey, B. J. Machine learning in genomic medicine: A review of computational problems and data sets. Proc. IEEE 104, 176–197 (2016).
Yang, Y. et al. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics 34, 1666–1671 (2018).
Google Scholar
Hadikurniawati, W., Anwar, M. T., Marlina, D. & Kusumo, H. Predicting tuberculosis drug resistance using machine learning based on DNA sequencing data. J. Phys. Conf. Ser. 1869, 012093 (2021).
Google Scholar
Adam, G. et al. Machine learning approaches to drug response prediction: challenges and recent progress. NPJ Precis Oncol 4, 19 (2020).
Google Scholar
Wan, N. et al. Machine learning enables detection of early-stage colorectal cancer by whole-genome sequencing of plasma cell-free DNA. BMC Cancer 19, 832 (2019).
Google Scholar
Kurian, B. & Jyothi, V. L. Breast cancer prediction using an optimal machine learning technique for next generation sequences. Concurrent Eng. Res. Appl. 29, 49–57 (2021).
Lee, S. H., van der Werf, J. H. J., Hayes, B. J., Goddard, M. E. & Visscher, P. M. Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genet. 4, e1000231 (2008).
Google Scholar
Guzzetta, G., Jurman, G. & Furlanello, C. A machine learning pipeline for quantitative phenotype prediction from genotype data. BMC Bioinf. 11(Suppl 8), S3 (2010).
Drouin, A. et al. Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics 17, 754 (2016).
Google Scholar
Montesinos-López, A., Montesinos-López, O. A., Gianola, D., Crossa, J. & Hernández-Suárez, C. M. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 8, 3813–3828 (2018).
Google Scholar
Ma, W. et al. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248, 1307–1318 (2018).
Google Scholar
Liu, Y. et al. Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Front. Genet. 10, 1091 (2019).
Google Scholar
Lee, Y.-C. et al. Using machine learning to predict obesity based on genome-wide and epigenome-wide gene-gene and gene-diet interactions. Front. Genet. 12, 783845 (2021).
Google Scholar
Wang, L., Shen, H., Liu, H. & Guo, G. Mixture SNPs effect on phenotype in genome-wide association studies. BMC Genomics 16, 3 (2015).
Google Scholar
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Orgogozo, V., Morizot, B. & Martin, A. The differential view of genotype–phenotype relationships. Front. Genet. 6, 179 (2015).
Google Scholar
A Density-Based Algorithm for Discovering Clusters in Large. https://www.aaai.org › KDD › 1996 › KDD96-037https://www.aaai.org › KDD › 1996 › KDD96-037.
Dormann, C. F. et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27–46 (2013).
Lees, J. A., Galardini, M., Bentley, S. D., Weiser, J. N. & Corander, J. Pyseer: A comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34, 4310–4312 (2018).
Google Scholar
Yokoyama, S., Tada, T., Zhang, H. & Britt, L. Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates. Proc. Natl. Acad. Sci. U. S. A. 105, 13480–13485 (2008).
Google Scholar
Katoh, K., Rozewicki, J. & Yamada, K. D. MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform. 20, 1160–1166 (2017).
Google Scholar
Talavera, G., Castresana, J., Kjer, K., Page, R. & Sullivan, J. Gblocks. http://molevol.cmima.csic.es/castresana/Gblocks.html.
Frazer, S. A., Baghbanzadeh, M., Rahnavard, A., Crandall, K. A. & Oakley, T. H. Discovering genotype–phenotype relationships with machine learning and the visual physiology opsin database (VPOD). GigaScience 13, giae073 (2024).
Google Scholar
Lynch, R. M., Shen, T., Gnanakaran, S. & Derdeyn, C. A. Appreciating HIV type 1 diversity: subtype differences in Env. AIDS Res. Hum. Retroviruses 25, 237–248 (2009).
Google Scholar
Felsövályi, K., Nádas, A., Zolla-Pazner, S. & Cardozo, T. Distinct sequence patterns characterize the V3 region of HIV type 1 gp120 from subtypes A and C. AIDS Res. Hum. Retroviruses 22, 703–708 (2006).
Google Scholar
Compendium, H. Foley B, LT, Apetrei C, Hahn B, Mizrachi I, Mullins J, Rambaut A, Wolinsky S & Korber B, Eds. Biophysics Group, Los Alamos National Laboratory
Rahnavard, A. et al. Omics community detection using multi-resolution clustering. Bioinformatics 37, 3588–3594 (2021).
Google Scholar
Patel, M. B., Hoffman, N. G. & Swanstrom, R. Subtype-specific conformational differences within the V3 region of subtype B and subtype C human immunodeficiency virus type 1 Env proteins. J. Virol. 82, 903–916 (2008).
Google Scholar
Fouchier, R. A. et al. Phenotype-associated sequence variation in the third variable domain of the human immunodeficiency virus type 1 gp120 molecule. J. Virol. 66, 3183–3187 (1992).
Google Scholar
Culyba, M. J. & Van Tyne, D. Bacterial evolution during human infection: Adapt and live or adapt and die. PLoS Pathog. 17, e1009872 (2021).
Google Scholar
McDonald, M. J. Microbial Experimental Evolution – a proving ground for evolutionary theory and a tool for discovery. EMBO Rep. 20, e46992 (2019).
Google Scholar
Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).
Google Scholar
Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded human microbiome project. Nature 550, 61–66 (2017).
Google Scholar
Brinkman, K. K. & Larsen, R. A. Interactions of the energy transducer TonB with noncognate energy-harvesting complexes. J. Bacteriol. 190, 421–427 (2008).
Google Scholar
Samantarrai, D., Lakshman Sagar, A., Gudla, R. & Siddavattam, D. TonB-dependent transporters in sphingomonads: unraveling their distribution and function in environmental adaptation. Microorganisms 8, 359 (2020).
Google Scholar
Biou, V. et al. Structural and molecular determinants for the interaction of ExbB from Serratia marcescens and HasB, a TonB paralog. Commun. Biol. 5, 355 (2022).
Google Scholar
Knowles, T. J., Scott-Tucker, A., Overduin, M. & Henderson, I. R. Membrane protein architects: The role of the BAM complex in outer membrane protein assembly. Nat. Rev. Microbiol. 7, 206–214 (2009).
Google Scholar
Georgieva, M. et al. Mutations in the essential outer membrane protein BamA contribute to Escherichia coli resistance to the antimicrobial peptide TAT-RasGAP317-326. J. Biol. Chem. 301, 108018 (2025).
Google Scholar
Holmes, E. C. What does virus evolution tell us about virus origins?. J. Virol. 85, 5247–5251 (2011).
Google Scholar
Rahnavard, A. et al. Epidemiological associations with genomic variation in SARS-CoV-2. Sci. Rep. 11, 23023 (2021).
Lauring, A. S. & Malani, P. N. Variants of SARS-CoV-2. JAMA https://doi.org/10.1001/jama.2021.14181 (2021).
Google Scholar
Ilmjärv, S. et al. Concurrent mutations in RNA-dependent RNA polymerase and spike protein emerged as the epidemiologically most successful SARS-CoV-2 variant. Sci. Rep. 11, 13705 (2021).
Google Scholar
Aleem, A., Akbar Samad, A. B. & Slenker, A. K. Emerging Variants of SARS-CoV-2 And Novel Therapeutics Against Coronavirus (COVID-19). in StatPearls (StatPearls Publishing, Treasure Island (FL), 2022).
Tracking SARS-CoV-2 variants. https://www.who.int/activities/tracking-SARS-CoV-2-variants.
Khare, S. et al. GISAID’s role in pandemic response. China CDC Wkly 3, 1049–1051 (2021).
Google Scholar
Dayhoff, M., Schwartz, R. & Orcutt, B. 22 a model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352 (1978).
Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8, 37–49 (2008).
Google Scholar
Hierons, R. Machine learning. Tom M. Mitchell. Published by McGraw-Hill, Maidenhead, U.K., International Student Edition, 1997. ISBN: 0-07-115467-1, 414 pages. Price: U.K. £22.99, soft cover. Software Testing, Verification and Reliability vol. 9 191–193 Preprint at https://doi.org/10.1002/(sici)1099-1689(199909)9:3<191::aid-stvr184>3.0.co;2-e (1999).
Nordhausen, K. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman. International Statistical Review vol. 77 482–482 Preprint at https://doi.org/10.1111/j.1751-5823.2009.00095_18.x (2009).
Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression (John Wiley & Sons, 2013).
Coefficient, S. R. C. In The Concise Encyclopedia of Statistics. Preprint at (2008).
Hamming Distance. in Encyclopedia of Biometrics (eds. Li, S. Z. & Jain, A.) 668–668 (Springer US, Boston, MA, 2009).
Hancock, J. M. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). in Dictionary of Bioinformatics and Computational Biology (2014).
Kvalseth, T. O. Entropy and correlation: some comments. IEEE Trans. Syst. Man Cybern. 17, 517–519 (1987).
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning—ICML ’09 Preprint at https://doi.org/10.1145/1553374.1553511 (2009).
Morey, L. C. & Agresti, A. The measurement of classification agreement: An adjustment to the rand statistic for chance agreement. Educ. Psychol. Meas. 44, 33–37 (1984).
Singh, D. & Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020).
Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).
Santosa, F. & Symes, W. W. Linear inversion of band-limited reflection seismograms. SIAM J. Sci. and Stat. Comput. 7, 1307–1330 (1986).
Google Scholar
Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. aos 32, 407–499 (2004).
Google Scholar
Huber, P. J. & Ronchetti, E. M. Robust Statistics. Wiley Series in Probability and Statistics Preprint at https://doi.org/10.1002/9780470434697 (2009).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Chen, T. & Guestrin, C. XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Preprint at https://doi.org/10.1145/2939672.2939785 (2016).
Ke, Meng, Finley & Wang. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst.
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and Regression Trees (Chapman & Hall/CRC, 2017).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55, 119–139 (1997).
Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Xgboost: Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and More. Runs on Single Machine, Hadoop, Spark, Dask, Flink and DataFlow. (Github).
LightGBM: A Fast, Distributed, High Performance Gradient Boosting (GBT, GBDT, GBRT, GBM or MART) Framework Based on Decision Tree Algorithms, Used for Ranking, Classification and Many Other Machine Learning Tasks. (Github).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Google Scholar
McKinney, W. Data Structures for Statistical Computing in Python. in Proceedings of the 9th Python in Science Conference (SciPy, 2010). https://doi.org/10.25080/majora-92bf1922-00a.
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Google Scholar
Waskom, M. seaborn: Statistical data visualization. J. Open Sour. Softw. 6, 3021 (2021).
Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
