deepBreaks identifies and prioritizes genotype–phenotype associations using machine learning

Machine Learning


  • Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. & Kim, D. Methods of integrating data to uncover genotype–phenotype interactions. Nat. Rev. Genet. 16, 85–97 (2015).

    CAS 
    PubMed 

    Google Scholar 

  • Moore, J. H., Asselbergs, F. W. & Williams, S. M. Bioinformatics challenges for genome-wide association studies. Bioinformatics 26, 445–455 (2010).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Doshi-Velez, F. & Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv [stat.ML] (2017).

  • Leung, M. K. K., Delong, A., Alipanahi, B. & Frey, B. J. Machine learning in genomic medicine: A review of computational problems and data sets. Proc. IEEE 104, 176–197 (2016).

    Google Scholar 

  • Yang, Y. et al. Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data. Bioinformatics 34, 1666–1671 (2018).

    CAS 
    PubMed 

    Google Scholar 

  • Hadikurniawati, W., Anwar, M. T., Marlina, D. & Kusumo, H. Predicting tuberculosis drug resistance using machine learning based on DNA sequencing data. J. Phys. Conf. Ser. 1869, 012093 (2021).

    CAS 

    Google Scholar 

  • Adam, G. et al. Machine learning approaches to drug response prediction: challenges and recent progress. NPJ Precis Oncol 4, 19 (2020).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Wan, N. et al. Machine learning enables detection of early-stage colorectal cancer by whole-genome sequencing of plasma cell-free DNA. BMC Cancer 19, 832 (2019).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Kurian, B. & Jyothi, V. L. Breast cancer prediction using an optimal machine learning technique for next generation sequences. Concurrent Eng. Res. Appl. 29, 49–57 (2021).

    Google Scholar 

  • Lee, S. H., van der Werf, J. H. J., Hayes, B. J., Goddard, M. E. & Visscher, P. M. Predicting unobserved phenotypes for complex traits from whole-genome SNP data. PLoS Genet. 4, e1000231 (2008).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Guzzetta, G., Jurman, G. & Furlanello, C. A machine learning pipeline for quantitative phenotype prediction from genotype data. BMC Bioinf. 11(Suppl 8), S3 (2010).

    Google Scholar 

  • Drouin, A. et al. Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics 17, 754 (2016).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Montesinos-López, A., Montesinos-López, O. A., Gianola, D., Crossa, J. & Hernández-Suárez, C. M. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 8, 3813–3828 (2018).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Ma, W. et al. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248, 1307–1318 (2018).

    CAS 
    PubMed 

    Google Scholar 

  • Liu, Y. et al. Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Front. Genet. 10, 1091 (2019).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Lee, Y.-C. et al. Using machine learning to predict obesity based on genome-wide and epigenome-wide gene-gene and gene-diet interactions. Front. Genet. 12, 783845 (2021).

    CAS 
    PubMed 

    Google Scholar 

  • Wang, L., Shen, H., Liu, H. & Guo, G. Mixture SNPs effect on phenotype in genome-wide association studies. BMC Genomics 16, 3 (2015).

    PubMed 
    PubMed Central 

    Google Scholar 

  • GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

  • Orgogozo, V., Morizot, B. & Martin, A. The differential view of genotype–phenotype relationships. Front. Genet. 6, 179 (2015).

    PubMed 
    PubMed Central 

    Google Scholar 

  • A Density-Based Algorithm for Discovering Clusters in Large. https://www.aaai.org › KDD › 1996 › KDD96-037https://www.aaai.org › KDD › 1996 › KDD96-037.

  • Dormann, C. F. et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27–46 (2013).

    Google Scholar 

  • Lees, J. A., Galardini, M., Bentley, S. D., Weiser, J. N. & Corander, J. Pyseer: A comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 34, 4310–4312 (2018).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Yokoyama, S., Tada, T., Zhang, H. & Britt, L. Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates. Proc. Natl. Acad. Sci. U. S. A. 105, 13480–13485 (2008).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Katoh, K., Rozewicki, J. & Yamada, K. D. MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform. 20, 1160–1166 (2017).

    PubMed Central 

    Google Scholar 

  • Talavera, G., Castresana, J., Kjer, K., Page, R. & Sullivan, J. Gblocks. http://molevol.cmima.csic.es/castresana/Gblocks.html.

  • Frazer, S. A., Baghbanzadeh, M., Rahnavard, A., Crandall, K. A. & Oakley, T. H. Discovering genotype–phenotype relationships with machine learning and the visual physiology opsin database (VPOD). GigaScience 13, giae073 (2024).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Lynch, R. M., Shen, T., Gnanakaran, S. & Derdeyn, C. A. Appreciating HIV type 1 diversity: subtype differences in Env. AIDS Res. Hum. Retroviruses 25, 237–248 (2009).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Felsövályi, K., Nádas, A., Zolla-Pazner, S. & Cardozo, T. Distinct sequence patterns characterize the V3 region of HIV type 1 gp120 from subtypes A and C. AIDS Res. Hum. Retroviruses 22, 703–708 (2006).

    PubMed 

    Google Scholar 

  • Compendium, H. Foley B, LT, Apetrei C, Hahn B, Mizrachi I, Mullins J, Rambaut A, Wolinsky S & Korber B, Eds. Biophysics Group, Los Alamos National Laboratory

  • Rahnavard, A. et al. Omics community detection using multi-resolution clustering. Bioinformatics 37, 3588–3594 (2021).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Patel, M. B., Hoffman, N. G. & Swanstrom, R. Subtype-specific conformational differences within the V3 region of subtype B and subtype C human immunodeficiency virus type 1 Env proteins. J. Virol. 82, 903–916 (2008).

    CAS 
    PubMed 

    Google Scholar 

  • Fouchier, R. A. et al. Phenotype-associated sequence variation in the third variable domain of the human immunodeficiency virus type 1 gp120 molecule. J. Virol. 66, 3183–3187 (1992).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Culyba, M. J. & Van Tyne, D. Bacterial evolution during human infection: Adapt and live or adapt and die. PLoS Pathog. 17, e1009872 (2021).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • McDonald, M. J. Microbial Experimental Evolution – a proving ground for evolutionary theory and a tool for discovery. EMBO Rep. 20, e46992 (2019).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

  • Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded human microbiome project. Nature 550, 61–66 (2017).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Brinkman, K. K. & Larsen, R. A. Interactions of the energy transducer TonB with noncognate energy-harvesting complexes. J. Bacteriol. 190, 421–427 (2008).

    CAS 
    PubMed 

    Google Scholar 

  • Samantarrai, D., Lakshman Sagar, A., Gudla, R. & Siddavattam, D. TonB-dependent transporters in sphingomonads: unraveling their distribution and function in environmental adaptation. Microorganisms 8, 359 (2020).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Biou, V. et al. Structural and molecular determinants for the interaction of ExbB from Serratia marcescens and HasB, a TonB paralog. Commun. Biol. 5, 355 (2022).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Knowles, T. J., Scott-Tucker, A., Overduin, M. & Henderson, I. R. Membrane protein architects: The role of the BAM complex in outer membrane protein assembly. Nat. Rev. Microbiol. 7, 206–214 (2009).

    CAS 
    PubMed 

    Google Scholar 

  • Georgieva, M. et al. Mutations in the essential outer membrane protein BamA contribute to Escherichia coli resistance to the antimicrobial peptide TAT-RasGAP317-326. J. Biol. Chem. 301, 108018 (2025).

    CAS 
    PubMed 

    Google Scholar 

  • Holmes, E. C. What does virus evolution tell us about virus origins?. J. Virol. 85, 5247–5251 (2011).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Rahnavard, A. et al. Epidemiological associations with genomic variation in SARS-CoV-2. Sci. Rep. 11, 23023 (2021).

  • Lauring, A. S. & Malani, P. N. Variants of SARS-CoV-2. JAMA https://doi.org/10.1001/jama.2021.14181 (2021).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Ilmjärv, S. et al. Concurrent mutations in RNA-dependent RNA polymerase and spike protein emerged as the epidemiologically most successful SARS-CoV-2 variant. Sci. Rep. 11, 13705 (2021).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Aleem, A., Akbar Samad, A. B. & Slenker, A. K. Emerging Variants of SARS-CoV-2 And Novel Therapeutics Against Coronavirus (COVID-19). in StatPearls (StatPearls Publishing, Treasure Island (FL), 2022).

  • Tracking SARS-CoV-2 variants. https://www.who.int/activities/tracking-SARS-CoV-2-variants.

  • Khare, S. et al. GISAID’s role in pandemic response. China CDC Wkly 3, 1049–1051 (2021).

    PubMed 
    PubMed Central 

    Google Scholar 

  • Dayhoff, M., Schwartz, R. & Orcutt, B. 22 a model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, 345–352 (1978).

    Google Scholar 

  • Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8, 37–49 (2008).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Hierons, R. Machine learning. Tom M. Mitchell. Published by McGraw-Hill, Maidenhead, U.K., International Student Edition, 1997. ISBN: 0-07-115467-1, 414 pages. Price: U.K. £22.99, soft cover. Software Testing, Verification and Reliability vol. 9 191–193 Preprint at https://doi.org/10.1002/(sici)1099-1689(199909)9:3<191::aid-stvr184>3.0.co;2-e (1999).

  • Nordhausen, K. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman. International Statistical Review vol. 77 482–482 Preprint at https://doi.org/10.1111/j.1751-5823.2009.00095_18.x (2009).

  • Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression (John Wiley & Sons, 2013).

    Google Scholar 

  • Coefficient, S. R. C. In The Concise Encyclopedia of Statistics. Preprint at (2008).

  • Hamming Distance. in Encyclopedia of Biometrics (eds. Li, S. Z. & Jain, A.) 668–668 (Springer US, Boston, MA, 2009).

  • Hancock, J. M. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). in Dictionary of Bioinformatics and Computational Biology (2014).

  • Kvalseth, T. O. Entropy and correlation: some comments. IEEE Trans. Syst. Man Cybern. 17, 517–519 (1987).

    Google Scholar 

  • Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning—ICML ’09 Preprint at https://doi.org/10.1145/1553374.1553511 (2009).

  • Morey, L. C. & Agresti, A. The measurement of classification agreement: An adjustment to the rand statistic for chance agreement. Educ. Psychol. Meas. 44, 33–37 (1984).

    Google Scholar 

  • Singh, D. & Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020).

    Google Scholar 

  • Hoerl, A. E. & Kennard, R. W. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970).

    Google Scholar 

  • Santosa, F. & Symes, W. W. Linear inversion of band-limited reflection seismograms. SIAM J. Sci. and Stat. Comput. 7, 1307–1330 (1986).

    MathSciNet 

    Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. aos 32, 407–499 (2004).

    MathSciNet 

    Google Scholar 

  • Huber, P. J. & Ronchetti, E. M. Robust Statistics. Wiley Series in Probability and Statistics Preprint at https://doi.org/10.1002/9780470434697 (2009).

  • Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).

    Google Scholar 

  • Chen, T. & Guestrin, C. XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Preprint at https://doi.org/10.1145/2939672.2939785 (2016).

  • Ke, Meng, Finley & Wang. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst.

  • Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  • Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and Regression Trees (Chapman & Hall/CRC, 2017).

    Google Scholar 

  • Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55, 119–139 (1997).

    MathSciNet 

    Google Scholar 

  • Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet 

    Google Scholar 

  • Xgboost: Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and More. Runs on Single Machine, Hadoop, Spark, Dask, Flink and DataFlow. (Github).

  • LightGBM: A Fast, Distributed, High Performance Gradient Boosting (GBT, GBDT, GBRT, GBM or MART) Framework Based on Decision Tree Algorithms, Used for Ranking, Classification and Many Other Machine Learning Tasks. (Github).

  • Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • McKinney, W. Data Structures for Statistical Computing in Python. in Proceedings of the 9th Python in Science Conference (SciPy, 2010). https://doi.org/10.25080/majora-92bf1922-00a.

  • Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • Waskom, M. seaborn: Statistical data visualization. J. Open Sour. Softw. 6, 3021 (2021).

    Google Scholar 

  • Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).

    Google Scholar 



  • Source link

    Leave a Reply

    Your email address will not be published. Required fields are marked *