Schaefer, J., Lehne, M., Schepers, J., Prasser, F. & Thun, S. The use of machine learning in rare diseases: a scoping review. Orphanet J. Rare Dis. 15, 145 (2020).
Google Scholar
Decherchi, S., Pedrini, E., Mordenti, M., Cavalli, A. & Sangiorgi, L. Opportunities and challenges for machine learning in rare diseases. Front. Med. 8, 747612 (2021).
Google Scholar
Li, A. et al. Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69, 2091–2099 (2009).
Google Scholar
Senate and House of Representatives of the United States of America in Congress. Orphan Drug Act (1983).
Agarwal, V. et al. Learning statistical models of phenotypes using noisy labeled training data. J. Am. Med. Inform. Assoc. 23, 1166–1173 (2016).
Google Scholar
Frénay, B. & Verleysen, M. Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 845–869 (2014).
Google Scholar
Toh, T. S., Dondelinger, F. & Wang, D. Looking beyond the hype: applied AI and machine learning in translational medicine. EBioMedicine 47, 607–615 (2019).
Google Scholar
Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8, 37–49 (2008).
Google Scholar
Altman, N. & Krzywinski, M. The curse(s) of dimensionality. Nat. Methods 15, 399–400 (2018).
Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Google Scholar
Leek, J. T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
Google Scholar
Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010).
Google Scholar
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).
Google Scholar
Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).
Google Scholar
Chellappa, R. & Turaga, P. Feature selection. In Computer Vision: a Reference Guide 1–5 (Springer International, 2020).
Chen, C.-H., Härdle, W. & Unwin, A. Handbook of Data Visualization (Springer, 2008).
Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).
Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://doi.org/10.48550/arXiv.1802.03426 (2018).
Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol. 15, e1006907 (2019).
Google Scholar
Wattenberg, M., Viégas, F. & Johnson, I. How to use t-SNE effectively. Distill 1, https://doi.org/10.23915/distill.00002 (2016).
Way, G. P., Zietz, M., Rubinetti, V., Himmelstein, D. S. & Greene, C. S. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol. 21, 109 (2020).
Google Scholar
de Souto, M. C. P., Costa, I. G., de Araujo, D. S. A., Ludermir, T. B. & Schliep, A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497 (2008).
Kothari, S. et al. Removing batch effects from histopathological images for enhanced cancer diagnosis. IEEE J. Biomed. Health Inform. 18, 765–772 (2014).
Google Scholar
Dwivedi, S. K., Tjärnberg, A., Tegnér, J. & Gustafsson, M. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat. Commun. 11, 856 (2020).
Google Scholar
Fertig, E. J., Ding, J., Favorov, A. V., Parmigiani, G. & Ochs, M. F. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics 26, 2792–2793 (2010).
Google Scholar
Quellec, G., Lamard, M., Conze, P.-H., Massin, P. & Cochener, B. Automatic detection of rare pathologies in fundus photographs using few-shot learning. Med. Image Anal. 61, 101660 (2020).
Google Scholar
Arvaniti, E. & Claassen, M. Sensitive detection of rare disease-associated cell subsets via representation learning. Nat. Commun. 8, 14825 (2017).
Google Scholar
Chaabane, I., Guermazi, R. & Hammami, M. Enhancing techniques for learning decision trees from imbalanced data. Adv. Data Anal. Classif. 14, 677–745 (2020).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Google Scholar
Köpcke, F. et al. Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data. BMC Med. Inform. Decis. Mak. 13, 134 (2013).
Google Scholar
Banerjee, J. et al. Integrative analysis identifies candidate tumor microenvironment and intracellular signaling pathways that define tumor heterogeneity in NF1. Genes 11, 226 (2020).
Colbaugh, R., Glass, K., Rudolf, C., & Tremblay, M. Learning to identify rare disease patients from electronic health records. AMIA Annu. Symp. Proc. 2018, 340–347 (2018).
Google Scholar
Heiselet, B., Serre, T., Pontil, M. & Poggio, T. Component-based face detection. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition I (CPRV, 2001).
Kasinski, A. & Schmidt, A. The architecture of the face and eyes detection system based on cascade classifiers. In Computer Recognition Systems 2 (ed. Kurzynski, M. et al.) 124–131 (Springer, 2007).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at arXiv https://doi.org/10.48550/arXiv.1301.3781 (2013).
Han, S., Williamson, B. D. & Fong, Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med. Inform. Decis. Mak. 21, 322 (2021).
Google Scholar
Ambert, K. H. & Cohen, A. M. A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection. J. Am. Med. Inform. Assoc. 16, 590–595 (2009).
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Google Scholar
More, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. Preprint at arXiv https://doi.org/10.48550/arXiv.1608.06048 (2016).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT, 2016).
Futoma, J., Simons, M., Doshi-Velez, F. & Kamaleswaran, R. Generalization in clinical prediction models: the blessing and curse of measurement indicator variables. Crit. Care Explor. 3, e0453 (2021).
Okser, S. et al. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 10, e1004754 (2014).
Google Scholar
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B Stat. Methodol. 67, 301–320 (2005).
Google Scholar
Founta, K. et al. Gene targeting in amyotrophic lateral sclerosis using causality-based feature selection and machine learning. Mol. Med. 29, 12 (2023).
Google Scholar
Torang, A., Gupta, P. & Klinke, D. J. 2nd An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets. BMC Bioinformatics 20, 433 (2019).
Dincer, A. B., Celik, S., Hiranuma, N. & Lee, S.-I. DeepProfile: deep learning of cancer molecular profiles for precision medicine. Preprint at bioRxiv https://doi.org/10.1101/278739 (2018).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at arXiv https://doi.org/10.48550/arXiv.1312.6114 (2013).
Sánchez Fernández, I. et al. Deep learning in rare disease. Detection of tubers in tuberous sclerosis complex. PLoS ONE 15, e0232376 (2020).
Google Scholar
Mungall, C. J. et al. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 45, D712–D722 (2017).
Google Scholar
Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife 6, e26726 (2017).
Google Scholar
Callahan, T. J., Tripodi, I. J., Hunter, L. E. & Baumgartner, W. A. A framework for automated construction of heterogeneous large-scale biomedical knowledge graphs. Preprint at bioRxiv https://doi.org/10.1101/2020.04.30.071407 (2020).
Percha, B. & Altman, R. B. A global network of biomedical relationships derived from text. Bioinformatics 34, 2614–2624 (2018).
Google Scholar
Orphanet https://www.orpha.net/consor/cgi-bin/index.php (2023).
Queralt-Rosinach, N. et al. Structured reviews for data and knowledge-driven research. Database 2020, baaa015 (2020).
Google Scholar
Moon, C. et al. Learning drug–disease–target embedding (DDTE) from knowledge graphs to inform drug repurposing hypotheses. J. Biomed. Inform. 119, 103838 (2021).
Google Scholar
Li, X. et al. Improving rare disease classification using imperfect knowledge graph. BMC Med. Inform. Decis. Mak. 19, 238 (2019).
Google Scholar
Sosa, D. N. et al. A literature-based knowledge graph embedding method for identifying drug repurposing opportunities in rare diseases. In Biocomputing 2020 463–474 (World Scientific, 2019).
Shen, F. et al. Rare disease knowledge enrichment through a data-driven approach. BMC Med. Inform. Decis. Mak. 19, 32 (2019).
Google Scholar
Rao, A. et al. Phenotype-driven gene prioritization for rare diseases using graph convolution on heterogeneous networks. BMC Med. Genomics 11, 57 (2018).
Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217 (2021).
Google Scholar
Rolland, T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014).
Google Scholar
Martens, M. et al. WikiPathways: connecting communities. Nucleic Acids Res. 49, D613–D621 (2021).
Google Scholar
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Google Scholar
Lee, S.-I. et al. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 42 (2018).
Google Scholar
Mao, W., Zaslavsky, E., Hartmann, B. M., Sealfon, S. C. & Chikina, M. Pathway-level information extractor (PLIER) for gene expression data. Nat. Methods 16, 607–610 (2019).
Google Scholar
Taroni, J. N. et al. MultiPLIER: a transfer learning framework for transcriptomics reveals systemic features of rare disease. Cell Syst. 8, 380–394 (2019).
Google Scholar
Greene, D., NIHR BioResource, Richardson, S. & Turro, E. Phenotype similarity regression for identifying the genetic determinants of rare diseases. Am. J. Hum. Genet. 98, 490–499 (2016).
Google Scholar
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Google Scholar
Ionita-Laza, I., Capanu, M., De Rubeis, S., McCallum, K. & Buxbaum, J. D. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet. 10, e1004729 (2014).
Google Scholar
Greene, D., NIHR BioResource, Richardson, S. & Turro, E. A fast association test for identifying pathogenic variants involved in rare diseases. Am. J. Hum. Genet. 101, 104–114 (2017).
Google Scholar
Boycott, K. M., Vanstone, M. R., Bulman, D. E. & MacKenzie, A. E. Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet. 14, 681–691 (2013).
Google Scholar
Wright, C. F., FitzPatrick, D. R. & Firth, H. V. Paediatric genomics: diagnosing rare disease in children. Nat. Rev. Genet. 19, 253–268 (2018).
Google Scholar
Adams, D. R. & Eng, C. M. Next-generation sequencing to diagnose suspected genetic disorders. N. Engl. J. Med. 379, 1353–1362 (2018).
Google Scholar
Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. & Greene, C. S. Responsible, practical genomic data sharing that accelerates research. Nat. Rev. Genet. 21, 615–629 (2020).
Google Scholar
Rieke, N. et al. The future of digital health with federated learning. NPJ Digit. Med. 3, 119 (2020).
Google Scholar
Yan, Y. et al. A continuously benchmarked and crowdsourced challenge for rapid development and evaluation of models to predict COVID-19 diagnosis and hospitalization. JAMA Netw. Open 4, e2124946 (2021).
Google Scholar
Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2, 749–760 (2018).
Google Scholar
Zhou, G., Zhang, J., Su, J., Shen, D. & Tan, C. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics 20, 1178–1190 (2004).
Google Scholar
Blitzer, J., McDonald, R. & Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (eds. Jurafsky, D. & Gaussier, E.) 120–128 (Association for Computational Linguistics, 2006).
Wang, C. & Mahadevan, S. Heterogeneous domain adaptation using manifold alignment. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence 2 (ed. Walsh, T.) 1541–1546 (AAAI, 2011).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).
Google Scholar
Collado-Torres, L. et al. Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35, 319–321 (2017).
Google Scholar
Kuhn, M. & Johnson, K. Applied Predictive Modeling (Springer, 2013).
Davis, J. & Goadrich, M. The relationship between precision–recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (eds. Cohen, W. W. & Moore, A.) 233–240 (Association for Computing Machinery, 2006).
Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning (Springer, 2001).
Shin, H.-C. et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016).
Google Scholar