Pretraining a foundation model for small-molecule natural products

Machine Learning


  • Chandrasekhar, V. et al. COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res. 53, 634–643 (2025).

    Article 

    Google Scholar 

  • Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs from 1981 to 2014. J. Nat. Prod. 79, 629–661 (2016).

    Article 

    Google Scholar 

  • Clark, A. M. Natural products as a resource for new drugs. Pharm. Res. 13, 1133–1141 (1996).

    Article 

    Google Scholar 

  • Harvey, A. L. Natural products in drug discovery. Drug Discov. Today 13, 894–901 (2008).

    Article 

    Google Scholar 

  • Li, J. W.-H. & Vederas, J. C. Drug discovery and natural products: end of an era or an endless frontier? Science 325, 161–165 (2009).

    Article 

    Google Scholar 

  • Atanasov, A. G., Zotchev, S. B., Dirsch, V. M. & Supuran, C. T. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021).

    Article 

    Google Scholar 

  • Corson, T. W. & Crews, C. M. Molecular understanding and modern application of traditional medicines: triumphs and trials. Cell 130, 769–774 (2007).

    Article 

    Google Scholar 

  • Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).

    Article 

    Google Scholar 

  • Enamine REAL Database: The Largest Enumerated Dataset of Synthetically Feasible Drug-like Molecules (Enamine, accessed 7 October 2025); https://enamine.net/compound-collections/real-compounds/real-database

  • Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).

    Article 

    Google Scholar 

  • Banerjee, P. et al. Super Natural II—a database of natural products. Nucleic Acids Res. 43, 935–939 (2015).

    Article 

    Google Scholar 

  • Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: collection of open natural products database. J. Cheminform. https://doi.org/10.1186/s13321-020-00478-9 (2021).

    Article 

    Google Scholar 

  • Rutz, A. et al. The LOTUS initiative for open knowledge management in natural products research. Elife 11, e70780 (2022).

    Article 

    Google Scholar 

  • Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, 1217–1222 (2018).

    Article 

    Google Scholar 

  • van Santen, J. A. et al. The Natural Products Atlas: an open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019).

    Article 

    Google Scholar 

  • Terlouw, B. R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 51, 603–610 (2023).

    Article 

    Google Scholar 

  • Lei, J. & Zhou, J. A marine natural product database. J. Chem. Inf. Comput. Sci. 42, 742–748 (2002).

    Article 

    Google Scholar 

  • Barbosa, A. J. & Roque, A. C. Free marine natural products databases for biotechnology and bioengineering. Biotechnol. J. 14, 1800607 (2019).

    Article 

    Google Scholar 

  • Lyu, C. et al. CMNPD: a comprehensive marine natural products database towards facilitating drug discovery from the ocean. Nucleic Acids Res. 49, 509–515 (2021).

    Article 

    Google Scholar 

  • Aghdam, S. A. & Brown, A. M. V. Deep learning approaches for natural product discovery from plant endophytic microbiomes. Environ. Microbiome 16, 6 (2021).

    Article 

    Google Scholar 

  • Zheng, S. et al. Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP. Nat. Commun. https://doi.org/10.1038/s41467-022-30970-9 (2022).

    Article 

    Google Scholar 

  • Lai, J. et al. Privileged scaffold analysis of natural products with deep learning-based indication prediction model. Mol. Inform. 39, e2000057 (2020).

    Article 

    Google Scholar 

  • Yoo, S. et al. A deep learning-based approach for identifying the medicinal uses of plant-derived natural compounds. Front. Pharmacol. 11, 584875 (2020).

    Article 

    Google Scholar 

  • Hannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).

    Article 

    Google Scholar 

  • Liu, Z. et al. Deep learning enables discovery of highly potent anti-osteoporosis natural products. Eur. J. Med. Chem. 210, 112982 (2021).

    Article 

    Google Scholar 

  • Xu, Q. et al. Composite machine learning strategy for natural products taxonomical classification and structural insights. Digital Discov. 3, 2192–2200 (2024).

    Article 

    Google Scholar 

  • Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 1–20 (2016).

    Article 

    Google Scholar 

  • Kim, H. W. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021).

    Article 

    Google Scholar 

  • Yu, L., Su, Y., Liu, Y. & Zeng, X. Review of unsupervised pretraining strategies for molecules representation. Brief. Funct. Genomics 20, 323–332 (2021).

    Article 

    Google Scholar 

  • Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).

    Article 

    Google Scholar 

  • Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing 1724–1734 (2014).

  • Vaswani, A. et al. Attention is all you need. In Proc. 30th International Conference on Advances in Neural Information Processing Systems (eds Guyon, I. et al.) 6000–6010 (Curran, 2017).

  • Xu, Z., Wang, S., Zhu, F. & Huang, J. Seq2seq fingerprint: an unsupervised deep molecular embedding for drug discovery. In Proc. 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (eds Haspel, N. et al.) 285–294 (Association for Computing Machinery, 2017).

  • Jastrzębski, S., Leśniak, D. & Czarnecki, W. M. Learning to SMILE(S). Preprint at https://arxiv.org/abs/1602.06289 (2016).

  • Kearnes, S., McCloskey, K., Berndl, M., Pande, V. & Riley, P. Molecular graph convolutions: moving beyond fingerprints. J. Comput-Aided Mol. Des. 30, 595–608 (2016).

    Article 

    Google Scholar 

  • Schütt, K. et al. SchNet: a continuous-filter convolutional neural network for modeling quantum interactions. In Proc. 31st Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 992–1002 (Curran Associates, 2017).

  • Wang, Y., Wang, J., Cao, Z. & Barati Farimani, A. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).

    Article 

    Google Scholar 

  • Hu, W. et al. Strategies for pre-training graph neural networks. In International Conference on Learning Representations (OpenReview.net, 2020).

  • Xia, J. et al. Mole-BERT: rethinking pre-training graph neural networks for molecules. In International Conference on Learning Representations https://openreview.net/pdf?id=jevY-DtiZTR (2023).

  • Liu, S. et al. Pre-training molecular graph representation with 3D geometry. Preprint at https://arxiv.org/abs/2110.07728 (2021).

  • Zhu, J. et al. Unified 2D and 3D pre-training of molecular representations. In Proc. 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (eds Zhang, A.) 2626–2636 (Association for Computing Machinery, 2022).

  • Li, H. et al. A knowledge-guided pre-training framework for improving molecular representation learning. Nat. Commun. https://doi.org/10.1038/s41467-023-43214-1 (2023).

    Article 

    Google Scholar 

  • Ni, Y. et al. Pre-training with fractional denoising to enhance molecular property prediction. Nat. Mach. Intell. 6, 1169–1178 (2024).

    Article 

    Google Scholar 

  • Mullowney, M. W. et al. Artificial intelligence for natural product drug discovery. Nat. Rev. Drug Discov. 22, 895–916 (2023).

    Article 

    Google Scholar 

  • Garcia-Castro, M., Zimmermann, S., Sankar, M. G. & Kumar, K. Scaffold diversity synthesis and its application in probe and drug discovery. Angew. Chem. Int. Ed. 55, 7586–7605 (2016).

    Article 

    Google Scholar 

  • Cruz-Monteagudo, M. et al. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discov. Today 19, 1069–1080 (2014).

    Article 

    Google Scholar 

  • Stumpfe, D., Hu, H. & Bajorath, J. Evolving concept of activity cliffs. ACS Omega 4, 14360–14368 (2019).

    Article 

    Google Scholar 

  • van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).

    Article 

    Google Scholar 

  • Shen, W. X. et al. Online triplet contrastive learning enables efficient cliff awareness in molecular activity prediction. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv-2023-5cz7s-v2 (2023).

  • Sun, R., Dai, H. & Yu, A. W. Does GNN pretraining help molecular representation? In Proc. 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 12096–12109 (Curran, 2022).

  • Martin, J. F. & Liras, P. Organization and expression of genes involved in the biosynthesis of antibiotics and other secondary metabolites. Annu. Rev. Microbiol. 43, 173–206 (1989).

    Article 

    Google Scholar 

  • Martin, J. F. Clusters of genes for the biosynthesis of antibiotics: regulatory genes and overproduction of pharmaceuticals. J. Ind. Microbiol. 9, 73–90 (1992).

    Article 

    Google Scholar 

  • Carroll, L. M. et al. Accurate de novo identification of biosynthetic gene clusters with GECCO. Preprint at bioRxiv https://doi.org/10.1101/2021.05.03.442509 (2021).

  • Sanchez, S. et al. Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS. Preprint at bioRxiv https://doi.org/10.1101/2023.05.23.540769 (2023).

  • Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, 412–419 (2021).

    Article 

    Google Scholar 

  • Marchler-Bauer, A. et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35, 237–240 (2007).

    Article 

    Google Scholar 

  • Ulrich, L. E. & Zhulin, I. B. The MiST2 database: a comprehensive genomics resource on microbial signal transduction. Nucleic Acids Res. 38, 401–407 (2010).

    Article 

    Google Scholar 

  • Zeng, T., Li, J. & Wu, R. Natural product databases for drug discovery: features AND applications. Pharm. Sci. Adv. 2, 100050 (2024).

    Article 

    Google Scholar 

  • Maia, E. H. B., Assis, L. C., de Oliveira, T. A., da Silva, A. M. & Taranto, A. G. Structure-based virtual screening: from classical to artificial intelligence. Front. Chem. 8, 343 (2020).

    Article 

    Google Scholar 

  • Friesner, R. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47, 1739–1749 (2004).

    Article 

    Google Scholar 

  • Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).

    Article 

    Google Scholar 

  • Kimber, T. B., Chen, Y. & Volkamer, A. Deep learning in virtual screening: recent applications and developments. Int. J. Mol. Sci. 22, 4435 (2021).

    Article 

    Google Scholar 

  • Ma, D.-L., Chan, D. S.-H. & Leung, C.-H. Molecular docking for virtual screening of natural product databases. Chem. Sci. 2, 1656–1665 (2011).

    Article 

    Google Scholar 

  • Soreq, H. & Seidman, S. Acetylcholinesterase—new roles for an old actor. Nat. Rev. Neurosci. 2, 294–302 (2001).

    Article 

    Google Scholar 

  • Delibegović, M., Dall’Angelo, S. & Dekeryte, R. Protein tyrosine phosphatase 1B in metabolic diseases and drug development. Nat. Rev. Endocrinol. 20, 366–378 (2024).

    Article 

    Google Scholar 

  • Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations https://openreview.net/pdf?id=ryGs6iA5Km (2019).

  • Hu, W. et al. OGB-LSC: a large-scale challenge for machine learning on graphs. In Proc. of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (eds Vanschoren, J. & Yeung, S.) https://openreview.net/pdf?id=qkcLxoC52kL (2021).

  • Wang, Y., Magar, R., Liang, C. & Barati Farimani, A. Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. J. Chem. Inf. Model. 62, 2713–2725 (2022).

    Article 

    Google Scholar 

  • Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).

    Article 

    Google Scholar 

  • Ding, Y. et al. Data archiving and access for NaFM: pre-training a foundation model for small-molecule natural products. figshare https://doi.org/10.6084/m9.figshare.28980254.v1 (2025).

  • Kim, H. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021).

    Article 

    Google Scholar 

  • Ding, Y. et al. Model weights for NaFM: pre-training a foundation model for small-molecule natural products. Zenodo https://doi.org/10.5281/zenodo.15382660 (2025).

  • Ding, Y. et al. NaFM-Official: version 1.0.0. Zenodo https://doi.org/10.5281/zenodo.18871560 (2025).

  • Liu, S., Demirel, M. F. & Liang, Y. N-gram graph: simple unsupervised representation for graphs, with applications to molecules. In Proc. 33rd Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8464–8476 (Curran Associates, 2019).

  • Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).

    Article 

    Google Scholar 

  • Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    Article 

    Google Scholar 



  • Source link