Chandrasekhar, V. et al. COCONUT 2.0: a comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res. 53, 634–643 (2025).
Google Scholar
Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs from 1981 to 2014. J. Nat. Prod. 79, 629–661 (2016).
Google Scholar
Clark, A. M. Natural products as a resource for new drugs. Pharm. Res. 13, 1133–1141 (1996).
Google Scholar
Harvey, A. L. Natural products in drug discovery. Drug Discov. Today 13, 894–901 (2008).
Google Scholar
Li, J. W.-H. & Vederas, J. C. Drug discovery and natural products: end of an era or an endless frontier? Science 325, 161–165 (2009).
Google Scholar
Atanasov, A. G., Zotchev, S. B., Dirsch, V. M. & Supuran, C. T. Natural products in drug discovery: advances and opportunities. Nat. Rev. Drug Discov. 20, 200–216 (2021).
Google Scholar
Corson, T. W. & Crews, C. M. Molecular understanding and modern application of traditional medicines: triumphs and trials. Cell 130, 769–774 (2007).
Google Scholar
Irwin, J. J. et al. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).
Google Scholar
Enamine REAL Database: The Largest Enumerated Dataset of Synthetically Feasible Drug-like Molecules (Enamine, accessed 7 October 2025); https://enamine.net/compound-collections/real-compounds/real-database
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702 (2020).
Google Scholar
Banerjee, P. et al. Super Natural II—a database of natural products. Nucleic Acids Res. 43, 935–939 (2015).
Google Scholar
Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: collection of open natural products database. J. Cheminform. https://doi.org/10.1186/s13321-020-00478-9 (2021).
Google Scholar
Rutz, A. et al. The LOTUS initiative for open knowledge management in natural products research. Elife 11, e70780 (2022).
Google Scholar
Zeng, X. et al. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 46, 1217–1222 (2018).
Google Scholar
van Santen, J. A. et al. The Natural Products Atlas: an open access knowledge base for microbial natural products discovery. ACS Cent. Sci. 5, 1824–1833 (2019).
Google Scholar
Terlouw, B. R. et al. MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res. 51, 603–610 (2023).
Google Scholar
Lei, J. & Zhou, J. A marine natural product database. J. Chem. Inf. Comput. Sci. 42, 742–748 (2002).
Google Scholar
Barbosa, A. J. & Roque, A. C. Free marine natural products databases for biotechnology and bioengineering. Biotechnol. J. 14, 1800607 (2019).
Google Scholar
Lyu, C. et al. CMNPD: a comprehensive marine natural products database towards facilitating drug discovery from the ocean. Nucleic Acids Res. 49, 509–515 (2021).
Google Scholar
Aghdam, S. A. & Brown, A. M. V. Deep learning approaches for natural product discovery from plant endophytic microbiomes. Environ. Microbiome 16, 6 (2021).
Google Scholar
Zheng, S. et al. Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP. Nat. Commun. https://doi.org/10.1038/s41467-022-30970-9 (2022).
Google Scholar
Lai, J. et al. Privileged scaffold analysis of natural products with deep learning-based indication prediction model. Mol. Inform. 39, e2000057 (2020).
Google Scholar
Yoo, S. et al. A deep learning-based approach for identifying the medicinal uses of plant-derived natural compounds. Front. Pharmacol. 11, 584875 (2020).
Google Scholar
Hannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).
Google Scholar
Liu, Z. et al. Deep learning enables discovery of highly potent anti-osteoporosis natural products. Eur. J. Med. Chem. 210, 112982 (2021).
Google Scholar
Xu, Q. et al. Composite machine learning strategy for natural products taxonomical classification and structural insights. Digital Discov. 3, 2192–2200 (2024).
Google Scholar
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminform. 8, 1–20 (2016).
Google Scholar
Kim, H. W. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021).
Google Scholar
Yu, L., Su, Y., Liu, Y. & Zeng, X. Review of unsupervised pretraining strategies for molecules representation. Brief. Funct. Genomics 20, 323–332 (2021).
Google Scholar
Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29, 97–101 (1989).
Google Scholar
Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing 1724–1734 (2014).
Vaswani, A. et al. Attention is all you need. In Proc. 30th International Conference on Advances in Neural Information Processing Systems (eds Guyon, I. et al.) 6000–6010 (Curran, 2017).
Xu, Z., Wang, S., Zhu, F. & Huang, J. Seq2seq fingerprint: an unsupervised deep molecular embedding for drug discovery. In Proc. 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (eds Haspel, N. et al.) 285–294 (Association for Computing Machinery, 2017).
Jastrzębski, S., Leśniak, D. & Czarnecki, W. M. Learning to SMILE(S). Preprint at https://arxiv.org/abs/1602.06289 (2016).
Kearnes, S., McCloskey, K., Berndl, M., Pande, V. & Riley, P. Molecular graph convolutions: moving beyond fingerprints. J. Comput-Aided Mol. Des. 30, 595–608 (2016).
Google Scholar
Schütt, K. et al. SchNet: a continuous-filter convolutional neural network for modeling quantum interactions. In Proc. 31st Conference on Neural Information Processing Systems (eds von Luxburg, U. et al.) 992–1002 (Curran Associates, 2017).
Wang, Y., Wang, J., Cao, Z. & Barati Farimani, A. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell. 4, 279–287 (2022).
Google Scholar
Hu, W. et al. Strategies for pre-training graph neural networks. In International Conference on Learning Representations (OpenReview.net, 2020).
Xia, J. et al. Mole-BERT: rethinking pre-training graph neural networks for molecules. In International Conference on Learning Representations https://openreview.net/pdf?id=jevY-DtiZTR (2023).
Liu, S. et al. Pre-training molecular graph representation with 3D geometry. Preprint at https://arxiv.org/abs/2110.07728 (2021).
Zhu, J. et al. Unified 2D and 3D pre-training of molecular representations. In Proc. 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (eds Zhang, A.) 2626–2636 (Association for Computing Machinery, 2022).
Li, H. et al. A knowledge-guided pre-training framework for improving molecular representation learning. Nat. Commun. https://doi.org/10.1038/s41467-023-43214-1 (2023).
Google Scholar
Ni, Y. et al. Pre-training with fractional denoising to enhance molecular property prediction. Nat. Mach. Intell. 6, 1169–1178 (2024).
Google Scholar
Mullowney, M. W. et al. Artificial intelligence for natural product drug discovery. Nat. Rev. Drug Discov. 22, 895–916 (2023).
Google Scholar
Garcia-Castro, M., Zimmermann, S., Sankar, M. G. & Kumar, K. Scaffold diversity synthesis and its application in probe and drug discovery. Angew. Chem. Int. Ed. 55, 7586–7605 (2016).
Google Scholar
Cruz-Monteagudo, M. et al. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discov. Today 19, 1069–1080 (2014).
Google Scholar
Stumpfe, D., Hu, H. & Bajorath, J. Evolving concept of activity cliffs. ACS Omega 4, 14360–14368 (2019).
Google Scholar
van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).
Google Scholar
Shen, W. X. et al. Online triplet contrastive learning enables efficient cliff awareness in molecular activity prediction. Preprint at ChemRxiv https://doi.org/10.26434/chemrxiv-2023-5cz7s-v2 (2023).
Sun, R., Dai, H. & Yu, A. W. Does GNN pretraining help molecular representation? In Proc. 36th International Conference on Neural Information Processing Systems (eds Koyejo, S. et al.) 12096–12109 (Curran, 2022).
Martin, J. F. & Liras, P. Organization and expression of genes involved in the biosynthesis of antibiotics and other secondary metabolites. Annu. Rev. Microbiol. 43, 173–206 (1989).
Google Scholar
Martin, J. F. Clusters of genes for the biosynthesis of antibiotics: regulatory genes and overproduction of pharmaceuticals. J. Ind. Microbiol. 9, 73–90 (1992).
Google Scholar
Carroll, L. M. et al. Accurate de novo identification of biosynthetic gene clusters with GECCO. Preprint at bioRxiv https://doi.org/10.1101/2021.05.03.442509 (2021).
Sanchez, S. et al. Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS. Preprint at bioRxiv https://doi.org/10.1101/2023.05.23.540769 (2023).
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, 412–419 (2021).
Google Scholar
Marchler-Bauer, A. et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35, 237–240 (2007).
Google Scholar
Ulrich, L. E. & Zhulin, I. B. The MiST2 database: a comprehensive genomics resource on microbial signal transduction. Nucleic Acids Res. 38, 401–407 (2010).
Google Scholar
Zeng, T., Li, J. & Wu, R. Natural product databases for drug discovery: features AND applications. Pharm. Sci. Adv. 2, 100050 (2024).
Google Scholar
Maia, E. H. B., Assis, L. C., de Oliveira, T. A., da Silva, A. M. & Taranto, A. G. Structure-based virtual screening: from classical to artificial intelligence. Front. Chem. 8, 343 (2020).
Google Scholar
Friesner, R. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47, 1739–1749 (2004).
Google Scholar
Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461 (2010).
Google Scholar
Kimber, T. B., Chen, Y. & Volkamer, A. Deep learning in virtual screening: recent applications and developments. Int. J. Mol. Sci. 22, 4435 (2021).
Google Scholar
Ma, D.-L., Chan, D. S.-H. & Leung, C.-H. Molecular docking for virtual screening of natural product databases. Chem. Sci. 2, 1656–1665 (2011).
Google Scholar
Soreq, H. & Seidman, S. Acetylcholinesterase—new roles for an old actor. Nat. Rev. Neurosci. 2, 294–302 (2001).
Google Scholar
Delibegović, M., Dall’Angelo, S. & Dekeryte, R. Protein tyrosine phosphatase 1B in metabolic diseases and drug development. Nat. Rev. Endocrinol. 20, 366–378 (2024).
Google Scholar
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations https://openreview.net/pdf?id=ryGs6iA5Km (2019).
Hu, W. et al. OGB-LSC: a large-scale challenge for machine learning on graphs. In Proc. of the 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (eds Vanschoren, J. & Yeung, S.) https://openreview.net/pdf?id=qkcLxoC52kL (2021).
Wang, Y., Magar, R., Liang, C. & Barati Farimani, A. Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. J. Chem. Inf. Model. 62, 2713–2725 (2022).
Google Scholar
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42, 1273–1280 (2002).
Google Scholar
Ding, Y. et al. Data archiving and access for NaFM: pre-training a foundation model for small-molecule natural products. figshare https://doi.org/10.6084/m9.figshare.28980254.v1 (2025).
Kim, H. et al. NPClassifier: a deep neural network-based structural classification tool for natural products. J. Nat. Prod. 84, 2795–2807 (2021).
Google Scholar
Ding, Y. et al. Model weights for NaFM: pre-training a foundation model for small-molecule natural products. Zenodo https://doi.org/10.5281/zenodo.15382660 (2025).
Ding, Y. et al. NaFM-Official: version 1.0.0. Zenodo https://doi.org/10.5281/zenodo.18871560 (2025).
Liu, S., Demirel, M. F. & Liang, Y. N-gram graph: simple unsupervised representation for graphs, with applications to molecules. In Proc. 33rd Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8464–8476 (Curran Associates, 2019).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019).
Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Google Scholar
