Scikit-fingerprints: An advanced Python library for efficient molecular fingerprint computation and integration with machine learning pipelines

Machine Learning


https://arxiv.org/abs/2407.13291v1

In computational chemistry, molecules are often represented as molecular graphs, which need to be converted into multidimensional vectors for processing, especially in machine learning applications. This is achieved using molecular fingerprint feature extraction algorithms that encode molecular structures as vectors. These fingerprints are essential for cheminformatics tasks such as chemical space diversity, clustering, virtual screening, and molecular property prediction. While Python's scikit-learn library is widely used for machine learning tasks due to its intuitive API, popular open-source tools such as CDK, OpenBabel, and RDKit that compute molecular fingerprints are primarily written in Java or C++ and are not compatible with scikit-learn's API.

Researchers from the AGH University of Krakow have developed scikit-fingerprints, a Python package designed for the computation of molecular fingerprints in cheminformatics. The library provides an interface compatible with scikit-learn, making it easy to integrate into machine learning pipelines. It features optimized parallel computation, making it efficient for processing large molecular datasets. scikit-fingerprints contains more than 30 types of molecular fingerprints, both 2D (based on molecular graph topology) and 3D (exploiting spatial structures), making it the most comprehensive library available in the Python ecosystem. The library is open source and accessible on PyPI and GitHub.

Scikit-fingerprints is a Python package designed for computing molecular fingerprints and optimized for cheminformatics and machine learning workflows. It is integrated with scikit-learn, allowing easy incorporation into ML pipelines and also provides parallel processing capabilities for large datasets. The package includes over 30 fingerprint types and supports 2D and 3D representations. Key features include parallel and distributed computing with Joblib and Dask, preprocessing utilities for molecular data transformation and standardization, and efficient dataset loading with HuggingFace Hub. The code adheres to high quality standards with extensive testing, security checks, and CI/CD practices.

Scikit-fingerprints, a Python package for computing molecular fingerprints, provides highly parallel computing capabilities, significantly speeding up the processing of large datasets. For example, with 16 cores, the fingerprint computation time decreases almost linearly with the number of cores, demonstrating near-ideal parallelism. Memory usage is optimized with support for sparse matrices, significantly reducing storage requirements for large datasets such as PCBA. The package simplifies the tuning of molecular property prediction and fingerprint hyperparameters, improving performance on a range of benchmarks. It also supports complex 3D fingerprint pipelines, outperforming existing tools in terms of number of fingerprints, parallelism, and integrated datasets.

Scikit-fingerprints provides a robust library for computing molecular fingerprints with more than 30 options in both 2D and 3D. Its scikit-learn compatible interface eases its integration into complex data processing pipelines. The library's efficient parallel computation speeds up the processing of large datasets, essential for tasks such as virtual screening and hyperparameter tuning. Its intuitive API supports users with different programming expertise, including computational chemists and molecular biologists. The library's extensible architecture, high code quality, and active community engagement demonstrate its relevance and ease of use. It is already being used in studies for molecular property prediction and pesticide toxicity studies.

In conclusion, scikit-fingerprints is an advanced open-source Python library designed to compute molecular fingerprints, fully compatible with the scikit-learn API. It is the most feature-rich library in the Python ecosystem, supporting over 30 different fingerprints and providing efficient parallel computation for processing large datasets. The library is optimized for chemoinformatics, de novo drug design, and computational molecular chemistry, enabling faster and more comprehensive experimentation. With a focus on high code quality, maintainability, and security, scikit-fingerprints provides a definitive solution for molecular fingerprint computation, simplifying tasks such as molecular property prediction and virtual screening.


Please check Papers and GitHubAll credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter And our Telegram Channel and LinkedIn GroupsUp. If you like our work, you will love our Newsletter..

Please join us 46k+ ML Subreddit

Check out our upcoming AI webinars here

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at Indian Institute of Technology Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of AI and real-world solutions.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *