New AI tools can generate millions of new molecules

Chemists have long faced a tremendous problem. The number of possible useful molecules is so vast that even the molecules that are already known represent only a small fraction of the molecules that might exist.

A research team at Spain’s Rovira y Virgili University has ventured into that uncharted territory, announcing that they have built an artificial intelligence system that can generate millions of molecules that follow the laws of chemistry but are not present in current databases. The study, published in Nature Machine Intelligence, points to a faster way to explore chemical space, an almost unimaginable range of atomic combinations that could one day lead to new drugs, materials, refrigerants, or other compounds.

The scale of the space cannot be overstated. The authors point out that the number of drug-like molecules could be about 10 to the 60th power, which is far more than the number of water molecules in Earth’s oceans. So discovering molecules becomes like sifting through the universe rather than searching a library.

At the heart of the new work is a model called CoCoGraph, which acts like an image-generating AI. Rather than generating images from noise, it generates molecular structures by learning how valid molecules are decomposed and reconstructed.

“Our algorithm does the same thing, but with molecules,” says Roger Gimera, ICREA research professor in the URV School of Chemical Engineering, in the source material.

Roger Guimerà, Manuel Ruiz-Botella and Marta Sales from the Department of Chemical Engineering led the research. (Credit: University of Rovira i Virgili)

Rules first, inventions second

This project addresses a central weakness of many early molecular production systems. Previous approaches, including models based on variational autoencoders, generative adversarial networks, and graph neural networks, have improved the field but often struggle with scale, efficiency, or chemical plausibility. Some produce structures that seem ingenious but violate fundamental chemical rules.

CoCoGraph takes a different path. Rather than having the model learn these rules from scratch, the researchers incorporated some of the rules directly into the generation process. Each atom maintains the correct number of bonds, maintains valence, and ensures that the molecular formula remains fixed throughout the process.

Design choices matter. This means that all molecules produced by the system are chemically valid under the benchmarks used in the study. According to the authors, CoCoGraph achieved 100% chemical validity while producing highly novel output.

Marta Sales-Pardo, also from URV’s Department of Chemical Engineering, briefly explained the process: “We start with real molecules, break bonds, and randomly create new molecules. The model learns to reverse this process and rebuild a consistent structure.”

Unlike images, molecules are not a smooth field of view. These are discrete structures made up of atoms and bonds, which makes the mathematics difficult. To address this, CoCoGraph uses what the team calls a constrained discrete diffusion process based on double edge swapping. In practice, bonds are repeatedly exchanged while maintaining the overall bonding requirements of the molecule.

The system also includes a second model, called a temporal model, that estimates how close the partially reconstructed graph is to real molecules. This additional signal helps determine how the main diffusion model controls the denoising process.

CoCoGraph, a constrained cooperative graph diffusion model. (Credit: Nature Machine Intelligence)

Smaller models, stronger realism

The researchers compared CoCoGraph to six other leading molecule generators using the GuacaMol benchmark, a standard test suite in the field. They evaluated all models against a filtered PubChem reference database containing 94.7 million molecules without overlap with the training data.

Two versions of CoCoGraph were tested. The smaller BASE model used a total of 534,000 parameters, while the fingerprint-enhanced version, called FPS, used 4.4 million parameters. Even with the larger model, it remained with fewer parameters than most competing models.

CoCoGraph performed well despite its lightweight design. Both versions achieved 100% chemical validity, uniqueness rates of 99.8% and 99.9%, and novelty of 95.7%. In the GuacaMol benchmark, the KL divergence score for property matching reached 95.7% for the BASE model and 96.3% for the FPS version, outperforming the baselines the team compared.

This is important because novelty alone is not enough. A useful model should produce new but still plausible molecules with physicochemical properties similar to those found in real chemistry.

The authors also expanded the scope of their tests beyond the usual 10 properties of the benchmark. Across 36 chemical properties, CoCoGraph outperformed competitive systems in at least 66.6% of them. The researchers reported particular strengths in topological features, electronic properties, and structural descriptors, attributes that may be important in medicinal chemistry and drug discovery.

Can a chemist tell the difference?

One of the most shocking parts of the study was when the researchers moved away from automated benchmarks and asked human experts to judge the results.

They built a database of 8.2 million synthetic molecules with 7.1% redundancy. Based on reported novelty rates, the database contains approximately 7.3 million new, unique, and chemically active molecules not found in PubChem.

Next, we performed a test that is equivalent to a molecular Turing test. A total of 121 participants with backgrounds in organic chemistry, biochemistry, and related fields were shown 20 pairs of molecules. One molecule of each pair was taken from the original dataset and the other was generated by CoCoGraph. Both shared the same molecular formula, and participants had to judge structure rather than size or composition.

Across 2,420 evaluations, the experts correctly selected the actual molecule 62% of the time. Undergraduate participants received a score of 60% and graduate student participants received a score of 64%.

That’s better than chance, but not by much.

For some categories, including acyclic and primarily aliphatic molecules, performance was statistically compatible with random guessing. Although the authors are careful not to claim that they are completely indistinguishable, they do claim that their results show that many of the molecules produced look convincing even to trained chemists.

The first step towards targeted design

Currently, CoCoGraph does not allow chemists to enter a wish list and receive the perfect molecule in return. It is not yet possible to directly engineer compounds for specific functions.

Still, this study includes an early demonstration of how the model can be useful. The research team searched a database of 8.2 million molecules for structures with similar physicochemical properties to paracetamol and used nine key properties to identify the top candidates. We also tested a repair-style approach that leaves part of the existing molecule fixed while adding small or medium fragments to create related variants.

50 random molecules generated by CoCoGraph FPS. (Credit: Nature Machine Intelligence)

This type of controlled editing can be important in drug optimization, where researchers often want to maintain the molecular scaffold while adjusting other parts of the structure.

“Right now, we’re just producing molecules,” Manuel Luis Botella, a doctoral student involved in the research, said in the source material. “The next step is to apply concrete goals to this process.”

The study also outlines its limitations. CoCoGraph modifies the molecular formula during generation, which may limit some applications. The model has also been developed for molecules with up to 70 atoms, and the authors note that extending it to larger structures will require retraining and more computing resources. They also point to future applications in mass spectrometry and conditional molecular generation, but these are directions for later research and not proven results.

Practical implications of the research

CoCoGraph’s immediate value is not that it already provides new drugs or new materials. it’s not. What this provides is a more efficient way to explore chemical environments that are too vast for humans to explore manually.

By producing only chemically valid structures and with fewer parameters than many rivals, this system has the potential to reduce the computational cost of large-scale molecular searches. Its 8.2 million molecule database could also give researchers a starting point for screening realistic candidates in drug development and materials research.

More broadly, this study suggests that AI systems in chemistry could be improved if they were built on the discipline’s own strict constraints, rather than simply mimicking known data. In this case, chemistry rules are not an afterthought. These are the reasons why the model works.

Source link