AI Advances Sustainable Scientific Research

Machine Learning


By its very nature, cutting-edge scientific research is a resource-intensive process: electricity, water, not to mention specific chemicals and specialist equipment. There’s an element of trial and error in exploring all new ideas, and behind every important discovery is a string of failed experiments. But while these failures and wrong turns are a crucial part of the process, there is an associated cost—not just in time and money, but in the resources consumed in the pursuit of this knowledge.

Scientific progress should not come at the cost of sustainability andwhether it’s streamlining the experimental process and tailoring project design, or optimizing lab operations and reducing wasteAI tools are transforming the way many researchers approach their work.

In this article, we will look at a spectrum of tools and programs helping researchers from across the sciences work more sustainably. 

AI-powered electronic lab notebooks

A typical synthetic chemistry project can involve hundreds, if not thousands, of individual experiments, each necessitating detailed analysis and characterization. Maintaining thorough and up-to-date records remains a hefty organizational task, although the introduction of electronic lab notebooks (ELNs) over the last 20 years has already gone a long way towards streamlining this onerous but essential job.

Intended to replace traditional paper records, ELNs create a digital entry for each experiment, storing methods, data, and analyses in an easily searchable and machine-readable format. These digitized records foster ready collaboration between teams and provide the perfect input for machine learning models, often prompting the user for missing information or to highlight duplicate experiments.

However, for computational chemist Professor Jonathan Hirst, this interactive interface presented an interesting opportunity to go beyond basic record-keeping functionality and begin to challenge users on the design of their experiments. In 2023, his team at the University of Nottingham launched AI4Green, the first electronic lab notebook with a central focus on sustainability.1 The software integrates the core functions of an ELN with a panel of simple apps and AI tools to calculate the green metrics of a planned reaction and propose sustainable alternatives where appropriate.

The user first sketches the intended reaction, adding key details such as reagents and quantities to the accompanying table. The software then automatically populates the rest of the table with hazards and chemical data imported from external databases, highlighting any particular safety or sustainability concerns. The associated summary also evaluates various other aspects of the reaction, including the reagents, temperature, catalyst recovery, and isolation method, prompting the user to consider each component and how it influences the sustainability.

In particular, solvents are a key target of these interventions—it’s estimated that as much as 90% of waste produced during pharmaceutical manufacture consists of solvent, the majority of which is ultimately incinerated.2,3 “We have some nice solvent selection tools where you can find a similar solvent with similar properties which is more environmentally friendly, and you can compare pairs of solvents side by side using our flash cards,” said Hirst.4

Post-experiment, AI4Green also enables the user to evaluate their reactions and draw trends from their data to identify a greener solvent for future iterations, specifically using the interactive principal component analysis (PCA) tool.5 PCA is a data representation technique that simplifies complex data sets into a more readable form; chemists often use this method to visualize the relationship between different solvent properties. Usually, this is a one-way process based on existing data, but Hirst’s interactive tool lets the researcher incorporate their own empirical observations into this representation. “The user can drag points together and then the model will update the plot in a mathematically rigorous fashion within those constraints to help them identify a green alternative solvent from the data that they have,” he explained.

AI is becoming an increasingly important part of these supporting add-ons, and since launching, AI4Green has incorporated a number of other tools developed for chemists including AstraZeneca’s open-source AI route scouting software which helps users evaluate different pathways when planning a new chemical synthesis.6,7 The group is now working on a machine learning model for lifecycle analysis, which will ultimately provide chemists with a much broader overview of the impacts of their choices, right from reagent sourcing, through to purification.

But while the coding behind the software is complex, the user interface isn’t. Hirst has designed the ELN with the needs of the average organic chemist in mind. “We’ve deployed it live on the cloud so anyone with a web browser can register and use it with no specific skills required at all – we’ve worked hard to make it intuitive,” he said. “The biggest barrier is really mindset and chemists’ willingness to do something different from what they’ve done before. I think creating a more open-source environment is going to be one of the critical ways to drive sustainability and help people develop confidence in AI suggestions.” 

AI apps assist scientists in making new materials

Taking it up a notch, AI applications more actively guide the direction of research to help scientists target the most productive experiments. Data-driven models can dramatically accelerate the discovery process—spotting patterns, analyzing data, and identifying the most impactful variables. Such tools are particularly valuable where there is a huge amount of theoretical research space to explore.

The field of sustainable concrete design is the perfect example. Concrete, and particularly the cement binder, is a huge contributor to global CO2 emissions, with an estimated 8% of the annual total from the manufacture of cement products.8 Research groups around the world are investigating alternative formulations to reduce or replace this problematic ingredient with greener alternatives (including fly ash, biochar and even coffee grounds), balancing sustainability considerations against mechanical properties and social factors like cost. 9,10,11 However, the sheer volume of possible combinations, in addition to the extended experimental times needed to validate properties such as compressive strength, restricts the rate of progress in this area.

Sequential learning, which combines a machine learning model with a decision-making rule to extrapolate from initial data, was therefore an obvious tool to streamline this process, reasoned materials informatician Christoph Vӧlker, now head of industrial AI at Iteratec. In 2021, while based at Bundesanstalt für Materialforschung und -prüfung in Germany, Vӧlker and his team developed a sequential learning program to evaluate alkali-activated binders as an alternative to cement and found several suitable candidates within just 11 experiments.12

But despite this success, and subsequent applications of the same methodology, relatively few researchers in the sustainable concrete space adopted AI-augmented approaches.13 The challenge, suggested Vӧlker, is that actually building and coding these models is a technically demanding task and remains a barrier, hindering many researchers from employing artificial intelligence in their own work.

Aiming to democratize these tools, Vӧlker’s team therefore developed their sequential learning program into an open-source and user-friendly app called SLAMD (Sequential Learning App for Materials Discovery).14 The user first outlines the desired properties of their concrete formulation, inputting existing experimental results and relevant literature as basic training data. The program then evaluates this initial information, making suggestions of the most promising experiments to try next, according to various factors weighted by the researcher. Once these experiments are complete, this data is fed back into the system for a second iteration, which provides even more targeted suggestions.

This cycle of inputting training data, machine learning analysis, and validation by experiment rapidly focuses the investigation on the most impactful variables and optimizes the approach to reduce the overall volume of experiments. Crucially, this smaller experimental burden not only accelerates discovery, but also decreases the environmental impact of the research process itself, requiring fewer resources, less power, and less money. 

Generative AI and digital twins accelerate clinical trials

The time, money, and sustainability gains provided by AI solutions are most significant for large and experimentally complex studies, such as clinical trials. “Trials are the last stage of drug discovery and the majority of the drug development costs will go to this part,” said Jimeng Sun, a computer science professor from the University of Illinois Urbana-Champaign and co-founder of Keiji AI 3,400.

Over the course of the pharmaceutical development process, an initial screen of hundreds or thousands of compounds is whittled down to a final pool of just three or four candidates, but even this small handful is too expensive and too slow to thoroughly investigate in vivo.

Fortunately, advances in AI are already informing these critical scientific and financial decisions, with prediction and analytics models guiding the design and direction of clinical trials more efficiently than ever before. “Trial outcome prediction plays a role in what industry calls portfolio management, i.e., prioritizing which candidate is most likely to work. The actual experiment is still necessary, but this is a more systematic way to determine where to invest and which trials to run,” explained Sun.

Traditionally, decision makers would look at the historical track record for trials of similar types of drugs and benchmark from that figure. Machine learning methods, on the other hand, combine information from multiple different sources and use this more complex spread of data to determine which factors are most significant for trial success.

Sun’s team employed this approach in their first prediction model, HINT (Hierarchical Interaction Network), which used training data from over 8000 past trials to predict the outcome of more than 3,400 recent drug studies.15

“We leveraged multiple sources of data—the molecular structure, the disease indications, the trial protocol—and then we augmented that with some knowledge that’s related to drug discovery, for example, wet lab properties, historical track records, etc,” explained Sun. “All this information was put together through a graph neural network machine learning model which considers the interaction of all these components to finally make a trial prediction.”

Notably, this initial model correctly predicted the success of Merck’s Sitagliptin (diabetes) and Bayer’s Afibercept (glaucoma). It also anticipated the costly failures of the promising drugs Entresto (heart failure) and Fivipiprant (asthma), which cost an estimated $240 million in unsuccessful trials.15 The team later developed a second iteration of this model called SPOT (Sequential Predictive mOdeling of clinical Trial outcome), which weights input data according to time and therefore aligns predictions more closely with the structures and protocols of modern trials.16

Building on the success of HINT and SPOT, they most recently reported the Clinical Trial Outcome (CTO) benchmark, which establishes a next-generation dataset with over 125,000 trials, richer multimodal features, and continuous temporal updates.17 This enables more robust, forward-looking evaluation of trial outcome prediction models under real-world distribution shifts, setting a new standard for scalable and deployable clinical AI, said Sun.

But even with a solid prediction of success, the cost and difficulty of recruiting suitable patients for a trial can still delay advances in healthcare. A robust phase three trial requires a few thousand patients, split roughly 50:50 between treatment and control arms. Studying each individual patient consequently amounts to a huge investment in both time and money and the challenge is further compounded for rare or aggressive diseases, where even finding sufficient patients for a valid trial can take years. “In these cases, sometimes the control arm is not even implemented at all because the patients are just so rare. And of course, for individual patients on the trials, they will probably prefer the treatment arms,” said Sun. “But from a scientific point of view, we do still need the control arms.”

A digital twin is a dynamic virtual replica of a particular individual patient, which can simulate that person’s health trajectory following different treatment regimens. “This is especially useful for control arms. Instead of using a real patient as a control arm, we can simulate the patient’s trajectory and compare that directly to the same patient receiving treatment,” explained Sun. “This reduces the trial recruitment process. You don’t need as many patients so it can speed up the trials and also put the patients on the new treatment options.”

In 2023, Sun’s team reported their first digital patient twin method, TWIN, a generative model that combines historic data from electronic health records— including details of prescribed medications, existing treatment regimens, and any adverse effects—and uses these to simulate various patient outcomes under different conditions.18 “To train a digital twin model, you code based on a cohort of patients to see what happened and the model learns from that large number,” Sun explained. “Once it’s trained, the application is then used for individual patients to simulate what will happen to that patient.”

This simulation approach is already being trialed by pharma companies, and Sun hopes this method can help developers design and refine their trials in the future. He is currently looking at reducing the training data demand for producing these patient-specific models. “At the moment, if you want to simulate a patient trajectory, you need patient-level data to do that,” said Sun “We’re working on whether we can leverage publications about other trials with aggregate statistics. Ultimately, we want to reverse engineer a digital twin model that can produce individual patient-level data from the statistics of a cohort.”

 

Regardless of research field, there are tools available at all levels of complexity, making the integration of AI into scientific workflows accessible to everyone. This isn’t just relevant for researchers—departmental support staff can also incorporate these tools and systems into their own work, for example, introducing ELNs to teaching labs or developing a digital twin of building systems to optimize heating, lighting, and air flow. As more people familiarize themselves with these tools, the transformative impact of AI in science will grow, leading to a more efficient and sustainable future for research. 



Source link