Advances in ML in data-poor drug discovery stages

Drug discovery pipelines are notoriously expensive, time-consuming, and error-prone, and AI and machine learning are becoming more popular to accelerate progress and improve outcomes.

Currently, machine learning in drug discovery focuses on data-rich stages that provide rich data for training algorithms. However, parts of the pipeline that generate less data can also benefit from machine learning.

Prior to the Society for Laboratory Automation and Screening (SLAS) Conference 2026, technology network We spoke with Dr. Daniel Reker, assistant professor of biomedical engineering at Duke University, about his work on pairwise molecular learning, which enables better computational decision-making in data-poor scenarios.

In this interview, Reker discusses how pairwise molecular learning is opening up new avenues in drug discovery, including first-in-class drug candidates, and explores what happens when machine learning is integrated into automated laboratories.

Katie Brighton (KB): How would you describe the role machine learning plays in modern drug discovery today, and where is it still falling short?

Dr. Daniel Laker (DR): Machine learning is actively reshaping drug discovery across multiple stages of the pipeline, and we are seeing widespread adoption from pharmaceutical and biotech companies, as well as interest from technology companies and a number of startups. Currently, the majority of these efforts are focused on target identification, lead generation, and clinical trials. Although it is still too early to make a final assessment, early data suggests that the computational approach accelerated timelines and slightly improved success rates, which could have important implications given how expensive, time-consuming and failure-prone drug discovery can be.

However, the current impact of machine learning is largely focused on high-throughput screening, genomics, and data-rich stages that leverage large clinical datasets that enable training and fine-tuning of complex algorithms.

Significant progress is still being made in addressing data-poor drug discovery challenges such as lead optimization, safety, and formulation development. These steps include complex synthesis, material characterization, and in vivo However, they represent important decision points that determine the fate of drug candidates.

Innovations in new experimental platforms and robust computational algorithms are poised to power these decisions with potentially even more powerful benefits in reducing costs and failure rates than previously possible, ultimately enabling the community to deliver more and better treatments to patients.

KB: Could you explain in more detail what pairwise molecular learning is?

Doctor: Pairwise molecule learning transforms traditional machine learning tasks into contrastive problems where the algorithm directly compares two molecules, rather than evaluating each independently.

Essentially, instead of asking a computer, “How potent is molecule A?”, we transform the question into “Which of these two molecules is more potent?” This allows for combinatorial data expansion, creating millions of molecular comparisons from just a few hundred to a few thousand original data points. Simply put, it gives deep neural networks different perspectives on the same underlying data to increase training efficiency.

This allows you to train state-of-the-art deep learning architectures on datasets of just 100 to 1000 compounds. This is where much of the real-world drug decision-making regarding important properties such as drug safety, metabolism, and pharmacokinetics occurs. Measuring these properties experimentally is expensive but essential for advancing the best candidates. We believe that pairwise learning allows the community to unlock the predictive power of deep neural networks for these data-poor but high-value decision points.

KB: What avenues does pairwise molecular learning open in drug discovery?

Doctor: Pairwise molecular learning opens up several exciting avenues in drug discovery. First, directly predicting which chemical modifications will improve important drug properties such as safety, metabolism, and efficacy enables more precise computational molecular optimization. This allows medicinal chemists to prioritize which compounds to synthesize next, saving time and resources.

Second, this pairwise expansion approach enables better computational decision-making in data-poor scenarios. This is particularly valuable for properties such as drug safety, metabolism, and formulation, critical decision points where experimental data are limited and expensive to generate.

It can also improve predictive performance for novel and challenging drug targets for which little knowledge has been accumulated to date, providing an opportunity for machine learning to better support the identification of first-in-class therapeutics. This capability is further enhanced algorithmically by the ability of pairwise learning to incorporate limited or poorly characterized data points that are typically discarded from modeling efforts. Although these data points are not well characterized to be directly incorporated into traditional models, they still provide important perspective and contrast with stronger candidates.

Third, our data suggest that this algorithm excels at identifying truly novel molecules. Learning the effects of changes in molecules, rather than simply identifying analogs of known compounds, avoids the memorization problems common to complex algorithms and allows algorithms to focus on learning relationships and patterns. Our proof-of-concept data shows that this allows for more fundamental structural changes during optimization, with strong potential to further enhance the safety and efficacy of drug candidates.

KB: What are the biggest benefits you’ve seen by combining machine learning and automated labs? Where are the remaining bottlenecks?

Doctor: The biggest benefit I’ve seen from combining machine learning and automated labs is in creating truly adaptive experimental design loops. The machine learning community refers to these as “active learning workflows” to indicate that predictive algorithms are directly involved in data acquisition and can request the most useful and valuable data points. Our research and others have shown that such “active learning” settings can reduce the data needed for decision-making by up to 90% and enable better predictive models by directly addressing bias in the data. These settings helped identify new drug candidates using fewer data points and new nanoparticle formulations that enhance drug efficacy and safety with greater precision.

The main bottlenecks that remain in the deployment of such feedback loops center on the robustness of automation infrastructure and algorithms. Most high-throughput screening platforms are optimized for scale at the expense of flexibility. For example, they rely on rapid screening of predefined compound libraries rather than allowing adaptive selection of individual experiments suggested by algorithms. In addition, material characterization and even in vivo Integrating research into these automated workflows is difficult.

We believe these feedback cycles are most effective in scenarios where data is really sparse, such as early-stage projects with fewer than 100 data points. However, building predictive models and being able to decide which data points to acquire next remains difficult even with the most data-efficient computational approaches. We are tackling this problem through the development of pairwise learning techniques and other new active learning techniques such as yoke learning, where algorithms work together in pairs. There is significant scope for further innovation in automation architectures and experiment design strategies to maximize the impact of integrated laboratories on drug discovery.

KB: Are you teasing us about your talk at SLAS 2026?

Doctor: I’m really looking forward to an exciting SLAS 2026. There will be a number of great presentations and discussions centered around the intersection of automation and AI in drug discovery.

In particular, my talk will introduce some of the pairwise and active learning concepts that we have been developing, as well as new unpublished developments that we think the community will find interesting. One of the highlights is a new class of algorithms that actually strategically “forget data” to enhance learning. Although this seems counterintuitive, we see some notable improvements in how these models quickly converge to better solutions.

To demonstrate the practical potential of these algorithms, we will incorporate specific examples from our work in drug discovery and nanoparticle design. The goal is to demonstrate how adaptive machine learning can lead to better decisions at every stage of drug development, from identifying early hits to optimizing formulations.

We look forward to connecting with potential partners and collaborators interested in implementing these approaches into their own pipelines. Real progress will come by getting these tools into the hands of more research teams in academia and industry.

Source link