Machine learning tools accelerate materials discovery

Literature searches, simulations, and hands-on experiments have been part of the materials science toolkit for decades, but the past few years have seen an explosion of machine learning-driven software tools that promise to accelerate all three.

Many of the challenges facing the semiconductor manufacturing industry are fundamentally materials science problems. What metal has the lowest resistance in nanowire dimensions? And what precursors can be used to deposit that metal? Which photoresists offer the best combination of etch resistance and sensitivity to EUV photons? Which oxide semiconductors with excellent carrier mobility are most compatible with CMOS BEOL processes? What is happening chemically, electrically, and thermally at the interfaces between all these layers?

As manufacturing processes become more complex, the number of materials required also increases rapidly. Once rare elements like ruthenium are now critical components of cutting-edge processes. Adhesion promoters, deposition precursors, and many other auxiliary compounds support each material that remains on the wafer.

Identifying and evaluating candidate materials requires process engineers to analyze vast amounts of data. Bulk properties such as resistivity and thermal conductivity are the starting point, but these properties often change as feature size decreases. Device integration also raises new questions, from surface interactions to long-term stability.

Materials discovery is often described as a funnel in which a large number of initial candidates are gradually narrowed down to a small number of potential solutions. At each step, engineers require more detailed information about the candidate’s behavior. Some of this data already exists, either in the technical literature or in the organization’s own organizational knowledge. Some data can be calculated by simulation. Additionally, some data can actually only be obtained through experimental research.

Atomistic model: filling the funnel
The first step is an atomistic, physics-based simulation that forms the basis for other tools. The most rigorous of these, density functional theory (DFT) models, attempt to solve the Schrödinger equation for the system of interest. Although improvements in computational power have made large systems more accessible, DFT calculations remain the most suitable for ideal crystals. Disordered systems like glasses, defective systems with broken symmetry, and thermal fluctuations are all important aspects of real materials that are computationally extremely challenging for DFT techniques.

Professor Michele Celliotti from the Federal Institute of Technology in Lausanne pointed to the development of machine learning interpolation potentials (MLIP) over the past decade as an important advance.[^⁠1] Given DFT simulations of primary reference frames, machine learning tools can interpolate potential fields between them. For example, such a model could start with an accurate simulation of the perfect crystal and the immediate vicinity of the defect, and use MLIP to investigate the distortion of the electric field potential due to the defect. How does the electric field change as the number of defects increases and the spacing between them decreases?

Imec principal investigator Geoffrey Pourtois said in a recent interview that large-scale efforts like the Materials Project use atomic simulations to build data libraries that characterize the fundamental properties of a vast range of compounds. These libraries support the final stage of the materials discovery funnel: initial screening.

Mr. Pourtois emphasized that a clearly formulated description of the problem to be solved is essential at this point. For example, if a team is trying to identify a “better” interlayer dielectric, what does that mean? Lower dielectric constant? Increased mechanical stability? Less interaction with new metals introduced into the process? A tool trained on something like the Materials Project dataset may provide hundreds or even thousands of candidates with good dielectric properties, but likely only a small number of them will meet the other requirements.

Anders Blom, principal solutions engineer at Synopsys, said that while atomic simulations have limited ability to predict real-world behavior, the results could provide parameters for higher-level tools. The list of candidate materials may include well-characterized materials already used in the application, as well as materials for which there is little data beyond their basic properties. By combining atomistic simulations with any experimental data that exists, machine learning models can predict where candidates will fall compared to “known” materials. As experimental work progresses, the model can be further refined and used to further refine the list of candidates.

Large-scale language models: An analysis of the literature
In addition to machine-readable databases like the Materials Project, much materials information is stored on media designed to be read by humans. Technical journals and materials data sheets contain decades of experimental and theoretical results. Suresh Rajaraman, executive vice president and head of thin films business at EMD, said large-scale analysis of these materials is the domain of large-scale language models (LLMs).

Unfortunately, using existing literature to support automated materials discovery is actually two separate tasks. First, the model used to analyze the material must be able to “understand” language in general. Recognize that words in close proximity, such as “thermal conductivity of silicon,” automatically convey a complete concept to a human, but pose a major challenge to a machine.

Commercial LLMs develop language models by analyzing huge datasets and incorporating billions or even trillions of parameters. Still, Xue Jiang and colleagues at the Institute of Advanced Materials Technology at Beijing University of Science and Technology found that these general-purpose models were unable to provide the specific, quantitative answers needed for material discovery tasks.[^⁠2]

The next step is to retrain the generic model on a more focused topic-specific database. Once that is done, well-expressed queries can find correlations that may not have been directly analysed. For example, Jiang et al. identified materials that are found in the context of words such as “cathode” and “electrochemical” and are also associated with thermoelectricity, but have not yet been analyzed for their thermoelectric properties. (See Figure 1 below.)

Figure 1: Identifying candidate materials using contextual analysis [Ref. 2].

Generative models: Designing new materials
All the tools described so far work with material that already exists in the literature. Someone synthesized them and analyzed them, or modeled them in software.

However, for precursors, complex oxides, and similar compounds, the library of available materials far exceeds such existing datasets. This is where generation tools come into play. Given a set of desired properties and a training set of materials with those properties, a generative neural network attempts to find new materials that “belong” to the training set.

Evaluation tools such as simulators test proposed materials and provide feedback to generation tools. Generation tools refine the model to generate better candidates. Candidate materials generated in this way can be evaluated in the same simulation tools as “real” materials to determine which are actually worth synthesizing.[^⁠3]

It is difficult to overstate how much machine learning tools and GPU hardware are changing the way materials discovery is done. Without these tools, few researchers would consider evaluating hundreds, much less thousands, of candidate compounds. Using these, initial evaluation of candidate materials can take only a few milliseconds. Even with relatively small computing hardware, simulations that previously took weeks or months can be completed in a lunch break or overnight.

Nevertheless, experimental studies become increasingly important as materials progress from initial screening to process integration. For new materials, Pourtois explained, there isn’t enough data in process or DTCO models. Process integration schemes inherently require more variables, increasing the number of details required for a complete simulation. AI tools cannot replace experimenters, but they can test a narrower range of candidates more intensively.

Ceriotti, M. “Beyond the Possible: Integrated Machine Learning Models for Materials,” MRS Bulletin 47, 1045–1053 (2022). https://doi.org/10.1557/s43577-022-00440-0
Jiang, X., Wang, W., Tian, S. et al., “Applications of natural language processing and large-scale language models in materials discovery,” npj Comput Mater 11, 79 (2025). https://doi.org/10.1038/s41524-025-01554-0
Pyzer-Knapp, EO, Pitera, JW, Staar, PWJ et al., “Accelerating materials discovery using artificial intelligence, high-performance computing, and robotics,” npj Comput Mater 8, 84 (2022). https://doi.org/10.1038/s41524-022-00765-z

Related books
Power Efficiency: AI Transforms IC Manufacturing as ICs Drive AI
Smarter manufacturing of smarter chips dominated the agenda at this year’s SEMICON West.

Source link