Smart Test Collides With The Data Chain

Machine Learning


Key Takeaways:

  • The promise of smart test is a data-chain problem before it is an algorithm problem.
  • A device can pass every checkpoint and still carry a latent defect the test record never captured.
  • As test grows more adaptive, the validity of the measurement environment matters as much as the measurement itself.

For years, the test roadmap has pointed toward more adaptive flows, better binning, shorter test times, and machine learning models that can decide what needs to be tested, what can be skipped, and where the next failure is likely to appear. This “smart test” approach still holds, but as devices become increasingly heterogeneous and more difficult to characterize at a single insertion point, there may not be enough context in the data feeding into an ML model to identify an actionable pattern.

As a result, smart test is becoming more of a costly traceability challenge than a test-time optimization problem. Manufacturers need to be able to connect and use data about what happened in the fab, what was measured at test, what the model predicted, and what the device later does in the field. But the model is only one layer. The harder problem is building the data infrastructure that lets the model act on the right device, at the right insertion, and with the right process history attached.

Smart test leverages adaptive test rules, machine-learning models, and feedforward or feedback data to adjust measurements, limits, binning, or material flow using information from fab metrology, inspection, electrical test, packaging, and in-field monitoring. It is not a single standard, protocol, or product category, but a way to describe the shift from fixed test sequences toward manufacturing decisions based on data gathered before, during, and after test insertions.

“The greatest impact on value is the ability to collect, align, and normalize data, and then have the infrastructure to deploy the model wherever it would be useful,” said Greg Prewitt, director of Exensio Solutions at PDF Solutions. “That requires a lot of platform data engineering underneath, and traceability through all that.”

A single device can now move through wafer sort, package assembly, burn-in, final test, system-level test, and field monitoring, carrying with it data generated by fab metrology tools, inspection systems, automated test equipment, handlers, probers, and customer-specific analytics platforms. The value of smart test depends on keeping all of those data streams connected, interpretable, and trusted across every handoff.

In the past, testers were optimized to cache data and build log files in ways that preserved throughput, rather than reducing latency. Now they’re being asked to do something fundamentally different. If a measurement is expected to influence a limit, a flow, a site behavior, or a downstream insertion, the test cell needs faster access to data and enough compute capability to act without slowing production.

“Tolerating hours or days of time between data collection and action is what we’ve been doing for a long time,” said Eli Roth, smart manufacturing product manager at Teradyne. “Now we’re seeing that latency compressed to minutes, even seconds.”

That time compression is exposing weaknesses across the manufacturing data chain. Some are physical, including thermal instability, probe variation, socket wear, and contact resistance drift. Others are informational, including missing metadata, inconsistent device identity, and model outputs that can’t be traced back to a meaningful physical cause. Each can undermine adaptive test in a different way, but all point to the same underlying requirement — smart test depends on data that can be trusted quickly enough to act upon.

“I don’t think smart test today is about building the models or building the rules,” Roth said. “It’s about understanding your latency and your compute requirements without affecting throughput.”

The high cost of missed defects
Test coverage is always constrained by economics. IC makers may want more data, but each additional measurement has to be justified against equipment cost, test time, labor, power, floor space, throughput, and the average selling price of the part.

“IC makers have a budget for test costs,” said Don Blair, business development manager at Advantest. “They test as much as they can up to their budget. After that, they have to find some way to cut test costs, including cutting test time or removing tests.”

The economic stakes are clearest in advanced packaging. A missed defect that might cost only additional test time in a monolithic flow can consume the value of an entire assembled multi-die product. That asymmetry directly affects the value of screening.

“The main challenge in testing advanced packaging is making sure the dies you select as candidates for assembly are defect-free,” said Nir Sever, senior director of business development at proteanTecs. “If you miss a defect in the chiplet die during wafer sort, and then you assemble it and find it at final test, you basically scrap the entire product, which can be orders of magnitude more expensive than a single die.”

In a compounded-yield environment, small uncertainties multiply quickly. A die may pass broad statistical limits but still carry latent risk if its internal behavior does not match what its own process and timing signatures would predict. The inverse is equally important. A part that looks unusual relative to a population may still be healthy if it behaves consistently with its expected individual profile.

“The question is, can you collect enough data during test that it informs more than a simple pass/fail?” Sever said. “Parametric data from within the chip, in thousands or tens of thousands of locations, can be used to train models and identify those outliers that cannot be found by any other statistical method. It’s a personalized assessment rather than a statistical one.”

When the measurement path becomes part of the data
As testing becomes more adaptive, the validity of the measurement environment increasingly matters alongside the measurement itself. Socket wear, debris, contact resistance, thermal variation, calibration drift, and equipment state can all affect whether a result represents the device or the conditions under which it was measured. And if a contact problem looks like a device failure, an adaptive system may feed the wrong signal into a binning decision or a downstream insertion with no indication that anything went wrong.

“Intermittent contact resistance, false opens, and debris-induced shorts are the usual culprits, and they all look like device failures until someone thinks to check the socket,” said Vidya Vijay, director of business development at Nordson Test and Inspection. “Parameters like contact coplanarity and contact height from the seating plane are particularly deceptive when they drift, which can cause multiple issues.”

The same principle applies when test flows are asked to make decisions on very short timescales. If a tester has only milliseconds or seconds to adjust limits or flows, the data path must distinguish a device signal from a test setup artifact quickly enough to avoid amplifying the error.

There is also a thermal dimension that has grown more acute with rising power density. For advanced devices, the thermal state of the device during test is a variable that interacts with every measurement being taken. Adding monitoring without carefully calibrating the decision rules around it can create new problems as readily as it solves old ones.

“If you set the sensitivity too high, you’re going to be cleaning the probe tips all the time, which reduces the life of them,” said Damian Megna, product manager for power and thermal instrument solutions at Teradyne. “Depending on how you act on it, it could actually be detrimental to your end goal.”

Models need context
Machine learning in test is often less dramatic than the marketing around it. Models can identify correlations, classify outliers, and recommend possible root causes. However, they do not automatically know whether the input data was labeled correctly or collected under valid conditions. That limitation matters in manufacturing, where a plausible-looking explanation can still be wrong enough to scrap good material or allow a latent defect to escape. Model outputs increasingly need to be treated as part of the controlled data environment. If a model influences a later operation, that output becomes part of the test history and needs to be stored, monitored, and checked for drift.

“It’s important when you use a model that generates predictions or features that those be loaded back into your analytics system as a virtual test operation,” said PDF’s Prewitt. “That lets you put some controls in place, or at least gives you the ability to recognize that the model suddenly shifted and is giving different outputs.”

That approach creates the basis for model governance, connecting model behavior to the process conditions and product mix it was trained on, and flagging when those conditions diverge from what the model is now being asked to evaluate.

“This may eventually evolve to a point where you write models to watch the models,” added Prewitt. “If you have the first-level model and its predictions, and then you have the response from the actual test operation, you could have a model watch those two results and potentially find variation.”

Metrology becomes test context
As adaptive test reaches beyond the tester, in-line metrology and inspection become more important to downstream test decisions. Structural variation may not appear as an electrical failure until after additional processing, packaging, or voltage stress, and the value of in-line metrology extends beyond catching excursions at the moment they occur to tying structural evidence to downstream electrical behavior.

In silicon carbide power devices, for example, crystalline defects in the substrate may propagate into epitaxial layers and later appear as latent or killer failures under high-voltage load. In another example, small distortions in vertical structures in 3D NAND devices may pass each individual inspection step, but still compound across successive layers until they lead to failure.

The practical impact of high-quality in-line metrology is two-fold. “On the one hand, it means catching process anomalies that truly matter at the time when they occur, said Lei Zhong, product marketing senior director at Onto Innovation. “In view of the process control challenges in the 3D-device era, we are working closely with our customers to identify those potential ‘escape tunnels’ and find solutions to block them in any way imaginable.”

“On the other hand,” added Mike Rosa, chief marketing officer and senior vice president of strategy at Onto Innovation, “applying the known relationships between key structural device parameters and test capabilities means that key metrology data taken in-line and communicated downstream at the time of device test allows for better correlation between examples of process excursions that can lead to latent or killer defects and the key test parameters that can be used to accelerate failure before these devices go to market. Having metrology data from the fab tied to the device being tested would be a key part of that overall process, and obviously of tremendous value to device suppliers (enabling them to reduce latent or killer defects that may occur in the field).”

The problem is that the handoff often breaks before test engineering can use the information. Metrology or test data may exist, and known correlations between structural features and failure modes may exist. But the data still has to move through the fab, the supply chain, and the test ecosystem with enough discipline to remain attached to the right device.

“Unfortunately, this breakdown happens almost immediately,” added Rosa. “It occurs in the tracking of useful wafer-level metrology or inspection data all the way through wafer fab process. It happens in the correlation of known failure modes and attributes to device structure or materials properties. While in many cases the correlations are known, and the software tools are in place to log the metrology or inspection data and track against the flow of the devices through the supply chain and through test, this process relies on an extremely disciplined supply chain with similarly compatible data tracking capabilities throughout — something that is more often than not a hit-and-miss scenario today.”

Physical analysis closes the evidence gap
Physical analysis adds another layer to the chain of evidence because electrical test can identify the presence and approximate location of a failure without always revealing what physically caused it. In advanced packaging, where defects may be buried inside stacked or heterogeneous structures, that distinction becomes increasingly important. Electrical techniques can localize a defect to within a few microns, but the root cause still may be a crack, delamination, non-wet interface, missing pillar, debris-induced short, or another structural feature smaller than the electrical localization can fully resolve.

“The most accurate electrical test will give you the location of a defect to within some microns,” said Thomas Rodgers, senior director of market strategy and head of business sector electronics at ZEISS Microscopy. “But when our customers then need to understand the root cause of that failure, you have to understand what’s going wrong with the device physically.”

That is where non-destructive imaging changes the failure-analysis sequence. If the only way to inspect a buried defect is to cut open the sample, the analysis itself can destroy the evidence. High-resolution 3D X-ray can provide a three-dimensional view before destructive analysis, helping engineers decide whether the X-ray image is sufficient or whether it should guide subsequent FIB-SEM or electron microscopy work.

“If your only way of inspecting defects is to cut the sample, then you’re always running the risk of breaking the item that you’re trying to look at,” Rodgers said. “If you cut past the defect, then the defect is gone, and the important information and learning are also gone. That becomes really critical in advanced packaging, because things are getting three-dimensional and more complex.”

In that role, physical analysis becomes a corrective for smart test. It isn’t practical to brute-force high-resolution imaging across every large chip or package. But once electrical test, acoustic inspection, or other localization techniques narrow the search field, imaging can validate whether an electrical signature corresponds to an actual structural defect and help feed root-cause learning back into the process flow.

The demand for traceability is also changing what customers expect from test coverage discipline. It is not enough to know that a test was run. Engineers need to know what it covered, what it missed, and whether the measurement was tied to a meaningful defect mechanism.

A flow cannot safely skip, shorten, or substitute tests unless it knows what evidence those tests supplied in the first place, and that knowledge degrades faster than most teams expect as designs change and packaging architectures become more heterogeneous. Historical data can guide future decisions, but only when the relationship between past and present remains valid.

“While past results offer some guidance, historical test results can be inconsistent due to evolving designs and technological advancements,” said Étienne Racine, product manager for Siemens EDA’s Tessent product line. “One result that is still valid today is that structural testing by digital scan tests and memory BIST is far more effective for fault detection and binning than functional tests.”

That makes coverage history another form of data-chain context. Adaptive test can only act on prior results if those results still describe the device, process, and defect mechanisms now in front of the tester.

Smart test migrates to the test cell
As the latency window shrinks, smart test begins to migrate into the test cell itself. The closer a decision gets to the touchdown, the more it depends on fast data movement, local compute, and careful control of throughput impact. A rule may respond to a simple repeated failure signature. A more complex model may require edge compute near the tester. In either case, the test system has to support the action without turning intelligence into a bottleneck.

“With our leading-edge customers, we’re figuring out how to change limits, change flows, change the site maps, and change the site behaviors in real time in the same touchdown,” said Teradyne’s Roth. “Some are getting into production, and they won’t talk about it because people don’t want to let on that they’re ahead or behind. But it’s certainly happening in labs and slowly getting into some production with customers that are prioritizing it.”

The architectural requirement is broader than the tester itself. A test decision may depend on tester data, thermal conditions, handler or prober status, previous metrology, package history, and model output, and in high-volume production the available window for that decision can be very short.

“There’s a lot of infrastructure required to implement those things,” said Jack Lewis, director of applications and product management at Modus Test. “Test times in these kinds of parts are typically quite fast. For example, we do an LDO, a low-dropout regulator, that has a lot of very high-precision test in it, but we’re doing 16 sites in 500 milliseconds.”

Smart test extends into the field
Eventually, the same smart test logic extends beyond manufacturing. For high-value devices used in AI, cloud, automotive, and other reliability-sensitive applications, production test may not be the last point at which device behavior is meaningful. Field telemetry can reveal aging, workload stress, marginal cores, and latent defects that were not visible during production test, and for a device operating under continuous mechanical, thermal, and voltage stress, the degradation profile may diverge significantly from what any production insertion could have predicted.

“Test is not a one-time or two-time or even three-time event,” said proteanTecs’ Sever. “Test is something that goes with the chip from the time it is powered up for the first time until it is dead.”

Embedded telemetry can identify anomalies at the level of individual logic cones, alert firmware or system-level controllers, and support responses ranging from removing a marginal core from an active pool to changing voltage or clock conditions.

“The data that we process inside the chip, and the signals that we bring either to the main controllers inside the chip or outside the chip to a system-level controller, is highly granular,” said Sever. “We can boil it down even to a particular logic cone in some implementations. That is basically all the logic that converges into a single flip-flop. That’s the level of granularity.”

The result is a feedback loop that stretches from production into operation. Field behavior can inform predictive maintenance, but it can also reveal which production signatures, process excursions, or marginal test results were early indicators of later degradation. That, in turn, can be fed back into future screening, binning, redundancy allocation, and design-for-test decisions.

“Customers are merging our telemetry data with their own telemetry data,” added Sever. “Ours is based on physical measurements coming mostly from within the chip. Theirs is coming mostly from their own in-chip functional monitors and system-level sensors. They are merging them together, and they are both feeding their own developed fleet-monitoring system.”

Conclusion: The chain of evidence
The strategic question for manufacturers is how much of this data chain they can make usable. The industry already generates enormous volumes of data, but the value of that data depends on whether it can be aligned across time, tools, insertions, and physical context. A model that predicts failure without traceability may be interesting. A model that predicts failure, ties it to a wafer-level signature, validates it against metrology, checks it against test conditions, and confirms it through field behavior is far more valuable. But it’s also far more difficult to build.

Smart test is not only about making test faster. It is about making decisions more accountable. The decision to skip a test, tighten a limit, scrap a die, add an insertion, or remove a core from service all depends on confidence in the chain of evidence behind that action. That chain fails not because the industry lacks algorithms, but when measurement context is lost, when physical causes are disconnected from electrical symptoms, when metadata breaks traceability, or when models are asked to act on data that nobody can fully vouch for.

The next phase of smart test will belong to the manufacturers that can preserve meaning across the full path, from design intent and process variation to test behavior, model output, package history, and field performance.


Related Articles
Making The Most Of Test Resources
How silicon lifecycle management and in-system testing are extending device reliability from the factory floor into the field.

Chiplets Add More Inspection And Test Steps
Why multi-die architectures are multiplying inspection insertions and forcing tighter integration between metrology and process control.

Adaptive Test Gaining Ground For HPC And AI Chips
Why static limits and fixed test sequences are reaching their practical limits as HPC and AI chip architectures grow more complex and variable.



Source link