Shell: Evaluating the Performance of Machine Learning Models Used in the Energy Field

background and explanation

This project leverages deep learning to perform a computer vision task – semantic segmentation in a specialized application domain. The project had about 15 deep learning (DL) models actively deployed. The DL model is applied to the generated predictions in a cascading fashion, feeding a series of downstream tasks to produce the final output that feeds into the manual interpretation task. Therefore, AI guarantees through model performance evaluation are important to guarantee robust and explainable AI results. Three types of model evaluation tests were designed and implemented in the DL inference pipeline.

regression tests (unit tests per DL model),
Integration testing (testing cascaded pipelines), and
Statistical tests (stress tests to understand the operating limits of conditional models to test data quality).

How this methodology applies to the regulatory principles of the AI Whitepaper

Learn more about the regulatory principles of the AI whitepaper here.

Safety, Security, Robustness

Regression and integration tests from the backbone provide model interpretability on a set of test data. During model development, it provides a baseline to interpret whether the model is performing better or worse, depending on the model’s training data and parameters. During the model deployment phase, these tests can also provide an early indication of concept drift.

Statistical tests are designed to predict model performance given the statistics of the test data, thus providing a mechanism to detect data drift during model deployment. Furthermore, we can also show how robust the performance of the DL model is to statistical fluctuations in the test data.

Appropriate transparency and explainability

The output of this AI assurance methodology is communicated to AI developers and product owners to monitor potential deviations from expected DL model performance. Additionally, in the event of performance deviations, these teams can take appropriate mitigation measures.

It also enables frontline users and business stakeholders to maintain a high degree of confidence in the results of DL models.

Accountability and governance

AI developers are responsible for designing and executing model evaluation tests to enhance performance testing. Product owners are responsible for leveraging these tests as a first line of defense before deploying new models. Project teams work together to coordinate testing and address data and concept drift during deployment.

Why We Take This Approach

In this project, DL model predictions ultimately generate inputs for the manual interpretation task. This task is complex, time-consuming and labor intensive, so it is important that the starting point (in this case the DL model prediction) is of high quality in terms of accuracy, detection range and very low noise. Moreover, the results of manual interpretation are reflected in the decision-making process with great impact.

Therefore, the quality and robustness of predictions of DL models are of utmost importance. The most important metric for judging the predictive performance of ML models is human quality control. However, the model evaluation test suite methodology was adopted to automate performance testing as a first line of defense. Data versioning and the creation of an implicit ML experimentation pipeline were primarily to be able to regenerate the model end-to-end (data, code, model performance) within tolerance.

Benefits for your organization

First Line of Defense, Automated DL Performance Testing for QA
Test model robustness and interpretability of DL model performance.
Solid explanation of DL model performance for AI developers and end users
Build trust in your user community and your DL models and workflows
Establishing a mechanism to detect concept drift enables model monitoring.
MLOps hook to enable CI-CD during model deployment.

Limitations of the approach

A large number of DL models that perform very different tasks such as detection, classification, and noise reduction.
The complexity and variability of the problems addressed by DL make KPI design difficult.
Lack of high-quality, representative data that can be used to design model evaluations
Lack of clear metrics/thresholds for designing regression, integration, and statistical tests.
Lack of a stable model evaluation library.