DATACORTECH: artificial intelligence platform for the virtual screen of aluminum corrosion inhibitors

Machine learning workflow

The workflow employed in the development of the ML model presented in this study (Fig. 1) involved the following steps: (1) collection of data; (2) calculation of the molecular descriptors; (3) division of the data into training/cross-validation (80% of the dataset) and test set (20% holdout sample); (4) selection of descriptors and electrolyte conditions to be considered in the model; and (5) training, optimization, cross-validation and, finally, testing of the predictive model. Technical details regarding this workflow are described in the “Methods” section.

The aim of the ML model was to distinguish more promising compounds, with corrosion inhibition efficiencies higher than 70% (790 data entries), from less promising compounds (1176 data entries), thus employing a classification algorithm. The option for classification, instead of regression, is due to the fact that optimal descriptors for the protection mechanism of metallic alloys using organic corrosion inhibitions can be difficult to obtain, but the quality of the descriptors might be enough to distinguish more effective compounds from less effective compounds, thus being enough to perform an initial virtual screen of potential corrosion inhibitors for aluminum alloys, as shown by the promising results obtained in other works^22,32. The challenge of obtaining appropriate molecular descriptors has been evidenced by Kokalj and coworkers^18,24,25, since direct correlations between well-known molecular descriptors and corrosion inhibition efficiencies are difficult to obtain, even for just a small number of corrosion inhibitors, but ML algorithms have the ability to consider different descriptors for different groups of inhibitors within the same model^18,22, thus helping to overcome this difficulty. Even some robust ML strategies that were able to obtain discrete corrosion inhibition efficiency values (regression), employ a a priori grouping strategy^26,29. The non-linearity of the widely used inhibition efficiency metric has also been evidenced by Kokalj et al.¹⁸, which has less influence on classification problems, but a real effect on regression predictions. For example, the same one percentage change in inhibition efficiency, from the experimental point of view, has an effective degradation on the substrates according to, for example, the polarization resistance effect, which is different when there is little corrosion (>75% inhibitors efficiency) or more extensive corrosion (lower efficiencies). This means that degradation effects are actually not linear, but in terms of predictions, this has a lower influence on classification, since it attempts to distinguish classes of compounds, and not precise efficiency values, such as in regression.

Moreover, different measurement methodologies often result in different inhibition values³⁴, while not changing the classification of the inhibitor for the majority of the compounds. When adding up all these points, they can have a significant influence on regression results for a large number of compounds, but a minor or lower influence on classification, being only relevant for compounds with inhibition efficiencies near the classification threshold (in this case, 70% inhibition efficiency). The chosen value for the threshold was the result of two factors: 1) preliminary tests to check how high the corrosion inhibition threshold could be considered without resulting in a degradation of the performance metrics due to the amount of data for promising inhibitors becoming too restricted; and 2) it is also a value near the point above which a change in efficiency means a more relevant improvement in corrosion protection according to polarization resistance¹⁸. In doing classification instead of regression, choosing 70% as the threshold, and discounting the influence of the categorical intervals pH (acidic, near neutral, and basic), only 70 examples for the same compounds had values both below and above 70% efficiency, among a total of 1966 examples. This results in around a 4% uncertainty due to the wide data mining strategy adopted in this work, which requires a classification workflow for its uncertainty and data variance resulting from different measurement conditions and techniques to be effectively minimized.

In a previous study from our research group, it was tested the performance of several well-known ML algorithms for one hundred compounds under a fixed set of conditions²². It was concluded that random forest, deep neural networks, and support vector machines were the best-performing algorithms. However, when several conditions were modeled together in the same composite dataset consisting of four hundred data entries, random forest had the better performance, being able to discern different conditions, while taking advantage of the information gain provided by the larger dataset. Therefore, the random forest algorithm was used in this study.

Besides the calculated molecular descriptors used as input in the ML model, the pH was also considered in this work. This approach was partially implemented in a previous work²², by successfully taking into consideration the pH values. The pH was considered since at a pH <4 and >10 the oxide film does not passivate the aluminum surface as effectively¹, and within near neutral pHs, aggressive salts can promote localized corrosion³⁵. Therefore, this feature was considered for the initial selection as a categorical feature with three possibilities (pH < 4 – acidic, 4 ≤ pH ≤ 10 – near neutral, and pH > 10 – basic). Moreover, these pH ranges also take into account the possible speciation of inhibitor species, in terms of protonation, neutral and deprotonation of the organic species, which can be translated into different interaction mechanisms with the surface for each molecule depending on the pH range. The dataset also included inhibition efficiencies for different types of aluminum alloys, measurement time, temperature, inhibitor concentration and aggressive salt concentration, which means that when a certain structure is identified by the model as being a promising inhibitor, it has the potential to do so according to several methodologies, for different aluminum alloys under different measurement and electrolyte conditions. Ideally, a composite model would take into account all possible variables as an explicit input. However, this was not considered to avoid overfitting, since although the dataset considered in this work for aluminum alloys is five times larger than in the previous study from our research group²² and twenty times larger than the larger datasets considered by other research groups⁴, it is still limited in terms of chemical and electrolyte diversity. Nevertheless, as more data are collected within the CORDATA application developed by our research team¹², more input variables can be considered explicitly without having a detrimental effect on the model.

Feature selection and importance

In this work, 209 molecular descriptors were obtained based on all the possible constitutional, structural, topological, electronic, and hybrid properties that were able to be calculated using the R interface to the Chemical Development Kit (RCDK) cheminformatics software³⁶ for all the molecules in the dataset (specification of all the calculated descriptors are available in the dataset provided together with this work). The RCDK package in R interfaces with the Chemistry Development Kit (CDK) to provide various cheminformatics tools, including the calculation of molecular descriptors. An overview of some of the categories and examples of descriptors that were calculated using RCDK are as follows: (1) constitutional descriptors (i.e., atom count, bond count, molecular weight, number of donor atoms for H-bonds, number of acceptor atoms for H-bonds); (2) topological descriptors (i.e., Wiener index, Randic index, Balaban J index, topological polar surface area (TPSA)); (3) electronic descriptors (i.e., polarizability, electron donor-acceptor descriptors); (4) geometric descriptors (i.e., 3D weighted holistic invariant molecular descriptors, moment of inertia, radius of gyration); (5) hydrophobicity descriptors (i.e., LogP, MLogP, XLogP); 6) connectivity indices (i.e., chi chain, chi cluster, chi path counts, Kappa shape indices); information indices (i.e., Shannon entropy, symmetry indices); (8) pharmacophore feature descriptors (i.e., counts of pharmacophoric features); and (9) autocorrelation descriptors (i.e., Moran autocorrelation, Geary autocorrelation). After calculating the descriptors, those to be included in the final machine learning model (together with the type of pH) were first selected using recursive feature elimination (RFE) with random forest. During the RFE step, a ML model without extensive hyperparameter optimization was fitted to the data previously sub-divided for cross-validation (the test set was not considered in any model optimization phase), and the weakest feature was removed until a specified number of features was reached. This RFE step was performed using 5-fold cross validation and accuracy as the performance metric. The RFE step validated the importance of the pH, which, together with 12 molecular descriptors (RFE profile presented in Fig. 2), were selected for the second feature selection step that relied on the feature importance technique obtained from the random forest model with optimized hyperparameters. As a result of this feature selection process (RFE followed by feature importance), the pH (near neutral), a categorical feature, and 9 numerical features based on the molecular structure of the inhibitors (ALogP, tpsaEfficiency, bpol, apol, WTPT.5, ALogp2, ATSm1, WTPT.3, XLogP) were selected as descriptors to be included in the final ML model, with their description presented in Table 1, feature importance for the final model presented in Fig. 3, and correlation matrix between the selected features, and the selected features and the initial Efficiency before pre-treatment (from numerical to class), also presented in Fig. 4.

Table 1 Description of features included in the final model

**Fig. 3: Random forest feature importance, representing the relative importance of each feature in predicting the target variable within the model.**

In terms of fundamental chemical properties, it is possible to verify that the main factors are the polarizability (apol, bpol and ALogp2), information about structural features (ATSm1, WTPT.3 and WTPT.5), the partition coefficient (ALogP and XLogP), and polarity (tpsaEfficiency). From the analysis of the correlation matrix, it is possible to verify that the features, albeit describing well the two classes [Inhibitors (>70% efficiency) vs non-inhibitors], they do not correlate with the initial numerical efficiency. Moreover, the highest correlation between features is between the two partition coefficient features (ALogP and XLogP) and polarizability (apol and bpol). Nevertheless, it was decided to leave both sets of features, since different methodologies can capture subtle differences between similar molecules also differently, and recursive feature elimination results supported this selection.

Regarding polarizability, molecules with polar functional groups, such as those containing electronegative atoms (e.g., oxygen or nitrogen), tend to have higher polarizability because these groups can induce a greater distortion in the electron cloud. Similarly, molecules with π electrons, especially in conjugated systems, can exhibit higher polarizability due to the increased electron cloud flexibility associated with π bonding. Polar functional groups and π electrons in conjugated systems are commonly associated with the majority of corrosion inhibitors, since it influences their ability to interact with metal surfaces and form protective films, thereby inhibiting corrosion^37,38,39.

The influence of the partition coefficient on the corrosion inhibition mechanism has been investigated^{40,41,42,43,44}, and in some works, it has been proposed a relation between hydrophobicity, as measured by the octanol/water partition coefficient, and the adsorption entropy of the corrosion inhibitor onto the metal^45,46, which is the most common protection mechanism of metallic corrosion by organic corrosion inhibitors⁴⁷. Before adsorbing onto the metal, the inhibitor is surrounded by solvent molecules. This organization of the solvent molecules around the inhibitor will have to be at least partially disrupted to allow the formation of the surface film, resulting, in principle, in reduced degrees of freedom for the ensemble of inhibitor and solvent organized onto the surface. This process can produce an enthalpic gain but at least partial entropic penalty for the overall free energy of adsorption. Taylor et al. proposed that an entropic gain could be approximated by a higher hydrophobicity of the inhibitors as expressed by the partition coefficient⁴⁵. This was also supported by recent work by Deng et al.⁴⁸. Indeed, the importance of entropic factors for the corrosion protection mechanism was also noticed in the previous model developed by our research team, which pointed to the role of self-association entropy between inhibitor molecules^22,38,39.

The polar surface area is related to the molecular surface associated with heteroatoms and polar functional groups, which are known to be key molecular moieties favoring the direct interaction between corrosion inhibitors and metallic surfaces⁴⁷, and has been employed in previous ML models²². In this case, the polar surface area is standardized regarding the molecular weight, thus discounting the effect of the size of the molecules.

Machine learning results

A classification model was developed to identify organic compounds that can be promising inhibitors for aluminum alloys. The model was based on the random forest algorithm, for reasons explained above and as concluded in a previous work²², being optimized on 80% of the dataset according to 10-fold cross-validation⁴⁹, and holding out 20% of the dataset for a final independent test. The results are presented in Table 2.

Table 2 Performance metrics for the classification ML model

The developed model has a cross-validation accuracy of 73% and a balanced accuracy of 74%. The balanced accuracy corrects for the result of an imbalanced dataset (40% promising inhibitor datapoints vs. 60% non-promising inhibitor datapoints) by giving equal weight in the balanced accuracy metric to both classes (promising vs. non-promising inhibitors). However, to analyze the accuracy relative to the two classes, the sensitivity and specificity were calculated, since they correspond to the ability to correctly identify promising corrosion inhibitors and non-promising corrosion inhibitors, respectively. Hence, the sensitivity was 75%, while the specificity was 73%, showing that the model performs equally well for both promising inhibitive compounds vs. non-promising inhibitors, according to cross-validation results. The Cohen’s kappa statistic⁵⁰ and the Matthews correlation coefficient⁵¹ allow to evaluate the reliability of a classification model regarding the effect of randomly making predictions and also the effect of imbalanced datasets. A Cohen’s kappa value between 0.2 and 0.4 indicates a fair agreement of the model, while a value between 0.4 and 0.6 indicates a moderate agreement of the model⁵². Therefore, the reliability of the model can be considered moderate for the cross-validation results. As for the Matthews correlation coefficient the worst possible value is ‒1 and the best possible is +1, meaning that the performance of the model is also reasonable according to this metric. Overall, the model performs satisfactorily for the train-validation set splits, having a reasonable performance that is clearly not random according to several performance statistics.

The final independent test set results, which includes different compounds than the original training and validation sets, reveal a lower sensitivity, related with the ability to identify the promising compounds, which decreases from cross-validation to the test set from 75% to 63%. The test set results are slightly lower in terms of general metrics, according to accuracy, balanced accuracy, kappa value and Matthews correlation coefficient results, yet still promising considering the diverse dataset in terms of compounds and broad range of measurement conditions from different laboratories. The diverse nature of the data can be challenging in terms of the statistical performance metrics, but it can also be more realistic. Corrosion is a phenomenon that occurs under a broad range of conditions and, therefore, identified inhibitors should be able to achieve its purpose under such conditions, sometimes even beyond those tested by design in the lab.

DATACORTECH web application

The Datacortech application is a web-based tool (Fig. 5, https://datacor.shinyapps.io/datacortech/) that empowers users to conduct a virtual screening of corrosion inhibitors specifically designed for aluminum alloys. This application seamlessly integrates various technologies for web development and machine learning, based on the R programming language⁵³ and the Shiny package⁵⁴. The application enables users to sketch molecular structures using the Chemdoodle sketcher⁵⁵, select pH conditions, and predict if a compound has the potential to be an efficient corrosion inhibitor (>70% efficiency) using the machine learning model described above. The interface is user-friendly and intuitive to use, allowing individuals to design inhibitors, choose pH ranges, and receive predictions with a few mouse actions. This platform contributes to the advancement of corrosion science and materials research, with a focus on supporting the development of eco-friendly corrosion inhibitors for aluminum alloys. It assists researchers looking for corrosion inhibitors to be included in nanostructured coating additives or directly into coating matrices^56,57,58.

Conclusions, strengths, and weaknesses

The purpose of this work was to contribute to the development of a data driven model that can be used for the virtual screen of potential corrosion inhibitors for aluminum alloys. It is faster than an experimental test, but cannot live without one. After a promising structure is identified, it should be validated experimentally. In this article, we show one way to develop such a model, by taking into account corrosion inhibition efficiencies measured by different laboratories under different conditions, in order to build a more holistic composite model.

Composite models allow to consider a larger dataset, with data originating from different laboratories, measured under different electrolyte conditions and methodologies, which will mutually validate themselves within the predictive model. Experimental corrosion tests are usually performed considering different conditions and there is a wealth of information already published in literature. Hence, it was necessary to address this in order to be able to use a larger and more diverse dataset. In reality, when searching for corrosion inhibitors, they should be able to protect against corrosion under different conditions and from the lens of different techniques. This work is a step in that direction, in which the pH alone was first considered. In the future, a combination of different conditions will be considered. Furthermore, the philosophy behind this approach can find applications beyond corrosion science, such as in many chemistry and materials science applications, where different surrounding conditions are usually considered.

The model obtained is the result of a comprehensive machine-learning workflow for predicting corrosion inhibition potential of organic compounds. Nevertheless, this approach poses different strengths and weaknesses.

Model strengths: (1) comprehensive dataset: the use of a dataset encompassing 1966 corrosion inhibition efficiencies for 173 organic compounds under various conditions provides a broad dataset for model training and evaluation; (2) feature selection: the implementation of RFE and random forest feature importance for feature selection helps in identifying the most relevant predictors, reducing the model’s complexity and improving its interpretability; (3) validation techniques: employing 10-fold cross-validation ensures that the model is reliable and generalizes well to unseen data, minimizing the risk of overfitting; (4) final evaluation on unseen data: testing the model on a held-out dataset (20% of the total) that was not used during training, hyperparameter optimization or feature selection stages provides a genuine measure of its predictive power and generalization ability; and (5) interactive web application: developing a web application using the Shiny package makes the model accessible to the corrosion science community, allowing for an interactive and practical application of the current work.

Model weaknesses: (1) data diversity: while the dataset is extensive, the application of the model is restricted to aluminum alloys, which may limit its applicability; (2) model specificity: the choice of a random forest model, while justified, means that the strengths and weaknesses of this particular model type will also influence the results, and other models can offer other insights; (3) feature selection dependence: the use of alternative feature selection methods, other than RFE and random forest feature importance, might yield different or even more informative features; and (4) model update and maintenance: the dynamic nature of both machine learning techniques and corrosion science means that the model and dataset require regular updates to remain relevant.

One possible shortcoming that is difficult to control in machine learning predictive modeling for chemistry and materials is that small changes in input variables can lead to large changes in output variables, which is typically associated with a problem known as high variance. This situation often occurs in models that are overfitting. Our strategy to minimize this issue was resorting to cross-validation, using both a cross-validation as well as a final test, and trusting the internal mechanisms of the random forest algorithm to avoid overfitting: (1) bootstrap aggregation (bagging): training each tree on a different random subset of data; (2) random feature selection: using a random subset of features for splitting at each node; (3) ensemble averaging: averaging the predictions from multiple trees to reduce variance; (4) tree depth limitation: controlling the depth of each tree to prevent over-complexity; and (5) diversity in trees: ensuring that trees are diverse through random sampling and feature selection. Composite models do have higher data variability, which make them more challenging, albeit more realistic. Regarding the model variance of a standard (only one set of conditions) vs a composite (several conditions withing the same model) it is not straightforward to directly compare their variance, since they will resort to both different data and input variables.

As a result, the described machine learning workflow is robust and comprehensive, employing state-of-the-art techniques for data processing, model training, and evaluation. However, it also presents challenges related to data and model specificity, and therefore its applicability, which are important considerations for future development and application of this work.

From the electrochemical and chemical standpoint, the dataset and resulting model also presents specific strengths and weaknesses.

Electrochemical and chemical strengths: (1) diverse methodologies for efficiency evaluation: incorporating results from various experimental techniques (e.g., potentiodynamic polarization, electrochemical impedance spectroscopy) provides a comprehensive understanding of corrosion inhibition mechanisms across different experimental conditions, enriching the dataset’s value; (2) wide range of substrates and conditions: using a variety of aluminum alloys and under different electrolyte conditions enhances the model’s applicability to real-world scenarios, ensuring its predictions are relevant across a broad spectrum of practical applications of aluminum corrosion inhibitors; (3) chemical diversity of inhibitors: the dataset includes a wide range of organic compounds with molecular weights from 33 to 993 g/mol, allowing the model to learn from a broad chemical space to potentially identify novel inhibitors; (4) inclusion of a wide range of molecular descriptors: calculating 209 molecular descriptors enables a detailed understanding of the structural and chemical features that influence corrosion inhibition, facilitating the identification of key features driving corrosion inhibition potential; (5) pH consideration: categorizing the dataset based on pH and its impact on corrosion inhibition mechanisms (e.g., oxide film passivation, localized corrosion, protonation/deprotonation of inhibitors) is a critical factor of the metal-substrate interface; and (6) addressing non-linearity in corrosion inhibition: the choice of classification over regression, due to the non-linear nature of corrosion inhibition efficiency, encompasses in the model the underlying electrochemical complexity and the practical challenges in directly correlating molecular descriptors with inhibition efficiency.

Electrochemical and chemical weaknesses: (1) impact of measurement methodologies: different experimental techniques might yield varying inhibition efficiencies for the same compound, which could influence the accuracy of future simulations; (2) descriptor complexity: while the use of a broad range of descriptors is a strength, it also introduces complexity in understanding which features are most critical to inhibition efficiency, and the relationship between descriptors and corrosion protection is not straightforward; (3) threshold selection for classification: the choice of a 70% efficiency threshold for classifying compounds as effective inhibitors, may not be enough to select really effective compounds or may not capture the nuances of moderately effective compounds that could still be of practical interest; and (4) limited consideration of experimental conditions: although the dataset includes various conditions, the model’s ability to incorporate these factors beyond pH is limited, which is something to be addressed in future works.

In summary, the machine learning approach results in an understanding of the chemical and electrochemical factors influencing corrosion protection by inhibitors. However, challenges remain in understanding more complex descriptor-efficiency relationships, and integrating the diverse nature of typical corrosion experimental data into a machine learning framework.

In this study, it is presented not only the results from the machine learning model, but also a web application (DATACORTECH, https://datacor.shinyapps.io/datacortech/), which is freely available for the corrosion science community to experiment with the model, and perform an initial virtual screen of potential corrosion inhibitors for aluminum alloys to be further tested experimentally. The potential of this application can be enhanced, when used in paralel with a previous application CORDATA (https://datacor.shinyapps.io/cordata/)¹², which is an open data management tool to select appropriate corrosion inhibitors for specific application conditions, from existing literature data for aluminum and also other alloys.

Source link