Machine learning model for accurately predicting the properties of CSPBCL3 perovskite quantum dots

Machine Learning


Data description

The study was launched by thoroughly analyzing existing literature and compiling a comprehensive database of hot injection synthesis parameters. \(\textrm {cspbcl} _3 \) PQDS. Data were collected from 59 peer-reviewed articles, listed comprehensively and cited in Table S2 (Support Information). Once the selected articles were determined, the associated synthesis parameters and corresponding output characteristics were manually extracted. The following parameters are considered independent input variables for training the algorithm: Injection temperature, source of chloride (CL), amount of Cl in mmol (mmol), source of lead (PB), amount of PB in Mmol, amount of CS in Mmol, molar ratio of CS-PB, molar ratio of CS to CL. Additionally, the amounts of octadecene (ODE), oleic acid (OA), and oleylamine (OLA) in milliliters (ML) and the total volume of ML ligand (OA+OLA) are also included as input parameters. Furthermore, the ratio of total ligand volume and CL amount to PB is also an input feature. Output target parameters are the nanometer (nm) size of PQDS, 1S ABS peak of NM, and PL of NM. 1S ABS refers to the first excitation absorbance absorption peak corresponding to the lowest optical energy transition of PQD, whereas PL represents the radiation emission generated between the lowest conduction of PQDS and the highest balanced energy band. We properly categorized the collected data, each variable parameter in each column, and all results for each row. As stated in the Support Information, 788 data points (531 inputs, 177 outputs) were used for the prediction. This amount of data is sufficient to accurately predict the properties of nanocrystals using ML.19,24. This model trained and processed data management more quickly because of this well-organized record collection. The input functions are independent throughout the modeling process, and we assume that the preferred ML model is sufficiently capturing the fundamental association between the input and target variables. Table S1 in the Support Information section shows the various stages of preparation used to enhance the quality and applicability of the data in the ML model. Dataset reliability was guaranteed to remove missing outliers and median substitutions using residual analysis. We estimated residuals using a basic regression model and used a Z-score threshold approach. These data points with larger residuals \(\ pm 3 \) Standard deviations from the mean (Z-score >3 or <-3) were classified as outliers and removed from training data. This avoided distorted learning due to over- or consistent synthesis results. Furthermore, we employed principal component analysis (PCA) to roughly maintain the calculation speed during large-scale calculations. \(95 \%\) of dispersion. Polynomial and logarithmic transformations were used in functional engineering to address skew issues and maintain links within the dataset.

Metrics and Machine Learning Methods

Datasets are divided into training and test categories according to the hierarchical clustering framework, rather than repeatedly using the same thing and avoiding cases where memory or fit interferes with new information. We evaluated six regression methods suitable for small datasets of SVR, NND, DL, DT, RF, and GBM. All of these algorithms were constructed using Scikit-Learn library. Both random sampling techniques and stratified sampling techniques were used to ensure representative samples for testing and training. The dataset was split into training. This included 80% of the examples and the remaining 20%. Tuning hyperparameters were performed through grid search. Model performance was evaluated by computing. \(\textrm {r}^2 \)MAE and RMSE metrics. MAE mainly considers outliers and compares datasets with models with different objectives measured at scale. A simple way to visualize the performance of a model is to see the value of MAE. Low values ​​correspond to high prediction accuracy. The distance between the predicted value and the actual value and the observed value of the data sample is the best way to interpret RMSE. If RMSE is equal to zero, the model correctly estimates the total cost. The measurement coefficient is stated as \(\textrm {r}^2 \)is a metric that quantifies the degree to which the model accurately represents the data, suggesting a higher accuracy when the value is close to 1. We evaluated the accuracy and performance of the models used to predict bending forces and compared them using three commonly employed statistical metrics: RMSE, MAE, and MAE, and \(r^2 \). These metrics are defined mathematically as follows:

$$\begin{aligned}&\text{rmse}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(w_i -{\hat{w}}}_i)^2}\end{aligned}$$

(1)

$$\begin{aligned}&\text{mae}=\frac{1}{n}\sum_{i=1}^{n}|w_i -{\hat{w}}_i|\end{aligned}$$

(2)

$$\begin{aligned}&r^2=1 -\frac{\sum_{i=1}^{n}(w_i -{\hat{w}}_i)^2}{\sum_{i=1}^{n}(n}(w_i -{\bar{w}})\end {aligned} $$

(3)

where \(w_i \) Represents the observed value, \({\hat {w}} _i \) Shows the predicted value, \(n \) The total number of predictions \({\bar {w}} \) The average of observed values.

In data science, SVR is a row of regression models that effectively model complex relationships within a dataset by mapping input data to a higher dimension space. SVR applications are particularly important in dealing with high-dimensional datasets and nonlinear relationships, allowing SVR to be computationally intensive, especially on large datasets. The SVR model was created using a radial basis function (RBF) kernel Scikit-Learn Python module. Hyperparameters were optimized using a grid search technique. SVR is particularly suitable for accurately predicting QD properties, as it can describe nonlinear dependencies between factors such as QD size, configuration, and resultant attributes. The intimate connection between these elements and the properties of QDS allows ML algorithms such as SVR to capture these complex correlations and reveal the interactions between QDS characteristics and input variables.

The SVR model for predicting the properties of nanomaterials is described as follows:

$$\begin{aligned} f(w)=\sum _{i = 1}^n(a_i -a_i^*)k(w_i, w) + b\end {aligned} $$

(4)

where \(w \) Represents input functions, \(w_i \) For support vectors, \(a_i, a_i^*\) For the Lagrange multiplier, \(k(w_i, w)\) For kernel functions \(b \) Bias terminology. This model predicts properties such as band gaps and surface area by capturing nonlinear correlations between functions using kernels such as RBF and polynomials. If there are few datasets, this method works very well with very few datasets for nanomaterial research29.

NND is an important concept of spatial analysis and machine learning. More precisely, it plays an important role in pattern recognition and classification algorithms, including k-nearest Neighbor (k-nn) methods. NND is defined as the shortest length separating two points in a dataset from each other. References show that NND is applied to fundamental concepts of computational geometry, theoretical analysis of particle systems, and statistical estimator convergence analysis.30. Python Scikit-Learn The library was also used to implement the NND model. The NND algorithm is an important tool that requires a robust foundation for understanding the fundamental mechanisms of nanomaterials31. Therefore, this algorithm can accurately predict ownership of PL, 1S ABS, QD size, etc.

Nanoparticle Collection \(n = \{w_1, w_2, \ldots, w_n \} \) in \({\mathbb {r}}^m \)NND is defined as follows:

$$\begin{aligned}\rho_k(w_i,n)=\min_{j\ne i}|| w_i -w_j || \end {aligned}$$

(5)

Here is the Euclidean distance \(w_i \) and \(w_j \) is represented by \(|| w_i -w_j || \). The average formula \(k \)-thnnd is:

$$\begin{aligned} h_{n,k}=\frac{1}{n}\sum_{w_i\inn}\log\frac{\rho _k(w_i,n)v_m e^{\psi(k)}}{f_i)}{align

(6)

where \(f(w_i)\) Local density, \(\ psi(k)\) It's a scaling function \(v_m \) It's the volume of \(m \)– Dimensional ball. This measure provides information about the spatial distribution and composition of nanoparticles. This is essential for understanding chemical and physical properties32.

The DT is a simple yet robust model that is easy to understand and explain. It can handle both numeric and categorical data, making it flexible with a wide range of data sets. Nevertheless, DTs are susceptible to overfitting, especially as the trees become too deep. To train the model, a decision tree model was developed using Python's Scikit-Learn module. The parameters of the model, including the maximum depth of the model, were changed by applying cross-validation. Integrating the DT algorithm into the design of QDS is important for predicting QD properties. It allows you to classify the ability to process complex data sets and the optimal mixture of material properties.33.

RF is a machine learning algorithm that has recently become famous.34. It is considered by many people as one of the best machine learning algorithms because it can process a thousand variables without compromising accuracy. It's fast. It's easy to implement. And the prediction accuracy is high35. This algorithm is called one with a high level of predictive performance, but is considered the most appropriate out-of-box classification and regression algorithm, as it requires less tuning.36. The RF model was implemented in Python Scikit-Learn Module. We trained the model using 500 trees and used cross-validation to optimize MAX_Features, the number of features to consider in each split. For regression, RF trains each tree independently with different bootstrap samples drawn exchanges from training data. Once trained, RF prediction for unknown samples \(w \) It is calculated as the average of predictions from individual trees, given by the following equation:

$$\begin{aligned}{\hat{f}}(w)=\frac{1}{n}\sum_{n=1}^{n} f_n(w)\end{aligned}$$

(7)

where: \(n \) The number of decision trees in a random forest; \(f_n(w)\) It's a prediction from \(n \)– Tree for input \(w \). This ensemble approach reduces the variance of individual decision trees and makes the random forest a more robust model for regression.

GBM is another very powerful machine learning technique, as it combines many weak learners. This technique is efficient for many classification tasks.37. It has also been identified for its high prediction accuracy and effectiveness when manipulating complex interactions within data. However, if it is not properly adjusted, one of the drawbacks of GBM, it tends to be overly tight. Python was used to train GBM models in the Scikit-Learn library. They were appropriately optimized by cross-validation of key parameters such as learning rate, number of boost rounds, and MAX_DEPTH.

DL, in particular neural networks, can learn through examples just like humans. These networks do not require a specific algorithm and can estimate nonlinear transformations. Therefore, they can be used to determine the input/output of complex systems38. Nevertheless, issues related to using older model architectures include the lack of balance within the dataset. As a result, it is not generalization by the machine learning algorithm itself, but also ignores redundancy within feature extraction and cross-layer properties interactions.39. I used it Scikit-Learn A Python library for training RF models.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *