Material embedding, machine learning architecture, and theoretical implementation
Previous studies have demonstrated success with partial data models such as gradient boost trees (GBT), random forests, and more. k– Minimal Adjacent Classifiers, Support Vector Classifiers, and Neural Networks36. Previous GBT was successfully trained by data22,37.
Following training of the GBT algorithm for topology data, subsequent analysis demonstrated that electron numbers and spatial groups are the major distinguishing determinants of material topology determination.twenty two. The model's performance was excellent, peaking at 90% on the full GBT model. When GBT coupled with AB initio calculations that ignored spin-orbit coupling, the accuracy of the material with strong confidence in the predicted topological state peaked at 92%. These calculations were not used to complement the ML model, as the complete spin-orbit AB initio calculations allow for direct prediction of material topology. The main advantage of using purely structure-based predictions is that it envelops generality and gives a simple way to retrain the model into new situations. Because the original dataset is inaccessible, the GBT algorithm without DFT was reconstructed accurately and applied to the current dataset. In the advanced TQC dataset, 76% accuracy was achieved as shown in Table 3. All algorithms considered are compared to the 76% benchmark, as they do not include additional AB Initio calculations. References. twenty two,CGNN was tested, but could not converge to reasonable accuracy due to topological prediction. Now we can see that we have excellent predictive capabilities.
Four faithful embeddings of the underlying material have been tested. For each embedding, the data format is standardized as follows: take a Becoming a set of atoms in a primitive cell. Each atom a ∈ a It is associated with two types of information Atomic Identifierva and Atomic positionpa. lastly, Global Vectorg A vector containing the dimensions and symmetry of a primitive cell. Different embeddings are considered for each input vector and are tested in all ML frameworks to determine the optimal representation.
For classification of n Category, remember it One Hot Encoding of I-th category is 0⊕(I-1) ⊕ 1 ⊕ 0⊕(n– –I)). Embeddings were chosen as to enhance generalization using one-hot embedding of atom numbers. \(h(r)\oplus h(c\,(\mathrm{mod}\,\,2))\oplus\lfloor c/2\rfloor\)feed using the periodic table of the left stage in Figure 3 r and c38. This allows for generalization of rows and columns in the periodic table with positions of 7 (rows) + 16 (spinless columns) + 1 (spin slot) = 24th place. Additional atomic properties were tested for embedding, but no additional performance improvements were found.

Each atom in the periodic table has column and row annotations used in ML.
Embedded position pa It is network dependent, but is stored using fractional units compared to the primitive cell base. Global data vectors have two main components g. First, give the primitive cell dimensions using sine wave encoding39the second records the space group with one hot embedding. Hyperparameter tuning was used to demonstrate maximum network performance only for TQC datasets, and the remaining tests were ignored to demonstrate the ability to generalize immediately.
The complete model described in the methodology allows for overfitting a consistent set of training data to any degree. Therefore, training accuracy is not emphasized. In some tables, previous papers are used as approximate benchmarks for comparison. Comparisons are at best indicators, as these papers may not use the same dataset.
One meaning of model fidelity is that limits must be introduced on speed training times. To compare the model with the GBT algorithm, a penalty was assigned to primitive cells that could not fit the representation as follows: Without knowledge of the underlying input variables, the best predictors of the elements of the validation set are v It's a single label, pFor each element of v. Therefore, this optimal element was used as the default prediction if it is not available in the ML model. note that p It is extracted from the training set to prevent contamination of the data. For classification, p It is the most common label. For regression, if the loss is the root mean square error (RMSE) or the mean absolute error (MAE), p Optimizing each of these measures of model error is the mean and median, respectively. This gives you a well-defined way of comparing different models on the underlying dataset. We also provide a simple baseline model for comparison, as shown in Tables 1-3.
Common quantum materials characterize prediction
Material representations are sufficient to determine the symmetric group. Therefore, as the first test of the global power of the ML algorithm, 151,000 materials were obtained from the material projects and ICSD datasets.40,41. POSCAR file format1 It was used as inputs to feed the atom type and position to the ML, and the primitive cell base. The target variable for each material was space group classification. Material symmetry is easily derived from the description of POSCAR using structural shapes. Therefore, classification of symmetric groups is completely accurate and allows for validation of the actual implementation of the model.
Two major implementations of symmetric groups were tested. One hot encoding for space groups and point groups. The space group consists of 230 labels, and the point group consists of 32 labels. As can be seen from Table 1, ML performance was lower compared to analytical techniques. In fact, this is a known weakness of ML and an ongoing field of research in the ML community. The CCNN algorithm is able to capture most of the symmetry of space, and shows that spatial relationships are best handled with this direct approach compared to the other three methods.
The atom-by-atom formation energy and magnetic classification were both indexed from reference. 40 For 151,000 materials. Natural errors were expected due to the temperature dependence of the experimental results and the limited accuracy of the DFT. The performance of the magnetic dataset was stronger compared to the 81% accuracy of the smaller dataset.42. This illustrates the universality of model designs implemented for formation energy (Table 2). However, as expected, classification model performance for unchanged regression tasks is weak and will improve in subsequent development.
Topology Classification
Three main sources were used to train the model. First data settwenty one Contains a comprehensive list of material topology indicators. Material information was extracted in the form of a POSCAR file from the two largest material data sets available40,41. Two sets of topology labels were extracted for each material. tssimplified labeling, and trthe refinement of ts. here, ts It consists of three labels: LCEBR, TI, and SM tr It consists of five extended labels: LCEBR, NLC, SEBR, ES, and ESFD. There are 75,000 ingredients for this labeling.
Two requirements were placed in the data. As a first criterion, primitive cells were required to have less than 60 atoms. The second criterion arose from the problem that materials are often replicated by stoichiometric labels and symmetric groups with slight variations in the POSCAR file. Therefore, if the topology label agreed, the entry was condensed. In the event of inconsistencies in the topological data, the material was simply excluded from the data set because of the high probability of incorrect calculations or due to abnormal ambient factors such as temperature or pressure. As an example of this type of situation, 39 materials tuples stand out in the ICSD database, but were merely distortions of each other identified in the material projects. After the filtration process, 36, 580 materials remained and 455 data points were deleted. The original dataset appears to have contained thousands of replicating materials. It is noteworthy that cross-contamination between training and test datasets causes ML processes based on the original dataset to artificially score high scores. The topological configurations of the dataset are ES, TI, SM, NLC, and ESFD of 0.10, 0.27, 0.07, 0.07, and 0.49 as fractions of the entire dataset, respectively.
The majority of model experiments were performed on the TQC dataset. This allows for the diagnosis of problems of a particular model based on accuracy. Unless otherwise stated, all comments are particularly relevant to the entire five TQC categories. At a 49% threshold, the model does not necessarily have information transfer between the input and output. This is because it contains 49% of the dataset, as it is non-topology, the most common material type. Near the 75% accuracy range, additional apparent plateaus occur, followed by reduced training. The CGNN model is particularly above this threshold. The models were trained 20-60 times (epoch) across the available dataset to achieve maximum accuracy in the test set. All four tested models showed early rapid growth, followed by apparent plateaus that lasted for about one epoch before a more subtle long-term increase in accuracy was revealed. To explain the differences in datasets, the alternative GBT algorithm in Table 2 was trained based on the specifications provided in the bibliography. twenty two Direct comparison of approaches. All models, as seen in Table 3, are comparable to or exceed the GBT baseline.
Optimized implementations for each network are provided github Includes notes on optimization. Additional correlation effects were examined and a weak correspondence was shown between the formation energy, magnetic classification, and topological effects of the supplementary material. The ensemble was created to test for full-body model errors. Material misclassification was found to occur most frequently, especially in less common elements of PT. Materials with multiple symmetry corresponding to the same stoichiometry were more frequently misclassified, except when the topology labels were identical in all symmetric phases.
Due to the large differences in model architecture, the ensemble approach provides a way to enhance model prediction. Any model that achieved full performance in the test set may indicate two potential situations if all models fail to properly classify the topological classification of the material.
-
This material is accurately represented by DFT, but neural networks (NNs) are misclassified due to violation of internal heuristics.
-
The material itself is miscarried due to defects in the DFT calculation of the band structure.
A filtration process is performed as an extension of model classification. The four model archetypes (NNN, CANL, CCNN, CGNN) can achieve accuracy of over 95% on the training dataset, so all four models train 95% of accuracy on several epochs across the dataset. If the errors between models are uncorrelated, then misclassification is uncorrelated as 36,580 (0.05).4 ~0. Deviation from this scenario shows interdependence between models, allowing for model-independent methods to diagnose similarity between sources. There were 54 such misclassification materials. Those materials, CEIN2Ni9,fe2snu2b4Fe (Space Group 58), and Ini4TMs are proactively identified as topology and can be misclassified due to insufficient DFT calculations. Furthermore, 1:3 and 1:5 compounds frequently occur with misclassifications corresponding to the compound PTNI.3,MOPT3PDFE3,HFPD3crni3Alk3hgti3 And hoku5gdzn5euag5,cept5thni5Seni5.
