This study is conducted in an honest, transparent and ethical manner. It was approved by the ethics committee of Huainan Normal University. All methods of this study were carried out in accordance with relevant guidelines and regulations. All experimental protocols for this study have been approved by the designated institution. All participants provided informed consent. We obtained written consent from participant involved in the study. This informed consent ensures that their rights and privacy are protected throughout the research process. This research was conducted to better determine the loyalty of employees to the enterprises. This research needed to address the following concerns: the methods for analyzing and studying relevant historical data about employees; the ways to analyze and explore factors and models that influence employee loyalty. Figure 1 shows the model building process.

The process of model construction.
Business and Data Understanding: This part includes business understanding and business data understanding, employee management system, loyalty mining, value analysis and other business requirements and data. Data acquisition is a very important step in the data mining process as the quality of the collected data will directly affect the training effect of the model and the accuracy of the subsequent prediction.
Data pre-processing: It mainly includes data cleaning, data integration, data transformation and data specification.
Data mining: This step is employee loyalty mining analysis. Various dimensional models are used to mine data and analyze the relationship between data variables.
Model evaluation: This refers to the evaluation of the model performance. To evaluate the effect of the model and whether it can meet the business needs of human resource management, various evaluation methods and indicators should be applied, and even the managers of TSMEs should be involved to thoroughly evaluate the model. In this research, algorithm performance was evaluated. After the evaluation, if the evaluation passes, it will enter the deployment phase, otherwise, it will be iteratively updated again.
Data sources
The first step of research is to collect and store data related to employee loyalty. This step is the operation layer of the original data, and the existing data storage forms of the enterprises are relatively diverse, which may be text files or relational databases.
The total number of technology enterprises in the Yangtze River Delta exceeds 200,000.The total number of science and technology talents in 41 cities in China’s Yangtze River Delta has gradually increased. With the development of the Yangtze River Delta regional integration as a national strategy, its competitiveness, fluidity, openness, sharing and other features become more and more obvious. The data in this research came from the employees and evaluation data of enterprises in the Yangtze River Delta region of China. Because these enterprises are the partners of the research unit, they provide a good practical basis for this research. These enterprises have been engaged in industrial software development and application for a long time, and have obtained many qualification certificates such as ISO full series system certification, CMMI software Maturity Level 3 certification, ITSS operation and maintenance System level 3 certification. These enterprises are committed to building a professional, stable, well-structured and dynamic team of highly skilled talents.
Through the survey and interview of partner enterprises, the research collected a number of data related to employee loyalty and formed the original data after integration. Firstly, the main data was obtained through direct interviews with employees. During the interview, we designed a series of questions around the theme of employee loyalty. Through these in-depth interviews, a solid foundation has been provided for the subsequent analysis of the factors influencing employee loyalty. Secondly, the data is obtained from the business systems of partner enterprises. The business system of partner companies records detailed information such as employee work performance, attendance, training records, project participation, and promotion records. Thirdly, the data obtained was also obtained from the official websites of partner companies. The official website of partner companies usually contains information about the company’s corporate culture, values, development history, honors and awards, as well as employee activities. We have conducted a systematic review and cleaning of company records to ensure the integrity and consistency of the data. The interview data is recorded in the form of audio recordings and notes, and organized and analyzed by professionals to extract key information. The data on the official website, such as employee evaluations and messages, has also been screened and verified to exclude false or misleading information.When integrating these data, we used cross validation to compare data from different sources and ensure their consistency. For the data with discrepancies, we conducted further investigation and verification to ensure that the final data used is accurate and reliable. Through this integration and verification process, we have laid a solid foundation for subsequent employee loyalty prediction and analysis.
The following is a detailed description of the interviewees. Senior Leaders are the supporters of the development of this system, and also the decision makers of the company, who can propose constructive suggestions for the whole process of the research. Head of Human Resource Management department are the main users of the system and the people who collect all kinds of evaluation and result data. Through them, more opinions on system construction can be obtained and they play an important role in data collection and system evaluation. Employees are one of the sources of system development data acquisition, at the same time, they are also direct testers, therefore, their feedback information strongly supports the development of the system. In addition, they play an important role in data collection. Department Leaders are the most direct personnel to mine the results of employee loyalty and value evaluation and obtain direct opinions through them. Lastly, the leaders were the ones responsible for verifying the collected data.
The main interview guide questions are listed below. What problems have you encountered in developing employee loyalty prediction? How do you usually carry out employee loyalty analysis? Do you think it is necessary to systematically manage employee loyalty prediction? What problems did you encounter in preparing the report materials? What factors do you think affect employee loyalty? Do you think the employee loyalty management system with decision support will play a significant role in the development of your company? If possible, would you encourage the company to use the employee loyalty management system with decision support as a part of human resource development?
Reliability analysis is used to evaluate the stability and consistency of a measurement tool or test. In interview analysis, reliability usually refers to the reliability and consistency of interview data. In this study, SPSS 25.0 statistical software was used to analyze the interviews, and it was concluded that Cronbach’s alpha coefficient was greater than 0.8, with good reliability.This indicates the internal consistency of multiple measurement items when measuring the concept or property in this study.Feedback method, comparison method and participant test were used to test and improve the validity of the interview. The above methods ensure the effective implementation of the interview.
Before data mining of employee loyalty, the collected data needed to be integrated to form raw data. After data collection and preliminary preprocessing, this research integrated 2200 pieces of data and Python language was used to analyze and mine the data of employee loyalty.
Data sources include Database and Datastore. This research communicated with the leaders of enterprises, who assigned relevant data specialists to coordinate the company’s data collection while data collection in this research was carried out by a combination of random sampling and stratified sampling. In order to avoid introducing bias into the research results, when selecting stratification criteria, full consideration should be given to the heterogeneity of the population and the research objectives, ensuring that each layer can fully represent different subgroups in the population. In addition, the impact of stratified sampling on research results was evaluated through methods such as cross validation to ensure the accuracy and reliability of the research findings. After determining the stratification criteria, employees are generally divided into different levels. According to the department, the employees are divided into administrative department, sales department, R&D department and production department. According to the ratio of the number of employees of each layer to the total number of employees and the total number of samples required for the research, the number of samples taken from each layer is calculated to ensure that the sum of the sample amounts taken from each layer is equal to the total number of samples. Within each layer, a specified number of employees are sampled using a random sampling method. In simple random sampling, the random number table method is used to ensure that each employee has an equal probability of being selected. Due to the developed economy and dense population in the Yangtze River Delta region, there are large economic and cultural differences between different cities and regions. Therefore, in the sampling process, full consideration should be given to the representativeness of these regions and industries to ensure that the sample can represent the overall characteristics of the entire Yangtze River Delta region. At the same time, the research covers multiple departments, such as production, research and development, sales, etc. When sampling, ensure the representation of different economic sectors to avoid sampling bias caused by sectoral differences. At the same time, strict quality control in the data collection process, such as training investigators and reviewing and cleaning the data, can reduce errors.
Moreover, the current study analyzed the loyalty of employees and based on the findings, positive samples were those with high loyalty while negative samples were those with low loyalty. Therefore, the data needed to be sampled from two levels in a balanced ratio. The specific sampling method was to randomly extract and stratify the original data from the company’s personnel management system and employee management system, and then clean, integrate and standardize the data after it is exported.
Predictors
According to the current research status, there are many factors that affect employee loyalty. In specific industries and positions, there are different influencing factors.Collect relevant data of employees through previous questionnaire surveys, interviews, enterprise databases, and other methods. Determine the important factors based on the following considerations. The generation of predictors in this study is based on the following criteria and contents.
The first is relevance. The predictor should have a significant correlation with the forecast target. The method is to identify and select relevant fields through Pearson correlation coefficient correlation analysis.
The second is completeness. The data set contains enough information to train the model, that is, enough fields need to be selected to capture various features of the data.The method is to ensure the integrity of the data set through data preprocessing, and possibly add new relevant fields through feature construction, feature scaling, feature coding, etc.
The third is accuracy. The data for the predictor should be as accurate as possible to reduce the impact of noise and error on model training. Methods To improve the accuracy of data by removing duplicate data, correcting error data and transforming data.
The fourth is interpretability. In some cases, predictors are interpretable so that the model’s decision-making process can be explained when it makes its predictions.Research selects fields with clear enterprise or business implications and avoids using overly complex features or combinations of features. At the same time, interpretable machine learning models can be used to enhance the interpretability of the model.
Factors with low correlation, incomplete data, inaccurate data, and unclear descriptions have been filtered and removed. Ultimately, 17 factors were retained. Specifically, Age represents an employee’s physiological age; Gender refers to an employee’s biological sex; Education level reflects an employee’s knowledge background and academic attainment; Position indicates an employee’s role and responsibilities within the organization; Salary level reflects an employee’s financial compensation; vacation days concern an employee’s leave time; Welfare level represents the benefits an employee enjoys; Ability Level embodies an employee’s skills and competencies; Team Spirit reflects an employee’s cooperative attitude within a team; ambition denotes an employee’s career goals and aspirations; Value Recognition embodies an employee’s self-perception of their value; Honesty reflects an employee’s integrity; Belonging Sense indicates an employee’s loyalty and identification with the organization; Training Opportunities represent the likelihood of an employee receiving training; Promotion Opportunities reflect an employee’s prospects for advancement within the organization; Working Environment describes the physical and psychological conditions in which an employee works; Overtime involves an employee’s working hours and intensity. Nevertheless, the following text will conduct important analysis and correlation analysis on the existing 17 factors, as well as provide comments and descriptions, in order to more accurately predict the core important factors and their correlation discourse. Subsequently, this research adopted machine learning algorithms. Hence, there are 17 main predictors, as shown in Table 2 below.
According to the table, predictors include Age,Gender, Education, Position, Salary, Vacation, Welfare, Ability Level, Team Spirit, Ambition, Value Recognition, Honesty, Belonging Sense, Training Opportunities, Promotion Opportunities, Working Environment, and Overtime.
Data preprocessing
In data mining, there are a large number of incomplete, inconsistent and abnormal data in the original data, which seriously affects the execution efficiency of data mining modeling and may even lead to the deviation of mining results. Therefore, data cleaning is particularly important. After data cleaning is completed, a series of processing such as data integration, conversion and specification are carried out, which is data preprocessing. On the one hand, data preprocessing is to improve the quality of data. On the other hand, it is to make data better adapt to specific mining technologies or tools. In this paper, the main contents of data preprocessing include data cleaning, data integration, data transformation and data specification. After data cleaning and preprocessing, the data is stored in the database or database warehouse for the use of data mining at the next level.
This research first deals with missing values and abnormal data. Missing value refers to clustering, grouping, deletion, or truncation of data due to lack of information in rough data. Missing values occur because the values of one or some attributes in the existing dataset are incomplete. The methods to deal with missing values generally include deletion and interpolation of missing values. For high-dimensional data, the noise features can be removed to reduce the interference to the model. Interpolation methods include nearest neighbor interpolation, mean interpolation, median filling and other methods. In this research, the interpolation method and mean value method before and after interpolation are comprehensively used.
Abnormal data is a special case that is often encountered in data analysis and the so-called abnormal value is abnormal data. Sometimes abnormal data is not only useless but will also affect the normal analysis results. Therefore, in the process of data exploration, it is necessary to identify these abnormal data and be handled well.
The visualization results are shown in Fig. 2 According to the box graph above, some variables have abnormal values, such as Ability Level variable and Team Spirit variable. As shown from the collected data set, the scale of these variables is 0–5.

Box and Line Diagram of Abnormal Value Detection.
Therefore, the scores below 0 and above 5 are abnormal values, indicating the need to process these variables to filter out the data within the normal range.
After the simple pre-processing of the above data, missing values, outliers and duplicate values are processed to obtain a clean and complete data set, and the data are saved to tables and databases.
Before data mining, descriptive statistics is necessary.
Descriptive statistics
Descriptive statistics are summarized in a way that reveals the characteristics of data distribution and can be used to express quantitative data. Descriptive statistics includes data frequency analysis, data trend analysis, data dispersion analysis, distribution description and statistical graph analysis.
Table 3 carries out statistical analysis on all attributes, including both discrete data and continuous data. Some statistical indicators of discrete data are useless data, which are represented by NaN. Similarly, some statistical indicators of continuous data are unavailable, which are represented by NaN.
As seen from the figure, descriptive statistics includes count, unique, top, frequency, mean, std, min, 25%, 50%, 75%, max and other statistical indicators related to each continuous variable.
Unique represents the unique value while top represents the attribute value with the largest frequency; frep represents the frequency value with the largest frequency; mean represents the average value of the sample, while std represents the standard deviation of the sample, and min represents the minimum value of the sample. Then, 25%, 50%, 75% are quartiles, respectively, the upper quartile, median and lower quartile, and finally, max represents the maximum value of the sample.
For example, for the discrete value of Position variable, the medium value has the largest frequency. For the continuous value of Ambition variable, the mean is 3.17, std is 1.41, min is 0; 25%, 50%, 75% is 2.1, 3.7, 4.3; max is 5.
The distribution of characteristic data
This research analyzed several continuous characteristic variables such as Ability Level, Team Spirit, Ambition, Value Recognition, Honesty, and Belonging Sense through data visualization technology.
As seen from Fig. 3, the numerical distribution of each characteristic variable is irregular.

The distribution of characteristic data.
Ability Level is the evaluation of the ability level of employees. The scores are mainly 2 to 5. The highest score is 5. Among them, the highest distribution is between 4 and 4.3, indicating that the ability level of the company’s historical employees is relatively high.
Team Spirit represents the team spirit of employees, focusing on 2 to 5 points. Moreover, the number of employees with 4.7 to 5 points is the largest, indicating that there are more employees with good team spirit.
Ambition represents the level of ambition of employees. The data focus on 3.5 to 5, indicating that the number of ambitious employees in the enterprise is large.
Value Recognition represents the level of employees’ recognition of the value of the enterprise. As seen from the figure, some scores are 0 to 3, while some are 3 to 5. This further indicated that there is a polarization of employees’ Recognition of the Value of the enterprise.
Honesty represents the level of employees’ honesty. The data is evenly distributed between 0 and 5 points, and there are more employees with more than 3 points, indicating that the honesty of employees in this enterprise is generally high.
Belonging Sense represents employees’ sense of belonging to the enterprise, which is like Value Recognition. Some scores are 0 to 3, while others are 3 to 5, showing polarization.
Character analysis
Figure 4 shows the relationship between education and prediction labels, Position and prediction labels, Welfare and prediction labels, and Promotion Opportunities and prediction labels. Figure 4 specifically depicts the distribution of labels, so the labels are displayed in the diagram for an intuitive description.. The Label value is equal to 0, indicating that employees are disloyal or have low loyalty. However, as reflected, the label value is 1, indicating that employees are loyal or have high loyalty.

The Relationship between Features and Prediction Label.
Several categorical variables related to prediction labels are analyzed.This research presents part of the process of variable analysis.
In Fig. 4, the label value is equal to 0, indicating low employee loyalty, while the value Label which is equal to 1 indicates high employee loyalty. The proportion of employees with a doctoral background marked as disloyal is 22.61%, while the proportion of employees with a doctoral background marked as loyal is 9.83%; the proportion of employees marked as disloyal with a bachelor’s degree is 7.72%, while the proportion of employees marked as loyal with a bachelor’s degree is 8.71%. In addition, as reflected in the table, the loyalty of employees with an education level of Doctorate is more likely to be low.
In Fig. 4, the proportion of employees with high positions marked as disloyal is 4.82%, while the proportion of employees with high positions marked as loyal is 22.05%; the proportion of employees marked as disloyal in low positions is 26.40, while the proportion of employees marked as loyal in low positions is 0.42%. Additionally, the probability of high loyalty is far higher than the number of employees with low loyalty when the position is high. However, the position of the lower group is the opposite, indicating that the level of loyalty is closely related to the level of the position.
In Fig. 4, the proportion of employees marked as disloyal with high benefits is 7.30%, while the proportion of employees marked as loyal with high benefits is 41.71%; the proportion of employees marked as disloyal with moderate benefits is 25.14%, while the proportion of employees marked as loyal with moderate benefits is 0.84; the proportion of employees marked as disloyal with low benefits is 23.74%, while the proportion of employees marked as loyal with low benefits is 1.26%. In addition, the welfare level is medium and low and the number of disloyal employees is more than the number of loyal employees, while for higher-level employees, the number of employees with high loyalty is more than the number of employees with low loyalty.
In Fig. 4, the proportion of employees marked as disloyal with high promotion opportunities is 3.23%, while the proportion of employees marked as loyal with high promotion opportunities is 19.80%; the proportion of employees marked as disloyal with low promotion opportunities is 25.28%, while the proportion of employees marked as loyal with low promotion opportunities is 1.54%. In addition, the promotion opportunity level is medium and low. The number of disloyal employees is more than the number of loyal employees, while at the high level, the number of employees with high loyalty is more than the number of employees with low loyalty.
Relevant analysis
The correlation between features is analyzed below. The correlation coefficient is visualized by the thermodynamic diagram, and the display results are shown in Fig. 5.

Thermodynamic Diagram of Correlation Analysis between Features.
In the thermodynamic diagram, the lighter the color is, the stronger the positive correlation is however, the darker the color is, the stronger the negative correlation is.
It can be seen from the figure that the Loyalty Label has a strong positive correlation with Belonging Sense, Value Recognition, Honesty, and Team Spirit. The correlation coefficient with Belonging Sense and Value Recognition, indicating a strong positive correlation. This further suggests that the bigger these values are, the more loyal the employees are. However, loyalty has a strong negative correlation with the Ambition feature, reaching -0.783, indicating that the bigger the value, the smaller the loyalty.
In TSMEs, Belonging Sense is an employee’s sense of identity, security and value to a certain group or organization. Loyalty labels are a measure of how loyal an employee is. When employee feel a strong sense of belonging to a group or organization, they are more likely to show high levels of loyalty. Value Recognition is an employee’s ability to judge and evaluate the value of things or behaviors. The formation of loyalty labels is often based on the employee’s identification and acceptance of the values of the group or organization. When employees believe that the values of a group or organization are in line with their own values, they are more likely to develop a sense of loyalty to that group or organization. Loyal employee are more likely to adhere to the principle of integrity because they are more willing to be honest and trustworthy for the good of the group or organization. Team Spirit is the group spirit formed within the team, which is consistent, mutual support, close cooperation and selfless dedication. Employee with high loyalty labels tend to have stronger team spirit, and they are more willing to make efforts for the common goals of the team and maintain good cooperative relations with team members.
Feature variables “Evaluation ID” and “ID” belong to useless or meaningless feature variables. They will be removed and will not participate in machine learning training. Through the correlation degree analysis, because the relationship value of the current 17 variables is not 0, and there are not many existing feature variables, so to accurately predict, the existing 17 features are all involved in machine learning training.
Data modeling
This step is the core of the data mining work of this research. The model construction is the generalization of the sampling data track, which reflects the general characteristics of the internal structure of the sampling data and is basically consistent with the specific structure of the sampling data. The construction of the prediction model includes model establishment, model training, model verification and model prediction. This research adopts the Python program which is more convenient for developing machine learning models. Seventeen predictors were selected to participate in machine learning training. Important predictors include Age, Gender, Education, Position, Salary, Vacation, Welfare, Ability Level, Team Spirit, Ambition, Value Recognition, Honesty, Belonging Sense, Training Opportunities, Promotion Opportunities, Working Environment, Overtime.
