Integrating machine learning and explainable AI for employee attrition prediction in HR analytics

Machine Learning


In this section, we present the experiments conducted to validate our approach for predicting employee attrition and job change using ML models. The experimental setup was implemented in Python on a Windows 11 system equipped with 128 GB of RAM; ensuring sufficient computational resources for handling large datasets and complex model training. We utilized a train-test-split ratio of 80% to 20%, where 80% of the data was used for training the models and 20% for evaluating their performance on unseen data. To optimize hyperparameters efficiently, we employed the TPE sampler with 2,500 optimization trials; which allowed us to explore the hyperparameter space effectively while minimizing computational costs.

The experiments involved four datasets, including the IBM HR Analytics Employee Attrition & Performance dataset, the HR Analytics: Job Change of Data Scientists dataset, the HR Dataset v14 and Attrition Rate of a Company. They were preprocessed by addressing missing values, normalizing numeric features using Min-Max or Z-score scaling, and balancing class distributions using techniques like SMOTE to mitigate the effects of class imbalance. Feature selection was performed using both filter-based methods (e.g., Gini importance from tree-based models) and wrapper methods such as Recursive Feature Elimination (RFE). The selected features were then fed into various ML algorithms, including Logistic Regression, Decision Trees, Random Forests, Support Vector Machines (SVMs), and Gradient Boosting Machines (GBMs). Model performance was evaluated using weighted precision, recall, F1 score, accuracy, and specificity to account for the imbalanced nature of the target variable. Additionally, SHAP was used to provide interpretable insights into feature contributions, enhancing model transparency and trustworthiness.

Statistical and quantitative analysis

Dataset: IBM HR analytics employee attrition and performance

Table 3 below presents the top-1 testing performance results for various ML models applied to the “IBM HR Analytics Employee Attrition and Performance” dataset. It highlights the most promising model, AdaBoost (AB) based on its superior F1-score and AUC-ROC., Specificity, and their average. Empty cells indicate that a specific configuration or technique was not applied (i.e., N/A). For clarity, we report only the most effective configuration for each model. In cases where a specific preprocessing step (like scaling or feature selection) had negligible impact on performance, it was omitted from the table to focus on the key contributing factors.

The AB model demonstrates superior performance with an average score of 79.69%, achieving high values across all metrics, particularly Precision (71.74%), Recall (70.21%), and Accuracy (90.82%). Notably, the use of MaxAbs scaling, Recursive Feature Elimination (RFE) with an 80% feature selection ratio, and ADASYN for data balancing contributed significantly to its success. Other models (such as Gradient Boosting (GB) and Support Vector Classifier (SVC)) also performed well but fell short of the AB model’s overall balance and consistency.

Interestingly, some models like Extra Trees (ETs) exhibited extreme trade-offs, with high Specificity (99.60%) but very low Recall (23.40%); resulting in suboptimal F1-scores. This underscores the importance of selecting appropriate preprocessing techniques and hyperparameters to achieve balanced performance across all metrics.

Table 3 Top-1 testing performance results for each model (applied on the “IBM HR Analytics Employee Attrition and Performance” database); with the most promising model (AB) highlighted in the first row. Empty cells indicate that a specific configuration or technique was not applied (i.e., N/A). Metrics include Precision, Recall, F1-score, Accuracy, and Specificity.

The superior performance of the AdaBoost model with MaxAbs scaling, RFE at an 80% feature selection ratio, and ADASYN for balancing can be attributed to a synergistic effect. The 80% RFE ratio strikes an optimal balance, retaining the most informative features while reducing noise and computational complexity, which benefits the AdaBoost algorithm. MaxAbs scaling preserves the sparsity of the data, which is advantageous for tree-based ensembles. Finally, ADASYN effectively mitigates the class imbalance by generating synthetic samples near the decision boundary, allowing the AdaBoost model to better capture the nuanced patterns associated with the minority attrition class.

To further validate the robustness of the top-performing model (AB), we conducted 10 independent trials using the same experimental setup. The statistical analysis and confusion matrix derived from these trials are presented in Figures 9 and 10. Figure 9 shows the confusion matrix for the AB model, highlighting its ability to correctly classify both attrition (positive class) and non-attrition (negative class) cases. The model achieves high True Positive (TP) and True Negative (TN) rates, indicating strong predictive power. However, a small number of False Negatives (FN) suggest that some attrition cases may still be missed; which could be critical in real-world applications.

Figure 10 presents a box plot summarizing the distribution of key performance metrics across the 10 trials. The narrow interquartile ranges for Precision, Recall, and F1-score demonstrate the model’s stability and reliability. The slight variability in Recall suggests that further tuning may be necessary to minimize FN rates consistently.

Fig. 9
figure 9

Confusion matrix for the top-performing AB model, illustrating classification performance across attrition and non-attrition cases.

Fig. 10
figure 10

Box plot summarizing the distribution of key performance metrics (Precision, Recall, F1-score, Accuracy, and Specificity) over 10 independent trials.

The global interpretability of the model is illustrated in Figs. 11 and 12, which includes the SHAP Beeswarm and Bar plots. The Beeswarm plot (Fig. 11) highlights the most influential features for predicting employee attrition, with OverTime, JobLevel, StockOptionLevel, and JobSatisfaction emerging as the top contributors. The SHAP values indicate how each feature impacts the model’s output, with higher absolute values signifying greater importance. For instance, employees working overtime significantly increase the likelihood of attrition, as evidenced by the clustering of high SHAP values for the OverTime feature. Similarly, lower JobSatisfaction correlates with higher attrition risk, demonstrating the model’s ability to capture nuanced relationships within the dataset. The Bar plot (Fig. 12) complements this analysis by summarizing the mean absolute SHAP values, reinforcing the dominance of OverTime and JobLevel as critical predictors. Together, these visualizations provide a comprehensive understanding of the factors driving attrition at a global level.

Fig. 11
figure 11

SHAP Beeswarm Plot illustrating global feature importance. Features such as “OverTime”, “JobLevel”, “StockOptionLevel”, and “JobSatisfaction” are identified as the most influential predictors for employee attrition. The SHAP values indicate the direction and magnitude of each feature’s impact on the model output.

Fig. 12
figure 12

SHAP Bar Plot summarizing global feature importance. The mean absolute SHAP values highlight OverTime as the most critical feature, followed by “JobLevel”, “StockOptionLevel”, and “JobSatisfaction”. The “Sum of 23 other features” represents the cumulative impact of less influential variables.

The identification of OverTime, JobLevel, StockOptionLevel, and JobSatisfaction as the most influential predictors provides clear, actionable guidance for HR practitioners. To mitigate attrition risk associated with “OverTime”, organizations should implement strict work-life balance policies, including mandatory time-off, flexible scheduling, and workload redistribution to prevent burnout. For “JobLevel”, the finding suggests that employees at mid-to-senior levels may feel stagnant or undervalued; targeted career development programs, mentorship opportunities, and clearer promotion pathways can address this. Low “JobSatisfaction” is a critical red flag; HR should conduct regular pulse surveys to identify sources of dissatisfaction and act on feedback promptly. Finally, “StockOptionLevel” indicates that financial incentives and long-term investment in the company are significant motivators; reviewing and potentially expanding equity-based compensation packages for key talent could be a strategic intervention.

At the individual level, Figs. 13 and 14 presents SHAP Force plots for specific instances; offering insights into how feature contributions shape predictions for single data points. For instance, the Force plot for Instance 135 (Fig. 13) reveals that EducationField and other undisclosed features collectively push the prediction toward a higher likelihood of attrition. In contrast, the Force plot for Instance 25 (Fig. 14) shows a balanced interplay between features such as Age and DailyRate, resulting in a predicted label of “0” (no attrition). These plots are instrumental in explaining why the model assigns specific predictions to individual employees, thereby enhancing transparency and trust. By combining global and local interpretability analyses, we gain a holistic view of the model’s behavior; ensuring that both overarching trends and individual nuances are adequately captured and understood.

Fig. 13
figure 13

SHAP Force Plot for Instance 135, showing how individual features contribute to the prediction for a specific employee. Features like “EducationField” positively influence the likelihood of attrition, pushing the prediction toward a higher risk of attrition.

Fig. 14
figure 14

SHAP Force Plot for Instance 25; demonstrating the balance of feature contributions for an employee predicted not to attrite (Predicted Label: 0). Features such as “Age” and “DailyRate” have minimal impact, resulting in a prediction close to the base value.

Dataset: HR analytics: job change of data scientists

Table 4 below presents the top-1 testing performance results for various ML models applied to the “HR Analytics: Job Change of Data Scientists” dataset. It highlights the most promising model (HGB) based on its superior F1-score and AUC-ROC. Empty cells indicate that a specific configuration or technique was not applied (i.e., N/A).

The HGB model demonstrates exceptional performance with an average score of 72.50%, achieving balanced results across all metrics, particularly in Recall (76.02%), F1-score (65.94%), and Accuracy (80.43%). The use of Random Oversampling (ROS) for data balancing appears to have significantly contributed to its success, as ROS consistently enhances the performance of several other models in the table. Notably, the HGB model also achieves high Specificity (81.89%), indicating its robustness in correctly identifying negative cases (i.e., employees not changing jobs). This balance between Recall and Specificity makes HGB the most reliable choice for predicting job changes among data scientists.

Interestingly, some models like Extra Trees (ETs) exhibit trade-offs between Recall and Specificity; achieving high Recall (74.76%) but relatively lower Specificity (75.50%); resulting in suboptimal overall performance. Similarly, Random Forest (RF) shows a strong bias toward Specificity (86.34%) at the expense of Recall (61.36%), leading to an imbalanced F1-score. These observations highlight the importance of selecting appropriate preprocessing techniques and hyperparameters to achieve a harmonious balance across all evaluation metrics. Furthermore, models like Decision Tree (DT) and K-Nearest Neighbors (KNN) demonstrate relatively weaker performance, underscoring the need for more sophisticated algorithms when dealing with complex datasets like this one.

Table 4 Top-1 testing performance results for each model (applied on the “ HR Analytics: Job Change of Data Scientists” database); with the most promising model (HGB) highlighted in the first row. Empty cells indicate that a specific configuration or technique was not applied (i.e., N/A). Metrics include Precision, Recall, F1-score, Accuracy, and Specificity.

To further validate the robustness of the top-performing model (HGB), we conducted 10 independent trials using the same experimental setup. The statistical analysis and confusion matrix derived from these trials are presented in Figs. 15 and 16. Figure 15 shows the confusion matrix for the HGB model, highlighting its ability to correctly classify both job change (positive class) and no job change (negative class) cases. The model demonstrates strong performance in minimizing classification errors, with high True Positive (TP) and True Negative (TN) rates. Figure 16 presents a box plot summarizing the distribution of key performance metrics, including Precision, Recall, F1-score, and Accuracy, across the 10 trials. The narrow interquartile ranges indicate the model’s stability and reliability. However, the slight variability in Recall suggests that further tuning may be necessary to consistently reduce False Negative (FN) rates.

Fig. 15
figure 15

Confusion matrix for the HGB model, illustrating classification performance across job change and no job change cases.

Fig. 16
figure 16

Box plot summarizing the distribution of key performance metrics (Precision, Recall, F1-score, and Accuracy) over 10 independent trials.

The global interpretability of the model is illustrated in Figs. 17 and 18, which includes the SHAP Beeswarm and Bar plots. The Beeswarm plot (Fig. 17) highlights the most influential features for predicting job changes, with OverTime, JobLevel, StockOptionLevel, and JobSatisfaction emerging as the top contributors. The SHAP values indicate how each feature impacts the model’s output, with higher absolute values signifying greater importance. For instance, employees working overtime significantly increase the likelihood of job change, as evidenced by the clustering of high SHAP values for the OverTime feature. Similarly, lower JobSatisfaction correlates with a higher propensity for job change. The Bar plot (Fig. 18) complements this analysis by summarizing the mean absolute SHAP values, reinforcing the dominance of OverTime and JobLevel as critical predictors. Together, these visualizations provide a comprehensive understanding of the factors driving job changes at a global level.

Fig. 17
figure 17

SHAP Beeswarm Plot illustrating global feature importance. Features such as OverTime, JobLevel, StockOptionLevel, and JobSatisfaction are identified as the most influential predictors for job change. The SHAP values indicate the direction and magnitude of each feature’s impact on the model output.

Fig. 18
figure 18

SHAP Bar Plot summarizing global feature importance. The mean absolute SHAP values highlight OverTime as the most critical feature, followed by JobLevel, StockOptionLevel, and JobSatisfaction. The “Sum of 23 other features” represents the cumulative impact of less influential variables.

The consistent dominance of OverTime, JobLevel, StockOptionLevel, and JobSatisfaction as key drivers of job change among data scientists underscores the universal nature of these factors. For this highly skilled cohort, reducing “OverTime” is paramount; companies should invest in automation tools, hire additional support staff, and foster a culture that values efficiency over hours logged. Addressing “JobLevel” requires creating specialized career tracks for technical experts who may not wish to move into management. Enhancing “JobSatisfaction” might involve granting greater autonomy in project selection and providing access to cutting-edge technologies. Lastly, competitive “StockOptionLevel” remains a powerful tool for retention, signaling to employees that their contributions are valued and that they share in the company’s success.

At the individual level, Figs. 19 and 20 presents SHAP Force plots for specific instances, offering insights into how feature contributions shape predictions for single data points. The Force plot for Instance 9358 (Fig. 19) reveals that features such as EducationField and DailyRate positively influence the likelihood of job change, pushing the prediction toward a higher risk of attrition. In contrast, the Scatter plot (Fig. 20) provides a broader view of feature impacts across all instances, highlighting patterns and outliers in the dataset. These plots are instrumental in explaining why the model assigns specific predictions to individual employees, thereby enhancing transparency and trust. By combining global and local interpretability analyses, we gain a holistic view of the model’s behavior, ensuring that both overarching trends and individual nuances are adequately captured and understood.

Fig. 19
figure 19

SHAP Force Plot for Instance 9358, showing how individual features contribute to the prediction for a specific employee. Features like EducationField and DailyRate positively influence the likelihood of job change.

Fig. 20
figure 20

SHAP Scatter Plot for all instances, illustrating the relationship between feature values and their impact on the model output. This visualization highlights patterns and potential outliers in the dataset.

Other datasets

In addition to the primary datasets, we evaluated the performance of our framework on two supplementary datasets: the “HR Dataset v14” and the “Attrition Rate of a Company Dataset”. The results indicate that the metrics achieved for these datasets are near-optimal, with performance scores approximating 100% across key evaluation criteria such as Precision, Recall, F1-score, and Accuracy. This exceptional performance can be attributed to the comprehensive preprocessing pipeline; which includes robust data cleaning, feature selection, and scaling techniques tailored to the unique characteristics of each dataset. Furthermore, the application of advanced data balancing methods (such as SMOTE and Random Oversampling (ROS)) effectively addressed class imbalance issues; ensuring that the models were well-equipped to handle minority classes. The combination of these strategies, along with the use of highly optimized ML models like HGB and XGB, underscores the adaptability and effectiveness of our approach in achieving near-perfect predictive performance on diverse HR-related datasets.

Related studies comparisons

The proposed framework demonstrates superior performance compared to existing studies, particularly in terms of predictive accuracy, interpretability, and adaptability across diverse datasets. For instance, Setiawan et al.31 achieved an accuracy of 75% using logistic regression, which is significantly lower than the 90.82% achieved by our AdaBoost model on the IBM HR Analytics dataset. Similarly, Krishna and Sidharth32 reported high training accuracy (99.472%) using Random Forest with SMOTE but noted limitations in model interpretability; a challenge effectively addressed in our study through SHAP-based visualizations.

Our approach also outperforms Nagpal et al.33; who explored multiple ML models but highlighted trade-offs between model complexity and interpretability. By integrating feature selection techniques like Recursive Feature Elimination (RFE) and utilizing advanced optimization strategies such as TPE, we achieve a harmonious balance between performance and interpretability. Furthermore, the near-optimal metrics achieved on supplementary datasets (HR Dataset v14 and Attrition Rate of a Company) underscore the adaptability of our framework, a feature not commonly observed in prior studies. These advancements position our work as a significant contribution to the field of HR analytics.

Our framework’s superior performance compared to existing studies stems from the synergistic integration of its components. Unlike studies that focus on a single aspect (e.g., only using SMOTE32 or only employing a single model type33), our approach combines advanced preprocessing (multiple scaling and balancing techniques), sophisticated hyperparameter optimization (TPE), and a diverse suite of ML algorithms. The use of TPE ensures that each model is finely tuned to its specific dataset and configuration, maximizing its potential. Furthermore, the inclusion of SHAP not only enhances interpretability but also provides feedback that can inform the feature selection process, creating a more robust and reliable predictive system.

Complexity analysis and real-time implementation

The computational complexity of the proposed framework is primarily driven by the preprocessing pipeline, model training, and hyperparameter optimization stages. Preprocessing steps, including data cleaning, normalization, and feature selection, exhibit linear complexity O(n), where n represents the number of samples. However, hyperparameter optimization using TPE introduces higher complexity due to its iterative nature, scaling approximately as \(O(k \times m)\), where k is the number of iterations (set to 2500 in this study) and m is the number of hyperparameters being tuned. Despite this, TPE’s efficiency in exploring the hyperparameter space ensures faster convergence compared to grid search or random search.

Real-time implementation of the framework requires careful consideration of computational resources and latency constraints. Models like HGB and XGB, while computationally intensive during training, offer efficient inference times suitable for real-time deployment. To further optimize performance, techniques such as model pruning, quantization, and parallel processing can be employed. Additionally, deploying the framework on cloud-based platforms or edge devices ensures scalability and accessibility for organizations of varying sizes. These strategies collectively enable seamless integration into existing HR systems; facilitating proactive decision-making and talent management.

Relevance of the study

This study holds significant relevance for both academic researchers and HR practitioners. Academically, it advances the field of HR analytics by addressing key limitations in prior research (such as the lack of interpretability, reliance on single datasets, and insufficient exploration of advanced data balancing techniques). The integration of SHAP-based explainability tools provides a novel perspective on understanding the drivers of employee attrition, bridging the gap between predictive modeling and actionable insights.

For HR practitioners, the study offers a practical and scalable solution for managing employee turnover. By identifying critical predictors of attrition, such as OverTime, JobLevel, and JobSatisfaction, organizations can implement targeted retention strategies. Furthermore, the near-optimal performance achieved on diverse datasets underscores the framework’s adaptability; making it applicable to various organizational contexts. This dual relevance positions the study as a valuable resource for enhancing both theoretical understanding and practical applications in HR analytics.

Real-time deployment feasibility and computational load

Real-time deployment feasibility hinges on balancing computational load with system responsiveness. The proposed framework achieves this balance through several strategies. First, the use of optimized ML models like HGB and XGB ensures efficient inference times, even for large datasets. Second, the preprocessing pipeline is modular, allowing for incremental updates as new data becomes available. This modularity reduces the need for retraining the entire model, thereby minimizing computational overhead.

To address potential computational bottlenecks, the framework utilizes parallel processing and distributed computing techniques. For instance, hyperparameter optimization using TPE can be parallelized across multiple nodes, significantly reducing training time. Additionally, deploying the framework on cloud platforms enables dynamic scaling based on workload demands, ensuring consistent performance even during peak usage periods. These strategies collectively ensure that the framework is not only feasible for real-time deployment but also capable of handling the computational demands of modern HR systems.



Source link