Evaluating Machine Learning Model Performance Before Production: An Employee Attrition Case Study
This tutorial walks through a complete workflow for assessing machine‑learning models—using a Kaggle HR attrition dataset, comparing Random Forest and Gradient Boosting via ROC‑AUC, precision, recall and segment analysis with the Evidently library—to decide which model is ready for production deployment.
Before deploying a machine‑learning model to production, it is essential to evaluate its performance beyond standard test‑set metrics. This tutorial demonstrates the process using a fictional employee‑attrition dataset from a Kaggle competition.
Dataset Overview
The data contain 1,470 employee records with 35 features describing background, job details, work history, compensation and more, plus a binary label indicating whether the employee left the company. The goal is a probability‑based binary classification task.
Model Training and Initial Metrics
Two models are trained on the same training split: a Random Forest and a Gradient‑Boosting model. Their ROC‑AUC scores on the held‑out test set are 0.795 and 0.803 respectively, indicating comparable overall discrimination.
Using Evidently for Model Comparison
The open‑source Evidently library is employed to generate a side‑by‑side performance dashboard.
comparison_report = Dashboard(rf_merged_test, cat_merged_test, column_mapping = column_mapping, tabs=[ProbClassificationPerformanceTab])
comparison_report.show()The dashboard visualizes ROC‑AUC, confusion matrices, class‑wise metrics and other diagnostics for both models.
Beyond Accuracy: Class Imbalance and Metric Choice
Only 16% of employees in the data actually attrite, making accuracy a misleading metric (a naïve model that predicts "stay" for everyone would achieve 84% accuracy). Therefore, recall, precision, F1‑score and class‑specific metrics become crucial.
Practical Scenarios
Scenario 1 – Tagging Employees : When the model is used to label each employee in an HR system, a higher recall (capturing more true attritions) may be preferred even at the cost of a few false positives.
Scenario 2 – Proactive Alerts : If predictions trigger email alerts to managers, the cost of false positives rises, so a higher precision threshold (e.g., 0.8) may be chosen to limit unnecessary notifications.
Scenario 3 – Selective Model Application : Segment analysis reveals that model performance varies across job levels and stock‑option tiers; the organization can apply the model only to segments where it performs well.
Threshold Tuning and Precision‑Recall Trade‑off
By adjusting the probability threshold (e.g., from the default 0.5 to 0.6, 0.8, or selecting the top‑X predictions), practitioners can balance precision against recall to match business needs. Evidently’s class‑separation and precision‑recall tables help visualize these effects.
Segment‑Level Diagnostics
Classification quality tables map prediction errors to specific feature values (e.g., job level, stock‑option level), allowing the team to understand where each model succeeds or fails and to consider data augmentation or rule‑based overrides for weak segments.
Conclusion
Although both models achieve similar ROC‑AUC, the Gradient‑Boosting model generally provides higher recall and better coverage across employee segments, making it the preferred choice for most use‑cases. The tutorial emphasizes the importance of multi‑metric evaluation, threshold selection, and segment‑aware deployment.
References
Dataset: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
Evidently library: https://github.com/evidentlyai/evidently
Jupyter notebook example: https://github.com/evidentlyai/evidently/blob/main/evidently/examples/ibm_hr_attrition_model_validation.ipynb
Original article: https://evidentlyai.com/blog/tutorial-2-model-evaluation-hr-attrition
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.