Employee Turnover Prediction

Overview

Turnover prediction uses classification models (logistic regression, random forest, XGBoost) to estimate the probability an employee will leave within a defined period (typically 6-12 months). Features include tenure, compensation, performance, promotion history, and engagement signals.

When to Use

Trigger conditions:

Identifying employees at high risk of voluntary departure
Quantifying which factors drive turnover for targeted interventions
Prioritizing retention budgets toward highest-impact employees

When NOT to use:

For involuntary termination planning (different process and ethics)
When headcount is < 200 (insufficient data for reliable modeling)

Algorithm

IRON LAW: Turnover Models Predict RISK, Not Certainty
A predicted 80% turnover probability means "employees with similar
profiles historically left 80% of the time." It does NOT mean this
specific employee WILL leave. Never use model outputs as sole basis
for employment decisions — that creates legal and ethical liability.

Phase 1: Input Validation

Collect: employee demographics, tenure, compensation (relative to market), last promotion date, performance ratings, manager change history, engagement survey scores, commute distance. Outcome: voluntary departure within N months. Gate: Minimum 200 turnover events, features available before departure date.

Phase 2: Core Algorithm

Feature engineering: tenure buckets, comp ratio (salary/market median), time since last promotion, manager tenure, engagement trend
Handle class imbalance: turnover rate typically 10-20%. Use SMOTE or class weights.
Train: logistic regression (interpretable, HR-preferred) or GBDT (higher accuracy)
Output: probability of departure + top risk factors per employee

Phase 3: Verification

Evaluate: AUC, precision-recall (at actionable thresholds). Backtest: did the model correctly flag employees who left in the past 6 months? Gate: AUC > 0.70, precision > 50% at top decile.

Phase 4: Output

Return risk scores with driver analysis.

Output Format

{
  "risk_scores": [{"employee_id": "E123", "turnover_prob": 0.72, "risk_tier": "high", "top_drivers": ["low_comp_ratio", "no_promotion_3yr"]}],
  "metadata": {"model": "xgboost", "auc": 0.78, "prediction_window_months": 12}
}

Examples

Sample I/O

Input: Employee: 4yr tenure, comp ratio 0.85, no promotion in 3yr, engagement score declining Expected: High risk (>0.6). Top drivers: below-market compensation, stalled career progression.

Edge Cases

Input	Expected	Why
New hire (< 6 months)	Unreliable prediction	Insufficient behavioral data
Top performer, high comp	Still could leave	Non-financial factors (manager, culture) matter
Post-reorg period	Model drift likely	Unusual conditions distort patterns

Gotchas

Survivorship bias: Training data only includes people who were hired and stayed long enough to observe. Early-stage leavers may be underrepresented.
Feature leakage: "Started job searching" or "updated LinkedIn" are strong predictors but ethically and legally problematic to use. Stick to internal HR data.
Self-fulfilling prophecy: If managers treat "high risk" employees differently (less investment, fewer projects), the model prediction becomes self-fulfilling.
Legal constraints: Using protected attributes (age, gender, ethnicity) directly or via proxies may violate employment law. Audit for disparate impact.
Retention intervention timing: Identifying risk is only useful if HR acts. Build the model into a retention workflow with specific intervention triggers.

References

For feature engineering from HR data, see references/hr-features.md
For ethical AI in HR applications, see references/ethical-hr-ai.md

algo-hr-turnover