regression
Regression with XGBoost + Conformal Prediction Intervals
For tabular regression, default to XGBoost as the point estimator and use conformalized quantile regression (CQR) to attach prediction intervals that actually achieve their stated coverage. Point estimates without intervals are not a model — they're a guess. This skill teaches the workflow for shipping a regressor that a downstream system can trust.
When to use this skill
- The target is continuous (price, count, score, time, demand)
- The features are tabular (numbers, categories, dates) — not images, text, or audio
- You have at least a few hundred labeled examples
- Downstream consumers need both a point estimate and an interval
When NOT to use this skill
- Classification (binary / multiclass / multilabel) → see the classification skills
- Time-series with strong temporal structure → use time-series methods, not XGBoost on a flat dataframe
- Truly linear data with < 100 rows → a regularized linear model
(
Ridge,Lasso,ElasticNet) will be hard to beat and gives closed-form coefficient interpretation - You need a fully-Bayesian posterior over predictions (e.g. for decision analysis with explicit utilities) → use PyMC GLMs
Project layout
<project>/
├── data/ # input parquet/csv
├── src/
│ ├── train.py # ibis read → 3 XGBRegressors → conformal cal → MLflow
│ ├── predict.py # reload models + conformal_q, return point + interval
│ └── plots.py # predicted vs actual, residual diagnostics, coverage, SHAP
├── notebooks/
│ └── demo.py # marimo walkthrough
└── mlruns/ # MLflow tracking store (gitignored)
Data access — ibis at the source, pandas at the sklearn boundary
Use ibis (ibis-framework[duckdb]) to read data and compute
summaries; materialize with .execute() exactly once just before
sklearn:
import ibis
table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]
target_stats = (
table
.aggregate(
target_mean=table.target.mean(),
target_std=table.target.std(),
n_total=table.count(),
)
.execute()
.iloc[0]
)
data = (
table
.select(*feature_cols, "target")
.execute()
)
X = data[feature_cols]
y = data["target"]
The pipeline — three models, not one
For regression with intervals you fit three XGBoost models on the same training data:
- Point estimator —
objective="reg:squarederror", gives the expected value - Lower quantile —
objective="reg:quantileerror",quantile_alpha=0.05(or 0.10), gives the lower bound - Upper quantile — same, with
quantile_alpha=0.95(or 0.90)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
def build_xgb_regressor(feature_cols, seed, *, objective="reg:squarederror", quantile_alpha=None):
kwargs = dict(
n_estimators=400, max_depth=4, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8, reg_lambda=1.0,
objective=objective, random_state=seed, n_jobs=-1,
)
if quantile_alpha is not None:
kwargs["quantile_alpha"] = quantile_alpha
return Pipeline([
("preprocess", ColumnTransformer([("num", StandardScaler(), feature_cols)])),
("clf", XGBRegressor(**kwargs)),
])
xgb_point = build_xgb_regressor(feature_cols, seed=42)
xgb_lower = build_xgb_regressor(feature_cols, seed=42, objective="reg:quantileerror", quantile_alpha=0.05)
xgb_upper = build_xgb_regressor(feature_cols, seed=42, objective="reg:quantileerror", quantile_alpha=0.95)
XGBoost 2.0+ has native quantile regression via reg:quantileerror.
Earlier versions don't — if you're stuck on 1.x, use a different
library (lightgbm has supported it for years) or switch to a quantile
random forest.
The four things that separate this from a tutorial
1. Conformalized quantile regression — intervals that actually cover
Vanilla quantile XGBoost optimizes pinball loss but does not
guarantee coverage. On Friedman1 with quantile_alpha=0.05/0.95
(nominal 90%), raw empirical coverage is typically 70-75% — way
below target. The fix is conformal calibration.
The recipe:
import numpy as np
from sklearn.model_selection import train_test_split
# 1. Split a calibration set off the training data
X_train, X_calib, y_train, y_calib = train_test_split(
X_train_full, y_train_full, test_size=0.2, random_state=42,
)
# 2. Fit quantile models on the (smaller) train set
xgb_lower.fit(X_train, y_train)
xgb_upper.fit(X_train, y_train)
# 3. Compute conformity scores on the calibration set:
# E_i = max(q_low(x_i) - y_i, y_i - q_high(x_i))
# Positive when y_i is OUTSIDE the predicted interval.
cal_low = xgb_lower.predict(X_calib)
cal_high = xgb_upper.predict(X_calib)
conformity = np.maximum(cal_low - y_calib, y_calib - cal_high)
# 4. Find the appropriate quantile of the conformity scores
nominal_coverage = 0.90
n_cal = len(y_calib)
q_level = min(1.0, np.ceil(nominal_coverage * (n_cal + 1)) / n_cal)
conformal_q = float(np.quantile(conformity, q_level))
# 5. At inference time, expand the raw quantile bounds by ±conformal_q
def predict_interval(X_new):
y_low_raw = xgb_lower.predict(X_new)
y_high_raw = xgb_upper.predict(X_new)
return y_low_raw - conformal_q, y_high_raw + conformal_q
This is conformalized quantile regression (Romano, Patterson,
Candes 2019). It guarantees marginal coverage of at least
nominal_coverage on test data drawn from the same distribution as
calibration. The intervals get wider, but they're now honest.
Always log both the raw and conformalized empirical coverage to MLflow so you can see the gap. If raw coverage is already at the nominal level, your data is well-behaved and the correction is small; if it's far off, the correction was load-bearing.
2. Residual diagnostics — what the model is missing
After training, plot the residuals (y_true - y_pred) three ways:
- Residual vs predicted: should be a flat band around 0. A funnel shape = heteroscedasticity (the variance depends on the prediction). A curve = the model is missing non-linear structure.
- Residual histogram + Normal overlay: should look roughly Normal
for squared-error models. Heavy tails suggest you should use
reg:absoluteerror(MAE / L1) orreg:huberinstead. - QQ plot vs Normal: quick visual check for tail behavior.
import matplotlib.pyplot as plt
from scipy import stats
residuals = y_test - y_pred
fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
axes[0].scatter(y_pred, residuals, alpha=0.5)
axes[0].axhline(0, color="red", ls="--")
axes[1].hist(residuals, bins=40, density=True)
stats.probplot(residuals, dist="norm", plot=axes[2])
If the residuals show structure, the model is leaving signal on the table. Add features, change the loss, or try a more flexible model.
3. Honest metrics — RMSE and MAE in domain units, not just R²
R² ("explained variance") looks great for a wide range of models and hides important information. Always report:
- RMSE (root mean squared error) — same units as the target, penalizes large errors. The benchmark to beat is the irreducible error (the std of the noise). If RMSE ≈ noise std, the model is essentially perfect.
- MAE (mean absolute error) — same units as the target, more interpretable than RMSE, less sensitive to outliers. "On average my prediction is off by X."
- R² — useful as a sanity check (≥ 0?) but don't optimize for it in isolation. R² = 0.95 on a problem with no signal is meaningless.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
rmse = float(np.sqrt(mean_squared_error(y_test, y_pred)))
mae = float(mean_absolute_error(y_test, y_pred))
r2 = float(r2_score(y_test, y_pred))
4. SHAP for feature importance
Same advice as for binary classification: don't use
feature_importances_, it has biases. Use SHAP's TreeExplainer:
import shap
clf = pipeline.named_steps["clf"] # the XGBRegressor
preprocessor = pipeline.named_steps["preprocess"]
X_test_t = preprocessor.transform(X_test.iloc[:200])
explainer = shap.TreeExplainer(clf)
shap_values = explainer(X_test_t)
shap.summary_plot(shap_values, X_test_t, feature_names=feature_cols)
SHAP also gives you per-prediction explanations (shap.plots.waterfall)
which are essential for any deployed model that affects users.
MLflow logging
Every run logs:
| Kind | What |
|---|---|
params |
data path, n_rows, n_features, target_mean / target_std, seed, lower_quantile, upper_quantile, nominal_coverage, model name, hyperparameters |
metrics |
test_rmse, test_mae, test_r2, irreducible_rmse (when known), rmse_above_irreducible, conformal_q, coverage_raw vs coverage_conformal, interval_width_raw vs interval_width_conformal |
tags |
data hash, target distribution stats |
artifacts |
three models (model_point, model_lower, model_upper), predicted-vs-actual scatter with interval bands, residual diagnostics, interval coverage plot (raw + conformal), SHAP summary, sidecar JSON |
The most important metric is coverage_conformal — if it's not within
~1% of nominal_coverage, something is wrong with your calibration set
(too small? distribution shift?).
Common pitfalls
- Reporting only R². R² of 0.95 can be a model that's useless in practice. Always report RMSE/MAE in domain units alongside.
- Skipping conformal calibration. Vanilla quantile XGBoost intervals undercover. Without CQR your "90% interval" is probably a 70-75% interval. This silently breaks any downstream system that trusts the bounds.
- Calibrating on the test set. The whole point of conformal prediction is that the calibration set is held out from both training and evaluation. Don't peek.
- Heteroscedastic data with squared-error loss. If
residual vs predictedshows a funnel, your variance depends on the prediction. Switch toreg:quantileerror(which doesn't assume constant variance) orreg:huber(robust to outliers). - Trusting
feature_importances_. Use SHAP. - Ignoring the irreducible error. If you know the noise level
(e.g. measurement uncertainty in the target), report
RMSE - noise_std. That's the model's actual error budget. - Treating prediction intervals as confidence intervals. They are not the same. A prediction interval covers the next observation; a confidence interval covers the true mean. PIs are wider.
Worked example
See demo.py (marimo notebook). It generates the Friedman1 non-linear
regression dataset inline (y = 10·sin(π·x₀·x₁) + 20·(x₂−0.5)² + 10·x₃ + 5·x₄ + ε),
fits all three XGBoost models, applies conformal calibration, plots
residual diagnostics + interval coverage, and includes a baseline
LinearRegression cell so you can see why a linear model fails on
non-linear structure.
More from brojonat/llmsrules
ibis-data
Use Ibis for database-agnostic data access in Python. Use when writing data queries, connecting to databases (DuckDB, PostgreSQL, SQLite), or building portable data pipelines that should work across backends.
13go-service
Build Go microservices with stdlib HTTP handlers, sqlc, urfave/cli, and slog. Use when creating or modifying a Go HTTP server, adding routes, middleware, database queries, or CLI commands.
13temporal-go
Build Temporal workflow applications in Go. Use when creating or modifying Temporal workflows, activities, workers, clients, signals, queries, updates, retry policies, saga patterns, or writing Temporal tests.
13parquet-analysis
Analyze parquet files using Python and Ibis. Use when the user wants to explore, transform, or analyze parquet data files, perform aggregations, joins, or export results. Works with local parquet files and provides database-agnostic data operations.
12ducklake
Work with DuckLake, an open lakehouse format built on DuckDB. Use when creating or querying DuckLake tables, managing snapshots, time travel, schema evolution, partitioning, or lakehouse maintenance operations.
12temporal-python
Build Temporal applications in Python using the temporalio SDK. Use when creating workflows, activities, workers, clients, signals, queries, updates, child workflows, timers, retry policies, saga/compensation patterns, testing, or any durable execution pattern in Python.
12