walk-forward-validation
Walk-Forward Validation
Walk-forward validation framework for trading strategies and ML models. Standard cross-validation (k-fold, random splits) fails catastrophically for financial time series because it introduces lookahead bias and ignores autocorrelation. This skill covers proper time-series validation techniques including rolling and expanding windows, purged cross-validation, combinatorial purged cross-validation (CPCV), and overfit detection metrics.
Why Standard Cross-Validation Fails
Standard k-fold CV assumes data points are independent and identically distributed (IID). Financial time series violate both assumptions:
- Lookahead bias — Random splits let the model train on future data and predict past data, artificially inflating performance.
- Autocorrelation — Adjacent observations are correlated. A random split that puts Monday in test and Tuesday in train leaks information.
- Regime dependence — Markets shift between regimes. A model trained on a bull market and tested on a bull market tells you nothing about bear market performance.
- Label overlap — If labels are computed over windows (e.g., 24h forward return), adjacent train/test samples share label computation periods, leaking information.
Walk-Forward Framework
Rolling Window (Fixed Train Size)
The train window has a fixed size and slides forward in time. This is preferred when you believe older data is less relevant (common in crypto).
Window 1: [===TRAIN===][=TEST=]
Window 2: [===TRAIN===][=TEST=]
Window 3: [===TRAIN===][=TEST=]
Parameters:
train_size: Number of bars/days in the training windowtest_size: Number of bars/days in the test windowstep_size: How far to advance between folds (often equalstest_size)
Expanding Window (Growing Train)
The train window starts at the beginning and expands forward. This uses all available historical data, which helps when data is scarce.
Window 1: [==TRAIN==][=TEST=]
Window 2: [====TRAIN====][=TEST=]
Window 3: [======TRAIN======][=TEST=]
Parameters:
min_train_size: Minimum training samples before first foldtest_size: Fixed test window sizestep_size: How far to advance between folds
Choosing Between Them
| Factor | Rolling | Expanding |
|---|---|---|
| Data recency | Prioritizes recent data | Uses all history |
| Regime changes | Better adapts to new regimes | May dilute recent regime |
| Sample size | Fixed, may be small | Grows over time |
| Crypto preference | Preferred for < 6mo horizons | Better for regime-stable models |
Purging and Embargo
Purging
Remove training samples whose labels overlap with the test set's time range. If a label is computed as the 24h forward return starting at time t, any training sample where t + 24h extends into the test period must be purged.
def purge_train_indices(
train_idx: list[int],
test_start: int,
label_horizon: int,
timestamps: list[int],
) -> list[int]:
"""Remove train samples whose label windows overlap test period."""
test_start_time = timestamps[test_start]
return [
i for i in train_idx
if timestamps[i] + label_horizon < test_start_time
]
Embargo
Add a buffer gap between the end of training and start of testing to account for serial correlation that purging alone does not eliminate.
[===TRAIN===][--EMBARGO--][=TEST=]
Typical embargo sizes:
- 1-minute bars: 60–240 bars (1–4 hours)
- 5-minute bars: 12–48 bars (1–4 hours)
- Hourly bars: 6–24 bars (6–24 hours)
- Daily bars: 2–5 bars (2–5 days)
- Crypto rule of thumb: Embargo >= 2x the label computation horizon
Combinatorial Purged Cross-Validation (CPCV)
CPCV (Lopez de Prado, 2018) generates all possible train/test combinations from N groups while maintaining temporal ordering. This produces far more test paths than standard walk-forward, enabling statistical tests for overfitting.
Key properties:
- Splits data into
Ncontiguous groups - For each combination of
ktest groups, the remainingN-kgroups form the training set - Applies purging and embargo at each train/test boundary
- Produces
C(N, k)backtest paths (e.g., N=6, k=2 gives 15 paths)
See references/methodology.md for the full CPCV algorithm and formulas.
Overfit Detection
Deflated Sharpe Ratio (DSR)
The observed Sharpe ratio must be adjusted for:
- Number of strategies tested (multiple testing)
- Non-normality of returns (skewness, kurtosis)
- Length of the backtest
import numpy as np
from scipy.stats import norm
def deflated_sharpe_ratio(
observed_sr: float,
num_trials: int,
backtest_length: int,
skewness: float = 0.0,
kurtosis: float = 3.0,
) -> float:
"""Compute the probability that observed SR > 0 after deflation.
Args:
observed_sr: Annualized Sharpe ratio of the selected strategy.
num_trials: Number of strategies tested (including discarded ones).
backtest_length: Number of return observations.
skewness: Skewness of returns.
kurtosis: Excess kurtosis of returns.
Returns:
p-value (probability SR is genuinely > 0).
"""
sr_std = np.sqrt(
(1 - skewness * observed_sr + (kurtosis - 1) / 4 * observed_sr**2)
/ (backtest_length - 1)
)
# Expected max SR under null (Euler-Mascheroni approximation)
euler_mascheroni = 0.5772156649
expected_max_sr = norm.ppf(1 - 1 / num_trials) * (
1 - euler_mascheroni
) + euler_mascheroni * norm.ppf(1 - 1 / (num_trials * np.e))
dsr = norm.cdf((observed_sr - expected_max_sr) / sr_std)
return dsr
A DSR below 0.95 suggests the observed performance is likely due to overfitting across the trials tested.
Probability of Backtest Overfitting (PBO)
PBO uses CPCV to measure the fraction of backtest paths where the in-sample optimal strategy underperforms the median out-of-sample. A PBO above 0.50 indicates more-likely-than-not overfitting.
See references/overfit_detection.md for complete derivations and implementation details.
Crypto-Specific Considerations
- Shorter windows: Crypto regimes change faster than equities. A 90-day rolling window may be more appropriate than 252 days.
- 24/7 markets: No weekends or holidays to account for, but funding rate resets (every 8h on perps) create microstructure effects.
- Survivorship bias: Many tokens delist. Validation must include delisted tokens or at minimum acknowledge this limitation.
- Liquidity regime shifts: A token's liquidity profile can change dramatically (new CEX listing, liquidity mining end). Train/test splits should ideally not straddle major liquidity events.
- Data availability: Many tokens have < 1 year of data. Expanding windows with small
min_train_sizemay be necessary.
Practical Window Sizes for Crypto
| Strategy Timeframe | Train Window | Test Window | Embargo |
|---|---|---|---|
| Scalping (1-5min) | 3-7 days | 1 day | 2-4 hours |
| Intraday (15min-1h) | 14-30 days | 3-7 days | 12-24 hours |
| Swing (4h-daily) | 30-90 days | 7-14 days | 2-5 days |
| Position (daily-weekly) | 90-180 days | 30 days | 5-10 days |
Quick Start
from walk_forward import WalkForwardValidator, WalkForwardConfig
config = WalkForwardConfig(
train_size=90,
test_size=14,
step_size=14,
window_type="rolling",
embargo_size=3,
purge_horizon=1,
)
validator = WalkForwardValidator(config)
for fold in validator.split(price_data):
model.fit(fold.train_X, fold.train_y)
predictions = model.predict(fold.test_X)
fold.record_performance(predictions, fold.test_y)
results = validator.aggregate_results()
print(f"OOS Sharpe: {results.oos_sharpe:.3f}")
print(f"Train/Test Sharpe ratio: {results.sharpe_ratio_ratio:.2f}")
Files
References
references/methodology.md— Walk-forward theory, window types, purging, embargo, CPCV algorithm with formulasreferences/overfit_detection.md— Deflated Sharpe ratio, probability of backtest overfitting, multiple testing correctionsreferences/practical_guide.md— Window size selection for crypto, regime considerations, common validation mistakes
Scripts
scripts/walk_forward.py— Walk-forward validation engine with rolling and expanding windows;--demomode with synthetic datascripts/overfit_detector.py— Deflated Sharpe ratio and PBO computation;--demomode with synthetic backtest results
More from agiprolabs/claude-trading-skills
pandas-ta
Technical analysis with 130+ indicators using pandas-ta for crypto market data
87trading-visualization
Professional trading charts including candlesticks, equity curves, drawdowns, correlation heatmaps, and return distributions
63risk-management
Portfolio-level risk controls, drawdown management, exposure limits, and circuit breakers for crypto trading
60feature-engineering
Feature construction from market data for ML trading models including price, volume, on-chain, and microstructure features
59signal-classification
ML trading signal classifiers using XGBoost and LightGBM with walk-forward validation, SHAP feature importance, and threshold optimization
59market-microstructure
DEX orderflow analysis, trade classification, buyer/seller pressure, and microstructure signals for Solana tokens
56