Statistical rigor built in

How We Test Strategy Credibility

Most backtesting tools let you overfit without knowing it. Our 6-layer validation pipeline applies institutional-grade statistical rigor to every optimization.

13-Way Data Split
2Walk-Forward Validation
3Deflated Sharpe Ratio
4Monte Carlo Significance
5Realistic Commission Models
6Parameter Robustness
1

3-Way Data Split

The foundation of honest backtesting is data isolation. We split every ticker's historical data into three non-overlapping segments:

Historical Price Data

Train (50%)
Validate (25%)
Holdback (25%)
Optimizer explores here
Best candidate selected
Never seen — true OOS
  • Train (50%) — the optimizer explores parameter combinations here.
  • Validate (25%) — used to select the best candidate from the training results.
  • Holdback (1-year minimum) — data the strategy NEVER sees during optimization. Holdback returns are the closest proxy for live performance.

Why this matters: Without a holdback period, every backtest is in-sample. You are evaluating performance on the same data used to choose the strategy. This is the most common source of backtest overfitting, and most retail tools do not guard against it.

2

Walk-Forward Validation

Within the training period, we run a 5-window rolling train/test protocol. Each window trains on a portion of the data and tests on the immediately following out-of-sample segment.

  • Strategies that fail walk-forward are eliminated automatically before reaching the validation phase.
  • This tests whether the strategy adapts to shifting market regimes — trending, mean-reverting, and volatile environments.
  • Only strategies that perform consistently across all five windows advance.

Why this matters: A strategy that works brilliantly in one time window but fails in others is likely overfit to a specific market regime. Walk-forward validation catches this before you risk capital.

Pardo, R. (2008). "The Evaluation and Optimization of Trading Strategies." Wiley.

3

Deflated Sharpe Ratio

When an optimizer tests hundreds or thousands of parameter combinations, the best result will look impressive by pure chance. The Deflated Sharpe Ratio (DSR) corrects for this multiple-testing bias.

  • Uses the number of independent trials to adjust the significance threshold for the Sharpe Ratio.
  • Accounts for non-normality (skewness, kurtosis) of return distributions.
  • A Sharpe of 2.0 from 1,000 trials is statistically very different from a Sharpe of 2.0 from a single trial. DSR quantifies this difference.

Why this matters: Without this correction, optimizers will always find something that looks good — it is a mathematical certainty when the trial count is large enough. DSR tells you whether your best result is genuine or a statistical artifact.

Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality." Journal of Portfolio Management, 40(5), 94–107.

4

Monte Carlo Significance

We use a permutation test to determine whether the strategy's returns are statistically distinguishable from random chance.

  • Shuffles daily portfolio returns using block bootstrap to destroy the timing signal while preserving serial correlation structure.
  • Runs 1,000+ permutations per test to build a null distribution.
  • The strategy must achieve a p-value below 0.05 — meaning there is less than a 5% probability that random reordering would produce equal or better returns.

Why this matters: Even after DSR correction, a strategy could have a decent Sharpe purely from favorable return clustering. The permutation test asks a direct question: does the specific ordering of trades matter, or would any random ordering do just as well?

White, H. (2000). "A Reality Check for Data Snooping." Econometrica, 68(5), 1097–1126.

5

Realistic Commission Models

A strategy that looks profitable with zero transaction costs often is not. We apply broker-specific commission schedules to every backtest.

  • 12 broker presets: IBKR Pro Tiered, IBKR Pro Fixed, Robinhood, Schwab, Fidelity, E*TRADE, TD Ameritrade, Webull, Firstrade, TradeStation, Tradier, and Alpaca.
  • Per-share rates with minimum and maximum per-order caps.
  • SEC Section 31 fees and FINRA TAF regulatory fees applied to every sell order.

Why this matters: Frequent-trading strategies are particularly sensitive to commissions. A strategy with 200 round-trips per year can lose 2–5% of its returns to transaction costs alone. If your backtest ignores this, its equity curve is fiction.

6

Parameter Robustness

After optimization selects the best parameters, we perturb each one to see if performance degrades gracefully or collapses.

  • Each optimized parameter is varied ±20% from its selected value.
  • If performance collapses outside a narrow range, the strategy is flagged as curve-fitted.
  • Robust strategies exhibit a performance plateau — they work across a range of nearby parameter values, not just one magic setting.

Why this matters: A strategy that only works at exactly RSI(14) with a 2.1 standard deviation Bollinger Band is almost certainly overfit. Real market edges are broad enough to survive minor parameter variation.

Ready to test your strategies?

Run your first optimization with full credibility testing. Free to start, no credit card required.

alphactor.ai provides AI-powered stock research tools for informational and educational purposes only. We are not a registered investment advisor. Nothing on this site constitutes financial, investment, or trading advice. Past performance does not guarantee future results.