How We Test Strategy Credibility
Most backtesting tools let you overfit without knowing it. Our 10-layer validation pipeline applies institutional-grade statistical rigor to every optimization.
3-Way Data Split
The foundation of honest backtesting is data isolation. We split every ticker's historical data into three non-overlapping segments:
Historical Price Data
- Train (50%) — the optimizer explores parameter combinations here.
- Validate (25%) — used to select the best candidate from the training results.
- Holdback (1-year minimum) — data the strategy NEVER sees during optimization. Holdback returns are the closest proxy for live performance.
Why this matters: Without a holdback period, every backtest is in-sample. You are evaluating performance on the same data used to choose the strategy. This is the most common source of backtest overfitting, and most retail tools do not guard against it.
Walk-Forward Validation
Within the training period, we run a 5-window rolling train/test protocol. Each window trains on a portion of the data and tests on the immediately following out-of-sample segment.
- Strategies that fail walk-forward are eliminated automatically before reaching the validation phase.
- This tests whether the strategy adapts to shifting market regimes — trending, mean-reverting, and volatile environments.
- Only strategies that perform consistently across all five windows advance.
Why this matters: A strategy that works brilliantly in one time window but fails in others is likely overfit to a specific market regime. Walk-forward validation catches this before you risk capital.
Pardo, R. (2008). "The Evaluation and Optimization of Trading Strategies." Wiley.
Deflated Sharpe Ratio
When an optimizer tests hundreds or thousands of parameter combinations, the best result will look impressive by pure chance. The Deflated Sharpe Ratio (DSR) corrects for this multiple-testing bias.
- Uses the number of independent trials to adjust the significance threshold for the Sharpe Ratio.
- Accounts for non-normality (skewness, kurtosis) of return distributions.
- A Sharpe of 2.0 from 1,000 trials is statistically very different from a Sharpe of 2.0 from a single trial. DSR quantifies this difference.
Why this matters: Without this correction, optimizers will always find something that looks good — it is a mathematical certainty when the trial count is large enough. DSR tells you whether your best result is genuine or a statistical artifact.
Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality." Journal of Portfolio Management, 40(5), 94–107.
Monte Carlo Significance
We use a permutation test to determine whether the strategy's returns are statistically distinguishable from random chance.
- Shuffles daily portfolio returns using block bootstrap to destroy the timing signal while preserving serial correlation structure.
- Runs 1,000+ permutations per test to build a null distribution.
- The strategy must achieve a p-value below 0.05 — meaning there is less than a 5% probability that random reordering would produce equal or better returns.
Why this matters: Even after DSR correction, a strategy could have a decent Sharpe purely from favorable return clustering. The permutation test asks a direct question: does the specific ordering of trades matter, or would any random ordering do just as well?
White, H. (2000). "A Reality Check for Data Snooping." Econometrica, 68(5), 1097–1126.
Realistic Commission Models
A strategy that looks profitable with zero transaction costs often is not. We apply broker-specific commission schedules to every backtest.
- 12 broker presets: IBKR Pro Tiered, IBKR Pro Fixed, Robinhood, Schwab, Fidelity, E*TRADE, TD Ameritrade, Webull, Firstrade, TradeStation, Tradier, and Alpaca.
- Per-share rates with minimum and maximum per-order caps.
- SEC Section 31 fees and FINRA TAF regulatory fees applied to every sell order.
Why this matters: Frequent-trading strategies are particularly sensitive to commissions. A strategy with 200 round-trips per year can lose 2–5% of its returns to transaction costs alone. If your backtest ignores this, its equity curve is fiction.
Parameter Robustness
After optimization selects the best parameters, we perturb each one to see if performance degrades gracefully or collapses.
- Each optimized parameter is varied ±20% from its selected value.
- If performance collapses outside a narrow range, the strategy is flagged as curve-fitted.
- Robust strategies exhibit a performance plateau — they work across a range of nearby parameter values, not just one magic setting.
Why this matters: A strategy that only works at exactly RSI(14) with a 2.1 standard deviation Bollinger Band is almost certainly overfit. Real market edges are broad enough to survive minor parameter variation.
Market Regime Detection
Every signal and strategy operates within a market regime — bull, bear, sideways, or crisis. alphactor.ai classifies the current regime using a three-layer ensemble grounded in two of the most cited papers in financial econometrics.
- Layer 1: Hamilton (1989) Markov-Switching Regression fits a 2-regime model on log returns with switching variance. The Hamilton smoother produces proper Bayesian posterior state probabilities — not ad-hoc threshold rules.
- Layer 2: Kritzman et al. (2012) volatility overlay detects crisis conditions by ranking current realized volatility against its full historical distribution. Bear markets with volatility above the 80th percentile are escalated to Bear/Crisis.
- Layer 3: SMA-50 trend confirmation prevents false bull signals during bear-market bounces and vice versa, providing the Sideways classification when the MS model is uncertain.
Why this matters: Strategies that backtest well in one regime often fail catastrophically in another. Regime-aware analysis helps you understand whether a signal is robust across environments or merely overfit to the current market phase.
Hamilton, J.D. (1989). Econometrica, 57(2). · Kritzman, M., Page, S., & Turkington, D. (2012). Financial Analysts Journal, 68(3). · Nystrup, P., Lindström, E., Pinson, P., & Madsen, H. (2024). "Learning Hidden Markov Models for Regression with Unaligned Timestamps." arXiv:2402.05272.
Regime-Conditioned Strategy Selection
Knowing the current market regime is only half the problem. The other half is selecting which strategy to deploy in each regime. alphactor.ai uses two regime-conditioned selection methods inspired by mixture-of-experts architectures in machine learning.
- Hard Switch — a single champion strategy is selected per regime via argmax over regime-conditional expected scores. Hysteresis and a switching penalty prevent whipsawing between strategies during regime transitions.
- Soft Mixture-of-Experts (MoE) — instead of picking one winner, a softmax gate blends the top-K strategies weighted by their regime-conditional scores. A hierarchical family-level gate groups strategies by signal type, preventing overconcentration in one signal family.
- Both methods use an online regime posterior (the Hamilton filter from Layer 7) so that strategy weights at each bar depend only on data available at that point — no lookahead bias.
Why this matters: A single strategy optimized across all market conditions is a compromise that underperforms in every regime. Regime-conditioned selection lets the system deploy the right tool for the current environment while controlling switching costs and turnover.
Meta-Learning Mixture of Experts for Regime-Aware Portfolio Construction. arXiv:2505.03659.
Credibility-First Champion Selection
Picking which strategy to surface as the champion is itself a statistical decision. We rank candidates with a hard credibility floor + cup-tier priority — so a backtest only earns the champion slot if it both beats the underlying stock on the held-out year AND survives our credibility tests.
- BEATS cup tier (1y holdback) — strategies are graded 🏆 to 🏆🏆🏆🏆 based on whether they beat the underlying stock and SPY on the held-out 1-year window. cup_4 = beats stock and SPY both by ≥5pp; trails = doesn't beat the stock.
- Hard credibility floor — a candidate must pass p-value ≤ 0.05 AND DSR ≥ 0.20 AND ≥ 30 trades to be eligible for cup-priority promotion. Below the floor we fall back to the credibility-haircut composite score, so a credible-but-trailing champion never gets displaced by a noisy cup_3 from too few trades.
- Cup priority within the credible cohort — once the floor is met, cup_4 outranks cup_3 outranks cup_2, with composite score as the tiebreak. This stops the system from rewarding a marginal in-sample edge over a strategy that actually beat the stock on the held-out year.
Why this matters: A pure 'highest score wins' system rewards in-sample optimization tricks. By promoting champions on a held-out 1-year metric AND requiring statistical credibility, we surface strategies that are both real and useful — and we tell the user honestly when no candidate passes both gates.
Sign-Correct Short Accounting
When a strategy can go short, every percentage return, drawdown, and Sharpe must be computed with the correct sign — or the cup tier shows a strategy that looks profitable but isn't. Phase 1 ships an internal accounting engine that handles long, short, and mixed books with textbook-orthodox math.
- Short proceeds credit cash at the open; margin reserved at Reg T 50% initial, 30% maintenance. Buying power equals equity minus margin used, never "available cash".
- Daily borrow accrual: short_qty × close × annual_borrow_rate / 252, debited from cash each bar. Backtests assume 2%/yr default; hard-to-borrow names bump to 15%/yr.
- Forced cover trigger: when equity drops below 30% of short market value, the largest-loser short closes at the next bar — same behavior your real broker would force on you.
Why this matters: A long position can lose at most 100% of capital. A short position can lose multiples — and the math behind drawdown, win rate, and Sharpe is direction-dependent. Without sign-correct accounting, a backtest will silently overstate short profits by the borrow-cost-it-never-paid and understate drawdowns by the margin-call-it-never-felt. Cup tier shown to a user is then a half-truth at best.
See docs/short-selling-plan.md §7.2 and §10 for the accounting equations and the 20-test regression corpus that gates v2 from breaking long-only behavior.
52 academic alpha families
Each family is a published, peer-reviewed strategy from the academic finance canon — grounded in an explicit economic mechanism. All 48 flow through the same 9-layer credibility harness above; the picker ranks them lane-agnostically so the strongest evidence wins, regardless of which paper it came from.
Families per sleeve · 52 total
multi_horizon_trendMulti-window SMA-cross vote across {50/200, 20/60, 10/30}
Multi-window trend votes (canonical TA construction)
cross_sectional_momentumTop/bottom decile of 6-month return vs universe
Jegadeesh & Titman 1993, JF
breakout_proximityClose ≥ 95% of N-day high with volume confirm
Breakout / anchor effect (TA canon)
vol_timed_maSMA crossover gated by realized-vol regime
Vol-timed trend (Moskowitz-Ooi-Pedersen 2012 derivative)
pairs_relative_valueSector/peer z-score, ±2σ entry, revert-to-zero exit
Engle-Granger pairs (TA quant standard)
range_regime_meanrevRSI + range position, gated on low ADX
Bollinger / RSI mean-rev (TA canon)
breakout_volumeDonchian breakout + volume surge + ATR expansion
Donchian / Turtle Traders (Curtis Faith)
flow_confirmedFINRA short-volume + (planned) options OI/GEX confirmation
Hasbrouck 1995, JF (flow-info-content)
event_awareEarnings calendar filter + post-event drift
Ball & Brown 1968, JAR
regime_overlayVIX × SPY 200d-MA regime gate (risk-on/off)
Daniel-Moskowitz 2016 (regime overlay)
tsmomSign(trailing 12m return) × inverse-vol scaling
Moskowitz, Ooi & Pedersen 2012, JFE
idiosyncratic_momentumTrailing residual return after stripping FF5+MOM betas
Blitz, Hanauer & Vidojevic 2020, JFE
peadPost-Earnings Announcement Drift on surprise > +5%
Bernard & Thomas 1989, JAE
iv_skewPut-call IV skew + ATM IV term structure (skeleton — needs vendor)
Xing, Zhang & Zhao 2010, JFQA
lottery_maxBottom-quartile MAX[5d return] outperforms top
Bali, Cakici & Whitelaw 2011, JFE
babBetting-Against-Beta: long low-β levered, short high-β hedged
Frazzini & Pedersen 2014, JFE
sector_momentumTicker vs sector ETF (XLF/XLK/...) outperformance
Moskowitz & Grinblatt 1999, JF
short_term_reversalSub-week mean reversion + reversal-up confirm
Jegadeesh 1990, JF
vrp_vix_termVariance Risk Premium + VIX/VIX3M term structure
Bollerslev, Tauchen & Zhou 2009, RFS
pairs_cointegrationEngle-Granger residual ±2σ, cointegration-tested
Gatev, Goetzmann & Rouwenhorst 2006, RFS
qmjQuality-Minus-Junk composite (profitability × growth × safety)
Asness, Frazzini & Pedersen 2019, RAS
ofi_microstructureL1 NBBO order-flow imbalance (skeleton — needs vendor)
Cont, Kukanov & Stoikov 2014, JFM
champion_overlayProduction champion + regime gate (VIX × SPY)
Asness-Frazzini-Israel-Moskowitz 2015 (AQR factor timing)
short_interest_changeFINRA daily short-volume z-score (continuation + squeeze-fade)
Boehmer, Huang & Jiang 2010, RFS
borrow_rate_spikeIBKR borrow-fee z-score, hard-to-borrow filter
Engelberg, Reed & Ringgenberg 2018, RFS
ftd_threshold_listSEC Reg SHO threshold inclusion streak (FTD persistence)
Boulton & Braga-Alves 2010 (Reg SHO compliance)
gap_playOvernight gap continuation (gap-and-go) and fade reversal
Gap dynamics (microstructure canon)
atr_breakoutDonchian channel break ± 1.5×ATR(20), Turtle 10-day exit
Faith 2003 (Turtle Traders)
high_52w_momentumProximity to 52-week high + deep-drawdown bounce
George & Hwang 2004, JF; Geczy & Samonov 2015
calendar_anomaliesTurn-of-month + pre-FOMC drift + Wed/Thu effect
Ariel 1987, JFE; Lucca & Moench 2015, JF
insider_form4SEC Form 4 cluster-buy: ≥3 insiders OR CEO/CFO > $1M
Cohen, Malloy & Pomorski 2012, JF
earnings_announcement_premiumT-2 to T+1 window around scheduled earnings
Frazzini & Lamont 2007, JFE
cash_operating_profitabilityCash-OP / total_assets quarterly z-score + trend filter
Ball, Gerakos, Linnainmaa & Nikolaev 2016, JFE
idio_vol_puzzleFF3-residual IVOL z-score: short high-IVOL, long low
Ang, Hodrick, Xing & Zhang 2006, JF
industry_lead_lagSector ETF vs SPY × ticker vs sector lead-lag
Menzly & Ozbas 2010, JF
buyback_driftNet shares-outstanding contraction over 2 consecutive quarters
Peyer & Vermaelen 2009, RFS; Ikenberry et al 1995, JFE
sloan_accrualsBalance-sheet accruals z-score: long low-accrual, short high
Sloan 1996, AR
novy_marx_gross_profitabilityGross profit / assets z-score, trend-confirmed
Novy-Marx 2013, JFE
rd_capitalized_valueR&D intensity z-score × price below 200d MA
Peters & Taylor 2017, JFE; Eisfeldt-Kim-Papanikolaou 2020
max_drawdown_premiumTrailing 252d MaxDD threshold + 60d MA recovery
Atilgan, Bali, Demirtas & Gunaydin 2020, JFE
speculative_betaRolling-window β dispersion: short high-β + high dispersion
Hong & Sraer 2016, JF
meta_equal_weight5-signal 1/N consensus (≥ 3/5 or 4/5 agree)
DeMiguel, Garlappi & Uppal 2009, RFS
meta_regime_routerVIX × SPY routes trend / revert / crisis-short sub-signal
Daniel & Moskowitz 2016, JFE (Momentum Crashes)
macro_regimeFRED term-spread + credit-spread + Fed-funds cycle gate
Estrella-Mishkin 1998 (yield-curve recession signal)
attention_spikeGoogle Trends SVI z-score: short spike, long quiet
Da, Engelberg & Gao 2011, JF
lazy_prices10-K filing year-over-year cosine similarity ≥ 0.85
Cohen, Malloy & Nguyen 2020, JF
m_and_a_arbSEC 8-K Item 1.01/2.01/8.01 post-event drift
Mitchell & Pulvino 2001, JFE
crowding_reversal13F crowding-z reversal (top-decile fade, bottom-decile rebound)
Wardlaw 2020, JF
realized_skew_xsOwn-history percentile of rolling realized skewness → long bottom, short top
Amaya, Christoffersen, Jacobs & Vasquez 2015, JFE
mag7_factor_overlayMag-7 concentration regime: rotation_into / rotation_away / regime_gate variants
AQR Capital — A New Paradigm in Active Equity, Q1 2025
pre_fomc_driftLong 1-3 trading days before scheduled FOMC + post-fade combo (press-conference filter)
Lucca & Moench 2015, JF (NY Fed SR-512) — 2024 Appl. Econ. update
index_rebalance_driftS&P 500 add-front-run, add-post-reversal, remove-bounce around effective date
Chen & Singal 2023, Financial Analysts Journal
Source: services/worker/alpha_experiments_runner.py · Updated weekly · See family-by-family signal logic
Ready to test your strategies?
Run your first optimization with full credibility testing. Free to start, no credit card required.