Statistical rigor built in

How We Test Strategy Credibility

Most backtesting tools let you overfit without knowing it. Our 13-layer validation pipeline applies institutional-grade statistical rigor to every strategy, ranks the whole market daily, and, when a stock cannot pass validation on its own, says so instead of pretending.

13-Way Data Split
2Walk-Forward Validation
3Deflated Sharpe Ratio
4Monte Carlo Significance
5Realistic Commission Models
6Parameter Robustness
7Market Regime Detection
8Combining strategies on each stock
9Picking the single best of everything
10Honest short-selling math
11Runs fresh every day
12Cross-sectional ranking
13Honest serving states
1

3-Way Data Split

The foundation of honest backtesting is data isolation. We split every ticker's historical data into three non-overlapping segments so the strategy is never tested on data it has already seen.

Historical Price Data

Train (50%)
Validate (25%)
Holdback (25%)
Optimizer explores here
Best candidate selected
Never seen, true OOS
  • Train (50%) is where the optimizer explores parameter combinations.
  • Validate (25%) is used to select the best candidate from the training results.
  • Holdback (one year minimum) is data the strategy NEVER sees during optimization. Holdback returns are the closest proxy for live performance.

Why this matters: Without a holdback period, every backtest is in-sample: you are evaluating performance on the same data used to choose the strategy. This is the most common source of backtest overfitting, and most retail tools do not guard against it.

2

Walk-Forward Validation

Within the training period, we run a 5-window rolling train/test protocol. Each window trains on a portion of the data and tests on the immediately following out-of-sample segment, so the optimizer has to keep working as the market changes underneath it.

  • Strategies that fail walk-forward are eliminated automatically before reaching the validation phase.
  • This tests whether the strategy adapts to shifting market regimes (trending, mean-reverting, and volatile environments).
  • Only strategies that perform consistently across all five windows advance.

Why this matters: A strategy that works brilliantly in one time window but fails in others is likely overfit to a specific market regime. Walk-forward validation catches this before you risk capital.

Pardo, R. (2008). "The Evaluation and Optimization of Trading Strategies." Wiley.

3

Deflated Sharpe Ratio

When an optimizer tests hundreds or thousands of parameter combinations, the best result will look impressive by pure chance. The Deflated Sharpe Ratio (DSR) discounts that lucky-winner effect so we judge a backtest against the right bar.

  • Uses the number of independent trials to adjust the significance threshold for the Sharpe Ratio.
  • Accounts for non-normality (skewness, kurtosis) of return distributions.
  • A Sharpe of 2.0 from 1,000 trials is statistically very different from a Sharpe of 2.0 from a single trial. DSR quantifies this difference.

Why this matters: Without this correction, optimizers will always find something that looks good. It is a mathematical certainty when the trial count is large enough. DSR tells you whether your best result is genuine or a statistical artifact.

Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality." Journal of Portfolio Management, 40(5), 94–107.

4

Monte Carlo Significance

We use a permutation test to determine whether the strategy's returns are statistically distinguishable from random chance. The question we want answered is simple: would any random reordering of these trades look as good?

  • Shuffles daily portfolio returns using block bootstrap to destroy the timing signal while preserving serial correlation structure.
  • Runs 1,000+ permutations per test to build a null distribution.
  • The strategy must achieve a p-value below 0.05, meaning there is less than a 5% probability that random reordering would produce equal or better returns.

Why this matters: Even after DSR correction, a strategy could have a decent Sharpe purely from favorable return clustering. The permutation test asks a direct question: does the specific ordering of trades matter, or would any random ordering do just as well?

White, H. (2000). "A Reality Check for Data Snooping." Econometrica, 68(5), 1097–1126.

5

Realistic Commission Models

A strategy that looks profitable with zero transaction costs often is not. We apply broker-specific commission schedules to every backtest, so the equity curve you see is what you would actually keep after fees.

  • 12 broker presets: IBKR Pro Tiered, IBKR Pro Fixed, Robinhood, Schwab, Fidelity, E*TRADE, TD Ameritrade, Webull, Firstrade, TradeStation, Tradier, and Alpaca.
  • Per-share rates with minimum and maximum per-order caps.
  • SEC Section 31 fees and FINRA TAF regulatory fees applied to every sell order.

Why this matters: Frequent-trading strategies are particularly sensitive to commissions. A strategy with 200 round-trips per year can lose 2 to 5% of its returns to transaction costs alone. If your backtest ignores this, its equity curve is fiction.

6

Parameter Robustness

After optimization selects the best parameters, we perturb each one to see if performance degrades gracefully or collapses. A real edge should survive small changes to the dials.

  • Each optimized parameter is varied ±20% from its selected value.
  • If performance collapses outside a narrow range, the strategy is flagged as curve-fitted.
  • Robust strategies exhibit a performance plateau: they work across a range of nearby parameter values, not just one magic setting.

Why this matters: A strategy that only works at exactly RSI(14) with a 2.1 standard deviation Bollinger Band is almost certainly overfit. Real market edges are broad enough to survive minor parameter variation.

7

Market Regime Detection

Every signal operates inside a market regime (bull, bear, sideways, or crisis). alphactor.ai classifies the current regime with a three-layer ensemble built on two of the most cited papers in financial econometrics, so we can tell which environment a strategy actually thrives in.

  • Layer 1: Hamilton (1989) Markov-Switching Regression fits a 2-regime model on log returns with switching variance. The Hamilton smoother produces proper Bayesian posterior state probabilities, not ad-hoc threshold rules.
  • Layer 2: Kritzman et al. (2012) volatility overlay detects crisis conditions by ranking current realized volatility against its full historical distribution. Bear markets with volatility above the 80th percentile are escalated to Bear/Crisis.
  • Layer 3: SMA-50 trend confirmation prevents false bull signals during bear-market bounces and vice versa, providing the Sideways classification when the MS model is uncertain.

Why this matters: Strategies that backtest well in one regime often fail catastrophically in another. Regime-aware analysis helps you understand whether a signal is robust across environments or merely overfit to the current market phase.

Hamilton, J.D. (1989). Econometrica, 57(2). · Kritzman, M., Page, S., & Turkington, D. (2012). Financial Analysts Journal, 68(3). · Nystrup, P., Lindström, E., Pinson, P., & Madsen, H. (2024). "Learning Hidden Markov Models for Regression with Unaligned Timestamps." arXiv:2402.05272.

8

Combining strategies on each stock

Once single strategies pass the checks above, the system looks for combinations: small teams of strong strategies on the same stock that, together, may be steadier than any one alone. Combining is never assumed to help. Each combination has to prove itself on held-back data before it is kept.

  • For each stock the system tries many recipes: how many strategies to combine (two, three, five or eight), how much weight to give each, whether to switch the combination off in certain market conditions, and when to cut it if it starts losing.
  • Every recipe must clear the same honesty checks a single strategy faces: it is re-tested on data set aside in several different ways, it is compared against a thousand random reshuffles of its own trades, and its score is discounted for how many recipes were tried. Miss any one and it is discarded.
  • Only the single best survivor is kept, one for each risk setting (conservative, balanced, aggressive). Stocks whose every recipe fails simply keep their best single strategy. In practice about 65% of stocks end up with a combination that clears the checks.

Why this matters: A careless ensemble just averages everything together. Here a combination has to be measurably better than its parts on data it was never tuned on, or it is thrown away. That keeps combinations that only looked good in hindsight out of the results.

López de Prado (2018) Advances in Financial Machine Learning, Ch. 7–8; Bailey & López de Prado (2014) The Deflated Sharpe Ratio.

9

Picking the single best of everything

With both single strategies and combinations validated, the system picks one winner per stock and risk setting. Singles and combinations are compared on exactly the same footing, so the strongest evidence wins no matter where it came from.

  • What matters first is how decisively a strategy beat the stock itself on the held-back year. Ties are broken by how clearly it beat random chance, then by a quality score tuned to the risk setting, then by which result is more recent.
  • An independent second ranker scores every candidate on its risk-adjusted return in today's market, its quality, and how different it is from the other picks (so the whole book is not the same bet repeated). When this second opinion disagrees strongly with the first, its pick wins.
  • The winner becomes what you see on the stock, and the full reasoning is recorded so any decision can be reviewed later. About one pick in eight changes hands through this second opinion on a typical day.

Why this matters: The simple ranking is easy to audit but blind to the bigger picture. A slightly lower-scoring combination can be the better choice when it diversifies the overall book. The second opinion catches those cases, and only overrides when the gap is large enough to matter.

10

Honest short-selling math

When a strategy can bet against a stock (sell it short), every return, drawdown, and risk number has to be computed with the correct sign, or a losing strategy can be made to look profitable. The engine accounts for long, short, and mixed positions with textbook-correct math.

  • Short proceeds credit cash at the open. Margin is reserved at Reg T 50% initial, 30% maintenance. Buying power equals equity minus margin used, never "available cash".
  • Daily borrow accrual: short_qty × close × annual_borrow_rate / 252, debited from cash each bar. Backtests assume 2%/yr by default; hard-to-borrow names bump to 15%/yr.
  • Forced cover trigger: when equity drops below 30% of short market value, the largest-loser short closes at the next bar, the same behavior your real broker would force on you.

Why this matters: A normal position can lose at most 100% of what you put in. A short bet can lose far more, and the math behind drawdown, win rate, and risk depends on the direction. Without honest accounting, a backtest quietly overstates short profits by a borrow cost it never paid and hides the margin call it never felt, so the result you see would be a half-truth at best.

11

Runs fresh every day

Everything above runs again, end to end, every day. It is built to repeat safely: each step saves its own progress and skips whatever has not changed, so a re-run after an interruption simply picks up where it left off.

  • First it flags stocks that have gone stale or are underperforming, re-runs the strategy search and all the honesty checks on them, rebuilds their combinations, and recomputes the second-opinion rankings. Stocks with nothing new are skipped, so a calm market day is fast.
  • Then it publishes the fresh winners (giving a brand-new type of strategy a short grace window to spread across the market before a daily limit applies), re-points each stock to its new winner, and retires any winner whose live results have drifted away from what its backtest promised.
  • Finally a read-only step records the day's activity: how much was re-checked, what was promoted or retired in each family, how broadly each family is working, and an early-warning score for strategies whose edge is quietly fading.

Why this matters: Separating the heavy re-testing from the light publishing step lets the system improve its candidates every night without disturbing what you see, unless a genuine improvement is found. The grace window gives a new strategy type time to prove itself broadly, and the retirement step keeps stale winners from lingering after their edge decays.

12

Cross-sectional ranking

Alongside the per-stock work, roughly 6,000 stocks are scored every day by a composite of validated signal families and ranked against each other. Portfolios built on this ranking trade rank membership, not per-stock signals: what matters is where a stock stands relative to the whole market, not what any single strategy says about it in isolation.

  • A stock enters the book only when it ranks in the top decile of the composite, stays while it holds the top quintile, and exits when it falls below. The wide exit band keeps borderline names from churning in and out on small rank wiggles.
  • Positions are held at equal weight. We tested weighting names by signal strength and by past Sharpe ratio; both failed out-of-sample, which matches the published research: optimized weights look better in-sample and lose to plain 1/N when tested honestly.
  • Only signal families that passed the validation layers above contribute to the composite, the daily universe is filtered for liquidity, and leveraged and inverse ETFs are excluded from the book.

Why this matters: Many stocks carry signals that are individually too weak to pass the per-stock evidence bar but real in aggregate. Ranking the whole market lets that pooled evidence be used honestly: spread across a broad, equal-weight book where no single name has to carry the claim, instead of being dressed up as a per-stock promise it cannot support.

DeMiguel, V., Garlappi, L., & Uppal, R. (2009). "Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy?" Review of Financial Studies, 22(5), 1915-1953.

13

Honest serving states

Every per-stock strategy that reaches you has passed Monte Carlo significance testing and an out-of-sample holdback. When a stock cannot clear that bar on its own, the platform does not quietly lower the bar; it tells you, and points to where the evidence actually lives. Every stock sits in exactly one of three states.

  • Direct signal: at least one strategy on this stock beat both the stock itself and random chance on held-back data, so the stock carries its own validated strategy.
  • Basket or portfolio only: nothing passes on this stock alone, but the name carries pooled, cross-sectional evidence, so it is served through a small basket or the ranked portfolio instead of a per-stock signal it has not earned.
  • Too new to judge: the stock does not have enough trading history for honest validation, so the app says exactly that instead of dressing up a guess.

Why this matters: A tool that always has a confident answer is not giving you information. On many stocks the most useful output is the plain admission that no per-stock edge passed, paired with a route that still uses what the data does support. That honesty is the methodology, not a disclaimer bolted onto it.

How a strategy earns its grade

Beat the stock, and beat the market

Every strategy is judged on a recent stretch of history it was never allowed to learn from. The grade answers one question: on that held-back data, how decisively did it beat simply buying and holding the same stock? The primary measure is Sharpe-ratio delta versus buy-and-hold on the held-out window. The top grades also have to beat the broad market (the S&P 500), not just the single stock.

Lost to buy & hold
Trailed slightly
Matched the stock
Beat it modestly
Beat it clearly
Beat it by a wide margin

A strategy has to land near the top of this ladder, beating both the stock and the market on data it never saw, before it is allowed to influence what you see. Beating the stock but not the market, or only by a hair, is not enough. And when nothing on a stock clears the bar, the app says so plainly instead of serving the least-bad option.

From one signal to a full model

Single strategies are only the starting point

A single strategy reads one signal and makes one bet. The same evidence bar is applied at every level as those signals are combined into richer models, from teams of strategies on one stock, to models that understand how companies are connected, to balanced books that span the whole market.

1. Single and composite strategies

The 570-generator library spans single-signal anomalies, confirmation composites (a signal is only acted on when price also agrees), and router families (a gatekeeper that switches between sub-strategies depending on market conditions). Every generator, regardless of its internal structure, is graded the same way: how decisively did it beat simply holding the stock on data it never saw?

2. Blends, a team per stock

On any given stock, the strongest few strategies can be combined into a team that is steadier than any single member. The team is not trusted automatically: it has to clear the same checks before it is used anywhere.

3. Company relationships

Companies are connected: suppliers and customers, direct competitors, firms held by the same funds, names that tend to move together. We map these links into a network and let a model learn how a strong read on one company should ripple out to the ones connected to it.

4. Whole-market models

At the top, models score roughly 6,000 stocks every day with a composite of validated signal families and rank them against each other. Portfolios built from this ranking trade rank membership (a stock enters near the top of the ranking and leaves when it slips), with positions held at equal weight. These power the ready-made Quant Portfolios and are graded on the steadiness of the whole book, not one stock at a time.

Knowing the terrain

The market's structure is an input too

Signals are only half the picture. The models also need to know what each instrument actually is and when the ground shifts underneath it, so two structural feeds run alongside the signal library.

Every liquid ETF gets a role

Over 2,300 liquid ETFs are classified into role sleeves (broad equity, bonds, gold, and so on) from their actual return behavior: a year of daily correlation and beta against sleeve benchmarks, not the marketing name on the fund. Leveraged and inverse products are flagged explicitly, and funds that behave like cash are labeled as such. The sleeves feed a regime-aware allocation view that shows how a simple, equal-weight sleeve mix tilts as market conditions change. It is a transparent illustration with no fitted optimization, shown for education rather than as a recommendation.

Index changes, captured the day they are announced

When a stock is added to or dropped from a major index, the announcement and the effective date are different events, and the gap between them is where most of the price reaction happens. The platform watches the official index-provider press feeds and records add and drop announcements the same day they are published, keyed to the announcement date rather than the later effective date, so event studies and signals line up with what the market actually knew and when.

Continuous hardening

We keep auditing our own backtests

The tests above are only as honest as their implementation. We periodically run an adversarial audit over the whole library and the scoring itself, looking for any way the future could leak into the past, and every flagged issue is independently re-checked before a fix lands. Highlights from the latest pass:

A higher bar for trying more ideas

The luck-correction now scales with how many variations were tried, instead of using one fixed hurdle. Searching more combinations is correctly held to a tougher standard.

Graded on data it never saw

Each candidate is scored on a recent stretch of history it was not tuned on, with its trades counted only inside that window, so a good in-sample fit cannot masquerade as real out-of-sample evidence.

No peeking into the future

Every strategy was audited for accidentally using information that would not have been available at the time. Fixes replace look-back windows with forward-only ones and delay outside data to its real publication date.

Honest market-condition labels

When we re-test across calm and stormy markets, the labels for which was which use only what was known at the time, with no after-the-fact smoothing that would quietly leak the answer.

One cost model, applied equally

Transaction costs were previously hardcoded in three places with different assumptions, so alpha and technical-analysis strategies were compared on an uneven footing. A single shared module now applies 5 bps per side (doubled for short legs to reflect borrow and locate friction) to every strategy uniformly.

Point-in-time correctness for external data

Families that incorporate analyst revisions, earnings surprises, patent publications, or peer-network events are each audited to confirm the signal is anchored to the date the information became public, not to the period it describes. Look-ahead surfaces are documented and guarded by publication-lag offsets.

Some strategies adapt their source research rather than copy it exactly (a different set of stocks, frequency, or stand-in data). Where a strategy departs from the original method we treat it as inspired by, not a literal reproduction of, that work.

The strategy library

570 generators, six signal groups

Each strategy is grounded in published research or a documented market effect, with a clear reason a price should move. The library spans single-signal anomalies, confirmation composites that require price agreement before acting, and router families that switch between sub-strategies as conditions change. To keep them navigable we group everything into six segments by the kind of signal they read. All generators are graded the same way, so the strongest evidence wins no matter which segment it belongs to.

Company Events & Earnings

102 strategies

Triggered by what a company does: earnings surprises, guidance changes, product news, mergers and other corporate events.

For example: 13d Activist Filing Drift · Activist Pair Revert · Adcomm Split Vote Short · Atm Offering Overhang Short · Black Box Warning Short

Economy & Policy

94 strategies

Read from the wider world: interest rates, inflation, the Fed, government action and geopolitics that move whole markets.

For example: Acled Conflict Onset Defense · Acled Mining Disruption · Acled Oil Supply Shock Long · Acled Protest Consumer Short · Baa Aaa Quality Spread

Filings, Insiders & Ownership

119 strategies

What the paperwork and the people closest to a company reveal: regulatory filings, insider buying and selling, short interest and who owns the stock.

For example: AI Disclosure Growth · Analyst Surprise Momentum · Cluster Buy Post Drawdown · Cyber Risk Disclosure Short · Def14a Comp Shift

Real-World & Alternative Data

104 strategies

Signals from outside the market: weather, shipping, crops, consumer demand, web attention and other physical-world measurements.

For example: Ag Basis Signal · Analyst Forecast Dispersion · Analyst Revision Breadth · Analyst Revision Jump · Attention Spike

Price & Market Behavior

126 strategies

Read straight from price, volume and volatility, including how a stock trades against the market and its peers.

For example: 52-Week-High Momentum · Accruals Quality · afternoon drift · Altman Z Score · Amihud Illiquidity

Combined Strategies

25 strategies

Strategies that blend several of the others into one signal and switch between them as market conditions change.

For example: Auto-Weighted Blend · Best-Strategy Overlay · Champion Disagreement Filter · Cme Silver Gold Ratio Regime · Combo Amihud X Max Drawdown

Newest research wave (June 2026): 23 generators across 7 new families are entering the validation pipeline. They cover revision-and-price confirmation (PEAD + analyst-revision drift confirmed by trend), fundamental inflection (improving quality metrics confirmed by market), transcript-vs-revision disagreement (management call tone against analyst revision behavior), short-pressure composites (squeeze-long vs overhang-short using borrow rates and short flow), ETF-flow and sector relative-strength, event-conditioned graph shock propagation (peer earnings and contract surprises rippling through TNIC networks), and patent-innovation velocity (application acceleration and KPSS value momentum). These are registered and in the queue; none have validated champions yet.

Updated weekly. Open any strategy for its plain-English summary, the idea behind it, and the data it uses.

Ready to test your strategies?

Run your first optimization with full credibility testing. Free to start, no credit card required.

alphactor.ai provides AI-powered stock research tools for informational and educational purposes only. We are not a registered investment advisor. Nothing on this site constitutes financial, investment, or trading advice. Past performance does not guarantee future results.
For informational and educational purposes only. Not financial advice. Learn more