backtest_engine
A reusable Python framework for evaluating systematic strategies — signal evaluation, vector-friendly sizing, transaction-cost modeling, and a metrics layer built around honest accounting of look-ahead and slippage.
Most ad-hoc backtests lie — and the ways they lie are well-known
Look-ahead bias. Survivorship bias. Frictionless fills. Sharpe ratios that don't survive a 5 bps transaction cost. Every quant researcher writes the same scaffolding from scratch and gets the same things wrong in subtly different ways.
This engine is the answer to "can I trust this PnL?" — a single framework where the inputs (prices, signals, costs) are well-typed, the accounting is auditable, and the failure modes are explicit. If a backtest passes here, I can stand behind the number.
Strategy in, metrics out — with every transformation auditable
┌──────────────────────────────────────────────────────────────┐
│ INPUTS │
│ prices: DataFrame[date × ticker] (adjusted close) │
│ signals: DataFrame[date × ticker] (model output) │
│ universe: Set[ticker] per date (point-in-time) │
│ cost_model: CostModel │
└──────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────┐
│ align & lag │ t+1 fills, no peek
└────────┬───────────┘
▼
┌────────────────────┐
│ position sizing │ vol-target / rank / fixed
└────────┬───────────┘
▼
┌────────────────────┐
│ apply costs │ commission + slippage
└────────┬───────────┘
▼
┌────────────────────┐
│ PnL accumulator │ gross, net, by sleeve
└────────┬───────────┘
▼
┌────────────────────┐
│ metrics & reports │ Sharpe / DD / hit / turnover
└────────────────────┘
The contract is small: signals come in lagged, exits come out audited. Each stage in the pipeline is a pure function over DataFrames, which makes it trivial to swap a sizing rule or cost model without touching the rest.
The engine doesn't fetch data — it consumes it
By design, ingestion is somebody else's problem (see the QuantFinance-Databases case study). The engine accepts:
| Input | Shape | Notes |
|---|---|---|
| Prices | DataFrame[date × ticker] | Total-return adjusted close. Daily by default. |
| Signals | DataFrame[date × ticker] | Float, any unit — sizer handles normalization. |
| Universe | Series[date] → Set[ticker] | Point-in-time index membership to avoid survivorship. |
| Cost model | CostModel object | Pluggable; defaults to linear bps. Square-root impact supported. |
This separation means the same engine can be pointed at a different vendor or asset class with zero changes.
The decisions that make or break a backtest
Mandatory signal lag at the boundary
Why Look-ahead is the #1 way backtests lie. The engine refuses to consume an unlagged signal — bt.run() raises if signals.index[-1] >= prices.index[-1]. Cost Slightly more verbose user code; worth it.
Vectorized over event-driven
Why EOD strategies don't need event-driven simulation. Vectorized pandas is 100× faster and the bookkeeping is easier to audit. Cost Limit orders and intra-bar fills aren't modeled — fine for the asset class this targets.
Costs are a separate object, not a flag
Why Linear bps is a placeholder. Real research needs square-root impact, fixed commission tiers, and borrow costs. Each is its own class implementing CostModel — composition over a config dict.
Point-in-time universe is required
Why Backtesting a strategy on today's S&P 500 constituents back to 2010 is a survivorship lie. The engine takes a universe-as-of function and enforces it. Cost Need to source historical index membership; well worth the friction.
Metrics are reported with confidence intervals
Why Sharpe 1.5 over five years vs. five months is a different story. The engine bootstraps a CI around every reported metric so I know when a result is small-sample noise. TODO: confirm CI defaults match repo
If the engine is wrong, every result downstream is wrong
- No look-ahead. A test feeds a signal that perfectly predicts t+1 returns and asserts the engine fails or returns a reasonable Sharpe — not an infinite one.
- Cost monotonicity. Increasing commission bps must produce a monotonically lower net Sharpe; tested across a grid.
- Cash invariant. Sum of position PnL each day equals the gross PnL series to 1e-9 — closes off any silent accounting drift.
- Turnover sanity. A constant-signal strategy has turnover near zero; an alternating-signal strategy has turnover near 2× notional.
- Bootstrap stability. Re-running the bootstrap with a fixed seed yields identical confidence intervals.
- TODO: add hypothesis-based property tests for edge cases (zero-vol asset, all-zero signal, etc.)
Sample strategy run — momentum, US equities
The numbers below are illustrative of what the engine produces; specific values depend on the universe and date range chosen.
The point of the engine is that those numbers come from one place, with one accounting, and the same code is what would gate a strategy from research into a paper-trade.
Equity curve, drawdown, exposure
TODO: export matplotlib runs to PNG, drop into ../assets/img/backtest-engine/, replace placeholders below.
(placeholder)
(placeholder)
(placeholder)
(placeholder)
What I'd want a reviewer to know
- EOD only. Intra-bar microstructure isn't modeled — this is a research tool for daily strategies, not an HFT simulator.
- No live execution path. The engine simulates fills; it does not connect to a broker.
- Costs are assumptions, not measurements. Linear bps is a starting point. For a real strategy, calibrate against measured fills.
- Borrow / shorting costs default to zero. Long-short strategies need a non-default cost model to be honest about financing.
- Single-currency. No FX P&L accounting for cross-listed names.
- Bootstrapped CIs assume IID returns. Block bootstrap is a TODO for serial correlation.