Case Study · Backtesting Framework

backtest_engine

A reusable Python framework for evaluating systematic strategies — signal evaluation, vector-friendly sizing, transaction-cost modeling, and a metrics layer built around honest accounting of look-ahead and slippage.

Status · Active Year · 2025–present Stack · Python · NumPy · pandas · matplotlib Tested on · US equities EOD

View on GitHub All projects

01 Problem

Most ad-hoc backtests lie — and the ways they lie are well-known

Look-ahead bias. Survivorship bias. Frictionless fills. Sharpe ratios that don't survive a 5 bps transaction cost. Every quant researcher writes the same scaffolding from scratch and gets the same things wrong in subtly different ways.

This engine is the answer to "can I trust this PnL?" — a single framework where the inputs (prices, signals, costs) are well-typed, the accounting is auditable, and the failure modes are explicit. If a backtest passes here, I can stand behind the number.

02 Architecture

Strategy in, metrics out — with every transformation auditable

┌──────────────────────────────────────────────────────────────┐
│                      INPUTS                                  │
│   prices: DataFrame[date × ticker]    (adjusted close)       │
│   signals: DataFrame[date × ticker]   (model output)         │
│   universe: Set[ticker] per date      (point-in-time)        │
│   cost_model: CostModel                                      │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
                   ┌────────────────────┐
                   │  align & lag       │   t+1 fills, no peek
                   └────────┬───────────┘
                            ▼
                   ┌────────────────────┐
                   │ position sizing    │   vol-target / rank / fixed
                   └────────┬───────────┘
                            ▼
                   ┌────────────────────┐
                   │ apply costs        │   commission + slippage
                   └────────┬───────────┘
                            ▼
                   ┌────────────────────┐
                   │ PnL accumulator    │   gross, net, by sleeve
                   └────────┬───────────┘
                            ▼
                   ┌────────────────────┐
                   │ metrics & reports  │   Sharpe / DD / hit / turnover
                   └────────────────────┘

The contract is small: signals come in lagged, exits come out audited. Each stage in the pipeline is a pure function over DataFrames, which makes it trivial to swap a sizing rule or cost model without touching the rest.

# minimal usage from backtest_engine import Backtest, VolTargetSizer, LinearCosts bt = Backtest( prices=prices, signals=momentum_zscore, sizer=VolTargetSizer(target=0.10, lookback=63), costs=LinearCosts(commission_bps=1.0, slippage_bps=5.0), rebalance="W-FRI", ) result = bt.run() print(result.metrics) # Sharpe, Sortino, max DD, turnover, hit rate result.plot() # equity curve + drawdown + exposure

03 Data sources

The engine doesn't fetch data — it consumes it

By design, ingestion is somebody else's problem (see the QuantFinance-Databases case study). The engine accepts:

Input	Shape	Notes
Prices	DataFrame[date × ticker]	Total-return adjusted close. Daily by default.
Signals	DataFrame[date × ticker]	Float, any unit — sizer handles normalization.
Universe	Series[date] → Set[ticker]	Point-in-time index membership to avoid survivorship.
Cost model	CostModel object	Pluggable; defaults to linear bps. Square-root impact supported.

This separation means the same engine can be pointed at a different vendor or asset class with zero changes.

04 Design choices

The decisions that make or break a backtest

Mandatory signal lag at the boundary

Why Look-ahead is the #1 way backtests lie. The engine refuses to consume an unlagged signal — bt.run() raises if signals.index[-1] >= prices.index[-1]. Cost Slightly more verbose user code; worth it.

Vectorized over event-driven

Why EOD strategies don't need event-driven simulation. Vectorized pandas is 100× faster and the bookkeeping is easier to audit. Cost Limit orders and intra-bar fills aren't modeled — fine for the asset class this targets.

Costs are a separate object, not a flag

Why Linear bps is a placeholder. Real research needs square-root impact, fixed commission tiers, and borrow costs. Each is its own class implementing CostModel — composition over a config dict.

Point-in-time universe is required

Why Backtesting a strategy on today's S&P 500 constituents back to 2010 is a survivorship lie. The engine takes a universe-as-of function and enforces it. Cost Need to source historical index membership; well worth the friction.

Metrics are reported with confidence intervals

Why Sharpe 1.5 over five years vs. five months is a different story. The engine bootstraps a CI around every reported metric so I know when a result is small-sample noise. TODO: confirm CI defaults match repo

05 Tests

If the engine is wrong, every result downstream is wrong

No look-ahead. A test feeds a signal that perfectly predicts t+1 returns and asserts the engine fails or returns a reasonable Sharpe — not an infinite one.
Cost monotonicity. Increasing commission bps must produce a monotonically lower net Sharpe; tested across a grid.
Cash invariant. Sum of position PnL each day equals the gross PnL series to 1e-9 — closes off any silent accounting drift.
Turnover sanity. A constant-signal strategy has turnover near zero; an alternating-signal strategy has turnover near 2× notional.
Bootstrap stability. Re-running the bootstrap with a fixed seed yields identical confidence intervals.
TODO: add hypothesis-based property tests for edge cases (zero-vol asset, all-zero signal, etc.)

06 Results

Sample strategy run — momentum, US equities

The numbers below are illustrative of what the engine produces; specific values depend on the universe and date range chosen.

TODO

Net Sharpe (5y)

TODO

Max drawdown

TODO

Annual turnover

TODO

Hit rate

TODO

Run time (10y EOD)

TODO

Cost as % of gross

The point of the engine is that those numbers come from one place, with one accounting, and the same code is what would gate a strategy from research into a paper-trade.

07 Screenshots

Equity curve, drawdown, exposure

TODO: export matplotlib runs to PNG, drop into ../assets/img/backtest-engine/, replace placeholders below.

Equity curve
(placeholder)

Drawdown waterfall
(placeholder)

Rolling Sharpe
(placeholder)

Exposure heatmap
(placeholder)

08 Limitations

What I'd want a reviewer to know

EOD only. Intra-bar microstructure isn't modeled — this is a research tool for daily strategies, not an HFT simulator.
No live execution path. The engine simulates fills; it does not connect to a broker.
Costs are assumptions, not measurements. Linear bps is a starting point. For a real strategy, calibrate against measured fills.
Borrow / shorting costs default to zero. Long-short strategies need a non-default cost model to be honest about financing.
Single-currency. No FX P&L accounting for cross-listed names.
Bootstrapped CIs assume IID returns. Block bootstrap is a TODO for serial correlation.