Simulation & Backtesting
Methodology
HFT simulation requires tick-level or order-book data replay rather than daily OHLCV bars. Historical tick data is replayed in chronological order, with configurable latency, queue position modeling, and fill assumptions (passive vs aggressive).
Key simulation parameters include round-trip latency, maker/taker fee structure, and queue position at the time of order submission. Survivorship bias and data quality are critical considerations when sourcing historical tick data.
This section dives into a niche but decisive problem: realistic passive-fill simulation. Most HFT backtests fail because they model alpha well but model queue dynamics poorly. If your fill model is optimistic, every downstream result looks better than reality.
Research objective: estimate fill hazard, not just fill ratio
Instead of asking "did this order fill?", ask "what is the instantaneous hazard of fill given queue state and flow?" That turns simulation into a conditional survival problem:
h(t | x) = lim(dt -> 0) P(fill in [t, t+dt] | not filled by t, x) / dt
Here `x` includes queue-ahead size, short-horizon OFI, spread regime, and local trade intensity. This model naturally captures why two orders at same price can have very different outcomes.
Event-driven replay with queue state vectors
In each replay step, we maintain a queue-state vector for every working order:
- Q_ahead: estimated visible queue ahead at placement
- dQ_trade: queue consumed by aggressive trades at level
- dQ_cancel: queue reduction from cancellations ahead
- dQ_insert: new same-price arrivals that reduce priority
Order fill progression then follows:
Q_ahead(t+1) = Q_ahead(t) - dQ_trade - dQ_cancel + dQ_insert
Fill is triggered when `Q_ahead <= 0` with venue-consistent timing and ack-latency constraints.
Calibration workflow from historical data
A practical calibration pipeline for this niche problem:
- Collect order-level outcomes with timestamps and queue context features
- Bucket by spread state, volatility state, and time-of-day regime
- Fit hazard or discrete-time fill-probability model per bucket
- Validate calibration drift using rolling out-of-sample windows
This avoids one global model that silently underfits open/close behavior and overfits midday calm.
PnL decomposition that exposes model errors
We keep a strict decomposition:
NetPnL = SpreadCapture + Rebates - Fees - MarkoutLoss - Slippage - InventoryCarry
If simulation overestimates queue quality, you will usually see inflated SpreadCapture and artificially low MarkoutLoss. This decomposition makes that mismatch obvious during post-run diagnostics.
Markout parity tests (the most useful validation I run)
For each simulated fill, compute k-horizon markouts and compare live vs sim distributions:
Markout_k = side * (Mid_(t+k) - FillPrice_t)
If simulated short-horizon markouts are systematically better than live for similar queue states, your hazard model or cancellation-latency model is too optimistic.
Latency model coupling with queue model
Queue simulation and latency simulation cannot be separated. Cancellation delay changes queue exposure window, which directly changes fill and markout quality. I model latency as regime-dependent random variables:
T_cancel ~ D_open, D_midday, D_news
Then evaluate strategy robustness across these distributions, not one fixed value.
Statistical checks before trusting results
- Parameter stability across rolling windows
- Error bounds for fill probability and markout estimates
- Sensitivity of EV to small latency shifts and fee changes
- Out-of-sample degradation from open to close regimes
For summary quality metrics:
Sharpe = mean(r) / std(r) * sqrt(N), MaxDrawdown = max(Peak - Trough)
But for this niche use case, calibration error on fill/markout often matters more than Sharpe itself.
How to use this in the project workflow
Use the engine iteratively: fit queue-hazard assumptions, replay with production-like latency, compare simulated and realized markouts, then tighten only the mismatched components. This cycle is slower than simplistic backtesting, but it is what makes live behavior converge with research behavior.