Latency & Infrastructure
Methodology
Latency in HFT is measured at every layer of the stack: network round-trip time, kernel processing, application logic, and order submission. Reducing tick-to-trade latency requires co-location, kernel bypass networking, and cache-optimized data structures.
This page covers the key infrastructure components and techniques used by institutional HFT firms, from co-location and DMA to FPGA acceleration and lock-free programming patterns.
This is the niche latency topic most teams avoid until production pain forces it: market-open microburst survival. Not median speed, not benchmark screenshots, but how your stack behaves in the first 30-90 seconds when message rate explodes, queue depths churn, and every hidden scheduling decision becomes visible in p99.99 latency.
The real problem: open-auction aftermath jitter
On normal flow, many stacks look "fast enough." At open, feed bursts expose architectural debt instantly: shared cores, drifting IRQs, remote NUMA memory, allocator spikes, and cold branch paths in risk code. The result is a latency plot with a clean median and catastrophic tails.
In one practical decomposition, we track tick-to-trade as:
T_total = T_rx + T_decode + T_signal + T_risk + T_serialize + T_tx
A useful discipline is to maintain tail budgets per segment. For example, if your hard target is sub-6us at p99.9, you do not allocate 5.8us to one "fast" stage and hope the rest behaves. Every stage must own a bounded tail budget.
What I tune first on a fresh host
- NIC NUMA node and process memory node must match
- RX poll thread, strategy thread, and TX thread pinned to isolated cores
- IRQ affinity locked to non-overlapping cores for the relevant queues
- Hot path allocation-free and lock-free under sustained burst
- Telemetry offloaded to non-critical cores
Teams often do 60% of this and still wonder why tails are unstable. The missing 40% usually sits at OS scheduling and hardware locality boundaries, not in business logic.
Niche bottleneck: RX ring pressure and parser backpressure mismatch
One subtle failure mode at open is the parser consuming bursts in variable-sized batches while downstream strategy logic handles fixed-sized work units. When RX burst size momentarily exceeds parser-to-strategy transfer capacity, queue depth oscillates and latency "breathes" in periodic spikes.
The fix is usually architectural, not magical: use fixed-capacity lock-free channels, pre-size decode objects, and align burst handling policy across stages. If ingress processes 32 packets per poll but strategy loop consumes 8-equivalent units, you are injecting synthetic jitter.
Co-location and distance still matter, but only with deterministic internals
Co-location and clean cross-connect paths buy you physics-level advantage. But low fiber distance does not rescue poor host determinism. If two firms are equally close and one has tighter p99.99, that one usually wins queue race consistency over time.
Timestamping for forensic clarity
During incident review, unsynchronized clocks create fake narratives. Use consistent timestamp strategy and segment stamps at ingress, post-decode, post-decision, and pre-transmit. A single end-to-end metric hides root causes.
- Track p50, p95, p99, p99.9, and p99.99 per segment
- Store short rolling histograms during high-vol windows
- Correlate latency bursts with sequence-gap and cancel-rate bursts
A practical open-session runbook
Before market open, I run deterministic replay bursts at 1.2x to 1.5x expected open message rate. If tail budgets fail, no deployment. After open, I compare live segment histograms to replay baseline and inspect drift within first minute.
- Reject launch if replay p99.9 exceeds policy threshold
- Auto-degrade strategy aggressiveness if live tails exceed envelope
- Trigger quote-width expansion if cancel-ack latency spikes
- Fail safe when sequence-gap or stale-book risk appears
The niche insight is simple: speed helps, but repeatability under burst wins. A stable 5us system beats a flashy 1us median with random 20us spikes.