I spent years optimizing median latency and then losing trades anyway because of p99.99 spikes. This guide is the playbook I now use to reduce tail jitter in a live low-latency stack. It is not a theoretical checklist. It is the exact sequence I run when I deploy a strategy host and need stable, boring, predictable microsecond behavior.
In production HFT, a strategy usually dies from variance, not from average speed. You can have a beautiful 700ns median packet-to-decision path and still lose money if a random scheduler event, IRQ burst, or NUMA remote memory access pushes some packets into 8-15us territory.
That tail behavior breaks queue position assumptions, invalidates your fill model, and silently turns a profitable simulator into a live disappointment. So this post is about discipline: pinning every critical thread, aligning memory locality, constraining interrupts, and continuously proving that jitter stayed under control.
I separate the machine into three lanes:
The cardinal rule is simple: Lane A and Lane B do not share cores with Lane C. Not even “just one helper thread.” If I violate this once, latency histograms show it immediately.
First thing I do on a new box:
lscpu -e=cpu,node,socket,core
numactl --hardware
cat /sys/class/net/eth0/device/numa_node
ethtool -i eth0 I want to know which NUMA node owns the NIC. If the NIC is on NUMA node 1 and my strategy thread runs on CPUs in node 0, I am paying remote-memory penalties and extra interconnect traffic for no benefit.
Practical Rule
Keep NIC RX/TX threads, parser buffers, and strategy hot data on the same NUMA node whenever possible.
`taskset` alone is not enough. The kernel can still schedule housekeeping tasks, timer work, or softirq activity on “your” cores unless you isolate them deliberately.
A typical low-jitter kernel command line I use as a baseline:
isolcpus=2-9 nohz_full=2-9 rcu_nocbs=2-9 intel_pstate=disable processor.max_cstate=1 idle=poll This is aggressive and power-inefficient, but very effective for jitter control during market hours. I keep management and background work on cores 0-1, and reserve 2-9 for the hot path.
After reboot, I verify isolation and housekeeping placement before doing anything else.
Most teams pin application threads but forget interrupt affinity. Then NIC interrupts hit random cores, and tails explode.
My approach:
# discover IRQs for the NIC
grep -i eth0 /proc/interrupts
# pin an IRQ (example IRQ 141) to CPU 4
echo 10 > /proc/irq/141/smp_affinity
# verify
cat /proc/irq/141/smp_affinity_list I monitor `/proc/interrupts` during replay to confirm packet interrupts are incrementing only where expected. If an unrelated IRQ drifts into the isolated set, I fix that before running strategy tests.
For hot buffers, I use NUMA-local allocation and huge pages where it helps TLB pressure. The win is not always raw speed; the win is fewer surprise stalls.
numactl --cpunodebind=1 --membind=1 ./hft_engine --config prod.yaml If I run DPDK, I pre-provision huge pages per NUMA node, verify socket memory allocation, and reject startup if memory lands on the wrong node. Silent fallback is dangerous.
“Main thread pinned” is not enough. Parser thread, strategy thread, risk thread, tx thread, and even timer threads should have explicit affinity and scheduling policy based on their role.
#include <pthread.h>
#include <sched.h>
#include <stdexcept>
inline void pin_current_thread(int cpu) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
if (pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) != 0) {
throw std::runtime_error("failed to set thread affinity");
}
}
inline void set_realtime_fifo(int priority) {
sched_param sp{};
sp.sched_priority = priority;
if (pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp) != 0) {
throw std::runtime_error("failed to set SCHED_FIFO");
}
} I only grant real-time priority to truly critical threads and keep watchdog/kill-switch logic on reliable reserved cores. Misusing real-time policies can starve your own safety systems.
Once market data starts, my hot path does not allocate, lock, or write logs to disk. Not “rarely.” Never. Every accidental `malloc`, filesystem write, or blocking syscall eventually appears as tail jitter.
I route telemetry to lock-free ring buffers and flush on non-critical cores. Hot path should look like: parse packet, update book state, compute quote, serialize order, submit.
I treat timing like unit tests. Every release must pass latency budgets, not just PnL checks.
#include <x86intrin.h>
#include <cstdint>
inline uint64_t rdtscp_cycles() {
unsigned aux;
return __rdtscp(&aux); // serializing read of TSC
}
struct LatencyProbe {
uint64_t t_rx;
uint64_t t_decision;
uint64_t t_tx;
};
inline void on_packet() {
const uint64_t t0 = rdtscp_cycles();
// parse + book update + signal + risk + order build
const uint64_t t1 = rdtscp_cycles();
// tx submit
const uint64_t t2 = rdtscp_cycles();
// write to preallocated ring; consume on telemetry core
publish_probe({t0, t1, t2});
} I calibrate cycles-to-nanoseconds on startup and continuously track p50/p95/p99/p99.9/p99.99 for each segment: RX-to-parse, parse-to-decision, decision-to-TX.
I keep hardware tuning focused and minimal:
I avoid turning this into a hardware shopping essay because software and OS placement mistakes usually dominate before exotic hardware does.
This is the part of HFT engineering that feels less glamorous than strategy research, but it is often where live edge is protected. You are not just building a fast program. You are building a deterministic machine behavior profile under stress.
If your latency plot is smooth and boring at market open, your strategy work finally has a fair chance.
Next continuation: a full tick-to-trade observability stack design (drop counters, replay parity checks, and production-safe kill-switch telemetry).