Deep Dive 6: NUMA, CPU Pinning, and Jitter Control (What I Actually Run)

I spent years optimizing median latency and then losing trades anyway because of p99.99 spikes. This guide is the playbook I now use to reduce tail jitter in a live low-latency stack. It is not a theoretical checklist. It is the exact sequence I run when I deploy a strategy host and need stable, boring, predictable microsecond behavior.

Why This Matters More Than Another 50ns Micro-Optimization

In production HFT, a strategy usually dies from variance, not from average speed. You can have a beautiful 700ns median packet-to-decision path and still lose money if a random scheduler event, IRQ burst, or NUMA remote memory access pushes some packets into 8-15us territory.

That tail behavior breaks queue position assumptions, invalidates your fill model, and silently turns a profitable simulator into a live disappointment. So this post is about discipline: pinning every critical thread, aligning memory locality, constraining interrupts, and continuously proving that jitter stayed under control.

The Three-Lane Mental Model I Use

I separate the machine into three lanes:

  • Lane A (RX lane): NIC receive queue thread, parser, and first-stage normalization.
  • Lane B (Strategy lane): signal computation, risk checks, quote/order decisions.
  • Lane C (Everything else): logging, metrics, persistence, shell activity, cron noise.

The cardinal rule is simple: Lane A and Lane B do not share cores with Lane C. Not even “just one helper thread.” If I violate this once, latency histograms show it immediately.

Step 1: Map NUMA Topology Before Touching the Code

First thing I do on a new box:

lscpu -e=cpu,node,socket,core
numactl --hardware
cat /sys/class/net/eth0/device/numa_node
ethtool -i eth0

I want to know which NUMA node owns the NIC. If the NIC is on NUMA node 1 and my strategy thread runs on CPUs in node 0, I am paying remote-memory penalties and extra interconnect traffic for no benefit.

Practical Rule

Keep NIC RX/TX threads, parser buffers, and strategy hot data on the same NUMA node whenever possible.

Step 2: CPU Isolation at Boot (Not Just `taskset`)

`taskset` alone is not enough. The kernel can still schedule housekeeping tasks, timer work, or softirq activity on “your” cores unless you isolate them deliberately.

A typical low-jitter kernel command line I use as a baseline:

isolcpus=2-9 nohz_full=2-9 rcu_nocbs=2-9 intel_pstate=disable processor.max_cstate=1 idle=poll

This is aggressive and power-inefficient, but very effective for jitter control during market hours. I keep management and background work on cores 0-1, and reserve 2-9 for the hot path.

After reboot, I verify isolation and housekeeping placement before doing anything else.

Step 3: IRQ Affinity and RSS Queue Ownership

Most teams pin application threads but forget interrupt affinity. Then NIC interrupts hit random cores, and tails explode.

My approach:

  1. Create fixed RX queues per strategy lane.
  2. Pin each queue IRQ to a dedicated isolated core.
  3. Pin the user-space poll/parse thread to the same core or sibling pair based on throughput needs.
  4. Keep RSS hash deterministic so flow-to-core mapping is stable.
# discover IRQs for the NIC
grep -i eth0 /proc/interrupts

# pin an IRQ (example IRQ 141) to CPU 4
echo 10 > /proc/irq/141/smp_affinity

# verify
cat /proc/irq/141/smp_affinity_list

I monitor `/proc/interrupts` during replay to confirm packet interrupts are incrementing only where expected. If an unrelated IRQ drifts into the isolated set, I fix that before running strategy tests.

Step 4: Memory Placement and Huge Pages for Predictability

For hot buffers, I use NUMA-local allocation and huge pages where it helps TLB pressure. The win is not always raw speed; the win is fewer surprise stalls.

numactl --cpunodebind=1 --membind=1 ./hft_engine --config prod.yaml

If I run DPDK, I pre-provision huge pages per NUMA node, verify socket memory allocation, and reject startup if memory lands on the wrong node. Silent fallback is dangerous.

Step 5: Pin Every Thread Explicitly in Code

“Main thread pinned” is not enough. Parser thread, strategy thread, risk thread, tx thread, and even timer threads should have explicit affinity and scheduling policy based on their role.

#include <pthread.h>
#include <sched.h>
#include <stdexcept>

inline void pin_current_thread(int cpu) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu, &cpuset);

    if (pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) != 0) {
        throw std::runtime_error("failed to set thread affinity");
    }
}

inline void set_realtime_fifo(int priority) {
    sched_param sp{};
    sp.sched_priority = priority;
    if (pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp) != 0) {
        throw std::runtime_error("failed to set SCHED_FIFO");
    }
}

I only grant real-time priority to truly critical threads and keep watchdog/kill-switch logic on reliable reserved cores. Misusing real-time policies can starve your own safety systems.

Step 6: Keep the Hot Path Allocation-Free and Syscall-Free

Once market data starts, my hot path does not allocate, lock, or write logs to disk. Not “rarely.” Never. Every accidental `malloc`, filesystem write, or blocking syscall eventually appears as tail jitter.

I route telemetry to lock-free ring buffers and flush on non-critical cores. Hot path should look like: parse packet, update book state, compute quote, serialize order, submit.

Step 7: Measure at Nanosecond Resolution and Store Histograms

I treat timing like unit tests. Every release must pass latency budgets, not just PnL checks.

#include <x86intrin.h>
#include <cstdint>

inline uint64_t rdtscp_cycles() {
    unsigned aux;
    return __rdtscp(&aux); // serializing read of TSC
}

struct LatencyProbe {
    uint64_t t_rx;
    uint64_t t_decision;
    uint64_t t_tx;
};

inline void on_packet() {
    const uint64_t t0 = rdtscp_cycles();
    // parse + book update + signal + risk + order build
    const uint64_t t1 = rdtscp_cycles();
    // tx submit
    const uint64_t t2 = rdtscp_cycles();

    // write to preallocated ring; consume on telemetry core
    publish_probe({t0, t1, t2});
}

I calibrate cycles-to-nanoseconds on startup and continuously track p50/p95/p99/p99.9/p99.99 for each segment: RX-to-parse, parse-to-decision, decision-to-TX.

A Short Hardware Section (Only What Actually Moves the Needle)

I keep hardware tuning focused and minimal:

  • Low-latency NIC with solid user-space tooling: stability and driver maturity matter more than marketing claims.
  • CPU frequency behavior: fixed/performance governor is usually better than aggressive power-save transitions for jitter-sensitive paths.
  • NUMA-aware PCIe placement: place NIC and hot threads on the same socket whenever possible.
  • RAM consistency: prefer stable, validated memory configs over pushing memory overclocks that increase unpredictability.

I avoid turning this into a hardware shopping essay because software and OS placement mistakes usually dominate before exotic hardware does.

The Bring-Up Checklist I Run on Every Host

  1. Verify NUMA topology and NIC node placement.
  2. Apply boot-time CPU isolation and reboot.
  3. Bind IRQs and confirm interrupt counts only hit intended cores.
  4. Launch process with NUMA cpu/memory bind.
  5. Pin each critical thread in code and verify at runtime.
  6. Replay market data burst and collect latency histograms.
  7. Reject deployment if p99.99 breaches strategy budget.

Common Failure Modes I Keep Seeing

  • “It worked in staging”: staging lacked market-open packet burst intensity.
  • Pinned app threads, unpinned IRQs: core isolation looked correct but interrupt noise destroyed tails.
  • Cross-node allocations: process pinned correctly, memory not pinned, causing remote access spikes.
  • Hidden logging path: error branch occasionally performed blocking I/O on a critical thread.
  • Percentiles too shallow: teams looked at p99, missed p99.99 outliers that actually lose queue priority.

Final Notes

This is the part of HFT engineering that feels less glamorous than strategy research, but it is often where live edge is protected. You are not just building a fast program. You are building a deterministic machine behavior profile under stress.

If your latency plot is smooth and boring at market open, your strategy work finally has a fair chance.

Next continuation: a full tick-to-trade observability stack design (drop counters, replay parity checks, and production-safe kill-switch telemetry).