Deep Dive 2: Defeating the Heap with Custom Memory Allocators

In the ultra-competitive landscape of quantitative execution, your application's relationship with the operating system must be strictly on your own terms. Relying on the standard libc `malloc` or C++ `new` operator for dynamic object allocation is an archaic practice that introduces unbounded latency spikes. We will fundamentally sever our reliance on the OS heap.

The Pathology of malloc and new

When a standard C++ application invokes new Order(), it hands control down to the language runtime, which eventually delegates to the kernel's memory manager. The allocator must traverse its internal data structures (often linked lists of variable-sized memory blocks) to find a sufficiently large hole to satisfy your request.

The Fragmentation Penalty

As your strategy runs over thousands of milliseconds, allocating and freeing orders of various sizes, the heap fragments. The traversal time to find a free block grows non-deterministically. Even worse, if there is no contiguous physical memory available, the OS must issue a `brk()` or `mmap()` syscall to map new pages into your virtual address space—a catastrophic 10,000+ nanosecond stall.

In HFT, average latency is a vanity metric; worst-case latency is all that matters. If your 99.99th percentile tick-to-trade latency spikes from 1µs to 12µs because the OS hit a soft page fault while searching for heap space to create an ExecutionReport, your algorithm just lost the trade to a competitor.

The Solution: Arena and Slab Allocators

The ironclad rule of low-latency architecture is: All memory must be pre-allocated during application initialization.

Zero allocations should occur on the hot path (during active market hours). Instead, we allocate massive, contiguous blocks of memory at startup—often referred to as an "Arena" or a "Memory Pool."

  • Arena Allocator: A massive block of static memory where allocation simply involves moving a pointer forward (a "bump allocator"). Deallocation is entirely forbidden. It is reset as a batch at the end of the trading day.
  • Slab Allocator (Object Pool): A highly structured pool managing chunks of identically sized objects (e.g., 256-byte Order nodes). When an object is "freed," it isn't returned to the OS; it is simply placed onto a local lock-free free-list to be reused in O(1) time.

Implementing a Template Memory Pool

To seamlessly integrate a custom Object Pool into an existing C++ codebase without butchering the syntax, we override the class-specific operator new and operator delete.

#include <vector>
#include <cstdint>
#include <iostream>
#include <cassert>

template <typename T, size_t BlockSize = 4096>
class ObjectPool {
private:
    union Node {
        T data;
        Node* next; // Intrusive free-list pointer overlaying the object memory
    };

    Node* freeList_ = nullptr;
    std::vector<std::vector<Node>> blocks_;

    void allocateBlock() {
        // Emplace a new contiguous chunk of generic nodes.
        // Note: Only happens at startup or in extreme exhaustion cases.
        blocks_.emplace_back(BlockSize);
        auto& block = blocks_.back();

        // Chain the newly allocated contiguous memory into our free list.
        for (size_t i = 0; i < BlockSize - 1; ++i) {
            block[i].next = &block[i + 1];
        }
        block[BlockSize - 1].next = freeList_;
        freeList_ = &block[0];
    }

public:
    ObjectPool() {
        // Prewarm the pool on construction (long before the Opening Bell)
        allocateBlock();
    }

    // The O(1) sub-nanosecond allocation
    void* allocate() noexcept {
        if (!freeList_) {
            // CRITICAL PATH VIOLATION: We exhausted our pre-allocation.
            // In strict HFT, this would throw std::bad_alloc entirely.
            allocateBlock(); 
        }
        Node* node = freeList_;
        freeList_ = freeList_->next; // Pop from the singly-linked list
        return node;
    }

    // The O(1) sub-nanosecond deallocation
    void deallocate(void* ptr) noexcept {
        if (!ptr) return;
        Node* node = static_cast<Node*>(ptr);
        
        // Intrusively overwrite the stale T memory with our free-list metadata
        node->next = freeList_;
        freeList_ = node; // Push head
    }
};

Integrating the Allocator via operator new

Now that our ultra-fast O(1) memory pool exists, we must force the C++ compiler to route our application's class instantiations into the pool rather than the OS heap. We achieve this by overloading the memory operators directly inside our target structure.

class LimitOrder {
private:
    uint64_t order_id_;
    uint32_t price_;
    uint32_t volume_;
    char symbol_[8];

    // Thread-local static memory pool. Prevents atomic locking during allocation.
    static thread_local ObjectPool<LimitOrder, 65536> pool_;

public:
    LimitOrder(uint64_t id, uint32_t px, uint32_t vol, const char* sym) 
       : order_id_(id), price_(px), volume_(vol) {
           std::memcpy(symbol_, sym, 8);
    }

    // OVERLOAD: Route 'new LimitOrder()' calls directly into our pool
    static void* operator new(size_t size) {
        assert(size == sizeof(LimitOrder));
        return pool_.allocate();
    }

    // OVERLOAD: Route 'delete order' calls to return memory to our pool
    static void operator delete(void* ptr) noexcept {
        pool_.deallocate(ptr);
    }
};

// Instantiate the static pool. Each execution thread gets its independent pool (no mutexes!)
thread_local ObjectPool<LimitOrder, 65536> LimitOrder::pool_;

The Intrusive Free-List Trick

Look closely at the union Node block inside the template code. This is an elegant display of memory overlaying.

When a LimitOrder is "allocated," that memory chunk operates as completely standard struct memory—holding prices, volumes, and IDs. But what happens when we delete the order?

Instead of creating separate, bloated metadata tracking structs outside the pool to remember which chunks are free, we treat the implicitly dead T data as a basic Node* next pointer! We intrusively embed the linked-list traversal pointers inside the freed object's memory itself.

Because C++ unions map their members to the exact same physical byte offsets, storing the next pointer inherently corrupts whatever price/volume data used to be in the order. But we don't care—the order is dead! This means our memory framework requires absolutely zero overhead. sizeof(Node) is perfectly identical to sizeof(LimitOrder)!

Conclusion

By moving to statically sized, pre-warmed, thread-local Slab Allocators featuring intrusive free lists, memory allocation takes under two CPU instructions. We entirely sever our dependency on standard OS `brk` and stringently bound deterministic execution jitter.

Coming in Deep Dive 3: We expand on structure design by tearing apart "LimitOrder" structs, looking at the exact byte boundaries, alignment padding, and how SoA (Struct of Arrays) allows us to hijack CPU hardware prefetchers.