Deep Dive 3: Cache-Line Optimization and Data-Oriented Design (DoD)

Object-Oriented Programming (OOP) taught you to model reality. In High-Frequency Trading, reality doesn't matter; the only reality is the CPU hardware execution pipeline. Data-Oriented Design (DoD) forces you to model your memory specifically for how the processor's prefetcher breathes in data.

The Brutal Reality of Memory Access

CPUs are incredibly fast, executing instructions in under 0.5 nanoseconds. But Main Memory (DRAM) is painfully slow, taking upward of 60 to 100 nanoseconds just to retrieve a single integer. When a processor asks for a variable that is not currently loaded into its local L1 or L2 cache, it triggers a Cache Miss. The CPU must halt its pipeline and wait 100ns for the data to arrive across the memory bus.

The 64-Byte Fetch Rule

Whenever the CPU fetches data from Main Memory, it NEVER fetches just 1 byte or 4 bytes. It fetches an entire 64-byte block (a Cache Line). If you request a 4-byte `price` integer, the CPU dutifully retrieves those 4 bytes alongside the 60 bytes physically next to it in RAM.

If those 60 neighboring bytes contain useless objects or padding, you just wasted 90% of your memory bandwidth. But if those adjacent bytes hold the exact data your algorithm needs to process next, then your next 15 loop iterations will execute instantaneously with zero memory latency!

Array of Structs vs. Struct of Arrays

Consider a typical Limit Order Book processing loop. You want to iterate over 1,000 active orders to calculate the Total Volume resting at a specific price level.

In traditional OOP, you build an Array of Structs (AoS):

struct Order {
    uint64_t order_id; // 8 bytes
    double price;      // 8 bytes
    uint32_t volume;   // 4 bytes
    bool is_buy;       // 1 byte
    // Compiler adds 3 bytes of implicit padding here!
    uint8_t flags[8];  // 8 bytes
}; // Total == 32 bytes

std::vector<Order> book;

Each Order is exactly 32 bytes. When your loop iterates over book[i].volume to sum up the sizes, the CPU fetches a 64-byte cache line containing:

Order 0 (32 bytes) [Contains ID, Price, Vol, Flags]
Order 1 (32 bytes) [Contains ID, Price, Vol, Flags]

When you sum the volume, you read 4 bytes for Order 0's volume, and 4 bytes for Order 1's volume. You used exactly 8 bytes of the 64-byte cache line. The remaining 56 bytes (Prices, Order IDs, Flags) were completely wasted! You suffer 1 L3 Cache Miss for every 2 loops.

The DoD Solution: Struct of Arrays (SoA)

Data-Oriented Design dictates that we group data by how the algorithm consumes it, not by logical human "objects".

struct OrderBookSoA {
    // Arrays separate the fields completely!
    std::vector<uint64_t> order_ids;
    std::vector<double> prices;
    std::vector<uint32_t> volumes;
    std::vector<uint8_t> is_buy;
    
    // Summation logic
    uint32_t sum_volume_at_price(double target_price, size_t count) {
        uint32_t total = 0;
        for (size_t i = 0; i < count; ++i) {
            if (prices[i] == target_price) {
                total += volumes[i];
            }
        }
        return total;
    }
};

Now look at the volumes array. It is a dense, contiguous array of sheer 4-byte integers! When the CPU pulls a 64-byte cache line from DRAM, it grabs exactly 16 contiguous volumes! (64 / 4 = 16).

Your loop iterates 16 times instantaneously with zero RAM lookup penalty. You only take 1 L3 Cache Miss for every 16 operations—an 8x massive improvement in throughput speed. Furthermore, because the memory access is strictly linear and predictable, the CPU's hardware prefetcher will asynchronously fetch the *next* block of 16 volumes into L1 before your loop even asks for them!

Implicit Padding and Alignment

C++ compilers do not pack struct members tightly. They strictly align them to their natural byte boundaries. A uint64_t must start at an address divisible by 8.

struct BadAlignment {
    char a;      // 1 byte
    // COMPILER INSERTS 7 BYTES OF PADDING HERE
    double c;    // 8 bytes
    float b;     // 4 bytes
    // COMPILER INSERTS 4 BYTES OF PADDING HERE
}; // Total Size: 24 bytes! Unbelievably wasteful.

When designing message format structures (especially when reading binary network packets like NASDAQ ITCH parsing structs directly off the wire), you must mathematically arrange variables from largest byte size down to smallest to entirely eliminate implicit padding holes.

struct PerfectAlignment {
    double c;    // 8 bytes (Starts offset 0)
    float b;     // 4 bytes (Starts offset 8)
    char a;      // 1 byte  (Starts offset 12)
    // COMPILER INSERTS 3 BYTES at the END to pad out to an 8-byte boundary. (Size 16).
};

The Hardware Prefetcher

Modern Intel and AMD processors have dedicated hardware logic strictly evaluating your program's memory read patterns. If it observes your thread requesting memory addresses `0x1000`, `0x1040`, `0x1080`... it recognizes a forward stride pattern. The CPU autonomously issues reads for `0x10C0` directly into the L1 cache ahead of time!

Pointer chasing kills prefetchers. Linked lists (`std::list`), unordered maps (`std::unordered_map`), and graph structures store pointers that scatter randomly across gigabytes of virtual memory. This forces the prefetcher to randomly guess (and usually fail), halting your core to wait on DRAM continuously. Always default to flat, dense, linear `std::vector` or raw arrays.

Conclusion

Latency hiding is impossible. Latency prevention is your only tool. If your structs are misaligned or if your algorithms iterate over massive "OOP" monolithic objects to check tiny boolean flags, you are blinding the hardware prefetcher and squandering your 64-byte fetches on explicit junk padding bytes.

Coming in Deep Dive 4: Moving out of the application's RAM entirely. Let's dig into bypassing the bloated Linux kernel network stack and fetching datagrams straight off the Network Interface Card (NIC) with DPDK.