Performance implications of compact representations

Summary

This postmortem analyzes the performance tradeoffs between compact bit-packed representations (using bitmasks and shifts) and expanded aligned representations (fat capabilities) for capabilities in a CHERI-like architecture. The key finding is that memory locality and cache efficiency dominate raw ALU cost, making compact representations generally superior in real systems despite requiring more bitwise operations. However, access patterns and instruction scheduling critically determine the actual performance delta.

Root Cause

The fundamental performance tension arises from two competing system bottlenecks:

Memory bandwidth and latency (dominant in most modern systems)
CPU pipeline and execution unit availability (typically abundant)

The compact representation exploits that memory is orders of magnitude slower than ALU operations, while the expanded representation assumes aligned access is “free” (which is only true in limited L1 cache scenarios).

Primary technical root causes:

Memory subsystem saturation: Compact representations reduce memory footprint by 50%, improving cache hit rates and reducing DRAM traffic
Pipeline stalls from dependency chains: Bit manipulation operations (masks, shifts) create dependency chains that can stall on data hazards, but modern CPUs have deep pipelines and multiple execution units to absorb this
Alignment checking overhead: While aligned access avoids bus errors, unaligned access on modern x86/ARM is often handled in microcode with minimal penalty if the data stays within cache lines
Vectorization barriers: Bit-packed fields complicate SIMD vectorization, which can matter for bulk operations

Why This Happens in Real Systems

Modern architectures optimize for spatial locality and sequential access patterns:

Cache line granularity: CPUs fetch 64-byte cache lines; half of an expanded 128-bit capability can fit in a single cache line, but a compact 64-bit capability fits two capabilities per cache line, doubling density
Prefetcher effectiveness: Hardware prefetchers recognize sequential access patterns better when data is contiguous; bit-packing breaks this pattern
Virtual memory overhead: Each capability access may trigger TLB lookups; compact representations reduce the number of TLB entries needed for large arrays
Real-world example: Database systems and kernel data structures (e.g., Linux task_struct) use bitfields for memory efficiency despite the masking overhead, because cache misses are 100-300x more expensive than a bit operation

Critical insight: In practice, memory latency is the bottleneck, not instruction throughput. The expanded representation’s aligned access only provides a benefit when the data is already hot in L1 cache; for cold data or large working sets, the compact representation wins due to better cache utilization.

Real-World Impact

Performance impacts vary dramatically by access pattern:

Array traversal (capabilities as objects in a container):
- Compact: 30-50% faster due to reduced memory footprint and better cache locality
- Expanded: Suffers from capacity misses when arrays exceed L2/L3 cache sizes
- Impact: Scalability limits at high object counts
Random access (lookup by capability ID):
- Compact: Requires 1-3 additional instructions (AND, SHIFT) per field
- Expanded: Direct load/store; zero bitwise overhead
- Impact: Negligible (<2% difference) for isolated accesses; matters for tight loops
Bulk operations (e.g., cloning capabilities, zeroing arrays):
- Compact: SIMD-friendly only if fields are aligned to byte boundaries; otherwise, requires scalar operations
- Expanded: Easy SIMD vectorization (each field is naturally aligned to 2-byte or 4-byte boundaries)
- Impact: 2-5x slowdown for compact in vectorized loops
Power consumption:
- Compact: Lower DRAM activation energy due to fewer cache lines fetched
- Expanded: Higher memory bandwidth usage → 20-40% more power in memory-bound scenarios
- Impact: Critical for mobile/embedded systems

Real-system evidence: Google’s Protocol Buffers and Apache Avro use variable-length encoding (even more compact than bitfields) because network bandwidth and storage are more expensive than CPU cycles.

Example or Code

// Compact representation (64 bits)
struct compact_capability {
    uint64_t type : 12;
    uint64_t pid : 12;
    uint64_t perms : 8;
    uint64_t flags : 8;
    uint64_t reserved : 4;
    uint64_t id : 16;
};

// Expanded representation (128 bits)
struct expanded_capability {
    uint16_t type;      // 16 bits
    uint16_t pid;       // 16 bits
    uint16_t perms;     // 16 bits
    uint16_t flags;     // 16 bits
    uint32_t reserved;  // 32 bits
    uint32_t id;        // 32 bits
};

// Access pattern comparison
void process_capabilities_array(struct compact_capability *caps, size_t n) {
    for (size_t i = 0; i > 0) & 0xFFF;  // ~1-2 CPU cycles
        uint16_t perms = (caps[i].perms >> 12) & 0xFF;
        // Memory access: 64-bit load (1 cache line per 2 caps)
    }
}

void process_capabilities_array_expanded(struct expanded_capability *caps, size_t n) {
    for (size_t i = 0; i < n; i++) {
        // Direct aligned load: no bit manipulation
        uint16_t pid = caps[i].pid;  // 1 cycle if L1 hit
        uint16_t perms = caps[i].perms;
        // Memory access: 128-bit load (1 cache line per cap)
    }
}

How Senior Engineers Fix It

Key strategies used by experienced engineers:

Profile before optimizing: Use perf or VTune to measure cache misses, branch mispredictions, and IPC (instructions per cycle). Never assume bit operations are “free”
Hybrid approach: Use compact representation for large arrays (sorted, bulk-processed) and expanded for small, frequently accessed objects (where L1 cache is guaranteed)
Bitfield layout optimization: Group fields by access frequency:
- Hot fields (accessed together): Place in contiguous bit ranges to minimize mask operations
- Cold fields: Isolate to reduce instruction cache pressure
Assembly inspection: Use objdump or compiler explorer to verify that bitfield access compiles to optimal instructions (e.g., pext on x86 for efficient masking)
Alignment tricks: For mixed workloads, use padding to 128-bit boundaries for arrays of compact structs to enable vectorization where possible
Compiler directives: Use __attribute__((packed)) sparingly; prefer explicit bit masks for fine-grained control and better portability

Senior engineer rule of thumb: Compact unless profiling shows cache misses are not the bottleneck. Memory hierarchy wins.

Why Juniors Miss It

Common misunderstandings that lead to suboptimal choices:

Overestimating ALU cost: Juniors often think “bit operations are expensive” when modern CPUs can execute 3-4 ALU ops per cycle. They underestimate that memory latency dominates (100+ cycles vs 1 cycle)
Ignoring compiler optimizations: Modern compilers (GCC, Clang) optimize bitfield access to use efficient instructions (e.g., and with immediate mask). Juniors may manually bit-shift, adding unnecessary code
Benchmarking on hot data only: Running microbenchmarks on tiny datasets (fits in L1) makes expanded look better. Real workloads exceed cache sizes
Misunderstanding alignment: Assuming aligned access is always faster without considering that unaligned access on modern ISAs (x86-64, ARMv8) has negligible penalty if within cache lines
Premature optimization: Choosing expanded for “simplicity” without measuring; compact is often simpler once you use helper functions/macros for bit extraction
Not considering scalability: Junior code works for 1000 objects but crashes performance at 1M objects due to cache thrashing
Overlooking pipeline effects: Thinking bitfield access creates long dependency chains; in reality, out-of-order execution hides most latency

Critical junior mistake: Benchmarking in a VM as the questioner mentioned. Virtual ISA obscures memory hierarchy, making results non-representative of real hardware. Always benchmark on bare metal or in a hypervisor with realistic memory profiles.