Ch9.7: list

Overview

::fast_io::list provides several operations that manipulate the list structure directly without copying elements. These are more efficient than the equivalent operations on a vector because they only adjust pointers.

1. Special Operations Table

Operation Description
splice(pos, it) Move the element pointed to by it to before pos
splice(pos, first, last) Move range [first, last) to before pos
splice(pos, other) Move all elements from other to before pos
sort() / sort(cmp) Sort the entire list in-place
sort(first, last) Sort a subrange in-place
merge(other) / merge(other, cmp) Merge sorted other into this sorted list
reverse() Reverse the order of all elements
is_single() Returns true if exactly one element
has_multiple() Returns true if two or more elements

2. Complete Example


#include <fast_io_dsal/list.h>
#include <fast_io.h>

int main() {
    ::fast_io::list<::std::size_t> a{10zu, 30zu, 50zu};
    ::fast_io::list<::std::size_t> b{20zu, 40zu};

    // Splice: move element from b to a
    auto bit = b.begin(); // points to 20
    auto ait = a.begin() + 1; // points to 30
    a.splice(ait, bit);
    // a is now {10zu, 20zu, 30zu, 50zu}
    // b is now {40zu}

    // Sort
    a.sort(); // a is now {10zu, 20zu, 30zu, 50zu} (already sorted)

    // Merge two sorted lists
    b.sort(); // b = {40}
    a.merge(::std::move(b));
    // a is now {10zu, 20zu, 30zu, 40zu, 50zu}
    // b is now empty

    // Reverse
    a.reverse();
    // a is now {50zu, 40zu, 30zu, 20zu, 10zu}
}

3. Why list Is Often Much Slower Than vector and deque

Despite its theoretical Θ(1) insertion and removal, list is often 10× to 100× slower than vector or deque in real-world workloads. The reason is data locality — how well the container’s memory layout matches the CPU cache architecture.

Cache line utilisation

Modern CPUs fetch memory in cache lines of typically 64 bytes. When any byte in a line is accessed, the entire 64 bytes are loaded into the L1 cache. The question is: how much of that cache line is actually useful data?

Container Layout Useful bytes per 64-byte cache line (for size_t) Efficiency
vector Contiguous: [e0][e1][e2]...[e7] 64 / 64 = 64 bytes (8 elements) 100%
deque Block-chunked: each block is contiguous 64 / 64 = 64 bytes (8 elements per block) ~100% within a block
list Scattered nodes: each node is a separate heap allocation 8 / 64 = 8 bytes (1 element per cache line) ~12%

A list<::std::size_t> node contains the 8-byte value plus two 8-byte pointers (prev/next), totalling ~24–32 bytes with alignment padding. But each node lives at a different heap address, so each access likely triggers a cache miss. The CPU must wait for main memory (~100 ns) instead of L1 cache (~1 ns).

Pointer chasing defeats prefetchers

Modern CPUs use hardware prefetchers that detect sequential access patterns and load upcoming cache lines before they are needed. This works perfectly for vector (contiguous, predictable) and reasonably well for deque (sequential blocks). But list requires pointer chasing: each node’s address is only known after reading the previous node’s next pointer. The prefetcher cannot predict these addresses, so every access is a potential cache miss.

The constant factor overwhelms the asymptotic advantage

List offers Θ(1) insertion/removal vs vector’s O(n) (for mid-sequence operations). But the constant factors differ enormously:

In practice, vector wins for almost any workload except when you need stable iterators (iterators that are not invalidated by insertions/erasures at other positions).

Benchmark: list vs vector vs deque

Pushing 10 million ::std::size_t elements, then iterating to compute a sum:

Container push_back loop (sum) Total
vector (no reserve) 0.162 s 0.014 s 0.185 s
vector + reserve() 0.093 s 0.012 s 0.114 s
deque (plain) 0.096 s 0.026 s 0.138 s
deque + resize_for_overwrite() 0.078 s 0.012 s 0.105 s
list 0.768 s 0.121 s 1.686 s

list is 9× slower than vector without reserve (1.686 s vs 0.185 s), and 15× slower than vector with reserve (1.686 s vs 0.114 s). Even the iteration alone is 9× slower than vector (0.121 s vs 0.014 s) because of cache misses on every node.

The gap grows dramatically with data size. When the dataset exceeds RAM, the operating system swaps pages to disk. Each list node access becomes a page fault requiring disk I/O — roughly 10 ms per access, or 100,000× slower than an L1 cache hit (~100 ns). A vector, by contrast, benefits from the OS’s read-ahead: when it faults on one page, the OS prefetches adjacent pages in bulk, amortising the disk latency across hundreds of elements. For list, each random address is unlikely to be in a prefetched page, so nearly every access pays the full disk penalty. At network-storage latencies (milliseconds per access), list becomes completely impractical.

When to actually use list

For almost everything else — especially iteration-heavy workloads — prefer vector or deque.

Key takeaways