Ch9.7: list

Overview

::fast_io::list provides several operations that manipulate the list structure directly without copying elements. These are more efficient than the equivalent operations on a vector because they only adjust pointers.

1. Special Operations Table

Operation	Description
`splice(pos, it)`	Move the element pointed to by `it` to before `pos`
`splice(pos, first, last)`	Move range `[first, last)` to before `pos`
`splice(pos, other)`	Move all elements from `other` to before `pos`
`sort()` / `sort(cmp)`	Sort the entire list in-place
`sort(first, last)`	Sort a subrange in-place
`merge(other)` / `merge(other, cmp)`	Merge sorted `other` into this sorted list
`reverse()`	Reverse the order of all elements
`is_single()`	Returns `true` if exactly one element
`has_multiple()`	Returns `true` if two or more elements

2. Complete Example


#include <fast_io_dsal/list.h>
#include <fast_io.h>

int main() {
    ::fast_io::list<::std::size_t> a{10zu, 30zu, 50zu};
    ::fast_io::list<::std::size_t> b{20zu, 40zu};

    // Splice: move element from b to a
    auto bit = b.begin(); // points to 20
    auto ait = a.begin() + 1; // points to 30
    a.splice(ait, bit);
    // a is now {10zu, 20zu, 30zu, 50zu}
    // b is now {40zu}

    // Sort
    a.sort(); // a is now {10zu, 20zu, 30zu, 50zu} (already sorted)

    // Merge two sorted lists
    b.sort(); // b = {40}
    a.merge(::std::move(b));
    // a is now {10zu, 20zu, 30zu, 40zu, 50zu}
    // b is now empty

    // Reverse
    a.reverse();
    // a is now {50zu, 40zu, 30zu, 20zu, 10zu}
}

3. Why list Is Often Much Slower Than vector and deque

Despite its theoretical Θ(1) insertion and removal, list is often 10× to 100× slower than vector or deque in real-world workloads. The reason is data locality — how well the container’s memory layout matches the CPU cache architecture.

Cache line utilisation

Modern CPUs fetch memory in cache lines of typically 64 bytes. When any byte in a line is accessed, the entire 64 bytes are loaded into the L1 cache. The question is: how much of that cache line is actually useful data?

Container	Layout	Useful bytes per 64-byte cache line (for `size_t`)	Efficiency
`vector`	Contiguous: `[e0][e1][e2]...[e7]`	64 / 64 = 64 bytes (8 elements)	100%
`deque`	Block-chunked: each block is contiguous	64 / 64 = 64 bytes (8 elements per block)	~100% within a block
`list`	Scattered nodes: each node is a separate heap allocation	8 / 64 = 8 bytes (1 element per cache line)	~12%

A list<::std::size_t> node contains the 8-byte value plus two 8-byte pointers (prev/next), totalling ~24–32 bytes with alignment padding. But each node lives at a different heap address, so each access likely triggers a cache miss. The CPU must wait for main memory (~100 ns) instead of L1 cache (~1 ns).

Pointer chasing defeats prefetchers

Modern CPUs use hardware prefetchers that detect sequential access patterns and load upcoming cache lines before they are needed. This works perfectly for vector (contiguous, predictable) and reasonably well for deque (sequential blocks). But list requires pointer chasing: each node’s address is only known after reading the previous node’s next pointer. The prefetcher cannot predict these addresses, so every access is a potential cache miss.

The constant factor overwhelms the asymptotic advantage

List offers Θ(1) insertion/removal vs vector’s O(n) (for mid-sequence operations). But the constant factors differ enormously:

vector insertion: move ~n elements via memcpy at ~10–50 GB/s — a few microseconds for a million elements.
list insertion: adjust 4 pointers, but each pointer access may be a cache miss at ~100 ns — tens of microseconds even for a modest number of elements.

In practice, vector wins for almost any workload except when you need stable iterators (iterators that are not invalidated by insertions/erasures at other positions).

Benchmark: list vs vector vs deque

Pushing 10 million ::std::size_t elements, then iterating to compute a sum:

Container	push_back	loop (sum)	Total
`vector` (no reserve)	0.162 s	0.014 s	0.185 s
`vector` + `reserve()`	0.093 s	0.012 s	0.114 s
`deque` (plain)	0.096 s	0.026 s	0.138 s
`deque` + `resize_for_overwrite()`	0.078 s	0.012 s	0.105 s
`list`	0.768 s	0.121 s	1.686 s

list is 9× slower than vector without reserve (1.686 s vs 0.185 s), and 15× slower than vector with reserve (1.686 s vs 0.114 s). Even the iteration alone is 9× slower than vector (0.121 s vs 0.014 s) because of cache misses on every node.

The gap grows dramatically with data size. When the dataset exceeds RAM, the operating system swaps pages to disk. Each list node access becomes a page fault requiring disk I/O — roughly 10 ms per access, or 100,000× slower than an L1 cache hit (~100 ns). A vector, by contrast, benefits from the OS’s read-ahead: when it faults on one page, the OS prefetches adjacent pages in bulk, amortising the disk latency across hundreds of elements. For list, each random address is unlikely to be in a prefetched page, so nearly every access pays the full disk penalty. At network-storage latencies (milliseconds per access), list becomes completely impractical.

When to actually use list

You need iterators or references to remain valid across insertions and erasures at other positions.
You frequently splice elements between containers.
The elements are large (so copying is expensive) and you need frequent mid-sequence insertion.
You are implementing algorithms that fundamentally require linked structure (e.g., certain graph algorithms).

For almost everything else — especially iteration-heavy workloads — prefer vector or deque.

Key takeaways

splice() moves elements between lists without copying.
sort(), merge(), and reverse() only adjust pointers.
is_single() and has_multiple() are list-specific queries.
List operations never invalidate iterators to elements that are not moved or erased.
list is ~9× slower than vector and ~15× slower than vector + reserve() — prefer vector unless you need stable iterators.