Ch9.7: list
Overview
::fast_io::list provides several operations that manipulate
the list structure directly without copying elements. These are more
efficient than the equivalent operations on a vector because they only
adjust pointers.
1. Special Operations Table
| Operation | Description |
|---|---|
splice(pos, it) |
Move the element pointed to by it to before pos |
splice(pos, first, last) |
Move range [first, last) to before pos |
splice(pos, other) |
Move all elements from other to before pos |
sort() / sort(cmp) |
Sort the entire list in-place |
sort(first, last) |
Sort a subrange in-place |
merge(other) / merge(other, cmp) |
Merge sorted other into this sorted list |
reverse() |
Reverse the order of all elements |
is_single() |
Returns true if exactly one element |
has_multiple() |
Returns true if two or more elements |
2. Complete Example
#include <fast_io_dsal/list.h>
#include <fast_io.h>
int main() {
::fast_io::list<::std::size_t> a{10zu, 30zu, 50zu};
::fast_io::list<::std::size_t> b{20zu, 40zu};
// Splice: move element from b to a
auto bit = b.begin(); // points to 20
auto ait = a.begin() + 1; // points to 30
a.splice(ait, bit);
// a is now {10zu, 20zu, 30zu, 50zu}
// b is now {40zu}
// Sort
a.sort(); // a is now {10zu, 20zu, 30zu, 50zu} (already sorted)
// Merge two sorted lists
b.sort(); // b = {40}
a.merge(::std::move(b));
// a is now {10zu, 20zu, 30zu, 40zu, 50zu}
// b is now empty
// Reverse
a.reverse();
// a is now {50zu, 40zu, 30zu, 20zu, 10zu}
}
3. Why list Is Often Much Slower Than vector and deque
Despite its theoretical Θ(1) insertion and removal,
list is often 10× to 100× slower
than vector or deque in real-world workloads.
The reason is data locality — how well the
container’s memory layout matches the CPU cache architecture.
Cache line utilisation
Modern CPUs fetch memory in cache lines of typically 64 bytes. When any byte in a line is accessed, the entire 64 bytes are loaded into the L1 cache. The question is: how much of that cache line is actually useful data?
| Container | Layout | Useful bytes per 64-byte cache line (for size_t) |
Efficiency |
|---|---|---|---|
vector |
Contiguous: [e0][e1][e2]...[e7] |
64 / 64 = 64 bytes (8 elements) | 100% |
deque |
Block-chunked: each block is contiguous | 64 / 64 = 64 bytes (8 elements per block) | ~100% within a block |
list |
Scattered nodes: each node is a separate heap allocation | 8 / 64 = 8 bytes (1 element per cache line) | ~12% |
A list<::std::size_t> node contains the 8-byte value
plus two 8-byte pointers (prev/next), totalling ~24–32 bytes with
alignment padding. But each node lives at a different heap
address, so each access likely triggers a cache miss.
The CPU must wait for main memory (~100 ns) instead of L1 cache
(~1 ns).
Pointer chasing defeats prefetchers
Modern CPUs use hardware prefetchers that detect
sequential access patterns and load upcoming cache lines before they are
needed. This works perfectly for vector (contiguous,
predictable) and reasonably well for deque (sequential
blocks). But list requires pointer chasing:
each node’s address is only known after reading the previous
node’s next pointer. The prefetcher cannot predict
these addresses, so every access is a potential cache miss.
The constant factor overwhelms the asymptotic advantage
List offers Θ(1) insertion/removal vs vector’s
O(n) (for mid-sequence operations). But the constant factors differ
enormously:
vectorinsertion: move ~n elements viamemcpyat ~10–50 GB/s — a few microseconds for a million elements.listinsertion: adjust 4 pointers, but each pointer access may be a cache miss at ~100 ns — tens of microseconds even for a modest number of elements.
In practice, vector wins for almost any workload except
when you need stable iterators (iterators that are not
invalidated by insertions/erasures at other positions).
Benchmark: list vs vector vs deque
Pushing 10 million ::std::size_t elements, then
iterating to compute a sum:
| Container | push_back | loop (sum) | Total |
|---|---|---|---|
vector (no reserve) |
0.162 s | 0.014 s | 0.185 s |
vector + reserve() |
0.093 s | 0.012 s | 0.114 s |
deque (plain) |
0.096 s | 0.026 s | 0.138 s |
deque + resize_for_overwrite() |
0.078 s | 0.012 s | 0.105 s |
list |
0.768 s | 0.121 s | 1.686 s |
list is 9× slower than
vector without reserve (1.686 s vs 0.185 s), and
15× slower than vector with
reserve (1.686 s vs 0.114 s). Even the iteration
alone is 9× slower than vector
(0.121 s vs 0.014 s) because of cache misses on every node.
The gap grows dramatically with data size. When the
dataset exceeds RAM, the operating system swaps pages to disk. Each
list node access becomes a page fault
requiring disk I/O — roughly 10 ms per access, or
100,000× slower than an L1 cache hit (~100 ns). A
vector, by contrast, benefits from the OS’s
read-ahead: when it faults on one page, the OS
prefetches adjacent pages in bulk, amortising the disk latency across
hundreds of elements. For list, each random address is
unlikely to be in a prefetched page, so nearly every access pays the
full disk penalty. At network-storage latencies (milliseconds per
access), list becomes completely impractical.
When to actually use list
- You need iterators or references to remain valid across insertions and erasures at other positions.
- You frequently splice elements between containers.
- The elements are large (so copying is expensive) and you need frequent mid-sequence insertion.
- You are implementing algorithms that fundamentally require linked structure (e.g., certain graph algorithms).
For almost everything else — especially iteration-heavy workloads
— prefer vector or deque.
Key takeaways
splice()moves elements between lists without copying.sort(),merge(), andreverse()only adjust pointers.is_single()andhas_multiple()are list-specific queries.- List operations never invalidate iterators to elements that are not moved or erased.
listis ~9× slower thanvectorand ~15× slower thanvector+reserve()— prefervectorunless you need stable iterators.