A deep dive into SmallVector::push

SmallVector is LLVM's most-used container, and push_back its hot operation. For the trivially-copyable specialization the fast path should be fast.

1
2
3

#include <llvm/ADT/SmallVector.h>

void f(llvm::SmallVectorImpl<int> &v, int x) { v.push_back(x); }

clang -S --target=x86_64 -O2 -DNDEBUG a.cc generates:

push   rbp                 # callee-saved spills + a stack realignment,
push   rbx                 # all on the fast path
push   rax
mov    eax, [rdi + 8]      # size
cmp    eax, [rdi + 12]     # vs capacity
jae    .Lgrow
.Lstore:                   # reached from the fast path AND from .Lgrow
mov    rcx, [rdi]
mov    [rcx + rax*4], esi
inc    dword ptr [rdi + 8]
add    rsp, 8
pop    rbx
pop    rbp
ret
.Lgrow:
mov    rbx, rdi            # keep `this`/`x` alive across the call
mov    ebp, esi
call   SmallVectorBase<unsigned>::grow_pod
...
jmp    .Lstore

push_back reserves capacity and then stores, so the store at .Lstore is shared between the no-grow and post-grow paths. On the grow path this and x must survive the grow_pod call, which means they are saved in callee-saved registers, leading to push rbx/push rbp in the prologue. push rbp is needed to maintain the 16-byte alignment of the stack frame.

GCC's output is also inefficient:

push   rbp ; mov ebp, esi    # x -> rbp, in the entry block
push   rbx ; mov rbx, rdi    # this -> rbx
... ; cmp ; jnb .Lslow
.Lmerge:                     # reached by both paths, reads rbx/rbp
mov    rdx, [rbx] ; mov [rdx+rax*4], ebp ; ...

Shrink wrapping can't remove it

Shrink wrapping relocates the save/restore of callee-saved registers; it never duplicates a block. To carry this/x across the conditional grow_pod call into a store the fast path also reaches, a callee-saved register must be live from entry. clang -mllvm -debug-only=shrink-wrap reports No Shrink wrap candidate found. GCC's -fshrink-wrap-separate (on at -O2) does not optimize this as well.

The transformation that would help is tail duplication — give the slow path its own copy of the store so the fast path keeps this/x in their argument registers. Neither compiler does it here, and it is not shrink-wrapping's job.

The fix: tail calling the slow path

Move the grow-and-store out of line and tail calls it:

LLVM_ATTRIBUTE_NOINLINE void growAndPushBack(ValueParamT Elt) {
  T Tmp = Elt;  
  this->grow(this->size() + 1);
  std::memcpy(reinterpret_cast<void *>(this->end()), &Tmp, sizeof(T));
  this->set_size(this->size() + 1);
}

void push_back(ValueParamT Elt) {
  if (LLVM_UNLIKELY(this->size() >= this->capacity()))
    return growAndPushBack(Elt);
  std::memcpy(reinterpret_cast<void *>(this->end()), &Elt, sizeof(T));
  this->set_size(this->size() + 1);
}

mov    eax, [rdi + 8]
cmp    eax, [rdi + 12]
jae    growAndPushBack         # TAILCALL
mov    rcx, [rdi]
mov    [rcx + rax*4], esi
inc    dword ptr [rdi + 8]
ret

7 instructions instead of 14, no callee-saved registers, nothing to shrink-wrap.

growAndPushBack is an vague linkage function that is emitted in a separate section using COMDAT, costing some .o file size.

Two details:

noinline is load-bearing. Without it Clang and GCC may sometimes inline the helper back and the prologue returns.
T Tmp = Elt handles Elt referencing the vector's own storage. It is elided for small by-value types and, for large by-const& types, is smaller than reusing the old pointer-relocation machinery.

#include <llvm/ADT/SmallVector.h>

void DecodeMOVDDUPMask(unsigned n, llvm::SmallVectorImpl<int> &v) {
  for (unsigned l = 0; l < n; l += 2)
    for (unsigned i = 0; i < 2; ++i)
      v.push_back(i);
}

Results

lld .text shrinks 40,512 bytes; by-const& element types win most, e.g. GotSection::addConstant goes 167 → 45 bytes. On the LLVM compile-time tracker the clang build is 0.41–0.51% fewer instructions:u across every configuration, for +0.13% binary size.

Sorted by relative size, a few outliers grow ~13.8% — the constexpr ByteCode interpreter (Interp.cpp, EvalEmitter.cpp). A smaller push_back likely perturbs the bottom-up inliner's near-threshold decisions.

Aside: "approximately trivially copyable"

1
2
3

std::is_trivially_copy_constructible<T> &&
std::is_trivially_move_constructible<T> &&
std::is_trivially_destructible<T>

is the predicate that selects the SmallVectorTemplateBase<..., true> specialization, where copy/move construction optimizes to memcpy and destroy_range is a no-op.

The condition is deliberately broader than is_trivially_copyable, which also demands trivial assignment. The motivating case is std::pair<int,int>: its constructors are trivial, but its assignment is user-provided (to support pair<T&,U&>), so is_trivially_copyable is false. SmallVector only ever copies or moves elements by construction into uninitialized storage (memcpy), never by assignment, so the distinction is unobservable and memcpy is sound — and std::pair<POD,POD> stays on the fast path.

The condition is also stronger than trivial relocatability, whose operational definition is just is_trivially_move_constructible && is_trivially_destructible. The extra is_trivially_copy_constructible is there because SmallVector also calls memcpys when copying a live element — push_back(const T&), or copy-constructing from another vector.

std::vector stores three 8-byte members. SmallVector stores a begin pointer plus a 32-bit size and a 32-bit capacity when sizeof(T) >= 4.

template <class Size_T> class SmallVectorBase {
  void *BeginX;
  Size_T Size, Capacity;
};

So for int, pointers, and most structs the header is 16 bytes — 8 fewer than std::vector. The cost of carrying a size instead of an end pointer is that addressing the end (begin + size * sizeof(T)) needs a multiply, visible when sizeof(T) is not a power of two.

std::vector::push_back is slow in both libc++ and libstdc++

Neither standard vector has a frame-free push_back fast path.

libc++'s push_back forwards to emplace_back, which routes the grow decision through std::__if_likely_else(cond, fast, slow). The slow path is kept out of line, but as a by-reference lambda, so its closure {&__end_, &__x, this} is materialized on the stack and the trailing this->__end_ = __end is a merge. The fast path therefore spills the closure and runs with a 48-byte frame:

push   rbx
sub    rsp, 48
mov    [rsp+12], esi          # spill x
mov    rax, [rdi+8]           # __end
mov    [rsp+16], rax
lea    rcx, [rsp+16] ; mov [rsp+24], rcx   # } closure {&__end,
lea    rcx, [rsp+12] ; mov [rsp+32], rcx   # }  &x, this}, built
mov    [rsp+40], rdi                       # }  on the fast path
jae    .Lslow                # else: store x; this->__end_ = __end

libstdc++ is heavier still: its push_back inlines _M_realloc_insert, pulling the whole reallocation — operator new, memcpy, operator delete, and the length_error throw — into the function. To keep state live across those calls the fast path holds six callee-saved registers, on both g++ and clang.

A direct out-of-line member taking (this, Elt) in registers — the growAndPushBack above — is what keeps the fast path free of both a frame and callee-saved registers.

Note: many libc++ builds enable hardening by default. Disable it (and exceptions) for the best performance:

1	-fno-exceptions -D_LIBCPP_HARDENING_MODE=_LIBCPP_HARDENING_MODE_NONE

Boost's small_vector has the same frame

boost::container::small_vector<int, N>::push_back tells the same story, independent of the inline capacity N (even N == 0):

sub    rsp, 24                  # frame on the fast path
mov    [rsp+12], esi            # spill x — dead on the fast path
mov    rax, [rdi+8]             # size (64-bit)
lea    rdx, [4*rax] ; add rdx, [rdi] ; cmp rax, [rdi+16]   # end; vs capacity
je     .Lgrow
mov    [rdx], esi ; inc rax ; mov [rdi+8], rax ; add rsp, 24 ; ret
.Lgrow:
lea    r8, [rsp+12]             # &x, for vector::priv_insert_…'s insert_emplace_proxy<const int&>
call   ...

x is spilled only so the cold grow path can pass its address to the emplace proxy, yet the sub rsp, 24 and the spill land on the fast path (the store itself uses esi). Boost also keeps size and capacity as size_t, so small_vector<int,0> is 24 bytes — like std::vector, 8 more than SmallVector<int,0>.

Takeaways

A fast/slow merge that rejoins after a call forces callee-saved spills onto the hot path, and shrink-wrapping can't remove them.
A tail-called out-of-line slow path removes the overhead.
Inliner behavior makes size effects are non-monotonic.