What are the best instruction sequences to generate vector constants on the fly?

Question

All-zero: pxor xmm0,xmm0 (or xorps xmm0,xmm0, one instruction-byte shorter.) There isn’t much difference on modern CPUs, but on Nehalem (before xor-zero elimination), the xorps uop could only run on port 5. I think that’s why compilers favour pxor-zeroing even for registers that will be used with FP instructions.

All-ones: pcmpeqw xmm0,xmm0. This is the usual starting point for generating other constants, because (like pxor) it breaks the dependency on the previous value of the register (except on old CPUs like K10 and pre-Core2 P6).

There’s no advantage to the W version over the byte or dword element size versions of pcmpeq on any CPU in Agner Fog’s instruction tables, but pcmpeqQ takes an extra byte, is slower on Silvermont, and requires SSE4.1.

SO doesn’t really have table formatting, so I’m just going to list additions to Agner Fog’s table 13.10, rather than an improved version. Sorry. Maybe if this answer becomes popular, I’ll use an ascii-art table-generator, but hopefully improvements will be rolled into future versions of the guide.

The main difficulty is 8-bit vectors, because there’s no `PSLLB`

Agner Fog’s table generates vectors of 16-bit elements and uses packuswb to work around this. For example, pcmpeqw xmm0,xmm0 / psrlw xmm0,15 / psllw xmm0,1 / packuswb xmm0,xmm0 generates a vector where every byte is 2. (This pattern of shifts, with different counts, is the main way to produce most constants for wider vectors). There is a better way:

paddb xmm0,xmm0 (SSE2) works as a left-shift by one with byte granularity, so a vector of -2 bytes can be generated with only two instructions (pcmpeqw / paddb). paddw/d/q as a left-shift-by-one for other element sizes saves one byte of machine code compared to shifts, and can generally run on more ports than a shift-imm.

pabsb xmm0,xmm0 (SSSE3) turns a vector of all-ones (-1) into a vector of 1 bytes, and is non-destructive so you still have the set1(-1) vector.

(You sometimes don’t need set1(1). You can add 1 to every element by subtracting -1 with psubb instead.)

We can generate 2 bytes with pcmpeqw / paddb / pabsb. (Order of add vs. abs doesn’t matter). pabs doesn’t need an imm8, but only saves code bytes for other element widths vs. right shifting when both require a 3-byte VEX prefix. This only happens when the source register is xmm8-15. (vpabsb/w/d always requires a 3-byte VEX prefix for VEX.128.66.0F38.WIG, but vpsrlw dest,src,imm can otherwise use a 2-byte VEX prefix for its VEX.NDD.128.66.0F.WIG).

We can actually save instructions in generating power-of-2 bytes like 4, too: pcmpeqw / pabsb / psllw xmm0, 2. All the bits that are shifted across byte boundaries by the word-shift are zero, thanks to pabsb. Obviously other shift counts can put the single set-bit at other locations, including the sign bit to generate a vector of -128 (0x80) bytes. Note that pabsb is non-destructive (the destination operand is write-only, and doesn’t need to be the same as the source to get the desired behaviour). You can keep the all-ones around as a constant, or as the start of generating another constant, or as a source operand for psubb (to increment by one).

A vector of 0x80 bytes can be also (see prev paragraph) be generated from anything that saturates to -128, using packsswb. e.g. if you already have a vector of 0xFF00 for something else, just copy it and use packsswb. Constants loaded from memory that happen to saturate correctly are potential targets for this.

A vector of 0x7f bytes can be generated with pcmpeqw / paddb xmm0,xmm0 / psrlw xmm0, 1. This is slightly better than pcmpeqw / psrlw xmm0, 9 / packuswb xmm0,xmm0, the usual trick of generating the value in each word and using packuswb. But PADDB can run on more ports than PACK on most CPUs.

pavgb (SSE2) against a zeroed register can right-shift by one, but only if the value is even. (It does unsigned dst = (dst+src+1)>>1 for rounding, with 9-bit internal precision for the temporary.) This doesn’t seem to be useful for constant-generation, though, because 0xff is odd: pxor xmm1,xmm1 / pcmpeqw xmm0,xmm0 / paddb xmm0,xmm0 / pavgb xmm0, xmm1 produces 0x7f bytes with one more insn than shift/pack. If a zeroed register is already needed for something else, though, paddb / pavgb does save one instruction byte.

I have tested these sequences. The easiest way is to throw them in a .asm, assemble/link, and run gdb on it. layout asm, display /x $xmm0.v16_int8 to dump that after every single-step, and single-step instructions (ni or si). In layout reg mode, you can do tui reg vec to switch to a display of vector regs, but it’s nearly useless because you can’t select which interpretation to display (you always get all of them, and can’t hscroll, and the columns don’t line up between registers). It’s excellent for integer regs/flags, though.

Note that using these with intrinsics can be tricky. Compilers don’t like to operate on uninitialized variables, so you should use _mm_undefined_si128() to tell the compiler that’s what you meant. Or perhaps using _mm_set1_epi32(-1) will get your compiler to emit a pcmpeqd same,same. Without this, some compilers will xor-zero uninitialized vector variables before use, or even (MSVC) load uninitialized memory from the stack.

Many constants can be stored more compactly in memory by taking advantage of SSE4.1’s pmovzx or pmovsx for zero or sign-extension on the fly. For example, a 128b vector of {1, 2, 3, 4} as 32bit elements could be generated with a pmovzx load from a 32bit memory location. Memory operands can micro-fuse with pmovzx, so it doesn’t take any extra fused-domain uops. It does prevent using the constant directly as a memory operand, though.

C/C++ intrinsics support for using pmovz/sx as a load is terrible: there’s _mm_cvtepu8_epi32 (__m128i a), but no version that takes a uint32_t * pointer operand. You can hack around it, but it’s ugly and compiler optimization failure is a problem. See the linked question for details and links to the gcc bug reports.

With 256b and (not so) soon 512b constants, the savings in memory are larger. This only matters very much if multiple useful constants can share a cache-line, though.

The FP equivalent of this is VCVTPH2PS xmm1, xmm2/m64, requiring the F16C (half precision) feature flag. (There’s also a store instruction that packs single to half, but no computation at half precision. It’s a memory bandwidth / cache footprint optimization only.)

Obviously when all elements are the same (but not suitable for generating on the fly), pshufd or AVX vbroadcastps / AVX2 vpbroadcastb/w/d/q/i128 are useful. pshufd can take a memory source operand, but it has to be 128b. movddup (SSE3) does a 64bit load, broadcast to fill a 128b register. On Intel, it doesn’t need an ALU execution unit, only load port. (Similarly, AVX v[p]broadcast loads of dword size and larger are handled in the load unit, without ALU).

Broadcasts or pmovz/sx are excellent for saving executable size when you’re going to load a mask into a register for repeated use in a loop. Generating multiple similar masks from one starting point can also save space, if it only takes one instruction.

See also For for an SSE vector that has all the same components, generate on the fly or precompute? which is asking more about using the set1 intrinsic, and it isn’t clear if it’s asking about constants or broadcasts of variables.

I also experimented some with compiler output for broadcasts.

If cache misses are a problem, take a look at your code and see if the compiler has duplicated _mm_set constants when the same function is inlined into different callers. Also watch out for constants that are used together (e.g. in functions called one after another) being scattered into different cache lines. Many scattered loads for constants is far worse than loading a lot of constants all from near each other.

pmovzx and/or broadcast loads let you pack more constants into a cache line, with very low overhead for loading them into a register. The load won’t be on the critical path, so even if it takes an extra uop, it can take a free execution unit at any cycle over a long window.

clang actually does a good job of this: separate set1 constants in different functions are recognized as identical, the way identical string literals can be merged. Note that clang’s asm source output appears to show each function having its own copy of the constant, but the binary disassembly shows that all those RIP-relative effective addresses are referencing the same location. For 256b versions of the repeated functions, clang also uses vbroadcastsd to only require an 8B load, at the expense of an extra instruction in each function. (This is at -O3, so clearly the clang devs have realized that size matters for performance, not just for -Os). IDK why it doesn’t go down to a 4B constant with vbroadcastss, because that should be just as fast. Unfortunately, the vbroadcast don’t simply come from part of the 16B constant the other functions used. This maybe makes sense: an AVX version of something could probably only merge some of its constants with an SSE version. It’s better to leave the memory pages with SSE constants completely cold, and have the AVX version keep all its constants together. Also, it’s a harder pattern-matching problem to be handled at assemble or link time (however it’s done. I didn’t read every directive to figure out which one enables the merging.)

gcc 5.3 also merges constants, but doesn’t use broadcast-loads to compress 32B constants. Again the 16B constant doesn’t overlap with the 32B constant.

GCC12 has started preferring constructing some vector constants on the fly if AVX is available, but starting with mov reg, imm64 / vmovq / shuffle. Even if the pattern is simple and repeated, it will use a bulky 64-bit immediate and a vpunpcklqdq instead of a 5-byte mov eax, imm32 and dword broadcast. Also ironically not constructing on the fly for a set1_epi16(0x00ff) which is trivial (pcmpeqd / psrlw xmm, 8). https://godbolt.org/z/78cMaxjMz

With AVX-512 available, mov r,imm / vpbroadcastd x/y/zmm, eax is only 2 instructions. (The vpbroadcastd ymm0, eax is 1 uop on Intel, 2 uops on Zen 4). That makes it more attractive to construct vectors that way, and easier for compilers to come up with.

The main difficulty is 8-bit vectors, because there’s no PSLLB

Leave a Comment Cancel reply

The main difficulty is 8-bit vectors, because there’s no `PSLLB`