Why is this seemingly slower C loop actually twice as fast as the other way?

Question

If you want fast vectorized code, don’t do short-circuit evaluation, and don’t branch in general. You want the compiler to be able to do 16 or 32 elements at once with SIMD operations, using 8-bit elements. (Compilers can optimize ifs to branchless code if it’s safe to do the work unconditionally, including dereferences, and there aren’t side-effects. This is called if-conversion, and is typically necessary for auto-vectorization of code like this.)

And you don’t want the compiler to worry that it’s not allowed to touch some memory because the C abstract machine doesn’t. E.g., if all the v_out[i] elements are false, the v_x might be a NULL pointer without causing UB! So the compiler can’t invent read access to objects that the C logic doesn’t read at all.

If v_x were truly an array, not just a pointer, the compiler would know that it’s readable and would be allowed to invent accesses to it by doing if-conversion of the short-circuit logic to branchless. But if its cost heuristics don’t see a really big benefit (like auto-vectorization), it might choose not to. Branchy code will often in practice be slower with a random mix of trues and falses (and NAs).

As you can see in the compiler’s assembly output (Clang 15 -O2 on Compiler Explorer), option 1 auto-vectorizes with SIMD, branchlessly handling 4 optional-bools in parallel (with just SSE2, more with -march=native). (Thanks to Richard in comments for cooking up the Compiler Explorer link; it’s probably reflective of what Apple Clang will do to your real code in main.)

Your 3-state bool that supports an NA state can be implemented with 2 bits, in a way that bitwise AND does your && operation. You can store arrays of it as one per unsigned char, or packed 4 per char to quadruple your throughput for vectorized operations, at the cost of slower access. (Or in general CHAR_BIT/2 per char, but on mainstream C implementations for x86 that’s 4.)

F = 00
N = 10 (in binary, so C 0b10 aka 2)
T = 11
conversion to bool with val & 1.
conversion from bool with 0b11 * b or something to broadcast the low bit to both positions.

F & anything = 0 because F is all-zero bits. N&N == N; that’s trivially true for any bit-pattern.
The “clever” part is that N&T = T&N = N, since the set bits in T are a superset of those in N.

This also works for || with bitwise |:
F|N == N and F|T == T because 0|x == x. Also x|x == x for any same input so we’re still fine there.

N = 0b10 won’t set the low bit when ORing, but will clear it when ANDing.

I forgot you said C instead of C++, so this bare-bones class wrapper (only sufficient to demo a few test callers) might not be relevant, but a loop doing c1[i] &= c2[i]; in plain C for unsigned char *c1, *c2 will auto-vectorize exactly the same way.

struct NBool{ // Nullable bool, should probably rename to optional bool
    unsigned char val;
    static const unsigned char F = 0b00;
    static const unsigned char T = 0b11;
    static const unsigned char N = 0b10;  // N&T = N;  N&N = N;  N&F = F

    auto operator &=(NBool rhs){   // define && the same way if you want, as non-short-circuiting
        val &= rhs.val;
        return *this;
    }
    operator bool() { return val & 1; }

    constexpr NBool(unsigned char x) : val(x) {};
    constexpr NBool& operator=(const NBool &) = default;

};

#include <stdint.h>
#include <stdlib.h>

bool test(NBool a){
    return a;
}

bool test2(NBool a){
    NBool b = NBool::F;
    return a &= b;   // return false
}


void foo(size_t len, NBool *a1, NBool *a2 )
{
    for (std::size_t i = 0 ; i < len ; i++){
        a1[i] &= a2[i];
    }
}

(I think “Nullable” isn’t really correct terminology for something that can have be NaN / NA; it’s always safe to read, and it’s not a reference in the first place. Maybe optional_bool, like C++ std::optional which is a value that may or may not be present.)

This compiles on Compiler Explorer with GCC and Clang. Clang auto-vectorizes fairly nicely with an unrolled loop doing vandps. (A bit of a weird choice by clang; on -march=haswell, vpand has better throughput.) But it is still limited by 1/clock store and 2/clock load anyway; this very much bottlenecks on load/store with such low computational intensity, even if the data is hot in L1d cache.

(Intel’s optimization manual says that even though Skylake’s peak L1d bandwidth is 2 loads + 1 store per clock (96 bytes with 32-byte vectors), the sustained bandwidth is more like 84 bytes per clock.)

It can still come relatively close to 32 bytes ANDed per clock cycle, with AVX. So that’s 32 NBool & operations, or 128 per clock if you pack 4 NBools per byte.

Packing NBools down to a packed bitmap of 1-bit bools could be done with pslld xmm, 7 / pmovmskb to extract the low bit of each byte (after shifting it to the high bit).

If stored 4 per byte, some SIMD bit-manipulation is in order to pack to bools, perhaps vpshufb as a 4-bit LUT to pack pairs of NBools down to a pair of bools in the bottom of a nibble, then combine? Or use scalar BMI2 pext to extract every other bit from 64 bits, if you’re on Zen 3 or Haswell or later, for fast pext.

Leave a Comment Cancel reply