10 April, 2016

Autovectorization Nightmare

The Opportunity of the Compiler to Break Your Code

C++ language and compilers are moving forward so fast. Today it's pretty common that the C++ compiler is able to take advantage of SIMD and automatically vectorize your plain scalar C++ code into something that looks like a monster. Sometimes this may give you some gains, but the biggest disaster is a compiler optimizing your code that has been already optimized - like optimizing leading loops that precede optimized loops, or optimizing initialization loops that are executed just once, etc.

I remember reporting a clang bug where it vectorized a loop containing a sin() call by just unrolling it 8 times and calling the same sin() 8 times sequentially. This of course didn't improve the performance at all, but was not breaking the code at least.

Well, it would be cool if the compiler just optimizes your code without actually breaking it, but more aggressive optimizations usually lead to more opportunities of breaking your humble scalar C++ code. And... because the universe is not fair, and Murphy's Law applies to all of us, I have hit the issue too, today.

There is a term called undefined behavior, and pretty much everybody writing C++ code knows about it and tries to avoid it. The problem is that on many targets the undefined behavior is actually a well-defined undefined-behavior. The reason is that you know the platform and you can take advantage of certain constructs when optimizing certain things. So what was the problem? Look at the simple code below:

static void write_unaligned_32(void* ptr, uint32_t value) noexcept {
  if (HAS_UNALIGNED_WRITE) {
    static_cast<uint32_t*>(ptr)[0] = value;
  }
  else {
    // NOTE: This example doesn't care of endianness (this is LE).
    static_cast<uint8_t*>(ptr)[0] = static_cast<uint8_t>((value      ) & 0xFF);
    static_cast<uint8_t*>(ptr)[1] = static_cast<uint8_t>((value >>  8) & 0xFF);
    static_cast<uint8_t*>(ptr)[2] = static_cast<uint8_t>((value >> 16) & 0xFF);
    static_cast<uint8_t*>(ptr)[3] = static_cast<uint8_t>((value >> 24));
  }
}

This is basically an abstraction that I use to make the code itself more portable, and to keep non-portable code at a specific place only. Unaligned reads and writes are fine on x86 architectures and are much faster than reading or writing individual bytes. In addition, if all unaligned writes go through write_unaligned_32 function it's quite easy to assume that all other writes are aligned and to simply locate code that requires unaligned writes for future refactorization, optimizations, and sanitization.

The write_unaligned_32 function looks sane. It uses unaligned write only when it's allowed by the target. However, the problem is that if you use it in a loop that the compiler tries to auto-vectorize, then the compiler may assume that the ptr is aligned to 4 bytes, because you store uint32_t value to it. And this is exactly what have happened. The compiler auto-vectorized the loop and used `movdqa` instruction to fetch 16 bytes at once assuming that the buffer is 16-byte aligned, which basically led to a general protection fault (#GP). The trickiest part is that this only happened in release mode.

Possible solutions

  • Tell the compiler to not autovectorize? Workaround! Disabling something should never be a solution.
  • Don't use unaligned reads / writes? Workaround! Unnecessarily pessimistic.
  • Tell the compiler the pointer is not aligned? Looks like a solution to me!

There are vendor-specific extensions that can be used to tell the compiler that the pointer isn't aligned as it thinks. However, some #ifdefs are necessary, as usual, to take advantage of that.

The first thing I did was to add CC_ASSUME_ALIGNED template to the cxxtool. The initial implementation looks like:

// [@CC_ASSUME_ALIGNED{@]
// \def @prefix@_ASSUME_ALIGNED(p, alignment)
// Assume that the pointer 'p' is aligned to at least 'alignment' bytes.
#if @prefix@_CC_HAS_ASSUME_ALIGNED
# define @prefix@_ASSUME_ALIGNED(p, alignment) __assume_aligned(p, alignment)
#elif @prefix@_CC_HAS_BUILTIN_ASSUME_ALIGNED
# define @prefix@_ASSUME_ALIGNED(p, alignment) p = __builtin_assume_aligned(p, alignment)
#else
# define @prefix@_ASSUME_ALIGNED(p, alignment) ((void)0)
#endif
// [@CC_ASSUME_ALIGNED}@]

The compiler detection module detects whether __assume_aligned or __builtin_assume_aligned is provided by the compiler, and then uses the one available. If none of them is provided then @prefix@_ASSUME_ALIGNED() is simply a NOP, which is fine.

Then the write_unaligned_32 function can be changed to something like this:

static void write_unaligned_32(void* ptr, uint32_t value) noexcept {
  ASSUME_ALIGNED(ptr, 1);

  if (HAS_UNALIGNED_WRITE_32) {
    static_cast<uint32_t*>(ptr)[0] = value;
  }
  else {
    // NOTE: This example doesn't care of endianness (this is LE).
    static_cast<uint8_t*>(ptr)[0] = static_cast<uint8_t>((value      ) & 0xFF);
    static_cast<uint8_t*>(ptr)[1] = static_cast<uint8_t>((value >>  8) & 0xFF);
    static_cast<uint8_t*>(ptr)[2] = static_cast<uint8_t>((value >> 16) & 0xFF);
    static_cast<uint8_t*>(ptr)[3] = static_cast<uint8_t>((value >> 24));
  }
}

What Triggered It

This bug happened in Blend2D's function that reads a non-premultiplied ARGB32 pixels from one buffer, converts the pixels to a premultiplied ARGB32, and stores them into another buffer. The function has been designed for image codecs so it must support reading from misaligned memory. This is common for example in PNG, which adds one BYTE for each ROW of pixels to specify the row-filter, and this additional byte causes all 16-bit and 32-bit rows to become misaligned. The bug has been fixed in both Blend2D and AsmJit.

Conclusion

Having all non-portable code at a single place is always a good idea. Basically one line fixed a bug that wouldn't be trivial to find by a non-skilled developer and that was only triggered by compiler's auto-vectorization feature. I fixed it by intuition, basically when I saw #GP I knew that there is an unaligned memory access, and because the function doesn't use SIMD by itself, it was obvious that the compiler did something I didn't ask for. I don't like this feature at all, but it's always good to have code that just works.

No comments:

Post a Comment