13 August, 2018

Are we autovectorized yet?

Short post about the compiler's ability (or inability?) to autovectorize your code. A lot of people write claims about compiler optimizations without actually proving them. Some people even state things such as "compiler is smarter than you", "compiler can autovectorize your code much better than you", and similar nonsense, and then use these as a foundation against people that care how their code is compiled or that even write optimized functions themselves. Hopefully, there is enough tools online to verify such claims and to prove which optimizations are likely and which aren't.

Here is our sample function:

struct Matrix2D {
  double m00;
  double m01;
  double m10;
  double m11;
  double m20;
  double m21;

  inline void reset(double a, double b, double c, double d, double e, double f) noexcept {
    m00 = a;
    m01 = b;
    m10 = c;
    m11 = d;
    m20 = e;
    m21 = f;
  }
};

void Matrix2D_Multiply(Matrix2D* dst, const Matrix2D* a, const Matrix2D* b) noexcept {
  dst->reset(a->m00 * b->m00 + a->m01 * b->m10,
             a->m00 * b->m01 + a->m01 * b->m11,
             a->m10 * b->m00 + a->m11 * b->m10,
             a->m10 * b->m01 + a->m11 * b->m11,
             a->m20 * b->m00 + a->m21 * b->m10 + b->m20,
             a->m20 * b->m01 + a->m21 * b->m11 + b->m21);
}

Yes it's a simple affine matrix multiplication used commonly in 2D graphics. I initially thought that all major C++ compilers would be able to autovectorize this code as it looks pretty straightforward, but I was wrong.

MSVC 2017 [/Ox /arch:AVX2]

; 48 instructions
Matrix2D_Multiply:
  mov     rax, rsp
  sub     rsp, 120
  vmovsd  xmm4, QWORD PTR [r8+8]
  vmulsd  xmm1, xmm4, QWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rax-24], xmm6
  vmovaps XMMWORD PTR [rax-40], xmm7
  vmovaps XMMWORD PTR [rax-56], xmm8
  vmovsd  xmm8, QWORD PTR [r8]
  vmulsd  xmm2, xmm8, QWORD PTR [rdx+16]
  vmovaps XMMWORD PTR [rax-72], xmm9
  vmovsd  xmm9, QWORD PTR [r8+16]
  vmovaps XMMWORD PTR [rax-88], xmm10
  vmovaps XMMWORD PTR [rax-104], xmm11
  vmovsd  xmm11, QWORD PTR [r8+24]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+40]
  vaddsd  xmm1, xmm1, xmm0
  vmulsd  xmm0, xmm9, QWORD PTR [rdx+40]
  vmovaps XMMWORD PTR [rsp], xmm12
  vaddsd  xmm12, xmm1, QWORD PTR [r8+40]
  vmulsd  xmm1, xmm8, QWORD PTR [rdx+32]
  vaddsd  xmm1, xmm1, xmm0
  vaddsd  xmm10, xmm1, QWORD PTR [r8+32]
  vmulsd  xmm1, xmm4, QWORD PTR [rdx+16]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+24]
  vaddsd  xmm7, xmm1, xmm0
  vmulsd  xmm1, xmm9, QWORD PTR [rdx+24]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+8]
  vmovaps xmm11, XMMWORD PTR [rax-104]
  vaddsd  xmm6, xmm2, xmm1
  vmulsd  xmm1, xmm4, QWORD PTR [rdx]
  vmulsd  xmm2, xmm8, QWORD PTR [rdx]
  vmovaps xmm8, XMMWORD PTR [rax-56]
  vaddsd  xmm4, xmm1, xmm0
  vmulsd  xmm1, xmm9, QWORD PTR [rdx+8]
  vmovaps xmm9, XMMWORD PTR [rax-72]
  vaddsd  xmm0, xmm2, xmm1
  vmovsd  QWORD PTR [rcx+16], xmm6
  vmovaps xmm6, XMMWORD PTR [rax-24]
  vmovsd  QWORD PTR [rcx+24], xmm7
  vmovaps xmm7, XMMWORD PTR [rax-40]
  vmovsd  QWORD PTR [rcx+32], xmm10
  vmovaps xmm10, XMMWORD PTR [rax-88]
  vmovsd  QWORD PTR [rcx+40], xmm12
  vmovaps xmm12, XMMWORD PTR [rsp]
  vmovsd  QWORD PTR [rcx], xmm0
  vmovsd  QWORD PTR [rcx+8], xmm4
  add     rsp, 120
  ret     0

Not so great, but there is an explanation for that. Firstly, the compiler decided to go scalar (failed to autovectorize the code), which means that it would need a lot of registers (this decision basically predated everything). Secondly, WIN64 calling convention requires some registers to be preserved across function calls so in order to use as many registers as it needs it has to additionally save and restore some of them. Don't be confused with rax use here, it's just a trick and it's used as an original stack pointer. This trick is just a code-size optimization.

GCC trunk [-O2 -mavx2]

; 37 instructions
Matrix2D_Multiply:
  vmovsd  xmm0, QWORD PTR [rdx+8]
  vmovsd  xmm5, QWORD PTR [rdx+24]
  vmovsd  xmm2, QWORD PTR [rsi+32]
  vmovsd  xmm3, QWORD PTR [rsi+40]
  vmovsd  xmm6, QWORD PTR [rdx+16]
  vmovsd  xmm9, QWORD PTR [rsi+16]
  vmulsd  xmm1, xmm3, xmm5
  vmovsd  xmm8, QWORD PTR [rsi+24]
  vmovsd  xmm7, QWORD PTR [rsi+8]
  vmulsd  xmm4, xmm2, xmm0
  vmulsd  xmm3, xmm3, xmm6
  vmulsd  xmm11, xmm6, xmm7
  vmulsd  xmm7, xmm5, xmm7
  vmulsd  xmm6, xmm6, xmm8
  vaddsd  xmm4, xmm4, xmm1
  vmovsd  xmm1, QWORD PTR [rdx]
  vmulsd  xmm5, xmm5, xmm8
  vaddsd  xmm4, xmm4, QWORD PTR [rdx+40]
  vmulsd  xmm2, xmm2, xmm1
  vaddsd  xmm3, xmm2, xmm3
  vmovsd  xmm2, QWORD PTR [rsi]
  vaddsd  xmm3, xmm3, QWORD PTR [rdx+32]
  vmovsd  QWORD PTR [rdi+40], xmm4
  vmulsd  xmm10, xmm1, xmm2
  vmulsd  xmm2, xmm0, xmm2
  vmovsd  QWORD PTR [rdi+32], xmm3
  vmulsd  xmm1, xmm1, xmm9
  vmulsd  xmm0, xmm0, xmm9
  vaddsd  xmm10, xmm10, xmm11
  vaddsd  xmm2, xmm2, xmm7
  vaddsd  xmm1, xmm1, xmm6
  vaddsd  xmm0, xmm0, xmm5
  vmovsd  QWORD PTR [rdi], xmm10
  vmovsd  QWORD PTR [rdi+8], xmm2
  vmovsd  QWORD PTR [rdi+16], xmm1
  vmovsd  QWORD PTR [rdi+24], xmm0
  ret

A perfect scalar version, but nothing more.

Clang trunk [-O2 -mavx2]

; 22 instructions
Matrix2D_Multiply:
  vmovddup xmm0, qword ptr [rsi] # xmm0 = mem[0,0]
  vmovupd xmm1, xmmword ptr [rdx]
  vmovupd xmm2, xmmword ptr [rdx + 16]
  vmulpd xmm0, xmm0, xmm1
  vmovddup xmm3, qword ptr [rsi + 8] # xmm3 = mem[0,0]
  vmulpd xmm3, xmm3, xmm2
  vaddpd xmm0, xmm0, xmm3
  vmovddup xmm3, qword ptr [rsi + 16] # xmm3 = mem[0,0]
  vmulpd xmm3, xmm1, xmm3
  vmovddup xmm4, qword ptr [rsi + 24] # xmm4 = mem[0,0]
  vmulpd xmm4, xmm2, xmm4
  vmovddup xmm5, qword ptr [rsi + 32] # xmm5 = mem[0,0]
  vmulpd xmm1, xmm1, xmm5
  vmovddup xmm5, qword ptr [rsi + 40] # xmm5 = mem[0,0]
  vmulpd xmm2, xmm2, xmm5
  vaddpd xmm1, xmm1, xmm2
  vaddpd xmm1, xmm1, xmmword ptr [rdx + 32]
  vaddpd xmm2, xmm3, xmm4
  vmovupd xmmword ptr [rdi], xmm0
  vmovupd xmmword ptr [rdi + 16], xmm2
  vmovupd xmmword ptr [rdi + 32], xmm1
  ret

Well, this is what I have initially expected to get. An autovectorized code that is as close to a hand-written asm as I can think of.

Conclusion

Compilers are generally improving, but some of them still lack to perform a really basic autovectorization optimizations. So we are not autovectorized yet and people should in general verify what they are claiming. Time will tell...

Update

I fixed one statement. Usage of [rax] by MSVC is actually smart. Addressing [rax + imm] is one byte smaller than addressing [rsp + imm].

You can use Compiler Exporer and try it yourself!

1 comment:

  1. Interestingly, on GCC... if you go with "-O3"... you get down to around 28 instructions and "-march=native" down to around 20. Granted I don't know what hardware they're running, but it's fair to say that it can do "better" under the right conditions.

    ReplyDelete