## 13 August, 2018

### Are we autovectorized yet?

Short post about the compiler's ability (or inability?) to autovectorize your code. A lot of people write claims about compiler optimizations without actually proving them. Some people even state things such as "compiler is smarter than you", "compiler can autovectorize your code much better than you", and similar nonsense, and then use these as a foundation against people that care how their code is compiled or that even write optimized functions themselves. Hopefully, there is enough tools online to verify such claims and to prove which optimizations are likely and which aren't.

Here is our sample function:

``````struct Matrix2D {
double m00;
double m01;
double m10;
double m11;
double m20;
double m21;

inline void reset(double a, double b, double c, double d, double e, double f) noexcept {
m00 = a;
m01 = b;
m10 = c;
m11 = d;
m20 = e;
m21 = f;
}
};

void Matrix2D_Multiply(Matrix2D* dst, const Matrix2D* a, const Matrix2D* b) noexcept {
dst->reset(a->m00 * b->m00 + a->m01 * b->m10,
a->m00 * b->m01 + a->m01 * b->m11,
a->m10 * b->m00 + a->m11 * b->m10,
a->m10 * b->m01 + a->m11 * b->m11,
a->m20 * b->m00 + a->m21 * b->m10 + b->m20,
a->m20 * b->m01 + a->m21 * b->m11 + b->m21);
}
``````

Yes it's a simple affine matrix multiplication used commonly in 2D graphics. I initially thought that all major C++ compilers would be able to autovectorize this code as it looks pretty straightforward, but I was wrong.

## MSVC 2017 [/Ox /arch:AVX2]

``````; 48 instructions
Matrix2D_Multiply:
mov     rax, rsp
sub     rsp, 120
vmovsd  xmm4, QWORD PTR [r8+8]
vmulsd  xmm1, xmm4, QWORD PTR [rdx+32]
vmovaps XMMWORD PTR [rax-24], xmm6
vmovaps XMMWORD PTR [rax-40], xmm7
vmovaps XMMWORD PTR [rax-56], xmm8
vmovsd  xmm8, QWORD PTR [r8]
vmulsd  xmm2, xmm8, QWORD PTR [rdx+16]
vmovaps XMMWORD PTR [rax-72], xmm9
vmovsd  xmm9, QWORD PTR [r8+16]
vmovaps XMMWORD PTR [rax-88], xmm10
vmovaps XMMWORD PTR [rax-104], xmm11
vmovsd  xmm11, QWORD PTR [r8+24]
vmulsd  xmm0, xmm11, QWORD PTR [rdx+40]
vmulsd  xmm0, xmm9, QWORD PTR [rdx+40]
vmovaps XMMWORD PTR [rsp], xmm12
vaddsd  xmm12, xmm1, QWORD PTR [r8+40]
vmulsd  xmm1, xmm8, QWORD PTR [rdx+32]
vaddsd  xmm10, xmm1, QWORD PTR [r8+32]
vmulsd  xmm1, xmm4, QWORD PTR [rdx+16]
vmulsd  xmm0, xmm11, QWORD PTR [rdx+24]
vmulsd  xmm1, xmm9, QWORD PTR [rdx+24]
vmulsd  xmm0, xmm11, QWORD PTR [rdx+8]
vmovaps xmm11, XMMWORD PTR [rax-104]
vmulsd  xmm1, xmm4, QWORD PTR [rdx]
vmulsd  xmm2, xmm8, QWORD PTR [rdx]
vmovaps xmm8, XMMWORD PTR [rax-56]
vmulsd  xmm1, xmm9, QWORD PTR [rdx+8]
vmovaps xmm9, XMMWORD PTR [rax-72]
vmovsd  QWORD PTR [rcx+16], xmm6
vmovaps xmm6, XMMWORD PTR [rax-24]
vmovsd  QWORD PTR [rcx+24], xmm7
vmovaps xmm7, XMMWORD PTR [rax-40]
vmovsd  QWORD PTR [rcx+32], xmm10
vmovaps xmm10, XMMWORD PTR [rax-88]
vmovsd  QWORD PTR [rcx+40], xmm12
vmovaps xmm12, XMMWORD PTR [rsp]
vmovsd  QWORD PTR [rcx], xmm0
vmovsd  QWORD PTR [rcx+8], xmm4
ret     0
``````

Not so great, but there is an explanation for that. Firstly, the compiler decided to go scalar (failed to autovectorize the code), which means that it would need a lot of registers (this decision basically predated everything). Secondly, WIN64 calling convention requires some registers to be preserved across function calls so in order to use as many registers as it needs it has to additionally save and restore some of them. Don't be confused with `rax` use here, it's just a trick and it's used as an original stack pointer. This trick is just a code-size optimization.

## GCC trunk [-O2 -mavx2]

``````; 37 instructions
Matrix2D_Multiply:
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  xmm5, QWORD PTR [rdx+24]
vmovsd  xmm2, QWORD PTR [rsi+32]
vmovsd  xmm3, QWORD PTR [rsi+40]
vmovsd  xmm6, QWORD PTR [rdx+16]
vmovsd  xmm9, QWORD PTR [rsi+16]
vmulsd  xmm1, xmm3, xmm5
vmovsd  xmm8, QWORD PTR [rsi+24]
vmovsd  xmm7, QWORD PTR [rsi+8]
vmulsd  xmm4, xmm2, xmm0
vmulsd  xmm3, xmm3, xmm6
vmulsd  xmm11, xmm6, xmm7
vmulsd  xmm7, xmm5, xmm7
vmulsd  xmm6, xmm6, xmm8
vmovsd  xmm1, QWORD PTR [rdx]
vmulsd  xmm5, xmm5, xmm8
vaddsd  xmm4, xmm4, QWORD PTR [rdx+40]
vmulsd  xmm2, xmm2, xmm1
vmovsd  xmm2, QWORD PTR [rsi]
vaddsd  xmm3, xmm3, QWORD PTR [rdx+32]
vmovsd  QWORD PTR [rdi+40], xmm4
vmulsd  xmm10, xmm1, xmm2
vmulsd  xmm2, xmm0, xmm2
vmovsd  QWORD PTR [rdi+32], xmm3
vmulsd  xmm1, xmm1, xmm9
vmulsd  xmm0, xmm0, xmm9
vmovsd  QWORD PTR [rdi], xmm10
vmovsd  QWORD PTR [rdi+8], xmm2
vmovsd  QWORD PTR [rdi+16], xmm1
vmovsd  QWORD PTR [rdi+24], xmm0
ret
``````

A perfect scalar version, but nothing more.

## Clang trunk [-O2 -mavx2]

``````; 22 instructions
Matrix2D_Multiply:
vmovddup xmm0, qword ptr [rsi] # xmm0 = mem[0,0]
vmovupd xmm1, xmmword ptr [rdx]
vmovupd xmm2, xmmword ptr [rdx + 16]
vmulpd xmm0, xmm0, xmm1
vmovddup xmm3, qword ptr [rsi + 8] # xmm3 = mem[0,0]
vmulpd xmm3, xmm3, xmm2
vmovddup xmm3, qword ptr [rsi + 16] # xmm3 = mem[0,0]
vmulpd xmm3, xmm1, xmm3
vmovddup xmm4, qword ptr [rsi + 24] # xmm4 = mem[0,0]
vmulpd xmm4, xmm2, xmm4
vmovddup xmm5, qword ptr [rsi + 32] # xmm5 = mem[0,0]
vmulpd xmm1, xmm1, xmm5
vmovddup xmm5, qword ptr [rsi + 40] # xmm5 = mem[0,0]
vmulpd xmm2, xmm2, xmm5
vaddpd xmm1, xmm1, xmmword ptr [rdx + 32]
vmovupd xmmword ptr [rdi], xmm0
vmovupd xmmword ptr [rdi + 16], xmm2
vmovupd xmmword ptr [rdi + 32], xmm1
ret
``````

Well, this is what I have initially expected to get. An autovectorized code that is as close to a hand-written asm as I can think of.

## Conclusion

Compilers are generally improving, but some of them still lack to perform a really basic autovectorization optimizations. So we are not autovectorized yet and people should in general verify what they are claiming. Time will tell...

## Update

I fixed one statement. Usage of `[rax]` by MSVC is actually smart. Addressing `[rax + imm]` is one byte smaller than addressing `[rsp + imm]`.

You can use Compiler Exporer and try it yourself!

#### 1 comment:

1. Interestingly, on GCC... if you go with "-O3"... you get down to around 28 instructions and "-march=native" down to around 20. Granted I don't know what hardware they're running, but it's fair to say that it can do "better" under the right conditions.