Short post about the compiler's ability (or inability?) to autovectorize your code. A lot of people write claims about compiler optimizations without actually proving them. Some people even state things such as "compiler is smarter than you", "compiler can autovectorize your code much better than you", and similar nonsense, and then use these as a foundation against people that care how their code is compiled or that even write optimized functions themselves. Hopefully, there is enough tools online to verify such claims and to prove which optimizations are likely and which aren't.
Here is our sample function:
struct Matrix2D {
double m00;
double m01;
double m10;
double m11;
double m20;
double m21;
inline void reset(double a, double b, double c, double d, double e, double f) noexcept {
m00 = a;
m01 = b;
m10 = c;
m11 = d;
m20 = e;
m21 = f;
}
};
void Matrix2D_Multiply(Matrix2D* dst, const Matrix2D* a, const Matrix2D* b) noexcept {
dst->reset(a->m00 * b->m00 + a->m01 * b->m10,
a->m00 * b->m01 + a->m01 * b->m11,
a->m10 * b->m00 + a->m11 * b->m10,
a->m10 * b->m01 + a->m11 * b->m11,
a->m20 * b->m00 + a->m21 * b->m10 + b->m20,
a->m20 * b->m01 + a->m21 * b->m11 + b->m21);
}
Yes it's a simple affine matrix multiplication used commonly in 2D graphics. I initially thought that all major C++ compilers would be able to autovectorize this code as it looks pretty straightforward, but I was wrong.
MSVC 2017 [/Ox /arch:AVX2]
; 48 instructions
Matrix2D_Multiply:
mov rax, rsp
sub rsp, 120
vmovsd xmm4, QWORD PTR [r8+8]
vmulsd xmm1, xmm4, QWORD PTR [rdx+32]
vmovaps XMMWORD PTR [rax-24], xmm6
vmovaps XMMWORD PTR [rax-40], xmm7
vmovaps XMMWORD PTR [rax-56], xmm8
vmovsd xmm8, QWORD PTR [r8]
vmulsd xmm2, xmm8, QWORD PTR [rdx+16]
vmovaps XMMWORD PTR [rax-72], xmm9
vmovsd xmm9, QWORD PTR [r8+16]
vmovaps XMMWORD PTR [rax-88], xmm10
vmovaps XMMWORD PTR [rax-104], xmm11
vmovsd xmm11, QWORD PTR [r8+24]
vmulsd xmm0, xmm11, QWORD PTR [rdx+40]
vaddsd xmm1, xmm1, xmm0
vmulsd xmm0, xmm9, QWORD PTR [rdx+40]
vmovaps XMMWORD PTR [rsp], xmm12
vaddsd xmm12, xmm1, QWORD PTR [r8+40]
vmulsd xmm1, xmm8, QWORD PTR [rdx+32]
vaddsd xmm1, xmm1, xmm0
vaddsd xmm10, xmm1, QWORD PTR [r8+32]
vmulsd xmm1, xmm4, QWORD PTR [rdx+16]
vmulsd xmm0, xmm11, QWORD PTR [rdx+24]
vaddsd xmm7, xmm1, xmm0
vmulsd xmm1, xmm9, QWORD PTR [rdx+24]
vmulsd xmm0, xmm11, QWORD PTR [rdx+8]
vmovaps xmm11, XMMWORD PTR [rax-104]
vaddsd xmm6, xmm2, xmm1
vmulsd xmm1, xmm4, QWORD PTR [rdx]
vmulsd xmm2, xmm8, QWORD PTR [rdx]
vmovaps xmm8, XMMWORD PTR [rax-56]
vaddsd xmm4, xmm1, xmm0
vmulsd xmm1, xmm9, QWORD PTR [rdx+8]
vmovaps xmm9, XMMWORD PTR [rax-72]
vaddsd xmm0, xmm2, xmm1
vmovsd QWORD PTR [rcx+16], xmm6
vmovaps xmm6, XMMWORD PTR [rax-24]
vmovsd QWORD PTR [rcx+24], xmm7
vmovaps xmm7, XMMWORD PTR [rax-40]
vmovsd QWORD PTR [rcx+32], xmm10
vmovaps xmm10, XMMWORD PTR [rax-88]
vmovsd QWORD PTR [rcx+40], xmm12
vmovaps xmm12, XMMWORD PTR [rsp]
vmovsd QWORD PTR [rcx], xmm0
vmovsd QWORD PTR [rcx+8], xmm4
add rsp, 120
ret 0
Not so great, but there is an explanation for that. Firstly, the compiler decided to go scalar (failed to autovectorize the code), which means that it would need a lot of registers (this decision basically predated everything). Secondly, WIN64 calling convention requires some registers to be preserved across function calls so in order to use as many registers as it needs it has to additionally save and restore some of them. Don't be confused with rax
use here, it's just a trick and it's used as an original stack pointer. This trick is just a code-size optimization.
GCC trunk [-O2 -mavx2]
; 37 instructions
Matrix2D_Multiply:
vmovsd xmm0, QWORD PTR [rdx+8]
vmovsd xmm5, QWORD PTR [rdx+24]
vmovsd xmm2, QWORD PTR [rsi+32]
vmovsd xmm3, QWORD PTR [rsi+40]
vmovsd xmm6, QWORD PTR [rdx+16]
vmovsd xmm9, QWORD PTR [rsi+16]
vmulsd xmm1, xmm3, xmm5
vmovsd xmm8, QWORD PTR [rsi+24]
vmovsd xmm7, QWORD PTR [rsi+8]
vmulsd xmm4, xmm2, xmm0
vmulsd xmm3, xmm3, xmm6
vmulsd xmm11, xmm6, xmm7
vmulsd xmm7, xmm5, xmm7
vmulsd xmm6, xmm6, xmm8
vaddsd xmm4, xmm4, xmm1
vmovsd xmm1, QWORD PTR [rdx]
vmulsd xmm5, xmm5, xmm8
vaddsd xmm4, xmm4, QWORD PTR [rdx+40]
vmulsd xmm2, xmm2, xmm1
vaddsd xmm3, xmm2, xmm3
vmovsd xmm2, QWORD PTR [rsi]
vaddsd xmm3, xmm3, QWORD PTR [rdx+32]
vmovsd QWORD PTR [rdi+40], xmm4
vmulsd xmm10, xmm1, xmm2
vmulsd xmm2, xmm0, xmm2
vmovsd QWORD PTR [rdi+32], xmm3
vmulsd xmm1, xmm1, xmm9
vmulsd xmm0, xmm0, xmm9
vaddsd xmm10, xmm10, xmm11
vaddsd xmm2, xmm2, xmm7
vaddsd xmm1, xmm1, xmm6
vaddsd xmm0, xmm0, xmm5
vmovsd QWORD PTR [rdi], xmm10
vmovsd QWORD PTR [rdi+8], xmm2
vmovsd QWORD PTR [rdi+16], xmm1
vmovsd QWORD PTR [rdi+24], xmm0
ret
A perfect scalar version, but nothing more.
Clang trunk [-O2 -mavx2]
; 22 instructions
Matrix2D_Multiply:
vmovddup xmm0, qword ptr [rsi] # xmm0 = mem[0,0]
vmovupd xmm1, xmmword ptr [rdx]
vmovupd xmm2, xmmword ptr [rdx + 16]
vmulpd xmm0, xmm0, xmm1
vmovddup xmm3, qword ptr [rsi + 8] # xmm3 = mem[0,0]
vmulpd xmm3, xmm3, xmm2
vaddpd xmm0, xmm0, xmm3
vmovddup xmm3, qword ptr [rsi + 16] # xmm3 = mem[0,0]
vmulpd xmm3, xmm1, xmm3
vmovddup xmm4, qword ptr [rsi + 24] # xmm4 = mem[0,0]
vmulpd xmm4, xmm2, xmm4
vmovddup xmm5, qword ptr [rsi + 32] # xmm5 = mem[0,0]
vmulpd xmm1, xmm1, xmm5
vmovddup xmm5, qword ptr [rsi + 40] # xmm5 = mem[0,0]
vmulpd xmm2, xmm2, xmm5
vaddpd xmm1, xmm1, xmm2
vaddpd xmm1, xmm1, xmmword ptr [rdx + 32]
vaddpd xmm2, xmm3, xmm4
vmovupd xmmword ptr [rdi], xmm0
vmovupd xmmword ptr [rdi + 16], xmm2
vmovupd xmmword ptr [rdi + 32], xmm1
ret
Well, this is what I have initially expected to get. An autovectorized code that is as close to a hand-written asm as I can think of.
Conclusion
Compilers are generally improving, but some of them still lack to perform a really basic autovectorization optimizations. So we are not autovectorized yet and people should in general verify what they are claiming. Time will tell...
Update
I fixed one statement. Usage of [rax]
by MSVC is actually smart. Addressing [rax + imm]
is one byte smaller than addressing [rsp + imm]
.
You can use Compiler Exporer and try it yourself!
Interestingly, on GCC... if you go with "-O3"... you get down to around 28 instructions and "-march=native" down to around 20. Granted I don't know what hardware they're running, but it's fair to say that it can do "better" under the right conditions.
ReplyDelete