18 August, 2016

Comparing register allocator of GCC and Clang

Introduction

I'm preparing to write a new register allocator for AsmJit and part of my preparation work is to study how the best open-source C++ compilers allocate registers of some C++ samples, and how they create function prologs and epilogs. Today, I wrote a very simple code that tries to exploit one thing - in 32-bit mode there is just 8 GP and SIMD registers, so let's see what GCC and Clang do with the following code:

#include <immintrin.h>

int fn(
  const int* px, const int* py,
  const int* pz, const int* pw,
  const int* pa, const int* pb,
  const int* pc, const int* pd) {

  __m256i a0 = _mm256_loadu_si256((__m256i*)px);
  __m256i a1 = _mm256_loadu_si256((__m256i*)py);
  __m256i a2 = _mm256_loadu_si256((__m256i*)pz);
  __m256i a3 = _mm256_loadu_si256((__m256i*)pw);
  __m256i a4 = _mm256_loadu_si256((__m256i*)pa);
  __m256i b0 = _mm256_loadu_si256((__m256i*)pb);
  __m256i b1 = _mm256_loadu_si256((__m256i*)pc);
  __m256i b2 = _mm256_loadu_si256((__m256i*)pd);
  __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1);
  __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1);
  
  __m256i x0 = _mm256_packus_epi16(a0, b0);
  __m256i x1 = _mm256_packus_epi16(a1, b1);
  __m256i x2 = _mm256_packus_epi16(a2, b2);
  __m256i x3 = _mm256_packus_epi16(a3, b3);
  __m256i x4 = _mm256_packus_epi16(a4, b4);
  
  x0 = _mm256_add_epi16(x0, a0);
  x1 = _mm256_add_epi16(x1, a1);
  x2 = _mm256_add_epi16(x2, a2);
  x3 = _mm256_add_epi16(x3, a3);
  x4 = _mm256_add_epi16(x4, a4);

  x0 = _mm256_sub_epi16(x0, b0);
  x1 = _mm256_sub_epi16(x1, b1);
  x2 = _mm256_sub_epi16(x2, b2);
  x3 = _mm256_sub_epi16(x3, b3);
  x4 = _mm256_sub_epi16(x4, b4);
  
  x0 = _mm256_packus_epi16(x0, x1);
  x0 = _mm256_packus_epi16(x0, x2);
  x0 = _mm256_packus_epi16(x0, x3);
  x0 = _mm256_packus_epi16(x0, x4);
  return _mm256_extract_epi32(x0, 1);
}

The function does nothing useful, it's just a dumb code that tricks the compiler to not eliminate any part of the function and to emit exactly what is written in its body. I'm trying to exploit the following: I'm using 8 arguments, which are passed by stack (32-bit mode), and each argument is a pointer to a 256-bit vector. The last two arguments are read twice, the first 6 arguments just once. This means that the compiler has the opportunity to store all arguments in GP registers if it just follows the code path. The second challenge is that I'm using 15 YMM registers at the same time, but only 8 are available.

GCC Output

Not so good, here is the asm with my annotations:

; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
  lea       ecx, [esp+4]                      ; Return address + 4 (first argument on the stack)
  and       esp, -32                          ; Align the stack to 32 bytes
  push      DWORD PTR [ecx-4]                 ; Push returned address
  push      ebp                               ; Save frame-pointer even if I told GCC to not to
  mov       ebp, esp
  push      edi                               ; Save GP regs
  push      esi
  push      ebx
  push      ecx
  sub       esp, 296                          ; Reserve stack for YMM spills
  mov       eax, DWORD PTR [ecx+16]           ; LOAD 'pa'
  mov       esi, DWORD PTR [ecx+4]            ; LOAD 'py'
  mov       edi, DWORD PTR [ecx]              ; LOAD 'px'
  mov       ebx, DWORD PTR [ecx+8]            ; LOAD 'pz'
  mov       edx, DWORD PTR [ecx+12]           ; LOAD 'pw'
  mov       DWORD PTR [ebp-120], eax          ; SPILL 'pa'
  mov       eax, DWORD PTR [ecx+20]           ; LOAD 'pb'
  mov       DWORD PTR [ebp-152], eax          ; SPILL 'pb'
  mov       eax, DWORD PTR [ecx+24]           ; LOAD 'pc'
  vmovdqu   ymm4, YMMWORD PTR [esi]
  mov       ecx, DWORD PTR [ecx+28]           ; LOAD 'pd'
  vmovdqu   ymm7, YMMWORD PTR [edi]
  vmovdqa   YMMWORD PTR [ebp-56], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [ebx]
  mov       ebx, DWORD PTR [ebp-152]          ; LOAD 'pb'
  vmovdqa   YMMWORD PTR [ebp-88], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [edx]
  mov       edx, DWORD PTR [ebp-120]          ; LOAD 'pa'
  vmovdqu   ymm6, YMMWORD PTR [edx]
  vmovdqa   YMMWORD PTR [ebp-120], ymm6       ; SPILL VEC
  vmovdqu   ymm0, YMMWORD PTR [ecx]
  vmovdqu   ymm6, YMMWORD PTR [ebx]
  vmovdqa   ymm5, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [eax+32]
  vmovdqu   ymm2, YMMWORD PTR [eax]
  vmovdqa   ymm1, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [ecx+32]
  vmovdqa   YMMWORD PTR [ebp-152], ymm2
  vmovdqa   ymm3, ymm0                        ; Why to move anything when using AVX?
  vpackuswb ymm0, ymm7, ymm6
  vmovdqa   YMMWORD PTR [ebp-184], ymm5       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-248], ymm3       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-280], ymm0       ; SPILL VEC
  vmovdqa   ymm0, YMMWORD PTR [ebp-56]        ; ALLOC VEC
  vmovdqa   YMMWORD PTR [ebp-216], ymm1       ; SPILL VEC
  vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-88]        ; ALLOC VEC
  vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpaddw    ymm2, ymm2, YMMWORD PTR [ebp-56]  ; Uses SPILL slot
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpaddw    ymm0, ymm7, YMMWORD PTR [ebp-280] ; Uses SPILL slot
  vpsubw    ymm0, ymm0, ymm6
  vmovdqa   ymm7, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpackuswb ymm0, ymm0, ymm2
  vpaddw    ymm2, ymm4, ymm1
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vmovdqa   YMMWORD PTR [ebp-312], ymm3       ; SPILL VEC
  vpaddw    ymm3, ymm5, YMMWORD PTR [ebp-88]  ; Uses SPILL slot
  vpsubw    ymm3, ymm3, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm3
  vpaddw    ymm1, ymm7, YMMWORD PTR [ebp-312] ; Uses SPILL slot
  vpsubw    ymm1, ymm1, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm2
  vpackuswb ymm0, ymm0, ymm1
  vpextrd   eax, xmm0, 1                      ; Return value
  vzeroupper
  add       esp, 296
  pop       ecx
  pop       ebx
  pop       esi
  pop       edi
  pop       ebp
  lea       esp, [ecx-4]
  ret

Here are my observations based on the output:

  • The first thing GCC does is to allocate arguments to registers and/or to move them to their home slots if there is not enough physical registers for all args.
  • It preserves stack pointer even if I told it to not to. The EBP register is valuable in our case. It does this (probably) because it wants to align the stack.
  • It uses [ebp-X] even when [esp+X] would be shorter in asm - for example it should use [esp+16] instead of [ebp-280] to make the instruction with such address shorter - we are talking about 2 bytes per instruction here.
  • YMM registers usage is a mystery to me - it does so many allocs and spills and also uses 'vmovdqa' to copy one register to another, which seems crazy in AVX code.

Verdict: GCC is worse than I expected in this test, it looks like it doesn't properly use DFG (data-flow-graph) and its register allocator is somewhat confused.

Clang Output

Clang surprised me, the output:

; Clang 3.8 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
   mov       eax, dword ptr [esp + 32]     ; LOAD 'pd'
   mov       ecx, dword ptr [esp + 4]      ; LOAD 'px'
   vmovdqu   ymm0, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 8]      ; LOAD 'py'
   vmovdqu   ymm1, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 12]     ; LOAD 'pz'
   vmovdqu  ymm2, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 16]     ; LOAD 'pw'
   vmovdqu   ymm3, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 20]     ; LOAD 'pa'
   vmovdqu   ymm4, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 24]     ; LOAD 'pb'
   vmovdqu   ymm5, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 28]     ; LOAD 'pc'
   vpackuswb ymm6, ymm0, ymm5
   vpsubw    ymm0, ymm0, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx]
   vpaddw    ymm0, ymm0, ymm6
   vpackuswb ymm6, ymm1, ymm5
   vpsubw    ymm1, ymm1, ymm5
   vmovdqu   ymm5, ymmword ptr [eax]
   vpaddw    ymm1, ymm1, ymm6
   vpackuswb ymm6, ymm2, ymm5
   vpsubw    ymm2, ymm2, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx + 32]
   vpaddw    ymm2, ymm2, ymm6
   vpackuswb ymm6, ymm3, ymm5
   vpsubw    ymm3, ymm3, ymm5
   vmovdqu   ymm5, ymmword ptr [eax + 32]
   vpaddw    ymm3, ymm3, ymm6
   vpackuswb ymm6, ymm4, ymm5
   vpsubw    ymm4, ymm4, ymm5
   vpaddw    ymm4, ymm4, ymm6
   vpackuswb ymm0, ymm0, ymm1
   vpackuswb ymm0, ymm0, ymm2
   vpackuswb ymm0, ymm0, ymm3
   vpackuswb ymm0, ymm0, ymm4
   vpextrd   eax, xmm0, 1                  ; Return value
   vzeroupper
   ret

Wow! Clang didn't spill any GP/VEC register and didn't use 'movdqa' like GCC. It rearranged the code in a way to prevent spills, most probably because it used DFG properly. I must applaud to clang developers as this output is amazing. It surprised me as I wrote the code to force the compiler to spill some YMMs (I know, bad design, but it was just a quick type-type-copy-paste:).

Conclusion

I cannot make a conclusion based on a single test, but GCC failed me in this case. Don't trust people that say that compilers produce better asm than people, it's a myth :)

You can try it yourself by copy-pasting the code here - it's a service that can translate your C++ code to asm online.

GCC bug #77287 reported.

No comments:

Post a Comment