18 August, 2016

Comparing register allocator of GCC and Clang

Introduction

I'm preparing to write a new register allocator for AsmJit and part of my preparation work is to study how the best open-source C++ compilers allocate registers of some C++ samples, and how they create function prologs and epilogs. Today, I wrote a very simple code that tries to exploit one thing - in 32-bit mode there is just 8 GP and SIMD registers, so let's see what GCC and Clang do with the following code:

#include <immintrin.h>

int fn(
  const int* px, const int* py,
  const int* pz, const int* pw,
  const int* pa, const int* pb,
  const int* pc, const int* pd) {

  __m256i a0 = _mm256_loadu_si256((__m256i*)px);
  __m256i a1 = _mm256_loadu_si256((__m256i*)py);
  __m256i a2 = _mm256_loadu_si256((__m256i*)pz);
  __m256i a3 = _mm256_loadu_si256((__m256i*)pw);
  __m256i a4 = _mm256_loadu_si256((__m256i*)pa);
  __m256i b0 = _mm256_loadu_si256((__m256i*)pb);
  __m256i b1 = _mm256_loadu_si256((__m256i*)pc);
  __m256i b2 = _mm256_loadu_si256((__m256i*)pd);
  __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1);
  __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1);
  
  __m256i x0 = _mm256_packus_epi16(a0, b0);
  __m256i x1 = _mm256_packus_epi16(a1, b1);
  __m256i x2 = _mm256_packus_epi16(a2, b2);
  __m256i x3 = _mm256_packus_epi16(a3, b3);
  __m256i x4 = _mm256_packus_epi16(a4, b4);
  
  x0 = _mm256_add_epi16(x0, a0);
  x1 = _mm256_add_epi16(x1, a1);
  x2 = _mm256_add_epi16(x2, a2);
  x3 = _mm256_add_epi16(x3, a3);
  x4 = _mm256_add_epi16(x4, a4);

  x0 = _mm256_sub_epi16(x0, b0);
  x1 = _mm256_sub_epi16(x1, b1);
  x2 = _mm256_sub_epi16(x2, b2);
  x3 = _mm256_sub_epi16(x3, b3);
  x4 = _mm256_sub_epi16(x4, b4);
  
  x0 = _mm256_packus_epi16(x0, x1);
  x0 = _mm256_packus_epi16(x0, x2);
  x0 = _mm256_packus_epi16(x0, x3);
  x0 = _mm256_packus_epi16(x0, x4);
  return _mm256_extract_epi32(x0, 1);
}

The function does nothing useful, it's just a dumb code that tricks the compiler to not eliminate any part of the function and to emit exactly what is written in its body. I'm trying to exploit the following: I'm using 8 arguments, which are passed by stack (32-bit mode), and each argument is a pointer to a 256-bit vector. The last two arguments are read twice, the first 6 arguments just once. This means that the compiler has the opportunity to store all arguments in GP registers if it just follows the code path. The second challenge is that I'm using 15 YMM registers at the same time, but only 8 are available.

GCC Output

Not so good, here is the asm with my annotations:

; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
  lea       ecx, [esp+4]                      ; Return address + 4 (first argument on the stack)
  and       esp, -32                          ; Align the stack to 32 bytes
  push      DWORD PTR [ecx-4]                 ; Push returned address
  push      ebp                               ; Save frame-pointer even if I told GCC to not to
  mov       ebp, esp
  push      edi                               ; Save GP regs
  push      esi
  push      ebx
  push      ecx
  sub       esp, 296                          ; Reserve stack for YMM spills
  mov       eax, DWORD PTR [ecx+16]           ; LOAD 'pa'
  mov       esi, DWORD PTR [ecx+4]            ; LOAD 'py'
  mov       edi, DWORD PTR [ecx]              ; LOAD 'px'
  mov       ebx, DWORD PTR [ecx+8]            ; LOAD 'pz'
  mov       edx, DWORD PTR [ecx+12]           ; LOAD 'pw'
  mov       DWORD PTR [ebp-120], eax          ; SPILL 'pa'
  mov       eax, DWORD PTR [ecx+20]           ; LOAD 'pb'
  mov       DWORD PTR [ebp-152], eax          ; SPILL 'pb'
  mov       eax, DWORD PTR [ecx+24]           ; LOAD 'pc'
  vmovdqu   ymm4, YMMWORD PTR [esi]
  mov       ecx, DWORD PTR [ecx+28]           ; LOAD 'pd'
  vmovdqu   ymm7, YMMWORD PTR [edi]
  vmovdqa   YMMWORD PTR [ebp-56], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [ebx]
  mov       ebx, DWORD PTR [ebp-152]          ; LOAD 'pb'
  vmovdqa   YMMWORD PTR [ebp-88], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [edx]
  mov       edx, DWORD PTR [ebp-120]          ; LOAD 'pa'
  vmovdqu   ymm6, YMMWORD PTR [edx]
  vmovdqa   YMMWORD PTR [ebp-120], ymm6       ; SPILL VEC
  vmovdqu   ymm0, YMMWORD PTR [ecx]
  vmovdqu   ymm6, YMMWORD PTR [ebx]
  vmovdqa   ymm5, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [eax+32]
  vmovdqu   ymm2, YMMWORD PTR [eax]
  vmovdqa   ymm1, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [ecx+32]
  vmovdqa   YMMWORD PTR [ebp-152], ymm2
  vmovdqa   ymm3, ymm0                        ; Why to move anything when using AVX?
  vpackuswb ymm0, ymm7, ymm6
  vmovdqa   YMMWORD PTR [ebp-184], ymm5       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-248], ymm3       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-280], ymm0       ; SPILL VEC
  vmovdqa   ymm0, YMMWORD PTR [ebp-56]        ; ALLOC VEC
  vmovdqa   YMMWORD PTR [ebp-216], ymm1       ; SPILL VEC
  vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-88]        ; ALLOC VEC
  vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpaddw    ymm2, ymm2, YMMWORD PTR [ebp-56]  ; Uses SPILL slot
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpaddw    ymm0, ymm7, YMMWORD PTR [ebp-280] ; Uses SPILL slot
  vpsubw    ymm0, ymm0, ymm6
  vmovdqa   ymm7, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpackuswb ymm0, ymm0, ymm2
  vpaddw    ymm2, ymm4, ymm1
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vmovdqa   YMMWORD PTR [ebp-312], ymm3       ; SPILL VEC
  vpaddw    ymm3, ymm5, YMMWORD PTR [ebp-88]  ; Uses SPILL slot
  vpsubw    ymm3, ymm3, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm3
  vpaddw    ymm1, ymm7, YMMWORD PTR [ebp-312] ; Uses SPILL slot
  vpsubw    ymm1, ymm1, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm2
  vpackuswb ymm0, ymm0, ymm1
  vpextrd   eax, xmm0, 1                      ; Return value
  vzeroupper
  add       esp, 296
  pop       ecx
  pop       ebx
  pop       esi
  pop       edi
  pop       ebp
  lea       esp, [ecx-4]
  ret

Here are my observations based on the output:

  • The first thing GCC does is to allocate arguments to registers and/or to move them to their home slots if there is not enough physical registers for all args.
  • It preserves stack pointer even if I told it to not to. The EBP register is valuable in our case. It does this (probably) because it wants to align the stack.
  • It uses [ebp-X] even when [esp+X] would be shorter in asm - for example it should use [esp+16] instead of [ebp-280] to make the instruction with such address shorter - we are talking about 2 bytes per instruction here.
  • YMM registers usage is a mystery to me - it does so many allocs and spills and also uses 'vmovdqa' to copy one register to another, which seems crazy in AVX code.

Verdict: GCC is worse than I expected in this test, it looks like it doesn't properly use DFG (data-flow-graph) and its register allocator is somewhat confused.

Clang Output

Clang surprised me, the output:

; Clang 3.8 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
   mov       eax, dword ptr [esp + 32]     ; LOAD 'pd'
   mov       ecx, dword ptr [esp + 4]      ; LOAD 'px'
   vmovdqu   ymm0, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 8]      ; LOAD 'py'
   vmovdqu   ymm1, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 12]     ; LOAD 'pz'
   vmovdqu  ymm2, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 16]     ; LOAD 'pw'
   vmovdqu   ymm3, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 20]     ; LOAD 'pa'
   vmovdqu   ymm4, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 24]     ; LOAD 'pb'
   vmovdqu   ymm5, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 28]     ; LOAD 'pc'
   vpackuswb ymm6, ymm0, ymm5
   vpsubw    ymm0, ymm0, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx]
   vpaddw    ymm0, ymm0, ymm6
   vpackuswb ymm6, ymm1, ymm5
   vpsubw    ymm1, ymm1, ymm5
   vmovdqu   ymm5, ymmword ptr [eax]
   vpaddw    ymm1, ymm1, ymm6
   vpackuswb ymm6, ymm2, ymm5
   vpsubw    ymm2, ymm2, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx + 32]
   vpaddw    ymm2, ymm2, ymm6
   vpackuswb ymm6, ymm3, ymm5
   vpsubw    ymm3, ymm3, ymm5
   vmovdqu   ymm5, ymmword ptr [eax + 32]
   vpaddw    ymm3, ymm3, ymm6
   vpackuswb ymm6, ymm4, ymm5
   vpsubw    ymm4, ymm4, ymm5
   vpaddw    ymm4, ymm4, ymm6
   vpackuswb ymm0, ymm0, ymm1
   vpackuswb ymm0, ymm0, ymm2
   vpackuswb ymm0, ymm0, ymm3
   vpackuswb ymm0, ymm0, ymm4
   vpextrd   eax, xmm0, 1                  ; Return value
   vzeroupper
   ret

Wow! Clang didn't spill any GP/VEC register and didn't use 'movdqa' like GCC. It rearranged the code in a way to prevent spills, most probably because it used DFG properly. I must applaud to clang developers as this output is amazing. It surprised me as I wrote the code to force the compiler to spill some YMMs (I know, bad design, but it was just a quick type-type-copy-paste:).

Conclusion

I cannot make a conclusion based on a single test, but GCC failed me in this case. Don't trust people that say that compilers produce better asm than people, it's a myth :)

You can try it yourself by copy-pasting the code here - it's a service that can translate your C++ code to asm online.

GCC bug #77287 reported.

15 August, 2016

AsmJit & AVX-512

The World of Prefixes

X86 architecture is known for its prefixes. It's no surprise that AVX-512 adds another one to the family - EVEX. Let's summarize the last 4 prefixes introduced in X86:

  • VEX - 2-byte (VEX2) and 3-byte (VEX3) prefix initially designed by Intel to encode AVX instructions, but now used by other CPU extensions like BMI and BMI2. VEX2 was designed to make some instructions 1 byte shorter than VEX3, but its usage is quite limited.
  • XOP - 3-byte prefix designed by AMD to support their XOP extensions in a way to not interfere with existing VEX prefix. XOP was never adopted by Intel and AMD will not support it in their new Zen processors (together with other extensions like FMA4). It's a dead end, dead silicone, and dead code that supports this prefix.
  • EVEX - 4-byte prefix designed by Intel to support 512-bit width vectors and 32 vector registers. Each AVX-512 instruction that works with vector registers uses this prefix. Many AVX and AVX2 instructions can be encoded by this new prefix as well. There are, however, several exceptions, but that would require a separate post.

AVX-512 Status in AsmJit

AVX-512 support in AsmJit is mostly finished. AsmJit's instruction database now contains all AVX-512 instructions together with older AVX and AVX2 instructions. The reorganization of instruction database and X86Assembler was quite drastic. AsmJit now contains a single path to encode either VEX, XOP, or EVEX instruction, which greatly simplified the logic in the assembler. XOP encoding IDs are no longer needed as each instruction now contains VEX, XOP, and EVEX bit. These bits instrument the encoder to use the correct prefix.

Encoder Improvements

The previous encoder was designed to encode each byte in the [VEX] prefix separately, and then write the result byte-to-byte into the destination buffer. This design was fairly simple, and according to my benchmarks, it was also very fast. However, this approach seemed unfeasible for supporting the new EXEX prefix, which contains 3 bytes of payload (instead of two) and the encoder must check all 3 bytes before it can decide whether to emit VEX or EVEX prefix. The new code does this differently - it uses a single 32-bit integer that represents the whole EVEX prefix, and then decides whether to use EVEX or VEX by checking specific bits in it. If any of the bits checked is '1' then the instruction is EVEX only. This guarantees that EVEX prefix will never be used by a legacy AVX instruction, and also guarantees that the best encoding (shortest prefix) is used. AsmJit allows to override this decision by using `evex()` option, which instructs the encoder to emit EVEX prefix, and similarly also supports `vex3()` option, which instructs the encoder to emit 3-byte VEX prefix instead of a shorter 2-byte VEX prefix. EVEX wins if both `evex()` and `vex3()` are specified.

A simplified version of AsmJit's VEX|EVEX encoder looks like this:

// Encode most of EVEX prefix, based on instruction operands and definition.
uint32_t x = EncodeMostEvex(...);             //  [........|zLL..aaa|Vvvvv..R|RBBmmmmm]
// Check if EVEX is required by checking:     x & [........|xx...xxx|x......x|.x.x....]
if (x & 0x00C78150U) {
  // Encode EVEX - uses most of `x`.
  // ... no more branches here - requires around 14 ops to finalize EVEX ...
  //                                                   _     ____    ____
  //                                              [zLLbVaaa|Wvvvv1pp|RBBR00mm|01100010].
}

// Not EVEX, prepare `x` for VEX2 or VEX3 (5 ops):[........|00L00000|0vvvv000|R0B0mmmm]
x |= ((opCode >> (kSHR_W_PP + 8)) & 0x8300U) | // [00000000|00L00000|Wvvvv0pp|R0B0mmmm]
     ((x      >> 11             ) & 0x0400U) ; // [00000000|00L00000|WvvvvLpp|R0B0mmmm]

// Check if VEX3 is needed by checking        x & [........|........|x.......|..x..x..]
if (x & 0x0008024U) {
  // Encode VEX3 or XOP.
  // ... no more branches here - requires around 7 ops to finalize VEX3 ...
  //                                                         ____    _ _
  //                                              [_OPCODE_|WvvvvLpp|R1Bmmmmm|VEX3|XOP].
else {
  // Encode VEX2.
  // ... no more branches here - requires around 3 ops to finalize VEX2 ...
}

This means that AsmJit requires just a single branch to decide whether to use VEX or EVEX prefix, and another branch to decide between 3-byte VEX|XOP or 2-byte VEX prefix. This is good news for everybody expecting high performance as this approach is nearly as fast as the old AsmJit's one, which haven't supported AVX-512 at all. It took me some time and thinking to actually design such approach and to reorganize instruction opcodes database in a way to be able to encode the initial EVEX prefix quickly. My initial approach was around 25% slower than the old AsmJit, and the final code (similar to the snippet shown above) is roughly 3-5% slower, which is pretty close to the old code. The new functionality is nontrivial so I'm really happy with such metrics (and to be honest I would like to see some metrics from other assemblers).

It's also obvious from the code that the new approach is basically optimistic for EVEX - emitting EVEX instructions is much cheaper than emitting VEX|XOP instructions. This wasn't goal, it's rather a consequence: all the bits that EVEX prefix introduces must be checked in order to decide between VEX vs. EVEX, thus AsmJit just puts most of these bits into the right position and only performs minor bit shuffling when converting to a prefix that uses less bytes (EVEX->VEX3->VEX2).

Future Work

Future posts will be about a new emitter called CodeBuilder. In short, it allows to emit instructions into a representation that can be processed afterwards. This representation was in AsmJit from the beginning as a part of Compiler. Compiler has many high-level features that some people don't need, so it was split into CodeBuilder that can be used exactly like Assembler, and CodeCompiler, that keeps all the high-level features.