03 March, 2017

C++ Compilers and Absurd Optimizations

Everything Started with a Bug

Yesterday, I finally implemented some code-paths that use AVX and AVX2 extensions to accelerate matrix transformations in Blend2D. Of course the initial version didn't work, because I did one small mistake in vector shuffling, which resulted in rendering something different than expected. I'm usually done with these kind of bugs very fast, but this time I did a mistake in a shuffle predicate, and the bug was harder to find than usual. After 15 minutes of looking at the code I disassembled the whole function, because I started being suspicious that I'm facing a compiler bug. And, I'm happy I saw the disassembly as I found some ridiculous code generated by C++ compiler that I simply couldn't understand.

Don't Trust C++ Compiler

As the title suggests, C++ compiler failed me again by generating a ridiculous code surrounding relatively simple loops. As manually vectorized C++ code usually requires at least two loops (one vectorized and one scalar) the implementation usually looks like the following:

void proc_v1(double* dst, const double* src, size_t length) {
  size_t i;

  // Main loop - process 4 units at a time (vectorized).
  for (size_t i = length >> 2; i; i--) {
    ...

    dst += 4;
    src += 4;
  }

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  for (size_t i = length & 3; i; i--) {
    ...

    dst += 1;
    src += 1;
  }
}

This is pretty standard code used in many libraries. However, when you need to save registers and you write hand-optimized loops in assembly, you usually don't use such construct directly translated to machine code, because it will cost you 2 registers. Instead, the following trick could be used:

void proc_v1(double* dst, const double* src, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  // Main loop - process 4 units at a time (vectorized).
  while ((i -= 4) >= 0) {
    ...

    dst += 4;
    src += 4;
  }

  // Now it's less than zero, so normalize it to get the remaining count.
  i += 4;

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  while (i) {
    ...

    dst += 1;
    src += 1;

    i--;
  }
}

Which can be translated to asm 1:1 and will use only a single register. I don't know personally any case where this approach would be slower than the previous one, and if the size of the item you are processing is greater than 1 it's always safe (I would say it's safe regardless of the size as I'm not sure if a 32-bit OS would be able to provide a 2GB of continuous memory to the user, but let's stay on a safe side).

Translated to x86 asm, the code would look like this:

; RDI - destination
; RSI - source
; RCX - counter

; Main Loop header.
sub rcx, 4    ; Try if we can use vectorized loop.
js MainSkip   ; If negative it means that there is less than 4 items.

MainLoop:
...           ; Some work to do...
add dst, 32   ; Advance destination 4 * sizeof(double).
add src, 32   ; Advance source.
sub rcx, 4    ; Decrement loop counter.
jns MainLoop  ; Jump if not negative.

MainSkip:
add rcx, 4    ; Fix the loop counter.
jz TailSkip   ; Exit if zero.

TailLoop:
...           ; Some work to do...
add dst, 8    ; Advance destination 1 * sizeof(double).
add src, 8    ; Advance source.
sub rcx, 1    ; Decrement loop counter.
jnz TailLoop  ; Jump if not zero.

TailSkip:
}

The asm code is not just nice, it's also very compact and will perform well on any hardware in general.

Working Example

I cannot just use an empty loop body to demonstrate how badly C++ compilers understand this code, so I wrote a very simple function that does something:

#include <stdint.h>

#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// Destination and source are points (pair of x|y), the function only sums them.
void transform(double* dst, const double* src, const double* matrix, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  while ((i -= 8) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src +  0);
    __m256d s1 = _mm256_loadu_pd(src +  4);
    __m256d s2 = _mm256_loadu_pd(src +  8);
    __m256d s3 = _mm256_loadu_pd(src + 12);

    _mm256_storeu_pd(dst +  0, _mm256_add_pd(s0, s0));
    _mm256_storeu_pd(dst +  4, _mm256_add_pd(s1, s1));
    _mm256_storeu_pd(dst +  8, _mm256_add_pd(s2, s2));
    _mm256_storeu_pd(dst + 12, _mm256_add_pd(s3, s3));
    
    dst += 16;
    src += 16;
  }
  i += 8;

  while ((i -= 2) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src);
    _mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));
    
    dst += 4;
    src += 4;
  }
  
  if (i & 1) {
    __m128d s0 = _mm_loadu_pd(src);
    _mm_storeu_pd(dst, _mm_add_pd(s0, s0));
  }
}

The function only performs a very simple operation with its inputs to distinguish between the loop body and everything else.

GCC (v7 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r9, rcx
        mov     r8, rcx
        sub     r9, 8
        js      .L6
        mov     rax, r9
        sub     rcx, 16
        mov     r8, r9
        and     rax, -8
        mov     rdx, rsi
        sub     rcx, rax
        mov     rax, rdi
        ; -----------------------
.L5:
        vmovupd xmm3, XMMWORD PTR [rdx]
        sub     r8, 8
        sub     rax, -128
        sub     rdx, -128
        vmovupd xmm2, XMMWORD PTR [rdx-96]
        vinsertf128     ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
        vmovupd xmm1, XMMWORD PTR [rdx-64]
        vinsertf128     ymm2, ymm2, XMMWORD PTR [rdx-80], 0x1
        vmovupd xmm0, XMMWORD PTR [rdx-32]
        vinsertf128     ymm1, ymm1, XMMWORD PTR [rdx-48], 0x1
        vaddpd  ymm3, ymm3, ymm3
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rdx-16], 0x1
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovups XMMWORD PTR [rax-128], xmm3
        vaddpd  ymm0, ymm0, ymm0
        vextractf128    XMMWORD PTR [rax-112], ymm3, 0x1
        vmovups XMMWORD PTR [rax-96], xmm2
        vextractf128    XMMWORD PTR [rax-80], ymm2, 0x1
        vmovups XMMWORD PTR [rax-64], xmm1
        vextractf128    XMMWORD PTR [rax-48], ymm1, 0x1
        vmovups XMMWORD PTR [rax-32], xmm0
        vextractf128    XMMWORD PTR [rax-16], ymm0, 0x1

        ; --- Instead of using sub/jns it uses sub/cmp/jne ---
        cmp     r8, rcx
        jne     .L5
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        mov     rax, r9
        shr     rax, 3
        lea     rdx, [rax+1]
        neg     rax
        lea     r8, [r9+rax*8]
        sal     rdx, 7
        add     rdi, rdx
        add     rsi, rdx
        ; -----------------------

.L6:
        ; --- Ridiculous code ---
        mov     rdx, r8
        sub     rdx, 2
        js      .L4
        shr     rdx
        xor     eax, eax
        lea     rcx, [rdx+1]
        sal     rcx, 5
        ; -----------------------

.L7:
        vmovupd xmm0, XMMWORD PTR [rsi+rax]
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
        vaddpd  ymm0, ymm0, ymm0
        vmovups XMMWORD PTR [rdi+rax], xmm0
        vextractf128    XMMWORD PTR [rdi+16+rax], ymm0, 0x1

        ; --- Instead of using sub/jns it uses add/cmp/jne ---
        add     rax, 32
        cmp     rax, rcx
        jne     .L7
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        neg     rdx
        add     rdi, rax
        add     rsi, rax
        lea     rdx, [r8-4+rdx*2]
        ; -----------------------
.L4:
        and     edx, 1
        je      .L14
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L14:
        vzeroupper
        ret

No comment - GCC magically transformed 9 initial instructions into 37, making the code larger and slower!

Clang (3.9.1 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r10, rcx
        add     r10, -8
        js      .LBB0_7
        mov     r11, r10
        shr     r11, 3
        mov     r8, r11
        shl     r8, 4
        lea     r9d, [r11 + 1]
        and     r9d, 1
        mov     rdx, rdi
        mov     rax, rsi
        test    r11, r11
        je      .LBB0_4
        lea     rcx, [r9 - 1]
        sub     rcx, r11
        mov     rdx, rdi
        mov     rax, rsi
        ; -----------------------

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
        vmovupd ymm0, ymmword ptr [rax + 128]
        vmovupd ymm1, ymmword ptr [rax + 160]
        vmovupd ymm2, ymmword ptr [rax + 192]
        vmovupd ymm3, ymmword ptr [rax + 224]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx + 128], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 160], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 192], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 224], ymm0
        add     rdx, 256
        add     rax, 256

        ; --- Instead of using sub/jns it uses add/jne ---
        add     rcx, 2
        jne     .LBB0_3
        ; ------------------------------------------------

        ; --- CLANG Unrolled the tail loop - OMG ---
.LBB0_4:
        shl     r11, 3
        lea     rcx, [r8 + 16]
        test    r9, r9
        je      .LBB0_6
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
.LBB0_6:
        lea     rdi, [rdi + 8*r8 + 128]
        sub     r10, r11
        lea     rsi, [rsi + 8*rcx]
        mov     rcx, r10
.LBB0_7:
        mov     r10, rcx
        add     r10, -2
        js      .LBB0_15
        mov     r8, r10
        shr     r8
        lea     r9d, [r8 + 1]
        and     r9d, 3
        mov     rax, rdi
        mov     rdx, rsi
        cmp     r10, 6
        jb      .LBB0_11
        lea     r10, [r9 - 1]
        sub     r10, r8
        mov     rax, rdi
        mov     rdx, rsi
.LBB0_10:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 32]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 32], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 64]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 64], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rax, -128
        sub     rdx, -128
        add     r10, 4
        jne     .LBB0_10
.LBB0_11:
        lea     r10, [4*r8 + 4]
        lea     r11, [4*r8]
        add     r8, r8
        test    r9, r9
        je      .LBB0_14
        neg     r9
.LBB0_13:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        add     rax, 32
        add     rdx, 32
        inc     r9
        jne     .LBB0_13
.LBB0_14:
        lea     rsi, [rsi + 8*r11 + 32]
        lea     rdi, [rdi + 8*r10]
        add     rcx, -4
        sub     rcx, r8
        mov     r10, rcx
        ; --- End of the unrolled loop ---

.LBB0_15:
        test    r10b, 1
        je      .LBB0_17
        vmovupd xmm0, xmmword ptr [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [rdi], xmm0
.LBB0_17:
        vzeroupper
        ret

This was even worse than I would ever expect - clang unrolled the tail loop and generated optimized code for something that would never be executed more than once. It completely misunderstood the code. Also, instead of following the C++ code it inserted dozens of instructions to help with unrolling, I won't even count them because the code is so ridiculous.

MSVC (v19 /O2 /arch:AVX)

transform, COMDAT PROC
        ; --- Ridiculous code ---
        add      r9, -8
        js       SHORT $LN3@transform
        lea      r8, QWORD PTR [r9+8]
        shr      r8, 3
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*8]
        npad     8
        ; -----------------------

$LL2@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vmovupd ymm1, YMMWORD PTR [rdx+32]
        vmovupd ymm4, YMMWORD PTR [rdx+64]
        vmovupd ymm5, YMMWORD PTR [rdx+96]
        vaddpd   ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm0
        vaddpd   ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rcx+32], ymm1
        vaddpd   ymm0, ymm4, ymm4
        vmovupd YMMWORD PTR [rcx+64], ymm0
        vaddpd   ymm1, ymm5, ymm5
        vmovupd YMMWORD PTR [rcx+96], ymm1
        sub      rcx, -128                ; ffffffffffffff80H
        sub      rdx, -128                ; ffffffffffffff80H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL2@transform
        ; ------------------------------------------------

$LN3@transform:
        ; --- Ridiculous code ---
        add      r9, 6
        js       SHORT $LN5@transform
        lea      r8, QWORD PTR [r9+2]
        shr      r8, 1
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*2]
        npad     5
        ; -----------------------

$LL4@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vaddpd   ymm2, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm2
        add      rcx, 32              ; 00000020H
        add      rdx, 32              ; 00000020H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL4@transform
        ; ------------------------------------------------
$LN5@transform:
        test     r9b, 1
        je       SHORT $LN6@transform
        vmovupd xmm0, XMMWORD PTR [rdx]
        vaddpd   xmm2, xmm0, xmm0
        vmovupd XMMWORD PTR [rcx], xmm2
$LN6@transform:
        vzeroupper
        ret      0

Actually not bad, still does things I didn't ask for, but it's much better than GCC and Clang in this case.

ICC (v17 -O2 -mavx -fno-exceptions)

transform(double*, double const*, double const*, unsigned long):
        add       rcx, -8                                       #11.11 FOLLOWS C++ code!
        js        ..B1.5        # Prob 2%                       #11.22
..B1.3:                         # Preds ..B1.1 ..B1.3
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #12.34
        vmovupd   xmm1, XMMWORD PTR [32+rsi]                    #13.34
        vmovupd   xmm2, XMMWORD PTR [64+rsi]                    #14.34
        vmovupd   xmm3, XMMWORD PTR [96+rsi]                    #15.34
        vinsertf128 ymm6, ymm1, XMMWORD PTR [48+rsi], 1         #13.34
        vinsertf128 ymm4, ymm0, XMMWORD PTR [16+rsi], 1         #12.34
        vinsertf128 ymm8, ymm2, XMMWORD PTR [80+rsi], 1         #14.34
        vinsertf128 ymm10, ymm3, XMMWORD PTR [112+rsi], 1       #15.34
        add       rsi, 128                                      #23.5
        vaddpd    ymm5, ymm4, ymm4                              #17.32
        vaddpd    ymm7, ymm6, ymm6                              #18.32
        vaddpd    ymm9, ymm8, ymm8                              #19.32
        vaddpd    ymm11, ymm10, ymm10                           #20.32
        vmovupd   XMMWORD PTR [rdi], xmm5                       #17.22
        vmovupd   XMMWORD PTR [32+rdi], xmm7                    #18.22
        vmovupd   XMMWORD PTR [64+rdi], xmm9                    #19.22
        vmovupd   XMMWORD PTR [96+rdi], xmm11                   #20.22
        vextractf128 XMMWORD PTR [16+rdi], ymm5, 1              #17.22
        vextractf128 XMMWORD PTR [48+rdi], ymm7, 1              #18.22
        vextractf128 XMMWORD PTR [80+rdi], ymm9, 1              #19.22
        vextractf128 XMMWORD PTR [112+rdi], ymm11, 1            #20.22
        add       rdi, 128                                      #22.5
        add       rcx, -8                                       #11.11
        jns       ..B1.3        # Prob 82%                      #11.22 FOLLOWS C++ code!
..B1.5:                         # Preds ..B1.3 ..B1.1
        add       rcx, 6                                        #25.3
        js        ..B1.9        # Prob 2%                       #27.22 FOLLOWS C++ code!
..B1.7:                         # Preds ..B1.5 ..B1.7
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #28.34
        vinsertf128 ymm1, ymm0, XMMWORD PTR [16+rsi], 1         #28.34
        add       rsi, 32                                       #32.5
        vaddpd    ymm2, ymm1, ymm1                              #29.27
        vmovupd   XMMWORD PTR [rdi], xmm2                       #29.22
        vextractf128 XMMWORD PTR [16+rdi], ymm2, 1              #29.22
        add       rdi, 32                                       #31.5
        add       rcx, -2                                       #27.11 FOLLOWS C++ code!
        jns       ..B1.7        # Prob 82%                      #27.22
..B1.9:                         # Preds ..B1.7 ..B1.5
        test      rcx, 1                                        #35.11
        je        ..B1.11       # Prob 60%                      #35.11
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #36.31
        vaddpd    xmm1, xmm0, xmm0                              #37.24
        vmovupd   XMMWORD PTR [rdi], xmm1                       #37.19
..B1.11:                        # Preds ..B1.9 ..B1.10
        vzeroupper                                              #39.1
        ret                                                     #39.1

ICC in this case is pure winner. I don't know why other compilers cannot generate code like this, it's small and good.

Bonus GCC with -Os

transform(double*, double const*, double const*, unsigned long):
.L3:
        mov     rax, rcx
        sub     rax, 8
        js      .L2
        vmovupd ymm3, YMMWORD PTR [rsi]
        sub     rdi, -128
        sub     rsi, -128
        vmovupd ymm2, YMMWORD PTR [rsi-96]
        mov     rcx, rax
        vmovupd ymm1, YMMWORD PTR [rsi-64]
        vaddpd  ymm3, ymm3, ymm3
        vmovupd ymm0, YMMWORD PTR [rsi-32]
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rdi-128], ymm3
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-96], ymm2
        vmovupd YMMWORD PTR [rdi-64], ymm1
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L3 ; Well, I asked -Os, but here sub/jns won't hurt
.L2:
        sub     rcx, 2
        js      .L4
        vmovupd ymm0, YMMWORD PTR [rsi]
        add     rdi, 32
        add     rsi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L2 ; Also here...
.L4:
        and     cl, 1
        je      .L1
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L1:
        ret

Much more close to the expected code, however, since I asked to generate the smallest possible code, it didn't generate conditional jumps at the end of loops, which is not good in general.

Bonus Clang with -Oz (should be even better than -Os)

transform(double*, double const*, double const*, unsigned long):                  # @transform(double*, double const*, double const*, unsigned long)
        push    7
        pop     rax
        sub     rax, rcx
        mov     rdx, rcx
        shr     rdx, 3
        xor     r10d, r10d
        test    rax, rax
        cmovle  r10, rdx
        mov     r9, r10
        shl     r9, 4
        lea     r8, [8*r10]
        shl     r10, 7
        add     r10, rsi
        mov     rdx, rcx
        mov     rax, rdi
        jmp     .LBB0_1
.LBB0_8:                                #   in Loop: Header=BB0_1 Depth=1
        vmovupd ymm0, ymmword ptr [rsi]
        vmovupd ymm1, ymmword ptr [rsi + 32]
        vmovupd ymm2, ymmword ptr [rsi + 64]
        vmovupd ymm3, ymmword ptr [rsi + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rax + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rax + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rsi, -128
        sub     rax, -128
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rdx, -8
        jns     .LBB0_8
        sub     rcx, r8
        lea     r8, [rdi + 8*r9]
        push    1
        pop     rax
        sub     rax, rcx
        lea     rdx, [rcx + rcx]
        and     rdx, -4
        xor     esi, esi
        test    rax, rax
        cmovle  rsi, rdx
        mov     rax, rcx
        mov     rdi, r10
        mov     rdx, r8
        jmp     .LBB0_3
.LBB0_4:                                #   in Loop: Header=BB0_3 Depth=1
        vmovupd ymm0, ymmword ptr [rdi]
        add     rdi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        add     rdx, 32
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        add     rax, -2
        jns     .LBB0_4
        test    cl, 1
        je      .LBB0_7
        vmovupd xmm0, xmmword ptr [r10 + 8*rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [r8 + 8*rsi], xmm0
.LBB0_7:
        vzeroupper
        ret

No, no, no - clang just doesn't like compact code...:(

Conclusion

People saying that C++ compilers always generate better code than people are simply wrong. I'm not saying that they always generate worse code, but it seems that newer the compiler is, more instructions it emits, and it's becoming scary. More arguments for Blend2D and its hand-optimized pipelines!

You can play with the code here.

Thanks for reading and don't forget that this is not a hate-post. A GCC bug #79830 was already reported (with a bit different sample code) and clang bug #32132 as well. Let's hope this can be improved as this actually affects performance of non-JIT code in Blend2D.

Update

After reading some comments on other portals about this post I think people really misunderstood it and try to put it into a different context. Please note that what is inside the loops is not important in this case. What is important is how various C++ compilers translate valid loop constructs into assembly and its impact on the size of the generated code. So please, instead of justifying the bloated code I have shown, let's fix it as it won't hurt to have smaller code generated in this case. And, I really think that if a commercial ICC compiler can generate perfect code for this case, GCC and clang should be able to do it too, because these are compilers that I use daily...