03 March, 2017

C++ Compilers and Absurd Optimizations

Everything Started with a Bug

Yesterday, I finally implemented some code-paths that use AVX and AVX2 extensions to accelerate matrix transformations in Blend2D. Of course the initial version didn't work, because I did one small mistake in vector shuffling, which resulted in rendering something different than expected. I'm usually done with these kind of bugs very fast, but this time I did a mistake in a shuffle predicate, and the bug was harder to find than usual. After 15 minutes of looking at the code I disassembled the whole function, because I started being suspicious that I'm facing a compiler bug. And, I'm happy I saw the disassembly as I found some ridiculous code generated by C++ compiler that I simply couldn't understand.

Don't Trust C++ Compiler

As the title suggests, C++ compiler failed me again by generating a ridiculous code surrounding relatively simple loops. As manually vectorized C++ code usually requires at least two loops (one vectorized and one scalar) the implementation usually looks like the following:

void proc_v1(double* dst, const double* src, size_t length) {
  size_t i;

  // Main loop - process 4 units at a time (vectorized).
  for (size_t i = length >> 2; i; i--) {
    ...

    dst += 4;
    src += 4;
  }

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  for (size_t i = length & 3; i; i--) {
    ...

    dst += 1;
    src += 1;
  }
}

This is pretty standard code used in many libraries. However, when you need to save registers and you write hand-optimized loops in assembly, you usually don't use such construct directly translated to machine code, because it will cost you 2 registers. Instead, the following trick could be used:

void proc_v1(double* dst, const double* src, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  // Main loop - process 4 units at a time (vectorized).
  while ((i -= 4) >= 0) {
    ...

    dst += 4;
    src += 4;
  }

  // Now it's less than zero, so normalize it to get the remaining count.
  i += 4;

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  while (i) {
    ...

    dst += 1;
    src += 1;

    i--;
  }
}

Which can be translated to asm 1:1 and will use only a single register. I don't know personally any case where this approach would be slower than the previous one, and if the size of the item you are processing is greater than 1 it's always safe (I would say it's safe regardless of the size as I'm not sure if a 32-bit OS would be able to provide a 2GB of continuous memory to the user, but let's stay on a safe side).

Translated to x86 asm, the code would look like this:

; RDI - destination
; RSI - source
; RCX - counter

; Main Loop header.
sub rcx, 4    ; Try if we can use vectorized loop.
js MainSkip   ; If negative it means that there is less than 4 items.

MainLoop:
...           ; Some work to do...
add dst, 32   ; Advance destination 4 * sizeof(double).
add src, 32   ; Advance source.
sub rcx, 4    ; Decrement loop counter.
jns MainLoop  ; Jump if not negative.

MainSkip:
add rcx, 4    ; Fix the loop counter.
jz TailSkip   ; Exit if zero.

TailLoop:
...           ; Some work to do...
add dst, 8    ; Advance destination 1 * sizeof(double).
add src, 8    ; Advance source.
sub rcx, 1    ; Decrement loop counter.
jnz TailLoop  ; Jump if not zero.

TailSkip:
}

The asm code is not just nice, it's also very compact and will perform well on any hardware in general.

Working Example

I cannot just use an empty loop body to demonstrate how badly C++ compilers understand this code, so I wrote a very simple function that does something:

#include <stdint.h>

#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// Destination and source are points (pair of x|y), the function only sums them.
void transform(double* dst, const double* src, const double* matrix, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  while ((i -= 8) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src +  0);
    __m256d s1 = _mm256_loadu_pd(src +  4);
    __m256d s2 = _mm256_loadu_pd(src +  8);
    __m256d s3 = _mm256_loadu_pd(src + 12);

    _mm256_storeu_pd(dst +  0, _mm256_add_pd(s0, s0));
    _mm256_storeu_pd(dst +  4, _mm256_add_pd(s1, s1));
    _mm256_storeu_pd(dst +  8, _mm256_add_pd(s2, s2));
    _mm256_storeu_pd(dst + 12, _mm256_add_pd(s3, s3));
    
    dst += 16;
    src += 16;
  }
  i += 8;

  while ((i -= 2) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src);
    _mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));
    
    dst += 4;
    src += 4;
  }
  
  if (i & 1) {
    __m128d s0 = _mm_loadu_pd(src);
    _mm_storeu_pd(dst, _mm_add_pd(s0, s0));
  }
}

The function only performs a very simple operation with its inputs to distinguish between the loop body and everything else.

GCC (v7 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r9, rcx
        mov     r8, rcx
        sub     r9, 8
        js      .L6
        mov     rax, r9
        sub     rcx, 16
        mov     r8, r9
        and     rax, -8
        mov     rdx, rsi
        sub     rcx, rax
        mov     rax, rdi
        ; -----------------------
.L5:
        vmovupd xmm3, XMMWORD PTR [rdx]
        sub     r8, 8
        sub     rax, -128
        sub     rdx, -128
        vmovupd xmm2, XMMWORD PTR [rdx-96]
        vinsertf128     ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
        vmovupd xmm1, XMMWORD PTR [rdx-64]
        vinsertf128     ymm2, ymm2, XMMWORD PTR [rdx-80], 0x1
        vmovupd xmm0, XMMWORD PTR [rdx-32]
        vinsertf128     ymm1, ymm1, XMMWORD PTR [rdx-48], 0x1
        vaddpd  ymm3, ymm3, ymm3
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rdx-16], 0x1
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovups XMMWORD PTR [rax-128], xmm3
        vaddpd  ymm0, ymm0, ymm0
        vextractf128    XMMWORD PTR [rax-112], ymm3, 0x1
        vmovups XMMWORD PTR [rax-96], xmm2
        vextractf128    XMMWORD PTR [rax-80], ymm2, 0x1
        vmovups XMMWORD PTR [rax-64], xmm1
        vextractf128    XMMWORD PTR [rax-48], ymm1, 0x1
        vmovups XMMWORD PTR [rax-32], xmm0
        vextractf128    XMMWORD PTR [rax-16], ymm0, 0x1

        ; --- Instead of using sub/jns it uses sub/cmp/jne ---
        cmp     r8, rcx
        jne     .L5
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        mov     rax, r9
        shr     rax, 3
        lea     rdx, [rax+1]
        neg     rax
        lea     r8, [r9+rax*8]
        sal     rdx, 7
        add     rdi, rdx
        add     rsi, rdx
        ; -----------------------

.L6:
        ; --- Ridiculous code ---
        mov     rdx, r8
        sub     rdx, 2
        js      .L4
        shr     rdx
        xor     eax, eax
        lea     rcx, [rdx+1]
        sal     rcx, 5
        ; -----------------------

.L7:
        vmovupd xmm0, XMMWORD PTR [rsi+rax]
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
        vaddpd  ymm0, ymm0, ymm0
        vmovups XMMWORD PTR [rdi+rax], xmm0
        vextractf128    XMMWORD PTR [rdi+16+rax], ymm0, 0x1

        ; --- Instead of using sub/jns it uses add/cmp/jne ---
        add     rax, 32
        cmp     rax, rcx
        jne     .L7
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        neg     rdx
        add     rdi, rax
        add     rsi, rax
        lea     rdx, [r8-4+rdx*2]
        ; -----------------------
.L4:
        and     edx, 1
        je      .L14
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L14:
        vzeroupper
        ret

No comment - GCC magically transformed 9 initial instructions into 37, making the code larger and slower!

Clang (3.9.1 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r10, rcx
        add     r10, -8
        js      .LBB0_7
        mov     r11, r10
        shr     r11, 3
        mov     r8, r11
        shl     r8, 4
        lea     r9d, [r11 + 1]
        and     r9d, 1
        mov     rdx, rdi
        mov     rax, rsi
        test    r11, r11
        je      .LBB0_4
        lea     rcx, [r9 - 1]
        sub     rcx, r11
        mov     rdx, rdi
        mov     rax, rsi
        ; -----------------------

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
        vmovupd ymm0, ymmword ptr [rax + 128]
        vmovupd ymm1, ymmword ptr [rax + 160]
        vmovupd ymm2, ymmword ptr [rax + 192]
        vmovupd ymm3, ymmword ptr [rax + 224]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx + 128], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 160], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 192], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 224], ymm0
        add     rdx, 256
        add     rax, 256

        ; --- Instead of using sub/jns it uses add/jne ---
        add     rcx, 2
        jne     .LBB0_3
        ; ------------------------------------------------

        ; --- CLANG Unrolled the tail loop - OMG ---
.LBB0_4:
        shl     r11, 3
        lea     rcx, [r8 + 16]
        test    r9, r9
        je      .LBB0_6
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
.LBB0_6:
        lea     rdi, [rdi + 8*r8 + 128]
        sub     r10, r11
        lea     rsi, [rsi + 8*rcx]
        mov     rcx, r10
.LBB0_7:
        mov     r10, rcx
        add     r10, -2
        js      .LBB0_15
        mov     r8, r10
        shr     r8
        lea     r9d, [r8 + 1]
        and     r9d, 3
        mov     rax, rdi
        mov     rdx, rsi
        cmp     r10, 6
        jb      .LBB0_11
        lea     r10, [r9 - 1]
        sub     r10, r8
        mov     rax, rdi
        mov     rdx, rsi
.LBB0_10:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 32]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 32], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 64]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 64], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rax, -128
        sub     rdx, -128
        add     r10, 4
        jne     .LBB0_10
.LBB0_11:
        lea     r10, [4*r8 + 4]
        lea     r11, [4*r8]
        add     r8, r8
        test    r9, r9
        je      .LBB0_14
        neg     r9
.LBB0_13:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        add     rax, 32
        add     rdx, 32
        inc     r9
        jne     .LBB0_13
.LBB0_14:
        lea     rsi, [rsi + 8*r11 + 32]
        lea     rdi, [rdi + 8*r10]
        add     rcx, -4
        sub     rcx, r8
        mov     r10, rcx
        ; --- End of the unrolled loop ---

.LBB0_15:
        test    r10b, 1
        je      .LBB0_17
        vmovupd xmm0, xmmword ptr [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [rdi], xmm0
.LBB0_17:
        vzeroupper
        ret

This was even worse than I would ever expect - clang unrolled the tail loop and generated optimized code for something that would never be executed more than once. It completely misunderstood the code. Also, instead of following the C++ code it inserted dozens of instructions to help with unrolling, I won't even count them because the code is so ridiculous.

MSVC (v19 /O2 /arch:AVX)

transform, COMDAT PROC
        ; --- Ridiculous code ---
        add      r9, -8
        js       SHORT $LN3@transform
        lea      r8, QWORD PTR [r9+8]
        shr      r8, 3
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*8]
        npad     8
        ; -----------------------

$LL2@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vmovupd ymm1, YMMWORD PTR [rdx+32]
        vmovupd ymm4, YMMWORD PTR [rdx+64]
        vmovupd ymm5, YMMWORD PTR [rdx+96]
        vaddpd   ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm0
        vaddpd   ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rcx+32], ymm1
        vaddpd   ymm0, ymm4, ymm4
        vmovupd YMMWORD PTR [rcx+64], ymm0
        vaddpd   ymm1, ymm5, ymm5
        vmovupd YMMWORD PTR [rcx+96], ymm1
        sub      rcx, -128                ; ffffffffffffff80H
        sub      rdx, -128                ; ffffffffffffff80H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL2@transform
        ; ------------------------------------------------

$LN3@transform:
        ; --- Ridiculous code ---
        add      r9, 6
        js       SHORT $LN5@transform
        lea      r8, QWORD PTR [r9+2]
        shr      r8, 1
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*2]
        npad     5
        ; -----------------------

$LL4@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vaddpd   ymm2, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm2
        add      rcx, 32              ; 00000020H
        add      rdx, 32              ; 00000020H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL4@transform
        ; ------------------------------------------------
$LN5@transform:
        test     r9b, 1
        je       SHORT $LN6@transform
        vmovupd xmm0, XMMWORD PTR [rdx]
        vaddpd   xmm2, xmm0, xmm0
        vmovupd XMMWORD PTR [rcx], xmm2
$LN6@transform:
        vzeroupper
        ret      0

Actually not bad, still does things I didn't ask for, but it's much better than GCC and Clang in this case.

ICC (v17 -O2 -mavx -fno-exceptions)

transform(double*, double const*, double const*, unsigned long):
        add       rcx, -8                                       #11.11 FOLLOWS C++ code!
        js        ..B1.5        # Prob 2%                       #11.22
..B1.3:                         # Preds ..B1.1 ..B1.3
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #12.34
        vmovupd   xmm1, XMMWORD PTR [32+rsi]                    #13.34
        vmovupd   xmm2, XMMWORD PTR [64+rsi]                    #14.34
        vmovupd   xmm3, XMMWORD PTR [96+rsi]                    #15.34
        vinsertf128 ymm6, ymm1, XMMWORD PTR [48+rsi], 1         #13.34
        vinsertf128 ymm4, ymm0, XMMWORD PTR [16+rsi], 1         #12.34
        vinsertf128 ymm8, ymm2, XMMWORD PTR [80+rsi], 1         #14.34
        vinsertf128 ymm10, ymm3, XMMWORD PTR [112+rsi], 1       #15.34
        add       rsi, 128                                      #23.5
        vaddpd    ymm5, ymm4, ymm4                              #17.32
        vaddpd    ymm7, ymm6, ymm6                              #18.32
        vaddpd    ymm9, ymm8, ymm8                              #19.32
        vaddpd    ymm11, ymm10, ymm10                           #20.32
        vmovupd   XMMWORD PTR [rdi], xmm5                       #17.22
        vmovupd   XMMWORD PTR [32+rdi], xmm7                    #18.22
        vmovupd   XMMWORD PTR [64+rdi], xmm9                    #19.22
        vmovupd   XMMWORD PTR [96+rdi], xmm11                   #20.22
        vextractf128 XMMWORD PTR [16+rdi], ymm5, 1              #17.22
        vextractf128 XMMWORD PTR [48+rdi], ymm7, 1              #18.22
        vextractf128 XMMWORD PTR [80+rdi], ymm9, 1              #19.22
        vextractf128 XMMWORD PTR [112+rdi], ymm11, 1            #20.22
        add       rdi, 128                                      #22.5
        add       rcx, -8                                       #11.11
        jns       ..B1.3        # Prob 82%                      #11.22 FOLLOWS C++ code!
..B1.5:                         # Preds ..B1.3 ..B1.1
        add       rcx, 6                                        #25.3
        js        ..B1.9        # Prob 2%                       #27.22 FOLLOWS C++ code!
..B1.7:                         # Preds ..B1.5 ..B1.7
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #28.34
        vinsertf128 ymm1, ymm0, XMMWORD PTR [16+rsi], 1         #28.34
        add       rsi, 32                                       #32.5
        vaddpd    ymm2, ymm1, ymm1                              #29.27
        vmovupd   XMMWORD PTR [rdi], xmm2                       #29.22
        vextractf128 XMMWORD PTR [16+rdi], ymm2, 1              #29.22
        add       rdi, 32                                       #31.5
        add       rcx, -2                                       #27.11 FOLLOWS C++ code!
        jns       ..B1.7        # Prob 82%                      #27.22
..B1.9:                         # Preds ..B1.7 ..B1.5
        test      rcx, 1                                        #35.11
        je        ..B1.11       # Prob 60%                      #35.11
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #36.31
        vaddpd    xmm1, xmm0, xmm0                              #37.24
        vmovupd   XMMWORD PTR [rdi], xmm1                       #37.19
..B1.11:                        # Preds ..B1.9 ..B1.10
        vzeroupper                                              #39.1
        ret                                                     #39.1

ICC in this case is pure winner. I don't know why other compilers cannot generate code like this, it's small and good.

Bonus GCC with -Os

transform(double*, double const*, double const*, unsigned long):
.L3:
        mov     rax, rcx
        sub     rax, 8
        js      .L2
        vmovupd ymm3, YMMWORD PTR [rsi]
        sub     rdi, -128
        sub     rsi, -128
        vmovupd ymm2, YMMWORD PTR [rsi-96]
        mov     rcx, rax
        vmovupd ymm1, YMMWORD PTR [rsi-64]
        vaddpd  ymm3, ymm3, ymm3
        vmovupd ymm0, YMMWORD PTR [rsi-32]
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rdi-128], ymm3
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-96], ymm2
        vmovupd YMMWORD PTR [rdi-64], ymm1
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L3 ; Well, I asked -Os, but here sub/jns won't hurt
.L2:
        sub     rcx, 2
        js      .L4
        vmovupd ymm0, YMMWORD PTR [rsi]
        add     rdi, 32
        add     rsi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L2 ; Also here...
.L4:
        and     cl, 1
        je      .L1
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L1:
        ret

Much more close to the expected code, however, since I asked to generate the smallest possible code, it didn't generate conditional jumps at the end of loops, which is not good in general.

Bonus Clang with -Oz (should be even better than -Os)

transform(double*, double const*, double const*, unsigned long):                  # @transform(double*, double const*, double const*, unsigned long)
        push    7
        pop     rax
        sub     rax, rcx
        mov     rdx, rcx
        shr     rdx, 3
        xor     r10d, r10d
        test    rax, rax
        cmovle  r10, rdx
        mov     r9, r10
        shl     r9, 4
        lea     r8, [8*r10]
        shl     r10, 7
        add     r10, rsi
        mov     rdx, rcx
        mov     rax, rdi
        jmp     .LBB0_1
.LBB0_8:                                #   in Loop: Header=BB0_1 Depth=1
        vmovupd ymm0, ymmword ptr [rsi]
        vmovupd ymm1, ymmword ptr [rsi + 32]
        vmovupd ymm2, ymmword ptr [rsi + 64]
        vmovupd ymm3, ymmword ptr [rsi + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rax + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rax + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rsi, -128
        sub     rax, -128
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rdx, -8
        jns     .LBB0_8
        sub     rcx, r8
        lea     r8, [rdi + 8*r9]
        push    1
        pop     rax
        sub     rax, rcx
        lea     rdx, [rcx + rcx]
        and     rdx, -4
        xor     esi, esi
        test    rax, rax
        cmovle  rsi, rdx
        mov     rax, rcx
        mov     rdi, r10
        mov     rdx, r8
        jmp     .LBB0_3
.LBB0_4:                                #   in Loop: Header=BB0_3 Depth=1
        vmovupd ymm0, ymmword ptr [rdi]
        add     rdi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        add     rdx, 32
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        add     rax, -2
        jns     .LBB0_4
        test    cl, 1
        je      .LBB0_7
        vmovupd xmm0, xmmword ptr [r10 + 8*rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [r8 + 8*rsi], xmm0
.LBB0_7:
        vzeroupper
        ret

No, no, no - clang just doesn't like compact code...:(

Conclusion

People saying that C++ compilers always generate better code than people are simply wrong. I'm not saying that they always generate worse code, but it seems that newer the compiler is, more instructions it emits, and it's becoming scary. More arguments for Blend2D and its hand-optimized pipelines!

You can play with the code here.

Thanks for reading and don't forget that this is not a hate-post. A GCC bug #79830 was already reported (with a bit different sample code) and clang bug #32132 as well. Let's hope this can be improved as this actually affects performance of non-JIT code in Blend2D.

Update

After reading some comments on other portals about this post I think people really misunderstood it and try to put it into a different context. Please note that what is inside the loops is not important in this case. What is important is how various C++ compilers translate valid loop constructs into assembly and its impact on the size of the generated code. So please, instead of justifying the bloated code I have shown, let's fix it as it won't hurt to have smaller code generated in this case. And, I really think that if a commercial ICC compiler can generate perfect code for this case, GCC and clang should be able to do it too, because these are compilers that I use daily...

13 February, 2017

PABSB|W|D|Q Without SSSE3

Packed Absolute Value

SSSE3 extensions introduced instructions for computing packed absolute values of 8-bit, 16-bit, and 32-bit integers. In this post I will show how to implement these in pure SSE2, and how to implement a missing pabsq (packed absolute value of 64-bit integers), which is not provided until AVX512-F.

Straightforward Implementation

Let's start with a straightforward implementation in C first:

inline uint32_t abs32(int32_t x) {
  return x >= 0 ? uint32_t(x) : uint32_t(-x);
}

Although it contains branches C++ compilers are usually able to recognize such code and create an optimized branch-less version of it. If you think about a possible branch-less solution you must understand how negation in 2s complement arithmetic works. The code -x is equivalent to 0-x, which is equivalent to ~x + 1. Now we know how to change a sign of some integer, however, what absolute value does is changing the sign only if the input is negative. Since all negative numbers in 2s complement arithmetic have the most significant bit set to 1 we can use arithmetic shift to get a mask (all zeros or all ones), which can be then used to negate all bits of the original value. The remaining addition of 1 can be turned into a subtraction of -1 (as -1 is represented as all ones in 2s complement arithmetic). Thus, we can rewrite the original code to (x ^ mask) - mask, which would do nothing if mask is zero, and negate the input if mask is all ones.

A branch-less implementation of the previous code would look like:

inline uint32_t abs32(int32_t x) {
  // NOTE: x >> y must be translated to an arithmetic shift here...
  uint32_t mask = uint32_t(x >> (sizeof(int32_t) * 8 - 1));
  return (uint32_t(x) ^ mask) - mask;
}

SSE2 Implementation

The C++ code can be directly translated to SSE2 for 16-bit and 32-bit integer sizes:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psraw  xmm7, 15              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubw  xmm0, xmm7            ; Add one if mask is all ones

; SSE2 compatible PABSD implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubd  xmm0, xmm7            ; Add one if mask is all ones

64-bit packed absolute value is trickier as there is no PSRAQ instruction in SSE2 (VPSRAQ was first introduced in AVX512-F), however, we can shuffle the input a bit and use PSRAD again:

; SSE2 compatible PABSQ implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pshufd xmm7, xmm0, 0xF5      ; Like _MM_SHUFFLE(3, 3, 1, 1)
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubq  xmm0, xmm7            ; Add one if mask is all ones

These were straightforward translations based on the initial C++ code shown at the beginning of the post. However, there is a better way of implementing PABSW and there is also a way of implementing PABSB without any shifts (because there is no packed shift that operates on 8-bit entities). Since absolute value could be also written as max(x, -x) we can use packed min/max to implement PABSB and PABSW:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubw  xmm7, xmm0            ; Negate all input values
pmaxsw xmm0, xmm7            ; Select all positive values

; SSE2 compatible PABSB implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubb  xmm7, xmm0            ; Negate all input values
pminub xmm0, xmm7            ; Select all positive values

The PABSW implementation is straightforward and I have nothing to add, however, PABSB implementation is interesting as it workarounds the missing PMAXSB instruction (which was introduced in SSE4.1) by using PMINUB instead, which works for us based on the knowledge about both inputs (selecting the minimum unsigned value is the same as selecting the maximum signed value in our case, as we know that they are negations of each other).

Conclusion

Hope you enjoyed reading the post. I'm preparing a very small library for JIT code generation for asmjit that will have all of these tricks implemented and ready to use. Any wishes about next post? I was thinking about some pre-SSE4.1 rounding tricks (float|double), basically the same tricks I have used in MathPresso.

10 February, 2017

PMINUW and PMAXUW Without SSE4.1

Pminuw and Pmaxuw

SSE4.1 extension introduced a lot of instructions that I would say should have been part of the baseline SSE2. There are instructions that are hard to workaround, like pshufb, and there are also instructions that just complete the unfinished SSE2 instruction set, like pminuw (packed minimum of uint16_t) and pmaxuw (packed maximum of uint16_t). I have seen various workarounds for implementing these two, but since I'm working with JIT I always think about the best possible solution that:

  • Doesn't need more than one temporary register
  • Doesn't need constants, if possible
  • Is as short as possible

Existing Solutions

Before I started thinking of the best possible implementation I checked libsimdpp, which contains implementation of many post-SSE2 instructions. The min/max implementation can be found at i_min.h and i_max.h files. What libsimdpp does is to XOR the most significant bit (basically the sign-bit) of both inputs to prepare them being used by either pminsw or pmaxsw instruction. The problem with this approach is that it needs a constant (0x8000), two moves, three XORs, and one packed min or max. In other words, this is a lot of operations to do just packed minimum or maximum.

The machine code of that solution may look like:

; SSE2 compatible PMINUW|PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm1            ; Move xmm1 to temporary
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0
pxor xmm7, [constant_0x8000] ; Toggle the sign bit of xmm7 (temporary)
pminsw|pmaxsw xmm0, xmm7     ; Perform packed min|max
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0

Of course if the second operand (xmm1) is not required after the operation the temporary variable (and move) could be eliminated.

Is there a better way?

I have used a similar solution in the past, but I was never really happy with it. Today I tried to think harder about the problem and possible instructions that I can use and I have found the following approach - since SSE2 has PSUBUSW (packed subtract with unsigned saturation) I can use that instruction instead of three XORs and then subtract the result from the original value to get the packed unsigned minimum. This trick of course only works for packed uint16_t operations as X86 SIMD doesn't have instructions to perform packed saturated addition|subtraction of elements greater than 16 bits.

The machine code of this solution would look like:

; SSE2 compatible PMINUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psubusw xmm7, xmm1           ; Subtract with unsigned saturation
psubw xmm0, xmm7             ; Subtract (no saturation, would never overflow here)

Why it works? If we perform a-b with unsigned saturation we get either zero, which means that b is either equal or greater than a, or some non-zero value, which means that a is greater than b, and the value is their difference (unsigned). Based on these we can subtract that value from the original a and get our unsigned minimum.

The machine code of pmaxuw implementation would be much simpler:

; SSE2 compatible PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
psubusw xmm0, xmm1           ; Subtract with unsigned saturation
paddw xmm0, xmm1             ; Add (no saturation, would never overflow here)

In this last case (unsigned uint16_t maximum) we don't need any temporary. The possible difference in xmm0 is enough to reconstruct the maximum value based on the content of xmm1.

Conclusion

There is always a way to do something better. I always valuate solutions that need less temporaries and don't need extra constants. Sometimes these requirements are impossible, but sometimes other instructions that you may not think of may help. And... old CPUs would thank you for using this approach!

28 January, 2017

Reflect-Trick

Baseline

2D rendering APIs must define how gradient or texture should be rendered when coordinates accessing it are out of bounds. In 2D this is usually called repeat or spread mode, and here are the most used ones:

  • Pad - Coordinate is saturated [result = pad(x, width)]
  • Repeat - Coordinate is repeated [result = repeat(x, width)]
  • Reflect - Coordinate is reflected [result = reflect(x, width)]

Where result, x, and width are integers and result is always within [0, width) range (note that width is exclusive). A baseline implementation of pad, repeat, and reflect functions is below:

inline int pad(int x, int width) {
  if (x < 0)
    return 0;

  if (x >= width)
    return width - 1;

  return x;
}

inline int repeat(int x, int width) {
  // Repeat from 0 to width.
  x = x % width;
  if (x < 0)
    x += width;

  return x;
}

inline int reflect(int x, int width) {
  // Repeat from 0 to width * 2.
  x = x % (width * 2);
  if (x < 0)
    x += width * 2;

  // Reflect from 0 to width.
  if (x >= width)
    x = width * 2 - 1 - x;

  return x;
}

When Width is a Power of 2

In many cases width would be a power of 2, especially when rendering gradients. In that case the expensive MOD operator (%) could be replaced by AND operator (&), which would make the implementation shorter and generally faster:

inline int repeat(int x, int width) {
  return x & (width - 1);
}

inline int reflect(int x, int width) {
  // Repeat from 0 to width * 2.
  x = x & ((width * 2) - 1);

  // Reflect from 0 to width.
  if (x >= width)
    x = width * 2 - 1 - x;

  return x;
}

AND solves two problems - it implicitly handles the x < 0 condition, because the width is never negative (for example -2 & 255 == 254, which is the same as -2 + 256 in 2's complement arithmetic) and it replaces a very expensive MOD operator by a totally cheap AND operator, that does the same job in our case, because of power of 2 constraint. However, the reflect function is still pretty long and could be simplified even more, especially if we want to vectorize it.

Reflect-Trick Suitable for SIMD

We know that width can be used to precalculate other values that can be used instead of it. For example if we always use width * 2 - 1 as a mask it would be better to just calculate it once and reuse the precalculated value. For the following reflect-trick we precalculate it so it becomes a mask instead of width:

inline int precalc(int width) {
  return width * 2 - 1;
}

We also know that SIMD instructions like more min/max approach than if/mask approach, so instead of using the condition the reflection itself could be rewritten to use a single subtraction and minimum:

// y is the precomputed value, equal to `width * 2 - 1`.
inline int reflect(int x, int y) {
  // Repeat from 0 to width * 2.
  x = x & y;

  // Reflect by using the reflect-trick.
  return min(x, y - x);
}

The following reasons summarize why this implementation is generally better for SIMD implementation (and probably for scalar too):

  • It uses only one constant (y), which could be kept in register.
  • It doesn't use branching, which would need to be translated to comparison and masking.

Handling Both Repeat & Reflect at the Same Time

If we introduce another variable, it would be possible to handle both repeat and reflect modes inside a single function. This would be extremely beneficial for 2D pipelines that allow different repeat modes for X and Y. We name our constants as repeatValue and reflectValue, and precalculate them the following way:

inline int precalcRepeatValue(int width, bool isReflecting) {
  return isReflecting ? width * 2 - 1 : width - 1;
}

inline int precalcReflectValue(int width, bool isReflecting) {
  return width * 2 - 1;
}

And use such values in a single function:

inline int repeatOrReflect(int x, int repeatValue, int reflectValue) {
  x = x & repeatValue;
  return min(x, reflectValue - x);
}

Based on the initialization, the repeatOrReflect() function will either repeat or reflect.

Reflect-Trick that uses SAR and XOR

On X86 architecture the reflection can also be performed by using a SAR (arithmetic shift right) and XOR instructions. The idea is to shift the interval of the value to be reflected from [0...width*2) to [-width..width). If we do so we can use the following:

inline int reflect(int x) {
  const int kShift = (sizeof(int) * 8 - 1);
  return (x >> kShift) ^ x;
}

The function in this case expects value in a correct range and will always output a value within [0..width). On X86 the function would typically translate to 3 instructions:

mov idx, x   ; Copy x to idx
sar idx, 31  ; Copy sign bits across idx
xor idx, x   ; Reflect

Or 2 instructions if we can make sure that the input x is in eax register and output in edx register:

cdq          ; Sign-extend eax to edx:eax
xor edx, eax ; Reflect

Conclusion

These tricks can be used to improve the performance of gradient and texture rendering. In this post I focused mostly on repeating and reflection constrained by a power of 2, which is very useful for gradients, but it's not that much for textures. If the repeatValue is removed from repeatOrReflect function and the implementation handles the repeat during advancing x and y coordiantes the same code could be used to repeat/reflect a value of any range, which is exactly how b2dpipe works.

22 November, 2016

Using Raspberry for ARM Testing

Introduction

Developing for x86/x64 architectures is relatively easy as it's a standard architecture used by today's desktops; it's powerful and there are plenty of options when it comes to dev-tools and IDEs. When it comes to other architectures (I mean ARM here) the situation changes dramatically. These devices are usually not that powerful (both CPU power and memory becomes a limited resource) and some devices running OSes like Android don't even provide tools that we developers are used to. I personally cannot code properly on a laptop, so devices like tablets or phones are completely out of question. Using simulators is also not an option as I need to experiment with JIT code generation.

After I have done some experiments with my RaspberryPi (Model B2) I decided to try to improve the comfort of using these kind of devices for testing. The model I have has 1GB of RAM and quad-core ARM processor. Using that device to run IDE would suck, but accessing it remotely and using it only for compilation and unit testing would work pretty well.

Preparation

I have done the following steps to prepare the Raspberry device itself:

  1. Download Raspbian Lite distribution
  2. Unpack the image to get the *.img file, for example as raspbian.img
  3. Plug your SD card to your computer, and check which device represents it, I will call it /dev/sdx here
  4. Use dd tool (Linux) to copy the content of the *.img file to your micro card, something like dd bs=4M if=./raspbian.img of=/dev/sdx (the device doesn't specify a partition id)
  5. Mount /dev/sdx2 drive to somewhere so you can access it, like /mnt/rpi
  6. Enable SSH daemon by default, go to /mnt/rpi/etc/rc2.d and rename Kxxsshd to S02sshd
  7. Unmount /dev/sdx2
  8. Sync drives by using sync command (just in case, should not be needed)
  9. Unplug the SD card and put it to your Raspberry

Now the device should be ready; the only thing needed is to turn it on and connect it to your network. I use LAN and DHCP here so that's it. You can use netstat tool to get the IP address of the device or ping 256 addresses on your local network if you don't like automation :) After the device is powered up it should allow SSH connections so just connect to it and use pi username and raspberry password, that should be changed immediately after you are logged in.

In the past I used raspi-config tool that comes with Raspbian to enable the SSH daemon, but to do that you would need to connect screen and keyboard to the device, which I would like to avoid if possible.

Setting up Cloud9 for C++ Development

I decided to go with Cloud9 IDE that can run on the device and can be accessed through a browser. I think this is a good deal as the UI is fast (rendered by your browser) and it doesn't waste the precious RAM on the device itself. The only thing that I miss is a C++ debugger, but since Cloud9 is not really targeting C++ I'm fine with that (syntax highlighting is just fine).

Installing Cloud9 requires the following steps on your Raspberry device:

  1. Install the following dependencies: sudo apt-get install build-essential pkg-config libssl-dev openssl libxml2-dev
  2. Clone Cloud9 IDE to some directory, for example Cloud9: git clone https://github.com/c9/core/ Cloud9
  3. Prepare Cloud9 by using their install-sdk script (takes maybe 15 minutes to compile all deps, also downloads a lot of packages from net): ./Cloud9/scripts/install-sdk.sh
  4. Create your workspace, mine is mkdir workspace (at ~)
  5. Prepare your script for starting the Cloud9 server, mine run.sh looks like this:
    #!/bin/sh
    node server.js -w ~/workspace --listen=0.0.0.0 --port=8181 --auth=username:password
    
  6. Don't forget to make the script executable chmod +x run.sh
  7. Run the server ./run.sh

Note that Cloud9 requires to set --listen to 0.0.0.0 otherwise it will block connections that are not coming from localhost. You can now connect to the device. The result looks like this:

The IDE provides access to the workspace and features also a built-in terminal, which you can use through the web browser as well. This means that configuring the project, compiling it, and running unit tests can be done completely via the browser.

Memory Consumption

$ cat /proc/meminfo 
MemTotal:         947736 kB
MemFree:          130244 kB
MemAvailable:     798108 kB
Buffers:           57904 kB
Cached:           615464 kB

This means that roughly 800MB is still available to be used by the C++ compiler and other dev-tools, which is fine for most projects. However, from my own experience, don't use make -j command for building your project as it will create 4 processes leaving around 200MB for each compilation unit - that is not enough even for smaller projects like AsmJit.

Conclusion

Since this really works for me it allows me to work on AsmJit's ARM support on a real ARM device. However, it's still not a high priority for me...

07 September, 2016

AsmJit & AsmTK integrated into x64dbg

AsmTK > AsmJit

AsmTK, as the name implies, is an assembler toolkit that uses AsmJit as a base for everything and builds additional features on top of it. The currently only feature it offers is AsmParser, which implements a parser that can consume assembler string and emit it into CodeEmitter. More features like Linker are planned in the future.

AsmTK is a first library that takes advantage of AsmJit's instruction validation framework, which requires just 5kB of data to validate everything AsmJit understands. This feature was necessary to build such tools and it's now an integral part of AsmJit, which can be enabled per CodeHolder and/or Assembler.

x64dbg > AsmTK

It didn't take long for AsmTK to gain some attention. The first project that officially started using AsmTK is x64dbg, an open-source debugger for Windows. It uses AsmParser to encode asm instruction(s) on-the-fly and it currently allows to select between 3 assembler engines to do the job. It will be interesting to see which engine people prefer the most, and which will be the most reliable one to use.

The collaboration started after the author of x64dbg pointed out that all of his ~150 tests were failing when using AsmTK as a backend. I have incrementally fixed most of them and integrated a better test suite into AsmTK itself, so we can add more tests and have them run on continuous integration servers. Some tests were failing because AsmTK was parsing instructions in a case-sensitive manner (so all UPPERCASED instructions were failing without any reason), however, other failures required a lot of fixes in AsmTK and AsmJit themselves - for example REP and REPE/REPNE prefixes were completely reworked (I have planned that, but these failures accelerated my motivation to fix that now). Other minor reorganizations in AsmJit happened mostly to increase compatibility with other assemblers.

What's Next?

AsmTK needs more people using it and reporting bugs. I'm completely open to implement many features and to accept pull requests that extend its functionality. AsmTK is much more open to extra functionality than AsmJit in this regard, which I try to keep lean and mean and focused on JIT (while providing a foundation to build on top of it).

So, if you need to parse some asm, take look at AsmTK and join AsmJit's room on gitter. You will get pretty powerful functionality in just 250kB (and the size can be reduced further by disabling some AsmJit features such as Compiler).

18 August, 2016

Comparing register allocator of GCC and Clang

Introduction

I'm preparing to write a new register allocator for AsmJit and part of my preparation work is to study how the best open-source C++ compilers allocate registers of some C++ samples, and how they create function prologs and epilogs. Today, I wrote a very simple code that tries to exploit one thing - in 32-bit mode there is just 8 GP and SIMD registers, so let's see what GCC and Clang do with the following code:

#include <immintrin.h>

int fn(
  const int* px, const int* py,
  const int* pz, const int* pw,
  const int* pa, const int* pb,
  const int* pc, const int* pd) {

  __m256i a0 = _mm256_loadu_si256((__m256i*)px);
  __m256i a1 = _mm256_loadu_si256((__m256i*)py);
  __m256i a2 = _mm256_loadu_si256((__m256i*)pz);
  __m256i a3 = _mm256_loadu_si256((__m256i*)pw);
  __m256i a4 = _mm256_loadu_si256((__m256i*)pa);
  __m256i b0 = _mm256_loadu_si256((__m256i*)pb);
  __m256i b1 = _mm256_loadu_si256((__m256i*)pc);
  __m256i b2 = _mm256_loadu_si256((__m256i*)pd);
  __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1);
  __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1);
  
  __m256i x0 = _mm256_packus_epi16(a0, b0);
  __m256i x1 = _mm256_packus_epi16(a1, b1);
  __m256i x2 = _mm256_packus_epi16(a2, b2);
  __m256i x3 = _mm256_packus_epi16(a3, b3);
  __m256i x4 = _mm256_packus_epi16(a4, b4);
  
  x0 = _mm256_add_epi16(x0, a0);
  x1 = _mm256_add_epi16(x1, a1);
  x2 = _mm256_add_epi16(x2, a2);
  x3 = _mm256_add_epi16(x3, a3);
  x4 = _mm256_add_epi16(x4, a4);

  x0 = _mm256_sub_epi16(x0, b0);
  x1 = _mm256_sub_epi16(x1, b1);
  x2 = _mm256_sub_epi16(x2, b2);
  x3 = _mm256_sub_epi16(x3, b3);
  x4 = _mm256_sub_epi16(x4, b4);
  
  x0 = _mm256_packus_epi16(x0, x1);
  x0 = _mm256_packus_epi16(x0, x2);
  x0 = _mm256_packus_epi16(x0, x3);
  x0 = _mm256_packus_epi16(x0, x4);
  return _mm256_extract_epi32(x0, 1);
}

The function does nothing useful, it's just a dumb code that tricks the compiler to not eliminate any part of the function and to emit exactly what is written in its body. I'm trying to exploit the following: I'm using 8 arguments, which are passed by stack (32-bit mode), and each argument is a pointer to a 256-bit vector. The last two arguments are read twice, the first 6 arguments just once. This means that the compiler has the opportunity to store all arguments in GP registers if it just follows the code path. The second challenge is that I'm using 15 YMM registers at the same time, but only 8 are available.

GCC Output

Not so good, here is the asm with my annotations:

; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
  lea       ecx, [esp+4]                      ; Return address + 4 (first argument on the stack)
  and       esp, -32                          ; Align the stack to 32 bytes
  push      DWORD PTR [ecx-4]                 ; Push returned address
  push      ebp                               ; Save frame-pointer even if I told GCC to not to
  mov       ebp, esp
  push      edi                               ; Save GP regs
  push      esi
  push      ebx
  push      ecx
  sub       esp, 296                          ; Reserve stack for YMM spills
  mov       eax, DWORD PTR [ecx+16]           ; LOAD 'pa'
  mov       esi, DWORD PTR [ecx+4]            ; LOAD 'py'
  mov       edi, DWORD PTR [ecx]              ; LOAD 'px'
  mov       ebx, DWORD PTR [ecx+8]            ; LOAD 'pz'
  mov       edx, DWORD PTR [ecx+12]           ; LOAD 'pw'
  mov       DWORD PTR [ebp-120], eax          ; SPILL 'pa'
  mov       eax, DWORD PTR [ecx+20]           ; LOAD 'pb'
  mov       DWORD PTR [ebp-152], eax          ; SPILL 'pb'
  mov       eax, DWORD PTR [ecx+24]           ; LOAD 'pc'
  vmovdqu   ymm4, YMMWORD PTR [esi]
  mov       ecx, DWORD PTR [ecx+28]           ; LOAD 'pd'
  vmovdqu   ymm7, YMMWORD PTR [edi]
  vmovdqa   YMMWORD PTR [ebp-56], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [ebx]
  mov       ebx, DWORD PTR [ebp-152]          ; LOAD 'pb'
  vmovdqa   YMMWORD PTR [ebp-88], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [edx]
  mov       edx, DWORD PTR [ebp-120]          ; LOAD 'pa'
  vmovdqu   ymm6, YMMWORD PTR [edx]
  vmovdqa   YMMWORD PTR [ebp-120], ymm6       ; SPILL VEC
  vmovdqu   ymm0, YMMWORD PTR [ecx]
  vmovdqu   ymm6, YMMWORD PTR [ebx]
  vmovdqa   ymm5, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [eax+32]
  vmovdqu   ymm2, YMMWORD PTR [eax]
  vmovdqa   ymm1, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [ecx+32]
  vmovdqa   YMMWORD PTR [ebp-152], ymm2
  vmovdqa   ymm3, ymm0                        ; Why to move anything when using AVX?
  vpackuswb ymm0, ymm7, ymm6
  vmovdqa   YMMWORD PTR [ebp-184], ymm5       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-248], ymm3       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-280], ymm0       ; SPILL VEC
  vmovdqa   ymm0, YMMWORD PTR [ebp-56]        ; ALLOC VEC
  vmovdqa   YMMWORD PTR [ebp-216], ymm1       ; SPILL VEC
  vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-88]        ; ALLOC VEC
  vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpaddw    ymm2, ymm2, YMMWORD PTR [ebp-56]  ; Uses SPILL slot
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpaddw    ymm0, ymm7, YMMWORD PTR [ebp-280] ; Uses SPILL slot
  vpsubw    ymm0, ymm0, ymm6
  vmovdqa   ymm7, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpackuswb ymm0, ymm0, ymm2
  vpaddw    ymm2, ymm4, ymm1
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vmovdqa   YMMWORD PTR [ebp-312], ymm3       ; SPILL VEC
  vpaddw    ymm3, ymm5, YMMWORD PTR [ebp-88]  ; Uses SPILL slot
  vpsubw    ymm3, ymm3, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm3
  vpaddw    ymm1, ymm7, YMMWORD PTR [ebp-312] ; Uses SPILL slot
  vpsubw    ymm1, ymm1, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm2
  vpackuswb ymm0, ymm0, ymm1
  vpextrd   eax, xmm0, 1                      ; Return value
  vzeroupper
  add       esp, 296
  pop       ecx
  pop       ebx
  pop       esi
  pop       edi
  pop       ebp
  lea       esp, [ecx-4]
  ret

Here are my observations based on the output:

  • The first thing GCC does is to allocate arguments to registers and/or to move them to their home slots if there is not enough physical registers for all args.
  • It preserves stack pointer even if I told it to not to. The EBP register is valuable in our case. It does this (probably) because it wants to align the stack.
  • It uses [ebp-X] even when [esp+X] would be shorter in asm - for example it should use [esp+16] instead of [ebp-280] to make the instruction with such address shorter - we are talking about 2 bytes per instruction here.
  • YMM registers usage is a mystery to me - it does so many allocs and spills and also uses 'vmovdqa' to copy one register to another, which seems crazy in AVX code.

Verdict: GCC is worse than I expected in this test, it looks like it doesn't properly use DFG (data-flow-graph) and its register allocator is somewhat confused.

Clang Output

Clang surprised me, the output:

; Clang 3.8 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
   mov       eax, dword ptr [esp + 32]     ; LOAD 'pd'
   mov       ecx, dword ptr [esp + 4]      ; LOAD 'px'
   vmovdqu   ymm0, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 8]      ; LOAD 'py'
   vmovdqu   ymm1, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 12]     ; LOAD 'pz'
   vmovdqu  ymm2, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 16]     ; LOAD 'pw'
   vmovdqu   ymm3, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 20]     ; LOAD 'pa'
   vmovdqu   ymm4, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 24]     ; LOAD 'pb'
   vmovdqu   ymm5, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 28]     ; LOAD 'pc'
   vpackuswb ymm6, ymm0, ymm5
   vpsubw    ymm0, ymm0, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx]
   vpaddw    ymm0, ymm0, ymm6
   vpackuswb ymm6, ymm1, ymm5
   vpsubw    ymm1, ymm1, ymm5
   vmovdqu   ymm5, ymmword ptr [eax]
   vpaddw    ymm1, ymm1, ymm6
   vpackuswb ymm6, ymm2, ymm5
   vpsubw    ymm2, ymm2, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx + 32]
   vpaddw    ymm2, ymm2, ymm6
   vpackuswb ymm6, ymm3, ymm5
   vpsubw    ymm3, ymm3, ymm5
   vmovdqu   ymm5, ymmword ptr [eax + 32]
   vpaddw    ymm3, ymm3, ymm6
   vpackuswb ymm6, ymm4, ymm5
   vpsubw    ymm4, ymm4, ymm5
   vpaddw    ymm4, ymm4, ymm6
   vpackuswb ymm0, ymm0, ymm1
   vpackuswb ymm0, ymm0, ymm2
   vpackuswb ymm0, ymm0, ymm3
   vpackuswb ymm0, ymm0, ymm4
   vpextrd   eax, xmm0, 1                  ; Return value
   vzeroupper
   ret

Wow! Clang didn't spill any GP/VEC register and didn't use 'movdqa' like GCC. It rearranged the code in a way to prevent spills, most probably because it used DFG properly. I must applaud to clang developers as this output is amazing. It surprised me as I wrote the code to force the compiler to spill some YMMs (I know, bad design, but it was just a quick type-type-copy-paste:).

Conclusion

I cannot make a conclusion based on a single test, but GCC failed me in this case. Don't trust people that say that compilers produce better asm than people, it's a myth :)

You can try it yourself by copy-pasting the code here - it's a service that can translate your C++ code to asm online.

GCC bug #77287 reported.

15 August, 2016

AsmJit & AVX-512

The World of Prefixes

X86 architecture is known for its prefixes. It's no surprise that AVX-512 adds another one to the family - EVEX. Let's summarize the last 4 prefixes introduced in X86:

  • VEX - 2-byte (VEX2) and 3-byte (VEX3) prefix initially designed by Intel to encode AVX instructions, but now used by other CPU extensions like BMI and BMI2. VEX2 was designed to make some instructions 1 byte shorter than VEX3, but its usage is quite limited.
  • XOP - 3-byte prefix designed by AMD to support their XOP extensions in a way to not interfere with existing VEX prefix. XOP was never adopted by Intel and AMD will not support it in their new Zen processors (together with other extensions like FMA4). It's a dead end, dead silicone, and dead code that supports this prefix.
  • EVEX - 4-byte prefix designed by Intel to support 512-bit width vectors and 32 vector registers. Each AVX-512 instruction that works with vector registers uses this prefix. Many AVX and AVX2 instructions can be encoded by this new prefix as well. There are, however, several exceptions, but that would require a separate post.

AVX-512 Status in AsmJit

AVX-512 support in AsmJit is mostly finished. AsmJit's instruction database now contains all AVX-512 instructions together with older AVX and AVX2 instructions. The reorganization of instruction database and X86Assembler was quite drastic. AsmJit now contains a single path to encode either VEX, XOP, or EVEX instruction, which greatly simplified the logic in the assembler. XOP encoding IDs are no longer needed as each instruction now contains VEX, XOP, and EVEX bit. These bits instrument the encoder to use the correct prefix.

Encoder Improvements

The previous encoder was designed to encode each byte in the [VEX] prefix separately, and then write the result byte-to-byte into the destination buffer. This design was fairly simple, and according to my benchmarks, it was also very fast. However, this approach seemed unfeasible for supporting the new EXEX prefix, which contains 3 bytes of payload (instead of two) and the encoder must check all 3 bytes before it can decide whether to emit VEX or EVEX prefix. The new code does this differently - it uses a single 32-bit integer that represents the whole EVEX prefix, and then decides whether to use EVEX or VEX by checking specific bits in it. If any of the bits checked is '1' then the instruction is EVEX only. This guarantees that EVEX prefix will never be used by a legacy AVX instruction, and also guarantees that the best encoding (shortest prefix) is used. AsmJit allows to override this decision by using `evex()` option, which instructs the encoder to emit EVEX prefix, and similarly also supports `vex3()` option, which instructs the encoder to emit 3-byte VEX prefix instead of a shorter 2-byte VEX prefix. EVEX wins if both `evex()` and `vex3()` are specified.

A simplified version of AsmJit's VEX|EVEX encoder looks like this:

// Encode most of EVEX prefix, based on instruction operands and definition.
uint32_t x = EncodeMostEvex(...);             //  [........|zLL..aaa|Vvvvv..R|RBBmmmmm]
// Check if EVEX is required by checking:     x & [........|xx...xxx|x......x|.x.x....]
if (x & 0x00C78150U) {
  // Encode EVEX - uses most of `x`.
  // ... no more branches here - requires around 14 ops to finalize EVEX ...
  //                                                   _     ____    ____
  //                                              [zLLbVaaa|Wvvvv1pp|RBBR00mm|01100010].
}

// Not EVEX, prepare `x` for VEX2 or VEX3 (5 ops):[........|00L00000|0vvvv000|R0B0mmmm]
x |= ((opCode >> (kSHR_W_PP + 8)) & 0x8300U) | // [00000000|00L00000|Wvvvv0pp|R0B0mmmm]
     ((x      >> 11             ) & 0x0400U) ; // [00000000|00L00000|WvvvvLpp|R0B0mmmm]

// Check if VEX3 is needed by checking        x & [........|........|x.......|..x..x..]
if (x & 0x0008024U) {
  // Encode VEX3 or XOP.
  // ... no more branches here - requires around 7 ops to finalize VEX3 ...
  //                                                         ____    _ _
  //                                              [_OPCODE_|WvvvvLpp|R1Bmmmmm|VEX3|XOP].
else {
  // Encode VEX2.
  // ... no more branches here - requires around 3 ops to finalize VEX2 ...
}

This means that AsmJit requires just a single branch to decide whether to use VEX or EVEX prefix, and another branch to decide between 3-byte VEX|XOP or 2-byte VEX prefix. This is good news for everybody expecting high performance as this approach is nearly as fast as the old AsmJit's one, which haven't supported AVX-512 at all. It took me some time and thinking to actually design such approach and to reorganize instruction opcodes database in a way to be able to encode the initial EVEX prefix quickly. My initial approach was around 25% slower than the old AsmJit, and the final code (similar to the snippet shown above) is roughly 3-5% slower, which is pretty close to the old code. The new functionality is nontrivial so I'm really happy with such metrics (and to be honest I would like to see some metrics from other assemblers).

It's also obvious from the code that the new approach is basically optimistic for EVEX - emitting EVEX instructions is much cheaper than emitting VEX|XOP instructions. This wasn't goal, it's rather a consequence: all the bits that EVEX prefix introduces must be checked in order to decide between VEX vs. EVEX, thus AsmJit just puts most of these bits into the right position and only performs minor bit shuffling when converting to a prefix that uses less bytes (EVEX->VEX3->VEX2).

Future Work

Future posts will be about a new emitter called CodeBuilder. In short, it allows to emit instructions into a representation that can be processed afterwards. This representation was in AsmJit from the beginning as a part of Compiler. Compiler has many high-level features that some people don't need, so it was split into CodeBuilder that can be used exactly like Assembler, and CodeCompiler, that keeps all the high-level features.

14 July, 2016

Introducing AsmTK

AsmTK - A Toolkit Based on AsmJit

AsmJit library provides a low-level and high-level JIT functionality that allows applications to generate code at run-time. The library was designed from scratch to be efficient and highly dynamic. Efficiency is achieved by having a single (dispatch) function that can encode all supported instructions without jumping to other helper functions. This function is actually pretty big, but I always tried to keep it organized and consistent. Dynamism is achieved by using a structure called Operand, which is a base class for any operand that can be used by the assembler, and guarantees that each Operand has the same size (16 bytes) regardless of its type and content.

The dynamic nature of AsmJit is actually what makes it much more powerful than other JIT assemblers out there. It's also a feature that makes it possible to have X86Compiler as a part of AsmJit without a significant library size increase; and it also makes it possible to create tools that use AsmJit as a base library to generate and process assembly at run-time. One missing feature that I have been frequently asked was to assemble code from a string. This is now provided by AsmTK library!

AsmParser

The AsmTK's AsmParser exploits what AsmJit offers - it parses the input string and constructs instruction operands on-the-fly, then passes the whole thing to the instruction validator, and finally passes it to the assembler itself. The AsmTK supports all instructions provided by AsmJit, because it uses AsmJit API for instruction name to id conversion and strict validation.

Here is a result of a sample application that I wrote in less than 15 minutes - it's basically on-the-fly X86/X64 instruction encoder based on AsmTK and AsmJit. You enter instruction and it tries to encode it and outputs its binary representation:

=========================================================
AsmTK-Test-Cmd - Architecture = x64 (use --x86 and --x64)
---------------------------------------------------------
Usage:
  1. Enter instruction and its operands to be encoded.
  2. Enter empty string to exit.
=========================================================
mov eax, ebx
8BC3
mov rax, rbx
488BC3
mov r15, rax
4C8BF8
cmp ah, al
3AE0
vandpd ymm0, ymm10, ymm13
C4C12D54C5
movdqa xmm0, [rax + rcx * 8 + 16]
660F6F44C810
movdqa rax, xmm0
ERROR: 0x0000000B (Illegal instruction)

The tool can be used to quickly verify if an instruction encodes correctly and also to check if the encoding is optimal (for example if AsmJit encodes it the shortest way possible, etc). I have already made two fixes in AsmJit to use shorter encoding of [mov gpq, u32 imm] and [and gpq, u32 imm] instructions.

Conclusion

The AsmTK library is a fresh piece of software that currently contains less than 1000 lines of code. It relies on AsmJit heavily and uses its new instruction validation API. It also serves as a demonstration of AsmJit capabilities that are not obvious from the AsmJit documentation.

12 July, 2016

AsmJit and Instruction Validation

AsmDB

AsmDB is an X86/X64 instruction database in a JSON-like format, that I started after I saw the complexity of AVX-512 instruction set. I thought, initially, that I would just add it manually to the AsmJit database, but after few hours I realized that it is extremely complex and not that straightforward as I thought.

AsmDB -> AsmJit

The solution was to create a database that contains all instructions in a similar format that is used by instruction-set manuals and to write a tool that can index the database and create all tables AsmJit is using programatically. At the moment I would say that 50% of the work is done - AsmJit tool that generates parts of x86inst.cpp file now uses AsmDB to generate a space-efficient operand tables that can be used to validate operands of any x86 and x64 instruction supported by AsmJit. This replaces the old operand tables that were basically useless as they combined all possibilities of all possible instruction encodings.

The new validation API is still a work-in-progress, but in general you can do something like this:

Error err = X86Inst::validate(
  kArchX64,                        // Architecture - kArchX86, kArchX64.
  kX86InstIdVpunpckhbw,            // Instruction id, see X86InstId enum.
  0,                               // Instruction options, see X86InstOptions enum.
  x86::xmm0, x86::xmm1, x86::eax); // Individual operands, or operands[] and count.

The call to validate() will return an error in this case, because vpunpckhbw instruction is defined for either [xmm, xmm, xmm/mem] or [ymm, ymm, ymm/mem] operands. The validator is very strict and has access to a very detailed information about every instruction - it knows about implicit operands, operands that require a specific register, and possible immediate and memory sizes. It can be used to implement an asm parser as well, and probably much more in the future.

X86Assembler enhancements

At the moment I'm still integrating the validation code into the assembler and compiler classes. What I can say is that I can simplify and remove most of the validation code from the assembler in favor of the new validation code. The reason is that the assembler was always a kind of lenient in terms of validation - it cares about performance and omits everything that is not necessary. This means, for example, that it allows something like mov eax, al and encodes it as mov eax, eax. Basically it checks the size of the destination register and doesn't care much about the source register except for its index for that particular instruction.

This gets much more problematic when using an unsafe API (API that allows to use untyped operands). It's possible to emit a ridiculous combination by doing for example a.emit(kX86InstIdAdd, x86::eax, x86::xmm5). Such combination doesn't exist at all, but AsmJit will encode it as add eax, ebp as the operands match the REG/REG signature (ebp register has the same index as xmm5), and since the 'add' instruction is only defined for GP registers the assembler can omit the register type check, because there is no typed-API that provides the 'add' instruction with such operands.

The new validation API changes the game since operands can now be validated and the assembler can validate each instruction before it actually tries to encode it. This means that such cases can be checked by the assembler without making it more complex. The only problem is that this kind of validation is more expensive, thus the assembler needs a new option to enable and disable it.

X86Compiler enhancements

Anybody who ever used AsmJit's compiler knows that it's not really trivial to debug it when something goes wrong. Compiler stores each instruction, processes it, and then serializes it to the assembler. If the instruction is invalid from the beginning it will be serialized as-is to the assembler as well, which would fail. The problem is that sometimes it's too late and you have to use debug output to figure out the exact place where the instruction was generated. The new validation API should solve most of the issues mentioned as the compiler can now validate each instruction before it stores it. This allows to find a code that misuses the compiler much faster.

Conclusion

There are still many things to do in AsmJit, but the library is slowly getting better and I hope that AsmJit users will find these new features useful. The good news is that since I reorganized the instruction tables there is still some space I can fill without increasing the library size. There are many things that can be put into these tables, but the first candidate is an SSE to AVX translation, which is very likely to be implemented first. Next goal is to have finalized AVX-512 of course :)

16 June, 2016

Rendering Game Maps in HTML Canvas

Introduction

It's around 10 days I started a simple toy project. My goal was to render a game-map based on assets extracted from an original Civilization 1 game (dos version). After one day I had my work done and I started thinking on how the rendering could be improved. I won't post here screenshot from that initial version as I used original assets from the game (you can find many on google images), however, the article should describe how I build a very complex texture-based map renderer from an initial tile-based renderer that used simple atlas to render its tiles. Rendering by blitting pre-rendered tiles is technique that was used by the original Civilization game and it's used today by many turn-based strategy games (including civ-clones).

I picked an HTML's <canvas> as a technology used by the renderer. I found it extremely challenging to write a renderer for such backend as the performance of canvas varies across browsers and its state-based architecture is extremely unsuitable for such rendering tasks. I don't know if webgl would be faster or not, but since I'm mostly interested in 2D I'm happy I started with canvas as it's challenging to move the boundaries of what one can do with it. I guess that even in webgl there will be some struggles as there is a lot of tiles to render, and there is a lot of layers that have to be rendered per tile, so it may be much faster, but it would also require some discipline to make it working.

Engine Design

Before I start describing the renderer itself I would like to describe the engine a bit and how I structured it (and you can skip this if you just want to see the results):

  • Game - connects game rules, game map, map renderer, and user interface.
  • GameMap - defines a game-map, implements basic map operations and algorithms.
  • GameTile - defines a single tile used by GameMap - contains what is required for game to use that tile, doesn't contain any rendering hints.
  • GameRules - provides information about objects in the game and their interaction. There are 7 types of objects that can be added to the rules, the 2 most important for use are Assets and Terrains.
  • GameUtils - static class that contains some helpers used across the engine
  • .

The renderer is completely separated from the engine and looks like the following:

  • AssetLoader - a class that takes all asset files from game rules and queues them for loading.
  • Renderer - a main renderer class that implements the rendering logic. When attached to the game it starts listening to invalidate events and
  • RendererTile - a tile used and maintained by the renderer. This tile does only contain information about the rendering and is updated every time a tile is invalidated by the game.
  • RenderUtils - utility functions used by renderer - most of the functionality provided here is used only once to process loaded assets into something the renderer would like to use.

When Renderer is attached to the game it starts listening to events that invalidate tiles and require them to be recalculated - these events are only related to tiles themselves, not to what happens on that tiles. For example if you move a unit across your territory that won't cause recalculation of rendering data, however, if you change a terrain type of some tile, build road, or uncover a hidden tile on the map then the tile and all its surroundings have to be invalidated and recalculated. The game sends `invalidateMap` (if the whole map changed), `invalidateTile` if a single tile was changed, and `invalidateRect` if an area of tiles was changed.

When the renderer receives such event it uses a bit-array that describes the region on the map that needs to be recalculated. Since invalidating a single tile also invalidates all surroundings, it means that the minimum tiles to be recalculated is always 9 (3x3 matrix, 1 tile and 8 surroundings). I have chosen a 4x4 grid of tiles to be represented by a single bit in a bit-array called `dirtyTiles`. So when the renderer receives invalidation event it just updates bits in a `dirtyTiles` and that's it, it will recalculate the tiles later as it's very likely more neighboring tiles will be invalidated in a game-play. When the renderer receives instructions to render the map it first checks if there are dirty tiles and recalculates them. After all tiles were recalculated it sets all dirty bits to false and starts the rendering process.

That was a very high-level introduction to the architecture I developed. The primary goal was to enable fast prototyping of the renderer. Next sections cover all the steps I used to write the texture-based renderer.

Step 1 - Initial Implementation

The renderer has to be able to render tiles based on their assigned textures. So the first thing to do is to add some assets to the game rules:


rules.assets = [
  { name: "Texture.Ocean"      , file: "ocean.png"          },
  { name: "Texture.Grassland"  , file: "grassland.png"      }
];

And to create terrain that will use the assets defined:


rules.terrain = [
  { name: "Grassland"          , id: TerrainType.Grassland, asset: "_[Texture.Grassland]" },
  { name: "Ocean"              , id: TerrainType.Ocean    , asset: "_[Texture.Ocean]"     }
];

The assets are referenced as `_[AssetName]`. This could be confusing now as why to change the name, but the reason is that each kind of item in the rules system has its own link format. This means that items of different kinds can have the same name and still be referenced without ambiguity. Rules use references for many things and for example if you have a building and you need something in order to build that building in your city, you will use the system of references and add prerequisites into that building (and the prerequisite could be a nation, technology, resource, or anything else that is defined by game rules).

But back to rendering! If you create two textures called ocean.png and grassland.png, each of them having exactly 256x256 pixels then you can render each texture on each tile by calculating the tile's world coordinates and keeping only 8 bits of them (it depends on the size of the texture, I would recommend using powers of 2, other dimensions will make your work harder, but not impossible). This way you can render something like the following:

While some devs would be satisfied with such wonderful image, I was not:)

Step 2 - Adding Blendmaps

To make the transitions between different tiles smooth we use a concept called blendmaps. Blendmap is an image that contains only alpha channel and contains various transitions between two tiles. I started using a blendmap that has 16 tiles, where the first tile is divided into 4 subtiles specifying blending of each corner, and next 15 tiles specify blending of sides and their combinations. A blendmap looks like the following (see the arrows of blending for illustration):

Even if it looks chaotic there is a very simple logic behind it. Each bit defines a blending side. The renderer then uses the following bits to describe a side or corner:

  • [1] - Top (Side)
  • [2] - Right (Side)
  • [3] - Bottom (Side)
  • [4] - Left (Side)
  • [5] - Top-Left (Corner)
  • [6] - Top-Right (Corner)
  • [7] - Bottom-Left (Corner)
  • [8] - Bottom-Right (Corner)

Sides start first and corners follow. I found this logic the best as when you keep only the first four bits you get a mask of 4 sides. Some assets need just these four to render them properly, like rivers, which will be explained later. When you represent these four bits as binary like 0001 and then convert to a decimal form (0001 becomes 1, 0010 becomes 2, etc) then you get the blending sides and their indexes in the blendmap (zero indexed, that's why I put corners first). So for example that last tile has all bits set (1111, 15 in decimal), which means that it's a top-right-bottom-left transition.

From now I will start using the RendererTile to store some information about each tile on the map. At the moment we need just two properties called `baseTexture` and `terrainEdges`. Base texture would be set to an ID of texture that would be rendered first (like ocean, grassland, etc). Terrain edges would be calculated this way:

  • Get all surrounding tiles of the tile you are recalculating (sides and corners).
  • For each surrounding tile, set the particular bit in `terrainEdges` if the tile is not the same, or clear it.

After all tiles are precalculated this way you can implement a very simple renderer that will blend one texture with another based on the tile sides and corners. So for each tile to be rendered do the following:

  • Blit the base texture first. For example if the tile is ocean, blit ocean, if it's grassland, blit grassland.
  • Blit tile sides based on the first four bits of `terrainEdges`, you calculate the x of the blendmap as `(terrainEdges & 15) * tileSize`.
  • Blit masked tile corners that represent each quarter of the tile - top-left, top-right, bottom-left, and/or bottom-right.

(NOTE: Rendering of the most complicated tile requires 5 transitions in our case - one for sides and at most 4 for each corner)

To make a masked tile-blit you need to do the following:

  • Clear the alpha value of the destination pixels defined by the blendmask.
  • Use destination-over operator to blend the second texture in the transparent area produced by the previous step.

The function may be implemented like this:


// ctx  - Canvas 2d context
// dx   - Destination X
// dy   - Destination Y
//
// tex  - Texture to blend
// texx - Texture X coordinate
// texy - Texture Y coordinate
//
// msk  - Blendmask
// mskx - Blendmask X coordinate
// msky - Blendmask Y coordinate
renderTransition(ctx, dx, dy, tex, texx, texy, msk, mskx, msky, sq) {
  // Clear pixels and alpha defined by the blend-mask.
  ctx.globalCompositeOperation = "xor";
  ctx.drawImage(msk, mskx, msky, sq, sq, dx, dy, sq, sq);

  // Blit pixels that were cleared by the previous operation.
  ctx.globalCompositeOperation = "destination-over";
  ctx.drawImage(tex, texx, texy, sq, sq, dx, dy, sq, sq);

  // Restore the composition operator.
  ctx.globalCompositeOperation = "source-over";
}

If you implement it correctly and render the same data as in Step 1 it would look like the following:

While it's much better than the previous rendering there are many things that can be improved. But before we go into step 3 I would like to present one trick to reduce the maximum number of blends per such transition to one. The key point is to define each logical combination that can happen and post-process the blendmap to have more tiles. I implemented this in the following way:

  • First define a table that contains all important corners for each combination of sides.
  • Use that table to resolve all possible combinations and create a lookup table where the index is `terrainEdges` (with all bits) and the value is a new index to a post-processed blendmap.
  • Post-process the blendmap to contain all possible combinations.

I implemented it the following way:


function makeTransitionData(sides) {
  const PRE = [];
  const LUT = []; for (var i = 0; i < 256; i++) LUT.push(-1);

  for (var side = 0; side < 16; side++) {
    const effective = sides[side];

    for (var corner = 16; corner < 256; corner += 16) {
      var lutIndex = side | corner;
      var altIndex = side | (corner & effective);

      if (LUT[altIndex] !== -1) {
        // Already in `PRE` table.
        LUT[lutIndex] = LUT[altIndex];
      }
      else {
        // New combination.
        const preIndex = PRE.length;
        PRE.push(altIndex);
        LUT[lutIndex] = LUT[altIndex] = preIndex;
      }
    }
  }

  return {
    PRE: PRE, // Preprocessing table.
    LUT: LUT  // Render lookup table.
  };
}

And then used that table with the following data, where each combination of sides describes all possible combinations of corners:


const TerrainTransitions = makeTransitionData([
  EdgeFlags.Corners                            , // |       |
  EdgeFlags.BottomLeft | EdgeFlags.BottomRight , // |      T|
  EdgeFlags.TopLeft    | EdgeFlags.BottomLeft  , // |    R  |
  EdgeFlags.BottomLeft                         , // |    R T|
  EdgeFlags.TopLeft    | EdgeFlags.TopRight    , // |  B    |
  EdgeFlags.None                               , // |  B   T|
  EdgeFlags.TopLeft                            , // |  B R  |
  EdgeFlags.None                               , // |  B R T|
  EdgeFlags.TopRight   | EdgeFlags.BottomRight , // |L      |
  EdgeFlags.BottomRight                        , // |L     T|
  EdgeFlags.None                               , // |L   R  |
  EdgeFlags.None                               , // |L   R T|
  EdgeFlags.TopRight                           , // |L B    |
  EdgeFlags.None                               , // |L B   T|
  EdgeFlags.None                               , // |L B R  |
  EdgeFlags.None                                 // |L B R T|
]);

The total number terrain transitions we defined is 46 and the preprocessing table contains the following masks:

[16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 1, 65, 129, 193, 18, 2, 66, 82, 3, 67, 20, 36, 52, 4, 5, 22, 6, 7, 8, 40, 136, 168, 9, 137, 10, 11, 12, 44, 13, 14, 15]

And the post-processed blendmap would look like this (note that you should post-process it programatically, not manually):

While it's not necessary to do it this way I found it much simpler to simply preprocess the blendmaps I'm using and use just one call to `renderTransition()` with appropriate blendmap position. If you plan to render many things per tile then I would consider this trick necessary as it improves performance a lot and it's not memory hungry.

Step 3 - Adding Rivers

The renderer can be improved to support rivers, to do that I did the following:

  • Add a new property to the `RendererTile` called `riverEdges`
  • Add a logic to recalculate `riverEdges` to the renderer.
  • Rivers connections are only side based, so if a tile is a river, then check all four sides, each river or ocean side adds a particular bit to the `RendererTile.riverEdges`.
  • For ocean tiles, check for all neighbors that are rivers and set `RendererTile.riverEdges` so you can render the river flow to the ocean.

Then you would need another blendmap that describes river transitions, I created the following one (and please note how the positions match the terrain blendmap from Step 2):

TIP: if you just started creating your own blendmaps in gimp or PS: create layers and paint your blendmap by using a white color in a transparent layer. Then after you finish you can create another layer, put your texture there and use multiply blend mode to blend the white with the texture. If you do it properly you would see something like this (and this is how it would look on the map):

If you have all of this then it's pretty trivial to add river support to the renderer:

  • On a ground tile, check for `riverEdges` and if non-zero then use `renderTransition()` to render the blendmap at the index specified by `riverEdges` - it's four bits that means 15 transitions if you ignore the zero index, which only makes sense in editor (it's tile with river without any connection).
  • On an ocean tile, check for `riverEdges` and render each side separately, use indexes 16, 17, 18, and 19, which provide the flow to the ocean.

I use an ocean texture for blending rivers, you can have a separate one if you want to make rivers brighter for example. The rendered map with rivers should look like this:

Step 4 - Adding Dominance

Until now we just rendered two kind of tiles (grassland and ocean) and rivers. What if we add more tiles, for example desert, plains, tundra, arctic, etc? Renderers of some games solve this problem in a simple way - they define a texture, which is used between different terrain types as a transitional texture. So for example if I define a desert to be that texture, then all transitions between plains and grasslands would go through desert, etc. The struggle is that this never looks good and it's painful to pick the texture to be used for such transitions. Some games solve this problem in another way - they only allow one kind of tile to be next to another to workaround such issue. But there is another concept that is simply called `dominance`.

Dominance works this way: Assign a dominance property to each tile and use that dominance to determine which neighbor merges where. Tiles with higher dominance 'dominates' neighbors with lesser dominance. For example if a dominant tile is grassland, and it's surrounded by all plains, then the grassland would be rendered as is and each plains surrounding it would contain transition from the grassland as it dominates it. I found this concept well described here and followed it to create my own implementation.

The first thing I needed is another blendmap for terrain transitions. Currently I use a single blendmap for all terrain transitions, but it's just for simplicity as it's configurable to have more blendmaps and to specify which should be used where. Here is a blendmap that I created for terrain transitions:

And here is what needs to be done to support it:

  • Remove property `terrainEdges` from `RendererTile`.
  • Add a new property to the `RendererTile` called `transitions`, which is an array containing pair of values [textureId, edges].
  • Add a logic to recalculate `transitions` to the renderer.

The `RenderTile.transitions` are recalculated by the following way

  • Clear `RendererTile.transitions`.
  • Check the tile dominance and loop from `dominance + 1` to `maxDominance` (maximum possible dominance of all supported terrain types).
  • For each dominance index check all neighboring tiles and collect bit mask of edges of that particular dominance. If the mask is not empty then add [textureId, edges] to the `RendererTile.transitions`.
  • If the tile is an ocean, set base tile to ocean if the tile has only ocean neighbors, or desert if the tile has ground neighbors. By doing this you create a nice coast that looks like sand into which all neighboring tiles blend.

Then during the rendering process first blit the base texture and then loop over `RendererTile.transitions` and blend each texture by using the textureId and edges (index to the blendmap). For example if you define the dominance like this:


rules.assets = [
  { name: "Texture.Ocean"      , file: "ocean.png"        , dominance: 0 },
  { name: "Texture.Desert"     , file: "desert.png"       , dominance: 1 },
  { name: "Texture.Arctic"     , file: "arctic.png"       , dominance: 2 },
  { name: "Texture.Tundra"     , file: "tundra.png"       , dominance: 3 },
  { name: "Texture.Plains"     , file: "plains.png"       , dominance: 4 },
  { name: "Texture.Grassland"  , file: "grassland.png"    , dominance: 5 },
];

rules.terrain = [
  { name: "Desert"             , id: TerrainType.Desert   , asset: "_[Texture.Desert]"    },
  { name: "Plains"             , id: TerrainType.Plains   , asset: "_[Texture.Plains]"    },
  { name: "Grassland"          , id: TerrainType.Grassland, asset: "_[Texture.Grassland]" },
  { name: "Tundra"             , id: TerrainType.Tundra   , asset: "_[Texture.Tundra]"    },
  { name: "Jungle"             , id: TerrainType.Jungle   , asset: "_[Texture.Grassland]" },
  { name: "Ocean"              , id: TerrainType.Ocean    , asset: "_[Texture.Ocean]"     }
];

Then the sample map would be rendered the following way:

Playing with terrain dominance settings will yield different renderings, for example increasing dominance of arctic would render the same map differently:

It's tricky and the result very much depends on the blendmap used to do the transitions. For example the blendmap I created is good for snow transitions, but I will create different one for other terrain types.

Step 5 - Adding Coast

Let's improve the result of Step 4 by adding a nicer coast and doing it programatically instead of messing with another blendmaps! To create a better coast we need to add a bit brighter sea that surrounds it. What I'm gonna do is to perform some image processing of the original blendmask within the browser: invert -> blur -> brighten -> combine, as illustrated below on a single tile:

If you process each tile from the coast blendmap and use that blendmap with a much brighter texture it would result in rendering really nice coastline. The following image also uses the same technique to enhance the look of rivers:

Of course, there are endless possibilities of image processing that can be done with blendmaps.

Next Steps, TODOs

I didn't implement terrains like hills, forests, and jungle because I don't really have assets for that. So if you have any idea how that can be done, or if you volunteer to provide me such assets I can dig into it. Other things like fog-of-war or uncovered map area are simply about adding more blendmaps and more recalculation to the renderer. T

Update

A working demo is now available here!

Conclusion

I know that this article is really high level and doesn't really cover everything. My writing skills are also a bit limited, however, I wrote it to demonstrate that it's possible to render such images based only on textures and blendmaps and it's possible to do that real-time (I didn't count frames but scrolling is smooth even in fullscreen). The work on this project also gave me new ideas that I can implement in Blend2D, because I think that many drawing APIs suffer from not having the ability to use blendmaps in a single operation. I'm not sure when the renderer will be released, but it's gonna be open-source project for sure. Discussion and new ideas are welcome!