04 June, 2017

Idio[ma]tic Cmake

Welcome to Trial and Error Scripting

If I counted the time I spent on figuring out why CMake doesn't work as expected I would have probably counted weeks. I don't think that CMake itself is a bad tool, however, I think that the CMake script is the most idiotic language ever invented, seriously. The other problem is also the documentation, as I have never solved anything by reading CMake docs (I probably wasted more time by reading it actually).

Consider a very small example that should describe an issue I was dealing with. We have some files that are always compiled as a part of some target, and some files that require specific compiler flags, for example `-mavx2` and some compile-time constants:

project(simple_app C CXX)          # CMake project.

set(SOURCE_FILES main.cpp)         # Source files that are always compiled.
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)  # Custom compiler flags we want to append to a specific file.

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  set_property(SOURCE ${CUSTOM_FILES} APPEND PROPERTY COMPILE_FLAGS ${CUSTOM_FLAGS})

  # Add all arch-specific files to SOURCE_FILES...
  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

The problem is that it will not work and you will have hard time figuring it out. The compiler command CMake generates would look like this for compiling impl_avx2.cpp:

/usr/bin/c++ -DCUSTOM_IMPL=1;-DAVX2_AVAILABLE=1;-mavx2 -o impl_avx2.o -c impl_avx2.cpp

Which is of course completely broken and contains semicolons instead of spaces. The reason behind this is that CMake script doesn't really support arrays, all arrays are strings separated by semicolons. Actually, these two lines are the same:

set(SOMETHING A B)
set(SOMETHING "A;B")

And there is no way to distinguish between these two. To make it clearer what is happening I wrote a simple test script:

function(my_func PREFIX FIRST)
  message("${PREFIX} FIRST=${FIRST}")
  SET(ARG_INDEX 0)
  foreach(ARG_VA ${ARGN})
    message("${PREFIX} #${ARG_INDEX} ${ARG_VA}")
    math(EXPR ARG_INDEX "${ARG_INDEX}+1")
  endforeach()
endfunction()

my_func("1:" arg)
my_func("2:" arg second)
my_func("3:" arg second third)
my_func("4:" arg "second;third")
my_func("5:" arg "second third")

Which outputs:

1: FIRST=arg
2: FIRST=arg
2: #0 second
3: FIRST=arg
3: #0 second
3: #1 third
4: FIRST=arg
4: #0 second
4: #1 third
5: FIRST=arg
5: #0 second third

Okay, so we know that cmake treats semicolons as separators, so what we can do is simply foreach() each flag and append it, so let's modify the first example:

project(simple_app C CXX)

set(SOURCE_FILES main.cpp)
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  foreach(flag ${CUSTOM_FLAGS})
    set_property(SOURCE ${CUSTOM_FILES} APPEND PROPERTY COMPILE_FLAGS ${flag})
  endforeach()

  # Add all arch-specific files to SOURCE_FILES...
  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

Well, the output would be the same as before, just try it:

/usr/bin/c++ -DCUSTOM_IMPL=1;-DAVX2_AVAILABLE=1;-mavx2 -o impl_avx2.o -c impl_avx2.cpp

Would you expect this? CMake developers are actually aware of it and to make things even more confusing we have APPEND and APPEND_STRING options. APPEND just appends the given property making it a list, which is then stringified with the semicolons and we are at the beginning. APPEND_STRING always appends to a RAW string instead:

project(simple_app C CXX)

set(SOURCE_FILES main.cpp)
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  foreach(flag ${CUSTOM_FLAGS})
    set_property(SOURCE ${CUSTOM_FILES} APPEND_STRING PROPERTY COMPILE_FLAGS ${flag})
  endforeach()

  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

Which yields:

/usr/bin/c++ -DCUSTOM_IMPL=1-DAVX2_AVAILABLE=1-mavx2 -o impl_avx2.o -c impl_avx2.cpp

Cool, we got rid off semicolons but have no spaces between our flags as a side effect. The problem is that CMake's COMPILE_FLAGS is in fact a string, not a list, so to append the flag properly we must append a space before it, which will of course insert a leading space if the property was empty:

project(simple_app C CXX)

set(SOURCE_FILES main.cpp)
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  foreach(flag ${CUSTOM_FLAGS})
    set_property(SOURCE ${CUSTOM_FILES} APPEND_STRING PROPERTY COMPILE_FLAGS " ${flag}")
  endforeach()

  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

Which is quasi working:

/usr/bin/c++  -DCUSTOM_IMPL=1 -DAVX2_AVAILABLE=1 -mavx2 -o impl_avx2.o -c impl_avx2.cpp

Now I would like to ask you, would you write the working version at the beginning? Because for me this was simply a trial and error until I found a solution that worked; and I personally don't like this approach of solving problems.

Time to Migrate Away?

CMake should really switch to a sane language otherwise I can't see using it in the future. I have already checked Meson as it was mentioned on several sites that I visit. Is it better? It probably is, but it's another one that employs a home-grown language that you probably cannot debug and forces you to write weird shell scripts as part of your project definition. I mean why to invent a language that cannot do the task and requires to run a shell script to list files in a directory?

I'm Looking for a project generator that uses embedded JavaScript and can be debugged like a normal programming language or something really close to it. It would be similar to C/C++ syntactically and could be linted by existing tools. I don't see a reason why to invent a new language for something like a project generator. It's kind of paradox that all C/C++ project generators use languages that are not even close to C and require you to write 5 lines to implement a simple if/else construct.

25 May, 2017

AVX512 {er} and {sae} Quirks

Embedded-Rounding and Suppress-All-Exceptions

Today I finally implemented all missing (and TODOed) features in AVX512 instruction encoding in AsmJit's X86Assembler (including also a parser update in AsmTK) and I have found an interesting issue regarding encoding of {sae} (suppress-all-exceptions) in EVEX encoded instructions. EVEX prefix uses many fields to describe the operation, the most important for this post are:

  • 'b' (broadcast bit) - if set and the instruction fetches from memory it's a broadcast, otherwise it specifies embedded-rounding {er} mode or suppress-all-exceptions {sae} mode)
  • 'LL' (vector-length) - 2 bits that can be used to select between 128-bit, 256-bit, and 512-bit operation (extensible in future up to 1024 bits).

The problem is that when embedded-rounding {er} is used the LL field becomes the rounding mode instead of the vector length, which is then assumed to be 512 bits wide (LL equals 0b10). This is the main reason why Intel Manual only allows {er}/{sae} in instructions that operate on either 512-bit vectors or scalars [as scalars don't use the LL field]. However, Intel Manual also says that if {sae} is used (which uses the same 'b' bit as {er}, but doesn't use LL field to specify the rounding mode) the LL field should still describe the vector length. But, I checked what C++ compilers output in that case and I found that GCC and Clang/LLVM change the LL field to zero when {sae} is used, but Intel Compiler (ICC) doesn't. This is confusing as Intel Manual doesn't say anything about ignoring LL field when executing instructions that use {sae}.

I created an online playground that can be used to quickly check the output of various C++ compilers. The instruction that uses {sae} is vcmppd (shown as vcmpeqpd) and here is a short comparison:

; EVEX - EVEX prefix (4 bytes).
; Op   - Instruction's opcode.
; Mr   - ModR/M byte.
; Pr   - Comparison predicate.

; Instruction and operands       ; |__EVEX__|Op|Mr|Pr| Compiler and comment
  vcmpeqpd k0, zmm0, zmm1, {sae} ; |62F1FD58|C2|C1|00| ICC   (uses k0)
  vcmpeqpd k1, zmm0, zmm1, {sae} ; |62F1FD18|C2|C9|00| GCC   (uses k1, clears LL)
  vcmpeqpd k0, zmm0, zmm1, {sae} ; |62F1FD18|C2|C1|00| Clang (uses k0, clears LL)

Let's decompose the bytes into individual fields:


; Instruction:
;   vcmppd k {kz}, zmm, zmm/m512/b64, ub {sae}
;
; Encoding:
;   [RVMI-FV] EVEX.NDS.512.66.0F.W1 C2 /r ib
;          ____      ____        _             
; __EVEX__|RBBR00mm|Wvvvv1pp|zLLbVaaa| OpCode | ModR/M |CompPred| Compiler
; 01100010|11110001|11111101|01011000|11000010|11000001|00000000| ICC
; 01100010|11110001|11111101|00011000|11000010|11001001|00000000| GCC
; 01100010|11110001|11111101|00011000|11000010|11000001|00000000| Clang

Now the differences should be clear - basically LL field and Mod/R field describing 'k' register index are different. For those who cannot read the encoding here is a small howto of reading registers:


; Instruction:
;   vcmppd k {kz}, zmm, zmm/m512/b64, ub {sae}
;
; Encoding:
;   [RVMI-FV] EVEX.NDS.512.66.0F.W1 C2 /r ib
;
; RVMI specifies how registers are encoded, in order, we name them 'a', 'b', and 'c':
;
;   [R|V|M|I] (I == Immediate value)
;   [a|b|c|.]
;
;   'a' - R field in Mod/R (2 bits in EVEX/R'R and 3 bits in Mod/R)
;   'b' - V field in EVEX  (5 bits in EVEX/V'vvvv)
;   'c' - M field in Mod/M (2 bits in EVEX/B'B and 3 bits in Mod/M)
;
; Registers in 'vcmpeqpd a:k, b:zmm, c:zmm, {sae}':
;          ____      ____        _             
; ........|acca....|.bbbb...|.LL.b...|........|..aaaccc|........| Compiler
; 01100010|11110001|11111101|01011000|11000010|11000001|00000000| ICC
; 01100010|11110001|11111101|00011000|11000010|11001001|00000000| GCC
; 01100010|11110001|11111101|00011000|11000010|11000001|00000000| Clang
;
;     __
; a = 11000 -> 00000 -> k0
;     __
;     11001 -> 00001 -> k1
;     _____
; b = 11111 -> 00000 -> zmm0
;     __
; c = 11001 -> 00001 -> zmm1

You can also check out EVEX prefix on wikipedia.

What AsmJit Should Emit?

That's what I don't know! If 'LL' is really ignored I would still keep it as it describes the vector length (that's what ICC does). I will keep an eye on it and try to test it on a real hardware when I get the chance.

Using k0 in a write operation

This article should also help with understanding how k0 could be used. AVX512 restricts its use only in EVEX's 'aaa' field that encodes write-mask {k1-k7}. Zero disables write-mask completely so {k0} is not encodable in EVEX prefix, however, it can be used by instructions that don't encode k register in 'aaa' field. This means that a register allocator can use k0 (with some limitations) and indeed ICC and Clang do it.

03 March, 2017

C++ Compilers and Absurd Optimizations

Everything Started with a Bug

Yesterday, I finally implemented some code-paths that use AVX and AVX2 extensions to accelerate matrix transformations in Blend2D. Of course the initial version didn't work, because I did one small mistake in vector shuffling, which resulted in rendering something different than expected. I'm usually done with these kind of bugs very fast, but this time I did a mistake in a shuffle predicate, and the bug was harder to find than usual. After 15 minutes of looking at the code I disassembled the whole function, because I started being suspicious that I'm facing a compiler bug. And, I'm happy I saw the disassembly as I found some ridiculous code generated by C++ compiler that I simply couldn't understand.

Don't Trust C++ Compiler

As the title suggests, C++ compiler failed me again by generating a ridiculous code surrounding relatively simple loops. As manually vectorized C++ code usually requires at least two loops (one vectorized and one scalar) the implementation usually looks like the following:

void proc_v1(double* dst, const double* src, size_t length) {
  size_t i;

  // Main loop - process 4 units at a time (vectorized).
  for (size_t i = length >> 2; i; i--) {
    ...

    dst += 4;
    src += 4;
  }

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  for (size_t i = length & 3; i; i--) {
    ...

    dst += 1;
    src += 1;
  }
}

This is pretty standard code used in many libraries. However, when you need to save registers and you write hand-optimized loops in assembly, you usually don't use such construct directly translated to machine code, because it will cost you 2 registers. Instead, the following trick could be used:

void proc_v1(double* dst, const double* src, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  // Main loop - process 4 units at a time (vectorized).
  while ((i -= 4) >= 0) {
    ...

    dst += 4;
    src += 4;
  }

  // Now it's less than zero, so normalize it to get the remaining count.
  i += 4;

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  while (i) {
    ...

    dst += 1;
    src += 1;

    i--;
  }
}

Which can be translated to asm 1:1 and will use only a single register. I don't know personally any case where this approach would be slower than the previous one, and if the size of the item you are processing is greater than 1 it's always safe (I would say it's safe regardless of the size as I'm not sure if a 32-bit OS would be able to provide a 2GB of continuous memory to the user, but let's stay on a safe side).

Translated to x86 asm, the code would look like this:

; RDI - destination
; RSI - source
; RCX - counter

; Main Loop header.
sub rcx, 4    ; Try if we can use vectorized loop.
js MainSkip   ; If negative it means that there is less than 4 items.

MainLoop:
...           ; Some work to do...
add dst, 32   ; Advance destination 4 * sizeof(double).
add src, 32   ; Advance source.
sub rcx, 4    ; Decrement loop counter.
jns MainLoop  ; Jump if not negative.

MainSkip:
add rcx, 4    ; Fix the loop counter.
jz TailSkip   ; Exit if zero.

TailLoop:
...           ; Some work to do...
add dst, 8    ; Advance destination 1 * sizeof(double).
add src, 8    ; Advance source.
sub rcx, 1    ; Decrement loop counter.
jnz TailLoop  ; Jump if not zero.

TailSkip:
}

The asm code is not just nice, it's also very compact and will perform well on any hardware in general.

Working Example

I cannot just use an empty loop body to demonstrate how badly C++ compilers understand this code, so I wrote a very simple function that does something:

#include <stdint.h>

#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// Destination and source are points (pair of x|y), the function only sums them.
void transform(double* dst, const double* src, const double* matrix, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  while ((i -= 8) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src +  0);
    __m256d s1 = _mm256_loadu_pd(src +  4);
    __m256d s2 = _mm256_loadu_pd(src +  8);
    __m256d s3 = _mm256_loadu_pd(src + 12);

    _mm256_storeu_pd(dst +  0, _mm256_add_pd(s0, s0));
    _mm256_storeu_pd(dst +  4, _mm256_add_pd(s1, s1));
    _mm256_storeu_pd(dst +  8, _mm256_add_pd(s2, s2));
    _mm256_storeu_pd(dst + 12, _mm256_add_pd(s3, s3));
    
    dst += 16;
    src += 16;
  }
  i += 8;

  while ((i -= 2) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src);
    _mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));
    
    dst += 4;
    src += 4;
  }
  
  if (i & 1) {
    __m128d s0 = _mm_loadu_pd(src);
    _mm_storeu_pd(dst, _mm_add_pd(s0, s0));
  }
}

The function only performs a very simple operation with its inputs to distinguish between the loop body and everything else.

GCC (v7 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r9, rcx
        mov     r8, rcx
        sub     r9, 8
        js      .L6
        mov     rax, r9
        sub     rcx, 16
        mov     r8, r9
        and     rax, -8
        mov     rdx, rsi
        sub     rcx, rax
        mov     rax, rdi
        ; -----------------------
.L5:
        vmovupd xmm3, XMMWORD PTR [rdx]
        sub     r8, 8
        sub     rax, -128
        sub     rdx, -128
        vmovupd xmm2, XMMWORD PTR [rdx-96]
        vinsertf128     ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
        vmovupd xmm1, XMMWORD PTR [rdx-64]
        vinsertf128     ymm2, ymm2, XMMWORD PTR [rdx-80], 0x1
        vmovupd xmm0, XMMWORD PTR [rdx-32]
        vinsertf128     ymm1, ymm1, XMMWORD PTR [rdx-48], 0x1
        vaddpd  ymm3, ymm3, ymm3
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rdx-16], 0x1
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovups XMMWORD PTR [rax-128], xmm3
        vaddpd  ymm0, ymm0, ymm0
        vextractf128    XMMWORD PTR [rax-112], ymm3, 0x1
        vmovups XMMWORD PTR [rax-96], xmm2
        vextractf128    XMMWORD PTR [rax-80], ymm2, 0x1
        vmovups XMMWORD PTR [rax-64], xmm1
        vextractf128    XMMWORD PTR [rax-48], ymm1, 0x1
        vmovups XMMWORD PTR [rax-32], xmm0
        vextractf128    XMMWORD PTR [rax-16], ymm0, 0x1

        ; --- Instead of using sub/jns it uses sub/cmp/jne ---
        cmp     r8, rcx
        jne     .L5
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        mov     rax, r9
        shr     rax, 3
        lea     rdx, [rax+1]
        neg     rax
        lea     r8, [r9+rax*8]
        sal     rdx, 7
        add     rdi, rdx
        add     rsi, rdx
        ; -----------------------

.L6:
        ; --- Ridiculous code ---
        mov     rdx, r8
        sub     rdx, 2
        js      .L4
        shr     rdx
        xor     eax, eax
        lea     rcx, [rdx+1]
        sal     rcx, 5
        ; -----------------------

.L7:
        vmovupd xmm0, XMMWORD PTR [rsi+rax]
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
        vaddpd  ymm0, ymm0, ymm0
        vmovups XMMWORD PTR [rdi+rax], xmm0
        vextractf128    XMMWORD PTR [rdi+16+rax], ymm0, 0x1

        ; --- Instead of using sub/jns it uses add/cmp/jne ---
        add     rax, 32
        cmp     rax, rcx
        jne     .L7
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        neg     rdx
        add     rdi, rax
        add     rsi, rax
        lea     rdx, [r8-4+rdx*2]
        ; -----------------------
.L4:
        and     edx, 1
        je      .L14
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L14:
        vzeroupper
        ret

No comment - GCC magically transformed 9 initial instructions into 37, making the code larger and slower!

Clang (3.9.1 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r10, rcx
        add     r10, -8
        js      .LBB0_7
        mov     r11, r10
        shr     r11, 3
        mov     r8, r11
        shl     r8, 4
        lea     r9d, [r11 + 1]
        and     r9d, 1
        mov     rdx, rdi
        mov     rax, rsi
        test    r11, r11
        je      .LBB0_4
        lea     rcx, [r9 - 1]
        sub     rcx, r11
        mov     rdx, rdi
        mov     rax, rsi
        ; -----------------------

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
        vmovupd ymm0, ymmword ptr [rax + 128]
        vmovupd ymm1, ymmword ptr [rax + 160]
        vmovupd ymm2, ymmword ptr [rax + 192]
        vmovupd ymm3, ymmword ptr [rax + 224]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx + 128], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 160], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 192], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 224], ymm0
        add     rdx, 256
        add     rax, 256

        ; --- Instead of using sub/jns it uses add/jne ---
        add     rcx, 2
        jne     .LBB0_3
        ; ------------------------------------------------

        ; --- CLANG Unrolled the tail loop - OMG ---
.LBB0_4:
        shl     r11, 3
        lea     rcx, [r8 + 16]
        test    r9, r9
        je      .LBB0_6
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
.LBB0_6:
        lea     rdi, [rdi + 8*r8 + 128]
        sub     r10, r11
        lea     rsi, [rsi + 8*rcx]
        mov     rcx, r10
.LBB0_7:
        mov     r10, rcx
        add     r10, -2
        js      .LBB0_15
        mov     r8, r10
        shr     r8
        lea     r9d, [r8 + 1]
        and     r9d, 3
        mov     rax, rdi
        mov     rdx, rsi
        cmp     r10, 6
        jb      .LBB0_11
        lea     r10, [r9 - 1]
        sub     r10, r8
        mov     rax, rdi
        mov     rdx, rsi
.LBB0_10:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 32]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 32], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 64]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 64], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rax, -128
        sub     rdx, -128
        add     r10, 4
        jne     .LBB0_10
.LBB0_11:
        lea     r10, [4*r8 + 4]
        lea     r11, [4*r8]
        add     r8, r8
        test    r9, r9
        je      .LBB0_14
        neg     r9
.LBB0_13:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        add     rax, 32
        add     rdx, 32
        inc     r9
        jne     .LBB0_13
.LBB0_14:
        lea     rsi, [rsi + 8*r11 + 32]
        lea     rdi, [rdi + 8*r10]
        add     rcx, -4
        sub     rcx, r8
        mov     r10, rcx
        ; --- End of the unrolled loop ---

.LBB0_15:
        test    r10b, 1
        je      .LBB0_17
        vmovupd xmm0, xmmword ptr [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [rdi], xmm0
.LBB0_17:
        vzeroupper
        ret

This was even worse than I would ever expect - clang unrolled the tail loop and generated optimized code for something that would never be executed more than once. It completely misunderstood the code. Also, instead of following the C++ code it inserted dozens of instructions to help with unrolling, I won't even count them because the code is so ridiculous.

MSVC (v19 /O2 /arch:AVX)

transform, COMDAT PROC
        ; --- Ridiculous code ---
        add      r9, -8
        js       SHORT $LN3@transform
        lea      r8, QWORD PTR [r9+8]
        shr      r8, 3
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*8]
        npad     8
        ; -----------------------

$LL2@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vmovupd ymm1, YMMWORD PTR [rdx+32]
        vmovupd ymm4, YMMWORD PTR [rdx+64]
        vmovupd ymm5, YMMWORD PTR [rdx+96]
        vaddpd   ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm0
        vaddpd   ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rcx+32], ymm1
        vaddpd   ymm0, ymm4, ymm4
        vmovupd YMMWORD PTR [rcx+64], ymm0
        vaddpd   ymm1, ymm5, ymm5
        vmovupd YMMWORD PTR [rcx+96], ymm1
        sub      rcx, -128                ; ffffffffffffff80H
        sub      rdx, -128                ; ffffffffffffff80H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL2@transform
        ; ------------------------------------------------

$LN3@transform:
        ; --- Ridiculous code ---
        add      r9, 6
        js       SHORT $LN5@transform
        lea      r8, QWORD PTR [r9+2]
        shr      r8, 1
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*2]
        npad     5
        ; -----------------------

$LL4@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vaddpd   ymm2, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm2
        add      rcx, 32              ; 00000020H
        add      rdx, 32              ; 00000020H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL4@transform
        ; ------------------------------------------------
$LN5@transform:
        test     r9b, 1
        je       SHORT $LN6@transform
        vmovupd xmm0, XMMWORD PTR [rdx]
        vaddpd   xmm2, xmm0, xmm0
        vmovupd XMMWORD PTR [rcx], xmm2
$LN6@transform:
        vzeroupper
        ret      0

Actually not bad, still does things I didn't ask for, but it's much better than GCC and Clang in this case.

ICC (v17 -O2 -mavx -fno-exceptions)

transform(double*, double const*, double const*, unsigned long):
        add       rcx, -8                                       #11.11 FOLLOWS C++ code!
        js        ..B1.5        # Prob 2%                       #11.22
..B1.3:                         # Preds ..B1.1 ..B1.3
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #12.34
        vmovupd   xmm1, XMMWORD PTR [32+rsi]                    #13.34
        vmovupd   xmm2, XMMWORD PTR [64+rsi]                    #14.34
        vmovupd   xmm3, XMMWORD PTR [96+rsi]                    #15.34
        vinsertf128 ymm6, ymm1, XMMWORD PTR [48+rsi], 1         #13.34
        vinsertf128 ymm4, ymm0, XMMWORD PTR [16+rsi], 1         #12.34
        vinsertf128 ymm8, ymm2, XMMWORD PTR [80+rsi], 1         #14.34
        vinsertf128 ymm10, ymm3, XMMWORD PTR [112+rsi], 1       #15.34
        add       rsi, 128                                      #23.5
        vaddpd    ymm5, ymm4, ymm4                              #17.32
        vaddpd    ymm7, ymm6, ymm6                              #18.32
        vaddpd    ymm9, ymm8, ymm8                              #19.32
        vaddpd    ymm11, ymm10, ymm10                           #20.32
        vmovupd   XMMWORD PTR [rdi], xmm5                       #17.22
        vmovupd   XMMWORD PTR [32+rdi], xmm7                    #18.22
        vmovupd   XMMWORD PTR [64+rdi], xmm9                    #19.22
        vmovupd   XMMWORD PTR [96+rdi], xmm11                   #20.22
        vextractf128 XMMWORD PTR [16+rdi], ymm5, 1              #17.22
        vextractf128 XMMWORD PTR [48+rdi], ymm7, 1              #18.22
        vextractf128 XMMWORD PTR [80+rdi], ymm9, 1              #19.22
        vextractf128 XMMWORD PTR [112+rdi], ymm11, 1            #20.22
        add       rdi, 128                                      #22.5
        add       rcx, -8                                       #11.11
        jns       ..B1.3        # Prob 82%                      #11.22 FOLLOWS C++ code!
..B1.5:                         # Preds ..B1.3 ..B1.1
        add       rcx, 6                                        #25.3
        js        ..B1.9        # Prob 2%                       #27.22 FOLLOWS C++ code!
..B1.7:                         # Preds ..B1.5 ..B1.7
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #28.34
        vinsertf128 ymm1, ymm0, XMMWORD PTR [16+rsi], 1         #28.34
        add       rsi, 32                                       #32.5
        vaddpd    ymm2, ymm1, ymm1                              #29.27
        vmovupd   XMMWORD PTR [rdi], xmm2                       #29.22
        vextractf128 XMMWORD PTR [16+rdi], ymm2, 1              #29.22
        add       rdi, 32                                       #31.5
        add       rcx, -2                                       #27.11 FOLLOWS C++ code!
        jns       ..B1.7        # Prob 82%                      #27.22
..B1.9:                         # Preds ..B1.7 ..B1.5
        test      rcx, 1                                        #35.11
        je        ..B1.11       # Prob 60%                      #35.11
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #36.31
        vaddpd    xmm1, xmm0, xmm0                              #37.24
        vmovupd   XMMWORD PTR [rdi], xmm1                       #37.19
..B1.11:                        # Preds ..B1.9 ..B1.10
        vzeroupper                                              #39.1
        ret                                                     #39.1

ICC in this case is pure winner. I don't know why other compilers cannot generate code like this, it's small and good.

Bonus GCC with -Os

transform(double*, double const*, double const*, unsigned long):
.L3:
        mov     rax, rcx
        sub     rax, 8
        js      .L2
        vmovupd ymm3, YMMWORD PTR [rsi]
        sub     rdi, -128
        sub     rsi, -128
        vmovupd ymm2, YMMWORD PTR [rsi-96]
        mov     rcx, rax
        vmovupd ymm1, YMMWORD PTR [rsi-64]
        vaddpd  ymm3, ymm3, ymm3
        vmovupd ymm0, YMMWORD PTR [rsi-32]
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rdi-128], ymm3
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-96], ymm2
        vmovupd YMMWORD PTR [rdi-64], ymm1
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L3 ; Well, I asked -Os, but here sub/jns won't hurt
.L2:
        sub     rcx, 2
        js      .L4
        vmovupd ymm0, YMMWORD PTR [rsi]
        add     rdi, 32
        add     rsi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L2 ; Also here...
.L4:
        and     cl, 1
        je      .L1
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L1:
        ret

Much more close to the expected code, however, since I asked to generate the smallest possible code, it didn't generate conditional jumps at the end of loops, which is not good in general.

Bonus Clang with -Oz (should be even better than -Os)

transform(double*, double const*, double const*, unsigned long):                  # @transform(double*, double const*, double const*, unsigned long)
        push    7
        pop     rax
        sub     rax, rcx
        mov     rdx, rcx
        shr     rdx, 3
        xor     r10d, r10d
        test    rax, rax
        cmovle  r10, rdx
        mov     r9, r10
        shl     r9, 4
        lea     r8, [8*r10]
        shl     r10, 7
        add     r10, rsi
        mov     rdx, rcx
        mov     rax, rdi
        jmp     .LBB0_1
.LBB0_8:                                #   in Loop: Header=BB0_1 Depth=1
        vmovupd ymm0, ymmword ptr [rsi]
        vmovupd ymm1, ymmword ptr [rsi + 32]
        vmovupd ymm2, ymmword ptr [rsi + 64]
        vmovupd ymm3, ymmword ptr [rsi + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rax + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rax + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rsi, -128
        sub     rax, -128
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rdx, -8
        jns     .LBB0_8
        sub     rcx, r8
        lea     r8, [rdi + 8*r9]
        push    1
        pop     rax
        sub     rax, rcx
        lea     rdx, [rcx + rcx]
        and     rdx, -4
        xor     esi, esi
        test    rax, rax
        cmovle  rsi, rdx
        mov     rax, rcx
        mov     rdi, r10
        mov     rdx, r8
        jmp     .LBB0_3
.LBB0_4:                                #   in Loop: Header=BB0_3 Depth=1
        vmovupd ymm0, ymmword ptr [rdi]
        add     rdi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        add     rdx, 32
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        add     rax, -2
        jns     .LBB0_4
        test    cl, 1
        je      .LBB0_7
        vmovupd xmm0, xmmword ptr [r10 + 8*rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [r8 + 8*rsi], xmm0
.LBB0_7:
        vzeroupper
        ret

No, no, no - clang just doesn't like compact code...:(

Conclusion

People saying that C++ compilers always generate better code than people are simply wrong. I'm not saying that they always generate worse code, but it seems that newer the compiler is, more instructions it emits, and it's becoming scary. More arguments for Blend2D and its hand-optimized pipelines!

You can play with the code here.

Thanks for reading and don't forget that this is not a hate-post. A GCC bug #79830 was already reported (with a bit different sample code) and clang bug #32132 as well. Let's hope this can be improved as this actually affects performance of non-JIT code in Blend2D.

Update

After reading some comments on other portals about this post I think people really misunderstood it and try to put it into a different context. Please note that what is inside the loops is not important in this case. What is important is how various C++ compilers translate valid loop constructs into assembly and its impact on the size of the generated code. So please, instead of justifying the bloated code I have shown, let's fix it as it won't hurt to have smaller code generated in this case. And, I really think that if a commercial ICC compiler can generate perfect code for this case, GCC and clang should be able to do it too, because these are compilers that I use daily...

13 February, 2017

PABSB|W|D|Q Without SSSE3

Packed Absolute Value

SSSE3 extensions introduced instructions for computing packed absolute values of 8-bit, 16-bit, and 32-bit integers. In this post I will show how to implement these in pure SSE2, and how to implement a missing pabsq (packed absolute value of 64-bit integers), which is not provided until AVX512-F.

Straightforward Implementation

Let's start with a straightforward implementation in C first:

inline uint32_t abs32(int32_t x) {
  return x >= 0 ? uint32_t(x) : uint32_t(-x);
}

Although it contains branches C++ compilers are usually able to recognize such code and create an optimized branch-less version of it. If you think about a possible branch-less solution you must understand how negation in 2s complement arithmetic works. The code -x is equivalent to 0-x, which is equivalent to ~x + 1. Now we know how to change a sign of some integer, however, what absolute value does is changing the sign only if the input is negative. Since all negative numbers in 2s complement arithmetic have the most significant bit set to 1 we can use arithmetic shift to get a mask (all zeros or all ones), which can be then used to negate all bits of the original value. The remaining addition of 1 can be turned into a subtraction of -1 (as -1 is represented as all ones in 2s complement arithmetic). Thus, we can rewrite the original code to (x ^ mask) - mask, which would do nothing if mask is zero, and negate the input if mask is all ones.

A branch-less implementation of the previous code would look like:

inline uint32_t abs32(int32_t x) {
  // NOTE: x >> y must be translated to an arithmetic shift here...
  uint32_t mask = uint32_t(x >> (sizeof(int32_t) * 8 - 1));
  return (uint32_t(x) ^ mask) - mask;
}

SSE2 Implementation

The C++ code can be directly translated to SSE2 for 16-bit and 32-bit integer sizes:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psraw  xmm7, 15              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubw  xmm0, xmm7            ; Add one if mask is all ones

; SSE2 compatible PABSD implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubd  xmm0, xmm7            ; Add one if mask is all ones

64-bit packed absolute value is trickier as there is no PSRAQ instruction in SSE2 (VPSRAQ was first introduced in AVX512-F), however, we can shuffle the input a bit and use PSRAD again:

; SSE2 compatible PABSQ implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pshufd xmm7, xmm0, 0xF5      ; Like _MM_SHUFFLE(3, 3, 1, 1)
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubq  xmm0, xmm7            ; Add one if mask is all ones

These were straightforward translations based on the initial C++ code shown at the beginning of the post. However, there is a better way of implementing PABSW and there is also a way of implementing PABSB without any shifts (because there is no packed shift that operates on 8-bit entities). Since absolute value could be also written as max(x, -x) we can use packed min/max to implement PABSB and PABSW:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubw  xmm7, xmm0            ; Negate all input values
pmaxsw xmm0, xmm7            ; Select all positive values

; SSE2 compatible PABSB implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubb  xmm7, xmm0            ; Negate all input values
pminub xmm0, xmm7            ; Select all positive values

The PABSW implementation is straightforward and I have nothing to add, however, PABSB implementation is interesting as it workarounds the missing PMAXSB instruction (which was introduced in SSE4.1) by using PMINUB instead, which works for us based on the knowledge about both inputs (selecting the minimum unsigned value is the same as selecting the maximum signed value in our case, as we know that they are negations of each other).

Conclusion

Hope you enjoyed reading the post. I'm preparing a very small library for JIT code generation for asmjit that will have all of these tricks implemented and ready to use. Any wishes about next post? I was thinking about some pre-SSE4.1 rounding tricks (float|double), basically the same tricks I have used in MathPresso.

10 February, 2017

PMINUW and PMAXUW Without SSE4.1

Pminuw and Pmaxuw

SSE4.1 extension introduced a lot of instructions that I would say should have been part of the baseline SSE2. There are instructions that are hard to workaround, like pshufb, and there are also instructions that just complete the unfinished SSE2 instruction set, like pminuw (packed minimum of uint16_t) and pmaxuw (packed maximum of uint16_t). I have seen various workarounds for implementing these two, but since I'm working with JIT I always think about the best possible solution that:

  • Doesn't need more than one temporary register
  • Doesn't need constants, if possible
  • Is as short as possible

Existing Solutions

Before I started thinking of the best possible implementation I checked libsimdpp, which contains implementation of many post-SSE2 instructions. The min/max implementation can be found at i_min.h and i_max.h files. What libsimdpp does is to XOR the most significant bit (basically the sign-bit) of both inputs to prepare them being used by either pminsw or pmaxsw instruction. The problem with this approach is that it needs a constant (0x8000), two moves, three XORs, and one packed min or max. In other words, this is a lot of operations to do just packed minimum or maximum.

The machine code of that solution may look like:

; SSE2 compatible PMINUW|PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm1            ; Move xmm1 to temporary
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0
pxor xmm7, [constant_0x8000] ; Toggle the sign bit of xmm7 (temporary)
pminsw|pmaxsw xmm0, xmm7     ; Perform packed min|max
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0

Of course if the second operand (xmm1) is not required after the operation the temporary variable (and move) could be eliminated.

Is there a better way?

I have used a similar solution in the past, but I was never really happy with it. Today I tried to think harder about the problem and possible instructions that I can use and I have found the following approach - since SSE2 has PSUBUSW (packed subtract with unsigned saturation) I can use that instruction instead of three XORs and then subtract the result from the original value to get the packed unsigned minimum. This trick of course only works for packed uint16_t operations as X86 SIMD doesn't have instructions to perform packed saturated addition|subtraction of elements greater than 16 bits.

The machine code of this solution would look like:

; SSE2 compatible PMINUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psubusw xmm7, xmm1           ; Subtract with unsigned saturation
psubw xmm0, xmm7             ; Subtract (no saturation, would never overflow here)

Why it works? If we perform a-b with unsigned saturation we get either zero, which means that b is either equal or greater than a, or some non-zero value, which means that a is greater than b, and the value is their difference (unsigned). Based on these we can subtract that value from the original a and get our unsigned minimum.

The machine code of pmaxuw implementation would be much simpler:

; SSE2 compatible PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
psubusw xmm0, xmm1           ; Subtract with unsigned saturation
paddw xmm0, xmm1             ; Add (no saturation, would never overflow here)

In this last case (unsigned uint16_t maximum) we don't need any temporary. The possible difference in xmm0 is enough to reconstruct the maximum value based on the content of xmm1.

Conclusion

There is always a way to do something better. I always valuate solutions that need less temporaries and don't need extra constants. Sometimes these requirements are impossible, but sometimes other instructions that you may not think of may help. And... old CPUs would thank you for using this approach!

28 January, 2017

Reflect-Trick

Baseline

2D rendering APIs must define how gradient or texture should be rendered when coordinates accessing it are out of bounds. In 2D this is usually called repeat or spread mode, and here are the most used ones:

  • Pad - Coordinate is saturated [result = pad(x, width)]
  • Repeat - Coordinate is repeated [result = repeat(x, width)]
  • Reflect - Coordinate is reflected [result = reflect(x, width)]

Where result, x, and width are integers and result is always within [0, width) range (note that width is exclusive). A baseline implementation of pad, repeat, and reflect functions is below:

inline int pad(int x, int width) {
  if (x < 0)
    return 0;

  if (x >= width)
    return width - 1;

  return x;
}

inline int repeat(int x, int width) {
  // Repeat from 0 to width.
  x = x % width;
  if (x < 0)
    x += width;

  return x;
}

inline int reflect(int x, int width) {
  // Repeat from 0 to width * 2.
  x = x % (width * 2);
  if (x < 0)
    x += width * 2;

  // Reflect from 0 to width.
  if (x >= width)
    x = width * 2 - 1 - x;

  return x;
}

When Width is a Power of 2

In many cases width would be a power of 2, especially when rendering gradients. In that case the expensive MOD operator (%) could be replaced by AND operator (&), which would make the implementation shorter and generally faster:

inline int repeat(int x, int width) {
  return x & (width - 1);
}

inline int reflect(int x, int width) {
  // Repeat from 0 to width * 2.
  x = x & ((width * 2) - 1);

  // Reflect from 0 to width.
  if (x >= width)
    x = width * 2 - 1 - x;

  return x;
}

AND solves two problems - it implicitly handles the x < 0 condition, because the width is never negative (for example -2 & 255 == 254, which is the same as -2 + 256 in 2's complement arithmetic) and it replaces a very expensive MOD operator by a totally cheap AND operator, that does the same job in our case, because of power of 2 constraint. However, the reflect function is still pretty long and could be simplified even more, especially if we want to vectorize it.

Reflect-Trick Suitable for SIMD

We know that width can be used to precalculate other values that can be used instead of it. For example if we always use width * 2 - 1 as a mask it would be better to just calculate it once and reuse the precalculated value. For the following reflect-trick we precalculate it so it becomes a mask instead of width:

inline int precalc(int width) {
  return width * 2 - 1;
}

We also know that SIMD instructions like more min/max approach than if/mask approach, so instead of using the condition the reflection itself could be rewritten to use a single subtraction and minimum:

// y is the precomputed value, equal to `width * 2 - 1`.
inline int reflect(int x, int y) {
  // Repeat from 0 to width * 2.
  x = x & y;

  // Reflect by using the reflect-trick.
  return min(x, y - x);
}

The following reasons summarize why this implementation is generally better for SIMD implementation (and probably for scalar too):

  • It uses only one constant (y), which could be kept in register.
  • It doesn't use branching, which would need to be translated to comparison and masking.

Handling Both Repeat & Reflect at the Same Time

If we introduce another variable, it would be possible to handle both repeat and reflect modes inside a single function. This would be extremely beneficial for 2D pipelines that allow different repeat modes for X and Y. We name our constants as repeatValue and reflectValue, and precalculate them the following way:

inline int precalcRepeatValue(int width, bool isReflecting) {
  return isReflecting ? width * 2 - 1 : width - 1;
}

inline int precalcReflectValue(int width, bool isReflecting) {
  return width * 2 - 1;
}

And use such values in a single function:

inline int repeatOrReflect(int x, int repeatValue, int reflectValue) {
  x = x & repeatValue;
  return min(x, reflectValue - x);
}

Based on the initialization, the repeatOrReflect() function will either repeat or reflect.

Reflect-Trick that uses SAR and XOR

On X86 architecture the reflection can also be performed by using a SAR (arithmetic shift right) and XOR instructions. The idea is to shift the interval of the value to be reflected from [0...width*2) to [-width..width). If we do so we can use the following:

inline int reflect(int x) {
  const int kShift = (sizeof(int) * 8 - 1);
  return (x >> kShift) ^ x;
}

The function in this case expects value in a correct range and will always output a value within [0..width). On X86 the function would typically translate to 3 instructions:

mov idx, x   ; Copy x to idx
sar idx, 31  ; Copy sign bits across idx
xor idx, x   ; Reflect

Or 2 instructions if we can make sure that the input x is in eax register and output in edx register:

cdq          ; Sign-extend eax to edx:eax
xor edx, eax ; Reflect

Conclusion

These tricks can be used to improve the performance of gradient and texture rendering. In this post I focused mostly on repeating and reflection constrained by a power of 2, which is very useful for gradients, but it's not that much for textures. If the repeatValue is removed from repeatOrReflect function and the implementation handles the repeat during advancing x and y coordiantes the same code could be used to repeat/reflect a value of any range, which is exactly how b2dpipe works.