09 November, 2017

C++11 and MSVC Gone Wrong

It's been few weeks I started switching all my projects to C++11. The reason is simple - it has been standardized for years and all major compilers now state complete (or almost complete) C++11 support. I thought that it's time to switch to C++11 and to update all my projects, well... was a good decision, but I found a severe bug in both VS2015 and VS2017 compilers that made the transition a nightmare...

Bugs happen and usually there are workarounds, however, this time I found one only for VS2017, which means that all my projects will require at least VS2017 to work (GCC and clang are unaffected). The tricky part of the issue is that VS compiles the code just without errors making it fail at runtime. This means that you have to run the code to actually discover the problem. I first discovered the bug by running AsmJit test suite where some really basic tests surprisingly started failing.

In this post I finally decided to dig into the issue a bit more and to create a repro that demonstrates it.

The Issue

MSVC doesn't like constant expressions in struct that contains a union. What happens is that instead of initializing the first member of the union it zero initializes it instead. I found it very confusing that even if you guard your code by static_assert to verify the initialized value it passes the assertion! Here is the code:

struct Data {
  struct A { int x, y; };
  struct B { int x, y; };
  union {
    A a;
    B b;
  };
};

class Object : public Data {
public:
  constexpr Object() noexcept 
    : Data {{ 0, 0 }} {}
  constexpr Object(int x, int y) noexcept
    : Data {{ x, y }} {}
};

int func() {
  constexpr Object obj(1, 2);
  static_assert(obj.a.x == 1);
  return obj.a.x;
}

I tested all compilers (GCC, Clang, Intel, MSVC) and all compile the code just fine. The function should return '1', but MSVC compiles it to return '0' even when the static_assert(obj.a.x == 1) passes:

; GCC/Clang/Intel output
func():
  mov eax, 1
  ret

; MSVC output
func PROC
  xor eax, eax
  ret 0
func ENDP

Snaky, right?

VS2017 Workaround

I was able to trick the compiler to compile the code correctly by adding an array as a first member of the union:

struct Data {
  struct A { int x, y; };
  struct B { int x, y; };
  union {
    int data[2];
    A a;
    B b;
  };
};

class Object : public Data {
public:
  constexpr Object() noexcept 
    : Data {{{ 0, 0 }}} {}
  constexpr Object(int x, int y) noexcept
    : Data {{{ x, y }}} {}
};

int func() {
  constexpr Object obj(1, 2);
  static_assert(obj.data[0] == 1);
  return obj.a.x;
}
All compilers except VS2015 compile this code correctly:
; GCC/Clang/Intel output
func():
  mov eax, 1
  ret

; MSVC output
func PROC
  mov eax, 1
  ret 0
func ENDP

Conclusion

I found this bug to be already reported here, but it seems it's not interesting to VS team and I'm not gonna invest any more time into this.

You can use a compiler explorer to test it yourself. If you know how to workaround this in VS2015 please leave a note here, I'm interested unless it requires a complete code rewrite.

UPDATE

It seems the issue has been resolved, however, the fix will not be available as an update to the broken VS2015/VS2017 products.

07 October, 2017

AsmJit - Register Allocator Progress

Introduction

This post is dedicated to a new register allocation (RA) in AsmJit library. AsmJit provides register allocation since the beginning as it allows to write JIT code fast and especially to connect various asm components together without a significant effort. However, register allocation is a very complicated and challenging subject and the previous AsmJit version simply used a basic linear-scan with some ad-hoc tweaks.

New RA summary:

  • Control Flow Graph (CFG), CFG builder, and TiedRegs.
  • Liveness analysis at block level (IN, OUT, GEN, KILL, per-block statistics).
  • Liveness analysis at instruction level (live-ranges and per-instruction statistics).
  • Global allocator (bin packing).
  • Local allocation (instruction-by-instruction).
  • Rewriter.

Another thing that really improved is logging. Now each RA part can log a lot of details about the work it does, which is really useful for tuning and bug fixing. All components log to AsmJit's Logger interface that can be turned on/off at runtime. This means that code that is shipped and has initially logging turned off can still provide option to enable it at runtime (based on command line switch, for example).

Sample code

All parts of this post will use a sample 'AlphaBlend' function used by AsmJit tests:

void generateAlphaBlend(asmjit::X86Compiler& cc) {
  using namespace asmjit;

  X86Gp dst = cc.newIntPtr("dst");
  X86Gp src = cc.newIntPtr("src");

  X86Gp i = cc.newIntPtr("i");
  X86Gp j = cc.newIntPtr("j");
  X86Gp t = cc.newIntPtr("t");

  X86Xmm vzero = cc.newXmm("vzero");
  X86Xmm v0080 = cc.newXmm("v0080");
  X86Xmm v0101 = cc.newXmm("v0101");

  Label L_SmallLoop = cc.newLabel();
  Label L_SmallEnd  = cc.newLabel();
  Label L_LargeLoop = cc.newLabel();
  Label L_LargeEnd  = cc.newLabel();
  Label L_DataPool  = cc.newLabel();

  cc.addFunc(FuncSignature3<void, void*, const void*, size_t>(cc.getCodeInfo().getCdeclCallConv()));

  cc.setArg(0, dst);
  cc.setArg(1, src);
  cc.setArg(2, i);

  // How many pixels have to be processed to make the loop aligned.
  cc.lea(t, x86::ptr(L_DataPool));
  cc.xorps(vzero, vzero);
  cc.movaps(v0080, x86::ptr(t, 0));
  cc.movaps(v0101, x86::ptr(t, 16));

  cc.xor_(j, j);
  cc.sub(j, dst);
  cc.and_(j, 15);
  cc.shr(j, 2);
  cc.jz(L_SmallEnd);

  cc.cmp(j, i);
  cc.cmovg(j, i); // j = min(i, j).
  cc.sub(i, j);   // i -= j.

  // Small loop.
  cc.bind(L_SmallLoop);
  {
    X86Xmm x0 = cc.newXmm("x0");
    X86Xmm y0 = cc.newXmm("y0");
    X86Xmm a0 = cc.newXmm("a0");

    cc.movd(y0, x86::ptr(src));
    cc.movd(x0, x86::ptr(dst));

    cc.pcmpeqb(a0, a0);
    cc.pxor(a0, y0);
    cc.psrlw(a0, 8);
    cc.punpcklbw(x0, vzero);

    cc.pshuflw(a0, a0, x86::shufImm(1, 1, 1, 1));
    cc.punpcklbw(y0, vzero);

    cc.pmullw(x0, a0);
    cc.paddsw(x0, v0080);
    cc.pmulhuw(x0, v0101);

    cc.paddw(x0, y0);
    cc.packuswb(x0, x0);

    cc.movd(x86::ptr(dst), x0);

    cc.add(dst, 4);
    cc.add(src, 4);

    cc.dec(j);
    cc.jnz(L_SmallLoop);
  }

  // Second section, prepare for an aligned loop.
  cc.bind(L_SmallEnd);

  cc.test(i, i);
  cc.mov(j, i);
  cc.jz(cc.getFunc()->getExitLabel());

  cc.and_(j, 3);
  cc.shr(i, 2);
  cc.jz(L_LargeEnd);

  // Aligned loop.
  cc.bind(L_LargeLoop);
  {
    X86Xmm x0 = cc.newXmm("x0");
    X86Xmm x1 = cc.newXmm("x1");
    X86Xmm y0 = cc.newXmm("y0");
    X86Xmm a0 = cc.newXmm("a0");
    X86Xmm a1 = cc.newXmm("a1");

    cc.movups(y0, x86::ptr(src));
    cc.movaps(x0, x86::ptr(dst));

    cc.pcmpeqb(a0, a0);
    cc.xorps(a0, y0);
    cc.movaps(x1, x0);

    cc.psrlw(a0, 8);
    cc.punpcklbw(x0, vzero);

    cc.movaps(a1, a0);
    cc.punpcklwd(a0, a0);

    cc.punpckhbw(x1, vzero);
    cc.punpckhwd(a1, a1);

    cc.pshufd(a0, a0, x86::shufImm(3, 3, 1, 1));
    cc.pshufd(a1, a1, x86::shufImm(3, 3, 1, 1));

    cc.pmullw(x0, a0);
    cc.pmullw(x1, a1);

    cc.paddsw(x0, v0080);
    cc.paddsw(x1, v0080);

    cc.pmulhuw(x0, v0101);
    cc.pmulhuw(x1, v0101);

    cc.add(src, 16);
    cc.packuswb(x0, x1);

    cc.paddw(x0, y0);
    cc.movaps(x86::ptr(dst), x0);

    cc.add(dst, 16);

    cc.dec(i);
    cc.jnz(L_LargeLoop);
  }

  cc.bind(L_LargeEnd);
  cc.test(j, j);
  cc.jnz(L_SmallLoop);
  cc.endFunc();

  // Data.
  cc.align(kAlignData, 16);
  cc.bind(L_DataPool);
  cc.dxmm(Data128::fromI16(0x0080));
  cc.dxmm(Data128::fromI16(0x0101));
}

Control Flow Graph (CFG)

CFG is a directed graph consisting of basic blocks (BB). Each BB has a list of instructions, predecessors, and successors, in addition to other metadata like statistics. The previous version of AsmJit didn't use CFG, it instead used a list of jumps and their targets that was generated during the analysis phase (called fetch in the previous AsmJit).

CFG makes code much more clear and it's something you will see in all Compiler related literature and implementations (all register allocators I checked out recently used CFG).

The CFG of the sample function:

[RAPass::BuildCFG]
  [Func] L5: void(u64@rdi dst, u64@rsi src, u64@rdx i)
  {#0}
    lea t, [L4]
    xorps vzero, vzero
    movaps v0080, [t]
    movaps v0101, [t+16]
    xor j, j
    sub j, dst
    and j, 15
    shr j, 2
    jz L1
  {#1}
    cmp j, i
    cmovg j, i
    sub i, j
  {#2}
  L0:
    movd y0, [src]
    movd x0, [dst]
    pcmpeqb a0, a0
    pxor a0, y0
    psrlw a0, 8
    punpcklbw x0, vzero
    pshuflw a0, a0, 85{1|1|1|1}
    punpcklbw y0, vzero
    pmullw x0, a0
    paddsw x0, v0080
    pmulhuw x0, v0101
    paddw x0, y0
    packuswb x0, x0
    movd [dst], x0
    add dst, 4
    add src, 4
    dec j
    jnz L0
  {#3}
  L1:
    test i, i
    mov j, i
    jz L6
  {#4}
    and j, 3
    shr i, 2
    jz L3
  {#5}
  L2:
    movups y0, [src]
    movaps x0, [dst]
    pcmpeqb a0, a0
    xorps a0, y0
    movaps x1, x0
    psrlw a0, 8
    punpcklbw x0, vzero
    movaps a1, a0
    punpcklwd a0, a0
    punpckhbw x1, vzero
    punpckhwd a1, a1
    pshufd a0, a0, 245{1|1|3|3}
    pshufd a1, a1, 245{1|1|3|3}
    pmullw x0, a0
    pmullw x1, a1
    paddsw x0, v0080
    paddsw x1, v0080
    pmulhuw x0, v0101
    pmulhuw x1, v0101
    add src, 16
    packuswb x0, x1
    paddw x0, y0
    movaps [dst], x0
    add dst, 16
    dec i
    jnz L2
  {#6}
  L3:
    test j, j
    jnz L0
  {#7}
  L6:
    [FuncEnd]
[RAPass::BuildViews]
[RAPass::BuildDominators]
  IDom of #1 -> #0
  IDom of #2 -> #1
  IDom of #3 -> #0
  IDom of #4 -> #3
  IDom of #5 -> #4
  IDom of #6 -> #4
  IDom of #7 -> #3
  IDom of #2 -> #0
  Done (3 iterations)

You can see that the logging improved so much. The logger is able to show function signature, each basic block {#XXX}, and also indents the code in a way to make it readable. Part of CFG is also building CFG views (currently only post-order-view) and dominance.

AsmJit does one more step at CFG construction that is not shown above, which is building TiedRegs from operands and filling their Read/Write information. AsmJit uses a data structure called RAInst, which contains an array of TiedRegs and some other data. TiedRegs solve the problem where one virtual register is used more than once in instruction's operands. A single TiedReg can link to 1 or more operand of an instruction, and instruction never uses 2 TiedRegs which would refer to the same virtual register.

Each TiedReg has the following information used during allocation and rewriting:

  • READ flag (R) - Read access.
  • WRITE flag (W) - Write access.
  • RW flag (X) - Combination of R and W flags.
  • USE slot - Register has a USE slot (either read or RW slot).
  • OUT slot - Register has an OUT slot (Write-only slot).

You may wonder why there are access flags as well as slots. Well, slots are used by the allocator in a very specific way. Since TiedRegs can contain data merged from multiple operands it's important to keep what each TiedReg represents. Imagine the following cases:


add x, y     ; x={R|W|Use}     y={R|Use}
lea x, [x+y] ; x={R|W|Use|Out} y={R|Use} 

Note the 'out' in the second case that tells us that 'x' is in both USE and OUT slots. In other words, TiedReg('x') contains a combined information from two operands [R|Use] and [W|Out], which still preserves the fact that we have USE and OUT slots and that the register allocator can reallocate 'x' into a different register by assigning to its OUT slot. And it would work even if we only use 'x' in LEA:


lea x, [x+x] ; x={R|W|Use|Out}

As you can see we still know that there are USE and OUT slots, we just lost the second USE slot, which doesn't really matter.

Liveness Analysis

Liveness analysis tells us which virtual register is alive and where. It happens in two phases:

  • 1. Block level - build LIVE_IN, LIVE_OUT, GEN, and KILL bit-vectors for each virtual register in each block.
  • 2. Instruction level - build live-ranges based on block-level liveness analysis and instructions of each block.

I really tried to optimize the liveness analysis as it has to collect a lot of information and iterate across the whole function. Here is an output from 'BuildLiveness' function:

[RAPass::BuildLiveness]
  LiveIn/Out Done (16 visits)
  {#0}
    IN   [dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [t, j, dst]
    KILL [t, vzero, v0080, v0101, j]
  {#1}
    IN   [vzero, v0080, v0101, j, dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [j, i]
  {#2}
    IN   [vzero, v0080, v0101, j, dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [vzero, v0080, v0101, j, dst, y0, src, x0, a0]
    KILL [y0, x0, a0]
  {#3}
    IN   [vzero, v0080, v0101, dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [i]
    KILL [j]
  {#4}
    IN   [vzero, v0080, v0101, j, dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [j, i]
  {#5}
    IN   [vzero, v0080, v0101, j, dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [vzero, v0080, v0101, dst, i, src, y0, x0, a0, x1, a1]
    KILL [y0, x0, a0, x1, a1]
  {#6}
    IN   [vzero, v0080, v0101, j, dst, i, src]
    OUT  [vzero, v0080, v0101, j, dst, i, src]
    GEN  [j]
  {#7}
  t     {id:0260 width: 6    freq: 0.5000}: [3:9]
  vzero {id:0261 width: 125  freq: 0.0400}: [5:130]
  v0080 {id:0262 width: 123  freq: 0.0325}: [7:130]
  v0101 {id:0263 width: 121  freq: 0.0331}: [9:130]
  j     {id:0259 width: 116  freq: 0.0948}: [11:62], [65:130]
  dst   {id:0256 width: 128  freq: 0.0547}: [2:130]
  i     {id:0258 width: 128  freq: 0.0547}: [2:130]
  y0    {id:0265 width: 22   freq: 0.1818}: [27:49]
  src   {id:0257 width: 128  freq: 0.0312}: [2:130]
  x0    {id:0264 width: 24   freq: 0.3333}: [29:53]
  a0    {id:0266 width: 12   freq: 0.4167}: [31:43]
  y0    {id:0269 width: 42   freq: 0.0714}: [75:117]
  x0    {id:0267 width: 42   freq: 0.2143}: [77:119]
  a0    {id:0270 width: 22   freq: 0.3182}: [79:101]
  x1    {id:0268 width: 32   freq: 0.1875}: [83:115]
  a1    {id:0271 width: 14   freq: 0.2857}: [89:103]

The output shows liveness analysis of each block and live-ranges of each virtual register. To build live-ranges each instruction must have a position in code (which increments by 2 to describe read/write operations) and based on that information it's very simple to calculate additional statistics like number of occurrences (reads/writes), total width, and frequency. Frequency, in particular, is a very interesting metric that is used later in Bin Packing.

Global allocation - Bin Packing

Bin Packing is simply trying to pack live ranges of more virtual registers into one physical register. AsmJit first sorts virtual registers by weight (frequency) and then tries to bin pack from the highest to the lowest. Here is an output from 'BinPack' function:


[RAPass::BinPack] Available=15 (0x0000FFEF) Count=5
  00: [3:9@260], [11:62@259], [65:130@259]
  01: [2:130@256]
  02: [2:130@258]
  03: [2:130@257]
  Completed.
[RAPass::BinPack] Available=16 (0x0000FFFF) Count=11
  00: [31:43@266], [79:101@270]
  01: [29:53@264], [89:103@271]
  02: [27:49@265], [77:119@267]
  03: [83:115@268]
  04: [75:117@269]
  05: [5:130@261]
  06: [9:130@263]
  07: [7:130@262]
  Completed.

As we can see the solution to the sample function is quite simple. In more complex cases when Bin Packing fails all registers that were not packed are marked as 'StackPreferred', which means that the local allocators would prefer these on the stack. After the bin packing is done it's time for the local allocator.

Local allocation - Instruction by Instruction

Local allocator is very similar to the previous one. It goes from instruction to instruction and allocates registers based on information produced by all previous phases. It handles additional cases like loads/spills and also cases where a fixed register is required.

Local allocator works at block basis. To make sure that the allocation is correct it records the exact state of all register upon all block entries and emits a state-switch if necessary when switching between blocks where the state differs. This design is still very similar to the previous register allocator except one thing - in some cases the new allocator may change the state of unallocated blocks a bit. For example it can set a dirty flag of some register.

Rewriter

Rewriter is executed after the register allocation is complete. It traverses over each instruction and rewrites virtual registers into physical registers based on instruction's TiedRegs (Each TiedReg stores physical register of USE and OUT slots).

And finally, here is a the rewritten code of our function:


[RAPass::Rewrite]
  L5:
    push rbx                        ; 53
    mov rcx, rdi                    ; 488BCF
    mov rbx, rsi                    ; 488BDE
    lea rax, [L4]                   ; 488D05........| lea t, [L4]                 | t{W|Out}
    xorps xmm5, xmm5                ; 0F57ED        | xorps vzero, vzero          | vzero{W|Out|Last}
    movaps xmm7, [rax]              ; 0F2838        | movaps v0080, [t]           | t{R|Use} v0080{W|Out|Last}
    movaps xmm6, [rax+16]           ; 0F287010      | movaps v0101, [t+16]        | t{R|Use|Last|Kill} v0101{W|Out|Last}
    xor rax, rax                    ; 4833C0        | xor j, j                    | j{W|Out}
    sub rax, rcx                    ; 482BC1        | sub j, dst                  | j{X|Use} dst{R|Use|Last}
    and rax, 15                     ; 83E00F        | and j, 15                   | j{X|Use}
    shr rax, 2                      ; 48C1E802      | shr j, 2                    | j{X|Use|Last}
    jz L1                           ; 0F84........  | jz L1
    cmp rax, rdx                    ; 483BC2        | cmp j, i                    | j{R|Use} i{R|Use}
    cmovg rax, rdx                  ; 480F4FC2      | cmovg j, i                  | j{X|Use} i{R|Use}
    sub rdx, rax                    ; 482BD0        | sub i, j                    | i{X|Use|Last} j{R|Use|Last}
  L0:                               ;               | L0:
    movd xmm2, [rbx]                ; 660F6E13      | movd y0, [src]              | src{R|Use} y0{W|Out}
    movd xmm1, [rcx]                ; 660F6E09      | movd x0, [dst]              | dst{R|Use} x0{W|Out}
    pcmpeqb xmm0, xmm0              ; 660F74C0      | pcmpeqb a0, a0              | a0{W|Out}
    pxor xmm0, xmm2                 ; 660FEFC2      | pxor a0, y0                 | a0{X|Use} y0{R|Use}
    psrlw xmm0, 8                   ; 660F71D008    | psrlw a0, 8                 | a0{X|Use}
    punpcklbw xmm1, xmm5            ; 660F60CD      | punpcklbw x0, vzero         | x0{X|Use} vzero{R|Use}
    pshuflw xmm0, xmm0, 85{1|1|1|1} ; F20F70C055    | pshuflw a0, a0, 85{1|1|1|1} | a0{X|Use|Out}
    punpcklbw xmm2, xmm5            ; 660F60D5      | punpcklbw y0, vzero         | y0{X|Use} vzero{R|Use|Last}
    pmullw xmm1, xmm0               ; 660FD5C8      | pmullw x0, a0               | x0{X|Use} a0{R|Use|Last|Kill}
    paddsw xmm1, xmm7               ; 660FEDCF      | paddsw x0, v0080            | x0{X|Use} v0080{R|Use|Last}
    pmulhuw xmm1, xmm6              ; 660FE4CE      | pmulhuw x0, v0101           | x0{X|Use} v0101{R|Use|Last}
    paddw xmm1, xmm2                ; 660FFDCA      | paddw x0, y0                | x0{X|Use} y0{R|Use|Last|Kill}
    packuswb xmm1, xmm1             ; 660F67C9      | packuswb x0, x0             | x0{X|Use}
    movd [rcx], xmm1                ; 660F7E09      | movd [dst], x0              | dst{R|Use} x0{R|Use|Last|Kill}
    add rcx, 4                      ; 4883C104      | add dst, 4                  | dst{X|Use|Last}
    add rbx, 4                      ; 4883C304      | add src, 4                  | src{X|Use|Last}
    dec rax                         ; 48FFC8        | dec j                       | j{X|Use|Last}
    short jnz L0                    ; 75B9          | jnz L0
  L1:                               ;               | L1:
    test rdx, rdx                   ; 4885D2        | test i, i                   | i{R|Use}
    mov rax, rdx                    ; 488BC2        | mov j, i                    | j{W|Out|Last} i{R|Use|Last}
    jz L6                           ; 0F84........  | jz L6
    and rax, 3                      ; 83E003        | and j, 3                    | j{X|Use|Last}
    shr rdx, 2                      ; 48C1EA02      | shr i, 2                    | i{X|Use|Last}
    jz L3                           ; 0F84........  | jz L3
  L2:                               ;               | L2:
    movups xmm4, [rbx]              ; 0F1023        | movups y0, [src]            | src{R|Use} y0{W|Out}
    movaps xmm2, [rcx]              ; 0F2811        | movaps x0, [dst]            | dst{R|Use} x0{W|Out}
    pcmpeqb xmm0, xmm0              ; 660F74C0      | pcmpeqb a0, a0              | a0{W|Out}
    xorps xmm0, xmm4                ; 0F57C4        | xorps a0, y0                | a0{X|Use} y0{R|Use}
    movaps xmm3, xmm2               ; 0F28DA        | movaps x1, x0               | x1{W|Out} x0{R|Use}
    psrlw xmm0, 8                   ; 660F71D008    | psrlw a0, 8                 | a0{X|Use}
    punpcklbw xmm2, xmm5            ; 660F60D5      | punpcklbw x0, vzero         | x0{X|Use} vzero{R|Use}
    movaps xmm1, xmm0               ; 0F28C8        | movaps a1, a0               | a1{W|Out} a0{R|Use}
    punpcklwd xmm0, xmm0            ; 660F61C0      | punpcklwd a0, a0            | a0{X|Use}
    punpckhbw xmm3, xmm5            ; 660F68DD      | punpckhbw x1, vzero         | x1{X|Use} vzero{R|Use|Last}
    punpckhwd xmm1, xmm1            ; 660F69C9      | punpckhwd a1, a1            | a1{X|Use}
    pshufd xmm0, xmm0, 245{1|1|3|3} ; 660F70C0F5    | pshufd a0, a0, 245{1|1|3|3} | a0{X|Use|Out}
    pshufd xmm1, xmm1, 245{1|1|3|3} ; 660F70C9F5    | pshufd a1, a1, 245{1|1|3|3} | a1{X|Use|Out}
    pmullw xmm2, xmm0               ; 660FD5D0      | pmullw x0, a0               | x0{X|Use} a0{R|Use|Last|Kill}
    pmullw xmm3, xmm1               ; 660FD5D9      | pmullw x1, a1               | x1{X|Use} a1{R|Use|Last|Kill}
    paddsw xmm2, xmm7               ; 660FEDD7      | paddsw x0, v0080            | x0{X|Use} v0080{R|Use}
    paddsw xmm3, xmm7               ; 660FEDDF      | paddsw x1, v0080            | x1{X|Use} v0080{R|Use|Last}
    pmulhuw xmm2, xmm6              ; 660FE4D6      | pmulhuw x0, v0101           | x0{X|Use} v0101{R|Use}
    pmulhuw xmm3, xmm6              ; 660FE4DE      | pmulhuw x1, v0101           | x1{X|Use} v0101{R|Use|Last}
    add rbx, 16                     ; 4883C310      | add src, 16                 | src{X|Use|Last}
    packuswb xmm2, xmm3             ; 660F67D3      | packuswb x0, x1             | x0{X|Use} x1{R|Use|Last|Kill}
    paddw xmm2, xmm4                ; 660FFDD4      | paddw x0, y0                | x0{X|Use} y0{R|Use|Last|Kill}
    movaps [rcx], xmm2              ; 0F2911        | movaps [dst], x0            | dst{R|Use} x0{R|Use|Last|Kill}
    add rcx, 16                     ; 4883C110      | add dst, 16                 | dst{X|Use|Last}
    dec rdx                         ; 48FFCA        | dec i                       | i{X|Use|Last}
    short jnz L2                    ; 759E          | jnz L2
  L3:                               ;               | L3:
    test rax, rax                   ; 4885C0        | test j, j                   | j{R|Use|Last}
    jnz L0                          ; 0F8535FFFFFF  | jnz L0
  L6:                               ;               | L6:
    pop rbx                         ; 5B
    ret                             ; C3
    .align 16
  L4:
    .data 80008000800080008000800080008000
    .data 01010101010101010101010101010101

Conclusion

I hope you enjoyed the summary of my part-time work on AsmJit of the past 9 months. The changes currently reside in next-wip branch which I would like to merge after I fix all remaining blockers. Today is a first day I managed to get Blend2D running with new AsmJit so I guess it will not take long time after now.

04 June, 2017

Idio[ma]tic Cmake

Welcome to Trial and Error Scripting

If I counted the time I spent on figuring out why CMake doesn't work as expected I would have probably counted weeks. I don't think that CMake itself is a bad tool, however, I think that the CMake script is the most idiotic language ever invented, seriously. The other problem is also the documentation, as I have never solved anything by reading CMake docs (I probably wasted more time by reading it actually).

Consider a very small example that should describe an issue I was dealing with. We have some files that are always compiled as a part of some target, and some files that require specific compiler flags, for example `-mavx2` and some compile-time constants:

project(simple_app C CXX)          # CMake project.

set(SOURCE_FILES main.cpp)         # Source files that are always compiled.
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)  # Custom compiler flags we want to append to a specific file.

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  set_property(SOURCE ${CUSTOM_FILES} APPEND PROPERTY COMPILE_FLAGS ${CUSTOM_FLAGS})

  # Add all arch-specific files to SOURCE_FILES...
  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

The problem is that it will not work and you will have hard time figuring it out. The compiler command CMake generates would look like this for compiling impl_avx2.cpp:

/usr/bin/c++ -DCUSTOM_IMPL=1;-DAVX2_AVAILABLE=1;-mavx2 -o impl_avx2.o -c impl_avx2.cpp

Which is of course completely broken and contains semicolons instead of spaces. The reason behind this is that CMake script doesn't really support arrays, all arrays are strings separated by semicolons. Actually, these two lines are the same:

set(SOMETHING A B)
set(SOMETHING "A;B")

And there is no way to distinguish between these two. To make it clearer what is happening I wrote a simple test script:

function(my_func PREFIX FIRST)
  message("${PREFIX} FIRST=${FIRST}")
  SET(ARG_INDEX 0)
  foreach(ARG_VA ${ARGN})
    message("${PREFIX} #${ARG_INDEX} ${ARG_VA}")
    math(EXPR ARG_INDEX "${ARG_INDEX}+1")
  endforeach()
endfunction()

my_func("1:" arg)
my_func("2:" arg second)
my_func("3:" arg second third)
my_func("4:" arg "second;third")
my_func("5:" arg "second third")

Which outputs:

1: FIRST=arg
2: FIRST=arg
2: #0 second
3: FIRST=arg
3: #0 second
3: #1 third
4: FIRST=arg
4: #0 second
4: #1 third
5: FIRST=arg
5: #0 second third

Okay, so we know that cmake treats semicolons as separators, so what we can do is simply foreach() each flag and append it, so let's modify the first example:

project(simple_app C CXX)

set(SOURCE_FILES main.cpp)
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  foreach(flag ${CUSTOM_FLAGS})
    set_property(SOURCE ${CUSTOM_FILES} APPEND PROPERTY COMPILE_FLAGS ${flag})
  endforeach()

  # Add all arch-specific files to SOURCE_FILES...
  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

Well, the output would be the same as before, just try it:

/usr/bin/c++ -DCUSTOM_IMPL=1;-DAVX2_AVAILABLE=1;-mavx2 -o impl_avx2.o -c impl_avx2.cpp

Would you expect this? CMake developers are actually aware of it and to make things even more confusing we have APPEND and APPEND_STRING options. APPEND just appends the given property making it a list, which is then stringified with the semicolons and we are at the beginning. APPEND_STRING always appends to a RAW string instead:

project(simple_app C CXX)

set(SOURCE_FILES main.cpp)
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  foreach(flag ${CUSTOM_FLAGS})
    set_property(SOURCE ${CUSTOM_FILES} APPEND_STRING PROPERTY COMPILE_FLAGS ${flag})
  endforeach()

  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

Which yields:

/usr/bin/c++ -DCUSTOM_IMPL=1-DAVX2_AVAILABLE=1-mavx2 -o impl_avx2.o -c impl_avx2.cpp

Cool, we got rid off semicolons but have no spaces between our flags as a side effect. The problem is that CMake's COMPILE_FLAGS is in fact a string, not a list, so to append the flag properly we must append a space before it, which will of course insert a leading space if the property was empty:

project(simple_app C CXX)

set(SOURCE_FILES main.cpp)
set(CUSTOM_FLAGS -DCUSTOM_IMPL=1)

if (BUILD_CUSTOM_FILES)
  set(CUSTOM_FILES impl_avx2.cpp)
  set(AVX2_FLAGS ${CUSTOM_FLAGS} -DAVX2_AVAILABLE=1 -mavx2)
  foreach(flag ${CUSTOM_FLAGS})
    set_property(SOURCE ${CUSTOM_FILES} APPEND_STRING PROPERTY COMPILE_FLAGS " ${flag}")
  endforeach()

  list(APPEND SOURCE_FILES ${CUSTOM_FILES})
endif()

add_executable(test_app ${SOURCE_FILES})

Which is quasi working:

/usr/bin/c++  -DCUSTOM_IMPL=1 -DAVX2_AVAILABLE=1 -mavx2 -o impl_avx2.o -c impl_avx2.cpp

Now I would like to ask you, would you write the working version at the beginning? Because for me this was simply a trial and error until I found a solution that worked; and I personally don't like this approach of solving problems.

Time to Migrate Away?

CMake should really switch to a sane language otherwise I can't see using it in the future. I have already checked Meson as it was mentioned on several sites that I visit. Is it better? It probably is, but it's another one that employs a home-grown language that you probably cannot debug and forces you to write weird shell scripts as part of your project definition. I mean why to invent a language that cannot do the task and requires to run a shell script to list files in a directory?

I'm Looking for a project generator that uses embedded JavaScript and can be debugged like a normal programming language or something really close to it. It would be similar to C/C++ syntactically and could be linted by existing tools. I don't see a reason why to invent a new language for something like a project generator. It's kind of paradox that all C/C++ project generators use languages that are not even close to C and require you to write 5 lines to implement a simple if/else construct.

25 May, 2017

AVX512 {er} and {sae} Quirks

Embedded-Rounding and Suppress-All-Exceptions

Today I finally implemented all missing (and TODOed) features in AVX512 instruction encoding in AsmJit's X86Assembler (including also a parser update in AsmTK) and I have found an interesting issue regarding encoding of {sae} (suppress-all-exceptions) in EVEX encoded instructions. EVEX prefix uses many fields to describe the operation, the most important for this post are:

  • 'b' (broadcast bit) - if set and the instruction fetches from memory it's a broadcast, otherwise it specifies embedded-rounding {er} mode or suppress-all-exceptions {sae} mode)
  • 'LL' (vector-length) - 2 bits that can be used to select between 128-bit, 256-bit, and 512-bit operation (extensible in future up to 1024 bits).

The problem is that when embedded-rounding {er} is used the LL field becomes the rounding mode instead of the vector length, which is then assumed to be 512 bits wide (LL equals 0b10). This is the main reason why Intel Manual only allows {er}/{sae} in instructions that operate on either 512-bit vectors or scalars [as scalars don't use the LL field]. However, Intel Manual also says that if {sae} is used (which uses the same 'b' bit as {er}, but doesn't use LL field to specify the rounding mode) the LL field should still describe the vector length. But, I checked what C++ compilers output in that case and I found that GCC and Clang/LLVM change the LL field to zero when {sae} is used, but Intel Compiler (ICC) doesn't. This is confusing as Intel Manual doesn't say anything about ignoring LL field when executing instructions that use {sae}.

I created an online playground that can be used to quickly check the output of various C++ compilers. The instruction that uses {sae} is vcmppd (shown as vcmpeqpd) and here is a short comparison:

; EVEX - EVEX prefix (4 bytes).
; Op   - Instruction's opcode.
; Mr   - ModR/M byte.
; Pr   - Comparison predicate.

; Instruction and operands       ; |__EVEX__|Op|Mr|Pr| Compiler and comment
  vcmpeqpd k0, zmm0, zmm1, {sae} ; |62F1FD58|C2|C1|00| ICC   (uses k0)
  vcmpeqpd k1, zmm0, zmm1, {sae} ; |62F1FD18|C2|C9|00| GCC   (uses k1, clears LL)
  vcmpeqpd k0, zmm0, zmm1, {sae} ; |62F1FD18|C2|C1|00| Clang (uses k0, clears LL)

Let's decompose the bytes into individual fields:


; Instruction:
;   vcmppd k {kz}, zmm, zmm/m512/b64, ub {sae}
;
; Encoding:
;   [RVMI-FV] EVEX.NDS.512.66.0F.W1 C2 /r ib
;          ____      ____        _             
; __EVEX__|RBBR00mm|Wvvvv1pp|zLLbVaaa| OpCode | ModR/M |CompPred| Compiler
; 01100010|11110001|11111101|01011000|11000010|11000001|00000000| ICC
; 01100010|11110001|11111101|00011000|11000010|11001001|00000000| GCC
; 01100010|11110001|11111101|00011000|11000010|11000001|00000000| Clang

Now the differences should be clear - basically LL field and Mod/R field describing 'k' register index are different. For those who cannot read the encoding here is a small howto of reading registers:


; Instruction:
;   vcmppd k {kz}, zmm, zmm/m512/b64, ub {sae}
;
; Encoding:
;   [RVMI-FV] EVEX.NDS.512.66.0F.W1 C2 /r ib
;
; RVMI specifies how registers are encoded, in order, we name them 'a', 'b', and 'c':
;
;   [R|V|M|I] (I == Immediate value)
;   [a|b|c|.]
;
;   'a' - R field in Mod/R (2 bits in EVEX/R'R and 3 bits in Mod/R)
;   'b' - V field in EVEX  (5 bits in EVEX/V'vvvv)
;   'c' - M field in Mod/M (2 bits in EVEX/B'B and 3 bits in Mod/M)
;
; Registers in 'vcmpeqpd a:k, b:zmm, c:zmm, {sae}':
;          ____      ____        _             
; ........|acca....|.bbbb...|.LL.b...|........|..aaaccc|........| Compiler
; 01100010|11110001|11111101|01011000|11000010|11000001|00000000| ICC
; 01100010|11110001|11111101|00011000|11000010|11001001|00000000| GCC
; 01100010|11110001|11111101|00011000|11000010|11000001|00000000| Clang
;
;     __
; a = 11000 -> 00000 -> k0
;     __
;     11001 -> 00001 -> k1
;     _____
; b = 11111 -> 00000 -> zmm0
;     __
; c = 11001 -> 00001 -> zmm1

You can also check out EVEX prefix on wikipedia.

What AsmJit Should Emit?

That's what I don't know! If 'LL' is really ignored I would still keep it as it describes the vector length (that's what ICC does). I will keep an eye on it and try to test it on a real hardware when I get the chance.

Using k0 in a write operation

This article should also help with understanding how k0 could be used. AVX512 restricts its use only in EVEX's 'aaa' field that encodes write-mask {k1-k7}. Zero disables write-mask completely so {k0} is not encodable in EVEX prefix, however, it can be used by instructions that don't encode k register in 'aaa' field. This means that a register allocator can use k0 (with some limitations) and indeed ICC and Clang do it.

03 March, 2017

C++ Compilers and Absurd Optimizations

Everything Started with a Bug

Yesterday, I finally implemented some code-paths that use AVX and AVX2 extensions to accelerate matrix transformations in Blend2D. Of course the initial version didn't work, because I did one small mistake in vector shuffling, which resulted in rendering something different than expected. I'm usually done with these kind of bugs very fast, but this time I did a mistake in a shuffle predicate, and the bug was harder to find than usual. After 15 minutes of looking at the code I disassembled the whole function, because I started being suspicious that I'm facing a compiler bug. And, I'm happy I saw the disassembly as I found some ridiculous code generated by C++ compiler that I simply couldn't understand.

Don't Trust C++ Compiler

As the title suggests, C++ compiler failed me again by generating a ridiculous code surrounding relatively simple loops. As manually vectorized C++ code usually requires at least two loops (one vectorized and one scalar) the implementation usually looks like the following:

void proc_v1(double* dst, const double* src, size_t length) {
  size_t i;

  // Main loop - process 4 units at a time (vectorized).
  for (size_t i = length >> 2; i; i--) {
    ...

    dst += 4;
    src += 4;
  }

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  for (size_t i = length & 3; i; i--) {
    ...

    dst += 1;
    src += 1;
  }
}

This is pretty standard code used in many libraries. However, when you need to save registers and you write hand-optimized loops in assembly, you usually don't use such construct directly translated to machine code, because it will cost you 2 registers. Instead, the following trick could be used:

void proc_v1(double* dst, const double* src, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  // Main loop - process 4 units at a time (vectorized).
  while ((i -= 4) >= 0) {
    ...

    dst += 4;
    src += 4;
  }

  // Now it's less than zero, so normalize it to get the remaining count.
  i += 4;

  // Tail loop - process 1 unit at a time (scalar, remaining input).
  while (i) {
    ...

    dst += 1;
    src += 1;

    i--;
  }
}

Which can be translated to asm 1:1 and will use only a single register. I don't know personally any case where this approach would be slower than the previous one, and if the size of the item you are processing is greater than 1 it's always safe (I would say it's safe regardless of the size as I'm not sure if a 32-bit OS would be able to provide a 2GB of continuous memory to the user, but let's stay on a safe side).

Translated to x86 asm, the code would look like this:

; RDI - destination
; RSI - source
; RCX - counter

; Main Loop header.
sub rcx, 4    ; Try if we can use vectorized loop.
js MainSkip   ; If negative it means that there is less than 4 items.

MainLoop:
...           ; Some work to do...
add dst, 32   ; Advance destination 4 * sizeof(double).
add src, 32   ; Advance source.
sub rcx, 4    ; Decrement loop counter.
jns MainLoop  ; Jump if not negative.

MainSkip:
add rcx, 4    ; Fix the loop counter.
jz TailSkip   ; Exit if zero.

TailLoop:
...           ; Some work to do...
add dst, 8    ; Advance destination 1 * sizeof(double).
add src, 8    ; Advance source.
sub rcx, 1    ; Decrement loop counter.
jnz TailLoop  ; Jump if not zero.

TailSkip:
}

The asm code is not just nice, it's also very compact and will perform well on any hardware in general.

Working Example

I cannot just use an empty loop body to demonstrate how badly C++ compilers understand this code, so I wrote a very simple function that does something:

#include <stdint.h>

#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif

// Destination and source are points (pair of x|y), the function only sums them.
// Please note that the function processes `length * 2` elements, you can look
// at it as it was `struct Point { double x, y; }` casted to `double*`.
void transform(double* dst, const double* src, const double* matrix, size_t length) {
  intptr_t i = static_cast<intptr_t>(length);

  while ((i -= 8) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src +  0);
    __m256d s1 = _mm256_loadu_pd(src +  4);
    __m256d s2 = _mm256_loadu_pd(src +  8);
    __m256d s3 = _mm256_loadu_pd(src + 12);

    _mm256_storeu_pd(dst +  0, _mm256_add_pd(s0, s0));
    _mm256_storeu_pd(dst +  4, _mm256_add_pd(s1, s1));
    _mm256_storeu_pd(dst +  8, _mm256_add_pd(s2, s2));
    _mm256_storeu_pd(dst + 12, _mm256_add_pd(s3, s3));
    
    dst += 16;
    src += 16;
  }
  i += 8;

  while ((i -= 2) >= 0) {
    __m256d s0 = _mm256_loadu_pd(src);
    _mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));
    
    dst += 4;
    src += 4;
  }
  
  if (i & 1) {
    __m128d s0 = _mm_loadu_pd(src);
    _mm_storeu_pd(dst, _mm_add_pd(s0, s0));
  }
}

The function only performs a very simple operation with its inputs to distinguish between the loop body and everything else.

GCC (v7 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r9, rcx
        mov     r8, rcx
        sub     r9, 8
        js      .L6
        mov     rax, r9
        sub     rcx, 16
        mov     r8, r9
        and     rax, -8
        mov     rdx, rsi
        sub     rcx, rax
        mov     rax, rdi
        ; -----------------------
.L5:
        vmovupd xmm3, XMMWORD PTR [rdx]
        sub     r8, 8
        sub     rax, -128
        sub     rdx, -128
        vmovupd xmm2, XMMWORD PTR [rdx-96]
        vinsertf128     ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
        vmovupd xmm1, XMMWORD PTR [rdx-64]
        vinsertf128     ymm2, ymm2, XMMWORD PTR [rdx-80], 0x1
        vmovupd xmm0, XMMWORD PTR [rdx-32]
        vinsertf128     ymm1, ymm1, XMMWORD PTR [rdx-48], 0x1
        vaddpd  ymm3, ymm3, ymm3
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rdx-16], 0x1
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovups XMMWORD PTR [rax-128], xmm3
        vaddpd  ymm0, ymm0, ymm0
        vextractf128    XMMWORD PTR [rax-112], ymm3, 0x1
        vmovups XMMWORD PTR [rax-96], xmm2
        vextractf128    XMMWORD PTR [rax-80], ymm2, 0x1
        vmovups XMMWORD PTR [rax-64], xmm1
        vextractf128    XMMWORD PTR [rax-48], ymm1, 0x1
        vmovups XMMWORD PTR [rax-32], xmm0
        vextractf128    XMMWORD PTR [rax-16], ymm0, 0x1

        ; --- Instead of using sub/jns it uses sub/cmp/jne ---
        cmp     r8, rcx
        jne     .L5
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        mov     rax, r9
        shr     rax, 3
        lea     rdx, [rax+1]
        neg     rax
        lea     r8, [r9+rax*8]
        sal     rdx, 7
        add     rdi, rdx
        add     rsi, rdx
        ; -----------------------

.L6:
        ; --- Ridiculous code ---
        mov     rdx, r8
        sub     rdx, 2
        js      .L4
        shr     rdx
        xor     eax, eax
        lea     rcx, [rdx+1]
        sal     rcx, 5
        ; -----------------------

.L7:
        vmovupd xmm0, XMMWORD PTR [rsi+rax]
        vinsertf128     ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
        vaddpd  ymm0, ymm0, ymm0
        vmovups XMMWORD PTR [rdi+rax], xmm0
        vextractf128    XMMWORD PTR [rdi+16+rax], ymm0, 0x1

        ; --- Instead of using sub/jns it uses add/cmp/jne ---
        add     rax, 32
        cmp     rax, rcx
        jne     .L7
        ; ----------------------------------------------------

        ; --- Ridiculous code ---
        neg     rdx
        add     rdi, rax
        add     rsi, rax
        lea     rdx, [r8-4+rdx*2]
        ; -----------------------
.L4:
        and     edx, 1
        je      .L14
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L14:
        vzeroupper
        ret

No comment - GCC magically transformed 9 initial instructions into 37, making the code larger and slower!

Clang (3.9.1 -O2 -mavx -fno-exceptions -fno-tree-vectorize)

transform(double*, double const*, double const*, unsigned long):
        ; --- Ridiculous code ---
        mov     r10, rcx
        add     r10, -8
        js      .LBB0_7
        mov     r11, r10
        shr     r11, 3
        mov     r8, r11
        shl     r8, 4
        lea     r9d, [r11 + 1]
        and     r9d, 1
        mov     rdx, rdi
        mov     rax, rsi
        test    r11, r11
        je      .LBB0_4
        lea     rcx, [r9 - 1]
        sub     rcx, r11
        mov     rdx, rdi
        mov     rax, rsi
        ; -----------------------

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
        vmovupd ymm0, ymmword ptr [rax + 128]
        vmovupd ymm1, ymmword ptr [rax + 160]
        vmovupd ymm2, ymmword ptr [rax + 192]
        vmovupd ymm3, ymmword ptr [rax + 224]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx + 128], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 160], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 192], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 224], ymm0
        add     rdx, 256
        add     rax, 256

        ; --- Instead of using sub/jns it uses add/jne ---
        add     rcx, 2
        jne     .LBB0_3
        ; ------------------------------------------------

        ; --- CLANG Unrolled the tail loop - OMG ---
.LBB0_4:
        shl     r11, 3
        lea     rcx, [r8 + 16]
        test    r9, r9
        je      .LBB0_6
        vmovupd ymm0, ymmword ptr [rax]
        vmovupd ymm1, ymmword ptr [rax + 32]
        vmovupd ymm2, ymmword ptr [rax + 64]
        vmovupd ymm3, ymmword ptr [rax + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rdx + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rdx + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rdx + 96], ymm0
.LBB0_6:
        lea     rdi, [rdi + 8*r8 + 128]
        sub     r10, r11
        lea     rsi, [rsi + 8*rcx]
        mov     rcx, r10
.LBB0_7:
        mov     r10, rcx
        add     r10, -2
        js      .LBB0_15
        mov     r8, r10
        shr     r8
        lea     r9d, [r8 + 1]
        and     r9d, 3
        mov     rax, rdi
        mov     rdx, rsi
        cmp     r10, 6
        jb      .LBB0_11
        lea     r10, [r9 - 1]
        sub     r10, r8
        mov     rax, rdi
        mov     rdx, rsi
.LBB0_10:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 32]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 32], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 64]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 64], ymm0
        vmovupd ymm0, ymmword ptr [rdx + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rax, -128
        sub     rdx, -128
        add     r10, 4
        jne     .LBB0_10
.LBB0_11:
        lea     r10, [4*r8 + 4]
        lea     r11, [4*r8]
        add     r8, r8
        test    r9, r9
        je      .LBB0_14
        neg     r9
.LBB0_13:                               # =>This Inner Loop Header: Depth=1
        vmovupd ymm0, ymmword ptr [rdx]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        add     rax, 32
        add     rdx, 32
        inc     r9
        jne     .LBB0_13
.LBB0_14:
        lea     rsi, [rsi + 8*r11 + 32]
        lea     rdi, [rdi + 8*r10]
        add     rcx, -4
        sub     rcx, r8
        mov     r10, rcx
        ; --- End of the unrolled loop ---

.LBB0_15:
        test    r10b, 1
        je      .LBB0_17
        vmovupd xmm0, xmmword ptr [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [rdi], xmm0
.LBB0_17:
        vzeroupper
        ret

This was even worse than I would ever expect - clang unrolled the tail loop and generated optimized code for something that would never be executed more than once. It completely misunderstood the code. Also, instead of following the C++ code it inserted dozens of instructions to help with unrolling, I won't even count them because the code is so ridiculous.

MSVC (v19 /O2 /arch:AVX)

transform, COMDAT PROC
        ; --- Ridiculous code ---
        add      r9, -8
        js       SHORT $LN3@transform
        lea      r8, QWORD PTR [r9+8]
        shr      r8, 3
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*8]
        npad     8
        ; -----------------------

$LL2@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vmovupd ymm1, YMMWORD PTR [rdx+32]
        vmovupd ymm4, YMMWORD PTR [rdx+64]
        vmovupd ymm5, YMMWORD PTR [rdx+96]
        vaddpd   ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm0
        vaddpd   ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rcx+32], ymm1
        vaddpd   ymm0, ymm4, ymm4
        vmovupd YMMWORD PTR [rcx+64], ymm0
        vaddpd   ymm1, ymm5, ymm5
        vmovupd YMMWORD PTR [rcx+96], ymm1
        sub      rcx, -128                ; ffffffffffffff80H
        sub      rdx, -128                ; ffffffffffffff80H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL2@transform
        ; ------------------------------------------------

$LN3@transform:
        ; --- Ridiculous code ---
        add      r9, 6
        js       SHORT $LN5@transform
        lea      r8, QWORD PTR [r9+2]
        shr      r8, 1
        mov      rax, r8
        neg      rax
        lea      r9, QWORD PTR [r9+rax*2]
        npad     5
        ; -----------------------

$LL4@transform:
        vmovupd ymm0, YMMWORD PTR [rdx]
        vaddpd   ymm2, ymm0, ymm0
        vmovupd YMMWORD PTR [rcx], ymm2
        add      rcx, 32              ; 00000020H
        add      rdx, 32              ; 00000020H
        ; --- Instead of using sub/jns it uses sub/jne ---
        sub      r8, 1
        jne      SHORT $LL4@transform
        ; ------------------------------------------------
$LN5@transform:
        test     r9b, 1
        je       SHORT $LN6@transform
        vmovupd xmm0, XMMWORD PTR [rdx]
        vaddpd   xmm2, xmm0, xmm0
        vmovupd XMMWORD PTR [rcx], xmm2
$LN6@transform:
        vzeroupper
        ret      0

Actually not bad, still does things I didn't ask for, but it's much better than GCC and Clang in this case.

ICC (v17 -O2 -mavx -fno-exceptions)

transform(double*, double const*, double const*, unsigned long):
        add       rcx, -8                                       #11.11 FOLLOWS C++ code!
        js        ..B1.5        # Prob 2%                       #11.22
..B1.3:                         # Preds ..B1.1 ..B1.3
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #12.34
        vmovupd   xmm1, XMMWORD PTR [32+rsi]                    #13.34
        vmovupd   xmm2, XMMWORD PTR [64+rsi]                    #14.34
        vmovupd   xmm3, XMMWORD PTR [96+rsi]                    #15.34
        vinsertf128 ymm6, ymm1, XMMWORD PTR [48+rsi], 1         #13.34
        vinsertf128 ymm4, ymm0, XMMWORD PTR [16+rsi], 1         #12.34
        vinsertf128 ymm8, ymm2, XMMWORD PTR [80+rsi], 1         #14.34
        vinsertf128 ymm10, ymm3, XMMWORD PTR [112+rsi], 1       #15.34
        add       rsi, 128                                      #23.5
        vaddpd    ymm5, ymm4, ymm4                              #17.32
        vaddpd    ymm7, ymm6, ymm6                              #18.32
        vaddpd    ymm9, ymm8, ymm8                              #19.32
        vaddpd    ymm11, ymm10, ymm10                           #20.32
        vmovupd   XMMWORD PTR [rdi], xmm5                       #17.22
        vmovupd   XMMWORD PTR [32+rdi], xmm7                    #18.22
        vmovupd   XMMWORD PTR [64+rdi], xmm9                    #19.22
        vmovupd   XMMWORD PTR [96+rdi], xmm11                   #20.22
        vextractf128 XMMWORD PTR [16+rdi], ymm5, 1              #17.22
        vextractf128 XMMWORD PTR [48+rdi], ymm7, 1              #18.22
        vextractf128 XMMWORD PTR [80+rdi], ymm9, 1              #19.22
        vextractf128 XMMWORD PTR [112+rdi], ymm11, 1            #20.22
        add       rdi, 128                                      #22.5
        add       rcx, -8                                       #11.11
        jns       ..B1.3        # Prob 82%                      #11.22 FOLLOWS C++ code!
..B1.5:                         # Preds ..B1.3 ..B1.1
        add       rcx, 6                                        #25.3
        js        ..B1.9        # Prob 2%                       #27.22 FOLLOWS C++ code!
..B1.7:                         # Preds ..B1.5 ..B1.7
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #28.34
        vinsertf128 ymm1, ymm0, XMMWORD PTR [16+rsi], 1         #28.34
        add       rsi, 32                                       #32.5
        vaddpd    ymm2, ymm1, ymm1                              #29.27
        vmovupd   XMMWORD PTR [rdi], xmm2                       #29.22
        vextractf128 XMMWORD PTR [16+rdi], ymm2, 1              #29.22
        add       rdi, 32                                       #31.5
        add       rcx, -2                                       #27.11 FOLLOWS C++ code!
        jns       ..B1.7        # Prob 82%                      #27.22
..B1.9:                         # Preds ..B1.7 ..B1.5
        test      rcx, 1                                        #35.11
        je        ..B1.11       # Prob 60%                      #35.11
        vmovupd   xmm0, XMMWORD PTR [rsi]                       #36.31
        vaddpd    xmm1, xmm0, xmm0                              #37.24
        vmovupd   XMMWORD PTR [rdi], xmm1                       #37.19
..B1.11:                        # Preds ..B1.9 ..B1.10
        vzeroupper                                              #39.1
        ret                                                     #39.1

ICC in this case is pure winner. I don't know why other compilers cannot generate code like this, it's small and good.

Bonus GCC with -Os

transform(double*, double const*, double const*, unsigned long):
.L3:
        mov     rax, rcx
        sub     rax, 8
        js      .L2
        vmovupd ymm3, YMMWORD PTR [rsi]
        sub     rdi, -128
        sub     rsi, -128
        vmovupd ymm2, YMMWORD PTR [rsi-96]
        mov     rcx, rax
        vmovupd ymm1, YMMWORD PTR [rsi-64]
        vaddpd  ymm3, ymm3, ymm3
        vmovupd ymm0, YMMWORD PTR [rsi-32]
        vaddpd  ymm2, ymm2, ymm2
        vaddpd  ymm1, ymm1, ymm1
        vmovupd YMMWORD PTR [rdi-128], ymm3
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-96], ymm2
        vmovupd YMMWORD PTR [rdi-64], ymm1
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L3 ; Well, I asked -Os, but here sub/jns won't hurt
.L2:
        sub     rcx, 2
        js      .L4
        vmovupd ymm0, YMMWORD PTR [rsi]
        add     rdi, 32
        add     rsi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdi-32], ymm0
        jmp     .L2 ; Also here...
.L4:
        and     cl, 1
        je      .L1
        vmovupd xmm0, XMMWORD PTR [rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovups XMMWORD PTR [rdi], xmm0
.L1:
        ret

Much more close to the expected code, however, since I asked to generate the smallest possible code, it didn't generate conditional jumps at the end of loops, which is not good in general.

Bonus Clang with -Oz (should be even better than -Os)

transform(double*, double const*, double const*, unsigned long):                  # @transform(double*, double const*, double const*, unsigned long)
        push    7
        pop     rax
        sub     rax, rcx
        mov     rdx, rcx
        shr     rdx, 3
        xor     r10d, r10d
        test    rax, rax
        cmovle  r10, rdx
        mov     r9, r10
        shl     r9, 4
        lea     r8, [8*r10]
        shl     r10, 7
        add     r10, rsi
        mov     rdx, rcx
        mov     rax, rdi
        jmp     .LBB0_1
.LBB0_8:                                #   in Loop: Header=BB0_1 Depth=1
        vmovupd ymm0, ymmword ptr [rsi]
        vmovupd ymm1, ymmword ptr [rsi + 32]
        vmovupd ymm2, ymmword ptr [rsi + 64]
        vmovupd ymm3, ymmword ptr [rsi + 96]
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rax], ymm0
        vaddpd  ymm0, ymm1, ymm1
        vmovupd ymmword ptr [rax + 32], ymm0
        vaddpd  ymm0, ymm2, ymm2
        vmovupd ymmword ptr [rax + 64], ymm0
        vaddpd  ymm0, ymm3, ymm3
        vmovupd ymmword ptr [rax + 96], ymm0
        sub     rsi, -128
        sub     rax, -128
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rdx, -8
        jns     .LBB0_8
        sub     rcx, r8
        lea     r8, [rdi + 8*r9]
        push    1
        pop     rax
        sub     rax, rcx
        lea     rdx, [rcx + rcx]
        and     rdx, -4
        xor     esi, esi
        test    rax, rax
        cmovle  rsi, rdx
        mov     rax, rcx
        mov     rdi, r10
        mov     rdx, r8
        jmp     .LBB0_3
.LBB0_4:                                #   in Loop: Header=BB0_3 Depth=1
        vmovupd ymm0, ymmword ptr [rdi]
        add     rdi, 32
        vaddpd  ymm0, ymm0, ymm0
        vmovupd ymmword ptr [rdx], ymm0
        add     rdx, 32
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        add     rax, -2
        jns     .LBB0_4
        test    cl, 1
        je      .LBB0_7
        vmovupd xmm0, xmmword ptr [r10 + 8*rsi]
        vaddpd  xmm0, xmm0, xmm0
        vmovupd xmmword ptr [r8 + 8*rsi], xmm0
.LBB0_7:
        vzeroupper
        ret

No, no, no - clang just doesn't like compact code...:(

Conclusion

People saying that C++ compilers always generate better code than people are simply wrong. I'm not saying that they always generate worse code, but it seems that newer the compiler is, more instructions it emits, and it's becoming scary. More arguments for Blend2D and its hand-optimized pipelines!

You can play with the code here.

Thanks for reading and don't forget that this is not a hate-post. A GCC bug #79830 was already reported (with a bit different sample code) and clang bug #32132 as well. Let's hope this can be improved as this actually affects performance of non-JIT code in Blend2D.

Update

After reading some comments on other portals about this post I think people really misunderstood it and try to put it into a different context. Please note that what is inside the loops is not important in this case. What is important is how various C++ compilers translate valid loop constructs into assembly and its impact on the size of the generated code. So please, instead of justifying the bloated code I have shown, let's fix it as it won't hurt to have smaller code generated in this case. And, I really think that if a commercial ICC compiler can generate perfect code for this case, GCC and clang should be able to do it too, because these are compilers that I use daily...

13 February, 2017

PABSB|W|D|Q Without SSSE3

Packed Absolute Value

SSSE3 extensions introduced instructions for computing packed absolute values of 8-bit, 16-bit, and 32-bit integers. In this post I will show how to implement these in pure SSE2, and how to implement a missing pabsq (packed absolute value of 64-bit integers), which is not provided until AVX512-F.

Straightforward Implementation

Let's start with a straightforward implementation in C first:

inline uint32_t abs32(int32_t x) {
  return x >= 0 ? uint32_t(x) : uint32_t(-x);
}

Although it contains branches C++ compilers are usually able to recognize such code and create an optimized branch-less version of it. If you think about a possible branch-less solution you must understand how negation in 2s complement arithmetic works. The code -x is equivalent to 0-x, which is equivalent to ~x + 1. Now we know how to change a sign of some integer, however, what absolute value does is changing the sign only if the input is negative. Since all negative numbers in 2s complement arithmetic have the most significant bit set to 1 we can use arithmetic shift to get a mask (all zeros or all ones), which can be then used to negate all bits of the original value. The remaining addition of 1 can be turned into a subtraction of -1 (as -1 is represented as all ones in 2s complement arithmetic). Thus, we can rewrite the original code to (x ^ mask) - mask, which would do nothing if mask is zero, and negate the input if mask is all ones.

A branch-less implementation of the previous code would look like:

inline uint32_t abs32(int32_t x) {
  // NOTE: x >> y must be translated to an arithmetic shift here...
  uint32_t mask = uint32_t(x >> (sizeof(int32_t) * 8 - 1));
  return (uint32_t(x) ^ mask) - mask;
}

SSE2 Implementation

The C++ code can be directly translated to SSE2 for 16-bit and 32-bit integer sizes:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psraw  xmm7, 15              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubw  xmm0, xmm7            ; Add one if mask is all ones

; SSE2 compatible PABSD implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubd  xmm0, xmm7            ; Add one if mask is all ones

64-bit packed absolute value is trickier as there is no PSRAQ instruction in SSE2 (VPSRAQ was first introduced in AVX512-F), however, we can shuffle the input a bit and use PSRAD again:

; SSE2 compatible PABSQ implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pshufd xmm7, xmm0, 0xF5      ; Like _MM_SHUFFLE(3, 3, 1, 1)
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubq  xmm0, xmm7            ; Add one if mask is all ones

These were straightforward translations based on the initial C++ code shown at the beginning of the post. However, there is a better way of implementing PABSW and there is also a way of implementing PABSB without any shifts (because there is no packed shift that operates on 8-bit entities). Since absolute value could be also written as max(x, -x) we can use packed min/max to implement PABSB and PABSW:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubw  xmm7, xmm0            ; Negate all input values
pmaxsw xmm0, xmm7            ; Select all positive values

; SSE2 compatible PABSB implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubb  xmm7, xmm0            ; Negate all input values
pminub xmm0, xmm7            ; Select all positive values

The PABSW implementation is straightforward and I have nothing to add, however, PABSB implementation is interesting as it workarounds the missing PMAXSB instruction (which was introduced in SSE4.1) by using PMINUB instead, which works for us based on the knowledge about both inputs (selecting the minimum unsigned value is the same as selecting the maximum signed value in our case, as we know that they are negations of each other).

Conclusion

Hope you enjoyed reading the post. I'm preparing a very small library for JIT code generation for asmjit that will have all of these tricks implemented and ready to use. Any wishes about next post? I was thinking about some pre-SSE4.1 rounding tricks (float|double), basically the same tricks I have used in MathPresso.

10 February, 2017

PMINUW and PMAXUW Without SSE4.1

Pminuw and Pmaxuw

SSE4.1 extension introduced a lot of instructions that I would say should have been part of the baseline SSE2. There are instructions that are hard to workaround, like pshufb, and there are also instructions that just complete the unfinished SSE2 instruction set, like pminuw (packed minimum of uint16_t) and pmaxuw (packed maximum of uint16_t). I have seen various workarounds for implementing these two, but since I'm working with JIT I always think about the best possible solution that:

  • Doesn't need more than one temporary register
  • Doesn't need constants, if possible
  • Is as short as possible

Existing Solutions

Before I started thinking of the best possible implementation I checked libsimdpp, which contains implementation of many post-SSE2 instructions. The min/max implementation can be found at i_min.h and i_max.h files. What libsimdpp does is to XOR the most significant bit (basically the sign-bit) of both inputs to prepare them being used by either pminsw or pmaxsw instruction. The problem with this approach is that it needs a constant (0x8000), two moves, three XORs, and one packed min or max. In other words, this is a lot of operations to do just packed minimum or maximum.

The machine code of that solution may look like:

; SSE2 compatible PMINUW|PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm1            ; Move xmm1 to temporary
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0
pxor xmm7, [constant_0x8000] ; Toggle the sign bit of xmm7 (temporary)
pminsw|pmaxsw xmm0, xmm7     ; Perform packed min|max
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0

Of course if the second operand (xmm1) is not required after the operation the temporary variable (and move) could be eliminated.

Is there a better way?

I have used a similar solution in the past, but I was never really happy with it. Today I tried to think harder about the problem and possible instructions that I can use and I have found the following approach - since SSE2 has PSUBUSW (packed subtract with unsigned saturation) I can use that instruction instead of three XORs and then subtract the result from the original value to get the packed unsigned minimum. This trick of course only works for packed uint16_t operations as X86 SIMD doesn't have instructions to perform packed saturated addition|subtraction of elements greater than 16 bits.

The machine code of this solution would look like:

; SSE2 compatible PMINUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psubusw xmm7, xmm1           ; Subtract with unsigned saturation
psubw xmm0, xmm7             ; Subtract (no saturation, would never overflow here)

Why it works? If we perform a-b with unsigned saturation we get either zero, which means that b is either equal or greater than a, or some non-zero value, which means that a is greater than b, and the value is their difference (unsigned). Based on these we can subtract that value from the original a and get our unsigned minimum.

The machine code of pmaxuw implementation would be much simpler:

; SSE2 compatible PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
psubusw xmm0, xmm1           ; Subtract with unsigned saturation
paddw xmm0, xmm1             ; Add (no saturation, would never overflow here)

In this last case (unsigned uint16_t maximum) we don't need any temporary. The possible difference in xmm0 is enough to reconstruct the maximum value based on the content of xmm1.

Conclusion

There is always a way to do something better. I always valuate solutions that need less temporaries and don't need extra constants. Sometimes these requirements are impossible, but sometimes other instructions that you may not think of may help. And... old CPUs would thank you for using this approach!

28 January, 2017

Reflect-Trick

Baseline

2D rendering APIs must define how gradient or texture should be rendered when coordinates accessing it are out of bounds. In 2D this is usually called repeat or spread mode, and here are the most used ones:

  • Pad - Coordinate is saturated [result = pad(x, width)]
  • Repeat - Coordinate is repeated [result = repeat(x, width)]
  • Reflect - Coordinate is reflected [result = reflect(x, width)]

Where result, x, and width are integers and result is always within [0, width) range (note that width is exclusive). A baseline implementation of pad, repeat, and reflect functions is below:

inline int pad(int x, int width) {
  if (x < 0)
    return 0;

  if (x >= width)
    return width - 1;

  return x;
}

inline int repeat(int x, int width) {
  // Repeat from 0 to width.
  x = x % width;
  if (x < 0)
    x += width;

  return x;
}

inline int reflect(int x, int width) {
  // Repeat from 0 to width * 2.
  x = x % (width * 2);
  if (x < 0)
    x += width * 2;

  // Reflect from 0 to width.
  if (x >= width)
    x = width * 2 - 1 - x;

  return x;
}

When Width is a Power of 2

In many cases width would be a power of 2, especially when rendering gradients. In that case the expensive MOD operator (%) could be replaced by AND operator (&), which would make the implementation shorter and generally faster:

inline int repeat(int x, int width) {
  return x & (width - 1);
}

inline int reflect(int x, int width) {
  // Repeat from 0 to width * 2.
  x = x & ((width * 2) - 1);

  // Reflect from 0 to width.
  if (x >= width)
    x = width * 2 - 1 - x;

  return x;
}

AND solves two problems - it implicitly handles the x < 0 condition, because the width is never negative (for example -2 & 255 == 254, which is the same as -2 + 256 in 2's complement arithmetic) and it replaces a very expensive MOD operator by a totally cheap AND operator, that does the same job in our case, because of power of 2 constraint. However, the reflect function is still pretty long and could be simplified even more, especially if we want to vectorize it.

Reflect-Trick Suitable for SIMD

We know that width can be used to precalculate other values that can be used instead of it. For example if we always use width * 2 - 1 as a mask it would be better to just calculate it once and reuse the precalculated value. For the following reflect-trick we precalculate it so it becomes a mask instead of width:

inline int precalc(int width) {
  return width * 2 - 1;
}

We also know that SIMD instructions like more min/max approach than if/mask approach, so instead of using the condition the reflection itself could be rewritten to use a single subtraction and minimum:

// y is the precomputed value, equal to `width * 2 - 1`.
inline int reflect(int x, int y) {
  // Repeat from 0 to width * 2.
  x = x & y;

  // Reflect by using the reflect-trick.
  return min(x, y - x);
}

The following reasons summarize why this implementation is generally better for SIMD implementation (and probably for scalar too):

  • It uses only one constant (y), which could be kept in register.
  • It doesn't use branching, which would need to be translated to comparison and masking.

Handling Both Repeat & Reflect at the Same Time

If we introduce another variable, it would be possible to handle both repeat and reflect modes inside a single function. This would be extremely beneficial for 2D pipelines that allow different repeat modes for X and Y. We name our constants as repeatValue and reflectValue, and precalculate them the following way:

inline int precalcRepeatValue(int width, bool isReflecting) {
  return isReflecting ? width * 2 - 1 : width - 1;
}

inline int precalcReflectValue(int width, bool isReflecting) {
  return width * 2 - 1;
}

And use such values in a single function:

inline int repeatOrReflect(int x, int repeatValue, int reflectValue) {
  x = x & repeatValue;
  return min(x, reflectValue - x);
}

Based on the initialization, the repeatOrReflect() function will either repeat or reflect.

Reflect-Trick that uses SAR and XOR

On X86 architecture the reflection can also be performed by using a SAR (arithmetic shift right) and XOR instructions. The idea is to shift the interval of the value to be reflected from [0...width*2) to [-width..width). If we do so we can use the following:

inline int reflect(int x) {
  const int kShift = (sizeof(int) * 8 - 1);
  return (x >> kShift) ^ x;
}

The function in this case expects value in a correct range and will always output a value within [0..width). On X86 the function would typically translate to 3 instructions:

mov idx, x   ; Copy x to idx
sar idx, 31  ; Copy sign bits across idx
xor idx, x   ; Reflect

Or 2 instructions if we can make sure that the input x is in eax register and output in edx register:

cdq          ; Sign-extend eax to edx:eax
xor edx, eax ; Reflect

Conclusion

These tricks can be used to improve the performance of gradient and texture rendering. In this post I focused mostly on repeating and reflection constrained by a power of 2, which is very useful for gradients, but it's not that much for textures. If the repeatValue is removed from repeatOrReflect function and the implementation handles the repeat during advancing x and y coordiantes the same code could be used to repeat/reflect a value of any range, which is exactly how b2dpipe works.

22 November, 2016

Using Raspberry for ARM Testing

Introduction

Developing for x86/x64 architectures is relatively easy as it's a standard architecture used by today's desktops; it's powerful and there are plenty of options when it comes to dev-tools and IDEs. When it comes to other architectures (I mean ARM here) the situation changes dramatically. These devices are usually not that powerful (both CPU power and memory becomes a limited resource) and some devices running OSes like Android don't even provide tools that we developers are used to. I personally cannot code properly on a laptop, so devices like tablets or phones are completely out of question. Using simulators is also not an option as I need to experiment with JIT code generation.

After I have done some experiments with my RaspberryPi (Model B2) I decided to try to improve the comfort of using these kind of devices for testing. The model I have has 1GB of RAM and quad-core ARM processor. Using that device to run IDE would suck, but accessing it remotely and using it only for compilation and unit testing would work pretty well.

Preparation

I have done the following steps to prepare the Raspberry device itself:

  1. Download Raspbian Lite distribution
  2. Unpack the image to get the *.img file, for example as raspbian.img
  3. Plug your SD card to your computer, and check which device represents it, I will call it /dev/sdx here
  4. Use dd tool (Linux) to copy the content of the *.img file to your micro card, something like dd bs=4M if=./raspbian.img of=/dev/sdx (the device doesn't specify a partition id)
  5. Mount /dev/sdx2 drive to somewhere so you can access it, like /mnt/rpi
  6. Enable SSH daemon by default, go to /mnt/rpi/etc/rc2.d and rename Kxxsshd to S02sshd
  7. Unmount /dev/sdx2
  8. Sync drives by using sync command (just in case, should not be needed)
  9. Unplug the SD card and put it to your Raspberry

Now the device should be ready; the only thing needed is to turn it on and connect it to your network. I use LAN and DHCP here so that's it. You can use netstat tool to get the IP address of the device or ping 256 addresses on your local network if you don't like automation :) After the device is powered up it should allow SSH connections so just connect to it and use pi username and raspberry password, that should be changed immediately after you are logged in.

In the past I used raspi-config tool that comes with Raspbian to enable the SSH daemon, but to do that you would need to connect screen and keyboard to the device, which I would like to avoid if possible.

Setting up Cloud9 for C++ Development

I decided to go with Cloud9 IDE that can run on the device and can be accessed through a browser. I think this is a good deal as the UI is fast (rendered by your browser) and it doesn't waste the precious RAM on the device itself. The only thing that I miss is a C++ debugger, but since Cloud9 is not really targeting C++ I'm fine with that (syntax highlighting is just fine).

Installing Cloud9 requires the following steps on your Raspberry device:

  1. Install the following dependencies: sudo apt-get install build-essential pkg-config libssl-dev openssl libxml2-dev
  2. Clone Cloud9 IDE to some directory, for example Cloud9: git clone https://github.com/c9/core/ Cloud9
  3. Prepare Cloud9 by using their install-sdk script (takes maybe 15 minutes to compile all deps, also downloads a lot of packages from net): ./Cloud9/scripts/install-sdk.sh
  4. Create your workspace, mine is mkdir workspace (at ~)
  5. Prepare your script for starting the Cloud9 server, mine run.sh looks like this:
    #!/bin/sh
    node server.js -w ~/workspace --listen=0.0.0.0 --port=8181 --auth=username:password
    
  6. Don't forget to make the script executable chmod +x run.sh
  7. Run the server ./run.sh

Note that Cloud9 requires to set --listen to 0.0.0.0 otherwise it will block connections that are not coming from localhost. You can now connect to the device. The result looks like this:

The IDE provides access to the workspace and features also a built-in terminal, which you can use through the web browser as well. This means that configuring the project, compiling it, and running unit tests can be done completely via the browser.

Memory Consumption

$ cat /proc/meminfo 
MemTotal:         947736 kB
MemFree:          130244 kB
MemAvailable:     798108 kB
Buffers:           57904 kB
Cached:           615464 kB

This means that roughly 800MB is still available to be used by the C++ compiler and other dev-tools, which is fine for most projects. However, from my own experience, don't use make -j command for building your project as it will create 4 processes leaving around 200MB for each compilation unit - that is not enough even for smaller projects like AsmJit.

Conclusion

Since this really works for me it allows me to work on AsmJit's ARM support on a real ARM device. However, it's still not a high priority for me...

07 September, 2016

AsmJit & AsmTK integrated into x64dbg

AsmTK > AsmJit

AsmTK, as the name implies, is an assembler toolkit that uses AsmJit as a base for everything and builds additional features on top of it. The currently only feature it offers is AsmParser, which implements a parser that can consume assembler string and emit it into CodeEmitter. More features like Linker are planned in the future.

AsmTK is a first library that takes advantage of AsmJit's instruction validation framework, which requires just 5kB of data to validate everything AsmJit understands. This feature was necessary to build such tools and it's now an integral part of AsmJit, which can be enabled per CodeHolder and/or Assembler.

x64dbg > AsmTK

It didn't take long for AsmTK to gain some attention. The first project that officially started using AsmTK is x64dbg, an open-source debugger for Windows. It uses AsmParser to encode asm instruction(s) on-the-fly and it currently allows to select between 3 assembler engines to do the job. It will be interesting to see which engine people prefer the most, and which will be the most reliable one to use.

The collaboration started after the author of x64dbg pointed out that all of his ~150 tests were failing when using AsmTK as a backend. I have incrementally fixed most of them and integrated a better test suite into AsmTK itself, so we can add more tests and have them run on continuous integration servers. Some tests were failing because AsmTK was parsing instructions in a case-sensitive manner (so all UPPERCASED instructions were failing without any reason), however, other failures required a lot of fixes in AsmTK and AsmJit themselves - for example REP and REPE/REPNE prefixes were completely reworked (I have planned that, but these failures accelerated my motivation to fix that now). Other minor reorganizations in AsmJit happened mostly to increase compatibility with other assemblers.

What's Next?

AsmTK needs more people using it and reporting bugs. I'm completely open to implement many features and to accept pull requests that extend its functionality. AsmTK is much more open to extra functionality than AsmJit in this regard, which I try to keep lean and mean and focused on JIT (while providing a foundation to build on top of it).

So, if you need to parse some asm, take look at AsmTK and join AsmJit's room on gitter. You will get pretty powerful functionality in just 250kB (and the size can be reduced further by disabling some AsmJit features such as Compiler).

18 August, 2016

Comparing register allocator of GCC and Clang

Introduction

I'm preparing to write a new register allocator for AsmJit and part of my preparation work is to study how the best open-source C++ compilers allocate registers of some C++ samples, and how they create function prologs and epilogs. Today, I wrote a very simple code that tries to exploit one thing - in 32-bit mode there is just 8 GP and SIMD registers, so let's see what GCC and Clang do with the following code:

#include <immintrin.h>

int fn(
  const int* px, const int* py,
  const int* pz, const int* pw,
  const int* pa, const int* pb,
  const int* pc, const int* pd) {

  __m256i a0 = _mm256_loadu_si256((__m256i*)px);
  __m256i a1 = _mm256_loadu_si256((__m256i*)py);
  __m256i a2 = _mm256_loadu_si256((__m256i*)pz);
  __m256i a3 = _mm256_loadu_si256((__m256i*)pw);
  __m256i a4 = _mm256_loadu_si256((__m256i*)pa);
  __m256i b0 = _mm256_loadu_si256((__m256i*)pb);
  __m256i b1 = _mm256_loadu_si256((__m256i*)pc);
  __m256i b2 = _mm256_loadu_si256((__m256i*)pd);
  __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1);
  __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1);
  
  __m256i x0 = _mm256_packus_epi16(a0, b0);
  __m256i x1 = _mm256_packus_epi16(a1, b1);
  __m256i x2 = _mm256_packus_epi16(a2, b2);
  __m256i x3 = _mm256_packus_epi16(a3, b3);
  __m256i x4 = _mm256_packus_epi16(a4, b4);
  
  x0 = _mm256_add_epi16(x0, a0);
  x1 = _mm256_add_epi16(x1, a1);
  x2 = _mm256_add_epi16(x2, a2);
  x3 = _mm256_add_epi16(x3, a3);
  x4 = _mm256_add_epi16(x4, a4);

  x0 = _mm256_sub_epi16(x0, b0);
  x1 = _mm256_sub_epi16(x1, b1);
  x2 = _mm256_sub_epi16(x2, b2);
  x3 = _mm256_sub_epi16(x3, b3);
  x4 = _mm256_sub_epi16(x4, b4);
  
  x0 = _mm256_packus_epi16(x0, x1);
  x0 = _mm256_packus_epi16(x0, x2);
  x0 = _mm256_packus_epi16(x0, x3);
  x0 = _mm256_packus_epi16(x0, x4);
  return _mm256_extract_epi32(x0, 1);
}

The function does nothing useful, it's just a dumb code that tricks the compiler to not eliminate any part of the function and to emit exactly what is written in its body. I'm trying to exploit the following: I'm using 8 arguments, which are passed by stack (32-bit mode), and each argument is a pointer to a 256-bit vector. The last two arguments are read twice, the first 6 arguments just once. This means that the compiler has the opportunity to store all arguments in GP registers if it just follows the code path. The second challenge is that I'm using 15 YMM registers at the same time, but only 8 are available.

GCC Output

Not so good, here is the asm with my annotations:

; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
  lea       ecx, [esp+4]                      ; Return address + 4 (first argument on the stack)
  and       esp, -32                          ; Align the stack to 32 bytes
  push      DWORD PTR [ecx-4]                 ; Push returned address
  push      ebp                               ; Save frame-pointer even if I told GCC to not to
  mov       ebp, esp
  push      edi                               ; Save GP regs
  push      esi
  push      ebx
  push      ecx
  sub       esp, 296                          ; Reserve stack for YMM spills
  mov       eax, DWORD PTR [ecx+16]           ; LOAD 'pa'
  mov       esi, DWORD PTR [ecx+4]            ; LOAD 'py'
  mov       edi, DWORD PTR [ecx]              ; LOAD 'px'
  mov       ebx, DWORD PTR [ecx+8]            ; LOAD 'pz'
  mov       edx, DWORD PTR [ecx+12]           ; LOAD 'pw'
  mov       DWORD PTR [ebp-120], eax          ; SPILL 'pa'
  mov       eax, DWORD PTR [ecx+20]           ; LOAD 'pb'
  mov       DWORD PTR [ebp-152], eax          ; SPILL 'pb'
  mov       eax, DWORD PTR [ecx+24]           ; LOAD 'pc'
  vmovdqu   ymm4, YMMWORD PTR [esi]
  mov       ecx, DWORD PTR [ecx+28]           ; LOAD 'pd'
  vmovdqu   ymm7, YMMWORD PTR [edi]
  vmovdqa   YMMWORD PTR [ebp-56], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [ebx]
  mov       ebx, DWORD PTR [ebp-152]          ; LOAD 'pb'
  vmovdqa   YMMWORD PTR [ebp-88], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [edx]
  mov       edx, DWORD PTR [ebp-120]          ; LOAD 'pa'
  vmovdqu   ymm6, YMMWORD PTR [edx]
  vmovdqa   YMMWORD PTR [ebp-120], ymm6       ; SPILL VEC
  vmovdqu   ymm0, YMMWORD PTR [ecx]
  vmovdqu   ymm6, YMMWORD PTR [ebx]
  vmovdqa   ymm5, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [eax+32]
  vmovdqu   ymm2, YMMWORD PTR [eax]
  vmovdqa   ymm1, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [ecx+32]
  vmovdqa   YMMWORD PTR [ebp-152], ymm2
  vmovdqa   ymm3, ymm0                        ; Why to move anything when using AVX?
  vpackuswb ymm0, ymm7, ymm6
  vmovdqa   YMMWORD PTR [ebp-184], ymm5       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-248], ymm3       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-280], ymm0       ; SPILL VEC
  vmovdqa   ymm0, YMMWORD PTR [ebp-56]        ; ALLOC VEC
  vmovdqa   YMMWORD PTR [ebp-216], ymm1       ; SPILL VEC
  vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-88]        ; ALLOC VEC
  vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpaddw    ymm2, ymm2, YMMWORD PTR [ebp-56]  ; Uses SPILL slot
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpaddw    ymm0, ymm7, YMMWORD PTR [ebp-280] ; Uses SPILL slot
  vpsubw    ymm0, ymm0, ymm6
  vmovdqa   ymm7, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpackuswb ymm0, ymm0, ymm2
  vpaddw    ymm2, ymm4, ymm1
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vmovdqa   YMMWORD PTR [ebp-312], ymm3       ; SPILL VEC
  vpaddw    ymm3, ymm5, YMMWORD PTR [ebp-88]  ; Uses SPILL slot
  vpsubw    ymm3, ymm3, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm3
  vpaddw    ymm1, ymm7, YMMWORD PTR [ebp-312] ; Uses SPILL slot
  vpsubw    ymm1, ymm1, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm2
  vpackuswb ymm0, ymm0, ymm1
  vpextrd   eax, xmm0, 1                      ; Return value
  vzeroupper
  add       esp, 296
  pop       ecx
  pop       ebx
  pop       esi
  pop       edi
  pop       ebp
  lea       esp, [ecx-4]
  ret

Here are my observations based on the output:

  • The first thing GCC does is to allocate arguments to registers and/or to move them to their home slots if there is not enough physical registers for all args.
  • It preserves stack pointer even if I told it to not to. The EBP register is valuable in our case. It does this (probably) because it wants to align the stack.
  • It uses [ebp-X] even when [esp+X] would be shorter in asm - for example it should use [esp+16] instead of [ebp-280] to make the instruction with such address shorter - we are talking about 2 bytes per instruction here.
  • YMM registers usage is a mystery to me - it does so many allocs and spills and also uses 'vmovdqa' to copy one register to another, which seems crazy in AVX code.

Verdict: GCC is worse than I expected in this test, it looks like it doesn't properly use DFG (data-flow-graph) and its register allocator is somewhat confused.

Clang Output

Clang surprised me, the output:

; Clang 3.8 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
   mov       eax, dword ptr [esp + 32]     ; LOAD 'pd'
   mov       ecx, dword ptr [esp + 4]      ; LOAD 'px'
   vmovdqu   ymm0, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 8]      ; LOAD 'py'
   vmovdqu   ymm1, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 12]     ; LOAD 'pz'
   vmovdqu  ymm2, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 16]     ; LOAD 'pw'
   vmovdqu   ymm3, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 20]     ; LOAD 'pa'
   vmovdqu   ymm4, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 24]     ; LOAD 'pb'
   vmovdqu   ymm5, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 28]     ; LOAD 'pc'
   vpackuswb ymm6, ymm0, ymm5
   vpsubw    ymm0, ymm0, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx]
   vpaddw    ymm0, ymm0, ymm6
   vpackuswb ymm6, ymm1, ymm5
   vpsubw    ymm1, ymm1, ymm5
   vmovdqu   ymm5, ymmword ptr [eax]
   vpaddw    ymm1, ymm1, ymm6
   vpackuswb ymm6, ymm2, ymm5
   vpsubw    ymm2, ymm2, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx + 32]
   vpaddw    ymm2, ymm2, ymm6
   vpackuswb ymm6, ymm3, ymm5
   vpsubw    ymm3, ymm3, ymm5
   vmovdqu   ymm5, ymmword ptr [eax + 32]
   vpaddw    ymm3, ymm3, ymm6
   vpackuswb ymm6, ymm4, ymm5
   vpsubw    ymm4, ymm4, ymm5
   vpaddw    ymm4, ymm4, ymm6
   vpackuswb ymm0, ymm0, ymm1
   vpackuswb ymm0, ymm0, ymm2
   vpackuswb ymm0, ymm0, ymm3
   vpackuswb ymm0, ymm0, ymm4
   vpextrd   eax, xmm0, 1                  ; Return value
   vzeroupper
   ret

Wow! Clang didn't spill any GP/VEC register and didn't use 'movdqa' like GCC. It rearranged the code in a way to prevent spills, most probably because it used DFG properly. I must applaud to clang developers as this output is amazing. It surprised me as I wrote the code to force the compiler to spill some YMMs (I know, bad design, but it was just a quick type-type-copy-paste:).

Conclusion

I cannot make a conclusion based on a single test, but GCC failed me in this case. Don't trust people that say that compilers produce better asm than people, it's a myth :)

You can try it yourself by copy-pasting the code here - it's a service that can translate your C++ code to asm online.

GCC bug #77287 reported.