22 November, 2016

Using Raspberry for ARM Testing

Introduction

Developing for x86/x64 architectures is relatively easy as it's a standard architecture used by today's desktops; it's powerful and there are plenty of options when it comes to dev-tools and IDEs. When it comes to other architectures (I mean ARM here) the situation changes dramatically. These devices are usually not that powerful (both CPU power and memory becomes a limited resource) and some devices running OSes like Android don't even provide tools that we developers are used to. I personally cannot code properly on a laptop, so devices like tablets or phones are completely out of question. Using simulators is also not an option as I need to experiment with JIT code generation.

After I have done some experiments with my RaspberryPi (Model B2) I decided to try to improve the comfort of using these kind of devices for testing. The model I have has 1GB of RAM and quad-core ARM processor. Using that device to run IDE would suck, but accessing it remotely and using it only for compilation and unit testing would work pretty well.

Preparation

I have done the following steps to prepare the Raspberry device itself:

  1. Download Raspbian Lite distribution
  2. Unpack the image to get the *.img file, for example as raspbian.img
  3. Plug your SD card to your computer, and check which device represents it, I will call it /dev/sdx here
  4. Use dd tool (Linux) to copy the content of the *.img file to your micro card, something like dd bs=4M if=./raspbian.img of=/dev/sdx (the device doesn't specify a partition id)
  5. Mount /dev/sdx2 drive to somewhere so you can access it, like /mnt/rpi
  6. Enable SSH daemon by default, go to /mnt/rpi/etc/rc2.d and rename Kxxsshd to S02sshd
  7. Unmount /dev/sdx2
  8. Sync drives by using sync command (just in case, should not be needed)
  9. Unplug the SD card and put it to your Raspberry

Now the device should be ready; the only thing needed is to turn it on and connect it to your network. I use LAN and DHCP here so that's it. You can use netstat tool to get the IP address of the device or ping 256 addresses on your local network if you don't like automation :) After the device is powered up it should allow SSH connections so just connect to it and use pi username and raspberry password, that should be changed immediately after you are logged in.

In the past I used raspi-config tool that comes with Raspbian to enable the SSH daemon, but to do that you would need to connect screen and keyboard to the device, which I would like to avoid if possible.

Setting up Cloud9 for C++ Development

I decided to go with Cloud9 IDE that can run on the device and can be accessed through a browser. I think this is a good deal as the UI is fast (rendered by your browser) and it doesn't waste the precious RAM on the device itself. The only thing that I miss is a C++ debugger, but since Cloud9 is not really targeting C++ I'm fine with that (syntax highlighting is just fine).

Installing Cloud9 requires the following steps on your Raspberry device:

  1. Install the following dependencies: sudo apt-get install build-essential pkg-config libssl-dev openssl libxml2-dev
  2. Clone Cloud9 IDE to some directory, for example Cloud9: git clone https://github.com/c9/core/ Cloud9
  3. Prepare Cloud9 by using their install-sdk script (takes maybe 15 minutes to compile all deps, also downloads a lot of packages from net): ./Cloud9/scripts/install-sdk.sh
  4. Create your workspace, mine is mkdir workspace (at ~)
  5. Prepare your script for starting the Cloud9 server, mine run.sh looks like this:
    #!/bin/sh
    node server.js -w ~/workspace --listen=0.0.0.0 --port=8181 --auth=username:password
    
  6. Don't forget to make the script executable chmod +x run.sh
  7. Run the server ./run.sh

Note that Cloud9 requires to set --listen to 0.0.0.0 otherwise it will block connections that are not coming from localhost. You can now connect to the device. The result looks like this:

The IDE provides access to the workspace and features also a built-in terminal, which you can use through the web browser as well. This means that configuring the project, compiling it, and running unit tests can be done completely via the browser.

Memory Consumption

$ cat /proc/meminfo 
MemTotal:         947736 kB
MemFree:          130244 kB
MemAvailable:     798108 kB
Buffers:           57904 kB
Cached:           615464 kB

This means that roughly 800MB is still available to be used by the C++ compiler and other dev-tools, which is fine for most projects. However, from my own experience, don't use make -j command for building your project as it will create 4 processes leaving around 200MB for each compilation unit - that is not enough even for smaller projects like AsmJit.

Conclusion

Since this really works for me it allows me to work on AsmJit's ARM support on a real ARM device. However, it's still not a high priority for me...

07 September, 2016

AsmJit & AsmTK integrated into x64dbg

AsmTK > AsmJit

AsmTK, as the name implies, is an assembler toolkit that uses AsmJit as a base for everything and builds additional features on top of it. The currently only feature it offers is AsmParser, which implements a parser that can consume assembler string and emit it into CodeEmitter. More features like Linker are planned in the future.

AsmTK is a first library that takes advantage of AsmJit's instruction validation framework, which requires just 5kB of data to validate everything AsmJit understands. This feature was necessary to build such tools and it's now an integral part of AsmJit, which can be enabled per CodeHolder and/or Assembler.

x64dbg > AsmTK

It didn't take long for AsmTK to gain some attention. The first project that officially started using AsmTK is x64dbg, an open-source debugger for Windows. It uses AsmParser to encode asm instruction(s) on-the-fly and it currently allows to select between 3 assembler engines to do the job. It will be interesting to see which engine people prefer the most, and which will be the most reliable one to use.

The collaboration started after the author of x64dbg pointed out that all of his ~150 tests were failing when using AsmTK as a backend. I have incrementally fixed most of them and integrated a better test suite into AsmTK itself, so we can add more tests and have them run on continuous integration servers. Some tests were failing because AsmTK was parsing instructions in a case-sensitive manner (so all UPPERCASED instructions were failing without any reason), however, other failures required a lot of fixes in AsmTK and AsmJit themselves - for example REP and REPE/REPNE prefixes were completely reworked (I have planned that, but these failures accelerated my motivation to fix that now). Other minor reorganizations in AsmJit happened mostly to increase compatibility with other assemblers.

What's Next?

AsmTK needs more people using it and reporting bugs. I'm completely open to implement many features and to accept pull requests that extend its functionality. AsmTK is much more open to extra functionality than AsmJit in this regard, which I try to keep lean and mean and focused on JIT (while providing a foundation to build on top of it).

So, if you need to parse some asm, take look at AsmTK and join AsmJit's room on gitter. You will get pretty powerful functionality in just 250kB (and the size can be reduced further by disabling some AsmJit features such as Compiler).

18 August, 2016

Comparing register allocator of GCC and Clang

Introduction

I'm preparing to write a new register allocator for AsmJit and part of my preparation work is to study how the best open-source C++ compilers allocate registers of some C++ samples, and how they create function prologs and epilogs. Today, I wrote a very simple code that tries to exploit one thing - in 32-bit mode there is just 8 GP and SIMD registers, so let's see what GCC and Clang do with the following code:

#include <immintrin.h>

int fn(
  const int* px, const int* py,
  const int* pz, const int* pw,
  const int* pa, const int* pb,
  const int* pc, const int* pd) {

  __m256i a0 = _mm256_loadu_si256((__m256i*)px);
  __m256i a1 = _mm256_loadu_si256((__m256i*)py);
  __m256i a2 = _mm256_loadu_si256((__m256i*)pz);
  __m256i a3 = _mm256_loadu_si256((__m256i*)pw);
  __m256i a4 = _mm256_loadu_si256((__m256i*)pa);
  __m256i b0 = _mm256_loadu_si256((__m256i*)pb);
  __m256i b1 = _mm256_loadu_si256((__m256i*)pc);
  __m256i b2 = _mm256_loadu_si256((__m256i*)pd);
  __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1);
  __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1);
  
  __m256i x0 = _mm256_packus_epi16(a0, b0);
  __m256i x1 = _mm256_packus_epi16(a1, b1);
  __m256i x2 = _mm256_packus_epi16(a2, b2);
  __m256i x3 = _mm256_packus_epi16(a3, b3);
  __m256i x4 = _mm256_packus_epi16(a4, b4);
  
  x0 = _mm256_add_epi16(x0, a0);
  x1 = _mm256_add_epi16(x1, a1);
  x2 = _mm256_add_epi16(x2, a2);
  x3 = _mm256_add_epi16(x3, a3);
  x4 = _mm256_add_epi16(x4, a4);

  x0 = _mm256_sub_epi16(x0, b0);
  x1 = _mm256_sub_epi16(x1, b1);
  x2 = _mm256_sub_epi16(x2, b2);
  x3 = _mm256_sub_epi16(x3, b3);
  x4 = _mm256_sub_epi16(x4, b4);
  
  x0 = _mm256_packus_epi16(x0, x1);
  x0 = _mm256_packus_epi16(x0, x2);
  x0 = _mm256_packus_epi16(x0, x3);
  x0 = _mm256_packus_epi16(x0, x4);
  return _mm256_extract_epi32(x0, 1);
}

The function does nothing useful, it's just a dumb code that tricks the compiler to not eliminate any part of the function and to emit exactly what is written in its body. I'm trying to exploit the following: I'm using 8 arguments, which are passed by stack (32-bit mode), and each argument is a pointer to a 256-bit vector. The last two arguments are read twice, the first 6 arguments just once. This means that the compiler has the opportunity to store all arguments in GP registers if it just follows the code path. The second challenge is that I'm using 15 YMM registers at the same time, but only 8 are available.

GCC Output

Not so good, here is the asm with my annotations:

; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
  lea       ecx, [esp+4]                      ; Return address + 4 (first argument on the stack)
  and       esp, -32                          ; Align the stack to 32 bytes
  push      DWORD PTR [ecx-4]                 ; Push returned address
  push      ebp                               ; Save frame-pointer even if I told GCC to not to
  mov       ebp, esp
  push      edi                               ; Save GP regs
  push      esi
  push      ebx
  push      ecx
  sub       esp, 296                          ; Reserve stack for YMM spills
  mov       eax, DWORD PTR [ecx+16]           ; LOAD 'pa'
  mov       esi, DWORD PTR [ecx+4]            ; LOAD 'py'
  mov       edi, DWORD PTR [ecx]              ; LOAD 'px'
  mov       ebx, DWORD PTR [ecx+8]            ; LOAD 'pz'
  mov       edx, DWORD PTR [ecx+12]           ; LOAD 'pw'
  mov       DWORD PTR [ebp-120], eax          ; SPILL 'pa'
  mov       eax, DWORD PTR [ecx+20]           ; LOAD 'pb'
  mov       DWORD PTR [ebp-152], eax          ; SPILL 'pb'
  mov       eax, DWORD PTR [ecx+24]           ; LOAD 'pc'
  vmovdqu   ymm4, YMMWORD PTR [esi]
  mov       ecx, DWORD PTR [ecx+28]           ; LOAD 'pd'
  vmovdqu   ymm7, YMMWORD PTR [edi]
  vmovdqa   YMMWORD PTR [ebp-56], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [ebx]
  mov       ebx, DWORD PTR [ebp-152]          ; LOAD 'pb'
  vmovdqa   YMMWORD PTR [ebp-88], ymm4        ; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [edx]
  mov       edx, DWORD PTR [ebp-120]          ; LOAD 'pa'
  vmovdqu   ymm6, YMMWORD PTR [edx]
  vmovdqa   YMMWORD PTR [ebp-120], ymm6       ; SPILL VEC
  vmovdqu   ymm0, YMMWORD PTR [ecx]
  vmovdqu   ymm6, YMMWORD PTR [ebx]
  vmovdqa   ymm5, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [eax+32]
  vmovdqu   ymm2, YMMWORD PTR [eax]
  vmovdqa   ymm1, ymm0                        ; Why to move anything when using AVX?
  vmovdqu   ymm0, YMMWORD PTR [ecx+32]
  vmovdqa   YMMWORD PTR [ebp-152], ymm2
  vmovdqa   ymm3, ymm0                        ; Why to move anything when using AVX?
  vpackuswb ymm0, ymm7, ymm6
  vmovdqa   YMMWORD PTR [ebp-184], ymm5       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-248], ymm3       ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-280], ymm0       ; SPILL VEC
  vmovdqa   ymm0, YMMWORD PTR [ebp-56]        ; ALLOC VEC
  vmovdqa   YMMWORD PTR [ebp-216], ymm1       ; SPILL VEC
  vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-88]        ; ALLOC VEC
  vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpaddw    ymm2, ymm2, YMMWORD PTR [ebp-56]  ; Uses SPILL slot
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpaddw    ymm0, ymm7, YMMWORD PTR [ebp-280] ; Uses SPILL slot
  vpsubw    ymm0, ymm0, ymm6
  vmovdqa   ymm7, YMMWORD PTR [ebp-120]       ; ALLOC VEC
  vpackuswb ymm0, ymm0, ymm2
  vpaddw    ymm2, ymm4, ymm1
  vpsubw    ymm2, ymm2, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vmovdqa   YMMWORD PTR [ebp-312], ymm3       ; SPILL VEC
  vpaddw    ymm3, ymm5, YMMWORD PTR [ebp-88]  ; Uses SPILL slot
  vpsubw    ymm3, ymm3, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm3
  vpaddw    ymm1, ymm7, YMMWORD PTR [ebp-312] ; Uses SPILL slot
  vpsubw    ymm1, ymm1, YMMWORD PTR [ebp-248] ; Uses SPILL slot
  vpackuswb ymm0, ymm0, ymm2
  vpackuswb ymm0, ymm0, ymm1
  vpextrd   eax, xmm0, 1                      ; Return value
  vzeroupper
  add       esp, 296
  pop       ecx
  pop       ebx
  pop       esi
  pop       edi
  pop       ebp
  lea       esp, [ecx-4]
  ret

Here are my observations based on the output:

  • The first thing GCC does is to allocate arguments to registers and/or to move them to their home slots if there is not enough physical registers for all args.
  • It preserves stack pointer even if I told it to not to. The EBP register is valuable in our case. It does this (probably) because it wants to align the stack.
  • It uses [ebp-X] even when [esp+X] would be shorter in asm - for example it should use [esp+16] instead of [ebp-280] to make the instruction with such address shorter - we are talking about 2 bytes per instruction here.
  • YMM registers usage is a mystery to me - it does so many allocs and spills and also uses 'vmovdqa' to copy one register to another, which seems crazy in AVX code.

Verdict: GCC is worse than I expected in this test, it looks like it doesn't properly use DFG (data-flow-graph) and its register allocator is somewhat confused.

Clang Output

Clang surprised me, the output:

; Clang 3.8 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
   mov       eax, dword ptr [esp + 32]     ; LOAD 'pd'
   mov       ecx, dword ptr [esp + 4]      ; LOAD 'px'
   vmovdqu   ymm0, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 8]      ; LOAD 'py'
   vmovdqu   ymm1, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 12]     ; LOAD 'pz'
   vmovdqu  ymm2, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 16]     ; LOAD 'pw'
   vmovdqu   ymm3, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 20]     ; LOAD 'pa'
   vmovdqu   ymm4, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 24]     ; LOAD 'pb'
   vmovdqu   ymm5, ymmword ptr [ecx]
   mov       ecx, dword ptr [esp + 28]     ; LOAD 'pc'
   vpackuswb ymm6, ymm0, ymm5
   vpsubw    ymm0, ymm0, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx]
   vpaddw    ymm0, ymm0, ymm6
   vpackuswb ymm6, ymm1, ymm5
   vpsubw    ymm1, ymm1, ymm5
   vmovdqu   ymm5, ymmword ptr [eax]
   vpaddw    ymm1, ymm1, ymm6
   vpackuswb ymm6, ymm2, ymm5
   vpsubw    ymm2, ymm2, ymm5
   vmovdqu   ymm5, ymmword ptr [ecx + 32]
   vpaddw    ymm2, ymm2, ymm6
   vpackuswb ymm6, ymm3, ymm5
   vpsubw    ymm3, ymm3, ymm5
   vmovdqu   ymm5, ymmword ptr [eax + 32]
   vpaddw    ymm3, ymm3, ymm6
   vpackuswb ymm6, ymm4, ymm5
   vpsubw    ymm4, ymm4, ymm5
   vpaddw    ymm4, ymm4, ymm6
   vpackuswb ymm0, ymm0, ymm1
   vpackuswb ymm0, ymm0, ymm2
   vpackuswb ymm0, ymm0, ymm3
   vpackuswb ymm0, ymm0, ymm4
   vpextrd   eax, xmm0, 1                  ; Return value
   vzeroupper
   ret

Wow! Clang didn't spill any GP/VEC register and didn't use 'movdqa' like GCC. It rearranged the code in a way to prevent spills, most probably because it used DFG properly. I must applaud to clang developers as this output is amazing. It surprised me as I wrote the code to force the compiler to spill some YMMs (I know, bad design, but it was just a quick type-type-copy-paste:).

Conclusion

I cannot make a conclusion based on a single test, but GCC failed me in this case. Don't trust people that say that compilers produce better asm than people, it's a myth :)

You can try it yourself by copy-pasting the code here - it's a service that can translate your C++ code to asm online.

GCC bug #77287 reported.

15 August, 2016

AsmJit & AVX-512

The World of Prefixes

X86 architecture is known for its prefixes. It's no surprise that AVX-512 adds another one to the family - EVEX. Let's summarize the last 4 prefixes introduced in X86:

  • VEX - 2-byte (VEX2) and 3-byte (VEX3) prefix initially designed by Intel to encode AVX instructions, but now used by other CPU extensions like BMI and BMI2. VEX2 was designed to make some instructions 1 byte shorter than VEX3, but its usage is quite limited.
  • XOP - 3-byte prefix designed by AMD to support their XOP extensions in a way to not interfere with existing VEX prefix. XOP was never adopted by Intel and AMD will not support it in their new Zen processors (together with other extensions like FMA4). It's a dead end, dead silicone, and dead code that supports this prefix.
  • EVEX - 4-byte prefix designed by Intel to support 512-bit width vectors and 32 vector registers. Each AVX-512 instruction that works with vector registers uses this prefix. Many AVX and AVX2 instructions can be encoded by this new prefix as well. There are, however, several exceptions, but that would require a separate post.

AVX-512 Status in AsmJit

AVX-512 support in AsmJit is mostly finished. AsmJit's instruction database now contains all AVX-512 instructions together with older AVX and AVX2 instructions. The reorganization of instruction database and X86Assembler was quite drastic. AsmJit now contains a single path to encode either VEX, XOP, or EVEX instruction, which greatly simplified the logic in the assembler. XOP encoding IDs are no longer needed as each instruction now contains VEX, XOP, and EVEX bit. These bits instrument the encoder to use the correct prefix.

Encoder Improvements

The previous encoder was designed to encode each byte in the [VEX] prefix separately, and then write the result byte-to-byte into the destination buffer. This design was fairly simple, and according to my benchmarks, it was also very fast. However, this approach seemed unfeasible for supporting the new EXEX prefix, which contains 3 bytes of payload (instead of two) and the encoder must check all 3 bytes before it can decide whether to emit VEX or EVEX prefix. The new code does this differently - it uses a single 32-bit integer that represents the whole EVEX prefix, and then decides whether to use EVEX or VEX by checking specific bits in it. If any of the bits checked is '1' then the instruction is EVEX only. This guarantees that EVEX prefix will never be used by a legacy AVX instruction, and also guarantees that the best encoding (shortest prefix) is used. AsmJit allows to override this decision by using `evex()` option, which instructs the encoder to emit EVEX prefix, and similarly also supports `vex3()` option, which instructs the encoder to emit 3-byte VEX prefix instead of a shorter 2-byte VEX prefix. EVEX wins if both `evex()` and `vex3()` are specified.

A simplified version of AsmJit's VEX|EVEX encoder looks like this:

// Encode most of EVEX prefix, based on instruction operands and definition.
uint32_t x = EncodeMostEvex(...);             //  [........|zLL..aaa|Vvvvv..R|RBBmmmmm]
// Check if EVEX is required by checking:     x & [........|xx...xxx|x......x|.x.x....]
if (x & 0x00C78150U) {
  // Encode EVEX - uses most of `x`.
  // ... no more branches here - requires around 14 ops to finalize EVEX ...
  //                                                   _     ____    ____
  //                                              [zLLbVaaa|Wvvvv1pp|RBBR00mm|01100010].
}

// Not EVEX, prepare `x` for VEX2 or VEX3 (5 ops):[........|00L00000|0vvvv000|R0B0mmmm]
x |= ((opCode >> (kSHR_W_PP + 8)) & 0x8300U) | // [00000000|00L00000|Wvvvv0pp|R0B0mmmm]
     ((x      >> 11             ) & 0x0400U) ; // [00000000|00L00000|WvvvvLpp|R0B0mmmm]

// Check if VEX3 is needed by checking        x & [........|........|x.......|..x..x..]
if (x & 0x0008024U) {
  // Encode VEX3 or XOP.
  // ... no more branches here - requires around 7 ops to finalize VEX3 ...
  //                                                         ____    _ _
  //                                              [_OPCODE_|WvvvvLpp|R1Bmmmmm|VEX3|XOP].
else {
  // Encode VEX2.
  // ... no more branches here - requires around 3 ops to finalize VEX2 ...
}

This means that AsmJit requires just a single branch to decide whether to use VEX or EVEX prefix, and another branch to decide between 3-byte VEX|XOP or 2-byte VEX prefix. This is good news for everybody expecting high performance as this approach is nearly as fast as the old AsmJit's one, which haven't supported AVX-512 at all. It took me some time and thinking to actually design such approach and to reorganize instruction opcodes database in a way to be able to encode the initial EVEX prefix quickly. My initial approach was around 25% slower than the old AsmJit, and the final code (similar to the snippet shown above) is roughly 3-5% slower, which is pretty close to the old code. The new functionality is nontrivial so I'm really happy with such metrics (and to be honest I would like to see some metrics from other assemblers).

It's also obvious from the code that the new approach is basically optimistic for EVEX - emitting EVEX instructions is much cheaper than emitting VEX|XOP instructions. This wasn't goal, it's rather a consequence: all the bits that EVEX prefix introduces must be checked in order to decide between VEX vs. EVEX, thus AsmJit just puts most of these bits into the right position and only performs minor bit shuffling when converting to a prefix that uses less bytes (EVEX->VEX3->VEX2).

Future Work

Future posts will be about a new emitter called CodeBuilder. In short, it allows to emit instructions into a representation that can be processed afterwards. This representation was in AsmJit from the beginning as a part of Compiler. Compiler has many high-level features that some people don't need, so it was split into CodeBuilder that can be used exactly like Assembler, and CodeCompiler, that keeps all the high-level features.

14 July, 2016

Introducing AsmTK

AsmTK - A Toolkit Based on AsmJit

AsmJit library provides a low-level and high-level JIT functionality that allows applications to generate code at run-time. The library was designed from scratch to be efficient and highly dynamic. Efficiency is achieved by having a single (dispatch) function that can encode all supported instructions without jumping to other helper functions. This function is actually pretty big, but I always tried to keep it organized and consistent. Dynamism is achieved by using a structure called Operand, which is a base class for any operand that can be used by the assembler, and guarantees that each Operand has the same size (16 bytes) regardless of its type and content.

The dynamic nature of AsmJit is actually what makes it much more powerful than other JIT assemblers out there. It's also a feature that makes it possible to have X86Compiler as a part of AsmJit without a significant library size increase; and it also makes it possible to create tools that use AsmJit as a base library to generate and process assembly at run-time. One missing feature that I have been frequently asked was to assemble code from a string. This is now provided by AsmTK library!

AsmParser

The AsmTK's AsmParser exploits what AsmJit offers - it parses the input string and constructs instruction operands on-the-fly, then passes the whole thing to the instruction validator, and finally passes it to the assembler itself. The AsmTK supports all instructions provided by AsmJit, because it uses AsmJit API for instruction name to id conversion and strict validation.

Here is a result of a sample application that I wrote in less than 15 minutes - it's basically on-the-fly X86/X64 instruction encoder based on AsmTK and AsmJit. You enter instruction and it tries to encode it and outputs its binary representation:

=========================================================
AsmTK-Test-Cmd - Architecture = x64 (use --x86 and --x64)
---------------------------------------------------------
Usage:
  1. Enter instruction and its operands to be encoded.
  2. Enter empty string to exit.
=========================================================
mov eax, ebx
8BC3
mov rax, rbx
488BC3
mov r15, rax
4C8BF8
cmp ah, al
3AE0
vandpd ymm0, ymm10, ymm13
C4C12D54C5
movdqa xmm0, [rax + rcx * 8 + 16]
660F6F44C810
movdqa rax, xmm0
ERROR: 0x0000000B (Illegal instruction)

The tool can be used to quickly verify if an instruction encodes correctly and also to check if the encoding is optimal (for example if AsmJit encodes it the shortest way possible, etc). I have already made two fixes in AsmJit to use shorter encoding of [mov gpq, u32 imm] and [and gpq, u32 imm] instructions.

Conclusion

The AsmTK library is a fresh piece of software that currently contains less than 1000 lines of code. It relies on AsmJit heavily and uses its new instruction validation API. It also serves as a demonstration of AsmJit capabilities that are not obvious from the AsmJit documentation.

12 July, 2016

AsmJit and Instruction Validation

AsmDB

AsmDB is an X86/X64 instruction database in a JSON-like format, that I started after I saw the complexity of AVX-512 instruction set. I thought, initially, that I would just add it manually to the AsmJit database, but after few hours I realized that it is extremely complex and not that straightforward as I thought.

AsmDB -> AsmJit

The solution was to create a database that contains all instructions in a similar format that is used by instruction-set manuals and to write a tool that can index the database and create all tables AsmJit is using programatically. At the moment I would say that 50% of the work is done - AsmJit tool that generates parts of x86inst.cpp file now uses AsmDB to generate a space-efficient operand tables that can be used to validate operands of any x86 and x64 instruction supported by AsmJit. This replaces the old operand tables that were basically useless as they combined all possibilities of all possible instruction encodings.

The new validation API is still a work-in-progress, but in general you can do something like this:

Error err = X86Inst::validate(
  kArchX64,                        // Architecture - kArchX86, kArchX64.
  kX86InstIdVpunpckhbw,            // Instruction id, see X86InstId enum.
  0,                               // Instruction options, see X86InstOptions enum.
  x86::xmm0, x86::xmm1, x86::eax); // Individual operands, or operands[] and count.

The call to validate() will return an error in this case, because vpunpckhbw instruction is defined for either [xmm, xmm, xmm/mem] or [ymm, ymm, ymm/mem] operands. The validator is very strict and has access to a very detailed information about every instruction - it knows about implicit operands, operands that require a specific register, and possible immediate and memory sizes. It can be used to implement an asm parser as well, and probably much more in the future.

X86Assembler enhancements

At the moment I'm still integrating the validation code into the assembler and compiler classes. What I can say is that I can simplify and remove most of the validation code from the assembler in favor of the new validation code. The reason is that the assembler was always a kind of lenient in terms of validation - it cares about performance and omits everything that is not necessary. This means, for example, that it allows something like mov eax, al and encodes it as mov eax, eax. Basically it checks the size of the destination register and doesn't care much about the source register except for its index for that particular instruction.

This gets much more problematic when using an unsafe API (API that allows to use untyped operands). It's possible to emit a ridiculous combination by doing for example a.emit(kX86InstIdAdd, x86::eax, x86::xmm5). Such combination doesn't exist at all, but AsmJit will encode it as add eax, ebp as the operands match the REG/REG signature (ebp register has the same index as xmm5), and since the 'add' instruction is only defined for GP registers the assembler can omit the register type check, because there is no typed-API that provides the 'add' instruction with such operands.

The new validation API changes the game since operands can now be validated and the assembler can validate each instruction before it actually tries to encode it. This means that such cases can be checked by the assembler without making it more complex. The only problem is that this kind of validation is more expensive, thus the assembler needs a new option to enable and disable it.

X86Compiler enhancements

Anybody who ever used AsmJit's compiler knows that it's not really trivial to debug it when something goes wrong. Compiler stores each instruction, processes it, and then serializes it to the assembler. If the instruction is invalid from the beginning it will be serialized as-is to the assembler as well, which would fail. The problem is that sometimes it's too late and you have to use debug output to figure out the exact place where the instruction was generated. The new validation API should solve most of the issues mentioned as the compiler can now validate each instruction before it stores it. This allows to find a code that misuses the compiler much faster.

Conclusion

There are still many things to do in AsmJit, but the library is slowly getting better and I hope that AsmJit users will find these new features useful. The good news is that since I reorganized the instruction tables there is still some space I can fill without increasing the library size. There are many things that can be put into these tables, but the first candidate is an SSE to AVX translation, which is very likely to be implemented first. Next goal is to have finalized AVX-512 of course :)

16 June, 2016

Rendering Game Maps in HTML Canvas

Introduction

It's around 10 days I started a simple toy project. My goal was to render a game-map based on assets extracted from an original Civilization 1 game (dos version). After one day I had my work done and I started thinking on how the rendering could be improved. I won't post here screenshot from that initial version as I used original assets from the game (you can find many on google images), however, the article should describe how I build a very complex texture-based map renderer from an initial tile-based renderer that used simple atlas to render its tiles. Rendering by blitting pre-rendered tiles is technique that was used by the original Civilization game and it's used today by many turn-based strategy games (including civ-clones).

I picked an HTML's <canvas> as a technology used by the renderer. I found it extremely challenging to write a renderer for such backend as the performance of canvas varies across browsers and its state-based architecture is extremely unsuitable for such rendering tasks. I don't know if webgl would be faster or not, but since I'm mostly interested in 2D I'm happy I started with canvas as it's challenging to move the boundaries of what one can do with it. I guess that even in webgl there will be some struggles as there is a lot of tiles to render, and there is a lot of layers that have to be rendered per tile, so it may be much faster, but it would also require some discipline to make it working.

Engine Design

Before I start describing the renderer itself I would like to describe the engine a bit and how I structured it (and you can skip this if you just want to see the results):

  • Game - connects game rules, game map, map renderer, and user interface.
  • GameMap - defines a game-map, implements basic map operations and algorithms.
  • GameTile - defines a single tile used by GameMap - contains what is required for game to use that tile, doesn't contain any rendering hints.
  • GameRules - provides information about objects in the game and their interaction. There are 7 types of objects that can be added to the rules, the 2 most important for use are Assets and Terrains.
  • GameUtils - static class that contains some helpers used across the engine
  • .

The renderer is completely separated from the engine and looks like the following:

  • AssetLoader - a class that takes all asset files from game rules and queues them for loading.
  • Renderer - a main renderer class that implements the rendering logic. When attached to the game it starts listening to invalidate events and
  • RendererTile - a tile used and maintained by the renderer. This tile does only contain information about the rendering and is updated every time a tile is invalidated by the game.
  • RenderUtils - utility functions used by renderer - most of the functionality provided here is used only once to process loaded assets into something the renderer would like to use.

When Renderer is attached to the game it starts listening to events that invalidate tiles and require them to be recalculated - these events are only related to tiles themselves, not to what happens on that tiles. For example if you move a unit across your territory that won't cause recalculation of rendering data, however, if you change a terrain type of some tile, build road, or uncover a hidden tile on the map then the tile and all its surroundings have to be invalidated and recalculated. The game sends `invalidateMap` (if the whole map changed), `invalidateTile` if a single tile was changed, and `invalidateRect` if an area of tiles was changed.

When the renderer receives such event it uses a bit-array that describes the region on the map that needs to be recalculated. Since invalidating a single tile also invalidates all surroundings, it means that the minimum tiles to be recalculated is always 9 (3x3 matrix, 1 tile and 8 surroundings). I have chosen a 4x4 grid of tiles to be represented by a single bit in a bit-array called `dirtyTiles`. So when the renderer receives invalidation event it just updates bits in a `dirtyTiles` and that's it, it will recalculate the tiles later as it's very likely more neighboring tiles will be invalidated in a game-play. When the renderer receives instructions to render the map it first checks if there are dirty tiles and recalculates them. After all tiles were recalculated it sets all dirty bits to false and starts the rendering process.

That was a very high-level introduction to the architecture I developed. The primary goal was to enable fast prototyping of the renderer. Next sections cover all the steps I used to write the texture-based renderer.

Step 1 - Initial Implementation

The renderer has to be able to render tiles based on their assigned textures. So the first thing to do is to add some assets to the game rules:


rules.assets = [
  { name: "Texture.Ocean"      , file: "ocean.png"          },
  { name: "Texture.Grassland"  , file: "grassland.png"      }
];

And to create terrain that will use the assets defined:


rules.terrain = [
  { name: "Grassland"          , id: TerrainType.Grassland, asset: "_[Texture.Grassland]" },
  { name: "Ocean"              , id: TerrainType.Ocean    , asset: "_[Texture.Ocean]"     }
];

The assets are referenced as `_[AssetName]`. This could be confusing now as why to change the name, but the reason is that each kind of item in the rules system has its own link format. This means that items of different kinds can have the same name and still be referenced without ambiguity. Rules use references for many things and for example if you have a building and you need something in order to build that building in your city, you will use the system of references and add prerequisites into that building (and the prerequisite could be a nation, technology, resource, or anything else that is defined by game rules).

But back to rendering! If you create two textures called ocean.png and grassland.png, each of them having exactly 256x256 pixels then you can render each texture on each tile by calculating the tile's world coordinates and keeping only 8 bits of them (it depends on the size of the texture, I would recommend using powers of 2, other dimensions will make your work harder, but not impossible). This way you can render something like the following:

While some devs would be satisfied with such wonderful image, I was not:)

Step 2 - Adding Blendmaps

To make the transitions between different tiles smooth we use a concept called blendmaps. Blendmap is an image that contains only alpha channel and contains various transitions between two tiles. I started using a blendmap that has 16 tiles, where the first tile is divided into 4 subtiles specifying blending of each corner, and next 15 tiles specify blending of sides and their combinations. A blendmap looks like the following (see the arrows of blending for illustration):

Even if it looks chaotic there is a very simple logic behind it. Each bit defines a blending side. The renderer then uses the following bits to describe a side or corner:

  • [1] - Top (Side)
  • [2] - Right (Side)
  • [3] - Bottom (Side)
  • [4] - Left (Side)
  • [5] - Top-Left (Corner)
  • [6] - Top-Right (Corner)
  • [7] - Bottom-Left (Corner)
  • [8] - Bottom-Right (Corner)

Sides start first and corners follow. I found this logic the best as when you keep only the first four bits you get a mask of 4 sides. Some assets need just these four to render them properly, like rivers, which will be explained later. When you represent these four bits as binary like 0001 and then convert to a decimal form (0001 becomes 1, 0010 becomes 2, etc) then you get the blending sides and their indexes in the blendmap (zero indexed, that's why I put corners first). So for example that last tile has all bits set (1111, 15 in decimal), which means that it's a top-right-bottom-left transition.

From now I will start using the RendererTile to store some information about each tile on the map. At the moment we need just two properties called `baseTexture` and `terrainEdges`. Base texture would be set to an ID of texture that would be rendered first (like ocean, grassland, etc). Terrain edges would be calculated this way:

  • Get all surrounding tiles of the tile you are recalculating (sides and corners).
  • For each surrounding tile, set the particular bit in `terrainEdges` if the tile is not the same, or clear it.

After all tiles are precalculated this way you can implement a very simple renderer that will blend one texture with another based on the tile sides and corners. So for each tile to be rendered do the following:

  • Blit the base texture first. For example if the tile is ocean, blit ocean, if it's grassland, blit grassland.
  • Blit tile sides based on the first four bits of `terrainEdges`, you calculate the x of the blendmap as `(terrainEdges & 15) * tileSize`.
  • Blit masked tile corners that represent each quarter of the tile - top-left, top-right, bottom-left, and/or bottom-right.

(NOTE: Rendering of the most complicated tile requires 5 transitions in our case - one for sides and at most 4 for each corner)

To make a masked tile-blit you need to do the following:

  • Clear the alpha value of the destination pixels defined by the blendmask.
  • Use destination-over operator to blend the second texture in the transparent area produced by the previous step.

The function may be implemented like this:


// ctx  - Canvas 2d context
// dx   - Destination X
// dy   - Destination Y
//
// tex  - Texture to blend
// texx - Texture X coordinate
// texy - Texture Y coordinate
//
// msk  - Blendmask
// mskx - Blendmask X coordinate
// msky - Blendmask Y coordinate
renderTransition(ctx, dx, dy, tex, texx, texy, msk, mskx, msky, sq) {
  // Clear pixels and alpha defined by the blend-mask.
  ctx.globalCompositeOperation = "xor";
  ctx.drawImage(msk, mskx, msky, sq, sq, dx, dy, sq, sq);

  // Blit pixels that were cleared by the previous operation.
  ctx.globalCompositeOperation = "destination-over";
  ctx.drawImage(tex, texx, texy, sq, sq, dx, dy, sq, sq);

  // Restore the composition operator.
  ctx.globalCompositeOperation = "source-over";
}

If you implement it correctly and render the same data as in Step 1 it would look like the following:

While it's much better than the previous rendering there are many things that can be improved. But before we go into step 3 I would like to present one trick to reduce the maximum number of blends per such transition to one. The key point is to define each logical combination that can happen and post-process the blendmap to have more tiles. I implemented this in the following way:

  • First define a table that contains all important corners for each combination of sides.
  • Use that table to resolve all possible combinations and create a lookup table where the index is `terrainEdges` (with all bits) and the value is a new index to a post-processed blendmap.
  • Post-process the blendmap to contain all possible combinations.

I implemented it the following way:


function makeTransitionData(sides) {
  const PRE = [];
  const LUT = []; for (var i = 0; i < 256; i++) LUT.push(-1);

  for (var side = 0; side < 16; side++) {
    const effective = sides[side];

    for (var corner = 16; corner < 256; corner += 16) {
      var lutIndex = side | corner;
      var altIndex = side | (corner & effective);

      if (LUT[altIndex] !== -1) {
        // Already in `PRE` table.
        LUT[lutIndex] = LUT[altIndex];
      }
      else {
        // New combination.
        const preIndex = PRE.length;
        PRE.push(altIndex);
        LUT[lutIndex] = LUT[altIndex] = preIndex;
      }
    }
  }

  return {
    PRE: PRE, // Preprocessing table.
    LUT: LUT  // Render lookup table.
  };
}

And then used that table with the following data, where each combination of sides describes all possible combinations of corners:


const TerrainTransitions = makeTransitionData([
  EdgeFlags.Corners                            , // |       |
  EdgeFlags.BottomLeft | EdgeFlags.BottomRight , // |      T|
  EdgeFlags.TopLeft    | EdgeFlags.BottomLeft  , // |    R  |
  EdgeFlags.BottomLeft                         , // |    R T|
  EdgeFlags.TopLeft    | EdgeFlags.TopRight    , // |  B    |
  EdgeFlags.None                               , // |  B   T|
  EdgeFlags.TopLeft                            , // |  B R  |
  EdgeFlags.None                               , // |  B R T|
  EdgeFlags.TopRight   | EdgeFlags.BottomRight , // |L      |
  EdgeFlags.BottomRight                        , // |L     T|
  EdgeFlags.None                               , // |L   R  |
  EdgeFlags.None                               , // |L   R T|
  EdgeFlags.TopRight                           , // |L B    |
  EdgeFlags.None                               , // |L B   T|
  EdgeFlags.None                               , // |L B R  |
  EdgeFlags.None                                 // |L B R T|
]);

The total number terrain transitions we defined is 46 and the preprocessing table contains the following masks:

[16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 1, 65, 129, 193, 18, 2, 66, 82, 3, 67, 20, 36, 52, 4, 5, 22, 6, 7, 8, 40, 136, 168, 9, 137, 10, 11, 12, 44, 13, 14, 15]

And the post-processed blendmap would look like this (note that you should post-process it programatically, not manually):

While it's not necessary to do it this way I found it much simpler to simply preprocess the blendmaps I'm using and use just one call to `renderTransition()` with appropriate blendmap position. If you plan to render many things per tile then I would consider this trick necessary as it improves performance a lot and it's not memory hungry.

Step 3 - Adding Rivers

The renderer can be improved to support rivers, to do that I did the following:

  • Add a new property to the `RendererTile` called `riverEdges`
  • Add a logic to recalculate `riverEdges` to the renderer.
  • Rivers connections are only side based, so if a tile is a river, then check all four sides, each river or ocean side adds a particular bit to the `RendererTile.riverEdges`.
  • For ocean tiles, check for all neighbors that are rivers and set `RendererTile.riverEdges` so you can render the river flow to the ocean.

Then you would need another blendmap that describes river transitions, I created the following one (and please note how the positions match the terrain blendmap from Step 2):

TIP: if you just started creating your own blendmaps in gimp or PS: create layers and paint your blendmap by using a white color in a transparent layer. Then after you finish you can create another layer, put your texture there and use multiply blend mode to blend the white with the texture. If you do it properly you would see something like this (and this is how it would look on the map):

If you have all of this then it's pretty trivial to add river support to the renderer:

  • On a ground tile, check for `riverEdges` and if non-zero then use `renderTransition()` to render the blendmap at the index specified by `riverEdges` - it's four bits that means 15 transitions if you ignore the zero index, which only makes sense in editor (it's tile with river without any connection).
  • On an ocean tile, check for `riverEdges` and render each side separately, use indexes 16, 17, 18, and 19, which provide the flow to the ocean.

I use an ocean texture for blending rivers, you can have a separate one if you want to make rivers brighter for example. The rendered map with rivers should look like this:

Step 4 - Adding Dominance

Until now we just rendered two kind of tiles (grassland and ocean) and rivers. What if we add more tiles, for example desert, plains, tundra, arctic, etc? Renderers of some games solve this problem in a simple way - they define a texture, which is used between different terrain types as a transitional texture. So for example if I define a desert to be that texture, then all transitions between plains and grasslands would go through desert, etc. The struggle is that this never looks good and it's painful to pick the texture to be used for such transitions. Some games solve this problem in another way - they only allow one kind of tile to be next to another to workaround such issue. But there is another concept that is simply called `dominance`.

Dominance works this way: Assign a dominance property to each tile and use that dominance to determine which neighbor merges where. Tiles with higher dominance 'dominates' neighbors with lesser dominance. For example if a dominant tile is grassland, and it's surrounded by all plains, then the grassland would be rendered as is and each plains surrounding it would contain transition from the grassland as it dominates it. I found this concept well described here and followed it to create my own implementation.

The first thing I needed is another blendmap for terrain transitions. Currently I use a single blendmap for all terrain transitions, but it's just for simplicity as it's configurable to have more blendmaps and to specify which should be used where. Here is a blendmap that I created for terrain transitions:

And here is what needs to be done to support it:

  • Remove property `terrainEdges` from `RendererTile`.
  • Add a new property to the `RendererTile` called `transitions`, which is an array containing pair of values [textureId, edges].
  • Add a logic to recalculate `transitions` to the renderer.

The `RenderTile.transitions` are recalculated by the following way

  • Clear `RendererTile.transitions`.
  • Check the tile dominance and loop from `dominance + 1` to `maxDominance` (maximum possible dominance of all supported terrain types).
  • For each dominance index check all neighboring tiles and collect bit mask of edges of that particular dominance. If the mask is not empty then add [textureId, edges] to the `RendererTile.transitions`.
  • If the tile is an ocean, set base tile to ocean if the tile has only ocean neighbors, or desert if the tile has ground neighbors. By doing this you create a nice coast that looks like sand into which all neighboring tiles blend.

Then during the rendering process first blit the base texture and then loop over `RendererTile.transitions` and blend each texture by using the textureId and edges (index to the blendmap). For example if you define the dominance like this:


rules.assets = [
  { name: "Texture.Ocean"      , file: "ocean.png"        , dominance: 0 },
  { name: "Texture.Desert"     , file: "desert.png"       , dominance: 1 },
  { name: "Texture.Arctic"     , file: "arctic.png"       , dominance: 2 },
  { name: "Texture.Tundra"     , file: "tundra.png"       , dominance: 3 },
  { name: "Texture.Plains"     , file: "plains.png"       , dominance: 4 },
  { name: "Texture.Grassland"  , file: "grassland.png"    , dominance: 5 },
];

rules.terrain = [
  { name: "Desert"             , id: TerrainType.Desert   , asset: "_[Texture.Desert]"    },
  { name: "Plains"             , id: TerrainType.Plains   , asset: "_[Texture.Plains]"    },
  { name: "Grassland"          , id: TerrainType.Grassland, asset: "_[Texture.Grassland]" },
  { name: "Tundra"             , id: TerrainType.Tundra   , asset: "_[Texture.Tundra]"    },
  { name: "Jungle"             , id: TerrainType.Jungle   , asset: "_[Texture.Grassland]" },
  { name: "Ocean"              , id: TerrainType.Ocean    , asset: "_[Texture.Ocean]"     }
];

Then the sample map would be rendered the following way:

Playing with terrain dominance settings will yield different renderings, for example increasing dominance of arctic would render the same map differently:

It's tricky and the result very much depends on the blendmap used to do the transitions. For example the blendmap I created is good for snow transitions, but I will create different one for other terrain types.

Step 5 - Adding Coast

Let's improve the result of Step 4 by adding a nicer coast and doing it programatically instead of messing with another blendmaps! To create a better coast we need to add a bit brighter sea that surrounds it. What I'm gonna do is to perform some image processing of the original blendmask within the browser: invert -> blur -> brighten -> combine, as illustrated below on a single tile:

If you process each tile from the coast blendmap and use that blendmap with a much brighter texture it would result in rendering really nice coastline. The following image also uses the same technique to enhance the look of rivers:

Of course, there are endless possibilities of image processing that can be done with blendmaps.

Next Steps, TODOs

I didn't implement terrains like hills, forests, and jungle because I don't really have assets for that. So if you have any idea how that can be done, or if you volunteer to provide me such assets I can dig into it. Other things like fog-of-war or uncovered map area are simply about adding more blendmaps and more recalculation to the renderer. T

Update

A working demo is now available here!

Conclusion

I know that this article is really high level and doesn't really cover everything. My writing skills are also a bit limited, however, I wrote it to demonstrate that it's possible to render such images based only on textures and blendmaps and it's possible to do that real-time (I didn't count frames but scrolling is smooth even in fullscreen). The work on this project also gave me new ideas that I can implement in Blend2D, because I think that many drawing APIs suffer from not having the ability to use blendmaps in a single operation. I'm not sure when the renderer will be released, but it's gonna be open-source project for sure. Discussion and new ideas are welcome!

20 April, 2016

Blend2D's Rasterizer Debugging

[Or] How to Visualize the Rasterizer's Output Differently

This is a first post about Blend2D and it will be a bit unusual. Today I implemented one feature that is only interesting for developers. It allows to visually distinguish between consecutive rasterizer's spans. Blend2D's rasterizer is based on AntiGrain (or AGG) and also uses cells, which are used to calculate the final alpha of resulting pixels after the whole input is processed.

There is a fundamental difference between how the cells are stored in Blend2D and AGG. AGG's approach is to append cells into cell blocks (each cell then includes its coordinate as well) and at the end of the rasterization process it does cell traversals to gather the number of cells per each scanline, to merge all cells that have the same coordinate, and sort the rest. These additional passes are actually the most expensive ones. I don't want to go into much detail here, but it simply requires a lot of cycles to prepare all the cells and to generate spans that can be passed to the compositor.

Blend2D uses a different approach. It introduces a data structure called Span (don't confuse with AGG spans here), which is simply an array of cells. When the rasterizer generates cells it always groups them into spans and then inserts these spans into to the scanline container (also don't confuse with AGG's scanline here), which keeps them sorted. This way introduces some overhead at a rasterization phase, but prevents the additional traversal and sorting at a later phase. Additionally, this approach prevents to have multiple cells that share the same coordinate. If a cell already exists it's merged with it instead of creating a new one. Blend2D's approach beats the AGG rasterizer significantly and keeps the memory requirements during rasterization low - it will not grow infinitely, whereas AGG fails to continue when it reaches the hard-coded cell limit.

But Nothing is Perfect

But nothing is perfect and Blend2D is no exception. Currently, when the rasterizer inserts a span into the scanline, it will stay there forever. More cells can be merged into it, but the data structure itself will hold the span as is and won't coalesce it with other newly inserted span(s) even if they are right next to it. This approach is beneficial for shapes that are not very complex, because it allows to render them quickly. However, this approach fails when the input shape contains a lot of small line segments, and it happens quite often, especially when rendering complex art and strokes.

And that's the area where I would like to improve Blend2D. Before I can do any improvement I have to to see some data first. However, rasterizer produces a lot of information and it would be highly inefficient to study the text form of it. So instead of printing text, I introduced a new property that is called debugFlags, and hacked the pipeline generator to use a special fetcher if it's set. When enabled, it flips the color each time a new span is processed.

Circles

To demonstrate what it does take a look at the following circle:
(note: all images were upscaled to make the details more visible)

And now look at the debug version:

Not bad although some erratic lines are visible. All lines that are red contain at least two spans at the beginning of the scanline - the second span was inserted after the first without coalescing (these spans most likely have just one cell).

Stroke shows some erratic spans as well, although still acceptable.

Random Paths

Let's look at something more complex that can reveal a bit more:

The same shape but stroked:

Pathological Stroke

A stroke of a path that contains more than 100 quadratic curves.

Interpreting the Images

The first 3 circles didn't look as bad as the polygons that follow. However, even the circle's stroke could be improved, because small spans degrade the ability of the compositor to process multiple pixels at a time. Blend2D's compositor can process more than 4 pixels at a time (maximum 16), which means that if the scanline contains a lot of small spans the compositor doesn't use the full width of SIMD registers available. The biggest overhead is caused by single-cell spans, which are commonly generated by lines of vertical slope.

Another problem is the pathological stroke. The shape itself is pathological and it's not common in 2D graphics, however, it shows that the rasterizer's inability to coalesce neighboring spans doesn't scale well. More spans the rasterizer has to keep track of means more overhead to insert new span into that list.

Some Code

The only thing I had to do to implement the debug-mode was to add new flags (debugFlags) to the 2D context and to implement a new debug-fetch (including some logic to select it). The initial version of the debug-fetch looks like this:

void SWPipeFetchPart::debugFetch1(SWPipePixelARGB& p, uint32_t flags) noexcept {
  X86GpVar dpTmp0 = c->newUInt32("dpTmp0");
  X86GpVar dpTmp1 = c->newUInt32("dpTmp1");

  c->mov(dpTmp0, g->_debugVar);
  c->not_(dpTmp0);
  c->mov(dpTmp1, g->_debugVar);
  c->and_(dpTmp0, 0xFFFFFFFF);
  c->and_(dpTmp1, 0xFFFF5050);
  c->add(dpTmp0, dpTmp1);

  p.pc[0] = c->newXmm("dpXmm");
  c->movd(p.pc[0], dpTmp0);

  g->xSatisfyARGB32_1x(p, flags);
}

Nothing fancy or over-complicated, just straightforward asmjit :) It checks the _debugVar (it's a mask that has all bits zero or all bits set) and generates a white or red pixel based on it. It then stores the pixel to p.pc[], which means Packed Color (the pixel that is packed and contains a color information - ARGB). At the end it calls the pipeline function to convert the pixel into the format that was requested by the compositor.

Further Development

This tool gave me some information that I can use to improve the rasterizer's performance. The rasterizer debug-mode can can be enabled at runtime (no recompilation needed) and causes no overhead to the rasterizer itself or to the rendering context. I just need to find the right constants I can use for span coalescing (like how far two spans can be to coalesce them) and to tweak the rasterizer. Benchmark suite will be then used to verify the performance gain and to make sure the change didn't regress.

I was thinking many times of making the rasterizer tile-based, or more precisely, to make all spans fixed size and to manage them differently. A radical change like this would make pathological cases (and very complex renderings) faster while hindering the common paths. I'm not fully convinced yet, maybe an adaptive algorithm that would switch to a tile-based approach (per scanline) after the number of spans the scanline is holding exceeds some threshold. Switching algorithm is not a high priority at the moment, but the rasterizer will be updated for sure before the beta release.

18 April, 2016

Tricky _mm_set_sd()

[Or] How a C++ Compiler Can Ruin Your Optimization Efforts

Compiler intrinsics targeting SSE and SSE2 instruction sets were historically used to instrument the compiler to generate better code that deals with floating point operations. The reason was that some CPU architectures have a global floating-point unit state (or FPU state) that contains rounding mode, precision, exception mask, and some other flags. These basically prevented the compiler to optimize operations like round(x), static_cast<int>(x), (x < y ? x : y), etc.

Of course this situation was not performance-friendly and many developers just started using SSE+ intrinsics when they became available to solve their problems in a more performance-friendly way. This approach was also blessed by some CPU manufacturers as the better way of dealing with the problem. Today's compilers are much smarter than before and some constructs may be considered obsolete. However, some constructs still hold and are probably better than using a pure C/C++ code in some cases, especially if you need to do something compiler won't do by default.

Scalar <-> SIMD Problem

The problem I faced is closely related to a conversion between float/double and SIMD registers. When SSE2 code-generation is turned ON these types are equivalent and using _mm_set_sd intrinsic should generate nothing, however, it's not always true as demonstrated by the snippet below:

#include <xmmintrin.h>

double MyMinV1(double a, double b) {
  return a < b ? a : b;
}

double MyMinV2(double a, double b) {
  __m128d x = _mm_set_sd(a);
  __m128d y = _mm_set_sd(b);
  return _mm_cvtsd_f64(_mm_min_sd(x, y));
}

One would expect that a modern compiler will generate the same assembly for both functions. Unfortunately, the reality is different. Let's compare gcc and clang.

Assembly generated by clang:

MyMinV1:
  minsd   xmm0, xmm1
  ret

MyMinV2:
  minsd   xmm0, xmm1
  ret

Assembly generated by gcc:

MyMinV1:
  minsd   xmm0, xmm1
  ret
MyMinV2:
  movsd   QWORD PTR [rsp-24], xmm0
  movsd   QWORD PTR [rsp-16], xmm1
  movsd   xmm0, QWORD PTR [rsp-24]
  movsd   xmm1, QWORD PTR [rsp-16]
  minsd   xmm0, xmm1
  ret

That's not good; gcc completely missed out the point and turned the C++ code into the worst assembly you would ever imagine. You can check it yourself here if you don't believe me.

The Core of the Problem

The problem is caused by using _mm_set_sd() to convert a double into a SIMD type. The intrinsic should emit the movsd xmm0, xmm1|mem64 instruction, which has ironically two behaviors:

  • If the second operand (the source operand) is an XMM register, it preserves the high-part of the first operand (the destination operand).
  • If the second operand (the source operand) is a memory location, it clears the high-part of the first operand (the destination operand).

This is tricky, because the documentation says that _mm_set_sd() intrinsic clears the high-part of the register, so gcc is basically following the specification. However, it's something unexpected if you don't really have use for the high-part of the register, and using two movsd's is something suboptimal as it could be replaced by a single movq, which is what clang does in some other cases.

A Possible Workaround

After submitting a gcc bug #70708 (which turned out to be a duplicate of #68211) I got an instant feedback and a workaround has been found. We can use some asm magic to tell the compiler what to do. Consider the updated code, which works well also with clang:

static inline __m128d DoubleAsXmm(double x) {
  __m128d xmm;
  asm ("" : "=x" (xmm) : "0" (x));
  return xmm;
}

double MyMinV3(double a, double b) {
  return _mm_cvtsd_f64(
    _mm_min_sd(
      DoubleAsXmm(a),
      DoubleAsXmm(b)));
}

Which leads to the desired result (both gcc and clang):

MyMinV3:
  minsd   xmm0, xmm1
  ret

Of course you need some preprocessor magic to make the conversion portable across compilers. However, it's a nice solution as it basically tells the compiler that you don't care of the high part of the SIMD register. As a bonus this will also prevent clang emitting a movq instruction in some other cases.

Lesson Learned

I think the time we had to specialize min/max by using SSE and SSE2 is finally over, compilers are fine doing that for us. However, if you still need to use SSE+ instrinsics, check how compiler converts your scalars into SIMD registers, especially if you only use them for scalar-only operations.

16 April, 2016

C++ Moving Forward

Time to move forward and require C++11 by default?

Although I don't use much of C++11 I'm becoming tired of making my code backwards compatible by using macros and other tricks. Removing all of this won't probably make much difference for me as I don't really like stl library. However, certain features like noexcept, move semantics, and constexpr are nice additions that I really like and every time I have to write a macro instead of the keyword I feel like working on a project, which is 20 years old.

ApiBegin

I already use many C++11 keywords and just define them if they are not available. I kind of developed my way of solving this problem without affecting other libraries and user code that is included after. I call it briefly as ApiBegin & ApiEnd. For example the following is from Blend2D's b2d_api_begin.h:

#if !B2D_CC_HAS_ALIGNAS && !defined(alignas)
# define alignas(x) B2D_ALIGNAS(x)
# define B2D_UNDEF_ALIGNAS
#endif // !B2D_CC_HAS_ALIGNAS && !alignas

#if !B2D_CC_HAS_NOEXCEPT && !defined(noexcept)
# define noexcept B2D_NOEXCEPT
# define B2D_UNDEF_NOEXCEPT
#endif // !B2D_CC_HAS_NOEXCEPT && !noexcept

#if !B2D_CC_HAS_NULLPTR && !defined(nullptr)
# define nullptr NULL
# define B2D_UNDEF_NULLPTR
#endif // !B2D_CC_HAS_NULLPTR && !nullptr

And also b2d_api_end.h, which clears such definitions to make no harm to people including Blend2D:

#if defined(B2D_UNDEF_ALIGNAS)
# undef alignas
# undef B2D_UNDEF_ALIGNAS
#endif // B2D_UNDEF_ALIGNAS

#if defined(B2D_UNDEF_NOEXCEPT)
# undef noexcept
# undef B2D_UNDEF_NOEXCEPT
#endif // B2D_UNDEF_NOEXCEPT

#if defined(B2D_UNDEF_NULLPTR)
# undef nullptr
# undef B2D_UNDEF_NULLPTR
#endif // B2D_UNDEF_NULLPTR

The intention is clear - this small hack makes Blend2D compilable by pre-c++11 compilers and allows Blend2D code itself to look more modern. I just use these features without making them ugly, for example noexcept looks much nicer than B2D_NOEXCEPT, etc. However, this also requires a really good compiler detection. Instead of solving this per-project I just wrote a tool to generate most of it. It's a monster, I know, but offers a centralized way of solving the compiler problem, and when a new compiler is released or some feature is wrongly detected a single fix fixes 'em all.

I don't even see a problem in my approach. It's clean, it makes the project backwards compatible, and it allows me to use at least some features I found useful. However, the problem that I see is that many projects just switched to C++11 and don't look back. So I'm asking myself, is it time to switch now or it's better to wait? It's not that long time ago I was asked to make asmjit compilable by VS2003 (yeah), and it was possible (maybe it still is, I don't know). But these seem to be rare cases to base any decision on them.

ApiEnd

Okay, this was a short boring post :) Next time I should finally write something about Blend2D, but I don't yet have a clue where to start. Maybe I can reveal that I finally bought an AVX2 capable machine so Blend2D will be released with a fully optimized AVX2 pipeline codegen?

10 April, 2016

Autovectorization Nightmare

The Opportunity of the Compiler to Break Your Code

C++ language and compilers are moving forward so fast. Today it's pretty common that the C++ compiler is able to take advantage of SIMD and automatically vectorize your plain scalar C++ code into something that looks like a monster. Sometimes this may give you some gains, but the biggest disaster is a compiler optimizing your code that has been already optimized - like optimizing leading loops that precede optimized loops, or optimizing initialization loops that are executed just once, etc.

I remember reporting a clang bug where it vectorized a loop containing a sin() call by just unrolling it 8 times and calling the same sin() 8 times sequentially. This of course didn't improve the performance at all, but was not breaking the code at least.

Well, it would be cool if the compiler just optimizes your code without actually breaking it, but more aggressive optimizations usually lead to more opportunities of breaking your humble scalar C++ code. And... because the universe is not fair, and Murphy's Law applies to all of us, I have hit the issue too, today.

There is a term called undefined behavior, and pretty much everybody writing C++ code knows about it and tries to avoid it. The problem is that on many targets the undefined behavior is actually a well-defined undefined-behavior. The reason is that you know the platform and you can take advantage of certain constructs when optimizing certain things. So what was the problem? Look at the simple code below:

static void write_unaligned_32(void* ptr, uint32_t value) noexcept {
  if (HAS_UNALIGNED_WRITE) {
    static_cast<uint32_t*>(ptr)[0] = value;
  }
  else {
    // NOTE: This example doesn't care of endianness (this is LE).
    static_cast<uint8_t*>(ptr)[0] = static_cast<uint8_t>((value      ) & 0xFF);
    static_cast<uint8_t*>(ptr)[1] = static_cast<uint8_t>((value >>  8) & 0xFF);
    static_cast<uint8_t*>(ptr)[2] = static_cast<uint8_t>((value >> 16) & 0xFF);
    static_cast<uint8_t*>(ptr)[3] = static_cast<uint8_t>((value >> 24));
  }
}

This is basically an abstraction that I use to make the code itself more portable, and to keep non-portable code at a specific place only. Unaligned reads and writes are fine on x86 architectures and are much faster than reading or writing individual bytes. In addition, if all unaligned writes go through write_unaligned_32 function it's quite easy to assume that all other writes are aligned and to simply locate code that requires unaligned writes for future refactorization, optimizations, and sanitization.

The write_unaligned_32 function looks sane. It uses unaligned write only when it's allowed by the target. However, the problem is that if you use it in a loop that the compiler tries to auto-vectorize, then the compiler may assume that the ptr is aligned to 4 bytes, because you store uint32_t value to it. And this is exactly what have happened. The compiler auto-vectorized the loop and used `movdqa` instruction to fetch 16 bytes at once assuming that the buffer is 16-byte aligned, which basically led to a general protection fault (#GP). The trickiest part is that this only happened in release mode.

Possible solutions

  • Tell the compiler to not autovectorize? Workaround! Disabling something should never be a solution.
  • Don't use unaligned reads / writes? Workaround! Unnecessarily pessimistic.
  • Tell the compiler the pointer is not aligned? Looks like a solution to me!

There are vendor-specific extensions that can be used to tell the compiler that the pointer isn't aligned as it thinks. However, some #ifdefs are necessary, as usual, to take advantage of that.

The first thing I did was to add CC_ASSUME_ALIGNED template to the cxxtool. The initial implementation looks like:

// [@CC_ASSUME_ALIGNED{@]
// \def @prefix@_ASSUME_ALIGNED(p, alignment)
// Assume that the pointer 'p' is aligned to at least 'alignment' bytes.
#if @prefix@_CC_HAS_ASSUME_ALIGNED
# define @prefix@_ASSUME_ALIGNED(p, alignment) __assume_aligned(p, alignment)
#elif @prefix@_CC_HAS_BUILTIN_ASSUME_ALIGNED
# define @prefix@_ASSUME_ALIGNED(p, alignment) p = __builtin_assume_aligned(p, alignment)
#else
# define @prefix@_ASSUME_ALIGNED(p, alignment) ((void)0)
#endif
// [@CC_ASSUME_ALIGNED}@]

The compiler detection module detects whether __assume_aligned or __builtin_assume_aligned is provided by the compiler, and then uses the one available. If none of them is provided then @prefix@_ASSUME_ALIGNED() is simply a NOP, which is fine.

Then the write_unaligned_32 function can be changed to something like this:

static void write_unaligned_32(void* ptr, uint32_t value) noexcept {
  ASSUME_ALIGNED(ptr, 1);

  if (HAS_UNALIGNED_WRITE_32) {
    static_cast<uint32_t*>(ptr)[0] = value;
  }
  else {
    // NOTE: This example doesn't care of endianness (this is LE).
    static_cast<uint8_t*>(ptr)[0] = static_cast<uint8_t>((value      ) & 0xFF);
    static_cast<uint8_t*>(ptr)[1] = static_cast<uint8_t>((value >>  8) & 0xFF);
    static_cast<uint8_t*>(ptr)[2] = static_cast<uint8_t>((value >> 16) & 0xFF);
    static_cast<uint8_t*>(ptr)[3] = static_cast<uint8_t>((value >> 24));
  }
}

What Triggered It

This bug happened in Blend2D's function that reads a non-premultiplied ARGB32 pixels from one buffer, converts the pixels to a premultiplied ARGB32, and stores them into another buffer. The function has been designed for image codecs so it must support reading from misaligned memory. This is common for example in PNG, which adds one BYTE for each ROW of pixels to specify the row-filter, and this additional byte causes all 16-bit and 32-bit rows to become misaligned. The bug has been fixed in both Blend2D and AsmJit.

Conclusion

Having all non-portable code at a single place is always a good idea. Basically one line fixed a bug that wouldn't be trivial to find by a non-skilled developer and that was only triggered by compiler's auto-vectorization feature. I fixed it by intuition, basically when I saw #GP I knew that there is an unaligned memory access, and because the function doesn't use SIMD by itself, it was obvious that the compiler did something I didn't ask for. I don't like this feature at all, but it's always good to have code that just works.

29 March, 2016

MPSL's new DSP features

Intro

I spent last 3 days by working on some MPSL features. After I fixed AstToIR phase to finally split a 256-bit register into two 128-bit registers (to support pre-AVX machines) and got all tests working, I started thinking more about DSP instructions that MPSL could provide. Initially, I didn't complicate the design of the language and I basically limited it to four basic types - bool, int, float, and double. However, each variable can form up to 256-bit vector, which means that any 32-bit type can form up to 8 element vector, and any 64-bit type can form up to 4 element vector. The DSP extension basically allows the int type and its vectors to act temporarily as packed bytes, words, double-words, and quad-words. It doesn't change the type system though. The inputs and outputs are defined by DSP intrinsics' themselves.

Intrinsics

Initially, I have implemented these:

  • Packed addition and subtraction with saturation
  • Packed multiplication
  • Packed minimum and maximum
  • Packed comparison
  • Packed shift by scalar

All DSP intrinsics have names that are close to their assembler mnemonics, for example (not a full listing):

  • paddb does an addition of packed bytes (maps to paddb instruction)
  • psubusw does a subtraction with unsigned saturation of packed words (maps to psubusw instruction)
  • pmulw does a multiplication of packed words and stores the LO word of each dword result (maps to pmullw instruction)
  • pmulhuw does an unsigned multiplication of packed words and stores the HI word of each dword result (maps to pmulhuw instruction)
  • pminud selects a minimum of packed unsigned dword (maps to pminud instruction)
  • pmaxsw selects a minimum of packed signed words (maps to pmaxsw instruction)
  • pcmpeqb does a comparison of packed bytes (maps to pcmpeqb instruction)

Example

One can use the new DSP intrinsics to write a low-level pixel processing, for example:

// Embedder provides `bg`, `fg`, and `alpha` variables as packed 16-bit ints stored as `int4`.
int4 main() {
  // HACK: TODO: creates a 32-bit integer having two 16-bit values (0x0100).
  const int inv = 0x01000100;

  // Lerp background and foreground - scalars automatically promoted to vectors.
  int4 x = pmulw(bg, psubw(inv, alpha));
  int4 y = pmulw(fg, alpha);

  // Combine them and return.
  return psrlw(paddw(x, y), 8);
}

The example is not the best, I know, but MPSL is getting better and I'm adding new concepts every day. The advantage of such program is that it works with packed 16-bit integers and can process twice as data as program working with floats. It's also very simple for embedders to actually use that shader, maybe 10 lines of C++ to setup the data layout and to create the shader.

Conclusion

There are currently more important things to be implemented like branching and loops, however, adding DSP intrinsics wasn't for profit, so why not :) The DSP implementation is not fully complete as well - there are missing some useful intrinsics like packing, unpacking, some pecularities like pmaddwd and psadbw, and also element swizzling, which is next on my list.

Update

There is now a working example mp_dsp that creates and executes the shader.

25 March, 2016

Vector sin/cos (SSE2)

Intro

Using SIMD intrinsics in C++ is so cool until you find something missing. Most of the time it's just about helpers that combine few intrinsics into something that should have been obviously implemented by the CPU, but it's not. However, sometimes you would like to have intrinsics that provide something much more - like sine and cosine. It's already available in C library, so why there are no intrinsics for it? Because it's not that easy. There are some libraries that provide non-conforming implementation of sine and cosine (that means that they don't handle special numbers and they only work within a certain range), which is usually fine as it's really rare to compute something like sin(1e16) although a conforming implementation should be able to handle such input without returning completely wrong result.

After I released an updated version of MathPresso and a preview of MPSL I faced two problems: The sine and cosine implementations that are implemented by C runtime vary across platforms I would like to support, and there are no vector ones that I can just use in MPSL (and to be honest, I want these functions to be inlined anyway). However, the problem is even deeper - I discovered that sine and cosine results don't vary just across platforms, but some implementations (like Visual Studio's C runtime) produce different results even in debug and release builds. This means that MPSL needs its own implementation of all functions it provides in order to produce the same results on all supported platforms.

Existing Implementations

Instead of trying to implement my own sine and cosine from scratch, I looked at existing open-source implementations and tried to compare their performance and precision. There are basically 2 implementations I found interesting:

  • Netlib's cephes - Provides an implementation of all math functions available in C runtime. The most interesting part of the sine implementation is a selection of a 6-degree polynomial based on the input. Not really SIMD friendly, but solvable, for example ssemath just executes both branches and then masks the wrong values out. The ssemath library, however, provides only single-precision sine and cosine that I don't need.
  • Erik's vecmathlib - Provides an implementation that use only one 9-degree polynomial. The implementation seems to be based on a public domain library called sleef, which implements the same algorithms that are used by vecmathlib (VML).

Testing Code

I'm at the moment mostly interested in a double-precision scalar and vector implementations. I wrote my own vector sine that uses cephes approach (two polynomials for a different range of inputs), and took the vecmathlib's implementation for precision and speed comparison. Initially, I have been comparing results from functions I tested against results obtained by calling a native sin() function, but since I started having differences in debug and release builds I decided to use a multi-precision library to calculate sine in a much higher precision and compare with these results instead.

The following is my implementation, which is based on cephes's approach:

static SIMD_INLINE __m128d sin_cephes_pd(__m128d x) {
  SIMD_CONST_SQ(sign     , SIMD_UINT64_C(0x8000000000000000));
  SIMD_CONST_SQ(inv_sign , SIMD_UINT64_C(0x7FFFFFFFFFFFFFFF));
  SIMD_CONST_SI(int32_one, 1);
  SIMD_CONST_SD(4_DIV_PI , 1.27323954473516268615107010698);
  SIMD_CONST_SD(DP1      , 7.85398125648498535156e-1);
  SIMD_CONST_SD(DP2      , 3.77489470793079817668e-8);
  SIMD_CONST_SD(DP3      , 2.69515142907905952645e-15);

#define DEFINE_DATA(name, x0, x1, x2, x3, x4, x5, xm, xa, y0, y1, y2, y3, y4, y5, ym, ya) \
  SIMD_ALIGN_VAR(static const double, name[], 16) = { \
    x0, x0, x1, x1, x2, x2, x3, x3, x4, x4, x5, x5, xm, xm, xa, xa, \
    y0, x0, y1, x1, y2, x2, y3, x3, y4, x4, y5, x5, ym, xm, ya, xa, \
    x0, y0, x1, y1, x2, y2, x3, y3, x4, y4, x5, y5, xm, ym, xa, ya, \
    y0, y0, y1, y1, y2, y2, y3, y3, y4, y4, y5, y5, ym, ym, ya, ya  \
  }

  DEFINE_DATA(sincos_coeff,
    1.58962301576546568060e-10,-2.50507477628578072866e-8,
    2.75573136213857245213e-6 ,-1.98412698295895385996e-4,
    8.33333333332211858878e-3 ,-1.66666666666666307295e-1, 1.0, 0.0,

   -1.13585365213876817300e-11, 2.08757008419747316778e-9,
   -2.75573141792967388112e-7 , 2.48015872888517045348e-5,
   -1.38888888888730564116e-3 , 4.16666666666665929218e-2,-0.5, 1.0);

  __m128d y;
  __m128d sign = x;                              // Sign bit.

  x = _mm_and_pd(x, SIMD_GET_PD(inv_sign));      // Take the absolute value.
  y = _mm_mul_pd(x, SIMD_GET_PD(4_DIV_PI));      // Integer part of `x * 4 / PI`.

  __m128i ival = _mm_cvttpd_epi32(y);            // Extract the integer part of y.
  __m128i ione = SIMD_GET_PI(int32_one);

  ival = _mm_add_epi32(ival, ione);              // j += 1.
  ival = _mm_andnot_si128(ione, ival);           // j &=~1.

  y = _mm_cvtepi32_pd(ival);
  ival = _mm_unpacklo_epi32(ival, ival);

  sign = _mm_xor_pd(sign,             
    _mm_castsi128_pd(_mm_slli_epi64(ival, 61))); // Swap the sign bit if `j & 4`.
  sign = _mm_and_pd(sign, SIMD_GET_PD(sign));    // Keep only the sign bit.

  // Get the polynom selection mask (j & 2):
  //   1. `0x0000000000000000` => `0    <= x <= PI/4`
  //   2. `0xFFFFFFFFFFFFFFFF` => `PI/4 <  x <= PI/2`
  ival = _mm_slli_epi32(ival, 30);
  ival = _mm_srai_epi32(ival, 31);

  // Extended precision modular arithmetic:
  //   x = ((x - y * DP1) - y * DP2) - y * DP3
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP1)));
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP2)));
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP3)));

  // Get the polynom coefficients for each lane (sin/cos).
  __m128d poly_mask = _mm_castsi128_pd(ival);
  const __m128d* coeff = reinterpret_cast<const __m128d*>(sincos_coeff) +
    static_cast<uintptr_t>(_mm_movemask_pd(poly_mask)) * 8;

  __m128d xx = _mm_mul_pd(x, x);
  y = coeff[0];
  y = Simd128::mad(y, xx, coeff[1]);
  y = Simd128::mad(y, xx, coeff[2]);
  y = Simd128::mad(y, xx, coeff[3]);
  y = Simd128::mad(y, xx, coeff[4]);
  y = Simd128::mad(y, xx, coeff[5]);
  y = _mm_mul_pd(y, xx);

  __m128d x_or_xx = _mm_or_pd(
    _mm_and_pd(xx, poly_mask),
    _mm_andnot_pd(poly_mask, x));

  y = _mm_mul_pd(y, x_or_xx);
  y = _mm_add_pd(y, _mm_mul_pd(x_or_xx, coeff[6]));
  y = _mm_add_pd(y, coeff[7]);

  return _mm_xor_pd(y, sign);
}

The following code uses the vecmathlib approach:

static SIMD_INLINE __m128d sin_vecmathlib_pd(__m128d x) {
  SIMD_CONST_SD(1_PI , 0.318309886183790671538);
  SIMD_CONST_SD(PI4_A,-3.14159265160560607912);
  SIMD_CONST_SD(PI4_B,-1.98418715485759733496e-9);
  SIMD_CONST_SD(PI4_C,-4.5034835412693155724e-18);
  SIMD_CONST_SD(PI4_D,-7.0431197303664003632e-27);

  SIMD_CONST_SD(sin_0,-7.97255955009037868891952e-18);
  SIMD_CONST_SD(sin_1, 2.81009972710863200091251e-15);
  SIMD_CONST_SD(sin_2,-7.64712219118158833288484e-13);
  SIMD_CONST_SD(sin_3, 1.60590430605664501629054e-10);
  SIMD_CONST_SD(sin_4,-2.50521083763502045810755e-08);
  SIMD_CONST_SD(sin_5, 2.75573192239198747630416e-06);
  SIMD_CONST_SD(sin_6,-1.98412698412696162806809e-04);
  SIMD_CONST_SD(sin_7, 8.33333333333332974823815e-03);
  SIMD_CONST_SD(sin_8,-1.66666666666666657414808e-01);

  SIMD_CONST_SD(magic, 6755399441055744.0);

  __m128d q = Simd128::m128roundeven(
    _mm_mul_pd(x, SIMD_GET_PD(1_PI)));

  __m128d i = _mm_castsi128_pd(
    _mm_slli_epi64(
      _mm_castpd_si128(_mm_add_pd(q, SIMD_GET_PD(magic))), 63));

  x = Simd128::mad(q, SIMD_GET_PD(PI4_A), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_B), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_C), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_D), x);

  __m128d xx = _mm_mul_pd(x, x);
  x = _mm_xor_pd(x, i);

  __m128d u = SIMD_GET_PD(sin_0);
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_1));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_2));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_3));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_4));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_5));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_6));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_7));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_8));
  u = Simd128::mad(xx, _mm_mul_pd(u, x), x);

  return u;
}
The functions above are used in the following way to calculate an array of sines:
void sin_array_pd(double* dst, const double* src, size_t length) {
  size_t i = length;

  while (i) {
    if (!SimdUtils::isAligned(dst, 16) || i == 1) {
      __m128d d = _mm_load_sd(src);
      _mm_store_sd(dst, sin_impl(d));
      
      dst++;
      src++;

      if (--i == 0)
        break;
    }

    while (i >= 2) {
      __m128d d = _mm_loadu_pd(src);
      _mm_store_pd(dst, sin_impl(d));

      dst += 2;
      src += 2;
      i -= 2;
    }
  }
}
A pure C implementation is implemented like this (giving compiler a chance to vectorize the loop):
void sin_array_math_h(double* dst, const double* src, size_t length) {
  for (size_t i = 0; i < length; i++)
    dst[i] = ::sin(src[i]);
}
And finally, here is a multi-precision sine, which uses LolRemez engine:
void sin_array_precise(double* dst, const double* src, size_t length) {
  for (size_t i = 0; i < length; i++)
    dst[i] = lol::sin(lol::real(src[i]));
}

Results

The test application first verifies the vector implementation and then benchmarks it. When making the verification it calculates the maximum ULP error. The SumUlp is a sum of all ULP errors of all results.

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err=317248/1000000 MaxUlp=2 SumUlp=319132 MaxEps=2.2204460492503131e-16]
[ERROR] IMPL=sin (Cephes-SSE2)  (Input=-3.1415926535897931)
  ref(-1.2246467991473532e-16) !=
  res(-1.2246467991473515e-16) (ulp=7 | epsilon=1.7256332301709633e-31)
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=114265/1000000 MaxUlp=7 SumUlp=114271 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=201188/1000000 MaxUlp=2 SumUlp=201344 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.760 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.087 s]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h)       [Err=317091/1000000 MaxUlp=2 SumUlp=318853 MaxEps=2.2204460492503131e-16]
[ERROR] IMPL=sin (Cephes-SSE2)  (Input=3.1415926535897931)
  ref(1.2246467991473532e-16) !=
  res(1.2246467991473515e-16) (ulp=7 | epsilon=1.7256332301709633e-31)
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=113862/1000000 MaxUlp=7 SumUlp=113868 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=200971/1000000 MaxUlp=2 SumUlp=201124 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.525 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.759 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err=348528/1000000 MaxUlp=2 SumUlp=351145 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165038/1000000 MaxUlp=1 SumUlp=165038 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299594/1000000 MaxUlp=2 SumUlp=300467 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.759 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.087 s]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h)       [Err=348356/1000000 MaxUlp=2 SumUlp=350857 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165389/1000000 MaxUlp=1 SumUlp=165389 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299318/1000000 MaxUlp=2 SumUlp=300179 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.744 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h)       [Err=351311/1000000 MaxUlp=2 SumUlp=354196 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166589/1000000 MaxUlp=1 SumUlp=166589 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=310995/1000000 MaxUlp=2 SumUlp=312241 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.744 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

The output basically says that the C implementation (VS2015) is the fastest one, but not the most precise one (within 2 ULPs). The trick is that VS compiler has its own SSE2 implementation exported as __vdecl_sin2(). However, it seems that this function is not fully compatible with its own scalar sin(), as shown below. Furthermore, the implementation that uses the Cephes approach is only slightly slower than the native implementation and is generally much more precise (all outputs except one are within 1 ULP). The VML implementation is the slowest one, but produces outputs within 2 ULPs compared to the precise implementation.

The following output was obtained from a debug build of the test application, without benchmarks:

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err= 33544/1000000 MaxUlp=1 SumUlp=33544 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=114265/1000000 MaxUlp=7 SumUlp=114271 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=201188/1000000 MaxUlp=2 SumUlp=201344 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h)       [Err= 33329/1000000 MaxUlp=1 SumUlp=33329 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=113862/1000000 MaxUlp=7 SumUlp=113868 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=200971/1000000 MaxUlp=2 SumUlp=201124 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err= 34408/1000000 MaxUlp=1 SumUlp=34408 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165038/1000000 MaxUlp=1 SumUlp=165038 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299594/1000000 MaxUlp=2 SumUlp=300467 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h)       [Err= 34528/1000000 MaxUlp=1 SumUlp=34528 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165389/1000000 MaxUlp=1 SumUlp=165389 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299318/1000000 MaxUlp=2 SumUlp=300179 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h)       [Err= 33902/1000000 MaxUlp=1 SumUlp=33902 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166589/1000000 MaxUlp=1 SumUlp=166589 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=310995/1000000 MaxUlp=2 SumUlp=312241 MaxEps=2.2204460492503131e-16]

Well, that sucks! VS has two sine implementations that give different results for the same inputs; and there is basically no control over which one to use. What I have observed is that debug builds simply don't use the vector implementation at all. Also note that this is not an X87 vs SSE2 problem as the application is 64-bit. The difference between DEBUG and RELEASE builds is shown below:

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h REL)   [Err=317248/1000000 MaxUlp=2 SumUlp=319132 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33544/1000000 MaxUlp=1 SumUlp= 33544 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h REL)   [Err=317091/1000000 MaxUlp=2 SumUlp=318853 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33329/1000000 MaxUlp=1 SumUlp= 33329 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h REL)   [Err=348528/1000000 MaxUlp=2 SumUlp=351145 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 34408/1000000 MaxUlp=1 SumUlp= 34408 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h REL)   [Err=348356/1000000 MaxUlp=2 SumUlp=350857 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 34528/1000000 MaxUlp=1 SumUlp= 34528 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h REL)   [Err=351311/1000000 MaxUlp=2 SumUlp=354196 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33902/1000000 MaxUlp=1 SumUlp= 33902 MaxEps=1.1102230246251565e-16]

Robustness

What we have seen so far were tests for inputs that weren't greater than 1e4. That was on purpose, because these are safe inputs that are easily reduced to 0..PI range. Basically the range reduction process is the most complicated part of a robust sine implementation. PI is an irrational number, which makes it tricky to range-reduce some other number to 0..PI. Check out fdlibm's e_rem_pio2.c and k_rem_pio2.c to see how the range reduction actually works. The same approach can be found in V8 engine in their fdlibm.cc, and so on. Basically everybody uses the code written by Sun more than 20 years ago. So, what happens if we try to pass huge numbers to the sine implementations provided in this post?

[TRIGO] Test 1000000 values in domain
  MIN=100000
  MAX=1.68663e+09
[CHECK] IMPL=sin (math.h)       [Err=222251/1000000 MaxUlp=2 SumUlp=223913 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166185/1000000 MaxUlp=1 SumUlp=166185 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=768543/1000000 MaxUlp=140737488387297 SumUlp=1926704344978426 MaxEps=1.1920928956520792e-07]
[BENCH] IMPL=sin (math.h)       [14.462 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.807 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.134 s]

[TRIGO] Test 1000000 values in domain
  MIN=1.68663e+09
  MAX=1e+100
[CHECK] IMPL=sin (math.h)       [Err=1252/1000000 MaxUlp=1 SumUlp=1252 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=999999/1000000 MaxUlp=4698695118786935215 SumUlp=11457351783800485383 MaxEps=inf]
[CHECK] IMPL=sin (VML-SSE2)     [Err=999999/1000000 MaxUlp=13912998465449088384 SumUlp=15519466227126556199 MaxEps=inf]
[BENCH] IMPL=sin (math.h)       [26.317 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.853 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.165 s]

The results clearly show that the C implementation should always behave correctly and that other implementation may simply fail to deliver safe results. It also shows that the C implementation uses a multi-precision range reduction (maybe derived from fdlibm), because it starts getting slower when used with huge inputs. Cephes implementation works in our case for inputs up to 1.68663e+9, which is really good. VML starts failing at 9.8999887e+7, which is also really good.

Conclusion

I don't have one! My favorite implementation is the cephes one (of course with proper range reduction so it can handle all inputs), because most of the time the maximum error stays within 1 ULP and is also faster than VML. What have delayed me from using it was the discovery of the erratic input x=3.1415926535897931, which results in 7 ULPs error (epsilon=1.1102230246251565e-16).

The code I used for testing is available at my simdtests repo.