15 August, 2016

AsmJit & AVX-512

The World of Prefixes

X86 architecture is known for its prefixes. It's no surprise that AVX-512 adds another one to the family - EVEX. Let's summarize the last 4 prefixes introduced in X86:

  • VEX - 2-byte (VEX2) and 3-byte (VEX3) prefix initially designed by Intel to encode AVX instructions, but now used by other CPU extensions like BMI and BMI2. VEX2 was designed to make some instructions 1 byte shorter than VEX3, but its usage is quite limited.
  • XOP - 3-byte prefix designed by AMD to support their XOP extensions in a way to not interfere with existing VEX prefix. XOP was never adopted by Intel and AMD will not support it in their new Zen processors (together with other extensions like FMA4). It's a dead end, dead silicone, and dead code that supports this prefix.
  • EVEX - 4-byte prefix designed by Intel to support 512-bit width vectors and 32 vector registers. Each AVX-512 instruction that works with vector registers uses this prefix. Many AVX and AVX2 instructions can be encoded by this new prefix as well. There are, however, several exceptions, but that would require a separate post.

AVX-512 Status in AsmJit

AVX-512 support in AsmJit is mostly finished. AsmJit's instruction database now contains all AVX-512 instructions together with older AVX and AVX2 instructions. The reorganization of instruction database and X86Assembler was quite drastic. AsmJit now contains a single path to encode either VEX, XOP, or EVEX instruction, which greatly simplified the logic in the assembler. XOP encoding IDs are no longer needed as each instruction now contains VEX, XOP, and EVEX bit. These bits instrument the encoder to use the correct prefix.

Encoder Improvements

The previous encoder was designed to encode each byte in the [VEX] prefix separately, and then write the result byte-to-byte into the destination buffer. This design was fairly simple, and according to my benchmarks, it was also very fast. However, this approach seemed unfeasible for supporting the new EXEX prefix, which contains 3 bytes of payload (instead of two) and the encoder must check all 3 bytes before it can decide whether to emit VEX or EVEX prefix. The new code does this differently - it uses a single 32-bit integer that represents the whole EVEX prefix, and then decides whether to use EVEX or VEX by checking specific bits in it. If any of the bits checked is '1' then the instruction is EVEX only. This guarantees that EVEX prefix will never be used by a legacy AVX instruction, and also guarantees that the best encoding (shortest prefix) is used. AsmJit allows to override this decision by using `evex()` option, which instructs the encoder to emit EVEX prefix, and similarly also supports `vex3()` option, which instructs the encoder to emit 3-byte VEX prefix instead of a shorter 2-byte VEX prefix. EVEX wins if both `evex()` and `vex3()` are specified.

A simplified version of AsmJit's VEX|EVEX encoder looks like this:

// Encode most of EVEX prefix, based on instruction operands and definition.
uint32_t x = EncodeMostEvex(...);             //  [........|zLL..aaa|Vvvvv..R|RBBmmmmm]
// Check if EVEX is required by checking:     x & [........|xx...xxx|x......x|.x.x....]
if (x & 0x00C78150U) {
  // Encode EVEX - uses most of `x`.
  // ... no more branches here - requires around 14 ops to finalize EVEX ...
  //                                                   _     ____    ____
  //                                              [zLLbVaaa|Wvvvv1pp|RBBR00mm|01100010].
}

// Not EVEX, prepare `x` for VEX2 or VEX3 (5 ops):[........|00L00000|0vvvv000|R0B0mmmm]
x |= ((opCode >> (kSHR_W_PP + 8)) & 0x8300U) | // [00000000|00L00000|Wvvvv0pp|R0B0mmmm]
     ((x      >> 11             ) & 0x0400U) ; // [00000000|00L00000|WvvvvLpp|R0B0mmmm]

// Check if VEX3 is needed by checking        x & [........|........|x.......|..x..x..]
if (x & 0x0008024U) {
  // Encode VEX3 or XOP.
  // ... no more branches here - requires around 7 ops to finalize VEX3 ...
  //                                                         ____    _ _
  //                                              [_OPCODE_|WvvvvLpp|R1Bmmmmm|VEX3|XOP].
else {
  // Encode VEX2.
  // ... no more branches here - requires around 3 ops to finalize VEX2 ...
}

This means that AsmJit requires just a single branch to decide whether to use VEX or EVEX prefix, and another branch to decide between 3-byte VEX|XOP or 2-byte VEX prefix. This is good news for everybody expecting high performance as this approach is nearly as fast as the old AsmJit's one, which haven't supported AVX-512 at all. It took me some time and thinking to actually design such approach and to reorganize instruction opcodes database in a way to be able to encode the initial EVEX prefix quickly. My initial approach was around 25% slower than the old AsmJit, and the final code (similar to the snippet shown above) is roughly 3-5% slower, which is pretty close to the old code. The new functionality is nontrivial so I'm really happy with such metrics (and to be honest I would like to see some metrics from other assemblers).

It's also obvious from the code that the new approach is basically optimistic for EVEX - emitting EVEX instructions is much cheaper than emitting VEX|XOP instructions. This wasn't goal, it's rather a consequence: all the bits that EVEX prefix introduces must be checked in order to decide between VEX vs. EVEX, thus AsmJit just puts most of these bits into the right position and only performs minor bit shuffling when converting to a prefix that uses less bytes (EVEX->VEX3->VEX2).

Future Work

Future posts will be about a new emitter called CodeBuilder. In short, it allows to emit instructions into a representation that can be processed afterwards. This representation was in AsmJit from the beginning as a part of Compiler. Compiler has many high-level features that some people don't need, so it was split into CodeBuilder that can be used exactly like Assembler, and CodeCompiler, that keeps all the high-level features.

4 comments:

  1. Any plans to use the recently released Intel XED encoder/decoder library? (https://github.com/intelxed/xed). IMHO it could potentially reduce the effort to support the future ISA extensions.

    ReplyDelete
  2. Hey, no plans for that. AsmJit will stay independent and lightweight. It already contains a lot of data about instructions, and it's not a problem to add more, so no XED needed.

    ReplyDelete
  3. The main point of using XED is that it is officially supported by Intel. It means that the support for new instruction sets will be available the same day the instruction set is published - IMHO it is a big deal.

    ReplyDelete
  4. The main point of using AsmJit is that it adds like 200kB to the product. I'm sure there are projects that can benefit from XED, but I see no reason I should use it, I already have the DB, there is no reason I should start rewriting it.

    ReplyDelete