25 May, 2017

AVX512 {er} and {sae} Quirks

Embedded-Rounding and Suppress-All-Exceptions

Today I finally implemented all missing (and TODOed) features in AVX512 instruction encoding in AsmJit's X86Assembler (including also a parser update in AsmTK) and I have found an interesting issue regarding encoding of {sae} (suppress-all-exceptions) in EVEX encoded instructions. EVEX prefix uses many fields to describe the operation, the most important for this post are:

  • 'b' (broadcast bit) - if set and the instruction fetches from memory it's a broadcast, otherwise it specifies embedded-rounding {er} mode or suppress-all-exceptions {sae} mode)
  • 'LL' (vector-length) - 2 bits that can be used to select between 128-bit, 256-bit, and 512-bit operation (extensible in future up to 1024 bits).

The problem is that when embedded-rounding {er} is used the LL field becomes the rounding mode instead of the vector length, which is then assumed to be 512 bits wide (LL equals 0b10). This is the main reason why Intel Manual only allows {er}/{sae} in instructions that operate on either 512-bit vectors or scalars [as scalars don't use the LL field]. However, Intel Manual also says that if {sae} is used (which uses the same 'b' bit as {er}, but doesn't use LL field to specify the rounding mode) the LL field should still describe the vector length. But, I checked what C++ compilers output in that case and I found that GCC and Clang/LLVM change the LL field to zero when {sae} is used, but Intel Compiler (ICC) doesn't. This is confusing as Intel Manual doesn't say anything about ignoring LL field when executing instructions that use {sae}.

I created an online playground that can be used to quickly check the output of various C++ compilers. The instruction that uses {sae} is vcmppd (shown as vcmpeqpd) and here is a short comparison:

; EVEX - EVEX prefix (4 bytes).
; Op   - Instruction's opcode.
; Mr   - ModR/M byte.
; Pr   - Comparison predicate.

; Instruction and operands       ; |__EVEX__|Op|Mr|Pr| Compiler and comment
  vcmpeqpd k0, zmm0, zmm1, {sae} ; |62F1FD58|C2|C1|00| ICC   (uses k0)
  vcmpeqpd k1, zmm0, zmm1, {sae} ; |62F1FD18|C2|C9|00| GCC   (uses k1, clears LL)
  vcmpeqpd k0, zmm0, zmm1, {sae} ; |62F1FD18|C2|C1|00| Clang (uses k0, clears LL)

Let's decompose the bytes into individual fields:


; Instruction:
;   vcmppd k {kz}, zmm, zmm/m512/b64, ub {sae}
;
; Encoding:
;   [RVMI-FV] EVEX.NDS.512.66.0F.W1 C2 /r ib
;          ____      ____        _             
; __EVEX__|RBBR00mm|Wvvvv1pp|zLLbVaaa| OpCode | ModR/M |CompPred| Compiler
; 01100010|11110001|11111101|01011000|11000010|11000001|00000000| ICC
; 01100010|11110001|11111101|00011000|11000010|11001001|00000000| GCC
; 01100010|11110001|11111101|00011000|11000010|11000001|00000000| Clang

Now the differences should be clear - basically LL field and Mod/R field describing 'k' register index are different. For those who cannot read the encoding here is a small howto of reading registers:


; Instruction:
;   vcmppd k {kz}, zmm, zmm/m512/b64, ub {sae}
;
; Encoding:
;   [RVMI-FV] EVEX.NDS.512.66.0F.W1 C2 /r ib
;
; RVMI specifies how registers are encoded, in order, we name them 'a', 'b', and 'c':
;
;   [R|V|M|I] (I == Immediate value)
;   [a|b|c|.]
;
;   'a' - R field in Mod/R (2 bits in EVEX/R'R and 3 bits in Mod/R)
;   'b' - V field in EVEX  (5 bits in EVEX/V'vvvv)
;   'c' - M field in Mod/M (2 bits in EVEX/B'B and 3 bits in Mod/M)
;
; Registers in 'vcmpeqpd a:k, b:zmm, c:zmm, {sae}':
;          ____      ____        _             
; ........|acca....|.bbbb...|.LL.b...|........|..aaaccc|........| Compiler
; 01100010|11110001|11111101|01011000|11000010|11000001|00000000| ICC
; 01100010|11110001|11111101|00011000|11000010|11001001|00000000| GCC
; 01100010|11110001|11111101|00011000|11000010|11000001|00000000| Clang
;
;     __
; a = 11000 -> 00000 -> k0
;     __
;     11001 -> 00001 -> k1
;     _____
; b = 11111 -> 00000 -> zmm0
;     __
; c = 11001 -> 00001 -> zmm1

You can also check out EVEX prefix on wikipedia.

What AsmJit Should Emit?

That's what I don't know! If 'LL' is really ignored I would still keep it as it describes the vector length (that's what ICC does). I will keep an eye on it and try to test it on a real hardware when I get the chance.

Using k0 in a write operation

This article should also help with understanding how k0 could be used. AVX512 restricts its use only in EVEX's 'aaa' field that encodes write-mask {k1-k7}. Zero disables write-mask completely so {k0} is not encodable in EVEX prefix, however, it can be used by instructions that don't encode k register in 'aaa' field. This means that a register allocator can use k0 (with some limitations) and indeed ICC and Clang do it.