13 February, 2017

PABSB|W|D|Q Without SSSE3

Packed Absolute Value

SSSE3 extensions introduced instructions for computing packed absolute values of 8-bit, 16-bit, and 32-bit integers. In this post I will show how to implement these in pure SSE2, and how to implement a missing pabsq (packed absolute value of 64-bit integers), which is not provided until AVX512-F.

Straightforward Implementation

Let's start with a straightforward implementation in C first:

inline uint32_t abs32(int32_t x) {
  return x >= 0 ? uint32_t(x) : uint32_t(-x);
}

Although it contains branches C++ compilers are usually able to recognize such code and create an optimized branch-less version of it. If you think about a possible branch-less solution you must understand how negation in 2s complement arithmetic works. The code -x is equivalent to 0-x, which is equivalent to ~x + 1. Now we know how to change a sign of some integer, however, what absolute value does is changing the sign only if the input is negative. Since all negative numbers in 2s complement arithmetic have the most significant bit set to 1 we can use arithmetic shift to get a mask (all zeros or all ones), which can be then used to negate all bits of the original value. The remaining addition of 1 can be turned into a subtraction of -1 (as -1 is represented as all ones in 2s complement arithmetic). Thus, we can rewrite the original code to (x ^ mask) - mask, which would do nothing if mask is zero, and negate the input if mask is all ones.

A branch-less implementation of the previous code would look like:

inline uint32_t abs32(int32_t x) {
  // NOTE: x >> y must be translated to an arithmetic shift here...
  uint32_t mask = uint32_t(x >> (sizeof(int32_t) * 8 - 1));
  return (uint32_t(x) ^ mask) - mask;
}

SSE2 Implementation

The C++ code can be directly translated to SSE2 for 16-bit and 32-bit integer sizes:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psraw  xmm7, 15              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubw  xmm0, xmm7            ; Add one if mask is all ones

; SSE2 compatible PABSD implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubd  xmm0, xmm7            ; Add one if mask is all ones

64-bit packed absolute value is trickier as there is no PSRAQ instruction in SSE2 (VPSRAQ was first introduced in AVX512-F), however, we can shuffle the input a bit and use PSRAD again:

; SSE2 compatible PABSQ implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pshufd xmm7, xmm0, 0xF5      ; Like _MM_SHUFFLE(3, 3, 1, 1)
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubq  xmm0, xmm7            ; Add one if mask is all ones

These were straightforward translations based on the initial C++ code shown at the beginning of the post. However, there is a better way of implementing PABSW and there is also a way of implementing PABSB without any shifts (because there is no packed shift that operates on 8-bit entities). Since absolute value could be also written as max(x, -x) we can use packed min/max to implement PABSB and PABSW:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubw  xmm7, xmm0            ; Negate all input values
pmaxsw xmm0, xmm7            ; Select all positive values

; SSE2 compatible PABSB implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubb  xmm7, xmm0            ; Negate all input values
pminub xmm0, xmm7            ; Select all positive values

The PABSW implementation is straightforward and I have nothing to add, however, PABSB implementation is interesting as it workarounds the missing PMAXSB instruction (which was introduced in SSE4.1) by using PMINUB instead, which works for us based on the knowledge about both inputs (selecting the minimum unsigned value is the same as selecting the maximum signed value in our case, as we know that they are negations of each other).

Conclusion

Hope you enjoyed reading the post. I'm preparing a very small library for JIT code generation for asmjit that will have all of these tricks implemented and ready to use. Any wishes about next post? I was thinking about some pre-SSE4.1 rounding tricks (float|double), basically the same tricks I have used in MathPresso.

10 February, 2017

PMINUW and PMAXUW Without SSE4.1

Pminuw and Pmaxuw

SSE4.1 extension introduced a lot of instructions that I would say should have been part of the baseline SSE2. There are instructions that are hard to workaround, like pshufb, and there are also instructions that just complete the unfinished SSE2 instruction set, like pminuw (packed minimum of uint16_t) and pmaxuw (packed maximum of uint16_t). I have seen various workarounds for implementing these two, but since I'm working with JIT I always think about the best possible solution that:

  • Doesn't need more than one temporary register
  • Doesn't need constants, if possible
  • Is as short as possible

Existing Solutions

Before I started thinking of the best possible implementation I checked libsimdpp, which contains implementation of many post-SSE2 instructions. The min/max implementation can be found at i_min.h and i_max.h files. What libsimdpp does is to XOR the most significant bit (basically the sign-bit) of both inputs to prepare them being used by either pminsw or pmaxsw instruction. The problem with this approach is that it needs a constant (0x8000), two moves, three XORs, and one packed min or max. In other words, this is a lot of operations to do just packed minimum or maximum.

The machine code of that solution may look like:

; SSE2 compatible PMINUW|PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm1            ; Move xmm1 to temporary
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0
pxor xmm7, [constant_0x8000] ; Toggle the sign bit of xmm7 (temporary)
pminsw|pmaxsw xmm0, xmm7     ; Perform packed min|max
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0

Of course if the second operand (xmm1) is not required after the operation the temporary variable (and move) could be eliminated.

Is there a better way?

I have used a similar solution in the past, but I was never really happy with it. Today I tried to think harder about the problem and possible instructions that I can use and I have found the following approach - since SSE2 has PSUBUSW (packed subtract with unsigned saturation) I can use that instruction instead of three XORs and then subtract the result from the original value to get the packed unsigned minimum. This trick of course only works for packed uint16_t operations as X86 SIMD doesn't have instructions to perform packed saturated addition|subtraction of elements greater than 16 bits.

The machine code of this solution would look like:

; SSE2 compatible PMINUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psubusw xmm7, xmm1           ; Subtract with unsigned saturation
psubw xmm0, xmm7             ; Subtract (no saturation, would never overflow here)

Why it works? If we perform a-b with unsigned saturation we get either zero, which means that b is either equal or greater than a, or some non-zero value, which means that a is greater than b, and the value is their difference (unsigned). Based on these we can subtract that value from the original a and get our unsigned minimum.

The machine code of pmaxuw implementation would be much simpler:

; SSE2 compatible PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
psubusw xmm0, xmm1           ; Subtract with unsigned saturation
paddw xmm0, xmm1             ; Add (no saturation, would never overflow here)

In this last case (unsigned uint16_t maximum) we don't need any temporary. The possible difference in xmm0 is enough to reconstruct the maximum value based on the content of xmm1.

Conclusion

There is always a way to do something better. I always valuate solutions that need less temporaries and don't need extra constants. Sometimes these requirements are impossible, but sometimes other instructions that you may not think of may help. And... old CPUs would thank you for using this approach!