13 February, 2017

PABSB|W|D|Q Without SSSE3

Packed Absolute Value

SSSE3 extensions introduced instructions for computing packed absolute values of 8-bit, 16-bit, and 32-bit integers. In this post I will show how to implement these in pure SSE2, and how to implement a missing pabsq (packed absolute value of 64-bit integers), which is not provided until AVX512-F.

Straightforward Implementation

Let's start with a straightforward implementation in C first:

inline uint32_t abs32(int32_t x) {
  return x >= 0 ? uint32_t(x) : uint32_t(-x);
}

Although it contains branches C++ compilers are usually able to recognize such code and create an optimized branch-less version of it. If you think about a possible branch-less solution you must understand how negation in 2s complement arithmetic works. The code -x is equivalent to 0-x, which is equivalent to ~x + 1. Now we know how to change a sign of some integer, however, what absolute value does is changing the sign only if the input is negative. Since all negative numbers in 2s complement arithmetic have the most significant bit set to 1 we can use arithmetic shift to get a mask (all zeros or all ones), which can be then used to negate all bits of the original value. The remaining addition of 1 can be turned into a subtraction of -1 (as -1 is represented as all ones in 2s complement arithmetic). Thus, we can rewrite the original code to (x ^ mask) - mask, which would do nothing if mask is zero, and negate the input if mask is all ones.

A branch-less implementation of the previous code would look like:

inline uint32_t abs32(int32_t x) {
  // NOTE: x >> y must be translated to an arithmetic shift here...
  uint32_t mask = uint32_t(x >> (sizeof(int32_t) * 8 - 1));
  return (uint32_t(x) ^ mask) - mask;
}

SSE2 Implementation

The C++ code can be directly translated to SSE2 for 16-bit and 32-bit integer sizes:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psraw  xmm7, 15              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubw  xmm0, xmm7            ; Add one if mask is all ones

; SSE2 compatible PABSD implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubd  xmm0, xmm7            ; Add one if mask is all ones

64-bit packed absolute value is trickier as there is no PSRAQ instruction in SSE2 (VPSRAQ was first introduced in AVX512-F), however, we can shuffle the input a bit and use PSRAD again:

; SSE2 compatible PABSQ implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pshufd xmm7, xmm0, 0xF5      ; Like _MM_SHUFFLE(3, 3, 1, 1)
psrad  xmm7, 31              ; Arithmetic shift right (creates the mask)
pxor   xmm0, xmm7            ; Bit-not if mask is all ones
psubq  xmm0, xmm7            ; Add one if mask is all ones

These were straightforward translations based on the initial C++ code shown at the beginning of the post. However, there is a better way of implementing PABSW and there is also a way of implementing PABSB without any shifts (because there is no packed shift that operates on 8-bit entities). Since absolute value could be also written as max(x, -x) we can use packed min/max to implement PABSB and PABSW:

; SSE2 compatible PABSW implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubw  xmm7, xmm0            ; Negate all input values
pmaxsw xmm0, xmm7            ; Select all positive values

; SSE2 compatible PABSB implementation
;   xmm0 = in|out
;   xmm7 = temporary (mask)
pxor   xmm7, xmm7            ; Zero xmm7 (temporary)
psubb  xmm7, xmm0            ; Negate all input values
pminub xmm0, xmm7            ; Select all positive values

The PABSW implementation is straightforward and I have nothing to add, however, PABSB implementation is interesting as it workarounds the missing PMAXSB instruction (which was introduced in SSE4.1) by using PMINUB instead, which works for us based on the knowledge about both inputs (selecting the minimum unsigned value is the same as selecting the maximum signed value in our case, as we know that they are negations of each other).

Conclusion

Hope you enjoyed reading the post. I'm preparing a very small library for JIT code generation for asmjit that will have all of these tricks implemented and ready to use. Any wishes about next post? I was thinking about some pre-SSE4.1 rounding tricks (float|double), basically the same tricks I have used in MathPresso.

No comments:

Post a Comment