10 February, 2017

PMINUW and PMAXUW Without SSE4.1

Pminuw and Pmaxuw

SSE4.1 extension introduced a lot of instructions that I would say should have been part of the baseline SSE2. There are instructions that are hard to workaround, like pshufb, and there are also instructions that just complete the unfinished SSE2 instruction set, like pminuw (packed minimum of uint16_t) and pmaxuw (packed maximum of uint16_t). I have seen various workarounds for implementing these two, but since I'm working with JIT I always think about the best possible solution that:

  • Doesn't need more than one temporary register
  • Doesn't need constants, if possible
  • Is as short as possible

Existing Solutions

Before I started thinking of the best possible implementation I checked libsimdpp, which contains implementation of many post-SSE2 instructions. The min/max implementation can be found at i_min.h and i_max.h files. What libsimdpp does is to XOR the most significant bit (basically the sign-bit) of both inputs to prepare them being used by either pminsw or pmaxsw instruction. The problem with this approach is that it needs a constant (0x8000), two moves, three XORs, and one packed min or max. In other words, this is a lot of operations to do just packed minimum or maximum.

The machine code of that solution may look like:

; SSE2 compatible PMINUW|PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm1            ; Move xmm1 to temporary
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0
pxor xmm7, [constant_0x8000] ; Toggle the sign bit of xmm7 (temporary)
pminsw|pmaxsw xmm0, xmm7     ; Perform packed min|max
pxor xmm0, [constant_0x8000] ; Toggle the sign bit of xmm0

Of course if the second operand (xmm1) is not required after the operation the temporary variable (and move) could be eliminated.

Is there a better way?

I have used a similar solution in the past, but I was never really happy with it. Today I tried to think harder about the problem and possible instructions that I can use and I have found the following approach - since SSE2 has PSUBUSW (packed subtract with unsigned saturation) I can use that instruction instead of three XORs and then subtract the result from the original value to get the packed unsigned minimum. This trick of course only works for packed uint16_t operations as X86 SIMD doesn't have instructions to perform packed saturated addition|subtraction of elements greater than 16 bits.

The machine code of this solution would look like:

; SSE2 compatible PMINUW implementation
;   xmm0 = in|out
;   xmm1 = in
;   xmm7 = temporary
movdqa xmm7, xmm0            ; Move xmm0 to temporary
psubusw xmm7, xmm1           ; Subtract with unsigned saturation
psubw xmm0, xmm7             ; Subtract (no saturation, would never overflow here)

Why it works? If we perform a-b with unsigned saturation we get either zero, which means that b is either equal or greater than a, or some non-zero value, which means that a is greater than b, and the value is their difference (unsigned). Based on these we can subtract that value from the original a and get our unsigned minimum.

The machine code of pmaxuw implementation would be much simpler:

; SSE2 compatible PMAXUW implementation
;   xmm0 = in|out
;   xmm1 = in
psubusw xmm0, xmm1           ; Subtract with unsigned saturation
paddw xmm0, xmm1             ; Add (no saturation, would never overflow here)

In this last case (unsigned uint16_t maximum) we don't need any temporary. The possible difference in xmm0 is enough to reconstruct the maximum value based on the content of xmm1.

Conclusion

There is always a way to do something better. I always valuate solutions that need less temporaries and don't need extra constants. Sometimes these requirements are impossible, but sometimes other instructions that you may not think of may help. And... old CPUs would thank you for using this approach!

No comments:

Post a Comment