18 April, 2016

Tricky _mm_set_sd()

[Or] How a C++ Compiler Can Ruin Your Optimization Efforts

Compiler intrinsics targeting SSE and SSE2 instruction sets were historically used to instrument the compiler to generate better code that deals with floating point operations. The reason was that some CPU architectures have a global floating-point unit state (or FPU state) that contains rounding mode, precision, exception mask, and some other flags. These basically prevented the compiler to optimize operations like round(x), static_cast<int>(x), (x < y ? x : y), etc.

Of course this situation was not performance-friendly and many developers just started using SSE+ intrinsics when they became available to solve their problems in a more performance-friendly way. This approach was also blessed by some CPU manufacturers as the better way of dealing with the problem. Today's compilers are much smarter than before and some constructs may be considered obsolete. However, some constructs still hold and are probably better than using a pure C/C++ code in some cases, especially if you need to do something compiler won't do by default.

Scalar <-> SIMD Problem

The problem I faced is closely related to a conversion between float/double and SIMD registers. When SSE2 code-generation is turned ON these types are equivalent and using _mm_set_sd intrinsic should generate nothing, however, it's not always true as demonstrated by the snippet below:

#include <xmmintrin.h>

double MyMinV1(double a, double b) {
  return a < b ? a : b;
}

double MyMinV2(double a, double b) {
  __m128d x = _mm_set_sd(a);
  __m128d y = _mm_set_sd(b);
  return _mm_cvtsd_f64(_mm_min_sd(x, y));
}

One would expect that a modern compiler will generate the same assembly for both functions. Unfortunately, the reality is different. Let's compare gcc and clang.

Assembly generated by clang:

MyMinV1:
  minsd   xmm0, xmm1
  ret

MyMinV2:
  minsd   xmm0, xmm1
  ret

Assembly generated by gcc:

MyMinV1:
  minsd   xmm0, xmm1
  ret
MyMinV2:
  movsd   QWORD PTR [rsp-24], xmm0
  movsd   QWORD PTR [rsp-16], xmm1
  movsd   xmm0, QWORD PTR [rsp-24]
  movsd   xmm1, QWORD PTR [rsp-16]
  minsd   xmm0, xmm1
  ret

That's not good; gcc completely missed out the point and turned the C++ code into the worst assembly you would ever imagine. You can check it yourself here if you don't believe me.

The Core of the Problem

The problem is caused by using _mm_set_sd() to convert a double into a SIMD type. The intrinsic should emit the movsd xmm0, xmm1|mem64 instruction, which has ironically two behaviors:

  • If the second operand (the source operand) is an XMM register, it preserves the high-part of the first operand (the destination operand).
  • If the second operand (the source operand) is a memory location, it clears the high-part of the first operand (the destination operand).

This is tricky, because the documentation says that _mm_set_sd() intrinsic clears the high-part of the register, so gcc is basically following the specification. However, it's something unexpected if you don't really have use for the high-part of the register, and using two movsd's is something suboptimal as it could be replaced by a single movq, which is what clang does in some other cases.

A Possible Workaround

After submitting a gcc bug #70708 (which turned out to be a duplicate of #68211) I got an instant feedback and a workaround has been found. We can use some asm magic to tell the compiler what to do. Consider the updated code, which works well also with clang:

static inline __m128d DoubleAsXmm(double x) {
  __m128d xmm;
  asm ("" : "=x" (xmm) : "0" (x));
  return xmm;
}

double MyMinV3(double a, double b) {
  return _mm_cvtsd_f64(
    _mm_min_sd(
      DoubleAsXmm(a),
      DoubleAsXmm(b)));
}

Which leads to the desired result (both gcc and clang):

MyMinV3:
  minsd   xmm0, xmm1
  ret

Of course you need some preprocessor magic to make the conversion portable across compilers. However, it's a nice solution as it basically tells the compiler that you don't care of the high part of the SIMD register. As a bonus this will also prevent clang emitting a movq instruction in some other cases.

Lesson Learned

I think the time we had to specialize min/max by using SSE and SSE2 is finally over, compilers are fine doing that for us. However, if you still need to use SSE+ instrinsics, check how compiler converts your scalars into SIMD registers, especially if you only use them for scalar-only operations.