[Or] How a C++ Compiler Can Ruin Your Optimization Efforts
Compiler intrinsics targeting SSE and SSE2 instruction sets were historically used to instrument the compiler to generate better code that deals with floating point operations. The reason was that some CPU architectures have a global floating-point unit state (or FPU state) that contains rounding mode, precision, exception mask, and some other flags. These basically prevented the compiler to optimize operations like round(x)
, static_cast<int>(x)
, (x < y ? x : y)
, etc.
Of course this situation was not performance-friendly and many developers just started using SSE+ intrinsics when they became available to solve their problems in a more performance-friendly way. This approach was also blessed by some CPU manufacturers as the better way of dealing with the problem. Today's compilers are much smarter than before and some constructs may be considered obsolete. However, some constructs still hold and are probably better than using a pure C/C++ code in some cases, especially if you need to do something compiler won't do by default.
Scalar <-> SIMD Problem
The problem I faced is closely related to a conversion between float/double and SIMD registers. When SSE2 code-generation is turned ON these types are equivalent and using _mm_set_sd
intrinsic should generate nothing, however, it's not always true as demonstrated by the snippet below:
#include <xmmintrin.h>
double MyMinV1(double a, double b) {
return a < b ? a : b;
}
double MyMinV2(double a, double b) {
__m128d x = _mm_set_sd(a);
__m128d y = _mm_set_sd(b);
return _mm_cvtsd_f64(_mm_min_sd(x, y));
}
One would expect that a modern compiler will generate the same assembly for both functions. Unfortunately, the reality is different. Let's compare gcc and clang.
Assembly generated by clang:
MyMinV1:
minsd xmm0, xmm1
ret
MyMinV2:
minsd xmm0, xmm1
ret
Assembly generated by gcc:
MyMinV1:
minsd xmm0, xmm1
ret
MyMinV2:
movsd QWORD PTR [rsp-24], xmm0
movsd QWORD PTR [rsp-16], xmm1
movsd xmm0, QWORD PTR [rsp-24]
movsd xmm1, QWORD PTR [rsp-16]
minsd xmm0, xmm1
ret
That's not good; gcc completely missed out the point and turned the C++ code into the worst assembly you would ever imagine. You can check it yourself here if you don't believe me.
The Core of the Problem
The problem is caused by using _mm_set_sd()
to convert a double into a SIMD type. The intrinsic should emit the movsd xmm0, xmm1|mem64
instruction, which has ironically two behaviors:
- If the second operand (the source operand) is an XMM register, it preserves the high-part of the first operand (the destination operand).
- If the second operand (the source operand) is a memory location, it clears the high-part of the first operand (the destination operand).
This is tricky, because the documentation says that _mm_set_sd()
intrinsic clears the high-part of the register, so gcc is basically following the specification. However, it's something unexpected if you don't really have use for the high-part of the register, and using two movsd's is something suboptimal as it could be replaced by a single movq, which is what clang does in some other cases.
A Possible Workaround
After submitting a gcc bug #70708 (which turned out to be a duplicate of #68211) I got an instant feedback and a workaround has been found. We can use some asm magic to tell the compiler what to do. Consider the updated code, which works well also with clang:
static inline __m128d DoubleAsXmm(double x) {
__m128d xmm;
asm ("" : "=x" (xmm) : "0" (x));
return xmm;
}
double MyMinV3(double a, double b) {
return _mm_cvtsd_f64(
_mm_min_sd(
DoubleAsXmm(a),
DoubleAsXmm(b)));
}
Which leads to the desired result (both gcc and clang):
MyMinV3:
minsd xmm0, xmm1
ret
Of course you need some preprocessor magic to make the conversion portable across compilers. However, it's a nice solution as it basically tells the compiler that you don't care of the high part of the SIMD register. As a bonus this will also prevent clang emitting a movq
instruction in some other cases.
Lesson Learned
I think the time we had to specialize min/max by using SSE and SSE2 is finally over, compilers are fine doing that for us. However, if you still need to use SSE+ instrinsics, check how compiler converts your scalars into SIMD registers, especially if you only use them for scalar-only operations.
No comments:
Post a Comment