This is a first part of my autovectorization experiments. As a part of my refactoring work I did some cleanup in a Blend2D's internal algebra module and I finally tried how compilers optimize some of its features. I would like to highlight that the performance of this code is very important as essentially all vectors passed to Blend2D's rendering context are processed before the rasterizer can do its work. The processing depends on what you want to do with them - if it's just for filling then the input is transformed, clipped, and passed to edge builder. If you intend to stroke a 2D path then the processing is much heavier and a lot of approximations come into the scene.
Today's post contains more code than usual, however, the code solves real problems so people can seriously consider which compiler would optimize their code the most. Spoiler alert - Clang is superior!
Preparation
Instead of using one huge code section I decided to provide a utility code first and then functions that I would compare. Some code is directly from Blend2D and some was added just for simplicity to make the code compile:
#include <algorithm>
#include <cmath>
#include <stdint.h>
// Point structure [x, y]
struct BLPoint {
double x;
double y;
inline BLPoint() noexcept = default;
constexpr BLPoint(const BLPoint&) noexcept = default;
constexpr BLPoint(double x, double y) noexcept
: x(x), y(y) {}
inline void reset() noexcept { reset(0, 0); }
inline void reset(double x, double y) noexcept {
this->x = x;
this->y = y;
}
};
// Box structure [x0, y0, x1, y1]
struct BLBox {
double x0;
double y0;
double x1;
double y1;
inline BLBox() noexcept = default;
constexpr BLBox(const BLBox&) noexcept = default;
constexpr BLBox(double x0, double y0, double x1, double y1) noexcept
: x0(x0), y0(y0), x1(x1), y1(y1) {}
inline void reset() noexcept { reset(0.0, 0.0, 0.0, 0.0); }
inline void reset(double x0, double y0, double x1, double y1) noexcept {
this->x0 = x0;
this->y0 = y0;
this->x1 = x1;
this->y1 = y1;
}
};
// Overloads to make vector processing simpler.
static constexpr BLPoint operator-(const BLPoint& a) noexcept { return BLPoint(-a.x, -a.y); }
static constexpr BLPoint operator+(const BLPoint& a, double b) noexcept { return BLPoint(a.x + b, a.y + b); }
static constexpr BLPoint operator-(const BLPoint& a, double b) noexcept { return BLPoint(a.x - b, a.y - b); }
static constexpr BLPoint operator*(const BLPoint& a, double b) noexcept { return BLPoint(a.x * b, a.y * b); }
static constexpr BLPoint operator/(const BLPoint& a, double b) noexcept { return BLPoint(a.x / b, a.y / b); }
static constexpr BLPoint operator+(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x + b.x, a.y + b.y); }
static constexpr BLPoint operator-(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x - b.x, a.y - b.y); }
static constexpr BLPoint operator*(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x * b.x, a.y * b.y); }
static constexpr BLPoint operator/(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x / b.x, a.y / b.y); }
static constexpr BLPoint operator+(double a, const BLPoint& b) noexcept { return BLPoint(a + b.x, a + b.y); }
static constexpr BLPoint operator-(double a, const BLPoint& b) noexcept { return BLPoint(a - b.x, a - b.y); }
static constexpr BLPoint operator*(double a, const BLPoint& b) noexcept { return BLPoint(a * b.x, a * b.y); }
static constexpr BLPoint operator/(double a, const BLPoint& b) noexcept { return BLPoint(a / b.x, a / b.y); }
static inline BLPoint& operator+=(BLPoint& a, double b) noexcept { a.x += b; a.y += b; return a; }
static inline BLPoint& operator-=(BLPoint& a, double b) noexcept { a.x -= b; a.y -= b; return a; }
static inline BLPoint& operator*=(BLPoint& a, double b) noexcept { a.x *= b; a.y *= b; return a; }
static inline BLPoint& operator/=(BLPoint& a, double b) noexcept { a.x /= b; a.y /= b; return a; }
static inline BLPoint& operator+=(BLPoint& a, const BLPoint& b) noexcept { a.x += b.x; a.y += b.y; return a; }
static inline BLPoint& operator-=(BLPoint& a, const BLPoint& b) noexcept { a.x -= b.x; a.y -= b.y; return a; }
static inline BLPoint& operator*=(BLPoint& a, const BLPoint& b) noexcept { a.x *= b.x; a.y *= b.y; return a; }
static inline BLPoint& operator/=(BLPoint& a, const BLPoint& b) noexcept { a.x /= b.x; a.y /= b.y; return a; }
// Min/Max - different semantics compared to std.
template<typename T> constexpr T blMin(const T& a, const T& b) noexcept { return b < a ? b : a; }
template<typename T> constexpr T blMax(const T& a, const T& b) noexcept { return a < b ? b : a; }
// Linear interpolation, works with points as well.
template<typename V, typename T = double>
inline V blLerp(const V& a, const V& b, const T& t) noexcept {
return (a * (1.0 - t)) + (b * t);
}
// Merge a point into a box by possibly increasing its size.
inline void blBoxMergePoint(BLBox& box, const BLPoint& p) noexcept {
box.x0 = blMin(box.x0, p.x);
box.y0 = blMin(box.y0, p.y);
box.x1 = blMax(box.x1, p.x);
box.y1 = blMax(box.y1, p.y);
}
// Quadratic Bezier Coefficients and Derivative.
inline void blQuadExtractCoefficients(const BLPoint bez[3], BLPoint& a, BLPoint& b, BLPoint& c) noexcept {
a = bez[0] - bez[1] * 2.0 + bez[2];
b = (bez[1] - bez[0]) * 2.0;
c = bez[0];
}
inline void blQuadExtractDerivative(const BLPoint bez[3], BLPoint& a, BLPoint& b) noexcept {
a = (bez[2] - bez[1] * 2.0 + bez[0]) * 2.0;
b = (bez[1] - bez[0]) * 2.0;
}
// Cubic Bezier Coefficients and Derivative.
inline void blCubicExtractCoefficients(const BLPoint bez[4], BLPoint& a, BLPoint& b, BLPoint& c, BLPoint& d) noexcept {
d = bez[0];
a = (bez[1] - bez[2]) * 3.0 + bez[3] - d;
b = (bez[0] - bez[1] * 2.0 + bez[2]) * 3.0;
c = (bez[1] - bez[0]) * 3.0;
}
inline void blCubicExtractDerivative(const BLPoint bez[4], BLPoint& a, BLPoint& b, BLPoint& c) noexcept {
a = ((bez[1] - bez[2]) * 3.0 + bez[3] - bez[0]) * 3.0;
b = (bez[2] - bez[1] * 2.0 + bez[0]) * 6.0;
c = (bez[1] - bez[0]) * 3.0;
}
I don't know whether to comment it - it's pretty simple code that provides Point, Box, and some utility functions we rely on.
Quadratic Bezier Evaluation
These functions return a point of quadratic Bezier at the given t
:
// Quadratic Bezier Evaluation.
BLPoint blQuadEvaluateAtT(const BLPoint bez[3], double t) noexcept {
BLPoint a, b, c;
blQuadExtractCoefficients(bez, a, b, c);
return (a * t + b) * t + c;
}
BLPoint blQuadPreciseEvaluateAtT(const BLPoint bez[3], double t) noexcept {
return blLerp(blLerp(bez[0], bez[1], t),
blLerp(bez[1], bez[2], t), t);
}
GCC [-std=c++17 -O3 -fno-math-errno -mavx2]
blQuadEvaluateAtT(BLPoint const*, double):
vmovupd xmm2, XMMWORD PTR [rdi+16]
vmovupd xmm4, XMMWORD PTR [rdi]
vmovddup xmm3, xmm0
vaddpd xmm1, xmm2, xmm2
vsubpd xmm2, xmm2, xmm4
vsubpd xmm1, xmm4, xmm1
vaddpd xmm0, xmm1, XMMWORD PTR [rdi+32]
vaddpd xmm2, xmm2, xmm2
vmulpd xmm1, xmm0, xmm3
vaddpd xmm0, xmm1, xmm2
vmulpd xmm0, xmm0, xmm3
vaddpd xmm5, xmm0, xmm4
vmovaps XMMWORD PTR [rsp-24], xmm5
vmovsd xmm1, QWORD PTR [rsp-16]
vmovsd xmm0, QWORD PTR [rsp-24]
ret
blQuadPreciseEvaluateAtT(BLPoint const*, double):
vmovapd xmm3, XMMWORD PTR .LC0[rip]
vmovddup xmm0, xmm0
vmovupd xmm4, XMMWORD PTR [rdi+16]
vmulpd xmm2, xmm0, XMMWORD PTR [rdi+32]
vsubpd xmm3, xmm3, xmm0
vmulpd xmm1, xmm3, xmm4
vaddpd xmm2, xmm2, xmm1
vmulpd xmm1, xmm3, XMMWORD PTR [rdi]
vmulpd xmm2, xmm2, xmm0
vmulpd xmm0, xmm4, xmm0
vaddpd xmm0, xmm1, xmm0
vmulpd xmm0, xmm0, xmm3
vaddpd xmm5, xmm2, xmm0
vmovaps XMMWORD PTR [rsp-24], xmm5
vmovsd xmm1, QWORD PTR [rsp-16]
vmovsd xmm0, QWORD PTR [rsp-24]
ret
Clang [-std=c++17 -O3 -fno-math-errno -mavx2]
blQuadEvaluateAtT(BLPoint const*, double):
vmovupd xmm1, xmmword ptr [rdi]
vmovupd xmm2, xmmword ptr [rdi + 16]
vaddpd xmm3, xmm2, xmm2
vsubpd xmm3, xmm1, xmm3
vaddpd xmm3, xmm3, xmmword ptr [rdi + 32]
vsubpd xmm2, xmm2, xmm1
vaddpd xmm2, xmm2, xmm2
vmovddup xmm0, xmm0 # xmm0 = xmm0[0,0]
vmulpd xmm3, xmm0, xmm3
vaddpd xmm2, xmm2, xmm3
vmulpd xmm0, xmm0, xmm2
vaddpd xmm0, xmm1, xmm0
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
ret
.LCPI1_0:
.quad 4607182418800017408 # double 1
blQuadPreciseEvaluateAtT(BLPoint const*, double):
vmovsd xmm1, qword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero
vsubsd xmm1, xmm1, xmm0
vmovddup xmm1, xmm1 # xmm1 = xmm1[0,0]
vmulpd xmm2, xmm1, xmmword ptr [rdi]
vmovupd xmm3, xmmword ptr [rdi + 16]
vmovddup xmm0, xmm0 # xmm0 = xmm0[0,0]
vmulpd xmm4, xmm0, xmm3
vaddpd xmm2, xmm2, xmm4
vmulpd xmm3, xmm1, xmm3
vmulpd xmm4, xmm0, xmmword ptr [rdi + 32]
vaddpd xmm3, xmm3, xmm4
vmulpd xmm1, xmm1, xmm2
vmulpd xmm0, xmm0, xmm3
vaddpd xmm0, xmm1, xmm0
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
ret
MSVC [/O2 /Qpar /arch:AVX2]
; NOT VECTORIZED!
__$ReturnUdt$ = 80
bez$ = 88
t$ = 96
BLPoint blQuadEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN42:
sub rsp, 72 ; 00000048H
vmovsd xmm0, QWORD PTR [rdx+16]
vmovsd xmm1, QWORD PTR [rdx+24]
vmovsd xmm5, QWORD PTR [rdx]
vmovaps XMMWORD PTR [rsp+48], xmm6
mov rax, rcx
vmovsd xmm6, QWORD PTR [rdx+8]
vmovaps XMMWORD PTR [rsp+32], xmm7
vaddsd xmm3, xmm0, xmm0
vsubsd xmm0, xmm5, xmm3
vmovaps XMMWORD PTR [rsp+16], xmm8
vaddsd xmm8, xmm0, QWORD PTR [rdx+32]
vmovsd xmm0, QWORD PTR [rdx+16]
vsubsd xmm3, xmm0, xmm5
vmovaps XMMWORD PTR [rsp], xmm9
vmovaps xmm9, xmm2
vaddsd xmm4, xmm1, xmm1
vsubsd xmm1, xmm6, xmm4
vaddsd xmm7, xmm1, QWORD PTR [rdx+40]
vmulsd xmm4, xmm3, QWORD PTR __real@4000000000000000
vmovsd xmm1, QWORD PTR [rdx+24]
vsubsd xmm2, xmm1, xmm6
vmulsd xmm5, xmm2, QWORD PTR __real@4000000000000000
vmovups xmm6, XMMWORD PTR [rdx]
vmulsd xmm2, xmm7, xmm9
vmovaps xmm7, XMMWORD PTR [rsp+32]
vmulsd xmm1, xmm8, xmm9
vmovaps xmm8, XMMWORD PTR [rsp+16]
vaddsd xmm0, xmm4, xmm1
vmulsd xmm4, xmm0, xmm9
vaddsd xmm0, xmm4, xmm6
vaddsd xmm3, xmm2, xmm5
vmulsd xmm5, xmm3, xmm9
vmovaps xmm9, XMMWORD PTR [rsp]
vunpckhpd xmm2, xmm6, xmm6
vmovaps xmm6, XMMWORD PTR [rsp+48]
vmovsd QWORD PTR [rcx], xmm0
vaddsd xmm0, xmm5, xmm2
vmovsd QWORD PTR [rcx+8], xmm0
add rsp, 72 ; 00000048H
ret 0
BLPoint blQuadEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP
; NOT VECTORIZED!
__$ReturnUdt$ = 64
bez$ = 72
t$ = 80
BLPoint blQuadPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN46:
sub rsp, 56 ; 00000038H
vmulsd xmm4, xmm2, QWORD PTR [rdx+32]
vmulsd xmm3, xmm2, QWORD PTR [rdx+40]
vmovsd xmm0, QWORD PTR __real@3ff0000000000000
vmovaps XMMWORD PTR [rsp+32], xmm6
mov rax, rcx
vmovaps XMMWORD PTR [rsp+16], xmm7
vsubsd xmm7, xmm0, xmm2
vmulsd xmm1, xmm7, QWORD PTR [rdx+16]
vmulsd xmm0, xmm7, QWORD PTR [rdx+24]
vaddsd xmm6, xmm3, xmm0
vmulsd xmm0, xmm7, QWORD PTR [rdx+8]
vaddsd xmm5, xmm4, xmm1
vmulsd xmm1, xmm7, QWORD PTR [rdx]
vmovaps XMMWORD PTR [rsp], xmm8
vmovaps xmm8, xmm2
vmulsd xmm3, xmm8, QWORD PTR [rdx+24]
vmulsd xmm2, xmm2, QWORD PTR [rdx+16]
vaddsd xmm2, xmm2, xmm1
vaddsd xmm3, xmm0, xmm3
vmulsd xmm4, xmm5, xmm8
vmulsd xmm5, xmm6, xmm8
vmovaps xmm6, XMMWORD PTR [rsp+32]
vmovaps xmm8, XMMWORD PTR [rsp]
vmulsd xmm0, xmm2, xmm7
vmulsd xmm1, xmm3, xmm7
vmovaps xmm7, XMMWORD PTR [rsp+16]
vaddsd xmm0, xmm0, xmm4
vaddsd xmm1, xmm1, xmm5
vmovsd QWORD PTR [rcx], xmm0
vmovsd QWORD PTR [rcx+8], xmm1
add rsp, 56 ; 00000038H
ret 0
BLPoint blQuadPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP
Interpreting the Results
GCC and Clang okay, MSVC didn't use SIMD.
Cubic Bezier Evaluation
These functions return a point of cubic Bezier at the given t
:
// Cubic Bezier Evaluation.
BLPoint blCubicEvaluateAtT(const BLPoint bez[4], double t) noexcept {
BLPoint a, b, c, d;
blCubicExtractCoefficients(bez, a, b, c, d);
return ((a * t + b) * t + c) * t + c;
}
BLPoint blCubicPreciseEvaluateAtT(const BLPoint bez[4], double t) noexcept {
BLPoint p01(blLerp(bez[0], bez[1], t));
BLPoint p12(blLerp(bez[1], bez[2], t));
BLPoint p23(blLerp(bez[2], bez[3], t));
return blLerp(blLerp(p01, p12, t), blLerp(p12, p23, t), t);
}
GCC [-std=c++17 -O3 -fno-math-errno -mavx2]
blCubicEvaluateAtT(BLPoint const*, double):
vmovupd xmm3, XMMWORD PTR [rdi+16]
vmovddup xmm4, xmm0
vmovupd xmm1, XMMWORD PTR [rdi]
vmovupd xmm6, XMMWORD PTR [rdi+32]
vmovapd xmm5, XMMWORD PTR .LC1[rip]
vsubpd xmm2, xmm3, xmm1
vsubpd xmm0, xmm3, xmm6
vaddpd xmm3, xmm3, xmm3
vmulpd xmm2, xmm2, xmm5
vmulpd xmm0, xmm0, xmm5
vaddpd xmm0, xmm0, XMMWORD PTR [rdi+48]
vsubpd xmm0, xmm0, xmm1
vsubpd xmm1, xmm1, xmm3
vaddpd xmm1, xmm1, xmm6
vmulpd xmm0, xmm0, xmm4
vmulpd xmm1, xmm1, xmm5
vaddpd xmm0, xmm0, xmm1
vmulpd xmm0, xmm0, xmm4
vaddpd xmm0, xmm0, xmm2
vmulpd xmm0, xmm0, xmm4
vaddpd xmm7, xmm0, xmm2
vmovaps XMMWORD PTR [rsp-24], xmm7
vmovsd xmm1, QWORD PTR [rsp-16]
vmovsd xmm0, QWORD PTR [rsp-24]
ret
blCubicPreciseEvaluateAtT(BLPoint const*, double):
vmovddup xmm2, xmm0
vmovapd xmm0, XMMWORD PTR .LC0[rip]
vmovupd xmm5, XMMWORD PTR [rdi+16]
vmulpd xmm4, xmm2, XMMWORD PTR [rdi+32]
vsubpd xmm3, xmm0, xmm2
vmulpd xmm1, xmm2, XMMWORD PTR [rdi+48]
vmulpd xmm0, xmm3, xmm5
vmulpd xmm5, xmm5, xmm2
vaddpd xmm4, xmm4, xmm0
vmulpd xmm0, xmm3, XMMWORD PTR [rdi+32]
vaddpd xmm1, xmm1, xmm0
vmulpd xmm0, xmm3, xmm4
vmulpd xmm1, xmm1, xmm2
vaddpd xmm1, xmm1, xmm0
vmulpd xmm0, xmm3, XMMWORD PTR [rdi]
vmulpd xmm1, xmm1, xmm2
vmulpd xmm2, xmm4, xmm2
vaddpd xmm0, xmm0, xmm5
vmulpd xmm0, xmm0, xmm3
vaddpd xmm0, xmm0, xmm2
vmulpd xmm0, xmm0, xmm3
vaddpd xmm6, xmm1, xmm0
vmovaps XMMWORD PTR [rsp-24], xmm6
vmovsd xmm1, QWORD PTR [rsp-16]
vmovsd xmm0, QWORD PTR [rsp-24]
ret
Clang [-std=c++17 -O3 -fno-math-errno -mavx2]
.LCPI2_0:
.quad 4613937818241073152 # double 3
.quad 4613937818241073152 # double 3
blCubicEvaluateAtT(BLPoint const*, double):
vmovupd xmm1, xmmword ptr [rdi + 16]
vmovupd xmm2, xmmword ptr [rdi + 32]
vsubpd xmm3, xmm1, xmm2
vmovapd xmm4, xmmword ptr [rip + .LCPI2_0] # xmm4 = [3.000000e+00,3.000000e+00]
vmulpd xmm3, xmm3, xmm4
vaddpd xmm3, xmm3, xmmword ptr [rdi + 48]
vmovupd xmm5, xmmword ptr [rdi]
vsubpd xmm3, xmm3, xmm5
vaddpd xmm6, xmm1, xmm1
vsubpd xmm6, xmm5, xmm6
vaddpd xmm2, xmm6, xmm2
vmulpd xmm2, xmm2, xmm4
vsubpd xmm1, xmm1, xmm5
vmulpd xmm1, xmm1, xmm4
vmovddup xmm0, xmm0 # xmm0 = xmm0[0,0]
vmulpd xmm3, xmm0, xmm3
vaddpd xmm2, xmm2, xmm3
vmulpd xmm2, xmm0, xmm2
vaddpd xmm2, xmm1, xmm2
vmulpd xmm0, xmm0, xmm2
vaddpd xmm0, xmm1, xmm0
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
ret
.LCPI3_0:
.quad 4607182418800017408 # double 1
blCubicPreciseEvaluateAtT(BLPoint const*, double):
vmovsd xmm1, qword ptr [rip + .LCPI3_0] # xmm1 = mem[0],zero
vsubsd xmm1, xmm1, xmm0
vmovddup xmm1, xmm1 # xmm1 = xmm1[0,0]
vmulpd xmm2, xmm1, xmmword ptr [rdi]
vmovddup xmm0, xmm0 # xmm0 = xmm0[0,0]
vmovupd xmm3, xmmword ptr [rdi + 16]
vmovupd xmm4, xmmword ptr [rdi + 32]
vmulpd xmm5, xmm0, xmm3
vaddpd xmm2, xmm2, xmm5
vmulpd xmm3, xmm1, xmm3
vmulpd xmm5, xmm0, xmm4
vaddpd xmm3, xmm3, xmm5
vmulpd xmm5, xmm0, xmmword ptr [rdi + 48]
vmulpd xmm4, xmm1, xmm4
vaddpd xmm4, xmm4, xmm5
vmulpd xmm2, xmm1, xmm2
vmulpd xmm5, xmm0, xmm3
vaddpd xmm2, xmm2, xmm5
vmulpd xmm3, xmm1, xmm3
vmulpd xmm4, xmm0, xmm4
vaddpd xmm3, xmm3, xmm4
vmulpd xmm1, xmm1, xmm2
vmulpd xmm0, xmm0, xmm3
vaddpd xmm0, xmm1, xmm0
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
ret
MSVC [/O2 /Qpar /arch:AVX2]
; NOT VECTORIZED!
__$ReturnUdt$ = 96
bez$ = 104
t$ = 112
BLPoint blCubicEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN70:
sub rsp, 88 ; 00000058H
vmovsd xmm0, QWORD PTR [rdx+16]
vsubsd xmm3, xmm0, QWORD PTR [rdx+32]
vmovsd xmm1, QWORD PTR [rdx+24]
vmovups xmm5, XMMWORD PTR [rdx]
vmovaps XMMWORD PTR [rsp+64], xmm6
mov rax, rcx
vmovaps XMMWORD PTR [rsp+48], xmm7
vmovsd xmm7, QWORD PTR __real@4008000000000000
vmovaps XMMWORD PTR [rsp+32], xmm8
vmulsd xmm0, xmm3, xmm7
vmovaps XMMWORD PTR [rsp+16], xmm9
vmovaps XMMWORD PTR [rsp], xmm10
vmovaps xmm10, xmm2
vsubsd xmm2, xmm1, QWORD PTR [rdx+40]
vaddsd xmm1, xmm0, QWORD PTR [rdx+48]
vmovsd xmm0, QWORD PTR [rdx+16]
vsubsd xmm9, xmm1, xmm5
vmulsd xmm4, xmm2, xmm7
vaddsd xmm2, xmm4, QWORD PTR [rdx+56]
vunpckhpd xmm1, xmm5, xmm5
vmovsd xmm5, QWORD PTR [rdx]
vsubsd xmm8, xmm2, xmm1
vmovsd xmm1, QWORD PTR [rdx+24]
vaddsd xmm2, xmm0, xmm0
vsubsd xmm0, xmm5, xmm2
vmovsd xmm2, QWORD PTR [rdx+8]
vaddsd xmm3, xmm1, xmm1
vaddsd xmm1, xmm0, QWORD PTR [rdx+32]
vsubsd xmm4, xmm2, xmm3
vaddsd xmm0, xmm4, QWORD PTR [rdx+40]
vmulsd xmm6, xmm0, xmm7
vmovsd xmm0, QWORD PTR [rdx+16]
vmulsd xmm4, xmm1, xmm7
vmovsd xmm1, QWORD PTR [rdx+24]
vsubsd xmm2, xmm1, xmm2
vsubsd xmm3, xmm0, xmm5
vmulsd xmm5, xmm3, xmm7
vmulsd xmm0, xmm9, xmm10
vmovaps xmm9, XMMWORD PTR [rsp+16]
vmulsd xmm1, xmm8, xmm10
vmovaps xmm8, XMMWORD PTR [rsp+32]
vaddsd xmm3, xmm1, xmm6
vmovaps xmm6, XMMWORD PTR [rsp+64]
vmulsd xmm7, xmm2, xmm7
vaddsd xmm2, xmm0, xmm4
vmulsd xmm4, xmm2, xmm10
vmulsd xmm0, xmm3, xmm10
vaddsd xmm1, xmm4, xmm5
vaddsd xmm2, xmm0, xmm7
vmulsd xmm3, xmm1, xmm10
vmulsd xmm4, xmm2, xmm10
vmovaps xmm10, XMMWORD PTR [rsp]
vaddsd xmm1, xmm4, xmm7
vmovaps xmm7, XMMWORD PTR [rsp+48]
vaddsd xmm0, xmm3, xmm5
vmovsd QWORD PTR [rcx], xmm0
vmovsd QWORD PTR [rcx+8], xmm1
add rsp, 88 ; 00000058H
ret 0
BLPoint blCubicEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP
; NOT VECTORIZED!
__$ReturnUdt$ = 128
bez$ = 136
t$ = 144
BLPoint blCubicPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN88:
mov r11, rsp
sub rsp, 120 ; 00000078H
vmulsd xmm4, xmm2, QWORD PTR [rdx+16]
vmulsd xmm3, xmm2, QWORD PTR [rdx+24]
vmovsd xmm0, QWORD PTR __real@3ff0000000000000
vmovaps XMMWORD PTR [rsp+96], xmm6
mov rax, rcx
vmovaps XMMWORD PTR [rsp+80], xmm7
vmovaps XMMWORD PTR [r11-56], xmm8
vmovaps XMMWORD PTR [r11-72], xmm9
vmovaps XMMWORD PTR [r11-88], xmm10
vmovaps XMMWORD PTR [r11-104], xmm11
vsubsd xmm11, xmm0, xmm2
vmulsd xmm1, xmm11, QWORD PTR [rdx]
vmulsd xmm0, xmm11, QWORD PTR [rdx+8]
vaddsd xmm10, xmm3, xmm0
vmulsd xmm3, xmm2, QWORD PTR [rdx+32]
vmulsd xmm0, xmm11, QWORD PTR [rdx+24]
vaddsd xmm9, xmm4, xmm1
vmulsd xmm1, xmm11, QWORD PTR [rdx+16]
vaddsd xmm5, xmm3, xmm1
vmulsd xmm1, xmm11, QWORD PTR [rdx+32]
vmovaps XMMWORD PTR [rsp], xmm12
vmovaps xmm12, xmm2
vmulsd xmm2, xmm2, QWORD PTR [rdx+40]
vmulsd xmm3, xmm12, QWORD PTR [rdx+48]
vaddsd xmm8, xmm2, xmm0
vmulsd xmm0, xmm11, QWORD PTR [rdx+40]
vmulsd xmm2, xmm12, QWORD PTR [rdx+56]
vaddsd xmm1, xmm1, xmm3
vaddsd xmm2, xmm2, xmm0
vmulsd xmm3, xmm1, xmm12
vmulsd xmm4, xmm2, xmm12
vmulsd xmm1, xmm8, xmm11
vaddsd xmm7, xmm1, xmm4
vmulsd xmm0, xmm5, xmm11
vaddsd xmm6, xmm0, xmm3
vmulsd xmm1, xmm10, xmm11
vmovaps xmm10, XMMWORD PTR [r11-88]
vmulsd xmm2, xmm5, xmm12
vmulsd xmm3, xmm8, xmm12
vmovaps xmm8, XMMWORD PTR [r11-56]
vmulsd xmm0, xmm9, xmm11
vmovaps xmm9, XMMWORD PTR [r11-72]
vaddsd xmm5, xmm1, xmm3
vaddsd xmm4, xmm0, xmm2
vmulsd xmm2, xmm6, xmm12
vmovaps xmm6, XMMWORD PTR [rsp+96]
vmulsd xmm3, xmm7, xmm12
vmovaps xmm7, XMMWORD PTR [rsp+80]
vmovaps xmm12, XMMWORD PTR [rsp]
vmulsd xmm1, xmm5, xmm11
vmulsd xmm0, xmm4, xmm11
vmovaps xmm11, XMMWORD PTR [r11-104]
vaddsd xmm1, xmm1, xmm3
vaddsd xmm0, xmm0, xmm2
vmovsd QWORD PTR [rcx+8], xmm1
vmovsd QWORD PTR [rcx], xmm0
add rsp, 120 ; 00000078H
ret 0
BLPoint blCubicPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP
Interpreting the Results
GCC and Clang okay, MSVC didn't use SIMD and generated so much code again.
Quadratic Bezier Bounding Box
Calculating bounding box of a quadratic Bezier curve is quite common. I wrote two solutions - the first version (V1) uses scalar code that only merges possible extremas when they are found. The second version (V2) doesn't bother and always calculates both extremas even if they are outside of the domain (they would be clamped in such case). The second version is SIMD friendly and that was the intention:
// Scalar version - not good for SIMD.
inline void blQuadCoefficientsAtT(double t, double& a, double& b, double& c) noexcept {
double tInv = 1.0 - t;
a = tInv * tInv;
b = 2.0 * (t - tInv);
c = t * t;
}
void blQuadBoundingBoxV1(const BLPoint bez[3], BLBox& bBox) noexcept {
// Get bounding box of start and end points.
bBox.reset(blMin(bez[0].x, bez[2].x), blMin(bez[0].y, bez[2].y),
blMax(bez[0].x, bez[2].x), blMax(bez[0].y, bez[2].y));
// Merge X extrema.
if (bez[1].x < bBox.x0 || bez[1].x > bBox.x1) {
double t = (bez[0].x - bez[1].x) / (bez[0].x - 2.0 * bez[1].x + bez[2].x);
double a, b, c;
blQuadCoefficientsAtT(t, a, b, c);
double coord = a * bez[0].x + b * bez[1].x + c * bez[2].x;
bBox.x0 = blMin(bBox.x0, coord);
bBox.x1 = blMax(bBox.x1, coord);
}
// Merge Y extrema.
if (bez[1].y < bBox.y0 || bez[1].y > bBox.y1) {
double t = (bez[0].y - bez[1].y) / (bez[0].y - 2.0 * bez[1].y + bez[2].y);
double a, b, c;
blQuadCoefficientsAtT(t, a, b, c);
double coord = a * bez[0].y + b * bez[1].y + c * bez[2].y;
bBox.y0 = blMin(bBox.y0, coord);
bBox.y1 = blMax(bBox.y1, coord);
}
}
// SIMD friendly (branchless).
void blQuadBoundingBoxV2(const BLPoint bez[3], BLBox& bBox) noexcept {
// Bounding box of start and end points.
bBox.reset(blMin(bez[0].x, bez[2].x), blMin(bez[0].y, bez[2].y),
blMax(bez[0].x, bez[2].x), blMax(bez[0].y, bez[2].y));
BLPoint t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);
t.x = blMax(t.x, 0.0);
t.y = blMax(t.y, 0.0);
t.x = blMin(t.x, 1.0);
t.y = blMin(t.y, 1.0);
blBoxMergePoint(bBox, blLerp(blLerp(bez[0], bez[1], t),
blLerp(bez[1], bez[2], t), t));
}
GCC [-std=c++17 -O3 -fno-math-errno -mavx2]
# NOT VECTORIZED!
blQuadBoundingBoxV1(BLPoint const*, BLBox&):
vmovsd xmm7, QWORD PTR [rdi+8]
vmovsd xmm6, QWORD PTR [rdi+40]
vmovsd xmm11, QWORD PTR [rdi]
vmovsd xmm10, QWORD PTR [rdi+32]
vmaxsd xmm4, xmm6, xmm7
vminsd xmm3, xmm6, xmm7
vminsd xmm8, xmm10, xmm11
vmaxsd xmm9, xmm10, xmm11
vunpcklpd xmm1, xmm8, xmm3
vunpcklpd xmm0, xmm9, xmm4
vmovups XMMWORD PTR [rsi], xmm1
vmovups XMMWORD PTR [rsi+16], xmm0
vmovsd xmm1, QWORD PTR [rdi+16]
vcomisd xmm8, xmm1
ja .L11
vcomisd xmm1, xmm9
ja .L11
.L12:
vmovsd xmm0, QWORD PTR [rdi+24]
vcomisd xmm3, xmm0
ja .L16
vcomisd xmm0, xmm4
jbe .L27
.L16:
vaddsd xmm1, xmm0, xmm0
vsubsd xmm2, xmm7, xmm0
vmovsd xmm5, QWORD PTR .LC2[rip]
vsubsd xmm1, xmm7, xmm1
vaddsd xmm1, xmm1, xmm6
vdivsd xmm2, xmm2, xmm1
vsubsd xmm5, xmm5, xmm2
vsubsd xmm1, xmm2, xmm5
vmulsd xmm5, xmm5, xmm5
vmulsd xmm2, xmm2, xmm2
vaddsd xmm1, xmm1, xmm1
vmulsd xmm5, xmm5, xmm7
vmulsd xmm0, xmm1, xmm0
vmulsd xmm2, xmm2, xmm6
vaddsd xmm0, xmm0, xmm5
vaddsd xmm0, xmm0, xmm2
vminsd xmm3, xmm0, xmm3
vmaxsd xmm0, xmm0, xmm4
vmovsd QWORD PTR [rsi+8], xmm3
vmovsd QWORD PTR [rsi+24], xmm0
.L27:
ret
.L11:
vaddsd xmm0, xmm1, xmm1
vsubsd xmm5, xmm11, xmm1
vsubsd xmm0, xmm11, xmm0
vaddsd xmm0, xmm0, xmm10
vdivsd xmm5, xmm5, xmm0
vmovsd xmm0, QWORD PTR .LC2[rip]
vsubsd xmm0, xmm0, xmm5
vsubsd xmm2, xmm5, xmm0
vmulsd xmm0, xmm0, xmm0
vmulsd xmm5, xmm5, xmm5
vaddsd xmm2, xmm2, xmm2
vmulsd xmm0, xmm0, xmm11
vmulsd xmm1, xmm2, xmm1
vmulsd xmm5, xmm5, xmm10
vaddsd xmm1, xmm1, xmm0
vaddsd xmm1, xmm1, xmm5
vminsd xmm8, xmm1, xmm8
vmaxsd xmm1, xmm1, xmm9
vmovsd QWORD PTR [rsi], xmm8
vmovsd QWORD PTR [rsi+16], xmm1
jmp .L12
# NOT VECTORIZED!
blQuadBoundingBoxV2(BLPoint const*, BLBox&):
push rbp
mov rbp, rsp
and rsp, -32
vmovsd xmm1, QWORD PTR [rdi+8]
vmovsd xmm0, QWORD PTR [rdi]
vmovsd xmm5, QWORD PTR [rdi+40]
vmovsd xmm6, QWORD PTR [rdi+32]
vmaxsd xmm13, xmm5, xmm1
vmaxsd xmm12, xmm6, xmm0
vminsd xmm5, xmm5, xmm1
vminsd xmm6, xmm6, xmm0
vunpcklpd xmm0, xmm12, xmm13
vunpcklpd xmm1, xmm6, xmm5
vmovups XMMWORD PTR [rsi+16], xmm0
vmovups XMMWORD PTR [rsi], xmm1
vmovsd xmm2, QWORD PTR [rdi+24]
vmovsd xmm10, QWORD PTR [rdi+8]
vmovsd xmm1, QWORD PTR [rdi+40]
vmovsd xmm7, QWORD PTR [rdi+16]
vaddsd xmm4, xmm2, xmm2
vsubsd xmm9, xmm10, xmm2
vmovsd xmm3, QWORD PTR [rdi]
vmovsd xmm0, QWORD PTR [rdi+32]
vsubsd xmm8, xmm3, xmm7
vsubsd xmm4, xmm10, xmm4
vaddsd xmm4, xmm4, xmm1
vdivsd xmm9, xmm9, xmm4
vaddsd xmm4, xmm7, xmm7
vsubsd xmm4, xmm3, xmm4
vaddsd xmm4, xmm4, xmm0
vdivsd xmm8, xmm8, xmm4
vxorpd xmm4, xmm4, xmm4
vcomisd xmm4, xmm8
ja .L33
vcomisd xmm4, xmm9
jbe .L62
vmovsd xmm11, QWORD PTR .LC2[rip]
vmulsd xmm14, xmm1, xmm4
vmulsd xmm9, xmm2, xmm4
vcomisd xmm8, xmm11
jbe .L63
vmovsd QWORD PTR [rsp-16], xmm2
vmovapd xmm1, xmm14
vmovapd xmm2, xmm9
vxorpd xmm14, xmm14, xmm14
vmovsd QWORD PTR [rsp-8], xmm7
vmulsd xmm3, xmm3, xmm4
vmovapd xmm15, xmm11
vmovapd xmm8, xmm11
vmulsd xmm7, xmm7, xmm4
vxorpd xmm9, xmm9, xmm9
jmp .L40
.L33:
vmulsd xmm11, xmm7, xmm4
vcomisd xmm4, xmm9
vxorpd xmm8, xmm8, xmm8
vmulsd xmm0, xmm0, xmm4
vmovsd QWORD PTR [rsp-8], xmm11
vmovsd xmm11, QWORD PTR .LC2[rip]
vmovapd xmm14, xmm11
jbe .L37
.L46:
vmovsd QWORD PTR [rsp-16], xmm2
vmulsd xmm1, xmm1, xmm4
vmovapd xmm15, xmm11
vxorpd xmm9, xmm9, xmm9
vmulsd xmm2, xmm2, xmm4
jmp .L40
.L62:
vmovsd xmm11, QWORD PTR .LC2[rip]
vcomisd xmm8, xmm11
jbe .L56
vmovsd QWORD PTR [rsp-8], xmm7
vmulsd xmm3, xmm3, xmm4
vxorpd xmm14, xmm14, xmm14
vmovapd xmm8, xmm11
vmulsd xmm7, xmm7, xmm4
.L37:
vcomisd xmm9, xmm11
jbe .L57
vmulsd xmm15, xmm2, xmm4
vmovapd xmm9, xmm11
vmulsd xmm10, xmm10, xmm4
vmovsd QWORD PTR [rsp-16], xmm15
vxorpd xmm15, xmm15, xmm15
.L40:
vaddsd xmm1, xmm1, QWORD PTR [rsp-16]
vaddsd xmm3, xmm3, QWORD PTR [rsp-8]
vaddsd xmm2, xmm2, xmm10
vaddsd xmm0, xmm0, xmm7
vmulsd xmm9, xmm1, xmm9
vmulsd xmm15, xmm2, xmm15
vmulsd xmm8, xmm0, xmm8
vmulsd xmm14, xmm3, xmm14
vaddsd xmm9, xmm9, xmm15
vaddsd xmm14, xmm8, xmm14
vminsd xmm5, xmm9, xmm5
vmaxsd xmm9, xmm9, xmm13
vminsd xmm6, xmm14, xmm6
vmaxsd xmm14, xmm14, xmm12
vmovsd QWORD PTR [rsi+8], xmm5
vmovsd QWORD PTR [rsi+24], xmm9
vmovsd QWORD PTR [rsi], xmm6
vmovsd QWORD PTR [rsi+16], xmm14
leave
ret
.L56:
vmulsd xmm15, xmm7, xmm8
vsubsd xmm14, xmm11, xmm8
vmulsd xmm0, xmm0, xmm8
vmulsd xmm3, xmm3, xmm14
vmulsd xmm7, xmm7, xmm14
vmovsd QWORD PTR [rsp-8], xmm15
jmp .L37
.L63:
vmulsd xmm15, xmm7, xmm8
vsubsd xmm14, xmm11, xmm8
vmulsd xmm0, xmm0, xmm8
vmulsd xmm3, xmm3, xmm14
vmulsd xmm7, xmm7, xmm14
vmovsd QWORD PTR [rsp-8], xmm15
jmp .L46
.L57:
vsubsd xmm15, xmm11, xmm9
vmulsd xmm1, xmm1, xmm9
vmulsd xmm4, xmm2, xmm15
vmulsd xmm10, xmm10, xmm15
vmulsd xmm2, xmm2, xmm9
vmovsd QWORD PTR [rsp-16], xmm4
jmp .L40
.LC0:
.long 0
.long 1072693248
.long 0
.long 1072693248
.LC1:
.long 0
.long 1074266112
.long 0
.long 1074266112
.LC2:
.long 0
.long 1072693248
Clang [-std=c++17 -O3 -fno-math-errno -mavx2]
# NOT VECTORIZED!
.LCPI4_0:
.quad 4607182418800017408 # double 1
blQuadBoundingBoxV1(BLPoint const*, BLBox&): # @blQuadBoundingBoxV1(BLPoint const*, BLBox&)
vmovupd xmm1, xmmword ptr [rdi]
vmovupd xmm0, xmmword ptr [rdi + 32]
vminpd xmm3, xmm0, xmm1
vmaxpd xmm2, xmm0, xmm1
vmovupd xmmword ptr [rsi], xmm3
vmovupd xmmword ptr [rsi + 16], xmm2
vmovsd xmm4, qword ptr [rdi + 16] # xmm4 = mem[0],zero
vucomisd xmm3, xmm4
ja .LBB4_2
vucomisd xmm4, xmm2
jbe .LBB4_3
.LBB4_2:
vsubsd xmm5, xmm1, xmm4
vaddsd xmm6, xmm4, xmm4
vsubsd xmm6, xmm1, xmm6
vaddsd xmm6, xmm0, xmm6
vdivsd xmm5, xmm5, xmm6
vmovsd xmm6, qword ptr [rip + .LCPI4_0] # xmm6 = mem[0],zero
vsubsd xmm6, xmm6, xmm5
vmulsd xmm7, xmm6, xmm6
vsubsd xmm6, xmm5, xmm6
vaddsd xmm6, xmm6, xmm6
vmulsd xmm5, xmm5, xmm5
vmulsd xmm7, xmm1, xmm7
vmulsd xmm4, xmm4, xmm6
vaddsd xmm4, xmm7, xmm4
vmulsd xmm5, xmm0, xmm5
vaddsd xmm4, xmm5, xmm4
vminsd xmm5, xmm4, xmm3
vmovsd qword ptr [rsi], xmm5
vmaxsd xmm4, xmm4, xmm2
vmovsd qword ptr [rsi + 16], xmm4
.LBB4_3:
vmovsd xmm4, qword ptr [rdi + 24] # xmm4 = mem[0],zero
vpermilpd xmm3, xmm3, 1 # xmm3 = xmm3[1,0]
vucomisd xmm3, xmm4
vpermilpd xmm2, xmm2, 1 # xmm2 = xmm2[1,0]
ja .LBB4_5
vucomisd xmm4, xmm2
ja .LBB4_5
ret
.LBB4_5:
vpermilpd xmm1, xmm1, 1 # xmm1 = xmm1[1,0]
vsubsd xmm5, xmm1, xmm4
vaddsd xmm6, xmm4, xmm4
vsubsd xmm6, xmm1, xmm6
vpermilpd xmm0, xmm0, 1 # xmm0 = xmm0[1,0]
vaddsd xmm6, xmm0, xmm6
vdivsd xmm5, xmm5, xmm6
vmovsd xmm6, qword ptr [rip + .LCPI4_0] # xmm6 = mem[0],zero
vsubsd xmm6, xmm6, xmm5
vmulsd xmm7, xmm6, xmm6
vsubsd xmm6, xmm5, xmm6
vaddsd xmm6, xmm6, xmm6
vmulsd xmm5, xmm5, xmm5
vmulsd xmm1, xmm1, xmm7
vmulsd xmm4, xmm4, xmm6
vaddsd xmm1, xmm1, xmm4
vmulsd xmm0, xmm0, xmm5
vaddsd xmm0, xmm0, xmm1
vminsd xmm1, xmm0, xmm3
vmovsd qword ptr [rsi + 8], xmm1
vmaxsd xmm0, xmm0, xmm2
vmovsd qword ptr [rsi + 24], xmm0
ret
.LCPI5_0:
.quad 4607182418800017408 # double 1
.quad 4607182418800017408 # double 1
# Vectorized
blQuadBoundingBoxV2(BLPoint const*, BLBox&): # @blQuadBoundingBoxV2(BLPoint const*, BLBox&)
vmovupd xmm0, xmmword ptr [rdi]
vmovupd xmm1, xmmword ptr [rdi + 16]
vmovupd xmm2, xmmword ptr [rdi + 32]
vminpd xmm3, xmm2, xmm0
vmaxpd xmm4, xmm2, xmm0
vsubpd xmm5, xmm0, xmm1
vaddpd xmm6, xmm1, xmm1
vsubpd xmm6, xmm0, xmm6
vaddpd xmm6, xmm2, xmm6
vdivpd xmm5, xmm5, xmm6
vxorpd xmm6, xmm6, xmm6
vmaxpd xmm5, xmm6, xmm5
vmovapd xmm6, xmmword ptr [rip + .LCPI5_0] # xmm6 = [1.000000e+00,1.000000e+00]
vminpd xmm5, xmm6, xmm5
vsubpd xmm6, xmm6, xmm5
vmulpd xmm0, xmm0, xmm6
vmulpd xmm7, xmm1, xmm5
vaddpd xmm0, xmm7, xmm0
vmulpd xmm1, xmm1, xmm6
vmulpd xmm2, xmm2, xmm5
vaddpd xmm1, xmm2, xmm1
vmulpd xmm0, xmm6, xmm0
vmulpd xmm1, xmm5, xmm1
vaddpd xmm0, xmm0, xmm1
vminpd xmm1, xmm0, xmm3
vmovupd xmmword ptr [rsi], xmm1
vmaxpd xmm0, xmm0, xmm4
vmovupd xmmword ptr [rsi + 16], xmm0
ret
MSVC [/O2 /Qpar /arch:AVX2]
; NOT VECTORIZED!
bez$ = 128
bBox$ = 136
void blQuadBoundingBoxV1(BLPoint const * __ptr64 const,BLBox & __ptr64) PROC
$LN46:
sub rsp, 120 ; 00000078H
vmovsd xmm3, QWORD PTR [rcx+40]
vmovsd xmm1, QWORD PTR [rcx+32]
vminsd xmm0, xmm1, QWORD PTR [rcx]
vmovaps XMMWORD PTR [rsp+96], xmm6
vmovaps XMMWORD PTR [rsp+80], xmm8
vmovaps XMMWORD PTR [rsp+64], xmm9
vmaxsd xmm9, xmm3, QWORD PTR [rcx+8]
vmovaps XMMWORD PTR [rsp+48], xmm10
vminsd xmm10, xmm3, QWORD PTR [rcx+8]
vmovaps XMMWORD PTR [rsp], xmm13
vmovsd xmm13, QWORD PTR __real@3ff0000000000000
vmovaps XMMWORD PTR [rsp+32], xmm11
vmaxsd xmm11, xmm1, QWORD PTR [rcx]
vmovsd QWORD PTR [rdx+16], xmm11
vmovsd QWORD PTR [rdx], xmm0
vmovsd QWORD PTR [rdx+8], xmm10
vmovsd QWORD PTR [rdx+24], xmm9
vmovsd xmm8, QWORD PTR [rcx+16]
vmovaps XMMWORD PTR [rsp+16], xmm12
vmovaps xmm12, xmm0
vcomisd xmm12, xmm8
ja SHORT $LN3@blQuadBoun
vcomisd xmm8, xmm11
jbe SHORT $LN2@blQuadBoun
$LN3@blQuadBoun:
vmovsd xmm6, QWORD PTR [rcx]
vaddsd xmm0, xmm8, xmm8
vsubsd xmm1, xmm6, xmm0
vaddsd xmm2, xmm1, QWORD PTR [rcx+32]
vsubsd xmm3, xmm6, xmm8
vdivsd xmm5, xmm3, xmm2
vsubsd xmm4, xmm13, xmm5
vsubsd xmm0, xmm5, xmm4
vaddsd xmm1, xmm0, xmm0
vmulsd xmm3, xmm1, xmm8
vmulsd xmm2, xmm4, xmm4
vmulsd xmm0, xmm2, xmm6
vmulsd xmm1, xmm5, xmm5
vmulsd xmm2, xmm1, QWORD PTR [rcx+32]
vaddsd xmm4, xmm3, xmm0
vaddsd xmm3, xmm4, xmm2
vminsd xmm0, xmm3, xmm12
vmaxsd xmm1, xmm3, xmm11
vmovsd QWORD PTR [rdx], xmm0
vmovsd QWORD PTR [rdx+16], xmm1
$LN2@blQuadBoun:
vmovsd xmm8, QWORD PTR [rcx+24]
vcomisd xmm10, xmm8
vmovaps xmm12, XMMWORD PTR [rsp+16]
vmovaps xmm11, XMMWORD PTR [rsp+32]
ja SHORT $LN5@blQuadBoun
vcomisd xmm8, xmm9
jbe SHORT $LN4@blQuadBoun
$LN5@blQuadBoun:
vmovsd xmm6, QWORD PTR [rcx+8]
vaddsd xmm0, xmm8, xmm8
vsubsd xmm1, xmm6, xmm0
vaddsd xmm2, xmm1, QWORD PTR [rcx+40]
vsubsd xmm3, xmm6, xmm8
vdivsd xmm5, xmm3, xmm2
vsubsd xmm4, xmm13, xmm5
vsubsd xmm0, xmm5, xmm4
vaddsd xmm1, xmm0, xmm0
vmulsd xmm3, xmm1, xmm8
vmulsd xmm2, xmm4, xmm4
vmulsd xmm0, xmm2, xmm6
vmulsd xmm1, xmm5, xmm5
vmulsd xmm2, xmm1, QWORD PTR [rcx+40]
vaddsd xmm4, xmm3, xmm0
vaddsd xmm3, xmm4, xmm2
vminsd xmm0, xmm3, xmm10
vmaxsd xmm1, xmm3, xmm9
vmovsd QWORD PTR [rdx+8], xmm0
vmovsd QWORD PTR [rdx+24], xmm1
$LN4@blQuadBoun:
vmovaps xmm6, XMMWORD PTR [rsp+96]
vmovaps xmm8, XMMWORD PTR [rsp+80]
vmovaps xmm9, XMMWORD PTR [rsp+64]
vmovaps xmm10, XMMWORD PTR [rsp+48]
vmovaps xmm13, XMMWORD PTR [rsp]
add rsp, 120 ; 00000078H
ret 0
void blQuadBoundingBoxV1(BLPoint const * __ptr64 const,BLBox & __ptr64) ENDP
; NOT VECTORIZED!
bez$ = 176
bBox$ = 184
void blQuadBoundingBoxV2(BLPoint const * __ptr64 const,BLBox & __ptr64) PROC
$LN130:
mov rax, rsp
sub rsp, 168 ; 000000a8H
vmovsd xmm3, QWORD PTR [rcx+40]
vmovsd xmm1, QWORD PTR [rcx+32]
vminsd xmm0, xmm1, QWORD PTR [rcx]
vmovaps XMMWORD PTR [rax-24], xmm6
lea r11, QWORD PTR [rax]
vmovaps XMMWORD PTR [rax-40], xmm7
vmovaps XMMWORD PTR [rax-56], xmm8
vmovaps XMMWORD PTR [rax-72], xmm9
vmovaps XMMWORD PTR [rax-88], xmm10
vmovaps XMMWORD PTR [rax-104], xmm11
vmovaps XMMWORD PTR [rax-120], xmm12
vmovaps XMMWORD PTR [rsp+32], xmm13
vminsd xmm13, xmm3, QWORD PTR [rcx+8]
vmovaps XMMWORD PTR [rsp+16], xmm14
vmaxsd xmm14, xmm1, QWORD PTR [rcx]
vmovaps XMMWORD PTR [rsp], xmm15
vmaxsd xmm15, xmm3, QWORD PTR [rcx+8]
vmovsd QWORD PTR [rdx], xmm0
vmovsd QWORD PTR [rdx+24], xmm15
vmovsd QWORD PTR [rdx+8], xmm13
vmovsd QWORD PTR [rdx+16], xmm14
vmovsd xmm12, QWORD PTR [rcx]
vmovsd xmm11, QWORD PTR [rcx+8]
vmovsd xmm0, QWORD PTR [rcx+16]
vmovsd xmm1, QWORD PTR [rcx+24]
vaddsd xmm2, xmm0, xmm0
vaddsd xmm3, xmm1, xmm1
vsubsd xmm0, xmm12, xmm2
vaddsd xmm2, xmm0, QWORD PTR [rcx+32]
vsubsd xmm0, xmm12, QWORD PTR [rcx+16]
vdivsd xmm2, xmm0, xmm2
vsubsd xmm1, xmm11, xmm3
vaddsd xmm3, xmm1, QWORD PTR [rcx+40]
vsubsd xmm1, xmm11, QWORD PTR [rcx+24]
vdivsd xmm5, xmm1, xmm3
vxorpd xmm4, xmm4, xmm4
vmaxsd xmm0, xmm4, xmm2
vmaxsd xmm1, xmm4, xmm5
vmovsd xmm4, QWORD PTR __real@3ff0000000000000
vminsd xmm10, xmm4, xmm0
vmulsd xmm3, xmm10, QWORD PTR [rcx+32]
vminsd xmm9, xmm4, xmm1
vmulsd xmm2, xmm9, QWORD PTR [rcx+40]
vsubsd xmm8, xmm4, xmm10
vmulsd xmm1, xmm8, QWORD PTR [rcx+16]
vaddsd xmm5, xmm1, xmm3
vmulsd xmm3, xmm10, QWORD PTR [rcx+16]
vsubsd xmm7, xmm4, xmm9
vmulsd xmm0, xmm7, QWORD PTR [rcx+24]
vaddsd xmm6, xmm2, xmm0
vmulsd xmm2, xmm9, QWORD PTR [rcx+24]
vmulsd xmm0, xmm8, xmm12
vmovaps xmm12, XMMWORD PTR [r11-120]
vmulsd xmm1, xmm7, xmm11
vmovaps xmm11, XMMWORD PTR [r11-104]
vaddsd xmm4, xmm2, xmm1
vaddsd xmm3, xmm3, xmm0
vmulsd xmm0, xmm3, xmm8
vmovaps xmm8, XMMWORD PTR [r11-56]
vmulsd xmm1, xmm4, xmm7
vmovaps xmm7, XMMWORD PTR [r11-40]
vmulsd xmm2, xmm6, xmm9
vmovaps xmm6, XMMWORD PTR [r11-24]
vmovaps xmm9, XMMWORD PTR [r11-72]
vmulsd xmm5, xmm5, xmm10
vmovaps xmm10, XMMWORD PTR [r11-88]
vaddsd xmm3, xmm0, xmm5
vminsd xmm0, xmm3, QWORD PTR [rdx]
vaddsd xmm4, xmm1, xmm2
vminsd xmm1, xmm4, xmm13
vmovaps xmm13, XMMWORD PTR [rsp+32]
vmovsd QWORD PTR [rdx], xmm0
vmaxsd xmm0, xmm3, xmm14
vmovaps xmm14, XMMWORD PTR [rsp+16]
vmovsd QWORD PTR [rdx+8], xmm1
vmaxsd xmm1, xmm4, xmm15
vmovaps xmm15, XMMWORD PTR [rsp]
vmovsd QWORD PTR [rdx+16], xmm0
vmovsd QWORD PTR [rdx+24], xmm1
mov rsp, r11
ret 0
void blQuadBoundingBoxV2(BLPoint const * __ptr64 const,BLBox & __ptr64) ENDP
Interpreting the Results
Only Clang was able to use SIMD in the second version. The rest was uninteresting scalar asm.
Quadratic Bezier Bounding Box - Clang Experiments
I did a bit of experimenting and I would like to show how sensitive compilers are when it comes to certain expressions. I created my own version of some basic functions like blMin(a, b)
and blMax(a, b)
. The main reason was to have these constexpr and to return a copy instead of a reference to either a
or b
, which is useful when the type is not primitive and it's some structure like BLPoint
. Let's take a look what would happen if we used standard std::min
and std::max
functions:
// Experiment, use std::min and std::max instead of our own.
void blQuadBoundingBoxV3(const BLPoint bez[3], BLBox& bBox) noexcept {
// Bounding box of start and end points.
bBox.reset(std::min(bez[0].x, bez[2].x), std::min(bez[0].y, bez[2].y),
std::max(bez[0].x, bez[2].x), std::max(bez[0].y, bez[2].y));
BLPoint t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);
t.x = std::max(t.x, 0.0);
t.y = std::max(t.y, 0.0);
t.x = std::min(t.x, 1.0);
t.y = std::min(t.y, 1.0);
blBoxMergePoint(bBox, blLerp(blLerp(bez[0], bez[1], t),
blLerp(bez[1], bez[2], t), t));
}
And the output:
# The code becomes a mixture of scalar and SIMD, much worse than the previous one.
blQuadBoundingBoxV3(BLPoint const*, BLBox&):
vmovupd xmm3, xmmword ptr [rdi]
vmovupd xmm2, xmmword ptr [rdi + 32]
vminpd xmm9, xmm2, xmm3
vmaxpd xmm8, xmm2, xmm3
vmovupd xmm4, xmmword ptr [rdi + 16]
vsubsd xmm5, xmm3, xmm4
vpermilpd xmm6, xmm4, 1 # xmm6 = xmm4[1,0]
vpermilpd xmm7, xmm3, 1 # xmm7 = xmm3[1,0]
vsubsd xmm0, xmm7, xmm6
vaddsd xmm1, xmm4, xmm4
vaddsd xmm6, xmm6, xmm6
vsubsd xmm1, xmm3, xmm1
vsubsd xmm6, xmm7, xmm6
vaddsd xmm1, xmm2, xmm1
vdivsd xmm1, xmm5, xmm1
vpermilpd xmm5, xmm2, 1 # xmm5 = xmm2[1,0]
vaddsd xmm5, xmm5, xmm6
vdivsd xmm0, xmm0, xmm5
vmovsd qword ptr [rsp - 16], xmm1
vmovsd qword ptr [rsp - 24], xmm0
vxorpd xmm5, xmm5, xmm5
vucomisd xmm5, xmm1
lea rax, [rsp - 8]
lea rcx, [rsp - 16]
cmova rcx, rax
mov qword ptr [rsp - 8], 0
mov rcx, qword ptr [rcx]
vucomisd xmm5, xmm0
lea rdx, [rsp - 24]
cmova rdx, rax
mov qword ptr [rsp - 16], rcx
mov qword ptr [rsp - 8], 0
mov rax, qword ptr [rdx]
mov qword ptr [rsp - 24], rax
vmovsd xmm0, qword ptr [rsp - 16] # xmm0 = mem[0],zero
vmovq xmm1, rcx
vmovq xmm5, rax
vpunpcklqdq xmm1, xmm1, xmm5 # xmm1 = xmm1[0],xmm5[0]
vmovapd xmm5, xmmword ptr [rip + .LCPI6_0] # xmm5 = [1.000000e+00,1.000000e+00]
vcmpltpd xmm1, xmm5, xmm1
vmovhpd xmm0, xmm0, qword ptr [rsp - 24] # xmm0 = xmm0[0],mem[0]
vblendvpd xmm0, xmm0, xmm5, xmm1
vsubpd xmm1, xmm5, xmm0
vmulpd xmm3, xmm3, xmm1
vmulpd xmm5, xmm4, xmm0
vaddpd xmm3, xmm5, xmm3
vmulpd xmm4, xmm4, xmm1
vmulpd xmm2, xmm2, xmm0
vaddpd xmm2, xmm2, xmm4
vmulpd xmm1, xmm1, xmm3
vmulpd xmm0, xmm2, xmm0
vaddpd xmm0, xmm1, xmm0
vminpd xmm1, xmm0, xmm9
vmovupd xmmword ptr [rsi], xmm1
vmaxpd xmm0, xmm0, xmm8
vmovupd xmmword ptr [rsi + 16], xmm0
ret
As you can see a slight modification of the code resulted in disaster. That nicely vectorized code is gone and instead we've got a mixture scalar and SIMD code that is still better than scalar-only version, but it's much longer than the previous fully autovectorized one.
Conclusion
I initially planned to show more snippets, but I think this was enough for a single post. I'm planning to continue with these experiments to show how code that I use in production compiles and possibly how to improve or rethink what you do to make it more likely to be optimized the way you want. I think that the best compiler at the moment is Clang as it's able to autovectorize well written code really easily, however, as the last snippet shows even Clang can be picky about the constructs you use and sometimes it's pure gambling.
In next posts I would like to focus moslty on GCC and Clang as these compilers are not shy to use SIMD. I tried also ICC before publishing this post, but since it produced similar output as MSVC it doesn't deserve my focus at all - one shitty compiler that makes this post longer is enough :)
You can try it yourself in Compiler Exporer.
Update
I reported a GCC Bug #87105 and clang bug #38705 as I think fixing these would help other code-bases as well.