25 August, 2018

C++ Autovectorization - Part 1

This is a first part of my autovectorization experiments. As a part of my refactoring work I did some cleanup in a Blend2D's internal algebra module and I finally tried how compilers optimize some of its features. I would like to highlight that the performance of this code is very important as essentially all vectors passed to Blend2D's rendering context are processed before the rasterizer can do its work. The processing depends on what you want to do with them - if it's just for filling then the input is transformed, clipped, and passed to edge builder. If you intend to stroke a 2D path then the processing is much heavier and a lot of approximations come into the scene.

Today's post contains more code than usual, however, the code solves real problems so people can seriously consider which compiler would optimize their code the most. Spoiler alert - Clang is superior!

Preparation

Instead of using one huge code section I decided to provide a utility code first and then functions that I would compare. Some code is directly from Blend2D and some was added just for simplicity to make the code compile:

#include <algorithm>
#include <cmath>
#include <stdint.h>

// Point structure [x, y]
struct BLPoint {
  double x;
  double y;

  inline BLPoint() noexcept = default;
  constexpr BLPoint(const BLPoint&) noexcept = default;

  constexpr BLPoint(double x, double y) noexcept
    : x(x), y(y) {}

  inline void reset() noexcept { reset(0, 0); }
  inline void reset(double x, double y) noexcept {
    this->x = x;
    this->y = y;
  }
};

// Box structure [x0, y0, x1, y1]
struct BLBox {
  double x0;
  double y0;
  double x1;
  double y1;

  inline BLBox() noexcept = default;
  constexpr BLBox(const BLBox&) noexcept = default;

  constexpr BLBox(double x0, double y0, double x1, double y1) noexcept
    : x0(x0), y0(y0), x1(x1), y1(y1) {}

  inline void reset() noexcept { reset(0.0, 0.0, 0.0, 0.0); }
  inline void reset(double x0, double y0, double x1, double y1) noexcept {
    this->x0 = x0;
    this->y0 = y0;
    this->x1 = x1;
    this->y1 = y1;
  }
};

// Overloads to make vector processing simpler.
static constexpr BLPoint operator-(const BLPoint& a) noexcept { return BLPoint(-a.x, -a.y); }

static constexpr BLPoint operator+(const BLPoint& a, double b) noexcept { return BLPoint(a.x + b, a.y + b); }
static constexpr BLPoint operator-(const BLPoint& a, double b) noexcept { return BLPoint(a.x - b, a.y - b); }
static constexpr BLPoint operator*(const BLPoint& a, double b) noexcept { return BLPoint(a.x * b, a.y * b); }
static constexpr BLPoint operator/(const BLPoint& a, double b) noexcept { return BLPoint(a.x / b, a.y / b); }

static constexpr BLPoint operator+(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x + b.x, a.y + b.y); }
static constexpr BLPoint operator-(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x - b.x, a.y - b.y); }
static constexpr BLPoint operator*(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x * b.x, a.y * b.y); }
static constexpr BLPoint operator/(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x / b.x, a.y / b.y); }

static constexpr BLPoint operator+(double a, const BLPoint& b) noexcept { return BLPoint(a + b.x, a + b.y); }
static constexpr BLPoint operator-(double a, const BLPoint& b) noexcept { return BLPoint(a - b.x, a - b.y); }
static constexpr BLPoint operator*(double a, const BLPoint& b) noexcept { return BLPoint(a * b.x, a * b.y); }
static constexpr BLPoint operator/(double a, const BLPoint& b) noexcept { return BLPoint(a / b.x, a / b.y); }

static inline BLPoint& operator+=(BLPoint& a, double b) noexcept { a.x += b; a.y += b; return a; }
static inline BLPoint& operator-=(BLPoint& a, double b) noexcept { a.x -= b; a.y -= b; return a; }
static inline BLPoint& operator*=(BLPoint& a, double b) noexcept { a.x *= b; a.y *= b; return a; }
static inline BLPoint& operator/=(BLPoint& a, double b) noexcept { a.x /= b; a.y /= b; return a; }

static inline BLPoint& operator+=(BLPoint& a, const BLPoint& b) noexcept { a.x += b.x; a.y += b.y; return a; }
static inline BLPoint& operator-=(BLPoint& a, const BLPoint& b) noexcept { a.x -= b.x; a.y -= b.y; return a; }
static inline BLPoint& operator*=(BLPoint& a, const BLPoint& b) noexcept { a.x *= b.x; a.y *= b.y; return a; }
static inline BLPoint& operator/=(BLPoint& a, const BLPoint& b) noexcept { a.x /= b.x; a.y /= b.y; return a; }

// Min/Max - different semantics compared to std.
template<typename T> constexpr T blMin(const T& a, const T& b) noexcept { return b < a ? b : a; }
template<typename T> constexpr T blMax(const T& a, const T& b) noexcept { return a < b ? b : a; }

// Linear interpolation, works with points as well.
template<typename V, typename T = double>
inline V blLerp(const V& a, const V& b, const T& t) noexcept {
  return (a * (1.0 - t)) + (b * t);
}

// Merge a point into a box by possibly increasing its size.
inline void blBoxMergePoint(BLBox& box, const BLPoint& p) noexcept {
  box.x0 = blMin(box.x0, p.x);
  box.y0 = blMin(box.y0, p.y);
  box.x1 = blMax(box.x1, p.x);
  box.y1 = blMax(box.y1, p.y);
}

// Quadratic Bezier Coefficients and Derivative.
inline void blQuadExtractCoefficients(const BLPoint bez[3], BLPoint& a, BLPoint& b, BLPoint& c) noexcept {
  a = bez[0] - bez[1] * 2.0 + bez[2];
  b = (bez[1] - bez[0]) * 2.0;
  c = bez[0];
}

inline void blQuadExtractDerivative(const BLPoint bez[3], BLPoint& a, BLPoint& b) noexcept {
  a = (bez[2] - bez[1] * 2.0 + bez[0]) * 2.0;
  b = (bez[1] - bez[0]) * 2.0;
}

// Cubic Bezier Coefficients and Derivative.
inline void blCubicExtractCoefficients(const BLPoint bez[4], BLPoint& a, BLPoint& b, BLPoint& c, BLPoint& d) noexcept {
  d = bez[0];
  a = (bez[1] - bez[2]) * 3.0 + bez[3] - d;
  b = (bez[0] - bez[1]  * 2.0 + bez[2]) * 3.0;
  c = (bez[1] - bez[0]) * 3.0;
}

inline void blCubicExtractDerivative(const BLPoint bez[4], BLPoint& a, BLPoint& b, BLPoint& c) noexcept {
  a = ((bez[1] - bez[2]) * 3.0 + bez[3] - bez[0]) * 3.0;
  b = (bez[2] - bez[1] * 2.0 + bez[0]) * 6.0;
  c = (bez[1] - bez[0]) * 3.0;
}

I don't know whether to comment it - it's pretty simple code that provides Point, Box, and some utility functions we rely on.

Quadratic Bezier Evaluation

These functions return a point of quadratic Bezier at the given t:

// Quadratic Bezier Evaluation.
BLPoint blQuadEvaluateAtT(const BLPoint bez[3], double t) noexcept {
  BLPoint a, b, c;
  blQuadExtractCoefficients(bez, a, b, c);
  return (a * t + b) * t + c;
}

BLPoint blQuadPreciseEvaluateAtT(const BLPoint bez[3], double t) noexcept {
  return blLerp(blLerp(bez[0], bez[1], t),
                blLerp(bez[1], bez[2], t), t);
}

GCC [-std=c++17 -O3 -fno-math-errno -mavx2]

blQuadEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm2, XMMWORD PTR [rdi+16]
  vmovupd  xmm4, XMMWORD PTR [rdi]
  vmovddup xmm3, xmm0
  vaddpd   xmm1, xmm2, xmm2
  vsubpd   xmm2, xmm2, xmm4
  vsubpd   xmm1, xmm4, xmm1
  vaddpd   xmm0, xmm1, XMMWORD PTR [rdi+32]
  vaddpd   xmm2, xmm2, xmm2
  vmulpd   xmm1, xmm0, xmm3
  vaddpd   xmm0, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm5, xmm0, xmm4
  vmovaps  XMMWORD PTR [rsp-24], xmm5
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

blQuadPreciseEvaluateAtT(BLPoint const*, double):
  vmovapd  xmm3, XMMWORD PTR .LC0[rip]
  vmovddup xmm0, xmm0
  vmovupd  xmm4, XMMWORD PTR [rdi+16]
  vmulpd   xmm2, xmm0, XMMWORD PTR [rdi+32]
  vsubpd   xmm3, xmm3, xmm0
  vmulpd   xmm1, xmm3, xmm4
  vaddpd   xmm2, xmm2, xmm1
  vmulpd   xmm1, xmm3, XMMWORD PTR [rdi]
  vmulpd   xmm2, xmm2, xmm0
  vmulpd   xmm0, xmm4, xmm0
  vaddpd   xmm0, xmm1, xmm0
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm5, xmm2, xmm0
  vmovaps  XMMWORD PTR [rsp-24], xmm5
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

Clang [-std=c++17 -O3 -fno-math-errno -mavx2]

blQuadEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm1, xmmword ptr [rdi]
  vmovupd  xmm2, xmmword ptr [rdi + 16]
  vaddpd   xmm3, xmm2, xmm2
  vsubpd   xmm3, xmm1, xmm3
  vaddpd   xmm3, xmm3, xmmword ptr [rdi + 32]
  vsubpd   xmm2, xmm2, xmm1
  vaddpd   xmm2, xmm2, xmm2
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmulpd   xmm3, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm3
  vmulpd   xmm0, xmm0, xmm2
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

.LCPI1_0:
  .quad    4607182418800017408     # double 1

blQuadPreciseEvaluateAtT(BLPoint const*, double):
  vmovsd   xmm1, qword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero
  vsubsd   xmm1, xmm1, xmm0
  vmovddup xmm1, xmm1      # xmm1 = xmm1[0,0]
  vmulpd   xmm2, xmm1, xmmword ptr [rdi]
  vmovupd  xmm3, xmmword ptr [rdi + 16]
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmulpd   xmm4, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm4
  vmulpd   xmm3, xmm1, xmm3
  vmulpd   xmm4, xmm0, xmmword ptr [rdi + 32]
  vaddpd   xmm3, xmm3, xmm4
  vmulpd   xmm1, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

MSVC [/O2 /Qpar /arch:AVX2]

; NOT VECTORIZED!
__$ReturnUdt$ = 80
bez$ = 88
t$ = 96
BLPoint blQuadEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN42:
  sub     rsp, 72                             ; 00000048H
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vmovsd  xmm5, QWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp+48], xmm6
  mov     rax, rcx
  vmovsd  xmm6, QWORD PTR [rdx+8]
  vmovaps XMMWORD PTR [rsp+32], xmm7
  vaddsd  xmm3, xmm0, xmm0
  vsubsd  xmm0, xmm5, xmm3
  vmovaps XMMWORD PTR [rsp+16], xmm8
  vaddsd  xmm8, xmm0, QWORD PTR [rdx+32]
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vsubsd  xmm3, xmm0, xmm5
  vmovaps XMMWORD PTR [rsp], xmm9
  vmovaps xmm9, xmm2
  vaddsd  xmm4, xmm1, xmm1
  vsubsd  xmm1, xmm6, xmm4
  vaddsd  xmm7, xmm1, QWORD PTR [rdx+40]
  vmulsd  xmm4, xmm3, QWORD PTR __real@4000000000000000
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vsubsd  xmm2, xmm1, xmm6
  vmulsd  xmm5, xmm2, QWORD PTR __real@4000000000000000
  vmovups xmm6, XMMWORD PTR [rdx]
  vmulsd  xmm2, xmm7, xmm9
  vmovaps xmm7, XMMWORD PTR [rsp+32]
  vmulsd  xmm1, xmm8, xmm9
  vmovaps xmm8, XMMWORD PTR [rsp+16]
  vaddsd  xmm0, xmm4, xmm1
  vmulsd  xmm4, xmm0, xmm9
  vaddsd  xmm0, xmm4, xmm6
  vaddsd  xmm3, xmm2, xmm5
  vmulsd  xmm5, xmm3, xmm9
  vmovaps xmm9, XMMWORD PTR [rsp]
  vunpckhpd xmm2, xmm6, xmm6
  vmovaps xmm6, XMMWORD PTR [rsp+48]
  vmovsd  QWORD PTR [rcx], xmm0
  vaddsd  xmm0, xmm5, xmm2
  vmovsd  QWORD PTR [rcx+8], xmm0
  add     rsp, 72                             ; 00000048H
  ret     0
BLPoint blQuadEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

; NOT VECTORIZED!
__$ReturnUdt$ = 64
bez$ = 72
t$ = 80
BLPoint blQuadPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN46:
  sub     rsp, 56                             ; 00000038H
  vmulsd  xmm4, xmm2, QWORD PTR [rdx+32]
  vmulsd  xmm3, xmm2, QWORD PTR [rdx+40]
  vmovsd  xmm0, QWORD PTR __real@3ff0000000000000
  vmovaps XMMWORD PTR [rsp+32], xmm6
  mov     rax, rcx
  vmovaps XMMWORD PTR [rsp+16], xmm7
  vsubsd  xmm7, xmm0, xmm2
  vmulsd  xmm1, xmm7, QWORD PTR [rdx+16]
  vmulsd  xmm0, xmm7, QWORD PTR [rdx+24]
  vaddsd  xmm6, xmm3, xmm0
  vmulsd  xmm0, xmm7, QWORD PTR [rdx+8]
  vaddsd  xmm5, xmm4, xmm1
  vmulsd  xmm1, xmm7, QWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp], xmm8
  vmovaps xmm8, xmm2
  vmulsd  xmm3, xmm8, QWORD PTR [rdx+24]
  vmulsd  xmm2, xmm2, QWORD PTR [rdx+16]
  vaddsd  xmm2, xmm2, xmm1
  vaddsd  xmm3, xmm0, xmm3
  vmulsd  xmm4, xmm5, xmm8
  vmulsd  xmm5, xmm6, xmm8
  vmovaps xmm6, XMMWORD PTR [rsp+32]
  vmovaps xmm8, XMMWORD PTR [rsp]
  vmulsd  xmm0, xmm2, xmm7
  vmulsd  xmm1, xmm3, xmm7
  vmovaps xmm7, XMMWORD PTR [rsp+16]
  vaddsd  xmm0, xmm0, xmm4
  vaddsd  xmm1, xmm1, xmm5
  vmovsd  QWORD PTR [rcx], xmm0
  vmovsd  QWORD PTR [rcx+8], xmm1
  add     rsp, 56                             ; 00000038H
  ret     0
BLPoint blQuadPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

Interpreting the Results

GCC and Clang okay, MSVC didn't use SIMD.

Cubic Bezier Evaluation

These functions return a point of cubic Bezier at the given t:

// Cubic Bezier Evaluation.
BLPoint blCubicEvaluateAtT(const BLPoint bez[4], double t) noexcept {
  BLPoint a, b, c, d;
  blCubicExtractCoefficients(bez, a, b, c, d);
  return ((a * t + b) * t + c) * t + c;
}

BLPoint blCubicPreciseEvaluateAtT(const BLPoint bez[4], double t) noexcept {
  BLPoint p01(blLerp(bez[0], bez[1], t));
  BLPoint p12(blLerp(bez[1], bez[2], t));
  BLPoint p23(blLerp(bez[2], bez[3], t));
  return blLerp(blLerp(p01, p12, t), blLerp(p12, p23, t), t);
}

GCC [-std=c++17 -O3 -fno-math-errno -mavx2]

blCubicEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm3, XMMWORD PTR [rdi+16]
  vmovddup xmm4, xmm0
  vmovupd  xmm1, XMMWORD PTR [rdi]
  vmovupd  xmm6, XMMWORD PTR [rdi+32]
  vmovapd  xmm5, XMMWORD PTR .LC1[rip]
  vsubpd   xmm2, xmm3, xmm1
  vsubpd   xmm0, xmm3, xmm6
  vaddpd   xmm3, xmm3, xmm3
  vmulpd   xmm2, xmm2, xmm5
  vmulpd   xmm0, xmm0, xmm5
  vaddpd   xmm0, xmm0, XMMWORD PTR [rdi+48]
  vsubpd   xmm0, xmm0, xmm1
  vsubpd   xmm1, xmm1, xmm3
  vaddpd   xmm1, xmm1, xmm6
  vmulpd   xmm0, xmm0, xmm4
  vmulpd   xmm1, xmm1, xmm5
  vaddpd   xmm0, xmm0, xmm1
  vmulpd   xmm0, xmm0, xmm4
  vaddpd   xmm0, xmm0, xmm2
  vmulpd   xmm0, xmm0, xmm4
  vaddpd   xmm7, xmm0, xmm2
  vmovaps  XMMWORD PTR [rsp-24], xmm7
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

blCubicPreciseEvaluateAtT(BLPoint const*, double):
  vmovddup xmm2, xmm0
  vmovapd  xmm0, XMMWORD PTR .LC0[rip]
  vmovupd  xmm5, XMMWORD PTR [rdi+16]
  vmulpd   xmm4, xmm2, XMMWORD PTR [rdi+32]
  vsubpd   xmm3, xmm0, xmm2
  vmulpd   xmm1, xmm2, XMMWORD PTR [rdi+48]
  vmulpd   xmm0, xmm3, xmm5
  vmulpd   xmm5, xmm5, xmm2
  vaddpd   xmm4, xmm4, xmm0
  vmulpd   xmm0, xmm3, XMMWORD PTR [rdi+32]
  vaddpd   xmm1, xmm1, xmm0
  vmulpd   xmm0, xmm3, xmm4
  vmulpd   xmm1, xmm1, xmm2
  vaddpd   xmm1, xmm1, xmm0
  vmulpd   xmm0, xmm3, XMMWORD PTR [rdi]
  vmulpd   xmm1, xmm1, xmm2
  vmulpd   xmm2, xmm4, xmm2
  vaddpd   xmm0, xmm0, xmm5
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm0, xmm0, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm6, xmm1, xmm0
  vmovaps  XMMWORD PTR [rsp-24], xmm6
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

Clang [-std=c++17 -O3 -fno-math-errno -mavx2]

.LCPI2_0:
  .quad    4613937818241073152     # double 3
  .quad    4613937818241073152     # double 3

blCubicEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm1, xmmword ptr [rdi + 16]
  vmovupd  xmm2, xmmword ptr [rdi + 32]
  vsubpd   xmm3, xmm1, xmm2
  vmovapd  xmm4, xmmword ptr [rip + .LCPI2_0] # xmm4 = [3.000000e+00,3.000000e+00]
  vmulpd   xmm3, xmm3, xmm4
  vaddpd   xmm3, xmm3, xmmword ptr [rdi + 48]
  vmovupd  xmm5, xmmword ptr [rdi]
  vsubpd   xmm3, xmm3, xmm5
  vaddpd   xmm6, xmm1, xmm1
  vsubpd   xmm6, xmm5, xmm6
  vaddpd   xmm2, xmm6, xmm2
  vmulpd   xmm2, xmm2, xmm4
  vsubpd   xmm1, xmm1, xmm5
  vmulpd   xmm1, xmm1, xmm4
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmulpd   xmm3, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm3
  vmulpd   xmm2, xmm0, xmm2
  vaddpd   xmm2, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm2
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

.LCPI3_0:
  .quad    4607182418800017408     # double 1

blCubicPreciseEvaluateAtT(BLPoint const*, double):
  vmovsd   xmm1, qword ptr [rip + .LCPI3_0] # xmm1 = mem[0],zero
  vsubsd   xmm1, xmm1, xmm0
  vmovddup xmm1, xmm1      # xmm1 = xmm1[0,0]
  vmulpd   xmm2, xmm1, xmmword ptr [rdi]
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmovupd  xmm3, xmmword ptr [rdi + 16]
  vmovupd  xmm4, xmmword ptr [rdi + 32]
  vmulpd   xmm5, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm5
  vmulpd   xmm3, xmm1, xmm3
  vmulpd   xmm5, xmm0, xmm4
  vaddpd   xmm3, xmm3, xmm5
  vmulpd   xmm5, xmm0, xmmword ptr [rdi + 48]
  vmulpd   xmm4, xmm1, xmm4
  vaddpd   xmm4, xmm4, xmm5
  vmulpd   xmm2, xmm1, xmm2
  vmulpd   xmm5, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm5
  vmulpd   xmm3, xmm1, xmm3
  vmulpd   xmm4, xmm0, xmm4
  vaddpd   xmm3, xmm3, xmm4
  vmulpd   xmm1, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

MSVC [/O2 /Qpar /arch:AVX2]

; NOT VECTORIZED!
__$ReturnUdt$ = 96
bez$ = 104
t$ = 112
BLPoint blCubicEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN70:
  sub     rsp, 88                             ; 00000058H
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vsubsd  xmm3, xmm0, QWORD PTR [rdx+32]
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vmovups xmm5, XMMWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp+64], xmm6
  mov     rax, rcx
  vmovaps XMMWORD PTR [rsp+48], xmm7
  vmovsd  xmm7, QWORD PTR __real@4008000000000000
  vmovaps XMMWORD PTR [rsp+32], xmm8
  vmulsd  xmm0, xmm3, xmm7
  vmovaps XMMWORD PTR [rsp+16], xmm9
  vmovaps XMMWORD PTR [rsp], xmm10
  vmovaps xmm10, xmm2
  vsubsd  xmm2, xmm1, QWORD PTR [rdx+40]
  vaddsd  xmm1, xmm0, QWORD PTR [rdx+48]
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vsubsd  xmm9, xmm1, xmm5
  vmulsd  xmm4, xmm2, xmm7
  vaddsd  xmm2, xmm4, QWORD PTR [rdx+56]
  vunpckhpd xmm1, xmm5, xmm5
  vmovsd  xmm5, QWORD PTR [rdx]
  vsubsd  xmm8, xmm2, xmm1
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vaddsd  xmm2, xmm0, xmm0
  vsubsd  xmm0, xmm5, xmm2
  vmovsd  xmm2, QWORD PTR [rdx+8]
  vaddsd  xmm3, xmm1, xmm1
  vaddsd  xmm1, xmm0, QWORD PTR [rdx+32]
  vsubsd  xmm4, xmm2, xmm3
  vaddsd  xmm0, xmm4, QWORD PTR [rdx+40]
  vmulsd  xmm6, xmm0, xmm7
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vmulsd  xmm4, xmm1, xmm7
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vsubsd  xmm2, xmm1, xmm2
  vsubsd  xmm3, xmm0, xmm5
  vmulsd  xmm5, xmm3, xmm7
  vmulsd  xmm0, xmm9, xmm10
  vmovaps xmm9, XMMWORD PTR [rsp+16]
  vmulsd  xmm1, xmm8, xmm10
  vmovaps xmm8, XMMWORD PTR [rsp+32]
  vaddsd  xmm3, xmm1, xmm6
  vmovaps xmm6, XMMWORD PTR [rsp+64]
  vmulsd  xmm7, xmm2, xmm7
  vaddsd  xmm2, xmm0, xmm4
  vmulsd  xmm4, xmm2, xmm10
  vmulsd  xmm0, xmm3, xmm10
  vaddsd  xmm1, xmm4, xmm5
  vaddsd  xmm2, xmm0, xmm7
  vmulsd  xmm3, xmm1, xmm10
  vmulsd  xmm4, xmm2, xmm10
  vmovaps xmm10, XMMWORD PTR [rsp]
  vaddsd  xmm1, xmm4, xmm7
  vmovaps xmm7, XMMWORD PTR [rsp+48]
  vaddsd  xmm0, xmm3, xmm5
  vmovsd  QWORD PTR [rcx], xmm0
  vmovsd  QWORD PTR [rcx+8], xmm1
  add     rsp, 88                             ; 00000058H
  ret     0
BLPoint blCubicEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

; NOT VECTORIZED!
__$ReturnUdt$ = 128
bez$ = 136
t$ = 144
BLPoint blCubicPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN88:
  mov     r11, rsp
  sub     rsp, 120                      ; 00000078H
  vmulsd  xmm4, xmm2, QWORD PTR [rdx+16]
  vmulsd  xmm3, xmm2, QWORD PTR [rdx+24]
  vmovsd  xmm0, QWORD PTR __real@3ff0000000000000
  vmovaps XMMWORD PTR [rsp+96], xmm6
  mov     rax, rcx
  vmovaps XMMWORD PTR [rsp+80], xmm7
  vmovaps XMMWORD PTR [r11-56], xmm8
  vmovaps XMMWORD PTR [r11-72], xmm9
  vmovaps XMMWORD PTR [r11-88], xmm10
  vmovaps XMMWORD PTR [r11-104], xmm11
  vsubsd  xmm11, xmm0, xmm2
  vmulsd  xmm1, xmm11, QWORD PTR [rdx]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+8]
  vaddsd  xmm10, xmm3, xmm0
  vmulsd  xmm3, xmm2, QWORD PTR [rdx+32]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+24]
  vaddsd  xmm9, xmm4, xmm1
  vmulsd  xmm1, xmm11, QWORD PTR [rdx+16]
  vaddsd  xmm5, xmm3, xmm1
  vmulsd  xmm1, xmm11, QWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rsp], xmm12
  vmovaps xmm12, xmm2
  vmulsd  xmm2, xmm2, QWORD PTR [rdx+40]
  vmulsd  xmm3, xmm12, QWORD PTR [rdx+48]
  vaddsd  xmm8, xmm2, xmm0
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+40]
  vmulsd  xmm2, xmm12, QWORD PTR [rdx+56]
  vaddsd  xmm1, xmm1, xmm3
  vaddsd  xmm2, xmm2, xmm0
  vmulsd  xmm3, xmm1, xmm12
  vmulsd  xmm4, xmm2, xmm12
  vmulsd  xmm1, xmm8, xmm11
  vaddsd  xmm7, xmm1, xmm4
  vmulsd  xmm0, xmm5, xmm11
  vaddsd  xmm6, xmm0, xmm3
  vmulsd  xmm1, xmm10, xmm11
  vmovaps xmm10, XMMWORD PTR [r11-88]
  vmulsd  xmm2, xmm5, xmm12
  vmulsd  xmm3, xmm8, xmm12
  vmovaps xmm8, XMMWORD PTR [r11-56]
  vmulsd  xmm0, xmm9, xmm11
  vmovaps xmm9, XMMWORD PTR [r11-72]
  vaddsd  xmm5, xmm1, xmm3
  vaddsd  xmm4, xmm0, xmm2
  vmulsd  xmm2, xmm6, xmm12
  vmovaps xmm6, XMMWORD PTR [rsp+96]
  vmulsd  xmm3, xmm7, xmm12
  vmovaps xmm7, XMMWORD PTR [rsp+80]
  vmovaps xmm12, XMMWORD PTR [rsp]
  vmulsd  xmm1, xmm5, xmm11
  vmulsd  xmm0, xmm4, xmm11
  vmovaps xmm11, XMMWORD PTR [r11-104]
  vaddsd  xmm1, xmm1, xmm3
  vaddsd  xmm0, xmm0, xmm2
  vmovsd  QWORD PTR [rcx+8], xmm1
  vmovsd  QWORD PTR [rcx], xmm0
  add     rsp, 120                      ; 00000078H
  ret     0
BLPoint blCubicPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

Interpreting the Results

GCC and Clang okay, MSVC didn't use SIMD and generated so much code again.

Quadratic Bezier Bounding Box

Calculating bounding box of a quadratic Bezier curve is quite common. I wrote two solutions - the first version (V1) uses scalar code that only merges possible extremas when they are found. The second version (V2) doesn't bother and always calculates both extremas even if they are outside of the domain (they would be clamped in such case). The second version is SIMD friendly and that was the intention:

// Scalar version - not good for SIMD.
inline void blQuadCoefficientsAtT(double t, double& a, double& b, double& c) noexcept {
  double tInv = 1.0 - t;
  a = tInv * tInv;
  b = 2.0 * (t - tInv);
  c = t * t;
}

void blQuadBoundingBoxV1(const BLPoint bez[3], BLBox& bBox) noexcept {
  // Get bounding box of start and end points.
  bBox.reset(blMin(bez[0].x, bez[2].x), blMin(bez[0].y, bez[2].y),
             blMax(bez[0].x, bez[2].x), blMax(bez[0].y, bez[2].y));

  // Merge X extrema.
  if (bez[1].x < bBox.x0 || bez[1].x > bBox.x1) {
    double t = (bez[0].x - bez[1].x) / (bez[0].x - 2.0 * bez[1].x + bez[2].x);
    double a, b, c;

    blQuadCoefficientsAtT(t, a, b, c);
    double coord = a * bez[0].x + b * bez[1].x + c * bez[2].x;
    bBox.x0 = blMin(bBox.x0, coord);
    bBox.x1 = blMax(bBox.x1, coord);
  }

  // Merge Y extrema.
  if (bez[1].y < bBox.y0 || bez[1].y > bBox.y1) {
    double t = (bez[0].y - bez[1].y) / (bez[0].y - 2.0 * bez[1].y + bez[2].y);
    double a, b, c;

    blQuadCoefficientsAtT(t, a, b, c);
    double coord = a * bez[0].y + b * bez[1].y + c * bez[2].y;
    bBox.y0 = blMin(bBox.y0, coord);
    bBox.y1 = blMax(bBox.y1, coord);
  }
}

// SIMD friendly (branchless).
void blQuadBoundingBoxV2(const BLPoint bez[3], BLBox& bBox) noexcept {
  // Bounding box of start and end points.
  bBox.reset(blMin(bez[0].x, bez[2].x), blMin(bez[0].y, bez[2].y),
             blMax(bez[0].x, bez[2].x), blMax(bez[0].y, bez[2].y));

  BLPoint t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);
  t.x = blMax(t.x, 0.0);
  t.y = blMax(t.y, 0.0);
  t.x = blMin(t.x, 1.0);
  t.y = blMin(t.y, 1.0);

  blBoxMergePoint(bBox, blLerp(blLerp(bez[0], bez[1], t),
                               blLerp(bez[1], bez[2], t), t));
}

GCC [-std=c++17 -O3 -fno-math-errno -mavx2]

# NOT VECTORIZED!
blQuadBoundingBoxV1(BLPoint const*, BLBox&):
  vmovsd  xmm7, QWORD PTR [rdi+8]
  vmovsd  xmm6, QWORD PTR [rdi+40]
  vmovsd  xmm11, QWORD PTR [rdi]
  vmovsd  xmm10, QWORD PTR [rdi+32]
  vmaxsd  xmm4, xmm6, xmm7
  vminsd  xmm3, xmm6, xmm7
  vminsd  xmm8, xmm10, xmm11
  vmaxsd  xmm9, xmm10, xmm11
  vunpcklpd       xmm1, xmm8, xmm3
  vunpcklpd       xmm0, xmm9, xmm4
  vmovups XMMWORD PTR [rsi], xmm1
  vmovups XMMWORD PTR [rsi+16], xmm0
  vmovsd  xmm1, QWORD PTR [rdi+16]
  vcomisd xmm8, xmm1
  ja      .L11
  vcomisd xmm1, xmm9
  ja      .L11
.L12:
  vmovsd  xmm0, QWORD PTR [rdi+24]
  vcomisd xmm3, xmm0
  ja      .L16
  vcomisd xmm0, xmm4
  jbe     .L27
.L16:
  vaddsd  xmm1, xmm0, xmm0
  vsubsd  xmm2, xmm7, xmm0
  vmovsd  xmm5, QWORD PTR .LC2[rip]
  vsubsd  xmm1, xmm7, xmm1
  vaddsd  xmm1, xmm1, xmm6
  vdivsd  xmm2, xmm2, xmm1
  vsubsd  xmm5, xmm5, xmm2
  vsubsd  xmm1, xmm2, xmm5
  vmulsd  xmm5, xmm5, xmm5
  vmulsd  xmm2, xmm2, xmm2
  vaddsd  xmm1, xmm1, xmm1
  vmulsd  xmm5, xmm5, xmm7
  vmulsd  xmm0, xmm1, xmm0
  vmulsd  xmm2, xmm2, xmm6
  vaddsd  xmm0, xmm0, xmm5
  vaddsd  xmm0, xmm0, xmm2
  vminsd  xmm3, xmm0, xmm3
  vmaxsd  xmm0, xmm0, xmm4
  vmovsd  QWORD PTR [rsi+8], xmm3
  vmovsd  QWORD PTR [rsi+24], xmm0
.L27:
  ret
.L11:
  vaddsd  xmm0, xmm1, xmm1
  vsubsd  xmm5, xmm11, xmm1
  vsubsd  xmm0, xmm11, xmm0
  vaddsd  xmm0, xmm0, xmm10
  vdivsd  xmm5, xmm5, xmm0
  vmovsd  xmm0, QWORD PTR .LC2[rip]
  vsubsd  xmm0, xmm0, xmm5
  vsubsd  xmm2, xmm5, xmm0
  vmulsd  xmm0, xmm0, xmm0
  vmulsd  xmm5, xmm5, xmm5
  vaddsd  xmm2, xmm2, xmm2
  vmulsd  xmm0, xmm0, xmm11
  vmulsd  xmm1, xmm2, xmm1
  vmulsd  xmm5, xmm5, xmm10
  vaddsd  xmm1, xmm1, xmm0
  vaddsd  xmm1, xmm1, xmm5
  vminsd  xmm8, xmm1, xmm8
  vmaxsd  xmm1, xmm1, xmm9
  vmovsd  QWORD PTR [rsi], xmm8
  vmovsd  QWORD PTR [rsi+16], xmm1
  jmp     .L12

# NOT VECTORIZED!
blQuadBoundingBoxV2(BLPoint const*, BLBox&):
  push    rbp
  mov     rbp, rsp
  and     rsp, -32
  vmovsd  xmm1, QWORD PTR [rdi+8]
  vmovsd  xmm0, QWORD PTR [rdi]
  vmovsd  xmm5, QWORD PTR [rdi+40]
  vmovsd  xmm6, QWORD PTR [rdi+32]
  vmaxsd  xmm13, xmm5, xmm1
  vmaxsd  xmm12, xmm6, xmm0
  vminsd  xmm5, xmm5, xmm1
  vminsd  xmm6, xmm6, xmm0
  vunpcklpd       xmm0, xmm12, xmm13
  vunpcklpd       xmm1, xmm6, xmm5
  vmovups XMMWORD PTR [rsi+16], xmm0
  vmovups XMMWORD PTR [rsi], xmm1
  vmovsd  xmm2, QWORD PTR [rdi+24]
  vmovsd  xmm10, QWORD PTR [rdi+8]
  vmovsd  xmm1, QWORD PTR [rdi+40]
  vmovsd  xmm7, QWORD PTR [rdi+16]
  vaddsd  xmm4, xmm2, xmm2
  vsubsd  xmm9, xmm10, xmm2
  vmovsd  xmm3, QWORD PTR [rdi]
  vmovsd  xmm0, QWORD PTR [rdi+32]
  vsubsd  xmm8, xmm3, xmm7
  vsubsd  xmm4, xmm10, xmm4
  vaddsd  xmm4, xmm4, xmm1
  vdivsd  xmm9, xmm9, xmm4
  vaddsd  xmm4, xmm7, xmm7
  vsubsd  xmm4, xmm3, xmm4
  vaddsd  xmm4, xmm4, xmm0
  vdivsd  xmm8, xmm8, xmm4
  vxorpd  xmm4, xmm4, xmm4
  vcomisd xmm4, xmm8
  ja      .L33
  vcomisd xmm4, xmm9
  jbe     .L62
  vmovsd  xmm11, QWORD PTR .LC2[rip]
  vmulsd  xmm14, xmm1, xmm4
  vmulsd  xmm9, xmm2, xmm4
  vcomisd xmm8, xmm11
  jbe     .L63
  vmovsd  QWORD PTR [rsp-16], xmm2
  vmovapd xmm1, xmm14
  vmovapd xmm2, xmm9
  vxorpd  xmm14, xmm14, xmm14
  vmovsd  QWORD PTR [rsp-8], xmm7
  vmulsd  xmm3, xmm3, xmm4
  vmovapd xmm15, xmm11
  vmovapd xmm8, xmm11
  vmulsd  xmm7, xmm7, xmm4
  vxorpd  xmm9, xmm9, xmm9
  jmp     .L40
.L33:
  vmulsd  xmm11, xmm7, xmm4
  vcomisd xmm4, xmm9
  vxorpd  xmm8, xmm8, xmm8
  vmulsd  xmm0, xmm0, xmm4
  vmovsd  QWORD PTR [rsp-8], xmm11
  vmovsd  xmm11, QWORD PTR .LC2[rip]
  vmovapd xmm14, xmm11
  jbe     .L37
.L46:
  vmovsd  QWORD PTR [rsp-16], xmm2
  vmulsd  xmm1, xmm1, xmm4
  vmovapd xmm15, xmm11
  vxorpd  xmm9, xmm9, xmm9
  vmulsd  xmm2, xmm2, xmm4
  jmp     .L40
.L62:
  vmovsd  xmm11, QWORD PTR .LC2[rip]
  vcomisd xmm8, xmm11
  jbe     .L56
  vmovsd  QWORD PTR [rsp-8], xmm7
  vmulsd  xmm3, xmm3, xmm4
  vxorpd  xmm14, xmm14, xmm14
  vmovapd xmm8, xmm11
  vmulsd  xmm7, xmm7, xmm4
.L37:
  vcomisd xmm9, xmm11
  jbe     .L57
  vmulsd  xmm15, xmm2, xmm4
  vmovapd xmm9, xmm11
  vmulsd  xmm10, xmm10, xmm4
  vmovsd  QWORD PTR [rsp-16], xmm15
  vxorpd  xmm15, xmm15, xmm15
.L40:
  vaddsd  xmm1, xmm1, QWORD PTR [rsp-16]
  vaddsd  xmm3, xmm3, QWORD PTR [rsp-8]
  vaddsd  xmm2, xmm2, xmm10
  vaddsd  xmm0, xmm0, xmm7
  vmulsd  xmm9, xmm1, xmm9
  vmulsd  xmm15, xmm2, xmm15
  vmulsd  xmm8, xmm0, xmm8
  vmulsd  xmm14, xmm3, xmm14
  vaddsd  xmm9, xmm9, xmm15
  vaddsd  xmm14, xmm8, xmm14
  vminsd  xmm5, xmm9, xmm5
  vmaxsd  xmm9, xmm9, xmm13
  vminsd  xmm6, xmm14, xmm6
  vmaxsd  xmm14, xmm14, xmm12
  vmovsd  QWORD PTR [rsi+8], xmm5
  vmovsd  QWORD PTR [rsi+24], xmm9
  vmovsd  QWORD PTR [rsi], xmm6
  vmovsd  QWORD PTR [rsi+16], xmm14
  leave
  ret
.L56:
  vmulsd  xmm15, xmm7, xmm8
  vsubsd  xmm14, xmm11, xmm8
  vmulsd  xmm0, xmm0, xmm8
  vmulsd  xmm3, xmm3, xmm14
  vmulsd  xmm7, xmm7, xmm14
  vmovsd  QWORD PTR [rsp-8], xmm15
  jmp     .L37
.L63:
  vmulsd  xmm15, xmm7, xmm8
  vsubsd  xmm14, xmm11, xmm8
  vmulsd  xmm0, xmm0, xmm8
  vmulsd  xmm3, xmm3, xmm14
  vmulsd  xmm7, xmm7, xmm14
  vmovsd  QWORD PTR [rsp-8], xmm15
  jmp     .L46
.L57:
  vsubsd  xmm15, xmm11, xmm9
  vmulsd  xmm1, xmm1, xmm9
  vmulsd  xmm4, xmm2, xmm15
  vmulsd  xmm10, xmm10, xmm15
  vmulsd  xmm2, xmm2, xmm9
  vmovsd  QWORD PTR [rsp-16], xmm4
  jmp     .L40
.LC0:
  .long   0
  .long   1072693248
  .long   0
  .long   1072693248
.LC1:
  .long   0
  .long   1074266112
  .long   0
  .long   1074266112
.LC2:
  .long   0
  .long   1072693248

Clang [-std=c++17 -O3 -fno-math-errno -mavx2]

# NOT VECTORIZED!
.LCPI4_0:
  .quad   4607182418800017408     # double 1

blQuadBoundingBoxV1(BLPoint const*, BLBox&): # @blQuadBoundingBoxV1(BLPoint const*, BLBox&)
  vmovupd xmm1, xmmword ptr [rdi]
  vmovupd xmm0, xmmword ptr [rdi + 32]
  vminpd  xmm3, xmm0, xmm1
  vmaxpd  xmm2, xmm0, xmm1
  vmovupd xmmword ptr [rsi], xmm3
  vmovupd xmmword ptr [rsi + 16], xmm2
  vmovsd  xmm4, qword ptr [rdi + 16] # xmm4 = mem[0],zero
  vucomisd        xmm3, xmm4
  ja      .LBB4_2
  vucomisd        xmm4, xmm2
  jbe     .LBB4_3
.LBB4_2:
  vsubsd  xmm5, xmm1, xmm4
  vaddsd  xmm6, xmm4, xmm4
  vsubsd  xmm6, xmm1, xmm6
  vaddsd  xmm6, xmm0, xmm6
  vdivsd  xmm5, xmm5, xmm6
  vmovsd  xmm6, qword ptr [rip + .LCPI4_0] # xmm6 = mem[0],zero
  vsubsd  xmm6, xmm6, xmm5
  vmulsd  xmm7, xmm6, xmm6
  vsubsd  xmm6, xmm5, xmm6
  vaddsd  xmm6, xmm6, xmm6
  vmulsd  xmm5, xmm5, xmm5
  vmulsd  xmm7, xmm1, xmm7
  vmulsd  xmm4, xmm4, xmm6
  vaddsd  xmm4, xmm7, xmm4
  vmulsd  xmm5, xmm0, xmm5
  vaddsd  xmm4, xmm5, xmm4
  vminsd  xmm5, xmm4, xmm3
  vmovsd  qword ptr [rsi], xmm5
  vmaxsd  xmm4, xmm4, xmm2
  vmovsd  qword ptr [rsi + 16], xmm4
.LBB4_3:
  vmovsd  xmm4, qword ptr [rdi + 24] # xmm4 = mem[0],zero
  vpermilpd       xmm3, xmm3, 1   # xmm3 = xmm3[1,0]
  vucomisd        xmm3, xmm4
  vpermilpd       xmm2, xmm2, 1   # xmm2 = xmm2[1,0]
  ja      .LBB4_5
  vucomisd        xmm4, xmm2
  ja      .LBB4_5
  ret
.LBB4_5:
  vpermilpd       xmm1, xmm1, 1   # xmm1 = xmm1[1,0]
  vsubsd  xmm5, xmm1, xmm4
  vaddsd  xmm6, xmm4, xmm4
  vsubsd  xmm6, xmm1, xmm6
  vpermilpd       xmm0, xmm0, 1   # xmm0 = xmm0[1,0]
  vaddsd  xmm6, xmm0, xmm6
  vdivsd  xmm5, xmm5, xmm6
  vmovsd  xmm6, qword ptr [rip + .LCPI4_0] # xmm6 = mem[0],zero
  vsubsd  xmm6, xmm6, xmm5
  vmulsd  xmm7, xmm6, xmm6
  vsubsd  xmm6, xmm5, xmm6
  vaddsd  xmm6, xmm6, xmm6
  vmulsd  xmm5, xmm5, xmm5
  vmulsd  xmm1, xmm1, xmm7
  vmulsd  xmm4, xmm4, xmm6
  vaddsd  xmm1, xmm1, xmm4
  vmulsd  xmm0, xmm0, xmm5
  vaddsd  xmm0, xmm0, xmm1
  vminsd  xmm1, xmm0, xmm3
  vmovsd  qword ptr [rsi + 8], xmm1
  vmaxsd  xmm0, xmm0, xmm2
  vmovsd  qword ptr [rsi + 24], xmm0
  ret
.LCPI5_0:
  .quad   4607182418800017408     # double 1
  .quad   4607182418800017408     # double 1

# Vectorized
blQuadBoundingBoxV2(BLPoint const*, BLBox&): # @blQuadBoundingBoxV2(BLPoint const*, BLBox&)
  vmovupd xmm0, xmmword ptr [rdi]
  vmovupd xmm1, xmmword ptr [rdi + 16]
  vmovupd xmm2, xmmword ptr [rdi + 32]
  vminpd  xmm3, xmm2, xmm0
  vmaxpd  xmm4, xmm2, xmm0
  vsubpd  xmm5, xmm0, xmm1
  vaddpd  xmm6, xmm1, xmm1
  vsubpd  xmm6, xmm0, xmm6
  vaddpd  xmm6, xmm2, xmm6
  vdivpd  xmm5, xmm5, xmm6
  vxorpd  xmm6, xmm6, xmm6
  vmaxpd  xmm5, xmm6, xmm5
  vmovapd xmm6, xmmword ptr [rip + .LCPI5_0] # xmm6 = [1.000000e+00,1.000000e+00]
  vminpd  xmm5, xmm6, xmm5
  vsubpd  xmm6, xmm6, xmm5
  vmulpd  xmm0, xmm0, xmm6
  vmulpd  xmm7, xmm1, xmm5
  vaddpd  xmm0, xmm7, xmm0
  vmulpd  xmm1, xmm1, xmm6
  vmulpd  xmm2, xmm2, xmm5
  vaddpd  xmm1, xmm2, xmm1
  vmulpd  xmm0, xmm6, xmm0
  vmulpd  xmm1, xmm5, xmm1
  vaddpd  xmm0, xmm0, xmm1
  vminpd  xmm1, xmm0, xmm3
  vmovupd xmmword ptr [rsi], xmm1
  vmaxpd  xmm0, xmm0, xmm4
  vmovupd xmmword ptr [rsi + 16], xmm0
  ret

MSVC [/O2 /Qpar /arch:AVX2]

; NOT VECTORIZED!
bez$ = 128
bBox$ = 136
void blQuadBoundingBoxV1(BLPoint const * __ptr64 const,BLBox & __ptr64) PROC
$LN46:
  sub     rsp, 120                      ; 00000078H
  vmovsd  xmm3, QWORD PTR [rcx+40]
  vmovsd  xmm1, QWORD PTR [rcx+32]
  vminsd  xmm0, xmm1, QWORD PTR [rcx]
  vmovaps XMMWORD PTR [rsp+96], xmm6
  vmovaps XMMWORD PTR [rsp+80], xmm8
  vmovaps XMMWORD PTR [rsp+64], xmm9
  vmaxsd  xmm9, xmm3, QWORD PTR [rcx+8]
  vmovaps XMMWORD PTR [rsp+48], xmm10
  vminsd  xmm10, xmm3, QWORD PTR [rcx+8]
  vmovaps XMMWORD PTR [rsp], xmm13
  vmovsd  xmm13, QWORD PTR __real@3ff0000000000000
  vmovaps XMMWORD PTR [rsp+32], xmm11
  vmaxsd  xmm11, xmm1, QWORD PTR [rcx]
  vmovsd  QWORD PTR [rdx+16], xmm11
  vmovsd  QWORD PTR [rdx], xmm0
  vmovsd  QWORD PTR [rdx+8], xmm10
  vmovsd  QWORD PTR [rdx+24], xmm9
  vmovsd  xmm8, QWORD PTR [rcx+16]
  vmovaps XMMWORD PTR [rsp+16], xmm12
  vmovaps xmm12, xmm0
  vcomisd xmm12, xmm8
  ja      SHORT $LN3@blQuadBoun
  vcomisd xmm8, xmm11
  jbe     SHORT $LN2@blQuadBoun
$LN3@blQuadBoun:
  vmovsd  xmm6, QWORD PTR [rcx]
  vaddsd  xmm0, xmm8, xmm8
  vsubsd  xmm1, xmm6, xmm0
  vaddsd  xmm2, xmm1, QWORD PTR [rcx+32]
  vsubsd  xmm3, xmm6, xmm8
  vdivsd  xmm5, xmm3, xmm2
  vsubsd  xmm4, xmm13, xmm5
  vsubsd  xmm0, xmm5, xmm4
  vaddsd  xmm1, xmm0, xmm0
  vmulsd  xmm3, xmm1, xmm8
  vmulsd  xmm2, xmm4, xmm4
  vmulsd  xmm0, xmm2, xmm6
  vmulsd  xmm1, xmm5, xmm5
  vmulsd  xmm2, xmm1, QWORD PTR [rcx+32]
  vaddsd  xmm4, xmm3, xmm0
  vaddsd  xmm3, xmm4, xmm2
  vminsd  xmm0, xmm3, xmm12
  vmaxsd  xmm1, xmm3, xmm11
  vmovsd  QWORD PTR [rdx], xmm0
  vmovsd  QWORD PTR [rdx+16], xmm1
$LN2@blQuadBoun:
  vmovsd  xmm8, QWORD PTR [rcx+24]
  vcomisd xmm10, xmm8
  vmovaps xmm12, XMMWORD PTR [rsp+16]
  vmovaps xmm11, XMMWORD PTR [rsp+32]
  ja      SHORT $LN5@blQuadBoun
  vcomisd xmm8, xmm9
  jbe     SHORT $LN4@blQuadBoun
$LN5@blQuadBoun:
  vmovsd  xmm6, QWORD PTR [rcx+8]
  vaddsd  xmm0, xmm8, xmm8
  vsubsd  xmm1, xmm6, xmm0
  vaddsd  xmm2, xmm1, QWORD PTR [rcx+40]
  vsubsd  xmm3, xmm6, xmm8
  vdivsd  xmm5, xmm3, xmm2
  vsubsd  xmm4, xmm13, xmm5
  vsubsd  xmm0, xmm5, xmm4
  vaddsd  xmm1, xmm0, xmm0
  vmulsd  xmm3, xmm1, xmm8
  vmulsd  xmm2, xmm4, xmm4
  vmulsd  xmm0, xmm2, xmm6
  vmulsd  xmm1, xmm5, xmm5
  vmulsd  xmm2, xmm1, QWORD PTR [rcx+40]
  vaddsd  xmm4, xmm3, xmm0
  vaddsd  xmm3, xmm4, xmm2
  vminsd  xmm0, xmm3, xmm10
  vmaxsd  xmm1, xmm3, xmm9
  vmovsd  QWORD PTR [rdx+8], xmm0
  vmovsd  QWORD PTR [rdx+24], xmm1
$LN4@blQuadBoun:
  vmovaps xmm6, XMMWORD PTR [rsp+96]
  vmovaps xmm8, XMMWORD PTR [rsp+80]
  vmovaps xmm9, XMMWORD PTR [rsp+64]
  vmovaps xmm10, XMMWORD PTR [rsp+48]
  vmovaps xmm13, XMMWORD PTR [rsp]
  add     rsp, 120                      ; 00000078H
  ret     0
void blQuadBoundingBoxV1(BLPoint const * __ptr64 const,BLBox & __ptr64) ENDP

; NOT VECTORIZED!
bez$ = 176
bBox$ = 184
void blQuadBoundingBoxV2(BLPoint const * __ptr64 const,BLBox & __ptr64) PROC
$LN130:
  mov     rax, rsp
  sub     rsp, 168                      ; 000000a8H
  vmovsd  xmm3, QWORD PTR [rcx+40]
  vmovsd  xmm1, QWORD PTR [rcx+32]
  vminsd  xmm0, xmm1, QWORD PTR [rcx]
  vmovaps XMMWORD PTR [rax-24], xmm6
  lea     r11, QWORD PTR [rax]
  vmovaps XMMWORD PTR [rax-40], xmm7
  vmovaps XMMWORD PTR [rax-56], xmm8
  vmovaps XMMWORD PTR [rax-72], xmm9
  vmovaps XMMWORD PTR [rax-88], xmm10
  vmovaps XMMWORD PTR [rax-104], xmm11
  vmovaps XMMWORD PTR [rax-120], xmm12
  vmovaps XMMWORD PTR [rsp+32], xmm13
  vminsd  xmm13, xmm3, QWORD PTR [rcx+8]
  vmovaps XMMWORD PTR [rsp+16], xmm14
  vmaxsd  xmm14, xmm1, QWORD PTR [rcx]
  vmovaps XMMWORD PTR [rsp], xmm15
  vmaxsd  xmm15, xmm3, QWORD PTR [rcx+8]
  vmovsd  QWORD PTR [rdx], xmm0
  vmovsd  QWORD PTR [rdx+24], xmm15
  vmovsd  QWORD PTR [rdx+8], xmm13
  vmovsd  QWORD PTR [rdx+16], xmm14
  vmovsd  xmm12, QWORD PTR [rcx]
  vmovsd  xmm11, QWORD PTR [rcx+8]
  vmovsd  xmm0, QWORD PTR [rcx+16]
  vmovsd  xmm1, QWORD PTR [rcx+24]
  vaddsd  xmm2, xmm0, xmm0
  vaddsd  xmm3, xmm1, xmm1
  vsubsd  xmm0, xmm12, xmm2
  vaddsd  xmm2, xmm0, QWORD PTR [rcx+32]
  vsubsd  xmm0, xmm12, QWORD PTR [rcx+16]
  vdivsd  xmm2, xmm0, xmm2
  vsubsd  xmm1, xmm11, xmm3
  vaddsd  xmm3, xmm1, QWORD PTR [rcx+40]
  vsubsd  xmm1, xmm11, QWORD PTR [rcx+24]
  vdivsd  xmm5, xmm1, xmm3
  vxorpd  xmm4, xmm4, xmm4
  vmaxsd  xmm0, xmm4, xmm2
  vmaxsd  xmm1, xmm4, xmm5
  vmovsd  xmm4, QWORD PTR __real@3ff0000000000000
  vminsd  xmm10, xmm4, xmm0
  vmulsd  xmm3, xmm10, QWORD PTR [rcx+32]
  vminsd  xmm9, xmm4, xmm1
  vmulsd  xmm2, xmm9, QWORD PTR [rcx+40]
  vsubsd  xmm8, xmm4, xmm10
  vmulsd  xmm1, xmm8, QWORD PTR [rcx+16]
  vaddsd  xmm5, xmm1, xmm3
  vmulsd  xmm3, xmm10, QWORD PTR [rcx+16]
  vsubsd  xmm7, xmm4, xmm9
  vmulsd  xmm0, xmm7, QWORD PTR [rcx+24]
  vaddsd  xmm6, xmm2, xmm0
  vmulsd  xmm2, xmm9, QWORD PTR [rcx+24]
  vmulsd  xmm0, xmm8, xmm12
  vmovaps xmm12, XMMWORD PTR [r11-120]
  vmulsd  xmm1, xmm7, xmm11
  vmovaps xmm11, XMMWORD PTR [r11-104]
  vaddsd  xmm4, xmm2, xmm1
  vaddsd  xmm3, xmm3, xmm0
  vmulsd  xmm0, xmm3, xmm8
  vmovaps xmm8, XMMWORD PTR [r11-56]
  vmulsd  xmm1, xmm4, xmm7
  vmovaps xmm7, XMMWORD PTR [r11-40]
  vmulsd  xmm2, xmm6, xmm9
  vmovaps xmm6, XMMWORD PTR [r11-24]
  vmovaps xmm9, XMMWORD PTR [r11-72]
  vmulsd  xmm5, xmm5, xmm10
  vmovaps xmm10, XMMWORD PTR [r11-88]
  vaddsd  xmm3, xmm0, xmm5
  vminsd  xmm0, xmm3, QWORD PTR [rdx]
  vaddsd  xmm4, xmm1, xmm2
  vminsd  xmm1, xmm4, xmm13
  vmovaps xmm13, XMMWORD PTR [rsp+32]
  vmovsd  QWORD PTR [rdx], xmm0
  vmaxsd  xmm0, xmm3, xmm14
  vmovaps xmm14, XMMWORD PTR [rsp+16]
  vmovsd  QWORD PTR [rdx+8], xmm1
  vmaxsd  xmm1, xmm4, xmm15
  vmovaps xmm15, XMMWORD PTR [rsp]
  vmovsd  QWORD PTR [rdx+16], xmm0
  vmovsd  QWORD PTR [rdx+24], xmm1
  mov     rsp, r11
  ret     0
void blQuadBoundingBoxV2(BLPoint const * __ptr64 const,BLBox & __ptr64) ENDP

Interpreting the Results

Only Clang was able to use SIMD in the second version. The rest was uninteresting scalar asm.

Quadratic Bezier Bounding Box - Clang Experiments

I did a bit of experimenting and I would like to show how sensitive compilers are when it comes to certain expressions. I created my own version of some basic functions like blMin(a, b) and blMax(a, b). The main reason was to have these constexpr and to return a copy instead of a reference to either a or b, which is useful when the type is not primitive and it's some structure like BLPoint. Let's take a look what would happen if we used standard std::min and std::max functions:

// Experiment, use std::min and std::max instead of our own.
void blQuadBoundingBoxV3(const BLPoint bez[3], BLBox& bBox) noexcept {
  // Bounding box of start and end points.
  bBox.reset(std::min(bez[0].x, bez[2].x), std::min(bez[0].y, bez[2].y),
             std::max(bez[0].x, bez[2].x), std::max(bez[0].y, bez[2].y));

  BLPoint t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);
  t.x = std::max(t.x, 0.0);
  t.y = std::max(t.y, 0.0);
  t.x = std::min(t.x, 1.0);
  t.y = std::min(t.y, 1.0);

  blBoxMergePoint(bBox, blLerp(blLerp(bez[0], bez[1], t),
                               blLerp(bez[1], bez[2], t), t));
}

And the output:

# The code becomes a mixture of scalar and SIMD, much worse than the previous one.
blQuadBoundingBoxV3(BLPoint const*, BLBox&):
  vmovupd  xmm3, xmmword ptr [rdi]
  vmovupd  xmm2, xmmword ptr [rdi + 32]
  vminpd   xmm9, xmm2, xmm3
  vmaxpd   xmm8, xmm2, xmm3
  vmovupd  xmm4, xmmword ptr [rdi + 16]
  vsubsd   xmm5, xmm3, xmm4
  vpermilpd xmm6, xmm4, 1   # xmm6 = xmm4[1,0]
  vpermilpd xmm7, xmm3, 1   # xmm7 = xmm3[1,0]
  vsubsd   xmm0, xmm7, xmm6
  vaddsd   xmm1, xmm4, xmm4
  vaddsd   xmm6, xmm6, xmm6
  vsubsd   xmm1, xmm3, xmm1
  vsubsd   xmm6, xmm7, xmm6
  vaddsd   xmm1, xmm2, xmm1
  vdivsd   xmm1, xmm5, xmm1
  vpermilpd xmm5, xmm2, 1   # xmm5 = xmm2[1,0]
  vaddsd   xmm5, xmm5, xmm6
  vdivsd   xmm0, xmm0, xmm5
  vmovsd   qword ptr [rsp - 16], xmm1
  vmovsd   qword ptr [rsp - 24], xmm0
  vxorpd   xmm5, xmm5, xmm5
  vucomisd xmm5, xmm1
  lea      rax, [rsp - 8]
  lea      rcx, [rsp - 16]
  cmova    rcx, rax
  mov      qword ptr [rsp - 8], 0
  mov      rcx, qword ptr [rcx]
  vucomisd xmm5, xmm0
  lea      rdx, [rsp - 24]
  cmova    rdx, rax
  mov      qword ptr [rsp - 16], rcx
  mov      qword ptr [rsp - 8], 0
  mov      rax, qword ptr [rdx]
  mov      qword ptr [rsp - 24], rax
  vmovsd   xmm0, qword ptr [rsp - 16] # xmm0 = mem[0],zero
  vmovq    xmm1, rcx
  vmovq    xmm5, rax
  vpunpcklqdq xmm1, xmm1, xmm5 # xmm1 = xmm1[0],xmm5[0]
  vmovapd  xmm5, xmmword ptr [rip + .LCPI6_0] # xmm5 = [1.000000e+00,1.000000e+00]
  vcmpltpd xmm1, xmm5, xmm1
  vmovhpd  xmm0, xmm0, qword ptr [rsp - 24] # xmm0 = xmm0[0],mem[0]
  vblendvpd xmm0, xmm0, xmm5, xmm1
  vsubpd   xmm1, xmm5, xmm0
  vmulpd   xmm3, xmm3, xmm1
  vmulpd   xmm5, xmm4, xmm0
  vaddpd   xmm3, xmm5, xmm3
  vmulpd   xmm4, xmm4, xmm1
  vmulpd   xmm2, xmm2, xmm0
  vaddpd   xmm2, xmm2, xmm4
  vmulpd   xmm1, xmm1, xmm3
  vmulpd   xmm0, xmm2, xmm0
  vaddpd   xmm0, xmm1, xmm0
  vminpd   xmm1, xmm0, xmm9
  vmovupd  xmmword ptr [rsi], xmm1
  vmaxpd   xmm0, xmm0, xmm8
  vmovupd  xmmword ptr [rsi + 16], xmm0
  ret

As you can see a slight modification of the code resulted in disaster. That nicely vectorized code is gone and instead we've got a mixture scalar and SIMD code that is still better than scalar-only version, but it's much longer than the previous fully autovectorized one.

Conclusion

I initially planned to show more snippets, but I think this was enough for a single post. I'm planning to continue with these experiments to show how code that I use in production compiles and possibly how to improve or rethink what you do to make it more likely to be optimized the way you want. I think that the best compiler at the moment is Clang as it's able to autovectorize well written code really easily, however, as the last snippet shows even Clang can be picky about the constructs you use and sometimes it's pure gambling.

In next posts I would like to focus moslty on GCC and Clang as these compilers are not shy to use SIMD. I tried also ICC before publishing this post, but since it produced similar output as MSVC it doesn't deserve my focus at all - one shitty compiler that makes this post longer is enough :)

You can try it yourself in Compiler Exporer.

Update

I reported a GCC Bug #87105 and clang bug #38705 as I think fixing these would help other code-bases as well.

24 August, 2018

C++ Compilers and FP Rounding on X86

Historically a floating point rounding was always a painful job on X86 targets. The main reason is lacking support from HW point of view as X86 initially offered only instructions to round by using the current rounding mode, which is in most cases assumed to be round-to-even. Round-to-even is the best option for intermediate computations (all floating computations you do would be rounded this way implicitly), however, it's probably the least used "explicit" rounding mode - think how many times have you used it? I guess zero. Please note that by explicit I mean code that you write to round something like std::floor(x).

As I work with floating point numbers often I always tend to check what compilers do with my code to identify if I can help them a bit to generate what I want. I decided to take a look and compare how different compilers handle rounding operations. These functions are covered in this article:

  • nearbyint
  • trunc
  • floor
  • ceil

Baseline

The following code is a baseline that is used for testing:

#include <cmath>

double my_nearby(double x) noexcept { return std::nearbyint(x); }
double my_trunc(double x) noexcept { return std::trunc(x); }
double my_floor(double x) noexcept { return std::floor(x); }
double my_ceil(double x) noexcept { return std::ceil(x); }

SSE2 Code Generation

This is a baseline code generation targeting pre-SSE4.1 hardware.

GCC [-O2]

my_nearby(double):
  jmp       nearbyint

my_trunc(double):
  movsd     xmm2, QWORD PTR .LC1[rip]
  movapd    xmm3, xmm0
  movsd     xmm4, QWORD PTR .LC0[rip]
  movapd    xmm1, xmm0
  andpd     xmm3, xmm2
  ucomisd   xmm4, xmm3
  jbe     .L4
  cvttsd2si rax, xmm0
  pxor      xmm1, xmm1
  andnpd    xmm2, xmm0
  cvtsi2sdq xmm1, rax
  orpd      xmm1, xmm2
  .L4:
  movapd    xmm0, xmm1
  ret

my_floor(double):
  movsd     xmm3, QWORD PTR .LC1[rip]
  movapd    xmm2, xmm0
  movsd     xmm4, QWORD PTR .LC0[rip]
  movapd    xmm1, xmm0
  andpd     xmm2, xmm3
  ucomisd   xmm4, xmm2
  jbe     .L6
  cvttsd2si rax, xmm0
  pxor      xmm2, xmm2
  movsd     xmm4, QWORD PTR .LC2[rip]
  andnpd    xmm3, xmm0
  cvtsi2sdq xmm2, rax
  movapd    xmm5, xmm2
  cmpnlesd  xmm5, xmm0
  movapd    xmm1, xmm5
  andpd     xmm1, xmm4
  subsd     xmm2, xmm1
  movapd    xmm1, xmm2
  orpd      xmm1, xmm3
  .L6:
  movapd    xmm0, xmm1
  ret

my_ceil(double):
  movsd     xmm3, QWORD PTR .LC1[rip]
  movapd    xmm2, xmm0
  movsd     xmm4, QWORD PTR .LC0[rip]
  movapd    xmm1, xmm0
  andpd     xmm2, xmm3
  ucomisd   xmm4, xmm2
  jbe     .L8
  cvttsd2si rax, xmm0
  pxor      xmm2, xmm2
  movsd     xmm4, QWORD PTR .LC2[rip]
  andnpd    xmm3, xmm0
  cvtsi2sdq xmm2, rax
  cmpnlesd  xmm1, xmm2
  andpd     xmm1, xmm4
  addsd     xmm1, xmm2
  orpd      xmm1, xmm3
  .L8:
  movapd    xmm0, xmm1
  ret

.LC0:
  .long     0
  .long     1127219200

.LC1:
  .long     4294967295
  .long     2147483647
  .long     0
  .long     0

.LC2:
  .long     0
  .long     1072693248

Clang [-O2]

my_nearby(double):
  jmp nearbyint

my_trunc(double):
  jmp trunc

my_floor(double):
  jmp floor

my_ceil(double):
  jmp ceil

MSVC [/O2]

double my_nearby(double) PROC
  jmp nearbyint
double my_nearby(double) ENDP

double my_trunc(double) PROC
  jmp trunc
double my_trunc(double) ENDP

double my_floor(double) PROC
  jmp floor
double my_floor(double) ENDP

double my_ceil(double) PROC
  jmp ceil
double my_ceil(double) ENDP

Interpreting the Results

As we can see GCC inlined SSE2 code and other compilers just tail-called C functions that would handle the problem at a library level. The GCC inlined code shows how complicated the rounding really is on a pre-SSE4.1 hardware. It's a lot of code to cover a lot of cases and to perform rounding by using instructions not really dedicated to that task.

Next experiment will only cover GCC and Clang compilers as I haven't found a way to tell MSVC to use SSE4.1 instruction set for that, but I would add some tips for MSVC and AVX code generation afterwards.

SSE4.1 Code Generation

SSE4.1 output of GCC and Clang by using an additional -msse4.1 option.

GCC [-O2 -msse4.1]

my_nearby(double):
  jmp nearbyint

my_trunc(double):
  roundsd xmm0, xmm0, 11
  ret

my_floor(double):
  roundsd xmm0, xmm0, 9
  ret

my_ceil(double):
  roundsd xmm0, xmm0, 10
  ret

Clang [-O2 -msse4.1]

my_nearby(double):
  roundsd xmm0, xmm0, 12
  ret

my_trunc(double):
  roundsd xmm0, xmm0, 11
  ret

my_floor(double):
  roundsd xmm0, xmm0, 9
  ret

my_ceil(double):
  roundsd xmm0, xmm0, 10
  ret

Interpreting the Results

As we can see most of the time we've got exactly what we wanted. Both compilers have chosen wisely the instructions dedicated for rounding so you get virtually no overhead here except in GCC's nearbyint() case, which I don't quite understand. It appears to be just a missed optimization, which was already reported as #71278. As I mentioned earlier, I don't know any way how to tell MSVC to generate such code so I cannot cover this compiler here (MSVC /arch parameter can only be used to switch to AVX code generation, which is far more advanced than SSE4.1)

AVX Code Generation to Cover MSVC Inability to Generate SSE4.1

The only way to tell MSVC to use instructions dedicated to rounding is to turn on AVX code generation. I initially didn't want to go this way, but I have found a very interesting thing that people should know. Even when AVX/AVX2 is turned on MSVC would not use it for rounding by default!

MSVC [/O2 /arch:AVX]

double my_nearby(double) PROC
  jmp nearbyint
double my_nearby(double) ENDP

double my_trunc(double) PROC
  jmp trunc
double my_trunc(double) ENDP

double my_floor(double) PROC
  jmp floor
double my_floor(double) ENDP

double my_ceil(double) PROC
  jmp ceil
double my_ceil(double) ENDP

Surprise! To be honest I don't know why MSVC cannot do that, but I have found a solution for some cases, but it's compiler-specific. We can use pragmas to temporarily tell MSVC that we want "fast" floating point operations and MSVC would then output instructions that we would expect [in some cases]. There are also compiler options to do this globally, but I personally don't recommend it as it gives the compiler so much power to reorder floating point operations that sometimes shouldn't be reordered.

MSVC Specific Code

#include <cmath>

#ifdef _MSC_VER
#pragma float_control(precise, off, push)
#endif

double my_nearby(double x) noexcept { return std::nearbyint(x); }
double my_trunc(double x) noexcept { return std::trunc(x); }
double my_floor(double x) noexcept { return std::floor(x); }
double my_ceil(double x) noexcept { return std::ceil(x); }

#ifdef _MSC_VER
#pragma float_control(pop)
#endif
double my_nearby(double) PROC
  jmp nearbyint
double my_nearby(double) ENDP

double my_trunc(double) PROC
 jmp trunc
double my_trunc(double) ENDP

double my_floor(double) PROC
  vxorpd   xmm2, xmm2, xmm2
  vroundsd xmm0, xmm2, xmm0, 1
  ret      0
double my_floor(double) ENDP

double my_ceil(double) PROC
  vxorpd   xmm2, xmm2, xmm2
  vroundsd xmm0, xmm2, xmm0, 2
  ret      0
double my_ceil(double) ENDP

Well, the code is much better for floor/ceil case, but the use of vxorpd instruction is just nonsensical as it zeroes a part of the register that is irrelevant for both rounding and return. This is in general what MSVC does often without any good reason. Why MSVC didn't inline also vroundsd instruction for nearby and trunc cases is not clear to me, but compilers often behave irrationally.

SSE2 Code Generation Through Intrinsics

As a bonus I have decided to also provide a pure SSE2 intristics solution that implements all four rounding operations in a branchless manner. I use this code in Blend2D for pre-SSE4.1 hardware and it's not the same as the code that GCC generates. In addition this code can be easily used to convert 2 double precision floating numbers at a time in cases in which you need to round vectors. The code can be further optimized as _mm_set_sd() is not really an efficient way to address constants on X86.

#include <stdint.h>
#include <emmintrin.h>

static inline __m128d my_blend(const __m128d& x, const __m128d& y, const __m128d& mask) noexcept {
  return _mm_or_pd(_mm_and_pd(mask, x),
                   _mm_and_pd(y, mask));
}

double my_nearby(double x) noexcept {
  __m128d src = _mm_set_sd(x);
  __m128d maxn = _mm_set_sd(4503599627370496.0);
  __m128d magic = _mm_set_sd(6755399441055744.0);

  __m128d msk = _mm_cmpnlt_sd(src, maxn);
  __m128d rnd = _mm_sub_sd(_mm_add_sd(src, magic), magic);

  return _mm_cvtsd_f64(my_blend(rnd, src, msk));
}

double my_trunc(double x) noexcept {
  static const uint64_t kSepMask[1] = { 0x7FFFFFFFFFFFFFFFu };

  __m128d src = _mm_set_sd(x);
  __m128d sep = _mm_load_sd(reinterpret_cast<const double*>(kSepMask));

  __m128d src_abs = _mm_and_pd(src, sep);
  __m128d sign = _mm_andnot_pd(sep, src);

  __m128d maxn = _mm_set_sd(4503599627370496.0);
  __m128d magic = _mm_set_sd(6755399441055744.0);

  __m128d msk = _mm_or_pd(_mm_cmpnlt_sd(src_abs, maxn), sign);
  __m128d rnd = _mm_sub_sd(_mm_add_sd(src_abs, magic), magic);
  __m128d maybeone = _mm_and_pd(_mm_cmplt_sd(src_abs, rnd), _mm_set_sd(1.0));

  return _mm_cvtsd_f64(my_blend(_mm_sub_sd(rnd, maybeone), src, msk));
}

double my_floor(double x) noexcept {
  __m128d src = _mm_set_sd(x);
  __m128d maxn = _mm_set_sd(4503599627370496.0);
  __m128d magic = _mm_set_sd(6755399441055744.0);

  __m128d msk = _mm_cmpnlt_sd(src, maxn);
  __m128d rnd = _mm_sub_sd(_mm_add_sd(src, magic), magic);
  __m128d maybeone = _mm_and_pd(_mm_cmplt_sd(src, rnd), _mm_set_sd(1.0));

  return _mm_cvtsd_f64(my_blend(_mm_sub_sd(rnd, maybeone), src, msk));
}

double my_ceil(double x) noexcept {
  __m128d src = _mm_set_sd(x);
  __m128d maxn = _mm_set_sd(4503599627370496.0);
  __m128d magic = _mm_set_sd(6755399441055744.0);

  __m128d msk = _mm_cmpnlt_sd(src, maxn);
  __m128d rnd = _mm_sub_sd(_mm_add_sd(src, magic), magic);
  __m128d maybeone = _mm_and_pd(_mm_cmpnle_sd(src, rnd), _mm_set_sd(1.0));

  return _mm_cvtsd_f64(my_blend(_mm_add_sd(rnd, maybeone), src, msk));
}

GCC [-O2]

my_nearby(double):
  movq     xmm0, xmm0
  movq     xmm3, QWORD PTR .LC1[rip]
  movapd   xmm2, xmm0
  movapd   xmm1, xmm0
  cmpnltsd xmm1, QWORD PTR .LC0[rip]
  addsd    xmm2, xmm3
  subsd    xmm2, xmm3
  andpd    xmm0, xmm1
  andpd    xmm1, xmm2
  orpd     xmm0, xmm1
  ret

my_trunc(double):
  movq     xmm1, QWORD PTR .LC2[rip]
  movq     xmm0, xmm0
  movq     xmm4, QWORD PTR .LC1[rip]
  movapd   xmm2, xmm0
  andpd    xmm2, xmm1
  movapd   xmm3, xmm1
  movapd   xmm1, xmm2
  andnpd   xmm3, xmm0
  cmpnltsd xmm1, QWORD PTR .LC0[rip]
  orpd     xmm1, xmm3
  movapd   xmm3, xmm2
  addsd    xmm3, xmm4
  andpd    xmm0, xmm1
  subsd    xmm3, xmm4
  cmpltsd  xmm2, xmm3
  andpd    xmm2, XMMWORD PTR .LC3[rip]
  subsd    xmm3, xmm2
  andpd    xmm1, xmm3
  orpd     xmm1, xmm0
  movapd   xmm0, xmm1
  ret

my_floor(double):
  movq     xmm0, xmm0
  movq     xmm3, QWORD PTR .LC1[rip]
  movapd   xmm2, xmm0
  movapd   xmm1, xmm0
  cmpnltsd xmm1, QWORD PTR .LC0[rip]
  addsd    xmm2, xmm3
  subsd    xmm2, xmm3
  movapd   xmm3, xmm0
  andpd    xmm0, xmm1
  cmpltsd  xmm3, xmm2
  andpd    xmm3, XMMWORD PTR .LC3[rip]
  subsd    xmm2, xmm3
  andpd    xmm1, xmm2
  orpd     xmm0, xmm1
  ret

my_ceil(double):
  movq     xmm0, xmm0
  movq     xmm3, QWORD PTR .LC1[rip]
  movapd   xmm2, xmm0
  movapd   xmm1, xmm0
  cmpnltsd xmm1, QWORD PTR .LC0[rip]
  addsd    xmm2, xmm3
  subsd    xmm2, xmm3
  movapd   xmm3, xmm0
  andpd    xmm0, xmm1
  cmpnlesd xmm3, xmm2
  andpd    xmm3, XMMWORD PTR .LC3[rip]
  addsd    xmm2, xmm3
  andpd    xmm1, xmm2
  orpd     xmm0, xmm1
  ret

.LC0:
  .long    0
  .long    1127219200
  .long    0
  .long    0

.LC1:
  .long    0
  .long    1127743488
  .long    0
  .long    0

.LC2:
  .long    4294967295
  .long    2147483647
  .long    0
  .long    0

.LC3:
  .long    0
  .long    1072693248
  .long    0
  .long    0

Clang [-O2]

.LCPI0_0:
  .quad    4841369599423283200     # double 4503599627370496
.LCPI0_1:
  .quad    4843621399236968448     # double 6755399441055744
.LCPI0_2:
  .quad    -4379750637617807360    # double -6755399441055744

my_nearby(double):
  movq     xmm1, xmm0
  addsd    xmm0, qword ptr [rip + .LCPI0_1]
  addsd    xmm0, qword ptr [rip + .LCPI0_2]
  orpd     xmm0, xmm1
  cmpnltsd xmm1, qword ptr [rip + .LCPI0_0]
  andpd    xmm0, xmm1
  ret

.LCPI1_0:
  .quad    4841369599423283200     # double 4503599627370496
.LCPI1_1:
  .quad    4843621399236968448     # double 6755399441055744
.LCPI1_2:
  .quad    -4379750637617807360    # double -6755399441055744

my_trunc(double):
  movq     xmm1, xmm0
  movabs   rax, 9223372036854775807
  movq     xmm2, rax
  pand     xmm2, xmm1
  movabs   rax, -9223372036854775808
  movq     xmm3, rax
  pand     xmm3, xmm1
  movdqa   xmm4, xmm2
  cmpnltsd xmm4, qword ptr [rip + .LCPI1_0]
  movsd    xmm0, qword ptr [rip + .LCPI1_1]
  addsd    xmm0, xmm2
  addsd    xmm0, qword ptr [rip + .LCPI1_2]
  orpd     xmm4, xmm3
  cmpltsd  xmm2, xmm0
  movabs   rax, 4607182418800017408
  movq     xmm3, rax
  pand     xmm3, xmm2
  subsd    xmm0, xmm3
  orpd     xmm0, xmm1
  andpd    xmm0, xmm4
  ret

.LCPI2_0:
  .quad    4841369599423283200     # double 4503599627370496
.LCPI2_1:
  .quad    4843621399236968448     # double 6755399441055744
.LCPI2_2:
  .quad    -4379750637617807360    # double -6755399441055744

my_floor(double):
  movq     xmm1, xmm0
  addsd    xmm0, qword ptr [rip + .LCPI2_1]
  addsd    xmm0, qword ptr [rip + .LCPI2_2]
  movdqa   xmm2, xmm1
  cmpltsd  xmm2, xmm0
  movabs   rax, 4607182418800017408
  movq     xmm3, rax
  pand     xmm3, xmm2
  subsd    xmm0, xmm3
  orpd     xmm0, xmm1
  cmpnltsd xmm1, qword ptr [rip + .LCPI2_0]
  andpd    xmm0, xmm1
  ret

.LCPI3_0:
  .quad    4841369599423283200     # double 4503599627370496
.LCPI3_1:
  .quad    4843621399236968448     # double 6755399441055744
.LCPI3_2:
  .quad    -4379750637617807360    # double -6755399441055744

my_ceil(double):
  movq     xmm1, xmm0
  addsd    xmm0, qword ptr [rip + .LCPI3_1]
  addsd    xmm0, qword ptr [rip + .LCPI3_2]
  movdqa   xmm2, xmm1
  cmpnlesd xmm2, xmm0
  movabs   rax, 4607182418800017408
  movq     xmm3, rax
  pand     xmm3, xmm2
  addsd    xmm0, xmm3
  orpd     xmm0, xmm1
  cmpnltsd xmm1, qword ptr [rip + .LCPI3_0]
  andpd    xmm0, xmm1
  ret

MSVC [/O2]

double my_nearby(double) PROC
  movaps   xmm1, xmm0
  xorps    xmm4, xmm4
  movsd    xmm4, xmm1
  movaps   xmm2, xmm4
  movaps   xmm0, xmm4
  addsd    xmm0, QWORD PTR __real@4338000000000000
  cmpnltsd xmm2, QWORD PTR __real@4330000000000000
  subsd    xmm0, QWORD PTR __real@4338000000000000
  andps    xmm0, xmm2
  andps    xmm2, xmm4
  orps     xmm0, xmm2
  ret      0
double my_nearby(double) ENDP

double my_trunc(double) PROC
$LN6:
  sub      rsp, 24
  movsd    xmm3, QWORD PTR unsigned __int64 const * const `double my_trunc(double)'::`2'::kSepMask
  movaps   xmm1, xmm0
  movsd    xmm0, QWORD PTR __real@3ff0000000000000
  movaps   xmm5, xmm3
  movaps   XMMWORD PTR [rsp], xmm6
  xorps    xmm6, xmm6
  movsd    xmm6, xmm1
  andnpd   xmm3, xmm6
  andps    xmm5, xmm6
  movaps   xmm4, xmm5
  cmpnltsd xmm4, QWORD PTR __real@4330000000000000
  orps     xmm4, xmm3
  movaps   xmm3, xmm5
  addsd    xmm3, QWORD PTR __real@4338000000000000
  subsd    xmm3, QWORD PTR __real@4338000000000000
  cmpltsd  xmm5, xmm3
  andps    xmm5, xmm0
  subsd    xmm3, xmm5
  andps    xmm3, xmm4
  andps    xmm4, xmm6
  movaps   xmm6, XMMWORD PTR [rsp]
  orps     xmm3, xmm4
  movaps   xmm0, xmm3
  add      rsp, 24
  ret      0
double my_trunc(double) ENDP

double my_floor(double) PROC
  movaps   xmm1, xmm0
  xorps    xmm5, xmm5
  movsd    xmm0, QWORD PTR __real@3ff0000000000000
  movsd    xmm5, xmm1
  movaps   xmm3, xmm5
  movaps   xmm4, xmm5
  addsd    xmm4, QWORD PTR __real@4338000000000000
  cmpnltsd xmm3, QWORD PTR __real@4330000000000000
  movaps   xmm2, xmm5
  subsd    xmm4, QWORD PTR __real@4338000000000000
  cmpltsd  xmm2, xmm4
  andps    xmm2, xmm0
  subsd    xmm4, xmm2
  andps    xmm4, xmm3
  andps    xmm3, xmm5
  orps     xmm4, xmm3
  movaps   xmm0, xmm4
  ret      0
double my_floor(double) ENDP

double my_ceil(double) PROC
  movaps   xmm1, xmm0
  xorps    xmm5, xmm5
  movsd    xmm0, QWORD PTR __real@3ff0000000000000
  movsd    xmm5, xmm1
  movaps   xmm3, xmm5
  movaps   xmm4, xmm5
  addsd    xmm4, QWORD PTR __real@4338000000000000
  cmpnltsd xmm3, QWORD PTR __real@4330000000000000
  movaps   xmm2, xmm5
  subsd    xmm4, QWORD PTR __real@4338000000000000
  cmpnlesd xmm2, xmm4
  andps    xmm2, xmm0
  addsd    xmm4, xmm2
  andps    xmm4, xmm3
  andps    xmm3, xmm5
  orps     xmm4, xmm3
  movaps   xmm0, xmm4
  ret      0
double my_ceil(double) ENDP

Interpreting the Results

There is not much to add as it's pretty close to the C++ code. Some compilers add unnecessary moves, but that's it in general.

SSE4.1 Code Generation Through Intrinsics

SSE4.1 offers dedicated instructions for rounding purposes, can C++ compilers screw up?

#include <nmmintrin.h>

template<int ControlFlags>
static inline __m128d my_round_sse4_1(double x) noexcept {
  __m128d y = _mm_set_sd(x);
  return _mm_round_sd(y, y, ControlFlags | _MM_FROUND_NO_EXC);
}

double my_nearby(double x) noexcept { return _mm_cvtsd_f64(my_round_sse4_1<_MM_FROUND_CUR_DIRECTION>(x)); }
double my_trunc (double x) noexcept { return _mm_cvtsd_f64(my_round_sse4_1<_MM_FROUND_TO_ZERO      >(x)); }
double my_floor (double x) noexcept { return _mm_cvtsd_f64(my_round_sse4_1<_MM_FROUND_TO_NEG_INF   >(x)); }
double my_ceil  (double x) noexcept { return _mm_cvtsd_f64(my_round_sse4_1<_MM_FROUND_TO_POS_INF   >(x)); }

GCC [-O2 -msse4.1]

my_nearby(double):
  movq    xmm0, xmm0
  roundsd xmm0, xmm0, 12
  ret

my_trunc(double):
  movq    xmm0, xmm0
  roundsd xmm0, xmm0, 11
  ret

my_floor(double):
  movq    xmm0, xmm0
  roundsd xmm0, xmm0, 9
  ret

my_ceil(double):
  movq    xmm0, xmm0
  roundsd xmm0, xmm0, 10
  ret

Clang [-O2 -msse4.1]

my_nearby(double):
  roundsd xmm0, xmm0, 12
  ret

my_trunc(double):
  roundsd xmm0, xmm0, 11
  ret

my_floor(double):
  roundsd xmm0, xmm0, 9
  ret

my_ceil(double):
  roundsd xmm0, xmm0, 10
  ret

MSVC [/O2]

double my_nearby(double) PROC
  movaps  xmm1, xmm0
  xorps   xmm2, xmm2
  movsd   xmm2, xmm1
  roundsd xmm2, xmm2, 12
  movaps  xmm0, xmm2
  ret     0
double my_nearby(double) ENDP

double my_trunc(double) PROC
  movaps  xmm1, xmm0
  xorps   xmm2, xmm2
  movsd   xmm2, xmm1
  roundsd xmm2, xmm2, 11
  movaps  xmm0, xmm2
  ret     0
double my_trunc(double) ENDP

double my_floor(double) PROC
  movaps  xmm1, xmm0
  xorps   xmm2, xmm2
  movsd   xmm2, xmm1
  roundsd xmm2, xmm2, 9
  movaps  xmm0, xmm2
  ret     0
double my_floor(double) ENDP

double my_ceil(double) PROC
  movaps  xmm1, xmm0
  xorps   xmm2, xmm2
  movsd   xmm2, xmm1
  roundsd xmm2, xmm2, 10
  movaps  xmm0, xmm2
  ret     0
double my_ceil(double) ENDP

Interpreting the Results

Do you also have mixed feelings about the results? We have avoided function calls in all cases and that's cool, but I would say that the only compiler that does what I would expect is Clang. GCC is not that bad as it just emits the extra movq instruction that acts as vxorpd in MSVC case and I could live with that. However, what MSVC does here is pure insanity and it shows how badly designed its SIMD pipeline really is.

Conclusion

Watch your compiler especially if you use MSVC for any reason :)

Compiler Exporer

Here are some links related to the compiler exporer so you can play with it yourself:

13 August, 2018

Are we autovectorized yet?

Short post about the compiler's ability (or inability?) to autovectorize your code. A lot of people write claims about compiler optimizations without actually proving them. Some people even state things such as "compiler is smarter than you", "compiler can autovectorize your code much better than you", and similar nonsense, and then use these as a foundation against people that care how their code is compiled or that even write optimized functions themselves. Hopefully, there is enough tools online to verify such claims and to prove which optimizations are likely and which aren't.

Here is our sample function:

struct Matrix2D {
  double m00;
  double m01;
  double m10;
  double m11;
  double m20;
  double m21;

  inline void reset(double a, double b, double c, double d, double e, double f) noexcept {
    m00 = a;
    m01 = b;
    m10 = c;
    m11 = d;
    m20 = e;
    m21 = f;
  }
};

void Matrix2D_Multiply(Matrix2D* dst, const Matrix2D* a, const Matrix2D* b) noexcept {
  dst->reset(a->m00 * b->m00 + a->m01 * b->m10,
             a->m00 * b->m01 + a->m01 * b->m11,
             a->m10 * b->m00 + a->m11 * b->m10,
             a->m10 * b->m01 + a->m11 * b->m11,
             a->m20 * b->m00 + a->m21 * b->m10 + b->m20,
             a->m20 * b->m01 + a->m21 * b->m11 + b->m21);
}

Yes it's a simple affine matrix multiplication used commonly in 2D graphics. I initially thought that all major C++ compilers would be able to autovectorize this code as it looks pretty straightforward, but I was wrong.

MSVC 2017 [/Ox /arch:AVX2]

; 48 instructions
Matrix2D_Multiply:
  mov     rax, rsp
  sub     rsp, 120
  vmovsd  xmm4, QWORD PTR [r8+8]
  vmulsd  xmm1, xmm4, QWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rax-24], xmm6
  vmovaps XMMWORD PTR [rax-40], xmm7
  vmovaps XMMWORD PTR [rax-56], xmm8
  vmovsd  xmm8, QWORD PTR [r8]
  vmulsd  xmm2, xmm8, QWORD PTR [rdx+16]
  vmovaps XMMWORD PTR [rax-72], xmm9
  vmovsd  xmm9, QWORD PTR [r8+16]
  vmovaps XMMWORD PTR [rax-88], xmm10
  vmovaps XMMWORD PTR [rax-104], xmm11
  vmovsd  xmm11, QWORD PTR [r8+24]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+40]
  vaddsd  xmm1, xmm1, xmm0
  vmulsd  xmm0, xmm9, QWORD PTR [rdx+40]
  vmovaps XMMWORD PTR [rsp], xmm12
  vaddsd  xmm12, xmm1, QWORD PTR [r8+40]
  vmulsd  xmm1, xmm8, QWORD PTR [rdx+32]
  vaddsd  xmm1, xmm1, xmm0
  vaddsd  xmm10, xmm1, QWORD PTR [r8+32]
  vmulsd  xmm1, xmm4, QWORD PTR [rdx+16]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+24]
  vaddsd  xmm7, xmm1, xmm0
  vmulsd  xmm1, xmm9, QWORD PTR [rdx+24]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+8]
  vmovaps xmm11, XMMWORD PTR [rax-104]
  vaddsd  xmm6, xmm2, xmm1
  vmulsd  xmm1, xmm4, QWORD PTR [rdx]
  vmulsd  xmm2, xmm8, QWORD PTR [rdx]
  vmovaps xmm8, XMMWORD PTR [rax-56]
  vaddsd  xmm4, xmm1, xmm0
  vmulsd  xmm1, xmm9, QWORD PTR [rdx+8]
  vmovaps xmm9, XMMWORD PTR [rax-72]
  vaddsd  xmm0, xmm2, xmm1
  vmovsd  QWORD PTR [rcx+16], xmm6
  vmovaps xmm6, XMMWORD PTR [rax-24]
  vmovsd  QWORD PTR [rcx+24], xmm7
  vmovaps xmm7, XMMWORD PTR [rax-40]
  vmovsd  QWORD PTR [rcx+32], xmm10
  vmovaps xmm10, XMMWORD PTR [rax-88]
  vmovsd  QWORD PTR [rcx+40], xmm12
  vmovaps xmm12, XMMWORD PTR [rsp]
  vmovsd  QWORD PTR [rcx], xmm0
  vmovsd  QWORD PTR [rcx+8], xmm4
  add     rsp, 120
  ret     0

Not so great, but there is an explanation for that. Firstly, the compiler decided to go scalar (failed to autovectorize the code), which means that it would need a lot of registers (this decision basically predated everything). Secondly, WIN64 calling convention requires some registers to be preserved across function calls so in order to use as many registers as it needs it has to additionally save and restore some of them. Don't be confused with rax use here, it's just a trick and it's used as an original stack pointer. This trick is just a code-size optimization.

GCC trunk [-O2 -mavx2]

; 37 instructions
Matrix2D_Multiply:
  vmovsd  xmm0, QWORD PTR [rdx+8]
  vmovsd  xmm5, QWORD PTR [rdx+24]
  vmovsd  xmm2, QWORD PTR [rsi+32]
  vmovsd  xmm3, QWORD PTR [rsi+40]
  vmovsd  xmm6, QWORD PTR [rdx+16]
  vmovsd  xmm9, QWORD PTR [rsi+16]
  vmulsd  xmm1, xmm3, xmm5
  vmovsd  xmm8, QWORD PTR [rsi+24]
  vmovsd  xmm7, QWORD PTR [rsi+8]
  vmulsd  xmm4, xmm2, xmm0
  vmulsd  xmm3, xmm3, xmm6
  vmulsd  xmm11, xmm6, xmm7
  vmulsd  xmm7, xmm5, xmm7
  vmulsd  xmm6, xmm6, xmm8
  vaddsd  xmm4, xmm4, xmm1
  vmovsd  xmm1, QWORD PTR [rdx]
  vmulsd  xmm5, xmm5, xmm8
  vaddsd  xmm4, xmm4, QWORD PTR [rdx+40]
  vmulsd  xmm2, xmm2, xmm1
  vaddsd  xmm3, xmm2, xmm3
  vmovsd  xmm2, QWORD PTR [rsi]
  vaddsd  xmm3, xmm3, QWORD PTR [rdx+32]
  vmovsd  QWORD PTR [rdi+40], xmm4
  vmulsd  xmm10, xmm1, xmm2
  vmulsd  xmm2, xmm0, xmm2
  vmovsd  QWORD PTR [rdi+32], xmm3
  vmulsd  xmm1, xmm1, xmm9
  vmulsd  xmm0, xmm0, xmm9
  vaddsd  xmm10, xmm10, xmm11
  vaddsd  xmm2, xmm2, xmm7
  vaddsd  xmm1, xmm1, xmm6
  vaddsd  xmm0, xmm0, xmm5
  vmovsd  QWORD PTR [rdi], xmm10
  vmovsd  QWORD PTR [rdi+8], xmm2
  vmovsd  QWORD PTR [rdi+16], xmm1
  vmovsd  QWORD PTR [rdi+24], xmm0
  ret

A perfect scalar version, but nothing more.

Clang trunk [-O2 -mavx2]

; 22 instructions
Matrix2D_Multiply:
  vmovddup xmm0, qword ptr [rsi] # xmm0 = mem[0,0]
  vmovupd xmm1, xmmword ptr [rdx]
  vmovupd xmm2, xmmword ptr [rdx + 16]
  vmulpd xmm0, xmm0, xmm1
  vmovddup xmm3, qword ptr [rsi + 8] # xmm3 = mem[0,0]
  vmulpd xmm3, xmm3, xmm2
  vaddpd xmm0, xmm0, xmm3
  vmovddup xmm3, qword ptr [rsi + 16] # xmm3 = mem[0,0]
  vmulpd xmm3, xmm1, xmm3
  vmovddup xmm4, qword ptr [rsi + 24] # xmm4 = mem[0,0]
  vmulpd xmm4, xmm2, xmm4
  vmovddup xmm5, qword ptr [rsi + 32] # xmm5 = mem[0,0]
  vmulpd xmm1, xmm1, xmm5
  vmovddup xmm5, qword ptr [rsi + 40] # xmm5 = mem[0,0]
  vmulpd xmm2, xmm2, xmm5
  vaddpd xmm1, xmm1, xmm2
  vaddpd xmm1, xmm1, xmmword ptr [rdx + 32]
  vaddpd xmm2, xmm3, xmm4
  vmovupd xmmword ptr [rdi], xmm0
  vmovupd xmmword ptr [rdi + 16], xmm2
  vmovupd xmmword ptr [rdi + 32], xmm1
  ret

Well, this is what I have initially expected to get. An autovectorized code that is as close to a hand-written asm as I can think of.

Conclusion

Compilers are generally improving, but some of them still lack to perform a really basic autovectorization optimizations. So we are not autovectorized yet and people should in general verify what they are claiming. Time will tell...

Update

I fixed one statement. Usage of [rax] by MSVC is actually smart. Addressing [rax + imm] is one byte smaller than addressing [rsp + imm].

You can use Compiler Exporer and try it yourself!