23 October, 2023

This Blog is Moving

This blog is moving to my homepage. I have started deleting the posts on this site and when I'm done the whole blogger account will be deleted.

04 September, 2018

C++ - Generating Lookup Tables at Compile-Time

There are many ways of creating lookup tables in C++. Let's assume that we have a set of keys which are compile-time constants and we want to associate some value to each. However, when accessing the value the key won't be a constant expression. Today's compilers are smart and can generate nice code that will do a binary search or can even create a lookup table, but you should never assume what they may do as each compiler has different heuristics and could take a different path to solve the problem.

What I will discuss today is creating arrays of values at compile-time that can be used anyhow you want as the format of the data will be defined by you. These arrays can be then used as lookup tables or regular arrays if that's what you need. So let's describe the problem first:

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

size_t objectSize(uint32_t tid) {
  return tid == TID_NULL    ?  0 :
         tid == TID_BOOLEAN ?  1 :
         tid == TID_INT8    ?  1 :
         tid == TID_INT16   ?  2 :
         tid == TID_INT32   ?  4 :
         tid == TID_INT64   ?  8 :
         tid == TID_UINT8   ?  1 :
         tid == TID_UINT16  ?  2 :
         tid == TID_UINT32  ?  4 :
         tid == TID_UINT64  ?  8 :
         tid == TID_FLOAT32 ?  4 :
         tid == TID_FLOAT64 ?  8 :
         tid == TID_STRING  ? 16 :
         tid == TID_ARRAY   ? 24 :
         tid == TID_MAP     ? 32 : SIZE_MAX;
}

[Compiler Explorer]

GCC and MSVC would take 'compare and return' approach while clang will use a single lookup table in our case. No approach is wrong in this case as compilers are free to do anything that would return a valid result.

Approach #1 - Naive LUT

We can use really naive approach to turn such code into a lookup table by creating an array and filling it:

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

static const uint8_t ObjectSizeArray[] = {
  /* TID_NULL    */  0 ,
  /* TID_BOOLEAN */  1 ,
  /* TID_INT8    */  1 ,
  /* TID_INT16   */  2 ,
  /* TID_INT32   */  4 ,
  /* TID_INT64   */  8 ,
  /* TID_UINT8   */  1 ,
  /* TID_UINT16  */  2 ,
  /* TID_UINT32  */  4 ,
  /* TID_UINT64  */  8 ,
  /* TID_FLOAT32 */  4 ,
  /* TID_FLOAT64 */  8 ,
  /* TID_STRING  */ 16 ,
  /* TID_ARRAY   */ 24 ,
  /* TID_MAP     */ 32
};

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

There is not much to add as the code is explicitly creating the lookup table and then using it. This is one of the simplest approaches that you will see in many projects to date. It's fine until you start messing with keys. Imagine somebody would reorder TID_UINT8 to follow TID_INT8 including the TID value. The table would immediately be broken, but the code would compile just fine - the runtime is where it will fail and such bugs can be really time consuming to debug as everything looks just perfect.

Approach #2 - Key/Value as a Macro

The idea is to create a macro that will be used N times to populate the table:

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

#define V(tid) (              \
  tid == TID_NULL    ?  0 :   \
  tid == TID_BOOLEAN ?  1 :   \
  tid == TID_INT8    ?  1 :   \
  tid == TID_INT16   ?  2 :   \
  tid == TID_INT32   ?  4 :   \
  tid == TID_INT64   ?  8 :   \
  tid == TID_UINT8   ?  1 :   \
  tid == TID_UINT16  ?  2 :   \
  tid == TID_UINT32  ?  4 :   \
  tid == TID_UINT64  ?  8 :   \
  tid == TID_FLOAT32 ?  4 :   \
  tid == TID_FLOAT64 ?  8 :   \
  tid == TID_STRING  ? 16 :   \
  tid == TID_ARRAY   ? 24 :   \
  tid == TID_MAP     ? 32 : 0 \
)

static const uint8_t ObjectSizeArray[] = {
  V(0),
  V(1),
  V(2),
  V(3),
  V(4),
  V(5),
  V(6),
  V(7),
  V(8),
  V(9),
  V(10),
  V(11),
  V(12),
  V(13),
  V(14)
};

#undef V

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

The result should be compiled to the same code as the previous example. We fixed one thing and designed a solution which would work for simple cases in which a key is changed, however, what we have done is not really a nice solution and requires a lot of additional lines compared to the first one.

Approach #3 - Key/Value as a Template

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

template<uint32_t tid>
struct KeyMapper {
  enum : uint8_t {
    value =
      tid == TID_NULL    ?  0 :
      tid == TID_BOOLEAN ?  1 :
      tid == TID_INT8    ?  1 :
      tid == TID_INT16   ?  2 :
      tid == TID_INT32   ?  4 :
      tid == TID_INT64   ?  8 :
      tid == TID_UINT8   ?  1 :
      tid == TID_UINT16  ?  2 :
      tid == TID_UINT32  ?  4 :
      tid == TID_UINT64  ?  8 :
      tid == TID_FLOAT32 ?  4 :
      tid == TID_FLOAT64 ?  8 :
      tid == TID_STRING  ? 16 :
      tid == TID_ARRAY   ? 24 :
      tid == TID_MAP     ? 32 : 0
  };
};

static const uint8_t ObjectSizeArray[] = {
  KeyMapper<0>::value,
  KeyMapper<1>::value,
  KeyMapper<2>::value,
  KeyMapper<3>::value,
  KeyMapper<4>::value,
  KeyMapper<5>::value,
  KeyMapper<6>::value,
  KeyMapper<7>::value,
  KeyMapper<8>::value,
  KeyMapper<9>::value,
  KeyMapper<10>::value,
  KeyMapper<11>::value,
  KeyMapper<12>::value,
  KeyMapper<13>::value,
  KeyMapper<14>::value
};

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

Okay, the nasty macro was replaced by a nasty template, but overall it's the same solution that is still annoying to write / maintain.

Approach #4 - std::index_sequence and constexpr (C++14)

#include <stdint.h>
#include <stdlib.h>
#include <utility>

// ------------------------------------------------------------------
// Normally you would put these helpers somewhere else, not into the
// the .cpp file, however, let's keep it simple.
// ------------------------------------------------------------------

template<typename T, size_t N>
struct LookupTable {
  T data[N];

  constexpr size_t size() const noexcept { return N; }
  constexpr const T& operator[](size_t i) const noexcept { return data[i]; }
};

template<typename T, size_t N, class Func, size_t... Indexes>
static constexpr LookupTable<T, N> MakeLookupTableImpl(Func f, std::index_sequence<Indexes...>) noexcept {
  return {{ T(f(Indexes))... }};
}

template<typename T, size_t N, class Func>
static constexpr LookupTable<T, N> MakeLookupTable(Func f) noexcept {
  return MakeLookupTableImpl<T, N>(f, std::make_index_sequence{});
}

// The original code...
enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

// And the original function turned into a constexpr.
static constexpr uint8_t ObjectSizeFunc(uint32_t tid) {
  return tid == TID_NULL    ?  0 :
         tid == TID_BOOLEAN ?  1 :
         tid == TID_INT8    ?  1 :
         tid == TID_INT16   ?  2 :
         tid == TID_INT32   ?  4 :
         tid == TID_INT64   ?  8 :
         tid == TID_UINT8   ?  1 :
         tid == TID_UINT16  ?  2 :
         tid == TID_UINT32  ?  4 :
         tid == TID_UINT64  ?  8 :
         tid == TID_FLOAT32 ?  4 :
         tid == TID_FLOAT64 ?  8 :
         tid == TID_STRING  ? 16 :
         tid == TID_ARRAY   ? 24 :
         tid == TID_MAP     ? 32 : 0;
}

static constexpr const auto ObjectSizeArray =
  MakeLookupTable<uint8_t, TID_COUNT>(ObjectSizeFunc);

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

I think this is so far the cleanest approach that can be maintained very well. Adding a new value still requires to update the ObjectSizeFunc function, but changing keys would do no harm.

Approach #5 - Minor update of #4

#include <stdint.h>
#include <stdlib.h>
#include <utility>

template<typename T, size_t N>
struct LookupTable {
  T data[N];

  constexpr size_t size() const noexcept { return N; }
  constexpr const T& operator[](size_t i) const noexcept { return data[i]; }
};

template<typename T, size_t N, class Generator, size_t... Indexes>
static constexpr LookupTable<T, N> MakeLookupTableImpl(std::index_sequence<Indexes...>) noexcept {
  constexpr LookupTable<T, N> table {{ T(Generator::value(Indexes))... }};
  return table;
}

template<typename T, size_t N, class Generator>
static constexpr LookupTable<T, N> MakeLookupTable() noexcept {
  return MakeLookupTableImpl<T, N, Generator>(std::make_index_sequence<N>{});
}

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

struct ObjectSizeGenerator {
  static constexpr uint8_t value(size_t tid) noexcept {
    return tid == TID_NULL    ?  0 :
           tid == TID_BOOLEAN ?  1 :
           tid == TID_INT8    ?  1 :
           tid == TID_INT16   ?  2 :
           tid == TID_INT32   ?  4 :
           tid == TID_INT64   ?  8 :
           tid == TID_UINT8   ?  1 :
           tid == TID_UINT16  ?  2 :
           tid == TID_UINT32  ?  4 :
           tid == TID_UINT64  ?  8 :
           tid == TID_FLOAT32 ?  4 :
           tid == TID_FLOAT64 ?  8 :
           tid == TID_STRING  ? 16 :
           tid == TID_ARRAY   ? 24 :
           tid == TID_MAP     ? 32 : 0;
  }
};

static constexpr const auto ObjectSizeArray =
  MakeLookupTable<uint8_t, TID_COUNT, ObjectSizeGenerator>();

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

This is an updated approach in which I have moved the generator into a helper struct/class as I noticed that in some cases MSVC would emit runtime initializers of the lookup table if its are part of a larger structure.

Conclusion

I just wanted to demonstrate different ways of creating lookup tables at compile-time. If you explore enough projects you will probably find all approaches mentioned here with some minor modifications (like more macros, more clever macros, different templates, etc...). I'm sure there is maybe even a better solution than the last one, but I'm currently happily switching to it as it provides great flexibility to me. The function that returns the value can be actually pretty complex and the constexpr [ObjectSizeArray] would still guarantee that the compiler will calculate everything at compile-time or fail. I have some tables in Blend2D that are used during glyph decoding (of both TrueType and OpenType fonts) and they work really well and are really easy to maintain. I plan to use such approach in AsmJit as well, but Blend2D still has a higher priority at the moment.

I would also note that the last approach also provides a great flexibility to define tables that may be a bit sparse. Like a table that has 256 values [mapping a byte key to a value], but only 10 of them are let's say non-zero [or keys of interest]

Use comments if you have anything to add or if you think it should be done differently!

18 April, 2016

Tricky _mm_set_sd()

[Or] How a C++ Compiler Can Ruin Your Optimization Efforts

Compiler intrinsics targeting SSE and SSE2 instruction sets were historically used to instrument the compiler to generate better code that deals with floating point operations. The reason was that some CPU architectures have a global floating-point unit state (or FPU state) that contains rounding mode, precision, exception mask, and some other flags. These basically prevented the compiler to optimize operations like round(x), static_cast<int>(x), (x < y ? x : y), etc.

Of course this situation was not performance-friendly and many developers just started using SSE+ intrinsics when they became available to solve their problems in a more performance-friendly way. This approach was also blessed by some CPU manufacturers as the better way of dealing with the problem. Today's compilers are much smarter than before and some constructs may be considered obsolete. However, some constructs still hold and are probably better than using a pure C/C++ code in some cases, especially if you need to do something compiler won't do by default.

Scalar <-> SIMD Problem

The problem I faced is closely related to a conversion between float/double and SIMD registers. When SSE2 code-generation is turned ON these types are equivalent and using _mm_set_sd intrinsic should generate nothing, however, it's not always true as demonstrated by the snippet below:

#include <xmmintrin.h>

double MyMinV1(double a, double b) {
  return a < b ? a : b;
}

double MyMinV2(double a, double b) {
  __m128d x = _mm_set_sd(a);
  __m128d y = _mm_set_sd(b);
  return _mm_cvtsd_f64(_mm_min_sd(x, y));
}

One would expect that a modern compiler will generate the same assembly for both functions. Unfortunately, the reality is different. Let's compare gcc and clang.

Assembly generated by clang:

MyMinV1:
  minsd   xmm0, xmm1
  ret

MyMinV2:
  minsd   xmm0, xmm1
  ret

Assembly generated by gcc:

MyMinV1:
  minsd   xmm0, xmm1
  ret
MyMinV2:
  movsd   QWORD PTR [rsp-24], xmm0
  movsd   QWORD PTR [rsp-16], xmm1
  movsd   xmm0, QWORD PTR [rsp-24]
  movsd   xmm1, QWORD PTR [rsp-16]
  minsd   xmm0, xmm1
  ret

That's not good; gcc completely missed out the point and turned the C++ code into the worst assembly you would ever imagine. You can check it yourself here if you don't believe me.

The Core of the Problem

The problem is caused by using _mm_set_sd() to convert a double into a SIMD type. The intrinsic should emit the movsd xmm0, xmm1|mem64 instruction, which has ironically two behaviors:

  • If the second operand (the source operand) is an XMM register, it preserves the high-part of the first operand (the destination operand).
  • If the second operand (the source operand) is a memory location, it clears the high-part of the first operand (the destination operand).

This is tricky, because the documentation says that _mm_set_sd() intrinsic clears the high-part of the register, so gcc is basically following the specification. However, it's something unexpected if you don't really have use for the high-part of the register, and using two movsd's is something suboptimal as it could be replaced by a single movq, which is what clang does in some other cases.

A Possible Workaround

After submitting a gcc bug #70708 (which turned out to be a duplicate of #68211) I got an instant feedback and a workaround has been found. We can use some asm magic to tell the compiler what to do. Consider the updated code, which works well also with clang:

static inline __m128d DoubleAsXmm(double x) {
  __m128d xmm;
  asm ("" : "=x" (xmm) : "0" (x));
  return xmm;
}

double MyMinV3(double a, double b) {
  return _mm_cvtsd_f64(
    _mm_min_sd(
      DoubleAsXmm(a),
      DoubleAsXmm(b)));
}

Which leads to the desired result (both gcc and clang):

MyMinV3:
  minsd   xmm0, xmm1
  ret

Of course you need some preprocessor magic to make the conversion portable across compilers. However, it's a nice solution as it basically tells the compiler that you don't care of the high part of the SIMD register. As a bonus this will also prevent clang emitting a movq instruction in some other cases.

Lesson Learned

I think the time we had to specialize min/max by using SSE and SSE2 is finally over, compilers are fine doing that for us. However, if you still need to use SSE+ instrinsics, check how compiler converts your scalars into SIMD registers, especially if you only use them for scalar-only operations.

25 March, 2016

Vector sin/cos (SSE2)

Intro

Using SIMD intrinsics in C++ is so cool until you find something missing. Most of the time it's just about helpers that combine few intrinsics into something that should have been obviously implemented by the CPU, but it's not. However, sometimes you would like to have intrinsics that provide something much more - like sine and cosine. It's already available in C library, so why there are no intrinsics for it? Because it's not that easy. There are some libraries that provide non-conforming implementation of sine and cosine (that means that they don't handle special numbers and they only work within a certain range), which is usually fine as it's really rare to compute something like sin(1e16) although a conforming implementation should be able to handle such input without returning completely wrong result.

After I released an updated version of MathPresso and a preview of MPSL I faced two problems: The sine and cosine implementations that are implemented by C runtime vary across platforms I would like to support, and there are no vector ones that I can just use in MPSL (and to be honest, I want these functions to be inlined anyway). However, the problem is even deeper - I discovered that sine and cosine results don't vary just across platforms, but some implementations (like Visual Studio's C runtime) produce different results even in debug and release builds. This means that MPSL needs its own implementation of all functions it provides in order to produce the same results on all supported platforms.

Existing Implementations

Instead of trying to implement my own sine and cosine from scratch, I looked at existing open-source implementations and tried to compare their performance and precision. There are basically 2 implementations I found interesting:

  • Netlib's cephes - Provides an implementation of all math functions available in C runtime. The most interesting part of the sine implementation is a selection of a 6-degree polynomial based on the input. Not really SIMD friendly, but solvable, for example ssemath just executes both branches and then masks the wrong values out. The ssemath library, however, provides only single-precision sine and cosine that I don't need.
  • Erik's vecmathlib - Provides an implementation that use only one 9-degree polynomial. The implementation seems to be based on a public domain library called sleef, which implements the same algorithms that are used by vecmathlib (VML).

Testing Code

I'm at the moment mostly interested in a double-precision scalar and vector implementations. I wrote my own vector sine that uses cephes approach (two polynomials for a different range of inputs), and took the vecmathlib's implementation for precision and speed comparison. Initially, I have been comparing results from functions I tested against results obtained by calling a native sin() function, but since I started having differences in debug and release builds I decided to use a multi-precision library to calculate sine in a much higher precision and compare with these results instead.

The following is my implementation, which is based on cephes's approach:

static SIMD_INLINE __m128d sin_cephes_pd(__m128d x) {
  SIMD_CONST_SQ(sign     , SIMD_UINT64_C(0x8000000000000000));
  SIMD_CONST_SQ(inv_sign , SIMD_UINT64_C(0x7FFFFFFFFFFFFFFF));
  SIMD_CONST_SI(int32_one, 1);
  SIMD_CONST_SD(4_DIV_PI , 1.27323954473516268615107010698);
  SIMD_CONST_SD(DP1      , 7.85398125648498535156e-1);
  SIMD_CONST_SD(DP2      , 3.77489470793079817668e-8);
  SIMD_CONST_SD(DP3      , 2.69515142907905952645e-15);

#define DEFINE_DATA(name, x0, x1, x2, x3, x4, x5, xm, xa, y0, y1, y2, y3, y4, y5, ym, ya) \
  SIMD_ALIGN_VAR(static const double, name[], 16) = { \
    x0, x0, x1, x1, x2, x2, x3, x3, x4, x4, x5, x5, xm, xm, xa, xa, \
    y0, x0, y1, x1, y2, x2, y3, x3, y4, x4, y5, x5, ym, xm, ya, xa, \
    x0, y0, x1, y1, x2, y2, x3, y3, x4, y4, x5, y5, xm, ym, xa, ya, \
    y0, y0, y1, y1, y2, y2, y3, y3, y4, y4, y5, y5, ym, ym, ya, ya  \
  }

  DEFINE_DATA(sincos_coeff,
    1.58962301576546568060e-10,-2.50507477628578072866e-8,
    2.75573136213857245213e-6 ,-1.98412698295895385996e-4,
    8.33333333332211858878e-3 ,-1.66666666666666307295e-1, 1.0, 0.0,

   -1.13585365213876817300e-11, 2.08757008419747316778e-9,
   -2.75573141792967388112e-7 , 2.48015872888517045348e-5,
   -1.38888888888730564116e-3 , 4.16666666666665929218e-2,-0.5, 1.0);

  __m128d y;
  __m128d sign = x;                              // Sign bit.

  x = _mm_and_pd(x, SIMD_GET_PD(inv_sign));      // Take the absolute value.
  y = _mm_mul_pd(x, SIMD_GET_PD(4_DIV_PI));      // Integer part of `x * 4 / PI`.

  __m128i ival = _mm_cvttpd_epi32(y);            // Extract the integer part of y.
  __m128i ione = SIMD_GET_PI(int32_one);

  ival = _mm_add_epi32(ival, ione);              // j += 1.
  ival = _mm_andnot_si128(ione, ival);           // j &=~1.

  y = _mm_cvtepi32_pd(ival);
  ival = _mm_unpacklo_epi32(ival, ival);

  sign = _mm_xor_pd(sign,             
    _mm_castsi128_pd(_mm_slli_epi64(ival, 61))); // Swap the sign bit if `j & 4`.
  sign = _mm_and_pd(sign, SIMD_GET_PD(sign));    // Keep only the sign bit.

  // Get the polynom selection mask (j & 2):
  //   1. `0x0000000000000000` => `0    <= x <= PI/4`
  //   2. `0xFFFFFFFFFFFFFFFF` => `PI/4 <  x <= PI/2`
  ival = _mm_slli_epi32(ival, 30);
  ival = _mm_srai_epi32(ival, 31);

  // Extended precision modular arithmetic:
  //   x = ((x - y * DP1) - y * DP2) - y * DP3
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP1)));
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP2)));
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP3)));

  // Get the polynom coefficients for each lane (sin/cos).
  __m128d poly_mask = _mm_castsi128_pd(ival);
  const __m128d* coeff = reinterpret_cast<const __m128d*>(sincos_coeff) +
    static_cast<uintptr_t>(_mm_movemask_pd(poly_mask)) * 8;

  __m128d xx = _mm_mul_pd(x, x);
  y = coeff[0];
  y = Simd128::mad(y, xx, coeff[1]);
  y = Simd128::mad(y, xx, coeff[2]);
  y = Simd128::mad(y, xx, coeff[3]);
  y = Simd128::mad(y, xx, coeff[4]);
  y = Simd128::mad(y, xx, coeff[5]);
  y = _mm_mul_pd(y, xx);

  __m128d x_or_xx = _mm_or_pd(
    _mm_and_pd(xx, poly_mask),
    _mm_andnot_pd(poly_mask, x));

  y = _mm_mul_pd(y, x_or_xx);
  y = _mm_add_pd(y, _mm_mul_pd(x_or_xx, coeff[6]));
  y = _mm_add_pd(y, coeff[7]);

  return _mm_xor_pd(y, sign);
}

The following code uses the vecmathlib approach:

static SIMD_INLINE __m128d sin_vecmathlib_pd(__m128d x) {
  SIMD_CONST_SD(1_PI , 0.318309886183790671538);
  SIMD_CONST_SD(PI4_A,-3.14159265160560607912);
  SIMD_CONST_SD(PI4_B,-1.98418715485759733496e-9);
  SIMD_CONST_SD(PI4_C,-4.5034835412693155724e-18);
  SIMD_CONST_SD(PI4_D,-7.0431197303664003632e-27);

  SIMD_CONST_SD(sin_0,-7.97255955009037868891952e-18);
  SIMD_CONST_SD(sin_1, 2.81009972710863200091251e-15);
  SIMD_CONST_SD(sin_2,-7.64712219118158833288484e-13);
  SIMD_CONST_SD(sin_3, 1.60590430605664501629054e-10);
  SIMD_CONST_SD(sin_4,-2.50521083763502045810755e-08);
  SIMD_CONST_SD(sin_5, 2.75573192239198747630416e-06);
  SIMD_CONST_SD(sin_6,-1.98412698412696162806809e-04);
  SIMD_CONST_SD(sin_7, 8.33333333333332974823815e-03);
  SIMD_CONST_SD(sin_8,-1.66666666666666657414808e-01);

  SIMD_CONST_SD(magic, 6755399441055744.0);

  __m128d q = Simd128::m128roundeven(
    _mm_mul_pd(x, SIMD_GET_PD(1_PI)));

  __m128d i = _mm_castsi128_pd(
    _mm_slli_epi64(
      _mm_castpd_si128(_mm_add_pd(q, SIMD_GET_PD(magic))), 63));

  x = Simd128::mad(q, SIMD_GET_PD(PI4_A), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_B), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_C), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_D), x);

  __m128d xx = _mm_mul_pd(x, x);
  x = _mm_xor_pd(x, i);

  __m128d u = SIMD_GET_PD(sin_0);
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_1));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_2));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_3));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_4));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_5));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_6));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_7));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_8));
  u = Simd128::mad(xx, _mm_mul_pd(u, x), x);

  return u;
}
The functions above are used in the following way to calculate an array of sines:
void sin_array_pd(double* dst, const double* src, size_t length) {
  size_t i = length;

  while (i) {
    if (!SimdUtils::isAligned(dst, 16) || i == 1) {
      __m128d d = _mm_load_sd(src);
      _mm_store_sd(dst, sin_impl(d));
      
      dst++;
      src++;

      if (--i == 0)
        break;
    }

    while (i >= 2) {
      __m128d d = _mm_loadu_pd(src);
      _mm_store_pd(dst, sin_impl(d));

      dst += 2;
      src += 2;
      i -= 2;
    }
  }
}
A pure C implementation is implemented like this (giving compiler a chance to vectorize the loop):
void sin_array_math_h(double* dst, const double* src, size_t length) {
  for (size_t i = 0; i < length; i++)
    dst[i] = ::sin(src[i]);
}
And finally, here is a multi-precision sine, which uses LolRemez engine:
void sin_array_precise(double* dst, const double* src, size_t length) {
  for (size_t i = 0; i < length; i++)
    dst[i] = lol::sin(lol::real(src[i]));
}

Results

The test application first verifies the vector implementation and then benchmarks it. When making the verification it calculates the maximum ULP error. The SumUlp is a sum of all ULP errors of all results.

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err=317248/1000000 MaxUlp=2 SumUlp=319132 MaxEps=2.2204460492503131e-16]
[ERROR] IMPL=sin (Cephes-SSE2)  (Input=-3.1415926535897931)
  ref(-1.2246467991473532e-16) !=
  res(-1.2246467991473515e-16) (ulp=7 | epsilon=1.7256332301709633e-31)
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=114265/1000000 MaxUlp=7 SumUlp=114271 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=201188/1000000 MaxUlp=2 SumUlp=201344 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.760 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.087 s]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h)       [Err=317091/1000000 MaxUlp=2 SumUlp=318853 MaxEps=2.2204460492503131e-16]
[ERROR] IMPL=sin (Cephes-SSE2)  (Input=3.1415926535897931)
  ref(1.2246467991473532e-16) !=
  res(1.2246467991473515e-16) (ulp=7 | epsilon=1.7256332301709633e-31)
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=113862/1000000 MaxUlp=7 SumUlp=113868 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=200971/1000000 MaxUlp=2 SumUlp=201124 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.525 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.759 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err=348528/1000000 MaxUlp=2 SumUlp=351145 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165038/1000000 MaxUlp=1 SumUlp=165038 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299594/1000000 MaxUlp=2 SumUlp=300467 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.759 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.087 s]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h)       [Err=348356/1000000 MaxUlp=2 SumUlp=350857 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165389/1000000 MaxUlp=1 SumUlp=165389 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299318/1000000 MaxUlp=2 SumUlp=300179 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.744 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h)       [Err=351311/1000000 MaxUlp=2 SumUlp=354196 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166589/1000000 MaxUlp=1 SumUlp=166589 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=310995/1000000 MaxUlp=2 SumUlp=312241 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.744 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

The output basically says that the C implementation (VS2015) is the fastest one, but not the most precise one (within 2 ULPs). The trick is that VS compiler has its own SSE2 implementation exported as __vdecl_sin2(). However, it seems that this function is not fully compatible with its own scalar sin(), as shown below. Furthermore, the implementation that uses the Cephes approach is only slightly slower than the native implementation and is generally much more precise (all outputs except one are within 1 ULP). The VML implementation is the slowest one, but produces outputs within 2 ULPs compared to the precise implementation.

The following output was obtained from a debug build of the test application, without benchmarks:

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err= 33544/1000000 MaxUlp=1 SumUlp=33544 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=114265/1000000 MaxUlp=7 SumUlp=114271 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=201188/1000000 MaxUlp=2 SumUlp=201344 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h)       [Err= 33329/1000000 MaxUlp=1 SumUlp=33329 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=113862/1000000 MaxUlp=7 SumUlp=113868 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=200971/1000000 MaxUlp=2 SumUlp=201124 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err= 34408/1000000 MaxUlp=1 SumUlp=34408 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165038/1000000 MaxUlp=1 SumUlp=165038 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299594/1000000 MaxUlp=2 SumUlp=300467 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h)       [Err= 34528/1000000 MaxUlp=1 SumUlp=34528 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165389/1000000 MaxUlp=1 SumUlp=165389 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299318/1000000 MaxUlp=2 SumUlp=300179 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h)       [Err= 33902/1000000 MaxUlp=1 SumUlp=33902 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166589/1000000 MaxUlp=1 SumUlp=166589 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=310995/1000000 MaxUlp=2 SumUlp=312241 MaxEps=2.2204460492503131e-16]

Well, that sucks! VS has two sine implementations that give different results for the same inputs; and there is basically no control over which one to use. What I have observed is that debug builds simply don't use the vector implementation at all. Also note that this is not an X87 vs SSE2 problem as the application is 64-bit. The difference between DEBUG and RELEASE builds is shown below:

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h REL)   [Err=317248/1000000 MaxUlp=2 SumUlp=319132 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33544/1000000 MaxUlp=1 SumUlp= 33544 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h REL)   [Err=317091/1000000 MaxUlp=2 SumUlp=318853 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33329/1000000 MaxUlp=1 SumUlp= 33329 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h REL)   [Err=348528/1000000 MaxUlp=2 SumUlp=351145 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 34408/1000000 MaxUlp=1 SumUlp= 34408 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h REL)   [Err=348356/1000000 MaxUlp=2 SumUlp=350857 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 34528/1000000 MaxUlp=1 SumUlp= 34528 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h REL)   [Err=351311/1000000 MaxUlp=2 SumUlp=354196 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33902/1000000 MaxUlp=1 SumUlp= 33902 MaxEps=1.1102230246251565e-16]

Robustness

What we have seen so far were tests for inputs that weren't greater than 1e4. That was on purpose, because these are safe inputs that are easily reduced to 0..PI range. Basically the range reduction process is the most complicated part of a robust sine implementation. PI is an irrational number, which makes it tricky to range-reduce some other number to 0..PI. Check out fdlibm's e_rem_pio2.c and k_rem_pio2.c to see how the range reduction actually works. The same approach can be found in V8 engine in their fdlibm.cc, and so on. Basically everybody uses the code written by Sun more than 20 years ago. So, what happens if we try to pass huge numbers to the sine implementations provided in this post?

[TRIGO] Test 1000000 values in domain
  MIN=100000
  MAX=1.68663e+09
[CHECK] IMPL=sin (math.h)       [Err=222251/1000000 MaxUlp=2 SumUlp=223913 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166185/1000000 MaxUlp=1 SumUlp=166185 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=768543/1000000 MaxUlp=140737488387297 SumUlp=1926704344978426 MaxEps=1.1920928956520792e-07]
[BENCH] IMPL=sin (math.h)       [14.462 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.807 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.134 s]

[TRIGO] Test 1000000 values in domain
  MIN=1.68663e+09
  MAX=1e+100
[CHECK] IMPL=sin (math.h)       [Err=1252/1000000 MaxUlp=1 SumUlp=1252 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=999999/1000000 MaxUlp=4698695118786935215 SumUlp=11457351783800485383 MaxEps=inf]
[CHECK] IMPL=sin (VML-SSE2)     [Err=999999/1000000 MaxUlp=13912998465449088384 SumUlp=15519466227126556199 MaxEps=inf]
[BENCH] IMPL=sin (math.h)       [26.317 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.853 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.165 s]

The results clearly show that the C implementation should always behave correctly and that other implementation may simply fail to deliver safe results. It also shows that the C implementation uses a multi-precision range reduction (maybe derived from fdlibm), because it starts getting slower when used with huge inputs. Cephes implementation works in our case for inputs up to 1.68663e+9, which is really good. VML starts failing at 9.8999887e+7, which is also really good.

Conclusion

I don't have one! My favorite implementation is the cephes one (of course with proper range reduction so it can handle all inputs), because most of the time the maximum error stays within 1 ULP and is also faster than VML. What have delayed me from using it was the discovery of the erratic input x=3.1415926535897931, which results in 7 ULPs error (epsilon=1.1102230246251565e-16).

The code I used for testing is available at my simdtests repo.