25 March, 2016

Vector sin/cos (SSE2)

Intro

Using SIMD intrinsics in C++ is so cool until you find something missing. Most of the time it's just about helpers that combine few intrinsics into something that should have been obviously implemented by the CPU, but it's not. However, sometimes you would like to have intrinsics that provide something much more - like sine and cosine. It's already available in C library, so why there are no intrinsics for it? Because it's not that easy. There are some libraries that provide non-conforming implementation of sine and cosine (that means that they don't handle special numbers and they only work within a certain range), which is usually fine as it's really rare to compute something like sin(1e16) although a conforming implementation should be able to handle such input without returning completely wrong result.

After I released an updated version of MathPresso and a preview of MPSL I faced two problems: The sine and cosine implementations that are implemented by C runtime vary across platforms I would like to support, and there are no vector ones that I can just use in MPSL (and to be honest, I want these functions to be inlined anyway). However, the problem is even deeper - I discovered that sine and cosine results don't vary just across platforms, but some implementations (like Visual Studio's C runtime) produce different results even in debug and release builds. This means that MPSL needs its own implementation of all functions it provides in order to produce the same results on all supported platforms.

Existing Implementations

Instead of trying to implement my own sine and cosine from scratch, I looked at existing open-source implementations and tried to compare their performance and precision. There are basically 2 implementations I found interesting:

  • Netlib's cephes - Provides an implementation of all math functions available in C runtime. The most interesting part of the sine implementation is a selection of a 6-degree polynomial based on the input. Not really SIMD friendly, but solvable, for example ssemath just executes both branches and then masks the wrong values out. The ssemath library, however, provides only single-precision sine and cosine that I don't need.
  • Erik's vecmathlib - Provides an implementation that use only one 9-degree polynomial. The implementation seems to be based on a public domain library called sleef, which implements the same algorithms that are used by vecmathlib (VML).

Testing Code

I'm at the moment mostly interested in a double-precision scalar and vector implementations. I wrote my own vector sine that uses cephes approach (two polynomials for a different range of inputs), and took the vecmathlib's implementation for precision and speed comparison. Initially, I have been comparing results from functions I tested against results obtained by calling a native sin() function, but since I started having differences in debug and release builds I decided to use a multi-precision library to calculate sine in a much higher precision and compare with these results instead.

The following is my implementation, which is based on cephes's approach:

static SIMD_INLINE __m128d sin_cephes_pd(__m128d x) {
  SIMD_CONST_SQ(sign     , SIMD_UINT64_C(0x8000000000000000));
  SIMD_CONST_SQ(inv_sign , SIMD_UINT64_C(0x7FFFFFFFFFFFFFFF));
  SIMD_CONST_SI(int32_one, 1);
  SIMD_CONST_SD(4_DIV_PI , 1.27323954473516268615107010698);
  SIMD_CONST_SD(DP1      , 7.85398125648498535156e-1);
  SIMD_CONST_SD(DP2      , 3.77489470793079817668e-8);
  SIMD_CONST_SD(DP3      , 2.69515142907905952645e-15);

#define DEFINE_DATA(name, x0, x1, x2, x3, x4, x5, xm, xa, y0, y1, y2, y3, y4, y5, ym, ya) \
  SIMD_ALIGN_VAR(static const double, name[], 16) = { \
    x0, x0, x1, x1, x2, x2, x3, x3, x4, x4, x5, x5, xm, xm, xa, xa, \
    y0, x0, y1, x1, y2, x2, y3, x3, y4, x4, y5, x5, ym, xm, ya, xa, \
    x0, y0, x1, y1, x2, y2, x3, y3, x4, y4, x5, y5, xm, ym, xa, ya, \
    y0, y0, y1, y1, y2, y2, y3, y3, y4, y4, y5, y5, ym, ym, ya, ya  \
  }

  DEFINE_DATA(sincos_coeff,
    1.58962301576546568060e-10,-2.50507477628578072866e-8,
    2.75573136213857245213e-6 ,-1.98412698295895385996e-4,
    8.33333333332211858878e-3 ,-1.66666666666666307295e-1, 1.0, 0.0,

   -1.13585365213876817300e-11, 2.08757008419747316778e-9,
   -2.75573141792967388112e-7 , 2.48015872888517045348e-5,
   -1.38888888888730564116e-3 , 4.16666666666665929218e-2,-0.5, 1.0);

  __m128d y;
  __m128d sign = x;                              // Sign bit.

  x = _mm_and_pd(x, SIMD_GET_PD(inv_sign));      // Take the absolute value.
  y = _mm_mul_pd(x, SIMD_GET_PD(4_DIV_PI));      // Integer part of `x * 4 / PI`.

  __m128i ival = _mm_cvttpd_epi32(y);            // Extract the integer part of y.
  __m128i ione = SIMD_GET_PI(int32_one);

  ival = _mm_add_epi32(ival, ione);              // j += 1.
  ival = _mm_andnot_si128(ione, ival);           // j &=~1.

  y = _mm_cvtepi32_pd(ival);
  ival = _mm_unpacklo_epi32(ival, ival);

  sign = _mm_xor_pd(sign,             
    _mm_castsi128_pd(_mm_slli_epi64(ival, 61))); // Swap the sign bit if `j & 4`.
  sign = _mm_and_pd(sign, SIMD_GET_PD(sign));    // Keep only the sign bit.

  // Get the polynom selection mask (j & 2):
  //   1. `0x0000000000000000` => `0    <= x <= PI/4`
  //   2. `0xFFFFFFFFFFFFFFFF` => `PI/4 <  x <= PI/2`
  ival = _mm_slli_epi32(ival, 30);
  ival = _mm_srai_epi32(ival, 31);

  // Extended precision modular arithmetic:
  //   x = ((x - y * DP1) - y * DP2) - y * DP3
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP1)));
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP2)));
  x = _mm_sub_pd(x, _mm_mul_pd(y, SIMD_GET_PD(DP3)));

  // Get the polynom coefficients for each lane (sin/cos).
  __m128d poly_mask = _mm_castsi128_pd(ival);
  const __m128d* coeff = reinterpret_cast<const __m128d*>(sincos_coeff) +
    static_cast<uintptr_t>(_mm_movemask_pd(poly_mask)) * 8;

  __m128d xx = _mm_mul_pd(x, x);
  y = coeff[0];
  y = Simd128::mad(y, xx, coeff[1]);
  y = Simd128::mad(y, xx, coeff[2]);
  y = Simd128::mad(y, xx, coeff[3]);
  y = Simd128::mad(y, xx, coeff[4]);
  y = Simd128::mad(y, xx, coeff[5]);
  y = _mm_mul_pd(y, xx);

  __m128d x_or_xx = _mm_or_pd(
    _mm_and_pd(xx, poly_mask),
    _mm_andnot_pd(poly_mask, x));

  y = _mm_mul_pd(y, x_or_xx);
  y = _mm_add_pd(y, _mm_mul_pd(x_or_xx, coeff[6]));
  y = _mm_add_pd(y, coeff[7]);

  return _mm_xor_pd(y, sign);
}

The following code uses the vecmathlib approach:

static SIMD_INLINE __m128d sin_vecmathlib_pd(__m128d x) {
  SIMD_CONST_SD(1_PI , 0.318309886183790671538);
  SIMD_CONST_SD(PI4_A,-3.14159265160560607912);
  SIMD_CONST_SD(PI4_B,-1.98418715485759733496e-9);
  SIMD_CONST_SD(PI4_C,-4.5034835412693155724e-18);
  SIMD_CONST_SD(PI4_D,-7.0431197303664003632e-27);

  SIMD_CONST_SD(sin_0,-7.97255955009037868891952e-18);
  SIMD_CONST_SD(sin_1, 2.81009972710863200091251e-15);
  SIMD_CONST_SD(sin_2,-7.64712219118158833288484e-13);
  SIMD_CONST_SD(sin_3, 1.60590430605664501629054e-10);
  SIMD_CONST_SD(sin_4,-2.50521083763502045810755e-08);
  SIMD_CONST_SD(sin_5, 2.75573192239198747630416e-06);
  SIMD_CONST_SD(sin_6,-1.98412698412696162806809e-04);
  SIMD_CONST_SD(sin_7, 8.33333333333332974823815e-03);
  SIMD_CONST_SD(sin_8,-1.66666666666666657414808e-01);

  SIMD_CONST_SD(magic, 6755399441055744.0);

  __m128d q = Simd128::m128roundeven(
    _mm_mul_pd(x, SIMD_GET_PD(1_PI)));

  __m128d i = _mm_castsi128_pd(
    _mm_slli_epi64(
      _mm_castpd_si128(_mm_add_pd(q, SIMD_GET_PD(magic))), 63));

  x = Simd128::mad(q, SIMD_GET_PD(PI4_A), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_B), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_C), x);
  x = Simd128::mad(q, SIMD_GET_PD(PI4_D), x);

  __m128d xx = _mm_mul_pd(x, x);
  x = _mm_xor_pd(x, i);

  __m128d u = SIMD_GET_PD(sin_0);
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_1));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_2));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_3));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_4));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_5));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_6));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_7));
  u = Simd128::mad(u, xx, SIMD_GET_PD(sin_8));
  u = Simd128::mad(xx, _mm_mul_pd(u, x), x);

  return u;
}
The functions above are used in the following way to calculate an array of sines:
void sin_array_pd(double* dst, const double* src, size_t length) {
  size_t i = length;

  while (i) {
    if (!SimdUtils::isAligned(dst, 16) || i == 1) {
      __m128d d = _mm_load_sd(src);
      _mm_store_sd(dst, sin_impl(d));
      
      dst++;
      src++;

      if (--i == 0)
        break;
    }

    while (i >= 2) {
      __m128d d = _mm_loadu_pd(src);
      _mm_store_pd(dst, sin_impl(d));

      dst += 2;
      src += 2;
      i -= 2;
    }
  }
}
A pure C implementation is implemented like this (giving compiler a chance to vectorize the loop):
void sin_array_math_h(double* dst, const double* src, size_t length) {
  for (size_t i = 0; i < length; i++)
    dst[i] = ::sin(src[i]);
}
And finally, here is a multi-precision sine, which uses LolRemez engine:
void sin_array_precise(double* dst, const double* src, size_t length) {
  for (size_t i = 0; i < length; i++)
    dst[i] = lol::sin(lol::real(src[i]));
}

Results

The test application first verifies the vector implementation and then benchmarks it. When making the verification it calculates the maximum ULP error. The SumUlp is a sum of all ULP errors of all results.

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err=317248/1000000 MaxUlp=2 SumUlp=319132 MaxEps=2.2204460492503131e-16]
[ERROR] IMPL=sin (Cephes-SSE2)  (Input=-3.1415926535897931)
  ref(-1.2246467991473532e-16) !=
  res(-1.2246467991473515e-16) (ulp=7 | epsilon=1.7256332301709633e-31)
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=114265/1000000 MaxUlp=7 SumUlp=114271 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=201188/1000000 MaxUlp=2 SumUlp=201344 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.760 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.087 s]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h)       [Err=317091/1000000 MaxUlp=2 SumUlp=318853 MaxEps=2.2204460492503131e-16]
[ERROR] IMPL=sin (Cephes-SSE2)  (Input=3.1415926535897931)
  ref(1.2246467991473532e-16) !=
  res(1.2246467991473515e-16) (ulp=7 | epsilon=1.7256332301709633e-31)
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=113862/1000000 MaxUlp=7 SumUlp=113868 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=200971/1000000 MaxUlp=2 SumUlp=201124 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.525 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.759 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err=348528/1000000 MaxUlp=2 SumUlp=351145 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165038/1000000 MaxUlp=1 SumUlp=165038 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299594/1000000 MaxUlp=2 SumUlp=300467 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.759 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.087 s]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h)       [Err=348356/1000000 MaxUlp=2 SumUlp=350857 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165389/1000000 MaxUlp=1 SumUlp=165389 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299318/1000000 MaxUlp=2 SumUlp=300179 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.744 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h)       [Err=351311/1000000 MaxUlp=2 SumUlp=354196 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166589/1000000 MaxUlp=1 SumUlp=166589 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=310995/1000000 MaxUlp=2 SumUlp=312241 MaxEps=2.2204460492503131e-16]
[BENCH] IMPL=sin (math.h)       [03.510 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.744 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.071 s]

The output basically says that the C implementation (VS2015) is the fastest one, but not the most precise one (within 2 ULPs). The trick is that VS compiler has its own SSE2 implementation exported as __vdecl_sin2(). However, it seems that this function is not fully compatible with its own scalar sin(), as shown below. Furthermore, the implementation that uses the Cephes approach is only slightly slower than the native implementation and is generally much more precise (all outputs except one are within 1 ULP). The VML implementation is the slowest one, but produces outputs within 2 ULPs compared to the precise implementation.

The following output was obtained from a debug build of the test application, without benchmarks:

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err= 33544/1000000 MaxUlp=1 SumUlp=33544 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=114265/1000000 MaxUlp=7 SumUlp=114271 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=201188/1000000 MaxUlp=2 SumUlp=201344 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h)       [Err= 33329/1000000 MaxUlp=1 SumUlp=33329 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=113862/1000000 MaxUlp=7 SumUlp=113868 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=200971/1000000 MaxUlp=2 SumUlp=201124 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h)       [Err= 34408/1000000 MaxUlp=1 SumUlp=34408 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165038/1000000 MaxUlp=1 SumUlp=165038 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299594/1000000 MaxUlp=2 SumUlp=300467 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h)       [Err= 34528/1000000 MaxUlp=1 SumUlp=34528 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=165389/1000000 MaxUlp=1 SumUlp=165389 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=299318/1000000 MaxUlp=2 SumUlp=300179 MaxEps=2.2204460492503131e-16]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h)       [Err= 33902/1000000 MaxUlp=1 SumUlp=33902 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166589/1000000 MaxUlp=1 SumUlp=166589 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=310995/1000000 MaxUlp=2 SumUlp=312241 MaxEps=2.2204460492503131e-16]

Well, that sucks! VS has two sine implementations that give different results for the same inputs; and there is basically no control over which one to use. What I have observed is that debug builds simply don't use the vector implementation at all. Also note that this is not an X87 vs SSE2 problem as the application is 64-bit. The difference between DEBUG and RELEASE builds is shown below:

[TRIGO] Test 1000000 values in domain
  MIN=-3.14159
  MAX=0
[CHECK] IMPL=sin (math.h REL)   [Err=317248/1000000 MaxUlp=2 SumUlp=319132 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33544/1000000 MaxUlp=1 SumUlp= 33544 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=3.14159
[CHECK] IMPL=sin (math.h REL)   [Err=317091/1000000 MaxUlp=2 SumUlp=318853 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33329/1000000 MaxUlp=1 SumUlp= 33329 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=-100
  MAX=0
[CHECK] IMPL=sin (math.h REL)   [Err=348528/1000000 MaxUlp=2 SumUlp=351145 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 34408/1000000 MaxUlp=1 SumUlp= 34408 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=0
  MAX=100
[CHECK] IMPL=sin (math.h REL)   [Err=348356/1000000 MaxUlp=2 SumUlp=350857 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 34528/1000000 MaxUlp=1 SumUlp= 34528 MaxEps=1.1102230246251565e-16]

[TRIGO] Test 1000000 values in domain
  MIN=100
  MAX=10000
[CHECK] IMPL=sin (math.h REL)   [Err=351311/1000000 MaxUlp=2 SumUlp=354196 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (math.h DBG)   [Err= 33902/1000000 MaxUlp=1 SumUlp= 33902 MaxEps=1.1102230246251565e-16]

Robustness

What we have seen so far were tests for inputs that weren't greater than 1e4. That was on purpose, because these are safe inputs that are easily reduced to 0..PI range. Basically the range reduction process is the most complicated part of a robust sine implementation. PI is an irrational number, which makes it tricky to range-reduce some other number to 0..PI. Check out fdlibm's e_rem_pio2.c and k_rem_pio2.c to see how the range reduction actually works. The same approach can be found in V8 engine in their fdlibm.cc, and so on. Basically everybody uses the code written by Sun more than 20 years ago. So, what happens if we try to pass huge numbers to the sine implementations provided in this post?

[TRIGO] Test 1000000 values in domain
  MIN=100000
  MAX=1.68663e+09
[CHECK] IMPL=sin (math.h)       [Err=222251/1000000 MaxUlp=2 SumUlp=223913 MaxEps=2.2204460492503131e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=166185/1000000 MaxUlp=1 SumUlp=166185 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (VML-SSE2)     [Err=768543/1000000 MaxUlp=140737488387297 SumUlp=1926704344978426 MaxEps=1.1920928956520792e-07]
[BENCH] IMPL=sin (math.h)       [14.462 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.807 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.134 s]

[TRIGO] Test 1000000 values in domain
  MIN=1.68663e+09
  MAX=1e+100
[CHECK] IMPL=sin (math.h)       [Err=1252/1000000 MaxUlp=1 SumUlp=1252 MaxEps=1.1102230246251565e-16]
[CHECK] IMPL=sin (Cephes-SSE2)  [Err=999999/1000000 MaxUlp=4698695118786935215 SumUlp=11457351783800485383 MaxEps=inf]
[CHECK] IMPL=sin (VML-SSE2)     [Err=999999/1000000 MaxUlp=13912998465449088384 SumUlp=15519466227126556199 MaxEps=inf]
[BENCH] IMPL=sin (math.h)       [26.317 s]
[BENCH] IMPL=sin (Cephes-SSE2)  [03.853 s]
[BENCH] IMPL=sin (VML-SSE2)     [04.165 s]

The results clearly show that the C implementation should always behave correctly and that other implementation may simply fail to deliver safe results. It also shows that the C implementation uses a multi-precision range reduction (maybe derived from fdlibm), because it starts getting slower when used with huge inputs. Cephes implementation works in our case for inputs up to 1.68663e+9, which is really good. VML starts failing at 9.8999887e+7, which is also really good.

Conclusion

I don't have one! My favorite implementation is the cephes one (of course with proper range reduction so it can handle all inputs), because most of the time the maximum error stays within 1 ULP and is also faster than VML. What have delayed me from using it was the discovery of the erratic input x=3.1415926535897931, which results in 7 ULPs error (epsilon=1.1102230246251565e-16).

The code I used for testing is available at my simdtests repo.