15 April, 2020

Blend2D - Multi-Threaded Rendering

Today a new beta version of Blend2D has been released. It features an initial implementation of a multi-threaded rendering context that can, at the moment, use up to 32 threads to accelerate rendering of 2D graphics into a pixel buffer. Multi-threaded (MT) rendering was planned from the start, however, there were other features which had a higher priority in 2019. I finally started working on MT rendering in early 2020, but the initial planning reaches back to years.

The Blend2D project is an ongoing effort about innovations in 2D rendering. It was the first open source rendering engine that used a JIT compiler to accelerate software-based rendering and it's likely to be the first to offer MT rendering, at least considering the 2D rendering capabilities of libraries such as AGG, Cairo, Qt, and SKIA. Multi-threaded rendering is another important optimization towards maximizing 2D graphics performance.

NOTE: A developer introduction to MT rendering is covered by the Multithreaded Rendering page. Furthermore, initial benchmarks are available on the Performance page. In the following I would like to talk about the design and other considerations.

Motivation

Every time when I finish a new feature I will ask myself - what to do next? Is there anything that I can do to improve the performance even further? Multi-threading resonated in my head for quite a long time. Although I had an idea, I never actually started working on it because there were always many features missing that I wanted to tackle first. However, since screens are getting larger, and their resolutions are getting larger as well, I though that it's also important to have an accelerated renderer capable of real-time rendering into large framebuffers. Based on my own experiments it's perfectly okay to use a single-threaded renderer to render into a FullHD framebuffer. Many of the Blend2D demos that use the Qt framework are able to achieve 240fps at such resolution. That also includes the time spent in Qt performing the blit of QImage into Qt's backing store. I haven't debugged what exactly happens there on my Linux box, but I assume the pixels are copied once or maybe even twice to get them on screen.

The problem comes when a framebuffer gets larger, like 4K - which is approximately 32MB of pixel data considering 32 bits per pixel. Such quantity usually cannot fit into the CPU cache so traditional synchronous rendering might become a bottleneck. This was taken into consideration when designing the MT renderer as well - it should always process a smaller area and move to the next one once the work is done, but this will be explained later. Since modern CPUs offer at least 4 cores, and high-end ones even 8 and more, it was on time to finally start working on MT rendering, because it could have greater impact than improving SIMD acceleration, for example.

I would like to discuss SIMD before I start with MT. On x86 architecture SIMD has been in a rapid evolution from the initial MMX era. We went from 64-bit SIMD to 512-bit SIMD - every time with new instructions, new prefixes, and new ways of achieving the same thing. I don't think this is a bad thing, I'm trying to say that the transition was never straightforward. Maybe the easiest one was from MMX to SSE2 as the register width just doubled and the instructions remained the same. But this doesn't apply to SSE2 vs AVX2. In the AVX2 case 256-bit YMM registers are just split into two 128-bit parts and the CPU processes them independently. This creates a lot of issues with data packing and unpacking as the packing is also split into two 128-bit parts. It means it's not possible to just change the size of SIMD in the code, load bigger quantities, and expect the code to work as is. It has to be carefully ported. The same thing happened again with AVX-512. Since AVX-512 introduced new mask registers it also repurposed many instructions to use such registers instead of SIMD registers. So when porting code from 256-bit SIMD to 512-bit SIMD you would face the same problem - you cannot just increase the operation width as you would have to port the code because the same instructions that worked for you with YMM registers now expect K register instead of ZMM register to store or use a mask.

And now to the point - multi-threading solves one big problem here. When a new CPU releases and provides more cores, an existing software capable of executing tasks multi-threaded is able to use the new hardware immediately so that the user would instantly see the improved performance. This doesn't happen with new SIMD extensions, which usually take years to get into existing software.

The Design of Multi-Threaded Renderer

Blend2D's multi-threaded rendering context is 100% compatible with the single-threaded rendering context. From now on I will call the single-threaded rendering context synchronous and the MT rendering context asynchronous. For the synchronous case each render call ends up modifying pixels and returns after all relevant pixels were modified. For the asynchronous case the rendering context becomes a serializer of render commands. Each render call gets translated into a command, which is then added into a command queue to be processed later.

In the Blend2D implementation commands describe only a part of the work. The rendering context uses another concept called jobs. A job can be seen as a prerequisite of a command, that can be, however, processed asynchronously in any order. The important thing to remember is that a job must be processed before the command that depends on such job - this is guaranteed by synchronization. When the rendering context starts processing batches, it first executes job processors, waits for their completion, and then executes command processors.

The main difference between jobs and commands is that jobs never modify pixels on framebuffer whereas commands do. A job is typically processing input data like paths and text. E.g. there are FillGeometry and StrokeGeometry jobs that only process the input geometry and generate edges which will be later consumed by the rasterizer. There are also FillText and StrokeText jobs, which do very similar operation, but their input is either text or glyph-runs.

Another difference between jobs and commands is that commands have no states - command inputs were already transformed and clipped, while inputs of jobs can be the same data used in a render call. Therefore, jobs actually need to know a transformation matrix and a clip region to process the input geometry. This is realized through shared-states. A shared-state can be shared across the whole rendering batch (multiple commands/jobs, thousands, even tens of thousands). It is created on demand by the rendering context when it creates a job to be used with a serialized command. When a shared-state is created the rendering context keeps reusing it until the user changes one or more properties that invalidate it. There are currently two shared states - FillState and StrokeState. FillState is used only by filling and contains everything to transform and clip the input geometry. StrokeState (together with FillState) is used by stroking and contains additional data required by path stroker.

Commands and jobs are part of a rendering batch, which is basically a queue of jobs, commands, and a storage of shared states. When the rendering context decides to process the batch, it does the following steps:

  • 1. Wake up workers to process jobs
  • 2. Wait until all jobs get processed
  • 3. Wake up workers to process bands (includes command processing)
  • 4. Wait until all bands get processed

Job Processing

Each worker thread operates on the same job queue, which is part of the same batch. It knows how many jobs are in the queue, and it knows the address of an atomic variable provided by the batch, which is called jobIndex. When a worker grabs a new job it simply increments the index by fetch_add() and, if valid, has an index of the job to process. Atomic operations guarantee that each worker processes unique jobs. When jobIndex goes out of range it means that there are no more jobs to process and the worker has to either wait for the remaining workers, or wake up workers and start processing commands, if this was the last job.

When a job processor finishes a job, it usually changes the command that is associated with it. E.g. if a command uses a rasterizer that needs edges, the pointer to those edges will initially be null and when job processor finishes edge building it would update the command.

Command Processing

The processing of commands is actually much more interesting than job processing. Blend2D uses bands for raster operations. A band is simply a bunch of consecutive scanlines, which are processed together. The whole engine works at band units - rasterizer, pipelines, and now also the command processor. Similarly to job processing, each batch has a shared atomic variable called bandIndex. When a worker thread wants to process a band, it simply increments such index by fetch_add() and when the returned index is valid it's an index of a band to process. Each band of the framebuffer is processed exactly once per batch. Since each band has an index, we can calculate which scanlines belong to the band. I.e. if bandHeight is 32 and bandIndex is 2 then the band refers to scanlines 64..95.

When a worker gets bandIndex to process it runs all command processors in order for the whole band. Since the processing is guaranteed to be top-to-bottom it's also guaranteed that once a command finishes (its bounding box won't intersect with next bands) it doesn't have to be processed again in next bands. Each worker manages its own bit-array for this purpose where each bit represents a command index. When a worker starts processing commands, all bits are assumed to be ones (1 means pending). When a command is finished and it's guaranteed that it won't produce anything in next bands, a corresponding bit in the bit-array is set to zero. The command processor simply iterates bits in the bit-array to know which commands to process, and those iterations are very fast because CPUs have a dedicated instruction for bit-scanning. The instruction executes in a single cycle or two. In practice, if a worker starts processing the first band it will go over all the commands, and then with each next band the number of zero bits in the bit-array increases (so is the number of commands it skips).

Synchronization

At the moment there are two synchronization points that happen during batch processing - the first ensures that all jobs were processed, and the second one ensures that all bands (which includes commands) were processed. There is currently no hard limit of how many operations can fit into a single batch. So in most scenarios there is only a single batch to render, which means that ideally the rendering context synchronizes only twice in its lifetime.

Performance

The performance is already covered here and, in my opinion, it looks very promising. However, I would like to add some notes regarding to the performance shown on the performance page and real FPS gains of Blend2D Samples that use the Qt library. Some demos do not yield the same gains as shown on the performance page because the pixels have to be transferred on screen, which is an additional operation that happens after each frame is rendered. Copying pixels on screen seems to be a pretty expensive operation and there are definitely limits that cannot be crossed. Additionally, I don't know how many times pixels are copied, this is solely on Qt and it's very possible they are copied twice as all demos render to a QImage, which is then blitted to a backing store.

Another problem with some of the demos is that they only use a single path to stress the rasterizer (bl-qt-circles, bl-qt-polys, bl-qt-particles), which means that there would be only a single job in the whole rendering batch. In this case workers are not utilized properly and basically only one worker is processing the job while others have to wait until it finishes. Regardless, I think that stressing the rasterizer is still important as not all paths are perfect and it simply must be able to handle complex paths having hundreds figures like text runs may have.

Future Work

What I have described in this post covers the initial implementation of the multi-threaded rendering context offered by Blend2D. There are still things that should be improved in the future:

  • Less synchronization - I think that the implementation can be changed to start processing commands, even when some jobs are not yet finished. The command processor should execute immediately after there are no more jobs and should wait when it encounters command that depends on an unfinished job. This could improve the rendering time by avoiding synchronization, which may be unnecessary in many workloads, especially if the first render call is something like clearAll() or fillAll().
  • Background Processing - Currently, the implementation wakes up all threads once a batch is ready to be processed. I think that, since jobs can be processed in any order, the rendering context can simply start processing jobs in background (meanwhile the context is still used by the user). This should also improve the total rendering time as workers will start doing their work much earlier.
  • Smarter Synchronization - At the moment workers are synchronized with condition variables, which are guarded by mutexes. I started exploring the possibility of using futexes on platforms that offer such synchronization primitive.

Conclusion

It was fun to work on something new like this and to push the performance of Blend2D even further. Now it's time to focus more on features that are not strictly related to performance, e.g. better text support and layers in the rendering context.

31 March, 2020

The Cost of Atomics

Introduction

Atomic operations, which are now available in many programming languages including C/C++ (introduced in C11 and C++11), are very useful operations that can be used to implement lock-free algorithms without the need to use hand-written assembly. The C++ compiler would translate each atomic operation into an instruction or a set of instructions that guarantee atomicity. But what is the cost of atomic operations?

This article tries to demonstrate the cost of atomic operations of a unique 64-bit ID generator, which uses a global counter and generates unique IDs by incrementing such counter atomically. This guarantees that no locking is required to obtain such unique IDs. However, there is still the cost of the atomic operation, which can be quite high regardless of the concurrency.

A Naive Approach

Let's say that we want to write a function that would return a unique 64-bit identifier. I have chosen a 64-bit number as I think that it's impossible to overflow such number in today's mainstream applications. Theoretically if some process generates 4 billion IDs per second it would still take 136 years to overflow the global counter, which seems unrealistic at the moment. However, if we improve the performance of the ID generator and run it on a high-end multicore CPU to generate such IDs in parallel then the theoretical time to overflow such counter can be drastically reduced, so it really depends on the nature of the application and whether the 64-bit identifier is enough. My recommendation would be to always abstract the data type used to represent such ID so it can be painlessly changed in the future.

So, how could the simplest ID generator look like?

uint64_t generateIdGlobal() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

This was actually an initial implementation that I have used in the past considering that it's just okay as I didn't expect the ID generator to be called that often. In Blend2D such IDs are only needed to create unique identifiers for caching purposes, so it seemed fine. However, there is a small little detail - what would happen if the target architecture doesn't implement 64-bit atomics? E.g. such code would probably not compile on a 32-bit ARM or MIPS hardware as the target instruction set doesn't have to provide such feature, but it would compile just fine on 32-bit x86 as it offers a cmpxchg8b instruction, which is enough for implementing any kind of 64-bit atomic operation.

Thread-Local Approach

If the only requirement for the ID generator is to return unique numbers that do not have to always be incrementing, we can use thread-local storage to implement a local cache and to only use atomics for incrementing the global counter in case we have exhausted the range of locally available IDs:

uint64_t generateIdLocal() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

This approach is of course longer, but it also minimizes the use of atomic operations to manipulate the global counter. In addition, since we have minimized the use of atomics we can also think about using a mutex to guard the access to globalCounter in case that we run on hardware that doesn't have 64-bit atomics:

static std::mutex globalMutex;

uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

Performance

So what would be your guess regarding the performance of each approach? I have written the following code to benchmark various implementations of the ID generator:

#include <stdint.h>
#include <atomic>
#include <mutex>
#include <thread>
#include <chrono>

typedef uint64_t (*GenerateIdFunc)(void);

static std::mutex globalMutex;

static uint64_t generateIdGlobal() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

static uint64_t generateIdGlobalMutex() {
  static uint64_t globalCounter = 0;
  std::lock_guard<std::mutex> guard(globalMutex);
  return ++globalCounter;
}

static uint64_t generateIdLocal() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

static uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

static void testFunction(GenerateIdFunc func, const char* name) {
  std::atomic<uint64_t> result {};
  constexpr size_t numThreads = 8;
  constexpr size_t numIterations = 1024 * 1024 * 16;

  printf("Testing %s:\n", name);
  auto start = std::chrono::high_resolution_clock::now();

  std::thread threads[numThreads];
  for (size_t i = 0; i < numThreads; i++)
    threads[i] = std::thread([&]() {
      uint64_t localResult = 0;
      for (size_t j = 0; j < numIterations; j++)
        localResult += func();
      result.fetch_add(localResult);
    });

  for (size_t i = 0; i < numThreads; i++)
    threads[i].join();

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;

  printf("  Time: %0.3g s\n", elapsed.count());
  printf("  Result: %llu\n", (unsigned long long)result.load());
}

int main(int argc, char** argv) {
  testFunction(generateIdGlobal, "Global");
  testFunction(generateIdGlobalMutex, "GlobalMutex");
  testFunction(generateIdLocal, "Local");
  testFunction(generateIdLocalMutex, "LocalMutex");
  return 0;
}

Results on AMD Ryzen 3950x (Linux):

Approach 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads
Global - Atomic 0.081s 0.292s 0.580s 1.070s 1.980s
Global - Mutex 0.164s 0.782s 4.440s 9.970s 19.00s
Local - Atomic 0.030s 0.039s 0.039s 0.038s 0.041s
Local - Mutex 0.038s 0.039s 0.037s 0.037s 0.056s

Conclusion

The results should be self explanatory - atomic operations are always faster than using synchronization primitives to access a shared resource, but the cost of atomic operations is still not negligible and there is a limit of how many atomic operations can be performed within the same cache-line. Accessing a thread local storage is faster than atomics and is beneficial especially in a highly concurrent environment. However, thread local storage is a scarce resource that should be always used wisely.

I would like to note that this is a microbenchmark that basically stresses the access to a shared resource. Such high contention should not happen in a reasonable code and the speedup offered by using a different approach may be totally negligible in many real world applications. In addition, there is a difference between accessing thread local storage in executables and in dynamically linked libraries, so always benchmark your code to make sure that you actually don't increase the complexity of the design for no gains.

25 July, 2019

Is C++20 ssize() Broken by Design?

This is a short post in which I would like to raise some concerns regarding std::ssize() (P1227) proposal. Sizes were always represented by [an unsigned] size_t type, which "on many platforms (an exception is systems with segmented addressing) can safely store the value of any non-member pointer, in which case it is synonymous with uintptr_t". This means that the size of size_t equals the size of a machine word on most targets.

I don't think the P1227 proposal was thought through and I believe it's broken by design. The immediate question is whether a 32-bit process can allocate 2^31 bytes of continuous memory and use such memory in a container. The answer is yes it's possible and I have verified it. The next question is how does the proposed std::ssize() implementation deal with that? The answer is undefined behavior as std::ssize() function simply casts its input to ptrdiff_t.

P1227 Motivation

Let's go to the beginning and analyze the sample code of the motivation section in P1227:

template <typename T>
bool has_repeated_values(const T& container) {
  for (int i = 0; i < container.size() - 1; ++i)
    if (container[i] == container[i + 1])
      return true;
  return false;
}

This is the sample code that tries to explain the need for std::ssize() function. The code is wrong at so many levels. It uses int type as an array index, which is signed and won't have a proper size on 64-bit targets. A decent compiler should at least warn about comparison of signed and unsigned values as sizes are normally unsigned. In addition, decrementing one from the size means that the result would wrap around if the container is empty, which is definitely not a border case. I don't think that such example should be used as a foundation in a feature proposal.

P1227 Solution

The proposal also talks about a possible solution by taking advantage of the proposed std::ssize() function, which would return ptrdiff_t, which is simply said a signed version of size_t. There is a problem though, if an array is so large (greater than PTRDIFF_MAX elements, but less than SIZE_MAX bytes), that the difference between two pointers may not be representable as ptrdiff_t, the result of subtracting two such pointers is undefined. This implies that using ssize() is also UNDEFINED when the size of a container is greater than PTRDIFF_MAX, which would be 2^31 - 1 on 32-bit targets.

So finally here is the proposed solution to the problem:

template <typename T>
bool has_repeated_values(const T& container) {
  for (ptrdiff_t i = 0; i < std::ssize(container) - 1; ++i)
    if (container[i] == container[i + 1])
      return true;
  return false;
}

The code definitely fixes the empty container problem, but it adds a possibility for undefined behavior as ptrdiff_t cannot represent all values of size_t.

Undefined Behavior Example

I have written a simple example in which I would like to demonstrate that replacing size_t with ptrdiff_t simply doesn't work:

#include <stdint.h>
#include <stdio.h>
#include <new>
#include <vector>

// A possible implementation of std::ssize(), simplified for our purpose.
static ptrdiff_t my_ssize(size_t size) {
  return ptrdiff_t(size);
}

// The proposed solution in P1227 that uses my_ssize() instead of std::ssize().
template <typename T>
bool has_repeated_values(const T& container) {
  for (ptrdiff_t i = 0; i < my_ssize(container.size()) - 1; ++i)
    if (container[i] == container[i + 1])
      return true;
  return false;
}

int main(int argc, char* argv[]) {
  try {
    // Let's create a vector that has more items than 2^31.
    std::vector<uint8_t> v;
    v.resize(2147484000, uint8_t(0));

    printf("Have a vector of size: %zu\n", v.size());
    printf("Have a vector of ssize: %zi\n", my_ssize(v.size()));

    printf("Has repeated values: %d\n", int(has_repeated_values(v)));
    return 0;
  }
  catch(std::bad_alloc& ex) {
    printf("Bad alloc thrown!\n");
    return 1;
  }
}

Compile the code:

g++ example.cpp -m32 -O2 -std=c++11 -o example

And execute it:

Have a vector of size: 2147484000
Have a vector of ssize: -2147483296
Has repeated values: 0

The output is obviously wrong, the reported size of the container as ptrdiff_t simply doesn't work as the size of the container [2147484000] is not representable in ptrdiff_t. So the proposed solution doesn't work as well as the loop terminates immediately because the value returned by std::ssize() is negative [-2147483296].

Better Solution

If you want to compare pairs in an indexed container then the best is to start from the second index (1) and to compare the previous value with the current one:

#include <stdint.h>
#include <stdio.h>
#include <new>
#include <vector>

// Simple solution.
template <typename T>
bool has_repeated_values(const T& container) {
  for (size_t i = 1; i < container.size(); i++)
    if (container[i - 1] == container[i])
      return true;
  return false;
}

int main(int argc, char* argv[]) {
  try {
    std::vector<uint8_t> v;
    v.resize(2147484000, uint8_t(0));

    printf("Have a vector of size: %zu\n", v.size());
    printf("Has repeated values: %d\n", int(has_repeated_values(v)));
    return 0;
  }
  catch(std::bad_alloc& ex) {
    printf("Bad alloc thrown!\n");
    return 1;
  }
}

The compiled code would work properly regardless of the container size:

Have a vector of size: 2147484000
Has repeated values: 1

Conclusion

I know that the motivation for std::ssize() is ranges support and not a has_repeated_values() function, however, I'm concerned about the trend here. If it's not possible to safely cast size_t to ptrdiff_t then ssize() is broken by design and any code using it is broken including the P1227 solution.

04 September, 2018

C++ - Generating Lookup Tables at Compile-Time

There are many ways of creating lookup tables in C++. Let's assume that we have a set of keys which are compile-time constants and we want to associate some value to each. However, when accessing the value the key won't be a constant expression. Today's compilers are smart and can generate nice code that will do a binary search or can even create a lookup table, but you should never assume what they may do as each compiler has different heuristics and could take a different path to solve the problem.

What I will discuss today is creating arrays of values at compile-time that can be used anyhow you want as the format of the data will be defined by you. These arrays can be then used as lookup tables or regular arrays if that's what you need. So let's describe the problem first:

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

size_t objectSize(uint32_t tid) {
  return tid == TID_NULL    ?  0 :
         tid == TID_BOOLEAN ?  1 :
         tid == TID_INT8    ?  1 :
         tid == TID_INT16   ?  2 :
         tid == TID_INT32   ?  4 :
         tid == TID_INT64   ?  8 :
         tid == TID_UINT8   ?  1 :
         tid == TID_UINT16  ?  2 :
         tid == TID_UINT32  ?  4 :
         tid == TID_UINT64  ?  8 :
         tid == TID_FLOAT32 ?  4 :
         tid == TID_FLOAT64 ?  8 :
         tid == TID_STRING  ? 16 :
         tid == TID_ARRAY   ? 24 :
         tid == TID_MAP     ? 32 : SIZE_MAX;
}

[Compiler Explorer]

GCC and MSVC would take 'compare and return' approach while clang will use a single lookup table in our case. No approach is wrong in this case as compilers are free to do anything that would return a valid result.

Approach #1 - Naive LUT

We can use really naive approach to turn such code into a lookup table by creating an array and filling it:

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

static const uint8_t ObjectSizeArray[] = {
  /* TID_NULL    */  0 ,
  /* TID_BOOLEAN */  1 ,
  /* TID_INT8    */  1 ,
  /* TID_INT16   */  2 ,
  /* TID_INT32   */  4 ,
  /* TID_INT64   */  8 ,
  /* TID_UINT8   */  1 ,
  /* TID_UINT16  */  2 ,
  /* TID_UINT32  */  4 ,
  /* TID_UINT64  */  8 ,
  /* TID_FLOAT32 */  4 ,
  /* TID_FLOAT64 */  8 ,
  /* TID_STRING  */ 16 ,
  /* TID_ARRAY   */ 24 ,
  /* TID_MAP     */ 32
};

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

There is not much to add as the code is explicitly creating the lookup table and then using it. This is one of the simplest approaches that you will see in many projects to date. It's fine until you start messing with keys. Imagine somebody would reorder TID_UINT8 to follow TID_INT8 including the TID value. The table would immediately be broken, but the code would compile just fine - the runtime is where it will fail and such bugs can be really time consuming to debug as everything looks just perfect.

Approach #2 - Key/Value as a Macro

The idea is to create a macro that will be used N times to populate the table:

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

#define V(tid) (              \
  tid == TID_NULL    ?  0 :   \
  tid == TID_BOOLEAN ?  1 :   \
  tid == TID_INT8    ?  1 :   \
  tid == TID_INT16   ?  2 :   \
  tid == TID_INT32   ?  4 :   \
  tid == TID_INT64   ?  8 :   \
  tid == TID_UINT8   ?  1 :   \
  tid == TID_UINT16  ?  2 :   \
  tid == TID_UINT32  ?  4 :   \
  tid == TID_UINT64  ?  8 :   \
  tid == TID_FLOAT32 ?  4 :   \
  tid == TID_FLOAT64 ?  8 :   \
  tid == TID_STRING  ? 16 :   \
  tid == TID_ARRAY   ? 24 :   \
  tid == TID_MAP     ? 32 : 0 \
)

static const uint8_t ObjectSizeArray[] = {
  V(0),
  V(1),
  V(2),
  V(3),
  V(4),
  V(5),
  V(6),
  V(7),
  V(8),
  V(9),
  V(10),
  V(11),
  V(12),
  V(13),
  V(14)
};

#undef V

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

The result should be compiled to the same code as the previous example. We fixed one thing and designed a solution which would work for simple cases in which a key is changed, however, what we have done is not really a nice solution and requires a lot of additional lines compared to the first one.

Approach #3 - Key/Value as a Template

#include <stdint.h>
#include <stdlib.h>

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

template<uint32_t tid>
struct KeyMapper {
  enum : uint8_t {
    value =
      tid == TID_NULL    ?  0 :
      tid == TID_BOOLEAN ?  1 :
      tid == TID_INT8    ?  1 :
      tid == TID_INT16   ?  2 :
      tid == TID_INT32   ?  4 :
      tid == TID_INT64   ?  8 :
      tid == TID_UINT8   ?  1 :
      tid == TID_UINT16  ?  2 :
      tid == TID_UINT32  ?  4 :
      tid == TID_UINT64  ?  8 :
      tid == TID_FLOAT32 ?  4 :
      tid == TID_FLOAT64 ?  8 :
      tid == TID_STRING  ? 16 :
      tid == TID_ARRAY   ? 24 :
      tid == TID_MAP     ? 32 : 0
  };
};

static const uint8_t ObjectSizeArray[] = {
  KeyMapper<0>::value,
  KeyMapper<1>::value,
  KeyMapper<2>::value,
  KeyMapper<3>::value,
  KeyMapper<4>::value,
  KeyMapper<5>::value,
  KeyMapper<6>::value,
  KeyMapper<7>::value,
  KeyMapper<8>::value,
  KeyMapper<9>::value,
  KeyMapper<10>::value,
  KeyMapper<11>::value,
  KeyMapper<12>::value,
  KeyMapper<13>::value,
  KeyMapper<14>::value
};

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

Okay, the nasty macro was replaced by a nasty template, but overall it's the same solution that is still annoying to write / maintain.

Approach #4 - std::index_sequence and constexpr (C++14)

#include <stdint.h>
#include <stdlib.h>
#include <utility>

// ------------------------------------------------------------------
// Normally you would put these helpers somewhere else, not into the
// the .cpp file, however, let's keep it simple.
// ------------------------------------------------------------------

template<typename T, size_t N>
struct LookupTable {
  T data[N];

  constexpr size_t size() const noexcept { return N; }
  constexpr const T& operator[](size_t i) const noexcept { return data[i]; }
};

template<typename T, size_t N, class Func, size_t... Indexes>
static constexpr LookupTable<T, N> MakeLookupTableImpl(Func f, std::index_sequence<Indexes...>) noexcept {
  return {{ T(f(Indexes))... }};
}

template<typename T, size_t N, class Func>
static constexpr LookupTable<T, N> MakeLookupTable(Func f) noexcept {
  return MakeLookupTableImpl<T, N>(f, std::make_index_sequence{});
}

// The original code...
enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

// And the original function turned into a constexpr.
static constexpr uint8_t ObjectSizeFunc(uint32_t tid) {
  return tid == TID_NULL    ?  0 :
         tid == TID_BOOLEAN ?  1 :
         tid == TID_INT8    ?  1 :
         tid == TID_INT16   ?  2 :
         tid == TID_INT32   ?  4 :
         tid == TID_INT64   ?  8 :
         tid == TID_UINT8   ?  1 :
         tid == TID_UINT16  ?  2 :
         tid == TID_UINT32  ?  4 :
         tid == TID_UINT64  ?  8 :
         tid == TID_FLOAT32 ?  4 :
         tid == TID_FLOAT64 ?  8 :
         tid == TID_STRING  ? 16 :
         tid == TID_ARRAY   ? 24 :
         tid == TID_MAP     ? 32 : 0;
}

static constexpr const auto ObjectSizeArray =
  MakeLookupTable<uint8_t, TID_COUNT>(ObjectSizeFunc);

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

I think this is so far the cleanest approach that can be maintained very well. Adding a new value still requires to update the ObjectSizeFunc function, but changing keys would do no harm.

Approach #5 - Minor update of #4

#include <stdint.h>
#include <stdlib.h>
#include <utility>

template<typename T, size_t N>
struct LookupTable {
  T data[N];

  constexpr size_t size() const noexcept { return N; }
  constexpr const T& operator[](size_t i) const noexcept { return data[i]; }
};

template<typename T, size_t N, class Generator, size_t... Indexes>
static constexpr LookupTable<T, N> MakeLookupTableImpl(std::index_sequence<Indexes...>) noexcept {
  constexpr LookupTable<T, N> table {{ T(Generator::value(Indexes))... }};
  return table;
}

template<typename T, size_t N, class Generator>
static constexpr LookupTable<T, N> MakeLookupTable() noexcept {
  return MakeLookupTableImpl<T, N, Generator>(std::make_index_sequence<N>{});
}

enum TypeId : uint32_t {
  TID_NULL    = 0,
  TID_BOOLEAN = 1,
  TID_INT8    = 2,
  TID_INT16   = 3,
  TID_INT32   = 4,
  TID_INT64   = 5,
  TID_UINT8   = 6,
  TID_UINT16  = 7,
  TID_UINT32  = 8,
  TID_UINT64  = 9,
  TID_FLOAT32 = 10,
  TID_FLOAT64 = 11,
  TID_STRING  = 12,
  TID_ARRAY   = 13,
  TID_MAP     = 14,

  TID_COUNT   = 15
};

struct ObjectSizeGenerator {
  static constexpr uint8_t value(size_t tid) noexcept {
    return tid == TID_NULL    ?  0 :
           tid == TID_BOOLEAN ?  1 :
           tid == TID_INT8    ?  1 :
           tid == TID_INT16   ?  2 :
           tid == TID_INT32   ?  4 :
           tid == TID_INT64   ?  8 :
           tid == TID_UINT8   ?  1 :
           tid == TID_UINT16  ?  2 :
           tid == TID_UINT32  ?  4 :
           tid == TID_UINT64  ?  8 :
           tid == TID_FLOAT32 ?  4 :
           tid == TID_FLOAT64 ?  8 :
           tid == TID_STRING  ? 16 :
           tid == TID_ARRAY   ? 24 :
           tid == TID_MAP     ? 32 : 0;
  }
};

static constexpr const auto ObjectSizeArray =
  MakeLookupTable<uint8_t, TID_COUNT, ObjectSizeGenerator>();

size_t objectSize(uint32_t tid) {
  if (tid >= TID_COUNT)
    return SIZE_MAX;
  else
    return ObjectSizeArray[tid];
}

[Compiler Explorer]

This is an updated approach in which I have moved the generator into a helper struct/class as I noticed that in some cases MSVC would emit runtime initializers of the lookup table if its are part of a larger structure.

Conclusion

I just wanted to demonstrate different ways of creating lookup tables at compile-time. If you explore enough projects you will probably find all approaches mentioned here with some minor modifications (like more macros, more clever macros, different templates, etc...). I'm sure there is maybe even a better solution than the last one, but I'm currently happily switching to it as it provides great flexibility to me. The function that returns the value can be actually pretty complex and the constexpr [ObjectSizeArray] would still guarantee that the compiler will calculate everything at compile-time or fail. I have some tables in Blend2D that are used during glyph decoding (of both TrueType and OpenType fonts) and they work really well and are really easy to maintain. I plan to use such approach in AsmJit as well, but Blend2D still has a higher priority at the moment.

I would also note that the last approach also provides a great flexibility to define tables that may be a bit sparse. Like a table that has 256 values [mapping a byte key to a value], but only 10 of them are let's say non-zero [or keys of interest]

Use comments if you have anything to add or if you think it should be done differently!

25 August, 2018

C++ Autovectorization - Part 1

This is a first part of my autovectorization experiments. As a part of my refactoring work I did some cleanup in a Blend2D's internal algebra module and I finally tried how compilers optimize some of its features. I would like to highlight that the performance of this code is very important as essentially all vectors passed to Blend2D's rendering context are processed before the rasterizer can do its work. The processing depends on what you want to do with them - if it's just for filling then the input is transformed, clipped, and passed to edge builder. If you intend to stroke a 2D path then the processing is much heavier and a lot of approximations come into the scene.

Today's post contains more code than usual, however, the code solves real problems so people can seriously consider which compiler would optimize their code the most. Spoiler alert - Clang is superior!

Preparation

Instead of using one huge code section I decided to provide a utility code first and then functions that I would compare. Some code is directly from Blend2D and some was added just for simplicity to make the code compile:

#include <algorithm>
#include <cmath>
#include <stdint.h>

// Point structure [x, y]
struct BLPoint {
  double x;
  double y;

  inline BLPoint() noexcept = default;
  constexpr BLPoint(const BLPoint&) noexcept = default;

  constexpr BLPoint(double x, double y) noexcept
    : x(x), y(y) {}

  inline void reset() noexcept { reset(0, 0); }
  inline void reset(double x, double y) noexcept {
    this->x = x;
    this->y = y;
  }
};

// Box structure [x0, y0, x1, y1]
struct BLBox {
  double x0;
  double y0;
  double x1;
  double y1;

  inline BLBox() noexcept = default;
  constexpr BLBox(const BLBox&) noexcept = default;

  constexpr BLBox(double x0, double y0, double x1, double y1) noexcept
    : x0(x0), y0(y0), x1(x1), y1(y1) {}

  inline void reset() noexcept { reset(0.0, 0.0, 0.0, 0.0); }
  inline void reset(double x0, double y0, double x1, double y1) noexcept {
    this->x0 = x0;
    this->y0 = y0;
    this->x1 = x1;
    this->y1 = y1;
  }
};

// Overloads to make vector processing simpler.
static constexpr BLPoint operator-(const BLPoint& a) noexcept { return BLPoint(-a.x, -a.y); }

static constexpr BLPoint operator+(const BLPoint& a, double b) noexcept { return BLPoint(a.x + b, a.y + b); }
static constexpr BLPoint operator-(const BLPoint& a, double b) noexcept { return BLPoint(a.x - b, a.y - b); }
static constexpr BLPoint operator*(const BLPoint& a, double b) noexcept { return BLPoint(a.x * b, a.y * b); }
static constexpr BLPoint operator/(const BLPoint& a, double b) noexcept { return BLPoint(a.x / b, a.y / b); }

static constexpr BLPoint operator+(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x + b.x, a.y + b.y); }
static constexpr BLPoint operator-(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x - b.x, a.y - b.y); }
static constexpr BLPoint operator*(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x * b.x, a.y * b.y); }
static constexpr BLPoint operator/(const BLPoint& a, const BLPoint& b) noexcept { return BLPoint(a.x / b.x, a.y / b.y); }

static constexpr BLPoint operator+(double a, const BLPoint& b) noexcept { return BLPoint(a + b.x, a + b.y); }
static constexpr BLPoint operator-(double a, const BLPoint& b) noexcept { return BLPoint(a - b.x, a - b.y); }
static constexpr BLPoint operator*(double a, const BLPoint& b) noexcept { return BLPoint(a * b.x, a * b.y); }
static constexpr BLPoint operator/(double a, const BLPoint& b) noexcept { return BLPoint(a / b.x, a / b.y); }

static inline BLPoint& operator+=(BLPoint& a, double b) noexcept { a.x += b; a.y += b; return a; }
static inline BLPoint& operator-=(BLPoint& a, double b) noexcept { a.x -= b; a.y -= b; return a; }
static inline BLPoint& operator*=(BLPoint& a, double b) noexcept { a.x *= b; a.y *= b; return a; }
static inline BLPoint& operator/=(BLPoint& a, double b) noexcept { a.x /= b; a.y /= b; return a; }

static inline BLPoint& operator+=(BLPoint& a, const BLPoint& b) noexcept { a.x += b.x; a.y += b.y; return a; }
static inline BLPoint& operator-=(BLPoint& a, const BLPoint& b) noexcept { a.x -= b.x; a.y -= b.y; return a; }
static inline BLPoint& operator*=(BLPoint& a, const BLPoint& b) noexcept { a.x *= b.x; a.y *= b.y; return a; }
static inline BLPoint& operator/=(BLPoint& a, const BLPoint& b) noexcept { a.x /= b.x; a.y /= b.y; return a; }

// Min/Max - different semantics compared to std.
template<typename T> constexpr T blMin(const T& a, const T& b) noexcept { return b < a ? b : a; }
template<typename T> constexpr T blMax(const T& a, const T& b) noexcept { return a < b ? b : a; }

// Linear interpolation, works with points as well.
template<typename V, typename T = double>
inline V blLerp(const V& a, const V& b, const T& t) noexcept {
  return (a * (1.0 - t)) + (b * t);
}

// Merge a point into a box by possibly increasing its size.
inline void blBoxMergePoint(BLBox& box, const BLPoint& p) noexcept {
  box.x0 = blMin(box.x0, p.x);
  box.y0 = blMin(box.y0, p.y);
  box.x1 = blMax(box.x1, p.x);
  box.y1 = blMax(box.y1, p.y);
}

// Quadratic Bezier Coefficients and Derivative.
inline void blQuadExtractCoefficients(const BLPoint bez[3], BLPoint& a, BLPoint& b, BLPoint& c) noexcept {
  a = bez[0] - bez[1] * 2.0 + bez[2];
  b = (bez[1] - bez[0]) * 2.0;
  c = bez[0];
}

inline void blQuadExtractDerivative(const BLPoint bez[3], BLPoint& a, BLPoint& b) noexcept {
  a = (bez[2] - bez[1] * 2.0 + bez[0]) * 2.0;
  b = (bez[1] - bez[0]) * 2.0;
}

// Cubic Bezier Coefficients and Derivative.
inline void blCubicExtractCoefficients(const BLPoint bez[4], BLPoint& a, BLPoint& b, BLPoint& c, BLPoint& d) noexcept {
  d = bez[0];
  a = (bez[1] - bez[2]) * 3.0 + bez[3] - d;
  b = (bez[0] - bez[1]  * 2.0 + bez[2]) * 3.0;
  c = (bez[1] - bez[0]) * 3.0;
}

inline void blCubicExtractDerivative(const BLPoint bez[4], BLPoint& a, BLPoint& b, BLPoint& c) noexcept {
  a = ((bez[1] - bez[2]) * 3.0 + bez[3] - bez[0]) * 3.0;
  b = (bez[2] - bez[1] * 2.0 + bez[0]) * 6.0;
  c = (bez[1] - bez[0]) * 3.0;
}

I don't know whether to comment it - it's pretty simple code that provides Point, Box, and some utility functions we rely on.

Quadratic Bezier Evaluation

These functions return a point of quadratic Bezier at the given t:

// Quadratic Bezier Evaluation.
BLPoint blQuadEvaluateAtT(const BLPoint bez[3], double t) noexcept {
  BLPoint a, b, c;
  blQuadExtractCoefficients(bez, a, b, c);
  return (a * t + b) * t + c;
}

BLPoint blQuadPreciseEvaluateAtT(const BLPoint bez[3], double t) noexcept {
  return blLerp(blLerp(bez[0], bez[1], t),
                blLerp(bez[1], bez[2], t), t);
}

GCC [-std=c++17 -O3 -fno-math-errno -mavx2]

blQuadEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm2, XMMWORD PTR [rdi+16]
  vmovupd  xmm4, XMMWORD PTR [rdi]
  vmovddup xmm3, xmm0
  vaddpd   xmm1, xmm2, xmm2
  vsubpd   xmm2, xmm2, xmm4
  vsubpd   xmm1, xmm4, xmm1
  vaddpd   xmm0, xmm1, XMMWORD PTR [rdi+32]
  vaddpd   xmm2, xmm2, xmm2
  vmulpd   xmm1, xmm0, xmm3
  vaddpd   xmm0, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm5, xmm0, xmm4
  vmovaps  XMMWORD PTR [rsp-24], xmm5
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

blQuadPreciseEvaluateAtT(BLPoint const*, double):
  vmovapd  xmm3, XMMWORD PTR .LC0[rip]
  vmovddup xmm0, xmm0
  vmovupd  xmm4, XMMWORD PTR [rdi+16]
  vmulpd   xmm2, xmm0, XMMWORD PTR [rdi+32]
  vsubpd   xmm3, xmm3, xmm0
  vmulpd   xmm1, xmm3, xmm4
  vaddpd   xmm2, xmm2, xmm1
  vmulpd   xmm1, xmm3, XMMWORD PTR [rdi]
  vmulpd   xmm2, xmm2, xmm0
  vmulpd   xmm0, xmm4, xmm0
  vaddpd   xmm0, xmm1, xmm0
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm5, xmm2, xmm0
  vmovaps  XMMWORD PTR [rsp-24], xmm5
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

Clang [-std=c++17 -O3 -fno-math-errno -mavx2]

blQuadEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm1, xmmword ptr [rdi]
  vmovupd  xmm2, xmmword ptr [rdi + 16]
  vaddpd   xmm3, xmm2, xmm2
  vsubpd   xmm3, xmm1, xmm3
  vaddpd   xmm3, xmm3, xmmword ptr [rdi + 32]
  vsubpd   xmm2, xmm2, xmm1
  vaddpd   xmm2, xmm2, xmm2
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmulpd   xmm3, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm3
  vmulpd   xmm0, xmm0, xmm2
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

.LCPI1_0:
  .quad    4607182418800017408     # double 1

blQuadPreciseEvaluateAtT(BLPoint const*, double):
  vmovsd   xmm1, qword ptr [rip + .LCPI1_0] # xmm1 = mem[0],zero
  vsubsd   xmm1, xmm1, xmm0
  vmovddup xmm1, xmm1      # xmm1 = xmm1[0,0]
  vmulpd   xmm2, xmm1, xmmword ptr [rdi]
  vmovupd  xmm3, xmmword ptr [rdi + 16]
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmulpd   xmm4, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm4
  vmulpd   xmm3, xmm1, xmm3
  vmulpd   xmm4, xmm0, xmmword ptr [rdi + 32]
  vaddpd   xmm3, xmm3, xmm4
  vmulpd   xmm1, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

MSVC [/O2 /Qpar /arch:AVX2]

; NOT VECTORIZED!
__$ReturnUdt$ = 80
bez$ = 88
t$ = 96
BLPoint blQuadEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN42:
  sub     rsp, 72                             ; 00000048H
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vmovsd  xmm5, QWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp+48], xmm6
  mov     rax, rcx
  vmovsd  xmm6, QWORD PTR [rdx+8]
  vmovaps XMMWORD PTR [rsp+32], xmm7
  vaddsd  xmm3, xmm0, xmm0
  vsubsd  xmm0, xmm5, xmm3
  vmovaps XMMWORD PTR [rsp+16], xmm8
  vaddsd  xmm8, xmm0, QWORD PTR [rdx+32]
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vsubsd  xmm3, xmm0, xmm5
  vmovaps XMMWORD PTR [rsp], xmm9
  vmovaps xmm9, xmm2
  vaddsd  xmm4, xmm1, xmm1
  vsubsd  xmm1, xmm6, xmm4
  vaddsd  xmm7, xmm1, QWORD PTR [rdx+40]
  vmulsd  xmm4, xmm3, QWORD PTR __real@4000000000000000
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vsubsd  xmm2, xmm1, xmm6
  vmulsd  xmm5, xmm2, QWORD PTR __real@4000000000000000
  vmovups xmm6, XMMWORD PTR [rdx]
  vmulsd  xmm2, xmm7, xmm9
  vmovaps xmm7, XMMWORD PTR [rsp+32]
  vmulsd  xmm1, xmm8, xmm9
  vmovaps xmm8, XMMWORD PTR [rsp+16]
  vaddsd  xmm0, xmm4, xmm1
  vmulsd  xmm4, xmm0, xmm9
  vaddsd  xmm0, xmm4, xmm6
  vaddsd  xmm3, xmm2, xmm5
  vmulsd  xmm5, xmm3, xmm9
  vmovaps xmm9, XMMWORD PTR [rsp]
  vunpckhpd xmm2, xmm6, xmm6
  vmovaps xmm6, XMMWORD PTR [rsp+48]
  vmovsd  QWORD PTR [rcx], xmm0
  vaddsd  xmm0, xmm5, xmm2
  vmovsd  QWORD PTR [rcx+8], xmm0
  add     rsp, 72                             ; 00000048H
  ret     0
BLPoint blQuadEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

; NOT VECTORIZED!
__$ReturnUdt$ = 64
bez$ = 72
t$ = 80
BLPoint blQuadPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN46:
  sub     rsp, 56                             ; 00000038H
  vmulsd  xmm4, xmm2, QWORD PTR [rdx+32]
  vmulsd  xmm3, xmm2, QWORD PTR [rdx+40]
  vmovsd  xmm0, QWORD PTR __real@3ff0000000000000
  vmovaps XMMWORD PTR [rsp+32], xmm6
  mov     rax, rcx
  vmovaps XMMWORD PTR [rsp+16], xmm7
  vsubsd  xmm7, xmm0, xmm2
  vmulsd  xmm1, xmm7, QWORD PTR [rdx+16]
  vmulsd  xmm0, xmm7, QWORD PTR [rdx+24]
  vaddsd  xmm6, xmm3, xmm0
  vmulsd  xmm0, xmm7, QWORD PTR [rdx+8]
  vaddsd  xmm5, xmm4, xmm1
  vmulsd  xmm1, xmm7, QWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp], xmm8
  vmovaps xmm8, xmm2
  vmulsd  xmm3, xmm8, QWORD PTR [rdx+24]
  vmulsd  xmm2, xmm2, QWORD PTR [rdx+16]
  vaddsd  xmm2, xmm2, xmm1
  vaddsd  xmm3, xmm0, xmm3
  vmulsd  xmm4, xmm5, xmm8
  vmulsd  xmm5, xmm6, xmm8
  vmovaps xmm6, XMMWORD PTR [rsp+32]
  vmovaps xmm8, XMMWORD PTR [rsp]
  vmulsd  xmm0, xmm2, xmm7
  vmulsd  xmm1, xmm3, xmm7
  vmovaps xmm7, XMMWORD PTR [rsp+16]
  vaddsd  xmm0, xmm0, xmm4
  vaddsd  xmm1, xmm1, xmm5
  vmovsd  QWORD PTR [rcx], xmm0
  vmovsd  QWORD PTR [rcx+8], xmm1
  add     rsp, 56                             ; 00000038H
  ret     0
BLPoint blQuadPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

Interpreting the Results

GCC and Clang okay, MSVC didn't use SIMD.

Cubic Bezier Evaluation

These functions return a point of cubic Bezier at the given t:

// Cubic Bezier Evaluation.
BLPoint blCubicEvaluateAtT(const BLPoint bez[4], double t) noexcept {
  BLPoint a, b, c, d;
  blCubicExtractCoefficients(bez, a, b, c, d);
  return ((a * t + b) * t + c) * t + c;
}

BLPoint blCubicPreciseEvaluateAtT(const BLPoint bez[4], double t) noexcept {
  BLPoint p01(blLerp(bez[0], bez[1], t));
  BLPoint p12(blLerp(bez[1], bez[2], t));
  BLPoint p23(blLerp(bez[2], bez[3], t));
  return blLerp(blLerp(p01, p12, t), blLerp(p12, p23, t), t);
}

GCC [-std=c++17 -O3 -fno-math-errno -mavx2]

blCubicEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm3, XMMWORD PTR [rdi+16]
  vmovddup xmm4, xmm0
  vmovupd  xmm1, XMMWORD PTR [rdi]
  vmovupd  xmm6, XMMWORD PTR [rdi+32]
  vmovapd  xmm5, XMMWORD PTR .LC1[rip]
  vsubpd   xmm2, xmm3, xmm1
  vsubpd   xmm0, xmm3, xmm6
  vaddpd   xmm3, xmm3, xmm3
  vmulpd   xmm2, xmm2, xmm5
  vmulpd   xmm0, xmm0, xmm5
  vaddpd   xmm0, xmm0, XMMWORD PTR [rdi+48]
  vsubpd   xmm0, xmm0, xmm1
  vsubpd   xmm1, xmm1, xmm3
  vaddpd   xmm1, xmm1, xmm6
  vmulpd   xmm0, xmm0, xmm4
  vmulpd   xmm1, xmm1, xmm5
  vaddpd   xmm0, xmm0, xmm1
  vmulpd   xmm0, xmm0, xmm4
  vaddpd   xmm0, xmm0, xmm2
  vmulpd   xmm0, xmm0, xmm4
  vaddpd   xmm7, xmm0, xmm2
  vmovaps  XMMWORD PTR [rsp-24], xmm7
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

blCubicPreciseEvaluateAtT(BLPoint const*, double):
  vmovddup xmm2, xmm0
  vmovapd  xmm0, XMMWORD PTR .LC0[rip]
  vmovupd  xmm5, XMMWORD PTR [rdi+16]
  vmulpd   xmm4, xmm2, XMMWORD PTR [rdi+32]
  vsubpd   xmm3, xmm0, xmm2
  vmulpd   xmm1, xmm2, XMMWORD PTR [rdi+48]
  vmulpd   xmm0, xmm3, xmm5
  vmulpd   xmm5, xmm5, xmm2
  vaddpd   xmm4, xmm4, xmm0
  vmulpd   xmm0, xmm3, XMMWORD PTR [rdi+32]
  vaddpd   xmm1, xmm1, xmm0
  vmulpd   xmm0, xmm3, xmm4
  vmulpd   xmm1, xmm1, xmm2
  vaddpd   xmm1, xmm1, xmm0
  vmulpd   xmm0, xmm3, XMMWORD PTR [rdi]
  vmulpd   xmm1, xmm1, xmm2
  vmulpd   xmm2, xmm4, xmm2
  vaddpd   xmm0, xmm0, xmm5
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm0, xmm0, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm6, xmm1, xmm0
  vmovaps  XMMWORD PTR [rsp-24], xmm6
  vmovsd   xmm1, QWORD PTR [rsp-16]
  vmovsd   xmm0, QWORD PTR [rsp-24]
  ret

Clang [-std=c++17 -O3 -fno-math-errno -mavx2]

.LCPI2_0:
  .quad    4613937818241073152     # double 3
  .quad    4613937818241073152     # double 3

blCubicEvaluateAtT(BLPoint const*, double):
  vmovupd  xmm1, xmmword ptr [rdi + 16]
  vmovupd  xmm2, xmmword ptr [rdi + 32]
  vsubpd   xmm3, xmm1, xmm2
  vmovapd  xmm4, xmmword ptr [rip + .LCPI2_0] # xmm4 = [3.000000e+00,3.000000e+00]
  vmulpd   xmm3, xmm3, xmm4
  vaddpd   xmm3, xmm3, xmmword ptr [rdi + 48]
  vmovupd  xmm5, xmmword ptr [rdi]
  vsubpd   xmm3, xmm3, xmm5
  vaddpd   xmm6, xmm1, xmm1
  vsubpd   xmm6, xmm5, xmm6
  vaddpd   xmm2, xmm6, xmm2
  vmulpd   xmm2, xmm2, xmm4
  vsubpd   xmm1, xmm1, xmm5
  vmulpd   xmm1, xmm1, xmm4
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmulpd   xmm3, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm3
  vmulpd   xmm2, xmm0, xmm2
  vaddpd   xmm2, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm2
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

.LCPI3_0:
  .quad    4607182418800017408     # double 1

blCubicPreciseEvaluateAtT(BLPoint const*, double):
  vmovsd   xmm1, qword ptr [rip + .LCPI3_0] # xmm1 = mem[0],zero
  vsubsd   xmm1, xmm1, xmm0
  vmovddup xmm1, xmm1      # xmm1 = xmm1[0,0]
  vmulpd   xmm2, xmm1, xmmword ptr [rdi]
  vmovddup xmm0, xmm0      # xmm0 = xmm0[0,0]
  vmovupd  xmm3, xmmword ptr [rdi + 16]
  vmovupd  xmm4, xmmword ptr [rdi + 32]
  vmulpd   xmm5, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm5
  vmulpd   xmm3, xmm1, xmm3
  vmulpd   xmm5, xmm0, xmm4
  vaddpd   xmm3, xmm3, xmm5
  vmulpd   xmm5, xmm0, xmmword ptr [rdi + 48]
  vmulpd   xmm4, xmm1, xmm4
  vaddpd   xmm4, xmm4, xmm5
  vmulpd   xmm2, xmm1, xmm2
  vmulpd   xmm5, xmm0, xmm3
  vaddpd   xmm2, xmm2, xmm5
  vmulpd   xmm3, xmm1, xmm3
  vmulpd   xmm4, xmm0, xmm4
  vaddpd   xmm3, xmm3, xmm4
  vmulpd   xmm1, xmm1, xmm2
  vmulpd   xmm0, xmm0, xmm3
  vaddpd   xmm0, xmm1, xmm0
  vpermilpd xmm1, xmm0, 1   # xmm1 = xmm0[1,0]
  ret

MSVC [/O2 /Qpar /arch:AVX2]

; NOT VECTORIZED!
__$ReturnUdt$ = 96
bez$ = 104
t$ = 112
BLPoint blCubicEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN70:
  sub     rsp, 88                             ; 00000058H
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vsubsd  xmm3, xmm0, QWORD PTR [rdx+32]
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vmovups xmm5, XMMWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp+64], xmm6
  mov     rax, rcx
  vmovaps XMMWORD PTR [rsp+48], xmm7
  vmovsd  xmm7, QWORD PTR __real@4008000000000000
  vmovaps XMMWORD PTR [rsp+32], xmm8
  vmulsd  xmm0, xmm3, xmm7
  vmovaps XMMWORD PTR [rsp+16], xmm9
  vmovaps XMMWORD PTR [rsp], xmm10
  vmovaps xmm10, xmm2
  vsubsd  xmm2, xmm1, QWORD PTR [rdx+40]
  vaddsd  xmm1, xmm0, QWORD PTR [rdx+48]
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vsubsd  xmm9, xmm1, xmm5
  vmulsd  xmm4, xmm2, xmm7
  vaddsd  xmm2, xmm4, QWORD PTR [rdx+56]
  vunpckhpd xmm1, xmm5, xmm5
  vmovsd  xmm5, QWORD PTR [rdx]
  vsubsd  xmm8, xmm2, xmm1
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vaddsd  xmm2, xmm0, xmm0
  vsubsd  xmm0, xmm5, xmm2
  vmovsd  xmm2, QWORD PTR [rdx+8]
  vaddsd  xmm3, xmm1, xmm1
  vaddsd  xmm1, xmm0, QWORD PTR [rdx+32]
  vsubsd  xmm4, xmm2, xmm3
  vaddsd  xmm0, xmm4, QWORD PTR [rdx+40]
  vmulsd  xmm6, xmm0, xmm7
  vmovsd  xmm0, QWORD PTR [rdx+16]
  vmulsd  xmm4, xmm1, xmm7
  vmovsd  xmm1, QWORD PTR [rdx+24]
  vsubsd  xmm2, xmm1, xmm2
  vsubsd  xmm3, xmm0, xmm5
  vmulsd  xmm5, xmm3, xmm7
  vmulsd  xmm0, xmm9, xmm10
  vmovaps xmm9, XMMWORD PTR [rsp+16]
  vmulsd  xmm1, xmm8, xmm10
  vmovaps xmm8, XMMWORD PTR [rsp+32]
  vaddsd  xmm3, xmm1, xmm6
  vmovaps xmm6, XMMWORD PTR [rsp+64]
  vmulsd  xmm7, xmm2, xmm7
  vaddsd  xmm2, xmm0, xmm4
  vmulsd  xmm4, xmm2, xmm10
  vmulsd  xmm0, xmm3, xmm10
  vaddsd  xmm1, xmm4, xmm5
  vaddsd  xmm2, xmm0, xmm7
  vmulsd  xmm3, xmm1, xmm10
  vmulsd  xmm4, xmm2, xmm10
  vmovaps xmm10, XMMWORD PTR [rsp]
  vaddsd  xmm1, xmm4, xmm7
  vmovaps xmm7, XMMWORD PTR [rsp+48]
  vaddsd  xmm0, xmm3, xmm5
  vmovsd  QWORD PTR [rcx], xmm0
  vmovsd  QWORD PTR [rcx+8], xmm1
  add     rsp, 88                             ; 00000058H
  ret     0
BLPoint blCubicEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

; NOT VECTORIZED!
__$ReturnUdt$ = 128
bez$ = 136
t$ = 144
BLPoint blCubicPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) PROC
$LN88:
  mov     r11, rsp
  sub     rsp, 120                      ; 00000078H
  vmulsd  xmm4, xmm2, QWORD PTR [rdx+16]
  vmulsd  xmm3, xmm2, QWORD PTR [rdx+24]
  vmovsd  xmm0, QWORD PTR __real@3ff0000000000000
  vmovaps XMMWORD PTR [rsp+96], xmm6
  mov     rax, rcx
  vmovaps XMMWORD PTR [rsp+80], xmm7
  vmovaps XMMWORD PTR [r11-56], xmm8
  vmovaps XMMWORD PTR [r11-72], xmm9
  vmovaps XMMWORD PTR [r11-88], xmm10
  vmovaps XMMWORD PTR [r11-104], xmm11
  vsubsd  xmm11, xmm0, xmm2
  vmulsd  xmm1, xmm11, QWORD PTR [rdx]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+8]
  vaddsd  xmm10, xmm3, xmm0
  vmulsd  xmm3, xmm2, QWORD PTR [rdx+32]
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+24]
  vaddsd  xmm9, xmm4, xmm1
  vmulsd  xmm1, xmm11, QWORD PTR [rdx+16]
  vaddsd  xmm5, xmm3, xmm1
  vmulsd  xmm1, xmm11, QWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rsp], xmm12
  vmovaps xmm12, xmm2
  vmulsd  xmm2, xmm2, QWORD PTR [rdx+40]
  vmulsd  xmm3, xmm12, QWORD PTR [rdx+48]
  vaddsd  xmm8, xmm2, xmm0
  vmulsd  xmm0, xmm11, QWORD PTR [rdx+40]
  vmulsd  xmm2, xmm12, QWORD PTR [rdx+56]
  vaddsd  xmm1, xmm1, xmm3
  vaddsd  xmm2, xmm2, xmm0
  vmulsd  xmm3, xmm1, xmm12
  vmulsd  xmm4, xmm2, xmm12
  vmulsd  xmm1, xmm8, xmm11
  vaddsd  xmm7, xmm1, xmm4
  vmulsd  xmm0, xmm5, xmm11
  vaddsd  xmm6, xmm0, xmm3
  vmulsd  xmm1, xmm10, xmm11
  vmovaps xmm10, XMMWORD PTR [r11-88]
  vmulsd  xmm2, xmm5, xmm12
  vmulsd  xmm3, xmm8, xmm12
  vmovaps xmm8, XMMWORD PTR [r11-56]
  vmulsd  xmm0, xmm9, xmm11
  vmovaps xmm9, XMMWORD PTR [r11-72]
  vaddsd  xmm5, xmm1, xmm3
  vaddsd  xmm4, xmm0, xmm2
  vmulsd  xmm2, xmm6, xmm12
  vmovaps xmm6, XMMWORD PTR [rsp+96]
  vmulsd  xmm3, xmm7, xmm12
  vmovaps xmm7, XMMWORD PTR [rsp+80]
  vmovaps xmm12, XMMWORD PTR [rsp]
  vmulsd  xmm1, xmm5, xmm11
  vmulsd  xmm0, xmm4, xmm11
  vmovaps xmm11, XMMWORD PTR [r11-104]
  vaddsd  xmm1, xmm1, xmm3
  vaddsd  xmm0, xmm0, xmm2
  vmovsd  QWORD PTR [rcx+8], xmm1
  vmovsd  QWORD PTR [rcx], xmm0
  add     rsp, 120                      ; 00000078H
  ret     0
BLPoint blCubicPreciseEvaluateAtT(BLPoint const * __ptr64 const,double) ENDP

Interpreting the Results

GCC and Clang okay, MSVC didn't use SIMD and generated so much code again.

Quadratic Bezier Bounding Box

Calculating bounding box of a quadratic Bezier curve is quite common. I wrote two solutions - the first version (V1) uses scalar code that only merges possible extremas when they are found. The second version (V2) doesn't bother and always calculates both extremas even if they are outside of the domain (they would be clamped in such case). The second version is SIMD friendly and that was the intention:

// Scalar version - not good for SIMD.
inline void blQuadCoefficientsAtT(double t, double& a, double& b, double& c) noexcept {
  double tInv = 1.0 - t;
  a = tInv * tInv;
  b = 2.0 * (t - tInv);
  c = t * t;
}

void blQuadBoundingBoxV1(const BLPoint bez[3], BLBox& bBox) noexcept {
  // Get bounding box of start and end points.
  bBox.reset(blMin(bez[0].x, bez[2].x), blMin(bez[0].y, bez[2].y),
             blMax(bez[0].x, bez[2].x), blMax(bez[0].y, bez[2].y));

  // Merge X extrema.
  if (bez[1].x < bBox.x0 || bez[1].x > bBox.x1) {
    double t = (bez[0].x - bez[1].x) / (bez[0].x - 2.0 * bez[1].x + bez[2].x);
    double a, b, c;

    blQuadCoefficientsAtT(t, a, b, c);
    double coord = a * bez[0].x + b * bez[1].x + c * bez[2].x;
    bBox.x0 = blMin(bBox.x0, coord);
    bBox.x1 = blMax(bBox.x1, coord);
  }

  // Merge Y extrema.
  if (bez[1].y < bBox.y0 || bez[1].y > bBox.y1) {
    double t = (bez[0].y - bez[1].y) / (bez[0].y - 2.0 * bez[1].y + bez[2].y);
    double a, b, c;

    blQuadCoefficientsAtT(t, a, b, c);
    double coord = a * bez[0].y + b * bez[1].y + c * bez[2].y;
    bBox.y0 = blMin(bBox.y0, coord);
    bBox.y1 = blMax(bBox.y1, coord);
  }
}

// SIMD friendly (branchless).
void blQuadBoundingBoxV2(const BLPoint bez[3], BLBox& bBox) noexcept {
  // Bounding box of start and end points.
  bBox.reset(blMin(bez[0].x, bez[2].x), blMin(bez[0].y, bez[2].y),
             blMax(bez[0].x, bez[2].x), blMax(bez[0].y, bez[2].y));

  BLPoint t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);
  t.x = blMax(t.x, 0.0);
  t.y = blMax(t.y, 0.0);
  t.x = blMin(t.x, 1.0);
  t.y = blMin(t.y, 1.0);

  blBoxMergePoint(bBox, blLerp(blLerp(bez[0], bez[1], t),
                               blLerp(bez[1], bez[2], t), t));
}

GCC [-std=c++17 -O3 -fno-math-errno -mavx2]

# NOT VECTORIZED!
blQuadBoundingBoxV1(BLPoint const*, BLBox&):
  vmovsd  xmm7, QWORD PTR [rdi+8]
  vmovsd  xmm6, QWORD PTR [rdi+40]
  vmovsd  xmm11, QWORD PTR [rdi]
  vmovsd  xmm10, QWORD PTR [rdi+32]
  vmaxsd  xmm4, xmm6, xmm7
  vminsd  xmm3, xmm6, xmm7
  vminsd  xmm8, xmm10, xmm11
  vmaxsd  xmm9, xmm10, xmm11
  vunpcklpd       xmm1, xmm8, xmm3
  vunpcklpd       xmm0, xmm9, xmm4
  vmovups XMMWORD PTR [rsi], xmm1
  vmovups XMMWORD PTR [rsi+16], xmm0
  vmovsd  xmm1, QWORD PTR [rdi+16]
  vcomisd xmm8, xmm1
  ja      .L11
  vcomisd xmm1, xmm9
  ja      .L11
.L12:
  vmovsd  xmm0, QWORD PTR [rdi+24]
  vcomisd xmm3, xmm0
  ja      .L16
  vcomisd xmm0, xmm4
  jbe     .L27
.L16:
  vaddsd  xmm1, xmm0, xmm0
  vsubsd  xmm2, xmm7, xmm0
  vmovsd  xmm5, QWORD PTR .LC2[rip]
  vsubsd  xmm1, xmm7, xmm1
  vaddsd  xmm1, xmm1, xmm6
  vdivsd  xmm2, xmm2, xmm1
  vsubsd  xmm5, xmm5, xmm2
  vsubsd  xmm1, xmm2, xmm5
  vmulsd  xmm5, xmm5, xmm5
  vmulsd  xmm2, xmm2, xmm2
  vaddsd  xmm1, xmm1, xmm1
  vmulsd  xmm5, xmm5, xmm7
  vmulsd  xmm0, xmm1, xmm0
  vmulsd  xmm2, xmm2, xmm6
  vaddsd  xmm0, xmm0, xmm5
  vaddsd  xmm0, xmm0, xmm2
  vminsd  xmm3, xmm0, xmm3
  vmaxsd  xmm0, xmm0, xmm4
  vmovsd  QWORD PTR [rsi+8], xmm3
  vmovsd  QWORD PTR [rsi+24], xmm0
.L27:
  ret
.L11:
  vaddsd  xmm0, xmm1, xmm1
  vsubsd  xmm5, xmm11, xmm1
  vsubsd  xmm0, xmm11, xmm0
  vaddsd  xmm0, xmm0, xmm10
  vdivsd  xmm5, xmm5, xmm0
  vmovsd  xmm0, QWORD PTR .LC2[rip]
  vsubsd  xmm0, xmm0, xmm5
  vsubsd  xmm2, xmm5, xmm0
  vmulsd  xmm0, xmm0, xmm0
  vmulsd  xmm5, xmm5, xmm5
  vaddsd  xmm2, xmm2, xmm2
  vmulsd  xmm0, xmm0, xmm11
  vmulsd  xmm1, xmm2, xmm1
  vmulsd  xmm5, xmm5, xmm10
  vaddsd  xmm1, xmm1, xmm0
  vaddsd  xmm1, xmm1, xmm5
  vminsd  xmm8, xmm1, xmm8
  vmaxsd  xmm1, xmm1, xmm9
  vmovsd  QWORD PTR [rsi], xmm8
  vmovsd  QWORD PTR [rsi+16], xmm1
  jmp     .L12

# NOT VECTORIZED!
blQuadBoundingBoxV2(BLPoint const*, BLBox&):
  push    rbp
  mov     rbp, rsp
  and     rsp, -32
  vmovsd  xmm1, QWORD PTR [rdi+8]
  vmovsd  xmm0, QWORD PTR [rdi]
  vmovsd  xmm5, QWORD PTR [rdi+40]
  vmovsd  xmm6, QWORD PTR [rdi+32]
  vmaxsd  xmm13, xmm5, xmm1
  vmaxsd  xmm12, xmm6, xmm0
  vminsd  xmm5, xmm5, xmm1
  vminsd  xmm6, xmm6, xmm0
  vunpcklpd       xmm0, xmm12, xmm13
  vunpcklpd       xmm1, xmm6, xmm5
  vmovups XMMWORD PTR [rsi+16], xmm0
  vmovups XMMWORD PTR [rsi], xmm1
  vmovsd  xmm2, QWORD PTR [rdi+24]
  vmovsd  xmm10, QWORD PTR [rdi+8]
  vmovsd  xmm1, QWORD PTR [rdi+40]
  vmovsd  xmm7, QWORD PTR [rdi+16]
  vaddsd  xmm4, xmm2, xmm2
  vsubsd  xmm9, xmm10, xmm2
  vmovsd  xmm3, QWORD PTR [rdi]
  vmovsd  xmm0, QWORD PTR [rdi+32]
  vsubsd  xmm8, xmm3, xmm7
  vsubsd  xmm4, xmm10, xmm4
  vaddsd  xmm4, xmm4, xmm1
  vdivsd  xmm9, xmm9, xmm4
  vaddsd  xmm4, xmm7, xmm7
  vsubsd  xmm4, xmm3, xmm4
  vaddsd  xmm4, xmm4, xmm0
  vdivsd  xmm8, xmm8, xmm4
  vxorpd  xmm4, xmm4, xmm4
  vcomisd xmm4, xmm8
  ja      .L33
  vcomisd xmm4, xmm9
  jbe     .L62
  vmovsd  xmm11, QWORD PTR .LC2[rip]
  vmulsd  xmm14, xmm1, xmm4
  vmulsd  xmm9, xmm2, xmm4
  vcomisd xmm8, xmm11
  jbe     .L63
  vmovsd  QWORD PTR [rsp-16], xmm2
  vmovapd xmm1, xmm14
  vmovapd xmm2, xmm9
  vxorpd  xmm14, xmm14, xmm14
  vmovsd  QWORD PTR [rsp-8], xmm7
  vmulsd  xmm3, xmm3, xmm4
  vmovapd xmm15, xmm11
  vmovapd xmm8, xmm11
  vmulsd  xmm7, xmm7, xmm4
  vxorpd  xmm9, xmm9, xmm9
  jmp     .L40
.L33:
  vmulsd  xmm11, xmm7, xmm4
  vcomisd xmm4, xmm9
  vxorpd  xmm8, xmm8, xmm8
  vmulsd  xmm0, xmm0, xmm4
  vmovsd  QWORD PTR [rsp-8], xmm11
  vmovsd  xmm11, QWORD PTR .LC2[rip]
  vmovapd xmm14, xmm11
  jbe     .L37
.L46:
  vmovsd  QWORD PTR [rsp-16], xmm2
  vmulsd  xmm1, xmm1, xmm4
  vmovapd xmm15, xmm11
  vxorpd  xmm9, xmm9, xmm9
  vmulsd  xmm2, xmm2, xmm4
  jmp     .L40
.L62:
  vmovsd  xmm11, QWORD PTR .LC2[rip]
  vcomisd xmm8, xmm11
  jbe     .L56
  vmovsd  QWORD PTR [rsp-8], xmm7
  vmulsd  xmm3, xmm3, xmm4
  vxorpd  xmm14, xmm14, xmm14
  vmovapd xmm8, xmm11
  vmulsd  xmm7, xmm7, xmm4
.L37:
  vcomisd xmm9, xmm11
  jbe     .L57
  vmulsd  xmm15, xmm2, xmm4
  vmovapd xmm9, xmm11
  vmulsd  xmm10, xmm10, xmm4
  vmovsd  QWORD PTR [rsp-16], xmm15
  vxorpd  xmm15, xmm15, xmm15
.L40:
  vaddsd  xmm1, xmm1, QWORD PTR [rsp-16]
  vaddsd  xmm3, xmm3, QWORD PTR [rsp-8]
  vaddsd  xmm2, xmm2, xmm10
  vaddsd  xmm0, xmm0, xmm7
  vmulsd  xmm9, xmm1, xmm9
  vmulsd  xmm15, xmm2, xmm15
  vmulsd  xmm8, xmm0, xmm8
  vmulsd  xmm14, xmm3, xmm14
  vaddsd  xmm9, xmm9, xmm15
  vaddsd  xmm14, xmm8, xmm14
  vminsd  xmm5, xmm9, xmm5
  vmaxsd  xmm9, xmm9, xmm13
  vminsd  xmm6, xmm14, xmm6
  vmaxsd  xmm14, xmm14, xmm12
  vmovsd  QWORD PTR [rsi+8], xmm5
  vmovsd  QWORD PTR [rsi+24], xmm9
  vmovsd  QWORD PTR [rsi], xmm6
  vmovsd  QWORD PTR [rsi+16], xmm14
  leave
  ret
.L56:
  vmulsd  xmm15, xmm7, xmm8
  vsubsd  xmm14, xmm11, xmm8
  vmulsd  xmm0, xmm0, xmm8
  vmulsd  xmm3, xmm3, xmm14
  vmulsd  xmm7, xmm7, xmm14
  vmovsd  QWORD PTR [rsp-8], xmm15
  jmp     .L37
.L63:
  vmulsd  xmm15, xmm7, xmm8
  vsubsd  xmm14, xmm11, xmm8
  vmulsd  xmm0, xmm0, xmm8
  vmulsd  xmm3, xmm3, xmm14
  vmulsd  xmm7, xmm7, xmm14
  vmovsd  QWORD PTR [rsp-8], xmm15
  jmp     .L46
.L57:
  vsubsd  xmm15, xmm11, xmm9
  vmulsd  xmm1, xmm1, xmm9
  vmulsd  xmm4, xmm2, xmm15
  vmulsd  xmm10, xmm10, xmm15
  vmulsd  xmm2, xmm2, xmm9
  vmovsd  QWORD PTR [rsp-16], xmm4
  jmp     .L40
.LC0:
  .long   0
  .long   1072693248
  .long   0
  .long   1072693248
.LC1:
  .long   0
  .long   1074266112
  .long   0
  .long   1074266112
.LC2:
  .long   0
  .long   1072693248

Clang [-std=c++17 -O3 -fno-math-errno -mavx2]

# NOT VECTORIZED!
.LCPI4_0:
  .quad   4607182418800017408     # double 1

blQuadBoundingBoxV1(BLPoint const*, BLBox&): # @blQuadBoundingBoxV1(BLPoint const*, BLBox&)
  vmovupd xmm1, xmmword ptr [rdi]
  vmovupd xmm0, xmmword ptr [rdi + 32]
  vminpd  xmm3, xmm0, xmm1
  vmaxpd  xmm2, xmm0, xmm1
  vmovupd xmmword ptr [rsi], xmm3
  vmovupd xmmword ptr [rsi + 16], xmm2
  vmovsd  xmm4, qword ptr [rdi + 16] # xmm4 = mem[0],zero
  vucomisd        xmm3, xmm4
  ja      .LBB4_2
  vucomisd        xmm4, xmm2
  jbe     .LBB4_3
.LBB4_2:
  vsubsd  xmm5, xmm1, xmm4
  vaddsd  xmm6, xmm4, xmm4
  vsubsd  xmm6, xmm1, xmm6
  vaddsd  xmm6, xmm0, xmm6
  vdivsd  xmm5, xmm5, xmm6
  vmovsd  xmm6, qword ptr [rip + .LCPI4_0] # xmm6 = mem[0],zero
  vsubsd  xmm6, xmm6, xmm5
  vmulsd  xmm7, xmm6, xmm6
  vsubsd  xmm6, xmm5, xmm6
  vaddsd  xmm6, xmm6, xmm6
  vmulsd  xmm5, xmm5, xmm5
  vmulsd  xmm7, xmm1, xmm7
  vmulsd  xmm4, xmm4, xmm6
  vaddsd  xmm4, xmm7, xmm4
  vmulsd  xmm5, xmm0, xmm5
  vaddsd  xmm4, xmm5, xmm4
  vminsd  xmm5, xmm4, xmm3
  vmovsd  qword ptr [rsi], xmm5
  vmaxsd  xmm4, xmm4, xmm2
  vmovsd  qword ptr [rsi + 16], xmm4
.LBB4_3:
  vmovsd  xmm4, qword ptr [rdi + 24] # xmm4 = mem[0],zero
  vpermilpd       xmm3, xmm3, 1   # xmm3 = xmm3[1,0]
  vucomisd        xmm3, xmm4
  vpermilpd       xmm2, xmm2, 1   # xmm2 = xmm2[1,0]
  ja      .LBB4_5
  vucomisd        xmm4, xmm2
  ja      .LBB4_5
  ret
.LBB4_5:
  vpermilpd       xmm1, xmm1, 1   # xmm1 = xmm1[1,0]
  vsubsd  xmm5, xmm1, xmm4
  vaddsd  xmm6, xmm4, xmm4
  vsubsd  xmm6, xmm1, xmm6
  vpermilpd       xmm0, xmm0, 1   # xmm0 = xmm0[1,0]
  vaddsd  xmm6, xmm0, xmm6
  vdivsd  xmm5, xmm5, xmm6
  vmovsd  xmm6, qword ptr [rip + .LCPI4_0] # xmm6 = mem[0],zero
  vsubsd  xmm6, xmm6, xmm5
  vmulsd  xmm7, xmm6, xmm6
  vsubsd  xmm6, xmm5, xmm6
  vaddsd  xmm6, xmm6, xmm6
  vmulsd  xmm5, xmm5, xmm5
  vmulsd  xmm1, xmm1, xmm7
  vmulsd  xmm4, xmm4, xmm6
  vaddsd  xmm1, xmm1, xmm4
  vmulsd  xmm0, xmm0, xmm5
  vaddsd  xmm0, xmm0, xmm1
  vminsd  xmm1, xmm0, xmm3
  vmovsd  qword ptr [rsi + 8], xmm1
  vmaxsd  xmm0, xmm0, xmm2
  vmovsd  qword ptr [rsi + 24], xmm0
  ret
.LCPI5_0:
  .quad   4607182418800017408     # double 1
  .quad   4607182418800017408     # double 1

# Vectorized
blQuadBoundingBoxV2(BLPoint const*, BLBox&): # @blQuadBoundingBoxV2(BLPoint const*, BLBox&)
  vmovupd xmm0, xmmword ptr [rdi]
  vmovupd xmm1, xmmword ptr [rdi + 16]
  vmovupd xmm2, xmmword ptr [rdi + 32]
  vminpd  xmm3, xmm2, xmm0
  vmaxpd  xmm4, xmm2, xmm0
  vsubpd  xmm5, xmm0, xmm1
  vaddpd  xmm6, xmm1, xmm1
  vsubpd  xmm6, xmm0, xmm6
  vaddpd  xmm6, xmm2, xmm6
  vdivpd  xmm5, xmm5, xmm6
  vxorpd  xmm6, xmm6, xmm6
  vmaxpd  xmm5, xmm6, xmm5
  vmovapd xmm6, xmmword ptr [rip + .LCPI5_0] # xmm6 = [1.000000e+00,1.000000e+00]
  vminpd  xmm5, xmm6, xmm5
  vsubpd  xmm6, xmm6, xmm5
  vmulpd  xmm0, xmm0, xmm6
  vmulpd  xmm7, xmm1, xmm5
  vaddpd  xmm0, xmm7, xmm0
  vmulpd  xmm1, xmm1, xmm6
  vmulpd  xmm2, xmm2, xmm5
  vaddpd  xmm1, xmm2, xmm1
  vmulpd  xmm0, xmm6, xmm0
  vmulpd  xmm1, xmm5, xmm1
  vaddpd  xmm0, xmm0, xmm1
  vminpd  xmm1, xmm0, xmm3
  vmovupd xmmword ptr [rsi], xmm1
  vmaxpd  xmm0, xmm0, xmm4
  vmovupd xmmword ptr [rsi + 16], xmm0
  ret

MSVC [/O2 /Qpar /arch:AVX2]

; NOT VECTORIZED!
bez$ = 128
bBox$ = 136
void blQuadBoundingBoxV1(BLPoint const * __ptr64 const,BLBox & __ptr64) PROC
$LN46:
  sub     rsp, 120                      ; 00000078H
  vmovsd  xmm3, QWORD PTR [rcx+40]
  vmovsd  xmm1, QWORD PTR [rcx+32]
  vminsd  xmm0, xmm1, QWORD PTR [rcx]
  vmovaps XMMWORD PTR [rsp+96], xmm6
  vmovaps XMMWORD PTR [rsp+80], xmm8
  vmovaps XMMWORD PTR [rsp+64], xmm9
  vmaxsd  xmm9, xmm3, QWORD PTR [rcx+8]
  vmovaps XMMWORD PTR [rsp+48], xmm10
  vminsd  xmm10, xmm3, QWORD PTR [rcx+8]
  vmovaps XMMWORD PTR [rsp], xmm13
  vmovsd  xmm13, QWORD PTR __real@3ff0000000000000
  vmovaps XMMWORD PTR [rsp+32], xmm11
  vmaxsd  xmm11, xmm1, QWORD PTR [rcx]
  vmovsd  QWORD PTR [rdx+16], xmm11
  vmovsd  QWORD PTR [rdx], xmm0
  vmovsd  QWORD PTR [rdx+8], xmm10
  vmovsd  QWORD PTR [rdx+24], xmm9
  vmovsd  xmm8, QWORD PTR [rcx+16]
  vmovaps XMMWORD PTR [rsp+16], xmm12
  vmovaps xmm12, xmm0
  vcomisd xmm12, xmm8
  ja      SHORT $LN3@blQuadBoun
  vcomisd xmm8, xmm11
  jbe     SHORT $LN2@blQuadBoun
$LN3@blQuadBoun:
  vmovsd  xmm6, QWORD PTR [rcx]
  vaddsd  xmm0, xmm8, xmm8
  vsubsd  xmm1, xmm6, xmm0
  vaddsd  xmm2, xmm1, QWORD PTR [rcx+32]
  vsubsd  xmm3, xmm6, xmm8
  vdivsd  xmm5, xmm3, xmm2
  vsubsd  xmm4, xmm13, xmm5
  vsubsd  xmm0, xmm5, xmm4
  vaddsd  xmm1, xmm0, xmm0
  vmulsd  xmm3, xmm1, xmm8
  vmulsd  xmm2, xmm4, xmm4
  vmulsd  xmm0, xmm2, xmm6
  vmulsd  xmm1, xmm5, xmm5
  vmulsd  xmm2, xmm1, QWORD PTR [rcx+32]
  vaddsd  xmm4, xmm3, xmm0
  vaddsd  xmm3, xmm4, xmm2
  vminsd  xmm0, xmm3, xmm12
  vmaxsd  xmm1, xmm3, xmm11
  vmovsd  QWORD PTR [rdx], xmm0
  vmovsd  QWORD PTR [rdx+16], xmm1
$LN2@blQuadBoun:
  vmovsd  xmm8, QWORD PTR [rcx+24]
  vcomisd xmm10, xmm8
  vmovaps xmm12, XMMWORD PTR [rsp+16]
  vmovaps xmm11, XMMWORD PTR [rsp+32]
  ja      SHORT $LN5@blQuadBoun
  vcomisd xmm8, xmm9
  jbe     SHORT $LN4@blQuadBoun
$LN5@blQuadBoun:
  vmovsd  xmm6, QWORD PTR [rcx+8]
  vaddsd  xmm0, xmm8, xmm8
  vsubsd  xmm1, xmm6, xmm0
  vaddsd  xmm2, xmm1, QWORD PTR [rcx+40]
  vsubsd  xmm3, xmm6, xmm8
  vdivsd  xmm5, xmm3, xmm2
  vsubsd  xmm4, xmm13, xmm5
  vsubsd  xmm0, xmm5, xmm4
  vaddsd  xmm1, xmm0, xmm0
  vmulsd  xmm3, xmm1, xmm8
  vmulsd  xmm2, xmm4, xmm4
  vmulsd  xmm0, xmm2, xmm6
  vmulsd  xmm1, xmm5, xmm5
  vmulsd  xmm2, xmm1, QWORD PTR [rcx+40]
  vaddsd  xmm4, xmm3, xmm0
  vaddsd  xmm3, xmm4, xmm2
  vminsd  xmm0, xmm3, xmm10
  vmaxsd  xmm1, xmm3, xmm9
  vmovsd  QWORD PTR [rdx+8], xmm0
  vmovsd  QWORD PTR [rdx+24], xmm1
$LN4@blQuadBoun:
  vmovaps xmm6, XMMWORD PTR [rsp+96]
  vmovaps xmm8, XMMWORD PTR [rsp+80]
  vmovaps xmm9, XMMWORD PTR [rsp+64]
  vmovaps xmm10, XMMWORD PTR [rsp+48]
  vmovaps xmm13, XMMWORD PTR [rsp]
  add     rsp, 120                      ; 00000078H
  ret     0
void blQuadBoundingBoxV1(BLPoint const * __ptr64 const,BLBox & __ptr64) ENDP

; NOT VECTORIZED!
bez$ = 176
bBox$ = 184
void blQuadBoundingBoxV2(BLPoint const * __ptr64 const,BLBox & __ptr64) PROC
$LN130:
  mov     rax, rsp
  sub     rsp, 168                      ; 000000a8H
  vmovsd  xmm3, QWORD PTR [rcx+40]
  vmovsd  xmm1, QWORD PTR [rcx+32]
  vminsd  xmm0, xmm1, QWORD PTR [rcx]
  vmovaps XMMWORD PTR [rax-24], xmm6
  lea     r11, QWORD PTR [rax]
  vmovaps XMMWORD PTR [rax-40], xmm7
  vmovaps XMMWORD PTR [rax-56], xmm8
  vmovaps XMMWORD PTR [rax-72], xmm9
  vmovaps XMMWORD PTR [rax-88], xmm10
  vmovaps XMMWORD PTR [rax-104], xmm11
  vmovaps XMMWORD PTR [rax-120], xmm12
  vmovaps XMMWORD PTR [rsp+32], xmm13
  vminsd  xmm13, xmm3, QWORD PTR [rcx+8]
  vmovaps XMMWORD PTR [rsp+16], xmm14
  vmaxsd  xmm14, xmm1, QWORD PTR [rcx]
  vmovaps XMMWORD PTR [rsp], xmm15
  vmaxsd  xmm15, xmm3, QWORD PTR [rcx+8]
  vmovsd  QWORD PTR [rdx], xmm0
  vmovsd  QWORD PTR [rdx+24], xmm15
  vmovsd  QWORD PTR [rdx+8], xmm13
  vmovsd  QWORD PTR [rdx+16], xmm14
  vmovsd  xmm12, QWORD PTR [rcx]
  vmovsd  xmm11, QWORD PTR [rcx+8]
  vmovsd  xmm0, QWORD PTR [rcx+16]
  vmovsd  xmm1, QWORD PTR [rcx+24]
  vaddsd  xmm2, xmm0, xmm0
  vaddsd  xmm3, xmm1, xmm1
  vsubsd  xmm0, xmm12, xmm2
  vaddsd  xmm2, xmm0, QWORD PTR [rcx+32]
  vsubsd  xmm0, xmm12, QWORD PTR [rcx+16]
  vdivsd  xmm2, xmm0, xmm2
  vsubsd  xmm1, xmm11, xmm3
  vaddsd  xmm3, xmm1, QWORD PTR [rcx+40]
  vsubsd  xmm1, xmm11, QWORD PTR [rcx+24]
  vdivsd  xmm5, xmm1, xmm3
  vxorpd  xmm4, xmm4, xmm4
  vmaxsd  xmm0, xmm4, xmm2
  vmaxsd  xmm1, xmm4, xmm5
  vmovsd  xmm4, QWORD PTR __real@3ff0000000000000
  vminsd  xmm10, xmm4, xmm0
  vmulsd  xmm3, xmm10, QWORD PTR [rcx+32]
  vminsd  xmm9, xmm4, xmm1
  vmulsd  xmm2, xmm9, QWORD PTR [rcx+40]
  vsubsd  xmm8, xmm4, xmm10
  vmulsd  xmm1, xmm8, QWORD PTR [rcx+16]
  vaddsd  xmm5, xmm1, xmm3
  vmulsd  xmm3, xmm10, QWORD PTR [rcx+16]
  vsubsd  xmm7, xmm4, xmm9
  vmulsd  xmm0, xmm7, QWORD PTR [rcx+24]
  vaddsd  xmm6, xmm2, xmm0
  vmulsd  xmm2, xmm9, QWORD PTR [rcx+24]
  vmulsd  xmm0, xmm8, xmm12
  vmovaps xmm12, XMMWORD PTR [r11-120]
  vmulsd  xmm1, xmm7, xmm11
  vmovaps xmm11, XMMWORD PTR [r11-104]
  vaddsd  xmm4, xmm2, xmm1
  vaddsd  xmm3, xmm3, xmm0
  vmulsd  xmm0, xmm3, xmm8
  vmovaps xmm8, XMMWORD PTR [r11-56]
  vmulsd  xmm1, xmm4, xmm7
  vmovaps xmm7, XMMWORD PTR [r11-40]
  vmulsd  xmm2, xmm6, xmm9
  vmovaps xmm6, XMMWORD PTR [r11-24]
  vmovaps xmm9, XMMWORD PTR [r11-72]
  vmulsd  xmm5, xmm5, xmm10
  vmovaps xmm10, XMMWORD PTR [r11-88]
  vaddsd  xmm3, xmm0, xmm5
  vminsd  xmm0, xmm3, QWORD PTR [rdx]
  vaddsd  xmm4, xmm1, xmm2
  vminsd  xmm1, xmm4, xmm13
  vmovaps xmm13, XMMWORD PTR [rsp+32]
  vmovsd  QWORD PTR [rdx], xmm0
  vmaxsd  xmm0, xmm3, xmm14
  vmovaps xmm14, XMMWORD PTR [rsp+16]
  vmovsd  QWORD PTR [rdx+8], xmm1
  vmaxsd  xmm1, xmm4, xmm15
  vmovaps xmm15, XMMWORD PTR [rsp]
  vmovsd  QWORD PTR [rdx+16], xmm0
  vmovsd  QWORD PTR [rdx+24], xmm1
  mov     rsp, r11
  ret     0
void blQuadBoundingBoxV2(BLPoint const * __ptr64 const,BLBox & __ptr64) ENDP

Interpreting the Results

Only Clang was able to use SIMD in the second version. The rest was uninteresting scalar asm.

Quadratic Bezier Bounding Box - Clang Experiments

I did a bit of experimenting and I would like to show how sensitive compilers are when it comes to certain expressions. I created my own version of some basic functions like blMin(a, b) and blMax(a, b). The main reason was to have these constexpr and to return a copy instead of a reference to either a or b, which is useful when the type is not primitive and it's some structure like BLPoint. Let's take a look what would happen if we used standard std::min and std::max functions:

// Experiment, use std::min and std::max instead of our own.
void blQuadBoundingBoxV3(const BLPoint bez[3], BLBox& bBox) noexcept {
  // Bounding box of start and end points.
  bBox.reset(std::min(bez[0].x, bez[2].x), std::min(bez[0].y, bez[2].y),
             std::max(bez[0].x, bez[2].x), std::max(bez[0].y, bez[2].y));

  BLPoint t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);
  t.x = std::max(t.x, 0.0);
  t.y = std::max(t.y, 0.0);
  t.x = std::min(t.x, 1.0);
  t.y = std::min(t.y, 1.0);

  blBoxMergePoint(bBox, blLerp(blLerp(bez[0], bez[1], t),
                               blLerp(bez[1], bez[2], t), t));
}

And the output:

# The code becomes a mixture of scalar and SIMD, much worse than the previous one.
blQuadBoundingBoxV3(BLPoint const*, BLBox&):
  vmovupd  xmm3, xmmword ptr [rdi]
  vmovupd  xmm2, xmmword ptr [rdi + 32]
  vminpd   xmm9, xmm2, xmm3
  vmaxpd   xmm8, xmm2, xmm3
  vmovupd  xmm4, xmmword ptr [rdi + 16]
  vsubsd   xmm5, xmm3, xmm4
  vpermilpd xmm6, xmm4, 1   # xmm6 = xmm4[1,0]
  vpermilpd xmm7, xmm3, 1   # xmm7 = xmm3[1,0]
  vsubsd   xmm0, xmm7, xmm6
  vaddsd   xmm1, xmm4, xmm4
  vaddsd   xmm6, xmm6, xmm6
  vsubsd   xmm1, xmm3, xmm1
  vsubsd   xmm6, xmm7, xmm6
  vaddsd   xmm1, xmm2, xmm1
  vdivsd   xmm1, xmm5, xmm1
  vpermilpd xmm5, xmm2, 1   # xmm5 = xmm2[1,0]
  vaddsd   xmm5, xmm5, xmm6
  vdivsd   xmm0, xmm0, xmm5
  vmovsd   qword ptr [rsp - 16], xmm1
  vmovsd   qword ptr [rsp - 24], xmm0
  vxorpd   xmm5, xmm5, xmm5
  vucomisd xmm5, xmm1
  lea      rax, [rsp - 8]
  lea      rcx, [rsp - 16]
  cmova    rcx, rax
  mov      qword ptr [rsp - 8], 0
  mov      rcx, qword ptr [rcx]
  vucomisd xmm5, xmm0
  lea      rdx, [rsp - 24]
  cmova    rdx, rax
  mov      qword ptr [rsp - 16], rcx
  mov      qword ptr [rsp - 8], 0
  mov      rax, qword ptr [rdx]
  mov      qword ptr [rsp - 24], rax
  vmovsd   xmm0, qword ptr [rsp - 16] # xmm0 = mem[0],zero
  vmovq    xmm1, rcx
  vmovq    xmm5, rax
  vpunpcklqdq xmm1, xmm1, xmm5 # xmm1 = xmm1[0],xmm5[0]
  vmovapd  xmm5, xmmword ptr [rip + .LCPI6_0] # xmm5 = [1.000000e+00,1.000000e+00]
  vcmpltpd xmm1, xmm5, xmm1
  vmovhpd  xmm0, xmm0, qword ptr [rsp - 24] # xmm0 = xmm0[0],mem[0]
  vblendvpd xmm0, xmm0, xmm5, xmm1
  vsubpd   xmm1, xmm5, xmm0
  vmulpd   xmm3, xmm3, xmm1
  vmulpd   xmm5, xmm4, xmm0
  vaddpd   xmm3, xmm5, xmm3
  vmulpd   xmm4, xmm4, xmm1
  vmulpd   xmm2, xmm2, xmm0
  vaddpd   xmm2, xmm2, xmm4
  vmulpd   xmm1, xmm1, xmm3
  vmulpd   xmm0, xmm2, xmm0
  vaddpd   xmm0, xmm1, xmm0
  vminpd   xmm1, xmm0, xmm9
  vmovupd  xmmword ptr [rsi], xmm1
  vmaxpd   xmm0, xmm0, xmm8
  vmovupd  xmmword ptr [rsi + 16], xmm0
  ret

As you can see a slight modification of the code resulted in disaster. That nicely vectorized code is gone and instead we've got a mixture scalar and SIMD code that is still better than scalar-only version, but it's much longer than the previous fully autovectorized one.

Conclusion

I initially planned to show more snippets, but I think this was enough for a single post. I'm planning to continue with these experiments to show how code that I use in production compiles and possibly how to improve or rethink what you do to make it more likely to be optimized the way you want. I think that the best compiler at the moment is Clang as it's able to autovectorize well written code really easily, however, as the last snippet shows even Clang can be picky about the constructs you use and sometimes it's pure gambling.

In next posts I would like to focus moslty on GCC and Clang as these compilers are not shy to use SIMD. I tried also ICC before publishing this post, but since it produced similar output as MSVC it doesn't deserve my focus at all - one shitty compiler that makes this post longer is enough :)

You can try it yourself in Compiler Exporer.

Update

I reported a GCC Bug #87105 and clang bug #38705 as I think fixing these would help other code-bases as well.