31 March, 2020

The Cost of Atomics

Introduction

Atomic operations, which are now available in many programming languages including C/C++ (introduced in C11 and C++11), are very useful operations that can be used to implement lock-free algorithms without the need to use hand-written assembly. The C++ compiler would translate each atomic operation into an instruction or a set of instructions that guarantee atomicity. But what is the cost of atomic operations?

This article tries to demonstrate the cost of atomic operations of a unique 64-bit ID generator, which uses a global counter and generates unique IDs by incrementing such counter atomically. This guarantees that no locking is required to obtain such unique IDs. However, there is still the cost of the atomic operation, which can be quite high regardless of the concurrency.

A Naive Approach

Let's say that we want to write a function that would return a unique 64-bit identifier. I have chosen a 64-bit number as I think that it's impossible to overflow such number in today's mainstream applications. Theoretically if some process generates 4 billion IDs per second it would still take 136 years to overflow the global counter, which seems unrealistic at the moment. However, if we improve the performance of the ID generator and run it on a high-end multicore CPU to generate such IDs in parallel then the theoretical time to overflow such counter can be drastically reduced, so it really depends on the nature of the application and whether the 64-bit identifier is enough. My recommendation would be to always abstract the data type used to represent such ID so it can be painlessly changed in the future.

So, how could the simplest ID generator look like?

uint64_t generateIdGlobal() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

This was actually an initial implementation that I have used in the past considering that it's just okay as I didn't expect the ID generator to be called that often. In Blend2D such IDs are only needed to create unique identifiers for caching purposes, so it seemed fine. However, there is a small little detail - what would happen if the target architecture doesn't implement 64-bit atomics? E.g. such code would probably not compile on a 32-bit ARM or MIPS hardware as the target instruction set doesn't have to provide such feature, but it would compile just fine on 32-bit x86 as it offers a cmpxchg8b instruction, which is enough for implementing any kind of 64-bit atomic operation.

Thread-Local Approach

If the only requirement for the ID generator is to return unique numbers that do not have to always be incrementing, we can use thread-local storage to implement a local cache and to only use atomics for incrementing the global counter in case we have exhausted the range of locally available IDs:

uint64_t generateIdLocal() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

This approach is of course longer, but it also minimizes the use of atomic operations to manipulate the global counter. In addition, since we have minimized the use of atomics we can also think about using a mutex to guard the access to globalCounter in case that we run on hardware that doesn't have 64-bit atomics:

static std::mutex globalMutex;

uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

Performance

So what would be your guess regarding the performance of each approach? I have written the following code to benchmark various implementations of the ID generator:

#include <stdint.h>
#include <atomic>
#include <mutex>
#include <thread>
#include <chrono>

typedef uint64_t (*GenerateIdFunc)(void);

static std::mutex globalMutex;

static uint64_t generateIdGlobal() {
  static std::atomic<uint64_t> globalCounter(0);
  return ++globalCounter;
}

static uint64_t generateIdGlobalMutex() {
  static uint64_t globalCounter = 0;
  std::lock_guard<std::mutex> guard(globalMutex);
  return ++globalCounter;
}

static uint64_t generateIdLocal() {
  static std::atomic<uint64_t> globalCounter(0);

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0)
    localIdx = globalCounter.fetch_add(uint64_t(cacheSize));

  return ++localIdx;
}

static uint64_t generateIdLocalMutex() {
  static uint64_t globalCounter = 0;

  static constexpr uint32_t cacheSize = 4096;
  static thread_local uint64_t localIdx = 0;

  if ((localIdx & (cacheSize - 1)) == 0) {
    std::lock_guard<std::mutex> guard(globalMutex);
    localIdx = globalCounter;
    globalCounter += cacheSize;
  }

  return ++localIdx;
}

static void testFunction(GenerateIdFunc func, const char* name) {
  std::atomic<uint64_t> result {};
  constexpr size_t numThreads = 8;
  constexpr size_t numIterations = 1024 * 1024 * 16;

  printf("Testing %s:\n", name);
  auto start = std::chrono::high_resolution_clock::now();

  std::thread threads[numThreads];
  for (size_t i = 0; i < numThreads; i++)
    threads[i] = std::thread([&]() {
      uint64_t localResult = 0;
      for (size_t j = 0; j < numIterations; j++)
        localResult += func();
      result.fetch_add(localResult);
    });

  for (size_t i = 0; i < numThreads; i++)
    threads[i].join();

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;

  printf("  Time: %0.3g s\n", elapsed.count());
  printf("  Result: %llu\n", (unsigned long long)result.load());
}

int main(int argc, char** argv) {
  testFunction(generateIdGlobal, "Global");
  testFunction(generateIdGlobalMutex, "GlobalMutex");
  testFunction(generateIdLocal, "Local");
  testFunction(generateIdLocalMutex, "LocalMutex");
  return 0;
}

Results on AMD Ryzen 3950x (Linux):

Approach 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads
Global - Atomic 0.081s 0.292s 0.580s 1.070s 1.980s
Global - Mutex 0.164s 0.782s 4.440s 9.970s 19.00s
Local - Atomic 0.030s 0.039s 0.039s 0.038s 0.041s
Local - Mutex 0.038s 0.039s 0.037s 0.037s 0.056s

Conclusion

The results should be self explanatory - atomic operations are always faster than using synchronization primitives to access a shared resource, but the cost of atomic operations is still not negligible and there is a limit of how many atomic operations can be performed within the same cache-line. Accessing a thread local storage is faster than atomics and is beneficial especially in a highly concurrent environment. However, thread local storage is a scarce resource that should be always used wisely.

I would like to note that this is a microbenchmark that basically stresses the access to a shared resource. Such high contention should not happen in a reasonable code and the speedup offered by using a different approach may be totally negligible in many real world applications. In addition, there is a difference between accessing thread local storage in executables and in dynamically linked libraries, so always benchmark your code to make sure that you actually don't increase the complexity of the design for no gains.