20 April, 2016

Blend2D's Rasterizer Debugging

[Or] How to Visualize the Rasterizer's Output Differently

This is a first post about Blend2D and it will be a bit unusual. Today I implemented one feature that is only interesting for developers. It allows to visually distinguish between consecutive rasterizer's spans. Blend2D's rasterizer is based on AntiGrain (or AGG) and also uses cells, which are used to calculate the final alpha of resulting pixels after the whole input is processed.

There is a fundamental difference between how the cells are stored in Blend2D and AGG. AGG's approach is to append cells into cell blocks (each cell then includes its coordinate as well) and at the end of the rasterization process it does cell traversals to gather the number of cells per each scanline, to merge all cells that have the same coordinate, and sort the rest. These additional passes are actually the most expensive ones. I don't want to go into much detail here, but it simply requires a lot of cycles to prepare all the cells and to generate spans that can be passed to the compositor.

Blend2D uses a different approach. It introduces a data structure called Span (don't confuse with AGG spans here), which is simply an array of cells. When the rasterizer generates cells it always groups them into spans and then inserts these spans into to the scanline container (also don't confuse with AGG's scanline here), which keeps them sorted. This way introduces some overhead at a rasterization phase, but prevents the additional traversal and sorting at a later phase. Additionally, this approach prevents to have multiple cells that share the same coordinate. If a cell already exists it's merged with it instead of creating a new one. Blend2D's approach beats the AGG rasterizer significantly and keeps the memory requirements during rasterization low - it will not grow infinitely, whereas AGG fails to continue when it reaches the hard-coded cell limit.

But Nothing is Perfect

But nothing is perfect and Blend2D is no exception. Currently, when the rasterizer inserts a span into the scanline, it will stay there forever. More cells can be merged into it, but the data structure itself will hold the span as is and won't coalesce it with other newly inserted span(s) even if they are right next to it. This approach is beneficial for shapes that are not very complex, because it allows to render them quickly. However, this approach fails when the input shape contains a lot of small line segments, and it happens quite often, especially when rendering complex art and strokes.

And that's the area where I would like to improve Blend2D. Before I can do any improvement I have to to see some data first. However, rasterizer produces a lot of information and it would be highly inefficient to study the text form of it. So instead of printing text, I introduced a new property that is called debugFlags, and hacked the pipeline generator to use a special fetcher if it's set. When enabled, it flips the color each time a new span is processed.


To demonstrate what it does take a look at the following circle:
(note: all images were upscaled to make the details more visible)

And now look at the debug version:

Not bad although some erratic lines are visible. All lines that are red contain at least two spans at the beginning of the scanline - the second span was inserted after the first without coalescing (these spans most likely have just one cell).

Stroke shows some erratic spans as well, although still acceptable.

Random Paths

Let's look at something more complex that can reveal a bit more:

The same shape but stroked:

Pathological Stroke

A stroke of a path that contains more than 100 quadratic curves.

Interpreting the Images

The first 3 circles didn't look as bad as the polygons that follow. However, even the circle's stroke could be improved, because small spans degrade the ability of the compositor to process multiple pixels at a time. Blend2D's compositor can process more than 4 pixels at a time (maximum 16), which means that if the scanline contains a lot of small spans the compositor doesn't use the full width of SIMD registers available. The biggest overhead is caused by single-cell spans, which are commonly generated by lines of vertical slope.

Another problem is the pathological stroke. The shape itself is pathological and it's not common in 2D graphics, however, it shows that the rasterizer's inability to coalesce neighboring spans doesn't scale well. More spans the rasterizer has to keep track of means more overhead to insert new span into that list.

Some Code

The only thing I had to do to implement the debug-mode was to add new flags (debugFlags) to the 2D context and to implement a new debug-fetch (including some logic to select it). The initial version of the debug-fetch looks like this:

void SWPipeFetchPart::debugFetch1(SWPipePixelARGB& p, uint32_t flags) noexcept {
  X86GpVar dpTmp0 = c->newUInt32("dpTmp0");
  X86GpVar dpTmp1 = c->newUInt32("dpTmp1");

  c->mov(dpTmp0, g->_debugVar);
  c->mov(dpTmp1, g->_debugVar);
  c->and_(dpTmp0, 0xFFFFFFFF);
  c->and_(dpTmp1, 0xFFFF5050);
  c->add(dpTmp0, dpTmp1);

  p.pc[0] = c->newXmm("dpXmm");
  c->movd(p.pc[0], dpTmp0);

  g->xSatisfyARGB32_1x(p, flags);

Nothing fancy or over-complicated, just straightforward asmjit :) It checks the _debugVar (it's a mask that has all bits zero or all bits set) and generates a white or red pixel based on it. It then stores the pixel to p.pc[], which means Packed Color (the pixel that is packed and contains a color information - ARGB). At the end it calls the pipeline function to convert the pixel into the format that was requested by the compositor.

Further Development

This tool gave me some information that I can use to improve the rasterizer's performance. The rasterizer debug-mode can can be enabled at runtime (no recompilation needed) and causes no overhead to the rasterizer itself or to the rendering context. I just need to find the right constants I can use for span coalescing (like how far two spans can be to coalesce them) and to tweak the rasterizer. Benchmark suite will be then used to verify the performance gain and to make sure the change didn't regress.

I was thinking many times of making the rasterizer tile-based, or more precisely, to make all spans fixed size and to manage them differently. A radical change like this would make pathological cases (and very complex renderings) faster while hindering the common paths. I'm not fully convinced yet, maybe an adaptive algorithm that would switch to a tile-based approach (per scanline) after the number of spans the scanline is holding exceeds some threshold. Switching algorithm is not a high priority at the moment, but the rasterizer will be updated for sure before the beta release.

18 April, 2016

Tricky _mm_set_sd()

[Or] How a C++ Compiler Can Ruin Your Optimization Efforts

Compiler intrinsics targeting SSE and SSE2 instruction sets were historically used to instrument the compiler to generate better code that deals with floating point operations. The reason was that some CPU architectures have a global floating-point unit state (or FPU state) that contains rounding mode, precision, exception mask, and some other flags. These basically prevented the compiler to optimize operations like round(x), static_cast<int>(x), (x < y ? x : y), etc.

Of course this situation was not performance-friendly and many developers just started using SSE+ intrinsics when they became available to solve their problems in a more performance-friendly way. This approach was also blessed by some CPU manufacturers as the better way of dealing with the problem. Today's compilers are much smarter than before and some constructs may be considered obsolete. However, some constructs still hold and are probably better than using a pure C/C++ code in some cases, especially if you need to do something compiler won't do by default.

Scalar <-> SIMD Problem

The problem I faced is closely related to a conversion between float/double and SIMD registers. When SSE2 code-generation is turned ON these types are equivalent and using _mm_set_sd intrinsic should generate nothing, however, it's not always true as demonstrated by the snippet below:

#include <xmmintrin.h>

double MyMinV1(double a, double b) {
  return a < b ? a : b;

double MyMinV2(double a, double b) {
  __m128d x = _mm_set_sd(a);
  __m128d y = _mm_set_sd(b);
  return _mm_cvtsd_f64(_mm_min_sd(x, y));

One would expect that a modern compiler will generate the same assembly for both functions. Unfortunately, the reality is different. Let's compare gcc and clang.

Assembly generated by clang:

  minsd   xmm0, xmm1

  minsd   xmm0, xmm1

Assembly generated by gcc:

  minsd   xmm0, xmm1
  movsd   QWORD PTR [rsp-24], xmm0
  movsd   QWORD PTR [rsp-16], xmm1
  movsd   xmm0, QWORD PTR [rsp-24]
  movsd   xmm1, QWORD PTR [rsp-16]
  minsd   xmm0, xmm1

That's not good; gcc completely missed out the point and turned the C++ code into the worst assembly you would ever imagine. You can check it yourself here if you don't believe me.

The Core of the Problem

The problem is caused by using _mm_set_sd() to convert a double into a SIMD type. The intrinsic should emit the movsd xmm0, xmm1|mem64 instruction, which has ironically two behaviors:

  • If the second operand (the source operand) is an XMM register, it preserves the high-part of the first operand (the destination operand).
  • If the second operand (the source operand) is a memory location, it clears the high-part of the first operand (the destination operand).

This is tricky, because the documentation says that _mm_set_sd() intrinsic clears the high-part of the register, so gcc is basically following the specification. However, it's something unexpected if you don't really have use for the high-part of the register, and using two movsd's is something suboptimal as it could be replaced by a single movq, which is what clang does in some other cases.

A Possible Workaround

After submitting a gcc bug #70708 (which turned out to be a duplicate of #68211) I got an instant feedback and a workaround has been found. We can use some asm magic to tell the compiler what to do. Consider the updated code, which works well also with clang:

static inline __m128d DoubleAsXmm(double x) {
  __m128d xmm;
  asm ("" : "=x" (xmm) : "0" (x));
  return xmm;

double MyMinV3(double a, double b) {
  return _mm_cvtsd_f64(

Which leads to the desired result (both gcc and clang):

  minsd   xmm0, xmm1

Of course you need some preprocessor magic to make the conversion portable across compilers. However, it's a nice solution as it basically tells the compiler that you don't care of the high part of the SIMD register. As a bonus this will also prevent clang emitting a movq instruction in some other cases.

Lesson Learned

I think the time we had to specialize min/max by using SSE and SSE2 is finally over, compilers are fine doing that for us. However, if you still need to use SSE+ instrinsics, check how compiler converts your scalars into SIMD registers, especially if you only use them for scalar-only operations.

16 April, 2016

C++ Moving Forward

Time to move forward and require C++11 by default?

Although I don't use much of C++11 I'm becoming tired of making my code backwards compatible by using macros and other tricks. Removing all of this won't probably make much difference for me as I don't really like stl library. However, certain features like noexcept, move semantics, and constexpr are nice additions that I really like and every time I have to write a macro instead of the keyword I feel like working on a project, which is 20 years old.


I already use many C++11 keywords and just define them if they are not available. I kind of developed my way of solving this problem without affecting other libraries and user code that is included after. I call it briefly as ApiBegin & ApiEnd. For example the following is from Blend2D's b2d_api_begin.h:

#if !B2D_CC_HAS_ALIGNAS && !defined(alignas)
# define alignas(x) B2D_ALIGNAS(x)
#endif // !B2D_CC_HAS_ALIGNAS && !alignas

#if !B2D_CC_HAS_NOEXCEPT && !defined(noexcept)
# define noexcept B2D_NOEXCEPT
#endif // !B2D_CC_HAS_NOEXCEPT && !noexcept

#if !B2D_CC_HAS_NULLPTR && !defined(nullptr)
# define nullptr NULL
#endif // !B2D_CC_HAS_NULLPTR && !nullptr

And also b2d_api_end.h, which clears such definitions to make no harm to people including Blend2D:

#if defined(B2D_UNDEF_ALIGNAS)
# undef alignas

#if defined(B2D_UNDEF_NOEXCEPT)
# undef noexcept

#if defined(B2D_UNDEF_NULLPTR)
# undef nullptr

The intention is clear - this small hack makes Blend2D compilable by pre-c++11 compilers and allows Blend2D code itself to look more modern. I just use these features without making them ugly, for example noexcept looks much nicer than B2D_NOEXCEPT, etc. However, this also requires a really good compiler detection. Instead of solving this per-project I just wrote a tool to generate most of it. It's a monster, I know, but offers a centralized way of solving the compiler problem, and when a new compiler is released or some feature is wrongly detected a single fix fixes 'em all.

I don't even see a problem in my approach. It's clean, it makes the project backwards compatible, and it allows me to use at least some features I found useful. However, the problem that I see is that many projects just switched to C++11 and don't look back. So I'm asking myself, is it time to switch now or it's better to wait? It's not that long time ago I was asked to make asmjit compilable by VS2003 (yeah), and it was possible (maybe it still is, I don't know). But these seem to be rare cases to base any decision on them.


Okay, this was a short boring post :) Next time I should finally write something about Blend2D, but I don't yet have a clue where to start. Maybe I can reveal that I finally bought an AVX2 capable machine so Blend2D will be released with a fully optimized AVX2 pipeline codegen?

10 April, 2016

Autovectorization Nightmare

The Opportunity of the Compiler to Break Your Code

C++ language and compilers are moving forward so fast. Today it's pretty common that the C++ compiler is able to take advantage of SIMD and automatically vectorize your plain scalar C++ code into something that looks like a monster. Sometimes this may give you some gains, but the biggest disaster is a compiler optimizing your code that has been already optimized - like optimizing leading loops that precede optimized loops, or optimizing initialization loops that are executed just once, etc.

I remember reporting a clang bug where it vectorized a loop containing a sin() call by just unrolling it 8 times and calling the same sin() 8 times sequentially. This of course didn't improve the performance at all, but was not breaking the code at least.

Well, it would be cool if the compiler just optimizes your code without actually breaking it, but more aggressive optimizations usually lead to more opportunities of breaking your humble scalar C++ code. And... because the universe is not fair, and Murphy's Law applies to all of us, I have hit the issue too, today.

There is a term called undefined behavior, and pretty much everybody writing C++ code knows about it and tries to avoid it. The problem is that on many targets the undefined behavior is actually a well-defined undefined-behavior. The reason is that you know the platform and you can take advantage of certain constructs when optimizing certain things. So what was the problem? Look at the simple code below:

static void write_unaligned_32(void* ptr, uint32_t value) noexcept {
    static_cast<uint32_t*>(ptr)[0] = value;
  else {
    // NOTE: This example doesn't care of endianness (this is LE).
    static_cast<uint8_t*>(ptr)[0] = static_cast<uint8_t>((value      ) & 0xFF);
    static_cast<uint8_t*>(ptr)[1] = static_cast<uint8_t>((value >>  8) & 0xFF);
    static_cast<uint8_t*>(ptr)[2] = static_cast<uint8_t>((value >> 16) & 0xFF);
    static_cast<uint8_t*>(ptr)[3] = static_cast<uint8_t>((value >> 24));

This is basically an abstraction that I use to make the code itself more portable, and to keep non-portable code at a specific place only. Unaligned reads and writes are fine on x86 architectures and are much faster than reading or writing individual bytes. In addition, if all unaligned writes go through write_unaligned_32 function it's quite easy to assume that all other writes are aligned and to simply locate code that requires unaligned writes for future refactorization, optimizations, and sanitization.

The write_unaligned_32 function looks sane. It uses unaligned write only when it's allowed by the target. However, the problem is that if you use it in a loop that the compiler tries to auto-vectorize, then the compiler may assume that the ptr is aligned to 4 bytes, because you store uint32_t value to it. And this is exactly what have happened. The compiler auto-vectorized the loop and used `movdqa` instruction to fetch 16 bytes at once assuming that the buffer is 16-byte aligned, which basically led to a general protection fault (#GP). The trickiest part is that this only happened in release mode.

Possible solutions

  • Tell the compiler to not autovectorize? Workaround! Disabling something should never be a solution.
  • Don't use unaligned reads / writes? Workaround! Unnecessarily pessimistic.
  • Tell the compiler the pointer is not aligned? Looks like a solution to me!

There are vendor-specific extensions that can be used to tell the compiler that the pointer isn't aligned as it thinks. However, some #ifdefs are necessary, as usual, to take advantage of that.

The first thing I did was to add CC_ASSUME_ALIGNED template to the cxxtool. The initial implementation looks like:

// \def @prefix@_ASSUME_ALIGNED(p, alignment)
// Assume that the pointer 'p' is aligned to at least 'alignment' bytes.
# define @prefix@_ASSUME_ALIGNED(p, alignment) __assume_aligned(p, alignment)
# define @prefix@_ASSUME_ALIGNED(p, alignment) p = __builtin_assume_aligned(p, alignment)
# define @prefix@_ASSUME_ALIGNED(p, alignment) ((void)0)

The compiler detection module detects whether __assume_aligned or __builtin_assume_aligned is provided by the compiler, and then uses the one available. If none of them is provided then @prefix@_ASSUME_ALIGNED() is simply a NOP, which is fine.

Then the write_unaligned_32 function can be changed to something like this:

static void write_unaligned_32(void* ptr, uint32_t value) noexcept {

    static_cast<uint32_t*>(ptr)[0] = value;
  else {
    // NOTE: This example doesn't care of endianness (this is LE).
    static_cast<uint8_t*>(ptr)[0] = static_cast<uint8_t>((value      ) & 0xFF);
    static_cast<uint8_t*>(ptr)[1] = static_cast<uint8_t>((value >>  8) & 0xFF);
    static_cast<uint8_t*>(ptr)[2] = static_cast<uint8_t>((value >> 16) & 0xFF);
    static_cast<uint8_t*>(ptr)[3] = static_cast<uint8_t>((value >> 24));

What Triggered It

This bug happened in Blend2D's function that reads a non-premultiplied ARGB32 pixels from one buffer, converts the pixels to a premultiplied ARGB32, and stores them into another buffer. The function has been designed for image codecs so it must support reading from misaligned memory. This is common for example in PNG, which adds one BYTE for each ROW of pixels to specify the row-filter, and this additional byte causes all 16-bit and 32-bit rows to become misaligned. The bug has been fixed in both Blend2D and AsmJit.


Having all non-portable code at a single place is always a good idea. Basically one line fixed a bug that wouldn't be trivial to find by a non-skilled developer and that was only triggered by compiler's auto-vectorization feature. I fixed it by intuition, basically when I saw #GP I knew that there is an unaligned memory access, and because the function doesn't use SIMD by itself, it was obvious that the compiler did something I didn't ask for. I don't like this feature at all, but it's always good to have code that just works.