20 April, 2016

Blend2D's Rasterizer Debugging

[Or] How to Visualize the Rasterizer's Output Differently

This is a first post about Blend2D and it will be a bit unusual. Today I implemented one feature that is only interesting for developers. It allows to visually distinguish between consecutive rasterizer's spans. Blend2D's rasterizer is based on AntiGrain (or AGG) and also uses cells, which are used to calculate the final alpha of resulting pixels after the whole input is processed.

There is a fundamental difference between how the cells are stored in Blend2D and AGG. AGG's approach is to append cells into cell blocks (each cell then includes its coordinate as well) and at the end of the rasterization process it does cell traversals to gather the number of cells per each scanline, to merge all cells that have the same coordinate, and sort the rest. These additional passes are actually the most expensive ones. I don't want to go into much detail here, but it simply requires a lot of cycles to prepare all the cells and to generate spans that can be passed to the compositor.

Blend2D uses a different approach. It introduces a data structure called Span (don't confuse with AGG spans here), which is simply an array of cells. When the rasterizer generates cells it always groups them into spans and then inserts these spans into to the scanline container (also don't confuse with AGG's scanline here), which keeps them sorted. This way introduces some overhead at a rasterization phase, but prevents the additional traversal and sorting at a later phase. Additionally, this approach prevents to have multiple cells that share the same coordinate. If a cell already exists it's merged with it instead of creating a new one. Blend2D's approach beats the AGG rasterizer significantly and keeps the memory requirements during rasterization low - it will not grow infinitely, whereas AGG fails to continue when it reaches the hard-coded cell limit.

But Nothing is Perfect

But nothing is perfect and Blend2D is no exception. Currently, when the rasterizer inserts a span into the scanline, it will stay there forever. More cells can be merged into it, but the data structure itself will hold the span as is and won't coalesce it with other newly inserted span(s) even if they are right next to it. This approach is beneficial for shapes that are not very complex, because it allows to render them quickly. However, this approach fails when the input shape contains a lot of small line segments, and it happens quite often, especially when rendering complex art and strokes.

And that's the area where I would like to improve Blend2D. Before I can do any improvement I have to to see some data first. However, rasterizer produces a lot of information and it would be highly inefficient to study the text form of it. So instead of printing text, I introduced a new property that is called debugFlags, and hacked the pipeline generator to use a special fetcher if it's set. When enabled, it flips the color each time a new span is processed.


To demonstrate what it does take a look at the following circle:
(note: all images were upscaled to make the details more visible)

And now look at the debug version:

Not bad although some erratic lines are visible. All lines that are red contain at least two spans at the beginning of the scanline - the second span was inserted after the first without coalescing (these spans most likely have just one cell).

Stroke shows some erratic spans as well, although still acceptable.

Random Paths

Let's look at something more complex that can reveal a bit more:

The same shape but stroked:

Pathological Stroke

A stroke of a path that contains more than 100 quadratic curves.

Interpreting the Images

The first 3 circles didn't look as bad as the polygons that follow. However, even the circle's stroke could be improved, because small spans degrade the ability of the compositor to process multiple pixels at a time. Blend2D's compositor can process more than 4 pixels at a time (maximum 16), which means that if the scanline contains a lot of small spans the compositor doesn't use the full width of SIMD registers available. The biggest overhead is caused by single-cell spans, which are commonly generated by lines of vertical slope.

Another problem is the pathological stroke. The shape itself is pathological and it's not common in 2D graphics, however, it shows that the rasterizer's inability to coalesce neighboring spans doesn't scale well. More spans the rasterizer has to keep track of means more overhead to insert new span into that list.

Some Code

The only thing I had to do to implement the debug-mode was to add new flags (debugFlags) to the 2D context and to implement a new debug-fetch (including some logic to select it). The initial version of the debug-fetch looks like this:

void SWPipeFetchPart::debugFetch1(SWPipePixelARGB& p, uint32_t flags) noexcept {
  X86GpVar dpTmp0 = c->newUInt32("dpTmp0");
  X86GpVar dpTmp1 = c->newUInt32("dpTmp1");

  c->mov(dpTmp0, g->_debugVar);
  c->mov(dpTmp1, g->_debugVar);
  c->and_(dpTmp0, 0xFFFFFFFF);
  c->and_(dpTmp1, 0xFFFF5050);
  c->add(dpTmp0, dpTmp1);

  p.pc[0] = c->newXmm("dpXmm");
  c->movd(p.pc[0], dpTmp0);

  g->xSatisfyARGB32_1x(p, flags);

Nothing fancy or over-complicated, just straightforward asmjit :) It checks the _debugVar (it's a mask that has all bits zero or all bits set) and generates a white or red pixel based on it. It then stores the pixel to p.pc[], which means Packed Color (the pixel that is packed and contains a color information - ARGB). At the end it calls the pipeline function to convert the pixel into the format that was requested by the compositor.

Further Development

This tool gave me some information that I can use to improve the rasterizer's performance. The rasterizer debug-mode can can be enabled at runtime (no recompilation needed) and causes no overhead to the rasterizer itself or to the rendering context. I just need to find the right constants I can use for span coalescing (like how far two spans can be to coalesce them) and to tweak the rasterizer. Benchmark suite will be then used to verify the performance gain and to make sure the change didn't regress.

I was thinking many times of making the rasterizer tile-based, or more precisely, to make all spans fixed size and to manage them differently. A radical change like this would make pathological cases (and very complex renderings) faster while hindering the common paths. I'm not fully convinced yet, maybe an adaptive algorithm that would switch to a tile-based approach (per scanline) after the number of spans the scanline is holding exceeds some threshold. Switching algorithm is not a high priority at the moment, but the rasterizer will be updated for sure before the beta release.

No comments:

Post a Comment