29 March, 2016

MPSL's new DSP features

Intro

I spent last 3 days by working on some MPSL features. After I fixed AstToIR phase to finally split a 256-bit register into two 128-bit registers (to support pre-AVX machines) and got all tests working, I started thinking more about DSP instructions that MPSL could provide. Initially, I didn't complicate the design of the language and I basically limited it to four basic types - bool, int, float, and double. However, each variable can form up to 256-bit vector, which means that any 32-bit type can form up to 8 element vector, and any 64-bit type can form up to 4 element vector. The DSP extension basically allows the int type and its vectors to act temporarily as packed bytes, words, double-words, and quad-words. It doesn't change the type system though. The inputs and outputs are defined by DSP intrinsics' themselves.

Intrinsics

Initially, I have implemented these:

  • Packed addition and subtraction with saturation
  • Packed multiplication
  • Packed minimum and maximum
  • Packed comparison
  • Packed shift by scalar

All DSP intrinsics have names that are close to their assembler mnemonics, for example (not a full listing):

  • paddb does an addition of packed bytes (maps to paddb instruction)
  • psubusw does a subtraction with unsigned saturation of packed words (maps to psubusw instruction)
  • pmulw does a multiplication of packed words and stores the LO word of each dword result (maps to pmullw instruction)
  • pmulhuw does an unsigned multiplication of packed words and stores the HI word of each dword result (maps to pmulhuw instruction)
  • pminud selects a minimum of packed unsigned dword (maps to pminud instruction)
  • pmaxsw selects a minimum of packed signed words (maps to pmaxsw instruction)
  • pcmpeqb does a comparison of packed bytes (maps to pcmpeqb instruction)

Example

One can use the new DSP intrinsics to write a low-level pixel processing, for example:

// Embedder provides `bg`, `fg`, and `alpha` variables as packed 16-bit ints stored as `int4`.
int4 main() {
  // HACK: TODO: creates a 32-bit integer having two 16-bit values (0x0100).
  const int inv = 0x01000100;

  // Lerp background and foreground - scalars automatically promoted to vectors.
  int4 x = pmulw(bg, psubw(inv, alpha));
  int4 y = pmulw(fg, alpha);

  // Combine them and return.
  return psrlw(paddw(x, y), 8);
}

The example is not the best, I know, but MPSL is getting better and I'm adding new concepts every day. The advantage of such program is that it works with packed 16-bit integers and can process twice as data as program working with floats. It's also very simple for embedders to actually use that shader, maybe 10 lines of C++ to setup the data layout and to create the shader.

Conclusion

There are currently more important things to be implemented like branching and loops, however, adding DSP intrinsics wasn't for profit, so why not :) The DSP implementation is not fully complete as well - there are missing some useful intrinsics like packing, unpacking, some pecularities like pmaddwd and psadbw, and also element swizzling, which is next on my list.

Update

There is now a working example mp_dsp that creates and executes the shader.

No comments:

Post a Comment