Everything Started with a Bug
Yesterday, I finally implemented some code-paths that use AVX and AVX2 extensions to accelerate matrix transformations in Blend2D. Of course the initial version didn't work, because I did one small mistake in vector shuffling, which resulted in rendering something different than expected. I'm usually done with these kind of bugs very fast, but this time I did a mistake in a shuffle predicate, and the bug was harder to find than usual. After 15 minutes of looking at the code I disassembled the whole function, because I started being suspicious that I'm facing a compiler bug. And, I'm happy I saw the disassembly as I found some ridiculous code generated by C++ compiler that I simply couldn't understand.
Don't Trust C++ Compiler
As the title suggests, C++ compiler failed me again by generating a ridiculous code surrounding relatively simple loops. As manually vectorized C++ code usually requires at least two loops (one vectorized and one scalar) the implementation usually looks like the following:
void proc_v1(double* dst, const double* src, size_t length) {
size_t i;
// Main loop - process 4 units at a time (vectorized).
for (size_t i = length >> 2; i; i--) {
...
dst += 4;
src += 4;
}
// Tail loop - process 1 unit at a time (scalar, remaining input).
for (size_t i = length & 3; i; i--) {
...
dst += 1;
src += 1;
}
}
This is pretty standard code used in many libraries. However, when you need to save registers and you write hand-optimized loops in assembly, you usually don't use such construct directly translated to machine code, because it will cost you 2 registers. Instead, the following trick could be used:
void proc_v1(double* dst, const double* src, size_t length) {
intptr_t i = static_cast<intptr_t>(length);
// Main loop - process 4 units at a time (vectorized).
while ((i -= 4) >= 0) {
...
dst += 4;
src += 4;
}
// Now it's less than zero, so normalize it to get the remaining count.
i += 4;
// Tail loop - process 1 unit at a time (scalar, remaining input).
while (i) {
...
dst += 1;
src += 1;
i--;
}
}
Which can be translated to asm 1:1 and will use only a single register. I don't know personally any case where this approach would be slower than the previous one, and if the size of the item you are processing is greater than 1 it's always safe (I would say it's safe regardless of the size as I'm not sure if a 32-bit OS would be able to provide a 2GB of continuous memory to the user, but let's stay on a safe side).
Translated to x86 asm, the code would look like this:
; RDI - destination
; RSI - source
; RCX - counter
; Main Loop header.
sub rcx, 4 ; Try if we can use vectorized loop.
js MainSkip ; If negative it means that there is less than 4 items.
MainLoop:
... ; Some work to do...
add dst, 32 ; Advance destination 4 * sizeof(double).
add src, 32 ; Advance source.
sub rcx, 4 ; Decrement loop counter.
jns MainLoop ; Jump if not negative.
MainSkip:
add rcx, 4 ; Fix the loop counter.
jz TailSkip ; Exit if zero.
TailLoop:
... ; Some work to do...
add dst, 8 ; Advance destination 1 * sizeof(double).
add src, 8 ; Advance source.
sub rcx, 1 ; Decrement loop counter.
jnz TailLoop ; Jump if not zero.
TailSkip:
}
The asm code is not just nice, it's also very compact and will perform well on any hardware in general.
Working Example
I cannot just use an empty loop body to demonstrate how badly C++ compilers understand this code, so I wrote a very simple function that does something:
#include <stdint.h>
#if defined(_MSC_VER)
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// Destination and source are points (pair of x|y), the function only sums them.
// Please note that the function processes `length * 2` elements, you can look
// at it as it was `struct Point { double x, y; }` casted to `double*`.
void transform(double* dst, const double* src, const double* matrix, size_t length) {
intptr_t i = static_cast<intptr_t>(length);
while ((i -= 8) >= 0) {
__m256d s0 = _mm256_loadu_pd(src + 0);
__m256d s1 = _mm256_loadu_pd(src + 4);
__m256d s2 = _mm256_loadu_pd(src + 8);
__m256d s3 = _mm256_loadu_pd(src + 12);
_mm256_storeu_pd(dst + 0, _mm256_add_pd(s0, s0));
_mm256_storeu_pd(dst + 4, _mm256_add_pd(s1, s1));
_mm256_storeu_pd(dst + 8, _mm256_add_pd(s2, s2));
_mm256_storeu_pd(dst + 12, _mm256_add_pd(s3, s3));
dst += 16;
src += 16;
}
i += 8;
while ((i -= 2) >= 0) {
__m256d s0 = _mm256_loadu_pd(src);
_mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));
dst += 4;
src += 4;
}
if (i & 1) {
__m128d s0 = _mm_loadu_pd(src);
_mm_storeu_pd(dst, _mm_add_pd(s0, s0));
}
}
The function only performs a very simple operation with its inputs to distinguish between the loop body and everything else.
GCC (v7 -O2 -mavx -fno-exceptions -fno-tree-vectorize)
transform(double*, double const*, double const*, unsigned long):
; --- Ridiculous code ---
mov r9, rcx
mov r8, rcx
sub r9, 8
js .L6
mov rax, r9
sub rcx, 16
mov r8, r9
and rax, -8
mov rdx, rsi
sub rcx, rax
mov rax, rdi
; -----------------------
.L5:
vmovupd xmm3, XMMWORD PTR [rdx]
sub r8, 8
sub rax, -128
sub rdx, -128
vmovupd xmm2, XMMWORD PTR [rdx-96]
vinsertf128 ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
vmovupd xmm1, XMMWORD PTR [rdx-64]
vinsertf128 ymm2, ymm2, XMMWORD PTR [rdx-80], 0x1
vmovupd xmm0, XMMWORD PTR [rdx-32]
vinsertf128 ymm1, ymm1, XMMWORD PTR [rdx-48], 0x1
vaddpd ymm3, ymm3, ymm3
vinsertf128 ymm0, ymm0, XMMWORD PTR [rdx-16], 0x1
vaddpd ymm2, ymm2, ymm2
vaddpd ymm1, ymm1, ymm1
vmovups XMMWORD PTR [rax-128], xmm3
vaddpd ymm0, ymm0, ymm0
vextractf128 XMMWORD PTR [rax-112], ymm3, 0x1
vmovups XMMWORD PTR [rax-96], xmm2
vextractf128 XMMWORD PTR [rax-80], ymm2, 0x1
vmovups XMMWORD PTR [rax-64], xmm1
vextractf128 XMMWORD PTR [rax-48], ymm1, 0x1
vmovups XMMWORD PTR [rax-32], xmm0
vextractf128 XMMWORD PTR [rax-16], ymm0, 0x1
; --- Instead of using sub/jns it uses sub/cmp/jne ---
cmp r8, rcx
jne .L5
; ----------------------------------------------------
; --- Ridiculous code ---
mov rax, r9
shr rax, 3
lea rdx, [rax+1]
neg rax
lea r8, [r9+rax*8]
sal rdx, 7
add rdi, rdx
add rsi, rdx
; -----------------------
.L6:
; --- Ridiculous code ---
mov rdx, r8
sub rdx, 2
js .L4
shr rdx
xor eax, eax
lea rcx, [rdx+1]
sal rcx, 5
; -----------------------
.L7:
vmovupd xmm0, XMMWORD PTR [rsi+rax]
vinsertf128 ymm0, ymm0, XMMWORD PTR [rsi+16+rax], 0x1
vaddpd ymm0, ymm0, ymm0
vmovups XMMWORD PTR [rdi+rax], xmm0
vextractf128 XMMWORD PTR [rdi+16+rax], ymm0, 0x1
; --- Instead of using sub/jns it uses add/cmp/jne ---
add rax, 32
cmp rax, rcx
jne .L7
; ----------------------------------------------------
; --- Ridiculous code ---
neg rdx
add rdi, rax
add rsi, rax
lea rdx, [r8-4+rdx*2]
; -----------------------
.L4:
and edx, 1
je .L14
vmovupd xmm0, XMMWORD PTR [rsi]
vaddpd xmm0, xmm0, xmm0
vmovups XMMWORD PTR [rdi], xmm0
.L14:
vzeroupper
ret
No comment - GCC magically transformed 9 initial instructions into 37, making the code larger and slower!
Clang (3.9.1 -O2 -mavx -fno-exceptions -fno-tree-vectorize)
transform(double*, double const*, double const*, unsigned long):
; --- Ridiculous code ---
mov r10, rcx
add r10, -8
js .LBB0_7
mov r11, r10
shr r11, 3
mov r8, r11
shl r8, 4
lea r9d, [r11 + 1]
and r9d, 1
mov rdx, rdi
mov rax, rsi
test r11, r11
je .LBB0_4
lea rcx, [r9 - 1]
sub rcx, r11
mov rdx, rdi
mov rax, rsi
; -----------------------
.LBB0_3: # =>This Inner Loop Header: Depth=1
vmovupd ymm0, ymmword ptr [rax]
vmovupd ymm1, ymmword ptr [rax + 32]
vmovupd ymm2, ymmword ptr [rax + 64]
vmovupd ymm3, ymmword ptr [rax + 96]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rdx], ymm0
vaddpd ymm0, ymm1, ymm1
vmovupd ymmword ptr [rdx + 32], ymm0
vaddpd ymm0, ymm2, ymm2
vmovupd ymmword ptr [rdx + 64], ymm0
vaddpd ymm0, ymm3, ymm3
vmovupd ymmword ptr [rdx + 96], ymm0
vmovupd ymm0, ymmword ptr [rax + 128]
vmovupd ymm1, ymmword ptr [rax + 160]
vmovupd ymm2, ymmword ptr [rax + 192]
vmovupd ymm3, ymmword ptr [rax + 224]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rdx + 128], ymm0
vaddpd ymm0, ymm1, ymm1
vmovupd ymmword ptr [rdx + 160], ymm0
vaddpd ymm0, ymm2, ymm2
vmovupd ymmword ptr [rdx + 192], ymm0
vaddpd ymm0, ymm3, ymm3
vmovupd ymmword ptr [rdx + 224], ymm0
add rdx, 256
add rax, 256
; --- Instead of using sub/jns it uses add/jne ---
add rcx, 2
jne .LBB0_3
; ------------------------------------------------
; --- CLANG Unrolled the tail loop - OMG ---
.LBB0_4:
shl r11, 3
lea rcx, [r8 + 16]
test r9, r9
je .LBB0_6
vmovupd ymm0, ymmword ptr [rax]
vmovupd ymm1, ymmword ptr [rax + 32]
vmovupd ymm2, ymmword ptr [rax + 64]
vmovupd ymm3, ymmword ptr [rax + 96]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rdx], ymm0
vaddpd ymm0, ymm1, ymm1
vmovupd ymmword ptr [rdx + 32], ymm0
vaddpd ymm0, ymm2, ymm2
vmovupd ymmword ptr [rdx + 64], ymm0
vaddpd ymm0, ymm3, ymm3
vmovupd ymmword ptr [rdx + 96], ymm0
.LBB0_6:
lea rdi, [rdi + 8*r8 + 128]
sub r10, r11
lea rsi, [rsi + 8*rcx]
mov rcx, r10
.LBB0_7:
mov r10, rcx
add r10, -2
js .LBB0_15
mov r8, r10
shr r8
lea r9d, [r8 + 1]
and r9d, 3
mov rax, rdi
mov rdx, rsi
cmp r10, 6
jb .LBB0_11
lea r10, [r9 - 1]
sub r10, r8
mov rax, rdi
mov rdx, rsi
.LBB0_10: # =>This Inner Loop Header: Depth=1
vmovupd ymm0, ymmword ptr [rdx]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rax], ymm0
vmovupd ymm0, ymmword ptr [rdx + 32]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rax + 32], ymm0
vmovupd ymm0, ymmword ptr [rdx + 64]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rax + 64], ymm0
vmovupd ymm0, ymmword ptr [rdx + 96]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rax + 96], ymm0
sub rax, -128
sub rdx, -128
add r10, 4
jne .LBB0_10
.LBB0_11:
lea r10, [4*r8 + 4]
lea r11, [4*r8]
add r8, r8
test r9, r9
je .LBB0_14
neg r9
.LBB0_13: # =>This Inner Loop Header: Depth=1
vmovupd ymm0, ymmword ptr [rdx]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rax], ymm0
add rax, 32
add rdx, 32
inc r9
jne .LBB0_13
.LBB0_14:
lea rsi, [rsi + 8*r11 + 32]
lea rdi, [rdi + 8*r10]
add rcx, -4
sub rcx, r8
mov r10, rcx
; --- End of the unrolled loop ---
.LBB0_15:
test r10b, 1
je .LBB0_17
vmovupd xmm0, xmmword ptr [rsi]
vaddpd xmm0, xmm0, xmm0
vmovupd xmmword ptr [rdi], xmm0
.LBB0_17:
vzeroupper
ret
This was even worse than I would ever expect - clang unrolled the tail loop and generated optimized code for something that would never be executed more than once. It completely misunderstood the code. Also, instead of following the C++ code it inserted dozens of instructions to help with unrolling, I won't even count them because the code is so ridiculous.
MSVC (v19 /O2 /arch:AVX)
transform, COMDAT PROC
; --- Ridiculous code ---
add r9, -8
js SHORT $LN3@transform
lea r8, QWORD PTR [r9+8]
shr r8, 3
mov rax, r8
neg rax
lea r9, QWORD PTR [r9+rax*8]
npad 8
; -----------------------
$LL2@transform:
vmovupd ymm0, YMMWORD PTR [rdx]
vmovupd ymm1, YMMWORD PTR [rdx+32]
vmovupd ymm4, YMMWORD PTR [rdx+64]
vmovupd ymm5, YMMWORD PTR [rdx+96]
vaddpd ymm0, ymm0, ymm0
vmovupd YMMWORD PTR [rcx], ymm0
vaddpd ymm1, ymm1, ymm1
vmovupd YMMWORD PTR [rcx+32], ymm1
vaddpd ymm0, ymm4, ymm4
vmovupd YMMWORD PTR [rcx+64], ymm0
vaddpd ymm1, ymm5, ymm5
vmovupd YMMWORD PTR [rcx+96], ymm1
sub rcx, -128 ; ffffffffffffff80H
sub rdx, -128 ; ffffffffffffff80H
; --- Instead of using sub/jns it uses sub/jne ---
sub r8, 1
jne SHORT $LL2@transform
; ------------------------------------------------
$LN3@transform:
; --- Ridiculous code ---
add r9, 6
js SHORT $LN5@transform
lea r8, QWORD PTR [r9+2]
shr r8, 1
mov rax, r8
neg rax
lea r9, QWORD PTR [r9+rax*2]
npad 5
; -----------------------
$LL4@transform:
vmovupd ymm0, YMMWORD PTR [rdx]
vaddpd ymm2, ymm0, ymm0
vmovupd YMMWORD PTR [rcx], ymm2
add rcx, 32 ; 00000020H
add rdx, 32 ; 00000020H
; --- Instead of using sub/jns it uses sub/jne ---
sub r8, 1
jne SHORT $LL4@transform
; ------------------------------------------------
$LN5@transform:
test r9b, 1
je SHORT $LN6@transform
vmovupd xmm0, XMMWORD PTR [rdx]
vaddpd xmm2, xmm0, xmm0
vmovupd XMMWORD PTR [rcx], xmm2
$LN6@transform:
vzeroupper
ret 0
Actually not bad, still does things I didn't ask for, but it's much better than GCC and Clang in this case.
ICC (v17 -O2 -mavx -fno-exceptions)
transform(double*, double const*, double const*, unsigned long):
add rcx, -8 #11.11 FOLLOWS C++ code!
js ..B1.5 # Prob 2% #11.22
..B1.3: # Preds ..B1.1 ..B1.3
vmovupd xmm0, XMMWORD PTR [rsi] #12.34
vmovupd xmm1, XMMWORD PTR [32+rsi] #13.34
vmovupd xmm2, XMMWORD PTR [64+rsi] #14.34
vmovupd xmm3, XMMWORD PTR [96+rsi] #15.34
vinsertf128 ymm6, ymm1, XMMWORD PTR [48+rsi], 1 #13.34
vinsertf128 ymm4, ymm0, XMMWORD PTR [16+rsi], 1 #12.34
vinsertf128 ymm8, ymm2, XMMWORD PTR [80+rsi], 1 #14.34
vinsertf128 ymm10, ymm3, XMMWORD PTR [112+rsi], 1 #15.34
add rsi, 128 #23.5
vaddpd ymm5, ymm4, ymm4 #17.32
vaddpd ymm7, ymm6, ymm6 #18.32
vaddpd ymm9, ymm8, ymm8 #19.32
vaddpd ymm11, ymm10, ymm10 #20.32
vmovupd XMMWORD PTR [rdi], xmm5 #17.22
vmovupd XMMWORD PTR [32+rdi], xmm7 #18.22
vmovupd XMMWORD PTR [64+rdi], xmm9 #19.22
vmovupd XMMWORD PTR [96+rdi], xmm11 #20.22
vextractf128 XMMWORD PTR [16+rdi], ymm5, 1 #17.22
vextractf128 XMMWORD PTR [48+rdi], ymm7, 1 #18.22
vextractf128 XMMWORD PTR [80+rdi], ymm9, 1 #19.22
vextractf128 XMMWORD PTR [112+rdi], ymm11, 1 #20.22
add rdi, 128 #22.5
add rcx, -8 #11.11
jns ..B1.3 # Prob 82% #11.22 FOLLOWS C++ code!
..B1.5: # Preds ..B1.3 ..B1.1
add rcx, 6 #25.3
js ..B1.9 # Prob 2% #27.22 FOLLOWS C++ code!
..B1.7: # Preds ..B1.5 ..B1.7
vmovupd xmm0, XMMWORD PTR [rsi] #28.34
vinsertf128 ymm1, ymm0, XMMWORD PTR [16+rsi], 1 #28.34
add rsi, 32 #32.5
vaddpd ymm2, ymm1, ymm1 #29.27
vmovupd XMMWORD PTR [rdi], xmm2 #29.22
vextractf128 XMMWORD PTR [16+rdi], ymm2, 1 #29.22
add rdi, 32 #31.5
add rcx, -2 #27.11 FOLLOWS C++ code!
jns ..B1.7 # Prob 82% #27.22
..B1.9: # Preds ..B1.7 ..B1.5
test rcx, 1 #35.11
je ..B1.11 # Prob 60% #35.11
vmovupd xmm0, XMMWORD PTR [rsi] #36.31
vaddpd xmm1, xmm0, xmm0 #37.24
vmovupd XMMWORD PTR [rdi], xmm1 #37.19
..B1.11: # Preds ..B1.9 ..B1.10
vzeroupper #39.1
ret #39.1
ICC in this case is pure winner. I don't know why other compilers cannot generate code like this, it's small and good.
Bonus GCC with -Os
transform(double*, double const*, double const*, unsigned long):
.L3:
mov rax, rcx
sub rax, 8
js .L2
vmovupd ymm3, YMMWORD PTR [rsi]
sub rdi, -128
sub rsi, -128
vmovupd ymm2, YMMWORD PTR [rsi-96]
mov rcx, rax
vmovupd ymm1, YMMWORD PTR [rsi-64]
vaddpd ymm3, ymm3, ymm3
vmovupd ymm0, YMMWORD PTR [rsi-32]
vaddpd ymm2, ymm2, ymm2
vaddpd ymm1, ymm1, ymm1
vmovupd YMMWORD PTR [rdi-128], ymm3
vaddpd ymm0, ymm0, ymm0
vmovupd YMMWORD PTR [rdi-96], ymm2
vmovupd YMMWORD PTR [rdi-64], ymm1
vmovupd YMMWORD PTR [rdi-32], ymm0
jmp .L3 ; Well, I asked -Os, but here sub/jns won't hurt
.L2:
sub rcx, 2
js .L4
vmovupd ymm0, YMMWORD PTR [rsi]
add rdi, 32
add rsi, 32
vaddpd ymm0, ymm0, ymm0
vmovupd YMMWORD PTR [rdi-32], ymm0
jmp .L2 ; Also here...
.L4:
and cl, 1
je .L1
vmovupd xmm0, XMMWORD PTR [rsi]
vaddpd xmm0, xmm0, xmm0
vmovups XMMWORD PTR [rdi], xmm0
.L1:
ret
Much more close to the expected code, however, since I asked to generate the smallest possible code, it didn't generate conditional jumps at the end of loops, which is not good in general.
Bonus Clang with -Oz (should be even better than -Os)
transform(double*, double const*, double const*, unsigned long): # @transform(double*, double const*, double const*, unsigned long)
push 7
pop rax
sub rax, rcx
mov rdx, rcx
shr rdx, 3
xor r10d, r10d
test rax, rax
cmovle r10, rdx
mov r9, r10
shl r9, 4
lea r8, [8*r10]
shl r10, 7
add r10, rsi
mov rdx, rcx
mov rax, rdi
jmp .LBB0_1
.LBB0_8: # in Loop: Header=BB0_1 Depth=1
vmovupd ymm0, ymmword ptr [rsi]
vmovupd ymm1, ymmword ptr [rsi + 32]
vmovupd ymm2, ymmword ptr [rsi + 64]
vmovupd ymm3, ymmword ptr [rsi + 96]
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rax], ymm0
vaddpd ymm0, ymm1, ymm1
vmovupd ymmword ptr [rax + 32], ymm0
vaddpd ymm0, ymm2, ymm2
vmovupd ymmword ptr [rax + 64], ymm0
vaddpd ymm0, ymm3, ymm3
vmovupd ymmword ptr [rax + 96], ymm0
sub rsi, -128
sub rax, -128
.LBB0_1: # =>This Inner Loop Header: Depth=1
add rdx, -8
jns .LBB0_8
sub rcx, r8
lea r8, [rdi + 8*r9]
push 1
pop rax
sub rax, rcx
lea rdx, [rcx + rcx]
and rdx, -4
xor esi, esi
test rax, rax
cmovle rsi, rdx
mov rax, rcx
mov rdi, r10
mov rdx, r8
jmp .LBB0_3
.LBB0_4: # in Loop: Header=BB0_3 Depth=1
vmovupd ymm0, ymmword ptr [rdi]
add rdi, 32
vaddpd ymm0, ymm0, ymm0
vmovupd ymmword ptr [rdx], ymm0
add rdx, 32
.LBB0_3: # =>This Inner Loop Header: Depth=1
add rax, -2
jns .LBB0_4
test cl, 1
je .LBB0_7
vmovupd xmm0, xmmword ptr [r10 + 8*rsi]
vaddpd xmm0, xmm0, xmm0
vmovupd xmmword ptr [r8 + 8*rsi], xmm0
.LBB0_7:
vzeroupper
ret
No, no, no - clang just doesn't like compact code...:(
Conclusion
People saying that C++ compilers always generate better code than people are simply wrong. I'm not saying that they always generate worse code, but it seems that newer the compiler is, more instructions it emits, and it's becoming scary. More arguments for Blend2D and its hand-optimized pipelines!
You can play with the code here.
Thanks for reading and don't forget that this is not a hate-post. A GCC bug #79830 was already reported (with a bit different sample code) and clang bug #32132 as well. Let's hope this can be improved as this actually affects performance of non-JIT code in Blend2D.
Update
After reading some comments on other portals about this post I think people really misunderstood it and try to put it into a different context. Please note that what is inside the loops is not important in this case. What is important is how various C++ compilers translate valid loop constructs into assembly and its impact on the size of the generated code. So please, instead of justifying the bloated code I have shown, let's fix it as it won't hurt to have smaller code generated in this case. And, I really think that if a commercial ICC compiler can generate perfect code for this case, GCC and clang should be able to do it too, because these are compilers that I use daily...
sub/jns or add/jns don't macro-fuse, so I don't really get why you want them.
ReplyDeleteDid you tried to hint clang to not unroll the tail loop with #pragma nounroll ?
ReplyDelete@Harold - You can replace all sub/jns sequences with sub/jnc that WILL macro-fuse. I have no idea how to play with carry flag in C++ code so checking sign is what people usually do when they want to generate a similar assembly I have seen the same trick in SKIA library, so it's not something completely alien to do in C++.
ReplyDelete@Hana No, I'm trying to avoid such pragmas in my code, I usually don't want compiler to unroll my loops anyway, and the code I don't like is not strictly related to unrolling, see clang with -Oz, which expands to the same code, but without the loop unrolled.
So, was this actually the source of the bug?
ReplyDeleteOr was the mistake elsewhere, but you found this too?
The bug was not related to this, but it led me take a closer look at the disassembly and I just didn't like what the compiler does with my code :)
ReplyDeleteInteresting post, thanks. Have you actually benchmarked the code ? Even if the code looks indeed ridiculous, it may not have a big impact in the time and may even be faster.
ReplyDeleteThere should be
ReplyDeletejnz TailLoop ; Jump if not zero.
instead of
jnz MainLoop ; Jump if not zero.
in x86 asm code (listing 3)?
@Michael Good eyes, good catch! Fixed, many thanks!
ReplyDeleteSeems like the compiler is having issues with the analysis due to aliasing. This example seems to produce minimal assembly code,
ReplyDeletehttps://godbolt.org/g/WH18xx
@Yalo, I know that restrict sometimes helps, but I couldn't use it as the API allows `dst == src` as a special case, otherwise it expects non-overlapping source and destination. In addition, I don't like to rely on compiler's ability to auto-vectorize my code - it failed many times, so I rather write SIMD myself for these cases - it's just few .cpp files where I use SSE2 and AVX/AVX2 code anyway, and AVX2 integer code is very tricky due to separation of 128-bit parts of YMM registers, so I always think twice before writing any AVX2 code to not end up permuting so much.
ReplyDeleteBTW I would have to look at the standard to verify if __restrict__ allows `src == dst`, but I'm pretty sure it must always be non-overlapping.
So, while not bad, I still think compilers can do a bit better.
I'm still thinking about the #pragma nounroll. If this pragma is supported by all mainstream compilers it could be pretty nice annotation of some loops that I know don't need unrolling, but compiler could struggle to decide so.
ReplyDeleteThe solution is to contribute your experience to both projects.
ReplyDeleteOnly constructive comments please.
ReplyDeleteI had the same experience as you with various compilers. I am a developer that is coding in C and ASM since 1988. I now perfectly well the assembly language from the C64, Amiga, i386, and the "modern" pc (SSE). My day job is to develop low-level (fast!) tools for AI and Machine Learning.
ReplyDeleteFrom my experience, the VisualStudio compilers are the best "all around" compilers that I found nowadays. I tested a few times the ICC compiler (the intel compiler) and it was indeed very good (even better than VS).
I don't understand why practically nobody (i.e. no open source C/C++ libraries) support ICC. If i need to create a good (i.e. fast) software (such as the "TIMi Suite" on which I am currently working) my first choice is ICC but, then, each time I want to use an open source library, I am just stuck with Visual Studio (or even worse: gcc) because nobody support ICC. AAArrgh!
Actually, I also tested various flavor of the visual studio compiler (VS2008, VS2012, VS2015) and I found that, if you stick to "common" C/C++ code structures, the VS2008 compiler produces the best (i.e. fastest) executables.
Actually, my feeling is that a large quantity of work was injected into the latest Visual Studio versions to make them c++x11 compatible but this is at the expense of the final speed of the produced asm code (at least, this is what I noticed on my limited test set). So, currently, I am still using VS2008 and I will certainly "switch" to ICC soon (since VS2008 is quite old now).
The annoying thing is the following: VS2008 and ICC are 2 compilers that are focused on speed and only provide very limited (i.e. bad/slow) support for the "new" cppx11 extensions that are heavily "promoted" by gcc users! Aaaaargh! So, I will be left with a terrible choice: I can either:
1. use gcc (as everybody) and then I will have access to all the open-source libraries that are using "recent" (i.e. c++x11) code. ..but then my tools will be terribly slow.
2. use an older but better compiler (VS2008 or ICC) and then I'll be forced to re-write all the open-source libraries that I am downloading to get rid of the c++x11 garbage. ...but, at least, my executables are fast.
Up to now, I am sticking with the second choice. So, it was with great pleasure that I saw that "mathpresso" was compiling without nearly any trouble on my VS2008! (there were less than 10 lines of code filled with c++x11 garbage that I had to "convert" to old-school C/C++ code). That's great! Thank you a lot for that!
Unfortunately, i had a quick look at the new "next-wip" branch of the asmjit project and I noticed that you are now using a lot of c++x11 garbage. As you might have realized at this point (i.e. I am referring to another post from you in this forum where you realize how ridiculous these c++x11 stuffs are), the VS compilers (and ICC) have difficulties to produce correct code when the c++x11 garbage is used. Furthermore, I think that the ICC compiler (i.e. the best compiler available, that I'll soon use instead of VS2008) will never fully support that c++x11 garbage (only gcc/clang support it correctly but these are terrible compilers! Aaaaargh!).
ReplyDeleteI noticed that you are somebody that is interested in getting the best speed out of the CPU. So, my final advice to you (just keep my advise or drop it! ;-D ) is to "stick" with the "classical/old school" C/C++ code (without the c++x11 stuffs), so that the best/fastest compilers (ICC/VS2008) can compile your library. I know that it's tempting to "switch" to all the "new wave" c++x11 functionalities but the reality is, if you want to produce the best/fastest executables (and I think that we both want that), we should (currently) avoid this c++x11 non-sense.
I am not the only one complaining about the ridiculous direction taken by the people working on compilers nowadays (especially the gcc team that seem to "lead the way" in the exact opposite direction of speed). The best developers (e.g. John Carmack, Linux Torvald) are stating the exact same as me. For example, inside the latest ID5 engine, John Carmack used very simple "old school" C/C++ and clearly states that this c++x11 stuffs are just non-sense.
Anyway, keep up the good work!
See you!
Frank
Hello, I'm not against the progress in the C++ language. I'm adapting and trying to make my code easier to maintain as everybody else. I have always stayed behind the newest C++ version for compatibility and I have always been picky about C++ features that I use, however, I think that using some of C++11 in 2019 is really just okay and that it won't hurt the future of the project.
ReplyDeleteI fully switched to clang for development, but I always try to make my code compile on all possible compilers. At the moment next-wip requires VS2017 as I had problems with VS2015 (bugs on compiler side that were hard to workaround), however, I have reorganized the code a bit so maybe 2015 is now okay. I don't plan to support older versions of Visual Studio than VS2015 as I would like to move forward and not back.
(the deleted comment above was a first version of this comment).
Hell Petr,
ReplyDeleteI am using your Mathpresso library since a few weeks now and it work great! Many thanks for this great library! I'm wondering: would it be possible to add support for the "&&" and the "||" operator inside the library?