关于 c:gcc 简单算术循环性能

gcc simple arithmetics loop performance

问题：明显多出一行代码会使程序加速近两倍。

这是一个很难表述的原始问题，它来自边界检查消除算法。所以，只是一些我无法理解的简单测试。

明显多出一行代码可以使程序加速近两倍。

有以下来源：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
long i = 0, a = 0, x = 0;
int up = 200000000;

int *values = malloc(sizeof(int)*up);

for (i = 0; i < up ; ++i)
{
values[i]=i % 2;
}
for (i = 0; i < up ; ++i)
{
x = (a & i);
#ifdef FAST
x = 0;
#endif
a += values[x];
}
printf ("a=%ld\
", a);
return 0;
}/*main*/

在本例中，'a' 的值始终为 0。行
x = 0;
是额外的。

但是，(看——没有任何优化！)
$gcc -O0 -o 短短.c

相关讨论

简答：存储 0 可消除其中一个循环中的先读后写依赖性。

详情：

我认为这是一个有趣的问题，尽管您关注的是 O0 优化级别，但在 O3 上也看到了相同的加速。但是查看 O0 可以更轻松地关注处理器正在做什么来优化代码而不是编译器，因为正如您所指出的，生成的汇编代码仅相差 1 条指令。

感兴趣的循环的汇编代码如下所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

movq $0, -32(%rbp)
jmp .L4
.L5:
movq -32(%rbp), %rax
movq -24(%rbp), %rdx
andq %rdx, %rax
movq %rax, -16(%rbp)
movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW
movq -16(%rbp), %rax
leaq 0(,%rax,4), %rdx
movq -8(%rbp), %rax
addq %rdx, %rax
movl (%rax), %eax
cltq
addq %rax, -24(%rbp)
addq $1, -32(%rbp)
.L4:
movl -36(%rbp), %eax
cltq
cmpq -32(%rbp), %rax
jg .L5

在我的系统上使用 perf stat 运行我得到以下结果：

慢代码的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Performance counter stats for './slow_o0':

1827.438670 task-clock # 0.999 CPUs utilized
155 context-switches # 0.085 K/sec
1 CPU-migrations # 0.001 K/sec
195,448 page-faults # 0.107 M/sec
6,675,246,466 cycles # 3.653 GHz
4,391,690,661 stalled-cycles-frontend # 65.79% frontend cycles idle
1,609,321,845 stalled-cycles-backend # 24.11% backend cycles idle
7,157,837,211 instructions # 1.07 insns per cycle
# 0.61 stalled cycles per insn
490,110,757 branches # 268.195 M/sec
178,287 branch-misses # 0.04% of all branches

1.829712061 seconds time elapsed

快速代码的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Performance counter stats for './fast_o0':

1109.451910 task-clock # 0.998 CPUs utilized
95 context-switches # 0.086 K/sec
1 CPU-migrations # 0.001 K/sec
195,448 page-faults # 0.176 M/sec
4,067,613,078 cycles # 3.666 GHz
1,784,131,209 stalled-cycles-frontend # 43.86% frontend cycles idle
438,447,105 stalled-cycles-backend # 10.78% backend cycles idle
7,356,892,998 instructions # 1.81 insns per cycle
# 0.24 stalled cycles per insn
489,945,197 branches # 441.610 M/sec
176,136 branch-misses # 0.04% of all branches

1.111398442 seconds time elapsed

因此您可以看到，即使"快速"代码执行更多指令，它的停顿也更少。当乱序 CPU(像大多数 x64 架构)正在执行代码时，它会跟踪指令之间的依赖关系。如果操作数已准备好，另一条指令可以绕过等待指令。

在这个例子中，关键点很可能是这个指令序列：

1
2
3
4
5
6

andq %rdx, %rax
movq %rax, -16(%rbp)
movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW
movq -16(%rbp), %rax
leaq 0(,%rax,4), %rdx
movq -8(%rbp), %rax

在快速代码中，movq -8(%rbp), %rax 指令将把 movq $0, -16(%rbp) 的结果转发给它，它能够更快地执行。而较慢的版本将不得不等待在循环迭代之间具有更多依赖关系的 movq %rax, -16(%rbp)。

请注意，在不了解具体微架构的情况下，此分析可能过于简单化。但我怀疑根本原因是这种依赖性，并且执行 0 的存储(movq $0, -16(%rbp) 指令)允许 CPU 在执行代码序列时执行更积极的推测。