关于valgrind：Kcachegrind / callgrind对于调度程序功能不准确吗？

Kcachegrind/callgrind is inaccurate for dispatcher functions?

我有一个模型代码，其中kcachegrind / callgrind报告了奇怪的结果。这是一种调度程序功能。调度员从四个地方被呼叫；每个调用都说要运行哪个实际的do_J函数(因此first2将仅调用do_1和do_2，依此类推)

源代码(这是实际代码的模型)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

#define N 1000000

int a[N];
int do_1(int *a) { int i; for(i=0;i<N/4;i++) a[i]+=1; }
int do_2(int *a) { int i; for(i=0;i<N/2;i++) a[i]+=2; }
int do_3(int *a) { int i; for(i=0;i<N*3/4;i++) a[i]+=3; }
int do_4(int *a) { int i; for(i=0;i<N;i++) a[i]+=4; }

int dispatcher(int *a, int j) {
if(j==1) do_1(a);
else if(j==2) do_2(a);
else if(j==3) do_3(a);
else do_4(a);
}

int first2(int *a) { dispatcher(a,1); dispatcher(a,2); }
int last2(int *a) { dispatcher(a,4); dispatcher(a,3); }
int inner2(int *a) { dispatcher(a,2); dispatcher(a,3); }
int outer2(int *a) { dispatcher(a,1); dispatcher(a,4); }

int main(){
first2(a);
last2(a);
inner2(a);
outer2(a);
}

用gcc -O0编译；用valgrind --tool=callgrind调用用kcachegrind和qcachegrind-0.7进行了kcache研磨。

这是应用程序的完整记录。到达do_J的所有路径都通过调度程序，这很好(do_1的隐藏速度太快了，但实际上是在这里，只剩下do_2了)

Full

让我们专注于do_1并检查调用它的人(此图片不正确)：

enter image description here

我认为这很奇怪，只有first2和outer2称为do_1，但并非全部。

它是callgrind / kcachegrind的限制吗？如何获得带有权重的准确的调用图(与每个函数的运行时间成正比，有无其子项)？

是的，这是callgrind格式的限制。它不存储完整的跟踪。它仅存储父子呼叫信息。

有一个带有pprof / libprofiler.so CPU分析器的google-perftools项目，http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html。 libprofiler.so可以获取具有呼叫跟踪的配置文件，并且它将存储具有完整回溯的每个跟踪事件。 pprof是libprofile的输出到图形格式或callgrind格式的转换器。在全视图下，结果将与kcachegrind中的结果相同；但是如果您专注于某些功能，例如 do_1使用pprof的选项焦点；当专注于功能时，它将显示准确的调用树。