0, 工具的介绍

What can Callgrind do:
Cachegrind collects flat profile data: event counts (data reads, cache misses, etc.) are attributed directly to the function they occurred in. This cost attribution mechanism is called self or exclusive attribution.
Callgrind extends this functionality by propagating costs across function call boundaries. If function foo calls bar, the costs from bar are added into foo's costs. When applied to the program as a whole, this builds up a picture of so called inclusive costs, that is, where the cost of each function includes the costs of all functions it called, directly or indirectly.
As an example, the inclusive cost of main should be almost 100 percent of the total program cost. Because of costs arising before main is run, such as initialization of the run time linker and construction of global C++ objects, the inclusive cost of main is not exactly 100 percent of the total program cost.
Together with the call graph, this allows you to find the specific call chains starting from main in which the majority of the program's costs occur. Caller/callee cost attribution is also useful for profiling functions called from multiple call sites, and where optimization opportunities depend on changing code in the callers, in particular by reducing the call count.
Callgrind's cache simulation is based on that of Cachegrind. Read the documentation for Cachegrind: a cache and branch-prediction profiler first. The material below describes the features supported in addition to Cachegrind's features.
Callgrind's ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns.

0.1 pros&cons

  • _Precise._ Cachegrind measures the exact number of instructions executed by your program, not an approximation. Furthermore, it presents the gathered data at the file, function, and line level. This is different to many other profilers that measure approximate execution time, using sampling, and only at the function level.
  • _Reproducible._ In general, execution time is a better metric than instruction counts because it's what users perceive. However, execution time often has high variability. When running the exact same program on the exact same input multiple times, execution time might vary by several percent. Furthermore, small changes in a program can change its memory layout and have even larger effects on runtime. In contrast, instruction counts are highly reproducible; for some programs they are perfectly reproducible. This means the effects of small changes in a program can be measured with high precision.

Cachegrind's cache profiling has a number of shortcomings:

  • It doesn't account for kernel activity. The effect of system calls on the cache and branch predictor contents is ignored.
  • It doesn't account for other process activity. This is arguably desirable when considering a single program.
  • It doesn't account for virtual-to-physical address mappings. Hence the simulation is not a true representation of what's happening in the cache. Most caches and branch predictors are physically indexed, but Cachegrind simulates caches using virtual addresses.
  • It doesn't account for cache misses not visible at the instruction level, e.g. those arising from TLB misses, or speculative execution.
  • Valgrind will schedule threads differently from how they would be when running natively. This could warp the results for threaded programs.
  • The x86/amd64 instructions btsbtr and btc will incorrectly be counted as doing a data read if both the arguments are registers, e.g.:

0.2 cachegrind工作原理

0.2 cachegrind

0.2 cachegrind工作原理

1, 正向分析: 具体某个函数

这类热点函数通常来自其他的分析工具,比如perf。
1.1 intel Pt 中哪些类型的函数可以直接优化

  • 1.1.1 动态内存分配

  • 1.1.2 std::xxx 函数(std::log10,std::pow, std::string, random_engine(),...)

  • 1.1.3 unexpected memset(来自 c++的zero initialization)

  • 1.1.4 热点函数(函数在一个区域内反复被调用)

  • 1.1.5 unused calculateion(计算了半天结果没用到,尤其在于很多feature log的打印中)

    1.2 优化方法

  • 1.2.1 修改数据结构(不使用有动态内存分配的数据结构,比如std::map, std::vector等等)

  • 1.2.2 time-space trade-off (查表法来代替计算)

  • 1.2.3 lazy initialization (使用的时候再计算)

  • 1.2.4 减少padding

  • 1.2.5 代码逻辑优化,(减少函数调用, 使用inline)

  • 1.2.6 算法优化(比如bitset能否一次检测多个)

  • 1.2.7 使用泛型(template传递函数指针,lambda表达式等等)

  • 1.2.8 减少拷贝(使用引用,reference_wrapper)

  • 1.2.9 likely() 这个宏定具体做了什么呢,该在什么时候用到它?likely

  • 1.2.10 打印(打印放到feature log中,打印信息可以在tti trace、bip log里得到)

  • 1.2.11 implicit consumption( std::function<>作为参数传递有构造和析构的开销, std::optional调用有开销, std::bitset遍历的开销,.at()的开销)

  • 1.2.12 return early

  • 1.2.13 编译时计算(constexpr, std::is_same_v, ...)

2,逆向分析:整个系统的性能瓶颈(热点)

2.1 找到instructs, 消耗最多的函数

 callgrind_annotate --show=Ir --sort=Ir --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log |  grep -vE "sm6-snowfish-dynamic-linker-on|tickler|glibc|generated|cpp_testsuites|\?\?\?" >ir.callgrind.txt

2.2 找到cachemiss 最多的位置
2.2.1 data cache miss

 callgrind_annotate --show=Ir,D1mr,D1mw --sort=Ir,D1mr,D1mw --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log >dcachemiss.callgrind.txt

2.2.2 instruct cache miss

callgrind_annotate --show=Ir,I1mr,ILmr --sort=Ir,I1mr,ILmr --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log >icachemiss.callgrind.txt

2.2.3 branch miss

# 首先用annotate过滤出Bc,Bcm,Bi,Bim四个事件
callgrind_annotate --show=Bc,Bcm,Bi,Bim --sort=Bc,Bcm --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log >annotate.callgrind.txt
# 找出和L2-PS相关的函数
cat annotate.callgrind.txt | grep -vE "sm6-snowfish-dynamic-linker-on|tickler|glibc|\?\?\?" > branchmiss.callgrind.txt

2.3 找到调用次数最多的函数
2.4 找到单位时间内消耗最大的函数(instruction, cycles)
2.5 cacheline使用率的统计
cacheline使用率按照 callgrind accost1指标排序

callgrind_annotate --show=AcCost1,SpLoss1 --sort=AcCost1,SpLoss1 --auto=no ./cb9304_CnpFdd1CellUlDlTraffic-1CC_20UE_FDD1000_CFG_SA_noCA-0-0.callgrind.log > accost1.log

cat accost1.log | grep -vE "sm6-snowfish-dynamic-linker-on|tickler|glibc|generated|cpp_testsuites|\?\?\?" > accost1.callgrind.txt

image.png

2.3 具体代码例子

2.3.1 修改数据结构

通过修改数据结构,来避免动态内存分配:
gerrit.ext.net.nokia.com/gerrit/c/MN/5G/NB/gnb/+/7470162
通过去除std::optional,使xsfn的合法性检查更高效:
[CB010482-SR-A-B18_5G_L2_PS] pucchInfo optimization. (Ibc45c2cf) · Gerrit Code Review (nokia.com)
减少不必要的std::string使用
https://gerrit.ext.net.nokia.com/gerrit/c/MN/5G/NB/gnb/+/7560807

2.3.2 time-space trade-off

适当地使用,但并不是所有的查表法都比计算更高效
不用查表法更高效的例子
1,例子 log10查表法
[CB010482-SR-A-B18_5G_L2_PS] enhancement of log10 lookup table. (I8bc239b2) · Gerrit Code Review (nokia.com)
2,{1,1,2,....} log2(x)取整数的计算
…/PdcchCandidateCalculator.hpp · Gerrit Code Review (nokia.com)
Quick C++ Benchmarks (quick-bench.com)

2.3.3 lazy initialization

2,feature开关打开时才进行变量的计算
下面这个例子展示了getOnAirXsfn引用变量定义所需要的时间。
image.png
image.png

2.3.4 padding

通常的理解,减少padding可以减少内存的使用,但对于性能而言,padding的减少不一定会影响性能。

以下这个例子显示了减少padding对性能影响,反而减少padding调整了结构之间的顺序,使得访问两个成员需要消耗两条cache line。 Quick C++ Benchmarks (quick-bench.com)

2.3.8 减少拷贝

在for循环中使用reference来遍历
利用RVO特性减少拷贝
使用std::move移动语义

2.3.9 likely()

2.3.10 打印

2.3.13 编译时计算

对于固定的值计算,可以在编译时进行
对于类型的判断,可以在编译时进行
使用模板生成不同的分支代码