0, 工具的介绍
What can Callgrind do:
Cachegrind collects flat profile data: event counts (data reads, cache misses, etc.) are attributed directly to the function they occurred in. This cost attribution mechanism is called self or exclusive attribution.
Callgrind extends this functionality by propagating costs across function call boundaries. If function foo calls bar, the costs from bar are added into foo's costs. When applied to the program as a whole, this builds up a picture of so called inclusive costs, that is, where the cost of each function includes the costs of all functions it called, directly or indirectly.
As an example, the inclusive cost of main should be almost 100 percent of the total program cost. Because of costs arising before main is run, such as initialization of the run time linker and construction of global C++ objects, the inclusive cost of main is not exactly 100 percent of the total program cost.
Together with the call graph, this allows you to find the specific call chains starting from main in which the majority of the program's costs occur. Caller/callee cost attribution is also useful for profiling functions called from multiple call sites, and where optimization opportunities depend on changing code in the callers, in particular by reducing the call count.
Callgrind's cache simulation is based on that of Cachegrind. Read the documentation for Cachegrind: a cache and branch-prediction profiler first. The material below describes the features supported in addition to Cachegrind's features.
Callgrind's ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns.
Cachegrind's cache profiling has a number of shortcomings:
bts, btr and btc will incorrectly be counted as doing a data read if both the arguments are registers, e.g.:这类热点函数通常来自其他的分析工具,比如perf。
1.1 intel Pt 中哪些类型的函数可以直接优化
1.1.1 动态内存分配
1.1.2 std::xxx 函数(std::log10,std::pow, std::string, random_engine(),...)
1.1.3 unexpected memset(来自 c++的zero initialization)
1.1.4 热点函数(函数在一个区域内反复被调用)
1.1.5 unused calculateion(计算了半天结果没用到,尤其在于很多feature log的打印中)
1.2 优化方法
1.2.1 修改数据结构(不使用有动态内存分配的数据结构,比如std::map, std::vector等等)
1.2.2 time-space trade-off (查表法来代替计算)
1.2.3 lazy initialization (使用的时候再计算)
1.2.4 减少padding
1.2.5 代码逻辑优化,(减少函数调用, 使用inline)
1.2.6 算法优化(比如bitset能否一次检测多个)
1.2.7 使用泛型(template传递函数指针,lambda表达式等等)
1.2.8 减少拷贝(使用引用,reference_wrapper)
1.2.9 likely() 这个宏定具体做了什么呢,该在什么时候用到它?likely
1.2.10 打印(打印放到feature log中,打印信息可以在tti trace、bip log里得到)
1.2.11 implicit consumption( std::function<>作为参数传递有构造和析构的开销, std::optional调用有开销, std::bitset遍历的开销,.at()的开销)
1.2.12 return early
1.2.13 编译时计算(constexpr, std::is_same_v, ...)
2.1 找到instructs, 消耗最多的函数
callgrind_annotate --show=Ir --sort=Ir --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log | grep -vE "sm6-snowfish-dynamic-linker-on|tickler|glibc|generated|cpp_testsuites|\?\?\?" >ir.callgrind.txt
2.2 找到cachemiss 最多的位置
2.2.1 data cache miss
callgrind_annotate --show=Ir,D1mr,D1mw --sort=Ir,D1mr,D1mw --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log >dcachemiss.callgrind.txt
2.2.2 instruct cache miss
callgrind_annotate --show=Ir,I1mr,ILmr --sort=Ir,I1mr,ILmr --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log >icachemiss.callgrind.txt
2.2.3 branch miss
# 首先用annotate过滤出Bc,Bcm,Bi,Bim四个事件
callgrind_annotate --show=Bc,Bcm,Bi,Bim --sort=Bc,Bcm --auto=no ./cb8140GJ_BasePcellAsSUT-2CC_TDD500_TDD500_SA_InterCA-0-0.callgrind.log >annotate.callgrind.txt
# 找出和L2-PS相关的函数
cat annotate.callgrind.txt | grep -vE "sm6-snowfish-dynamic-linker-on|tickler|glibc|\?\?\?" > branchmiss.callgrind.txt
2.3 找到调用次数最多的函数
2.4 找到单位时间内消耗最大的函数(instruction, cycles)
2.5 cacheline使用率的统计
cacheline使用率按照 callgrind accost1指标排序
callgrind_annotate --show=AcCost1,SpLoss1 --sort=AcCost1,SpLoss1 --auto=no ./cb9304_CnpFdd1CellUlDlTraffic-1CC_20UE_FDD1000_CFG_SA_noCA-0-0.callgrind.log > accost1.log
cat accost1.log | grep -vE "sm6-snowfish-dynamic-linker-on|tickler|glibc|generated|cpp_testsuites|\?\?\?" > accost1.callgrind.txt

通过修改数据结构,来避免动态内存分配:
gerrit.ext.net.nokia.com/gerrit/c/MN/5G/NB/gnb/+/7470162
通过去除std::optional,使xsfn的合法性检查更高效:
[CB010482-SR-A-B18_5G_L2_PS] pucchInfo optimization. (Ibc45c2cf) · Gerrit Code Review (nokia.com)
减少不必要的std::string使用
https://gerrit.ext.net.nokia.com/gerrit/c/MN/5G/NB/gnb/+/7560807
适当地使用,但并不是所有的查表法都比计算更高效
不用查表法更高效的例子
1,例子 log10查表法
[CB010482-SR-A-B18_5G_L2_PS] enhancement of log10 lookup table. (I8bc239b2) · Gerrit Code Review (nokia.com)
2,{1,1,2,....} log2(x)取整数的计算
…/PdcchCandidateCalculator.hpp · Gerrit Code Review (nokia.com)
Quick C++ Benchmarks (quick-bench.com)
1, 变量被真正使用的时候,才调用接口获取变量
[CNI-111721-A-B1_5G_L2_PS] lazy initialization some variables. (I8ea9a83f) · Gerrit Code Review (nokia.com)
2,feature开关打开时才进行变量的计算
下面这个例子展示了getOnAirXsfn引用变量定义所需要的时间。


通常的理解,减少padding可以减少内存的使用,但对于性能而言,padding的减少不一定会影响性能。
以下这个例子显示了减少padding对性能影响,反而减少padding调整了结构之间的顺序,使得访问两个成员需要消耗两条cache line。 Quick C++ Benchmarks (quick-bench.com)
在for循环中使用reference来遍历
利用RVO特性减少拷贝
使用std::move移动语义
对于固定的值计算,可以在编译时进行
对于类型的判断,可以在编译时进行
使用模板生成不同的分支代码