Flamegraphs March 8th 2023 1000. Title page ---------------------------------------------------------------------- 2000. -fno-omit-frame-pointer ---------------------------------------------------------------------- Fedora 38 compiles (almost) everything with frame pointers, making total system performance analysis much easier. For details of why this is better, see my blog: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwarf-my-verdict/ Unfortunately Fedora <= 37 and RHEL do not have frame pointers unless you recompile your program and all library dependencies (including glibc!) 2100. sudo perf record -a -g -- nbdkit -U - null 1G --run 'export uri; fio nbd.fio' ---------------------------------------------------------------------- You can use perf to record the whole system while running a command. -a => record the whole system -g => use frame pointers to gather stack traces On every CPU, 100s of times a second, an interrupt will fire and the full stack trace is collected. If it's running in the kernel, it collects the kernel stack and the userspace stack. If it's running in userspace, only the userspace stack is collected. 2200/2250. flamegraph > analysis.svg ---------------------------------------------------------------------- You can post-process the perf output to get a flamegraph, an interactive SVG file. I wrote the flamegraph shell script to paper over some unnecessary complexity in the tools. 3000. << a flamegraph >> ---------------------------------------------------------------------- Opens in a web browser. Interactive, click in for more detail. Searchable. 3100. Width is time ---------------------------------------------------------------------- Shows the width as a % of total system (non-sleeping) time. 3200. Height is stack depth ---------------------------------------------------------------------- Not usually interesting. 3300. Plateaus ---------------------------------------------------------------------- Plateaus indicate functions consuming a lot of time. Remember I said that every core fires hundreds of interrupts and collects a stack trace? The stack trace is shown upside down in a flamegraph, with the inner stack frame at the top. So a plateau indicates a function that was actually running when the interrupt happened. (Show osq_lock and others) (Describe MSR problem in UKL) 3400. Left to right is NOT time ---------------------------------------------------------------------- Stack frames are ordered alphabetically 3500. Wide function with wide function on top ---------------------------------------------------------------------- Time consumed in the inner function, not the outer function. 3600. Kernel threads and other "disconnected" or "background" work ---------------------------------------------------------------------- Noisy machine, running firefox. Note kernel threads (kcryptd) taking longer than the task being measured. 3700. Sleeping / blocked time is not recorded 3750. Sleeping / blocked time is not recorded ---------------------------------------------------------------------- Show sleeptest.c Show unexpected graph. Unless you take special steps. The easiest is to use the perf --off-cpu flag, but it is still not supported in the Fedora perf builds. 4000. What kind of things can be revealed? ---------------------------------------------------------------------- - Lock contention - Inefficient algorithms like zlib inflate - MSRs / slow CPU features - Serialization through a CPU - Copy to/from user - Page faults - Excessive mmap/munmap (from memory allocations) - Opportunities to short-cut through the stack 5000. In summary ---------------------------------------------------------------------- - Use Fedora >= 38 - perf record -a -g - flamegraph script - Width is time - Look for plateaus - Watch out for disconnected / background work - perf --off-cpu (in future) - Many fascinating, revealing insights The backbone of performance work in UKL. Students spent literally months studying flamegraphs (and other perf-adjacent tools) to get insights.