How I Use pprof: A Structured 4-Stage Profiling System

As with most things, not knowing which stage you are in will hurt your efficiency. You may end up chasing the wrong bottlenecks, fixing details that do not matter yet, or redesigning components when a small refinement would have been enough. Combine that with an unstructured approach, and it can quickly cost you your sanity.

To avoid this, I follow a structured approach to profiling. I start with system entry-point benchmarks to catch the low-hanging inefficiencies, move to microbenchmarks for finer-grained insights, and then use path profiling and realistic workloads for deeper, design-level improvements. This progression ensures that each step builds on the previous one, so I tackle the right performance problems at the right time.

System Entry-Point Synthetic Benchmark

No realistic workload is simulated; the top-level call is repeatedly stressed to reveal performance hotspots.

When implementing a new system design, you inevitably make many small decisions that are too granular to specify in the design and are left to the implementation. The goal of the system entry point benchmark is to identify and clean up these small decisions to improve overall system performance.

For example, you might choose to use Sprintf instead of manual string concatenation, which can result in the following performance difference:

BenchmarkSprintfHotPath-8       42454868        30.05 ns/op      24 B/op       1 allocs/op
BenchmarkStringConcatHotPath-8  74467857        15.50 ns/op      24 B/op       1 allocs/op

Or maybe you didn't consider preallocating the slice and just used append instead:

BenchmarkSliceWithPreallocation-8      8978208      115.9 ns/op       0  B/op       0 allocs/op
BenchmarkSliceWithoutPreallocation-8   353932       3385 ns/op     25208 B/op      12 allocs/op

Individually, these decisions may seem minor, but collectively they can add up to a significant performance cost if not addressed. After repeatedly seeing this pattern, I now start my profiling sessions with what I call the "cleanup stage". Of course you implement a system as well as your current knowledge level allows you to, but the clean up stage gives you the opportunity to learn new things. Most of the work here is simply asking, "Can this be done differently, and would that result in better performance?". Do not assume anything, measure everything and try things out.

Preferred Method

I particularly like using the -callgrind output format so I can analyze profiles with QCacheGrind. For example:

go tool pprof -callgrind cpu.prof > profile.callgrind
qcachegrind profile.callgrind

QCacheGrind provides functionality similar to the standard graph feature in the pprof web interface, but I find it easier to understand and reason about because it feels less noisy. While I am aware of the focus and ignore features in pprof, I prefer being able to get the view I want with a single click. The way the information is displayed also feels more direct and simpler to interpret.

None

Microbenchmark of a Single Function

After identifying hotspots in your system, focus on a specific candidate for improvement and run a dedicated benchmark for that function alone. This produces a detailed "heat map" of its performance, helping you pinpoint the most costly operations.

For the most part, writing a separate benchmark for a single function is unnecessary. However, if the function is large or complex, not performing a targeted benchmark may cause you to lose valuable performance insights. In such cases, a focused microbenchmark helps capture detailed data that would otherwise be obscured.

Preferred Method

The main option I use here is -lines, which shows line-level profiling information within the function. For example:

go test -bench=BenchmarkMyFunction -cpuprofile cpu.prof
go tool pprof -lines cpu.prof

Example output:

Showing nodes accounting for 500ms, 100% of 500ms total
      flat  flat%   sum%        cum   cum%
    200ms 40.0%  40.0%      200ms 40.0%  mypackage.MyFunction:45
    150ms 30.0%  70.0%      150ms 30.0%  mypackage.MyFunction:52
    100ms 20.0%  90.0%      100ms 20.0%  mypackage.Helper:10
     50ms 10.0% 100.0%       50ms 10.0%  mypackage.AnotherHelper:23

Using the -lines option helps you avoid running different commands to figure out where inside those functions you need to optimize.

System Path Profiling Benchmark

Feed your system with real or synthetic workloads that exercise the full range of branches and cases.

System profiling goes beyond individual function hotspots to reveal how your code behaves across entire execution paths. While a System Entry Point Synthetic Benchmark identifies which functions are expensive within a single path, path profiling captures the range of different outcomes that can result from each entry point, especially when execution varies due to branching or edge cases.

Note: In many cases, path profiling can be simplified or replaced by the system entry point phase. By stressing each entry point, you get a good sense of overall performance, provided the entry points do not have too many branches or edge cases.

This is particularly useful in systems with multiple expensive and complex execution paths. For example:

HandleRequest
 ├─> Authenticate
 │     └─> ValidateToken
 ├─> RouteRequest
 │     ├─> ProcessOrder
 │     │     ├─> ValidateOrder
 │     │     ├─> CheckInventory
 │     │     └─> ChargePayment
 │     └─> ProcessRefund
 │           ├─> ValidateRefund
 │           └─> IssueRefund
 └─> LogRequest

When path profiling isn't useful:

  • In systems where execution is mostly linear.
  • When alternative paths exist but are cheap or rarely executed.

Preferred Method

1. -files

Shows the source files associated with each function in the profile. Useful to quickly see which files contain performance-critical code.

Example:

go tool pprof -files cpu.prof

Output snippet:

flat  flat%   sum%        cum   cum%   file
    200ms 40.0%  40.0%      200ms 40.0%  myapp/handlers.go (ProcessOrder)
    150ms 30.0%  70.0%      150ms 30.0%  myapp/payment.go (ChargePayment)

2. -focus

Filters the profile to only show functions whose names match a given regex.

Example:

go tool pprof -focus=orderLib cpu.prof

Output snippet:

flat  flat%   sum%        cum   cum%   function
    120ms 24.0%  24.0%      120ms 24.0%  orderLib.ProcessOrder
     50ms 10.0%  34.0%       50ms 10.0%  orderLib.ValidateOrder
     30ms  6.0%  40.0%       30ms  6.0%  orderLib.CheckInventory
     50ms 10.0%  50.0%       50ms 10.0%  orderLib.ChargePayment

3. -relative_percentages

Shows each node's cost relative to the total samples in the profile after applying any filters. Makes it easier to understand the impact of functions within a focused view.

Example:

go tool pprof -focus=runtime -relative_percentages cpu.out

Output snippet:

Showing nodes accounting for 3.12s, 93.98% of 3.32s total
flat   flat%   sum%    cum    cum% 
2.13s  64.16%  64.16%. 2.13s  64.16% sync/atomic.CompareAndSwapPointer 
0.58s  17.47%  81.63%  0.58s  17.47% runtime.procUnpin 0
.25s   7.53%   89.16%  0.25s  7.53% runtime.procPin 
0.05s  1.51%   90.66%  0.05s  1.51% runtime.usleep 
0.02s  0.6%    91.27%  0.02s  0.6% internal/runtime/atomic.(*Uint8).Load (inline) 
0.02s  0.6%    91.87%  0.02s. 0.6% runtime.(*gQueue).pushBack 
0.02s  0.6%    92.47%  0.02s. 0.6% runtime.chansend 
0.02s  0.6%    93.07%  0.02s. 0.6% runtime.pthread_cond_signal 
0.02s  0.6%    93.67%  0.02s  0.6% runtime.pthread_cond_wait 
0.01s  0.3%    93.98%  0.04s  1.20% runtime.(*timers).run

Without -relative_percentages:

Showing nodes accounting for 3.12s, 36.11% of 8.64s total
Dropped 46 nodes (cum <= 0.04s)
Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
     2.13s 24.65% 24.65%      2.13s 24.65%  sync/atomic.CompareAndSwapPointer
     0.58s  6.71% 31.37%      0.58s  6.71%  runtime.procUnpin
     0.25s  2.89% 34.26%      0.25s  2.89%  runtime.procPin
     0.05s  0.58% 34.84%      0.05s  0.58%  runtime.usleep
     0.02s  0.23% 35.07%      0.02s  0.23%  internal/runtime/atomic.(*Uint8).Load (inline)
     0.02s  0.23% 35.30%      0.02s  0.23%  runtime.(*gQueue).pushBack
     0.02s  0.23% 35.53%      0.02s  0.23%  runtime.chansend
     0.02s  0.23% 35.76%      0.02s  0.23%  runtime.pthread_cond_signal
     0.02s  0.23% 36.00%      0.02s  0.23%  runtime.pthread_cond_wait
     0.01s  0.12% 36.11%      0.04s  0.46%  runtime.(*timers).run

Combined usage:

go tool pprof -files -focus=ProcessOrder -relative_percentages cpu.prof

OR

go tool pprof -functions -focus=ProcessOrder -relative_percentages cpu.prof
# or -lines instead of -functions

This produces a focused, file-level view of functions matching the target, showing their relative impact within the execution profile. It lets you analyze a specific component or path of your system in isolation, comparing functions only within that scope rather than against the entire program.

Realistic Workload Benchmark

Up until now, most of your efforts should have focused on cleaning and refining your system's implementation. No significant design changes should have been made. Optimizing for a scenario or use case that is not representative of real usage can mislead your efforts. Significant design changes should only come after gaining a better understanding of how the system is actually used and the overall direction of the project. At that point, you can decide whether to specialize for a particular workload or keep a more general approach.

Preferred Method

  1. -peek

This method is preferred because it allows you to focus on a single function and see both its callers and callees. This gives a clear picture of how that function interacts with the rest of the system without being distracted by unrelated parts of the profile.

Example:

go tool pprof -peek="Get" cpu.out
# Shows callers and callees of a specific function

Output snippet:

Showing nodes accounting for 8.64s, 100% of 8.64s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                             3.80s   100% |   github.com/AlexsanderHamir/GenPool/test.BenchmarkGenPool.func1
     1.40s 16.20% 16.20%      3.80s 43.98%                |   github.com/AlexsanderHamir/GenPool/pool.Get
                                             0.95s 25.00% |   sync/atomic.CompareAndSwap
                                             0.71s 18.68% |   sync/atomic..Load (inline)
                                             0.33s  8.68% |   github.com/AlexsanderHamir/GenPool/pool.IncrementUsage
                                             0.27s  7.11% |   runtime.procUnpin
                                             0.10s  2.63% |   runtime.procPin
                                             0.04s  1.05% |   runtime.newstack
----------------------------------------------------------+-------------

On top of using peek, I combine it with all the methods above, along with prof, to organize and track my progress.

Conclusion

Performance work is an investigation. The simpler the case, the more people can contribute. But as a system becomes more optimized — and the requirements more demanding — fewer people will have the expertise to push it further. Still, good organization can make the journey much easier, and curiosity will definitely take you far, even if you didn't know much when you started.

Linkedin

Github