Friday, June 24, 2011Profiling Go ProgramsBy using Go's profiling tools to identify and correct specific bottlenecks, we can make the Go loop finding program run an order of magnitude faster and use 6x less memory. Hundt's paper does not specify which versions of the C++, Go, Java, and Scala tools he used. In this blog post, we will be using the most recent weekly snapshot of the 6g Go compiler and the version of g++
that ships with the Ubuntu Natty distribution. (We will not be using
Java or Scala, because we are not skilled at writing efficient programs
in either of those languages, so the comparison would be unfair. Since
C++ was the fastest language in the paper, the comparisons here with C++
should suffice.)
The programs are run on a Lenovo X201s with a 2.13 GHz Core i7-based CPU and 4 GB of RAM running Ubuntu Natty's Linux 2.6.38-8-generic kernel. The machine is running with CPU frequency scaling disabled via
We've taken in C++ and Go, combined each into a single source file, and removed all but one line of output. We'll time the program using Linux's time utility with a format that shows user time, system time, real time, and maximum memory usage:
The C++ program runs in 27.47 seconds and uses 700 MB of memory. The Go program runs in 56.92 seconds and uses 1604 MB of memory. (These measurements are difficult to reconcile with the ones in the paper, but the point of this post is to explore how to use gopprof , not to reproduce the results from the paper.)To start tuning the Go program, we have to enable profiling. If the code used the Go testing package's benchmarking support, we could use gotest 's standard -cpuprofile and -memprofile flags. In a standalone program like this one, we have to import runtime/pprof and add a few lines of code:
The new code defines a flag named cpuprofile , calls the Go flag library to parse the command line flags, and then, if the cpuprofile flag has been set on the command line, starts CPU profiling redirected to that file. The profiler requires a final call to StopCPUProfile to flush any pending writes to the file before the program exits; we use defer to make sure this happens as main returns.After adding that code, we can run the program with the new -cpuprofile flag and then run gopprof to interpret the profile.
The gopprof program is a slight variant of pprof C++ profiler. The most important command is topN , which shows the top N samples in the profile:
When CPU profiling is enabled, the Go program stops about 100 times per second and records a sample consisting of the program counters on the currently executing goroutine's stack. The profile has 5758 samples, so it was running for a bit over 57 seconds. In the gopprof
output, there is a row for each function that appeared in a sample. The
first two columns show the number of samples in which the function was
running (as opposed to waiting for a called function to return), as a
raw count and as a percentage of total samples. The hash_lookup function was running during 1028 samples, or 17.9%. The top10
output is sorted by this sample count. The third column shows the
running total during the listing: the first three rows account for 39.8%
of the samples. The fourth and fifth columns show the number of samples
in which the function appeared (either running or waiting for a called
function to return). The main.FindLoops function was
running in 7.3% of the samples, but it was on the call stack (it or
functions it called were running) in 74.3% of the samples.To sort by the fourth and fifth columns, use the -cum (for cumulative) flag:
In fact the total for main.FindLoops and main.main
should have been 100%, but each stack sample only includes the bottom
100 stack frames; during about a quarter of the samples, the recursive main.DFS function was more than 100 frames deeper than main.main so the complete trace was truncated.The stack trace samples contain more interesting data about function call relationships than the text listings can show. The web command writes a graph of the profile data in SVG format and opens it in a web browser. (There is also a gv command that writes PostScript and opens it in Ghostview. For either command, you need graphviz installed.)
A small fragment of the full graph looks like: Each box in the graph corresponds to a single function, and the boxes are sized according to the number of samples in which the function was running. An edge from box X to box Y indicates that X calls Y; the number along the edge is the number of times that call appears in a sample. If a call appears multiple times in a single sample, such as during recursive function calls, each appearance counts toward the edge weight. That explains the 69206 on the self-edge from main.DFS to itself.Just at a glance, we can see that the program spends much of its time in hash operations, which correspond to use of Go's map values. We can tell web to use only samples that include a specific function, such as hash_lookup , which clears some of the noise from the graph:
If we squint, we can see that the calls to runtime.mapaccess1 are being made by main.FindLoops and main.DFS .Now that we have a rough idea of the big picture, it's time to zoom in on a particular function. Let's look at main.DFS first, just because it is a shorter function:
The listing shows the source code for the DFS function (really, for every function matching the regular expression DFS ).
The first three columns are the number of samples taken while running
that line, the number of samples taken while running that line or in
code called from that line, and the line number in the file. The related
command disasm shows a disassembly of the function instead
of a source listing; when there are enough samples this can help you
see which instructions are expensive. The weblist command mixes the two modes: it shows a source listing in which clicking a line shows the disassembly.Since we already know that the time is going into map lookups implemented by the hash runtime functions, we care most about the second column. A large fraction of time is spent in recursive calls to DFS
(line 247), as would be expected from a recursive traversal. Excluding
the recursion, it looks like the time is going into the accesses to the number
map on lines 242, 246, and 250. For that particular lookup, a map is
not the most efficient choice. Just as they would be in a compiler, the
basic block structures have unique sequence numbers assigned to them.
Instead of using a map[*BasicBlock]int we can use a []int , a slice indexed by the block number. There's no reason to use a map when an array or slice will do.Changing number from a map to a slice requires editing seven lines in the program and cut its run time by nearly a factor of two:
We can run the profiler again to confirm that main.DFS is no longer a significant part of the run time:
main.DFS
still appears in the profile, but its total time has dropped from 20.0%
to 3.6%. The rest of the program runtime has dropped too. Now the
program is spending most of its time allocating memory and garbage
collecting (runtime.mallocgc , which both allocates and runs
periodic garbage collections, accounts for 55.9% of the time). To find
out why the garbage collector is running so much, we have to find out
what is allocating memory. One way is to add memory profiling to the
program. We'll arrange that if the -memprofile flag is supplied, the program stops after one iteration of the loop finding, writes a memory profile, and exits:
We invoke the program with -memprofile flag to write a profile:
We use gopprof exactly the same way. Now the samples we are examining are memory allocations, not clock ticks.
Gopprof reports that FindLoops has allocated approximately 66.1 of the 118.3 MB in use; NewLoop
accounts for another 30.5 MB. To reduce overhead, the memory profiler
only records information for approximately one block per half megabyte
allocated (the “1-in-524288 sampling rate”), so these are approximations
to the actual counts.To find the memory allocations, we can list those functions.
It looks like the current bottleneck is the same as the last one: using maps where simpler data structures suffice.
Since the 200,000 maps account for 20 MB, it looks like the initial map allocation takes about 100 bytes. That's reasonable when a map is being used to hold key-value pairs, but not when a map is being used as a stand-in for a simple set, as it is here. Instead of using a map, we can use a simple slice to list the elements. In all but one of the cases where maps are being used, it is impossible for the algorithm to insert a duplicate element. In the one remaining case, we can write a simple variant of the append built-in function:
In
addition to writing that function, changing the Go program to use
slices instead of maps requires changing just a few lines of code.
We're now at 3x faster than when we started. Let's look at a CPU profile again.
Now memory allocation and the consequent garbage collection (
It's
hard to tell what's going on in that graph, because there are many
nodes with small sample numbers obscuring the big ones. We can tell
We can follow the thick arrows easily now, to see that
Every time FindLoops is called, it allocates some sizable bookkeeping structures. Since the benchmark calls FindLoops 50 times, these add up to a significant amount of garbage, so a significant amount of work for the garbage collector.Having a garbage-collected language doesn't mean you can ignore memory allocation issues. In this case, a simple solution is to introduce a cache so that each call to FindLoops reuses the previous
call's storage when possible. (In fact, in Hundt's paper, he explains
that the Java program needed just this change to get anything like
reasonable performance, but he did not make the same change in the other
garbage-collected implementations.)We'll add a global cache structure:
and then have
Such a global variable is bad engineering practice, of course: it means that concurrent calls to FindLoops
are now unsafe. For now, we are making the minimal possible changes in
order to understand what is important for the performance of our
program; this change is simple and mirrors the code in the Java
implementation. The final version of the Go program will use a separate LoopFinder instance to track this memory, restoring the possibility of concurrent use.
There's more we can do to clean up the program and make it faster, but none of it requires profiling techniques that we haven't already shown. The work list used in the inner loop can be reused across iterations and across calls to FindLoops , and it can be combined with the
separate “node pool” generated during that pass. Similarly, the loop
graph storage can be reused on each iteration instead of reallocated. In
addition to these performance changes, the final version
is written using idiomatic Go style, using data structures and methods.
The stylistic changes have only a minor effect on the run time: the
algorithm and constraints are unchanged.The final version runs in 3.84 seconds and uses 257 MB of memory:
That's nearly 15 times faster than the program we started with. Even if we disable reuse of the generated loop graph, so that the only cached memory is the loop finding bookeeping, the program still runs 10x faster than the original and uses 2.5x less memory.
Of course, it's no longer fair to compare this Go program to the original C++ program, which used inefficient data structures like set s where vector s would be more appropriate. As a sanity check, we translated the final Go program into equivalent C++ code. Its execution time is similar to the Go program's:
The Go program runs slightly faster because the C++ program is using automatic deletes and allocation instead of an explicit cache. That makes the C++ program a bit shorter and easier to write, but not dramatically so:
Benchmarks are only as good as the programs they measure. We used gopprof
to study an inefficient Go program and then to improve its performance
by an order of magnitude and to reduce its memory usage by a factor of
six. A subsequent comparison with an equivalently optimized C++ program
shows that Go can be competitive with C++ when programmers are careful
about how much garbage is generated by inner loops.The program sources, Linux x86-64 binaries, and profiles used to write this post are available in the benchgraffiti project on Google Code. As mentioned above, gotest includes these profiling flags already: define a benchmark function and you're all set. There is also a standard HTTP interface to profiling data. In an HTTP server, adding
will install handlers for a few URLs under /debug/pprof/ . Then you can run gopprof with a single argument—the URL to your server's profiling data—and it will download and examine a live profile.
- Russ Cox |
|