669 строки
27 KiB
Markdown
669 строки
27 KiB
Markdown
---
|
|
title: Profiling Go Programs
|
|
date: 2011-06-24
|
|
by:
|
|
- Russ Cox, July 2011; updated by Shenghou Ma, May 2013
|
|
tags:
|
|
- benchmark
|
|
- pprof
|
|
- profiling
|
|
- technical
|
|
summary: How to use Go's built-in profiler to understand and optimize your programs.
|
|
---
|
|
|
|
|
|
At Scala Days 2011, Robert Hundt presented a paper titled
|
|
[Loop Recognition in C++/Java/Go/Scala.](http://research.google.com/pubs/pub37122.html)
|
|
The paper implemented a specific loop finding algorithm, such as you might use
|
|
in a flow analysis pass of a compiler, in C++, Go, Java, Scala, and then used
|
|
those programs to draw conclusions about typical performance concerns in these
|
|
languages.
|
|
The Go program presented in that paper runs quite slowly, making it
|
|
an excellent opportunity to demonstrate how to use Go's profiling tools to take
|
|
a slow program and make it faster.
|
|
|
|
_By using Go's profiling tools to identify and correct specific bottlenecks, we can make the Go loop finding program run an order of magnitude faster and use 6x less memory._
|
|
(Update: Due to recent optimizations of `libstdc++` in `gcc`, the memory reduction is now 3.7x.)
|
|
|
|
Hundt's paper does not specify which versions of the C++, Go, Java, and Scala
|
|
tools he used.
|
|
In this blog post, we will be using the most recent weekly snapshot of the `6g`
|
|
Go compiler and the version of `g++` that ships with the Ubuntu Natty
|
|
distribution.
|
|
(We will not be using Java or Scala, because we are not skilled at writing efficient
|
|
programs in either of those languages, so the comparison would be unfair.
|
|
Since C++ was the fastest language in the paper, the comparisons here with C++ should
|
|
suffice.)
|
|
(Update: In this updated post, we will be using the most recent development snapshot
|
|
of the Go compiler on amd64 and the most recent version of `g++` -- 4.8.0, which was
|
|
released in March 2013.)
|
|
|
|
$ go version
|
|
go version devel +08d20469cc20 Tue Mar 26 08:27:18 2013 +0100 linux/amd64
|
|
$ g++ --version
|
|
g++ (GCC) 4.8.0
|
|
Copyright (C) 2013 Free Software Foundation, Inc.
|
|
...
|
|
$
|
|
|
|
The programs are run on a computer with a 3.4GHz Core i7-2600 CPU and 16 GB of
|
|
RAM running Gentoo Linux's 3.8.4-gentoo kernel.
|
|
The machine is running with CPU frequency scaling disabled via
|
|
|
|
$ sudo bash
|
|
# for i in /sys/devices/system/cpu/cpu[0-7]
|
|
do
|
|
echo performance > $i/cpufreq/scaling_governor
|
|
done
|
|
#
|
|
|
|
We've taken [Hundt's benchmark programs](https://github.com/hundt98847/multi-language-bench)
|
|
in C++ and Go, combined each into a single source file, and removed all but one
|
|
line of output.
|
|
We'll time the program using Linux's `time` utility with a format that shows user time,
|
|
system time, real time, and maximum memory usage:
|
|
|
|
$ cat xtime
|
|
#!/bin/sh
|
|
/usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@"
|
|
$
|
|
|
|
$ make havlak1cc
|
|
g++ -O3 -o havlak1cc havlak1.cc
|
|
$ ./xtime ./havlak1cc
|
|
# of loops: 76002 (total 3800100)
|
|
loop-0, nest: 0, depth: 0
|
|
17.70u 0.05s 17.80r 715472kB ./havlak1cc
|
|
$
|
|
|
|
$ make havlak1
|
|
go build havlak1.go
|
|
$ ./xtime ./havlak1
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
25.05u 0.11s 25.20r 1334032kB ./havlak1
|
|
$
|
|
|
|
The C++ program runs in 17.80 seconds and uses 700 MB of memory.
|
|
The Go program runs in 25.20 seconds and uses 1302 MB of memory.
|
|
(These measurements are difficult to reconcile with the ones in the paper, but the
|
|
point of this post is to explore how to use `go tool pprof`, not to reproduce the
|
|
results from the paper.)
|
|
|
|
To start tuning the Go program, we have to enable profiling.
|
|
If the code used the [Go testing package](/pkg/testing/)'s
|
|
benchmarking support, we could use gotest's standard `-cpuprofile` and `-memprofile`
|
|
flags.
|
|
In a standalone program like this one, we have to import `runtime/pprof` and add a few
|
|
lines of code:
|
|
|
|
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
|
|
|
|
func main() {
|
|
flag.Parse()
|
|
if *cpuprofile != "" {
|
|
f, err := os.Create(*cpuprofile)
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
pprof.StartCPUProfile(f)
|
|
defer pprof.StopCPUProfile()
|
|
}
|
|
...
|
|
|
|
The new code defines a flag named `cpuprofile`, calls the
|
|
[Go flag library](/pkg/flag/) to parse the command line flags,
|
|
and then, if the `cpuprofile` flag has been set on the command line,
|
|
[starts CPU profiling](/pkg/runtime/pprof/#StartCPUProfile)
|
|
redirected to that file.
|
|
The profiler requires a final call to
|
|
[`StopCPUProfile`](/pkg/runtime/pprof/#StopCPUProfile) to
|
|
flush any pending writes to the file before the program exits; we use `defer`
|
|
to make sure this happens as `main` returns.
|
|
|
|
After adding that code, we can run the program with the new `-cpuprofile` flag
|
|
and then run `go tool pprof` to interpret the profile.
|
|
|
|
$ make havlak1.prof
|
|
./havlak1 -cpuprofile=havlak1.prof
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
$ go tool pprof havlak1 havlak1.prof
|
|
Welcome to pprof! For help, type 'help'.
|
|
(pprof)
|
|
|
|
The `go tool pprof` program is a slight variant of
|
|
[Google's `pprof` C++ profiler](https://github.com/gperftools/gperftools).
|
|
The most important command is `topN`, which shows the top `N` samples in the profile:
|
|
|
|
(pprof) top10
|
|
Total: 2525 samples
|
|
298 11.8% 11.8% 345 13.7% runtime.mapaccess1_fast64
|
|
268 10.6% 22.4% 2124 84.1% main.FindLoops
|
|
251 9.9% 32.4% 451 17.9% scanblock
|
|
178 7.0% 39.4% 351 13.9% hash_insert
|
|
131 5.2% 44.6% 158 6.3% sweepspan
|
|
119 4.7% 49.3% 350 13.9% main.DFS
|
|
96 3.8% 53.1% 98 3.9% flushptrbuf
|
|
95 3.8% 56.9% 95 3.8% runtime.aeshash64
|
|
95 3.8% 60.6% 101 4.0% runtime.settype_flush
|
|
88 3.5% 64.1% 988 39.1% runtime.mallocgc
|
|
|
|
When CPU profiling is enabled, the Go program stops about 100 times per second
|
|
and records a sample consisting of the program counters on the currently executing
|
|
goroutine's stack.
|
|
The profile has 2525 samples, so it was running for a bit over 25 seconds.
|
|
In the `go tool pprof` output, there is a row for each function that appeared in
|
|
a sample.
|
|
The first two columns show the number of samples in which the function was running
|
|
(as opposed to waiting for a called function to return), as a raw count and as a
|
|
percentage of total samples.
|
|
The `runtime.mapaccess1_fast64` function was running during 298 samples, or 11.8%.
|
|
The `top10` output is sorted by this sample count.
|
|
The third column shows the running total during the listing:
|
|
the first three rows account for 32.4% of the samples.
|
|
The fourth and fifth columns show the number of samples in which the function appeared
|
|
(either running or waiting for a called function to return).
|
|
The `main.FindLoops` function was running in 10.6% of the samples, but it was on the
|
|
call stack (it or functions it called were running) in 84.1% of the samples.
|
|
|
|
To sort by the fourth and fifth columns, use the `-cum` (for cumulative) flag:
|
|
|
|
(pprof) top5 -cum
|
|
Total: 2525 samples
|
|
0 0.0% 0.0% 2144 84.9% gosched0
|
|
0 0.0% 0.0% 2144 84.9% main.main
|
|
0 0.0% 0.0% 2144 84.9% runtime.main
|
|
0 0.0% 0.0% 2124 84.1% main.FindHavlakLoops
|
|
268 10.6% 10.6% 2124 84.1% main.FindLoops
|
|
(pprof) top5 -cum
|
|
|
|
In fact the total for `main.FindLoops` and `main.main` should have been 100%, but
|
|
each stack sample only includes the bottom 100 stack frames; during about a quarter
|
|
of the samples, the recursive `main.DFS` function was more than 100 frames deeper
|
|
than `main.main` so the complete trace was truncated.
|
|
|
|
The stack trace samples contain more interesting data about function call relationships
|
|
than the text listings can show.
|
|
The `web` command writes a graph of the profile data in SVG format and opens it in a web
|
|
browser.
|
|
(There is also a `gv` command that writes PostScript and opens it in Ghostview.
|
|
For either command, you need [graphviz](http://www.graphviz.org/) installed.)
|
|
|
|
(pprof) web
|
|
|
|
A small fragment of
|
|
[the full graph](https://rawgit.com/rsc/benchgraffiti/master/havlak/havlak1.svg) looks like:
|
|
|
|
{{image "pprof/havlak1a-75.png"}}
|
|
|
|
Each box in the graph corresponds to a single function, and the boxes are sized
|
|
according to the number of samples in which the function was running.
|
|
An edge from box X to box Y indicates that X calls Y; the number along the edge is
|
|
the number of times that call appears in a sample.
|
|
If a call appears multiple times in a single sample, such as during recursive function
|
|
calls, each appearance counts toward the edge weight.
|
|
That explains the 21342 on the self-edge from `main.DFS` to itself.
|
|
|
|
Just at a glance, we can see that the program spends much of its time in hash
|
|
operations, which correspond to use of Go's `map` values.
|
|
We can tell `web` to use only samples that include a specific function, such as
|
|
`runtime.mapaccess1_fast64`, which clears some of the noise from the graph:
|
|
|
|
(pprof) web mapaccess1
|
|
|
|
{{image "pprof/havlak1-hash_lookup-75.png"}}
|
|
|
|
If we squint, we can see that the calls to `runtime.mapaccess1_fast64` are being
|
|
made by `main.FindLoops` and `main.DFS`.
|
|
|
|
Now that we have a rough idea of the big picture, it's time to zoom in on a particular
|
|
function.
|
|
Let's look at `main.DFS` first, just because it is a shorter function:
|
|
|
|
(pprof) list DFS
|
|
Total: 2525 samples
|
|
ROUTINE ====================== main.DFS in /home/rsc/g/benchgraffiti/havlak/havlak1.go
|
|
119 697 Total samples (flat / cumulative)
|
|
3 3 240: func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {
|
|
1 1 241: nodes[current].Init(currentNode, current)
|
|
1 37 242: number[currentNode] = current
|
|
. . 243:
|
|
1 1 244: lastid := current
|
|
89 89 245: for _, target := range currentNode.OutEdges {
|
|
9 152 246: if number[target] == unvisited {
|
|
7 354 247: lastid = DFS(target, nodes, number, last, lastid+1)
|
|
. . 248: }
|
|
. . 249: }
|
|
7 59 250: last[number[currentNode]] = lastid
|
|
1 1 251: return lastid
|
|
(pprof)
|
|
|
|
The listing shows the source code for the `DFS` function (really, for every function
|
|
matching the regular expression `DFS`).
|
|
The first three columns are the number of samples taken while running that line, the
|
|
number of samples taken while running that line or in code called from that line, and
|
|
the line number in the file.
|
|
The related command `disasm` shows a disassembly of the function instead of a source
|
|
listing; when there are enough samples this can help you see which instructions are
|
|
expensive.
|
|
The `weblist` command mixes the two modes: it shows
|
|
[a source listing in which clicking a line shows the disassembly](https://rawgit.com/rsc/benchgraffiti/master/havlak/havlak1.html).
|
|
|
|
Since we already know that the time is going into map lookups implemented by the
|
|
hash runtime functions, we care most about the second column.
|
|
A large fraction of time is spent in recursive calls to `DFS` (line 247), as would be
|
|
expected from a recursive traversal.
|
|
Excluding the recursion, it looks like the time is going into the accesses to the
|
|
`number` map on lines 242, 246, and 250.
|
|
For that particular lookup, a map is not the most efficient choice.
|
|
Just as they would be in a compiler, the basic block structures have unique sequence
|
|
numbers assigned to them.
|
|
Instead of using a `map[*BasicBlock]int` we can use a `[]int`, a slice indexed by the
|
|
block number.
|
|
There's no reason to use a map when an array or slice will do.
|
|
|
|
Changing `number` from a map to a slice requires editing seven lines in the program
|
|
and cut its run time by nearly a factor of two:
|
|
|
|
$ make havlak2
|
|
go build havlak2.go
|
|
$ ./xtime ./havlak2
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
16.55u 0.11s 16.69r 1321008kB ./havlak2
|
|
$
|
|
|
|
(See the [diff between `havlak1` and `havlak2`](https://github.com/rsc/benchgraffiti/commit/58ac27bcac3ffb553c29d0b3fb64745c91c95948))
|
|
|
|
We can run the profiler again to confirm that `main.DFS` is no longer a significant
|
|
part of the run time:
|
|
|
|
$ make havlak2.prof
|
|
./havlak2 -cpuprofile=havlak2.prof
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
$ go tool pprof havlak2 havlak2.prof
|
|
Welcome to pprof! For help, type 'help'.
|
|
(pprof)
|
|
(pprof) top5
|
|
Total: 1652 samples
|
|
197 11.9% 11.9% 382 23.1% scanblock
|
|
189 11.4% 23.4% 1549 93.8% main.FindLoops
|
|
130 7.9% 31.2% 152 9.2% sweepspan
|
|
104 6.3% 37.5% 896 54.2% runtime.mallocgc
|
|
98 5.9% 43.5% 100 6.1% flushptrbuf
|
|
(pprof)
|
|
|
|
The entry `main.DFS` no longer appears in the profile, and the rest of the program
|
|
runtime has dropped too.
|
|
Now the program is spending most of its time allocating memory and garbage collecting
|
|
(`runtime.mallocgc`, which both allocates and runs periodic garbage collections,
|
|
accounts for 54.2% of the time).
|
|
To find out why the garbage collector is running so much, we have to find out what is
|
|
allocating memory.
|
|
One way is to add memory profiling to the program.
|
|
We'll arrange that if the `-memprofile` flag is supplied, the program stops after one
|
|
iteration of the loop finding, writes a memory profile, and exits:
|
|
|
|
var memprofile = flag.String("memprofile", "", "write memory profile to this file")
|
|
...
|
|
|
|
FindHavlakLoops(cfgraph, lsgraph)
|
|
if *memprofile != "" {
|
|
f, err := os.Create(*memprofile)
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
pprof.WriteHeapProfile(f)
|
|
f.Close()
|
|
return
|
|
}
|
|
|
|
We invoke the program with `-memprofile` flag to write a profile:
|
|
|
|
$ make havlak3.mprof
|
|
go build havlak3.go
|
|
./havlak3 -memprofile=havlak3.mprof
|
|
$
|
|
|
|
(See the [diff from havlak2](https://github.com/rsc/benchgraffiti/commit/b78dac106bea1eb3be6bb3ca5dba57c130268232))
|
|
|
|
We use `go tool pprof` exactly the same way. Now the samples we are examining are
|
|
memory allocations, not clock ticks.
|
|
|
|
$ go tool pprof havlak3 havlak3.mprof
|
|
Adjusting heap profiles for 1-in-524288 sampling rate
|
|
Welcome to pprof! For help, type 'help'.
|
|
(pprof) top5
|
|
Total: 82.4 MB
|
|
56.3 68.4% 68.4% 56.3 68.4% main.FindLoops
|
|
17.6 21.3% 89.7% 17.6 21.3% main.(*CFG).CreateNode
|
|
8.0 9.7% 99.4% 25.6 31.0% main.NewBasicBlockEdge
|
|
0.5 0.6% 100.0% 0.5 0.6% itab
|
|
0.0 0.0% 100.0% 0.5 0.6% fmt.init
|
|
(pprof)
|
|
|
|
The command `go tool pprof` reports that `FindLoops` has allocated approximately
|
|
56.3 of the 82.4 MB in use; `CreateNode` accounts for another 17.6 MB.
|
|
To reduce overhead, the memory profiler only records information for approximately
|
|
one block per half megabyte allocated (the “1-in-524288 sampling rate”), so these
|
|
are approximations to the actual counts.
|
|
|
|
To find the memory allocations, we can list those functions.
|
|
|
|
{{raw `
|
|
(pprof) list FindLoops
|
|
Total: 82.4 MB
|
|
ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
|
|
56.3 56.3 Total MB (flat / cumulative)
|
|
...
|
|
1.9 1.9 268: nonBackPreds := make([]map[int]bool, size)
|
|
5.8 5.8 269: backPreds := make([][]int, size)
|
|
. . 270:
|
|
1.9 1.9 271: number := make([]int, size)
|
|
1.9 1.9 272: header := make([]int, size, size)
|
|
1.9 1.9 273: types := make([]int, size, size)
|
|
1.9 1.9 274: last := make([]int, size, size)
|
|
1.9 1.9 275: nodes := make([]*UnionFindNode, size, size)
|
|
. . 276:
|
|
. . 277: for i := 0; i < size; i++ {
|
|
9.5 9.5 278: nodes[i] = new(UnionFindNode)
|
|
. . 279: }
|
|
...
|
|
. . 286: for i, bb := range cfgraph.Blocks {
|
|
. . 287: number[bb.Name] = unvisited
|
|
29.5 29.5 288: nonBackPreds[i] = make(map[int]bool)
|
|
. . 289: }
|
|
...
|
|
`}}
|
|
|
|
It looks like the current bottleneck is the same as the last one: using maps where
|
|
simpler data structures suffice.
|
|
`FindLoops` is allocating about 29.5 MB of maps.
|
|
|
|
As an aside, if we run `go tool pprof` with the `--inuse_objects` flag, it will
|
|
report allocation counts instead of sizes:
|
|
|
|
{{raw `
|
|
$ go tool pprof --inuse_objects havlak3 havlak3.mprof
|
|
Adjusting heap profiles for 1-in-524288 sampling rate
|
|
Welcome to pprof! For help, type 'help'.
|
|
(pprof) list FindLoops
|
|
Total: 1763108 objects
|
|
ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
|
|
720903 720903 Total objects (flat / cumulative)
|
|
...
|
|
. . 277: for i := 0; i < size; i++ {
|
|
311296 311296 278: nodes[i] = new(UnionFindNode)
|
|
. . 279: }
|
|
. . 280:
|
|
. . 281: // Step a:
|
|
. . 282: // - initialize all nodes as unvisited.
|
|
. . 283: // - depth-first traversal and numbering.
|
|
. . 284: // - unreached BB's are marked as dead.
|
|
. . 285: //
|
|
. . 286: for i, bb := range cfgraph.Blocks {
|
|
. . 287: number[bb.Name] = unvisited
|
|
409600 409600 288: nonBackPreds[i] = make(map[int]bool)
|
|
. . 289: }
|
|
...
|
|
(pprof)
|
|
`}}
|
|
|
|
Since the ~200,000 maps account for 29.5 MB, it looks like the initial map allocation
|
|
takes about 150 bytes.
|
|
That's reasonable when a map is being used to hold key-value pairs, but not when a map
|
|
is being used as a stand-in for a simple set, as it is here.
|
|
|
|
Instead of using a map, we can use a simple slice to list the elements.
|
|
In all but one of the cases where maps are being used, it is impossible for the algorithm
|
|
to insert a duplicate element.
|
|
In the one remaining case, we can write a simple variant of the `append` built-in function:
|
|
|
|
func appendUnique(a []int, x int) []int {
|
|
for _, y := range a {
|
|
if x == y {
|
|
return a
|
|
}
|
|
}
|
|
return append(a, x)
|
|
}
|
|
|
|
In addition to writing that function, changing the Go program to use slices instead
|
|
of maps requires changing just a few lines of code.
|
|
|
|
$ make havlak4
|
|
go build havlak4.go
|
|
$ ./xtime ./havlak4
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
11.84u 0.08s 11.94r 810416kB ./havlak4
|
|
$
|
|
|
|
(See the [diff from havlak3](https://github.com/rsc/benchgraffiti/commit/245d899f7b1a33b0c8148a4cd147cb3de5228c8a))
|
|
|
|
We're now at 2.11x faster than when we started. Let's look at a CPU profile again.
|
|
|
|
$ make havlak4.prof
|
|
./havlak4 -cpuprofile=havlak4.prof
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
$ go tool pprof havlak4 havlak4.prof
|
|
Welcome to pprof! For help, type 'help'.
|
|
(pprof) top10
|
|
Total: 1173 samples
|
|
205 17.5% 17.5% 1083 92.3% main.FindLoops
|
|
138 11.8% 29.2% 215 18.3% scanblock
|
|
88 7.5% 36.7% 96 8.2% sweepspan
|
|
76 6.5% 43.2% 597 50.9% runtime.mallocgc
|
|
75 6.4% 49.6% 78 6.6% runtime.settype_flush
|
|
74 6.3% 55.9% 75 6.4% flushptrbuf
|
|
64 5.5% 61.4% 64 5.5% runtime.memmove
|
|
63 5.4% 66.8% 524 44.7% runtime.growslice
|
|
51 4.3% 71.1% 51 4.3% main.DFS
|
|
50 4.3% 75.4% 146 12.4% runtime.MCache_Alloc
|
|
(pprof)
|
|
|
|
Now memory allocation and the consequent garbage collection (`runtime.mallocgc`)
|
|
accounts for 50.9% of our run time.
|
|
Another way to look at why the system is garbage collecting is to look at the
|
|
allocations that are causing the collections, the ones that spend most of the time
|
|
in `mallocgc`:
|
|
|
|
(pprof) web mallocgc
|
|
|
|
{{image "pprof/havlak4a-mallocgc.png"}}
|
|
|
|
It's hard to tell what's going on in that graph, because there are many nodes with
|
|
small sample numbers obscuring the big ones.
|
|
We can tell `go tool pprof` to ignore nodes that don't account for at least 10% of
|
|
the samples:
|
|
|
|
$ go tool pprof --nodefraction=0.1 havlak4 havlak4.prof
|
|
Welcome to pprof! For help, type 'help'.
|
|
(pprof) web mallocgc
|
|
|
|
{{image "pprof/havlak4a-mallocgc-trim.png"}}
|
|
|
|
We can follow the thick arrows easily now, to see that `FindLoops` is triggering
|
|
most of the garbage collection.
|
|
If we list `FindLoops` we can see that much of it is right at the beginning:
|
|
|
|
{{raw `
|
|
(pprof) list FindLoops
|
|
...
|
|
. . 270: func FindLoops(cfgraph *CFG, lsgraph *LSG) {
|
|
. . 271: if cfgraph.Start == nil {
|
|
. . 272: return
|
|
. . 273: }
|
|
. . 274:
|
|
. . 275: size := cfgraph.NumNodes()
|
|
. . 276:
|
|
. 145 277: nonBackPreds := make([][]int, size)
|
|
. 9 278: backPreds := make([][]int, size)
|
|
. . 279:
|
|
. 1 280: number := make([]int, size)
|
|
. 17 281: header := make([]int, size, size)
|
|
. . 282: types := make([]int, size, size)
|
|
. . 283: last := make([]int, size, size)
|
|
. . 284: nodes := make([]*UnionFindNode, size, size)
|
|
. . 285:
|
|
. . 286: for i := 0; i < size; i++ {
|
|
2 79 287: nodes[i] = new(UnionFindNode)
|
|
. . 288: }
|
|
...
|
|
(pprof)
|
|
`}}
|
|
|
|
Every time `FindLoops` is called, it allocates some sizable bookkeeping structures.
|
|
Since the benchmark calls `FindLoops` 50 times, these add up to a significant amount
|
|
of garbage, so a significant amount of work for the garbage collector.
|
|
|
|
Having a garbage-collected language doesn't mean you can ignore memory allocation
|
|
issues.
|
|
In this case, a simple solution is to introduce a cache so that each call to `FindLoops`
|
|
reuses the previous call's storage when possible.
|
|
(In fact, in Hundt's paper, he explains that the Java program needed just this change to
|
|
get anything like reasonable performance, but he did not make the same change in the
|
|
other garbage-collected implementations.)
|
|
|
|
We'll add a global `cache` structure:
|
|
|
|
var cache struct {
|
|
size int
|
|
nonBackPreds [][]int
|
|
backPreds [][]int
|
|
number []int
|
|
header []int
|
|
types []int
|
|
last []int
|
|
nodes []*UnionFindNode
|
|
}
|
|
|
|
and then have `FindLoops` consult it as a replacement for allocation:
|
|
|
|
{{raw `
|
|
if cache.size < size {
|
|
cache.size = size
|
|
cache.nonBackPreds = make([][]int, size)
|
|
cache.backPreds = make([][]int, size)
|
|
cache.number = make([]int, size)
|
|
cache.header = make([]int, size)
|
|
cache.types = make([]int, size)
|
|
cache.last = make([]int, size)
|
|
cache.nodes = make([]*UnionFindNode, size)
|
|
for i := range cache.nodes {
|
|
cache.nodes[i] = new(UnionFindNode)
|
|
}
|
|
}
|
|
|
|
nonBackPreds := cache.nonBackPreds[:size]
|
|
for i := range nonBackPreds {
|
|
nonBackPreds[i] = nonBackPreds[i][:0]
|
|
}
|
|
backPreds := cache.backPreds[:size]
|
|
for i := range nonBackPreds {
|
|
backPreds[i] = backPreds[i][:0]
|
|
}
|
|
number := cache.number[:size]
|
|
header := cache.header[:size]
|
|
types := cache.types[:size]
|
|
last := cache.last[:size]
|
|
nodes := cache.nodes[:size]
|
|
`}}
|
|
|
|
Such a global variable is bad engineering practice, of course: it means that
|
|
concurrent calls to `FindLoops` are now unsafe.
|
|
For now, we are making the minimal possible changes in order to understand what
|
|
is important for the performance of our program; this change is simple and mirrors
|
|
the code in the Java implementation.
|
|
The final version of the Go program will use a separate `LoopFinder` instance to
|
|
track this memory, restoring the possibility of concurrent use.
|
|
|
|
$ make havlak5
|
|
go build havlak5.go
|
|
$ ./xtime ./havlak5
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
8.03u 0.06s 8.11r 770352kB ./havlak5
|
|
$
|
|
|
|
(See the [diff from havlak4](https://github.com/rsc/benchgraffiti/commit/2d41d6d16286b8146a3f697dd4074deac60d12a4))
|
|
|
|
There's more we can do to clean up the program and make it faster, but none of
|
|
it requires profiling techniques that we haven't already shown.
|
|
The work list used in the inner loop can be reused across iterations and across
|
|
calls to `FindLoops`, and it can be combined with the separate “node pool” generated
|
|
during that pass.
|
|
Similarly, the loop graph storage can be reused on each iteration instead of reallocated.
|
|
In addition to these performance changes, the
|
|
[final version](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.go)
|
|
is written using idiomatic Go style, using data structures and methods.
|
|
The stylistic changes have only a minor effect on the run time: the algorithm and
|
|
constraints are unchanged.
|
|
|
|
The final version runs in 2.29 seconds and uses 351 MB of memory:
|
|
|
|
$ make havlak6
|
|
go build havlak6.go
|
|
$ ./xtime ./havlak6
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
2.26u 0.02s 2.29r 360224kB ./havlak6
|
|
$
|
|
|
|
That's 11 times faster than the program we started with.
|
|
Even if we disable reuse of the generated loop graph, so that the only cached memory
|
|
is the loop finding bookkeeping, the program still runs 6.7x faster than the original
|
|
and uses 1.5x less memory.
|
|
|
|
$ ./xtime ./havlak6 -reuseloopgraph=false
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
3.69u 0.06s 3.76r 797120kB ./havlak6 -reuseloopgraph=false
|
|
$
|
|
|
|
Of course, it's no longer fair to compare this Go program to the original C++
|
|
program, which used inefficient data structures like `set`s where `vector`s would
|
|
be more appropriate.
|
|
As a sanity check, we translated the final Go program into
|
|
[equivalent C++ code](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.cc).
|
|
Its execution time is similar to the Go program's:
|
|
|
|
$ make havlak6cc
|
|
g++ -O3 -o havlak6cc havlak6.cc
|
|
$ ./xtime ./havlak6cc
|
|
# of loops: 76000 (including 1 artificial root node)
|
|
1.99u 0.19s 2.19r 387936kB ./havlak6cc
|
|
|
|
The Go program runs almost as fast as the C++ program.
|
|
As the C++ program is using automatic deletes and allocation instead of an explicit
|
|
cache, the C++ program a bit shorter and easier to write, but not dramatically so:
|
|
|
|
$ wc havlak6.cc; wc havlak6.go
|
|
401 1220 9040 havlak6.cc
|
|
461 1441 9467 havlak6.go
|
|
$
|
|
|
|
(See [havlak6.cc](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.cc)
|
|
and [havlak6.go](https://github.com/rsc/benchgraffiti/blob/master/havlak/havlak6.go))
|
|
|
|
Benchmarks are only as good as the programs they measure.
|
|
We used `go tool pprof` to study an inefficient Go program and then to improve its
|
|
performance by an order of magnitude and to reduce its memory usage by a factor of 3.7.
|
|
A subsequent comparison with an equivalently optimized C++ program shows that Go can be
|
|
competitive with C++ when programmers are careful about how much garbage is generated
|
|
by inner loops.
|
|
|
|
The program sources, Linux x86-64 binaries, and profiles used to write this post
|
|
are available in the [benchgraffiti project on GitHub](https://github.com/rsc/benchgraffiti/).
|
|
|
|
As mentioned above, [`go test`](/cmd/go/#Test_packages) includes
|
|
these profiling flags already: define a
|
|
[benchmark function](/pkg/testing/) and you're all set.
|
|
There is also a standard HTTP interface to profiling data. In an HTTP server, adding
|
|
|
|
import _ "net/http/pprof"
|
|
|
|
will install handlers for a few URLs under `/debug/pprof/`.
|
|
Then you can run `go tool pprof` with a single argument—the URL to your server's
|
|
profiling data and it will download and examine a live profile.
|
|
|
|
go tool pprof http://localhost:6060/debug/pprof/profile # 30-second CPU profile
|
|
go tool pprof http://localhost:6060/debug/pprof/heap # heap profile
|
|
go tool pprof http://localhost:6060/debug/pprof/block # goroutine blocking profile
|
|
|
|
The goroutine blocking profile will be explained in a future post. Stay tuned.
|