The heuristic was introduced in 4e2aa98e, based on the C++ Snappy
implementation, but the Go code contained a flawed optimization. The C++ code
used an explicit skip variable:
uint32 bytes_between_hash_lookups = skip++ >> 5;
next_ip = ip + bytes_between_hash_lookups;
whereas the Go code optimized this to be an implicit skip:
s += 1 + (s-lit)>>5
This is equivalent for small s values (relative to lit, the last hash table
hit), but diverges for large ones. This Go program demonstrates the difference:
// main prints the encoder skipping behavior when seeing no hash table hits.
func main() {
s0, s1 := 0, 0
skip := 32
for i := 0; i < 300; i++ {
// This is the C++ Snappy algorithm.
bytes_between_hash_lookups := skip >> 5
skip++
s0 += bytes_between_hash_lookups
// This is the Go Snappy algorithm.
s1 += 1 + s1>>5
// The intention was that the Go algorithm behaves the same as the C++
// one, but it doesn't.
if i%10 == 0 {
fmt.Printf("%d\t%d\t%d\n", i, s0, s1)
}
}
}
It prints:
0 1 1
10 11 11
20 21 21
30 31 31
40 50 50
50 70 73
60 90 105
70 117 149
80 147 208
90 177 288
100 212 398
110 252 548
120 292 752
130 335 1030
140 385 1408
150 435 1922
160 486 2619
170 546 3568
180 606 4861
190 666 6617
200 735 9005
210 805 12257
220 875 16681
230 952 22697
240 1032 30881
250 1112 42015
260 1197 57161
270 1287 77764
280 1377 105791
290 1470 143914
The C++ algorithm is quadratic. The Go algorithm is exponential.
This commit re-introduces the explicit skip variable, so that the Go
implementation matches the C++ implementation.
For completeness, benchmark numbers are included below, but the worse numbers
merely reflect that the old Go algorithm was too aggressive on skipping ahead
on incompressible input (RandomEncode, ZFlat2 and ZFlat4), and so after an
initial warm-up period, it was essentially performing not much more than a
memcpy. Memcpy is indeed fast in terms of MB/s, but it doesn't compress at all,
which obviously defeats the whole purpose of a compression format like Snappy.
benchmark old MB/s new MB/s speedup
BenchmarkWordsEncode1e1-4 3.65 3.77 1.03x
BenchmarkWordsEncode1e2-4 29.22 29.35 1.00x
BenchmarkWordsEncode1e3-4 99.46 101.20 1.02x
BenchmarkWordsEncode1e4-4 118.11 121.54 1.03x
BenchmarkWordsEncode1e5-4 90.37 91.72 1.01x
BenchmarkWordsEncode1e6-4 107.49 108.88 1.01x
BenchmarkRandomEncode-4 7679.09 4491.97 0.58x
Benchmark_ZFlat0-4 229.41 233.79 1.02x
Benchmark_ZFlat1-4 115.10 116.83 1.02x
Benchmark_ZFlat2-4 7256.88 3003.79 0.41x
Benchmark_ZFlat3-4 53.39 54.02 1.01x
Benchmark_ZFlat4-4 1873.63 289.28 0.15x
Benchmark_ZFlat5-4 233.29 234.95 1.01x
Benchmark_ZFlat6-4 101.33 102.79 1.01x
Benchmark_ZFlat7-4 95.26 96.63 1.01x
Benchmark_ZFlat8-4 105.66 106.89 1.01x
Benchmark_ZFlat9-4 92.04 93.11 1.01x
Benchmark_ZFlat10-4 265.68 265.93 1.00x
Benchmark_ZFlat11-4 149.72 151.32 1.01x
These numbers were generated on an amd64 machine, but on a different machine
than the one used for other recent commits. The raw MB/s numbers are therefore
not directly comparable, although the speedup numbers should be.
Doing s/int/int32/ in "var table [maxTableSize]int" saves 64 KiB of
stack space that needed zero'ing. maxTableSize is 1<<14, or 16384.
The benchmarks show the biggest effect for small src lengths, or for
mostly uncompressible data such as the JPEG file (possibly because the
multiple-byte skipping means that the src is effectively short).
On amd64:
benchmark old MB/s new MB/s speedup
BenchmarkWordsEncode1e1-8 3.05 5.71 1.87x
BenchmarkWordsEncode1e2-8 26.98 44.87 1.66x
BenchmarkWordsEncode1e3-8 130.87 156.72 1.20x
BenchmarkWordsEncode1e4-8 162.48 180.89 1.11x
BenchmarkWordsEncode1e5-8 132.35 131.27 0.99x
BenchmarkWordsEncode1e6-8 159.97 158.49 0.99x
BenchmarkRandomEncode-8 12340.86 13485.69 1.09x
Benchmark_ZFlat0-8 329.92 329.17 1.00x
Benchmark_ZFlat1-8 165.06 164.46 1.00x
Benchmark_ZFlat2-8 8955.25 10530.49 1.18x
Benchmark_ZFlat3-8 47.79 80.06 1.68x
Benchmark_ZFlat4-8 2650.55 2732.00 1.03x
Benchmark_ZFlat5-8 336.52 334.94 1.00x
Benchmark_ZFlat6-8 147.99 145.85 0.99x
Benchmark_ZFlat7-8 136.32 137.20 1.01x
Benchmark_ZFlat8-8 153.03 152.15 0.99x
Benchmark_ZFlat9-8 133.18 131.74 0.99x
Benchmark_ZFlat10-8 376.02 378.28 1.01x
Benchmark_ZFlat11-8 224.16 216.81 0.97x
Thanks to Klaus Post for the original suggestion on
https://github.com/golang/snappy/pull/23 but I hesitate to accept that
pull request in its entirety as it makes many changes, some more
complicated than this separable, self-contained s/int/int32/ change.