Граф коммитов

9 Коммитов

Автор SHA1 Сообщение Дата
Nigel Tao d9eb7a3d35 Support the COPY_4 tag.
It is a valid encoding, even if no longer issued by most encoders.

name              old speed      new speed      delta
WordsDecode1e1-8   525MB/s ± 0%   504MB/s ± 1%  -4.04%   (p=0.000 n=9+10)
WordsDecode1e2-8  1.23GB/s ± 0%  1.23GB/s ± 1%    ~      (p=0.678 n=10+9)
WordsDecode1e3-8  1.54GB/s ± 0%  1.53GB/s ± 1%  -0.75%   (p=0.000 n=10+9)
WordsDecode1e4-8  1.53GB/s ± 0%  1.51GB/s ± 3%  -1.46%   (p=0.000 n=9+10)
WordsDecode1e5-8   793MB/s ± 0%   777MB/s ± 2%  -2.01%   (p=0.017 n=9+10)
WordsDecode1e6-8   917MB/s ± 1%   917MB/s ± 1%    ~      (p=0.473 n=8+10)
WordsEncode1e1-8   641MB/s ± 2%   641MB/s ± 0%    ~      (p=0.780 n=10+9)
WordsEncode1e2-8   583MB/s ± 0%   580MB/s ± 0%  -0.41%   (p=0.001 n=10+9)
WordsEncode1e3-8   647MB/s ± 1%   648MB/s ± 0%    ~      (p=0.326 n=10+9)
WordsEncode1e4-8   442MB/s ± 1%   452MB/s ± 0%  +2.20%   (p=0.000 n=10+8)
WordsEncode1e5-8   355MB/s ± 1%   355MB/s ± 0%    ~      (p=0.880 n=10+8)
WordsEncode1e6-8   433MB/s ± 0%   434MB/s ± 0%    ~       (p=0.700 n=8+8)
RandomEncode-8    14.2GB/s ± 3%  14.2GB/s ± 3%    ~      (p=0.780 n=10+9)
_UFlat0-8         2.18GB/s ± 1%  2.19GB/s ± 0%    ~      (p=0.447 n=10+9)
_UFlat1-8         1.40GB/s ± 2%  1.41GB/s ± 0%  +0.73%   (p=0.043 n=9+10)
_UFlat2-8         23.4GB/s ± 3%  23.5GB/s ± 2%    ~      (p=0.497 n=9+10)
_UFlat3-8         1.90GB/s ± 0%  1.91GB/s ± 0%  +0.30%    (p=0.002 n=8+9)
_UFlat4-8         13.9GB/s ± 2%  14.0GB/s ± 1%    ~      (p=0.720 n=9+10)
_UFlat5-8         1.96GB/s ± 1%  1.97GB/s ± 0%  +0.81%   (p=0.000 n=10+9)
_UFlat6-8          813MB/s ± 0%   814MB/s ± 0%  +0.17%   (p=0.037 n=8+10)
_UFlat7-8          783MB/s ± 2%   785MB/s ± 0%    ~       (p=0.340 n=9+9)
_UFlat8-8          859MB/s ± 0%   857MB/s ± 0%    ~       (p=0.074 n=8+9)
_UFlat9-8          719MB/s ± 1%   719MB/s ± 1%    ~      (p=0.621 n=10+9)
_UFlat10-8        2.84GB/s ± 0%  2.84GB/s ± 0%  +0.19%   (p=0.043 n=10+9)
_UFlat11-8        1.05GB/s ± 1%  1.05GB/s ± 0%    ~       (p=0.523 n=9+8)
_ZFlat0-8         1.04GB/s ± 2%  1.04GB/s ± 0%    ~       (p=0.222 n=9+9)
_ZFlat1-8          535MB/s ± 0%   534MB/s ± 0%    ~       (p=0.059 n=9+9)
_ZFlat2-8         15.6GB/s ± 3%  15.7GB/s ± 1%    ~      (p=0.720 n=9+10)
_ZFlat3-8          723MB/s ± 0%   740MB/s ± 3%  +2.36%   (p=0.034 n=8+10)
_ZFlat4-8         9.16GB/s ± 1%  9.20GB/s ± 1%    ~       (p=0.297 n=9+9)
_ZFlat5-8          987MB/s ± 1%   991MB/s ± 0%    ~       (p=0.167 n=9+8)
_ZFlat6-8          378MB/s ± 2%   379MB/s ± 0%    ~       (p=0.334 n=9+8)
_ZFlat7-8          350MB/s ± 2%   352MB/s ± 0%  +0.60%    (p=0.014 n=9+8)
_ZFlat8-8          397MB/s ± 0%   396MB/s ± 1%    ~      (p=0.965 n=8+10)
_ZFlat9-8          328MB/s ± 0%   327MB/s ± 1%    ~       (p=0.409 n=8+9)
_ZFlat10-8        1.33GB/s ± 0%  1.33GB/s ± 1%    ~      (p=0.356 n=9+10)
_ZFlat11-8         605MB/s ± 0%   605MB/s ± 1%    ~       (p=0.743 n=9+8)
2016-05-29 15:00:41 +10:00
Nigel Tao 4f2f9a13dd Write the encoder's extendMatch in asm.
name              old speed      new speed      delta
WordsEncode1e1-8   678MB/s ± 0%   690MB/s ± 0%   +1.79%  (p=0.008 n=5+5)
WordsEncode1e2-8  87.5MB/s ± 0%  83.7MB/s ± 1%   -4.26%  (p=0.008 n=5+5)
WordsEncode1e3-8   257MB/s ± 1%   230MB/s ± 1%  -10.41%  (p=0.008 n=5+5)
WordsEncode1e4-8   247MB/s ± 1%   233MB/s ± 1%   -5.56%  (p=0.008 n=5+5)
WordsEncode1e5-8   186MB/s ± 0%   212MB/s ± 0%  +14.36%  (p=0.008 n=5+5)
WordsEncode1e6-8   211MB/s ± 0%   255MB/s ± 0%  +20.82%  (p=0.008 n=5+5)
RandomEncode-8    13.1GB/s ± 2%  13.2GB/s ± 1%     ~     (p=0.222 n=5+5)
_ZFlat0-8          433MB/s ± 0%   623MB/s ± 0%  +43.92%  (p=0.008 n=5+5)
_ZFlat1-8          276MB/s ± 0%   319MB/s ± 1%  +15.42%  (p=0.008 n=5+5)
_ZFlat2-8         13.8GB/s ± 1%  13.9GB/s ± 1%     ~     (p=0.222 n=5+5)
_ZFlat3-8          170MB/s ± 0%   176MB/s ± 0%   +3.55%  (p=0.008 n=5+5)
_ZFlat4-8         3.09GB/s ± 1%  6.05GB/s ± 0%  +96.00%  (p=0.008 n=5+5)
_ZFlat5-8          427MB/s ± 1%   603MB/s ± 0%  +41.35%  (p=0.008 n=5+5)
_ZFlat6-8          190MB/s ± 0%   228MB/s ± 0%  +20.24%  (p=0.008 n=5+5)
_ZFlat7-8          182MB/s ± 0%   212MB/s ± 0%  +16.87%  (p=0.008 n=5+5)
_ZFlat8-8          200MB/s ± 0%   242MB/s ± 0%  +20.97%  (p=0.008 n=5+5)
_ZFlat9-8          175MB/s ± 0%   199MB/s ± 1%  +13.74%  (p=0.008 n=5+5)
_ZFlat10-8         507MB/s ± 0%   796MB/s ± 1%  +56.83%  (p=0.008 n=5+5)
_ZFlat11-8         278MB/s ± 0%   348MB/s ± 0%  +25.09%  (p=0.008 n=5+5)

name           old time/op  new time/op  delta
ExtendMatch-8  16.5µs ± 1%   7.8µs ± 1%  -52.93%  (p=0.008 n=5+5)
2016-04-23 11:26:04 +10:00
Nigel Tao 3588d1dd84 Add appengine and noasm build tags.
The general suggestion to use a noasm tag is by Klaus Post at
https://groups.google.com/d/msg/golang-dev/CeKX81B3WdQ/2mq-eY0VAgAJ
2016-04-14 18:47:10 +10:00
Nigel Tao 857ad66e00 Add gc build tag for the asm code.
Fixes #27.
2016-04-04 10:05:40 +10:00
Nigel Tao 5f1c01d9f6 Optimize a 16-byte load and store.
benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     528.93       528.93       1.00x
BenchmarkWordsDecode1e2-8     983.60       999.00       1.02x
BenchmarkWordsDecode1e3-8     1474.03      1513.22      1.03x
BenchmarkWordsDecode1e4-8     1523.38      1561.36      1.02x
BenchmarkWordsDecode1e5-8     792.34       800.00       1.01x
BenchmarkWordsDecode1e6-8     881.58       885.13       1.00x
Benchmark_UFlat0-8            2168.73      2224.25      1.03x
Benchmark_UFlat1-8            1431.99      1446.11      1.01x
Benchmark_UFlat2-8            15392.46     15301.72     0.99x
Benchmark_UFlat3-8            1825.26      1841.57      1.01x
Benchmark_UFlat4-8            10885.32     11384.32     1.05x
Benchmark_UFlat5-8            1955.55      2002.59      1.02x
Benchmark_UFlat6-8            833.99       829.35       0.99x
Benchmark_UFlat7-8            794.80       793.35       1.00x
Benchmark_UFlat8-8            859.01       854.84       1.00x
Benchmark_UFlat9-8            731.84       726.50       0.99x
Benchmark_UFlat10-8           2775.21      2898.57      1.04x
Benchmark_UFlat11-8           1032.75      1032.47      1.00x
2016-03-04 16:48:22 +11:00
Nigel Tao 427fb6fc07 Optimize asm for decoding copy fragments some more.
Relative to the previous commit:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     518.80       508.74       0.98x
BenchmarkWordsDecode1e2-8     871.43       962.52       1.10x
BenchmarkWordsDecode1e3-8     1411.32      1435.51      1.02x
BenchmarkWordsDecode1e4-8     1469.60      1514.02      1.03x
BenchmarkWordsDecode1e5-8     771.07       807.73       1.05x
BenchmarkWordsDecode1e6-8     872.19       892.24       1.02x
Benchmark_UFlat0-8            1129.79      2200.22      1.95x
Benchmark_UFlat1-8            1075.37      1446.09      1.34x
Benchmark_UFlat2-8            15617.45     14706.88     0.94x
Benchmark_UFlat3-8            1438.15      1787.82      1.24x
Benchmark_UFlat4-8            4838.37      10683.24     2.21x
Benchmark_UFlat5-8            1075.46      1965.33      1.83x
Benchmark_UFlat6-8            811.70       833.52       1.03x
Benchmark_UFlat7-8            781.87       792.85       1.01x
Benchmark_UFlat8-8            819.38       854.75       1.04x
Benchmark_UFlat9-8            724.43       730.21       1.01x
Benchmark_UFlat10-8           1193.70      2775.98      2.33x
Benchmark_UFlat11-8           879.15       1037.94      1.18x

As with previous recent commits, the new asm code is covered by existing
tests: TestDecode, TestDecodeLengthOffset and TestDecodeGoldenInput.
There is also a new test for the slowForwardCopy algorithm.

As a data point, the "new MB/s" numbers are now in the same ballpark as
the benchmark numbers that I get from the C++ snappy implementation on
the same machine:

BM_UFlat/0   2.4GB/s    html
BM_UFlat/1   1.4GB/s    urls
BM_UFlat/2   21.1GB/s   jpg
BM_UFlat/3   1.5GB/s    jpg_200
BM_UFlat/4   10.2GB/s   pdf
BM_UFlat/5   2.1GB/s    html4
BM_UFlat/6   990.6MB/s  txt1
BM_UFlat/7   930.1MB/s  txt2
BM_UFlat/8   1.0GB/s    txt3
BM_UFlat/9   849.7MB/s  txt4
BM_UFlat/10  2.9GB/s    pb
BM_UFlat/11  1.2GB/s    gaviota

As another data point, here is the amd64 asm code as of this commit
compared to the most recent pure Go implementation, revision 03ee571c:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     498.83       508.74       1.02x
BenchmarkWordsDecode1e2-8     445.12       962.52       2.16x
BenchmarkWordsDecode1e3-8     530.29       1435.51      2.71x
BenchmarkWordsDecode1e4-8     361.08       1514.02      4.19x
BenchmarkWordsDecode1e5-8     270.69       807.73       2.98x
BenchmarkWordsDecode1e6-8     290.91       892.24       3.07x
Benchmark_UFlat0-8            543.87       2200.22      4.05x
Benchmark_UFlat1-8            449.84       1446.09      3.21x
Benchmark_UFlat2-8            15511.96     14706.88     0.95x
Benchmark_UFlat3-8            873.92       1787.82      2.05x
Benchmark_UFlat4-8            2978.58      10683.24     3.59x
Benchmark_UFlat5-8            536.04       1965.33      3.67x
Benchmark_UFlat6-8            278.33       833.52       2.99x
Benchmark_UFlat7-8            271.63       792.85       2.92x
Benchmark_UFlat8-8            288.86       854.75       2.96x
Benchmark_UFlat9-8            262.13       730.21       2.79x
Benchmark_UFlat10-8           640.03       2775.98      4.34x
Benchmark_UFlat11-8           356.37       1037.94      2.91x

The UFlat2 case is decoding a compressed JPEG file, a binary format, and
so Snappy is not actually getting much extra compression. Decompression
collapses to not much more than repeatedly invoking runtime.memmove, so
optimizing the snappy code per se doesn't have a huge impact on that
particular benchmark number.
2016-02-26 17:30:25 +11:00
Nigel Tao 4c1fc8e426 Optimize asm for decoding copy fragments.
Relative to the previous commit:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     518.05       518.80       1.00x
BenchmarkWordsDecode1e2-8     776.28       871.43       1.12x
BenchmarkWordsDecode1e3-8     995.41       1411.32      1.42x
BenchmarkWordsDecode1e4-8     615.92       1469.60      2.39x
BenchmarkWordsDecode1e5-8     453.95       771.07       1.70x
BenchmarkWordsDecode1e6-8     453.74       872.19       1.92x
Benchmark_UFlat0-8            863.12       1129.79      1.31x
Benchmark_UFlat1-8            766.01       1075.37      1.40x
Benchmark_UFlat2-8            15463.36     15617.45     1.01x
Benchmark_UFlat3-8            1388.63      1438.15      1.04x
Benchmark_UFlat4-8            4367.79      4838.37      1.11x
Benchmark_UFlat5-8            844.84       1075.46      1.27x
Benchmark_UFlat6-8            442.42       811.70       1.83x
Benchmark_UFlat7-8            437.68       781.87       1.79x
Benchmark_UFlat8-8            458.19       819.38       1.79x
Benchmark_UFlat9-8            423.36       724.43       1.71x
Benchmark_UFlat10-8           1023.05      1193.70      1.17x
Benchmark_UFlat11-8           507.18       879.15       1.73x
2016-02-26 17:21:48 +11:00
Nigel Tao 8c7c9dec59 Optimize asm for decoding literal fragments.
Relative to the previous commit:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     519.36       518.05       1.00x
BenchmarkWordsDecode1e2-8     691.63       776.28       1.12x
BenchmarkWordsDecode1e3-8     858.97       995.41       1.16x
BenchmarkWordsDecode1e4-8     581.86       615.92       1.06x
BenchmarkWordsDecode1e5-8     380.78       453.95       1.19x
BenchmarkWordsDecode1e6-8     403.12       453.74       1.13x
Benchmark_UFlat0-8            784.21       863.12       1.10x
Benchmark_UFlat1-8            625.49       766.01       1.22x
Benchmark_UFlat2-8            15366.67     15463.36     1.01x
Benchmark_UFlat3-8            1321.47      1388.63      1.05x
Benchmark_UFlat4-8            4338.83      4367.79      1.01x
Benchmark_UFlat5-8            770.24       844.84       1.10x
Benchmark_UFlat6-8            386.10       442.42       1.15x
Benchmark_UFlat7-8            376.79       437.68       1.16x
Benchmark_UFlat8-8            400.47       458.19       1.14x
Benchmark_UFlat9-8            362.89       423.36       1.17x
Benchmark_UFlat10-8           943.89       1023.05      1.08x
Benchmark_UFlat11-8           493.98       507.18       1.03x
2016-02-26 17:17:00 +11:00
Nigel Tao 402436317a Rewrite the core of the decoder in asm.
This is an experiment. A future commit may roll back this commit if it
turns out that the complexity and inherent unsafety of asm code
outweights the performance benefits.

The new asm code is covered by existing tests: TestDecode,
TestDecodeLengthOffset and TestDecodeGoldenInput. These tests were
checked in by previous commits, to make it clear that they pass both
before and after this new implementation. This commit is purely an
optimization; there should be no other change in behavior.

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     498.83       519.36       1.04x
BenchmarkWordsDecode1e2-8     445.12       691.63       1.55x
BenchmarkWordsDecode1e3-8     530.29       858.97       1.62x
BenchmarkWordsDecode1e4-8     361.08       581.86       1.61x
BenchmarkWordsDecode1e5-8     270.69       380.78       1.41x
BenchmarkWordsDecode1e6-8     290.91       403.12       1.39x
Benchmark_UFlat0-8            543.87       784.21       1.44x
Benchmark_UFlat1-8            449.84       625.49       1.39x
Benchmark_UFlat2-8            15511.96     15366.67     0.99x
Benchmark_UFlat3-8            873.92       1321.47      1.51x
Benchmark_UFlat4-8            2978.58      4338.83      1.46x
Benchmark_UFlat5-8            536.04       770.24       1.44x
Benchmark_UFlat6-8            278.33       386.10       1.39x
Benchmark_UFlat7-8            271.63       376.79       1.39x
Benchmark_UFlat8-8            288.86       400.47       1.39x
Benchmark_UFlat9-8            262.13       362.89       1.38x
Benchmark_UFlat10-8           640.03       943.89       1.47x
Benchmark_UFlat11-8           356.37       493.98       1.39x

The numbers above are pure Go vs the new asm; about a 1.4x improvement.
As a data point, the numbers below are pure Go vs pure Go with bounds
checking disabled:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     498.83       516.68       1.04x
BenchmarkWordsDecode1e2-8     445.12       495.22       1.11x
BenchmarkWordsDecode1e3-8     530.29       612.44       1.15x
BenchmarkWordsDecode1e4-8     361.08       374.12       1.04x
BenchmarkWordsDecode1e5-8     270.69       300.66       1.11x
BenchmarkWordsDecode1e6-8     290.91       325.22       1.12x
Benchmark_UFlat0-8            543.87       655.85       1.21x
Benchmark_UFlat1-8            449.84       516.04       1.15x
Benchmark_UFlat2-8            15511.96     15291.13     0.99x
Benchmark_UFlat3-8            873.92       1063.07      1.22x
Benchmark_UFlat4-8            2978.58      3615.30      1.21x
Benchmark_UFlat5-8            536.04       639.51       1.19x
Benchmark_UFlat6-8            278.33       309.44       1.11x
Benchmark_UFlat7-8            271.63       301.89       1.11x
Benchmark_UFlat8-8            288.86       322.38       1.12x
Benchmark_UFlat9-8            262.13       289.92       1.11x
Benchmark_UFlat10-8           640.03       787.34       1.23x
Benchmark_UFlat11-8           356.37       403.35       1.13x

In other words, eliminating bounds checking gets you about a 1.15x
improvement. All the other benefits of hand-written asm gets you another
1.2x over and above that.

For reference, I've copy/pasted the "go tool compile -S -B -o /dev/null
main.go" output at http://play.golang.org/p/vOs4Z7Qf1X
2016-02-26 17:03:02 +11:00