зеркало из https://github.com/golang/snappy.git
9 Коммитов
Автор | SHA1 | Сообщение | Дата |
---|---|---|---|
Nigel Tao | d9eb7a3d35 |
Support the COPY_4 tag.
It is a valid encoding, even if no longer issued by most encoders. name old speed new speed delta WordsDecode1e1-8 525MB/s ± 0% 504MB/s ± 1% -4.04% (p=0.000 n=9+10) WordsDecode1e2-8 1.23GB/s ± 0% 1.23GB/s ± 1% ~ (p=0.678 n=10+9) WordsDecode1e3-8 1.54GB/s ± 0% 1.53GB/s ± 1% -0.75% (p=0.000 n=10+9) WordsDecode1e4-8 1.53GB/s ± 0% 1.51GB/s ± 3% -1.46% (p=0.000 n=9+10) WordsDecode1e5-8 793MB/s ± 0% 777MB/s ± 2% -2.01% (p=0.017 n=9+10) WordsDecode1e6-8 917MB/s ± 1% 917MB/s ± 1% ~ (p=0.473 n=8+10) WordsEncode1e1-8 641MB/s ± 2% 641MB/s ± 0% ~ (p=0.780 n=10+9) WordsEncode1e2-8 583MB/s ± 0% 580MB/s ± 0% -0.41% (p=0.001 n=10+9) WordsEncode1e3-8 647MB/s ± 1% 648MB/s ± 0% ~ (p=0.326 n=10+9) WordsEncode1e4-8 442MB/s ± 1% 452MB/s ± 0% +2.20% (p=0.000 n=10+8) WordsEncode1e5-8 355MB/s ± 1% 355MB/s ± 0% ~ (p=0.880 n=10+8) WordsEncode1e6-8 433MB/s ± 0% 434MB/s ± 0% ~ (p=0.700 n=8+8) RandomEncode-8 14.2GB/s ± 3% 14.2GB/s ± 3% ~ (p=0.780 n=10+9) _UFlat0-8 2.18GB/s ± 1% 2.19GB/s ± 0% ~ (p=0.447 n=10+9) _UFlat1-8 1.40GB/s ± 2% 1.41GB/s ± 0% +0.73% (p=0.043 n=9+10) _UFlat2-8 23.4GB/s ± 3% 23.5GB/s ± 2% ~ (p=0.497 n=9+10) _UFlat3-8 1.90GB/s ± 0% 1.91GB/s ± 0% +0.30% (p=0.002 n=8+9) _UFlat4-8 13.9GB/s ± 2% 14.0GB/s ± 1% ~ (p=0.720 n=9+10) _UFlat5-8 1.96GB/s ± 1% 1.97GB/s ± 0% +0.81% (p=0.000 n=10+9) _UFlat6-8 813MB/s ± 0% 814MB/s ± 0% +0.17% (p=0.037 n=8+10) _UFlat7-8 783MB/s ± 2% 785MB/s ± 0% ~ (p=0.340 n=9+9) _UFlat8-8 859MB/s ± 0% 857MB/s ± 0% ~ (p=0.074 n=8+9) _UFlat9-8 719MB/s ± 1% 719MB/s ± 1% ~ (p=0.621 n=10+9) _UFlat10-8 2.84GB/s ± 0% 2.84GB/s ± 0% +0.19% (p=0.043 n=10+9) _UFlat11-8 1.05GB/s ± 1% 1.05GB/s ± 0% ~ (p=0.523 n=9+8) _ZFlat0-8 1.04GB/s ± 2% 1.04GB/s ± 0% ~ (p=0.222 n=9+9) _ZFlat1-8 535MB/s ± 0% 534MB/s ± 0% ~ (p=0.059 n=9+9) _ZFlat2-8 15.6GB/s ± 3% 15.7GB/s ± 1% ~ (p=0.720 n=9+10) _ZFlat3-8 723MB/s ± 0% 740MB/s ± 3% +2.36% (p=0.034 n=8+10) _ZFlat4-8 9.16GB/s ± 1% 9.20GB/s ± 1% ~ (p=0.297 n=9+9) _ZFlat5-8 987MB/s ± 1% 991MB/s ± 0% ~ (p=0.167 n=9+8) _ZFlat6-8 378MB/s ± 2% 379MB/s ± 0% ~ (p=0.334 n=9+8) _ZFlat7-8 350MB/s ± 2% 352MB/s ± 0% +0.60% (p=0.014 n=9+8) _ZFlat8-8 397MB/s ± 0% 396MB/s ± 1% ~ (p=0.965 n=8+10) _ZFlat9-8 328MB/s ± 0% 327MB/s ± 1% ~ (p=0.409 n=8+9) _ZFlat10-8 1.33GB/s ± 0% 1.33GB/s ± 1% ~ (p=0.356 n=9+10) _ZFlat11-8 605MB/s ± 0% 605MB/s ± 1% ~ (p=0.743 n=9+8) |
|
Nigel Tao | 4f2f9a13dd |
Write the encoder's extendMatch in asm.
name old speed new speed delta WordsEncode1e1-8 678MB/s ± 0% 690MB/s ± 0% +1.79% (p=0.008 n=5+5) WordsEncode1e2-8 87.5MB/s ± 0% 83.7MB/s ± 1% -4.26% (p=0.008 n=5+5) WordsEncode1e3-8 257MB/s ± 1% 230MB/s ± 1% -10.41% (p=0.008 n=5+5) WordsEncode1e4-8 247MB/s ± 1% 233MB/s ± 1% -5.56% (p=0.008 n=5+5) WordsEncode1e5-8 186MB/s ± 0% 212MB/s ± 0% +14.36% (p=0.008 n=5+5) WordsEncode1e6-8 211MB/s ± 0% 255MB/s ± 0% +20.82% (p=0.008 n=5+5) RandomEncode-8 13.1GB/s ± 2% 13.2GB/s ± 1% ~ (p=0.222 n=5+5) _ZFlat0-8 433MB/s ± 0% 623MB/s ± 0% +43.92% (p=0.008 n=5+5) _ZFlat1-8 276MB/s ± 0% 319MB/s ± 1% +15.42% (p=0.008 n=5+5) _ZFlat2-8 13.8GB/s ± 1% 13.9GB/s ± 1% ~ (p=0.222 n=5+5) _ZFlat3-8 170MB/s ± 0% 176MB/s ± 0% +3.55% (p=0.008 n=5+5) _ZFlat4-8 3.09GB/s ± 1% 6.05GB/s ± 0% +96.00% (p=0.008 n=5+5) _ZFlat5-8 427MB/s ± 1% 603MB/s ± 0% +41.35% (p=0.008 n=5+5) _ZFlat6-8 190MB/s ± 0% 228MB/s ± 0% +20.24% (p=0.008 n=5+5) _ZFlat7-8 182MB/s ± 0% 212MB/s ± 0% +16.87% (p=0.008 n=5+5) _ZFlat8-8 200MB/s ± 0% 242MB/s ± 0% +20.97% (p=0.008 n=5+5) _ZFlat9-8 175MB/s ± 0% 199MB/s ± 1% +13.74% (p=0.008 n=5+5) _ZFlat10-8 507MB/s ± 0% 796MB/s ± 1% +56.83% (p=0.008 n=5+5) _ZFlat11-8 278MB/s ± 0% 348MB/s ± 0% +25.09% (p=0.008 n=5+5) name old time/op new time/op delta ExtendMatch-8 16.5µs ± 1% 7.8µs ± 1% -52.93% (p=0.008 n=5+5) |
|
Nigel Tao | 3588d1dd84 |
Add appengine and noasm build tags.
The general suggestion to use a noasm tag is by Klaus Post at https://groups.google.com/d/msg/golang-dev/CeKX81B3WdQ/2mq-eY0VAgAJ |
|
Nigel Tao | 857ad66e00 |
Add gc build tag for the asm code.
Fixes #27. |
|
Nigel Tao | 5f1c01d9f6 |
Optimize a 16-byte load and store.
benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 528.93 528.93 1.00x BenchmarkWordsDecode1e2-8 983.60 999.00 1.02x BenchmarkWordsDecode1e3-8 1474.03 1513.22 1.03x BenchmarkWordsDecode1e4-8 1523.38 1561.36 1.02x BenchmarkWordsDecode1e5-8 792.34 800.00 1.01x BenchmarkWordsDecode1e6-8 881.58 885.13 1.00x Benchmark_UFlat0-8 2168.73 2224.25 1.03x Benchmark_UFlat1-8 1431.99 1446.11 1.01x Benchmark_UFlat2-8 15392.46 15301.72 0.99x Benchmark_UFlat3-8 1825.26 1841.57 1.01x Benchmark_UFlat4-8 10885.32 11384.32 1.05x Benchmark_UFlat5-8 1955.55 2002.59 1.02x Benchmark_UFlat6-8 833.99 829.35 0.99x Benchmark_UFlat7-8 794.80 793.35 1.00x Benchmark_UFlat8-8 859.01 854.84 1.00x Benchmark_UFlat9-8 731.84 726.50 0.99x Benchmark_UFlat10-8 2775.21 2898.57 1.04x Benchmark_UFlat11-8 1032.75 1032.47 1.00x |
|
Nigel Tao | 427fb6fc07 |
Optimize asm for decoding copy fragments some more.
Relative to the previous commit: benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 518.80 508.74 0.98x BenchmarkWordsDecode1e2-8 871.43 962.52 1.10x BenchmarkWordsDecode1e3-8 1411.32 1435.51 1.02x BenchmarkWordsDecode1e4-8 1469.60 1514.02 1.03x BenchmarkWordsDecode1e5-8 771.07 807.73 1.05x BenchmarkWordsDecode1e6-8 872.19 892.24 1.02x Benchmark_UFlat0-8 1129.79 2200.22 1.95x Benchmark_UFlat1-8 1075.37 1446.09 1.34x Benchmark_UFlat2-8 15617.45 14706.88 0.94x Benchmark_UFlat3-8 1438.15 1787.82 1.24x Benchmark_UFlat4-8 4838.37 10683.24 2.21x Benchmark_UFlat5-8 1075.46 1965.33 1.83x Benchmark_UFlat6-8 811.70 833.52 1.03x Benchmark_UFlat7-8 781.87 792.85 1.01x Benchmark_UFlat8-8 819.38 854.75 1.04x Benchmark_UFlat9-8 724.43 730.21 1.01x Benchmark_UFlat10-8 1193.70 2775.98 2.33x Benchmark_UFlat11-8 879.15 1037.94 1.18x As with previous recent commits, the new asm code is covered by existing tests: TestDecode, TestDecodeLengthOffset and TestDecodeGoldenInput. There is also a new test for the slowForwardCopy algorithm. As a data point, the "new MB/s" numbers are now in the same ballpark as the benchmark numbers that I get from the C++ snappy implementation on the same machine: BM_UFlat/0 2.4GB/s html BM_UFlat/1 1.4GB/s urls BM_UFlat/2 21.1GB/s jpg BM_UFlat/3 1.5GB/s jpg_200 BM_UFlat/4 10.2GB/s pdf BM_UFlat/5 2.1GB/s html4 BM_UFlat/6 990.6MB/s txt1 BM_UFlat/7 930.1MB/s txt2 BM_UFlat/8 1.0GB/s txt3 BM_UFlat/9 849.7MB/s txt4 BM_UFlat/10 2.9GB/s pb BM_UFlat/11 1.2GB/s gaviota As another data point, here is the amd64 asm code as of this commit compared to the most recent pure Go implementation, revision 03ee571c: benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 498.83 508.74 1.02x BenchmarkWordsDecode1e2-8 445.12 962.52 2.16x BenchmarkWordsDecode1e3-8 530.29 1435.51 2.71x BenchmarkWordsDecode1e4-8 361.08 1514.02 4.19x BenchmarkWordsDecode1e5-8 270.69 807.73 2.98x BenchmarkWordsDecode1e6-8 290.91 892.24 3.07x Benchmark_UFlat0-8 543.87 2200.22 4.05x Benchmark_UFlat1-8 449.84 1446.09 3.21x Benchmark_UFlat2-8 15511.96 14706.88 0.95x Benchmark_UFlat3-8 873.92 1787.82 2.05x Benchmark_UFlat4-8 2978.58 10683.24 3.59x Benchmark_UFlat5-8 536.04 1965.33 3.67x Benchmark_UFlat6-8 278.33 833.52 2.99x Benchmark_UFlat7-8 271.63 792.85 2.92x Benchmark_UFlat8-8 288.86 854.75 2.96x Benchmark_UFlat9-8 262.13 730.21 2.79x Benchmark_UFlat10-8 640.03 2775.98 4.34x Benchmark_UFlat11-8 356.37 1037.94 2.91x The UFlat2 case is decoding a compressed JPEG file, a binary format, and so Snappy is not actually getting much extra compression. Decompression collapses to not much more than repeatedly invoking runtime.memmove, so optimizing the snappy code per se doesn't have a huge impact on that particular benchmark number. |
|
Nigel Tao | 4c1fc8e426 |
Optimize asm for decoding copy fragments.
Relative to the previous commit: benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 518.05 518.80 1.00x BenchmarkWordsDecode1e2-8 776.28 871.43 1.12x BenchmarkWordsDecode1e3-8 995.41 1411.32 1.42x BenchmarkWordsDecode1e4-8 615.92 1469.60 2.39x BenchmarkWordsDecode1e5-8 453.95 771.07 1.70x BenchmarkWordsDecode1e6-8 453.74 872.19 1.92x Benchmark_UFlat0-8 863.12 1129.79 1.31x Benchmark_UFlat1-8 766.01 1075.37 1.40x Benchmark_UFlat2-8 15463.36 15617.45 1.01x Benchmark_UFlat3-8 1388.63 1438.15 1.04x Benchmark_UFlat4-8 4367.79 4838.37 1.11x Benchmark_UFlat5-8 844.84 1075.46 1.27x Benchmark_UFlat6-8 442.42 811.70 1.83x Benchmark_UFlat7-8 437.68 781.87 1.79x Benchmark_UFlat8-8 458.19 819.38 1.79x Benchmark_UFlat9-8 423.36 724.43 1.71x Benchmark_UFlat10-8 1023.05 1193.70 1.17x Benchmark_UFlat11-8 507.18 879.15 1.73x |
|
Nigel Tao | 8c7c9dec59 |
Optimize asm for decoding literal fragments.
Relative to the previous commit: benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 519.36 518.05 1.00x BenchmarkWordsDecode1e2-8 691.63 776.28 1.12x BenchmarkWordsDecode1e3-8 858.97 995.41 1.16x BenchmarkWordsDecode1e4-8 581.86 615.92 1.06x BenchmarkWordsDecode1e5-8 380.78 453.95 1.19x BenchmarkWordsDecode1e6-8 403.12 453.74 1.13x Benchmark_UFlat0-8 784.21 863.12 1.10x Benchmark_UFlat1-8 625.49 766.01 1.22x Benchmark_UFlat2-8 15366.67 15463.36 1.01x Benchmark_UFlat3-8 1321.47 1388.63 1.05x Benchmark_UFlat4-8 4338.83 4367.79 1.01x Benchmark_UFlat5-8 770.24 844.84 1.10x Benchmark_UFlat6-8 386.10 442.42 1.15x Benchmark_UFlat7-8 376.79 437.68 1.16x Benchmark_UFlat8-8 400.47 458.19 1.14x Benchmark_UFlat9-8 362.89 423.36 1.17x Benchmark_UFlat10-8 943.89 1023.05 1.08x Benchmark_UFlat11-8 493.98 507.18 1.03x |
|
Nigel Tao | 402436317a |
Rewrite the core of the decoder in asm.
This is an experiment. A future commit may roll back this commit if it turns out that the complexity and inherent unsafety of asm code outweights the performance benefits. The new asm code is covered by existing tests: TestDecode, TestDecodeLengthOffset and TestDecodeGoldenInput. These tests were checked in by previous commits, to make it clear that they pass both before and after this new implementation. This commit is purely an optimization; there should be no other change in behavior. benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 498.83 519.36 1.04x BenchmarkWordsDecode1e2-8 445.12 691.63 1.55x BenchmarkWordsDecode1e3-8 530.29 858.97 1.62x BenchmarkWordsDecode1e4-8 361.08 581.86 1.61x BenchmarkWordsDecode1e5-8 270.69 380.78 1.41x BenchmarkWordsDecode1e6-8 290.91 403.12 1.39x Benchmark_UFlat0-8 543.87 784.21 1.44x Benchmark_UFlat1-8 449.84 625.49 1.39x Benchmark_UFlat2-8 15511.96 15366.67 0.99x Benchmark_UFlat3-8 873.92 1321.47 1.51x Benchmark_UFlat4-8 2978.58 4338.83 1.46x Benchmark_UFlat5-8 536.04 770.24 1.44x Benchmark_UFlat6-8 278.33 386.10 1.39x Benchmark_UFlat7-8 271.63 376.79 1.39x Benchmark_UFlat8-8 288.86 400.47 1.39x Benchmark_UFlat9-8 262.13 362.89 1.38x Benchmark_UFlat10-8 640.03 943.89 1.47x Benchmark_UFlat11-8 356.37 493.98 1.39x The numbers above are pure Go vs the new asm; about a 1.4x improvement. As a data point, the numbers below are pure Go vs pure Go with bounds checking disabled: benchmark old MB/s new MB/s speedup BenchmarkWordsDecode1e1-8 498.83 516.68 1.04x BenchmarkWordsDecode1e2-8 445.12 495.22 1.11x BenchmarkWordsDecode1e3-8 530.29 612.44 1.15x BenchmarkWordsDecode1e4-8 361.08 374.12 1.04x BenchmarkWordsDecode1e5-8 270.69 300.66 1.11x BenchmarkWordsDecode1e6-8 290.91 325.22 1.12x Benchmark_UFlat0-8 543.87 655.85 1.21x Benchmark_UFlat1-8 449.84 516.04 1.15x Benchmark_UFlat2-8 15511.96 15291.13 0.99x Benchmark_UFlat3-8 873.92 1063.07 1.22x Benchmark_UFlat4-8 2978.58 3615.30 1.21x Benchmark_UFlat5-8 536.04 639.51 1.19x Benchmark_UFlat6-8 278.33 309.44 1.11x Benchmark_UFlat7-8 271.63 301.89 1.11x Benchmark_UFlat8-8 288.86 322.38 1.12x Benchmark_UFlat9-8 262.13 289.92 1.11x Benchmark_UFlat10-8 640.03 787.34 1.23x Benchmark_UFlat11-8 356.37 403.35 1.13x In other words, eliminating bounds checking gets you about a 1.15x improvement. All the other benefits of hand-written asm gets you another 1.2x over and above that. For reference, I've copy/pasted the "go tool compile -S -B -o /dev/null main.go" output at http://play.golang.org/p/vOs4Z7Qf1X |