Nigel Tao
43d5d4cd4e
Merge pull request #68 from MarkLodato/cli-readme
...
README: update instructions to install CLI
2023-12-26 09:57:46 +11:00
Mark Lodato
470b9ed42b
README: update instructions to install CLI
...
- Add instructions for using `go install` to run the binary.
- Note that `go get` is only for use as a library.
2022-11-17 07:22:32 -05:00
Nigel Tao
fa5810519d
Ensure arm64 frame sizes are 8 (mod 16)
...
Fixes #63
2022-01-16 12:10:46 +11:00
Nigel Tao
544b4180ac
Update AUTHORS and CONTRIBUTORS
2021-06-08 14:05:37 +10:00
Nigel Tao
0eaccd4763
Fix dangling golden_test filename link
2021-06-08 14:02:21 +10:00
Nigel Tao
3ff355f7bb
Merge pull request #51 from topos-ai/bytereader
...
Add ReadByte method, satisfies the io.ByteReader interface
2021-06-08 13:56:06 +10:00
Nigel Tao
b9440b43e5
Merge pull request #40 from EdwardBetts/spelling
...
correct spelling mistake
2021-06-08 13:46:59 +10:00
Nigel Tao
ef348818ab
Merge pull request #60 from alexlegg/master
...
Use a more inclusive text for golden input.
2021-06-08 13:46:43 +10:00
Nigel Tao
33fc3d5d8d
Merge pull request #61 from cuonglm/cuonglm/fix-wrong-arm64-scaled-register-format
...
Fix wrong arm64 scaled register format
2021-05-02 13:53:20 +10:00
Cuong Manh Le
b46926bc8a
Fix wrong arm64 scaled register format
...
Arm64 does not have scaled register format, casue snappy test failed for
current go tip:
$ go version
go version devel go1.17-24875e3880 Tue Apr 20 15:14:05 2021 +0000 darwin/arm64
$ go test
# github.com/golang/snappy
./encode_arm64.s:385: arm64 doesn't support scaled register format
./encode_arm64.s:675: arm64 doesn't support scaled register format
asm: assembly of ./encode_arm64.s failed
FAIL github.com/golang/snappy [build failed]
See https://go-review.googlesource.com/c/go/+/289589
2021-04-21 00:16:25 +07:00
Alex Legg
e149cdd03f
Use a more inclusive text for golden input.
...
Replace the first chapter of Tom Sawyer with the first 400 lines of
Isaac Newton's Opticks.
The rawsnappy version was generated by cmd/snappytool in this repo.
The extendMatch test goldens were updated as per the instructions in
golden_test.go (with an update to account for the golang version of
extendMatch being inlined.)
2021-04-12 16:34:41 +10:00
Nigel Tao
674baa8c7f
Merge pull request #56 from AWSjswinney/arm64-port-pr
...
bug fix to encode_arm64.s: some registers overwritten in memmove call
ARM64 memmove clobbers R16 and R17 as of
https://go-review.googlesource.com/c/go/+/243357
2020-11-04 09:46:00 +11:00
Jonathan Swinney
f81760ec4c
bug fix to encode_arm64.s: some registers overwritten in memmove call
...
In encode_arm64.s, encodeBlock, two of the registers added during the port from
amd64 were not saved or restored for the memmove call. Instead of saving them,
just recalculate their values. Additionally, I made a few small changes to
improve things since I've learned a bit more about ARMv8 assembly.
- The CMP instruction accepts an immediate as the first argument
- use LDP/STP instead of SIMD instructions
The change to use the load-pair and store-pair instructions instead of the SIMD
instructions results in some modest performance improvements as meastured on
Neoverse N1 (Graviton 2).
name old time/op new time/op delta
WordsDecode1e1-2 25.9ns ± 1% 26.1ns ± 1% +0.66% (p=0.005 n=10+10)
WordsDecode1e2-2 107ns ± 0% 105ns ± 0% -1.87% (p=0.000 n=10+10)
WordsDecode1e3-2 953ns ± 0% 901ns ± 0% -5.50% (p=0.000 n=10+10)
WordsDecode1e4-2 10.6µs ± 0% 9.9µs ± 2% -6.60% (p=0.000 n=7+10)
WordsDecode1e5-2 170µs ± 1% 164µs ± 1% -3.12% (p=0.000 n=10+9)
WordsDecode1e6-2 1.71ms ± 0% 1.66ms ± 0% -2.98% (p=0.000 n=10+10)
WordsEncode1e1-2 22.0ns ± 1% 21.9ns ± 1% -0.67% (p=0.006 n=8+10)
WordsEncode1e2-2 248ns ± 0% 245ns ± 0% -1.21% (p=0.002 n=8+10)
WordsEncode1e3-2 2.50µs ± 0% 2.49µs ± 0% ~ (p=0.103 n=10+9)
WordsEncode1e4-2 27.8µs ± 3% 28.0µs ± 2% ~ (p=0.075 n=10+10)
WordsEncode1e5-2 339µs ± 0% 343µs ± 0% +1.18% (p=0.000 n=9+10)
WordsEncode1e6-2 3.39ms ± 0% 3.42ms ± 0% +0.94% (p=0.000 n=10+10)
RandomEncode-2 74.8µs ± 1% 77.1µs ± 1% +3.16% (p=0.000 n=10+10)
_UFlat0-2 68.8µs ± 1% 66.4µs ± 2% -3.54% (p=0.000 n=10+10)
_UFlat1-2 770µs ± 0% 740µs ± 1% -3.93% (p=0.000 n=10+10)
_UFlat2-2 6.57µs ± 0% 6.55µs ± 0% -0.25% (p=0.000 n=8+10)
_UFlat3-2 183ns ± 0% 178ns ± 1% -2.84% (p=0.000 n=9+10)
_UFlat4-2 9.76µs ± 1% 9.56µs ± 0% -2.07% (p=0.000 n=10+9)
_UFlat5-2 301µs ± 0% 293µs ± 0% -2.67% (p=0.000 n=9+10)
_UFlat6-2 280µs ± 1% 267µs ± 1% -4.63% (p=0.000 n=10+10)
_UFlat7-2 241µs ± 0% 230µs ± 1% -4.68% (p=0.000 n=9+10)
_UFlat8-2 745µs ± 0% 715µs ± 1% -4.11% (p=0.000 n=10+10)
_UFlat9-2 1.01ms ± 0% 0.96ms ± 0% -4.60% (p=0.000 n=10+10)
_UFlat10-2 62.3µs ± 1% 59.3µs ± 1% -4.72% (p=0.000 n=10+9)
_UFlat11-2 258µs ± 0% 252µs ± 1% -2.56% (p=0.000 n=10+10)
_ZFlat0-2 135µs ± 1% 132µs ± 1% -1.88% (p=0.000 n=10+8)
_ZFlat1-2 1.76ms ± 0% 1.74ms ± 0% -1.00% (p=0.000 n=9+9)
_ZFlat2-2 9.54µs ± 0% 9.84µs ± 5% +3.18% (p=0.000 n=10+10)
_ZFlat3-2 449ns ± 0% 447ns ± 0% -0.38% (p=0.000 n=10+9)
_ZFlat4-2 15.6µs ± 0% 16.0µs ± 4% ~ (p=0.118 n=9+10)
_ZFlat5-2 560µs ± 1% 555µs ± 1% -0.89% (p=0.000 n=9+9)
_ZFlat6-2 531µs ± 0% 534µs ± 0% +0.64% (p=0.000 n=10+10)
_ZFlat7-2 466µs ± 0% 468µs ± 0% +0.32% (p=0.003 n=10+10)
_ZFlat8-2 1.42ms ± 0% 1.42ms ± 0% +0.43% (p=0.000 n=10+10)
_ZFlat9-2 1.93ms ± 0% 1.94ms ± 0% +0.44% (p=0.000 n=10+10)
_ZFlat10-2 120µs ± 0% 121µs ± 3% ~ (p=0.436 n=9+9)
_ZFlat11-2 433µs ± 0% 437µs ± 0% +1.03% (p=0.000 n=10+10)
ExtendMatch-2 9.77µs ± 0% 9.76µs ± 0% -0.13% (p=0.050 n=10+10)
As measured on Cortex-A53 (Raspberry Pi 3)
name old time/op new time/op delta
WordsDecode1e1-4 152ns ± 2% 151ns ± 0% ~ (p=0.536 n=10+8)
WordsDecode1e2-4 639ns ± 0% 617ns ± 0% -3.54% (p=0.000 n=9+8)
WordsDecode1e3-4 6.74µs ± 2% 6.35µs ± 0% -5.75% (p=0.000 n=10+9)
WordsDecode1e4-4 66.7µs ± 0% 63.5µs ± 0% -4.69% (p=0.000 n=9+9)
WordsDecode1e5-4 715µs ± 0% 684µs ± 0% -4.38% (p=0.000 n=8+8)
WordsDecode1e6-4 6.87ms ± 2% 6.53ms ± 1% -4.99% (p=0.000 n=10+9)
WordsEncode1e1-4 127ns ± 2% 126ns ± 0% ~ (p=0.065 n=10+9)
WordsEncode1e2-4 1.58µs ± 0% 1.57µs ± 0% -0.99% (p=0.000 n=8+8)
WordsEncode1e3-4 15.1µs ± 0% 14.9µs ± 0% -1.46% (p=0.000 n=9+8)
WordsEncode1e4-4 148µs ± 0% 148µs ± 4% ~ (p=0.497 n=9+10)
WordsEncode1e5-4 1.54ms ± 0% 1.54ms ± 0% +0.12% (p=0.012 n=10+8)
WordsEncode1e6-4 14.4ms ± 0% 14.4ms ± 1% -0.47% (p=0.015 n=9+8)
RandomEncode-4 1.13ms ± 1% 1.13ms ± 1% ~ (p=0.529 n=10+10)
_UFlat0-4 294µs ± 0% 288µs ± 1% -2.08% (p=0.000 n=9+9)
_UFlat1-4 3.05ms ± 1% 2.98ms ± 1% -2.22% (p=0.000 n=9+9)
_UFlat2-4 37.3µs ± 0% 37.4µs ± 1% ~ (p=0.093 n=8+9)
_UFlat3-4 909ns ± 0% 914ns ± 2% ~ (p=0.526 n=8+10)
_UFlat4-4 58.7µs ± 0% 58.1µs ± 0% -1.09% (p=0.000 n=8+10)
_UFlat5-4 1.22ms ± 0% 1.19ms ± 1% -2.14% (p=0.000 n=8+8)
_UFlat6-4 1.03ms ± 0% 0.99ms ± 0% -3.28% (p=0.000 n=9+8)
_UFlat7-4 895µs ± 0% 861µs ± 0% -3.79% (p=0.000 n=8+8)
_UFlat8-4 2.83ms ± 0% 2.75ms ± 0% -2.88% (p=0.000 n=7+8)
_UFlat9-4 3.85ms ± 1% 3.73ms ± 1% -3.03% (p=0.000 n=8+9)
_UFlat10-4 286µs ± 0% 282µs ± 0% -1.59% (p=0.000 n=9+9)
_UFlat11-4 1.06ms ± 0% 1.02ms ± 0% -3.58% (p=0.000 n=8+9)
_ZFlat0-4 620µs ± 0% 620µs ± 1% ~ (p=0.963 n=9+8)
_ZFlat1-4 9.49ms ± 1% 9.67ms ± 3% +1.87% (p=0.000 n=9+10)
_ZFlat2-4 61.8µs ± 0% 62.3µs ± 3% ~ (p=0.829 n=8+10)
_ZFlat3-4 2.80µs ± 1% 2.79µs ± 0% -0.55% (p=0.000 n=8+8)
_ZFlat4-4 108µs ± 0% 109µs ± 0% +0.55% (p=0.000 n=10+8)
_ZFlat5-4 2.59ms ± 2% 2.58ms ± 1% ~ (p=0.274 n=10+8)
_ZFlat6-4 2.39ms ± 3% 2.40ms ± 1% ~ (p=0.631 n=10+10)
_ZFlat7-4 2.11ms ± 0% 2.08ms ± 1% -1.23% (p=0.000 n=10+9)
_ZFlat8-4 6.86ms ± 0% 6.92ms ± 1% +0.78% (p=0.000 n=9+8)
_ZFlat9-4 9.42ms ± 0% 9.40ms ± 1% ~ (p=0.606 n=8+9)
_ZFlat10-4 620µs ± 1% 621µs ± 4% ~ (p=0.173 n=8+10)
_ZFlat11-4 1.94ms ± 0% 1.93ms ± 0% -0.52% (p=0.001 n=9+8)
ExtendMatch-4 69.3µs ± 2% 69.2µs ± 0% ~ (p=0.515 n=10+8)
2020-10-02 15:49:34 +00:00
Nigel Tao
196ae77b8a
A+C: add Jonathan Swinney <jswinney@amazon.com>.
2020-07-07 23:17:29 +10:00
Nigel Tao
1801c13ca2
Merge pull request #53 from AWSjswinney/arm64-port-pr
...
port amd64 assembly to arm64
2020-07-07 23:11:05 +10:00
Jonathan Swinney
ea060ccb72
port amd64 assembly to arm64
...
This change was produced by taking the amd64 assembly and reproducing it
as closely as possible for the arm64 arch.
The main differences:
- arm64 uses registers R1-R17 which are mapped directly onto an amd64
counterpart
- arm64 requires 8 additional bytes of stack so callee args are displaced
by 8 bytes from amd64
- operands to CMP instructions are reversed except in a few cases where
arm64 uses a BLS (branch less-same) instead of JAE (jump above-equal)
- immediates in some cases have to be split to a separate MOVD instruction
- shifts can be combined with another instruction, such as an ADD, in some
cases
- The amd64 BSFQ instruction is implemented with a bit reversal and
leading zero count instruction
- memclear on arm64 makes use of the SIMD instructions to clear 64 bytes
at a time and uses a pointer comparison instead of a counter to reduce
the number of instructions in the loop
Tested on an AWS m6g.large (ARMv8.2):
name old time/op new time/op delta
WordsDecode1e1-2 29.2ns ± 0% 26.2ns ± 1% -10.51% (p=0.000 n=9+10)
WordsDecode1e2-2 187ns ± 0% 107ns ± 0% -42.78% (p=0.000 n=7+10)
WordsDecode1e3-2 2.16µs ± 1% 0.95µs ± 0% -55.85% (p=0.000 n=10+10)
WordsDecode1e4-2 30.1µs ± 0% 10.4µs ± 2% -65.40% (p=0.000 n=10+10)
WordsDecode1e5-2 348µs ± 0% 168µs ± 0% -51.86% (p=0.000 n=10+9)
WordsDecode1e6-2 3.47ms ± 0% 1.71ms ± 0% -50.66% (p=0.000 n=10+10)
WordsEncode1e1-2 19.4ns ± 0% 21.7ns ± 1% +12.06% (p=0.000 n=8+10)
WordsEncode1e2-2 2.09µs ± 0% 0.25µs ± 0% -88.14% (p=0.000 n=9+10)
WordsEncode1e3-2 6.67µs ± 1% 2.49µs ± 0% -62.63% (p=0.000 n=10+10)
WordsEncode1e4-2 63.5µs ± 1% 29.4µs ± 1% -53.63% (p=0.000 n=10+9)
WordsEncode1e5-2 722µs ± 0% 345µs ± 0% -52.21% (p=0.000 n=10+10)
WordsEncode1e6-2 7.17ms ± 0% 3.41ms ± 0% -52.46% (p=0.000 n=10+8)
RandomEncode-2 106µs ± 2% 78µs ± 0% -26.02% (p=0.000 n=10+10)
_UFlat0-2 152µs ± 0% 69µs ± 1% -54.90% (p=0.000 n=10+9)
_UFlat1-2 1.57ms ± 0% 0.77ms ± 0% -51.10% (p=0.000 n=9+10)
_UFlat2-2 6.84µs ± 0% 6.55µs ± 0% -4.25% (p=0.000 n=10+8)
_UFlat3-2 312ns ± 0% 183ns ± 0% -41.35% (p=0.000 n=10+9)
_UFlat4-2 15.4µs ± 1% 9.7µs ± 1% -36.79% (p=0.000 n=10+10)
_UFlat5-2 625µs ± 0% 301µs ± 1% -51.88% (p=0.000 n=9+10)
_UFlat6-2 570µs ± 0% 278µs ± 0% -51.18% (p=0.000 n=10+9)
_UFlat7-2 490µs ± 0% 240µs ± 1% -50.95% (p=0.000 n=10+10)
_UFlat8-2 1.52ms ± 0% 0.74ms ± 0% -51.01% (p=0.000 n=8+7)
_UFlat9-2 2.00ms ± 0% 1.01ms ± 0% -49.49% (p=0.000 n=10+10)
_UFlat10-2 132µs ± 0% 62µs ± 2% -53.19% (p=0.000 n=10+10)
_UFlat11-2 497µs ± 0% 258µs ± 0% -48.11% (p=0.000 n=10+9)
_ZFlat0-2 346µs ± 1% 136µs ± 5% -60.70% (p=0.000 n=10+9)
_ZFlat1-2 3.63ms ± 0% 1.76ms ± 0% -51.60% (p=0.000 n=10+8)
_ZFlat2-2 13.2µs ± 0% 9.5µs ± 0% -27.62% (p=0.000 n=8+9)
_ZFlat3-2 2.49µs ± 0% 0.45µs ± 0% -81.96% (p=0.002 n=8+10)
_ZFlat4-2 50.5µs ± 0% 15.7µs ± 1% -68.96% (p=0.000 n=10+9)
_ZFlat5-2 1.40ms ± 0% 0.56ms ± 0% -60.20% (p=0.000 n=9+9)
_ZFlat6-2 1.13ms ± 0% 0.54ms ± 0% -52.39% (p=0.000 n=10+9)
_ZFlat7-2 961µs ± 0% 472µs ± 0% -50.83% (p=0.000 n=10+10)
_ZFlat8-2 3.03ms ± 0% 1.43ms ± 0% -52.90% (p=0.000 n=9+10)
_ZFlat9-2 3.88ms ± 0% 1.95ms ± 0% -49.72% (p=0.000 n=10+10)
_ZFlat10-2 339µs ± 0% 123µs ± 3% -63.82% (p=0.000 n=10+10)
_ZFlat11-2 973µs ± 0% 433µs ± 0% -55.49% (p=0.000 n=10+10)
ExtendMatch-2 22.1µs ± 1% 9.8µs ± 0% -55.63% (p=0.000 n=10+10)
name old speed new speed delta
WordsDecode1e1-2 342MB/s ± 0% 382MB/s ± 1% +11.77% (p=0.000 n=9+10)
WordsDecode1e2-2 535MB/s ± 0% 934MB/s ± 0% +74.43% (p=0.000 n=10+10)
WordsDecode1e3-2 463MB/s ± 1% 1049MB/s ± 0% +126.52% (p=0.000 n=10+10)
WordsDecode1e4-2 333MB/s ± 0% 961MB/s ± 2% +189.04% (p=0.000 n=10+10)
WordsDecode1e5-2 287MB/s ± 0% 597MB/s ± 0% +107.72% (p=0.000 n=10+9)
WordsDecode1e6-2 288MB/s ± 0% 584MB/s ± 0% +102.67% (p=0.000 n=10+10)
WordsEncode1e1-2 515MB/s ± 0% 460MB/s ± 0% -10.70% (p=0.000 n=10+10)
WordsEncode1e2-2 47.8MB/s ± 0% 403.3MB/s ± 0% +743.40% (p=0.000 n=10+10)
WordsEncode1e3-2 150MB/s ± 1% 401MB/s ± 0% +167.66% (p=0.000 n=10+9)
WordsEncode1e4-2 157MB/s ± 1% 340MB/s ± 1% +115.66% (p=0.000 n=10+9)
WordsEncode1e5-2 138MB/s ± 0% 290MB/s ± 0% +109.24% (p=0.000 n=10+10)
WordsEncode1e6-2 139MB/s ± 0% 293MB/s ± 0% +110.35% (p=0.000 n=10+8)
RandomEncode-2 9.93GB/s ± 2% 13.42GB/s ± 0% +35.15% (p=0.000 n=10+10)
_UFlat0-2 672MB/s ± 0% 1489MB/s ± 1% +121.75% (p=0.000 n=10+9)
_UFlat1-2 446MB/s ± 0% 913MB/s ± 0% +104.48% (p=0.000 n=9+10)
_UFlat2-2 18.0GB/s ± 0% 18.8GB/s ± 0% +4.44% (p=0.000 n=8+8)
_UFlat3-2 641MB/s ± 0% 1091MB/s ± 0% +70.19% (p=0.000 n=10+10)
_UFlat4-2 6.66GB/s ± 1% 10.53GB/s ± 1% +58.19% (p=0.000 n=10+10)
_UFlat5-2 655MB/s ± 0% 1362MB/s ± 1% +107.80% (p=0.000 n=9+10)
_UFlat6-2 267MB/s ± 0% 547MB/s ± 0% +104.82% (p=0.000 n=10+9)
_UFlat7-2 255MB/s ± 0% 521MB/s ± 1% +103.89% (p=0.000 n=10+10)
_UFlat8-2 281MB/s ± 0% 574MB/s ± 0% +104.14% (p=0.000 n=8+7)
_UFlat9-2 241MB/s ± 0% 478MB/s ± 0% +97.97% (p=0.000 n=10+10)
_UFlat10-2 896MB/s ± 0% 1914MB/s ± 2% +113.64% (p=0.000 n=10+10)
_UFlat11-2 371MB/s ± 0% 715MB/s ± 0% +92.72% (p=0.000 n=10+9)
_ZFlat0-2 296MB/s ± 1% 754MB/s ± 5% +154.57% (p=0.000 n=10+9)
_ZFlat1-2 194MB/s ± 0% 400MB/s ± 0% +106.63% (p=0.000 n=10+8)
_ZFlat2-2 9.35GB/s ± 0% 12.92GB/s ± 0% +38.17% (p=0.000 n=8+10)
_ZFlat3-2 80.3MB/s ± 0% 445.6MB/s ± 0% +454.64% (p=0.000 n=10+10)
_ZFlat4-2 2.03GB/s ± 0% 6.54GB/s ± 1% +222.19% (p=0.000 n=10+9)
_ZFlat5-2 292MB/s ± 0% 733MB/s ± 0% +151.25% (p=0.000 n=9+9)
_ZFlat6-2 135MB/s ± 0% 284MB/s ± 0% +110.05% (p=0.000 n=10+9)
_ZFlat7-2 130MB/s ± 0% 265MB/s ± 0% +103.38% (p=0.000 n=10+10)
_ZFlat8-2 141MB/s ± 0% 299MB/s ± 0% +112.30% (p=0.000 n=9+10)
_ZFlat9-2 124MB/s ± 0% 247MB/s ± 0% +98.90% (p=0.000 n=10+10)
_ZFlat10-2 350MB/s ± 0% 967MB/s ± 3% +176.44% (p=0.000 n=10+10)
_ZFlat11-2 189MB/s ± 0% 426MB/s ± 0% +124.65% (p=0.000 n=10+10)
2020-07-01 03:01:52 +00:00
Eric Buth
0a27eb7fa2
Add ReadByte method, satisfies the io.ByteReader interface
2020-02-17 13:39:43 -05:00
Nigel Tao
ff6b7dc882
Add comments re handling block and stream formats
2019-09-04 16:35:34 +10:00
Nigel Tao
059a9b1922
A+C: add Klaus Post <klauspost@gmail.com>.
2019-09-04 16:29:47 +10:00
Nigel Tao
c9879f99e6
Merge pull request #48 from klauspost/use-copy-for-non-overlapping
...
Use faster copy when not overlapping
2019-09-04 16:27:17 +10:00
Nigel Tao
5610373d2f
Merge pull request #49 from klauspost/faster-overlapping-copies
...
Faster overlapping copies
2019-09-04 16:26:23 +10:00
Klaus Post
f6ad6c8bb8
Faster overlapping copies
...
Eliminates bounds check on every byte copied.
Benchmark measured on AMD64 but with `-tags=noasm`:
```
>benchstat old.txt new.txt
name old time/op new time/op delta
_UFlat0-8 194µs ± 3% 150µs ± 2% -22.59% (p=0.000 n=10+10)
_UFlat1-8 1.62ms ± 1% 1.41ms ± 2% -12.70% (p=0.000 n=9+10)
_UFlat2-8 8.91µs ± 4% 8.76µs ± 2% ~ (p=0.343 n=10+10)
_UFlat3-8 222ns ± 2% 224ns ± 1% +1.00% (p=0.028 n=10+9)
_UFlat4-8 28.4µs ± 2% 20.3µs ± 3% -28.45% (p=0.000 n=10+10)
_UFlat5-8 797µs ± 5% 603µs ± 2% -24.34% (p=0.000 n=10+9)
_UFlat6-8 565µs ± 1% 531µs ± 2% -6.16% (p=0.000 n=8+9)
_UFlat7-8 494µs ± 4% 457µs ± 2% -7.61% (p=0.000 n=10+10)
_UFlat8-8 1.55ms ± 4% 1.40ms ± 2% -9.48% (p=0.000 n=10+9)
_UFlat9-8 1.93ms ± 1% 1.83ms ± 2% -5.44% (p=0.000 n=10+9)
_UFlat10-8 186µs ± 2% 138µs ± 5% -26.04% (p=0.000 n=10+10)
_UFlat11-8 524µs ± 2% 478µs ± 3% -8.68% (p=0.000 n=10+10)
name old speed new speed delta
_UFlat0-8 528MB/s ± 3% 682MB/s ± 2% +29.18% (p=0.000 n=10+10)
_UFlat1-8 434MB/s ± 1% 497MB/s ± 2% +14.56% (p=0.000 n=9+10)
_UFlat2-8 13.8GB/s ± 4% 14.1GB/s ± 2% ~ (p=0.353 n=10+10)
_UFlat3-8 901MB/s ± 1% 890MB/s ± 1% -1.18% (p=0.008 n=9+9)
_UFlat4-8 3.60GB/s ± 2% 5.03GB/s ± 3% +39.76% (p=0.000 n=10+10)
_UFlat5-8 514MB/s ± 5% 679MB/s ± 2% +32.04% (p=0.000 n=10+9)
_UFlat6-8 269MB/s ± 1% 287MB/s ± 2% +6.57% (p=0.000 n=8+9)
_UFlat7-8 253MB/s ± 4% 274MB/s ± 2% +8.23% (p=0.000 n=10+10)
_UFlat8-8 276MB/s ± 4% 305MB/s ± 2% +10.43% (p=0.000 n=10+9)
_UFlat9-8 249MB/s ± 1% 263MB/s ± 2% +5.76% (p=0.000 n=10+9)
_UFlat10-8 637MB/s ± 2% 862MB/s ± 5% +35.25% (p=0.000 n=10+10)
_UFlat11-8 352MB/s ± 2% 385MB/s ± 3% +9.51% (p=0.000 n=10+10)
```
2019-09-01 19:55:24 +02:00
Klaus Post
efb0d863a3
Use faster copy when not overlapping
...
Use the built-in copy function when the source doesn't overlap the destination.
Again benchmarks are a bit polarized based on how often this is the case, but should be a solid improvement for all non-amd64 users.
Benchmark measured on AMD64 but with `-tags=noasm`:
```
>benchstat old.txt new.txt
name old time/op new time/op delta
_UFlat0-8 194µs ± 3% 130µs ± 2% -33.14% (p=0.000 n=10+10)
_UFlat1-8 1.62ms ± 1% 1.42ms ± 1% -11.98% (p=0.000 n=9+9)
_UFlat2-8 8.91µs ± 4% 8.73µs ± 1% ~ (p=0.182 n=10+9)
_UFlat3-8 222ns ± 2% 219ns ± 6% -1.36% (p=0.022 n=10+9)
_UFlat4-8 28.4µs ± 2% 11.5µs ± 1% -59.57% (p=0.000 n=10+10)
_UFlat5-8 797µs ± 5% 536µs ± 1% -32.77% (p=0.000 n=10+10)
_UFlat6-8 565µs ± 1% 571µs ± 1% +1.04% (p=0.007 n=8+10)
_UFlat7-8 494µs ± 4% 496µs ± 3% ~ (p=0.986 n=10+10)
_UFlat8-8 1.55ms ± 4% 1.53ms ± 3% ~ (p=0.280 n=10+10)
_UFlat9-8 1.93ms ± 1% 1.98ms ± 3% +2.57% (p=0.000 n=10+10)
_UFlat10-8 186µs ± 2% 102µs ± 2% -45.14% (p=0.000 n=10+10)
_UFlat11-8 524µs ± 2% 510µs ± 1% -2.56% (p=0.000 n=10+8)
name old speed new speed delta
_UFlat0-8 528MB/s ± 3% 790MB/s ± 1% +49.54% (p=0.000 n=10+10)
_UFlat1-8 434MB/s ± 1% 493MB/s ± 1% +13.61% (p=0.000 n=9+9)
_UFlat2-8 13.8GB/s ± 4% 14.1GB/s ± 2% ~ (p=0.182 n=10+9)
_UFlat3-8 901MB/s ± 1% 912MB/s ± 6% +1.18% (p=0.026 n=9+9)
_UFlat4-8 3.60GB/s ± 2% 8.91GB/s ± 1% +147.32% (p=0.000 n=10+10)
_UFlat5-8 514MB/s ± 5% 764MB/s ± 2% +48.59% (p=0.000 n=10+10)
_UFlat6-8 269MB/s ± 1% 266MB/s ± 1% -1.03% (p=0.009 n=8+10)
_UFlat7-8 253MB/s ± 4% 252MB/s ± 3% ~ (p=0.985 n=10+10)
_UFlat8-8 276MB/s ± 4% 279MB/s ± 3% ~ (p=0.288 n=10+10)
_UFlat9-8 249MB/s ± 1% 243MB/s ± 3% -2.51% (p=0.000 n=10+10)
_UFlat10-8 637MB/s ± 2% 1162MB/s ± 2% +82.29% (p=0.000 n=10+10)
_UFlat11-8 352MB/s ± 2% 361MB/s ± 1% +2.62% (p=0.000 n=10+8)
```
Co-Authored-By: Nigel Tao <nigeltao@golang.org>
2019-09-01 19:53:02 +02:00
Nigel Tao
2a8bb927dd
Merge pull request #46 from creachadair/gomod
...
Add a go.mod file for basic Go modules support.
2019-02-19 10:22:22 +11:00
M. J. Fromberger
f05e7a5086
Add a go.mod file for basic Go modules support.
2019-02-11 13:35:28 -08:00
Nigel Tao
2e65f85255
Fix snappytool to use block, not stream, format
...
The key difference is replacing snappy.NewWriter and snappy.NewReader
with snappy.Encode and snappy.Decode.
This change restores the behavior of the previous (written in C)
snappytool program.
2018-05-18 15:45:09 +10:00
Nigel Tao
e45cd318e0
Merge pull request #38 from mattn/cmd-snappytool
...
rewrite snappytool in go
2018-05-18 15:18:59 +10:00
Edward Betts
da2bb3382a
correct spelling mistake
2017-09-01 12:38:27 +01:00
Yasuhiro Matsumoto
35a8406c21
rewrite snappytool in go
2017-03-28 21:05:51 +09:00
Nigel Tao
553a641470
Merge pull request #37 from fatedier/master
...
fix typo
2017-02-16 10:32:05 +11:00
fatedier
0d9c4c05f1
fix typo
2017-01-25 15:07:54 +08:00
Nigel Tao
7db9049039
Merge pull request #36 from sguiheux/gofmt
...
Run gofmt.
2017-01-19 12:47:23 +11:00
Steven Guiheux
5a0054d7b7
fix: gofmt
2017-01-18 11:51:53 +01:00
Nigel Tao
d9eb7a3d35
Support the COPY_4 tag.
...
It is a valid encoding, even if no longer issued by most encoders.
name old speed new speed delta
WordsDecode1e1-8 525MB/s ± 0% 504MB/s ± 1% -4.04% (p=0.000 n=9+10)
WordsDecode1e2-8 1.23GB/s ± 0% 1.23GB/s ± 1% ~ (p=0.678 n=10+9)
WordsDecode1e3-8 1.54GB/s ± 0% 1.53GB/s ± 1% -0.75% (p=0.000 n=10+9)
WordsDecode1e4-8 1.53GB/s ± 0% 1.51GB/s ± 3% -1.46% (p=0.000 n=9+10)
WordsDecode1e5-8 793MB/s ± 0% 777MB/s ± 2% -2.01% (p=0.017 n=9+10)
WordsDecode1e6-8 917MB/s ± 1% 917MB/s ± 1% ~ (p=0.473 n=8+10)
WordsEncode1e1-8 641MB/s ± 2% 641MB/s ± 0% ~ (p=0.780 n=10+9)
WordsEncode1e2-8 583MB/s ± 0% 580MB/s ± 0% -0.41% (p=0.001 n=10+9)
WordsEncode1e3-8 647MB/s ± 1% 648MB/s ± 0% ~ (p=0.326 n=10+9)
WordsEncode1e4-8 442MB/s ± 1% 452MB/s ± 0% +2.20% (p=0.000 n=10+8)
WordsEncode1e5-8 355MB/s ± 1% 355MB/s ± 0% ~ (p=0.880 n=10+8)
WordsEncode1e6-8 433MB/s ± 0% 434MB/s ± 0% ~ (p=0.700 n=8+8)
RandomEncode-8 14.2GB/s ± 3% 14.2GB/s ± 3% ~ (p=0.780 n=10+9)
_UFlat0-8 2.18GB/s ± 1% 2.19GB/s ± 0% ~ (p=0.447 n=10+9)
_UFlat1-8 1.40GB/s ± 2% 1.41GB/s ± 0% +0.73% (p=0.043 n=9+10)
_UFlat2-8 23.4GB/s ± 3% 23.5GB/s ± 2% ~ (p=0.497 n=9+10)
_UFlat3-8 1.90GB/s ± 0% 1.91GB/s ± 0% +0.30% (p=0.002 n=8+9)
_UFlat4-8 13.9GB/s ± 2% 14.0GB/s ± 1% ~ (p=0.720 n=9+10)
_UFlat5-8 1.96GB/s ± 1% 1.97GB/s ± 0% +0.81% (p=0.000 n=10+9)
_UFlat6-8 813MB/s ± 0% 814MB/s ± 0% +0.17% (p=0.037 n=8+10)
_UFlat7-8 783MB/s ± 2% 785MB/s ± 0% ~ (p=0.340 n=9+9)
_UFlat8-8 859MB/s ± 0% 857MB/s ± 0% ~ (p=0.074 n=8+9)
_UFlat9-8 719MB/s ± 1% 719MB/s ± 1% ~ (p=0.621 n=10+9)
_UFlat10-8 2.84GB/s ± 0% 2.84GB/s ± 0% +0.19% (p=0.043 n=10+9)
_UFlat11-8 1.05GB/s ± 1% 1.05GB/s ± 0% ~ (p=0.523 n=9+8)
_ZFlat0-8 1.04GB/s ± 2% 1.04GB/s ± 0% ~ (p=0.222 n=9+9)
_ZFlat1-8 535MB/s ± 0% 534MB/s ± 0% ~ (p=0.059 n=9+9)
_ZFlat2-8 15.6GB/s ± 3% 15.7GB/s ± 1% ~ (p=0.720 n=9+10)
_ZFlat3-8 723MB/s ± 0% 740MB/s ± 3% +2.36% (p=0.034 n=8+10)
_ZFlat4-8 9.16GB/s ± 1% 9.20GB/s ± 1% ~ (p=0.297 n=9+9)
_ZFlat5-8 987MB/s ± 1% 991MB/s ± 0% ~ (p=0.167 n=9+8)
_ZFlat6-8 378MB/s ± 2% 379MB/s ± 0% ~ (p=0.334 n=9+8)
_ZFlat7-8 350MB/s ± 2% 352MB/s ± 0% +0.60% (p=0.014 n=9+8)
_ZFlat8-8 397MB/s ± 0% 396MB/s ± 1% ~ (p=0.965 n=8+10)
_ZFlat9-8 328MB/s ± 0% 327MB/s ± 1% ~ (p=0.409 n=8+9)
_ZFlat10-8 1.33GB/s ± 0% 1.33GB/s ± 1% ~ (p=0.356 n=9+10)
_ZFlat11-8 605MB/s ± 0% 605MB/s ± 1% ~ (p=0.743 n=9+8)
2016-05-29 15:00:41 +10:00
Nigel Tao
d6668316e4
Fix BenchmarkExtendMatch to honor the testdata flag.
2016-05-19 13:34:20 +10:00
Nigel Tao
d7b1e156f5
Add a benchdataDir flag.
2016-05-05 08:17:12 +10:00
Nigel Tao
aefa7ba4ef
Re-add the testdata flag.
...
Some build environments need to specify their own testdata dir.
2016-05-05 07:48:01 +10:00
Nigel Tao
43fea289ed
Remove the snappy.test binary, inadvertently checked in.
...
Fixes #32 .
2016-04-30 09:02:19 +10:00
Nigel Tao
b62d312cd2
Add some benchmark numbers to the README.
2016-04-29 15:28:03 +10:00
Nigel Tao
dfb3612ba2
Inline the extendMatch call.
...
Compared to the previous commit:
name old speed new speed delta
WordsEncode1e1-8 701MB/s ± 0% 699MB/s ± 1% ~ (p=0.123 n=10+10)
WordsEncode1e2-8 460MB/s ± 0% 583MB/s ± 1% +26.64% (p=0.000 n=10+10)
WordsEncode1e3-8 480MB/s ± 0% 647MB/s ± 2% +34.85% (p=0.000 n=10+10)
WordsEncode1e4-8 416MB/s ± 0% 451MB/s ± 0% +8.30% (p=0.000 n=10+8)
WordsEncode1e5-8 297MB/s ± 0% 355MB/s ± 2% +19.50% (p=0.000 n=10+9)
WordsEncode1e6-8 345MB/s ± 0% 433MB/s ± 2% +25.47% (p=0.000 n=10+9)
RandomEncode-8 14.4GB/s ± 2% 14.3GB/s ± 3% ~ (p=0.075 n=10+10)
_ZFlat0-8 891MB/s ± 1% 1040MB/s ± 0% +16.67% (p=0.000 n=9+9)
_ZFlat1-8 471MB/s ± 0% 535MB/s ± 1% +13.68% (p=0.000 n=9+10)
_ZFlat2-8 16.2GB/s ± 3% 16.4GB/s ± 1% ~ (p=0.122 n=10+8)
_ZFlat3-8 676MB/s ± 0% 762MB/s ± 0% +12.62% (p=0.000 n=10+9)
_ZFlat4-8 8.36GB/s ± 1% 9.47GB/s ± 1% +13.28% (p=0.000 n=10+10)
_ZFlat5-8 852MB/s ± 0% 986MB/s ± 1% +15.79% (p=0.000 n=10+9)
_ZFlat6-8 316MB/s ± 0% 380MB/s ± 1% +20.41% (p=0.000 n=8+9)
_ZFlat7-8 296MB/s ± 0% 353MB/s ± 0% +19.44% (p=0.000 n=8+10)
_ZFlat8-8 331MB/s ± 1% 399MB/s ± 0% +20.53% (p=0.000 n=9+8)
_ZFlat9-8 274MB/s ± 0% 329MB/s ± 0% +20.27% (p=0.000 n=8+9)
_ZFlat10-8 1.17GB/s ± 0% 1.35GB/s ± 1% +15.15% (p=0.000 n=9+9)
_ZFlat11-8 462MB/s ± 0% 608MB/s ± 0% +31.54% (p=0.000 n=9+9)
The net effect of the past four inlining commits, when compared to just
before c3defccc
"Inline the emitCopy call":
name old speed new speed delta
WordsEncode1e1-8 701MB/s ± 1% 699MB/s ± 1% ~ (p=0.353 n=10+10)
WordsEncode1e2-8 429MB/s ± 0% 583MB/s ± 1% +35.95% (p=0.000 n=9+10)
WordsEncode1e3-8 447MB/s ± 0% 647MB/s ± 2% +44.85% (p=0.000 n=9+10)
WordsEncode1e4-8 322MB/s ± 1% 451MB/s ± 0% +40.00% (p=0.000 n=10+8)
WordsEncode1e5-8 268MB/s ± 0% 355MB/s ± 2% +32.41% (p=0.000 n=9+9)
WordsEncode1e6-8 313MB/s ± 0% 433MB/s ± 2% +38.28% (p=0.000 n=8+9)
RandomEncode-8 14.4GB/s ± 1% 14.3GB/s ± 3% ~ (p=0.897 n=8+10)
_ZFlat0-8 797MB/s ± 2% 1040MB/s ± 0% +30.53% (p=0.000 n=9+9)
_ZFlat1-8 435MB/s ± 1% 535MB/s ± 1% +22.97% (p=0.000 n=9+10)
_ZFlat2-8 16.1GB/s ± 2% 16.4GB/s ± 1% +1.47% (p=0.001 n=10+8)
_ZFlat3-8 633MB/s ± 0% 762MB/s ± 0% +20.32% (p=0.000 n=10+9)
_ZFlat4-8 7.95GB/s ± 1% 9.47GB/s ± 1% +19.11% (p=0.000 n=10+10)
_ZFlat5-8 771MB/s ± 0% 986MB/s ± 1% +27.83% (p=0.000 n=10+9)
_ZFlat6-8 283MB/s ± 0% 380MB/s ± 1% +34.46% (p=0.000 n=10+9)
_ZFlat7-8 265MB/s ± 0% 353MB/s ± 0% +33.29% (p=0.000 n=9+10)
_ZFlat8-8 299MB/s ± 0% 399MB/s ± 0% +33.36% (p=0.000 n=9+8)
_ZFlat9-8 246MB/s ± 1% 329MB/s ± 0% +33.58% (p=0.000 n=10+9)
_ZFlat10-8 1.05GB/s ± 1% 1.35GB/s ± 1% +28.35% (p=0.000 n=10+9)
_ZFlat11-8 411MB/s ± 0% 608MB/s ± 0% +47.82% (p=0.000 n=10+9)
2016-04-29 14:24:51 +10:00
Nigel Tao
c707890a47
Rearrange the extendMatch register allocation.
...
This minimizes the diff in a follow-up commit, when manually inlining.
It's not an optimization per se, but for the record:
name old speed new speed delta
WordsEncode1e1-8 700MB/s ± 1% 701MB/s ± 0% ~ (p=0.393 n=10+10)
WordsEncode1e2-8 460MB/s ± 1% 460MB/s ± 0% ~ (p=0.393 n=10+10)
WordsEncode1e3-8 478MB/s ± 2% 480MB/s ± 0% ~ (p=0.912 n=10+10)
WordsEncode1e4-8 414MB/s ± 0% 416MB/s ± 0% +0.64% (p=0.000 n=9+10)
WordsEncode1e5-8 296MB/s ± 1% 297MB/s ± 0% ~ (p=0.113 n=9+10)
WordsEncode1e6-8 345MB/s ± 0% 345MB/s ± 0% ~ (p=0.949 n=8+10)
RandomEncode-8 14.4GB/s ± 2% 14.4GB/s ± 2% ~ (p=0.278 n=9+10)
_ZFlat0-8 888MB/s ± 1% 891MB/s ± 1% +0.35% (p=0.010 n=10+9)
_ZFlat1-8 471MB/s ± 1% 471MB/s ± 0% ~ (p=0.447 n=10+9)
_ZFlat2-8 16.2GB/s ± 3% 16.2GB/s ± 3% ~ (p=0.912 n=10+10)
_ZFlat3-8 675MB/s ± 1% 676MB/s ± 0% ~ (p=0.150 n=9+10)
_ZFlat4-8 8.31GB/s ± 1% 8.36GB/s ± 1% +0.65% (p=0.035 n=10+10)
_ZFlat5-8 850MB/s ± 0% 852MB/s ± 0% ~ (p=0.182 n=9+10)
_ZFlat6-8 316MB/s ± 0% 316MB/s ± 0% ~ (p=0.762 n=10+8)
_ZFlat7-8 294MB/s ± 1% 296MB/s ± 0% +0.51% (p=0.006 n=9+8)
_ZFlat8-8 330MB/s ± 1% 331MB/s ± 1% ~ (p=0.881 n=9+9)
_ZFlat9-8 273MB/s ± 0% 274MB/s ± 0% +0.23% (p=0.043 n=10+8)
_ZFlat10-8 1.17GB/s ± 1% 1.17GB/s ± 0% ~ (p=0.922 n=10+9)
_ZFlat11-8 461MB/s ± 0% 462MB/s ± 0% ~ (p=0.219 n=10+9)
Also:
name old time/op new time/op delta
ExtendMatch-8 7.92µs ± 2% 7.80µs ± 2% -1.51% (p=0.002 n=10+9)
and note that this is time/op instead of MB/s, so negative is better,
although it's quite possibly all just noise.
2016-04-29 14:11:06 +10:00
Nigel Tao
5a44a9da21
Inline the emitLiteral call.
...
name old speed new speed delta
WordsEncode1e1-8 712MB/s ± 1% 700MB/s ± 1% -1.65% (p=0.000 n=10+10)
WordsEncode1e2-8 467MB/s ± 0% 460MB/s ± 1% -1.53% (p=0.000 n=9+10)
WordsEncode1e3-8 483MB/s ± 0% 478MB/s ± 2% -0.98% (p=0.007 n=9+10)
WordsEncode1e4-8 353MB/s ± 1% 414MB/s ± 0% +17.03% (p=0.000 n=10+9)
WordsEncode1e5-8 293MB/s ± 0% 296MB/s ± 1% +1.03% (p=0.000 n=8+9)
WordsEncode1e6-8 345MB/s ± 0% 345MB/s ± 0% ~ (p=0.332 n=9+8)
RandomEncode-8 14.4GB/s ± 2% 14.4GB/s ± 2% ~ (p=1.000 n=10+9)
_ZFlat0-8 863MB/s ± 0% 888MB/s ± 1% +2.86% (p=0.000 n=9+10)
_ZFlat1-8 471MB/s ± 0% 471MB/s ± 1% ~ (p=0.897 n=8+10)
_ZFlat2-8 16.2GB/s ± 2% 16.2GB/s ± 3% ~ (p=0.631 n=10+10)
_ZFlat3-8 659MB/s ± 1% 675MB/s ± 1% +2.32% (p=0.000 n=9+9)
_ZFlat4-8 8.29GB/s ± 1% 8.31GB/s ± 1% ~ (p=0.315 n=10+10)
_ZFlat5-8 836MB/s ± 1% 850MB/s ± 0% +1.78% (p=0.000 n=9+9)
_ZFlat6-8 315MB/s ± 0% 316MB/s ± 0% +0.39% (p=0.002 n=9+10)
_ZFlat7-8 293MB/s ± 1% 294MB/s ± 1% ~ (p=0.139 n=10+9)
_ZFlat8-8 331MB/s ± 1% 330MB/s ± 1% ~ (p=0.356 n=10+9)
_ZFlat9-8 273MB/s ± 1% 273MB/s ± 0% ~ (p=0.280 n=10+10)
_ZFlat10-8 1.12GB/s ± 1% 1.17GB/s ± 1% +4.12% (p=0.000 n=10+10)
_ZFlat11-8 460MB/s ± 0% 461MB/s ± 0% +0.34% (p=0.006 n=8+10)
2016-04-29 13:20:53 +10:00
Nigel Tao
c3defccc35
Inline the emitCopy call.
...
name old speed new speed delta
WordsEncode1e1-8 701MB/s ± 1% 712MB/s ± 1% +1.64% (p=0.000 n=10+10)
WordsEncode1e2-8 429MB/s ± 0% 467MB/s ± 0% +8.86% (p=0.000 n=9+9)
WordsEncode1e3-8 447MB/s ± 0% 483MB/s ± 0% +8.20% (p=0.000 n=9+9)
WordsEncode1e4-8 322MB/s ± 1% 353MB/s ± 1% +9.76% (p=0.000 n=10+10)
WordsEncode1e5-8 268MB/s ± 0% 293MB/s ± 0% +9.42% (p=0.000 n=9+8)
WordsEncode1e6-8 313MB/s ± 0% 345MB/s ± 0% +10.06% (p=0.000 n=8+9)
RandomEncode-8 14.4GB/s ± 1% 14.4GB/s ± 2% ~ (p=0.829 n=8+10)
_ZFlat0-8 797MB/s ± 2% 863MB/s ± 0% +8.39% (p=0.000 n=9+9)
_ZFlat1-8 435MB/s ± 1% 471MB/s ± 0% +8.34% (p=0.000 n=9+8)
_ZFlat2-8 16.1GB/s ± 2% 16.2GB/s ± 2% ~ (p=0.165 n=10+10)
_ZFlat3-8 633MB/s ± 0% 659MB/s ± 1% +4.12% (p=0.000 n=10+9)
_ZFlat4-8 7.95GB/s ± 1% 8.29GB/s ± 1% +4.22% (p=0.000 n=10+10)
_ZFlat5-8 771MB/s ± 0% 836MB/s ± 1% +8.33% (p=0.000 n=10+9)
_ZFlat6-8 283MB/s ± 0% 315MB/s ± 0% +11.19% (p=0.000 n=10+9)
_ZFlat7-8 265MB/s ± 0% 293MB/s ± 1% +10.73% (p=0.000 n=9+10)
_ZFlat8-8 299MB/s ± 0% 331MB/s ± 1% +10.74% (p=0.000 n=9+10)
_ZFlat9-8 246MB/s ± 1% 273MB/s ± 1% +10.90% (p=0.000 n=10+10)
_ZFlat10-8 1.05GB/s ± 1% 1.12GB/s ± 1% +7.02% (p=0.000 n=10+10)
_ZFlat11-8 411MB/s ± 0% 460MB/s ± 0% +11.79% (p=0.000 n=10+8)
2016-04-29 12:54:56 +10:00
Nigel Tao
598d84db77
Rearrange the emitLiteral register allocation.
...
This minimizes the diff in a follow-up commit, when manually inlining.
It's not an optimization per se, but for the record:
name old speed new speed delta
WordsEncode1e1-8 698MB/s ± 1% 701MB/s ± 1% ~ (p=0.165 n=10+10)
WordsEncode1e2-8 428MB/s ± 0% 429MB/s ± 0% ~ (p=0.489 n=9+9)
WordsEncode1e3-8 446MB/s ± 0% 447MB/s ± 0% ~ (p=0.476 n=9+9)
WordsEncode1e4-8 321MB/s ± 1% 322MB/s ± 1% ~ (p=0.593 n=10+10)
WordsEncode1e5-8 267MB/s ± 1% 268MB/s ± 0% ~ (p=0.287 n=9+9)
WordsEncode1e6-8 313MB/s ± 1% 313MB/s ± 0% ~ (p=0.190 n=9+8)
RandomEncode-8 14.4GB/s ± 1% 14.4GB/s ± 1% ~ (p=0.673 n=9+8)
_ZFlat0-8 800MB/s ± 0% 797MB/s ± 2% ~ (p=0.387 n=9+9)
_ZFlat1-8 436MB/s ± 1% 435MB/s ± 1% ~ (p=0.169 n=9+9)
_ZFlat2-8 16.2GB/s ± 1% 16.1GB/s ± 2% ~ (p=0.063 n=10+10)
_ZFlat3-8 633MB/s ± 1% 633MB/s ± 0% ~ (p=0.661 n=9+10)
_ZFlat4-8 7.96GB/s ± 1% 7.95GB/s ± 1% ~ (p=0.796 n=10+10)
_ZFlat5-8 771MB/s ± 0% 771MB/s ± 0% ~ (p=0.929 n=10+10)
_ZFlat6-8 283MB/s ± 1% 283MB/s ± 0% ~ (p=0.912 n=10+10)
_ZFlat7-8 265MB/s ± 0% 265MB/s ± 0% ~ (p=0.649 n=9+9)
_ZFlat8-8 299MB/s ± 0% 299MB/s ± 0% ~ (p=0.748 n=9+9)
_ZFlat9-8 246MB/s ± 1% 246MB/s ± 1% ~ (p=0.921 n=9+10)
_ZFlat10-8 1.05GB/s ± 1% 1.05GB/s ± 1% ~ (p=0.089 n=10+10)
_ZFlat11-8 410MB/s ± 0% 411MB/s ± 0% ~ (p=0.190 n=10+10)
2016-04-29 12:00:38 +10:00
Nigel Tao
9f7b278fd7
Rearrange the emitCopy register allocation.
...
This minimizes the diff in a follow-up commit, when manually inlining.
It's not an optimization per se, but for the record:
name old speed new speed delta
WordsEncode1e1-8 711MB/s ± 1% 700MB/s ± 1% -1.64% (p=0.000 n=9+10)
WordsEncode1e2-8 407MB/s ± 1% 430MB/s ± 0% +5.57% (p=0.000 n=10+10)
WordsEncode1e3-8 441MB/s ± 1% 447MB/s ± 0% +1.52% (p=0.000 n=8+8)
WordsEncode1e4-8 311MB/s ± 1% 322MB/s ± 0% +3.69% (p=0.000 n=9+10)
WordsEncode1e5-8 267MB/s ± 0% 267MB/s ± 1% ~ (p=0.068 n=8+10)
WordsEncode1e6-8 312MB/s ± 1% 314MB/s ± 0% +0.45% (p=0.000 n=9+10)
RandomEncode-8 14.4GB/s ± 2% 14.4GB/s ± 2% ~ (p=0.739 n=10+10)
_ZFlat0-8 792MB/s ± 1% 801MB/s ± 0% +1.11% (p=0.000 n=8+9)
_ZFlat1-8 435MB/s ± 1% 437MB/s ± 0% ~ (p=0.857 n=9+10)
_ZFlat2-8 16.0GB/s ± 4% 16.3GB/s ± 1% ~ (p=0.143 n=10+10)
_ZFlat3-8 613MB/s ± 0% 634MB/s ± 0% +3.54% (p=0.000 n=8+10)
_ZFlat4-8 7.96GB/s ± 1% 7.97GB/s ± 1% ~ (p=0.829 n=8+10)
_ZFlat5-8 770MB/s ± 0% 773MB/s ± 0% +0.33% (p=0.000 n=8+9)
_ZFlat6-8 283MB/s ± 0% 283MB/s ± 0% +0.13% (p=0.043 n=8+9)
_ZFlat7-8 264MB/s ± 2% 265MB/s ± 0% +0.61% (p=0.000 n=9+9)
_ZFlat8-8 297MB/s ± 3% 299MB/s ± 0% ~ (p=0.161 n=9+9)
_ZFlat9-8 247MB/s ± 1% 247MB/s ± 0% ~ (p=0.465 n=8+9)
_ZFlat10-8 1.03GB/s ± 0% 1.05GB/s ± 1% +1.75% (p=0.000 n=9+9)
_ZFlat11-8 409MB/s ± 0% 412MB/s ± 0% +0.64% (p=0.000 n=8+8)
2016-04-29 11:22:44 +10:00
Nigel Tao
2b29335120
Run asmfmt.
2016-04-29 11:06:33 +10:00
Nigel Tao
6ffc20e64a
Add more comments for the asm workaround.
2016-04-29 10:31:32 +10:00
Nigel Tao
ec642410cd
Workaround "table-32768(SP)(R11*2)" not assembling.
...
This asm phrase works on Go 1.4 and Go tip, but not Go 1.6. I'm not sure
why, but this workaround should make the package installable while I
investigate.
Fixes #29 .
2016-04-24 10:32:34 +10:00
Nigel Tao
7dddae14f7
Fix redeclaration of "end" in the asm.
...
Multiple "end" labels, in different functions, did not work with the Go
1.4 toolchain.
Fixes #30 .
2016-04-24 10:07:12 +10:00
Nigel Tao
2dbf365277
Inline extendMatch for the noasm encoder.
...
This is a partial undo of 4f2f9a13
"Write the encoder's extendMatch in
asm" but we can selectively apply the undo only to the noasm case now
that encodeBlock (the function that calls extendMatch) is itself written
in asm.
With "go test -test.bench='Encode|ZFlat' -tags=noasm":
name old speed new speed delta
WordsEncode1e1-8 676MB/s ± 1% 676MB/s ± 0% ~ (p=0.841 n=5+5)
WordsEncode1e2-8 85.3MB/s ± 0% 87.5MB/s ± 1% +2.50% (p=0.008 n=5+5)
WordsEncode1e3-8 241MB/s ± 0% 258MB/s ± 0% +7.33% (p=0.008 n=5+5)
WordsEncode1e4-8 199MB/s ± 0% 245MB/s ± 0% +23.15% (p=0.008 n=5+5)
WordsEncode1e5-8 171MB/s ± 0% 186MB/s ± 0% +8.57% (p=0.008 n=5+5)
WordsEncode1e6-8 192MB/s ± 0% 211MB/s ± 0% +9.51% (p=0.008 n=5+5)
RandomEncode-8 13.1GB/s ± 2% 13.2GB/s ± 1% ~ (p=0.690 n=5+5)
_ZFlat0-8 404MB/s ± 0% 431MB/s ± 0% +6.84% (p=0.008 n=5+5)
_ZFlat1-8 260MB/s ± 0% 277MB/s ± 0% +6.46% (p=0.008 n=5+5)
_ZFlat2-8 13.8GB/s ± 1% 13.8GB/s ± 2% ~ (p=1.000 n=5+5)
_ZFlat3-8 170MB/s ± 1% 173MB/s ± 0% +1.60% (p=0.008 n=5+5)
_ZFlat4-8 2.94GB/s ± 5% 3.10GB/s ± 0% +5.35% (p=0.008 n=5+5)
_ZFlat5-8 397MB/s ± 1% 426MB/s ± 0% +7.32% (p=0.008 n=5+5)
_ZFlat6-8 175MB/s ± 2% 190MB/s ± 0% +8.61% (p=0.008 n=5+5)
_ZFlat7-8 169MB/s ± 0% 182MB/s ± 0% +7.47% (p=0.016 n=4+5)
_ZFlat8-8 184MB/s ± 3% 200MB/s ± 0% +8.65% (p=0.008 n=5+5)
_ZFlat9-8 163MB/s ± 0% 175MB/s ± 0% +7.57% (p=0.016 n=4+5)
_ZFlat10-8 481MB/s ± 0% 509MB/s ± 0% +5.80% (p=0.016 n=4+5)
_ZFlat11-8 254MB/s ± 0% 275MB/s ± 0% +8.32% (p=0.008 n=5+5)
For the record, after this commit, the comparison between the noasm
('old') and vanilla (i.e. with asm, 'new') encoder benchmarks, summing
up the last eight or so commits, is:
name old speed new speed delta
WordsEncode1e1-8 676MB/s ± 0% 677MB/s ± 1% ~ (p=0.310 n=5+5)
WordsEncode1e2-8 87.5MB/s ± 1% 428.3MB/s ± 0% +389.71% (p=0.008 n=5+5)
WordsEncode1e3-8 258MB/s ± 0% 446MB/s ± 1% +72.67% (p=0.008 n=5+5)
WordsEncode1e4-8 245MB/s ± 0% 316MB/s ± 0% +28.94% (p=0.008 n=5+5)
WordsEncode1e5-8 186MB/s ± 0% 269MB/s ± 0% +44.86% (p=0.008 n=5+5)
WordsEncode1e6-8 211MB/s ± 0% 314MB/s ± 1% +48.84% (p=0.008 n=5+5)
RandomEncode-8 13.2GB/s ± 1% 14.4GB/s ± 1% +9.33% (p=0.008 n=5+5)
_ZFlat0-8 431MB/s ± 0% 792MB/s ± 0% +83.67% (p=0.008 n=5+5)
_ZFlat1-8 277MB/s ± 0% 436MB/s ± 1% +57.46% (p=0.008 n=5+5)
_ZFlat2-8 13.8GB/s ± 2% 16.2GB/s ± 1% +17.16% (p=0.008 n=5+5)
_ZFlat3-8 173MB/s ± 0% 632MB/s ± 1% +265.85% (p=0.008 n=5+5)
_ZFlat4-8 3.10GB/s ± 0% 8.00GB/s ± 0% +157.99% (p=0.008 n=5+5)
_ZFlat5-8 426MB/s ± 0% 768MB/s ± 0% +80.06% (p=0.008 n=5+5)
_ZFlat6-8 190MB/s ± 0% 282MB/s ± 1% +48.48% (p=0.008 n=5+5)
_ZFlat7-8 182MB/s ± 0% 264MB/s ± 1% +44.97% (p=0.008 n=5+5)
_ZFlat8-8 200MB/s ± 0% 298MB/s ± 0% +49.45% (p=0.008 n=5+5)
_ZFlat9-8 175MB/s ± 0% 247MB/s ± 0% +41.02% (p=0.008 n=5+5)
_ZFlat10-8 509MB/s ± 0% 1027MB/s ± 0% +101.72% (p=0.008 n=5+5)
_ZFlat11-8 275MB/s ± 0% 411MB/s ± 0% +49.57% (p=0.008 n=5+5)
2016-04-23 15:01:47 +10:00