Граф коммитов

154 Коммитов

Автор SHA1 Сообщение Дата
Nigel Tao 43d5d4cd4e
Merge pull request #68 from MarkLodato/cli-readme
README: update instructions to install CLI
2023-12-26 09:57:46 +11:00
Mark Lodato 470b9ed42b README: update instructions to install CLI
- Add instructions for using `go install` to run the binary.
- Note that `go get` is only for use as a library.
2022-11-17 07:22:32 -05:00
Nigel Tao fa5810519d Ensure arm64 frame sizes are 8 (mod 16)
Fixes #63
2022-01-16 12:10:46 +11:00
Nigel Tao 544b4180ac Update AUTHORS and CONTRIBUTORS 2021-06-08 14:05:37 +10:00
Nigel Tao 0eaccd4763 Fix dangling golden_test filename link 2021-06-08 14:02:21 +10:00
Nigel Tao 3ff355f7bb
Merge pull request #51 from topos-ai/bytereader
Add ReadByte method, satisfies the io.ByteReader interface
2021-06-08 13:56:06 +10:00
Nigel Tao b9440b43e5
Merge pull request #40 from EdwardBetts/spelling
correct spelling mistake
2021-06-08 13:46:59 +10:00
Nigel Tao ef348818ab
Merge pull request #60 from alexlegg/master
Use a more inclusive text for golden input.
2021-06-08 13:46:43 +10:00
Nigel Tao 33fc3d5d8d
Merge pull request #61 from cuonglm/cuonglm/fix-wrong-arm64-scaled-register-format
Fix wrong arm64 scaled register format
2021-05-02 13:53:20 +10:00
Cuong Manh Le b46926bc8a Fix wrong arm64 scaled register format
Arm64 does not have scaled register format, casue snappy test failed for
current go tip:

	$ go version
	go version devel go1.17-24875e3880 Tue Apr 20 15:14:05 2021 +0000 darwin/arm64
	$ go test
	# github.com/golang/snappy
	./encode_arm64.s:385: arm64 doesn't support scaled register format
	./encode_arm64.s:675: arm64 doesn't support scaled register format
	asm: assembly of ./encode_arm64.s failed
	FAIL	github.com/golang/snappy [build failed]

See https://go-review.googlesource.com/c/go/+/289589
2021-04-21 00:16:25 +07:00
Alex Legg e149cdd03f Use a more inclusive text for golden input.
Replace the first chapter of Tom Sawyer with the first 400 lines of
Isaac Newton's Opticks.

The rawsnappy version was generated by cmd/snappytool in this repo.

The extendMatch test goldens were updated as per the instructions in
golden_test.go (with an update to account for the golang version of
extendMatch being inlined.)
2021-04-12 16:34:41 +10:00
Nigel Tao 674baa8c7f
Merge pull request #56 from AWSjswinney/arm64-port-pr
bug fix to encode_arm64.s: some registers overwritten in memmove call

ARM64 memmove clobbers R16 and R17 as of
https://go-review.googlesource.com/c/go/+/243357
2020-11-04 09:46:00 +11:00
Jonathan Swinney f81760ec4c bug fix to encode_arm64.s: some registers overwritten in memmove call
In encode_arm64.s, encodeBlock, two of the registers added during the port from
amd64 were not saved or restored for the memmove call. Instead of saving them,
just recalculate their values. Additionally, I made a few small changes to
improve things since I've learned a bit more about ARMv8 assembly.
 - The CMP instruction accepts an immediate as the first argument
 - use LDP/STP instead of SIMD instructions

The change to use the load-pair and store-pair instructions instead of the SIMD
instructions results in some modest performance improvements as meastured on
Neoverse N1 (Graviton 2).

name              old time/op    new time/op    delta
WordsDecode1e1-2    25.9ns ± 1%    26.1ns ± 1%  +0.66%  (p=0.005 n=10+10)
WordsDecode1e2-2     107ns ± 0%     105ns ± 0%  -1.87%  (p=0.000 n=10+10)
WordsDecode1e3-2     953ns ± 0%     901ns ± 0%  -5.50%  (p=0.000 n=10+10)
WordsDecode1e4-2    10.6µs ± 0%     9.9µs ± 2%  -6.60%  (p=0.000 n=7+10)
WordsDecode1e5-2     170µs ± 1%     164µs ± 1%  -3.12%  (p=0.000 n=10+9)
WordsDecode1e6-2    1.71ms ± 0%    1.66ms ± 0%  -2.98%  (p=0.000 n=10+10)
WordsEncode1e1-2    22.0ns ± 1%    21.9ns ± 1%  -0.67%  (p=0.006 n=8+10)
WordsEncode1e2-2     248ns ± 0%     245ns ± 0%  -1.21%  (p=0.002 n=8+10)
WordsEncode1e3-2    2.50µs ± 0%    2.49µs ± 0%    ~     (p=0.103 n=10+9)
WordsEncode1e4-2    27.8µs ± 3%    28.0µs ± 2%    ~     (p=0.075 n=10+10)
WordsEncode1e5-2     339µs ± 0%     343µs ± 0%  +1.18%  (p=0.000 n=9+10)
WordsEncode1e6-2    3.39ms ± 0%    3.42ms ± 0%  +0.94%  (p=0.000 n=10+10)
RandomEncode-2      74.8µs ± 1%    77.1µs ± 1%  +3.16%  (p=0.000 n=10+10)
_UFlat0-2           68.8µs ± 1%    66.4µs ± 2%  -3.54%  (p=0.000 n=10+10)
_UFlat1-2            770µs ± 0%     740µs ± 1%  -3.93%  (p=0.000 n=10+10)
_UFlat2-2           6.57µs ± 0%    6.55µs ± 0%  -0.25%  (p=0.000 n=8+10)
_UFlat3-2            183ns ± 0%     178ns ± 1%  -2.84%  (p=0.000 n=9+10)
_UFlat4-2           9.76µs ± 1%    9.56µs ± 0%  -2.07%  (p=0.000 n=10+9)
_UFlat5-2            301µs ± 0%     293µs ± 0%  -2.67%  (p=0.000 n=9+10)
_UFlat6-2            280µs ± 1%     267µs ± 1%  -4.63%  (p=0.000 n=10+10)
_UFlat7-2            241µs ± 0%     230µs ± 1%  -4.68%  (p=0.000 n=9+10)
_UFlat8-2            745µs ± 0%     715µs ± 1%  -4.11%  (p=0.000 n=10+10)
_UFlat9-2           1.01ms ± 0%    0.96ms ± 0%  -4.60%  (p=0.000 n=10+10)
_UFlat10-2          62.3µs ± 1%    59.3µs ± 1%  -4.72%  (p=0.000 n=10+9)
_UFlat11-2           258µs ± 0%     252µs ± 1%  -2.56%  (p=0.000 n=10+10)
_ZFlat0-2            135µs ± 1%     132µs ± 1%  -1.88%  (p=0.000 n=10+8)
_ZFlat1-2           1.76ms ± 0%    1.74ms ± 0%  -1.00%  (p=0.000 n=9+9)
_ZFlat2-2           9.54µs ± 0%    9.84µs ± 5%  +3.18%  (p=0.000 n=10+10)
_ZFlat3-2            449ns ± 0%     447ns ± 0%  -0.38%  (p=0.000 n=10+9)
_ZFlat4-2           15.6µs ± 0%    16.0µs ± 4%    ~     (p=0.118 n=9+10)
_ZFlat5-2            560µs ± 1%     555µs ± 1%  -0.89%  (p=0.000 n=9+9)
_ZFlat6-2            531µs ± 0%     534µs ± 0%  +0.64%  (p=0.000 n=10+10)
_ZFlat7-2            466µs ± 0%     468µs ± 0%  +0.32%  (p=0.003 n=10+10)
_ZFlat8-2           1.42ms ± 0%    1.42ms ± 0%  +0.43%  (p=0.000 n=10+10)
_ZFlat9-2           1.93ms ± 0%    1.94ms ± 0%  +0.44%  (p=0.000 n=10+10)
_ZFlat10-2           120µs ± 0%     121µs ± 3%    ~     (p=0.436 n=9+9)
_ZFlat11-2           433µs ± 0%     437µs ± 0%  +1.03%  (p=0.000 n=10+10)
ExtendMatch-2       9.77µs ± 0%    9.76µs ± 0%  -0.13%  (p=0.050 n=10+10)

As measured on Cortex-A53 (Raspberry Pi 3)

name              old time/op    new time/op    delta
WordsDecode1e1-4     152ns ± 2%     151ns ± 0%    ~     (p=0.536 n=10+8)
WordsDecode1e2-4     639ns ± 0%     617ns ± 0%  -3.54%  (p=0.000 n=9+8)
WordsDecode1e3-4    6.74µs ± 2%    6.35µs ± 0%  -5.75%  (p=0.000 n=10+9)
WordsDecode1e4-4    66.7µs ± 0%    63.5µs ± 0%  -4.69%  (p=0.000 n=9+9)
WordsDecode1e5-4     715µs ± 0%     684µs ± 0%  -4.38%  (p=0.000 n=8+8)
WordsDecode1e6-4    6.87ms ± 2%    6.53ms ± 1%  -4.99%  (p=0.000 n=10+9)
WordsEncode1e1-4     127ns ± 2%     126ns ± 0%    ~     (p=0.065 n=10+9)
WordsEncode1e2-4    1.58µs ± 0%    1.57µs ± 0%  -0.99%  (p=0.000 n=8+8)
WordsEncode1e3-4    15.1µs ± 0%    14.9µs ± 0%  -1.46%  (p=0.000 n=9+8)
WordsEncode1e4-4     148µs ± 0%     148µs ± 4%    ~     (p=0.497 n=9+10)
WordsEncode1e5-4    1.54ms ± 0%    1.54ms ± 0%  +0.12%  (p=0.012 n=10+8)
WordsEncode1e6-4    14.4ms ± 0%    14.4ms ± 1%  -0.47%  (p=0.015 n=9+8)
RandomEncode-4      1.13ms ± 1%    1.13ms ± 1%    ~     (p=0.529 n=10+10)
_UFlat0-4            294µs ± 0%     288µs ± 1%  -2.08%  (p=0.000 n=9+9)
_UFlat1-4           3.05ms ± 1%    2.98ms ± 1%  -2.22%  (p=0.000 n=9+9)
_UFlat2-4           37.3µs ± 0%    37.4µs ± 1%    ~     (p=0.093 n=8+9)
_UFlat3-4            909ns ± 0%     914ns ± 2%    ~     (p=0.526 n=8+10)
_UFlat4-4           58.7µs ± 0%    58.1µs ± 0%  -1.09%  (p=0.000 n=8+10)
_UFlat5-4           1.22ms ± 0%    1.19ms ± 1%  -2.14%  (p=0.000 n=8+8)
_UFlat6-4           1.03ms ± 0%    0.99ms ± 0%  -3.28%  (p=0.000 n=9+8)
_UFlat7-4            895µs ± 0%     861µs ± 0%  -3.79%  (p=0.000 n=8+8)
_UFlat8-4           2.83ms ± 0%    2.75ms ± 0%  -2.88%  (p=0.000 n=7+8)
_UFlat9-4           3.85ms ± 1%    3.73ms ± 1%  -3.03%  (p=0.000 n=8+9)
_UFlat10-4           286µs ± 0%     282µs ± 0%  -1.59%  (p=0.000 n=9+9)
_UFlat11-4          1.06ms ± 0%    1.02ms ± 0%  -3.58%  (p=0.000 n=8+9)
_ZFlat0-4            620µs ± 0%     620µs ± 1%    ~     (p=0.963 n=9+8)
_ZFlat1-4           9.49ms ± 1%    9.67ms ± 3%  +1.87%  (p=0.000 n=9+10)
_ZFlat2-4           61.8µs ± 0%    62.3µs ± 3%    ~     (p=0.829 n=8+10)
_ZFlat3-4           2.80µs ± 1%    2.79µs ± 0%  -0.55%  (p=0.000 n=8+8)
_ZFlat4-4            108µs ± 0%     109µs ± 0%  +0.55%  (p=0.000 n=10+8)
_ZFlat5-4           2.59ms ± 2%    2.58ms ± 1%    ~     (p=0.274 n=10+8)
_ZFlat6-4           2.39ms ± 3%    2.40ms ± 1%    ~     (p=0.631 n=10+10)
_ZFlat7-4           2.11ms ± 0%    2.08ms ± 1%  -1.23%  (p=0.000 n=10+9)
_ZFlat8-4           6.86ms ± 0%    6.92ms ± 1%  +0.78%  (p=0.000 n=9+8)
_ZFlat9-4           9.42ms ± 0%    9.40ms ± 1%    ~     (p=0.606 n=8+9)
_ZFlat10-4           620µs ± 1%     621µs ± 4%    ~     (p=0.173 n=8+10)
_ZFlat11-4          1.94ms ± 0%    1.93ms ± 0%  -0.52%  (p=0.001 n=9+8)
ExtendMatch-4       69.3µs ± 2%    69.2µs ± 0%    ~     (p=0.515 n=10+8)
2020-10-02 15:49:34 +00:00
Nigel Tao 196ae77b8a A+C: add Jonathan Swinney <jswinney@amazon.com>. 2020-07-07 23:17:29 +10:00
Nigel Tao 1801c13ca2
Merge pull request #53 from AWSjswinney/arm64-port-pr
port amd64 assembly to arm64
2020-07-07 23:11:05 +10:00
Jonathan Swinney ea060ccb72 port amd64 assembly to arm64
This change was produced by taking the amd64 assembly and reproducing it
as closely as possible for the arm64 arch.

The main differences:
 - arm64 uses registers R1-R17 which are mapped directly onto an amd64
   counterpart
 - arm64 requires 8 additional bytes of stack so callee args are displaced
   by 8 bytes from amd64
 - operands to CMP instructions are reversed except in a few cases where
   arm64 uses a BLS (branch less-same) instead of JAE (jump above-equal)
 - immediates in some cases have to be split to a separate MOVD instruction
 - shifts can be combined with another instruction, such as an ADD, in some
   cases
 - The amd64 BSFQ instruction is implemented with a bit reversal and
   leading zero count instruction
 - memclear on arm64 makes use of the SIMD instructions to clear 64 bytes
   at a time and uses a pointer comparison instead of a counter to reduce
   the number of instructions in the loop

Tested on an AWS m6g.large (ARMv8.2):
name              old time/op    new time/op     delta
WordsDecode1e1-2    29.2ns ± 0%     26.2ns ± 1%   -10.51%  (p=0.000 n=9+10)
WordsDecode1e2-2     187ns ± 0%      107ns ± 0%   -42.78%  (p=0.000 n=7+10)
WordsDecode1e3-2    2.16µs ± 1%     0.95µs ± 0%   -55.85%  (p=0.000 n=10+10)
WordsDecode1e4-2    30.1µs ± 0%     10.4µs ± 2%   -65.40%  (p=0.000 n=10+10)
WordsDecode1e5-2     348µs ± 0%      168µs ± 0%   -51.86%  (p=0.000 n=10+9)
WordsDecode1e6-2    3.47ms ± 0%     1.71ms ± 0%   -50.66%  (p=0.000 n=10+10)
WordsEncode1e1-2    19.4ns ± 0%     21.7ns ± 1%   +12.06%  (p=0.000 n=8+10)
WordsEncode1e2-2    2.09µs ± 0%     0.25µs ± 0%   -88.14%  (p=0.000 n=9+10)
WordsEncode1e3-2    6.67µs ± 1%     2.49µs ± 0%   -62.63%  (p=0.000 n=10+10)
WordsEncode1e4-2    63.5µs ± 1%     29.4µs ± 1%   -53.63%  (p=0.000 n=10+9)
WordsEncode1e5-2     722µs ± 0%      345µs ± 0%   -52.21%  (p=0.000 n=10+10)
WordsEncode1e6-2    7.17ms ± 0%     3.41ms ± 0%   -52.46%  (p=0.000 n=10+8)
RandomEncode-2       106µs ± 2%       78µs ± 0%   -26.02%  (p=0.000 n=10+10)
_UFlat0-2            152µs ± 0%       69µs ± 1%   -54.90%  (p=0.000 n=10+9)
_UFlat1-2           1.57ms ± 0%     0.77ms ± 0%   -51.10%  (p=0.000 n=9+10)
_UFlat2-2           6.84µs ± 0%     6.55µs ± 0%    -4.25%  (p=0.000 n=10+8)
_UFlat3-2            312ns ± 0%      183ns ± 0%   -41.35%  (p=0.000 n=10+9)
_UFlat4-2           15.4µs ± 1%      9.7µs ± 1%   -36.79%  (p=0.000 n=10+10)
_UFlat5-2            625µs ± 0%      301µs ± 1%   -51.88%  (p=0.000 n=9+10)
_UFlat6-2            570µs ± 0%      278µs ± 0%   -51.18%  (p=0.000 n=10+9)
_UFlat7-2            490µs ± 0%      240µs ± 1%   -50.95%  (p=0.000 n=10+10)
_UFlat8-2           1.52ms ± 0%     0.74ms ± 0%   -51.01%  (p=0.000 n=8+7)
_UFlat9-2           2.00ms ± 0%     1.01ms ± 0%   -49.49%  (p=0.000 n=10+10)
_UFlat10-2           132µs ± 0%       62µs ± 2%   -53.19%  (p=0.000 n=10+10)
_UFlat11-2           497µs ± 0%      258µs ± 0%   -48.11%  (p=0.000 n=10+9)
_ZFlat0-2            346µs ± 1%      136µs ± 5%   -60.70%  (p=0.000 n=10+9)
_ZFlat1-2           3.63ms ± 0%     1.76ms ± 0%   -51.60%  (p=0.000 n=10+8)
_ZFlat2-2           13.2µs ± 0%      9.5µs ± 0%   -27.62%  (p=0.000 n=8+9)
_ZFlat3-2           2.49µs ± 0%     0.45µs ± 0%   -81.96%  (p=0.002 n=8+10)
_ZFlat4-2           50.5µs ± 0%     15.7µs ± 1%   -68.96%  (p=0.000 n=10+9)
_ZFlat5-2           1.40ms ± 0%     0.56ms ± 0%   -60.20%  (p=0.000 n=9+9)
_ZFlat6-2           1.13ms ± 0%     0.54ms ± 0%   -52.39%  (p=0.000 n=10+9)
_ZFlat7-2            961µs ± 0%      472µs ± 0%   -50.83%  (p=0.000 n=10+10)
_ZFlat8-2           3.03ms ± 0%     1.43ms ± 0%   -52.90%  (p=0.000 n=9+10)
_ZFlat9-2           3.88ms ± 0%     1.95ms ± 0%   -49.72%  (p=0.000 n=10+10)
_ZFlat10-2           339µs ± 0%      123µs ± 3%   -63.82%  (p=0.000 n=10+10)
_ZFlat11-2           973µs ± 0%      433µs ± 0%   -55.49%  (p=0.000 n=10+10)
ExtendMatch-2       22.1µs ± 1%      9.8µs ± 0%   -55.63%  (p=0.000 n=10+10)

name              old speed      new speed       delta
WordsDecode1e1-2   342MB/s ± 0%    382MB/s ± 1%   +11.77%  (p=0.000 n=9+10)
WordsDecode1e2-2   535MB/s ± 0%    934MB/s ± 0%   +74.43%  (p=0.000 n=10+10)
WordsDecode1e3-2   463MB/s ± 1%   1049MB/s ± 0%  +126.52%  (p=0.000 n=10+10)
WordsDecode1e4-2   333MB/s ± 0%    961MB/s ± 2%  +189.04%  (p=0.000 n=10+10)
WordsDecode1e5-2   287MB/s ± 0%    597MB/s ± 0%  +107.72%  (p=0.000 n=10+9)
WordsDecode1e6-2   288MB/s ± 0%    584MB/s ± 0%  +102.67%  (p=0.000 n=10+10)
WordsEncode1e1-2   515MB/s ± 0%    460MB/s ± 0%   -10.70%  (p=0.000 n=10+10)
WordsEncode1e2-2  47.8MB/s ± 0%  403.3MB/s ± 0%  +743.40%  (p=0.000 n=10+10)
WordsEncode1e3-2   150MB/s ± 1%    401MB/s ± 0%  +167.66%  (p=0.000 n=10+9)
WordsEncode1e4-2   157MB/s ± 1%    340MB/s ± 1%  +115.66%  (p=0.000 n=10+9)
WordsEncode1e5-2   138MB/s ± 0%    290MB/s ± 0%  +109.24%  (p=0.000 n=10+10)
WordsEncode1e6-2   139MB/s ± 0%    293MB/s ± 0%  +110.35%  (p=0.000 n=10+8)
RandomEncode-2    9.93GB/s ± 2%  13.42GB/s ± 0%   +35.15%  (p=0.000 n=10+10)
_UFlat0-2          672MB/s ± 0%   1489MB/s ± 1%  +121.75%  (p=0.000 n=10+9)
_UFlat1-2          446MB/s ± 0%    913MB/s ± 0%  +104.48%  (p=0.000 n=9+10)
_UFlat2-2         18.0GB/s ± 0%   18.8GB/s ± 0%    +4.44%  (p=0.000 n=8+8)
_UFlat3-2          641MB/s ± 0%   1091MB/s ± 0%   +70.19%  (p=0.000 n=10+10)
_UFlat4-2         6.66GB/s ± 1%  10.53GB/s ± 1%   +58.19%  (p=0.000 n=10+10)
_UFlat5-2          655MB/s ± 0%   1362MB/s ± 1%  +107.80%  (p=0.000 n=9+10)
_UFlat6-2          267MB/s ± 0%    547MB/s ± 0%  +104.82%  (p=0.000 n=10+9)
_UFlat7-2          255MB/s ± 0%    521MB/s ± 1%  +103.89%  (p=0.000 n=10+10)
_UFlat8-2          281MB/s ± 0%    574MB/s ± 0%  +104.14%  (p=0.000 n=8+7)
_UFlat9-2          241MB/s ± 0%    478MB/s ± 0%   +97.97%  (p=0.000 n=10+10)
_UFlat10-2         896MB/s ± 0%   1914MB/s ± 2%  +113.64%  (p=0.000 n=10+10)
_UFlat11-2         371MB/s ± 0%    715MB/s ± 0%   +92.72%  (p=0.000 n=10+9)
_ZFlat0-2          296MB/s ± 1%    754MB/s ± 5%  +154.57%  (p=0.000 n=10+9)
_ZFlat1-2          194MB/s ± 0%    400MB/s ± 0%  +106.63%  (p=0.000 n=10+8)
_ZFlat2-2         9.35GB/s ± 0%  12.92GB/s ± 0%   +38.17%  (p=0.000 n=8+10)
_ZFlat3-2         80.3MB/s ± 0%  445.6MB/s ± 0%  +454.64%  (p=0.000 n=10+10)
_ZFlat4-2         2.03GB/s ± 0%   6.54GB/s ± 1%  +222.19%  (p=0.000 n=10+9)
_ZFlat5-2          292MB/s ± 0%    733MB/s ± 0%  +151.25%  (p=0.000 n=9+9)
_ZFlat6-2          135MB/s ± 0%    284MB/s ± 0%  +110.05%  (p=0.000 n=10+9)
_ZFlat7-2          130MB/s ± 0%    265MB/s ± 0%  +103.38%  (p=0.000 n=10+10)
_ZFlat8-2          141MB/s ± 0%    299MB/s ± 0%  +112.30%  (p=0.000 n=9+10)
_ZFlat9-2          124MB/s ± 0%    247MB/s ± 0%   +98.90%  (p=0.000 n=10+10)
_ZFlat10-2         350MB/s ± 0%    967MB/s ± 3%  +176.44%  (p=0.000 n=10+10)
_ZFlat11-2         189MB/s ± 0%    426MB/s ± 0%  +124.65%  (p=0.000 n=10+10)
2020-07-01 03:01:52 +00:00
Eric Buth 0a27eb7fa2 Add ReadByte method, satisfies the io.ByteReader interface 2020-02-17 13:39:43 -05:00
Nigel Tao ff6b7dc882 Add comments re handling block and stream formats 2019-09-04 16:35:34 +10:00
Nigel Tao 059a9b1922 A+C: add Klaus Post <klauspost@gmail.com>. 2019-09-04 16:29:47 +10:00
Nigel Tao c9879f99e6
Merge pull request #48 from klauspost/use-copy-for-non-overlapping
Use faster copy when not overlapping
2019-09-04 16:27:17 +10:00
Nigel Tao 5610373d2f
Merge pull request #49 from klauspost/faster-overlapping-copies
Faster overlapping copies
2019-09-04 16:26:23 +10:00
Klaus Post f6ad6c8bb8 Faster overlapping copies
Eliminates bounds check on every byte copied.

Benchmark measured on AMD64 but with `-tags=noasm`:

```
>benchstat old.txt new.txt
name        old time/op    new time/op    delta
_UFlat0-8      194µs ± 3%     150µs ± 2%  -22.59%  (p=0.000 n=10+10)
_UFlat1-8     1.62ms ± 1%    1.41ms ± 2%  -12.70%   (p=0.000 n=9+10)
_UFlat2-8     8.91µs ± 4%    8.76µs ± 2%     ~     (p=0.343 n=10+10)
_UFlat3-8      222ns ± 2%     224ns ± 1%   +1.00%   (p=0.028 n=10+9)
_UFlat4-8     28.4µs ± 2%    20.3µs ± 3%  -28.45%  (p=0.000 n=10+10)
_UFlat5-8      797µs ± 5%     603µs ± 2%  -24.34%   (p=0.000 n=10+9)
_UFlat6-8      565µs ± 1%     531µs ± 2%   -6.16%    (p=0.000 n=8+9)
_UFlat7-8      494µs ± 4%     457µs ± 2%   -7.61%  (p=0.000 n=10+10)
_UFlat8-8     1.55ms ± 4%    1.40ms ± 2%   -9.48%   (p=0.000 n=10+9)
_UFlat9-8     1.93ms ± 1%    1.83ms ± 2%   -5.44%   (p=0.000 n=10+9)
_UFlat10-8     186µs ± 2%     138µs ± 5%  -26.04%  (p=0.000 n=10+10)
_UFlat11-8     524µs ± 2%     478µs ± 3%   -8.68%  (p=0.000 n=10+10)

name        old speed      new speed      delta
_UFlat0-8    528MB/s ± 3%   682MB/s ± 2%  +29.18%  (p=0.000 n=10+10)
_UFlat1-8    434MB/s ± 1%   497MB/s ± 2%  +14.56%   (p=0.000 n=9+10)
_UFlat2-8   13.8GB/s ± 4%  14.1GB/s ± 2%     ~     (p=0.353 n=10+10)
_UFlat3-8    901MB/s ± 1%   890MB/s ± 1%   -1.18%    (p=0.008 n=9+9)
_UFlat4-8   3.60GB/s ± 2%  5.03GB/s ± 3%  +39.76%  (p=0.000 n=10+10)
_UFlat5-8    514MB/s ± 5%   679MB/s ± 2%  +32.04%   (p=0.000 n=10+9)
_UFlat6-8    269MB/s ± 1%   287MB/s ± 2%   +6.57%    (p=0.000 n=8+9)
_UFlat7-8    253MB/s ± 4%   274MB/s ± 2%   +8.23%  (p=0.000 n=10+10)
_UFlat8-8    276MB/s ± 4%   305MB/s ± 2%  +10.43%   (p=0.000 n=10+9)
_UFlat9-8    249MB/s ± 1%   263MB/s ± 2%   +5.76%   (p=0.000 n=10+9)
_UFlat10-8   637MB/s ± 2%   862MB/s ± 5%  +35.25%  (p=0.000 n=10+10)
_UFlat11-8   352MB/s ± 2%   385MB/s ± 3%   +9.51%  (p=0.000 n=10+10)
```
2019-09-01 19:55:24 +02:00
Klaus Post efb0d863a3 Use faster copy when not overlapping
Use the built-in copy function when the source doesn't overlap the destination.

Again benchmarks are a bit polarized based on how often this is the case, but should be a solid improvement for all non-amd64 users.

Benchmark  measured on AMD64 but with `-tags=noasm`:

```
>benchstat old.txt new.txt
name        old time/op    new time/op    delta
_UFlat0-8      194µs ± 3%     130µs ± 2%   -33.14%  (p=0.000 n=10+10)
_UFlat1-8     1.62ms ± 1%    1.42ms ± 1%   -11.98%    (p=0.000 n=9+9)
_UFlat2-8     8.91µs ± 4%    8.73µs ± 1%      ~      (p=0.182 n=10+9)
_UFlat3-8      222ns ± 2%     219ns ± 6%    -1.36%   (p=0.022 n=10+9)
_UFlat4-8     28.4µs ± 2%    11.5µs ± 1%   -59.57%  (p=0.000 n=10+10)
_UFlat5-8      797µs ± 5%     536µs ± 1%   -32.77%  (p=0.000 n=10+10)
_UFlat6-8      565µs ± 1%     571µs ± 1%    +1.04%   (p=0.007 n=8+10)
_UFlat7-8      494µs ± 4%     496µs ± 3%      ~     (p=0.986 n=10+10)
_UFlat8-8     1.55ms ± 4%    1.53ms ± 3%      ~     (p=0.280 n=10+10)
_UFlat9-8     1.93ms ± 1%    1.98ms ± 3%    +2.57%  (p=0.000 n=10+10)
_UFlat10-8     186µs ± 2%     102µs ± 2%   -45.14%  (p=0.000 n=10+10)
_UFlat11-8     524µs ± 2%     510µs ± 1%    -2.56%   (p=0.000 n=10+8)

name        old speed      new speed      delta
_UFlat0-8    528MB/s ± 3%   790MB/s ± 1%   +49.54%  (p=0.000 n=10+10)
_UFlat1-8    434MB/s ± 1%   493MB/s ± 1%   +13.61%    (p=0.000 n=9+9)
_UFlat2-8   13.8GB/s ± 4%  14.1GB/s ± 2%      ~      (p=0.182 n=10+9)
_UFlat3-8    901MB/s ± 1%   912MB/s ± 6%    +1.18%    (p=0.026 n=9+9)
_UFlat4-8   3.60GB/s ± 2%  8.91GB/s ± 1%  +147.32%  (p=0.000 n=10+10)
_UFlat5-8    514MB/s ± 5%   764MB/s ± 2%   +48.59%  (p=0.000 n=10+10)
_UFlat6-8    269MB/s ± 1%   266MB/s ± 1%    -1.03%   (p=0.009 n=8+10)
_UFlat7-8    253MB/s ± 4%   252MB/s ± 3%      ~     (p=0.985 n=10+10)
_UFlat8-8    276MB/s ± 4%   279MB/s ± 3%      ~     (p=0.288 n=10+10)
_UFlat9-8    249MB/s ± 1%   243MB/s ± 3%    -2.51%  (p=0.000 n=10+10)
_UFlat10-8   637MB/s ± 2%  1162MB/s ± 2%   +82.29%  (p=0.000 n=10+10)
_UFlat11-8   352MB/s ± 2%   361MB/s ± 1%    +2.62%   (p=0.000 n=10+8)
```

Co-Authored-By: Nigel Tao <nigeltao@golang.org>
2019-09-01 19:53:02 +02:00
Nigel Tao 2a8bb927dd
Merge pull request #46 from creachadair/gomod
Add a go.mod file for basic Go modules support.
2019-02-19 10:22:22 +11:00
M. J. Fromberger f05e7a5086 Add a go.mod file for basic Go modules support. 2019-02-11 13:35:28 -08:00
Nigel Tao 2e65f85255 Fix snappytool to use block, not stream, format
The key difference is replacing snappy.NewWriter and snappy.NewReader
with snappy.Encode and snappy.Decode.

This change restores the behavior of the previous (written in C)
snappytool program.
2018-05-18 15:45:09 +10:00
Nigel Tao e45cd318e0
Merge pull request #38 from mattn/cmd-snappytool
rewrite snappytool in go
2018-05-18 15:18:59 +10:00
Edward Betts da2bb3382a correct spelling mistake 2017-09-01 12:38:27 +01:00
Yasuhiro Matsumoto 35a8406c21 rewrite snappytool in go 2017-03-28 21:05:51 +09:00
Nigel Tao 553a641470 Merge pull request #37 from fatedier/master
fix typo
2017-02-16 10:32:05 +11:00
fatedier 0d9c4c05f1 fix typo 2017-01-25 15:07:54 +08:00
Nigel Tao 7db9049039 Merge pull request #36 from sguiheux/gofmt
Run gofmt.
2017-01-19 12:47:23 +11:00
Steven Guiheux 5a0054d7b7 fix: gofmt 2017-01-18 11:51:53 +01:00
Nigel Tao d9eb7a3d35 Support the COPY_4 tag.
It is a valid encoding, even if no longer issued by most encoders.

name              old speed      new speed      delta
WordsDecode1e1-8   525MB/s ± 0%   504MB/s ± 1%  -4.04%   (p=0.000 n=9+10)
WordsDecode1e2-8  1.23GB/s ± 0%  1.23GB/s ± 1%    ~      (p=0.678 n=10+9)
WordsDecode1e3-8  1.54GB/s ± 0%  1.53GB/s ± 1%  -0.75%   (p=0.000 n=10+9)
WordsDecode1e4-8  1.53GB/s ± 0%  1.51GB/s ± 3%  -1.46%   (p=0.000 n=9+10)
WordsDecode1e5-8   793MB/s ± 0%   777MB/s ± 2%  -2.01%   (p=0.017 n=9+10)
WordsDecode1e6-8   917MB/s ± 1%   917MB/s ± 1%    ~      (p=0.473 n=8+10)
WordsEncode1e1-8   641MB/s ± 2%   641MB/s ± 0%    ~      (p=0.780 n=10+9)
WordsEncode1e2-8   583MB/s ± 0%   580MB/s ± 0%  -0.41%   (p=0.001 n=10+9)
WordsEncode1e3-8   647MB/s ± 1%   648MB/s ± 0%    ~      (p=0.326 n=10+9)
WordsEncode1e4-8   442MB/s ± 1%   452MB/s ± 0%  +2.20%   (p=0.000 n=10+8)
WordsEncode1e5-8   355MB/s ± 1%   355MB/s ± 0%    ~      (p=0.880 n=10+8)
WordsEncode1e6-8   433MB/s ± 0%   434MB/s ± 0%    ~       (p=0.700 n=8+8)
RandomEncode-8    14.2GB/s ± 3%  14.2GB/s ± 3%    ~      (p=0.780 n=10+9)
_UFlat0-8         2.18GB/s ± 1%  2.19GB/s ± 0%    ~      (p=0.447 n=10+9)
_UFlat1-8         1.40GB/s ± 2%  1.41GB/s ± 0%  +0.73%   (p=0.043 n=9+10)
_UFlat2-8         23.4GB/s ± 3%  23.5GB/s ± 2%    ~      (p=0.497 n=9+10)
_UFlat3-8         1.90GB/s ± 0%  1.91GB/s ± 0%  +0.30%    (p=0.002 n=8+9)
_UFlat4-8         13.9GB/s ± 2%  14.0GB/s ± 1%    ~      (p=0.720 n=9+10)
_UFlat5-8         1.96GB/s ± 1%  1.97GB/s ± 0%  +0.81%   (p=0.000 n=10+9)
_UFlat6-8          813MB/s ± 0%   814MB/s ± 0%  +0.17%   (p=0.037 n=8+10)
_UFlat7-8          783MB/s ± 2%   785MB/s ± 0%    ~       (p=0.340 n=9+9)
_UFlat8-8          859MB/s ± 0%   857MB/s ± 0%    ~       (p=0.074 n=8+9)
_UFlat9-8          719MB/s ± 1%   719MB/s ± 1%    ~      (p=0.621 n=10+9)
_UFlat10-8        2.84GB/s ± 0%  2.84GB/s ± 0%  +0.19%   (p=0.043 n=10+9)
_UFlat11-8        1.05GB/s ± 1%  1.05GB/s ± 0%    ~       (p=0.523 n=9+8)
_ZFlat0-8         1.04GB/s ± 2%  1.04GB/s ± 0%    ~       (p=0.222 n=9+9)
_ZFlat1-8          535MB/s ± 0%   534MB/s ± 0%    ~       (p=0.059 n=9+9)
_ZFlat2-8         15.6GB/s ± 3%  15.7GB/s ± 1%    ~      (p=0.720 n=9+10)
_ZFlat3-8          723MB/s ± 0%   740MB/s ± 3%  +2.36%   (p=0.034 n=8+10)
_ZFlat4-8         9.16GB/s ± 1%  9.20GB/s ± 1%    ~       (p=0.297 n=9+9)
_ZFlat5-8          987MB/s ± 1%   991MB/s ± 0%    ~       (p=0.167 n=9+8)
_ZFlat6-8          378MB/s ± 2%   379MB/s ± 0%    ~       (p=0.334 n=9+8)
_ZFlat7-8          350MB/s ± 2%   352MB/s ± 0%  +0.60%    (p=0.014 n=9+8)
_ZFlat8-8          397MB/s ± 0%   396MB/s ± 1%    ~      (p=0.965 n=8+10)
_ZFlat9-8          328MB/s ± 0%   327MB/s ± 1%    ~       (p=0.409 n=8+9)
_ZFlat10-8        1.33GB/s ± 0%  1.33GB/s ± 1%    ~      (p=0.356 n=9+10)
_ZFlat11-8         605MB/s ± 0%   605MB/s ± 1%    ~       (p=0.743 n=9+8)
2016-05-29 15:00:41 +10:00
Nigel Tao d6668316e4 Fix BenchmarkExtendMatch to honor the testdata flag. 2016-05-19 13:34:20 +10:00
Nigel Tao d7b1e156f5 Add a benchdataDir flag. 2016-05-05 08:17:12 +10:00
Nigel Tao aefa7ba4ef Re-add the testdata flag.
Some build environments need to specify their own testdata dir.
2016-05-05 07:48:01 +10:00
Nigel Tao 43fea289ed Remove the snappy.test binary, inadvertently checked in.
Fixes #32.
2016-04-30 09:02:19 +10:00
Nigel Tao b62d312cd2 Add some benchmark numbers to the README. 2016-04-29 15:28:03 +10:00
Nigel Tao dfb3612ba2 Inline the extendMatch call.
Compared to the previous commit:
name              old speed      new speed      delta
WordsEncode1e1-8   701MB/s ± 0%   699MB/s ± 1%     ~     (p=0.123 n=10+10)
WordsEncode1e2-8   460MB/s ± 0%   583MB/s ± 1%  +26.64%  (p=0.000 n=10+10)
WordsEncode1e3-8   480MB/s ± 0%   647MB/s ± 2%  +34.85%  (p=0.000 n=10+10)
WordsEncode1e4-8   416MB/s ± 0%   451MB/s ± 0%   +8.30%   (p=0.000 n=10+8)
WordsEncode1e5-8   297MB/s ± 0%   355MB/s ± 2%  +19.50%   (p=0.000 n=10+9)
WordsEncode1e6-8   345MB/s ± 0%   433MB/s ± 2%  +25.47%   (p=0.000 n=10+9)
RandomEncode-8    14.4GB/s ± 2%  14.3GB/s ± 3%     ~     (p=0.075 n=10+10)
_ZFlat0-8          891MB/s ± 1%  1040MB/s ± 0%  +16.67%    (p=0.000 n=9+9)
_ZFlat1-8          471MB/s ± 0%   535MB/s ± 1%  +13.68%   (p=0.000 n=9+10)
_ZFlat2-8         16.2GB/s ± 3%  16.4GB/s ± 1%     ~      (p=0.122 n=10+8)
_ZFlat3-8          676MB/s ± 0%   762MB/s ± 0%  +12.62%   (p=0.000 n=10+9)
_ZFlat4-8         8.36GB/s ± 1%  9.47GB/s ± 1%  +13.28%  (p=0.000 n=10+10)
_ZFlat5-8          852MB/s ± 0%   986MB/s ± 1%  +15.79%   (p=0.000 n=10+9)
_ZFlat6-8          316MB/s ± 0%   380MB/s ± 1%  +20.41%    (p=0.000 n=8+9)
_ZFlat7-8          296MB/s ± 0%   353MB/s ± 0%  +19.44%   (p=0.000 n=8+10)
_ZFlat8-8          331MB/s ± 1%   399MB/s ± 0%  +20.53%    (p=0.000 n=9+8)
_ZFlat9-8          274MB/s ± 0%   329MB/s ± 0%  +20.27%    (p=0.000 n=8+9)
_ZFlat10-8        1.17GB/s ± 0%  1.35GB/s ± 1%  +15.15%    (p=0.000 n=9+9)
_ZFlat11-8         462MB/s ± 0%   608MB/s ± 0%  +31.54%    (p=0.000 n=9+9)

The net effect of the past four inlining commits, when compared to just
before c3defccc "Inline the emitCopy call":
name              old speed      new speed      delta
WordsEncode1e1-8   701MB/s ± 1%   699MB/s ± 1%     ~     (p=0.353 n=10+10)
WordsEncode1e2-8   429MB/s ± 0%   583MB/s ± 1%  +35.95%   (p=0.000 n=9+10)
WordsEncode1e3-8   447MB/s ± 0%   647MB/s ± 2%  +44.85%   (p=0.000 n=9+10)
WordsEncode1e4-8   322MB/s ± 1%   451MB/s ± 0%  +40.00%   (p=0.000 n=10+8)
WordsEncode1e5-8   268MB/s ± 0%   355MB/s ± 2%  +32.41%    (p=0.000 n=9+9)
WordsEncode1e6-8   313MB/s ± 0%   433MB/s ± 2%  +38.28%    (p=0.000 n=8+9)
RandomEncode-8    14.4GB/s ± 1%  14.3GB/s ± 3%     ~      (p=0.897 n=8+10)
_ZFlat0-8          797MB/s ± 2%  1040MB/s ± 0%  +30.53%    (p=0.000 n=9+9)
_ZFlat1-8          435MB/s ± 1%   535MB/s ± 1%  +22.97%   (p=0.000 n=9+10)
_ZFlat2-8         16.1GB/s ± 2%  16.4GB/s ± 1%   +1.47%   (p=0.001 n=10+8)
_ZFlat3-8          633MB/s ± 0%   762MB/s ± 0%  +20.32%   (p=0.000 n=10+9)
_ZFlat4-8         7.95GB/s ± 1%  9.47GB/s ± 1%  +19.11%  (p=0.000 n=10+10)
_ZFlat5-8          771MB/s ± 0%   986MB/s ± 1%  +27.83%   (p=0.000 n=10+9)
_ZFlat6-8          283MB/s ± 0%   380MB/s ± 1%  +34.46%   (p=0.000 n=10+9)
_ZFlat7-8          265MB/s ± 0%   353MB/s ± 0%  +33.29%   (p=0.000 n=9+10)
_ZFlat8-8          299MB/s ± 0%   399MB/s ± 0%  +33.36%    (p=0.000 n=9+8)
_ZFlat9-8          246MB/s ± 1%   329MB/s ± 0%  +33.58%   (p=0.000 n=10+9)
_ZFlat10-8        1.05GB/s ± 1%  1.35GB/s ± 1%  +28.35%   (p=0.000 n=10+9)
_ZFlat11-8         411MB/s ± 0%   608MB/s ± 0%  +47.82%   (p=0.000 n=10+9)
2016-04-29 14:24:51 +10:00
Nigel Tao c707890a47 Rearrange the extendMatch register allocation.
This minimizes the diff in a follow-up commit, when manually inlining.

It's not an optimization per se, but for the record:
name              old speed      new speed      delta
WordsEncode1e1-8   700MB/s ± 1%   701MB/s ± 0%    ~     (p=0.393 n=10+10)
WordsEncode1e2-8   460MB/s ± 1%   460MB/s ± 0%    ~     (p=0.393 n=10+10)
WordsEncode1e3-8   478MB/s ± 2%   480MB/s ± 0%    ~     (p=0.912 n=10+10)
WordsEncode1e4-8   414MB/s ± 0%   416MB/s ± 0%  +0.64%   (p=0.000 n=9+10)
WordsEncode1e5-8   296MB/s ± 1%   297MB/s ± 0%    ~      (p=0.113 n=9+10)
WordsEncode1e6-8   345MB/s ± 0%   345MB/s ± 0%    ~      (p=0.949 n=8+10)
RandomEncode-8    14.4GB/s ± 2%  14.4GB/s ± 2%    ~      (p=0.278 n=9+10)
_ZFlat0-8          888MB/s ± 1%   891MB/s ± 1%  +0.35%   (p=0.010 n=10+9)
_ZFlat1-8          471MB/s ± 1%   471MB/s ± 0%    ~      (p=0.447 n=10+9)
_ZFlat2-8         16.2GB/s ± 3%  16.2GB/s ± 3%    ~     (p=0.912 n=10+10)
_ZFlat3-8          675MB/s ± 1%   676MB/s ± 0%    ~      (p=0.150 n=9+10)
_ZFlat4-8         8.31GB/s ± 1%  8.36GB/s ± 1%  +0.65%  (p=0.035 n=10+10)
_ZFlat5-8          850MB/s ± 0%   852MB/s ± 0%    ~      (p=0.182 n=9+10)
_ZFlat6-8          316MB/s ± 0%   316MB/s ± 0%    ~      (p=0.762 n=10+8)
_ZFlat7-8          294MB/s ± 1%   296MB/s ± 0%  +0.51%    (p=0.006 n=9+8)
_ZFlat8-8          330MB/s ± 1%   331MB/s ± 1%    ~       (p=0.881 n=9+9)
_ZFlat9-8          273MB/s ± 0%   274MB/s ± 0%  +0.23%   (p=0.043 n=10+8)
_ZFlat10-8        1.17GB/s ± 1%  1.17GB/s ± 0%    ~      (p=0.922 n=10+9)
_ZFlat11-8         461MB/s ± 0%   462MB/s ± 0%    ~      (p=0.219 n=10+9)

Also:
name           old time/op  new time/op  delta
ExtendMatch-8  7.92µs ± 2%  7.80µs ± 2%  -1.51%  (p=0.002 n=10+9)
and note that this is time/op instead of MB/s, so negative is better,
although it's quite possibly all just noise.
2016-04-29 14:11:06 +10:00
Nigel Tao 5a44a9da21 Inline the emitLiteral call.
name              old speed      new speed      delta
WordsEncode1e1-8   712MB/s ± 1%   700MB/s ± 1%   -1.65%  (p=0.000 n=10+10)
WordsEncode1e2-8   467MB/s ± 0%   460MB/s ± 1%   -1.53%   (p=0.000 n=9+10)
WordsEncode1e3-8   483MB/s ± 0%   478MB/s ± 2%   -0.98%   (p=0.007 n=9+10)
WordsEncode1e4-8   353MB/s ± 1%   414MB/s ± 0%  +17.03%   (p=0.000 n=10+9)
WordsEncode1e5-8   293MB/s ± 0%   296MB/s ± 1%   +1.03%    (p=0.000 n=8+9)
WordsEncode1e6-8   345MB/s ± 0%   345MB/s ± 0%     ~       (p=0.332 n=9+8)
RandomEncode-8    14.4GB/s ± 2%  14.4GB/s ± 2%     ~      (p=1.000 n=10+9)
_ZFlat0-8          863MB/s ± 0%   888MB/s ± 1%   +2.86%   (p=0.000 n=9+10)
_ZFlat1-8          471MB/s ± 0%   471MB/s ± 1%     ~      (p=0.897 n=8+10)
_ZFlat2-8         16.2GB/s ± 2%  16.2GB/s ± 3%     ~     (p=0.631 n=10+10)
_ZFlat3-8          659MB/s ± 1%   675MB/s ± 1%   +2.32%    (p=0.000 n=9+9)
_ZFlat4-8         8.29GB/s ± 1%  8.31GB/s ± 1%     ~     (p=0.315 n=10+10)
_ZFlat5-8          836MB/s ± 1%   850MB/s ± 0%   +1.78%    (p=0.000 n=9+9)
_ZFlat6-8          315MB/s ± 0%   316MB/s ± 0%   +0.39%   (p=0.002 n=9+10)
_ZFlat7-8          293MB/s ± 1%   294MB/s ± 1%     ~      (p=0.139 n=10+9)
_ZFlat8-8          331MB/s ± 1%   330MB/s ± 1%     ~      (p=0.356 n=10+9)
_ZFlat9-8          273MB/s ± 1%   273MB/s ± 0%     ~     (p=0.280 n=10+10)
_ZFlat10-8        1.12GB/s ± 1%  1.17GB/s ± 1%   +4.12%  (p=0.000 n=10+10)
_ZFlat11-8         460MB/s ± 0%   461MB/s ± 0%   +0.34%   (p=0.006 n=8+10)
2016-04-29 13:20:53 +10:00
Nigel Tao c3defccc35 Inline the emitCopy call.
name              old speed      new speed      delta
WordsEncode1e1-8   701MB/s ± 1%   712MB/s ± 1%   +1.64%  (p=0.000 n=10+10)
WordsEncode1e2-8   429MB/s ± 0%   467MB/s ± 0%   +8.86%    (p=0.000 n=9+9)
WordsEncode1e3-8   447MB/s ± 0%   483MB/s ± 0%   +8.20%    (p=0.000 n=9+9)
WordsEncode1e4-8   322MB/s ± 1%   353MB/s ± 1%   +9.76%  (p=0.000 n=10+10)
WordsEncode1e5-8   268MB/s ± 0%   293MB/s ± 0%   +9.42%    (p=0.000 n=9+8)
WordsEncode1e6-8   313MB/s ± 0%   345MB/s ± 0%  +10.06%    (p=0.000 n=8+9)
RandomEncode-8    14.4GB/s ± 1%  14.4GB/s ± 2%     ~      (p=0.829 n=8+10)
_ZFlat0-8          797MB/s ± 2%   863MB/s ± 0%   +8.39%    (p=0.000 n=9+9)
_ZFlat1-8          435MB/s ± 1%   471MB/s ± 0%   +8.34%    (p=0.000 n=9+8)
_ZFlat2-8         16.1GB/s ± 2%  16.2GB/s ± 2%     ~     (p=0.165 n=10+10)
_ZFlat3-8          633MB/s ± 0%   659MB/s ± 1%   +4.12%   (p=0.000 n=10+9)
_ZFlat4-8         7.95GB/s ± 1%  8.29GB/s ± 1%   +4.22%  (p=0.000 n=10+10)
_ZFlat5-8          771MB/s ± 0%   836MB/s ± 1%   +8.33%   (p=0.000 n=10+9)
_ZFlat6-8          283MB/s ± 0%   315MB/s ± 0%  +11.19%   (p=0.000 n=10+9)
_ZFlat7-8          265MB/s ± 0%   293MB/s ± 1%  +10.73%   (p=0.000 n=9+10)
_ZFlat8-8          299MB/s ± 0%   331MB/s ± 1%  +10.74%   (p=0.000 n=9+10)
_ZFlat9-8          246MB/s ± 1%   273MB/s ± 1%  +10.90%  (p=0.000 n=10+10)
_ZFlat10-8        1.05GB/s ± 1%  1.12GB/s ± 1%   +7.02%  (p=0.000 n=10+10)
_ZFlat11-8         411MB/s ± 0%   460MB/s ± 0%  +11.79%   (p=0.000 n=10+8)
2016-04-29 12:54:56 +10:00
Nigel Tao 598d84db77 Rearrange the emitLiteral register allocation.
This minimizes the diff in a follow-up commit, when manually inlining.

It's not an optimization per se, but for the record:
name              old speed      new speed      delta
WordsEncode1e1-8   698MB/s ± 1%   701MB/s ± 1%   ~     (p=0.165 n=10+10)
WordsEncode1e2-8   428MB/s ± 0%   429MB/s ± 0%   ~       (p=0.489 n=9+9)
WordsEncode1e3-8   446MB/s ± 0%   447MB/s ± 0%   ~       (p=0.476 n=9+9)
WordsEncode1e4-8   321MB/s ± 1%   322MB/s ± 1%   ~     (p=0.593 n=10+10)
WordsEncode1e5-8   267MB/s ± 1%   268MB/s ± 0%   ~       (p=0.287 n=9+9)
WordsEncode1e6-8   313MB/s ± 1%   313MB/s ± 0%   ~       (p=0.190 n=9+8)
RandomEncode-8    14.4GB/s ± 1%  14.4GB/s ± 1%   ~       (p=0.673 n=9+8)
_ZFlat0-8          800MB/s ± 0%   797MB/s ± 2%   ~       (p=0.387 n=9+9)
_ZFlat1-8          436MB/s ± 1%   435MB/s ± 1%   ~       (p=0.169 n=9+9)
_ZFlat2-8         16.2GB/s ± 1%  16.1GB/s ± 2%   ~     (p=0.063 n=10+10)
_ZFlat3-8          633MB/s ± 1%   633MB/s ± 0%   ~      (p=0.661 n=9+10)
_ZFlat4-8         7.96GB/s ± 1%  7.95GB/s ± 1%   ~     (p=0.796 n=10+10)
_ZFlat5-8          771MB/s ± 0%   771MB/s ± 0%   ~     (p=0.929 n=10+10)
_ZFlat6-8          283MB/s ± 1%   283MB/s ± 0%   ~     (p=0.912 n=10+10)
_ZFlat7-8          265MB/s ± 0%   265MB/s ± 0%   ~       (p=0.649 n=9+9)
_ZFlat8-8          299MB/s ± 0%   299MB/s ± 0%   ~       (p=0.748 n=9+9)
_ZFlat9-8          246MB/s ± 1%   246MB/s ± 1%   ~      (p=0.921 n=9+10)
_ZFlat10-8        1.05GB/s ± 1%  1.05GB/s ± 1%   ~     (p=0.089 n=10+10)
_ZFlat11-8         410MB/s ± 0%   411MB/s ± 0%   ~     (p=0.190 n=10+10)
2016-04-29 12:00:38 +10:00
Nigel Tao 9f7b278fd7 Rearrange the emitCopy register allocation.
This minimizes the diff in a follow-up commit, when manually inlining.

It's not an optimization per se, but for the record:
name              old speed      new speed      delta
WordsEncode1e1-8   711MB/s ± 1%   700MB/s ± 1%  -1.64%   (p=0.000 n=9+10)
WordsEncode1e2-8   407MB/s ± 1%   430MB/s ± 0%  +5.57%  (p=0.000 n=10+10)
WordsEncode1e3-8   441MB/s ± 1%   447MB/s ± 0%  +1.52%    (p=0.000 n=8+8)
WordsEncode1e4-8   311MB/s ± 1%   322MB/s ± 0%  +3.69%   (p=0.000 n=9+10)
WordsEncode1e5-8   267MB/s ± 0%   267MB/s ± 1%    ~      (p=0.068 n=8+10)
WordsEncode1e6-8   312MB/s ± 1%   314MB/s ± 0%  +0.45%   (p=0.000 n=9+10)
RandomEncode-8    14.4GB/s ± 2%  14.4GB/s ± 2%    ~     (p=0.739 n=10+10)
_ZFlat0-8          792MB/s ± 1%   801MB/s ± 0%  +1.11%    (p=0.000 n=8+9)
_ZFlat1-8          435MB/s ± 1%   437MB/s ± 0%    ~      (p=0.857 n=9+10)
_ZFlat2-8         16.0GB/s ± 4%  16.3GB/s ± 1%    ~     (p=0.143 n=10+10)
_ZFlat3-8          613MB/s ± 0%   634MB/s ± 0%  +3.54%   (p=0.000 n=8+10)
_ZFlat4-8         7.96GB/s ± 1%  7.97GB/s ± 1%    ~      (p=0.829 n=8+10)
_ZFlat5-8          770MB/s ± 0%   773MB/s ± 0%  +0.33%    (p=0.000 n=8+9)
_ZFlat6-8          283MB/s ± 0%   283MB/s ± 0%  +0.13%    (p=0.043 n=8+9)
_ZFlat7-8          264MB/s ± 2%   265MB/s ± 0%  +0.61%    (p=0.000 n=9+9)
_ZFlat8-8          297MB/s ± 3%   299MB/s ± 0%    ~       (p=0.161 n=9+9)
_ZFlat9-8          247MB/s ± 1%   247MB/s ± 0%    ~       (p=0.465 n=8+9)
_ZFlat10-8        1.03GB/s ± 0%  1.05GB/s ± 1%  +1.75%    (p=0.000 n=9+9)
_ZFlat11-8         409MB/s ± 0%   412MB/s ± 0%  +0.64%    (p=0.000 n=8+8)
2016-04-29 11:22:44 +10:00
Nigel Tao 2b29335120 Run asmfmt. 2016-04-29 11:06:33 +10:00
Nigel Tao 6ffc20e64a Add more comments for the asm workaround. 2016-04-29 10:31:32 +10:00
Nigel Tao ec642410cd Workaround "table-32768(SP)(R11*2)" not assembling.
This asm phrase works on Go 1.4 and Go tip, but not Go 1.6. I'm not sure
why, but this workaround should make the package installable while I
investigate.

Fixes #29.
2016-04-24 10:32:34 +10:00
Nigel Tao 7dddae14f7 Fix redeclaration of "end" in the asm.
Multiple "end" labels, in different functions, did not work with the Go
1.4 toolchain.

Fixes #30.
2016-04-24 10:07:12 +10:00
Nigel Tao 2dbf365277 Inline extendMatch for the noasm encoder.
This is a partial undo of 4f2f9a13 "Write the encoder's extendMatch in
asm" but we can selectively apply the undo only to the noasm case now
that encodeBlock (the function that calls extendMatch) is itself written
in asm.

With "go test -test.bench='Encode|ZFlat' -tags=noasm":
name              old speed      new speed      delta
WordsEncode1e1-8   676MB/s ± 1%   676MB/s ± 0%     ~     (p=0.841 n=5+5)
WordsEncode1e2-8  85.3MB/s ± 0%  87.5MB/s ± 1%   +2.50%  (p=0.008 n=5+5)
WordsEncode1e3-8   241MB/s ± 0%   258MB/s ± 0%   +7.33%  (p=0.008 n=5+5)
WordsEncode1e4-8   199MB/s ± 0%   245MB/s ± 0%  +23.15%  (p=0.008 n=5+5)
WordsEncode1e5-8   171MB/s ± 0%   186MB/s ± 0%   +8.57%  (p=0.008 n=5+5)
WordsEncode1e6-8   192MB/s ± 0%   211MB/s ± 0%   +9.51%  (p=0.008 n=5+5)
RandomEncode-8    13.1GB/s ± 2%  13.2GB/s ± 1%     ~     (p=0.690 n=5+5)
_ZFlat0-8          404MB/s ± 0%   431MB/s ± 0%   +6.84%  (p=0.008 n=5+5)
_ZFlat1-8          260MB/s ± 0%   277MB/s ± 0%   +6.46%  (p=0.008 n=5+5)
_ZFlat2-8         13.8GB/s ± 1%  13.8GB/s ± 2%     ~     (p=1.000 n=5+5)
_ZFlat3-8          170MB/s ± 1%   173MB/s ± 0%   +1.60%  (p=0.008 n=5+5)
_ZFlat4-8         2.94GB/s ± 5%  3.10GB/s ± 0%   +5.35%  (p=0.008 n=5+5)
_ZFlat5-8          397MB/s ± 1%   426MB/s ± 0%   +7.32%  (p=0.008 n=5+5)
_ZFlat6-8          175MB/s ± 2%   190MB/s ± 0%   +8.61%  (p=0.008 n=5+5)
_ZFlat7-8          169MB/s ± 0%   182MB/s ± 0%   +7.47%  (p=0.016 n=4+5)
_ZFlat8-8          184MB/s ± 3%   200MB/s ± 0%   +8.65%  (p=0.008 n=5+5)
_ZFlat9-8          163MB/s ± 0%   175MB/s ± 0%   +7.57%  (p=0.016 n=4+5)
_ZFlat10-8         481MB/s ± 0%   509MB/s ± 0%   +5.80%  (p=0.016 n=4+5)
_ZFlat11-8         254MB/s ± 0%   275MB/s ± 0%   +8.32%  (p=0.008 n=5+5)

For the record, after this commit, the comparison between the noasm
('old') and vanilla (i.e. with asm, 'new') encoder benchmarks, summing
up the last eight or so commits, is:
name              old speed      new speed       delta
WordsEncode1e1-8   676MB/s ± 0%    677MB/s ± 1%      ~     (p=0.310 n=5+5)
WordsEncode1e2-8  87.5MB/s ± 1%  428.3MB/s ± 0%  +389.71%  (p=0.008 n=5+5)
WordsEncode1e3-8   258MB/s ± 0%    446MB/s ± 1%   +72.67%  (p=0.008 n=5+5)
WordsEncode1e4-8   245MB/s ± 0%    316MB/s ± 0%   +28.94%  (p=0.008 n=5+5)
WordsEncode1e5-8   186MB/s ± 0%    269MB/s ± 0%   +44.86%  (p=0.008 n=5+5)
WordsEncode1e6-8   211MB/s ± 0%    314MB/s ± 1%   +48.84%  (p=0.008 n=5+5)
RandomEncode-8    13.2GB/s ± 1%   14.4GB/s ± 1%    +9.33%  (p=0.008 n=5+5)
_ZFlat0-8          431MB/s ± 0%    792MB/s ± 0%   +83.67%  (p=0.008 n=5+5)
_ZFlat1-8          277MB/s ± 0%    436MB/s ± 1%   +57.46%  (p=0.008 n=5+5)
_ZFlat2-8         13.8GB/s ± 2%   16.2GB/s ± 1%   +17.16%  (p=0.008 n=5+5)
_ZFlat3-8          173MB/s ± 0%    632MB/s ± 1%  +265.85%  (p=0.008 n=5+5)
_ZFlat4-8         3.10GB/s ± 0%   8.00GB/s ± 0%  +157.99%  (p=0.008 n=5+5)
_ZFlat5-8          426MB/s ± 0%    768MB/s ± 0%   +80.06%  (p=0.008 n=5+5)
_ZFlat6-8          190MB/s ± 0%    282MB/s ± 1%   +48.48%  (p=0.008 n=5+5)
_ZFlat7-8          182MB/s ± 0%    264MB/s ± 1%   +44.97%  (p=0.008 n=5+5)
_ZFlat8-8          200MB/s ± 0%    298MB/s ± 0%   +49.45%  (p=0.008 n=5+5)
_ZFlat9-8          175MB/s ± 0%    247MB/s ± 0%   +41.02%  (p=0.008 n=5+5)
_ZFlat10-8         509MB/s ± 0%   1027MB/s ± 0%  +101.72%  (p=0.008 n=5+5)
_ZFlat11-8         275MB/s ± 0%    411MB/s ± 0%   +49.57%  (p=0.008 n=5+5)
2016-04-23 15:01:47 +10:00