When data is encoded to BERT, each individual, encoded result piece is stored inside an Array based Buffer. At the end, each piece is sequentially written out to a StringIO object and the underlying String is returned. Unfortunately, this sequential writing to StringIO causes a lot of growth of the internal String object. By calling `#join` on the Buffer internal Array, Ruby will allocate a single string that can contain the whole result in a single step.
Don't reinvent this particular wheel. There's still a mention or two of
`.bundle` which is what we have on OS X. So some of the cleaning up
just did not work on anything else.
This method will buffer writes, but to an array rather than to a
StringIO. This allows us to calculate the size of the BERT packet that
we're going to send *without* copying large strings in to a new buffer.
Writes might take a bit more CPU, but will take fare less memory.
This speeds up string decoding by using `rb_str_substr` rather than
`rb_str_new`. `rb_str_substr` will create shared strings, so it will
avoid string copying like `rb_str_new` will do.
Before this patch:
```
[aaron@TC bert (master)]$ ruby -I lib:test bench/decode_bench.rb
user system total real
BERT C Extension Decoder
BERT tiny 0.000000 0.000000 0.000000 ( 0.000574)
BERT small 0.010000 0.010000 0.020000 ( 0.014938)
BERT large 13.990000 11.640000 25.630000 ( 25.892584)
BERT complex 0.030000 0.010000 0.040000 ( 0.033596)
```
After this patch:
```
[aaron@TC bert (master)]$ ruby -I lib:test bench/decode_bench.rb
user system total real
BERT C Extension Decoder
BERT tiny 0.000000 0.000000 0.000000 ( 0.000563)
BERT small 0.010000 0.000000 0.010000 ( 0.008701)
BERT large 6.180000 0.040000 6.220000 ( 6.299307)
BERT complex 0.060000 0.000000 0.060000 ( 0.070287)
```
It's turning out to be a pain for the client side to accept any
encoding. This commit ensures that non-utf8 data is transcoded to utf8
or converted to binary before being sent across the wire.
The two new types are extensions, so this commit adds a comment
documenting what these extensions are for (namely so that we can support
string encodings over the wire).
This commit adds two new types, one for unicode strings and one for
other encoded strings. Unocide strings have no extra wire protocol
overhead, where "other" strings send the encoding name along with the
string.