`String#+@` is 2-3 times faster than `String#dup` because it can
directly go through `rb_str_dup` instead of using the generic
much slower `rb_obj_dup`.
This fact led to the existance of the ugly `Performance/UnfreezeString`
rubocop performance rule that encourage users to rewrite the much
more readable and convenient `"foo".dup` into the ugly `(+"foo")`.
Let's make that rubocop rule useless.
```
compare-ruby: ruby 3.3.0dev (2023-11-20T02:02:55Z master 701b0650de) [arm64-darwin22]
last_commit=[ruby/prism] feat: add encoding for IBM865 (https://github.com/ruby/prism/pull/1884)
built-ruby: ruby 3.3.0dev (2023-11-20T12:51:45Z faster-str-lit-dup 6b745bbc5d) [arm64-darwin22]
warming up..
| |compare-ruby|built-ruby|
|:------|-----------:|---------:|
|uplus | 16.312M| 16.332M|
| | -| 1.00x|
|dup | 5.912M| 16.329M|
| | -| 2.76x|
```
When an inline cache misses, it is very likely that the stale shape_id
and the current instance shape_id have a close common ancestor.
For example if the instance variable is sometimes frozen sometimes
not, one of the two shape will be the direct parent of the other.
Another pattern that commonly cause IC misses is "memoization",
in such case the object will have a "base common shape" and then
a number of close descendants.
In addition, when we find a common ancestor, we store it in the
inline cache instead of the current shape. This help prevent the
cache from flip-flopping, ensuring the next lookup will be marginally
faster and more generally avoid writing in memory too much.
However, now that shapes have an ancestors index, we only check
for a few ancestors before falling back to use the index.
So overall this change speeds up what is assumed to be the more common
case, but makes what is assumed to be the less common case a bit slower.
```
compare-ruby: ruby 3.3.0dev (2023-10-26T05:30:17Z master 701ca070b4) [arm64-darwin22]
built-ruby: ruby 3.3.0dev (2023-10-26T09:25:09Z shapes_double_sear.. a723a85235) [arm64-darwin22]
warming up......
| |compare-ruby|built-ruby|
|:------------------------------------|-----------:|---------:|
|vm_ivar_stable_shape | 11.672M| 11.679M|
| | -| 1.00x|
|vm_ivar_memoize_unstable_shape | 7.551M| 10.506M|
| | -| 1.39x|
|vm_ivar_memoize_unstable_shape_miss | 11.591M| 11.624M|
| | -| 1.00x|
|vm_ivar_unstable_undef | 9.037M| 7.981M|
| | 1.13x| -|
|vm_ivar_divergent_shape | 8.034M| 6.657M|
| | 1.21x| -|
|vm_ivar_divergent_shape_imbalanced | 10.471M| 9.231M|
| | 1.13x| -|
```
Co-Authored-By: John Hawthorn <john@hawthorn.email>
This is an experimental commit that uses a functional red-black tree to
create an index of the ancestor shapes. It uses an Okasaki style
functional red black tree:
https://www.cs.tufts.edu/comp/150FP/archive/chris-okasaki/redblack99.pdf
This tree is advantageous because:
* It offers O(n log n) insertions and O(n log n) lookups.
* It shares memory with previous "versions" of the tree
When we insert a node in the tree, only the parts of the tree that need
to be rebalanced are newly allocated. Parts of the tree that don't need
to be rebalanced are not reallocated, so "new trees" are able to share
memory with old trees. This is in contrast to a sorted set where we
would have to duplicate the set, and also resort the set on each
insertion.
I've added a new stat to RubyVM.stat so we can understand how the red
black tree increases.
On Range#bsearch for endless ranges, we try positions at `begin + 2**i` (i = 0, 1, 2, ...)
to find a point that satisfies a given condition.
Subsequently, we perform binary searching with the interval `[begin, begin + 2**n]`.
However, the interval `[begin + 2**(n-1), begin + 2**n]` is sufficient for binary search
because `begin + 2**(n-1)` does not satisfy the condition.
The same applies to beginless ranges.
Leave callers to convert byte index to char index, as well as
`rb_str_index`, so that `rb_str_rpartition` does not need to
re-convert char index to byte index.
In most of case `sort_by` works on primitive type.
Using `qsort_r` with function pointer is much slower than compare data directly.
I implement an intro sort which compare primitive data directly for `sort_by`.
We can even afford an O(n) type check before primitive data sort.
It still go faster.
CALLER_ARG_SPLAT is not necessary for method_missing. We just need
to unshift the method name into the arguments.
This optimizes all method_missing calls:
* mm(recv) ~9%
* mm(recv, *args) ~215% for args.length == 200
* mm(recv, *args, **kw) ~55% for args.length == 200
* mm(recv, **kw) ~22%
* mm(recv, kw: 1) ~100%
Note that empty argument splats do get slower with this approach,
by about 30-40%. Other than non-empty argument splats, other
argument splats are faster, with the speedup depending on the
number of arguments.
Similar to the bmethod/send optimization, this avoids using
CALLER_ARG_SPLAT if not necessary. As long as the receiver argument
can be shifted off, other arguments are passed through as-is.
This optimizes the following types of calls:
* symproc.(recv) ~5%
* symproc.(recv, *args) ~65% for args.length == 200
* symproc.(recv, *args, **kw) ~45% for args.length == 200
* symproc.(recv, **kw) ~30%
* symproc.(recv, kw: 1) ~100%
Note that empty argument splats do get slower with this approach,
by about 2-3%. This is probably because iseq argument setup is
slower for empty argument splats than CALLER_SETUP_ARG is. Other
than non-empty argument splats, other argument splats are faster,
with the speedup depending on the number of arguments.
The following types of calls are not optimized:
* symproc.(*args)
* symproc.(*args, **kw)
This is because the you cannot shift the receiver argument off
without first splatting the arg.
Similar to the bmethod optimization, this avoids using
CALLER_ARG_SPLAT if not necessary. As long as the method argument
can be shifted off, other arguments are passed through as-is.
This optimizes the following types of calls:
* send(meth, arg) ~5%
* send(meth, *args) ~75% for args.length == 200
* send(meth, *args, **kw) ~50% for args.length == 200
* send(meth, **kw) ~25%
* send(meth, kw: 1) ~115%
Note that empty argument splats do get slower with this approach,
by about 20%. This is probably because iseq argument setup is
slower for empty argument splats than CALLER_SETUP_ARG is. Other
than non-empty argument splats, other argument splats are faster,
with the speedup depending on the number of arguments.
The following types of calls are not optimized:
* send(*args)
* send(*args, **kw)
This is because the you cannot shift the method argument off
without first splatting the arg.
This optimizes the following calls:
* ~10-15% for f(*a) when a does not end with a flagged keywords hash
* ~10-15% for f(*a) when a ends with an empty flagged keywords hash
* ~35-40% for f(*a, **kw) if kw is empty
This still copies the array contents to the VM stack, but avoids some
overhead. It would be faster to use the array pointer directly,
but that could cause problems if the array was modified during
the call to the function. You could do that optimization for frozen
arrays, but as splatting frozen arrays is uncommon, and the speedup
is minimal (<5%), it doesn't seem worth it.
The vm_send_cfunc benchmark has been updated to test additional cfunc
call types, and the numbers above were taken from the benchmark results.
Currently, bmethod arguments are copied from the VM stack to the
C stack in vm_call_bmethod, then copied from the C stack to the VM
stack later in invoke_iseq_block_from_c. This is inefficient.
This adds vm_call_iseq_bmethod and vm_call_noniseq_bmethod.
vm_call_iseq_bmethod is an optimized method that skips stack
copies (though there is one copy to remove the receiver from
the stack), and avoids calling vm_call_bmethod_body,
rb_vm_invoke_bmethod, invoke_block_from_c_proc,
invoke_iseq_block_from_c, and vm_yield_setup_args.
Th vm_call_iseq_bmethod argument handling is similar to the
way normal iseq methods are called, and allows for similar
performance optimizations when using splats or keywords.
However, even in the no argument case it's still significantly
faster.
A benchmark is added for bmethod calling. In my environment,
it improves bmethod calling performance by 38-59% for simple
bmethod calls, and up to 180% for bmethod calls passing
literal keywords on both sides.
```
./miniruby-iseq-bmethod: 18159792.6 i/s
./miniruby-m: 13174419.1 i/s - 1.38x slower
bmethod_simple_1
./miniruby-iseq-bmethod: 15890745.4 i/s
./miniruby-m: 10008972.7 i/s - 1.59x slower
bmethod_simple_0_splat
./miniruby-iseq-bmethod: 13142804.3 i/s
./miniruby-m: 11168595.2 i/s - 1.18x slower
bmethod_simple_1_splat
./miniruby-iseq-bmethod: 12375791.0 i/s
./miniruby-m: 8491140.1 i/s - 1.46x slower
bmethod_no_splat
./miniruby-iseq-bmethod: 10151258.8 i/s
./miniruby-m: 8716664.1 i/s - 1.16x slower
bmethod_0_splat
./miniruby-iseq-bmethod: 8138802.5 i/s
./miniruby-m: 7515600.2 i/s - 1.08x slower
bmethod_1_splat
./miniruby-iseq-bmethod: 8028372.7 i/s
./miniruby-m: 5947658.6 i/s - 1.35x slower
bmethod_10_splat
./miniruby-iseq-bmethod: 6953514.1 i/s
./miniruby-m: 4840132.9 i/s - 1.44x slower
bmethod_100_splat
./miniruby-iseq-bmethod: 5287288.4 i/s
./miniruby-m: 2243218.4 i/s - 2.36x slower
bmethod_kw
./miniruby-iseq-bmethod: 8931358.2 i/s
./miniruby-m: 3185818.6 i/s - 2.80x slower
bmethod_no_kw
./miniruby-iseq-bmethod: 12281287.4 i/s
./miniruby-m: 10041727.9 i/s - 1.22x slower
bmethod_kw_splat
./miniruby-iseq-bmethod: 5618956.8 i/s
./miniruby-m: 3657549.5 i/s - 1.54x slower
```
Prior to this commit the `OPTIMIZED_CMP` macro relied on a method lookup
to determine whether `<=>` was overridden. The result of the lookup was
cached, but only for the duration of the specific method that
initialized the cmp_opt_data cache structure.
With this method lookup, `[x,y].max` is slower than doing `x > y ?
x : y` even though there's an optimized instruction for "new array max".
(John noticed somebody a proposed micro-optimization based on this fact
in https://github.com/mastodon/mastodon/pull/19903.)
```rb
a, b = 1, 2
Benchmark.ips do |bm|
bm.report('conditional') { a > b ? a : b }
bm.report('method') { [a, b].max }
bm.compare!
end
```
Before:
```
Comparison:
conditional: 22603733.2 i/s
method: 19820412.7 i/s - 1.14x (± 0.00) slower
```
This commit replaces the method lookup with a new CMP basic op, which
gives the examples above equivalent performance.
After:
```
Comparison:
method: 24022466.5 i/s
conditional: 23851094.2 i/s - same-ish: difference falls within
error
```
Relevant benchmarks show an improvement to Array#max and Array#min when
not using the optimized newarray_max instruction as well. They are
noticeably faster for small arrays with the relevant types, and the same
or maybe a touch faster on larger arrays.
```
$ make benchmark COMPARE_RUBY=<master@5958c305> ITEM=array_min
$ make benchmark COMPARE_RUBY=<master@5958c305> ITEM=array_max
```
The benchmarks added in this commit also look generally improved.
Co-authored-by: John Hawthorn <jhawthorn@github.com>
Previously YARV bytecode implemented constant caching by having a pair
of instructions, opt_getinlinecache and opt_setinlinecache, wrapping a
series of getconstant calls (with putobject providing supporting
arguments).
This commit replaces that pattern with a new instruction,
opt_getconstant_path, handling both getting/setting the inline cache and
fetching the constant on a cache miss.
This is implemented by storing the full constant path as a
null-terminated array of IDs inside of the IC structure. idNULL is used
to signal an absolute constant reference.
$ ./miniruby --dump=insns -e '::Foo::Bar::Baz'
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,13)> (catch: FALSE)
0000 opt_getconstant_path <ic:0 ::Foo::Bar::Baz> ( 1)[Li]
0002 leave
The motivation for this is that we had increasingly found the need to
disassemble the instructions between the opt_getinlinecache and
opt_setinlinecache in order to determine the constant we are fetching,
or otherwise store metadata.
This disassembly was done:
* In opt_setinlinecache, to register the IC against the constant names
it is using for granular invalidation.
* In rb_iseq_free, to unregister the IC from the invalidation table.
* In YJIT to find the position of a opt_getinlinecache instruction to
invalidate it when the cache is populated
* In YJIT to register the constant names being used for invalidation.
With this change we no longe need disassemly for these (in fact
rb_iseq_each is now unused), as the list of constant names being
referenced is held in the IC. This should also make it possible to make
more optimizations in the future.
This may also reduce the size of iseqs, as previously each segment
required 32 bytes (on 64-bit platforms) for each constant segment. This
implementation only stores one ID per-segment.
There should be no significant performance change between this and the
previous implementation. Previously opt_getinlinecache was a "leaf"
instruction, but it included a jump (almost always to a separate cache
line). Now opt_getconstant_path is a non-leaf (it may
raise/autoload/call const_missing) but it does not jump. These seem to
even out.
I'm planning to introduce mjit_compiler.rb, and I want to make this
consistent with it. Consistency with compile.c doesn't seem important
for MJIT anyway.
* Optimize Marshal dump of large fixnum
Marshal's FIXNUM type only supports 31-bit fixnums, so on 64-bit
platforms the 63-bit fixnums need to be represented in Marshal's
BIGNUM.
Previously this was done by converting to a bugnum and serializing the
bignum object.
This commit avoids allocating the intermediate bignum object, instead
outputting the T_FIXNUM directly to a Marshal bignum. This maintains the
same representation as the previous implementation, including not using
LINKs for these large fixnums (an artifact of the previous
implementation always allocating a new BIGNUM).
This commit also avoids unnecessary st_lookups on immediate values,
which we know will not be in that table.
* Fastpath for loading FIXNUM from Marshal bignum
* Run update-deps
This allows them to show the effect of the previous newarray/expandarray
to swap/opt_reverse optimization. This shows an 35-83% performance
improvement in the four multiple assignment benchmarks that use this
optimization.
- The method was renamed from `get` to `get_value`
- Comparing to `String#unpack` isn't quite equivalent, `unpack1` is closer.
- Use frozen_string_literal to avoid allocating a format string every time.
- Use `N` format which is equivalent to `:U32` (`uint_32_t` big-endian).
- Disable experimental warnings to not mess up the output.
If the RHS has valid encoding, and both strings have the same
encoding, we can use the fast path.
However we need to update the LHS coderange.
```
compare-ruby: ruby 3.2.0dev (2022-07-21T14:46:32Z master cdbb9b8555) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-25T07:25:41Z string-concat-vali.. 11a2772bdd) [arm64-darwin21]
warming up...
| |compare-ruby|built-ruby|
|:-------------------|-----------:|---------:|
|binary_concat_7bit | 554.816k| 556.460k|
| | -| 1.00x|
|utf8_concat_7bit | 556.367k| 555.101k|
| | 1.00x| -|
|utf8_concat_UTF8 | 412.555k| 556.824k|
| | -| 1.35x|
```
Not having to fetch the rb_encoding save a significant
amount of time.
Additionally, even when we have to fetch it, we can do
it faster using `ENCODING_GET` rather than `rb_enc_get`.
```
compare-ruby: ruby 3.2.0dev (2022-07-19T08:41:40Z master cb9fd920a3) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-21T11:16:16Z faster-buffer-conc.. 4f001f0748) [arm64-darwin21]
warming up...
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|binary_concat_utf8 | 510.580k| 565.600k|
| | -| 1.11x|
|binary_concat_binary | 512.653k| 571.483k|
| | -| 1.11x|
|utf8_concat_utf8 | 511.396k| 566.879k|
| | -| 1.11x|
```
If the LHS is ASCII compatible and the RHS is 7BIT
we can directly concat without being concerned about
anything else.
Benchmark:
```
compare-ruby: ruby 3.2.0dev (2022-07-12T15:01:11Z master 71aec68566) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-13T10:13:53Z faster-buffer-conc.. a04c10476d) [arm64-darwin21]
warming up...
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|binary_append_utf8 | 385.315k| 573.663k|
| | -| 1.49x|
|binary_append_binary | 446.579k| 574.898k|
| | -| 1.29x|
|utf8_append_utf8 | 430.936k| 573.394k|
| | -| 1.33x|
```
Note that in the benchmark, the RHS always have a precomputed
coderange. So the benchmark never enter the slowpath of having to
scan the RHS. However it's extremly likely that we'll end
up scanning it anyway in rb_enc_cr_str_buf_cat
Prior to this change, we were measuring object allocation as well
as setting instance variables within ivar benchmarks. With this
change, we now only measure setting instance variables within
ivar benchmarks.
`rb_str_concat` does a lot of type checking we can easily bypass.
```
| |compare-ruby|built-ruby|
|:--------------|-----------:|---------:|
|string_concat | 362.007k| 398.965k|
| | -| 1.10x|
```