`rb_str_concat` does a lot of type checking we can easily bypass.
```
| |compare-ruby|built-ruby|
|:--------------|-----------:|---------:|
|string_concat | 362.007k| 398.965k|
| | -| 1.10x|
```
This commit reintroduces finer-grained constant cache invalidation.
After 8008fb7 got merged, it was causing issues on token-threaded
builds (such as on Windows).
The issue was that when you're iterating through instruction sequences
and using the translator functions to get back the instruction structs,
you're either using `rb_vm_insn_null_translator` or
`rb_vm_insn_addr2insn2` depending if it's a direct-threading build.
`rb_vm_insn_addr2insn2` does some normalization to always return to
you the non-trace version of whatever instruction you're looking at.
`rb_vm_insn_null_translator` does not do that normalization.
This means that when you're looping through the instructions if you're
trying to do an opcode comparison, it can change depending on the type
of threading that you're using. This can be very confusing. So, this
commit creates a new translator function
`rb_vm_insn_normalizing_translator` to always return the non-trace
version so that opcode comparisons don't have to worry about different
configurations.
[Feature #18589]
This reverts commits for [Feature #18589]:
* 8008fb7352
"Update formatting per feedback"
* 8f6eaca2e1
"Delete ID from constant cache table if it becomes empty on ISEQ free"
* 629908586b
"Finer-grained inline constant cache invalidation"
MSWin builds on AppVeyor have been crashing since the merger.
Current behavior - caches depend on a global counter. All constant mutations cause caches to be invalidated.
```ruby
class A
B = 1
end
def foo
A::B # inline cache depends on global counter
end
foo # populate inline cache
foo # hit inline cache
C = 1 # global counter increments, all caches are invalidated
foo # misses inline cache due to `C = 1`
```
Proposed behavior - caches depend on name components. Only constant mutations with corresponding names will invalidate the cache.
```ruby
class A
B = 1
end
def foo
A::B # inline cache depends constants named "A" and "B"
end
foo # populate inline cache
foo # hit inline cache
C = 1 # caches that depend on the name "C" are invalidated
foo # hits inline cache because IC only depends on "A" and "B"
```
Examples of breaking the new cache:
```ruby
module C
# Breaks `foo` cache because "A" constant is set and the cache in foo depends
# on "A" and "B"
class A; end
end
B = 1
```
We expect the new cache scheme to be invalidated less often because names aren't frequently reused. With the cache being invalidated less, we can rely on its stability more to keep our constant references fast and reduce the need to throw away generated code in YJIT.
Previously when checking ancestors, we would walk all the way up the
ancestry chain checking each parent for a matching class or module.
I believe this was especially unfriendly to CPU cache since for each
step we need to check two cache lines (the class and class ext).
This check is used quite often in:
* case statements
* rescue statements
* Calling protected methods
* Class#is_a?
* Module#===
* Module#<=>
I believe it's most common to check a class against a parent class, to
this commit aims to improve that (unfortunately does not help checking
for an included Module).
This is done by storing on each class the number and an array of all
parent classes, in order (BasicObject is at index 0). Using this we can
check whether a class is a subclass of another in constant time since we
know the location to expect it in the hierarchy.
Previously Time.now was switched to use Time.new as it added support for
the in: argument. Unfortunately because Class#new is a cfunc this
requires always allocating a Hash.
This commit switches Time.now back to using a builtin time_s_now. This
avoids the extra Hash allocation and is about 3x faster.
$ benchmark-driver -e './ruby;3.1::~/.rubies/ruby-3.1.0/bin/ruby;3.0::~/.rubies/ruby-3.0.2/bin/ruby' benchmark/time_now.yml
Warming up --------------------------------------
Time.now 6.704M i/s - 6.710M times in 1.000814s (149.16ns/i, 328clocks/i)
Time.now(in: "+09:00") 2.003M i/s - 2.112M times in 1.054330s (499.31ns/i)
Calculating -------------------------------------
./ruby 3.1 3.0
Time.now 7.693M 2.763M 6.394M i/s - 20.113M times in 2.614428s 7.278710s 3.145572s
Time.now(in: "+09:00") 2.030M 1.260M 1.617M i/s - 6.008M times in 2.960132s 4.769378s 3.716537s
Comparison:
Time.now
./ruby: 7693129.7 i/s
3.0: 6394109.2 i/s - 1.20x slower
3.1: 2763282.5 i/s - 2.78x slower
Time.now(in: "+09:00")
./ruby: 2029757.4 i/s
3.0: 1616652.3 i/s - 1.26x slower
3.1: 1259776.2 i/s - 1.61x slower
This provides a significant speedup for symbol, true, false,
nil, and 0-9, class/module, and a small speedup in most other cases.
Speedups (using included benchmarks):
:symbol :: 60%
0-9 :: 50%
Class/Module :: 50%
nil/true/false :: 20%
integer :: 10%
[] :: 10%
"" :: 3%
One reason this approach is faster is it reduces the number of
VM instructions for each interpolated value.
Initial idea, approach, and benchmarks from Eric Wong. I applied
the same approach against the master branch, updating it to handle
the significant internal changes since this was first proposed 4
years ago (such as CALL_INFO/CALL_CACHE -> CALL_DATA). I also
expanded it to optimize true/false/nil/0-9/class/module, and added
handling of missing methods, refined methods, and RUBY_DEBUG.
This renames the tostring insn to anytostring, and adds an
objtostring insn that implements the optimization. This requires
making a few functions non-static, and adding some non-static
functions.
This disables 4 YJIT tests. Those tests should be reenabled after
YJIT optimizes the new objtostring insn.
Implements [Feature #13715]
Co-authored-by: Eric Wong <e@80x24.org>
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
Co-authored-by: Yusuke Endoh <mame@ruby-lang.org>
Co-authored-by: Koichi Sasada <ko1@atdot.net>
From the documentation of rb_obj_hash:
> Certain core classes such as Integer use built-in hash calculations and
> do not call the #hash method when used as a hash key.
So if you override, say, Integer#hash it won't be used from rb_hash_aref
and similar. This avoids method lookups in many common cases.
This commit uses the same optimization in rb_hash, a method used
internally and in the C API to get the hash value of an object. Usually
this is used to build the hash of an object based on its elements.
Previously it would always do a method lookup for 'hash'.
This is primarily intended to speed up hashing of Arrays and Hashes,
which call rb_hash for each element.
compare-ruby: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
built-ruby: ruby 3.1.0dev (2021-09-29T02:13:24Z fast_hash d670bf88b2) [x86_64-linux]
# Iteration per second (i/s)
| |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|hash_aref_array | 1.008| 1.769|
| | -| 1.76x|
In vm_call_method_each_type, check for c_call and c_return events before
dispatching to vm_call_ivar and vm_call_attrset. With this approach, the
call cache will still dispatch directly to those functions, so this
change will only decrease performance for the first (uncached) call, and
even then, the performance decrease is very minimal.
This approach requires that we clear the call caches when tracing is
enabled or disabled. The approach currently switches all vm_call_ivar
and vm_call_attrset call caches to vm_call_general any time tracing is
enabled or disabled. So it could theoretically result in a slowdown for
code that constantly enables or disables tracing.
This approach does not handle targeted tracepoints, but from my testing,
c_call and c_return events are not supported for targeted tracepoints,
so that shouldn't matter.
This includes a benchmark showing the performance decrease is minimal
if detectable at all.
Fixes [Bug #16383]
Fixes [Bug #10470]
Co-authored-by: Takashi Kokubun <takashikkbn@gmail.com>
Redo of 34a2acdac788602c14bf05fb616215187badd504 and
931138b00696419945dc03e10f033b1f53cd50f3 which were reverted.
GitHub PR #4340.
This change implements a cache for class variables. Previously there was
no cache for cvars. Cvar access is slow due to needing to travel all the
way up th ancestor tree before returning the cvar value. The deeper the
ancestor tree the slower cvar access will be.
The benefits of the cache are more visible with a higher number of
included modules due to the way Ruby looks up class variables. The
benchmark here includes 26 modules and shows with the cache, this branch
is 6.5x faster when accessing class variables.
```
compare-ruby: ruby 3.1.0dev (2021-03-15T06:22:34Z master 9e5105c) [x86_64-darwin19]
built-ruby: ruby 3.1.0dev (2021-03-15T12:12:44Z add-cache-for-clas.. c6be009) [x86_64-darwin19]
| |compare-ruby|built-ruby|
|:--------|-----------:|---------:|
|vm_cvar | 5.681M| 36.980M|
| | -| 6.51x|
```
Benchmark.ips calling `ActiveRecord::Base.logger` from within a Rails
application. ActiveRecord::Base.logger has 71 ancestors. The more
ancestors a tree has, the more clear the speed increase. IE if Base had
only one ancestor we'd see no improvement. This benchmark is run on a
vanilla Rails application.
Benchmark code:
```ruby
require "benchmark/ips"
require_relative "config/environment"
Benchmark.ips do |x|
x.report "logger" do
ActiveRecord::Base.logger
end
end
```
Ruby 3.0 master / Rails 6.1:
```
Warming up --------------------------------------
logger 155.251k i/100ms
Calculating -------------------------------------
```
Ruby 3.0 with cvar cache / Rails 6.1:
```
Warming up --------------------------------------
logger 1.546M i/100ms
Calculating -------------------------------------
logger 14.857M (± 4.8%) i/s - 74.198M in 5.006202s
```
Lastly we ran a benchmark to demonstate the difference between master
and our cache when the number of modules increases. This benchmark
measures 1 ancestor, 30 ancestors, and 100 ancestors.
Ruby 3.0 master:
```
Warming up --------------------------------------
1 module 1.231M i/100ms
30 modules 432.020k i/100ms
100 modules 145.399k i/100ms
Calculating -------------------------------------
1 module 12.210M (± 2.1%) i/s - 61.553M in 5.043400s
30 modules 4.354M (± 2.7%) i/s - 22.033M in 5.063839s
100 modules 1.434M (± 2.9%) i/s - 7.270M in 5.072531s
Comparison:
1 module: 12209958.3 i/s
30 modules: 4354217.8 i/s - 2.80x (± 0.00) slower
100 modules: 1434447.3 i/s - 8.51x (± 0.00) slower
```
Ruby 3.0 with cvar cache:
```
Warming up --------------------------------------
1 module 1.641M i/100ms
30 modules 1.655M i/100ms
100 modules 1.620M i/100ms
Calculating -------------------------------------
1 module 16.279M (± 3.8%) i/s - 82.038M in 5.046923s
30 modules 15.891M (± 3.9%) i/s - 79.459M in 5.007958s
100 modules 16.087M (± 3.6%) i/s - 81.005M in 5.041931s
Comparison:
1 module: 16279458.0 i/s
100 modules: 16087484.6 i/s - same-ish: difference falls within error
30 modules: 15891406.2 i/s - same-ish: difference falls within error
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
* Improve perfomance for Integer#size method [Feature #17135]
* re-run ci
* Let MJIT frame skip work for Integer#size
Co-authored-by: Takashi Kokubun <takashikkbn@gmail.com>
The checkmatch instruction with VM_CHECKMATCH_TYPE_CASE calls
=== without a call cache. Emit a send instruction to make the call
instead. It includes a call cache.
The call cache improves throughput of using when statements to check the
class of a given object. This is useful for say, JSON serialization.
Use of a regular send instead of checkmatch also avoids taking the VM
lock every time, which is good for multi-ractor workloads.
Calculating -------------------------------------
master post
vm_case_classes 11.013M 16.172M i/s - 6.000M times in 0.544795s 0.371009s
vm_case_lit 2.296 2.263 i/s - 1.000 times in 0.435606s 0.441826s
vm_case 74.098M 64.338M i/s - 6.000M times in 0.080974s 0.093257s
Comparison:
vm_case_classes
post: 16172114.4 i/s
master: 11013316.9 i/s - 1.47x slower
vm_case_lit
master: 2.3 i/s
post: 2.3 i/s - 1.01x slower
vm_case
master: 74097858.6 i/s
post: 64338333.9 i/s - 1.15x slower
The vm_case benchmark is a bit slower post patch, possibily due to the
larger instruction sequence. The benchmark dispatches using
opt_case_dispatch so was not running checkmatch and does not make the
=== call post patch.
This change implements a cache for class variables. Previously there was
no cache for cvars. Cvar access is slow due to needing to travel all the
way up th ancestor tree before returning the cvar value. The deeper the
ancestor tree the slower cvar access will be.
The benefits of the cache are more visible with a higher number of
included modules due to the way Ruby looks up class variables. The
benchmark here includes 26 modules and shows with the cache, this branch
is 6.5x faster when accessing class variables.
```
compare-ruby: ruby 3.1.0dev (2021-03-15T06:22:34Z master 9e5105ca45) [x86_64-darwin19]
built-ruby: ruby 3.1.0dev (2021-03-15T12:12:44Z add-cache-for-clas.. c6be0093ae) [x86_64-darwin19]
| |compare-ruby|built-ruby|
|:--------|-----------:|---------:|
|vm_cvar | 5.681M| 36.980M|
| | -| 6.51x|
```
Benchmark.ips calling `ActiveRecord::Base.logger` from within a Rails
application. ActiveRecord::Base.logger has 71 ancestors. The more
ancestors a tree has, the more clear the speed increase. IE if Base had
only one ancestor we'd see no improvement. This benchmark is run on a
vanilla Rails application.
Benchmark code:
```ruby
require "benchmark/ips"
require_relative "config/environment"
Benchmark.ips do |x|
x.report "logger" do
ActiveRecord::Base.logger
end
end
```
Ruby 3.0 master / Rails 6.1:
```
Warming up --------------------------------------
logger 155.251k i/100ms
Calculating -------------------------------------
```
Ruby 3.0 with cvar cache / Rails 6.1:
```
Warming up --------------------------------------
logger 1.546M i/100ms
Calculating -------------------------------------
logger 14.857M (± 4.8%) i/s - 74.198M in 5.006202s
```
Lastly we ran a benchmark to demonstate the difference between master
and our cache when the number of modules increases. This benchmark
measures 1 ancestor, 30 ancestors, and 100 ancestors.
Ruby 3.0 master:
```
Warming up --------------------------------------
1 module 1.231M i/100ms
30 modules 432.020k i/100ms
100 modules 145.399k i/100ms
Calculating -------------------------------------
1 module 12.210M (± 2.1%) i/s - 61.553M in 5.043400s
30 modules 4.354M (± 2.7%) i/s - 22.033M in 5.063839s
100 modules 1.434M (± 2.9%) i/s - 7.270M in 5.072531s
Comparison:
1 module: 12209958.3 i/s
30 modules: 4354217.8 i/s - 2.80x (± 0.00) slower
100 modules: 1434447.3 i/s - 8.51x (± 0.00) slower
```
Ruby 3.0 with cvar cache:
```
Warming up --------------------------------------
1 module 1.641M i/100ms
30 modules 1.655M i/100ms
100 modules 1.620M i/100ms
Calculating -------------------------------------
1 module 16.279M (± 3.8%) i/s - 82.038M in 5.046923s
30 modules 15.891M (± 3.9%) i/s - 79.459M in 5.007958s
100 modules 16.087M (± 3.6%) i/s - 81.005M in 5.041931s
Comparison:
1 module: 16279458.0 i/s
100 modules: 16087484.6 i/s - same-ish: difference falls within error
30 modules: 15891406.2 i/s - same-ish: difference falls within error
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
This allows us to allocate the right size for the object in advance,
meaning that we don't have to pay the cost of ivar table extension
later. The idea is that if an object type ever became "extended" at
some point, then it is very likely it will become extended again. So we
may as well allocate the ivar table up front.
In regular assignment, Ruby evaluates the left hand side before
the right hand side. For example:
```ruby
foo[0] = bar
```
Calls `foo`, then `bar`, then `[]=` on the result of `foo`.
Previously, multiple assignment didn't work this way. If you did:
```ruby
abc.def, foo[0] = bar, baz
```
Ruby would previously call `bar`, then `baz`, then `abc`, then
`def=` on the result of `abc`, then `foo`, then `[]=` on the
result of `foo`.
This change makes multiple assignment similar to single assignment,
changing the evaluation order of the above multiple assignment code
to calling `abc`, then `foo`, then `bar`, then `baz`, then `def=` on
the result of `abc`, then `[]=` on the result of `foo`.
Implementing this is challenging with the stack-based virtual machine.
We need to keep track of all of the left hand side attribute setter
receivers and setter arguments, and then keep track of the stack level
while handling the assignment processing, so we can issue the
appropriate topn instructions to get the receiver. Here's an example
of how the multiple assignment is executed, showing the stack and
instructions:
```
self # putself
abc # send
abc, self # putself
abc, foo # send
abc, foo, 0 # putobject 0
abc, foo, 0, [bar, baz] # evaluate RHS
abc, foo, 0, [bar, baz], baz, bar # expandarray
abc, foo, 0, [bar, baz], baz, bar, abc # topn 5
abc, foo, 0, [bar, baz], baz, abc, bar # swap
abc, foo, 0, [bar, baz], baz, def= # send
abc, foo, 0, [bar, baz], baz # pop
abc, foo, 0, [bar, baz], baz, foo # topn 3
abc, foo, 0, [bar, baz], baz, foo, 0 # topn 3
abc, foo, 0, [bar, baz], baz, foo, 0, baz # topn 2
abc, foo, 0, [bar, baz], baz, []= # send
abc, foo, 0, [bar, baz], baz # pop
abc, foo, 0, [bar, baz] # pop
[bar, baz], foo, 0, [bar, baz] # setn 3
[bar, baz], foo, 0 # pop
[bar, baz], foo # pop
[bar, baz] # pop
```
As multiple assignment must deal with splats, post args, and any level
of nesting, it gets quite a bit more complex than this in non-trivial
cases. To handle this, struct masgn_state is added to keep
track of the overall state of the mass assignment, which stores a linked
list of struct masgn_attrasgn, one for each assigned attribute.
This adds a new optimization that replaces a topn 1/pop instruction
combination with a single swap instruction for multiple assignment
to non-aref attributes.
This new approach isn't compatible with one of the optimizations
previously used, in the case where the multiple assignment return value
was not needed, there was no lhs splat, and one of the left hand side
used an attribute setter. This removes that optimization. Removing
the optimization allowed for removing the POP_ELEMENT and adjust_stack
functions.
This adds a benchmark to measure how much slower multiple
assignment is with the correct evaluation order.
This benchmark shows:
* 4-9% decrease for attribute sets
* 14-23% decrease for array member sets
* Basically same speed for local variable sets
Importantly, it shows no significant difference between the popped
(where return value of the multiple assignment is not needed) and
!popped (where return value of the multiple assignment is needed)
cases for attribute and array member sets. This indicates the
previous optimization, which was dropped in the evaluation
order fix and only affected the popped case, is not important to
performance.
Fixes [Bug #4443]
The most common use case for `bind_call` is to protect from core
methods being redefined, for instance a typical use:
```ruby
UNBOUND_METHOD_MODULE_NAME = Module.instance_method(:name)
def real_mod_name(mod)
UNBOUND_METHOD_MODULE_NAME.bind_call(mod)
end
```
But it's extremely common that the method wasn't actually redefined.
In such case we can avoid creating a new callable method entry,
and simply delegate to the receiver.
This result in a 1.5-2X speed-up for the fast path, and little to
no impact on the slowpath:
```
compare-ruby: ruby 3.1.0dev (2021-02-05T06:33:00Z master b2674c1fd7) [x86_64-darwin19]
built-ruby: ruby 3.1.0dev (2021-02-15T10:35:17Z bind-call-fastpath d687e06615) [x86_64-darwin19]
| |compare-ruby|built-ruby|
|:---------|-----------:|---------:|
|fastpath | 11.325M| 16.393M|
| | -| 1.45x|
|slowpath | 10.488M| 10.242M|
| | 1.02x| -|
```
* Add a benchmark-driver runner for Ractor
* Process.clock_gettime(Process:CLOCK_MONOTONIC) could be slow
in Ruby 3.0 Ractor
* Fetching Time could also be slow
* Fix a comment
* Assert overriding a private method
because the name "MJIT" is an internal code name, it's inconsistent with
--jit while they are related to each other, and I want to discourage future
JIT implementation-specific (e.g. MJIT-specific) APIs by this rename.
[Feature #17490]
Allocating an instance of a class uses the allocator for the class. When
the class has no allocator set, Ruby looks for it in the super class
(see rb_get_alloc_func()).
It's uncommon for classes created from Ruby code to ever have an
allocator set, so it's common during the allocation process to search
all the way to BasicObject from the class with which the allocation is
being performed. This makes creating instances of classes that have
long ancestry chains more expensive than creating instances of classes
have that shorter ancestry chains.
Setting the allocator at class creation time removes the need to perform
a search for the alloctor during allocation.
This is a breaking change for C-extensions that assume that classes
created from Ruby code have no allocator set. Libraries that setup a
class hierarchy in Ruby code and then set the allocator on some parent
class, for example, can experience breakage. This seems like an unusual
use case and hopefully it is rare or non-existent in practice.
Rails has many classes that have upwards of 60 elements in the ancestry
chain and benchmark shows a significant improvement for allocating with
a class that includes 64 modules.
```
pre: ruby 3.0.0dev (2020-11-12T14:39:27Z master 6325866421)
post: ruby 3.0.0dev (2020-11-12T20:15:30Z cut-allocator-lookup)
Comparison:
allocate_8_deep
post: 10336985.6 i/s
pre: 8691873.1 i/s - 1.19x slower
allocate_32_deep
post: 10423181.2 i/s
pre: 6264879.1 i/s - 1.66x slower
allocate_64_deep
post: 10541851.2 i/s
pre: 4936321.5 i/s - 2.14x slower
allocate_128_deep
post: 10451505.0 i/s
pre: 3031313.5 i/s - 3.45x slower
```
This benchmark demonstrates the performance of setting an instance
variable when the type of object is constantly changing. This benchmark
should give us an idea of the performance of ivar setting in a
polymorphic environment
When the inline cache is written, the iv table will contain an entry for
the instance variable. If we get an inline cache hit, then we know the
iv table must contain a value for the index written to the inline cache.
If the index in the inline cache is larger than the list on the object,
but *smaller* than the iv index table on the class, then we can just
eagerly allocate the iv list to be the same size as the iv index table.
This avoids duplicate work of checking frozen as well as looking up the
index for the particular instance variable name.
This PR improves the performance of `super` calls. While working on some
Rails optimizations jhawthorn discovered that `super` calls were slower
than expected.
The changes here do the following:
1) Adds a check for whether the call frame is not equal to the method
entry iseq. This avoids the `rb_obj_is_kind_of` check on the next line
which is quite slow. If the current call frame is equal to the method
entry we know we can't have an instance eval, etc.
2) Changes `FL_TEST` to `FL_TEST_RAW`. This is safe because we've
already done the check for `T_ICLASS` above.
3) Adds a benchmark for `T_ICLASS` super calls.
4) Note: makes a chage for `method_entry_cref` to use `const`.
On master the benchmarks showed that `super` is 1.76x slower. Our
changes improved the performance so that it is now only 1.36x slower.
Benchmark IPS:
```
Warming up --------------------------------------
super 244.918k i/100ms
method call 383.007k i/100ms
Calculating -------------------------------------
super 2.280M (± 6.7%) i/s - 11.511M in 5.071758s
method call 3.834M (± 4.9%) i/s - 19.150M in 5.008444s
Comparison:
method call: 3833648.3 i/s
super: 2279837.9 i/s - 1.68x (± 0.00) slower
```
With changes:
```
Warming up --------------------------------------
super 308.777k i/100ms
method call 375.051k i/100ms
Calculating -------------------------------------
super 2.951M (± 5.4%) i/s - 14.821M in 5.039592s
method call 3.551M (± 4.9%) i/s - 18.002M in 5.081695s
Comparison:
method call: 3551372.7 i/s
super: 2950557.9 i/s - 1.20x (± 0.00) slower
```
Ruby VM benchmarks also showed an improvement:
Existing `vm_super` benchmark`.
```
$ make benchmark ITEM=vm_super
| |compare-ruby|built-ruby|
|:---------|-----------:|---------:|
|vm_super | 21.555M| 37.819M|
| | -| 1.75x|
```
New `vm_iclass_super` benchmark:
```
$ make benchmark ITEM=vm_iclass_super
| |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|vm_iclass_super | 1.669M| 3.683M|
| | -| 2.21x|
```
This is the benchmark script used for the benchmark-ips benchmarks:
```ruby
require "benchmark/ips"
class Foo
def zuper; end
def top; end
last_method = "top"
("A".."M").each do |module_name|
eval <<-EOM
module #{module_name}
def zuper; super; end
def #{module_name.downcase}
#{last_method}
end
end
prepend #{module_name}
EOM
last_method = module_name.downcase
end
end
foo = Foo.new
Benchmark.ips do |x|
x.report "super" do
foo.zuper
end
x.report "method call" do
foo.m
end
x.compare!
end
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
Co-authored-by: John Hawthorn <john@hawthorn.email>
* Rewrite Kernel#tap with Ruby
This was good for VM too, but of course my intention is to unblock JIT's inlining of a block over yield
(inlining invokeyield has not been committed though).
* Fix test_settracefunc
About the :tap deletions, the :tap events are actually traced (we already have a TracePoint test for builtin methods),
but it's filtered out by tp.path == "xyzzy" (it became "<internal:kernel>"). We could trace tp.path == "<internal:kernel>"
cases too, but the lineno is impacted by kernel.rb changes and I didn't want to make it fragile for kernel.rb lineno changes.
for opt_* insns.
opt_eq handles rb_obj_equal inside opt_eq, and all other cfunc is
handled by opt_send_without_block. Therefore we can't decide which insn
should be generated by checking whether it's cfunc cc or not.
```
$ benchmark-driver -v --rbenv 'before --jit;after --jit' benchmark/mjit_opt_cc_insns.yml --repeat-count=4
before --jit: ruby 2.8.0dev (2020-06-26T05:21:43Z master 9dbc2294a6) +JIT [x86_64-linux]
after --jit: ruby 2.8.0dev (2020-06-26T06:30:18Z master 75cece1b0b) +JIT [x86_64-linux]
last_commit=Decide JIT-ed insn based on cached cfunc
Calculating -------------------------------------
before --jit after --jit
mjit_nil?(1) 73.878M 74.021M i/s - 40.000M times in 0.541432s 0.540391s
mjit_not(1) 72.635M 74.601M i/s - 40.000M times in 0.550702s 0.536187s
mjit_eq(1, nil) 7.331M 7.445M i/s - 8.000M times in 1.091211s 1.074596s
mjit_eq(nil, 1) 49.450M 64.711M i/s - 8.000M times in 0.161781s 0.123627s
Comparison:
mjit_nil?(1)
after --jit: 74020528.4 i/s
before --jit: 73878185.9 i/s - 1.00x slower
mjit_not(1)
after --jit: 74600882.0 i/s
before --jit: 72634507.6 i/s - 1.03x slower
mjit_eq(1, nil)
after --jit: 7444657.4 i/s
before --jit: 7331304.3 i/s - 1.02x slower
mjit_eq(nil, 1)
after --jit: 64710790.6 i/s
before --jit: 49449507.4 i/s - 1.31x slower
```
because opt_nil/opt_not/opt_eq populates cc even when it doesn't
fallback to opt_send_without_block because of vm_method_cfunc_is.
```
$ benchmark-driver -v --rbenv 'before --jit;after --jit' benchmark/mjit_opt_cc_insns.yml --repeat-count=4
before --jit: ruby 2.8.0dev (2020-06-22T08:11:24Z master d231b8f95b) +JIT [x86_64-linux]
after --jit: ruby 2.8.0dev (2020-06-22T08:53:27Z master e1125879ed) +JIT [x86_64-linux]
last_commit=Compile opt_send for opt_* only when cc has ISeq
Calculating -------------------------------------
before --jit after --jit
mjit_nil?(1) 54.106M 73.693M i/s - 40.000M times in 0.739288s 0.542795s
mjit_not(1) 53.398M 74.477M i/s - 40.000M times in 0.749090s 0.537075s
mjit_eq(1, nil) 7.427M 6.497M i/s - 8.000M times in 1.077136s 1.231326s
Comparison:
mjit_nil?(1)
after --jit: 73692594.3 i/s
before --jit: 54106108.4 i/s - 1.36x slower
mjit_not(1)
after --jit: 74477487.9 i/s
before --jit: 53398125.0 i/s - 1.39x slower
mjit_eq(1, nil)
before --jit: 7427105.9 i/s
after --jit: 6497063.0 i/s - 1.14x slower
```
Actually opt_eq becomes slower by this. Maybe it's indeed using
opt_send_without_block, but I'll approach that one in another commit.
These days I don't use `make benchmark`. The YAML files should be
executable with bare `benchmark-driver` CLI without passing
`RUBYOPT=-Ibenchmark/lib`.
A prerequisite to fix https://bugs.ruby-lang.org/issues/15589 with JIT.
This commit alone doesn't make a significant difference yet, but I thought
this commit should be committed independently.
This method override was discussed in [Misc #16961].
The vm1_ prefix and vm2_ had had special meaning until
820ad9cb1d and
12068aa4e9. AFAIK there's no special
meaning in vm3_ prefix.
As they have confused people (like "In `benchmark` what is difference
between `vm1_`, `vm2_` and `vm3_`"), I'd like to remove the obsoleted
prefix as we obviated that two years ago.
for VM_METHOD_TYPE_CFUNC.
This has been known to decrease optcarrot fps:
```
$ benchmark-driver -v --rbenv 'before --jit;after --jit' benchmark.yml --repeat-count=24 --output=all
before --jit: ruby 2.8.0dev (2020-04-13T16:25:13Z master fb40495cd9) +JIT [x86_64-linux]
after --jit: ruby 2.8.0dev (2020-04-13T23:23:11Z mjit-inline-c bdcd06d159) +JIT [x86_64-linux]
Calculating -------------------------------------
before --jit after --jit
Optcarrot Lan_Master.nes 66.38132676191719 67.41369177299630 fps
69.42728743772243 68.90327567263054
72.16028300263211 69.62605130880686
72.46631319102777 70.48818243767207
73.37078877002490 70.79522887347566
73.69422431217367 70.99021920193194
74.01471487018695 74.69931965402584
75.48685183295630 74.86714575949016
75.54445264507932 75.97864419721677
77.28089738169756 76.48908637569581
78.04183397891302 76.54320932488021
78.36807984096562 76.59407262898067
78.92898762543574 77.31316743361343
78.93576483233765 77.97153484180480
79.13754917503078 77.98478782102325
79.62648945850653 78.02263322726446
79.86334213878064 78.26333724045934
80.05100635898518 78.60056756355614
80.26186843769584 78.91082645644468
80.34205717020330 79.01226659142263
80.62286066044338 79.32733939423721
80.95883033058557 79.63793060542024
80.97376819251613 79.73108936622778
81.23050939202896 80.18280109433088
```
and I deleted this capability in an early stage of YARV-MJIT development:
0ab130feee
I suspect either of the following things could be the cause:
* Directly calling vm_call_cfunc requires more optimization effort in GCC,
resulting in 30ms-ish compilation time increase for such methods and
decreasing the number of methods compiled in a benchmarked period.
* Code size increase => icache miss hit
These hypotheses could be verified by some methodologies. However, I'd
like to introduce this regardless of the result because this blocks
inlining C method's definition.
I may revert this commit when I give up to implement inlining C method
definition, which requires this change.
Microbenchmark-wise, this gives slight performance improvement:
```
$ benchmark-driver -v --rbenv 'before --jit;after --jit' benchmark/mjit_send_cfunc.yml --repeat-count=4
before --jit: ruby 2.8.0dev (2020-04-13T16:25:13Z master fb40495cd9) +JIT [x86_64-linux]
after --jit: ruby 2.8.0dev (2020-04-13T23:23:11Z mjit-inline-c bdcd06d159) +JIT [x86_64-linux]
Calculating -------------------------------------
before --jit after --jit
mjit_send_cfunc 41.961M 56.489M i/s - 100.000M times in 2.383143s 1.770244s
Comparison:
mjit_send_cfunc
after --jit: 56489372.5 i/s
before --jit: 41961388.1 i/s - 1.35x slower
```
Previously, passing a keyword splat to a method always allocated
a hash on the caller side, and accepting arbitrary keywords in
a method allocated a separate hash on the callee side. Passing
explicit keywords to a method that accepted a keyword splat
did not allocate a hash on the caller side, but resulted in two
hashes allocated on the callee side.
This commit makes passing a single keyword splat to a method not
allocate a hash on the caller side. Passing multiple keyword
splats or a mix of explicit keywords and a keyword splat still
generates a hash on the caller side. On the callee side,
if arbitrary keywords are not accepted, it does not allocate a
hash. If arbitrary keywords are accepted, it will allocate a
hash, but this commit uses a callinfo flag to indicate whether
the caller already allocated a hash, and if so, the callee can
use the passed hash without duplicating it. So this commit
should make it so that a maximum of a single hash is allocated
during method calls.
To set the callinfo flag appropriately, method call argument
compilation checks if only a single keyword splat is given.
If only one keyword splat is given, the VM_CALL_KW_SPLAT_MUT
callinfo flag is not set, since in that case the keyword
splat is passed directly and not mutable. If more than one
splat is used, a new hash needs to be generated on the caller
side, and in that case the callinfo flag is set, indicating
the keyword splat is mutable by the callee.
In compile_hash, used for both hash and keyword argument
compilation, if compiling keyword arguments and only a
single keyword splat is used, pass the argument directly.
On the caller side, in vm_args.c, the callinfo flag needs to
be recognized and handled. Because the keyword splat
argument may not be a hash, it needs to be converted to a
hash first if not. Then, unless the callinfo flag is set,
the hash needs to be duplicated. The temporary copy of the
callinfo flag, kw_flag, is updated if a hash was duplicated,
to prevent the need to duplicate it again. If we are
converting to a hash or duplicating a hash, we need to update
the argument array, which can including duplicating the
positional splat array if one was passed. CALLER_SETUP_ARG
and a couple other places needs to be modified to handle
similar issues for other types of calls.
This includes fairly comprehensive tests for different ways
keywords are handled internally, checking that you get equal
results but that keyword splats on the caller side result in
distinct objects for keyword rest parameters.
Included are benchmarks for keyword argument calls.
Brief results when compiled without optimization:
def kw(a: 1) a end
def kws(**kw) kw end
h = {a: 1}
kw(a: 1) # about same
kw(**h) # 2.37x faster
kws(a: 1) # 1.30x faster
kws(**h) # 2.19x faster
kw(a: 1, **h) # 1.03x slower
kw(**h, **h) # about same
kws(a: 1, **h) # 1.16x faster
kws(**h, **h) # 1.14x faster
Instead of searching twice to extract and to delete, extract and
delete the found position at the first search.
This makes faster nearly twice, for regexps and strings.
| |compare-ruby|built-ruby|
|:-------------|-----------:|---------:|
|regexp-short | 2.143M| 3.918M|
|regexp-long | 105.162k| 205.410k|
|string-short | 3.789M| 7.964M|
|string-long | 1.301M| 2.457M|
I noticed that some files in rubygems were executable, and I could think
of no reason why they should be.
In general, I think ruby files should never have the executable bit set
unless they include a shebang, so I run the following command over the
whole repo:
```bash
find . -name '*.rb' -type f -executable -exec bash -c 'grep -L "^#!" $1 || chmod -x $1' _ {} \;
```
* Stop making a redundant hash copy in Hash#dup
It was making a copy of the hash without rehashing, then created an
extra copy of the hash to do the rehashing. Since rehashing creates
a new copy already, this change just uses that rehashing to make
the copy.
[Bug #16121]
* Remove redundant Check_Type after to_hash
* Fix freeing and clearing destination hash in Hash#initialize_copy
The code was assuming the state of the destination hash based on the
source hash for clearing any existing table on it. If these don't match,
then that can cause the old table to be leaked. This can be seen by
compiling hash.c with `#define HASH_DEBUG 1` and running the following
script, which will crash from a debug assertion.
```ruby
h = 9.times.map { |i| [i, i] }.to_h
h.send(:initialize_copy, {})
```
* Remove dead code paths in rb_hash_initialize_copy
Given that `RHASH_ST_TABLE_P(h)` is defined as `(!RHASH_AR_TABLE_P(h))`
it shouldn't be possible for a hash to be neither of these, so there
is no need for the removed `else if` blocks.
* Share implementation between Hash#replace and Hash#initialize_copy
This also fixes key rehashing for small hashes backed by an array
table for Hash#replace. This used to be done consistently in ruby
2.5.x, but stopped being done for small arrays in ruby 2.6.x.
This also bring optimization improvements that were done for
Hash#initialize_copy to Hash#replace.
* Add the Hash#dup benchmark
I noticed that in case of cache misshit, re-calculated cc->me can
be the same method entry than the pevious one. That is an okay
situation but can't we partially reuse the cache, because cc->call
should still be valid then?
One thing that has to be special-cased is when the method entry
gets amended by some refinements. That happens behind-the-scene
of call cache mechanism. We have to check if cc->me->def points to
the previously saved one.
Calculating -------------------------------------
trunk ours
vm2_poly_same_method 1.534M 2.025M i/s - 6.000M times in 3.910203s 2.962752s
Comparison:
vm2_poly_same_method
ours: 2025143.9 i/s
trunk: 1534447.2 i/s - 1.32x slower