The checkmatch instruction with VM_CHECKMATCH_TYPE_CASE calls
=== without a call cache. Emit a send instruction to make the call
instead. It includes a call cache.
The call cache improves throughput of using when statements to check the
class of a given object. This is useful for say, JSON serialization.
Use of a regular send instead of checkmatch also avoids taking the VM
lock every time, which is good for multi-ractor workloads.
Calculating -------------------------------------
master post
vm_case_classes 11.013M 16.172M i/s - 6.000M times in 0.544795s 0.371009s
vm_case_lit 2.296 2.263 i/s - 1.000 times in 0.435606s 0.441826s
vm_case 74.098M 64.338M i/s - 6.000M times in 0.080974s 0.093257s
Comparison:
vm_case_classes
post: 16172114.4 i/s
master: 11013316.9 i/s - 1.47x slower
vm_case_lit
master: 2.3 i/s
post: 2.3 i/s - 1.01x slower
vm_case
master: 74097858.6 i/s
post: 64338333.9 i/s - 1.15x slower
The vm_case benchmark is a bit slower post patch, possibily due to the
larger instruction sequence. The benchmark dispatches using
opt_case_dispatch so was not running checkmatch and does not make the
=== call post patch.
It looks for "checkmatch", when it could be applied to anything that has
"newrange".
Making the optimization target more ranges might only be fair play when
all ranges are frozen. So I'm putting a reference to the ticket that
froze all ranges.
[Feature #15504]
Before this change, CDHASH operands were built as plain hashes when
loaded from binary. Without setting up the hash with the correct
st_table type, the hash can sometimes be an ar_table. When the hash is
an ar_table, lookups can call the `eql?` method on keys of the hash,
which makes the `opt_case_dispatch` instruction not "leaf" as it
implicitly declares.
The following script trips the stack canary for checking the leaf
attribute for `opt_case_dispatch` on VM_CHECK_MODE > 0 (enabled by
default with RUBY_DEBUG).
rb_vm_iseq = RubyVM::InstructionSequence
iseq = rb_vm_iseq.compile(<<-EOF)
case Class.new(String).new("foo")
when "foo"
42
end
EOF
puts rb_vm_iseq.load_from_binary(iseq.to_binary).eval
This commit changes the binary loading logic to build CDHASH with the
right st_table type. The dumping logic and the dump format stays the
same
609de71f04 fixes the issue by using
`throw` insn if `ensure` is used. However, that patch introduce
additional `throw` even if it is not needed. This patch solves
the issue.
This issue is pointed by @mame.
Rational literals are those integers suffixed with `r`. They tend to
be a part of more complex expressions like `123/456r`, but in theory
they can live alone. When such "bare" rational literals are passed to
case-when branch, we have to take care of them. Fixes [Bug #17854]
Instead of on read. Once it's in the inline cache we never have to make
one again. We want to eventually put the value into the cache, and the
best opportunity to do that is when you write the value.
This change implements a cache for class variables. Previously there was
no cache for cvars. Cvar access is slow due to needing to travel all the
way up th ancestor tree before returning the cvar value. The deeper the
ancestor tree the slower cvar access will be.
The benefits of the cache are more visible with a higher number of
included modules due to the way Ruby looks up class variables. The
benchmark here includes 26 modules and shows with the cache, this branch
is 6.5x faster when accessing class variables.
```
compare-ruby: ruby 3.1.0dev (2021-03-15T06:22:34Z master 9e5105ca45) [x86_64-darwin19]
built-ruby: ruby 3.1.0dev (2021-03-15T12:12:44Z add-cache-for-clas.. c6be0093ae) [x86_64-darwin19]
| |compare-ruby|built-ruby|
|:--------|-----------:|---------:|
|vm_cvar | 5.681M| 36.980M|
| | -| 6.51x|
```
Benchmark.ips calling `ActiveRecord::Base.logger` from within a Rails
application. ActiveRecord::Base.logger has 71 ancestors. The more
ancestors a tree has, the more clear the speed increase. IE if Base had
only one ancestor we'd see no improvement. This benchmark is run on a
vanilla Rails application.
Benchmark code:
```ruby
require "benchmark/ips"
require_relative "config/environment"
Benchmark.ips do |x|
x.report "logger" do
ActiveRecord::Base.logger
end
end
```
Ruby 3.0 master / Rails 6.1:
```
Warming up --------------------------------------
logger 155.251k i/100ms
Calculating -------------------------------------
```
Ruby 3.0 with cvar cache / Rails 6.1:
```
Warming up --------------------------------------
logger 1.546M i/100ms
Calculating -------------------------------------
logger 14.857M (± 4.8%) i/s - 74.198M in 5.006202s
```
Lastly we ran a benchmark to demonstate the difference between master
and our cache when the number of modules increases. This benchmark
measures 1 ancestor, 30 ancestors, and 100 ancestors.
Ruby 3.0 master:
```
Warming up --------------------------------------
1 module 1.231M i/100ms
30 modules 432.020k i/100ms
100 modules 145.399k i/100ms
Calculating -------------------------------------
1 module 12.210M (± 2.1%) i/s - 61.553M in 5.043400s
30 modules 4.354M (± 2.7%) i/s - 22.033M in 5.063839s
100 modules 1.434M (± 2.9%) i/s - 7.270M in 5.072531s
Comparison:
1 module: 12209958.3 i/s
30 modules: 4354217.8 i/s - 2.80x (± 0.00) slower
100 modules: 1434447.3 i/s - 8.51x (± 0.00) slower
```
Ruby 3.0 with cvar cache:
```
Warming up --------------------------------------
1 module 1.641M i/100ms
30 modules 1.655M i/100ms
100 modules 1.620M i/100ms
Calculating -------------------------------------
1 module 16.279M (± 3.8%) i/s - 82.038M in 5.046923s
30 modules 15.891M (± 3.9%) i/s - 79.459M in 5.007958s
100 modules 16.087M (± 3.6%) i/s - 81.005M in 5.041931s
Comparison:
1 module: 16279458.0 i/s
100 modules: 16087484.6 i/s - same-ish: difference falls within error
30 modules: 15891406.2 i/s - same-ish: difference falls within error
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
... then, new_insn_core extracts nd_line(node).
Also, if a macro "EXPERIMENTAL_ISEQ_NODE_ID" is defined, this changeset
keeps nd_node_id(node) for each instruction. This is intended for
TypeProf to identify what AST::Node corresponds to each instruction.
This patch is originally authored by @yui-knk for showing which column a
NoMethodError occurred.
https://github.com/ruby/ruby/compare/master...yui-knk:feature/node_id
Co-Authored-By: Yuichiro Kaneko <yui-knk@ruby-lang.org>
add_ensure_iseq() adds ensure block to the end of
jump such as next/redo/return. However, if the rescue
cause are in the body, this rescue catches the exception
in ensure clause.
iter do
next
rescue
R
ensure
raise
end
In this case, R should not be executed, but executed without this patch.
Fixes [Bug #13930]
Fixes [Bug #16618]
A part of tests are written by @jeremyevans https://github.com/ruby/ruby/pull/4291
In regular assignment, Ruby evaluates the left hand side before
the right hand side. For example:
```ruby
foo[0] = bar
```
Calls `foo`, then `bar`, then `[]=` on the result of `foo`.
Previously, multiple assignment didn't work this way. If you did:
```ruby
abc.def, foo[0] = bar, baz
```
Ruby would previously call `bar`, then `baz`, then `abc`, then
`def=` on the result of `abc`, then `foo`, then `[]=` on the
result of `foo`.
This change makes multiple assignment similar to single assignment,
changing the evaluation order of the above multiple assignment code
to calling `abc`, then `foo`, then `bar`, then `baz`, then `def=` on
the result of `abc`, then `[]=` on the result of `foo`.
Implementing this is challenging with the stack-based virtual machine.
We need to keep track of all of the left hand side attribute setter
receivers and setter arguments, and then keep track of the stack level
while handling the assignment processing, so we can issue the
appropriate topn instructions to get the receiver. Here's an example
of how the multiple assignment is executed, showing the stack and
instructions:
```
self # putself
abc # send
abc, self # putself
abc, foo # send
abc, foo, 0 # putobject 0
abc, foo, 0, [bar, baz] # evaluate RHS
abc, foo, 0, [bar, baz], baz, bar # expandarray
abc, foo, 0, [bar, baz], baz, bar, abc # topn 5
abc, foo, 0, [bar, baz], baz, abc, bar # swap
abc, foo, 0, [bar, baz], baz, def= # send
abc, foo, 0, [bar, baz], baz # pop
abc, foo, 0, [bar, baz], baz, foo # topn 3
abc, foo, 0, [bar, baz], baz, foo, 0 # topn 3
abc, foo, 0, [bar, baz], baz, foo, 0, baz # topn 2
abc, foo, 0, [bar, baz], baz, []= # send
abc, foo, 0, [bar, baz], baz # pop
abc, foo, 0, [bar, baz] # pop
[bar, baz], foo, 0, [bar, baz] # setn 3
[bar, baz], foo, 0 # pop
[bar, baz], foo # pop
[bar, baz] # pop
```
As multiple assignment must deal with splats, post args, and any level
of nesting, it gets quite a bit more complex than this in non-trivial
cases. To handle this, struct masgn_state is added to keep
track of the overall state of the mass assignment, which stores a linked
list of struct masgn_attrasgn, one for each assigned attribute.
This adds a new optimization that replaces a topn 1/pop instruction
combination with a single swap instruction for multiple assignment
to non-aref attributes.
This new approach isn't compatible with one of the optimizations
previously used, in the case where the multiple assignment return value
was not needed, there was no lhs splat, and one of the left hand side
used an attribute setter. This removes that optimization. Removing
the optimization allowed for removing the POP_ELEMENT and adjust_stack
functions.
This adds a benchmark to measure how much slower multiple
assignment is with the correct evaluation order.
This benchmark shows:
* 4-9% decrease for attribute sets
* 14-23% decrease for array member sets
* Basically same speed for local variable sets
Importantly, it shows no significant difference between the popped
(where return value of the multiple assignment is not needed) and
!popped (where return value of the multiple assignment is needed)
cases for attribute and array member sets. This indicates the
previous optimization, which was dropped in the evaluation
order fix and only affected the popped case, is not important to
performance.
Fixes [Bug #4443]
Previously, defined? could result in many more method calls than
the code it was checking. `defined? a.b.c.d.e.f` generated 15 calls,
with `a` called 5 times, `b` called 4 times, etc.. This was due to
the fact that defined works in a recursive manner, but it previously
did not cache results. So for `defined? a.b.c.d.e.f`, the logic was
similar to
```ruby
return nil unless defined? a
return nil unless defined? a.b
return nil unless defined? a.b.c
return nil unless defined? a.b.c.d
return nil unless defined? a.b.c.d.e
return nil unless defined? a.b.c.d.e.f
"method"
```
With this change, the logic is similar to the following, without
the creation of a local variable:
```ruby
return nil unless defined? a
_ = a
return nil unless defined? _.b
_ = _.b
return nil unless defined? _.c
_ = _.c
return nil unless defined? _.d
_ = _.d
return nil unless defined? _.e
_ = _.e
return nil unless defined? _.f
"method"
```
In addition to eliminating redundant method calls for defined
statements, this greatly simplifies the instruction sequences by
eliminating duplication. Previously:
```
0000 putnil ( 1)[Li]
0001 putself
0002 defined func, :a, false
0006 branchunless 73
0008 putself
0009 opt_send_without_block <calldata!mid:a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0011 defined method, :b, false
0015 branchunless 73
0017 putself
0018 opt_send_without_block <calldata!mid:a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0020 opt_send_without_block <calldata!mid:b, argc:0, ARGS_SIMPLE>
0022 defined method, :c, false
0026 branchunless 73
0028 putself
0029 opt_send_without_block <calldata!mid:a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0031 opt_send_without_block <calldata!mid:b, argc:0, ARGS_SIMPLE>
0033 opt_send_without_block <calldata!mid:c, argc:0, ARGS_SIMPLE>
0035 defined method, :d, false
0039 branchunless 73
0041 putself
0042 opt_send_without_block <calldata!mid:a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0044 opt_send_without_block <calldata!mid:b, argc:0, ARGS_SIMPLE>
0046 opt_send_without_block <calldata!mid:c, argc:0, ARGS_SIMPLE>
0048 opt_send_without_block <calldata!mid:d, argc:0, ARGS_SIMPLE>
0050 defined method, :e, false
0054 branchunless 73
0056 putself
0057 opt_send_without_block <calldata!mid:a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0059 opt_send_without_block <calldata!mid:b, argc:0, ARGS_SIMPLE>
0061 opt_send_without_block <calldata!mid:c, argc:0, ARGS_SIMPLE>
0063 opt_send_without_block <calldata!mid:d, argc:0, ARGS_SIMPLE>
0065 opt_send_without_block <calldata!mid:e, argc:0, ARGS_SIMPLE>
0067 defined method, :f, true
0071 swap
0072 pop
0073 leave
```
After change:
```
0000 putnil ( 1)[Li]
0001 putself
0002 dup
0003 defined func, :a, false
0007 branchunless 52
0009 opt_send_without_block <calldata!mid:a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0011 dup
0012 defined method, :b, false
0016 branchunless 52
0018 opt_send_without_block <calldata!mid:b, argc:0, ARGS_SIMPLE>
0020 dup
0021 defined method, :c, false
0025 branchunless 52
0027 opt_send_without_block <calldata!mid:c, argc:0, ARGS_SIMPLE>
0029 dup
0030 defined method, :d, false
0034 branchunless 52
0036 opt_send_without_block <calldata!mid:d, argc:0, ARGS_SIMPLE>
0038 dup
0039 defined method, :e, false
0043 branchunless 52
0045 opt_send_without_block <calldata!mid:e, argc:0, ARGS_SIMPLE>
0047 defined method, :f, true
0051 swap
0052 pop
0053 leave
```
This fixes issues where for pathological small examples, Ruby would generate
huge instruction sequences.
Unfortunately, implementing this support is kind of a hack. This adds another
parameter to compile_call for whether we should assume the receiver is already
present on the stack, and has defined? set that parameter for the specific
case where it is compiling a method call where the receiver is also a method
call.
defined_expr0 also takes an additional parameter for whether it should leave
the results of the method call on the stack. If that argument is true, in
the case where the method isn't defined, we jump to the pop before the leave,
so the extra result is not left on the stack. This requires space for an
additional label, so lfinish now needs to be able to hold 3 labels.
Fixes [Bug #17649]
Fixes [Bug #13708]
Peephole optimization doesn't play well with find pattern at
least. The only case when a pattern matching could have
unreachable patterns is when we have lasgn/dasgn node, which
shouldn't happen in real-life.
Fixes https://bugs.ruby-lang.org/issues/17534
Callinfo was being written in to an array and the GC would not see the
reference on the stack. `new_insn_send` creates a new callinfo object,
then it calls `new_insn_core`. `new_insn_core` allocates a new INSN
linked list item, which can end up calling `xmalloc` which will trigger
a GC:
70cd351c7c/compile.c (L968-L969)
Since the callinfo object isn't on the stack, the GC won't see it, and
it can get collected. This patch just refactors `new_insn_send` to keep
the object on the stack
Co-authored-by: John Hawthorn <john@hawthorn.email>
We don't need nop padding when the catch tables are only for break /
next / redo, so lets avoid them. This eliminates nop padding in
many lambdas.
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
constant cache `IC` is accessed by non-atomic manner and there are
thread-safety issues, so Ruby 3.0 disables to use const cache on
non-main ractors.
This patch enables it by introducing `imemo_constcache` and allocates
it by every re-fill of const cache like `imemo_callcache`.
[Bug #17510]
Now `IC` only has one entry `IC::entry` and it points to
`iseq_inline_constant_cache_entry`, managed by T_IMEMO object.
`IC` is atomic data structure so `rb_mjit_before_vm_ic_update()` and
`rb_mjit_after_vm_ic_update()` is not needed.
Previously we would add code to check if an ivar was defined when using
`@foo ||= 123`, which was slower than `@foo || (@foo = 123)` when `@foo`
was already defined.
Recently 01b7d5acc7 made it so that
accessing an undefined variable no longer generates a warning, making
the defined check unnecessary and both statements exactly equal.
This commit avoids emitting the defined instruction when compiling
NODE_OP_ASGN_OR with a NODE_IVAR.
Before:
$ ruby --dump=insn -e '@foo ||= 123'
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,12)> (catch: FALSE)
0000 putnil ( 1)[Li]
0001 defined instance-variable, :@foo, false
0005 branchunless 14
0007 getinstancevariable :@foo, <is:0>
0010 dup
0011 branchif 20
0013 pop
0014 putobject 123
0016 dup
0017 setinstancevariable :@foo, <is:0>
0020 leave
After:
$ ./ruby --dump=insn -e '@foo ||= 123'
== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,12)> (catch: FALSE)
0000 getinstancevariable :@foo, <is:0> ( 1)[Li]
0003 dup
0004 branchif 13
0006 pop
0007 putobject 123
0009 dup
0010 setinstancevariable :@foo, <is:0>
0013 leave
This seems to be about 50% faster in this benchmark:
require "benchmark/ips"
class Foo
def initialize
@foo = nil
end
def test1
@foo ||= 123
end
def test2
@foo || (@foo = 123)
end
end
FOO = Foo.new
Benchmark.ips do |x|
x.report("test1", "FOO.test1")
x.report("test2", "FOO.test2")
end
Before:
$ ruby benchmark_ivar.rb
Warming up --------------------------------------
test1 1.957M i/100ms
test2 3.125M i/100ms
Calculating -------------------------------------
test1 20.030M (± 1.7%) i/s - 101.780M in 5.083040s
test2 31.227M (± 4.5%) i/s - 156.262M in 5.015936s
After:
$ ./ruby benchmark_ivar.rb
Warming up --------------------------------------
test1 3.205M i/100ms
test2 3.197M i/100ms
Calculating -------------------------------------
test1 32.066M (± 1.1%) i/s - 163.440M in 5.097581s
test2 31.438M (± 4.9%) i/s - 159.860M in 5.098961s
* `GC.auto_compact=`, `GC.auto_compact` can be used to control when
compaction runs. Setting `auto_compact=` to true will cause
compaction to occurr duing major collections. At the moment,
compaction adds significant overhead to major collections, so please
test first!
[Feature #17176]
iv_index_tbl manages instance variable indexes (ID -> index).
This data structure should be synchronized with other ractors
so introduce some VM locks.
This patch also introduced atomic ivar cache used by
set/getinlinecache instructions. To make updating ivar cache (IVC),
we changed iv_index_tbl data structure to manage (ID -> entry)
and an entry points serial and index. IVC points to this entry so
that cache update becomes atomically.