Some tunings.
* add `inline` for vm_sendish()
* pass enum instead of func ptr to vm_sendish()
* reorder initial order of `calling` struct.
* add ALWAYS_INLINE for vm_search_method_fastpath()
* call vm_search_method_fastpath() from vm_sendish()
add cc_found_in_ccs (renamed from cc_found_ccs), cc_not_found_in_ccs,
call0_public, call0_other debug counters to measure more details.
also it contains several modification.
`cd` is passed to method call functions to method invocation
functions, but `cd` can be manipulated by other ractors simultaneously
so it contains thread-safety issue.
To solve this issue, this patch stores `ci` and found `cc` to `calling`
and stops to pass `cd`.
This speeds up all instance variable access, even when not in
verbose mode. Uninitialized instance variable warnings were
rarely helpful, and resulted in slower code if you wanted to
avoid warnings when run in verbose mode.
Implements [Feature #17055]
C extensions can violate the ractor-safety, so only ractor-safe
C extensions (C methods) can run on non-main ractors.
rb_ext_ractor_safe(true) declares that the successive
defined methods are ractor-safe. Otherwiwze, defined methods
checked they are invoked in main ractor and raise an error
if invoked at non-main ractors.
[Feature #17307]
Performance is probably improved?
$ benchmark-driver -v --rbenv 'before --jit;after --jit' --repeat-count=12 --alternate --output=all benchmark.yml
before --jit: ruby 3.0.0dev (2020-11-27T04:37:47Z master 69e77e81dc) +JIT [x86_64-linux]
after --jit: ruby 3.0.0dev (2020-11-27T05:28:19Z master df6b05c6dd) +JIT [x86_64-linux]
last_commit=Set VM_FRAME_FLAG_FINISH at once
Calculating -------------------------------------
before --jit after --jit
Optcarrot Lan_Master.nes 80.89292998533379 82.19497327502751 fps
80.93130641142331 85.13943315260148
81.06214830270119 87.43757879797808
82.29172808453910 87.89942441487113
84.61206450455929 87.91309779491075
85.44545883567997 87.98026086648694
86.02923132404449 88.03081060383973
86.07411817365879 88.14650206137341
86.34348799602836 88.32791633649961
87.90257338977324 88.57599644892220
88.58006509876580 88.67426384743277
89.26611118140011 88.81669430874207
This should have no bad impact on VM because this function is ALWAYS_INLINE.
Allocating an instance of a class uses the allocator for the class. When
the class has no allocator set, Ruby looks for it in the super class
(see rb_get_alloc_func()).
It's uncommon for classes created from Ruby code to ever have an
allocator set, so it's common during the allocation process to search
all the way to BasicObject from the class with which the allocation is
being performed. This makes creating instances of classes that have
long ancestry chains more expensive than creating instances of classes
have that shorter ancestry chains.
Setting the allocator at class creation time removes the need to perform
a search for the alloctor during allocation.
This is a breaking change for C-extensions that assume that classes
created from Ruby code have no allocator set. Libraries that setup a
class hierarchy in Ruby code and then set the allocator on some parent
class, for example, can experience breakage. This seems like an unusual
use case and hopefully it is rare or non-existent in practice.
Rails has many classes that have upwards of 60 elements in the ancestry
chain and benchmark shows a significant improvement for allocating with
a class that includes 64 modules.
```
pre: ruby 3.0.0dev (2020-11-12T14:39:27Z master 6325866421)
post: ruby 3.0.0dev (2020-11-12T20:15:30Z cut-allocator-lookup)
Comparison:
allocate_8_deep
post: 10336985.6 i/s
pre: 8691873.1 i/s - 1.19x slower
allocate_32_deep
post: 10423181.2 i/s
pre: 6264879.1 i/s - 1.66x slower
allocate_64_deep
post: 10541851.2 i/s
pre: 4936321.5 i/s - 2.14x slower
allocate_128_deep
post: 10451505.0 i/s
pre: 3031313.5 i/s - 3.45x slower
```
This commit adds a debug counter for the case where the inline cache
*missed* but the ivar index table has an entry for that ivar. This is a
case where a polymorphic cache could help
iv tables cannot shrink. If the inline cache was ever set, then there
must be an entry for the instance variable in the iv table. Just set
the iv list on the object to be equal to the iv index table size, then
set the iv.
When the inline cache is written, the iv table will contain an entry for
the instance variable. If we get an inline cache hit, then we know the
iv table must contain a value for the index written to the inline cache.
If the index in the inline cache is larger than the list on the object,
but *smaller* than the iv index table on the class, then we can just
eagerly allocate the iv list to be the same size as the iv index table.
This avoids duplicate work of checking frozen as well as looking up the
index for the particular instance variable name.
Ractor.make_shareable() supports Proc object if
(1) a Proc only read outer local variables (no assignments)
(2) read outer local variables are shareable.
Read local variables are stored in a snapshot, so after making
shareable Proc, any assignments are not affeect like that:
```ruby
a = 1
pr = Ractor.make_shareable(Proc.new{p a})
pr.call #=> 1
a = 2
pr.call #=> 1 # `a = 2` doesn't affect
```
[Feature #17284]
iv_index_tbl manages instance variable indexes (ID -> index).
This data structure should be synchronized with other ractors
so introduce some VM locks.
This patch also introduced atomic ivar cache used by
set/getinlinecache instructions. To make updating ivar cache (IVC),
we changed iv_index_tbl data structure to manage (ID -> entry)
and an entry points serial and index. IVC points to this entry so
that cache update becomes atomically.
Buggy native extensions could have mark functions that cause stack
overflow. When a stack overflow happens during GC, Ruby used to recover
by raising an exception, which runs the interpreter. It's not safe to
run the interpreter during GC since the GC is in an inconsistent state.
This could cause object allocation during GC, for example.
Instead of running the interpreter and potentially causing a crash down
the line, fail fast and abort.
generic_ivtbl is a process global table to maintain instance variables
for non T_OBJECT/T_CLASS/... objects. So we need to protect them
for multi-Ractor exection.
Hint: we can make them Ractor local for unshareable objects, but
now it is premature optimization.
The changes here include:
* Using `FL_TEST_RAW` instead of `FL_TEST` in the first check in
`vm_search_super_method`. While the profile showed us spending a fair
amount of time here, the subsequent benchmarks didn't show much
improvement when adding this. Regardless, we know this does less work
than `FL_TEST` and we know that `FL_TEST_RAW` is safe due to the
previous check so it's a small but accurate optimization.
* Set `mid` only once. Both `vm_ci_new_runtime` and `vm_ci_mid` were
getting the `original_id` for the method entry. We can do this once
and pass the variable to the 2 callers that need it. This also doesn't
have a huge performance improvement but cleans up the code a bit.
Benchmark:
```
| |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|vm_iclass_super | 3.540M| 3.940M|
| | -| 1.11x|
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
This PR improves the performance of `super` calls. While working on some
Rails optimizations jhawthorn discovered that `super` calls were slower
than expected.
The changes here do the following:
1) Adds a check for whether the call frame is not equal to the method
entry iseq. This avoids the `rb_obj_is_kind_of` check on the next line
which is quite slow. If the current call frame is equal to the method
entry we know we can't have an instance eval, etc.
2) Changes `FL_TEST` to `FL_TEST_RAW`. This is safe because we've
already done the check for `T_ICLASS` above.
3) Adds a benchmark for `T_ICLASS` super calls.
4) Note: makes a chage for `method_entry_cref` to use `const`.
On master the benchmarks showed that `super` is 1.76x slower. Our
changes improved the performance so that it is now only 1.36x slower.
Benchmark IPS:
```
Warming up --------------------------------------
super 244.918k i/100ms
method call 383.007k i/100ms
Calculating -------------------------------------
super 2.280M (± 6.7%) i/s - 11.511M in 5.071758s
method call 3.834M (± 4.9%) i/s - 19.150M in 5.008444s
Comparison:
method call: 3833648.3 i/s
super: 2279837.9 i/s - 1.68x (± 0.00) slower
```
With changes:
```
Warming up --------------------------------------
super 308.777k i/100ms
method call 375.051k i/100ms
Calculating -------------------------------------
super 2.951M (± 5.4%) i/s - 14.821M in 5.039592s
method call 3.551M (± 4.9%) i/s - 18.002M in 5.081695s
Comparison:
method call: 3551372.7 i/s
super: 2950557.9 i/s - 1.20x (± 0.00) slower
```
Ruby VM benchmarks also showed an improvement:
Existing `vm_super` benchmark`.
```
$ make benchmark ITEM=vm_super
| |compare-ruby|built-ruby|
|:---------|-----------:|---------:|
|vm_super | 21.555M| 37.819M|
| | -| 1.75x|
```
New `vm_iclass_super` benchmark:
```
$ make benchmark ITEM=vm_iclass_super
| |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|vm_iclass_super | 1.669M| 3.683M|
| | -| 2.21x|
```
This is the benchmark script used for the benchmark-ips benchmarks:
```ruby
require "benchmark/ips"
class Foo
def zuper; end
def top; end
last_method = "top"
("A".."M").each do |module_name|
eval <<-EOM
module #{module_name}
def zuper; super; end
def #{module_name.downcase}
#{last_method}
end
end
prepend #{module_name}
EOM
last_method = module_name.downcase
end
end
foo = Foo.new
Benchmark.ips do |x|
x.report "super" do
foo.zuper
end
x.report "method call" do
foo.m
end
x.compare!
end
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
Co-authored-by: John Hawthorn <john@hawthorn.email>
This reverts commit eeef16e190.
This also reverts the spec change.
Preventing the SystemStackError would be nice, but there is valid
code that the fix breaks, and it is probably more common than cases
that cause the SystemStackError.
Fixes [Bug #17182]
This commit introduces Ractor mechanism to run Ruby program in
parallel. See doc/ractor.md for more details about Ractor.
See ticket [Feature #17100] to see the implementation details
and discussions.
[Feature #17100]
This commit does not complete the implementation. You can find
many bugs on using Ractor. Also the specification will be changed
so that this feature is experimental. You will see a warning when
you make the first Ractor with `Ractor.new`.
I hope this feature can help programmers from thread-safety issues.
Previously, Method#super_method looked at the called_id to
determine the method id to use, but that isn't correct for
aliased methods, because the super target depends on the
original method id, not the called_id.
Additionally, aliases can reference methods defined in other
classes and modules, and super lookup needs to start in the
super of the defined class in such cases.
This adds tests for Method#super_method for both types of
aliases, one that uses VM_METHOD_TYPE_ALIAS and another that
does not. Both check that the results for calling super
methods return the expected values.
To find the defined class for alias methods, add an rb_ prefix
to find_defined_class_by_owner in vm_insnhelper.c and make it
non-static, so that it can be called from method_super_method
in proc.c.
This bug was original discovered while researching [Bug #11189].
Fixes [Bug #17130]
Without this, if a refinement defines a method that calls super and
includes a module with a module that calls super and has a activated
refinement at the point super is called, the module method super call
will end up calling back into the refinement method, creating a loop.
Fixes [Bug #17007]
Struct assignment using a compound literal is more readable than before,
to me at least. It seems compilers reorder assignments anyways.
Neither speedup nor slowdown is observed on my machine.
Use ID instead of GENTRY for gvars.
Global variables are compiled into GENTRY (a pointer to struct
rb_global_entry). This patch replace this GENTRY to ID and
make the code simple.
We need to search GENTRY from ID every time (st_lookup), so
additional overhead will be introduced.
However, the performance of accessing global variables is not
important now a day and this simplicity helps Ractor development.
for opt_* insns.
opt_eq handles rb_obj_equal inside opt_eq, and all other cfunc is
handled by opt_send_without_block. Therefore we can't decide which insn
should be generated by checking whether it's cfunc cc or not.
```
$ benchmark-driver -v --rbenv 'before --jit;after --jit' benchmark/mjit_opt_cc_insns.yml --repeat-count=4
before --jit: ruby 2.8.0dev (2020-06-26T05:21:43Z master 9dbc2294a6) +JIT [x86_64-linux]
after --jit: ruby 2.8.0dev (2020-06-26T06:30:18Z master 75cece1b0b) +JIT [x86_64-linux]
last_commit=Decide JIT-ed insn based on cached cfunc
Calculating -------------------------------------
before --jit after --jit
mjit_nil?(1) 73.878M 74.021M i/s - 40.000M times in 0.541432s 0.540391s
mjit_not(1) 72.635M 74.601M i/s - 40.000M times in 0.550702s 0.536187s
mjit_eq(1, nil) 7.331M 7.445M i/s - 8.000M times in 1.091211s 1.074596s
mjit_eq(nil, 1) 49.450M 64.711M i/s - 8.000M times in 0.161781s 0.123627s
Comparison:
mjit_nil?(1)
after --jit: 74020528.4 i/s
before --jit: 73878185.9 i/s - 1.00x slower
mjit_not(1)
after --jit: 74600882.0 i/s
before --jit: 72634507.6 i/s - 1.03x slower
mjit_eq(1, nil)
after --jit: 7444657.4 i/s
before --jit: 7331304.3 i/s - 1.02x slower
mjit_eq(nil, 1)
after --jit: 64710790.6 i/s
before --jit: 49449507.4 i/s - 1.31x slower
```