Fix as the compiler orders:
```
warning: unused return value of `into_raw_fd` that must be used
--> ../src/yjit/src/disasm.rs:123:21
|
123 | file.into_raw_fd(); // keep the fd open
| ^^^^^^^^^^^^^^^^^^
|
= note: losing the raw file descriptor may leak resources
= note: `#[warn(unused_must_use)]` on by default
help: use `let _ = ...` to ignore the resulting value
|
123 | let _ = file.into_raw_fd(); // keep the fd open
| +++++++
warning: unused return value of `into_raw_fd` that must be used
--> ../src/yjit/src/log.rs:84:21
|
84 | file.into_raw_fd(); // keep the fd open
| ^^^^^^^^^^^^^^^^^^
|
= note: losing the raw file descriptor may leak resources
help: use `let _ = ...` to ignore the resulting value
|
84 | let _ = file.into_raw_fd(); // keep the fd open
| +++++++
```
* YJIT: Replace Array#each only when YJIT is enabled
* Add comments about BUILTIN_ATTR_C_TRACE
* Make Ruby Array#each available with --yjit as well
* Fix all paths that expect a C location
* Use method_basic_definition_p to detect patches
* Copy a comment about C_TRACE flag to compilers
* Rephrase a comment about add_yjit_hook
* Give METHOD_ENTRY_BASIC flag to Array#each
* Add --yjit-c-builtin option
* Allow inconsistent source_location in test-spec
* Refactor a check of BUILTIN_ATTR_C_TRACE
* Set METHOD_ENTRY_BASIC without touching vm->running
We got some core dumps in the wild where a PendingBranch had everything
as None, leading to a panic unwrapping in PendingBranch::into_branch().
This happened while compiling a `branchif`.
It seems that the only way this can happen is when core::gen_branch()
fails, but not due to OOM. We wouldn't have reach into_branch() when
OOM, and the only way to not leave markers that would've set the
branch's start_addr to some value in gen_branch() is for set_target() to
fail, causing an early return.
Unfortunately, it's hard to tell the exact sequence of events that led
to this situation, but regardless, the dumps show us that we should
check for errors in gen_branch().
Because gen_branch() is used deep in the stack during compilation (e.g.
guard_known_class() -> jit_chain_guard() -> gen_branch()), it'd be bad
for compile speed to propagate the error everywhere, not to mention the
massive patch required. Opt for a flag checked near the end of
compilation.
Type information in the context for no additional work!
This is the `if (special_object_p(obj)) return obj;` path in
rb_obj_dup() and for Numeric#dup, it's always the identity function.
Previously, in the "Top-N most frequent C calls"
section of --yjit-stats output, we printed the class
name of the receiver, not the method owner. This meant
that calls on subclass instances that land on the same
method showed up as different entires.
Similarly, method called using an alias showed up as
different entries from other aliases.
Group by the resolved method instead.
Test program:
1.itself; [].itself; true.inspect; true.to_s
Before:
Top-4 most frequent C calls (80.0% of C calls):
1 (20.0%): Integer#itself
1 (20.0%): TrueClass#to_s
1 (20.0%): TrueClass#inspect
1 (20.0%): Array#itself
After:
Top-2 most frequent C calls (80.0% of C calls):
2 (40.0%): Kernel#itself
2 (40.0%): TrueClass#to_s
* YJIT: Add `--yjit-compilation-log` flag to print out the compilation log at exit.
* YJIT: Add an option to enable the compilation log at runtime.
* YJIT: Fix a typo in the `IseqPayload` docs.
* YJIT: Add stubs for getting the YJIT compilation log in memory.
* YJIT: Add a compilation log based on a circular buffer to cap the log size.
* YJIT: Allow specifying either a file or directory name for the YJIT compilation log.
The compilation log will be populated as compilation events occur. If a directory is supplied, then a filename based on the PID will be used as the write target. If a file name is supplied instead, the log will be written to that file.
* YJIT: Add JIT compilation of C function substitutions to the compilation log.
* YJIT: Add compilation events to the circular buffer even if output is sent to a file.
Previously, the two modes were treated as being exclusive of one another. However, it could be beneficial to log all events to a file while also allowing for direct access of the last N events via `RubyVM::YJIT.compilation_log`.
* YJIT: Make timestamps the first element in the YJIT compilation log tuple.
* YJIT: Stream log to stderr if `--yjit-compilation-log` is supplied without an argument.
* YJIT: Eagerly compute compilation log messages to avoid hanging on to references that may GC.
* YJIT: Log all compiled blocks, not just the method entry points.
* YJIT: Remove all compilation events other than block compilation to slim down the log.
* YJIT: Replace circular buffer iterator with a consuming loop.
* YJIT: Support `--yjit-compilation-log=quiet` as a way to activate the in-memory log without printing it.
Co-authored-by: Randy Stauner <randy.stauner@shopify.com>
* YJIT: Promote the compilation log to being the one YJIT log.
Co-authored-by: Randy Stauner <randy.stauner@shopify.com>
* Update doc/yjit/yjit.md
* Update doc/yjit/yjit.md
---------
Co-authored-by: Randy Stauner <randy.stauner@shopify.com>
Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
Module#name shows up as a top C method callee in lobsters so probably
common enough. It's also easy to substitute thanks to rb_mod_name()
already having no GC yield points.
klass = BasicObject
50_000_000.times { klass.name }
Benchmark 1: /.rubies/post/bin/ruby --yjit mod_name.rb
Time (mean ± σ): 1.433 s ± 0.010 s [User: 1.410 s, System: 0.010 s]
Range (min … max): 1.421 s … 1.449 s 10 runs
Benchmark 2: /.rubies/mstr/bin/ruby --yjit mod_name.rb
Time (mean ± σ): 1.491 s ± 0.012 s [User: 1.468 s, System: 0.010 s]
Range (min … max): 1.470 s … 1.511 s 10 runs
Summary
/.rubies/post/bin/ruby --yjit mod_name.rb ran
1.04 ± 0.01 times faster than /.rubies/mstr/bin/ruby --yjit mod_name.rb
Now that we've inlined the eden_heap into the size_pool, we should
rename the size_pool to heap. So that Ruby contains multiple heaps, with
different sized objects.
The term heap as a collection of memory pages is more in memory
management nomenclature, whereas size_pool was a name chosen out of
necessity during the development of the Variable Width Allocation
features of Ruby.
The concept of size pools was introduced in order to facilitate
different sized objects (other than the default 40 bytes). They wrapped
the eden heap and the tomb heap, and some related state, and provided a
reasonably simple way of duplicating all related concerns, to provide
multiple pools that all shared the same structure but held different
objects.
Since then various changes have happend in Ruby's memory layout:
* The concept of tomb heaps has been replaced by a global free pages list,
with each page having it's slot size reconfigured at the point when it
is resurrected
* the eden heap has been inlined into the size pool itself, so that now
the size pool directly controls the free_pages list, the sweeping
page, the compaction cursor and the other state that was previously
being managed by the eden heap.
Now that there is no need for a heap wrapper, we should refer to the
collection of pages containing Ruby objects as a heap again rather than
a size pool
If a Hash which is empty or only using literals is frozen, we detect
this as a peephole optimization and change the instructions to be
`opt_hash_freeze`.
[Feature #20684]
Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
If an Array which is empty or only using literals is frozen, we detect
this as a peephole optimization and change the instructions to be
`opt_ary_freeze`.
[Feature #20684]
Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
* YJIT: Encode doubles to VALUE objects and move stat generation to rust
Stats that can now be generated from rust have been moved there.
* Move object_shape_count call for runtime_stats to rust
This reduces the ruby method to a single primitive.
* Change hash_aset_usize from macro to function
YJIT currently uses the YJIT root object to mark objects during GC and
update references during compaction. This object otherwise serves no
purpose.
This commit changes it YJIT to be step when marking the GC root. This
saves some memory from being allocated from the system and the GC.
* Document why we need to explicitly spill registers.
* Simplify passing a byte value to `str_buf_cat`.
* YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values.
* YJIT: Move runtime type check into YJIT.
Performing the check in YJIT means we can make assumptions about the type. It also improves correctness of stack traces in cases where the codepoint argument is not a String or a Fixnum.
* YJIT: Allow dev_nodebug to disasm release-mode code
* Revert "YJIT: Squash canary before falling back"
This reverts commit f05ad373d8.
The stray canary issue should have been solved by
def7023ee4, alleviating this codegen
accommodation.
* s/runtime_assertions/runtime_checks/
---------
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
* YJIT: Local variable register allocation
* locals are not stack temps
* Rename RegTemps to RegMappings
* Rename RegMapping to RegOpnd
* Rename local_size to num_locals
* s/stack value/operand/
* Rename spill_temps() to spill_regs()
* Clarify when num_locals becomes None
* Mention that InsnOut uses different registers
* Rename get_reg_mapping to get_reg_opnd
* Resurrect --yjit-temp-regs capability
* Use MAX_CTX_TEMPS and MAX_CTX_LOCALS
* YJIT: increase context cache size to 1024 redux
* Move context hashing code outside of unsafe block
* Avoid allocating large table on the stack, which would cause a stack overflow
Co-authored by Alan Wu @XrXr
* YJIT: increase context cache size to 1024
The other day I ran into a mysterious bug while increasing the
cache size to 1024. I was not able to reproduce this locally.
Opening this PR for testing/debugging.
* Add extra debug assertions
* Add more comments to context code
* Update yjit/src/core.rs
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
* Update yjit/src/core.rs
* Comment out potentially problematic assertion
* Revert cache size to 512 so we can merge other changes
---------
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
This change implements a fallback mode for the `--yjit-dump-disasm`
development command-line option to make it usable in release builds.
Previously, using the option with release builds of YJIT yielded only
a warning asking the user to build with `--enable-yjit=dev`.
While builds that use the `disasm` feature still give the best output,
just having the comments is useful enough for many kinds of debugging.
Having it usable in release builds is nice for new hackers, too, since
this allows for tinkering without having to learn how to build YJIT in
development mode.
Sample output on A64:
```
# regenerate_branch
# Insn: 0001 opt_send_without_block (stack_size: 1)
# guard known object with singleton class
0x11f7e0034: 4b 00 00 58 03 00 00 14 08 ce 9c 04 01 00 00
0x11f7e0043: 00 3f 00 0b eb 81 06 01 54 1f 20 03 d5
# RUBY_VM_CHECK_INTS(ec)
0x11f7e0050: 8b 02 42 b8 cb 07 01 35
# stack overflow check
0x11f7e0058: ab 62 02 91 7f 02 0b eb 69 07 01 54
# save PC to CFP
0x11f7e0064: 0b 3b 9a d2 2b 2f a0 f2 0b 00 cc f2 6b 02 00
0x11f7e0073: f8 ab 82 00 91
```
To ensure this feature doesn't incur too much cost when running without
the `--yjit-dump-disasm` option, I checked that there is no significant
impact to compile time and memory usage with the `compile_time_ns` and
`yjit_alloc_size` entry in `RubyVM::YJIT.runtime_stats`. For each
sample, I ran 3 iterations of the `lobsters` YJIT benchmark. The
statistics summary and done with the `summary` function in R.
Compile time, sample size of 60, lower is better:
```
Before After
Min. :2.054e+09 Min. :2.028e+09
1st Qu.:2.069e+09 1st Qu.:2.044e+09
Median :2.081e+09 Median :2.060e+09
Mean :2.089e+09 Mean :2.066e+09
3rd Qu.:2.109e+09 3rd Qu.:2.085e+09
Max. :2.146e+09 Max. :2.144e+09
```
Allocation size, sample size of 20, lower is better:
```
Before After
Min. :21804742 Min. :21794082
1st Qu.:21826682 1st Qu.:21816282
Median :21844042 Median :21826814
Mean :21960664 Mean :22026291
3rd Qu.:21861228 3rd Qu.:22040439
Max. :22587426 Max. :22930614
```
The `yjit_alloc_size` samples are noisy, but since the average increased
by only 0.3%, and the median is lower, I feel safe saying that there is
no significant change.
Use a special breakpoint address if one isn't explicitly supplied in order to support natural line stepping.
ARM64 will not increment the program counter (PC) upon hitting a breakpoint instruction. Consequently, stepping through code with a debugger ends up looping back to the breakpoint instruction. LLDB has a special breakpoint address of 0xf000 that will increment the PC and allow the debugger to work as expected. This change makes it possible to debug YJIT generated code on ARM64.
More details at: https://discourse.llvm.org/t/stepping-over-a-brk-instruction-on-arm64/69766/8
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
This commit expands inlining for simple ISeqs to accept
callees that have unused keyword parameters and callers
that specify unused keywords. The following shows 2 new
callsites that will be inlined:
```ruby
def let(a, checked: true) = a
let(1)
let(1, checked: false)
```
Co-authored-by: Kaan Ozkan <kaan.ozkan@shopify.com>
Many functions take an outlined code block but do nothing more than
passing it along; only a couple of functions actually make use of it.
So, in most cases the `ocb` parameter is just boilerplate.
Most functions that take `ocb` already also take a `JITState` and this
commit moves `ocb` into `JITState` to remove the visual noise of the
`ocb` parameter.
This commit fixes splat and block handling when calling in to a
forwarding iseq. In the case of a splat we need to avoid expanding the
array to the stack. We need to also ensure the CI write is flushed to
the SP, otherwise it's possible for a block handler to clobber the CI
[ruby-core:118360]
This commit adds `sendforward` and `invokesuperforward` for forwarding
parameters to calls
Co-authored-by: Matt Valentine-House <matt@eightbitraptor.com>
This patch optimizes forwarding callers and callees. It only optimizes methods that only take `...` as their parameter, and then pass `...` to other calls.
Calls it optimizes look like this:
```ruby
def bar(a) = a
def foo(...) = bar(...) # optimized
foo(123)
```
```ruby
def bar(a) = a
def foo(...) = bar(1, 2, ...) # optimized
foo(123)
```
```ruby
def bar(*a) = a
def foo(...)
list = [1, 2]
bar(*list, ...) # optimized
end
foo(123)
```
All variants of the above but using `super` are also optimized, including a bare super like this:
```ruby
def foo(...)
super
end
```
This patch eliminates intermediate allocations made when calling methods that accept `...`.
We can observe allocation elimination like this:
```ruby
def m
x = GC.stat(:total_allocated_objects)
yield
GC.stat(:total_allocated_objects) - x
end
def bar(a) = a
def foo(...) = bar(...)
def test
m { foo(123) }
end
test
p test # allocates 1 object on master, but 0 objects with this patch
```
```ruby
def bar(a, b:) = a + b
def foo(...) = bar(...)
def test
m { foo(1, b: 2) }
end
test
p test # allocates 2 objects on master, but 0 objects with this patch
```
How does it work?
-----------------
This patch works by using a dynamic stack size when passing forwarded parameters to callees.
The caller's info object (known as the "CI") contains the stack size of the
parameters, so we pass the CI object itself as a parameter to the callee.
When forwarding parameters, the forwarding ISeq uses the caller's CI to determine how much stack to copy, then copies the caller's stack before calling the callee.
The CI at the forwarded call site is adjusted using information from the caller's CI.
I think this description is kind of confusing, so let's walk through an example with code.
```ruby
def delegatee(a, b) = a + b
def delegator(...)
delegatee(...) # CI2 (FORWARDING)
end
def caller
delegator(1, 2) # CI1 (argc: 2)
end
```
Before we call the delegator method, the stack looks like this:
```
Executing Line | Code | Stack
---------------+---------------------------------------+--------
1| def delegatee(a, b) = a + b | self
2| | 1
3| def delegator(...) | 2
4| # |
5| delegatee(...) # CI2 (FORWARDING) |
6| end |
7| |
8| def caller |
-> 9| delegator(1, 2) # CI1 (argc: 2) |
10| end |
```
The ISeq for `delegator` is tagged as "forwardable", so when `caller` calls in
to `delegator`, it writes `CI1` on to the stack as a local variable for the
`delegator` method. The `delegator` method has a special local called `...`
that holds the caller's CI object.
Here is the ISeq disasm fo `delegator`:
```
== disasm: #<ISeq:delegator@-e:1 (1,0)-(1,39)>
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] "..."@0
0000 putself ( 1)[LiCa]
0001 getlocal_WC_0 "..."@0
0003 send <calldata!mid:delegatee, argc:0, FCALL|FORWARDING>, nil
0006 leave [Re]
```
The local called `...` will contain the caller's CI: CI1.
Here is the stack when we enter `delegator`:
```
Executing Line | Code | Stack
---------------+---------------------------------------+--------
1| def delegatee(a, b) = a + b | self
2| | 1
3| def delegator(...) | 2
-> 4| # | CI1 (argc: 2)
5| delegatee(...) # CI2 (FORWARDING) | cref_or_me
6| end | specval
7| | type
8| def caller |
9| delegator(1, 2) # CI1 (argc: 2) |
10| end |
```
The CI at `delegatee` on line 5 is tagged as "FORWARDING", so it knows to
memcopy the caller's stack before calling `delegatee`. In this case, it will
memcopy self, 1, and 2 to the stack before calling `delegatee`. It knows how much
memory to copy from the caller because `CI1` contains stack size information
(argc: 2).
Before executing the `send` instruction, we push `...` on the stack. The
`send` instruction pops `...`, and because it is tagged with `FORWARDING`, it
knows to memcopy (using the information in the CI it just popped):
```
== disasm: #<ISeq:delegator@-e:1 (1,0)-(1,39)>
local table (size: 1, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 1] "..."@0
0000 putself ( 1)[LiCa]
0001 getlocal_WC_0 "..."@0
0003 send <calldata!mid:delegatee, argc:0, FCALL|FORWARDING>, nil
0006 leave [Re]
```
Instruction 001 puts the caller's CI on the stack. `send` is tagged with
FORWARDING, so it reads the CI and _copies_ the callers stack to this stack:
```
Executing Line | Code | Stack
---------------+---------------------------------------+--------
1| def delegatee(a, b) = a + b | self
2| | 1
3| def delegator(...) | 2
4| # | CI1 (argc: 2)
-> 5| delegatee(...) # CI2 (FORWARDING) | cref_or_me
6| end | specval
7| | type
8| def caller | self
9| delegator(1, 2) # CI1 (argc: 2) | 1
10| end | 2
```
The "FORWARDING" call site combines information from CI1 with CI2 in order
to support passing other values in addition to the `...` value, as well as
perfectly forward splat args, kwargs, etc.
Since we're able to copy the stack from `caller` in to `delegator`'s stack, we
can avoid allocating objects.
I want to do this to eliminate object allocations for delegate methods.
My long term goal is to implement `Class#new` in Ruby and it uses `...`.
I was able to implement `Class#new` in Ruby
[here](https://github.com/ruby/ruby/pull/9289).
If we adopt the technique in this patch, then we can optimize allocating
objects that take keyword parameters for `initialize`.
For example, this code will allocate 2 objects: one for `SomeObject`, and one
for the kwargs:
```ruby
SomeObject.new(foo: 1)
```
If we combine this technique, plus implement `Class#new` in Ruby, then we can
reduce allocations for this common operation.
Co-Authored-By: John Hawthorn <john@hawthorn.email>
Co-Authored-By: Alan Wu <XrXr@users.noreply.github.com>
This mainly aims to make `--yjit-dump-disasm=<relative_path>` more
usable. Previously, it crashed if the program did chdir(2), since it
opened the dump file every time when appending.
Tested with:
./miniruby --yjit-dump-disasm=. --yjit-call-threshold=1 -e 'Dir.chdir("/") {}'
And the `lobsters` benchmark.
Calls to defer_compilation() leave behind a stub and a `struct Block`
that we retain. If the block is empty, it only exits to hold the
`struct Branch` that the stub needs.
This patch transplants the branch out of the empty block into the newly
generated block when the defer_compilation() stub is hit, and deletes
the empty block to save memory.
To assist the transplantation, `Block::outgoing` is now a
`MutableBranchList`, and `Branch::Block` now in a `Cell`. These types
don't incur a size cost.
On the `lobsters` benchmark, `yjit_alloc_size` is roughly 98% of what
it was before the change.
Co-authored-by: Kevin Menard <kevin.menard@shopify.com>
Co-authored-by: Randy Stauner <randy@r4s6.net>
Co-authored-by: Maxime Chevalier-Boisvert <maxime.chevalierboisvert@shopify.com>