Speeds up ChunkyPNG.
The interpreter is about 70% faster:
```
before: ruby 3.4.0dev (2024-07-03T15:16:17Z master 786cf9db48) [arm64-darwin23]
after: ruby 3.4.0dev (2024-07-03T15:32:25Z ruby-downto 0b8b744ce2) [arm64-darwin23]
---------- ----------- ---------- ---------- ---------- ------------- ------------
bench before (ms) stddev (%) after (ms) stddev (%) after 1st itr before/after
chunky-png 892.2 0.1 526.3 1.0 1.65 1.70
---------- ----------- ---------- ---------- ---------- ------------- ------------
Legend:
- after 1st itr: ratio of before/after time for the first benchmarking iteration.
- before/after: ratio of before/after time. Higher is better for after. Above 1 represents a speedup.
```
YJIT is 2.5x faster:
```
before: ruby 3.4.0dev (2024-07-03T15:16:17Z master 786cf9db48) +YJIT [arm64-darwin23]
after: ruby 3.4.0dev (2024-07-03T15:32:25Z ruby-downto 0b8b744ce2) +YJIT [arm64-darwin23]
---------- ----------- ---------- ---------- ---------- ------------- ------------
bench before (ms) stddev (%) after (ms) stddev (%) after 1st itr before/after
chunky-png 709.4 0.1 278.8 0.3 2.35 2.54
---------- ----------- ---------- ---------- ---------- ------------- ------------
Legend:
- after 1st itr: ratio of before/after time for the first benchmarking iteration.
- before/after: ratio of before/after time. Higher is better for after. Above 1 represents a speedup.
```
Following [Feature #20589] it can happen that we change the
capacity of a frozen array, so these assertions no longer make
sense.
Normally we don't hit them because `Array#freeze` shrinks the
array, but if somehow the Array was frozen using `Object#freeze`
then we may shrink it after it was frozen.
This commit splits gc.c into two files:
- gc.c now only contains code not specific to Ruby GC. This includes
code to mark objects (which the GC implementation may choose not to
use) and wrappers for internal APIs that the implementation may need
to use (e.g. locking the VM).
- gc_impl.c now contains the implementation of Ruby's GC. This includes
marking, sweeping, compaction, and statistics. Most importantly,
gc_impl.c only uses public APIs in Ruby and a limited set of functions
exposed in gc.c. This allows us to build gc_impl.c independently of
Ruby and plug Ruby's GC into itself.
**What does this PR do?**
This PR tweaks the `vm_push_frame` function to add an explicit compiler
fence (`atomic_signal_fence`) to ensure profilers that use signals
to interrupt applications (stackprof, vernier, pf2, Datadog profiler)
can safely sample from the signal handler.
**Motivation:**
The `vm_push_frame` was specifically tweaked in
https://github.com/ruby/ruby/pull/3296 to initialize the a frame
before updating the `cfp` pointer.
But since there's nothing stopping the compiler from reordering
the initialization of a frame (`*cfp =`) with the update of the cfp
pointer (`ec->cfp = cfp`) we've been hesitant to rely on this on
the Datadog profiler.
In practice, after some experimentation + talking to folks, this
reordering does not seem to happen.
But since modern compilers have a way for us to exactly tell them
not to do the reordering (`atomic_signal_fence`), this seems even
better.
I've actually extracted `vm_push_frame` into the "Compiler Explorer"
website, which you can use to see the assembly output of this function
across many compilers and architectures: https://godbolt.org/z/3oxd1446K
On that link you can observe two things across many compilers:
1. The compilers are not reordering the writes
2. The barrier does not change the generated assembly output
(== has no cost in practice)
**Additional Notes:**
The checks added in `configure.ac` define two new macros:
* `HAVE_STDATOMIC_H`
* `HAVE_DECL_ATOMIC_SIGNAL_FENCE`
Since Ruby generates an arch-specific `config.h` header with
these macros upon installation, this can be used by profilers
and other libraries to test if Ruby was compiled with the fence enabled.
**How to test the change?**
As I mentioned above, you can check https://godbolt.org/z/3oxd1446K
to confirm the compiled output of `vm_push_frame` does not change
in most compilers (at least all that I've checked on that site).
* Speed up chunkypng benchmark
Since d037c5196a we're seeing a slowdown
in ChunkyPNG benchmarks in YJIT bench. This patch addresses the
slowdown. Making the thread volatile speeds up the benchmark by 2 or 3%
on my machine.
```
before: ruby 3.4.0dev (2024-07-02T18:48:43Z master b2b8306b46) [x86_64-linux]
after: ruby 3.4.0dev (2024-07-02T20:07:44Z speed-chunkypng 418334dba9) [x86_64-linux]
---------- ----------- ---------- ---------- ---------- ------------- ------------
bench before (ms) stddev (%) after (ms) stddev (%) after 1st itr before/after
chunky-png 1000.2 0.1 980.6 0.1 1.02 1.02
---------- ----------- ---------- ---------- ---------- ------------- ------------
Legend:
- after 1st itr: ratio of before/after time for the first benchmarking iteration.
- before/after: ratio of before/after time. Higher is better for after. Above 1 represents a speedup.
Output:
./data/output_015.csv
```
* Update thread.c
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
---------
Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
Co-authored-by: Alan Wu <XrXr@users.noreply.github.com>
Use a special breakpoint address if one isn't explicitly supplied in order to support natural line stepping.
ARM64 will not increment the program counter (PC) upon hitting a breakpoint instruction. Consequently, stepping through code with a debugger ends up looping back to the breakpoint instruction. LLDB has a special breakpoint address of 0xf000 that will increment the PC and allow the debugger to work as expected. This change makes it possible to debug YJIT generated code on ARM64.
More details at: https://discourse.llvm.org/t/stepping-over-a-brk-instruction-on-arm64/69766/8
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
When we forward calls to C functions if the callsite is a forwarding
site it might not always be a splat, so we can't use the fast path.
Fixes:
[ruby-core:118418]
This commit expands inlining for simple ISeqs to accept
callees that have unused keyword parameters and callers
that specify unused keywords. The following shows 2 new
callsites that will be inlined:
```ruby
def let(a, checked: true) = a
let(1)
let(1, checked: false)
```
Co-authored-by: Kaan Ozkan <kaan.ozkan@shopify.com>
While working on a separate issue we found that in some cases
`ary_heap_realloc` was being called on frozen arrays. To fix this, this
change does the following:
1) Updates `rb_ary_freeze` to assert the type is an array, return if
already frozen, and shrink the capacity if it is not embedded, shared
or a shared root.
2) Replaces `rb_obj_freeze` with `rb_ary_freeze` when the object is
always an array.
3) In `ary_heap_realloc`, ensure the new capa is set with
`ARY_SET_CAPA`. Previously the change in capa was not set.
4) Adds an assertion to `ary_heap_realloc` that the array is not frozen.
Some of this work was originally done in
https://github.com/ruby/ruby/pull/2640, referencing this issue
https://bugs.ruby-lang.org/issues/16291. There didn't appear to be any
objections to this PR, it appears to have simply lost traction.
The original PR made changes to arrays and strings at the same time,
this PR only does arrays. Also it was old enough that rather than revive
that branch I've made a new one. I added Lourens as co-author in addtion
to Aaron who helped me with this patch.
The original PR made this change for performance reasons, and while
that's still true for this PR, the goal of this PR is to avoid
calling `ary_heap_realloc` on frozen arrays. The capacity should be
shrunk _before_ the array is frozen, not after.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
Co-Authored-By: methodmissing <lourens@methodmissing.com>
Some tests in btest uses long src for btest and it is harmful to
check the results. This patch introducing the limitation how many
lines of code is shown on failure.
I think this change fixes the following assertion failure:
```
[BUG] unexpected rb_parser_ary_data_type (2114076960) for script lines
```
It seems that `ast_value` is collected then `rb_parser_build_script_lines_from`
touches invalid memory address.
This change prevents `ast_value` from being collected by RB_GC_GUARD.
Many functions take an outlined code block but do nothing more than
passing it along; only a couple of functions actually make use of it.
So, in most cases the `ocb` parameter is just boilerplate.
Most functions that take `ocb` already also take a `JITState` and this
commit moves `ocb` into `JITState` to remove the visual noise of the
`ocb` parameter.
* Set VM_CALL_KWARG flag first and reuse it to avoid checking kw_arg twice
* Fix comment for VM_CALL_ARGS_SIMPLE
* Make VM_CALL_ARGS_SIMPLE set-site match its comment