This commit implements Objects on Variable Width Allocation. This allows
Objects with more ivars to be embedded (i.e. contents directly follow the
object header) which improves performance through better cache locality.
This commit enables Arrays to move between size pools during compaction.
This can occur if the array is mutated such that it would fit in a
different size pool when embedded.
The move is carried out in two stages:
1. The RVALUE is moved to a destination heap during object movement
phase of compaction
2. The array data is re-embedded and the original buffer free'd if
required. This happens during the update references step
In order to reliably test compaction we need to be able to move objects
between size pools.
In order for this to happen there must be pages in a size pool into
which we can allocate.
The existing implementation of `double_heap` only doubled the existing
number of pages in the heap, so if a size pool had a low number of pages
(or 0) it's not guaranteed that enough space will be created to move
objects into that size pool.
This commit deprecates the `double_heap` option and replaces it with
`expand_heap` instead.
expand heap will expand each heap by enough pages to hold a number of
slots defined by `GC_HEAP_INIT_SLOTS` or by `heap->total_pags` whichever
is larger.
If both `double_heap` and `expand_heap` are present, a deprecation
warning will be shown for `double_heap` and the `expand_heap` behaviour
will take precedence
Given that this is an API intended for debugging and testing GC
compaction I'm not concerned about the extra memory usage or time taken
to create the pages. However, for completeness:
Running the following `test.rb` and using `time` on my Macbook Pro shows
the following memory usage and time impact:
pp "RSS (kb): #{`ps -o rss #{Process.pid}`.lines.last.to_i}"
GC.verify_compaction_references(double_heap: true, toward: :empty)
pp "RSS (kb): #{`ps -o rss #{Process.pid}`.lines.last.to_i}"
❯ time make run
./miniruby -I./lib -I. -I.ext/common -r./arm64-darwin21-fake ./test.rb
"RSS (kb): 24000"
<internal:gc>:251: warning: double_heap is deprecated and will be removed
"RSS (kb): 25232"
________________________________________________________
Executed in 124.37 millis fish external
usr time 82.22 millis 0.09 millis 82.12 millis
sys time 28.76 millis 2.61 millis 26.15 millis
❯ time make run
./miniruby -I./lib -I. -I.ext/common -r./arm64-darwin21-fake ./test.rb
"RSS (kb): 24000"
"RSS (kb): 49040"
________________________________________________________
Executed in 150.13 millis fish external
usr time 103.32 millis 0.10 millis 103.22 millis
sys time 35.73 millis 2.59 millis 33.14 millis
If the page_body is a null pointer, then read_barrier_handler will
crash with an unrelated message. This commit improves the error message.
Before:
test.rb:1: [BUG] Couldn't unprotect page 0x0000000000000000, errno: Cannot allocate memory
After:
test.rb:1: [BUG] read_barrier_handler: segmentation fault at 0x14
The GC compaction mechanism implements a kind of read barrier by marking
some (OS) pages as unreadable, and installing a SIGBUS/SIGSEGV handler
to detect when they're accessed and invalidate an attempt to move the
object.
Unfortunately, when a debugger is attached to the Ruby interpreter on
Mac OS, the debugger will trap the EXC_BAD_ACCES mach exception before
the runtime can transform that into a SIGBUS signal and dispatch it.
Thus, execution gets stuck; any attempt to continue from the debugger
re-executes the line that caused the exception and no forward progress
can be made.
This makes it impossible to debug either the Ruby interpreter or a C
extension whilst compaction is in use.
To fix this, we disable the EXC_BAD_ACCESS handler when installing the
SIGBUS/SIGSEGV handlers, and re-enable them once the compaction is done.
The debugger will still trap on the attempt to read the bad page, but it
will be trapping the SIGBUS signal, rather than the EXC_BAD_ACCESS mach
exception. It's possible to continue from this in the debugger, which
invokes the signal handler and allows forward progress to be made.
Commit 0c36ba5319 changed GC compaction
methods to not be implemented when not supported. However, that commit
only does compile time checks (which currently only checks for WASM),
but there are additional compaction support checks during run time.
This commit changes it so that GC compaction methods aren't defined
during run time if the platform does not support GC compaction.
[Bug #18829]
Only growth heaps are allowed to start major GCs. Before this patch,
growth heaps are defined as size pools that freed more slots than had
empty slots (i.e. there were more dead objects that empty space).
But if the size pool is relatively stable and tightly packed with mostly
old objects and has allocatable pages, then it would be incorrectly
classified as a growth heap and trigger major GC. But since it's stable,
it would not use any of the allocatable pages and forever be classified
as a growth heap, causing major GC thrashing. This commit changes the
definition of growth heap to require that the size pool to have no
allocatable pages.
Having a while loop over `heap_prepare` makes the GC logic difficult to
understand (it is difficult to understand when and why `heap_prepare`
yields a free page). It is also a source of bugs and can cause an infinite
loop if `heap_page` never yields a free page.
Fixes [Bug #18779]
Define the following methods as `rb_f_notimplement` on unsupported
platforms:
- GC.compact
- GC.auto_compact
- GC.auto_compact=
- GC.latest_compact_info
- GC.verify_compaction_references
This change allows users to call `GC.respond_to?(:compact)` to
properly test for compaction support. Previously, it was necessary to
invoke `GC.compact` or `GC.verify_compaction_references` and check if
those methods raised `NotImplementedError` to determine if compaction
was supported.
This follows the precedent set for other platform-specific
methods. For example, in `process.c` for methods such as
`Process.fork`, `Process.setpgid`, and `Process.getpriority`.
These methods are removed from gc.rb and added to gc.c:
- GC.compact
- GC.auto_compact
- GC.auto_compact=
- GC.latest_compact_info
- GC.verify_compaction_references
This is a prefactor to allow setting these methods to
`rb_f_notimplement` in a followup commit.
Some size pools may not have any pages/slots, so total_slots is 0. This
causes a divide-by-zero in the calculation. This commit adds a special
case to catch the case when total_slots is 0 and returns the number of
pages for heap_init_slots.
If the size pool has no or few pages/slots, then min_free_slots will
be a very small number (or even 0). Then the heap won't be eligible to
grow, causing GC thrashing or infinite loops.
Size pools with no pages won't be swept so gc_sweep_finish_size_pool
will never be called on it, but gc_sweep_finish_size_pool must be called
to grow the size pool.
Depending on alignment, the last bitmap plane may not used. Then it will
appear as if all of the objects on that plane is unmarked, which will
cause a buffer overrun when we try to free the object. This commit
changes the loop to calculate the number of planes used
(bitmap_plane_count).
Since 4d8f76286b, we need to dereference
the includer field on iclasses, so we need to mark it to make sure
it's alive.
Sometimes during compaction we crash because the field is dangling,
though I have a hard time constructing such a situation. See
http://ci.rvm.jp/results/trunk@ruby-iga/3947725
We didn't update the includer field during compaction so it could become
a dangling pointer after compaction. It's only recently that we started
to dereference the field, and we were only comparing the pointer before
then, so the omission only recently started to cause crashes.
By instrumenting object.c:833 with `rp(includer);`, you can see the
includer field become `T_NONE` with the following script:
```ruby
mod = Module.new do
protected def foo = 1
end
klass = Class.new do
include Module.new
def run
foo
end
end
klass.include(mod)
GC.verify_compaction_references(double_heap: true, toward: :empty)
klass.new.run
```
I found a crash in a private application that this patch fixes, but
wasn't able to develop a small reproducer. Hence the above demo that
requires instrumentation.
During VM startup, rb_objspace_alloc sets malloc_limit
(objspace->malloc_params.limit) before ruby_gc_set_params is called, thus
nullifying the effect of RUBY_GC_MALLOC_LIMIT before the initial GC run.
The call sequence is as follows:
main.c::main()
ruby_init
ruby_setup
Init_BareVM
rb_objspace_alloc // malloc_limit = gc_params.malloc_limit_min;
ruby_options
ruby_process_options
process_options
ruby_gc_set_params // RUBY_GC_MALLOC_LIMIT => gc_params.malloc_limit_min
With ruby_gc_set_params setting malloc_limit, RUBY_GC_MALLOC_LIMIT
affects the process sooner.
[ruby-core:107170]
Commit dde164e968 decoupled incremental
marking from page sizes. This commit changes Ruby heap page sizes to
64KiB. Doing so will have several benefits:
1. We can use compaction on systems with 64KiB system page sizes (e.g.
PowerPC).
2. Larger page sizes will allow Variable Width Allocation to increase
slot sizes and embed larger objects.
3. Since commit 002fa28599, macOS has 64
KiB pages. Making page sizes 64 KiB will bring these systems to
parity.
I have attached some bechmark results below.
Discourse:
On Discourse, we saw much better p99 performance (e.g. for "categories"
it went from 214ms on master to 134ms on branch, for "home" it went
from 265ms to 251ms). We don’t see much change in p60, p75, and p90
performance. We also see a slight decrease in memory usage by 1.04x.
Branch RSS: 354.9MB
Master RSS: 368.2MB
railsbench:
On rails bench, we don’t see a big change in RPS or p99
performance. We don’t see a big difference in memory usage.
Branch RPS: 826.27
Master RPS: 824.85
Branch p99: 1.67
Master p99: 1.72
Branch RSS: 88.72MB
Master RSS: 88.48MB
liquid:
We don’t see a significant change in liquid performance.
Branch parse & render: 28.653 I/s
Master parse & render: 28.563 i/s
Currently, rb_aligned_malloc uses mmap if Ruby heap pages can be
allocated through mmap (when system heap page size <= Ruby heap page
size). If Ruby heap page sizes is increased to 64KiB, then mmap will
be used on systems with 64KiB system page sizes. However, the transient
heap also uses rb_aligned_malloc and requires 32KiB alignment. This
would break in the current implementation since it would allocate sizes
through mmap that is not a multiple of the system page size.
This commit adds heap_page_body_allocate which will use mmap when
possible and changes rb_aligned_malloc to not use mmap (and only
use posix_memalign).
This commit changes the way compaction moves objects and sweeps pages in
order to better facilitate object movement between size pools.
Previously we would move the scan cursor first until we found an empty
slot and then we'd decrement the compact cursor until we found something
to move into that slot. We would sweep the page that contained the scan
cursor before trying to fill it
In this algorithm we first move the compact cursor down until we find an
object to move - We then take a free page from the desired destination
heap (always the same heap in this current iteration of the code).
If there is no free page we sweep the page at the sweeping_page cursor,
add it to the free pages, and advance the cursor to the next page, and
try again.
We sweep one page from each size pool in this way, and then repeat that
process until all the size pools are compacted (all the cursors have
met), and then we update references and sweep the rest of the heap.
Currently, the number of incremental marking steps is calculated based
on the number of pooled pages available. This means that if we make Ruby
heap pages larger, it would run fewer incremental marking steps (which
would mean each incremental marking step takes longer).
This commit changes incremental marking to run after every
INCREMENTAL_MARK_STEP_ALLOCATIONS number of allocations. This means that
the behaviour of incremental marking remains the same regardless of the
Ruby heap page size.
I've benchmarked against discourse benchmarks and did not get a
significant change in response times beyond the margin of error. This is
expected as this new incremental marking algorithm behaves very
similarly to the previous one.
Use ISEQ_BODY macro to get the rb_iseq_constant_body of the ISeq. Using
this macro will make it easier for us to change the allocation strategy
of rb_iseq_constant_body when using Variable Width Allocation.
Previously, we would build a new `superclasses` array for each class,
even though for all immediate subclasses of a class, the array is
identical.
This avoids duplicating the arrays on leaf classes (those without
subclasses) by calculating and storing a "superclasses including self"
array on a class when it's first inherited and sharing that among all
superclasses.
An additional trick used is that the "superclass array including self"
is valid as "self"'s superclass array. It just has it's own class at the
end. We can use this to avoid an extra pointer of storage and can use
one bit of a flag to track that we've "upgraded" the array.
Previously when checking ancestors, we would walk all the way up the
ancestry chain checking each parent for a matching class or module.
I believe this was especially unfriendly to CPU cache since for each
step we need to check two cache lines (the class and class ext).
This check is used quite often in:
* case statements
* rescue statements
* Calling protected methods
* Class#is_a?
* Module#===
* Module#<=>
I believe it's most common to check a class against a parent class, to
this commit aims to improve that (unfortunately does not help checking
for an included Module).
This is done by storing on each class the number and an array of all
parent classes, in order (BasicObject is at index 0). Using this we can
check whether a class is a subclass of another in constant time since we
know the location to expect it in the hierarchy.
(1) gc_verify_internal_consistency() use barrier locking
for consistency while `during_gc == true` at the end
of the sweep on `RGENGC_CHECK_MODE >= 2`.
(2) `rb_objspace_reachable_objects_from()` is called without
VM synchronization and it checks `during_gc != true`.
So (1) and (2) causes BUG because of `during_gc == true`.
To prevent this error, wait for VM barrier on `during_gc == false`
and introduce VM locking on `rb_objspace_reachable_objects_from()`.
http://ci.rvm.jp/results/trunk-asserts@phosphorus-docker/3830088
gc_marks_continue will start sweeping when it finishes marking. However,
if the heap we are trying to allocate into is full, then the sweeping
may not yield any free slots. If we don't call gc_sweep_continue
immediate after this, then another GC will be started halfway during
lazy sweeping. gc_sweep_continue will either grow the heap or finish
sweeping.
Add a new macro BASE_SLOT_SIZE that determines the slot size.
For Variable Width Allocation (compiled with USE_RVARGC=1), all slot
sizes are powers-of-2 multiples of BASE_SLOT_SIZE.
For USE_RVARGC=0, BASE_SLOT_SIZE is set to sizeof(RVALUE).
Renames rb_id_table_foreach_with_replace to
rb_id_table_foreach_values_with_replace and passes only the value to the
callback. We can use this in GC compaction when we cannot access the
global symbol array.
NUM_IN_PAGE could return a value much larger than 64. According to the
C11 spec 6.5.7 paragraph 3 this is undefined behavior:
> If the value of the right operand is negative or is greater than or
> equal to the width of the promoted left operand, the behavior is
> undefined.
On most platforms, this is usually not a problem as the architecture
will mask off all out-of-range bits.
WebAssembly has function local infinite registers and stack values, but
there is no way to scan the values in a call stack for now.
This implementation uses Asyncify to spilling out wasm locals into
linear memory.
On 32-bit systems, VWA causes class_serial to not be aligned (it only
guarantees 4 byte alignment but class_serial is 8 bytes and requires 8
byte alignment). This commit uses a hack to allocate class_serial
through malloc. Once VWA allocates with 8 byte alignment in the future,
we will revert this commit.
This commit switches from a custom implemented bsearch algorithm to
use the one provided by the C standard library.
Because `is_pointer_to_heap` will only return true if the pointer
being searched for is a valid slot starting address within the heap
page body, we've extracted the bsearch call site into a more general
function so we can use it elsewhere.
The new function `heap_page_for_ptr` returns the heap page for any heap
page pointer, regardless of whether that is at the start of a slot or
in the middle of one.
We then use this function as the basis of `is_pointer_to_heap`.
Some callable method entries (cme) can be a key of `overloaded_cme_table`
and the keys should be pinned because the table is numtable (VALUE is a key).
Before the patch GC checks the cme is in `overloaded_cme_table` by looking up
the table, but it needs VM locking.
It works well in normal GC marking because it is protected by the VM lock,
but it doesn't work on `rb_objspace_reachable_objects_from` because it doesn't
use VM lock.
Now, the number of target cmes are small enough, I decide to pin down
all possible cmes instead of using looking up the table.
`overloaded_cme_table` keeps cme -> monly_cme pairs to manage
corresponding `monly_cme` for `cme`. The lifetime of the `monly_cme`
should be longer than `monly_cme`, but the previous patch losts the
reference to the living `monly_cme`.
Now `overloaded_cme_table` values are always root (keys are only weak
reference), it means `monly_cme` does not freed until corresponding
`cme` is invalidated.
To make managing easy, move `overloaded_cme_table` to `rb_vm_t`.
`def` (`rb_method_definition_t`) is shared by multiple callable
method entries (cme, `rb_callable_method_entry_t`).
There are two issues:
* old -> young reference: `cme1->def->mandatory_only_cme = monly_cme`
if `cme1` is young and `monly_cme` is young, there is no problem.
Howevr, another old `cme2` can refer `def`, in this case, old `cme2`
points young `monly_cme` and it violates gengc assumption.
* cme can have different `defined_class` but `monly_cme` only has
one `defined_class`. It does not make sense and `monly_cme`
should be created for a cme (not `def`).
To solve these issues, this patch allocates `monly_cme` per `cme`.
`cme` does not have another room to store a pointer to the `monly_cme`,
so this patch introduces `overloaded_cme_table`, which is weak key map
`[cme] -> [monly_cme]`.
`def::body::iseqptr::monly_cme` is deleted.
The first issue is reported by Alan Wu.
When using `rp(obj)` for debugging during development, it may be
useful to know that an object is soon to be swept. Add a new letter to
the object dump for whether the object is garbage. It's easy to forget
about lazy sweep.
Except on Windows and MinGW, we can only use compaction on systems that
use mmap (only systems that use mmap can use the read barrier that
compaction requires). We don't need to separately detect whether we can
support compaction or not.
* Lazily create singletons on instance_{exec,eval}
Previously when instance_exec or instance_eval was called on an object,
that object would be given a singleton class so that method
definitions inside the block would be added to the object rather than
its class.
This commit aims to improve performance by delaying the creation of the
singleton class unless/until one is needed for method definition. Most
of the time instance_eval is used without any method definition.
This was implemented by adding a flag to the cref indicating that it
represents a singleton of the object rather than a class itself. In this
case CREF_CLASS returns the object's existing class, but in cases that
we are defining a method (either via definemethod or
VM_SPECIAL_OBJECT_CBASE which is used for undef and alias).
This also happens to fix what I believe is a bug. Previously
instance_eval behaved differently with regards to constant access for
true/false/nil than for all other objects. I don't think this was
intentional.
String::Foo = "foo"
"".instance_eval("Foo") # => "foo"
Integer::Foo = "foo"
123.instance_eval("Foo") # => "foo"
TrueClass::Foo = "foo"
true.instance_eval("Foo") # NameError: uninitialized constant Foo
This also slightly changes the error message when trying to define a method
through instance_eval on an object which can't have a singleton class.
Before:
$ ruby -e '123.instance_eval { def foo; end }'
-e:1:in `block in <main>': no class/module to add method (TypeError)
After:
$ ./ruby -e '123.instance_eval { def foo; end }'
-e:1:in `block in <main>': can't define singleton (TypeError)
IMO this error is a small improvement on the original and better matches
the (both old and new) message when definging a method using `def self.`
$ ruby -e '123.instance_eval{ def self.foo; end }'
-e:1:in `block in <main>': can't define singleton (TypeError)
Co-authored-by: Matthew Draper <matthew@trebex.net>
* Remove "under" argument from yield_under
* Move CREF_SINGLETON_SET into vm_cref_new
* Simplify vm_get_const_base
* Fix leaf VM_SPECIAL_OBJECT_CONST_BASE
Co-authored-by: Matthew Draper <matthew@trebex.net>
suseconds_t, which is the type of tv_usec, may be defined with a longer
size type than tv_nsec's type (long). So usec to nsec conversion needs
an explicit casting.
This commit adds a Ractor cache for every size pool. Previously, all VWA
allocated objects used the slowpath and locked the VM.
On a micro-benchmark that benchmarks String allocation:
VWA turned off:
29.196591 0.889709 30.086300 ( 9.434059)
VWA before this commit:
29.279486 41.477869 70.757355 ( 12.527379)
VWA after this commit:
16.782903 0.557117 17.340020 ( 4.255603)
Updating RCLASS_PARENT_SUBCLASSES and RCLASS_MODULE_SUBCLASSES while
compacting can trigger the read barrier. This commit makes
RCLASS_SUBCLASSES a doubly linked list with a dedicated head object so
that we can add and remove entries from the list without having to touch
an object in the Ruby heap
* `GC.measure_total_time = true` enables total time measurement (default: true)
* `GC.measure_total_time` returns current flag.
* `GC.total_time` returns measured total time in nano seconds.
* `GC.stat(:time)` (and Hash) returns measured total time in milli seconds.
Compare with the C methods, A built-in methods written in Ruby is
slower if only mandatory parameters are given because it needs to
check the argumens and fill default values for optional and keyword
parameters (C methods can check the number of parameters with `argc`,
so there are no overhead). Passing mandatory arguments are common
(optional arguments are exceptional, in many cases) so it is important
to provide the fast path for such common cases.
`Primitive.mandatory_only?` is a special builtin function used with
`if` expression like that:
```ruby
def self.at(time, subsec = false, unit = :microsecond, in: nil)
if Primitive.mandatory_only?
Primitive.time_s_at1(time)
else
Primitive.time_s_at(time, subsec, unit, Primitive.arg!(:in))
end
end
```
and it makes two ISeq,
```
def self.at(time, subsec = false, unit = :microsecond, in: nil)
Primitive.time_s_at(time, subsec, unit, Primitive.arg!(:in))
end
def self.at(time)
Primitive.time_s_at1(time)
end
```
and (2) is pointed by (1). Note that `Primitive.mandatory_only?`
should be used only in a condition of an `if` statement and the
`if` statement should be equal to the methdo body (you can not
put any expression before and after the `if` statement).
A method entry with `mandatory_only?` (`Time.at` on the above case)
is marked as `iseq_overload`. When the method will be dispatch only
with mandatory arguments (`Time.at(0)` for example), make another
method entry with ISeq (2) as mandatory only method entry and it
will be cached in an inline method cache.
The idea is similar discussed in https://bugs.ruby-lang.org/issues/16254
but it only checks mandatory parameters or more, because many cases
only mandatory parameters are given. If we find other cases (optional
or keyword parameters are used frequently and it hurts performance),
we can extend the feature.
With RVARGC we always store the rb_classext_t in the same slot as the
RClass struct that refers to it. So we don't need to store the pointer
or access through the pointer anymore and can switch the RCLASS_EXT
macro to use an offset
This commit fixes a memory leak introduced in an early part of the
variable width allocation project that would prevent the rb_classext_t
struct from being free'd when the class is swept.
This commit deprecates rb_gc_force_recycle and coverts it to a no-op
function. Also removes invalidate_mark_stack_chunk since only
rb_gc_force_recycle uses it.
```
../gc.c:2342:45: warning: comparison of integers of different signs: 'short' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
GC_ASSERT(size_pools[pool_id].slot_size == slot_size);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
```
Add cast to short, because `GC_ASSERT`s in `size_pool_for_size`
already use cast to short.
```
../gc.c:2342:25: warning: array subscript is of type 'char' [-Wchar-subscripts]
GC_ASSERT(size_pools[pool_id].slot_size == slot_size);
^~~~~~~~
```
It seems like `gc_verify_compaction_references` is not protected in case
alignment is wrong. This commit pushes the alignment check down to
`gc_start_internal` so that anyone trying to compact will check page
alignment
I think this method may be getting called on PowerPC and the alignment
might be wrong.
http://rubyci.s3.amazonaws.com/ppc64le/ruby-master/log/20211021T190006Z.fail.html.gz
I'm looking through the places where YJIT needs notifications. It looks
like these changes to gc.c and vm_callinfo.h have become unnecessary
since 84ab77ba59. This commit just makes the diff against upstream
smaller, but otherwise shouldn't change any behavior.
Added UJIT_CHECK_MODE. Set to 1 to double check method dispatch in
generated code.
It's surprising to me that we need to watch both cc and cme. There might
be opportunities to simplify there.
Runtime assertion for the argument declared as non-null.
This macro does nothing if `RBIMPL_ATTR_NONNULL` is effective,
otherwise asserts that the argument is non-null.
Commit 123eeb1c1a added incremental GC
which moved resetting malloc_increase, oldmalloc_increase to before
marking. However, during_minor_gc is not set until gc_marks_start. So
the value will be from the last GC run, rather than the current one.
Before the incremental GC commit, this code was in gc_before_sweep
which ran before sweep (after marking) so the value during_minor_gc
was correct.
When the object is moved back into the T_MOVED, the flags of the T_MOVED
is not copied, so the FL_FROM_FREELIST flag is lost. This causes
total_freed_objects to always be incremented.
When invalidating a page during compaction, the free_slots count should
be updated for the page of the object and not the page of the forwarding
address (since the object gets moved back to the forwarding address).
Force recycled objects could create a freelist for the page. At the
start of sweeping we should append to the freelist to avoid
permanently losing the slots on the freelist.
This commits implements size classes in the GC for the Variable Width
Allocation feature. Unless `USE_RVARGC` compile flag is set, only a
single size class is created, maintaining current behaviour. See the
redmine ticket for more details.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
This commit removes T_PAYLOAD since the new VWA implementation no longer
requires T_PAYLOAD types.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
This commits implements size classes in the GC for the Variable Width
Allocation feature. Unless `USE_RVARGC` compile flag is set, only a
single size class is created, maintaining current behaviour. See the
redmine ticket for more details.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
This commit removes T_PAYLOAD since the new VWA implementation no longer
requires T_PAYLOAD types.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
which have not undefined or redefined it.
When a `T_DATA` object is created whose class has not undefined or
redefined the alloc function, the alloc function now gets undefined by
Data_Wrap_Struct et al. Optionally, a future release may also warn
that this being done.
This should help developers of C extensions to meet the requirements
explained in "doc/extension.rdoc". Without a check like this, there is
no easy way for an author of a C extension to see where they have made
a mistake.
Commit c32218de1b turned during_compacting
flag to 2 bits to support the case when there is no write barrier. But
commit 32b7dcfb56 changed compaction to
always enable the write barrier. This commit cleans up some of the
leftover code.
This fixes cases where exceptions raised using Thread#raise are
swallowed by finalizers and not delivered to the running thread.
This could cause issues with finalizers that rely on pending interrupts,
but that case is expected to be rarer.
Fixes [Bug #13876]
Fixes [Bug #15507]
Co-authored-by: Koichi Sasada <ko1@atdot.net>
This commit adds an assertion has been added after `gc_page_sweep` to
verify that the freelist length is equal to the number of free slots in
the page.
When a Ractor is removed, the freelist in the Ractor cache is not
returned to the GC, leaving the freelist permanently lost. This commit
recycles the freelist when the Ractor is destroyed, preventing a memory
leak from occurring.
If we force recycle an object before the page is swept, we should clear
it in the mark bitmap. If we don't clear it in the bitmap, then during
sweeping we won't account for this free slot so the `free_slots` count
of the page will be incorrect.
When running btest there is a crash when compiled with
RGENGC_CHECK_MODE=4. The crash happens because `during_gc` is not
turned off before `gc_marks_check` is called, causing the marking to
happen on the main mark stack instead of mark stack created in
`objspace_allrefs`.
Redo of 34a2acdac788602c14bf05fb616215187badd504 and
931138b00696419945dc03e10f033b1f53cd50f3 which were reverted.
GitHub PR #4340.
This change implements a cache for class variables. Previously there was
no cache for cvars. Cvar access is slow due to needing to travel all the
way up th ancestor tree before returning the cvar value. The deeper the
ancestor tree the slower cvar access will be.
The benefits of the cache are more visible with a higher number of
included modules due to the way Ruby looks up class variables. The
benchmark here includes 26 modules and shows with the cache, this branch
is 6.5x faster when accessing class variables.
```
compare-ruby: ruby 3.1.0dev (2021-03-15T06:22:34Z master 9e5105c) [x86_64-darwin19]
built-ruby: ruby 3.1.0dev (2021-03-15T12:12:44Z add-cache-for-clas.. c6be009) [x86_64-darwin19]
| |compare-ruby|built-ruby|
|:--------|-----------:|---------:|
|vm_cvar | 5.681M| 36.980M|
| | -| 6.51x|
```
Benchmark.ips calling `ActiveRecord::Base.logger` from within a Rails
application. ActiveRecord::Base.logger has 71 ancestors. The more
ancestors a tree has, the more clear the speed increase. IE if Base had
only one ancestor we'd see no improvement. This benchmark is run on a
vanilla Rails application.
Benchmark code:
```ruby
require "benchmark/ips"
require_relative "config/environment"
Benchmark.ips do |x|
x.report "logger" do
ActiveRecord::Base.logger
end
end
```
Ruby 3.0 master / Rails 6.1:
```
Warming up --------------------------------------
logger 155.251k i/100ms
Calculating -------------------------------------
```
Ruby 3.0 with cvar cache / Rails 6.1:
```
Warming up --------------------------------------
logger 1.546M i/100ms
Calculating -------------------------------------
logger 14.857M (± 4.8%) i/s - 74.198M in 5.006202s
```
Lastly we ran a benchmark to demonstate the difference between master
and our cache when the number of modules increases. This benchmark
measures 1 ancestor, 30 ancestors, and 100 ancestors.
Ruby 3.0 master:
```
Warming up --------------------------------------
1 module 1.231M i/100ms
30 modules 432.020k i/100ms
100 modules 145.399k i/100ms
Calculating -------------------------------------
1 module 12.210M (± 2.1%) i/s - 61.553M in 5.043400s
30 modules 4.354M (± 2.7%) i/s - 22.033M in 5.063839s
100 modules 1.434M (± 2.9%) i/s - 7.270M in 5.072531s
Comparison:
1 module: 12209958.3 i/s
30 modules: 4354217.8 i/s - 2.80x (± 0.00) slower
100 modules: 1434447.3 i/s - 8.51x (± 0.00) slower
```
Ruby 3.0 with cvar cache:
```
Warming up --------------------------------------
1 module 1.641M i/100ms
30 modules 1.655M i/100ms
100 modules 1.620M i/100ms
Calculating -------------------------------------
1 module 16.279M (± 3.8%) i/s - 82.038M in 5.046923s
30 modules 15.891M (± 3.9%) i/s - 79.459M in 5.007958s
100 modules 16.087M (± 3.6%) i/s - 81.005M in 5.041931s
Comparison:
1 module: 16279458.0 i/s
100 modules: 16087484.6 i/s - same-ish: difference falls within error
30 modules: 15891406.2 i/s - same-ish: difference falls within error
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
heap_set_increment essentially only calls heap_allocatable_pages_set.
They only differ in behaviour when `additional_pages == 0`. However,
this is only possible because heap_extend_pages may return 0. This
commit also changes heap_extend_pages to always return at least 1.
If we are during incremental sweeping when calling gc_set_initial_pages
there is an assertion error. The following patch will artificially
produce the bug:
```
diff --git a/gc.c b/gc.c
index c3157dbe2c..d7282cf8f0 100644
--- a/gc.c
+++ b/gc.c
@@ -404,7 +404,7 @@ int ruby_rgengc_debug;
* 5: show all references
*/
#ifndef RGENGC_CHECK_MODE
-#define RGENGC_CHECK_MODE 0
+#define RGENGC_CHECK_MODE 1
#endif
// Note: using RUBY_ASSERT_WHEN() extend a macro in expr (info by nobu).
@@ -10821,6 +10821,10 @@ gc_set_initial_pages(void)
void
ruby_gc_set_params(void)
{
+ for (int i = 0; i < 10000; i++) {
+ rb_ary_new();
+ }
+
/* RUBY_GC_HEAP_FREE_SLOTS */
if (get_envparam_size("RUBY_GC_HEAP_FREE_SLOTS", &gc_params.heap_free_slots, 0)) {
/* ok */
```
The crash looks like:
```
Assertion Failed: ../gc.c:2038:heap_add_page:!(heap == heap_eden && heap->sweeping_page)
```
NUM_IN_PAGE(page->start) will sometimes return a 0 or a 1 depending on
how the alignment of the 40 byte slots work out. This commit uses the
NUM_IN_PAGE function to shift the bitmap down on the first bitmap plane.
Iterating on the first bitmap plane is "special", but this commit allows
us to align object addresses on something besides 40 bytes, and also
eliminates the need to fill guard bits.
We are only iterating over the eden heap so `heap_eden->total_pages`
contains the exact number of pages we need to allocate for.
`heap_allocated_pages` may contain pages in the tomb.
Instead of keeping track of the current bit plane, keep track of the
actual slot when compacting. This means we don't need to re-scan
objects inside the same bit plane when we continue with movement
When objects are popped from the mark stack, we check that the object is
the right type (otherwise an rb_bug happens). The problem is that when
we pop a bad object from the stack, we have no idea what pushed the bad
object on the stack.
This change makes an error happen when a bad object is pushed on the
mark stack, that way we can track down the source of the bug.
Since compaction can be concurrent, the machine stack is allowed to
change while compaction is happening. When compaction finishes, there
may be references on the machine stack that need to be reverted so that
we can remove the read barrier.
This change implements a cache for class variables. Previously there was
no cache for cvars. Cvar access is slow due to needing to travel all the
way up th ancestor tree before returning the cvar value. The deeper the
ancestor tree the slower cvar access will be.
The benefits of the cache are more visible with a higher number of
included modules due to the way Ruby looks up class variables. The
benchmark here includes 26 modules and shows with the cache, this branch
is 6.5x faster when accessing class variables.
```
compare-ruby: ruby 3.1.0dev (2021-03-15T06:22:34Z master 9e5105ca45) [x86_64-darwin19]
built-ruby: ruby 3.1.0dev (2021-03-15T12:12:44Z add-cache-for-clas.. c6be0093ae) [x86_64-darwin19]
| |compare-ruby|built-ruby|
|:--------|-----------:|---------:|
|vm_cvar | 5.681M| 36.980M|
| | -| 6.51x|
```
Benchmark.ips calling `ActiveRecord::Base.logger` from within a Rails
application. ActiveRecord::Base.logger has 71 ancestors. The more
ancestors a tree has, the more clear the speed increase. IE if Base had
only one ancestor we'd see no improvement. This benchmark is run on a
vanilla Rails application.
Benchmark code:
```ruby
require "benchmark/ips"
require_relative "config/environment"
Benchmark.ips do |x|
x.report "logger" do
ActiveRecord::Base.logger
end
end
```
Ruby 3.0 master / Rails 6.1:
```
Warming up --------------------------------------
logger 155.251k i/100ms
Calculating -------------------------------------
```
Ruby 3.0 with cvar cache / Rails 6.1:
```
Warming up --------------------------------------
logger 1.546M i/100ms
Calculating -------------------------------------
logger 14.857M (± 4.8%) i/s - 74.198M in 5.006202s
```
Lastly we ran a benchmark to demonstate the difference between master
and our cache when the number of modules increases. This benchmark
measures 1 ancestor, 30 ancestors, and 100 ancestors.
Ruby 3.0 master:
```
Warming up --------------------------------------
1 module 1.231M i/100ms
30 modules 432.020k i/100ms
100 modules 145.399k i/100ms
Calculating -------------------------------------
1 module 12.210M (± 2.1%) i/s - 61.553M in 5.043400s
30 modules 4.354M (± 2.7%) i/s - 22.033M in 5.063839s
100 modules 1.434M (± 2.9%) i/s - 7.270M in 5.072531s
Comparison:
1 module: 12209958.3 i/s
30 modules: 4354217.8 i/s - 2.80x (± 0.00) slower
100 modules: 1434447.3 i/s - 8.51x (± 0.00) slower
```
Ruby 3.0 with cvar cache:
```
Warming up --------------------------------------
1 module 1.641M i/100ms
30 modules 1.655M i/100ms
100 modules 1.620M i/100ms
Calculating -------------------------------------
1 module 16.279M (± 3.8%) i/s - 82.038M in 5.046923s
30 modules 15.891M (± 3.9%) i/s - 79.459M in 5.007958s
100 modules 16.087M (± 3.6%) i/s - 81.005M in 5.041931s
Comparison:
1 module: 16279458.0 i/s
100 modules: 16087484.6 i/s - same-ish: difference falls within error
30 modules: 15891406.2 i/s - same-ish: difference falls within error
```
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
The current fix for PAGE_SIZE macro detection in autoconf does not work
correctly. I see the following output with running configure on Linux:
```
checking PAGE_SIZE is defined... no
```
Linux has PAGE_SIZE macro. This is happening because the macro exists in
sys/user.h and not in the malloc headers.
This allows us to allocate the right size for the object in advance,
meaning that we don't have to pay the cost of ivar table extension
later. The idea is that if an object type ever became "extended" at
some point, then it is very likely it will become extended again. So we
may as well allocate the ivar table up front.
Previously imemo_ast was handled as WB-protected which caused a segfault
of the following code:
# shareable_constant_value: literal
M0 = {}
M1 = {}
...
M100000 = {}
My analysis is here: `shareable_constant_value: literal` creates many
Hash instances during parsing, and add them to node_buffer of imemo_ast.
However, the contents are missed because imemo_ast is incorrectly
WB-protected.
This changeset makes imemo_ast as WB-unprotected.
IV index tables weren't being freed. This program would leak memory:
```ruby
loop do
k = Class.new do
def initialize
@a = 1
@b = 1
@c = 1
@d = 1
@e = 1
@f = 1
@g = 1
end
end
k.new
end
```
This commit fixes the leak.
Emscripten provides "emscripten_scan_stack" to get the beginning and end
pointers of the stack for conservative GC.
Also, "emscripten_scan_registers" allows the GC to mark local variables
in WASM.
`rb_define_const` can add objects as "mark objects". This is to make
code like this work:
33d6e92e0c/ext/etc/etc.c (L1201)
```
rb_define_const(rb_cStruct, "Passwd", sPasswd); /* deprecated name */
```
sPasswd is a heap allocated object that is also a C global, so we can't
move it (it needs to be pinned). However, we have many calls to
`rb_define_const` that just pass in an integer like this:
```
rb_define_const(rb_cDBM, "WRITER", INT2FIX(O_RDWR|RUBY_DBM_RW_BIT));
```
Non heap allocated objects like integers will never move, so there is no
reason to waste time in the GC marking / pinning them.
Because of `double` in `RFloat`, `RValue` would be packed by
`sizeof(double)` by default, on platforms where `double` is wider
than `VALUE`. Size of `RValue` is multiple of 5 now.
[A previous commit](b59077eecf) removes some macro definitions that are used when RGENGC_CHECK_MODE >=4 because they were using data stored against objspace, which is not ractor safe
This commit reinstates those macro definitions, using the current ractor
Co-authored-by: peterzhu2118 <peter@peterzhu.ca>
Some objects can survive the GC before compaction, but get collected in
the second compaction. This means we could have objects reference
T_MOVED during "free" in the second, compacting GC. If that is the
case, we need to invalidate those "moved" addresses. Invalidation is
done via read barrier, so we need to make sure the read barrier is
active even during `GC.compact`.
This also means we don't actually need to do one GC before compaction,
we can just do the compaction and GC in one step.