According to the C99 specification section 7.20.3.2 paragraph 2:
> If ptr is a null pointer, no action occurs.
So we do not need to check that the pointer is a null pointer.
Because a thread calling IO#close now blocks in a native condvar wait,
it's possible for there to be _no_ threads left to actually handle
incoming signals/ubf calls/etc.
This manifested as failing tests on Solaris 10 (SPARC), because:
* One thread called IO#close, which sent a SIGVTALRM to the other
thread to interrupt it, and then waited on the condvar to be notified
that the reading thread was done.
* One thread was calling IO#read, but it hadn't yet reached the actual
call to select(2) when the SIGVTALRM arrived, so it never unblocked
itself.
This results in a deadlock.
The fix is to use a real Ruby mutex for the close lock; that way, the
closing thread goes into sigwait-sleep and can keep trying to interrupt
the select(2) thread.
See the discussion in: https://github.com/ruby/ruby/pull/7865/
When one thread is closing a file descriptor whilst another thread is
concurrently reading it, we need to wait for the reading thread to be
done with it to prevent a potential EBADF (or, worse, file descriptor
reuse).
At the moment, that is done by keeping a list of threads still using the
file descriptor in io_close_fptr. It then continually calls
rb_thread_schedule() in fptr_finalize_flush until said list is empty.
That busy-looping seems to behave rather poorly on some OS's,
particulary FreeBSD. It can cause the TestIO#test_race_gets_and_close
test to fail (even with its very long 200 second timeout) because the
closing thread starves out the using thread.
To fix that, I introduce the concept of struct rb_io_close_wait_list; a
list of threads still using a file descriptor that we want to close. We
call `rb_notify_fd_close` to let the thread scheduler know we're closing
a FD, which fills the list with threads. Then, we call
rb_notify_fd_close_wait which will block the thread until all of the
still-using threads are done.
This is implemented with a condition variable sleep, so no busy-looping
is required.
This patch fixes a potential busy-loop in the thread scheduler. If there
are two threads, the main thread (where Ruby signal handlers must run)
and a sleeping thread, it is possible for the following sequence of
events to occur:
* The sleeping thread is in native_sleep -> sigwait_sleep A signal
* arives, kicking this thread out of rb_sigwait_sleep The sleeping
* thread calls THREAD_BLOCKING_END and eventually
thread_sched_to_running_common
* the sleeping thread writes into the sigwait_fd pipe by calling
rb_thread_wakeup_timer_thread
* the sleeping thread re-loops around in native_sleep() because
the desired sleep time has not actually yet expired
* that calls rb_sigwait_sleep again the ppoll() in rb_sigwait_sleep
* immediately returns because
of the byte written into the sigwait_fd by
rb_thread_wakeup_timer_thread
* that wakes the thread up again and kicks the whole cycle off again.
Such a loop can only be broken by the main thread waking up and handling
the signal, such that ubf_threads_empty() below becomes true again;
however this loop can actually keep things so busy (and cause so much
contention on the main thread's interrupt_lock) that the main thread
doesn't deal with the signal for many seconds. This seems particuarly
likely on FreeBSD 13.
(the cycle can also be broken by the sleeping thread finally elapsing
its desired sleep time).
The fix for _this_ loop is to only wakeup the timer thrad in
thread_sched_to_running_common if the current thread is not itself the
sigwait thread.
An almost identical loop also happens in the same circumstances because
the call to check_signals_nogvl (through sigwait_timeout) in
rb_sigwait_sleep returns true if there is any pending signal for the
main thread to handle. That then causes rb_sigwait_sleep to skip over
sleeping entirely.
This is unnescessary and counterproductive, I believe; if the main
thread needs to be woken up that is done inline in check_signals_nogvl
anyway.
See https://bugs.ruby-lang.org/issues/19680
because of 9720f5ac89http://rubyci.s3.amazonaws.com/solaris11-sunc/ruby-master/log/20230403T130011Z.fail.html.gz
```
1) Failure:
TestThread#test_signal_at_join [/export/home/chkbuild/chkbuild-sunc/tmp/build/20230403T130011Z/ruby/test/ruby/test_thread.rb:1488]:
Exception raised:
<#<fatal:"No live threads left. Deadlock?\n1 threads, 1 sleeps current:0x00891288 main thread:0x00891288\n* #<Thread:0xfef89a18 sleep_forever>\n rb_thread_t:0x00891288 native:0x00000001 int:0\n \n">>
Backtrace:
-:30:in `join'
-:30:in `block (3 levels) in <main>'
-:21:in `times'
-:21:in `block (2 levels) in <main>'.
```
The mechanism:
* Main thread (M) calls `Thread#join`
* M: calls `sleep_forever()`
* M: set `th->status = THREAD_STOPPED_FOREVER`
* M: do `checkints`
* M: handle a trap handler with `th->status = THREAD_RUNNABLE`
* M: thread switch at the end of the trap handler
* Another thread (T) will process `Thread#kill` by M.
* T: `rb_threadptr_join_list_wakeup()` at the end of T tris to wakeup M,
but M's state is runnable because M is handling trap handler and
just ignore the waking up and terminate T$a
* T: switch to M.
* M: after the trap handler, reset `th->status = THREAD_STOPPED_FOREVER`
and check deadlock -> Deadlock because only M is living.
To avoid such situation, add new sleep flags `SLEEP_ALLOW_SPURIOUS`
and `SLEEP_NO_CHECKINTS` to skip any check ints.
BTW this is instentional to leave second `vm_check_ints_blocking()`
without checking `SLEEP_NO_CHECKINTS` because `SLEEP_ALLOW_SPURIOUS`
should be specified with `SLEEP_NO_CHECKINTS` and skipping this
checkints can skip any interrupts.
* Revert "Remove special handling of `SIGCHLD`. (#7482)"
This reverts commit 44a0711eab.
* Revert "Remove prototypes for functions that are no longer used. (#7497)"
This reverts commit 4dce12bead.
* Revert "Remove SIGCHLD `waidpid`. (#7476)"
This reverts commit 1658e7d966.
* Fix change to rjit variable name.
It's possible (but very rare) to have a race condition between setting
`mutex->fiber = NULL` and `thread_mutex_remove(th, mutex)` which results
in the following bug:
```
[BUG] invalid keeping_mutexes: Attempt to unlock a mutex which is not locked
```
Fixes <https://bugs.ruby-lang.org/issues/19480>.
```
1) Failure:
TestThreadInstrumentation#test_thread_instrumentation [/tmp/ruby/src/trunk-repeat20-asserts/test/-ext-/thread/test_instrumentation_api.rb:33]:
Call counters[4]: [3, 4, 4, 4, 0].
Expected 0 to be > 0.
```
We fire the EXIT hook after the call to `thread_sched_to_dead` which
mean another thread might be running before the `EXIT` hook have been
executed.
[Bug #19415]
If multiple threads attemps to load the same file concurrently
it's not a circular dependency issue.
So we check that the existing ThreadShield is owner by the current
fiber before warning about circular dependencies.
[Bug #19415]
If multiple threads attemps to load the same file concurrently
it's not a circular dependency issue.
So we check that the existing ThreadShield is owner by the current
fiber before warning about circular dependencies.
First, rb_mjit_fork should call rb_thread_atfork to stop threads after
fork in the child process. Unfortunately, we cannot use rb_fork_ruby to
prevent this kind of mistakes because MJIT needs special handling of
waiting_pid and mjit_pause/resume.
Second, mjit_waitpid_finished should be checked regardless of
trap_interrupt. It doesn't seem like the flag is not set when SIGCHLD is
handled for an MJIT child process.
GCC warns of empty format strings, perhaps because they have no
effects in printf() and there are better ways than sprintf().
However, ruby_debug_log() adds informations other than the format,
this warning is not the case.
rb_ary_tmp_new suggests that the array is temporary in some way, but
that's not true, it just creates an array that's hidden and not on the
transient heap. This commit renames it to rb_ary_hidden_new.
`rb_thread_wait_for_single_fd` needs to mutate the `waiting_fds` list
that is stored on the VM. We need to delete the FD from the list before
returning, and deleting from the list requires a VM lock (because the
list is a global).
[Bug #18816] [ruby-core:108771]
Co-Authored-By: Alan Wu <alanwu@ruby-lang.org>
[Feature #18339]
After experimenting with the initial version of the API I figured there is a need
for an exit event to cleanup instrumentation data. e.g. if you record data in a
{thread_id -> data} table, you need to free associated data when a thread goes away.
Previously, because opt_aref and opt_aset don't push a frame, when they
would call rb_hash to determine the hash value of the key, the initial
level of recursion would incorrectly use the method id at the top of the
stack instead of "hash".
This commit replaces rb_exec_recursive_outer with
rb_exec_recursive_outer_mid, which takes an explicit method id, so that
we can make the hash calculation behave consistently.
rb_exec_recursive_outer was documented as being internal, so I believe
this should be okay to change.
`NON_SCALAR_THREAD_ID` shows `pthread_t` is non-scalar (non-pointer)
and only s390x is known platform. However, the supporting code is
very complex and it is only used for deubg print information.
So this patch removes the support of `NON_SCALAR_THREAD_ID`
and make the code simple.
`rb_thread_t` contained `native_thread_data_t` to represent
thread implementation dependent data. This patch separates
them and rename it `rb_native_thread` and point it from
`rb_thraed_t`.
Now, 1 Ruby thread (`rb_thread_t`) has 1 native thread (`rb_native_thread`).
Now GVL is not process *Global* so this patch try to use
another words.
* `rb_global_vm_lock_t` -> `struct rb_thread_sched`
* `gvl->owner` -> `sched->running`
* `gvl->waitq` -> `sched->readyq`
* `rb_gvl_init` -> `rb_thread_sched_init`
* `gvl_destroy` -> `rb_thread_sched_destroy`
* `gvl_acquire` -> `thread_sched_to_running` # waiting -> ready -> running
* `gvl_release` -> `thread_sched_to_waiting` # running -> waiting
* `gvl_yield` -> `thread_sched_yield`
* `GVL_UNLOCK_BEGIN` -> `THREAD_BLOCKING_BEGIN`
* `GVL_UNLOCK_END` -> `THREAD_BLOCKING_END`
* removed
* `rb_ractor_gvl`
* `rb_vm_gvl_destroy` (not used)
There are GVL functions such as `rb_thread_call_without_gvl()` yet
but I don't have good name to replace them. Maybe GVL stands for
"Greate Valuable Lock" or something like that.
Use ISEQ_BODY macro to get the rb_iseq_constant_body of the ISeq. Using
this macro will make it easier for us to change the allocation strategy
of rb_iseq_constant_body when using Variable Width Allocation.
Currently the calculation only counts the size of the struct. This commit adds the size of the associated st tables, id tables, and linked lists.
Still missing is the size of the ractors and (potentially) the size of the object space.
If the thread termination invokes user code after `th->status` becomes
`THREAD_KILLED`, and the user unblock function causes that `th->status` to
become something else (e.g. `THREAD_RUNNING`), threads waiting in
`thread_join_sleep` will hang forever. We move the unblock function call
to before the thread status is updated, and allow threads to join as soon
as `th->value` becomes defined.
This reverts commit 6505c77501.
If the thread termination invokes user code after `th->status` becomes
`THREAD_KILLED`, and the user unblock function causes that `th->status` to
become something else (e.g. `THREAD_RUNNING`), threads waiting in
`thread_join_sleep` will hang forever. We move the unblock function call
to before the thread status is updated, and allow threads to join as soon
as `th->value` becomes defined.
* Wake up join list within thread EC context.
* Consume items from join list so that they are not re-executed.
If `rb_fiber_scheduler_unblock` raises an exception, it can result in a
segfault if `rb_threadptr_join_list_wakeup` is not within a valid EC. This
change moves `rb_threadptr_join_list_wakeup` into the thread's top level EC
which initially caused an infinite loop because on exception will retry. We
explicitly remove items from the thread's join list to avoid this situation.
* Verify the required scheduler interface.
* Test several scheduler hooks methods with broken `unblock` implementation.
The kill/terminate interrupts are internally handled not as Exception
instances, but as integers. So using Exception doesn't handle these
interrupts, but Object does. You can use Integer if you only want to
handle kill/terminate interrupts, but that's probably more of an
implementation detail, while handling Object should work regardless
of the implementation.
Fixes [Bug #15735]
Thread's are assigned a group at initialization, and no API exists
for them to unassign them from a group without assigning them to
another group.
Fixes [Bug #17505]
* Rename `rb_scheduler` to `rb_fiber_scheduler`.
* Use public interface if available.
* Use `rb_check_funcall` where possible.
* Don't use `unblock` unless the fiber was non-blocking.