* Introduction of Happy Eyeballs Version 2 (RFC8305) in Socket.tcp
This is an implementation of Happy Eyeballs version 2 (RFC 8305) in Socket.tcp.
[Background]
Currently, `Socket.tcp` synchronously resolves names and makes connection attempts with `Addrinfo::foreach.`
This implementation has the following two problems.
1. In name resolution, the program stops until the DNS server responds to all DNS queries.
2. In a connection attempt, while an IP address is trying to connect to the destination host and is taking time, the program stops, and other resolved IP addresses cannot try to connect.
[Proposal]
"Happy Eyeballs" ([RFC 8305](https://datatracker.ietf.org/doc/html/rfc8305)) is an algorithm to solve this kind of problem. It avoids delays to the user whenever possible and also uses IPv6 preferentially.
I implemented it into `Socket.tcp` by using `Addrinfo.getaddrinfo` in each thread spawned per address family to resolve the hostname asynchronously, and using `Socket::connect_nonblock` to try to connect with multiple addrinfo in parallel.
[Outcome]
This change eliminates a fatal defect in the following cases.
Case 1. One of the A or AAAA DNS queries does not return
---
require 'socket'
class Addrinfo
class << self
# Current Socket.tcp depends on foreach
def foreach(nodename, service, family=nil, socktype=nil, protocol=nil, flags=nil, timeout: nil, &block)
getaddrinfo(nodename, service, Socket::AF_INET6, socktype, protocol, flags, timeout: timeout)
.concat(getaddrinfo(nodename, service, Socket::AF_INET, socktype, protocol, flags, timeout: timeout))
.each(&block)
end
def getaddrinfo(_, _, family, *_)
case family
when Socket::AF_INET6 then sleep
when Socket::AF_INET then [Addrinfo.tcp("127.0.0.1", 4567)]
end
end
end
end
Socket.tcp("localhost", 4567)
---
Because the current `Socket.tcp` cannot resolve IPv6 names, the program stops in this case. It cannot start to connect with IPv4 address.
Though `Socket.tcp` with HEv2 can promptly start a connection attempt with IPv4 address in this case.
Case 2. Server does not promptly return ack for syn of either IPv4 / IPv6 address family
---
require 'socket'
fork do
socket = Socket.new(Socket::AF_INET6, :STREAM)
socket.setsockopt(:SOCKET, :REUSEADDR, true)
socket.bind(Socket.pack_sockaddr_in(4567, '::1'))
sleep
socket.listen(1)
connection, _ = socket.accept
connection.close
socket.close
end
fork do
socket = Socket.new(Socket::AF_INET, :STREAM)
socket.setsockopt(:SOCKET, :REUSEADDR, true)
socket.bind(Socket.pack_sockaddr_in(4567, '127.0.0.1'))
socket.listen(1)
connection, _ = socket.accept
connection.close
socket.close
end
Socket.tcp("localhost", 4567)
---
The current `Socket.tcp` tries to connect serially, so when its first name resolves an IPv6 address and initiates a connection to an IPv6 server, this server does not return an ACK, and the program stops.
Though `Socket.tcp` with HEv2 starts to connect sequentially and in parallel so a connection can be established promptly at the socket that attempted to connect to the IPv4 server.
In exchange, the performance of `Socket.tcp` with HEv2 will be degraded.
---
100.times { Socket.tcp("www.ruby-lang.org", 80) }
---
This is due to the addition of the creation of IO objects, Thread objects, etc., and calls to `IO::select` in the implementation.
* Avoid NameError of Socket::EAI_ADDRFAMILY in MinGW
* Support Windows with SO_CONNECT_TIME
* Improve performance
I have additionally implemented the following patterns:
- If the host is single-stack, name resolution is performed in the main thread. This reduces the cost of creating threads.
- If an IP address is specified, name resolution is performed in the main thread. This also reduces the cost of creating threads.
- If only one IP address is resolved, connect is executed in blocking mode. This reduces the cost of calling IO::select.
Also, I have added a fast_fallback option for users who wish not to use HE.
Here are the results of each performance test.
```ruby
require 'socket'
require 'benchmark'
HOSTNAME = "www.ruby-lang.org"
PORT = 80
ai = Addrinfo.tcp(HOSTNAME, PORT)
Benchmark.bmbm do |x|
x.report("Domain name") do
30.times { Socket.tcp(HOSTNAME, PORT).close }
end
x.report("IP Address") do
30.times { Socket.tcp(ai.ip_address, PORT).close }
end
x.report("fast_fallback: false") do
30.times { Socket.tcp(HOSTNAME, PORT, fast_fallback: false).close }
end
end
```
```
user system total real
Domain name 0.015567 0.032511 0.048078 ( 0.325284)
IP Address 0.004458 0.014219 0.018677 ( 0.284361)
fast_fallback: false 0.005869 0.021511 0.027380 ( 0.321891)
````
And this is the measurement result when executed in a single stack environment.
```
user system total real
Domain name 0.007062 0.019276 0.026338 ( 1.905775)
IP Address 0.004527 0.012176 0.016703 ( 3.051192)
fast_fallback: false 0.005546 0.019426 0.024972 ( 1.775798)
```
The following is the result of the run on Ruby 3.3.0.
(on Dual stack environment)
```
user system total real
Ruby 3.3.0 0.007271 0.027410 0.034681 ( 0.472510)
```
(on Single stack environment)
```
user system total real
Ruby 3.3.0 0.005353 0.018898 0.024251 ( 1.774535)
```
* Do not cache `Socket.ip_address_list`
As mentioned in the comment at https://github.com/ruby/ruby/pull/9374#discussion_r1482269186, caching Socket.ip_address_list does not follow changes in network configuration.
But if we stop caching, it becomes necessary to check every time `Socket.tcp` is called whether it's a single stack or not, which could further degrade performance in the case of a dual stack.
From this, I've changed the approach so that when a domain name is passed, it doesn't check whether it's a single stack or not and resolves names in parallel each time.
The performance measurement results are as follows.
require 'socket'
require 'benchmark'
HOSTNAME = "www.ruby-lang.org"
PORT = 80
ai = Addrinfo.tcp(HOSTNAME, PORT)
Benchmark.bmbm do |x|
x.report("Domain name") do
30.times { Socket.tcp(HOSTNAME, PORT).close }
end
x.report("IP Address") do
30.times { Socket.tcp(ai.ip_address, PORT).close }
end
x.report("fast_fallback: false") do
30.times { Socket.tcp(HOSTNAME, PORT, fast_fallback: false).close }
end
end
user system total real
Domain name 0.004085 0.011873 0.015958 ( 0.330097)
IP Address 0.000993 0.004400 0.005393 ( 0.257286)
fast_fallback: false 0.001348 0.008266 0.009614 ( 0.298626)
* Wait forever if fallback addresses are unresolved, unless resolv_timeout
Changed from waiting only 3 seconds for name resolution when there is no fallback address available, to waiting as long as there is no resolv_timeout.
This is in accordance with the current `Socket.tcp` specification.
* Use exact pattern to match IPv6 address format for specify address family
This commit changes how stack extents are calculated for both the main
thread and other threads. Ruby uses the address of a local variable as
part of the calculation for machine stack extents:
* pthreads uses it as a lower-bound on the start of the stack, because
glibc (and maybe other libcs) can store its own data on the stack
before calling into user code on thread creation.
* win32 uses it as an argument to VirtualQuery, which gets the extent of
the memory mapping which contains the variable
However, the local being used for this is actually too low (too close to
the leaf function call) in both the main thread case and the new thread
case.
In the main thread case, we have the `INIT_STACK` macro, which is used
for pthreads to set the `native_main_thread->stack_start` value. This
value is correctly captured at the very top level of the program (in
main.c). However, this is _not_ what's used to set the execution context
machine stack (`th->ec->machine_stack.stack_start`); that gets set as
part of a call to `ruby_thread_init_stack` in `Init_BareVM`, using the
address of a local variable allocated _inside_ `Init_BareVM`. This is
too low; we need to use a local allocated closer to the top of the
program.
In the new thread case, the lolcal is allocated inside
`native_thread_init_stack`, which is, again, too low.
In both cases, this means that we might have VALUEs lying outside the
bounds of `th->ec->machine.stack_{start,end}`, which won't be marked
correctly by the GC machinery.
To fix this,
* In the main thread case: We already have `INIT_STACK` at the right
level, so just pass that local var to `ruby_thread_init_stack`.
* In the new thread case: Allocate the local one level above the call to
`native_thread_init_stack` in `call_thread_start_func2`.
[Bug #20001]
fix
The implementation of `native_thread_init_stack` for the various
threading models can use the address of a local variable as part of the
calculation of the machine stack extents:
* pthreads uses it as a lower-bound on the start of the stack, because
glibc (and maybe other libcs) can store its own data on the stack
before calling into user code on thread creation.
* win32 uses it as an argument to VirtualQuery, which gets the extent of
the memory mapping which contains the variable
However, the local being used for this is actually allocated _inside_
the `native_thread_init_stack` frame; that means the caller might
allocate a VALUE on the stack that actually lies outside the bounds
stored in machine.stack_{start,end}.
A local variable from one level above the topmost frame that stores
VALUEs on the stack must be drilled down into the call to
`native_thread_init_stack` to be used in the calculation. This probably
doesn't _really_ matter for the win32 case (they'll be in the same
memory mapping so VirtualQuery should return the same thing), but
definitely could matter for the pthreads case.
[Bug #20001]
* Restore experimental warnings.
* Documentation and code structure improvements.
* Improved validation of flags, clarified documentation of argument handling.
* Remove inconsistent use of `Example:` and add example to `null?`.
* Expose `IO::Buffer#private?` and add test.
This is a C API for extensions to resolve and get function symbols of other extensions.
Extensions can check the expected symbol is correctly loaded and accessible, and
use it if it is available.
Otherwise, extensions can raise their own error to guide users to setup their
environments correctly and what's missing.
Our current implementation of rb_postponed_job_register suffers from
some safety issues that can lead to interpreter crashes (see bug #1991).
Essentially, the issue is that jobs can be called with the wrong
arguments.
We made two attempts to fix this whilst keeping the promised semantics,
but:
* The first one involved masking/unmasking when flushing jobs, which
was believed to be too expensive
* The second one involved a lock-free, multi-producer, single-consumer
ringbuffer, which was too complex
The critical insight behind this third solution is that essentially the
only user of these APIs are a) internal, or b) profiling gems.
For a), none of the usages actually require variable data; they will
work just fine with the preregistration interface.
For b), generally profiling gems only call a single callback with a
single piece of data (which is actually usually just zero) for the life
of the program. The ringbuffer is complex because it needs to support
multi-word inserts of job & data (which can't be atomic); but nobody
actually even needs that functionality, really.
So, this comit:
* Introduces a pre-registration API for jobs, with a GVL-requiring
rb_postponed_job_prereigster, which returns a handle which can be
used with an async-signal-safe rb_postponed_job_trigger.
* Deprecates rb_postponed_job_register (and re-implements it on top of
the preregister function for compatability)
* Moves all the internal usages of postponed job register
pre-registration
This patch introduces thread specific storage APIs
for tools which use `rb_internal_thread_event_hook` APIs.
* `rb_internal_thread_specific_key_create()` to create a tool specific
thread local storage key and allocate the storage if not available.
* `rb_internal_thread_specific_set()` sets a data to thread and tool
specific storage.
* `rb_internal_thread_specific_get()` gets a data in thread and tool
specific storage.
Note that `rb_internal_thread_specific_get|set(thread_val, key)`
can be called without GVL and safe for async signal and safe for
multi-threading (native threads). So you can call it in any internal
thread event hooks. Further more you can call it from other native
threads. Of course `thread_val` should be living while accessing the
data from this function.
Note that you should not forget to clean up the set data.
This entirely changes how it is tested. Rather than to use counters
we now record the timeline of events with associated threads which
makes it much easier to assert that certains events are only preceded
by a specific event, and makes it much easier to debug unexpected
timelines.
Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>
Co-Authored-By: JP Camara <jp@jpcamara.com>
Co-Authored-By: John Hawthorn <john@hawthorn.email>
Context: https://github.com/ivoanjo/gvl-tracing/pull/4
Some hooks may want to collect data on a per thread basis.
Right now the only way to identify the concerned thread is to
use `rb_nativethread_self()` or similar, but even then because
of the thread cache or MaNy, two distinct Ruby threads may report
the same native thread id.
By passing `thread->self`, hooks can use it as a key to store
the metadata.
NB: Most hooks are executed outside the GVL, so such data collection
need to use a thread-safe data-structure, and shouldn't use the
reference in other ways from inside the hook.
They must also either pin that value or handle compaction.
This commit adds a new flag RUBY_TYPED_EMBEDDABLE that allows the data
of a TypedData object to be embedded after the object itself. This will
improve cache locality and allow us to save the 8 byte data pointer.
Co-Authored-By: Jean Boussier <byroot@ruby-lang.org>
Add a new API rb_profile_thread_frames(), which is essentialy a
per-thread version of rb_profile_frames().
While the original rb_profile_frames() always returns results about the
current active thread obtained by GET_EC(), this new API takes a Thread
to be profiled as an argument.
This should come in handy when profiling I/O-bound programs such as
webapps, since this new API allows us to learn about Threads performing
I/O (which do not have the GVL).
Profiling worker threads (such as Sidekiq workers) may be another
application.
Implements [Feature #10602]
Co-authored-by: Mike Perham <mike@perham.net>
This patch introduce M:N thread scheduler for Ractor system.
In general, M:N thread scheduler employs N native threads (OS threads)
to manage M user-level threads (Ruby threads in this case).
On the Ruby interpreter, 1 native thread is provided for 1 Ractor
and all Ruby threads are managed by the native thread.
From Ruby 1.9, the interpreter uses 1:1 thread scheduler which means
1 Ruby thread has 1 native thread. M:N scheduler change this strategy.
Because of compatibility issue (and stableness issue of the implementation)
main Ractor doesn't use M:N scheduler on default. On the other words,
threads on the main Ractor will be managed with 1:1 thread scheduler.
There are additional settings by environment variables:
`RUBY_MN_THREADS=1` enables M:N thread scheduler on the main ractor.
Note that non-main ractors use the M:N scheduler without this
configuration. With this configuration, single ractor applications
run threads on M:1 thread scheduler (green threads, user-level threads).
`RUBY_MAX_CPU=n` specifies maximum number of native threads for
M:N scheduler (default: 8).
This patch will be reverted soon if non-easy issues are found.
[Bug #19842]
The current documentation for `rb_postponed_job_register_one()` is
explaining the differences with itself, where it should be explaining
the differences with `rb_postponed_job_register()`.
Existing strscan releases rely on this C API. It means that the current
Ruby master doesn't work if your Gemfile.lock has strscan unless it's
locked to 3.0.7, which is not released yet.
To fix it, let's not remove the C API we've exposed to users.
rb_reg_onig_match performs preparation, error handling, and cleanup for
matching a regex against a string. This reduces repetitive code and
removes the need for StringScanner to access internal data of regex.