- Rename regexp.rdoc to exclude from "Pages". This file is for to be
included in the "class Regexp" document, but it also appeared as a
separate page duplicately.
- Fix links on case-sensitive filesystems.
- Fix to use rdoc-ref instead of converted HTML page names.
When matching an incompatible encoding, the Regexp needs to recompile.
If `usecnt == 0`, then we can reuse the `ptr` because nothing else is
using it. This avoids allocating another `regex_t`.
This speeds up matches that switch to incompatible encodings by 15%.
Branch:
```
Regex#match? with different encoding
1.431M (± 1.3%) i/s - 7.264M in 5.076153s
Regex#match? with same encoding
16.858M (± 1.1%) i/s - 85.347M in 5.063279s
```
Base:
```
Regex#match? with different encoding
1.248M (± 2.0%) i/s - 6.342M in 5.083151s
Regex#match? with same encoding
16.377M (± 1.1%) i/s - 82.519M in 5.039504s
```
Script:
```
regex = /foo/
str1 = "日本語"
str2 = "English".force_encoding("ASCII-8BIT")
Benchmark.ips do |x|
x.report("Regex#match? with different encoding") do |times|
i = 0
while i < times
regex.match?(str1)
regex.match?(str2)
i += 1
end
end
x.report("Regex#match? with same encoding") do |times|
i = 0
while i < times
regex.match?(str1)
i += 1
end
end
end
```
Existing strscan releases rely on this C API. It means that the current
Ruby master doesn't work if your Gemfile.lock has strscan unless it's
locked to 3.0.7, which is not released yet.
To fix it, let's not remove the C API we've exposed to users.
rb_reg_onig_match performs preparation, error handling, and cleanup for
matching a regex against a string. This reduces repetitive code and
removes the need for StringScanner to access internal data of regex.
This was broken in ec3542229b. That commit
didn't handle cases where extended mode was turned on/off inside the
regexp. There are two ways to turn extended mode on/off:
```
/(?-x:#y)#z
/x =~ '#y'
/(?-x)#y(?x)#z
/x =~ '#y'
```
These can be nested inside the same regexp:
```
/(?-x:(?x)#x
(?-x)#y)#z
/x =~ '#y'
```
As you can probably imagine, this makes handling these regexps
somewhat complex. Due to the nesting inside portions of regexps,
the unassign_nonascii function needs to be recursive. In
recursive mode, it needs to track both opening and closing
parentheses, similar to how it already tracked opening and
closing brackets for character classes.
When scanning the regexp and coming to `(?` not followed by `#`,
scan for options, and use `x` and `i` to determine whether to
turn on or off extended mode. For `:`, indicting only the
current regexp section should have the extended mode
switched, recurse with the extended mode set or unset. For `)`,
indicating the remainder of the regexp (or current regexp portion
if already recursing) should turn extended mode on or off, just
change the extended mode flag and keep scanning.
While testing this, I noticed that `a`, `d`, and `u` are accepted
as options, in addition to `i`, `m`, and `x`, but I can't see
where those options are documented. I'm not sure whether or not
handling `a`, `d`, and `u` as options is a bug.
Fixes [Bug #19379]
Previously, only certain values of the 3rd argument triggered a
deprecation warning.
First step for fix for bug #18797. Support for the 3rd argument
will be removed after the release of Ruby 3.2.
Fix minor fallout discovered by the tests.
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
Calling `String#scan` without a block creates an incomplete MatchData
object whose `RMATCH(match)->str` is Qfalse. Usually this object is not
leaked, but it was possible to pull it by using ObjectSpace.each_object.
This change hides the internal MatchData object by using rb_obj_hide.
Fixes [Bug #19159]
GCC [Bug 99578] seems triggered by calling `rb_reg_last_match` before
`match_check(match)`, probably by `NIL_P(match)` in `rb_reg_nth_match`.
[Bug 99578]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
Fix per-instance Regexp timeout
This makes it follow what was decided in [Bug #19055]:
* `Regexp.new(str, timeout: nil)` should respect the global timeout
* `Regexp.new(str, timeout: huge_val)` should use the maximum value that
can be represented in the internal representation
* `Regexp.new(str, timeout: 0 or negative value)` should raise an error
Tabs were expanded because the file did not have any tab indentation in unedited lines.
Please update your editor config, and use misc/expand_tabs.rb in the pre-commit hook.
This patch speeds up setting the backref match object by avoiding some
memcopies. Take the following code for example:
```ruby
"hello world" =~ /hello/
p $~
```
When the RE matches the string, we have to set the Match object in the
backref global. So we would allocate a match object[^1] and use
`rb_reg_region_copy`[^2] to make a deep copy of the stack allocated
`re_registers` struct[^3] in to the newly created Ruby object. This
could possibly trigger GC[^4], and would allocate new memory.
This patch makes a shallow copy of the `re_registers` struct on to the
Match object allowing the match object to manage the `re_registers`
pointer and also avoiding some calls to `xmalloc` and some manual
memcopy.
Benchmark looks like this:
```ruby
require "benchmark/ips"
def test_re thing
thing =~ /hello/
end
Benchmark.ips do |x|
x.report("re hit") do
test_re "hello world"
end
x.report("re miss") do
test_re "world"
end
end
```
Before this patch:
```
$ ruby -v test.rb
ruby 3.2.0dev (2022-07-27T22:29:00Z master 4ad69899b7) [arm64-darwin21]
Ignoring bcrypt-3.1.16 because its extensions are not built. Try: gem pristine bcrypt --version 3.1.16
Warming up --------------------------------------
re hit 345.401k i/100ms
re miss 673.584k i/100ms
Calculating -------------------------------------
re hit 3.452M (± 0.5%) i/s - 17.270M in 5.002535s
re miss 6.736M (± 0.4%) i/s - 34.353M in 5.099593s
```
After this patch:
```
$ ./ruby -v test.rb
ruby 3.2.0dev (2022-08-01T21:24:12Z less-memcpy 0ff2a56606) [arm64-darwin21]
Warming up --------------------------------------
re hit 419.578k i/100ms
re miss 673.251k i/100ms
Calculating -------------------------------------
re hit 4.201M (± 0.7%) i/s - 21.398M in 5.093593s
re miss 6.716M (± 0.4%) i/s - 33.663M in 5.012756s
```
Matches get faster and misses maintain the same speed
[^1]: 24204d54ab/re.c (L1737)
[^2]: 24204d54ab/re.c (L1738)
[^3]: 24204d54ab/re.c (L1686)
[^4]: 24204d54ab/re.c (L981)