Remove !USE_RVARGC code
[Feature #19579]
The Variable Width Allocation feature was turned on by default in Ruby
3.2. Since then, we haven't received bug reports or backports to the
non-Variable Width Allocation code paths, so we assume that nobody is
using it. We also don't plan on maintaining the non-Variable Width
Allocation code, so we are going to remove it.
bytesplice(index, length, str, str_index, str_length) -> string
bytesplice(range, str, str_range) -> string
In these forms, the content of +self+ is replaced by str.byteslice(str_index, str_length) or str.byteslice(str_range); however the substring of +str+ is not allocated as a new string.
In Feature #19314, we concluded that the return value of String#bytesplice
should be changed from the source string to the receiver, because the source
string is useless and confusing when extra arguments are added.
This change should be included in Ruby 3.2.1.
str_enc_copy_direct copies the string encoding over without checking the
frozen status of the string. Because we know that we're safe here (we
only use this function when interpolating strings on the stack via a
concatstrings instruction) we can safely skip this check
This commit adds str_enc_copy_direct, which is like str_enc_copy but
does not check the frozen status of str1 and does not check the validity
of the encoding of str2. This makes certain string operations ~5% faster.
```ruby
puts(Benchmark.measure do
100_000_000.times do
"a".downcase
end
end)
```
Before this patch:
```
7.587598 0.040858 7.628456 ( 7.669022)
```
After this patch:
```
7.133128 0.039809 7.172937 ( 7.183124)
```
The reference updating code for strings is not re-embedding strings
because the code is incorrectly wrapped inside of a
`if (STR_SHARED_P(obj))` clause. Shared strings can't be re-embedded
so this ends up being a no-op. This means that strings can be moved to a
large size pool during compaction, but won't be re-embedded, which would
waste the space.
The following code crashes on my machine:
```
GC.stress = true
str = "testing testing testing"
puts str.capitalize
```
We need to ensure that the object `buffer_anchor` remains on the stack
so it does not get GC'd.
It's questionable whether we want to allow rstrip to work for strings
where the broken coderange occurs before the trailing whitespace and
not after, but this approach is probably simpler, and I don't think
users should expect string operations like rstrip to work on broken
strings.
In some cases, this changes rstrip to raise
Encoding::CompatibilityError instead of ArgumentError. However, as
the problem is related to an encoding issue in the receiver, and due
not due to an issue with an argument, I think
Encoding::CompatibilityError is the more appropriate error.
Fixes [Bug #18931]
During auto-compaction, using length to determine whether or not a
string can be re-embedded may be a problem for newly created strings.
This is because usually it requires a malloc before setting the length.
If the malloc triggers compaction, then the string may be re-embedded
and can cause crashes.
Commit aa2a428 introduced a bug where non-embedded string slices copied
the encoding of the original string. If the original string had a broken
encoding but the slice has valid encoding, then the slice would be
incorrectly marked as broken encoding.
This commit extracts common code between str_substr and rb_str_subseq
into a function called str_subseq.
This commit also applies optimizations in commit 2e88bca to
rb_str_subseq.
str_new_shared already has all the necessary logic to do this
and is also smart enough to skip this step if the source string
is already a shared string itself.
This saves a useless String allocation on each call.
Leave the new coderange unknown if the original encoding is not
ASCII-compatible. Non-ASCII-compatible encoding strings with valid or
broken coderange can end up as ascii-only.
Fixes 9a8f6e392f ("Cheaply derive code range for String#b return
value", 2022-07-25).
Remove the superfluous str_modify_keep_cr() call from rb_str_update().
It ends up calling either rb_str_drop_bytes() or rb_str_splice_0(),
which already does checks if necessary.
The extra call makes the string "independent". This is not always
wanted, in other words, it can keep the same shared root when merely
removing the leading part of a shared string.
This is an inelegant hack, by manually checking for this specific
code point in rb_str_inspect. Some testing indicates that this is
the only code point affected.
It's possible a better fix would be inside of lower-level encoding
code, such that rb_enc_isprint would return false and not true for
codepoint 0x85.
Fixes [Bug #16842]
The result of String#b is a string with an ASCII_8BIT/BINARY encoding. That encoding is ASCII-compatible and has no byte sequences that are invalid for the encoding. If we know the receiver's code range, we can derive the resulting string's code range without needing to perform a full code range scan.
If the RHS has valid encoding, and both strings have the same
encoding, we can use the fast path.
However we need to update the LHS coderange.
```
compare-ruby: ruby 3.2.0dev (2022-07-21T14:46:32Z master cdbb9b8555) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-25T07:25:41Z string-concat-vali.. 11a2772bdd) [arm64-darwin21]
warming up...
| |compare-ruby|built-ruby|
|:-------------------|-----------:|---------:|
|binary_concat_7bit | 554.816k| 556.460k|
| | -| 1.00x|
|utf8_concat_7bit | 556.367k| 555.101k|
| | 1.00x| -|
|utf8_concat_UTF8 | 412.555k| 556.824k|
| | -| 1.35x|
```
Previously, it was including one newline when chomp was used,
which is inconsistent with IO#each_line behavior. This makes
behavior consistent with IO#each_line, chomping all paragraph
separators (multiple consecutive newlines), but not single
newlines.
Partially Fixes [Bug #18768]
Not having to fetch the rb_encoding save a significant
amount of time.
Additionally, even when we have to fetch it, we can do
it faster using `ENCODING_GET` rather than `rb_enc_get`.
```
compare-ruby: ruby 3.2.0dev (2022-07-19T08:41:40Z master cb9fd920a3) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-21T11:16:16Z faster-buffer-conc.. 4f001f0748) [arm64-darwin21]
warming up...
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|binary_concat_utf8 | 510.580k| 565.600k|
| | -| 1.11x|
|binary_concat_binary | 512.653k| 571.483k|
| | -| 1.11x|
|utf8_concat_utf8 | 511.396k| 566.879k|
| | -| 1.11x|
```
If the LHS is ASCII compatible and the RHS is 7BIT
we can directly concat without being concerned about
anything else.
Benchmark:
```
compare-ruby: ruby 3.2.0dev (2022-07-12T15:01:11Z master 71aec68566) [arm64-darwin21]
built-ruby: ruby 3.2.0dev (2022-07-13T10:13:53Z faster-buffer-conc.. a04c10476d) [arm64-darwin21]
warming up...
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|binary_append_utf8 | 385.315k| 573.663k|
| | -| 1.49x|
|binary_append_binary | 446.579k| 574.898k|
| | -| 1.29x|
|utf8_append_utf8 | 430.936k| 573.394k|
| | -| 1.33x|
```
Note that in the benchmark, the RHS always have a precomputed
coderange. So the benchmark never enter the slowpath of having to
scan the RHS. However it's extremly likely that we'll end
up scanning it anyway in rb_enc_cr_str_buf_cat
This function was added to a public header in [1] probably
unintentionally since it's not used anywhere, exposes implementation
details, and isn't related to the goals of that pull request.
[1]: 56cc3e99b6
* Add missing space for `String#start_with?`.
* Add missing pluses for `String#tr` and
`Methods for Converting to New String` label.
* Move quote into the tag for `Whitespace in Strings` label.
... only when the message string has a newline.
`p StandardError.new("foo\nbar")` now prints `#<StandardError: "foo\nbar">'
instead of:
#<StandardError:
bar>
[Bug #18170]
Adds to doc for String.new, also making it compliant with documentation_guide.rdoc.
Fixes some broken links in io.c (that I failed to correct yesterday).
Method references is not only able to be marked up as code, also
reflects `--show-hash` option.
The bug that prevented the old rdoc from correctly parsing these
methods was fixed last month.
Treats:
#chars
#codepoints
#each_char
#each_codepoint
#each_grapheme_cluster
#grapheme_clusters
Also, corrects a passage in #unicode_normalize that mentioned module UnicodeNormalize, whose doc (:nodoc:, actually) says not to mention it.
As @peterzhu2118 and @duerst have pointed out, putting string method's RDoc into doc/ (which allows non-ASCII in examples) makes the "click to toggle source" feature not work for that method.
This PR moves the primary method doc back into string.c, then includes RDoc from doc/string/*.rdoc, and also removes doc/string.rdoc.
The affected methods are:
::new
#bytes
#each_byte
#each_line
#split
The call-seq is in string.c because it works there; it did not work when the call-seq is in doc/string/*.rdoc.
This PR also updates the relevant guidance in doc/documentation_guide.rdoc.
* Enhanced RDoc for String#split
* Enhanced RDoc for String#split
* Enhanced RDoc for String#split
* Enhanced RDoc for String#split
* Enhanced RDoc for String#split
* String#getbyte returns `nil` if `index` is out of range.
* Add String#getbyte example with nil output.
* Modify String#getbyte example to use negative index.
I used this regex:
([A-Za-z]+)\.html#(?:class|module)-[A-Za-z]+-label-([A-Za-z0-9\-\+]+)
And performed a global find & replace for this:
rdoc-ref:$1@$2
Adds file doc/case_mapping.rdoc, which describes case mapping and provides a link target that methods doc can link to.
Revises:
String#capitalize
String#capitalize!
String#casecmp
String#casecmp?
String#downcase
String#downcase!
String#swapcase
String#swapcase!
String#upcase
String#upcase!
Symbol#capitalize
Symbol#casecmp
Symbol#casecmp?
Symbol#downcase
Symbol#swapcase
Symbol#upcase
This commit adds a Ractor cache for every size pool. Previously, all VWA
allocated objects used the slowpath and locked the VM.
On a micro-benchmark that benchmarks String allocation:
VWA turned off:
29.196591 0.889709 30.086300 ( 9.434059)
VWA before this commit:
29.279486 41.477869 70.757355 ( 12.527379)
VWA after this commit:
16.782903 0.557117 17.340020 ( 4.255603)
Also, check if a suffix is empty, to guarantee the assumption of
`onigenc_get_left_adjust_char_head` that `*s` is always accessible,
even in the case of `SHARABLE_MIDDLE_SUBSTRING`.
Each of these methods calls str_modify_keep_cr before
term filling, which should ensure the backing string
uses private memory, and therefore term filling should
not affect other strings.
Skipping the term filling was added in
a707ab4bc8.
Fixes [Bug #12540]