Граф коммитов

1889 Коммитов

Автор SHA1 Сообщение Дата
Jean Boussier 4e85b6b4c4 rb_str_bytesplice: skip encoding check if encodings are the same
If both strings have the same encoding, all this work is useless.
2024-08-09 22:06:44 +02:00
Jean Boussier 3bac5f6af5 string.c: add fastpath in str_ensure_byte_pos
If the string only contain single byte characters we can
skips all the costly checks.
2024-08-09 22:06:44 +02:00
Jean Boussier a332367dad string.c: Add fastpath to single_byte_optimizable
`rb_enc_from_index` is a costly operation so it is worth avoiding
to call it for the common encodings.

Also in the case of UTF-8, it's more efficient to scan the
coderange if it is unknown that to fallback to the slower
algorithms.
2024-08-09 22:06:44 +02:00
Jean Boussier 2bd5dc47ac string.c: str_capacity don't check for immediates
`STR_EMBED_P` uses `FL_TEST_RAW` meaning we already assume `str`
isn't an immediate, so we can use `FL_TEST_RAW` here too.
2024-08-09 15:20:58 +02:00
Jean Boussier af44af238b str_independent: add a fastpath with a single flag check
If we assume that most strings we modify are not frozen and
are independent, then we can optimize this case by replacing
multiple flag checks by a single mask check.
2024-08-09 15:20:58 +02:00
Kevin Menard 04a6165ac0
YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values. (#11032)
* Document why we need to explicitly spill registers.

* Simplify passing a byte value to `str_buf_cat`.

* YJIT: Enhance the `String#<<` method substitution to handle integer codepoint values.

* YJIT: Move runtime type check into YJIT.

Performing the check in YJIT means we can make assumptions about the type. It also improves correctness of stack traces in cases where the codepoint argument is not a String or a Fixnum.
2024-08-02 15:45:22 -04:00
Jean Boussier 83f57ca3d2 String.new(capacity:) don't substract termlen
[Bug #20585]

This was changed in 36a06efdd9f0604093dccbaf96d4e2cb17874dc8 because
`String.new(1024)` would end up allocating `1025` bytes, but the problem
with this change is that the caller may be trying to right size a String.

So instead, we should just better document the behavior of `capacity:`.
2024-06-19 15:11:07 +02:00
Kevin Menard a119b5f879 Add a fast path implementation for appending single byte values to US-ASCII strings. 2024-06-17 09:44:48 -07:00
Kevin Menard 27e13fbc58 Add a fast path implementation for appending single byte values to binary strings.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
2024-06-17 09:44:48 -07:00
Alan Wu 6416ee33eb Simplify unaligned write for pre-computed string hash 2024-06-13 18:52:09 -04:00
Alan Wu a8730adb60 rb_str_hash(): Avoid UB with making misaligned pointer
Previously, on common platforms, this code made a pointer to a union of
8 byte alignment out of a char pointer that is not guaranteed to satisfy
the alignment requirement. That is undefined behavior according
to [C99 6.3.2.3p7](https://port70.net/~nsz/c/c99/n1256.html#6.3.2.3p7).

Use memcpy() to do the unaligned read instead.
2024-06-13 18:52:09 -04:00
tompng a9b8981aac Simplify rb_str_resize clear range condition 2024-06-13 18:27:02 +02:00
tompng 9c7374b0e6 Clear coderange when rb_str_resize change size
In some encoding like utf-16 utf-32, expanding the string with null bytes can change coderange to either broken or valid.
2024-06-13 18:27:02 +02:00
Nobuyoshi Nakada dd8903fed7
[Bug #20566] Mention out-of-range argument cases in `String#<<`
Also [Bug #18973].
2024-06-09 10:11:06 +09:00
Jean Boussier 730e3b2ce0 Stop exposing `rb_str_chilled_p`
[Feature #20205]

Now that chilled strings no longer appear as frozen, there is no
need to offer an API to check for chilled strings.

We however need to change `rb_check_frozen_internal` to no
longer be a macro, as it needs to check for chilled strings.
2024-06-02 13:53:35 +02:00
Nobuyoshi Nakada 7d144781a9
[Bug #20512] Set coderange in `Range#each` of strings 2024-05-28 16:59:51 +09:00
Nobuyoshi Nakada 0a92c9f2b0
Set empty strings to ASCII-only 2024-05-28 16:24:21 +09:00
Jean Boussier 9e9f1d9301 Precompute embedded string literals hash code
With embedded strings we often have some space left in the slot, which
we can use to store the string Hash code.

It's probably only worth it for string literals, as they are the ones
likely to be used as hash keys.

We chose to store the Hash code right after the string terminator as to
make it easy/fast to compute, and not require one more union in RString.

```
compare-ruby: ruby 3.4.0dev (2024-04-22T06:32:21Z main f77618c1fa) [arm64-darwin23]
built-ruby: ruby 3.4.0dev (2024-04-22T10:13:03Z interned-string-ha.. 8a1a32331b) [arm64-darwin23]
last_commit=Precompute embedded string literals hash code

|            |compare-ruby|built-ruby|
|:-----------|-----------:|---------:|
|symbol      |     39.275M|   39.753M|
|            |           -|     1.01x|
|dyn_symbol  |     37.348M|   37.704M|
|            |           -|     1.01x|
|small_lit   |     29.514M|   33.948M|
|            |           -|     1.15x|
|frozen_lit  |     27.180M|   33.056M|
|            |           -|     1.22x|
|iseq_lit    |     27.391M|   32.242M|
|            |           -|     1.18x|
```

Co-Authored-By: Étienne Barrié <etienne.barrie@gmail.com>
2024-05-28 07:32:41 +02:00
Étienne Barrié 1376881e9a Stop marking chilled strings as frozen
They were initially made frozen to avoid false positives for cases such
as:

    str = str.dup if str.frozen?

But this may cause bugs and is generally confusing for users.

[Feature #20205]

Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
2024-05-28 07:32:33 +02:00
Jean Boussier 3a7846b1aa Add a hint of `ASCII-8BIT` being `BINARY`
[Feature #18576]

Since outright renaming `ASCII-8BIT` is deemed to backward incompatible,
the next best thing would be to only change its `#inspect`, particularly
in exception messages.
2024-04-18 10:17:26 +02:00
Jean Boussier f06670c5a2 Eliminate usage of OBJ_FREEZE_RAW
Previously it would bypass the `FL_ABLE` check, but
since shapes introduction, it started having a different
behavior than `OBJ_FREEZE`, as it would onyl set the `FL_FREEZE`
flag, but not update the shape.

I have no indication of this causing a bug yet, but it seems
like a trap waiting to happen.
2024-04-16 17:20:35 +02:00
Étienne Barrié 49b31c7680 Document STR_CHILLED flag on RString
[Feature #20205]
2024-04-08 13:25:09 +02:00
Nobuyoshi Nakada 4dd9e5cf74 Add builtin type assertion 2024-04-08 11:13:29 +09:00
Peter Zhu e50590a541 Assert that Symbol#inspect returns a T_STRING 2024-04-05 16:15:28 -04:00
KJ Tsanaktsidis 9d0a5148ae Add missing RB_GC_GUARDs related to DATA_PTR
I discovered the problem in `compile.c` from a failing
TestIseqLoad#test_stressful_roundtrip test with ASAN enabled. The other
two changes in array.c and string.c I found by auditing similar usages
of DATA_PTR in the codebase.

[Bug #20402]
2024-03-31 20:33:38 +11:00
Étienne Barrié 2b08406cd0 Expose rb_str_chilled_p
Some extensions (like stringio) may need to differentiate between
chilled strings and frozen strings.

They can now use rb_str_chilled_p but must check for its presence since
the function will be removed when chilled strings are removed.

[Bug #20389]

[Feature #20205]

Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
2024-03-26 12:54:54 +01:00
Nobuyoshi Nakada fdd7ffb70c [Bug #20389] Chilled string cannot be a shared root 2024-03-25 10:26:56 +09:00
Étienne Barrié 12be40ae6b Implement chilled strings
[Feature #20205]

As a path toward enabling frozen string literals by default in the future,
this commit introduce "chilled strings". From a user perspective chilled
strings pretend to be frozen, but on the first attempt to mutate them,
they lose their frozen status and emit a warning rather than to raise a
`FrozenError`.

Implementation wise, `rb_compile_option_struct.frozen_string_literal` is
no longer a boolean but a tri-state of `enabled/disabled/unset`.

When code is compiled with frozen string literals neither explictly enabled
or disabled, string literals are compiled with a new `putchilledstring`
instruction. This instruction is identical to `putstring` except it marks
the String with the `STR_CHILLED (FL_USER3)` and `FL_FREEZE` flags.

Chilled strings have the `FL_FREEZE` flag as to minimize the need to check
for chilled strings across the codebase, and to improve compatibility with
C extensions.

Notes:
  - `String#freeze`: clears the chilled flag.
  - `String#-@`: acts as if the string was mutable.
  - `String#+@`: acts as if the string was mutable.
  - `String#clone`: copies the chilled flag.

Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
2024-03-19 09:26:49 +01:00
Thomas Marshall 7e4b1f8e19
[Bug #20322] Fix rb_enc_interned_str_cstr null encoding
The documentation for `rb_enc_interned_str_cstr` notes that `enc` can be
a null pointer, but this currently causes a segmentation fault when
trying to autoload the encoding. This commit fixes the issue by checking
for NULL before calling `rb_enc_autoload`.
2024-03-03 10:43:35 +00:00
Peter Zhu ce8531fed4 Stop using rb_str_locktmp_ensure publicly
rb_str_locktmp_ensure is a private API.
2024-02-23 14:08:29 -05:00
Takashi Kokubun 8a6740c70e
YJIT: Lazily push a frame for specialized C funcs (#10080)
* YJIT: Lazily push a frame for specialized C funcs

Co-authored-by: Maxime Chevalier-Boisvert <maxime.chevalierboisvert@shopify.com>

* Fix a comment on pc_to_cfunc

* Rename rb_yjit_check_pc to rb_yjit_lazy_push_frame

* Rename it to jit_prepare_lazy_frame_call

* Fix a typo

* Optimize String#getbyte as well

* Optimize String#byteslice as well

---------

Co-authored-by: Maxime Chevalier-Boisvert <maxime.chevalierboisvert@shopify.com>
2024-02-23 19:08:09 +00:00
Peter Zhu 510404f2de Stop using rb_fstring publicly
rb_fstring is a private API, so we should use rb_str_to_interned_str
instead, which is a public API.
2024-02-23 13:33:46 -05:00
Peter Zhu df5b8ea4db Remove unneeded RUBY_FUNC_EXPORTED 2024-02-23 10:24:21 -05:00
Takashi Kokubun d5080f6e8b Fix -Wsign-compare on String#initialize
../string.c:1886:57: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare]
 1886 |                 if (STR_EMBED_P(str)) RUBY_ASSERT(osize <= str_embed_capa(str));
      |                                                         ^~
2024-02-22 16:11:30 -08:00
Nobuyoshi Nakada e04146129e
[Bug #20292] Truncate embedded string to new capacity 2024-02-22 22:46:18 +09:00
Nobuyoshi Nakada b1d70e4264
[Bug #20280] Check by `rb_parser_enc_str_coderange`
Co-authored-by: Yuichiro Kaneko <spiketeika@gmail.com>
2024-02-19 16:33:26 +09:00
Nobuyoshi Nakada fcc55dc226
[Bug #20280] Raise SyntaxError on invalid encoding symbol 2024-02-19 16:33:26 +09:00
Peter Zhu 4d1b3a2bf3 Unset STR_SHARED when setting string to embed 2024-02-15 12:19:45 -05:00
Yusuke Endoh 25d74b9527 Do not include a backtick in error messages and backtraces
[Feature #16495]
2024-02-15 18:42:31 +09:00
Burdette Lamar 65f5435540
[DOC] Doc compliance (#9955) 2024-02-14 10:47:42 -05:00
Alan Wu 6261d4b4d8 Fix use-after-move in Symbol#inspect
The allocation could re-embed `orig_str` and invalidate the data
pointer from RSTRING_GETMEM() if the string is embedded.

Found on CI, where the test introduced in 7002e77694 ("Fix
Symbol#inspect for GC compaction") recently failed.

See: <https://github.com/ruby/ruby/actions/runs/7880657560/job/21503019659>
2024-02-13 14:49:54 -05:00
Aaron Patterson c35fea8509
Specialize String#byteslice(a, b) (#9939)
* Specialize String#byteslice(a, b)

This adds a specialization for String#byteslice when there are two
parameters.

This makes our protobuf parser go from 5.84x slower to 5.33x slower

```
Comparison:
decode upstream (53738 bytes):     7228.5 i/s
decode protobuff (53738 bytes):     1236.8 i/s - 5.84x  slower

Comparison:
decode upstream (53738 bytes):     7024.8 i/s
decode protobuff (53738 bytes):     1318.5 i/s - 5.33x  slower
```

* Update yjit/src/codegen.rs

---------

Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
2024-02-13 16:20:27 +00:00
Peter Zhu ac38f259aa Replace assert with RUBY_ASSERT in string.c
assert does not print the bug report, only the file and line number of
the assertion that failed. RUBY_ASSERT prints the full bug report, which
makes it much easier to debug.
2024-02-12 15:07:47 -05:00
Peter Zhu c6b391214c [DOC] Improve flags of string 2024-02-08 10:49:38 -05:00
Peter Zhu 5e0c171451 Make io_fwrite safe for compaction
[Bug #20169]

Embedded strings are not safe for system calls without the GVL because
compaction can cause pages to be locked causing the operation to fail
with EFAULT. This commit changes io_fwrite to use rb_str_tmp_frozen_no_embed_acquire,
which guarantees that the return string is not embedded.
2024-02-05 11:11:07 -05:00
Takashi Kokubun 51753ec7fa
Annotate Symbol#to_s as leaf (#9769) 2024-01-31 10:47:35 -05:00
Peter Zhu e17c83e02c Fix memory leak in String#tr and String#tr_s
rb_enc_codepoint_len could raise, which would cause the memory in buf
to leak.

For example:

    str1 = "\xE0\xA0\xA1#{" " * 100}".force_encoding("EUC-JP")
    str2 = ""
    str3 = "a".force_encoding("Windows-31J")

    10.times do
      1_000_000.times do
        str1.tr_s(str2, str3)
      rescue
      end

      puts `ps -o rss= -p #{$$}`
    end

Before:

    17536
    22752
    28032
    33312
    38688
    43968
    49200
    54432
    59744
    64992

After:

    12176
    12352
    12352
    12448
    12448
    12448
    12448
    12448
    12448
    12448
2024-01-17 08:54:25 -05:00
tompng ade56737e2 Fix coderange of invalid_encoding_string.<<(ord)
Appending valid encoding character can change coderange from invalid to valid.
Example: "\x95".force_encoding('sjis')<<0x5C will be a valid string "\x{955C}"
2024-01-16 23:18:55 +09:00
Peter Zhu b3d6128049 Fix memory leak in grapheme clusters
[Bug #20150]

String#grapheme_cluters and String#each_grapheme_cluster leaks memory
because if the string is not UTF-8, then the created regex will not
be freed.

For example:

    str = "hello world".encode(Encoding::UTF_32LE)

    10.times do
      1_000.times do
        str.grapheme_clusters
      end

      puts `ps -o rss= -p #{$$}`
    end

Before:

    26000
    42256
    59008
    75792
    92528
    109232
    125936
    142672
    159392
    176160

After:

    9264
    9504
    9808
    10000
    10128
    10224
    10352
    10544
    10704
    10896
2024-01-08 09:14:04 -05:00
Peter Zhu 5aba5f0454 [DOC] Add parentheses in call-seq for String#include? 2024-01-02 19:19:12 -05:00