In ASCII, 'a' is bigger than 'A'. Which means 'A' - 'a' is a negative
number (-32, to be precise). In C, the type of 'a' and 'A' are signed
int (cf: ISO/IEC 9899:1990 section 6.1.3.4). So 'A' - 'a' is also a
signed int. It is `(signed int)-32`.
The problem is, OnigCodePoint is unsigned int. Adding a negative
number to a variable of OnigCodepoint (`code` here) introduces an
unintentional cast of `(unsigned)(signed)-32`, which is
4,294,967,264. Adding this value to code then overflows, and the
result eventually becomes normal codepoint.
The series of operations are not a serious problem but because
`code >= 'a'` holds, we can `(code - 'a') + 'A'` to reroute this.
See also: https://github.com/k-takata/Onigmo/pull/107
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65752 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- common.mk: Change Unicode version to 11.0.0
- enc/unicode/case-folding.rb, enc/unicode.c: Initial changes to deal with
Gregorian Mtavruli. This should bring us up to the same level as e.g.
Python 3.7, by following the Unicode tables exactly. But it will
produce undesirable (mixed-case) results for String#capitalize.
This will be addressed in a later commit.
- enc/unicode/11.0.0, enc/unicode/11.0.0/casefold.h, enc/unicode/name2ctype.h:
Add generated files.
- lib/unicode_normalize/tables.rb: Updated table.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65091 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode.c: moved additional Grapheme Cluster Break ranges
which depend on the Unicode version.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65087 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* regparse.c (node_extended_grapheme_cluster): as Unicode 10 has
added Grapheme_Cluster_Break properties to some characters,
remove duplicated ranges for Unicode 9.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65086 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
Also, as it's in the middle of the list of 4 arguments, 3rd and 4th arguments
(trim_mode, eoutvar) are changed to keyword arguments.
Old ways to specify arguments are deprecated and warned now.
bin/erb: deprecate -S option.
We'll remove all of deprecated ones at Ruby 2.7+.
enc/make_encmake.rb: stopped using deprecated interface
ext/etc/mkconstants.rb: ditto
ext/socket/mkconstants.rb: ditto
sample/ripper/ruby2html.rb: ditto
spec/ruby/library/erb/defmethod/def_erb_method_spec.rb: ditto
spec/ruby/library/erb/new_spec.rb: ditto
test/erb/test_erb.rb: ditto
test/erb/test_erb_command.rb: ditto
tool/generic_erb.rb: ditto
tool/ruby_vm/helpers/dumper.rb: ditto
tool/transcode-tblgen.rb: ditto
lib/rdoc/erbio.rb: ditto
lib/rdoc/generator/darkfish.rb: ditto
[Feature #14256]
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@62529 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* common.mk: enc/unicode.$(OBJEXT) depends on onigmo.h via
oniguruma.h.
* common.mk: dependencies of *prelude.$(OBJEXT) are defined for
each generated C sources.
* enc/depend: casefold.h and name2ctype.h are located under
$(UNICODE_HDR_DIR).
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@61800 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* tool/gperf.sed: comment out arguments part only, to keep the
following declarations static. [Feature #13883]
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@61282 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* tool/gperf.sed: extracted sed commands to a script. ANSI-C code
produced by gperf 3.1 declares length arguments as `size_t`. it
causes conflict with existing declarations, and needs casts for
a local variable and return statements.
[Feature #13883]
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@61076 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* common.mk: download emoji-data.txt. As emoji data files are
located in a separate directory in Unicode.org site, reearranged
Unicode data files directories same as the site.
* tool/enc-unicode.rb (get_file): search emoji data files in the
second argument path.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@60977 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
We don't need these files anymore because we upgraded to Unicode 10.0.0.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59760 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- In common.mk, set UNICODE_VERSION to 10.0.0
- Generate and add enc/unicode/10.0.0/casefold.h and
enc/unicode/10.0.0/name2ctype.h
- Update lib/unicode_normalize/tables.rb
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59759 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
Replace the copyrights and explanatory texts in the data files used for mapping
GB2312/GB12345 to/from Unicode with short explanatory texts. [Bug #12598] [ci skip]
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59715 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
In enc/utf_8.c, change maximum byte length of UTF-8 to 4 bytes (from 6)
to conform to definition of UTF-8. This closes issue #13590.
(This is a retry of r58954, after issue #13590 has been addressed.)
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58965 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
Revert change to maximum of 4 bytes for UTF-8 characters at r58954 temporarily.
This failed spec at https://travis-ci.org/ruby/ruby/builds/237086017, but it
is totally unclear why.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58955 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
In enc/utf_8.c, change maximum byte length of UTF-8 to 4 bytes (from 6)
to conform to definition of UTF-8. This closes issue #13590.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58954 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/depend (enc/unicode.o): remove hardcoded Unicode versions.
this object file must be compiled by toplevel make.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58386 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
I started writing a template for auto-generation and
let "tool/update-deps --fix" fill in the rest.
Hopefully this fixes problems with some CI builds
after r58359. Further changes to other ext/-test-/
files should probably add or update "depend" files, too.
* ext/-test-/struct/depend: new file
* enc/depend: auto-updated with unicode 9.0.0 headers (side-effect)
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58364 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/9.0.0/name2ctype.h: update due to merger of Onigmo
6.0.0.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58064 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
Onigumo 6 (r57045) introduced new onigumo.h header file, which is
required from quite much everywhere. This commit adds necessary
dependencies.
Note: ruby/oniguruma.h now includes onigumo.h,
ruby/io.h includes oniguruma.h,
ruby/encoding.h also includes oniguruma.h,
and internal.h includes encoding.h.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58054 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode.c: Remove special processing for U+03B9/U+03BC/U+A64B
(GREEK SMALL LETTERs IOTA/MU, CYRILLIC SMALL LETTER MONOGRAPH UK)
from onigenc_unicode_case_map and simplify code.
* enc/unicode/case-folding.rb: Remove check for U+03B9/U+03BC/U+A64B.
This and the previous few related commits make sure that we won't hit
the equivalent of bug #12990 anymore for future updates of Unicode versions.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56976 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/case-folding.rb: Reorder codepoints so that the upper-case
mapping comes first.
* enc/unicode/9.0.0/casefold.h: Codepoints reordered, upper-case mapping
flag added.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56975 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* meta character \X matches Unicode 9.0.0 characters with some workarounds
for UTR #51 Unicode Emoji, Version 4.0 emoji zwj sequences.
[Feature #12831] [ruby-core:77586]
The term "character" can have many meanings bytes, codepoints, combined
characters, and so on. "grapheme cluster" is highest one of such words,
which means user-perceived characters.
Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION specifies how to
handle grapheme clusters (extended grapheme cluster).
But some specs aren't updated to current situation because Unicode Emoji
is rapidly extended without well definition.
It breaks the precondition of UTR#29 "Grapheme cluster boundaries can be
easily tested by looking at immediately adjacent characters". (the
sentence will be removed in the next version)
Though some of its detail are described in Unicode Technical Report #51
UNICODE EMOJI but it is not merged into UTR#29 yet.
http://unicode.org/reports/tr29/http://unicode.org/reports/tr51/http://unicode.org/Public/emoji/4.0/
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56949 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode.c: Add U+A64B to the special cases 03B9 and 03BC
at the end of onigenc_unicode_case_map (Bug #12990).
* enc/unicode/case-folding.rb: Add U+A64B to the special cases
03B9 and 03BC. Add a comment pointing to enc/unicode.c.
Change warnings to exceptions for unpredicted cases,
because this would have been more easily noticed
(the warning was not noticed when upgrading to Unicode 9.0.0).
* test/ruby/enc/test_case_comprehensive.rb: Remove temporary
exclusion of U+A64B from testing.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56941 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/trans/single_byte.trans (transcode_tblgen_singlebyte):
remove useless code. returned value is not used.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56513 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
removing directories/files related to Unicode version 8.0.0
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56090 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* unicode/9.0.0/casefold.h, name2ctype.h, unicode/data/9.0.0:
new directories/files for Unicode version 9.0.0
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56087 b2dd03c8-39d4-4d8f-98ff-823fe69b080e