Co-authored-by: Nobuyoshi Nakada [nobu@ruby-lang.org](mailto:nobu@ruby-lang.org)
See https://github.com/ruby/ruby/pull/6451 and
https://bugs.ruby-lang.org/issues/19007.
This keeps the Unicode version at 14.0.0, so this commit
is suited for backporting where applicable.
At the time of this commit, the reason for the wrong properties
which we fix here is still not completely known, so issue 19007
should be kept open.
The reason why this was commented out was because of gperf 3.0 vs 3.1
differences (see [Feature #13883]). Five years passed, I am pretty
confident that we can drop support of old versions here.
Ditto for uniname2ctype_p(), onig_jis_property(), and zonetab().
Unicode Version 12.1.0 adds one single character, U+32FF SQUARE ERA NAME REIWA,
for the new Japanese era starting on May 1st. 12.1.0 will be finalized only on
May 7th, so we go with the beta version because further changes in the data we
need are highly unlikely, and we want to make sure Ruby is ready for the new era.
* common.mk: change UNICODE_VERSION to 12.1.0, UNICODE_BETA to YES
* enc/unicode/12.1.0, enc/unicode/12.1.0/casefold.h, enc/unicode/12.1.0/name2ctype.h:
add directory and generated data files for new version
* lib/unicode_normalize/tables.rb: update for new character
* test/ruby/test_regexp.rb: add test for character property age=12.1
* test/test_unicode_normalize.rb: add test for NFKC decomposition of new character
This (mostly) completes issue #15195.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@67441 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- common.mk: set UNICODE_VERSION and UNICODE_EMOJI_VERSION to 12.0.0
- lib/unicode_normalize/tables.rb: update table data to Unicode version 12.0.0
- enc/unicode/12.0.0/casefold.h, enc/unicode/12.0.0/name2ctype.h: add generated
files for Unicode version 12.0.0
This is the main commit for #15321.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@67169 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
This line, and those below, will be ignored--
D enc/unicode/10.0.0
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66295 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- common.mk: Change Unicode version to 11.0.0, and Emoji version to 11.0
- test/ruby/enc/test_emoji_breaks.rb: update hard-coded Emoji version
- enc/unicode/11.0.0, enc/unicode/11.0.0/casefold.h, enc/unicode/name2ctype.h:
Add generated files. Files for Unicode 10.0.0 will be removed once we are
sure 11.0.0 works.
- lib/unicode_normalize/tables.rb: Updated table.
- regparse.c: Almost completely reimplement grapheme cluster detection in
function node_extended_grapheme_cluster().
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66213 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- enc/unicode/case-folding.rb:
- Convert unpredicted case to actual flag setting
- Eliminate an unused variable
- Change a variable name to avoid a warning
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65933 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- common.mk: Change Unicode version to 11.0.0
- enc/unicode/case-folding.rb, enc/unicode.c: Initial changes to deal with
Gregorian Mtavruli. This should bring us up to the same level as e.g.
Python 3.7, by following the Unicode tables exactly. But it will
produce undesirable (mixed-case) results for String#capitalize.
This will be addressed in a later commit.
- enc/unicode/11.0.0, enc/unicode/11.0.0/casefold.h, enc/unicode/name2ctype.h:
Add generated files.
- lib/unicode_normalize/tables.rb: Updated table.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@65091 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* tool/gperf.sed: extracted sed commands to a script. ANSI-C code
produced by gperf 3.1 declares length arguments as `size_t`. it
causes conflict with existing declarations, and needs casts for
a local variable and return statements.
[Feature #13883]
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@61076 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* common.mk: download emoji-data.txt. As emoji data files are
located in a separate directory in Unicode.org site, reearranged
Unicode data files directories same as the site.
* tool/enc-unicode.rb (get_file): search emoji data files in the
second argument path.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@60977 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
We don't need these files anymore because we upgraded to Unicode 10.0.0.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59760 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
- In common.mk, set UNICODE_VERSION to 10.0.0
- Generate and add enc/unicode/10.0.0/casefold.h and
enc/unicode/10.0.0/name2ctype.h
- Update lib/unicode_normalize/tables.rb
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@59759 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/9.0.0/name2ctype.h: update due to merger of Onigmo
6.0.0.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@58064 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode.c: Remove special processing for U+03B9/U+03BC/U+A64B
(GREEK SMALL LETTERs IOTA/MU, CYRILLIC SMALL LETTER MONOGRAPH UK)
from onigenc_unicode_case_map and simplify code.
* enc/unicode/case-folding.rb: Remove check for U+03B9/U+03BC/U+A64B.
This and the previous few related commits make sure that we won't hit
the equivalent of bug #12990 anymore for future updates of Unicode versions.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56976 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/case-folding.rb: Reorder codepoints so that the upper-case
mapping comes first.
* enc/unicode/9.0.0/casefold.h: Codepoints reordered, upper-case mapping
flag added.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56975 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* meta character \X matches Unicode 9.0.0 characters with some workarounds
for UTR #51 Unicode Emoji, Version 4.0 emoji zwj sequences.
[Feature #12831] [ruby-core:77586]
The term "character" can have many meanings bytes, codepoints, combined
characters, and so on. "grapheme cluster" is highest one of such words,
which means user-perceived characters.
Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION specifies how to
handle grapheme clusters (extended grapheme cluster).
But some specs aren't updated to current situation because Unicode Emoji
is rapidly extended without well definition.
It breaks the precondition of UTR#29 "Grapheme cluster boundaries can be
easily tested by looking at immediately adjacent characters". (the
sentence will be removed in the next version)
Though some of its detail are described in Unicode Technical Report #51
UNICODE EMOJI but it is not merged into UTR#29 yet.
http://unicode.org/reports/tr29/http://unicode.org/reports/tr51/http://unicode.org/Public/emoji/4.0/
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56949 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode.c: Add U+A64B to the special cases 03B9 and 03BC
at the end of onigenc_unicode_case_map (Bug #12990).
* enc/unicode/case-folding.rb: Add U+A64B to the special cases
03B9 and 03BC. Add a comment pointing to enc/unicode.c.
Change warnings to exceptions for unpredicted cases,
because this would have been more easily noticed
(the warning was not noticed when upgrading to Unicode 9.0.0).
* test/ruby/enc/test_case_comprehensive.rb: Remove temporary
exclusion of U+A64B from testing.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56941 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
removing directories/files related to Unicode version 8.0.0
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56090 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* unicode/9.0.0/casefold.h, name2ctype.h, unicode/data/9.0.0:
new directories/files for Unicode version 9.0.0
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@56087 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* common.mk, enc/depend (casefold.h, name2ctype.h): move to
unicode data directory per version.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55701 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/case-folding.rb, tool/enc-unicode.rb: check if
Unicode versions are consistent with each other.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55687 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* Makefile.in (enc/unicode/name2ctype.h): remove stale recipe,
which did not support Unicode age properties.
* common.mk (enc/unicode/name2ctype.h): update by --header option
of tool/enc-unicode.rb. enc/unicode/name2ctype.kwd file has not
been used.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55678 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/case-folding.rb: check if version numbers in each
data files match.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55545 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* enc/unicode/case-folding.rb (CaseFolding#load): read in binary
mode to deal with non-ASCII charater in CaseFolding.txt.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55496 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
* common.mk (lib/unicode_normalize/tables.rb): should not depend
on Unicode data files unless ALWAYS_UPDATE_UNICODE=yes, to get
rid of downloading Unicode data unnecessary. [ruby-dev:49681]
* common.mk (enc/unicode/casefold.h): update Unicode files in a
sub-make, not to let the header depend on the files always.
* enc/unicode/case-folding.rb: if gperf is not usable, assume the
existing file is OK.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@55492 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
swapcase functionality for titlecase characters. Swapcase isn't defined
by Unicode, because the purpose/usage of swapcase is unclear anyway.
The implementation follows a proposal from Nobu, swaping the case of
each component of a titlecase character individually.
This means that the titlecase characters have to be decomposed.
* enc/unicode.c: Code using the above data.
* test/ruby/enc/test_case_mapping.rb: Tests for the above.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54469 b2dd03c8-39d4-4d8f-98ff-823fe69b080e
special cases in CaseUnfold_11_Table.
* enc/unicode.c: Adjustments for above.
* test/ruby/enc/test_case_mapping.rb: Tests for the above: Some tests in
test_titlecase activated; test_greek added. A test in test_cherokee fixed.
git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@54383 b2dd03c8-39d4-4d8f-98ff-823fe69b080e