ac4d08312c
Differential Revision: https://phabricator.services.mozilla.com/D62732 |
||
---|---|---|
.. | ||
source | ||
GIT-INFO | ||
README.md | ||
data_filter.json |
README.md
Introduction
Internationalization (i18n, "i" then 18 letters then "n") is the process of handling data with respect to a particular locale:
- The number 5 representing five US dollars might be formatted as
- "$5.00" in American English,
- "US$5.00" in Canadian English, or
- "5,00 $US" in French.
- A list of people's names in a phone book would sort
- in English alphabetically; but
- in German, where "ä"/"ö"/"ü" are often interchangeable with "ae"/"oe"/"ue", alphabetically but with vowels with umlauts treated as their two-vowel counterparts.
- The currency whose code is "CHF" might be formatted as
- "Swiss Franc" in English, but
- "franc suisse" in French.
- The Unix time 1590803313070 might format as the time string
- "9:48:33 PM Eastern Daylight Time" in American English, but
- "21:48:33 Nordamerikanische Ostküsten-Sommerzeit" in German.
i18n encompasses far more than this, but you get the basic idea.
Internationalization in SpiderMonkey and Gecko
SpiderMonkey implements extensive i18n capabilities through the ECMAScript Internationalization API and the global Intl
object. Gecko requires i18n capabilities to implement text shaping, sort operations in some contexts, and various other features.
SpiderMonkey and Gecko use ICU, Internationalization Components for Unicode, to implement many low-level i18n operations. (Line breaking, implemented instead in intl/lwbrk
, is a notable exception.) Gecko and SpiderMonkey also use ICU's implementations of certain i18n-adjacent operations (for example, Unicode normalization).
ICU date/time formatting functionality requires extensive knowledge of time zone names and when zone transitions occur. The IANA tzdata
database supplies this information.
A final note of caution: ICU carefully depends upon an exact Unicode version. Other parts of SpiderMonkey and Gecko have separate dependencies on an exact Unicode version. Updates to ICU and related components must be synchronized with those updates so that the entirety of SpiderMonkey, and the entirety of Gecko including SpiderMonkey within it, advance to new Unicode versions in lockstep.1
Building SpiderMonkey or Gecko with ICU
SpiderMonkey and Gecko can be built using either a periodically-updated copy of ICU in intl/icu/source
(using time zone data in intl/tzdata/source
), or using a system-provided ICU library (dependent on its own tzdata
information). Pass --with-system-icu
when configuring to use system ICU. (Using system ICU will disable some Intl
functionality, such as historically accurate time zone calculations, that can't be readily supported without a precisely-controlled ICU.) ICU version requirements advance fairly quickly as Gecko depends on features and bug fixes in newer ICU releases. You'll get a build error if you try to use an unsupported ICU.
SpiderMonkey's Intl
API may be built or disabled by configuring --with-intl-api
(the default) or --without-intl-api
. SpiderMonkey built without the Intl
API doesn't require ICU. However, if you build without the Intl
API, some non-Intl
JavaScript functionality will not exist (String.prototype.normalize
) or won't fully work (for example, String.prototype.toLocale{Lower,Upper}Case
will not respect a provided locale, and the various toLocaleString
functions have best-effort behavior).
Using ICU functionality in SpiderMonkey and Gecko
ICU headers are considered system headers by the Gecko build system, so they must be listed in config/system-headers.mozbuild
. Code that wishes to use ICU functionality may use #include "unicode/unorm.h"
or similar to do so.
Gecko and SpiderMonkey code may use ICU's stable C API (ICU4C). These functions are stable and shouldn't change as ICU updates occur. (ICU4C's enum
initializers are not always stable: while initializer values are stable, new initializers are sometimes added, perhaps behind #ifdef U_HIDE_DRAFT_API
. This may be necessary for exhaustive switch
es to add #ifdef
s around some case
s.)
Gecko and SpiderMonkey are strongly discouraged from using ICU's C++ API (unfortunately including all smart pointer classes), because the C++ API doesn't provide ICU4C's compatibility guarantees. Rarely, we tolerate C++ API use when no stable option exists. But the API has to "look" reasonably stable, and we usually want to start a discussion with upstream about adding a stable API to eventually use. Use symbols from namespace icu
to access ICU C++ functionality. Talk to the current imported-ICU owner (presently Jeff Walden) before you start doing any of this!
SpiderMonkey and Gecko's imported ICU
Build system
The system for building ICU lives in config/external/icu
and intl/icu/icu_sources_data.py
. We generate a Mozilla-compatible build system rather than using ICU's build system. The build system is shared by SpiderMonkey and Gecko both.
ICU includes functionality we never use, so we don't naively compile all of it. We extract the list of files to compile from intl/icu/source/{common,i18n}/Makefile.in
and then apply a manually-maintained blacklist (stored in intl/icu_sources_data.py
) when we update ICU.
Locale and time zone data
ICU contains a considerable amount of raw locale data: formatting characteristics for each locale, strings for things like currencies and languages for each locale, localized time zone specifiers, and so on. This data lives in human-readable files in intl/icu/source/data
. Time zone data in intl/tzdata/source
is stored in partially-compiled formats (some of them only partly human-readable).
However, a normal Gecko build never uses these files! Instead, both ICU and tzdata
data are precompiled into a large, endian-specific icudtNNE.dat
(NN
= ICU version, E
= endianness) file.2 That file is added to config/external/icu/data/
and is checked into the Mozilla tree, to be directly incorporated into Gecko/SpiderMonkey builds. For size reasons, only the little-endian version is checked into the tree.
ICU's locale data covers all ICU internationalization features, including ones we never need. We trim locale data to size with a intl/icu/data_filter.json
data filter when compiling icudtNNE.dat
. Removing too much data won't necessarily break the build, so it's important that we have automated tests for the locale data we actually use in order to detect mistakes.
Local patching of ICU
We generally don't patch our copy of ICU except for compelling need. When we do patch, we usually only apply reasonably small patches that have been reviewed and landed upstream (so that our patch will be obsolete when we next update ICU).
Local patches are stored in the intl/icu-patches
directory. They're applied when ICU is updated, so merely updating ICU files in place won't persist changes across an ICU update.
Updating imported code
The process of updating imported i18n-relevant code is semi-automated. We use a series of shell and Python scripts to do the job.
Updating ICU
New ICU versions are announced on the icu-announce mailing list. Both release candidates and actual releases are announced here. It's a good idea to attempt to update ICU when a release candidate is announced, just in case some serious problem is present (especially one that would be painful to fix through local patching).
intl/update-icu.sh
updates our ICU to a given ICU release:3
$ cd "$topsrcdir/intl"
$ # Ensure certain Python modules in the tree are accessible when updating.
$ export PYTHONPATH="$topsrcdir/python/mozbuild/"
$ # <URL to ICU Git> <release tag name>
$ ./update-icu.sh https://github.com/unicode-org/icu.git release-67-1
But usually you'll want to update to the latest commit from the corresponding ICU maintenance branch so that you pick up fixes landed post-release:
$ cd "$topsrcdir/intl"
$ # Ensure certain Python modules in the tree are accessible when updating.
$ export PYTHONPATH="$topsrcdir/python/mozbuild/"
$ # <URL to ICU Git> <maintenance name>
$ ./update-icu.sh https://github.com/unicode-org/icu.git maint/maint-67
Updating ICU will also update the language tag registry (which records language tag semantics needed to correctly implement Intl
functionality). Therefore it's likely necessary to update SpiderMonkey's language tag handling after running this4. See below where the langtags
mode of make_intl_data.py
is discussed.
update-icu.sh
is intended for replayability, not for hands-off runnability. It downloads ICU source, prunes various irrelevant files, replaces intl/icu/source
with the new files -- and then blindly applies local patches in fixed order.
Often a local patch won't apply, or new patches must be applied to successfully build. In this case you'll have to manually edit update-icu.sh
to abort after only some patches have been applied, make whatever changes are necessary by hand, generate a new/updated patch file by hand, then carefully reattempt updating. (The people who have updated ICU in the past, usually jwalden and anba, follow this awkward process and don't have good ideas on how to improve it.)
Any time ICU is updated, you'll need to fully rebuild whichever of SpiderMonkey or Gecko you're building. For SpiderMonkey, delete your object directory and reconfigure from scratch. For Gecko, change the message in the top-level CLOBBER
file.
Updating tzdata
ICU contains a copy of tzdata
, but that copy is whatever tzdata
release was current at the time the ICU release was finalized. Time zone data changes much more often than that: every time some national legislature or tinpot dictator decides to alter time zones.5 The tz-announce
mailing list announces changes as they occur. (Note that we can't immediately update when a release occurs: ICU's icu-data
repository must be updated before we can update our tzdata
.)
Therefore, either (usually) after you update ICU or when a new tzdata
release occurs, you'll need to update our imported tzdata
files. (If you do need to update time zone data, note that you'll also need to additionally update SpiderMonkey's time zone handling, described further below.) This also suitably updates config/external/icu/data/icudtNNE.data
. (If you've just run update-icu.sh
, it will warn you that you need to do this.6)
First, make sure you have a usable icupkg
on your system.7 Then run the update-tzdata.sh
script to update intl/tzdata
and icudtNNE.data
:
$ cd "$topsrcdir/intl"
$ ./update-tzdata.sh 2020a # or whatever the latest release is
If tzdata
must be updated on trunk, you'll almost certainly have to backport the update to Beta and ESR. Don't attempt to backport the literal patch; just run the appropriate commands documented here to do so.
Updating SpiderMonkey Intl
data
SpiderMonkey itself can't blindly invoke ICU to perform every i18n operation, because sometimes ICU behavior deviates from what web specifications require. Therefore, when ICU is updated, we also must update SpiderMonkey itself as well (including various generated tests). Such updating is performed using the various modes of js/src/builtin/make_intl_data.py
.
Updating SpiderMonkey time zone handling
The ECMAScript Internationalization API requires that time zone identifiers (America/New_York
, Antarctica/McMurdo
, etc.) be interpreted according to IANA
semantics. Unfortunately, ICU doesn't precisely implement those semantics. (See comments in js/src/builtin/intl/SharedIntlData.h
for details.) Therefore SpiderMonkey has to do certain pre- and post-processing based on what's in IANA but not in ICU, and what's in ICU that isn't in IANA.
Use make_intl_data.py
's tzdata
mode to update time zone information:
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py tzdata
The tzdata
mode accepts two optional arguments that generally will not be needed:
--tz
will act using data from a localtzdata/
directory containing rawtzdata
source (note that this is not the same as what is inintl/tzdata/source
). It may be useful to help debug problems that arise during an update.--ignore-backzone
will omit time zone information before 1970. SpiderMonkey and Gecko include this information by default. However, because (by deliberate policy)tzdata
information before 1970 is not reliable to the same degree as data since 1970, and backzone data has a size cost, a SpiderMonkey embedding or custom Gecko build might decide to omit it.
Updating SpiderMonkey language tag handling
Language tags (en
, de-CH
, ar-u-ca-islamicc
, and so on) are the primary means of specifying localization characteristics. The ECMAScript Internationalization API supports certain operations that depend upon the current state of the language tag registry (stored in the Unicode Common Locale Data Repository, CLDR, a repository of all locale-specific characteristics) that specifies subtag semantics:
Intl.getCanonicalLocales
andIntl.Locale
must replace alias subtags with their preferred forms. For example,ar-u-ca-islamic-civil
uses the preferred Islamic calendar subtag, whilear-u-ca-islamicc
uses an alias.Intl.Locale.prototype.maximize
andIntl.Locale.prototype.minimize
accept a language tag and add or remove "likely" subtags from it. For example,de
most likely refers to German using Latin script in Germany, so it maximizes tode-Latn-DE
-- and in reverse,de-Latn-DE
minimizes to simplyde
.
These decisions vary over time: as countries change8, as customs change, as language prevalence in regions varies, etc.
Use make_intl_data.py
's langtags
mode to update language tag information to the same CLDR version used by ICU:
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py langtags
The CLDR version used will be printed in the header of CLDR-sensitive generated files. For example, js/src/builtin/intl/LanguageTagGenerated.cpp
currently begins with:
// Generated by make_intl_data.py. DO NOT EDIT.
// Version: CLDR-37
// URL: https://unicode.org/Public/cldr/37/core.zip
Updating SpiderMonkey currency support
Currencies use different numbers of fractional digits in their preferred formatting. Most currencies use two decimal digits; a handful use no fractional digits or some other number. Currency fractional digit is maintained by ISO and must be updated as currencies change their preferred fractional digits or new currencies arise that don't use two decimal digits.
Currency updates are fairly uncommon, so it'll be rare to need to update currency info. A newsletter periodically sends updates about changes.
Use make_intl_data.py
's currency
mode to update currency fractional digit information:
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py currency
Updating SpiderMonkey measurement formatting support
The Intl
API supports formatting numbers as measurement units (for example, "17 meters" or "42 meters per second"). It specifies a list of units that must be supported, that we centrally record in js/src/builtin/intl/SanctionedSimpleUnitIdentifiers.yaml
, that we verify are supported by ICU and generate supporting files from.
If Intl
's list of supported units is ever updated, two separate changes will be required.
First, intl/icu/data_filter.json
must be updated to incorporate localized strings for the new unit. These strings are stored in icudtNNE.dat
, so you'll have to re-update ICU (and likely reimport tzdata
as well, if it's been updated since the last ICU update) to rewrite that file.
Second, use make_intl_data.py
's units
mode to update unit handling and associated tests in SpiderMonkey:
$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py units
-
The steps involved in updating Gecko-in-general's Unicode version, and updating SpiderMonkey's code dependent on Unicode version, are documented on WikiMO. ↩︎
-
icudtNNE.dat
isn't compiled during a SpiderMonkey/Gecko build because it would require ICU command-line tools. And it's a pain to either compile and run them during the build, or to require them as build dependencies. ↩︎ -
The ICU Git URL argument lets you update from a local ICU clone. This can speed up work when you're updating to a new ICU release and need to adjust or add new local patches. ↩︎
-
To give a sense of how frequently
tzdata
is updated, and the irregularity of releases over time:- 2019 had three
tzdata
releases, 2019a through 2019c. - 2018 had nine
tzdata
releases, 2018a through 2018i. - 2017 had three
tzdata
releases, 2017a through 2017c.
- 2019 had three
-
For example:
↩︎WARN: Local tzdata (2020a) is newer than ICU tzdata (2019c), please run './update-tzdata.sh 2020a'
-
To install
icupkg
on your system:- On Fedora, use
sudo dnf install icu
. - On Ubuntu, use
sudo apt-get install icu-devtools
. - On Mac OS X, use
brew install icu4c
. - On Windows, you'll need to download a binary build of ICU for Windows and use the
bin/icupkg.exe
orbin64/icupkg.exe
utility inside it.
If you're on Windows, or for some reason you don't want to use the
icupkg
now in your$PATH
, you can manually specify it on the command line using the-e /path/to/icupkg
flag:$ cd "$topsrcdir/intl" $ ./update-tzdata.sh -e /path/to/icupkg 2020a # or whatever the latest release is
In principle, the
icupkg
you use should be the one from the ICU release/maintenance branch being built: if there's a mismatch, you might encounter an ICU "format version not supported" error. If you're on Windows, make sure to download a binary build for that release/branch. On other platforms, you might have to build your own ICU from source. The steps required to do this are left as an exercise for the reader. (In the somewhat longer term, the update commands might be changed to do this themselves.) ↩︎ - On Fedora, use
-
For just one relevant example, the breakup of the Soviet Union is the cause of numerous entries in the language tag registry.
ru-SU
, Russian as used in the Soviet Union, is now expressed asru-RU
, Russian as used in Russia;ab-SU
, Abkhazian as used in the Soviet Union, is now expressed asab-GE
, Abkhazian as used in Georgia; and so on for all the other satellite states. ↩︎