gecko-dev/intl/icu
Ricky Stewart 0ba6a8762d Bug 1645779 - Make icu_sources_data.py Python 3-compliant r=jwalden,anba
This removes a dependency on `pymake`, which is Python 2-only, and thoroughly unnecessary since we just use it to find assignments of the form `OBJECTS = ...`. We can replicate this logic by just isolating lines that begin with that literal string, and everything else can stay the same. This is definitionally more brittle than actually using a parser, but it works fine for now, and the original implementation wasn't significantly better (it didn't handle any form of dynamism, anything more complicated than a single unconditional assignment with a space-separated list of literal strings representing outputs, etc.)

Differential Revision: https://phabricator.services.mozilla.com/D79896
2020-06-18 21:01:49 +00:00
..
source Bug 1636984 - Part 2: Reimport ICU to generate the data file. r=jwalden 2020-06-12 12:50:04 +00:00
GIT-INFO Bug 1632434 - Part 1: Update in-tree ICU to release 67.1. r=jwalden 2020-04-29 23:47:31 +00:00
README.md Bug 1645779 - Make icu_sources_data.py Python 3-compliant r=jwalden,anba 2020-06-18 21:01:49 +00:00
data_filter.json Bug 1557727 - Part 1: Add resources for Intl.DisplayNames to ICU data file. r=jwalden 2020-05-19 11:17:21 +00:00

README.md

Introduction

Internationalization (i18n, "i" then 18 letters then "n") is the process of handling data with respect to a particular locale:

  • The number 5 representing five US dollars might be formatted as
    • "$5.00" in American English,
    • "US$5.00" in Canadian English, or
    • "5,00 $US" in French.
  • A list of people's names in a phone book would sort
    • in English alphabetically; but
    • in German, where "ä"/"ö"/"ü" are often interchangeable with "ae"/"oe"/"ue", alphabetically but with vowels with umlauts treated as their two-vowel counterparts.
  • The currency whose code is "CHF" might be formatted as
    • "Swiss Franc" in English, but
    • "franc suisse" in French.
  • The Unix time 1590803313070 might format as the time string
    • "9:48:33 PM Eastern Daylight Time" in American English, but
    • "21:48:33 Nordamerikanische Ostküsten-Sommerzeit" in German.

i18n encompasses far more than this, but you get the basic idea.

Internationalization in SpiderMonkey and Gecko

SpiderMonkey implements extensive i18n capabilities through the ECMAScript Internationalization API and the global Intl object. Gecko requires i18n capabilities to implement text shaping, sort operations in some contexts, and various other features.

SpiderMonkey and Gecko use ICU, Internationalization Components for Unicode, to implement many low-level i18n operations. (Line breaking, implemented instead in intl/lwbrk, is a notable exception.) Gecko and SpiderMonkey also use ICU's implementations of certain i18n-adjacent operations (for example, Unicode normalization).

ICU date/time formatting functionality requires extensive knowledge of time zone names and when zone transitions occur. The IANA tzdata database supplies this information.

A final note of caution: ICU carefully depends upon an exact Unicode version. Other parts of SpiderMonkey and Gecko have separate dependencies on an exact Unicode version. Updates to ICU and related components must be synchronized with those updates so that the entirety of SpiderMonkey, and the entirety of Gecko including SpiderMonkey within it, advance to new Unicode versions in lockstep.1

Building SpiderMonkey or Gecko with ICU

SpiderMonkey and Gecko can be built using either a periodically-updated copy of ICU in intl/icu/source (using time zone data in intl/tzdata/source), or using a system-provided ICU library (dependent on its own tzdata information). Pass --with-system-icu when configuring to use system ICU. (Using system ICU will disable some Intl functionality, such as historically accurate time zone calculations, that can't be readily supported without a precisely-controlled ICU.) ICU version requirements advance fairly quickly as Gecko depends on features and bug fixes in newer ICU releases. You'll get a build error if you try to use an unsupported ICU.

SpiderMonkey's Intl API may be built or disabled by configuring --with-intl-api (the default) or --without-intl-api. SpiderMonkey built without the Intl API doesn't require ICU. However, if you build without the Intl API, some non-Intl JavaScript functionality will not exist (String.prototype.normalize) or won't fully work (for example, String.prototype.toLocale{Lower,Upper}Case will not respect a provided locale, and the various toLocaleString functions have best-effort behavior).

Using ICU functionality in SpiderMonkey and Gecko

ICU headers are considered system headers by the Gecko build system, so they must be listed in config/system-headers.mozbuild. Code that wishes to use ICU functionality may use #include "unicode/unorm.h" or similar to do so.

Gecko and SpiderMonkey code may use ICU's stable C API (ICU4C). These functions are stable and shouldn't change as ICU updates occur. (ICU4C's enum initializers are not always stable: while initializer values are stable, new initializers are sometimes added, perhaps behind #ifdef U_HIDE_DRAFT_API. This may be necessary for exhaustive switches to add #ifdefs around some cases.)

Gecko and SpiderMonkey are strongly discouraged from using ICU's C++ API (unfortunately including all smart pointer classes), because the C++ API doesn't provide ICU4C's compatibility guarantees. Rarely, we tolerate C++ API use when no stable option exists. But the API has to "look" reasonably stable, and we usually want to start a discussion with upstream about adding a stable API to eventually use. Use symbols from namespace icu to access ICU C++ functionality. Talk to the current imported-ICU owner (presently Jeff Walden) before you start doing any of this!

SpiderMonkey and Gecko's imported ICU

Build system

The system for building ICU lives in config/external/icu and intl/icu/icu_sources_data.py. We generate a Mozilla-compatible build system rather than using ICU's build system. The build system is shared by SpiderMonkey and Gecko both.

ICU includes functionality we never use, so we don't naively compile all of it. We extract the list of files to compile from intl/icu/source/{common,i18n}/Makefile.in and then apply a manually-maintained blacklist (stored in intl/icu_sources_data.py) when we update ICU.

Locale and time zone data

ICU contains a considerable amount of raw locale data: formatting characteristics for each locale, strings for things like currencies and languages for each locale, localized time zone specifiers, and so on. This data lives in human-readable files in intl/icu/source/data. Time zone data in intl/tzdata/source is stored in partially-compiled formats (some of them only partly human-readable).

However, a normal Gecko build never uses these files! Instead, both ICU and tzdata data are precompiled into a large, endian-specific icudtNNE.dat (NN = ICU version, E = endianness) file.2 That file is added to config/external/icu/data/ and is checked into the Mozilla tree, to be directly incorporated into Gecko/SpiderMonkey builds. For size reasons, only the little-endian version is checked into the tree.

ICU's locale data covers all ICU internationalization features, including ones we never need. We trim locale data to size with a intl/icu/data_filter.json data filter when compiling icudtNNE.dat. Removing too much data won't necessarily break the build, so it's important that we have automated tests for the locale data we actually use in order to detect mistakes.

Local patching of ICU

We generally don't patch our copy of ICU except for compelling need. When we do patch, we usually only apply reasonably small patches that have been reviewed and landed upstream (so that our patch will be obsolete when we next update ICU).

Local patches are stored in the intl/icu-patches directory. They're applied when ICU is updated, so merely updating ICU files in place won't persist changes across an ICU update.

Updating imported code

The process of updating imported i18n-relevant code is semi-automated. We use a series of shell and Python scripts to do the job.

Updating ICU

New ICU versions are announced on the icu-announce mailing list. Both release candidates and actual releases are announced here. It's a good idea to attempt to update ICU when a release candidate is announced, just in case some serious problem is present (especially one that would be painful to fix through local patching).

intl/update-icu.sh updates our ICU to a given ICU release:3

$ cd "$topsrcdir/intl"
$ # Ensure certain Python modules in the tree are accessible when updating.
$ export PYTHONPATH="$topsrcdir/python/mozbuild/"
$ #               <URL to ICU Git>                       <release tag name>
$ ./update-icu.sh https://github.com/unicode-org/icu.git release-67-1

But usually you'll want to update to the latest commit from the corresponding ICU maintenance branch so that you pick up fixes landed post-release:

$ cd "$topsrcdir/intl"
$ # Ensure certain Python modules in the tree are accessible when updating.
$ export PYTHONPATH="$topsrcdir/python/mozbuild/"
$ #               <URL to ICU Git>                       <maintenance name>
$ ./update-icu.sh https://github.com/unicode-org/icu.git maint/maint-67

Updating ICU will also update the language tag registry (which records language tag semantics needed to correctly implement Intl functionality). Therefore it's likely necessary to update SpiderMonkey's language tag handling after running this4. See below where the langtags mode of make_intl_data.py is discussed.

update-icu.sh is intended for replayability, not for hands-off runnability. It downloads ICU source, prunes various irrelevant files, replaces intl/icu/source with the new files -- and then blindly applies local patches in fixed order.

Often a local patch won't apply, or new patches must be applied to successfully build. In this case you'll have to manually edit update-icu.sh to abort after only some patches have been applied, make whatever changes are necessary by hand, generate a new/updated patch file by hand, then carefully reattempt updating. (The people who have updated ICU in the past, usually jwalden and anba, follow this awkward process and don't have good ideas on how to improve it.)

Any time ICU is updated, you'll need to fully rebuild whichever of SpiderMonkey or Gecko you're building. For SpiderMonkey, delete your object directory and reconfigure from scratch. For Gecko, change the message in the top-level CLOBBER file.

Updating tzdata

ICU contains a copy of tzdata, but that copy is whatever tzdata release was current at the time the ICU release was finalized. Time zone data changes much more often than that: every time some national legislature or tinpot dictator decides to alter time zones.5 The tz-announce mailing list announces changes as they occur. (Note that we can't immediately update when a release occurs: ICU's icu-data repository must be updated before we can update our tzdata.)

Therefore, either (usually) after you update ICU or when a new tzdata release occurs, you'll need to update our imported tzdata files. (If you do need to update time zone data, note that you'll also need to additionally update SpiderMonkey's time zone handling, described further below.) This also suitably updates config/external/icu/data/icudtNNE.data. (If you've just run update-icu.sh, it will warn you that you need to do this.6)

First, make sure you have a usable icupkg on your system.7 Then run the update-tzdata.sh script to update intl/tzdata and icudtNNE.data:

$ cd "$topsrcdir/intl"
$ ./update-tzdata.sh 2020a # or whatever the latest release is

If tzdata must be updated on trunk, you'll almost certainly have to backport the update to Beta and ESR. Don't attempt to backport the literal patch; just run the appropriate commands documented here to do so.

Updating SpiderMonkey Intl data

SpiderMonkey itself can't blindly invoke ICU to perform every i18n operation, because sometimes ICU behavior deviates from what web specifications require. Therefore, when ICU is updated, we also must update SpiderMonkey itself as well (including various generated tests). Such updating is performed using the various modes of js/src/builtin/make_intl_data.py.

Updating SpiderMonkey time zone handling

The ECMAScript Internationalization API requires that time zone identifiers (America/New_York, Antarctica/McMurdo, etc.) be interpreted according to IANA semantics. Unfortunately, ICU doesn't precisely implement those semantics. (See comments in js/src/builtin/intl/SharedIntlData.h for details.) Therefore SpiderMonkey has to do certain pre- and post-processing based on what's in IANA but not in ICU, and what's in ICU that isn't in IANA.

Use make_intl_data.py's tzdata mode to update time zone information:

$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py tzdata

The tzdata mode accepts two optional arguments that generally will not be needed:

  • --tz will act using data from a local tzdata/ directory containing raw tzdata source (note that this is not the same as what is in intl/tzdata/source). It may be useful to help debug problems that arise during an update.
  • --ignore-backzone will omit time zone information before 1970. SpiderMonkey and Gecko include this information by default. However, because (by deliberate policy) tzdata information before 1970 is not reliable to the same degree as data since 1970, and backzone data has a size cost, a SpiderMonkey embedding or custom Gecko build might decide to omit it.

Updating SpiderMonkey language tag handling

Language tags (en, de-CH, ar-u-ca-islamicc, and so on) are the primary means of specifying localization characteristics. The ECMAScript Internationalization API supports certain operations that depend upon the current state of the language tag registry (stored in the Unicode Common Locale Data Repository, CLDR, a repository of all locale-specific characteristics) that specifies subtag semantics:

  • Intl.getCanonicalLocales and Intl.Locale must replace alias subtags with their preferred forms. For example, ar-u-ca-islamic-civil uses the preferred Islamic calendar subtag, while ar-u-ca-islamicc uses an alias.
  • Intl.Locale.prototype.maximize and Intl.Locale.prototype.minimize accept a language tag and add or remove "likely" subtags from it. For example, de most likely refers to German using Latin script in Germany, so it maximizes to de-Latn-DE -- and in reverse, de-Latn-DE minimizes to simply de.

These decisions vary over time: as countries change8, as customs change, as language prevalence in regions varies, etc.

Use make_intl_data.py's langtags mode to update language tag information to the same CLDR version used by ICU:

$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py langtags

The CLDR version used will be printed in the header of CLDR-sensitive generated files. For example, js/src/builtin/intl/LanguageTagGenerated.cpp currently begins with:

// Generated by make_intl_data.py. DO NOT EDIT.
// Version: CLDR-37
// URL: https://unicode.org/Public/cldr/37/core.zip

Updating SpiderMonkey currency support

Currencies use different numbers of fractional digits in their preferred formatting. Most currencies use two decimal digits; a handful use no fractional digits or some other number. Currency fractional digit is maintained by ISO and must be updated as currencies change their preferred fractional digits or new currencies arise that don't use two decimal digits.

Currency updates are fairly uncommon, so it'll be rare to need to update currency info. A newsletter periodically sends updates about changes.

Use make_intl_data.py's currency mode to update currency fractional digit information:

$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py currency

Updating SpiderMonkey measurement formatting support

The Intl API supports formatting numbers as measurement units (for example, "17 meters" or "42 meters per second"). It specifies a list of units that must be supported, that we centrally record in js/src/builtin/intl/SanctionedSimpleUnitIdentifiers.yaml, that we verify are supported by ICU and generate supporting files from.

If Intl's list of supported units is ever updated, two separate changes will be required.

First, intl/icu/data_filter.json must be updated to incorporate localized strings for the new unit. These strings are stored in icudtNNE.dat, so you'll have to re-update ICU (and likely reimport tzdata as well, if it's been updated since the last ICU update) to rewrite that file.

Second, use make_intl_data.py's units mode to update unit handling and associated tests in SpiderMonkey:

$ cd "$topsrcdir/js/src/builtin/intl"
$ # make_intl_data.py requires yaml.
$ export PYTHONPATH="$topsrcdir/third_party/python/PyYAML/lib3/"
$ python3 ./make_intl_data.py units

  1. The steps involved in updating Gecko-in-general's Unicode version, and updating SpiderMonkey's code dependent on Unicode version, are documented on WikiMO. ↩︎

  2. icudtNNE.dat isn't compiled during a SpiderMonkey/Gecko build because it would require ICU command-line tools. And it's a pain to either compile and run them during the build, or to require them as build dependencies. ↩︎

  3. The ICU Git URL argument lets you update from a local ICU clone. This can speed up work when you're updating to a new ICU release and need to adjust or add new local patches. ↩︎

  4. update-icu.sh will print a notice as a reminder of this:

    INFO: Please run 'js/src/builtin/intl/make_intl_data.py langtags' to update additional language tag files for SpiderMonkey.
    
    ↩︎
  5. To give a sense of how frequently tzdata is updated, and the irregularity of releases over time:

    • 2019 had three tzdata releases, 2019a through 2019c.
    • 2018 had nine tzdata releases, 2018a through 2018i.
    • 2017 had three tzdata releases, 2017a through 2017c.
    ↩︎
  6. For example:

    WARN: Local tzdata (2020a) is newer than ICU tzdata (2019c), please run './update-tzdata.sh 2020a'
    
    ↩︎
  7. To install icupkg on your system:

    • On Fedora, use sudo dnf install icu.
    • On Ubuntu, use sudo apt-get install icu-devtools.
    • On Mac OS X, use brew install icu4c.
    • On Windows, you'll need to download a binary build of ICU for Windows and use the bin/icupkg.exe or bin64/icupkg.exe utility inside it.

    If you're on Windows, or for some reason you don't want to use the icupkg now in your $PATH, you can manually specify it on the command line using the -e /path/to/icupkg flag:

    $ cd "$topsrcdir/intl"
    $ ./update-tzdata.sh -e /path/to/icupkg 2020a # or whatever the latest release is
    

    In principle, the icupkg you use should be the one from the ICU release/maintenance branch being built: if there's a mismatch, you might encounter an ICU "format version not supported" error. If you're on Windows, make sure to download a binary build for that release/branch. On other platforms, you might have to build your own ICU from source. The steps required to do this are left as an exercise for the reader. (In the somewhat longer term, the update commands might be changed to do this themselves.) ↩︎

  8. For just one relevant example, the breakup of the Soviet Union is the cause of numerous entries in the language tag registry. ru-SU, Russian as used in the Soviet Union, is now expressed as ru-RU, Russian as used in Russia; ab-SU, Abkhazian as used in the Soviet Union, is now expressed as ab-GE, Abkhazian as used in Georgia; and so on for all the other satellite states. ↩︎