supported tags and coverage levels.
This interface will be used by packages in go.text to indicate their
coverage.
This CL also includes a fix in maketables.go to prepare for CLDR 25.
LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/77760044
script is returned.
The algorithm previously assumed that if the IANA registry has a
SuppressScript entry for a language, this is typically the only script in
use for this language. There are a few langauges for which this is not the
case, though. For example, Kazakh is typically written in Arabic (and
possibly Latin) script by native speakers in China, instead of the Cyrillic
typically used in Kazakhstan. A similar situation exists for Panjabi and
Malay.
LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/74680044
Also added tie-breaker rule to Matcher to prefer matches for which there is
a parent relation over those where this is not the case. Note that this is a
low-priority rule as the region distance is a better measure (close regions
are likely to share the same parent; this is not true for English, but this
is already special-cased in a different manner). Also script and region
equality are stronger measures than the parent relationship.
The Parent relationship may change between different versions of CLDR
(albeit usually only slightly). Added a Version constant to identify CLDR
version on which the package is based.
LGTM=r
R=r
CC=golang-codereviews, markdavis
https://golang.org/cl/74210044
- No longer takes transitive closure of languageMatch entries.
E.g. A Dane will understand Norwegian Bokmål. A writer of Bokmål will
undestand Norwegian Nynorsk, but a Dane will typically not understand Nynorsk.
- Changed langMatch entries with a percentage of 90 to be High instead of Low.
(An arbitrary choice, but seems better.)
- Fixed matching of sr-Latn vs sr-Cyrl.
- Language-specific script mappings are now supported.
LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/68190043
One could argue the top-level MustParse should always call Raw.Parse.
However, one can find arguments either way. We opt to be consistent and
leave it up to the user to decide if a Tag "constant" should be
canonicalized or not.
LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/65380043
The benefits of using predefined tags are:
- convenience for the user
- an additional layer or indirection for tag representations that may
change in the future (e.g. pt to mean Brazilian Portuguese or just
Portuguese).
Predefined values are only provided for a selection of tags. The selection
is based on CLDR data and augmented with languages for populous areas where
speakers do not typically also fluently speak a language that was already
included in the set (thereby increasing the likelihood of needing it) and
languages for which tags may be ambiguous.
We do not intend to provide similar constants for Regions, Scripts and Base
values. Specifically the set of Regions is much more "dynamic" than the set
of languages. The (Tag|Base|Script|Region|Currency)OrDie methods are provided
as a convenience for the user to specify values at startup.
Details:
- Went with names instead of codes to allow for a level of indirection on
top of tags.
- Names for the tags are taken from CLDR for the en locale.
- Internal tag constants are now prefixed with '_' instead of lang_, reg,
scr and cur. All tags are unique between these for so this does not pose
a problem. It makes the predefined values for tags look a lot better in
the godocs.
LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/59740047
This requires the str field to be changed from a pointer to a string. This
adds 8 bytes, a painful 50%, to each Tag. The benefits are quite noteworthy,
though. Note that this change is not reversible.
Benefits:
- allows for testing exact equality easier.
- less overhead testing for exact equality, rather than the approximate
equality of Matcher.
- Prevents bugs with incosistent behavior matching identical tags, as is
currently the case (comparing tags is currently not supported, but it
work under certain circumstances).
Drawbacks:
- Larger tags.
- Might lead people to use == instead of a Matcher when the latter is more
appropriate.
Details:
- If a Tag has no variants or extensions, str will now always be "". This
allows for faster and simpler processing in most cases. It potentially
avoids the need for an alloc on creation time, but it will now always
result in an alloc for String() when a tag a region or script, but
no variants or extensions. This is an acceptable tradeoff.
- SetTypeForKey now allows erasing a key-value pair.
R=r
CC=golang-codereviews
https://golang.org/cl/60000044
code.
The occurence of a non-existing ISO3 code is rather rare to the point it
seems better to just return something.
R=r
CC=golang-codereviews
https://golang.org/cl/55230043
BACKGROUND
As this API has evolved over time, a few things don't fit well anymore.
Most notably:
- With the introduction of the Matcher, it is no longer necessary to fully
canonicalize tags.
- With parse functions for the various types now available, it doesn't
make too much sense to have a (only have) Compose defined on a map of strings.
Also, some design aspects proved to be a bit too subtle:
- It looks like users sometimes select between Make or Parse depending on
whether they need the error, not whether they need canonicalization.
- The Base, Script and Region type had a Tag method defined on them. This
allowed the user to possibly unexepectedly bypass canonicalization.
Also, canonicalization should ideally be done considering the combination
of all tags, as this may alter the results.
GOALS OF THE REDESIGN
- Make functionality of top-level functions orthogonal to whether the user
wants canonicalization or not.
- Make the API such that the user typically never has to think about
canonicalization. (Sensible defaults combined with the Matcher.)
- Make it such that unless the user explicity uses a non-default CanonType
all tags will be canonicalized the same.
- Encourage canonicalization on full tags only to prevent canonicalization
of individual tags before composing them into a full tag.
OVERVIEW OF CHANGES
- Default canonicalization now omits modes where potentially useful
information is lost. Deprecated and legacy values are still canonicalized
by default.
- Changed Compose to be a varargs function taking any of the individual tag
types to compose a tag. It can also be used to update tags.
- Removed the old map[Part]string-based implementation of Compose as well as
the Part type. The new implemenation is quite a bit simpler and the
removal of the Part type simplifies the API.
- Compose, Parse and Make all use the Default CanonType for canonicalization.
This means that Parse now canonicalizes, whereas before it didn't!!!
This eliminates one way in which users could inadventantly create tags
that were not normalized the default way.
- Compose, Parse, and Make are now also defined on CanonType.
- Tag.Canonicalize is removed and added to CanonType. This puts all methods
for creating tags with non-default canonicalization in one spot. It
also simplifies the API for Tag.
- The Raw CanonType is added for using any of these functions without
canonicalization. The original Parse can thus be simulated with Raw.Parse.
- (Base,Script,Region).Tag are removed. These methods can not be simulated
by passing the result of ParseX to Compose. For example:
language.Compose(language.ParseRegion("NL"))
Removing them removes one way in which the user could create tags that
were not created using default canonicalization. It also encourages
users to create fully-specified tags before invoking canonicalization.
- Added Extension type and implemented Variant type. Both types have a
ParseX function to complement the rest.
- Added Raw method to Tag to get the verbatim Base, Script and Region tags.
Using the Script or Region methods potentially changes an unspecified
region or script to respectively ZZ or Zzzz, which causes them to be
marked as IsPrivate. It was possible but tedious to distinguish between
the two cases. Raw makes this trivial, among having other uses.
- Added Deprecated(Base|Script|Region} CanonTypes (where the old Deprecated
is the union of these) as this proved to be useful in some use cases.
CAVEATS
In most cases where the API changed it will break existing code. This is
not the case for Parse, which now canonicalizes whereas before it didn't.
This could lead to subtle bugs. The damage is somewhat limited by the
fact that by default now only legacy and deprecated values are canonicalized.
R=r
CC=golang-codereviews
https://golang.org/cl/51280048
Now the data from ordinals.xml and plurals.xml are both available,
whereas formerly ordinals was overwriting the plural's data. The two can
be distiguished using the "Type" field.
R=mpvl
CC=golang-codereviews
https://golang.org/cl/48660044
The added test doesn't test much, but mainly serves the purpose of asserting
no panic occurs for any value.
R=r
CC=golang-codereviews
https://golang.org/cl/48080043
It turned out that the mapping is a bit more complex than simply using
a single table, with the following features:
- blocks of sorted array of UN.M49 and region pairs.
- each block is indexed by the 3 msb of the 10-bit UN.M49 code.
- each block contains a sorted list of uint16s where the 7 msb are the 7 lsb
of the UN.M49 code and the 9 lsb are the region code.
- total table size increase is 582 bytes
A more straightforward approach would have lead to a table size of at least 1K,
up to 2k. However, the lookup code for these approaches is either not
substantially smaller or the table size is notably larger.
The table also includes a few more new entries from the IANA registry.
R=r
CC=golang-codereviews
https://golang.org/cl/44560043
Analoguous to deprecated languages, deprecated regions are represented with
their own internal codes and get canonicalized when necessary. The deprecated
regions were previously not recognized.
- refactored code for writing sorted maps.
- refactored region testing code in separate components to
simplify tests.
- CLDR does not include the deprecated 3-letter ISO codes for all deprecated
codes. We added them for completeness.
- Script deprecation is hard-coded. The CLDR data only contains one remapping.
maketables.go checks that this is indeed the only one.
This adds about 250 bytes of data.
R=r
CC=golang-dev
https://golang.org/cl/19850043
Also refactored remakeString to allow for faster rebuilding code when possible.
Tag.String now uses this for faster string creation as well.
R=r
CC=golang-dev
https://golang.org/cl/14425066
This a a choice between more conformance to BCP 47 on the one hand and
Unicode and CLDR on the other hand.
The user can use the returned Confidence value to determine whether the
script was unspecified or explicitly specified as Zzzz (in the rare case the
user would care at all).
Updated comments.
R=r
CC=golang-dev
https://golang.org/cl/16020043
Correct prefixes are a SHOULD, not a MUST in BCP 47. Incorrect prefixes
should therefore either be marked with a special error so that it can be
ignored or not generate an error at all. We opt for the latter.
Too bad, as the prefix checking algorithm was kinda cool. Also, sorry for
having to go through reviewing it in the first place.
R=r
CC=golang-dev
https://golang.org/cl/14483054
[this time with the correct files:]
Measures to expose bugs:
- Added more elaborate tests for extensions.
- Return end instead of len(scan.b) at the end of parseExtension
to force exposing bugs.
- parseExtensions used to sometimes update scan.b and sometimes not,
leaving it to parse. Made this more consistent, simplifying parse
and forcing errors to be exposed.
- Removed some checks to catch errors that should have been caught
elsewhere. Again to expose bugs.
- Tightened some of the checks to expose bugs more easily.
Bugs fixed:
- Attributes in -u extension are no sorted, as per LDML spec
(even though nobody uses them, a spec is a spec).
- Fixed various bug where invalid keys or value were not properly
removed. Merged the special case and common case to eliminate rare
code paths and simplify testing.
- Fixed some bugs where invalid empty extensions were not properly
removed.
- Fixed bug in Compose, which dropped the 'w', '9', and 'z' extensions.
Other:
- removed parsePrivate to simplify code.description here.
R=r
CC=golang-dev
https://golang.org/cl/14669043
The package now recognizes valid variants and rejects invalid ones.
It accepts variants only if they follow the proper language or proper
prefix sequence, as lined out in the BCP 47 spec.
If variants are not in the right order, it will make an attempt to sort
create a correct sorting order out of them.
Duplicate variants are removed.
Note that BCP 47 presumes that a script is suppressed if it is marked
as such for a language, whereas we allow such scripts to still exist.
There is an additional check to handle this case.
R=r
CC=golang-dev
https://golang.org/cl/14555043
- ValueError now exported as new type. ValueError retains the problematic value,
allowing the user to inspect and correct it.
- Dynamically allocated errors returned in case of a syntax error are replaced
by a error variable.
- Fixed bug: return error if an "u" extension has a type without a value.
- Added benchmarks or parsing code.
- Renamed MissingLikelyData to ErrMissingLikelyData to be consistent with other
Go packages. This variable is not yet returned, so this change is not likely to cause
a big issue.
- Removed Set type as long as there is no demand for it.
The code is measurably faster after removing the dynamically allocated errors.
A ValueError is 8 bytes and should not require allocation when passed as an error.
Returning a fixed error variable instead of a ValueError did not significantly improve
performance.
I considered returning a syntax error with the position at which the error occurred.
This extra management needed for this slowed down the code a bit, so I opted not to
support this. This could still be implemented if there turns out to be a need for it.
R=r, mpvl
CC=golang-dev
https://golang.org/cl/14162044
on the CLDR algorithm. It incorporates some ideas from other implementations,
but it is designed from scratch.
Note that the IANA registry has been updated so this CL also adds some new
language codes as well as add mappings for deprecated codes. maketables.go
has also been modified to work around a bug that was introduced in the latest
IANA update.
Note that the Match method of the Matcher interface returns an index of
the original Tag along with the Tag. Certain users of Matcher, such as
service like collation, need to associate data with each Tag.
Package collate is updated to use the new Matcher interface.
As the Matcher matches returns the index of the Tag as well, the tables
can be simplified to be an array instead of a map.
R=r, nigeltao, mpvl
CC=golang-dev, markdavis
https://golang.org/cl/13819047
header.
It supports a few non-standard language tags that appear relatively frequently
in the Accept-Language headers.
R=r
CC=golang-dev, nigeltao
https://golang.org/cl/13974043
- Added Tag methods to Base, Script and Region types to convert them in a proper tag.
- Factored out part of Canonicalize that does not remake the string (used in upcoming matcher code).
- Added "nb" -> "no" conversion in the tables to allow more consistency for code using these tables directly.
- changed to short name used in some methods for type Base so that it consistenly appears as "b" in the documentation.
R=r
CC=golang-dev
https://golang.org/cl/13647043
- renamed package locale to language
- renamed type ID to Tag (language.Tag)
- renamed type Language to Base (language.Base)
- deleting locale package
- changed occurences of "locale identifier" in comments to "language tag".
- renamed method variable names from id or loc to t when the receiver type is Tag.
R=r, nigeltao
CC=golang-dev
https://golang.org/cl/13468043