Граф коммитов

79 Коммитов

Автор SHA1 Сообщение Дата
Marcel van Lohuizen 35e765ddbd go.text/language: added Converage interface as a generic mechanism to report
supported tags and coverage levels.

This interface will be used by packages in go.text to indicate their
coverage.

This CL also includes a fix in maketables.go to prepare for CLDR 25.

LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/77760044
2014-03-20 11:35:29 +01:00
Marcel van Lohuizen da460d3524 go.text/language: fixed bug in Script, where sometimes the wrong likely
script is returned.

The algorithm previously assumed that if the IANA registry has a
SuppressScript entry for a language, this is typically the only script in
use for this language. There are a few langauges for which this is not the
case, though. For example, Kazakh is typically written in Arabic (and
possibly Latin) script by native speakers in China, instead of the Cyrillic
typically used in Kazakhstan. A similar situation exists for Panjabi and
Malay.

LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/74680044
2014-03-18 12:55:58 +01:00
Marcel van Lohuizen 229aa9981c go.text/language: added Parent() method that returns the CLDR parent of a tag.
Also added tie-breaker rule to Matcher to prefer matches for which there is
a parent relation over those where this is not the case. Note that this is a
low-priority rule as the region distance is a better measure (close regions
are likely to share the same parent; this is not true for English, but this
is already special-cased in a different manner). Also script and region
equality are stronger measures than the parent relationship.

The Parent relationship may change between different versions of CLDR
(albeit usually only slightly). Added a Version constant to identify CLDR
version on which the package is based.

LGTM=r
R=r
CC=golang-codereviews, markdavis
https://golang.org/cl/74210044
2014-03-12 07:51:42 +01:00
Marcel van Lohuizen 900f6ddba6 go.text/language: tweaked language matcher and fixed some anomalies.
- No longer takes transitive closure of languageMatch entries.
  E.g. A Dane will understand Norwegian Bokmål. A writer of Bokmål will
  undestand Norwegian Nynorsk, but a Dane will typically not understand Nynorsk.
- Changed langMatch entries with a percentage of 90 to be High instead of Low.
  (An arbitrary choice, but seems better.)
- Fixed matching of sr-Latn vs sr-Cyrl.
- Language-specific script mappings are now supported.

LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/68190043
2014-02-27 22:29:49 +01:00
Marcel van Lohuizen 610d33a009 go.text/language: added MustParse equivalent for CanonType.
One could argue the top-level MustParse should always call Raw.Parse.
However, one can find arguments either way. We opt to be consistent and
leave it up to the user to decide if a Tag "constant" should be
canonicalized or not.

LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/65380043
2014-02-18 20:36:13 +01:00
Marcel van Lohuizen cd75f379c5 go.text/language: added predefined common tag values.
The benefits of using predefined tags are:
- convenience for the user
- an additional layer or indirection for tag representations that may
  change in the future (e.g. pt to mean Brazilian Portuguese or just
  Portuguese).

Predefined values are only provided for a selection of tags. The selection
is based on CLDR data and augmented with languages for populous areas where
speakers do not typically also fluently speak a language that was already
included in the set (thereby increasing the likelihood of needing it) and
languages for which tags may be ambiguous.

We do not intend to provide similar constants for Regions, Scripts and Base
values. Specifically the set of Regions is much more "dynamic" than the set
of languages. The (Tag|Base|Script|Region|Currency)OrDie methods are provided
as a convenience for the user to specify values at startup.

Details:
- Went with names instead of codes to allow for a level of indirection on
  top of tags.
- Names for the tags are taken from CLDR for the en locale.
- Internal tag constants are now prefixed with '_' instead of lang_, reg,
  scr and cur. All tags are unique between these for so this does not pose
  a problem. It makes the predefined values for tags look a lot better in
  the godocs.

LGTM=r
R=r
CC=golang-codereviews
https://golang.org/cl/59740047
2014-02-18 12:38:45 +01:00
Marcel van Lohuizen 1f25524227 go.text/language: allow == operator to be used to test equality of Tags.
This requires the str field to be changed from a pointer to a string. This
adds 8 bytes, a painful 50%, to each Tag. The benefits are quite noteworthy,
though. Note that this change is not reversible.
Benefits:
- allows for testing exact equality easier.
- less overhead testing for exact equality, rather than the approximate
  equality of Matcher.
- Prevents bugs with incosistent behavior matching identical tags, as is
  currently the case (comparing tags is currently not supported, but it
  work under certain circumstances).

Drawbacks:
- Larger tags.
- Might lead people to use == instead of a Matcher when the latter is more
  appropriate.

Details:
- If a Tag has no variants or extensions, str will now always be "". This
  allows for faster and simpler processing in most cases. It potentially
  avoids the need for an alloc on creation time, but it will now always
  result in an alloc for String() when a tag a region or script, but
  no variants or extensions. This is an acceptable tradeoff.
- SetTypeForKey now allows erasing a key-value pair.

R=r
CC=golang-codereviews
https://golang.org/cl/60000044
2014-02-14 10:44:28 +01:00
Marcel van Lohuizen 73f318262f go.text/language: make Region.ISO3 return "ZZZ" in case of non-exising ISO3
code.
The occurence of a non-existing ISO3 code is rather rare to the point it
seems better to just return something.

R=r
CC=golang-codereviews
https://golang.org/cl/55230043
2014-02-04 14:03:48 +01:00
Marcel van Lohuizen 66b4d1c3e7 go.text/language: somewhat subtle and less subtle API overhaul.
BACKGROUND
As this API has evolved over time, a few things don't fit well anymore.
Most notably:
- With the introduction of the Matcher, it is no longer necessary to fully
  canonicalize tags.
- With parse functions for the various types now available, it doesn't
  make too much sense to have a (only have) Compose defined on a map of strings.

Also, some design aspects proved to be a bit too subtle:
- It looks like users sometimes select between Make or Parse depending on
  whether they need the error, not whether they need canonicalization.
- The Base, Script and Region type had a Tag method defined on them. This
  allowed the user to possibly unexepectedly bypass canonicalization.
  Also, canonicalization should ideally be done considering the combination
  of all tags, as this may alter the results.

GOALS OF THE REDESIGN
- Make functionality of top-level functions orthogonal to whether the user
  wants canonicalization or not.
- Make the API such that the user typically never has to think about
  canonicalization. (Sensible defaults combined with the Matcher.)
- Make it such that unless the user explicity uses a non-default CanonType
  all tags will be canonicalized the same.
- Encourage canonicalization on full tags only to prevent canonicalization
  of individual tags before composing them into a full tag.

OVERVIEW OF CHANGES
- Default canonicalization now omits modes where potentially useful
  information is lost. Deprecated and legacy values are still canonicalized
  by default.
- Changed Compose to be a varargs function taking any of the individual tag
  types to compose a tag. It can also be used to update tags.
- Removed the old map[Part]string-based implementation of Compose as well as
  the Part type.  The new implemenation is quite a bit simpler and the
  removal of the Part type simplifies the API.
- Compose, Parse and Make all use the Default CanonType for canonicalization.
  This means that Parse now canonicalizes, whereas before it didn't!!!
  This eliminates one way in which users could inadventantly create tags
  that were not normalized the default way.
- Compose, Parse, and Make are now also defined on CanonType.
- Tag.Canonicalize is removed and added to CanonType. This puts all methods
  for creating tags with non-default canonicalization in one spot. It
  also simplifies the API for Tag.
- The Raw CanonType is added for using any of these functions without
  canonicalization. The original Parse can thus be simulated with Raw.Parse.
- (Base,Script,Region).Tag are removed. These methods can not be simulated
  by passing the result of ParseX to Compose. For example:
    language.Compose(language.ParseRegion("NL"))
  Removing them removes one way in which the user could create tags that
  were not created using default canonicalization. It also encourages
  users to create fully-specified tags before invoking canonicalization.
- Added Extension type and implemented Variant type. Both types have a
  ParseX function to complement the rest.
- Added Raw method to Tag to get the verbatim Base, Script and Region tags.
  Using the Script or Region methods potentially changes an unspecified
  region or script to respectively ZZ or Zzzz, which causes them to be
  marked as IsPrivate. It was possible but tedious to distinguish between
  the two cases. Raw makes this trivial, among having other uses.
- Added Deprecated(Base|Script|Region} CanonTypes (where the old Deprecated
  is the union of these) as this proved to be useful in some use cases.

CAVEATS
In most cases where the API changed it will break existing code. This is
not the case for Parse, which now canonicalizes whereas before it didn't.
This could lead to subtle bugs. The damage is somewhat limited by the
fact that by default now only legacy and deprecated values are canonicalized.

R=r
CC=golang-codereviews
https://golang.org/cl/51280048
2014-01-16 23:51:53 +01:00
Marcel van Lohuizen c4b790c79f go.text/language: factored out parsing of single extensions, which is
needed for a different CL, so that the diffs of the next CL look palatable.

R=r
CC=golang-codereviews
https://golang.org/cl/49830043
2014-01-10 08:30:18 +01:00
Dave Day e5a444a8bc go.text/cldr: support both plurals and ordinals in the Specification struct
Now the data from ordinals.xml and plurals.xml are both available,
whereas formerly ordinals was overwriting the plural's data. The two can
be distiguished using the "Type" field.

R=mpvl
CC=golang-codereviews
https://golang.org/cl/48660044
2014-01-09 09:49:07 +11:00
Marcel van Lohuizen 22a6057374 go.text/language: the 'e' is kind of phonetically redundant, but I think
it is better to include it anyway.

R=r
CC=golang-codereviews
https://golang.org/cl/48230043
2014-01-07 11:57:59 +01:00
Marcel van Lohuizen b8c3e1927b go.text/language: fixed bug that caused EncodeM49 to panic for some values.
The added test doesn't test much, but mainly serves the purpose of asserting
no panic occurs for any value.

R=r
CC=golang-codereviews
https://golang.org/cl/48080043
2014-01-06 17:59:22 +01:00
Marcel van Lohuizen a36a459697 go.text/language: added a seperate table for mapping from UN.M49 to region code.
It turned out that the mapping is a bit more complex than simply using
a single table, with the following features:
- blocks of sorted array of UN.M49 and region pairs.
- each block is indexed by the 3 msb of the 10-bit UN.M49 code.
- each block contains a sorted list of uint16s where the 7 msb are the 7 lsb
  of the UN.M49 code and the 9 lsb are the region code.
- total table size increase is 582 bytes
A more straightforward approach would have lead to a table size of at least 1K,
up to 2k. However, the lookup code for these approaches is either not
substantially smaller or the table size is notably larger.

The table also includes a few more new entries from the IANA registry.

R=r
CC=golang-codereviews
https://golang.org/cl/44560043
2013-12-23 09:34:53 +01:00
Volker Dobler 119f8793fe go.text: fix trivial typos.
R=mpvl
CC=golang-dev
https://golang.org/cl/33810043
2013-11-29 10:27:20 +11:00
Marcel van Lohuizen a7e91de037 go.text/language: canonicalize deprecated regions and scripts.
Analoguous to deprecated languages, deprecated regions are represented with
their own internal codes and get canonicalized when necessary. The deprecated
regions were previously not recognized.
- refactored code for writing sorted maps.
- refactored region testing code in separate components to
  simplify tests.
- CLDR does not include the deprecated 3-letter ISO codes for all deprecated
  codes. We added them for completeness.
- Script deprecation is hard-coded. The CLDR data only contains one remapping.
  maketables.go checks that this is indeed the only one.
This adds about 250 bytes of data.

R=r
CC=golang-dev
https://golang.org/cl/19850043
2013-11-08 12:52:05 +01:00
Marcel van Lohuizen 3494cc8d0c go.text/language: implemented TypeForKey and SetTypeForKey.
Also refactored remakeString to allow for faster rebuilding code when possible.
Tag.String now uses this for faster string creation as well.

R=r
CC=golang-dev
https://golang.org/cl/14425066
2013-10-24 15:07:24 +02:00
Marcel van Lohuizen 240601eac0 go.text/language: change Zyyy to Zzzz as representation of undefined script.
This a a choice between more conformance to BCP 47 on the one hand and
Unicode and CLDR on the other hand.
The user can use the returned Confidence value to determine whether the
script was unspecified or explicitly specified as Zzzz (in the rare case the
user would care at all).
Updated comments.

R=r
CC=golang-dev
https://golang.org/cl/16020043
2013-10-24 15:04:51 +02:00
Marcel van Lohuizen 80a998998e go.text/language: fixed a few go vet errors.
R=r
CC=golang-dev
https://golang.org/cl/16010043
2013-10-23 16:19:10 +02:00
Marcel van Lohuizen f4a79d0559 go.text/language: corrected url in comments.
R=r
CC=golang-dev
https://golang.org/cl/15520048
2013-10-23 16:18:38 +02:00
Marcel van Lohuizen 3255f38977 go.text/language: removed prefix validation of variants.
Correct prefixes are a SHOULD, not a MUST in BCP 47. Incorrect prefixes
should therefore either be marked with a special error so that it can be
ignored or not generate an error at all. We opt for the latter.
Too bad, as the prefix checking algorithm was kinda cool. Also, sorry for
having to go through reviewing it in the first place.

R=r
CC=golang-dev
https://golang.org/cl/14483054
2013-10-23 10:20:05 +02:00
Marcel van Lohuizen 51fb595f78 go.text/language: bunch of bug fixes in extension handling:
[this time with the correct files:]
Measures to expose bugs:
  - Added more elaborate tests for extensions.
  - Return end instead of len(scan.b) at the end of parseExtension
    to force exposing bugs.
  - parseExtensions used to sometimes update scan.b and sometimes not,
    leaving it to parse. Made this more consistent, simplifying parse
        and forcing errors to be exposed.
  - Removed some checks to catch errors that should have been caught
    elsewhere. Again to expose bugs.
  - Tightened some of the checks to expose bugs more easily.

Bugs fixed:
  - Attributes in -u extension are no sorted, as per LDML spec
    (even though nobody uses them, a spec is a spec).
  - Fixed various bug where invalid keys or value were not properly
    removed. Merged the special case and common case to eliminate rare
        code paths and simplify testing.
  - Fixed some bugs where invalid empty extensions were not properly
    removed.
  - Fixed bug in Compose, which dropped the 'w', '9', and 'z' extensions.

Other:
  - removed parsePrivate to simplify code.description here.

R=r
CC=golang-dev
https://golang.org/cl/14669043
2013-10-16 11:12:07 +02:00
Marcel van Lohuizen 4fe0ccd82b go.text/language: added proper handling of variants.
The package now recognizes valid variants and rejects invalid ones.
It accepts variants only if they follow the proper language or proper
prefix sequence, as lined out in the BCP 47 spec.
If variants are not in the right order, it will make an attempt to sort
create a correct sorting order out of them.
Duplicate variants are removed.
Note that BCP 47 presumes that a script is suppressed if it is marked
as such for a language, whereas we allow such scripts to still exist.
There is an additional check to handle this case.

R=r
CC=golang-dev
https://golang.org/cl/14555043
2013-10-14 16:06:28 +02:00
Marcel van Lohuizen 71ab14c455 go.text/language: revamped error handling:
- ValueError now exported as new type. ValueError retains the problematic value,
  allowing the user to inspect and correct it.
- Dynamically allocated errors returned in case of a syntax error are replaced
  by a error variable.
- Fixed bug: return error if an "u" extension has a type without a value.
- Added benchmarks or parsing code.
- Renamed MissingLikelyData to ErrMissingLikelyData to be consistent with other
  Go packages. This variable is not yet returned, so this change is not likely to cause
  a big issue.
- Removed Set type as long as there is no demand for it.

The code is measurably faster after removing the dynamically allocated errors.
A ValueError is 8 bytes and should not require allocation when passed as an error.
Returning a fixed error variable instead of a ValueError did not significantly improve
performance.

I considered returning a syntax error with the position at which the error occurred.
This extra management needed for this slowed down the code a bit, so I opted not to
support this. This could still be implemented if there turns out to be a need for it.

R=r, mpvl
CC=golang-dev
https://golang.org/cl/14162044
2013-10-08 19:51:01 +02:00
Marcel van Lohuizen d95a5f25a9 go.text/language: added tag matching algorithm. This algorithm is not based
on the CLDR algorithm. It incorporates some ideas from other implementations,
but it is designed from scratch.
Note that the IANA registry has been updated so this CL also adds some new
language codes as well as add mappings for deprecated codes. maketables.go
has also been modified to work around a bug that was introduced in the latest
IANA update.

Note that the Match method of the Matcher interface returns an index of
the original Tag along with the Tag. Certain users of Matcher, such as
service like collation, need to associate data with each Tag.

Package collate is updated to use the new Matcher interface.
As the Matcher matches returns the index of the Tag as well, the tables
can be simplified to be an array instead of a map.

R=r, nigeltao, mpvl
CC=golang-dev, markdavis
https://golang.org/cl/13819047
2013-10-07 13:14:45 +02:00
Marcel van Lohuizen 9f86e0be98 go.text/language: make it build with Go 1.1, which does not include
sort.Stable.

Fixes golang/go#6523.

R=r
CC=golang-dev
https://golang.org/cl/14265043
2013-10-04 10:23:17 +02:00
Marcel van Lohuizen 6e2c2aaf7b go.text/language: added function to parse the value of an HTTP Accept-Language
header.
It supports a few non-standard language tags that appear relatively frequently
in the Accept-Language headers.

R=r
CC=golang-dev, nigeltao
https://golang.org/cl/13974043
2013-09-27 12:43:24 +02:00
Marcel van Lohuizen 4a56690205 go.text/language: A few small changes:
- Added Tag methods to Base, Script and Region types to convert them in a proper tag.
- Factored out part of Canonicalize that does not remake the string (used in upcoming matcher code).
- Added "nb" -> "no" conversion in the tables to allow more consistency for code using these tables directly.
- changed to short name used in some methods for type Base so that it consistenly appears as "b" in the documentation.

R=r
CC=golang-dev
https://golang.org/cl/13647043
2013-09-23 11:03:22 +02:00
Marcel van Lohuizen b38db9f15a go.text/language: renaming of locale package:
- renamed package locale to language
- renamed type ID to Tag (language.Tag)
- renamed type Language to Base (language.Base)
- deleting locale package
- changed occurences of "locale identifier" in comments to "language tag".
- renamed method variable names from id or loc to t when the receiver type is Tag.

R=r, nigeltao
CC=golang-dev
https://golang.org/cl/13468043
2013-09-05 11:16:24 +02:00