go/text - text

Граф коммитов

Автор	SHA1	Сообщение	Дата
Marcel van Lohuizen	35e765ddbd	go.text/language: added Converage interface as a generic mechanism to report supported tags and coverage levels. This interface will be used by packages in go.text to indicate their coverage. This CL also includes a fix in maketables.go to prepare for CLDR 25. LGTM=r R=r CC=golang-codereviews https://golang.org/cl/77760044	2014-03-20 11:35:29 +01:00
Marcel van Lohuizen	da460d3524	go.text/language: fixed bug in Script, where sometimes the wrong likely script is returned. The algorithm previously assumed that if the IANA registry has a SuppressScript entry for a language, this is typically the only script in use for this language. There are a few langauges for which this is not the case, though. For example, Kazakh is typically written in Arabic (and possibly Latin) script by native speakers in China, instead of the Cyrillic typically used in Kazakhstan. A similar situation exists for Panjabi and Malay. LGTM=r R=r CC=golang-codereviews https://golang.org/cl/74680044	2014-03-18 12:55:58 +01:00
Marcel van Lohuizen	229aa9981c	go.text/language: added Parent() method that returns the CLDR parent of a tag. Also added tie-breaker rule to Matcher to prefer matches for which there is a parent relation over those where this is not the case. Note that this is a low-priority rule as the region distance is a better measure (close regions are likely to share the same parent; this is not true for English, but this is already special-cased in a different manner). Also script and region equality are stronger measures than the parent relationship. The Parent relationship may change between different versions of CLDR (albeit usually only slightly). Added a Version constant to identify CLDR version on which the package is based. LGTM=r R=r CC=golang-codereviews, markdavis https://golang.org/cl/74210044	2014-03-12 07:51:42 +01:00
Marcel van Lohuizen	900f6ddba6	go.text/language: tweaked language matcher and fixed some anomalies. - No longer takes transitive closure of languageMatch entries. E.g. A Dane will understand Norwegian Bokmål. A writer of Bokmål will undestand Norwegian Nynorsk, but a Dane will typically not understand Nynorsk. - Changed langMatch entries with a percentage of 90 to be High instead of Low. (An arbitrary choice, but seems better.) - Fixed matching of sr-Latn vs sr-Cyrl. - Language-specific script mappings are now supported. LGTM=r R=r CC=golang-codereviews https://golang.org/cl/68190043	2014-02-27 22:29:49 +01:00
Marcel van Lohuizen	610d33a009	go.text/language: added MustParse equivalent for CanonType. One could argue the top-level MustParse should always call Raw.Parse. However, one can find arguments either way. We opt to be consistent and leave it up to the user to decide if a Tag "constant" should be canonicalized or not. LGTM=r R=r CC=golang-codereviews https://golang.org/cl/65380043	2014-02-18 20:36:13 +01:00
Marcel van Lohuizen	cd75f379c5	go.text/language: added predefined common tag values. The benefits of using predefined tags are: - convenience for the user - an additional layer or indirection for tag representations that may change in the future (e.g. pt to mean Brazilian Portuguese or just Portuguese). Predefined values are only provided for a selection of tags. The selection is based on CLDR data and augmented with languages for populous areas where speakers do not typically also fluently speak a language that was already included in the set (thereby increasing the likelihood of needing it) and languages for which tags may be ambiguous. We do not intend to provide similar constants for Regions, Scripts and Base values. Specifically the set of Regions is much more "dynamic" than the set of languages. The (Tag\|Base\|Script\|Region\|Currency)OrDie methods are provided as a convenience for the user to specify values at startup. Details: - Went with names instead of codes to allow for a level of indirection on top of tags. - Names for the tags are taken from CLDR for the en locale. - Internal tag constants are now prefixed with '_' instead of lang_, reg, scr and cur. All tags are unique between these for so this does not pose a problem. It makes the predefined values for tags look a lot better in the godocs. LGTM=r R=r CC=golang-codereviews https://golang.org/cl/59740047	2014-02-18 12:38:45 +01:00
Marcel van Lohuizen	1f25524227	go.text/language: allow == operator to be used to test equality of Tags. This requires the str field to be changed from a pointer to a string. This adds 8 bytes, a painful 50%, to each Tag. The benefits are quite noteworthy, though. Note that this change is not reversible. Benefits: - allows for testing exact equality easier. - less overhead testing for exact equality, rather than the approximate equality of Matcher. - Prevents bugs with incosistent behavior matching identical tags, as is currently the case (comparing tags is currently not supported, but it work under certain circumstances). Drawbacks: - Larger tags. - Might lead people to use == instead of a Matcher when the latter is more appropriate. Details: - If a Tag has no variants or extensions, str will now always be "". This allows for faster and simpler processing in most cases. It potentially avoids the need for an alloc on creation time, but it will now always result in an alloc for String() when a tag a region or script, but no variants or extensions. This is an acceptable tradeoff. - SetTypeForKey now allows erasing a key-value pair. R=r CC=golang-codereviews https://golang.org/cl/60000044	2014-02-14 10:44:28 +01:00
Marcel van Lohuizen	73f318262f	go.text/language: make Region.ISO3 return "ZZZ" in case of non-exising ISO3 code. The occurence of a non-existing ISO3 code is rather rare to the point it seems better to just return something. R=r CC=golang-codereviews https://golang.org/cl/55230043	2014-02-04 14:03:48 +01:00
Marcel van Lohuizen	66b4d1c3e7	go.text/language: somewhat subtle and less subtle API overhaul. BACKGROUND As this API has evolved over time, a few things don't fit well anymore. Most notably: - With the introduction of the Matcher, it is no longer necessary to fully canonicalize tags. - With parse functions for the various types now available, it doesn't make too much sense to have a (only have) Compose defined on a map of strings. Also, some design aspects proved to be a bit too subtle: - It looks like users sometimes select between Make or Parse depending on whether they need the error, not whether they need canonicalization. - The Base, Script and Region type had a Tag method defined on them. This allowed the user to possibly unexepectedly bypass canonicalization. Also, canonicalization should ideally be done considering the combination of all tags, as this may alter the results. GOALS OF THE REDESIGN - Make functionality of top-level functions orthogonal to whether the user wants canonicalization or not. - Make the API such that the user typically never has to think about canonicalization. (Sensible defaults combined with the Matcher.) - Make it such that unless the user explicity uses a non-default CanonType all tags will be canonicalized the same. - Encourage canonicalization on full tags only to prevent canonicalization of individual tags before composing them into a full tag. OVERVIEW OF CHANGES - Default canonicalization now omits modes where potentially useful information is lost. Deprecated and legacy values are still canonicalized by default. - Changed Compose to be a varargs function taking any of the individual tag types to compose a tag. It can also be used to update tags. - Removed the old map[Part]string-based implementation of Compose as well as the Part type. The new implemenation is quite a bit simpler and the removal of the Part type simplifies the API. - Compose, Parse and Make all use the Default CanonType for canonicalization. This means that Parse now canonicalizes, whereas before it didn't!!! This eliminates one way in which users could inadventantly create tags that were not normalized the default way. - Compose, Parse, and Make are now also defined on CanonType. - Tag.Canonicalize is removed and added to CanonType. This puts all methods for creating tags with non-default canonicalization in one spot. It also simplifies the API for Tag. - The Raw CanonType is added for using any of these functions without canonicalization. The original Parse can thus be simulated with Raw.Parse. - (Base,Script,Region).Tag are removed. These methods can not be simulated by passing the result of ParseX to Compose. For example: language.Compose(language.ParseRegion("NL")) Removing them removes one way in which the user could create tags that were not created using default canonicalization. It also encourages users to create fully-specified tags before invoking canonicalization. - Added Extension type and implemented Variant type. Both types have a ParseX function to complement the rest. - Added Raw method to Tag to get the verbatim Base, Script and Region tags. Using the Script or Region methods potentially changes an unspecified region or script to respectively ZZ or Zzzz, which causes them to be marked as IsPrivate. It was possible but tedious to distinguish between the two cases. Raw makes this trivial, among having other uses. - Added Deprecated(Base\|Script\|Region} CanonTypes (where the old Deprecated is the union of these) as this proved to be useful in some use cases. CAVEATS In most cases where the API changed it will break existing code. This is not the case for Parse, which now canonicalizes whereas before it didn't. This could lead to subtle bugs. The damage is somewhat limited by the fact that by default now only legacy and deprecated values are canonicalized. R=r CC=golang-codereviews https://golang.org/cl/51280048	2014-01-16 23:51:53 +01:00
Marcel van Lohuizen	c4b790c79f	go.text/language: factored out parsing of single extensions, which is needed for a different CL, so that the diffs of the next CL look palatable. R=r CC=golang-codereviews https://golang.org/cl/49830043	2014-01-10 08:30:18 +01:00
Dave Day	e5a444a8bc	go.text/cldr: support both plurals and ordinals in the Specification struct Now the data from ordinals.xml and plurals.xml are both available, whereas formerly ordinals was overwriting the plural's data. The two can be distiguished using the "Type" field. R=mpvl CC=golang-codereviews https://golang.org/cl/48660044	2014-01-09 09:49:07 +11:00
Marcel van Lohuizen	22a6057374	go.text/language: the 'e' is kind of phonetically redundant, but I think it is better to include it anyway. R=r CC=golang-codereviews https://golang.org/cl/48230043	2014-01-07 11:57:59 +01:00
Marcel van Lohuizen	b8c3e1927b	go.text/language: fixed bug that caused EncodeM49 to panic for some values. The added test doesn't test much, but mainly serves the purpose of asserting no panic occurs for any value. R=r CC=golang-codereviews https://golang.org/cl/48080043	2014-01-06 17:59:22 +01:00
Marcel van Lohuizen	a36a459697	go.text/language: added a seperate table for mapping from UN.M49 to region code. It turned out that the mapping is a bit more complex than simply using a single table, with the following features: - blocks of sorted array of UN.M49 and region pairs. - each block is indexed by the 3 msb of the 10-bit UN.M49 code. - each block contains a sorted list of uint16s where the 7 msb are the 7 lsb of the UN.M49 code and the 9 lsb are the region code. - total table size increase is 582 bytes A more straightforward approach would have lead to a table size of at least 1K, up to 2k. However, the lookup code for these approaches is either not substantially smaller or the table size is notably larger. The table also includes a few more new entries from the IANA registry. R=r CC=golang-codereviews https://golang.org/cl/44560043	2013-12-23 09:34:53 +01:00
Volker Dobler	119f8793fe	go.text: fix trivial typos. R=mpvl CC=golang-dev https://golang.org/cl/33810043	2013-11-29 10:27:20 +11:00
Marcel van Lohuizen	a7e91de037	go.text/language: canonicalize deprecated regions and scripts. Analoguous to deprecated languages, deprecated regions are represented with their own internal codes and get canonicalized when necessary. The deprecated regions were previously not recognized. - refactored code for writing sorted maps. - refactored region testing code in separate components to simplify tests. - CLDR does not include the deprecated 3-letter ISO codes for all deprecated codes. We added them for completeness. - Script deprecation is hard-coded. The CLDR data only contains one remapping. maketables.go checks that this is indeed the only one. This adds about 250 bytes of data. R=r CC=golang-dev https://golang.org/cl/19850043	2013-11-08 12:52:05 +01:00
Marcel van Lohuizen	3494cc8d0c	go.text/language: implemented TypeForKey and SetTypeForKey. Also refactored remakeString to allow for faster rebuilding code when possible. Tag.String now uses this for faster string creation as well. R=r CC=golang-dev https://golang.org/cl/14425066	2013-10-24 15:07:24 +02:00
Marcel van Lohuizen	240601eac0	go.text/language: change Zyyy to Zzzz as representation of undefined script. This a a choice between more conformance to BCP 47 on the one hand and Unicode and CLDR on the other hand. The user can use the returned Confidence value to determine whether the script was unspecified or explicitly specified as Zzzz (in the rare case the user would care at all). Updated comments. R=r CC=golang-dev https://golang.org/cl/16020043	2013-10-24 15:04:51 +02:00
Marcel van Lohuizen	80a998998e	go.text/language: fixed a few go vet errors. R=r CC=golang-dev https://golang.org/cl/16010043	2013-10-23 16:19:10 +02:00
Marcel van Lohuizen	f4a79d0559	go.text/language: corrected url in comments. R=r CC=golang-dev https://golang.org/cl/15520048	2013-10-23 16:18:38 +02:00
Marcel van Lohuizen	3255f38977	go.text/language: removed prefix validation of variants. Correct prefixes are a SHOULD, not a MUST in BCP 47. Incorrect prefixes should therefore either be marked with a special error so that it can be ignored or not generate an error at all. We opt for the latter. Too bad, as the prefix checking algorithm was kinda cool. Also, sorry for having to go through reviewing it in the first place. R=r CC=golang-dev https://golang.org/cl/14483054	2013-10-23 10:20:05 +02:00
Marcel van Lohuizen	51fb595f78	go.text/language: bunch of bug fixes in extension handling: [this time with the correct files:] Measures to expose bugs: - Added more elaborate tests for extensions. - Return end instead of len(scan.b) at the end of parseExtension to force exposing bugs. - parseExtensions used to sometimes update scan.b and sometimes not, leaving it to parse. Made this more consistent, simplifying parse and forcing errors to be exposed. - Removed some checks to catch errors that should have been caught elsewhere. Again to expose bugs. - Tightened some of the checks to expose bugs more easily. Bugs fixed: - Attributes in -u extension are no sorted, as per LDML spec (even though nobody uses them, a spec is a spec). - Fixed various bug where invalid keys or value were not properly removed. Merged the special case and common case to eliminate rare code paths and simplify testing. - Fixed some bugs where invalid empty extensions were not properly removed. - Fixed bug in Compose, which dropped the 'w', '9', and 'z' extensions. Other: - removed parsePrivate to simplify code.description here. R=r CC=golang-dev https://golang.org/cl/14669043	2013-10-16 11:12:07 +02:00
Marcel van Lohuizen	4fe0ccd82b	go.text/language: added proper handling of variants. The package now recognizes valid variants and rejects invalid ones. It accepts variants only if they follow the proper language or proper prefix sequence, as lined out in the BCP 47 spec. If variants are not in the right order, it will make an attempt to sort create a correct sorting order out of them. Duplicate variants are removed. Note that BCP 47 presumes that a script is suppressed if it is marked as such for a language, whereas we allow such scripts to still exist. There is an additional check to handle this case. R=r CC=golang-dev https://golang.org/cl/14555043	2013-10-14 16:06:28 +02:00
Marcel van Lohuizen	71ab14c455	go.text/language: revamped error handling: - ValueError now exported as new type. ValueError retains the problematic value, allowing the user to inspect and correct it. - Dynamically allocated errors returned in case of a syntax error are replaced by a error variable. - Fixed bug: return error if an "u" extension has a type without a value. - Added benchmarks or parsing code. - Renamed MissingLikelyData to ErrMissingLikelyData to be consistent with other Go packages. This variable is not yet returned, so this change is not likely to cause a big issue. - Removed Set type as long as there is no demand for it. The code is measurably faster after removing the dynamically allocated errors. A ValueError is 8 bytes and should not require allocation when passed as an error. Returning a fixed error variable instead of a ValueError did not significantly improve performance. I considered returning a syntax error with the position at which the error occurred. This extra management needed for this slowed down the code a bit, so I opted not to support this. This could still be implemented if there turns out to be a need for it. R=r, mpvl CC=golang-dev https://golang.org/cl/14162044	2013-10-08 19:51:01 +02:00
Marcel van Lohuizen	d95a5f25a9	go.text/language: added tag matching algorithm. This algorithm is not based on the CLDR algorithm. It incorporates some ideas from other implementations, but it is designed from scratch. Note that the IANA registry has been updated so this CL also adds some new language codes as well as add mappings for deprecated codes. maketables.go has also been modified to work around a bug that was introduced in the latest IANA update. Note that the Match method of the Matcher interface returns an index of the original Tag along with the Tag. Certain users of Matcher, such as service like collation, need to associate data with each Tag. Package collate is updated to use the new Matcher interface. As the Matcher matches returns the index of the Tag as well, the tables can be simplified to be an array instead of a map. R=r, nigeltao, mpvl CC=golang-dev, markdavis https://golang.org/cl/13819047	2013-10-07 13:14:45 +02:00
Marcel van Lohuizen	9f86e0be98	go.text/language: make it build with Go 1.1, which does not include sort.Stable. Fixes golang/go#6523. R=r CC=golang-dev https://golang.org/cl/14265043	2013-10-04 10:23:17 +02:00
Marcel van Lohuizen	6e2c2aaf7b	go.text/language: added function to parse the value of an HTTP Accept-Language header. It supports a few non-standard language tags that appear relatively frequently in the Accept-Language headers. R=r CC=golang-dev, nigeltao https://golang.org/cl/13974043	2013-09-27 12:43:24 +02:00
Marcel van Lohuizen	4a56690205	go.text/language: A few small changes: - Added Tag methods to Base, Script and Region types to convert them in a proper tag. - Factored out part of Canonicalize that does not remake the string (used in upcoming matcher code). - Added "nb" -> "no" conversion in the tables to allow more consistency for code using these tables directly. - changed to short name used in some methods for type Base so that it consistenly appears as "b" in the documentation. R=r CC=golang-dev https://golang.org/cl/13647043	2013-09-23 11:03:22 +02:00
Marcel van Lohuizen	b38db9f15a	go.text/language: renaming of locale package: - renamed package locale to language - renamed type ID to Tag (language.Tag) - renamed type Language to Base (language.Base) - deleting locale package - changed occurences of "locale identifier" in comments to "language tag". - renamed method variable names from id or loc to t when the receiver type is Tag. R=r, nigeltao CC=golang-dev https://golang.org/cl/13468043	2013-09-05 11:16:24 +02:00

1 2

79 Коммитов