Граф коммитов

58 Коммитов

Автор SHA1 Сообщение Дата
Marcel van Lohuizen 191b11aac8 go.text/transform: added RemoveFunc transform for removing individual
runes from the input. This corresponds to ICU's Remove transform.
For example, to remove accents from characters one could use RemoveFunc
as follows:
        nonspacingMark := func(r rune) bool {
                return unicode.Is(unicode.Mn, r)
        }
        transform.Chain(norm.NFD, transform.RemoveFunc(nonspacingMark), norm.NFC)
(Once norm.Form implements Transformer; guess what will be my next CL.)

R=r
CC=golang-dev, nigeltao
https://golang.org/cl/23220043
2013-11-26 08:29:24 +01:00
Nigel Tao 089e4d2d44 go.text/encoding: add the Nop encoding.
R=mpvl, andybalholm
CC=golang-dev
https://golang.org/cl/29950045
2013-11-23 10:09:50 +11:00
Marcel van Lohuizen 9465bcffb0 go.text/unicode/norm: as quickSpan now no longer uses reorderBuffer, we can
optimize a few more calls to not create a reorderBuffer unless necessary.
This will greatly improve performance for small strings if there is no need
to normalize.

Updated Bytes, String, IsNormal* and FirstBoundary*. The latter is
especially important as it will always only inspect a small piece of text.

Also added a few benchmarks.

R=r
CC=golang-dev
https://golang.org/cl/28260043
2013-11-18 20:12:15 +01:00
Marcel van Lohuizen de84dfdf8f go.tex/unicode/norm: added implementation of Transformer to norm.Form.
One fundemental design decision was made to have the norm.Form type
implement Transform directly, rather than having the user create a
Transformer instance. The advantage of this approach is that it
  1) results in a much nicer API (e.g. norm.NFC can be used as a
     transformer as is).
  2) the Transformer is stateless, thread-safe and reentrant, reducing
     the possibility of errors.
  3) is consistent with most other transformers.
The clear disadvantage is that it is impossible to reuse a reorderBuffer
between calls. The cost of initialization can be amortized when
normalizing large blocks, but can be prohibitive for small strings
(can probably get its size down to 500 bytes, but still).

However, in theory it is possible to skip the use of the reorderBuffer in
the fast majority of cases. 99.98 of HTML page content (excluding markup)
is in NFC (http://www.macchiato.com/unicode/nfc-faq). In most cases, NFC
can be converted to NFD without reordering (if it is in FCD form, which
can easily be detected). This means that, using the same techniques as for
norm.Iter, most conversions can be done without using a reorderBuffer at
all. If it is really necessary, it is probably possible to get rid of the
reorderBuffer altogether or to have an alternative that covers 99% of the
remaining cases.

As we reckon that the creation of a reorderBuffer can be avoided in most
cases, we opt for the better API. Note that this approach will be more
efficient if the creation of a reorderBuffer can be avoided.

The current implementation doesn't do any of the optimizations described
above. It only uses quickSpan to quickly cover most normalized content.
This will not avoid using a reorderBuffer when converting between different
forms, though.

Also:
- quickSpan has been modified to allow for incremental procesing (passing
atEOF) and returning whether any violating runes have been found (returning
a position smaller than the input length no longer signals this if !atEOF).
QuickSpan and QuickSpanString now no longer require a reorderBuffer.

- Added benchmarks to compare different normalization methods. Also added
benchmarks to compare these implementations against running ToLower on
the same strings.

All in all this CL raises some issues that will have to be addressed in
follow-up CLs:
 - Optimization of Transform method.
 - QuickSpan* probably needs to be adjusted to allow for incremental
   calls (like the internal version).
 - Unifying some of the implementations should be considered.

R=r
CC=golang-dev, nigeltao
https://golang.org/cl/23460044
2013-11-14 23:20:39 +01:00
Marcel van Lohuizen a7e91de037 go.text/language: canonicalize deprecated regions and scripts.
Analoguous to deprecated languages, deprecated regions are represented with
their own internal codes and get canonicalized when necessary. The deprecated
regions were previously not recognized.
- refactored code for writing sorted maps.
- refactored region testing code in separate components to
  simplify tests.
- CLDR does not include the deprecated 3-letter ISO codes for all deprecated
  codes. We added them for completeness.
- Script deprecation is hard-coded. The CLDR data only contains one remapping.
  maketables.go checks that this is indeed the only one.
This adds about 250 bytes of data.

R=r
CC=golang-dev
https://golang.org/cl/19850043
2013-11-08 12:52:05 +01:00
Marcel van Lohuizen 3494cc8d0c go.text/language: implemented TypeForKey and SetTypeForKey.
Also refactored remakeString to allow for faster rebuilding code when possible.
Tag.String now uses this for faster string creation as well.

R=r
CC=golang-dev
https://golang.org/cl/14425066
2013-10-24 15:07:24 +02:00
Marcel van Lohuizen 240601eac0 go.text/language: change Zyyy to Zzzz as representation of undefined script.
This a a choice between more conformance to BCP 47 on the one hand and
Unicode and CLDR on the other hand.
The user can use the returned Confidence value to determine whether the
script was unspecified or explicitly specified as Zzzz (in the rare case the
user would care at all).
Updated comments.

R=r
CC=golang-dev
https://golang.org/cl/16020043
2013-10-24 15:04:51 +02:00
Marcel van Lohuizen 80a998998e go.text/language: fixed a few go vet errors.
R=r
CC=golang-dev
https://golang.org/cl/16010043
2013-10-23 16:19:10 +02:00
Marcel van Lohuizen f4a79d0559 go.text/language: corrected url in comments.
R=r
CC=golang-dev
https://golang.org/cl/15520048
2013-10-23 16:18:38 +02:00
Marcel van Lohuizen 3255f38977 go.text/language: removed prefix validation of variants.
Correct prefixes are a SHOULD, not a MUST in BCP 47. Incorrect prefixes
should therefore either be marked with a special error so that it can be
ignored or not generate an error at all. We opt for the latter.
Too bad, as the prefix checking algorithm was kinda cool. Also, sorry for
having to go through reviewing it in the first place.

R=r
CC=golang-dev
https://golang.org/cl/14483054
2013-10-23 10:20:05 +02:00
Marcel van Lohuizen 51fb595f78 go.text/language: bunch of bug fixes in extension handling:
[this time with the correct files:]
Measures to expose bugs:
  - Added more elaborate tests for extensions.
  - Return end instead of len(scan.b) at the end of parseExtension
    to force exposing bugs.
  - parseExtensions used to sometimes update scan.b and sometimes not,
    leaving it to parse. Made this more consistent, simplifying parse
        and forcing errors to be exposed.
  - Removed some checks to catch errors that should have been caught
    elsewhere. Again to expose bugs.
  - Tightened some of the checks to expose bugs more easily.

Bugs fixed:
  - Attributes in -u extension are no sorted, as per LDML spec
    (even though nobody uses them, a spec is a spec).
  - Fixed various bug where invalid keys or value were not properly
    removed. Merged the special case and common case to eliminate rare
        code paths and simplify testing.
  - Fixed some bugs where invalid empty extensions were not properly
    removed.
  - Fixed bug in Compose, which dropped the 'w', '9', and 'z' extensions.

Other:
  - removed parsePrivate to simplify code.description here.

R=r
CC=golang-dev
https://golang.org/cl/14669043
2013-10-16 11:12:07 +02:00
Marcel van Lohuizen 4fe0ccd82b go.text/language: added proper handling of variants.
The package now recognizes valid variants and rejects invalid ones.
It accepts variants only if they follow the proper language or proper
prefix sequence, as lined out in the BCP 47 spec.
If variants are not in the right order, it will make an attempt to sort
create a correct sorting order out of them.
Duplicate variants are removed.
Note that BCP 47 presumes that a script is suppressed if it is marked
as such for a language, whereas we allow such scripts to still exist.
There is an additional check to handle this case.

R=r
CC=golang-dev
https://golang.org/cl/14555043
2013-10-14 16:06:28 +02:00
Marcel van Lohuizen 71ab14c455 go.text/language: revamped error handling:
- ValueError now exported as new type. ValueError retains the problematic value,
  allowing the user to inspect and correct it.
- Dynamically allocated errors returned in case of a syntax error are replaced
  by a error variable.
- Fixed bug: return error if an "u" extension has a type without a value.
- Added benchmarks or parsing code.
- Renamed MissingLikelyData to ErrMissingLikelyData to be consistent with other
  Go packages. This variable is not yet returned, so this change is not likely to cause
  a big issue.
- Removed Set type as long as there is no demand for it.

The code is measurably faster after removing the dynamically allocated errors.
A ValueError is 8 bytes and should not require allocation when passed as an error.
Returning a fixed error variable instead of a ValueError did not significantly improve
performance.

I considered returning a syntax error with the position at which the error occurred.
This extra management needed for this slowed down the code a bit, so I opted not to
support this. This could still be implemented if there turns out to be a need for it.

R=r, mpvl
CC=golang-dev
https://golang.org/cl/14162044
2013-10-08 19:51:01 +02:00
Marcel van Lohuizen d95a5f25a9 go.text/language: added tag matching algorithm. This algorithm is not based
on the CLDR algorithm. It incorporates some ideas from other implementations,
but it is designed from scratch.
Note that the IANA registry has been updated so this CL also adds some new
language codes as well as add mappings for deprecated codes. maketables.go
has also been modified to work around a bug that was introduced in the latest
IANA update.

Note that the Match method of the Matcher interface returns an index of
the original Tag along with the Tag. Certain users of Matcher, such as
service like collation, need to associate data with each Tag.

Package collate is updated to use the new Matcher interface.
As the Matcher matches returns the index of the Tag as well, the tables
can be simplified to be an array instead of a map.

R=r, nigeltao, mpvl
CC=golang-dev, markdavis
https://golang.org/cl/13819047
2013-10-07 13:14:45 +02:00
Marcel van Lohuizen 9f86e0be98 go.text/language: make it build with Go 1.1, which does not include
sort.Stable.

Fixes golang/go#6523.

R=r
CC=golang-dev
https://golang.org/cl/14265043
2013-10-04 10:23:17 +02:00
Marcel van Lohuizen 6e2c2aaf7b go.text/language: added function to parse the value of an HTTP Accept-Language
header.
It supports a few non-standard language tags that appear relatively frequently
in the Accept-Language headers.

R=r
CC=golang-dev, nigeltao
https://golang.org/cl/13974043
2013-09-27 12:43:24 +02:00
Nigel Tao 893a30928a go.text/encoding: add Replacement and XUserDefined encodings.
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/14022043
2013-09-27 18:12:46 +10:00
Nigel Tao 5d548ed6fc go.text/encoding/simplifiedchinese: implement HZ-GB2312.
The HZ-GB2312 encoding can only represent GBK levels 1 and 2, and
not GBK levels 3, 4 or 5, so there is a new testdata/etc-utf-8.txt
file.

The GBK levels are visualized at http://en.wikipedia.org/wiki/GBK

R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13957043
2013-09-27 10:33:31 +10:00
Nigel Tao c25434d637 go.text/encoding/simplifiedchinese: implement GB18030.
GB18030 is a superset of GBK. I'm not entirely sure why GBK decoding
got 6% faster; I'm just happy that there aren't any big regressions.

benchmark                     old MB/s     new MB/s  speedup
BenchmarkGBKDecoder             116.96       123.64    1.06x
BenchmarkGBKEncoder             179.31       176.86    0.99x

R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13761047
2013-09-24 11:45:21 +10:00
Marcel van Lohuizen 4a56690205 go.text/language: A few small changes:
- Added Tag methods to Base, Script and Region types to convert them in a proper tag.
- Factored out part of Canonicalize that does not remake the string (used in upcoming matcher code).
- Added "nb" -> "no" conversion in the tables to allow more consistency for code using these tables directly.
- changed to short name used in some methods for type Base so that it consistenly appears as "b" in the documentation.

R=r
CC=golang-dev
https://golang.org/cl/13647043
2013-09-23 11:03:22 +02:00
Nigel Tao fd9ccd35d5 go.text/encoding: be consistent when converting codes outside of an
encoding.Encoding's repertoire. Specifically, they are converted:
- to the Unicode replacement character '\ufffd' when converting to UTF-8,
- to the ASCII substitute character '\x1a' when converting from UTF-8.

R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13802043
2013-09-20 16:16:00 +10:00
Nigel Tao 46294c9806 go.text/encoding/japanese: implement ISO-2022-JP.
R=r
CC=golang-dev
https://golang.org/cl/13308048
2013-09-20 11:27:11 +10:00
Nigel Tao c603fe2b51 go.text/encoding: check that the hard-coded encode switch covers all
tables.

R=r
CC=golang-dev
https://golang.org/cl/13333056
2013-09-18 14:56:11 +10:00
Nigel Tao a60de809e6 go.text/encoding: shrink the japanese and korean encoding data tables.
The encoding.test binary size generated by "go test -c" drops by 132320
bytes.

Some benchmarks get better, others get worse (but that might just be
noise, as there are no code or data changes for Big5 or GBK).

benchmark                    old MB/s     new MB/s  speedup
BenchmarkBig5Encoder           170.12       171.82    1.01x
BenchmarkEUCJPEncoder          160.94       156.07    0.97x
BenchmarkEUCKREncoder          166.75       171.66    1.03x
BenchmarkGBKEncoder            180.07       173.59    0.96x
BenchmarkShiftJISEncoder       137.95       143.70    1.04x

R=r
CC=golang-dev
https://golang.org/cl/13321047
2013-09-18 13:42:53 +10:00
Nigel Tao d94036e178 go.text/encoding/charmap: add all the charmap encodings listed at
http://encoding.spec.whatwg.org/#legacy-single-byte-encodings

R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13252049
2013-09-13 20:53:45 +10:00
Nigel Tao 7db818c9d2 go.text/encoding/simplifiedchinese: remove redundant if check.
The improvement is barely noticible, but it surely can't hurt.

benchmark               old MB/s     new MB/s  speedup
BenchmarkGBKEncoder       181.94       182.25    1.00x

R=r
CC=golang-dev
https://golang.org/cl/13244047
2013-09-12 11:24:54 +10:00
Nigel Tao ec69b9aa64 go.text/encoding/simplifiedchinese: shrink the encoding data table from
65536 mostly-zero uint16s to 32186 uint16s. There are still explicit
zero entries, but no long runs of zeroes.

benchmark               old MB/s     new MB/s  speedup
BenchmarkGBKEncoder       159.24       180.24    1.13x

R=mpvl
CC=andybalholm, golang-dev, r, rogpeppe
https://golang.org/cl/13253047
2013-09-12 10:36:27 +10:00
Nigel Tao ea98615240 go.text/encoding/traditionalchinese: new package.
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13242054
2013-09-11 15:32:34 +10:00
Nigel Tao b463796019 go.text/encoding/korean: new package.
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13639043
2013-09-11 10:05:07 +10:00
Marcel van Lohuizen b38db9f15a go.text/language: renaming of locale package:
- renamed package locale to language
- renamed type ID to Tag (language.Tag)
- renamed type Language to Base (language.Base)
- deleting locale package
- changed occurences of "locale identifier" in comments to "language tag".
- renamed method variable names from id or loc to t when the receiver type is Tag.

R=r, nigeltao
CC=golang-dev
https://golang.org/cl/13468043
2013-09-05 11:16:24 +02:00
Nigel Tao 60b8f5ddd6 go.text/encoding: fix off-by-one error in some decoders returning
ErrShortDst even when there is sufficient dst space.

In theory, a transform.Transformer is allowed to return fewer dst bytes
than maximal, but in practice, we shouldn't be wasteful.

R=r
CC=golang-dev
https://golang.org/cl/13512045
2013-09-05 12:42:17 +10:00
Nigel Tao 550a27802b go.text/encoding/charmap: make the underlying tables global variables
instead of created-at-init-time local variables, so that they can be
initialized more efficiently as data instead of text.

charmap.a size in bytes before/after is 625236 / 278886, or 2.24.

R=r
CC=golang-dev
https://golang.org/cl/13234047
2013-09-05 08:43:09 +10:00
Nigel Tao 577d09f62f go.text/encoding: move the charmap and unicode encodings to their
own dedicated packages.

Prior to this change, encoding.a was 1201 KiB (compiled with 6g).
Manually removing one charmap from tables.go changed this by 98 KiB.
This is already a non-trivial amount of code for the compiler/linker
to process just to throw away when building e.g. encoding/japanese,
and the number of supported charmaps (currently 11) will go up.

R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13486043
2013-09-03 17:49:19 +10:00
Nigel Tao 21b2da3c5c go.text/encoding/simplifiedchinese: new package.
R=r, chaishushan
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13462043
2013-09-03 16:00:54 +10:00
Nigel Tao 1ae8c9038e go.text/encoding/japanese: new package.
R=r, andybalholm, bradfitz
CC=golang-dev, mpvl, rogpeppe
https://golang.org/cl/13179046
2013-08-29 15:38:03 +10:00
Nigel Tao 84e593ec5f encoding: enable transform.Chain example, now that it's checked in.
R=r
CC=golang-dev
https://golang.org/cl/13290046
2013-08-28 10:45:01 +10:00
Nigel Tao e79def78d8 go.text/encoding: simplify charmapEncoder now that
https://golang.org/cl/12087043/ "remove UTF-8
synchronization" has landed.

R=r
CC=golang-dev
https://golang.org/cl/13185043
2013-08-26 11:17:23 +10:00
Marcel van Lohuizen 5059ed55b5 go.text/locale: Separated Macro canonicalization into Legacy and
Macro groups, as defined by CLDR. The Legacy cases are now hard coded.
This allows us to handle sh -> sr-Latn without introducing a new data type
just for this case. The set of legacy translations is unlikely to change,
but maketables.go now checks and fails if the set changes.
Also introduced Default CanonType in preperation for adding tag maximization
and minimization.

Further changes: deviating from CLDR in a few places to not have to deal with
legacy choice. CLDR is likely to head in this direction as well, so it
prevents incompatibilities down the road.
Added CLDR option to force strict compliance to CLDR.

Mapping "mo" to "ro-MD" instead of "ro".  In cases where ID is used as a
locale, preserving this piece of information may be important. It is up
to the matching code to establish that "ro" and "ro-MD" are mutually
intelligible.

R=r
CC=golang-dev
https://golang.org/cl/12903045
2013-08-21 11:15:02 +02:00
Nigel Tao 4980de1c40 go.text/encoding: add some more Windows encodings (874, 1250, 1251,
1253, 1254, 1255, 1256, 1257, 1258).

R=r
CC=golang-dev
https://golang.org/cl/13120043
2013-08-20 18:18:12 +10:00
Marcel van Lohuizen c35e1dcc4d go.text/cldr: fix bug introduced with changeset 17795:749d02164043:
cmd/gc: &x panics if x does.

Fixes golang/go#6178.

Code worked for nil interface values before this change, but now it doesn't.
Changed check for nil so that it works again.

R=r
CC=golang-dev, iant, rsc
https://golang.org/cl/12788046
2013-08-18 09:05:32 +02:00
Marcel van Lohuizen 73b7721064 go.text/locale: Expanded API and fixed bug:
- Exposed functions for parsing Language, Script, Region and Currency.
- Exposed several of the internal methods for these types as well.
- Fixed bug where not all private use tags were registered due to a bug in inc.

R=r
CC=golang-dev
https://golang.org/cl/12987043
2013-08-16 12:13:35 +02:00
Nigel Tao bc48732fe9 go.text/encoding: remove UTF-8 synchronization during encoding. It's
not worth it.

R=r
CC=golang-dev
https://golang.org/cl/12087043
2013-07-30 17:14:54 +10:00
Nigel Tao 0e390ba84d go.text/encoding: add UTF-16 encodings.
There are some TODOs concerning the exact behavior for bad UTF-16, but
I'll address those after getting consensus on the broad-brush design.

candide-utf-16le.txt was generated by
iconv -f UTF-8 -t UTF-16LE < candide-utf-8.txt > candide-utf-16le.txt

R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/11565043
2013-07-30 11:53:06 +10:00
Nigel Tao d0bbf51710 go.text/transform: fix s/src/dst/ typo in Writer.Close.
R=r
CC=golang-dev, mpvl
https://golang.org/cl/11823043
2013-07-26 09:22:10 +10:00
Marcel van Lohuizen b67299ac79 go.text/transform: implementation of Writer, Chain, Nop and Discard.
R=nigeltao, r
CC=golang-dev
https://golang.org/cl/10964043
2013-07-24 16:26:05 +02:00
Nigel Tao 48ba322e43 go.text/encoding: new package that provides character set encodings.
Only IBM Code Page 437 and Windows 1252 encodings for now. Others will
come in follow-up CLs once the infrastructure's settled.

R=r, mpvl, andybalholm
CC=golang-dev, rogpeppe
https://golang.org/cl/11270043
2013-07-18 14:24:31 +10:00
Nigel Tao 0c7fb33750 go.text/transform: re-arrange TestReader's test cases so that they can
be re-used by other tests.

R=mpvl, r
CC=golang-dev
https://golang.org/cl/10672044
2013-07-10 10:37:15 +10:00
Nigel Tao 8a29aad8b1 go.text/transform: improve comments based on review of
https://golang.org/cl/10538043.

R=r, mpvl
CC=golang-dev
https://golang.org/cl/10996043
2013-07-09 16:03:01 +10:00
Marcel van Lohuizen beb5bf8642 go.text/locale: some semantics and API changes
- Defined 0 value to be "unspecified" id for languages, scripts and
          regions. These values are not directly exposed to the user, but
                  rather are used to distinguish between the case where the
          user explicitly specifies, for example, Zzzz vs not specifying it.
        - The nil-value for ID now identifies Root.
        - Use Zyyy (undetermined) instead of Zzzz (uncoded, as used by CLDR) as
          the code for an unspecified script.  CLDR uses Zzzz, but BCP47 prescribes
          using Zyyy in this case.  With the new semantics is choice is somewhat
          arbitrary, so we stick with BCP47.
        - Added error to Canonicalize to accommodate future canonicalization algorithms.
        - Removed Parent and Written as their semantics are rather hazy.
        - Added Confidence to Language method as well.
        - Removed Scope methods. Instead, user should just filter pre-defined
          lists of IDs to mimic its functionality.
        - Added SetTypeForKey and removed KeyValueString.  The same can be done
          with the former, but is much easier to use for the common case
          (change the type for a single key on an existing ID).
        - Removed SimplifyOptions as it is unclear such functionality should
          be exposed to the user or that it belongs in ID at all.
Implemented:
        - Language, Script, Region
        - IsCountry

R=r
CC=golang-dev
https://golang.org/cl/10697043
2013-07-08 21:10:26 +02:00
Nigel Tao 79b045a0f2 go.text/transform: new package.
This CL only provides the Reader type; Writer will be in a follow-up.

R=mpvl, r, mpvl
CC=andybalholm, golang-dev, rogpeppe
https://golang.org/cl/10538043
2013-07-02 09:56:20 +10:00