runes from the input. This corresponds to ICU's Remove transform.
For example, to remove accents from characters one could use RemoveFunc
as follows:
nonspacingMark := func(r rune) bool {
return unicode.Is(unicode.Mn, r)
}
transform.Chain(norm.NFD, transform.RemoveFunc(nonspacingMark), norm.NFC)
(Once norm.Form implements Transformer; guess what will be my next CL.)
R=r
CC=golang-dev, nigeltao
https://golang.org/cl/23220043
optimize a few more calls to not create a reorderBuffer unless necessary.
This will greatly improve performance for small strings if there is no need
to normalize.
Updated Bytes, String, IsNormal* and FirstBoundary*. The latter is
especially important as it will always only inspect a small piece of text.
Also added a few benchmarks.
R=r
CC=golang-dev
https://golang.org/cl/28260043
One fundemental design decision was made to have the norm.Form type
implement Transform directly, rather than having the user create a
Transformer instance. The advantage of this approach is that it
1) results in a much nicer API (e.g. norm.NFC can be used as a
transformer as is).
2) the Transformer is stateless, thread-safe and reentrant, reducing
the possibility of errors.
3) is consistent with most other transformers.
The clear disadvantage is that it is impossible to reuse a reorderBuffer
between calls. The cost of initialization can be amortized when
normalizing large blocks, but can be prohibitive for small strings
(can probably get its size down to 500 bytes, but still).
However, in theory it is possible to skip the use of the reorderBuffer in
the fast majority of cases. 99.98 of HTML page content (excluding markup)
is in NFC (http://www.macchiato.com/unicode/nfc-faq). In most cases, NFC
can be converted to NFD without reordering (if it is in FCD form, which
can easily be detected). This means that, using the same techniques as for
norm.Iter, most conversions can be done without using a reorderBuffer at
all. If it is really necessary, it is probably possible to get rid of the
reorderBuffer altogether or to have an alternative that covers 99% of the
remaining cases.
As we reckon that the creation of a reorderBuffer can be avoided in most
cases, we opt for the better API. Note that this approach will be more
efficient if the creation of a reorderBuffer can be avoided.
The current implementation doesn't do any of the optimizations described
above. It only uses quickSpan to quickly cover most normalized content.
This will not avoid using a reorderBuffer when converting between different
forms, though.
Also:
- quickSpan has been modified to allow for incremental procesing (passing
atEOF) and returning whether any violating runes have been found (returning
a position smaller than the input length no longer signals this if !atEOF).
QuickSpan and QuickSpanString now no longer require a reorderBuffer.
- Added benchmarks to compare different normalization methods. Also added
benchmarks to compare these implementations against running ToLower on
the same strings.
All in all this CL raises some issues that will have to be addressed in
follow-up CLs:
- Optimization of Transform method.
- QuickSpan* probably needs to be adjusted to allow for incremental
calls (like the internal version).
- Unifying some of the implementations should be considered.
R=r
CC=golang-dev, nigeltao
https://golang.org/cl/23460044
Analoguous to deprecated languages, deprecated regions are represented with
their own internal codes and get canonicalized when necessary. The deprecated
regions were previously not recognized.
- refactored code for writing sorted maps.
- refactored region testing code in separate components to
simplify tests.
- CLDR does not include the deprecated 3-letter ISO codes for all deprecated
codes. We added them for completeness.
- Script deprecation is hard-coded. The CLDR data only contains one remapping.
maketables.go checks that this is indeed the only one.
This adds about 250 bytes of data.
R=r
CC=golang-dev
https://golang.org/cl/19850043
Also refactored remakeString to allow for faster rebuilding code when possible.
Tag.String now uses this for faster string creation as well.
R=r
CC=golang-dev
https://golang.org/cl/14425066
This a a choice between more conformance to BCP 47 on the one hand and
Unicode and CLDR on the other hand.
The user can use the returned Confidence value to determine whether the
script was unspecified or explicitly specified as Zzzz (in the rare case the
user would care at all).
Updated comments.
R=r
CC=golang-dev
https://golang.org/cl/16020043
Correct prefixes are a SHOULD, not a MUST in BCP 47. Incorrect prefixes
should therefore either be marked with a special error so that it can be
ignored or not generate an error at all. We opt for the latter.
Too bad, as the prefix checking algorithm was kinda cool. Also, sorry for
having to go through reviewing it in the first place.
R=r
CC=golang-dev
https://golang.org/cl/14483054
[this time with the correct files:]
Measures to expose bugs:
- Added more elaborate tests for extensions.
- Return end instead of len(scan.b) at the end of parseExtension
to force exposing bugs.
- parseExtensions used to sometimes update scan.b and sometimes not,
leaving it to parse. Made this more consistent, simplifying parse
and forcing errors to be exposed.
- Removed some checks to catch errors that should have been caught
elsewhere. Again to expose bugs.
- Tightened some of the checks to expose bugs more easily.
Bugs fixed:
- Attributes in -u extension are no sorted, as per LDML spec
(even though nobody uses them, a spec is a spec).
- Fixed various bug where invalid keys or value were not properly
removed. Merged the special case and common case to eliminate rare
code paths and simplify testing.
- Fixed some bugs where invalid empty extensions were not properly
removed.
- Fixed bug in Compose, which dropped the 'w', '9', and 'z' extensions.
Other:
- removed parsePrivate to simplify code.description here.
R=r
CC=golang-dev
https://golang.org/cl/14669043
The package now recognizes valid variants and rejects invalid ones.
It accepts variants only if they follow the proper language or proper
prefix sequence, as lined out in the BCP 47 spec.
If variants are not in the right order, it will make an attempt to sort
create a correct sorting order out of them.
Duplicate variants are removed.
Note that BCP 47 presumes that a script is suppressed if it is marked
as such for a language, whereas we allow such scripts to still exist.
There is an additional check to handle this case.
R=r
CC=golang-dev
https://golang.org/cl/14555043
- ValueError now exported as new type. ValueError retains the problematic value,
allowing the user to inspect and correct it.
- Dynamically allocated errors returned in case of a syntax error are replaced
by a error variable.
- Fixed bug: return error if an "u" extension has a type without a value.
- Added benchmarks or parsing code.
- Renamed MissingLikelyData to ErrMissingLikelyData to be consistent with other
Go packages. This variable is not yet returned, so this change is not likely to cause
a big issue.
- Removed Set type as long as there is no demand for it.
The code is measurably faster after removing the dynamically allocated errors.
A ValueError is 8 bytes and should not require allocation when passed as an error.
Returning a fixed error variable instead of a ValueError did not significantly improve
performance.
I considered returning a syntax error with the position at which the error occurred.
This extra management needed for this slowed down the code a bit, so I opted not to
support this. This could still be implemented if there turns out to be a need for it.
R=r, mpvl
CC=golang-dev
https://golang.org/cl/14162044
on the CLDR algorithm. It incorporates some ideas from other implementations,
but it is designed from scratch.
Note that the IANA registry has been updated so this CL also adds some new
language codes as well as add mappings for deprecated codes. maketables.go
has also been modified to work around a bug that was introduced in the latest
IANA update.
Note that the Match method of the Matcher interface returns an index of
the original Tag along with the Tag. Certain users of Matcher, such as
service like collation, need to associate data with each Tag.
Package collate is updated to use the new Matcher interface.
As the Matcher matches returns the index of the Tag as well, the tables
can be simplified to be an array instead of a map.
R=r, nigeltao, mpvl
CC=golang-dev, markdavis
https://golang.org/cl/13819047
header.
It supports a few non-standard language tags that appear relatively frequently
in the Accept-Language headers.
R=r
CC=golang-dev, nigeltao
https://golang.org/cl/13974043
The HZ-GB2312 encoding can only represent GBK levels 1 and 2, and
not GBK levels 3, 4 or 5, so there is a new testdata/etc-utf-8.txt
file.
The GBK levels are visualized at http://en.wikipedia.org/wiki/GBK
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13957043
GB18030 is a superset of GBK. I'm not entirely sure why GBK decoding
got 6% faster; I'm just happy that there aren't any big regressions.
benchmark old MB/s new MB/s speedup
BenchmarkGBKDecoder 116.96 123.64 1.06x
BenchmarkGBKEncoder 179.31 176.86 0.99x
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13761047
- Added Tag methods to Base, Script and Region types to convert them in a proper tag.
- Factored out part of Canonicalize that does not remake the string (used in upcoming matcher code).
- Added "nb" -> "no" conversion in the tables to allow more consistency for code using these tables directly.
- changed to short name used in some methods for type Base so that it consistenly appears as "b" in the documentation.
R=r
CC=golang-dev
https://golang.org/cl/13647043
encoding.Encoding's repertoire. Specifically, they are converted:
- to the Unicode replacement character '\ufffd' when converting to UTF-8,
- to the ASCII substitute character '\x1a' when converting from UTF-8.
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13802043
The encoding.test binary size generated by "go test -c" drops by 132320
bytes.
Some benchmarks get better, others get worse (but that might just be
noise, as there are no code or data changes for Big5 or GBK).
benchmark old MB/s new MB/s speedup
BenchmarkBig5Encoder 170.12 171.82 1.01x
BenchmarkEUCJPEncoder 160.94 156.07 0.97x
BenchmarkEUCKREncoder 166.75 171.66 1.03x
BenchmarkGBKEncoder 180.07 173.59 0.96x
BenchmarkShiftJISEncoder 137.95 143.70 1.04x
R=r
CC=golang-dev
https://golang.org/cl/13321047
The improvement is barely noticible, but it surely can't hurt.
benchmark old MB/s new MB/s speedup
BenchmarkGBKEncoder 181.94 182.25 1.00x
R=r
CC=golang-dev
https://golang.org/cl/13244047
65536 mostly-zero uint16s to 32186 uint16s. There are still explicit
zero entries, but no long runs of zeroes.
benchmark old MB/s new MB/s speedup
BenchmarkGBKEncoder 159.24 180.24 1.13x
R=mpvl
CC=andybalholm, golang-dev, r, rogpeppe
https://golang.org/cl/13253047
- renamed package locale to language
- renamed type ID to Tag (language.Tag)
- renamed type Language to Base (language.Base)
- deleting locale package
- changed occurences of "locale identifier" in comments to "language tag".
- renamed method variable names from id or loc to t when the receiver type is Tag.
R=r, nigeltao
CC=golang-dev
https://golang.org/cl/13468043
ErrShortDst even when there is sufficient dst space.
In theory, a transform.Transformer is allowed to return fewer dst bytes
than maximal, but in practice, we shouldn't be wasteful.
R=r
CC=golang-dev
https://golang.org/cl/13512045
instead of created-at-init-time local variables, so that they can be
initialized more efficiently as data instead of text.
charmap.a size in bytes before/after is 625236 / 278886, or 2.24.
R=r
CC=golang-dev
https://golang.org/cl/13234047
own dedicated packages.
Prior to this change, encoding.a was 1201 KiB (compiled with 6g).
Manually removing one charmap from tables.go changed this by 98 KiB.
This is already a non-trivial amount of code for the compiler/linker
to process just to throw away when building e.g. encoding/japanese,
and the number of supported charmaps (currently 11) will go up.
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/13486043
Macro groups, as defined by CLDR. The Legacy cases are now hard coded.
This allows us to handle sh -> sr-Latn without introducing a new data type
just for this case. The set of legacy translations is unlikely to change,
but maketables.go now checks and fails if the set changes.
Also introduced Default CanonType in preperation for adding tag maximization
and minimization.
Further changes: deviating from CLDR in a few places to not have to deal with
legacy choice. CLDR is likely to head in this direction as well, so it
prevents incompatibilities down the road.
Added CLDR option to force strict compliance to CLDR.
Mapping "mo" to "ro-MD" instead of "ro". In cases where ID is used as a
locale, preserving this piece of information may be important. It is up
to the matching code to establish that "ro" and "ro-MD" are mutually
intelligible.
R=r
CC=golang-dev
https://golang.org/cl/12903045
cmd/gc: &x panics if x does.
Fixesgolang/go#6178.
Code worked for nil interface values before this change, but now it doesn't.
Changed check for nil so that it works again.
R=r
CC=golang-dev, iant, rsc
https://golang.org/cl/12788046
- Exposed functions for parsing Language, Script, Region and Currency.
- Exposed several of the internal methods for these types as well.
- Fixed bug where not all private use tags were registered due to a bug in inc.
R=r
CC=golang-dev
https://golang.org/cl/12987043
There are some TODOs concerning the exact behavior for bad UTF-16, but
I'll address those after getting consensus on the broad-brush design.
candide-utf-16le.txt was generated by
iconv -f UTF-8 -t UTF-16LE < candide-utf-8.txt > candide-utf-16le.txt
R=r
CC=andybalholm, golang-dev, mpvl, rogpeppe
https://golang.org/cl/11565043
Only IBM Code Page 437 and Windows 1252 encodings for now. Others will
come in follow-up CLs once the infrastructure's settled.
R=r, mpvl, andybalholm
CC=golang-dev, rogpeppe
https://golang.org/cl/11270043
- Defined 0 value to be "unspecified" id for languages, scripts and
regions. These values are not directly exposed to the user, but
rather are used to distinguish between the case where the
user explicitly specifies, for example, Zzzz vs not specifying it.
- The nil-value for ID now identifies Root.
- Use Zyyy (undetermined) instead of Zzzz (uncoded, as used by CLDR) as
the code for an unspecified script. CLDR uses Zzzz, but BCP47 prescribes
using Zyyy in this case. With the new semantics is choice is somewhat
arbitrary, so we stick with BCP47.
- Added error to Canonicalize to accommodate future canonicalization algorithms.
- Removed Parent and Written as their semantics are rather hazy.
- Added Confidence to Language method as well.
- Removed Scope methods. Instead, user should just filter pre-defined
lists of IDs to mimic its functionality.
- Added SetTypeForKey and removed KeyValueString. The same can be done
with the former, but is much easier to use for the common case
(change the type for a single key on an existing ID).
- Removed SimplifyOptions as it is unclear such functionality should
be exposed to the user or that it belongs in ID at all.
Implemented:
- Language, Script, Region
- IsCountry
R=r
CC=golang-dev
https://golang.org/cl/10697043
This CL only provides the Reader type; Writer will be in a follow-up.
R=mpvl, r, mpvl
CC=andybalholm, golang-dev, rogpeppe
https://golang.org/cl/10538043