go/text - text

Граф коммитов

Автор	SHA1	Сообщение	Дата
Marcel van Lohuizen	191b11aac8	go.text/transform: added RemoveFunc transform for removing individual runes from the input. This corresponds to ICU's Remove transform. For example, to remove accents from characters one could use RemoveFunc as follows: nonspacingMark := func(r rune) bool { return unicode.Is(unicode.Mn, r) } transform.Chain(norm.NFD, transform.RemoveFunc(nonspacingMark), norm.NFC) (Once norm.Form implements Transformer; guess what will be my next CL.) R=r CC=golang-dev, nigeltao https://golang.org/cl/23220043	2013-11-26 08:29:24 +01:00
Nigel Tao	089e4d2d44	go.text/encoding: add the Nop encoding. R=mpvl, andybalholm CC=golang-dev https://golang.org/cl/29950045	2013-11-23 10:09:50 +11:00
Marcel van Lohuizen	9465bcffb0	go.text/unicode/norm: as quickSpan now no longer uses reorderBuffer, we can optimize a few more calls to not create a reorderBuffer unless necessary. This will greatly improve performance for small strings if there is no need to normalize. Updated Bytes, String, IsNormal* and FirstBoundary*. The latter is especially important as it will always only inspect a small piece of text. Also added a few benchmarks. R=r CC=golang-dev https://golang.org/cl/28260043	2013-11-18 20:12:15 +01:00
Marcel van Lohuizen	de84dfdf8f	go.tex/unicode/norm: added implementation of Transformer to norm.Form. One fundemental design decision was made to have the norm.Form type implement Transform directly, rather than having the user create a Transformer instance. The advantage of this approach is that it 1) results in a much nicer API (e.g. norm.NFC can be used as a transformer as is). 2) the Transformer is stateless, thread-safe and reentrant, reducing the possibility of errors. 3) is consistent with most other transformers. The clear disadvantage is that it is impossible to reuse a reorderBuffer between calls. The cost of initialization can be amortized when normalizing large blocks, but can be prohibitive for small strings (can probably get its size down to 500 bytes, but still). However, in theory it is possible to skip the use of the reorderBuffer in the fast majority of cases. 99.98 of HTML page content (excluding markup) is in NFC (http://www.macchiato.com/unicode/nfc-faq). In most cases, NFC can be converted to NFD without reordering (if it is in FCD form, which can easily be detected). This means that, using the same techniques as for norm.Iter, most conversions can be done without using a reorderBuffer at all. If it is really necessary, it is probably possible to get rid of the reorderBuffer altogether or to have an alternative that covers 99% of the remaining cases. As we reckon that the creation of a reorderBuffer can be avoided in most cases, we opt for the better API. Note that this approach will be more efficient if the creation of a reorderBuffer can be avoided. The current implementation doesn't do any of the optimizations described above. It only uses quickSpan to quickly cover most normalized content. This will not avoid using a reorderBuffer when converting between different forms, though. Also: - quickSpan has been modified to allow for incremental procesing (passing atEOF) and returning whether any violating runes have been found (returning a position smaller than the input length no longer signals this if !atEOF). QuickSpan and QuickSpanString now no longer require a reorderBuffer. - Added benchmarks to compare different normalization methods. Also added benchmarks to compare these implementations against running ToLower on the same strings. All in all this CL raises some issues that will have to be addressed in follow-up CLs: - Optimization of Transform method. - QuickSpan* probably needs to be adjusted to allow for incremental calls (like the internal version). - Unifying some of the implementations should be considered. R=r CC=golang-dev, nigeltao https://golang.org/cl/23460044	2013-11-14 23:20:39 +01:00
Marcel van Lohuizen	a7e91de037	go.text/language: canonicalize deprecated regions and scripts. Analoguous to deprecated languages, deprecated regions are represented with their own internal codes and get canonicalized when necessary. The deprecated regions were previously not recognized. - refactored code for writing sorted maps. - refactored region testing code in separate components to simplify tests. - CLDR does not include the deprecated 3-letter ISO codes for all deprecated codes. We added them for completeness. - Script deprecation is hard-coded. The CLDR data only contains one remapping. maketables.go checks that this is indeed the only one. This adds about 250 bytes of data. R=r CC=golang-dev https://golang.org/cl/19850043	2013-11-08 12:52:05 +01:00
Marcel van Lohuizen	3494cc8d0c	go.text/language: implemented TypeForKey and SetTypeForKey. Also refactored remakeString to allow for faster rebuilding code when possible. Tag.String now uses this for faster string creation as well. R=r CC=golang-dev https://golang.org/cl/14425066	2013-10-24 15:07:24 +02:00
Marcel van Lohuizen	240601eac0	go.text/language: change Zyyy to Zzzz as representation of undefined script. This a a choice between more conformance to BCP 47 on the one hand and Unicode and CLDR on the other hand. The user can use the returned Confidence value to determine whether the script was unspecified or explicitly specified as Zzzz (in the rare case the user would care at all). Updated comments. R=r CC=golang-dev https://golang.org/cl/16020043	2013-10-24 15:04:51 +02:00
Marcel van Lohuizen	80a998998e	go.text/language: fixed a few go vet errors. R=r CC=golang-dev https://golang.org/cl/16010043	2013-10-23 16:19:10 +02:00
Marcel van Lohuizen	f4a79d0559	go.text/language: corrected url in comments. R=r CC=golang-dev https://golang.org/cl/15520048	2013-10-23 16:18:38 +02:00
Marcel van Lohuizen	3255f38977	go.text/language: removed prefix validation of variants. Correct prefixes are a SHOULD, not a MUST in BCP 47. Incorrect prefixes should therefore either be marked with a special error so that it can be ignored or not generate an error at all. We opt for the latter. Too bad, as the prefix checking algorithm was kinda cool. Also, sorry for having to go through reviewing it in the first place. R=r CC=golang-dev https://golang.org/cl/14483054	2013-10-23 10:20:05 +02:00
Marcel van Lohuizen	51fb595f78	go.text/language: bunch of bug fixes in extension handling: [this time with the correct files:] Measures to expose bugs: - Added more elaborate tests for extensions. - Return end instead of len(scan.b) at the end of parseExtension to force exposing bugs. - parseExtensions used to sometimes update scan.b and sometimes not, leaving it to parse. Made this more consistent, simplifying parse and forcing errors to be exposed. - Removed some checks to catch errors that should have been caught elsewhere. Again to expose bugs. - Tightened some of the checks to expose bugs more easily. Bugs fixed: - Attributes in -u extension are no sorted, as per LDML spec (even though nobody uses them, a spec is a spec). - Fixed various bug where invalid keys or value were not properly removed. Merged the special case and common case to eliminate rare code paths and simplify testing. - Fixed some bugs where invalid empty extensions were not properly removed. - Fixed bug in Compose, which dropped the 'w', '9', and 'z' extensions. Other: - removed parsePrivate to simplify code.description here. R=r CC=golang-dev https://golang.org/cl/14669043	2013-10-16 11:12:07 +02:00
Marcel van Lohuizen	4fe0ccd82b	go.text/language: added proper handling of variants. The package now recognizes valid variants and rejects invalid ones. It accepts variants only if they follow the proper language or proper prefix sequence, as lined out in the BCP 47 spec. If variants are not in the right order, it will make an attempt to sort create a correct sorting order out of them. Duplicate variants are removed. Note that BCP 47 presumes that a script is suppressed if it is marked as such for a language, whereas we allow such scripts to still exist. There is an additional check to handle this case. R=r CC=golang-dev https://golang.org/cl/14555043	2013-10-14 16:06:28 +02:00
Marcel van Lohuizen	71ab14c455	go.text/language: revamped error handling: - ValueError now exported as new type. ValueError retains the problematic value, allowing the user to inspect and correct it. - Dynamically allocated errors returned in case of a syntax error are replaced by a error variable. - Fixed bug: return error if an "u" extension has a type without a value. - Added benchmarks or parsing code. - Renamed MissingLikelyData to ErrMissingLikelyData to be consistent with other Go packages. This variable is not yet returned, so this change is not likely to cause a big issue. - Removed Set type as long as there is no demand for it. The code is measurably faster after removing the dynamically allocated errors. A ValueError is 8 bytes and should not require allocation when passed as an error. Returning a fixed error variable instead of a ValueError did not significantly improve performance. I considered returning a syntax error with the position at which the error occurred. This extra management needed for this slowed down the code a bit, so I opted not to support this. This could still be implemented if there turns out to be a need for it. R=r, mpvl CC=golang-dev https://golang.org/cl/14162044	2013-10-08 19:51:01 +02:00
Marcel van Lohuizen	d95a5f25a9	go.text/language: added tag matching algorithm. This algorithm is not based on the CLDR algorithm. It incorporates some ideas from other implementations, but it is designed from scratch. Note that the IANA registry has been updated so this CL also adds some new language codes as well as add mappings for deprecated codes. maketables.go has also been modified to work around a bug that was introduced in the latest IANA update. Note that the Match method of the Matcher interface returns an index of the original Tag along with the Tag. Certain users of Matcher, such as service like collation, need to associate data with each Tag. Package collate is updated to use the new Matcher interface. As the Matcher matches returns the index of the Tag as well, the tables can be simplified to be an array instead of a map. R=r, nigeltao, mpvl CC=golang-dev, markdavis https://golang.org/cl/13819047	2013-10-07 13:14:45 +02:00
Marcel van Lohuizen	9f86e0be98	go.text/language: make it build with Go 1.1, which does not include sort.Stable. Fixes golang/go#6523. R=r CC=golang-dev https://golang.org/cl/14265043	2013-10-04 10:23:17 +02:00
Marcel van Lohuizen	6e2c2aaf7b	go.text/language: added function to parse the value of an HTTP Accept-Language header. It supports a few non-standard language tags that appear relatively frequently in the Accept-Language headers. R=r CC=golang-dev, nigeltao https://golang.org/cl/13974043	2013-09-27 12:43:24 +02:00
Nigel Tao	893a30928a	go.text/encoding: add Replacement and XUserDefined encodings. R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/14022043	2013-09-27 18:12:46 +10:00
Nigel Tao	5d548ed6fc	go.text/encoding/simplifiedchinese: implement HZ-GB2312. The HZ-GB2312 encoding can only represent GBK levels 1 and 2, and not GBK levels 3, 4 or 5, so there is a new testdata/etc-utf-8.txt file. The GBK levels are visualized at http://en.wikipedia.org/wiki/GBK R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13957043	2013-09-27 10:33:31 +10:00
Nigel Tao	c25434d637	go.text/encoding/simplifiedchinese: implement GB18030. GB18030 is a superset of GBK. I'm not entirely sure why GBK decoding got 6% faster; I'm just happy that there aren't any big regressions. benchmark old MB/s new MB/s speedup BenchmarkGBKDecoder 116.96 123.64 1.06x BenchmarkGBKEncoder 179.31 176.86 0.99x R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13761047	2013-09-24 11:45:21 +10:00
Marcel van Lohuizen	4a56690205	go.text/language: A few small changes: - Added Tag methods to Base, Script and Region types to convert them in a proper tag. - Factored out part of Canonicalize that does not remake the string (used in upcoming matcher code). - Added "nb" -> "no" conversion in the tables to allow more consistency for code using these tables directly. - changed to short name used in some methods for type Base so that it consistenly appears as "b" in the documentation. R=r CC=golang-dev https://golang.org/cl/13647043	2013-09-23 11:03:22 +02:00
Nigel Tao	fd9ccd35d5	go.text/encoding: be consistent when converting codes outside of an encoding.Encoding's repertoire. Specifically, they are converted: - to the Unicode replacement character '\ufffd' when converting to UTF-8, - to the ASCII substitute character '\x1a' when converting from UTF-8. R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13802043	2013-09-20 16:16:00 +10:00
Nigel Tao	46294c9806	go.text/encoding/japanese: implement ISO-2022-JP. R=r CC=golang-dev https://golang.org/cl/13308048	2013-09-20 11:27:11 +10:00
Nigel Tao	c603fe2b51	go.text/encoding: check that the hard-coded encode switch covers all tables. R=r CC=golang-dev https://golang.org/cl/13333056	2013-09-18 14:56:11 +10:00
Nigel Tao	a60de809e6	go.text/encoding: shrink the japanese and korean encoding data tables. The encoding.test binary size generated by "go test -c" drops by 132320 bytes. Some benchmarks get better, others get worse (but that might just be noise, as there are no code or data changes for Big5 or GBK). benchmark old MB/s new MB/s speedup BenchmarkBig5Encoder 170.12 171.82 1.01x BenchmarkEUCJPEncoder 160.94 156.07 0.97x BenchmarkEUCKREncoder 166.75 171.66 1.03x BenchmarkGBKEncoder 180.07 173.59 0.96x BenchmarkShiftJISEncoder 137.95 143.70 1.04x R=r CC=golang-dev https://golang.org/cl/13321047	2013-09-18 13:42:53 +10:00
Nigel Tao	d94036e178	go.text/encoding/charmap: add all the charmap encodings listed at http://encoding.spec.whatwg.org/#legacy-single-byte-encodings R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13252049	2013-09-13 20:53:45 +10:00
Nigel Tao	7db818c9d2	go.text/encoding/simplifiedchinese: remove redundant if check. The improvement is barely noticible, but it surely can't hurt. benchmark old MB/s new MB/s speedup BenchmarkGBKEncoder 181.94 182.25 1.00x R=r CC=golang-dev https://golang.org/cl/13244047	2013-09-12 11:24:54 +10:00
Nigel Tao	ec69b9aa64	go.text/encoding/simplifiedchinese: shrink the encoding data table from 65536 mostly-zero uint16s to 32186 uint16s. There are still explicit zero entries, but no long runs of zeroes. benchmark old MB/s new MB/s speedup BenchmarkGBKEncoder 159.24 180.24 1.13x R=mpvl CC=andybalholm, golang-dev, r, rogpeppe https://golang.org/cl/13253047	2013-09-12 10:36:27 +10:00
Nigel Tao	ea98615240	go.text/encoding/traditionalchinese: new package. R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13242054	2013-09-11 15:32:34 +10:00
Nigel Tao	b463796019	go.text/encoding/korean: new package. R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13639043	2013-09-11 10:05:07 +10:00
Marcel van Lohuizen	b38db9f15a	go.text/language: renaming of locale package: - renamed package locale to language - renamed type ID to Tag (language.Tag) - renamed type Language to Base (language.Base) - deleting locale package - changed occurences of "locale identifier" in comments to "language tag". - renamed method variable names from id or loc to t when the receiver type is Tag. R=r, nigeltao CC=golang-dev https://golang.org/cl/13468043	2013-09-05 11:16:24 +02:00
Nigel Tao	60b8f5ddd6	go.text/encoding: fix off-by-one error in some decoders returning ErrShortDst even when there is sufficient dst space. In theory, a transform.Transformer is allowed to return fewer dst bytes than maximal, but in practice, we shouldn't be wasteful. R=r CC=golang-dev https://golang.org/cl/13512045	2013-09-05 12:42:17 +10:00
Nigel Tao	550a27802b	go.text/encoding/charmap: make the underlying tables global variables instead of created-at-init-time local variables, so that they can be initialized more efficiently as data instead of text. charmap.a size in bytes before/after is 625236 / 278886, or 2.24. R=r CC=golang-dev https://golang.org/cl/13234047	2013-09-05 08:43:09 +10:00
Nigel Tao	577d09f62f	go.text/encoding: move the charmap and unicode encodings to their own dedicated packages. Prior to this change, encoding.a was 1201 KiB (compiled with 6g). Manually removing one charmap from tables.go changed this by 98 KiB. This is already a non-trivial amount of code for the compiler/linker to process just to throw away when building e.g. encoding/japanese, and the number of supported charmaps (currently 11) will go up. R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13486043	2013-09-03 17:49:19 +10:00
Nigel Tao	21b2da3c5c	go.text/encoding/simplifiedchinese: new package. R=r, chaishushan CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/13462043	2013-09-03 16:00:54 +10:00
Nigel Tao	1ae8c9038e	go.text/encoding/japanese: new package. R=r, andybalholm, bradfitz CC=golang-dev, mpvl, rogpeppe https://golang.org/cl/13179046	2013-08-29 15:38:03 +10:00
Nigel Tao	84e593ec5f	encoding: enable transform.Chain example, now that it's checked in. R=r CC=golang-dev https://golang.org/cl/13290046	2013-08-28 10:45:01 +10:00
Nigel Tao	e79def78d8	go.text/encoding: simplify charmapEncoder now that https://golang.org/cl/12087043/ "remove UTF-8 synchronization" has landed. R=r CC=golang-dev https://golang.org/cl/13185043	2013-08-26 11:17:23 +10:00
Marcel van Lohuizen	5059ed55b5	go.text/locale: Separated Macro canonicalization into Legacy and Macro groups, as defined by CLDR. The Legacy cases are now hard coded. This allows us to handle sh -> sr-Latn without introducing a new data type just for this case. The set of legacy translations is unlikely to change, but maketables.go now checks and fails if the set changes. Also introduced Default CanonType in preperation for adding tag maximization and minimization. Further changes: deviating from CLDR in a few places to not have to deal with legacy choice. CLDR is likely to head in this direction as well, so it prevents incompatibilities down the road. Added CLDR option to force strict compliance to CLDR. Mapping "mo" to "ro-MD" instead of "ro". In cases where ID is used as a locale, preserving this piece of information may be important. It is up to the matching code to establish that "ro" and "ro-MD" are mutually intelligible. R=r CC=golang-dev https://golang.org/cl/12903045	2013-08-21 11:15:02 +02:00
Nigel Tao	4980de1c40	go.text/encoding: add some more Windows encodings (874, 1250, 1251, 1253, 1254, 1255, 1256, 1257, 1258). R=r CC=golang-dev https://golang.org/cl/13120043	2013-08-20 18:18:12 +10:00
Marcel van Lohuizen	c35e1dcc4d	go.text/cldr: fix bug introduced with changeset 17795:749d02164043: cmd/gc: &x panics if x does. Fixes golang/go#6178. Code worked for nil interface values before this change, but now it doesn't. Changed check for nil so that it works again. R=r CC=golang-dev, iant, rsc https://golang.org/cl/12788046	2013-08-18 09:05:32 +02:00
Marcel van Lohuizen	73b7721064	go.text/locale: Expanded API and fixed bug: - Exposed functions for parsing Language, Script, Region and Currency. - Exposed several of the internal methods for these types as well. - Fixed bug where not all private use tags were registered due to a bug in inc. R=r CC=golang-dev https://golang.org/cl/12987043	2013-08-16 12:13:35 +02:00
Nigel Tao	bc48732fe9	go.text/encoding: remove UTF-8 synchronization during encoding. It's not worth it. R=r CC=golang-dev https://golang.org/cl/12087043	2013-07-30 17:14:54 +10:00
Nigel Tao	0e390ba84d	go.text/encoding: add UTF-16 encodings. There are some TODOs concerning the exact behavior for bad UTF-16, but I'll address those after getting consensus on the broad-brush design. candide-utf-16le.txt was generated by iconv -f UTF-8 -t UTF-16LE < candide-utf-8.txt > candide-utf-16le.txt R=r CC=andybalholm, golang-dev, mpvl, rogpeppe https://golang.org/cl/11565043	2013-07-30 11:53:06 +10:00
Nigel Tao	d0bbf51710	go.text/transform: fix s/src/dst/ typo in Writer.Close. R=r CC=golang-dev, mpvl https://golang.org/cl/11823043	2013-07-26 09:22:10 +10:00
Marcel van Lohuizen	b67299ac79	go.text/transform: implementation of Writer, Chain, Nop and Discard. R=nigeltao, r CC=golang-dev https://golang.org/cl/10964043	2013-07-24 16:26:05 +02:00
Nigel Tao	48ba322e43	go.text/encoding: new package that provides character set encodings. Only IBM Code Page 437 and Windows 1252 encodings for now. Others will come in follow-up CLs once the infrastructure's settled. R=r, mpvl, andybalholm CC=golang-dev, rogpeppe https://golang.org/cl/11270043	2013-07-18 14:24:31 +10:00
Nigel Tao	0c7fb33750	go.text/transform: re-arrange TestReader's test cases so that they can be re-used by other tests. R=mpvl, r CC=golang-dev https://golang.org/cl/10672044	2013-07-10 10:37:15 +10:00
Nigel Tao	8a29aad8b1	go.text/transform: improve comments based on review of https://golang.org/cl/10538043. R=r, mpvl CC=golang-dev https://golang.org/cl/10996043	2013-07-09 16:03:01 +10:00
Marcel van Lohuizen	beb5bf8642	go.text/locale: some semantics and API changes - Defined 0 value to be "unspecified" id for languages, scripts and regions. These values are not directly exposed to the user, but rather are used to distinguish between the case where the user explicitly specifies, for example, Zzzz vs not specifying it. - The nil-value for ID now identifies Root. - Use Zyyy (undetermined) instead of Zzzz (uncoded, as used by CLDR) as the code for an unspecified script. CLDR uses Zzzz, but BCP47 prescribes using Zyyy in this case. With the new semantics is choice is somewhat arbitrary, so we stick with BCP47. - Added error to Canonicalize to accommodate future canonicalization algorithms. - Removed Parent and Written as their semantics are rather hazy. - Added Confidence to Language method as well. - Removed Scope methods. Instead, user should just filter pre-defined lists of IDs to mimic its functionality. - Added SetTypeForKey and removed KeyValueString. The same can be done with the former, but is much easier to use for the common case (change the type for a single key on an existing ID). - Removed SimplifyOptions as it is unclear such functionality should be exposed to the user or that it belongs in ID at all. Implemented: - Language, Script, Region - IsCountry R=r CC=golang-dev https://golang.org/cl/10697043	2013-07-08 21:10:26 +02:00
Nigel Tao	79b045a0f2	go.text/transform: new package. This CL only provides the Reader type; Writer will be in a follow-up. R=mpvl, r, mpvl CC=andybalholm, golang-dev, rogpeppe https://golang.org/cl/10538043	2013-07-02 09:56:20 +10:00

1 2

58 Коммитов Все ветки Поиск

58 Коммитов

Все ветки