История

Henri Sivonen e7e2436c9b Bug 1581509 - Update encoding_rs to 0.8.20. r=m_kato Differential Revision: https://phabricator.services.mozilla.com/D46000 --HG-- extra : moz-landing-system : lando		2019-09-18 08:28:04 +00:00
..
doc	…
src	Bug 1581509 - Update encoding_rs to 0.8.20. r=m_kato	2019-09-18 08:28:04 +00:00
.cargo-checksum.json	Bug 1581509 - Update encoding_rs to 0.8.20. r=m_kato	2019-09-18 08:28:04 +00:00
CONTRIBUTING.md	…
COPYRIGHT	…
Cargo.toml	Bug 1581509 - Update encoding_rs to 0.8.20. r=m_kato	2019-09-18 08:28:04 +00:00
Ideas.md	…
LICENSE-APACHE	…
LICENSE-MIT	…
README.md	Bug 1581509 - Update encoding_rs to 0.8.20. r=m_kato	2019-09-18 08:28:04 +00:00
build.rs	Bug 1579383 - Update encoding_rs to 0.8.19. r=m_kato	2019-09-18 08:26:36 +00:00
generate-encoding-data.py	Bug 1581509 - Update encoding_rs to 0.8.20. r=m_kato	2019-09-18 08:28:04 +00:00
rustfmt.toml	…

README.md

encoding_rs

encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Firefox 56).

Additionally, the mem module provides various operations for dealing with in-RAM text (as opposed to data that's coming from or going to an IO boundary). The mem module is a module instead of a separate crate due to internal implementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t).
Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t) into a sequence of bytes in an Encoding Standard-defined character encoding as if the lone surrogates had been replaced with the REPLACEMENT CHARACTER before performing the encode. (Gecko's UTF-16 is potentially invalid.)
Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid UTF-8.
Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
Does the above in streaming (input and output split across multiple buffers) and non-streaming (whole input in a single buffer and whole output in a single buffer) variants.
Avoids copying (borrows) when possible in the non-streaming cases when decoding to or encoding from UTF-8.
Resolves textual labels that identify character encodings in protocol text into type-safe objects representing the those encodings conceptually.
Maps the type-safe encoding objects onto strings suitable for returning from document.characterSet.
Validates UTF-8 (in common instruction set scenarios a bit faster for Web workloads than the standard library; hopefully will get upstreamed some day) and ASCII.

Additionally, encoding_rs::mem does the following:

Checks if a byte buffer contains only ASCII.
Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer contains only Latin1 code points (below U+0100).
Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior (suitable for checking if the Unicode Bidirectional Algorithm can be optimized out).
Combined versions of the above two checks.
Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
Converts UTF-8 and UTF-16 to Latin1 (if in range).
Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
Copies ASCII from one buffer to another up to the first non-ASCII byte.
Converts ASCII to UTF-16 up to the first non-ASCII byte.
Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.

Integration with `std::io`

Notably, the above feature list doesn't include the capability to wrap a std::io::Read, decode it into UTF-8 and presenting the result via std::io::Read. The encoding_rs_io crate provides that capability.

Decoding Email

For decoding character encodings that occur in email, use the charset crate instead of using this one directly. (It wraps this crate and adds UTF-7 decoding.)

Windows Code Page Identifier Mappings

For mappings to and from Windows code page identifiers, use the codepage crate.

Preparing Text for the Encoders

Normalizing text into Unicode Normalization Form C prior to encoding text into a legacy encoding minimizes unmappable characters. Text can be normalized to Unicode Normalization Form C using the unic-normal crate.

The exception is windows-1258, which after normalizing to Unicode Normalization Form C requires tone marks to be decomposed in order to minimize unmappable characters. Vietnamese tone marks can be decomposed using the detone crate.

Licensing

Please see the file named COPYRIGHT.

Documentation

Generated API documentation is available online.

There is a long-form write-up about the design and internals of the crate.

C and C++ bindings

An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.

The bindings for the mem module are in the encoding_c_mem crate.

For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.

There's a write-up about the C++ wrappers.

Sample programs

Optional features

There are currently these optional cargo features:

`simd-accel`

Enables SIMD acceleration using the nightly-dependent packed_simd crate.

This is an opt-in feature, because enabling this feature opts out of Rust's guarantees of future compilers compiling old code (aka. "stability story").

Currently, this has not been tested to be an improvement except for these targets:

x86_64
i686
aarch64
thumbv7neon

If you use nightly Rust, you use targets whose first component is one of the above, and you are prepared to have to revise your configuration when updating Rust, you should enable this feature. Otherwise, please do not enable this feature.

Note! If you are compiling for a target that does not have 128-bit SIMD enabled as part of the target definition and you are enabling 128-bit SIMD using -C target_feature, you need to enable the core_arch Cargo feature for packed_simd to compile a crates.io snapshot of core_arch instead of using the standard-library copy of core::arch, because the core::arch module of the pre-compiled standard library has been compiled with the assumption that the CPU doesn't have 128-bit SIMD. At present this applies mainly to 32-bit ARM targets whose first component does not include the substring neon.

The encoding_rs side of things has not been properly set up for POWER, PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow the advice from the previous paragraph, you probably shouldn't use the simd-accel option on the less mainstream architectures at this time.

Used by Firefox.

`serde`

Enables support for serializing and deserializing &'static Encoding-typed struct fields using Serde.

Not used by Firefox.

`fast-legacy-encode`

A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options:

fast-hangul-encode
fast-hanja-encode
fast-kanji-encode
fast-gb-hanzi-encode
fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

`fast-hangul-encode`

Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.