gecko-dev/third_party/rust/utf8-ranges
Bastien Orivel faf6cad78c Bug 1580908 - Part 10: Revendor dependencies. r=froydnj
Differential Revision: https://phabricator.services.mozilla.com/D45719

--HG--
rename : third_party/rust/regex-0.2.2/src/freqs.rs => third_party/rust/aho-corasick/src/byte_frequencies.rs
rename : third_party/rust/crc/Cargo.toml => third_party/rust/blake2b_simd/Cargo.toml
rename : third_party/rust/miniz_oxide_c_api/LICENSE => third_party/rust/miniz_oxide/LICENSE
rename : third_party/rust/redox_users/tests/group => third_party/rust/redox_users/tests/etc/group
rename : third_party/rust/redox_users/tests/passwd => third_party/rust/redox_users/tests/etc/passwd
rename : third_party/rust/redox_users/tests/shadow => third_party/rust/redox_users/tests/etc/shadow
rename : third_party/rust/utf8-ranges/src/lib.rs => third_party/rust/regex-syntax/src/utf8.rs
rename : third_party/rust/crossbeam-channel/LICENSE-APACHE => third_party/rust/rust-argon2/LICENSE-APACHE
rename : third_party/rust/memchr-1.0.2/COPYING => third_party/rust/winapi-util/COPYING
rename : third_party/rust/ucd-util/Cargo.toml => third_party/rust/winapi-util/Cargo.toml
rename : third_party/rust/memchr-1.0.2/LICENSE-MIT => third_party/rust/winapi-util/LICENSE-MIT
rename : third_party/rust/memchr-1.0.2/UNLICENSE => third_party/rust/winapi-util/UNLICENSE
extra : moz-landing-system : lando
2019-09-12 21:46:32 +00:00
..
benches
src Bug 1580908 - Part 10: Revendor dependencies. r=froydnj 2019-09-12 21:46:32 +00:00
.cargo-checksum.json Bug 1580908 - Part 10: Revendor dependencies. r=froydnj 2019-09-12 21:46:32 +00:00
COPYING
Cargo.toml Bug 1580908 - Part 10: Revendor dependencies. r=froydnj 2019-09-12 21:46:32 +00:00
LICENSE-MIT
README.md Bug 1580908 - Part 10: Revendor dependencies. r=froydnj 2019-09-12 21:46:32 +00:00
UNLICENSE

README.md

DEPRECATED: This crate has been folded into the regex-syntax and is now deprecated.

utf8-ranges

This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte ranges. This is useful when constructing byte based automata from Unicode. Stated differently, this lets one embed UTF-8 decoding as part of one's automaton.

Linux build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/utf8-ranges

Example

This shows how to convert a scalar value range (e.g., the basic multilingual plane) to a sequence of byte based character classes.

extern crate utf8_ranges;

use utf8_ranges::Utf8Sequences;

fn main() {
    for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
        println!("{:?}", range);
    }
}

The output:

[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]

These ranges can then be used to build an automaton. Namely:

  1. Every arbitrary sequence of bytes matches exactly one of the sequences of ranges or none of them.
  2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous encodings of surrogate codepoints in UTF-8 cannot match any of the byte ranges above.)