gecko-dev/.cargo
Henri Sivonen 822fc2ac55 Bug 1702246 - Make the encoding detector tolerate extensions to legacy CJK encodings. r=emk
This patch tries to address the issue that legacy CJK extensions have various
extended variants where the core of the encoding is compatible but the edges
are incompatible. Without this patch, we reject e.g. Big5 if it has a single
character from the UAO extension or a single Windows end-user-defined character.

Likewise for the other legacy CJK encodings.

This patch tolerates:

* All Big5 extensions (the motivating part of this patch).
* Windows EUDC for EUC-KR.
* Classic Mac OS extensions to Shift_JIS, EUC-KR, GBK, and Big5 to the
  extent practical considering conflicting definitions of what constitutes
  a lead byte in the Encoding Standard but a single-byte extension in
  Classic Mac OS.
* JIS X 0213 / 2004 extensions to Shift_JIS and EUC-JP. (It's unclear if
  these have actual deployment.)

Tolerating means that the occurrence of an extension character doesn't
disqualify a candidate but only applies a penalty to the pending score.
If there is enough other convincing content, it should be able to overcome
the penalty.

Differential Revision: https://phabricator.services.mozilla.com/D111372
2021-04-13 13:14:35 +00:00
..
config.in Bug 1702246 - Make the encoding detector tolerate extensions to legacy CJK encodings. r=emk 2021-04-13 13:14:35 +00:00