Граф коммитов

41 Коммитов

Автор SHA1 Сообщение Дата
Henri Sivonen 8dca2aaa89 Bug 1712928 - Gather telemetry about encoding-unlabeled pages and about Repair Text Encoding usage situations. r=emk
In particular, gather telemetry to evaluate the impact of unlabeled UTF-8
and how detector-triggered reloads would change if ASCII-only at initial
guess was treated as UTF-8.

Differential Revision: https://phabricator.services.mozilla.com/D140818
2022-03-29 08:04:25 +00:00
Henri Sivonen 649a5b63d8 Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.

The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.

Differences from WebKit and Blink:

* WebKit and Blink honor <meta charset> in <noscript>. This implementation
  does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
  foreign content. This implementation is foreign content-aware. This
  makes a difference for CDATA sections that contain a > before the meta
  as well as style and script elements within foreign content. This could
  happen if the CDATA section that has mysteriously been introduced around
  a what looks like a meta tag also contains another prior tag-looking
  run of text.
* This implementation processes rel=preload and speculative loads that are
  seen before <meta charset> has been seen. WebKit and Blink instead first
  look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
  an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
  an XML declaration, the detection from content is not dependent of network
  buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
  the stream if the guess made at that point differs from the first guess.
  (See below for the definition of the input to the first guess.)

Differences from the old spec and Gecko previously:

* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
  meta within the first 1024 bytes, now a meta that started within the first
  1024 bytes counts as early enough. Additionally, if by then there hasn't
  been a template start tag and head hasn't ended, meta occurring before the
  earlier of the end of the head or a template start tag counts as early
  enough.
* Meta now counts as not-late even if the encoding label has numeric
  character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
  there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
  the initial chardetng scan is potentially longer than before: the first 1024
  bytes, the token spanning the 1024-byte boundary if there is such a token,
  and, if by then head hasn't ended and there hasn't been a template start tag
  until the end of the template start tag or the end of the token that causes
  head to end, ever comes first. However, if the token implying the end of the
  head is a text token, bytes only to the end of the previous non-text token is
  considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
  instead of expat for extracting the internal encoding label.

Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.

An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .

Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 11:34:20 +00:00
Norisz Fay 1d6984bc21 Backed out changeset 3dfd3c94a105 (bug 1701828) for causing mochitest failures on browser_hsts_host.js CLOSED TREE 2021-12-07 12:05:44 +02:00
Henri Sivonen 58476d7f17 Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.

The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.

Differences from WebKit and Blink:

* WebKit and Blink honor <meta charset> in <noscript>. This implementation
  does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
  foreign content. This implementation is foreign content-aware. This
  makes a difference for CDATA sections that contain a > before the meta
  as well as style and script elements within foreign content. This could
  happen if the CDATA section that has mysteriously been introduced around
  a what looks like a meta tag also contains another prior tag-looking
  run of text.
* This implementation processes rel=preload and speculative loads that are
  seen before <meta charset> has been seen. WebKit and Blink instead first
  look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
  an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
  an XML declaration, the detection from content is not dependent of network
  buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
  the stream if the guess made at that point differs from the first guess.
  (See below for the definition of the input to the first guess.)

Differences from the old spec and Gecko previously:

* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
  meta within the first 1024 bytes, now a meta that started within the first
  1024 bytes counts as early enough. Additionally, if by then there hasn't
  been a template start tag and head hasn't ended, meta occurring before the
  earlier of the end of the head or a template start tag counts as early
  enough.
* Meta now counts as not-late even if the encoding label has numeric
  character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
  there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
  the initial chardetng scan is potentially longer than before: the first 1024
  bytes, the token spanning the 1024-byte boundary if there is such a token,
  and, if by then head hasn't ended and there hasn't been a template start tag
  until the end of the template start tag or the end of the token that causes
  head to end, ever comes first. However, if the token implying the end of the
  head is a text token, bytes only to the end of the previous non-text token is
  considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
  instead of expat for extracting the internal encoding label.

Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.

An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .

Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-07 07:35:32 +00:00
Andi-Bogdan Postelnicu c8e0f87391 Bug 1519636 - First reformat with clang-format 13.0.0. r=firefox-build-system-reviewers,sylvestre,mhentges
Updated with clang-format version 13.0.0 (taskcluster-OgjH5lasS5K_fvefdRcJVg)

Depends on D131114

Differential Revision: https://phabricator.services.mozilla.com/D129119
2021-11-16 08:07:30 +00:00
Henri Sivonen 5397b4f0a9 Bug 1727491 - Remove support for BOMless unlabeled Latin1 Supplement-range UTF-16LE|BE. r=emk
Differential Revision: https://phabricator.services.mozilla.com/D123596
2021-09-01 09:13:29 +00:00
criss 02cf484af4 Backed out changeset dc6b9ca8f3fa (bug 1727491) for causing mochitest failures on test_bug631751be.html. CLOSED TREE 2021-08-30 11:14:38 +03:00
Henri Sivonen 4233abeb9e Bug 1727491 - Remove support for BOMless unlabeled Latin1 Supplement-range UTF-16LE|BE. r=emk
Differential Revision: https://phabricator.services.mozilla.com/D123596
2021-08-30 07:11:09 +00:00
Henri Sivonen 58e0b2946c Bug 1716290 - Remove protections against the document changing as part of kCharsetFromFinalUserForcedAutoDetection reload. r=emk,emilio
NOTE! In cases where there is no HTTP-layer encoding declaration, and CSS
parsing inherits the encoding from the HTML document, for preloads, this
changes the inherited encoding from windows-1252 to UTF-8 in order to
make the speculative encoding correct in the common `<meta charset=utf-8>`
case.

Differential Revision: https://phabricator.services.mozilla.com/D123593
2021-08-26 18:02:15 +00:00
criss 2be42eea15 Backed out changeset ab805f2926d5 (bug 1716290) for causing failures on link-header-preload.html. CLOSED TREE 2021-08-26 12:07:17 +03:00
Henri Sivonen ff85d45e69 Bug 1716290 - Remove protections against the document changing as part of kCharsetFromFinalUserForcedAutoDetection reload. r=emk
Differential Revision: https://phabricator.services.mozilla.com/D123593
2021-08-26 06:25:31 +00:00
Henri Sivonen 7df7939f77 Bug 1713627 - Remove code obsoleted by the replacing the Text Encoding menu with one item. r=jaws,emk
Differential Revision: https://phabricator.services.mozilla.com/D116391
2021-06-21 12:09:01 +00:00
Dorel Luca 2118316ba4 Backed out changeset 4891a17c55e2 (bug 1713627) for Browser-chrome failures in docshell/test/browser/browser_bug673087-1.js. CLOSED TREE 2021-06-21 12:10:54 +03:00
Henri Sivonen abbbf94915 Bug 1713627 - Remove code obsoleted by the replacing the Text Encoding menu with one item. r=jaws,emk
Differential Revision: https://phabricator.services.mozilla.com/D116391
2021-06-21 08:09:43 +00:00
Henri Sivonen b98488aa92 Bug 673087 - Honor encoding declared via XML declaration in text/html. r=emk
Differential Revision: https://phabricator.services.mozilla.com/D107806
2021-03-23 09:52:04 +00:00
Henri Sivonen 9b210c311e Bug 1686463 - Gather telemetry about automatic encoding detection outcomes. r=chutten,emk
Differential Revision: https://phabricator.services.mozilla.com/D102397
2021-01-24 00:11:07 +00:00
Henri Sivonen 058e02104c Bug 1648464 - Add an Autodetect item to the Text Encoding menu. r=emk,chutten,Gijs
Take a step towards replacing the encoding menu with a single menu item that
triggers the autodetection manually. However, don't remove anything for now.

* Add an autodetect item.
* Add telemetry for autodetect used in session.
* Add telemetry for non-autodetect used in session.
* Restore and revise telemetry for how the encoding that is being overridden
  was discovered.

Differential Revision: https://phabricator.services.mozilla.com/D81132
2021-01-14 07:06:53 +00:00
Henri Sivonen f0af8088e4 Bug 1647310 - Stop storing charset on cache entries. r=necko-reviewers,dragana
Storing the charset on cache entries makes the code path uselessly different
when loading from cache relative to uncached loads. Also, for future
telemetry purposes, caching the charset obscures its original source.

Differential Revision: https://phabricator.services.mozilla.com/D101570
2021-01-15 09:35:56 +00:00
Sylvestre Ledru caf785c695 Bug 1519636 - Reformat recent changes to the Google coding style r=andi
# ignore-this-changeset

Differential Revision: https://phabricator.services.mozilla.com/D82178
2020-07-04 09:38:43 +00:00
Henri Sivonen 2d63627ce0 Bug 1647728 - Unify kCharsetFromUserForced and kCharsetFromParentForced. r=m_kato
For making further changes less messy.

Differential Revision: https://phabricator.services.mozilla.com/D80813
2020-06-25 03:25:03 +00:00
Henri Sivonen 5c2bad25ab Bug 1551276 - Autodetect legacy encodings on unlabeled pages. r=emk
Differential Revision: https://phabricator.services.mozilla.com/D56362

--HG--
extra : moz-landing-system : lando
2019-12-12 17:50:19 +00:00
Oana Pop Rus df78d6011c Backed out changeset 0810ad586986 (bug 1551276) for wpt failures in ar-ISO-8859-6-late.tentative.html on a CLOSED TREE 2019-12-12 16:38:54 +02:00
Henri Sivonen 07527a83c9 Bug 1551276 - Autodetect legacy encodings on unlabeled pages. r=emk
Differential Revision: https://phabricator.services.mozilla.com/D56362

--HG--
extra : moz-landing-system : lando
2019-12-12 12:59:47 +00:00
Masatoshi Kimura 2d731ea4c9 Bug 1556746 - Remove kCharsetFromHintPrevDoc that have been accidentally restored by resolving merge conflict. r=hsivonen
Differential Revision: https://phabricator.services.mozilla.com/D33642

--HG--
extra : moz-landing-system : lando
2019-06-04 14:47:34 +00:00
Cosmin Sabou d68454d1da Bug 1543077 - Fix merge conflict bustage, add missing comma. r=bustage-fix CLOSED TREE
--HG--
extra : amend_source : 81882c10b7f3bee85f2acb3a38489b566f4684c1
2019-06-03 19:49:09 +03:00
Cosmin Sabou bcd5ff3d98 Merge mozilla-central to autoland.
--HG--
extra : rebase_source : ec8335cc4fb4f7c2594b2b95cd6d5078af2be625
2019-06-03 19:24:20 +03:00
Henri Sivonen ae34dc651a Bug 1543077 part 4 - Have only one item for Japanese in the Text Encoding menu. r=Gijs,emk.
Differential Revision: https://phabricator.services.mozilla.com/D28634
2019-06-03 15:30:41 +03:00
Masatoshi Kimura b411f04b5f Bug 1554589 - Turn nsCharsetSource constants into an enum. r=hsivonen
Since unscoped enums implicitly convert to underlying types, we don't have to change everything at once. It is worth enough to introduce auto-numbering, IMO.

Differential Revision: https://phabricator.services.mozilla.com/D32841

--HG--
extra : moz-landing-system : lando
2019-06-03 09:13:50 +00:00
Mihai Alexandru Michis 1dd6cb6ee5 Backed out 6 changesets (bug 1543077) for causing bc failures at docshell/test/browser/browser_bug1543077.js
Backed out changeset f593045cc48f (bug 1543077)
Backed out changeset 25449ba8aceb (bug 1543077)
Backed out changeset ccc438262e29 (bug 1543077)
Backed out changeset 4573c25b1ce0 (bug 1543077)
Backed out changeset 1cbaafb9373a (bug 1543077)
Backed out changeset 1a0e7ced8e47 (bug 1543077)

--HG--
extra : rebase_source : f04bf405303fe03776f0e70b03db076c0a41ae45
2019-05-27 12:00:21 +03:00
Henri Sivonen 533527938d Bug 1543077 part 4 - Have only one item for Japanese in the Text Encoding menu. r=emk,Gijs
Differential Revision: https://phabricator.services.mozilla.com/D28634

--HG--
extra : moz-landing-system : lando
2019-05-27 07:55:27 +00:00
Henri Sivonen 69ad08c987 Bug 1071816 - Support loading unlabeled/BOMless UTF-8 text/html and text/plain files from file: URLs. r=emk. 2018-12-11 10:36:46 +02:00
Sylvestre Ledru 265e672179 Bug 1511181 - Reformat everything to the Google coding style r=ehsan a=clang-format
# ignore-this-changeset

--HG--
extra : amend_source : 4d301d3b0b8711c4692392aa76088ba7fd7d1022
2018-11-30 11:46:48 +01:00
Henri Sivonen a36fff43c5 Bug 741776 - Treat JSON, WebVTT and AppCache manifests as UTF-8 when loaded as plain text. r=Ehsan
MozReview-Commit-ID: 5UvYqJVvX0r

--HG--
extra : rebase_source : 5a6f3dfd97fb06810fde9a4b8b650a7a922a7c20
2016-06-09 14:29:30 +03:00
Henri Sivonen 7eef0de378 Bug 910211 - Guess the fallback encoding from the top-level domain when feasible. r=emk. 2014-02-06 11:08:01 +02:00
Henri Sivonen 5ac2a5be04 Bug 910192 non-UI part - Get rid of intl.charset.default as a localizable pref and deduce the fallback from the locale. r=bzbarsky. 2013-11-04 13:24:33 +02:00
Carsten "Tomcat" Book 6646962e49 Backed out changeset 88e0c01e2d81 (bug 910192) bustage on a CLOSED TREE 2013-11-04 13:04:02 +01:00
Henri Sivonen 7af818f242 Bug 910192 non-UI part - Get rid of intl.charset.default as a localizable pref and deduce the fallback from the locale. r=bzbarsky. 2013-11-04 13:24:33 +02:00
Henri Sivonen eb050aeb55 Bug 871161 - Stop inheriting charset where other browsers do not inherit it. r=bzbarsky. 2013-10-16 04:46:10 +03:00
Henri Sivonen 7f2f8a25e2 Bug 234628 part 1 - Make the BOM take precedence over manual encoding overrides. r=smaug. 2013-01-18 16:27:03 +02:00
Henri Sivonen 4038858f3f Bug 716579 - Let a BOM override HTTP-level charset in the HTML and XML parsers. r=smaug. 2012-11-06 13:57:51 +02:00
Henri Sivonen 31192f4b01 Bug 737417 part 1 - Split charset source constants out of nsIParser.h. r=smaug. 2012-03-22 16:42:42 +02:00