2009-09-18 13:21:47 +04:00
|
|
|
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
|
2018-11-30 18:39:55 +03:00
|
|
|
/* vim: set sw=2 ts=2 et tw=80: */
|
2012-05-29 19:52:43 +04:00
|
|
|
/* This Source Code Form is subject to the terms of the Mozilla Public
|
|
|
|
* License, v. 2.0. If a copy of the MPL was not distributed with this
|
|
|
|
* file, You can obtain one at http://mozilla.org/MPL/2.0/. */
|
2009-09-18 13:21:47 +04:00
|
|
|
|
|
|
|
#include "nsHtml5StreamParser.h"
|
2017-06-17 05:54:40 +03:00
|
|
|
|
2020-12-07 17:21:43 +03:00
|
|
|
#include <stdlib.h>
|
|
|
|
#include <string.h>
|
|
|
|
#include <algorithm>
|
|
|
|
#include <new>
|
|
|
|
#include <type_traits>
|
|
|
|
#include <utility>
|
2022-04-27 10:24:34 +03:00
|
|
|
#include "ErrorList.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "GeckoProfiler.h"
|
|
|
|
#include "js/GCAPI.h"
|
|
|
|
#include "mozilla/ArrayIterator.h"
|
|
|
|
#include "mozilla/Buffer.h"
|
|
|
|
#include "mozilla/CheckedInt.h"
|
2017-06-17 05:54:40 +03:00
|
|
|
#include "mozilla/DebugOnly.h"
|
|
|
|
#include "mozilla/Encoding.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "mozilla/EncodingDetector.h"
|
|
|
|
#include "mozilla/Likely.h"
|
|
|
|
#include "mozilla/Maybe.h"
|
2020-04-07 18:16:33 +03:00
|
|
|
#include "mozilla/SchedulerGroup.h"
|
2020-12-07 17:25:59 +03:00
|
|
|
#include "mozilla/ScopeExit.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "mozilla/Services.h"
|
|
|
|
#include "mozilla/StaticPrefs_html5.h"
|
|
|
|
#include "mozilla/StaticPrefs_intl.h"
|
|
|
|
#include "mozilla/TaskCategory.h"
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
#include "mozilla/TextUtils.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "mozilla/Tuple.h"
|
|
|
|
#include "mozilla/UniquePtrExtensions.h"
|
|
|
|
#include "mozilla/Unused.h"
|
|
|
|
#include "mozilla/dom/BindingDeclarations.h"
|
|
|
|
#include "mozilla/dom/BrowsingContext.h"
|
|
|
|
#include "mozilla/dom/DebuggerUtilsBinding.h"
|
|
|
|
#include "mozilla/dom/DocGroup.h"
|
|
|
|
#include "mozilla/dom/Document.h"
|
|
|
|
#include "mozilla/mozalloc.h"
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
#include "mozilla/Vector.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsContentSink.h"
|
2009-09-18 13:21:47 +04:00
|
|
|
#include "nsContentUtils.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsCycleCollectionTraversalCallback.h"
|
2009-09-21 17:18:20 +04:00
|
|
|
#include "nsHtml5AtomTable.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsHtml5ByteReadable.h"
|
|
|
|
#include "nsHtml5Highlighter.h"
|
2009-09-25 21:11:02 +04:00
|
|
|
#include "nsHtml5Module.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsHtml5OwningUTF16Buffer.h"
|
|
|
|
#include "nsHtml5Parser.h"
|
|
|
|
#include "nsHtml5Speculation.h"
|
2017-06-27 20:38:30 +03:00
|
|
|
#include "nsHtml5StreamParserPtr.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsHtml5Tokenizer.h"
|
|
|
|
#include "nsHtml5TreeBuilder.h"
|
|
|
|
#include "nsHtml5TreeOpExecutor.h"
|
|
|
|
#include "nsHtml5TreeOpStage.h"
|
|
|
|
#include "nsIChannel.h"
|
|
|
|
#include "nsIContentSink.h"
|
|
|
|
#include "nsID.h"
|
|
|
|
#include "nsIDTD.h"
|
2017-12-05 00:09:15 +03:00
|
|
|
#include "nsIDocShell.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsIEventTarget.h"
|
|
|
|
#include "nsIHttpChannel.h"
|
|
|
|
#include "nsIInputStream.h"
|
2011-11-30 21:44:31 +04:00
|
|
|
#include "nsINestedURI.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsIObserverService.h"
|
|
|
|
#include "nsIRequest.h"
|
|
|
|
#include "nsIRunnable.h"
|
|
|
|
#include "nsIScriptError.h"
|
|
|
|
#include "nsIThread.h"
|
2013-07-09 05:45:26 +04:00
|
|
|
#include "nsIThreadRetargetableRequest.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsIThreadRetargetableStreamListener.h"
|
|
|
|
#include "nsITimer.h"
|
|
|
|
#include "nsIURI.h"
|
|
|
|
#include "nsJSEnvironment.h"
|
|
|
|
#include "nsLiteralString.h"
|
2013-09-10 21:40:50 +04:00
|
|
|
#include "nsNetUtil.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
#include "nsString.h"
|
|
|
|
#include "nsTPromiseFlatString.h"
|
|
|
|
#include "nsThreadUtils.h"
|
2014-11-03 20:49:05 +03:00
|
|
|
#include "nsXULAppAPI.h"
|
2020-12-07 17:21:43 +03:00
|
|
|
|
2021-03-23 12:52:04 +03:00
|
|
|
extern "C" {
|
|
|
|
// Defined in intl/encoding_glue/src/lib.rs
|
|
|
|
const mozilla::Encoding* xmldecl_parse(const uint8_t* buf, size_t buf_len);
|
|
|
|
};
|
|
|
|
|
2012-11-08 03:04:22 +04:00
|
|
|
using namespace mozilla;
|
2019-01-30 16:35:02 +03:00
|
|
|
using namespace mozilla::dom;
|
2009-09-18 13:21:47 +04:00
|
|
|
|
2010-04-16 14:52:06 +04:00
|
|
|
/*
|
|
|
|
* Note that nsHtml5StreamParser implements cycle collecting AddRef and
|
|
|
|
* Release. Therefore, nsHtml5StreamParser must never be refcounted from
|
|
|
|
* the parser thread!
|
|
|
|
*
|
|
|
|
* To work around this limitation, runnables posted by the main thread to the
|
|
|
|
* parser thread hold their reference to the stream parser in an
|
2018-03-16 18:26:06 +03:00
|
|
|
* nsHtml5StreamParserPtr. Upon creation, nsHtml5StreamParserPtr addrefs the
|
|
|
|
* object it holds
|
2010-04-16 14:52:06 +04:00
|
|
|
* just like a regular nsRefPtr. This is OK, since the creation of the
|
2017-06-27 20:38:30 +03:00
|
|
|
* runnable and the nsHtml5StreamParserPtr happens on the main thread.
|
2010-04-16 14:52:06 +04:00
|
|
|
*
|
|
|
|
* When the runnable is done on the parser thread, the destructor of
|
2017-06-27 20:38:30 +03:00
|
|
|
* nsHtml5StreamParserPtr runs there. It doesn't call Release on the held object
|
2010-04-16 14:52:06 +04:00
|
|
|
* directly. Instead, it posts another runnable back to the main thread where
|
|
|
|
* that runnable calls Release on the wrapped object.
|
|
|
|
*
|
|
|
|
* When posting runnables in the other direction, the runnables have to be
|
|
|
|
* created on the main thread when nsHtml5StreamParser is instantiated and
|
|
|
|
* held for the lifetime of the nsHtml5StreamParser. This works, because the
|
|
|
|
* same runnabled can be dispatched multiple times and currently runnables
|
|
|
|
* posted from the parser thread to main thread don't need to wrap any
|
|
|
|
* runnable-specific data. (In the other direction, the runnables most notably
|
|
|
|
* wrap the byte data of the stream.)
|
|
|
|
*/
|
2009-09-18 13:21:47 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTING_ADDREF(nsHtml5StreamParser)
|
|
|
|
NS_IMPL_CYCLE_COLLECTING_RELEASE(nsHtml5StreamParser)
|
|
|
|
|
|
|
|
NS_INTERFACE_TABLE_HEAD(nsHtml5StreamParser)
|
2019-12-12 20:50:19 +03:00
|
|
|
NS_INTERFACE_TABLE(nsHtml5StreamParser, nsISupports)
|
2009-09-18 13:21:47 +04:00
|
|
|
NS_INTERFACE_TABLE_TO_MAP_SEGUE_CYCLE_COLLECTION(nsHtml5StreamParser)
|
|
|
|
NS_INTERFACE_MAP_END
|
|
|
|
|
2013-08-02 05:29:05 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_CLASS(nsHtml5StreamParser)
|
|
|
|
|
2009-09-25 21:11:02 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_UNLINK_BEGIN(nsHtml5StreamParser)
|
2010-04-16 14:52:06 +04:00
|
|
|
tmp->DropTimer();
|
2012-11-15 11:32:40 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_UNLINK(mRequest)
|
2012-11-28 03:08:22 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_UNLINK(mOwner)
|
2012-07-30 18:20:58 +04:00
|
|
|
tmp->mExecutorFlusher = nullptr;
|
|
|
|
tmp->mLoadFlusher = nullptr;
|
|
|
|
tmp->mExecutor = nullptr;
|
2009-09-25 21:11:02 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_UNLINK_END
|
|
|
|
|
|
|
|
NS_IMPL_CYCLE_COLLECTION_TRAVERSE_BEGIN(nsHtml5StreamParser)
|
2012-11-15 11:32:40 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_TRAVERSE(mRequest)
|
2012-11-28 03:08:22 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_TRAVERSE(mOwner)
|
2009-09-25 21:11:02 +04:00
|
|
|
// hack: count the strongly owned edge wrapped in the runnable
|
|
|
|
if (tmp->mExecutorFlusher) {
|
|
|
|
NS_CYCLE_COLLECTION_NOTE_EDGE_NAME(cb, "mExecutorFlusher->mExecutor");
|
2018-03-16 18:26:06 +03:00
|
|
|
cb.NoteXPCOMChild(static_cast<nsIContentSink*>(tmp->mExecutor));
|
2009-09-25 21:11:02 +04:00
|
|
|
}
|
2010-02-26 12:18:38 +03:00
|
|
|
// hack: count the strongly owned edge wrapped in the runnable
|
|
|
|
if (tmp->mLoadFlusher) {
|
|
|
|
NS_CYCLE_COLLECTION_NOTE_EDGE_NAME(cb, "mLoadFlusher->mExecutor");
|
2018-03-16 18:26:06 +03:00
|
|
|
cb.NoteXPCOMChild(static_cast<nsIContentSink*>(tmp->mExecutor));
|
2010-02-26 12:18:38 +03:00
|
|
|
}
|
2009-09-25 21:11:02 +04:00
|
|
|
NS_IMPL_CYCLE_COLLECTION_TRAVERSE_END
|
|
|
|
|
2016-04-26 03:23:21 +03:00
|
|
|
class nsHtml5ExecutorFlusher : public Runnable {
|
2018-03-16 18:26:06 +03:00
|
|
|
private:
|
|
|
|
RefPtr<nsHtml5TreeOpExecutor> mExecutor;
|
|
|
|
|
|
|
|
public:
|
|
|
|
explicit nsHtml5ExecutorFlusher(nsHtml5TreeOpExecutor* aExecutor)
|
|
|
|
: Runnable("nsHtml5ExecutorFlusher"), mExecutor(aExecutor) {}
|
|
|
|
NS_IMETHOD Run() override {
|
|
|
|
if (!mExecutor->isInList()) {
|
2019-01-30 16:35:02 +03:00
|
|
|
Document* doc = mExecutor->GetDocument();
|
|
|
|
if (XRE_IsContentProcess() &&
|
|
|
|
nsContentUtils::
|
|
|
|
HighPriorityEventPendingForTopLevelDocumentBeforeContentfulPaint(
|
|
|
|
doc)) {
|
|
|
|
// Possible early paint pending, reuse the runnable and try to
|
|
|
|
// call RunFlushLoop later.
|
|
|
|
nsCOMPtr<nsIRunnable> flusher = this;
|
|
|
|
if (NS_SUCCEEDED(
|
|
|
|
doc->Dispatch(TaskCategory::Network, flusher.forget()))) {
|
2020-09-10 06:02:36 +03:00
|
|
|
PROFILER_MARKER_UNTYPED("HighPrio blocking parser flushing(1)", DOM);
|
2019-01-30 16:35:02 +03:00
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
}
|
2018-03-16 18:26:06 +03:00
|
|
|
mExecutor->RunFlushLoop();
|
2010-02-26 12:18:38 +03:00
|
|
|
}
|
2018-03-16 18:26:06 +03:00
|
|
|
return NS_OK;
|
|
|
|
}
|
2010-02-26 12:18:38 +03:00
|
|
|
};
|
|
|
|
|
2016-04-26 03:23:21 +03:00
|
|
|
class nsHtml5LoadFlusher : public Runnable {
|
2018-03-16 18:26:06 +03:00
|
|
|
private:
|
|
|
|
RefPtr<nsHtml5TreeOpExecutor> mExecutor;
|
|
|
|
|
|
|
|
public:
|
|
|
|
explicit nsHtml5LoadFlusher(nsHtml5TreeOpExecutor* aExecutor)
|
|
|
|
: Runnable("nsHtml5LoadFlusher"), mExecutor(aExecutor) {}
|
|
|
|
NS_IMETHOD Run() override {
|
|
|
|
mExecutor->FlushSpeculativeLoads();
|
|
|
|
return NS_OK;
|
|
|
|
}
|
2009-09-25 21:11:02 +04:00
|
|
|
};
|
2009-09-18 13:21:47 +04:00
|
|
|
|
2009-09-21 17:18:20 +04:00
|
|
|
nsHtml5StreamParser::nsHtml5StreamParser(nsHtml5TreeOpExecutor* aExecutor,
|
2010-07-30 14:15:38 +04:00
|
|
|
nsHtml5Parser* aOwner,
|
|
|
|
eParserMode aMode)
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
: mBomState(eBomState::BOM_SNIFFING_NOT_STARTED),
|
2017-01-24 19:11:43 +03:00
|
|
|
mCharsetSource(kCharsetUninitialized),
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mEncodingSwitchSource(kCharsetUninitialized),
|
|
|
|
mEncoding(X_USER_DEFINED_ENCODING), // Obviously bogus value to notice if
|
|
|
|
// not updated
|
|
|
|
mNeedsEncodingSwitchTo(nullptr),
|
|
|
|
mSeenEligibleMetaCharset(false),
|
|
|
|
mChardetEof(false),
|
|
|
|
#ifdef DEBUG
|
|
|
|
mStartedFeedingDetector(false),
|
|
|
|
mStartedFeedingDevTools(false),
|
|
|
|
#endif
|
2017-01-24 19:11:43 +03:00
|
|
|
mReparseForbidden(false),
|
2021-08-26 21:02:15 +03:00
|
|
|
mForceAutoDetection(false),
|
2021-03-23 12:52:04 +03:00
|
|
|
mChannelHadCharset(false),
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mLookingForMetaCharset(false),
|
|
|
|
mStartsWithLtQuestion(false),
|
|
|
|
mLookingForXmlDeclarationForXmlViewSource(false),
|
|
|
|
mTemplatePushedOrHeadPopped(false),
|
|
|
|
mGtBuffer(nullptr),
|
|
|
|
mGtPos(0),
|
2018-12-04 11:02:04 +03:00
|
|
|
mLastBuffer(nullptr), // Will be filled when starting
|
2009-09-18 13:21:47 +04:00
|
|
|
mExecutor(aExecutor),
|
2018-03-16 18:26:06 +03:00
|
|
|
mTreeBuilder(new nsHtml5TreeBuilder(
|
|
|
|
(aMode == VIEW_SOURCE_HTML || aMode == VIEW_SOURCE_XML)
|
|
|
|
? nullptr
|
|
|
|
: mExecutor->GetStage(),
|
2021-12-22 18:54:49 +03:00
|
|
|
mExecutor->GetStage(), aMode == NORMAL)),
|
2020-02-28 23:49:06 +03:00
|
|
|
mTokenizer(
|
|
|
|
new nsHtml5Tokenizer(mTreeBuilder.get(), aMode == VIEW_SOURCE_XML)),
|
2009-09-21 17:18:20 +04:00
|
|
|
mTokenizerMutex("nsHtml5StreamParser mTokenizerMutex"),
|
2009-09-18 13:21:47 +04:00
|
|
|
mOwner(aOwner),
|
2017-01-24 19:11:43 +03:00
|
|
|
mLastWasCR(false),
|
|
|
|
mStreamState(eHtml5StreamState::STREAM_NOT_STARTED),
|
|
|
|
mSpeculating(false),
|
|
|
|
mAtEOF(false),
|
2009-10-12 17:08:04 +04:00
|
|
|
mSpeculationMutex("nsHtml5StreamParser mSpeculationMutex"),
|
2017-01-24 19:11:43 +03:00
|
|
|
mSpeculationFailureCount(0),
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mNumBytesBuffered(0),
|
2017-01-24 19:11:43 +03:00
|
|
|
mTerminated(false),
|
|
|
|
mInterrupted(false),
|
2017-06-08 02:23:29 +03:00
|
|
|
mEventTarget(nsHtml5Module::GetStreamParserThread()->SerialEventTarget()),
|
2009-09-25 21:11:02 +04:00
|
|
|
mExecutorFlusher(new nsHtml5ExecutorFlusher(aExecutor)),
|
2010-02-26 12:18:38 +03:00
|
|
|
mLoadFlusher(new nsHtml5LoadFlusher(aExecutor)),
|
2017-01-24 19:11:43 +03:00
|
|
|
mInitialEncodingWasFromParentFrame(false),
|
2017-11-23 10:03:41 +03:00
|
|
|
mHasHadErrors(false),
|
2021-01-24 03:11:07 +03:00
|
|
|
mDetectorHasSeenNonAscii(false),
|
2019-12-12 20:50:19 +03:00
|
|
|
mDecodingLocalFileWithoutTokenizing(false),
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mBufferingBytes(false),
|
2018-10-18 02:57:36 +03:00
|
|
|
mFlushTimer(NS_NewTimer(mEventTarget)),
|
2017-01-26 09:45:31 +03:00
|
|
|
mFlushTimerMutex("nsHtml5StreamParser mFlushTimerMutex"),
|
2017-01-24 19:11:43 +03:00
|
|
|
mFlushTimerArmed(false),
|
|
|
|
mFlushTimerEverFired(false),
|
2019-12-16 20:40:26 +03:00
|
|
|
mMode(aMode) {
|
2009-09-21 17:18:20 +04:00
|
|
|
NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
|
2010-04-16 14:52:06 +04:00
|
|
|
#ifdef DEBUG
|
2017-06-08 02:23:29 +03:00
|
|
|
mAtomTable.SetPermittedLookupEventTarget(mEventTarget);
|
2010-04-16 14:52:06 +04:00
|
|
|
#endif
|
2009-09-21 17:18:20 +04:00
|
|
|
mTokenizer->setInterner(&mAtomTable);
|
|
|
|
mTokenizer->setEncodingDeclarationHandler(this);
|
2010-01-12 16:15:10 +03:00
|
|
|
|
2010-07-30 14:15:38 +04:00
|
|
|
if (aMode == VIEW_SOURCE_HTML || aMode == VIEW_SOURCE_XML) {
|
2011-11-01 15:33:11 +04:00
|
|
|
nsHtml5Highlighter* highlighter =
|
|
|
|
new nsHtml5Highlighter(mExecutor->GetStage());
|
2018-03-16 18:26:06 +03:00
|
|
|
mTokenizer->EnableViewSource(highlighter); // takes ownership
|
2011-11-01 15:33:11 +04:00
|
|
|
mTreeBuilder->EnableViewSource(highlighter); // doesn't own
|
2010-07-30 14:15:38 +04:00
|
|
|
}
|
|
|
|
|
2009-09-21 17:18:20 +04:00
|
|
|
// There's a zeroing operator new for everything else
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
nsHtml5StreamParser::~nsHtml5StreamParser() {
|
2009-09-21 17:18:20 +04:00
|
|
|
NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
|
|
|
|
mTokenizer->end();
|
2010-04-16 14:52:06 +04:00
|
|
|
#ifdef DEBUG
|
2017-01-26 09:45:31 +03:00
|
|
|
{
|
|
|
|
mozilla::MutexAutoLock flushTimerLock(mFlushTimerMutex);
|
|
|
|
MOZ_ASSERT(!mFlushTimer, "Flush timer was not dropped before dtor!");
|
|
|
|
}
|
2012-07-30 18:20:58 +04:00
|
|
|
mRequest = nullptr;
|
|
|
|
mUnicodeDecoder = nullptr;
|
|
|
|
mFirstBuffer = nullptr;
|
|
|
|
mExecutor = nullptr;
|
|
|
|
mTreeBuilder = nullptr;
|
|
|
|
mTokenizer = nullptr;
|
|
|
|
mOwner = nullptr;
|
2010-04-16 14:52:06 +04:00
|
|
|
#endif
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
nsresult nsHtml5StreamParser::GetChannel(nsIChannel** aChannel) {
|
2009-09-21 17:18:20 +04:00
|
|
|
NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
|
2018-03-16 18:26:06 +03:00
|
|
|
return mRequest ? CallQueryInterface(mRequest, aChannel)
|
|
|
|
: NS_ERROR_NOT_AVAILABLE;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
std::tuple<NotNull<const Encoding*>, nsCharsetSource>
|
|
|
|
nsHtml5StreamParser::GuessEncoding(bool aInitial) {
|
2021-01-14 10:06:53 +03:00
|
|
|
MOZ_ASSERT(
|
|
|
|
mCharsetSource != kCharsetFromFinalUserForcedAutoDetection &&
|
2022-03-29 11:04:25 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldHaveBeenUTF8InitialWasASCII &&
|
2021-01-24 03:11:07 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8Generic &&
|
2022-03-29 11:04:25 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8GenericInitialWasASCII &&
|
2021-01-24 03:11:07 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8Content &&
|
2022-03-29 11:04:25 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8ContentInitialWasASCII &&
|
2021-01-24 03:11:07 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8DependedOnTLD &&
|
2022-03-29 11:04:25 +03:00
|
|
|
mCharsetSource !=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8DependedOnTLDInitialWasASCII &&
|
2021-01-14 10:06:53 +03:00
|
|
|
mCharsetSource != kCharsetFromFinalAutoDetectionFile);
|
2021-01-24 03:11:07 +03:00
|
|
|
auto ifHadBeenForced = mDetector->Guess(EmptyCString(), true);
|
2021-01-14 10:06:53 +03:00
|
|
|
auto encoding =
|
2021-08-26 21:02:15 +03:00
|
|
|
mForceAutoDetection
|
|
|
|
? ifHadBeenForced
|
|
|
|
: mDetector->Guess(mTLD, mDecodingLocalFileWithoutTokenizing);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
nsCharsetSource source =
|
2021-01-14 10:06:53 +03:00
|
|
|
aInitial
|
2021-08-26 21:02:15 +03:00
|
|
|
? (mForceAutoDetection
|
2021-01-24 03:11:07 +03:00
|
|
|
? kCharsetFromInitialUserForcedAutoDetection
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
: (mDecodingLocalFileWithoutTokenizing
|
|
|
|
? kCharsetFromFinalAutoDetectionFile
|
|
|
|
: kCharsetFromInitialAutoDetectionWouldNotHaveBeenUTF8Generic))
|
2021-08-26 21:02:15 +03:00
|
|
|
: (mForceAutoDetection
|
2021-01-14 10:06:53 +03:00
|
|
|
? kCharsetFromFinalUserForcedAutoDetection
|
|
|
|
: (mDecodingLocalFileWithoutTokenizing
|
|
|
|
? kCharsetFromFinalAutoDetectionFile
|
2021-01-24 03:11:07 +03:00
|
|
|
: kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8Generic));
|
|
|
|
if (source == kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8Generic) {
|
|
|
|
if (encoding == ISO_2022_JP_ENCODING) {
|
|
|
|
if (EncodingDetector::TldMayAffectGuess(mTLD)) {
|
|
|
|
source = kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8Content;
|
|
|
|
}
|
|
|
|
} else if (!mDetectorHasSeenNonAscii) {
|
|
|
|
source = kCharsetFromInitialAutoDetectionASCII; // deliberately Initial
|
|
|
|
} else if (ifHadBeenForced == UTF_8_ENCODING) {
|
2022-03-29 11:04:25 +03:00
|
|
|
MOZ_ASSERT(mCharsetSource == kCharsetFromInitialAutoDetectionASCII ||
|
|
|
|
mCharsetSource ==
|
|
|
|
kCharsetFromInitialAutoDetectionWouldHaveBeenUTF8);
|
|
|
|
source = kCharsetFromFinalAutoDetectionWouldHaveBeenUTF8InitialWasASCII;
|
2021-01-24 03:11:07 +03:00
|
|
|
} else if (encoding != ifHadBeenForced) {
|
2022-03-29 11:04:25 +03:00
|
|
|
if (mCharsetSource == kCharsetFromInitialAutoDetectionASCII) {
|
|
|
|
source =
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8DependedOnTLDInitialWasASCII;
|
|
|
|
} else {
|
|
|
|
source =
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8DependedOnTLD;
|
|
|
|
}
|
2021-01-24 03:11:07 +03:00
|
|
|
} else if (EncodingDetector::TldMayAffectGuess(mTLD)) {
|
2022-03-29 11:04:25 +03:00
|
|
|
if (mCharsetSource == kCharsetFromInitialAutoDetectionASCII) {
|
|
|
|
source =
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8ContentInitialWasASCII;
|
|
|
|
} else {
|
|
|
|
source = kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8Content;
|
|
|
|
}
|
|
|
|
} else if (mCharsetSource == kCharsetFromInitialAutoDetectionASCII) {
|
|
|
|
source =
|
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8GenericInitialWasASCII;
|
2021-01-14 10:06:53 +03:00
|
|
|
}
|
2021-01-24 03:11:07 +03:00
|
|
|
} else if (source ==
|
|
|
|
kCharsetFromInitialAutoDetectionWouldNotHaveBeenUTF8Generic) {
|
|
|
|
if (encoding == ISO_2022_JP_ENCODING) {
|
|
|
|
if (EncodingDetector::TldMayAffectGuess(mTLD)) {
|
|
|
|
source = kCharsetFromInitialAutoDetectionWouldNotHaveBeenUTF8Content;
|
|
|
|
}
|
|
|
|
} else if (!mDetectorHasSeenNonAscii) {
|
|
|
|
source = kCharsetFromInitialAutoDetectionASCII;
|
|
|
|
} else if (ifHadBeenForced == UTF_8_ENCODING) {
|
2021-01-14 10:06:53 +03:00
|
|
|
source = kCharsetFromInitialAutoDetectionWouldHaveBeenUTF8;
|
2021-01-24 03:11:07 +03:00
|
|
|
} else if (encoding != ifHadBeenForced) {
|
|
|
|
source =
|
|
|
|
kCharsetFromInitialAutoDetectionWouldNotHaveBeenUTF8DependedOnTLD;
|
|
|
|
} else if (EncodingDetector::TldMayAffectGuess(mTLD)) {
|
|
|
|
source = kCharsetFromInitialAutoDetectionWouldNotHaveBeenUTF8Content;
|
2021-01-14 10:06:53 +03:00
|
|
|
}
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
return {encoding, source};
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-07 10:35:32 +03:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
void nsHtml5StreamParser::FeedDetector(Span<const uint8_t> aBuffer) {
|
|
|
|
#ifdef DEBUG
|
|
|
|
mStartedFeedingDetector = true;
|
|
|
|
#endif
|
|
|
|
MOZ_ASSERT(!mChardetEof);
|
|
|
|
mDetectorHasSeenNonAscii = mDetector->Feed(aBuffer, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
void nsHtml5StreamParser::DetectorEof() {
|
|
|
|
#ifdef DEBUG
|
|
|
|
mStartedFeedingDetector = true;
|
|
|
|
#endif
|
|
|
|
if (mChardetEof) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mChardetEof = true;
|
|
|
|
mDetectorHasSeenNonAscii = mDetector->Feed(Span<const uint8_t>(), true);
|
2019-04-16 22:05:26 +03:00
|
|
|
}
|
|
|
|
|
2011-11-30 21:44:31 +04:00
|
|
|
void nsHtml5StreamParser::SetViewSourceTitle(nsIURI* aURL) {
|
2019-08-05 00:33:07 +03:00
|
|
|
MOZ_ASSERT(NS_IsMainThread());
|
|
|
|
|
2020-05-12 12:18:26 +03:00
|
|
|
BrowsingContext* browsingContext =
|
|
|
|
mExecutor->GetDocument()->GetBrowsingContext();
|
|
|
|
if (browsingContext && browsingContext->WatchedByDevTools()) {
|
2019-08-05 00:33:07 +03:00
|
|
|
mURIToSendToDevtools = aURL;
|
|
|
|
|
|
|
|
nsID uuid;
|
2022-02-03 07:39:34 +03:00
|
|
|
nsresult rv = nsID::GenerateUUIDInPlace(uuid);
|
2019-08-05 00:33:07 +03:00
|
|
|
if (!NS_FAILED(rv)) {
|
|
|
|
char buffer[NSID_LENGTH];
|
|
|
|
uuid.ToProvidedString(buffer);
|
|
|
|
mUUIDForDevtools = NS_ConvertASCIItoUTF16(buffer);
|
|
|
|
}
|
2018-07-24 00:43:13 +03:00
|
|
|
}
|
|
|
|
|
2011-11-30 21:44:31 +04:00
|
|
|
if (aURL) {
|
|
|
|
nsCOMPtr<nsIURI> temp;
|
2019-08-07 22:49:40 +03:00
|
|
|
if (aURL->SchemeIs("view-source")) {
|
2011-11-30 21:44:31 +04:00
|
|
|
nsCOMPtr<nsINestedURI> nested = do_QueryInterface(aURL);
|
|
|
|
nested->GetInnerURI(getter_AddRefs(temp));
|
|
|
|
} else {
|
|
|
|
temp = aURL;
|
|
|
|
}
|
2019-08-07 22:49:40 +03:00
|
|
|
if (temp->SchemeIs("data")) {
|
2011-11-30 21:44:31 +04:00
|
|
|
// Avoid showing potentially huge data: URLs. The three last bytes are
|
|
|
|
// UTF-8 for an ellipsis.
|
|
|
|
mViewSourceTitle.AssignLiteral("data:\xE2\x80\xA6");
|
|
|
|
} else {
|
2016-08-31 05:13:59 +03:00
|
|
|
nsresult rv = temp->GetSpec(mViewSourceTitle);
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
mViewSourceTitle.AssignLiteral("\xE2\x80\xA6");
|
|
|
|
}
|
2011-11-30 21:44:31 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-09-18 13:21:47 +04:00
|
|
|
nsresult
|
2018-03-16 18:26:06 +03:00
|
|
|
nsHtml5StreamParser::SetupDecodingAndWriteSniffingBufferAndCurrentSegment(
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
Span<const uint8_t> aPrefix, Span<const uint8_t> aFromSegment) {
|
2009-09-25 21:11:02 +04:00
|
|
|
NS_ASSERTION(IsParserThread(), "Wrong thread!");
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithBOMRemoval();
|
|
|
|
nsresult rv = WriteStreamBytes(aPrefix);
|
|
|
|
NS_ENSURE_SUCCESS(rv, rv);
|
2018-12-06 21:59:56 +03:00
|
|
|
return WriteStreamBytes(aFromSegment);
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
2021-03-23 12:52:04 +03:00
|
|
|
void nsHtml5StreamParser::SetupDecodingFromBom(
|
2017-06-18 14:37:50 +03:00
|
|
|
NotNull<const Encoding*> aEncoding) {
|
2021-03-23 12:52:04 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2017-06-18 14:37:50 +03:00
|
|
|
mEncoding = aEncoding;
|
2019-12-12 20:50:19 +03:00
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mLookingForMetaCharset = false;
|
|
|
|
mBufferingBytes = false;
|
2018-08-13 11:01:56 +03:00
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithoutBOMHandling();
|
2009-09-18 13:21:47 +04:00
|
|
|
mCharsetSource = kCharsetFromByteOrderMark;
|
2021-08-26 21:02:15 +03:00
|
|
|
mForceAutoDetection = false;
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, false);
|
2009-09-18 13:21:47 +04:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
mTokenizer->StartViewSourceCharacters();
|
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
2021-03-23 12:52:04 +03:00
|
|
|
void nsHtml5StreamParser::SetupDecodingFromUtf16BogoXml(
|
|
|
|
NotNull<const Encoding*> aEncoding) {
|
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
|
|
|
mEncoding = aEncoding;
|
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mLookingForMetaCharset = false;
|
|
|
|
mBufferingBytes = false;
|
2021-03-23 12:52:04 +03:00
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithoutBOMHandling();
|
|
|
|
mCharsetSource = kCharsetFromXmlDeclarationUtf16;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mForceAutoDetection = false;
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, false);
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
mTokenizer->StartViewSourceCharacters();
|
|
|
|
}
|
2021-03-23 12:52:04 +03:00
|
|
|
auto dst = mLastBuffer->TailAsSpan(READ_BUFFER_SIZE);
|
|
|
|
dst[0] = '<';
|
|
|
|
dst[1] = '?';
|
|
|
|
dst[2] = 'x';
|
|
|
|
mLastBuffer->AdvanceEnd(3);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!mStartedFeedingDevTools);
|
|
|
|
OnNewContent(dst.To(3));
|
2021-03-23 12:52:04 +03:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
size_t nsHtml5StreamParser::LengthOfLtContainingPrefixInSecondBuffer() {
|
|
|
|
MOZ_ASSERT(mBufferedBytes.Length() <= 2);
|
|
|
|
if (mBufferedBytes.Length() < 2) {
|
|
|
|
return 0;
|
2019-12-12 20:50:19 +03:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
Buffer<uint8_t>& second = mBufferedBytes[1];
|
|
|
|
const uint8_t* elements = second.Elements();
|
|
|
|
const uint8_t* lt = (const uint8_t*)memchr(elements, '>', second.Length());
|
|
|
|
if (lt) {
|
|
|
|
return (lt - elements) + 1;
|
2019-04-17 14:36:52 +03:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
return 0;
|
2019-04-17 14:36:52 +03:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
nsresult nsHtml5StreamParser::SniffStreamBytes(Span<const uint8_t> aFromSegment,
|
2018-12-06 21:59:56 +03:00
|
|
|
bool aEof) {
|
2019-04-17 14:36:52 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT_IF(aEof, aFromSegment.IsEmpty());
|
2021-11-16 19:02:04 +03:00
|
|
|
|
2022-03-29 11:04:25 +03:00
|
|
|
if (mCharsetSource >=
|
|
|
|
kCharsetFromFinalAutoDetectionWouldHaveBeenUTF8InitialWasASCII &&
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mCharsetSource <= kCharsetFromFinalUserForcedAutoDetection) {
|
|
|
|
if (mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncDetectorReloadPlain", true,
|
|
|
|
0);
|
2021-03-23 12:52:04 +03:00
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncDetectorReload", true, 0);
|
2021-03-23 12:52:04 +03:00
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
2017-06-18 14:37:50 +03:00
|
|
|
// mEncoding and mCharsetSource potentially have come from channel or higher
|
2012-11-06 15:57:51 +04:00
|
|
|
// by now. If we find a BOM, SetupDecodingFromBom() will overwrite them.
|
2017-06-18 14:37:50 +03:00
|
|
|
// If we don't find a BOM, the previously set values of mEncoding and
|
2012-11-06 15:57:51 +04:00
|
|
|
// mCharsetSource are not modified by the BOM sniffing here.
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
static uint8_t utf8[] = {0xEF, 0xBB};
|
|
|
|
static uint8_t utf16le[] = {0xFF};
|
|
|
|
static uint8_t utf16be[] = {0xFE};
|
|
|
|
static uint8_t utf16leXml[] = {'<', 0x00, '?', 0x00, 'x'};
|
|
|
|
static uint8_t utf16beXml[] = {0x00, '<', 0x00, '?', 0x00};
|
|
|
|
// Buffer for replaying past bytes based on state machine state. If
|
|
|
|
// writing this from scratch, probably wouldn't do it this way, but
|
|
|
|
// let's keep the changes to a minimum.
|
|
|
|
const uint8_t* prefix = utf8;
|
|
|
|
size_t prefixLength = 0;
|
|
|
|
if (aEof && mBomState == BOM_SNIFFING_NOT_STARTED) {
|
|
|
|
// Avoid handling aEof in the BOM_SNIFFING_NOT_STARTED state below.
|
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
for (size_t i = 0;
|
|
|
|
(i < aFromSegment.Length() && mBomState != BOM_SNIFFING_OVER) || aEof;
|
|
|
|
i++) {
|
2009-09-18 13:21:47 +04:00
|
|
|
switch (mBomState) {
|
|
|
|
case BOM_SNIFFING_NOT_STARTED:
|
2021-03-23 12:52:04 +03:00
|
|
|
MOZ_ASSERT(i == 0, "Bad BOM sniffing state.");
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!aEof, "Should have checked for aEof above!");
|
2018-12-06 21:59:56 +03:00
|
|
|
switch (aFromSegment[0]) {
|
2009-09-18 13:21:47 +04:00
|
|
|
case 0xEF:
|
|
|
|
mBomState = SEEN_UTF_8_FIRST_BYTE;
|
|
|
|
break;
|
|
|
|
case 0xFF:
|
|
|
|
mBomState = SEEN_UTF_16_LE_FIRST_BYTE;
|
|
|
|
break;
|
|
|
|
case 0xFE:
|
|
|
|
mBomState = SEEN_UTF_16_BE_FIRST_BYTE;
|
|
|
|
break;
|
2021-03-23 12:52:04 +03:00
|
|
|
case 0x00:
|
|
|
|
if (mCharsetSource < kCharsetFromXmlDeclarationUtf16 &&
|
2021-08-26 21:02:15 +03:00
|
|
|
mCharsetSource != kCharsetFromChannel) {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_BE_XML_FIRST;
|
|
|
|
} else {
|
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
case '<':
|
2021-03-23 12:52:04 +03:00
|
|
|
if (mCharsetSource < kCharsetFromXmlDeclarationUtf16 &&
|
2021-08-26 21:02:15 +03:00
|
|
|
mCharsetSource != kCharsetFromChannel) {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_LE_XML_FIRST;
|
|
|
|
} else {
|
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
2009-09-18 13:21:47 +04:00
|
|
|
default:
|
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_LE_FIRST_BYTE:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0xFE) {
|
2021-03-23 12:52:04 +03:00
|
|
|
SetupDecodingFromBom(UTF_16LE_ENCODING);
|
2018-12-06 21:59:56 +03:00
|
|
|
return WriteStreamBytes(aFromSegment.From(i + 1));
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16le;
|
|
|
|
prefixLength = 1 - i;
|
2009-09-18 13:21:47 +04:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_BE_FIRST_BYTE:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0xFF) {
|
2021-03-23 12:52:04 +03:00
|
|
|
SetupDecodingFromBom(UTF_16BE_ENCODING);
|
2018-12-06 21:59:56 +03:00
|
|
|
return WriteStreamBytes(aFromSegment.From(i + 1));
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16be;
|
|
|
|
prefixLength = 1 - i;
|
2009-09-18 13:21:47 +04:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_8_FIRST_BYTE:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0xBB) {
|
2009-09-18 13:21:47 +04:00
|
|
|
mBomState = SEEN_UTF_8_SECOND_BYTE;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefixLength = 1 - i;
|
2009-09-18 13:21:47 +04:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_8_SECOND_BYTE:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0xBF) {
|
2021-03-23 12:52:04 +03:00
|
|
|
SetupDecodingFromBom(UTF_8_ENCODING);
|
|
|
|
return WriteStreamBytes(aFromSegment.From(i + 1));
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefixLength = 2 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_BE_XML_FIRST:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == '<') {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_BE_XML_SECOND;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16beXml;
|
|
|
|
prefixLength = 1 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_BE_XML_SECOND:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0x00) {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_BE_XML_THIRD;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16beXml;
|
|
|
|
prefixLength = 2 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_BE_XML_THIRD:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == '?') {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_BE_XML_FOURTH;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16beXml;
|
|
|
|
prefixLength = 3 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_BE_XML_FOURTH:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0x00) {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_BE_XML_FIFTH;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16beXml;
|
|
|
|
prefixLength = 4 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_BE_XML_FIFTH:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 'x') {
|
2021-03-23 12:52:04 +03:00
|
|
|
SetupDecodingFromUtf16BogoXml(UTF_16BE_ENCODING);
|
|
|
|
return WriteStreamBytes(aFromSegment.From(i + 1));
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16beXml;
|
|
|
|
prefixLength = 5 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_LE_XML_FIRST:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0x00) {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_LE_XML_SECOND;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == '?' &&
|
|
|
|
!(mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN)) {
|
|
|
|
mStartsWithLtQuestion = true;
|
|
|
|
}
|
|
|
|
prefix = utf16leXml;
|
|
|
|
prefixLength = 1 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_LE_XML_SECOND:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == '?') {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_LE_XML_THIRD;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16leXml;
|
|
|
|
prefixLength = 2 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_LE_XML_THIRD:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0x00) {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_LE_XML_FOURTH;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16leXml;
|
|
|
|
prefixLength = 3 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_LE_XML_FOURTH:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 'x') {
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = SEEN_UTF_16_LE_XML_FIFTH;
|
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16leXml;
|
|
|
|
prefixLength = 4 - i;
|
2021-03-23 12:52:04 +03:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case SEEN_UTF_16_LE_XML_FIFTH:
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!aEof && aFromSegment[i] == 0x00) {
|
2021-03-23 12:52:04 +03:00
|
|
|
SetupDecodingFromUtf16BogoXml(UTF_16LE_ENCODING);
|
2018-12-06 21:59:56 +03:00
|
|
|
return WriteStreamBytes(aFromSegment.From(i + 1));
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
prefix = utf16leXml;
|
|
|
|
prefixLength = 5 - i;
|
2009-09-18 13:21:47 +04:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
|
|
|
default:
|
2010-04-09 18:11:56 +04:00
|
|
|
mBomState = BOM_SNIFFING_OVER;
|
|
|
|
break;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (aEof) {
|
|
|
|
break;
|
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
2012-11-06 15:57:51 +04:00
|
|
|
// if we get here, there either was no BOM or the BOM sniffing isn't complete
|
|
|
|
// yet
|
2017-04-27 13:27:03 +03:00
|
|
|
|
2013-11-25 12:06:56 +04:00
|
|
|
MOZ_ASSERT(mCharsetSource != kCharsetFromByteOrderMark,
|
|
|
|
"Should not come here if BOM was found.");
|
2021-03-23 12:52:04 +03:00
|
|
|
MOZ_ASSERT(mCharsetSource != kCharsetFromXmlDeclarationUtf16,
|
|
|
|
"Should not come here if UTF-16 bogo-XML declaration was found.");
|
2013-11-25 12:06:56 +04:00
|
|
|
MOZ_ASSERT(mCharsetSource != kCharsetFromOtherComponent,
|
|
|
|
"kCharsetFromOtherComponent is for XSLT.");
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mBomState == BOM_SNIFFING_OVER) {
|
|
|
|
if (mMode == VIEW_SOURCE_XML && mStartsWithLtQuestion &&
|
|
|
|
mCharsetSource < kCharsetFromChannel) {
|
|
|
|
// Sniff for XML declaration only.
|
|
|
|
MOZ_ASSERT(!mLookingForXmlDeclarationForXmlViewSource);
|
|
|
|
MOZ_ASSERT(!aEof);
|
|
|
|
MOZ_ASSERT(!mLookingForMetaCharset);
|
|
|
|
MOZ_ASSERT(!mDecodingLocalFileWithoutTokenizing);
|
|
|
|
// Maybe we've already buffered a '>'.
|
|
|
|
MOZ_ASSERT(!mBufferedBytes.IsEmpty(),
|
|
|
|
"How did at least <? not get buffered?");
|
|
|
|
Buffer<uint8_t>& first = mBufferedBytes[0];
|
|
|
|
const Encoding* encoding =
|
|
|
|
xmldecl_parse(first.Elements(), first.Length());
|
|
|
|
if (encoding) {
|
2021-12-07 13:05:44 +03:00
|
|
|
mEncoding = WrapNotNull(encoding);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mCharsetSource = kCharsetFromXmlDeclaration;
|
|
|
|
} else if (memchr(first.Elements(), '>', first.Length())) {
|
|
|
|
// There was a '>', but an encoding still wasn't found.
|
|
|
|
; // fall through to commit to the UTF-8 default.
|
|
|
|
} else if (size_t lengthOfPrefix =
|
|
|
|
LengthOfLtContainingPrefixInSecondBuffer()) {
|
|
|
|
// This can only happen if the first buffer was a lone '<', because
|
|
|
|
// we come here upon seeing the second byte '?' if the first two bytes
|
|
|
|
// were "<?". That is, the only way how we aren't dealing with the first
|
|
|
|
// buffer is if the first buffer only contained a single '<' and we are
|
|
|
|
// dealing with the second buffer that starts with '?'.
|
|
|
|
MOZ_ASSERT(first.Length() == 1);
|
|
|
|
MOZ_ASSERT(mBufferedBytes[1][0] == '?');
|
|
|
|
// Our scanner for XML declaration-like syntax wants to see a contiguous
|
|
|
|
// buffer, so let's linearize the data. (Ideally, the XML declaration
|
|
|
|
// scanner would be incremental, but this is the rare path anyway.)
|
|
|
|
Vector<uint8_t> contiguous;
|
|
|
|
if (!contiguous.append(first.Elements(), first.Length())) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return NS_ERROR_OUT_OF_MEMORY;
|
|
|
|
}
|
|
|
|
if (!contiguous.append(mBufferedBytes[1].Elements(), lengthOfPrefix)) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return NS_ERROR_OUT_OF_MEMORY;
|
|
|
|
}
|
|
|
|
encoding = xmldecl_parse(contiguous.begin(), contiguous.length());
|
|
|
|
if (encoding) {
|
|
|
|
mEncoding = WrapNotNull(encoding);
|
|
|
|
mCharsetSource = kCharsetFromXmlDeclaration;
|
|
|
|
}
|
|
|
|
// else no XML decl, commit to the UTF-8 default.
|
|
|
|
} else {
|
|
|
|
MOZ_ASSERT(mBufferingBytes);
|
|
|
|
mLookingForXmlDeclarationForXmlViewSource = true;
|
|
|
|
return NS_OK;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-07 10:35:32 +03:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
} else if (mMode != VIEW_SOURCE_XML &&
|
|
|
|
(mForceAutoDetection || mCharsetSource < kCharsetFromChannel)) {
|
|
|
|
// In order to use the buffering logic for meta with mForceAutoDetection,
|
|
|
|
// we set mLookingForMetaCharset but still actually potentially ignore the
|
|
|
|
// meta.
|
|
|
|
mFirstBufferOfMetaScan = mFirstBuffer;
|
|
|
|
MOZ_ASSERT(mLookingForMetaCharset);
|
|
|
|
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return r.unwrapErr();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return r.unwrapErr();
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
// Encoding committer flushes the ops on the main thread.
|
|
|
|
|
|
|
|
mozilla::MutexAutoLock speculationAutoLock(mSpeculationMutex);
|
|
|
|
nsHtml5Speculation* speculation = new nsHtml5Speculation(
|
|
|
|
mFirstBuffer, mFirstBuffer->getStart(), mTokenizer->getLineNumber(),
|
|
|
|
mTreeBuilder->newSnapshot());
|
|
|
|
MOZ_ASSERT(!mFlushTimerArmed, "How did we end up arming the timer?");
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
mTokenizer->SetViewSourceOpSink(speculation);
|
|
|
|
mTokenizer->StartViewSourceCharacters();
|
2021-01-14 10:06:53 +03:00
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mMode != VIEW_SOURCE_XML);
|
|
|
|
mTreeBuilder->SetOpSink(speculation);
|
|
|
|
}
|
|
|
|
mSpeculations.AppendElement(speculation); // adopts the pointer
|
|
|
|
mSpeculating = true;
|
|
|
|
} else {
|
|
|
|
mLookingForMetaCharset = false;
|
|
|
|
mBufferingBytes = false;
|
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
mTokenizer->StartViewSourceCharacters();
|
2021-01-14 10:06:53 +03:00
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, false);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
return SetupDecodingAndWriteSniffingBufferAndCurrentSegment(
|
|
|
|
Span(prefix, prefixLength), aFromSegment);
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
|
2019-08-05 00:33:07 +03:00
|
|
|
class AddContentRunnable : public Runnable {
|
|
|
|
public:
|
|
|
|
AddContentRunnable(const nsAString& aParserID, nsIURI* aURI,
|
|
|
|
Span<const char16_t> aData, bool aComplete)
|
|
|
|
: Runnable("AddContent") {
|
|
|
|
nsAutoCString spec;
|
|
|
|
aURI->GetSpec(spec);
|
|
|
|
mData.mUri.Construct(NS_ConvertUTF8toUTF16(spec));
|
|
|
|
mData.mParserID.Construct(aParserID);
|
|
|
|
mData.mContents.Construct(aData.Elements(), aData.Length());
|
|
|
|
mData.mComplete.Construct(aComplete);
|
|
|
|
}
|
|
|
|
|
|
|
|
NS_IMETHOD Run() override {
|
|
|
|
nsAutoString json;
|
|
|
|
if (!mData.ToJSON(json)) {
|
|
|
|
return NS_ERROR_FAILURE;
|
|
|
|
}
|
|
|
|
|
|
|
|
nsCOMPtr<nsIObserverService> obsService = services::GetObserverService();
|
|
|
|
if (obsService) {
|
|
|
|
obsService->NotifyObservers(nullptr, "devtools-html-content",
|
|
|
|
PromiseFlatString(json).get());
|
|
|
|
}
|
|
|
|
|
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
|
|
|
|
HTMLContent mData;
|
|
|
|
};
|
|
|
|
|
|
|
|
inline void nsHtml5StreamParser::OnNewContent(Span<const char16_t> aData) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
#ifdef DEBUG
|
|
|
|
mStartedFeedingDevTools = true;
|
|
|
|
#endif
|
2019-08-05 00:33:07 +03:00
|
|
|
if (mURIToSendToDevtools) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (aData.IsEmpty()) {
|
|
|
|
// Optimize out the runnable.
|
|
|
|
return;
|
|
|
|
}
|
2019-08-05 00:33:07 +03:00
|
|
|
NS_DispatchToMainThread(new AddContentRunnable(mUUIDForDevtools,
|
|
|
|
mURIToSendToDevtools, aData,
|
|
|
|
/* aComplete */ false));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
inline void nsHtml5StreamParser::OnContentComplete() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
#ifdef DEBUG
|
|
|
|
mStartedFeedingDevTools = true;
|
|
|
|
#endif
|
2019-08-05 00:33:07 +03:00
|
|
|
if (mURIToSendToDevtools) {
|
|
|
|
NS_DispatchToMainThread(new AddContentRunnable(
|
|
|
|
mUUIDForDevtools, mURIToSendToDevtools, Span<const char16_t>(),
|
|
|
|
/* aComplete */ true));
|
|
|
|
mURIToSendToDevtools = nullptr;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-12-06 21:59:56 +03:00
|
|
|
nsresult nsHtml5StreamParser::WriteStreamBytes(
|
|
|
|
Span<const uint8_t> aFromSegment) {
|
2009-09-25 21:11:02 +04:00
|
|
|
NS_ASSERTION(IsParserThread(), "Wrong thread!");
|
2022-03-22 17:06:03 +03:00
|
|
|
mTokenizerMutex.AssertCurrentThreadOwns();
|
2013-10-16 19:16:36 +04:00
|
|
|
// mLastBuffer should always point to a buffer of the size
|
2018-12-04 11:02:04 +03:00
|
|
|
// READ_BUFFER_SIZE.
|
2013-10-16 19:16:36 +04:00
|
|
|
if (!mLastBuffer) {
|
|
|
|
NS_WARNING("mLastBuffer should not be null!");
|
2014-11-05 18:44:37 +03:00
|
|
|
MarkAsBroken(NS_ERROR_NULL_POINTER);
|
2013-10-16 19:16:36 +04:00
|
|
|
return NS_ERROR_NULL_POINTER;
|
|
|
|
}
|
2017-04-27 13:27:03 +03:00
|
|
|
size_t totalRead = 0;
|
2018-12-06 21:59:56 +03:00
|
|
|
auto src = aFromSegment;
|
2009-09-18 13:21:47 +04:00
|
|
|
for (;;) {
|
2018-12-04 11:02:04 +03:00
|
|
|
auto dst = mLastBuffer->TailAsSpan(READ_BUFFER_SIZE);
|
2021-11-08 11:14:00 +03:00
|
|
|
auto [result, read, written, hadErrors] =
|
2017-04-27 13:27:03 +03:00
|
|
|
mUnicodeDecoder->DecodeToUTF16(src, dst, false);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!(mLookingForMetaCharset || mDecodingLocalFileWithoutTokenizing)) {
|
2019-08-05 00:33:07 +03:00
|
|
|
OnNewContent(dst.To(written));
|
2018-07-24 00:43:13 +03:00
|
|
|
}
|
2018-04-06 00:20:28 +03:00
|
|
|
if (hadErrors && !mHasHadErrors) {
|
|
|
|
mHasHadErrors = true;
|
|
|
|
if (mEncoding == UTF_8_ENCODING) {
|
|
|
|
mTreeBuilder->TryToEnableEncodingMenu();
|
|
|
|
}
|
|
|
|
}
|
2017-04-27 13:27:03 +03:00
|
|
|
src = src.From(read);
|
|
|
|
totalRead += read;
|
|
|
|
mLastBuffer->AdvanceEnd(written);
|
|
|
|
if (result == kOutputFull) {
|
2015-10-18 08:24:48 +03:00
|
|
|
RefPtr<nsHtml5OwningUTF16Buffer> newBuf =
|
2011-09-28 16:45:17 +04:00
|
|
|
nsHtml5OwningUTF16Buffer::FalliblyCreate(READ_BUFFER_SIZE);
|
|
|
|
if (!newBuf) {
|
2017-06-22 14:32:34 +03:00
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
2011-09-28 16:45:17 +04:00
|
|
|
return NS_ERROR_OUT_OF_MEMORY;
|
|
|
|
}
|
2020-02-13 17:38:48 +03:00
|
|
|
mLastBuffer = (mLastBuffer->next = std::move(newBuf));
|
2009-09-18 13:21:47 +04:00
|
|
|
} else {
|
2018-12-06 21:59:56 +03:00
|
|
|
MOZ_ASSERT(totalRead == aFromSegment.Length(),
|
2017-04-27 13:27:03 +03:00
|
|
|
"The Unicode decoder consumed the wrong number of bytes.");
|
2021-09-28 03:02:47 +03:00
|
|
|
(void)totalRead;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!mLookingForMetaCharset && mDecodingLocalFileWithoutTokenizing &&
|
|
|
|
mNumBytesBuffered == LOCAL_FILE_UTF_8_BUFFER_SIZE) {
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDetector);
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
FeedDetector(buffer);
|
|
|
|
}
|
|
|
|
// If the file is exactly LOCAL_FILE_UTF_8_BUFFER_SIZE bytes long
|
|
|
|
// we end up not considering the EOF. That's not fatal, since we
|
|
|
|
// don't consider the EOF if the file is
|
|
|
|
// LOCAL_FILE_UTF_8_BUFFER_SIZE + 1 bytes long.
|
|
|
|
auto [encoding, source] = GuessEncoding(true);
|
|
|
|
mCharsetSource = source;
|
|
|
|
if (encoding != mEncoding) {
|
|
|
|
mEncoding = encoding;
|
2022-04-27 10:24:34 +03:00
|
|
|
nsresult rv = ReDecodeLocalFile();
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
return rv;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
} else {
|
|
|
|
MOZ_ASSERT(mEncoding == UTF_8_ENCODING);
|
2022-04-27 10:24:34 +03:00
|
|
|
nsresult rv = CommitLocalFileToEncoding();
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
return rv;
|
|
|
|
}
|
2019-12-12 20:50:19 +03:00
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-04-27 10:24:34 +03:00
|
|
|
[[nodiscard]] nsresult nsHtml5StreamParser::ReDecodeLocalFile() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mDecodingLocalFileWithoutTokenizing && !mLookingForMetaCharset);
|
|
|
|
MOZ_ASSERT(mFirstBufferOfMetaScan);
|
|
|
|
MOZ_ASSERT(mCharsetSource == kCharsetFromFinalAutoDetectionFile ||
|
|
|
|
(mForceAutoDetection &&
|
|
|
|
mCharsetSource == kCharsetFromInitialUserForcedAutoDetection));
|
|
|
|
|
|
|
|
DiscardMetaSpeculation();
|
|
|
|
|
|
|
|
MOZ_ASSERT(mEncoding != UTF_8_ENCODING);
|
|
|
|
|
2019-12-12 20:50:19 +03:00
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
2019-04-24 14:45:12 +03:00
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mEncoding->NewDecoderWithBOMRemovalInto(*mUnicodeDecoder);
|
|
|
|
mHasHadErrors = false;
|
2021-12-07 13:05:44 +03:00
|
|
|
|
2018-12-04 11:02:04 +03:00
|
|
|
// Throw away previous decoded data
|
|
|
|
mLastBuffer = mFirstBuffer;
|
|
|
|
mLastBuffer->next = nullptr;
|
|
|
|
mLastBuffer->setStart(0);
|
|
|
|
mLastBuffer->setEnd(0);
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mBufferingBytes = false;
|
|
|
|
mForceAutoDetection = false; // To stop feeding the detector
|
|
|
|
mFirstBufferOfMetaScan = nullptr;
|
|
|
|
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, true);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
|
2018-12-04 11:02:04 +03:00
|
|
|
// Decode again
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
2018-12-04 11:02:04 +03:00
|
|
|
DoDataAvailable(buffer);
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return r.unwrapErr();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return r.unwrapErr();
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
2022-04-27 10:24:34 +03:00
|
|
|
return NS_OK;
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
|
|
|
|
2022-04-27 10:24:34 +03:00
|
|
|
[[nodiscard]] nsresult nsHtml5StreamParser::CommitLocalFileToEncoding() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mDecodingLocalFileWithoutTokenizing && !mLookingForMetaCharset);
|
|
|
|
MOZ_ASSERT(mFirstBufferOfMetaScan);
|
2019-12-12 20:50:19 +03:00
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mCharsetSource == kCharsetFromFinalAutoDetectionFile ||
|
|
|
|
(mForceAutoDetection &&
|
|
|
|
mCharsetSource == kCharsetFromInitialUserForcedAutoDetection));
|
|
|
|
MOZ_ASSERT(mEncoding == UTF_8_ENCODING);
|
|
|
|
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDevTools);
|
|
|
|
if (mURIToSendToDevtools) {
|
|
|
|
nsHtml5OwningUTF16Buffer* buffer = mFirstBufferOfMetaScan;
|
|
|
|
while (buffer) {
|
|
|
|
Span<const char16_t> data(buffer->getBuffer() + buffer->getStart(),
|
|
|
|
buffer->getLength());
|
|
|
|
OnNewContent(data);
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
mFirstBufferOfMetaScan = nullptr;
|
|
|
|
|
|
|
|
mBufferingBytes = false;
|
|
|
|
mForceAutoDetection = false; // To stop feeding the detector
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, true);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return r.unwrapErr();
|
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return r.unwrapErr();
|
|
|
|
}
|
|
|
|
return NS_OK;
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
|
|
|
|
2021-01-07 15:55:07 +03:00
|
|
|
class MaybeRunCollector : public Runnable {
|
|
|
|
public:
|
|
|
|
explicit MaybeRunCollector(nsIDocShell* aDocShell)
|
|
|
|
: Runnable("MaybeRunCollector"), mDocShell(aDocShell) {}
|
|
|
|
|
|
|
|
NS_IMETHOD Run() override {
|
|
|
|
nsJSContext::MaybeRunNextCollectorSlice(mDocShell,
|
|
|
|
JS::GCReason::HTML_PARSER);
|
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
|
|
|
|
nsCOMPtr<nsIDocShell> mDocShell;
|
|
|
|
};
|
|
|
|
|
2019-02-28 02:41:04 +03:00
|
|
|
nsresult nsHtml5StreamParser::OnStartRequest(nsIRequest* aRequest) {
|
2017-10-06 10:42:21 +03:00
|
|
|
MOZ_RELEASE_ASSERT(STREAM_NOT_STARTED == mStreamState,
|
|
|
|
"Got OnStartRequest when the stream had already started.");
|
2018-04-28 22:50:58 +03:00
|
|
|
MOZ_ASSERT(
|
2017-04-27 13:27:03 +03:00
|
|
|
!mExecutor->HasStarted(),
|
|
|
|
"Got OnStartRequest at the wrong stage in the executor life cycle.");
|
2019-12-12 20:50:19 +03:00
|
|
|
MOZ_ASSERT(NS_IsMainThread(), "Wrong thread!");
|
|
|
|
|
|
|
|
// To avoid the cost of instantiating the detector when it's not needed,
|
|
|
|
// let's instantiate only if we make it out of this method with the
|
|
|
|
// intent to use it.
|
|
|
|
auto detectorCreator = MakeScopeExit([&] {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if ((mForceAutoDetection || mCharsetSource < kCharsetFromParentFrame) ||
|
|
|
|
!(mMode == LOAD_AS_DATA || mMode == VIEW_SOURCE_XML)) {
|
2019-12-12 20:50:19 +03:00
|
|
|
mDetector = mozilla::EncodingDetector::Create();
|
|
|
|
}
|
|
|
|
});
|
|
|
|
|
2009-09-18 13:21:47 +04:00
|
|
|
mRequest = aRequest;
|
|
|
|
|
2009-09-21 17:18:20 +04:00
|
|
|
mStreamState = STREAM_BEING_READ;
|
|
|
|
|
2010-07-30 14:15:38 +04:00
|
|
|
// For View Source, the parser should run with scripts "enabled" if a normal
|
|
|
|
// load would have scripts enabled.
|
2018-03-16 18:26:06 +03:00
|
|
|
bool scriptingEnabled =
|
|
|
|
mMode == LOAD_AS_DATA ? false : mExecutor->IsScriptEnabled();
|
2009-09-21 17:18:20 +04:00
|
|
|
mOwner->StartTokenizer(scriptingEnabled);
|
2013-09-10 21:40:50 +04:00
|
|
|
|
2019-12-12 20:50:19 +03:00
|
|
|
MOZ_ASSERT(!mDecodingLocalFileWithoutTokenizing);
|
2013-09-10 21:40:50 +04:00
|
|
|
bool isSrcdoc = false;
|
|
|
|
nsCOMPtr<nsIChannel> channel;
|
|
|
|
nsresult rv = GetChannel(getter_AddRefs(channel));
|
|
|
|
if (NS_SUCCEEDED(rv)) {
|
|
|
|
isSrcdoc = NS_IsSrcdocChannel(channel);
|
2021-06-21 15:09:01 +03:00
|
|
|
if (!isSrcdoc && mCharsetSource <= kCharsetFromFallback) {
|
2018-12-04 11:02:04 +03:00
|
|
|
nsCOMPtr<nsIURI> originalURI;
|
|
|
|
rv = channel->GetOriginalURI(getter_AddRefs(originalURI));
|
|
|
|
if (NS_SUCCEEDED(rv)) {
|
2019-08-07 22:49:40 +03:00
|
|
|
if (originalURI->SchemeIs("resource")) {
|
2018-12-04 11:02:04 +03:00
|
|
|
mCharsetSource = kCharsetFromBuiltIn;
|
|
|
|
mEncoding = UTF_8_ENCODING;
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, false);
|
2018-12-04 11:02:04 +03:00
|
|
|
} else {
|
|
|
|
nsCOMPtr<nsIURI> currentURI;
|
|
|
|
rv = channel->GetURI(getter_AddRefs(currentURI));
|
|
|
|
if (NS_SUCCEEDED(rv)) {
|
|
|
|
nsCOMPtr<nsIURI> innermost = NS_GetInnermostURI(currentURI);
|
2019-12-12 20:50:19 +03:00
|
|
|
if (innermost->SchemeIs("file")) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mEncoding == UTF_8_ENCODING);
|
|
|
|
if (!(mMode == LOAD_AS_DATA || mMode == VIEW_SOURCE_XML)) {
|
|
|
|
mDecodingLocalFileWithoutTokenizing = true;
|
|
|
|
}
|
2019-12-12 20:50:19 +03:00
|
|
|
} else {
|
|
|
|
nsAutoCString host;
|
|
|
|
innermost->GetAsciiHost(host);
|
|
|
|
if (!host.IsEmpty()) {
|
|
|
|
// First let's see if the host is DNS-absolute and ends with a
|
|
|
|
// dot and get rid of that one.
|
|
|
|
if (host.Last() == '.') {
|
|
|
|
host.SetLength(host.Length() - 1);
|
|
|
|
}
|
|
|
|
int32_t index = host.RFindChar('.');
|
|
|
|
if (index != kNotFound) {
|
|
|
|
// We tolerate an IPv4 component as generic "TLD", so don't
|
|
|
|
// bother checking.
|
|
|
|
ToLowerCase(
|
|
|
|
Substring(host, index + 1, host.Length() - (index + 1)),
|
|
|
|
mTLD);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2013-09-10 21:40:50 +04:00
|
|
|
}
|
|
|
|
mTreeBuilder->setIsSrcdocDocument(isSrcdoc);
|
2009-09-21 17:18:20 +04:00
|
|
|
mTreeBuilder->setScriptingEnabled(scriptingEnabled);
|
2018-03-16 18:26:06 +03:00
|
|
|
mTreeBuilder->SetPreventScriptExecution(
|
|
|
|
!((mMode == NORMAL) && scriptingEnabled));
|
2009-09-21 17:18:20 +04:00
|
|
|
mTokenizer->start();
|
|
|
|
mExecutor->Start();
|
|
|
|
mExecutor->StartReadingFromStage();
|
2011-11-01 19:27:36 +04:00
|
|
|
|
|
|
|
if (mMode == PLAIN_TEXT) {
|
|
|
|
mTreeBuilder->StartPlainText();
|
|
|
|
mTokenizer->StartPlainText();
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(
|
|
|
|
mTemplatePushedOrHeadPopped); // Needed to force 1024-byte sniffing
|
2021-12-22 18:54:49 +03:00
|
|
|
// Flush the ops to put them where ContinueAfterScriptsOrEncodingCommitment
|
|
|
|
// can find them.
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return mExecutor->MarkAsBroken(r.unwrapErr());
|
|
|
|
}
|
2011-11-30 21:44:31 +04:00
|
|
|
} else if (mMode == VIEW_SOURCE_PLAIN) {
|
2015-08-25 18:05:45 +03:00
|
|
|
nsAutoString viewSourceTitle;
|
|
|
|
CopyUTF8toUTF16(mViewSourceTitle, viewSourceTitle);
|
|
|
|
mTreeBuilder->EnsureBufferSpace(viewSourceTitle.Length());
|
|
|
|
mTreeBuilder->StartPlainTextViewSource(viewSourceTitle);
|
2011-11-30 21:44:31 +04:00
|
|
|
mTokenizer->StartPlainText();
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(
|
|
|
|
mTemplatePushedOrHeadPopped); // Needed to force 1024-byte sniffing
|
2021-12-22 18:54:49 +03:00
|
|
|
// Flush the ops to put them where ContinueAfterScriptsOrEncodingCommitment
|
|
|
|
// can find them.
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return mExecutor->MarkAsBroken(r.unwrapErr());
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
} else if (mMode == VIEW_SOURCE_HTML || mMode == VIEW_SOURCE_XML) {
|
|
|
|
// Generate and flush the View Source document up to and including the
|
|
|
|
// pre element start.
|
|
|
|
mTokenizer->StartViewSource(NS_ConvertUTF8toUTF16(mViewSourceTitle));
|
|
|
|
if (mMode == VIEW_SOURCE_XML) {
|
|
|
|
mTokenizer->StartViewSourceCharacters();
|
|
|
|
}
|
2021-12-22 18:54:49 +03:00
|
|
|
// Flush the ops to put them where ContinueAfterScriptsOrEncodingCommitment
|
|
|
|
// can find them.
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
return mExecutor->MarkAsBroken(r.unwrapErr());
|
|
|
|
}
|
2011-11-01 19:27:36 +04:00
|
|
|
}
|
|
|
|
|
2009-09-18 13:21:47 +04:00
|
|
|
/*
|
2017-04-27 13:27:03 +03:00
|
|
|
* If you move the following line, be very careful not to cause
|
|
|
|
* WillBuildModel to be called before the document has had its
|
2009-09-18 13:21:47 +04:00
|
|
|
* script global object set.
|
|
|
|
*/
|
2022-02-14 16:03:46 +03:00
|
|
|
rv = mExecutor->WillBuildModel();
|
2014-11-05 18:44:37 +03:00
|
|
|
NS_ENSURE_SUCCESS(rv, rv);
|
2017-04-27 13:27:03 +03:00
|
|
|
|
2015-10-18 08:24:48 +03:00
|
|
|
RefPtr<nsHtml5OwningUTF16Buffer> newBuf =
|
2011-09-28 16:45:17 +04:00
|
|
|
nsHtml5OwningUTF16Buffer::FalliblyCreate(READ_BUFFER_SIZE);
|
|
|
|
if (!newBuf) {
|
2012-03-21 16:39:25 +04:00
|
|
|
// marks this stream parser as terminated,
|
|
|
|
// which prevents entry to code paths that
|
|
|
|
// would use mFirstBuffer or mLastBuffer.
|
|
|
|
return mExecutor->MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
2011-09-28 16:45:17 +04:00
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
MOZ_ASSERT(!mFirstBuffer, "How come we have the first buffer set?");
|
|
|
|
MOZ_ASSERT(!mLastBuffer, "How come we have the last buffer set?");
|
2011-09-28 16:45:17 +04:00
|
|
|
mFirstBuffer = mLastBuffer = newBuf;
|
|
|
|
|
2013-09-10 21:40:50 +04:00
|
|
|
rv = NS_OK;
|
2009-09-21 17:18:20 +04:00
|
|
|
|
2021-03-18 22:24:50 +03:00
|
|
|
mNetworkEventTarget =
|
|
|
|
mExecutor->GetDocument()->EventTargetFor(TaskCategory::Network);
|
2017-06-27 20:38:30 +03:00
|
|
|
|
2009-09-21 17:18:20 +04:00
|
|
|
nsCOMPtr<nsIHttpChannel> httpChannel(do_QueryInterface(mRequest, &rv));
|
|
|
|
if (NS_SUCCEEDED(rv)) {
|
2017-06-27 20:38:30 +03:00
|
|
|
// Non-HTTP channels are bogus enough that we let them work with unlabeled
|
|
|
|
// runnables for now. Asserting for HTTP channels only.
|
2021-03-18 22:24:50 +03:00
|
|
|
MOZ_ASSERT(mNetworkEventTarget || mMode == LOAD_AS_DATA,
|
|
|
|
"How come the network event target is still null?");
|
2017-06-27 20:38:30 +03:00
|
|
|
|
2012-09-02 06:35:17 +04:00
|
|
|
nsAutoCString method;
|
2016-12-20 09:36:24 +03:00
|
|
|
Unused << httpChannel->GetRequestMethod(method);
|
2009-09-21 17:18:20 +04:00
|
|
|
// XXX does Necko have a way to renavigate POST, etc. without hitting
|
|
|
|
// the network?
|
|
|
|
if (!method.EqualsLiteral("GET")) {
|
|
|
|
// This is the old Gecko behavior but the HTML5 spec disagrees.
|
|
|
|
// Don't reparse on POST.
|
2011-10-17 18:59:28 +04:00
|
|
|
mReparseForbidden = true;
|
2009-09-21 17:18:20 +04:00
|
|
|
}
|
2014-06-17 20:39:19 +04:00
|
|
|
}
|
2013-07-09 05:45:26 +04:00
|
|
|
|
2014-06-17 20:39:19 +04:00
|
|
|
// Attempt to retarget delivery of data (via OnDataAvailable) to the parser
|
|
|
|
// thread, rather than through the main thread.
|
|
|
|
nsCOMPtr<nsIThreadRetargetableRequest> threadRetargetableRequest =
|
|
|
|
do_QueryInterface(mRequest, &rv);
|
|
|
|
if (threadRetargetableRequest) {
|
2017-06-08 02:23:29 +03:00
|
|
|
rv = threadRetargetableRequest->RetargetDeliveryTo(mEventTarget);
|
2021-01-07 15:55:07 +03:00
|
|
|
if (NS_SUCCEEDED(rv)) {
|
|
|
|
// Parser thread should be now ready to get data from necko and parse it
|
|
|
|
// and main thread might have a chance to process a collector slice.
|
|
|
|
// We need to do this asynchronously so that necko may continue processing
|
|
|
|
// the request.
|
|
|
|
nsCOMPtr<nsIRunnable> runnable =
|
|
|
|
new MaybeRunCollector(mExecutor->GetDocument()->GetDocShell());
|
|
|
|
mozilla::SchedulerGroup::Dispatch(
|
|
|
|
mozilla::TaskCategory::GarbageCollection, runnable.forget());
|
|
|
|
}
|
2014-06-17 20:39:19 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (NS_FAILED(rv)) {
|
2017-03-20 12:29:07 +03:00
|
|
|
NS_WARNING("Failed to retarget HTML data delivery to the parser thread.");
|
2009-09-21 17:18:20 +04:00
|
|
|
}
|
2011-02-11 13:33:32 +03:00
|
|
|
|
2012-03-24 15:34:42 +04:00
|
|
|
if (mCharsetSource == kCharsetFromParentFrame) {
|
2019-12-12 20:50:19 +03:00
|
|
|
// Remember this for error reporting.
|
2012-03-24 15:34:42 +04:00
|
|
|
mInitialEncodingWasFromParentFrame = true;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!mDecodingLocalFileWithoutTokenizing);
|
2012-03-24 15:34:42 +04:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mForceAutoDetection || mCharsetSource < kCharsetFromChannel) {
|
|
|
|
mBufferingBytes = true;
|
|
|
|
if (mMode != VIEW_SOURCE_XML) {
|
|
|
|
// We need to set mLookingForMetaCharset to true here in case the first
|
|
|
|
// buffer to arrive is larger than 1024. We need the code that splits
|
|
|
|
// the buffers at 1024 bytes to work even in that case.
|
|
|
|
mLookingForMetaCharset = true;
|
|
|
|
}
|
2011-02-11 13:33:32 +03:00
|
|
|
}
|
2017-04-27 13:27:03 +03:00
|
|
|
|
2019-02-28 04:09:48 +03:00
|
|
|
if (mCharsetSource < kCharsetFromUtf8OnlyMime) {
|
2009-09-18 13:21:47 +04:00
|
|
|
// we aren't ready to commit to an encoding yet
|
|
|
|
// leave converter uninstantiated for now
|
|
|
|
return NS_OK;
|
|
|
|
}
|
2013-01-18 18:27:03 +04:00
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!(mMode == VIEW_SOURCE_HTML || mMode == VIEW_SOURCE_XML));
|
|
|
|
|
|
|
|
MOZ_ASSERT(mEncoding == UTF_8_ENCODING,
|
|
|
|
"How come UTF-8-only MIME type didn't set encoding to UTF-8?");
|
|
|
|
|
2019-04-17 14:36:52 +03:00
|
|
|
// We are loading JSON/WebVTT/etc. into a browsing context.
|
|
|
|
// There's no need to remove the BOM manually here, because
|
|
|
|
// the UTF-8 decoder removes it.
|
2013-01-18 18:27:03 +04:00
|
|
|
mReparseForbidden = true;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mForceAutoDetection = false;
|
2013-01-18 18:27:03 +04:00
|
|
|
|
|
|
|
// Instantiate the converter here to avoid BOM sniffing.
|
2019-12-12 20:50:19 +03:00
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
2017-06-18 14:37:50 +03:00
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithBOMRemoval();
|
2009-09-18 13:21:47 +04:00
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
|
2009-09-25 21:11:02 +04:00
|
|
|
void nsHtml5StreamParser::DoStopRequest() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2017-10-06 10:42:21 +03:00
|
|
|
MOZ_RELEASE_ASSERT(STREAM_BEING_READ == mStreamState,
|
|
|
|
"Stream ended without being open.");
|
2009-10-12 17:08:04 +04:00
|
|
|
mTokenizerMutex.AssertCurrentThreadOwns();
|
|
|
|
|
2019-08-05 00:33:07 +03:00
|
|
|
auto guard = MakeScopeExit([&] { OnContentComplete(); });
|
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
if (IsTerminated()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (MOZ_UNLIKELY(mLookingForXmlDeclarationForXmlViewSource)) {
|
|
|
|
mLookingForXmlDeclarationForXmlViewSource = false;
|
|
|
|
mBufferingBytes = false;
|
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithoutBOMHandling();
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, false);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
2022-04-27 10:24:34 +03:00
|
|
|
nsresult rv = WriteStreamBytes(buffer);
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
MarkAsBroken(rv);
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
} else if (!mUnicodeDecoder) {
|
2014-11-05 18:44:37 +03:00
|
|
|
nsresult rv;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (NS_FAILED(rv = SniffStreamBytes(Span<const uint8_t>(), true))) {
|
2014-11-05 18:44:37 +03:00
|
|
|
MarkAsBroken(rv);
|
2011-09-28 16:45:17 +04:00
|
|
|
return;
|
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
2009-09-21 17:18:20 +04:00
|
|
|
|
2018-03-16 18:26:06 +03:00
|
|
|
MOZ_ASSERT(mUnicodeDecoder,
|
|
|
|
"Should have a decoder after finalizing sniffing.");
|
2017-06-22 14:32:34 +03:00
|
|
|
|
|
|
|
// mLastBuffer should always point to a buffer of the size
|
2018-12-04 11:02:04 +03:00
|
|
|
// READ_BUFFER_SIZE.
|
2017-06-22 14:32:34 +03:00
|
|
|
if (!mLastBuffer) {
|
|
|
|
NS_WARNING("mLastBuffer should not be null!");
|
|
|
|
MarkAsBroken(NS_ERROR_NULL_POINTER);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
Span<uint8_t> src; // empty span
|
|
|
|
for (;;) {
|
2018-12-04 11:02:04 +03:00
|
|
|
auto dst = mLastBuffer->TailAsSpan(READ_BUFFER_SIZE);
|
2021-11-05 14:23:45 +03:00
|
|
|
uint32_t result;
|
|
|
|
size_t read;
|
|
|
|
size_t written;
|
|
|
|
bool hadErrors;
|
2021-11-08 11:14:00 +03:00
|
|
|
// Do not use structured binding lest deal with [-Werror=unused-variable]
|
|
|
|
std::tie(result, read, written, hadErrors) =
|
2017-06-22 14:32:34 +03:00
|
|
|
mUnicodeDecoder->DecodeToUTF16(src, dst, true);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!(mLookingForMetaCharset || mDecodingLocalFileWithoutTokenizing)) {
|
2019-08-05 00:33:07 +03:00
|
|
|
OnNewContent(dst.To(written));
|
2018-07-24 00:43:13 +03:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (hadErrors) {
|
2018-04-06 00:20:28 +03:00
|
|
|
mHasHadErrors = true;
|
|
|
|
}
|
2017-06-22 14:32:34 +03:00
|
|
|
MOZ_ASSERT(read == 0, "How come an empty span was read form?");
|
|
|
|
mLastBuffer->AdvanceEnd(written);
|
|
|
|
if (result == kOutputFull) {
|
|
|
|
RefPtr<nsHtml5OwningUTF16Buffer> newBuf =
|
|
|
|
nsHtml5OwningUTF16Buffer::FalliblyCreate(READ_BUFFER_SIZE);
|
|
|
|
if (!newBuf) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
2020-02-13 17:38:48 +03:00
|
|
|
mLastBuffer = (mLastBuffer->next = std::move(newBuf));
|
2017-06-22 14:32:34 +03:00
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!mLookingForMetaCharset && mDecodingLocalFileWithoutTokenizing) {
|
|
|
|
MOZ_ASSERT(mNumBytesBuffered < LOCAL_FILE_UTF_8_BUFFER_SIZE);
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDetector);
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
FeedDetector(buffer);
|
|
|
|
}
|
|
|
|
MOZ_ASSERT(!mChardetEof);
|
|
|
|
DetectorEof();
|
|
|
|
auto [encoding, source] = GuessEncoding(true);
|
|
|
|
mCharsetSource = source;
|
|
|
|
if (encoding != mEncoding) {
|
|
|
|
mEncoding = encoding;
|
2022-04-27 10:24:34 +03:00
|
|
|
nsresult rv = ReDecodeLocalFile();
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
MarkAsBroken(rv);
|
|
|
|
return;
|
|
|
|
}
|
2019-12-12 20:50:19 +03:00
|
|
|
DoStopRequest();
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mEncoding == UTF_8_ENCODING);
|
2022-04-27 10:24:34 +03:00
|
|
|
nsresult rv = CommitLocalFileToEncoding();
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
MarkAsBroken(rv);
|
|
|
|
return;
|
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
2017-06-22 14:32:34 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-12-04 11:02:04 +03:00
|
|
|
mStreamState = STREAM_ENDED;
|
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
if (IsTerminatedOrInterrupted()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-04-27 13:27:03 +03:00
|
|
|
ParseAvailableData();
|
2009-09-25 21:11:02 +04:00
|
|
|
}
|
2009-09-21 17:18:20 +04:00
|
|
|
|
2016-04-26 03:23:21 +03:00
|
|
|
class nsHtml5RequestStopper : public Runnable {
|
2018-03-16 18:26:06 +03:00
|
|
|
private:
|
|
|
|
nsHtml5StreamParserPtr mStreamParser;
|
|
|
|
|
|
|
|
public:
|
|
|
|
explicit nsHtml5RequestStopper(nsHtml5StreamParser* aStreamParser)
|
|
|
|
: Runnable("nsHtml5RequestStopper"), mStreamParser(aStreamParser) {}
|
|
|
|
NS_IMETHOD Run() override {
|
|
|
|
mozilla::MutexAutoLock autoLock(mStreamParser->mTokenizerMutex);
|
|
|
|
mStreamParser->DoStopRequest();
|
2021-12-22 18:54:49 +03:00
|
|
|
mStreamParser->PostLoadFlusher();
|
2018-03-16 18:26:06 +03:00
|
|
|
return NS_OK;
|
|
|
|
}
|
2009-09-25 21:11:02 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
nsresult nsHtml5StreamParser::OnStopRequest(nsIRequest* aRequest,
|
2018-03-16 18:26:06 +03:00
|
|
|
nsresult status) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mRequest == aRequest, "Got Stop on wrong stream.");
|
|
|
|
MOZ_ASSERT(NS_IsMainThread(), "Wrong thread!");
|
2013-06-17 23:44:07 +04:00
|
|
|
nsCOMPtr<nsIRunnable> stopper = new nsHtml5RequestStopper(this);
|
2017-06-08 02:23:29 +03:00
|
|
|
if (NS_FAILED(mEventTarget->Dispatch(stopper, nsIThread::DISPATCH_NORMAL))) {
|
2013-06-17 23:44:07 +04:00
|
|
|
NS_WARNING("Dispatching StopRequest event failed.");
|
2009-09-25 21:11:02 +04:00
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
|
2018-12-04 11:02:04 +03:00
|
|
|
void nsHtml5StreamParser::DoDataAvailableBuffer(
|
|
|
|
mozilla::Buffer<uint8_t>&& aBuffer) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (MOZ_UNLIKELY(!mBufferingBytes)) {
|
2018-12-04 11:02:04 +03:00
|
|
|
DoDataAvailable(aBuffer);
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (MOZ_UNLIKELY(mLookingForXmlDeclarationForXmlViewSource)) {
|
|
|
|
const uint8_t* elements = aBuffer.Elements();
|
|
|
|
size_t length = aBuffer.Length();
|
|
|
|
const uint8_t* lt = (const uint8_t*)memchr(elements, '>', length);
|
|
|
|
if (!lt) {
|
|
|
|
mBufferedBytes.AppendElement(std::move(aBuffer));
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// We found an '>'. Now there either is or isn't an XML decl.
|
|
|
|
length = (lt - elements) + 1;
|
|
|
|
Vector<uint8_t> contiguous;
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
if (!contiguous.append(buffer.Elements(), buffer.Length())) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!contiguous.append(elements, length)) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
const Encoding* encoding =
|
|
|
|
xmldecl_parse(contiguous.begin(), contiguous.length());
|
|
|
|
if (encoding) {
|
|
|
|
mEncoding = WrapNotNull(encoding);
|
|
|
|
mCharsetSource = kCharsetFromXmlDeclaration;
|
|
|
|
}
|
|
|
|
|
|
|
|
mLookingForXmlDeclarationForXmlViewSource = false;
|
|
|
|
mBufferingBytes = false;
|
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithoutBOMHandling();
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, false);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
DoDataAvailable(buffer);
|
|
|
|
}
|
|
|
|
DoDataAvailable(aBuffer);
|
|
|
|
mBufferedBytes.Clear();
|
|
|
|
return;
|
|
|
|
}
|
2018-12-04 11:02:04 +03:00
|
|
|
CheckedInt<size_t> bufferedPlusLength(aBuffer.Length());
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
bufferedPlusLength += mNumBytesBuffered;
|
2018-12-04 11:02:04 +03:00
|
|
|
if (!bufferedPlusLength.isValid()) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
// Ensure that WriteStreamBytes() sees buffers ending
|
|
|
|
// exactly at the two special boundaries.
|
|
|
|
bool metaBoundaryWithinBuffer =
|
|
|
|
mLookingForMetaCharset &&
|
|
|
|
mNumBytesBuffered < UNCONDITIONAL_META_SCAN_BOUNDARY &&
|
|
|
|
bufferedPlusLength.value() > UNCONDITIONAL_META_SCAN_BOUNDARY;
|
|
|
|
bool localFileLimitWithinBuffer =
|
|
|
|
mDecodingLocalFileWithoutTokenizing &&
|
|
|
|
mNumBytesBuffered < LOCAL_FILE_UTF_8_BUFFER_SIZE &&
|
|
|
|
bufferedPlusLength.value() > LOCAL_FILE_UTF_8_BUFFER_SIZE;
|
|
|
|
if (!metaBoundaryWithinBuffer && !localFileLimitWithinBuffer) {
|
2018-12-04 11:02:04 +03:00
|
|
|
// Truncation OK, because we just checked the range.
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mNumBytesBuffered = bufferedPlusLength.value();
|
|
|
|
mBufferedBytes.AppendElement(std::move(aBuffer));
|
|
|
|
DoDataAvailable(mBufferedBytes.LastElement());
|
2018-12-04 11:02:04 +03:00
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_RELEASE_ASSERT(
|
|
|
|
!(metaBoundaryWithinBuffer && localFileLimitWithinBuffer),
|
|
|
|
"How can Necko give us a buffer this large?");
|
|
|
|
size_t boundary = metaBoundaryWithinBuffer
|
|
|
|
? UNCONDITIONAL_META_SCAN_BOUNDARY
|
|
|
|
: LOCAL_FILE_UTF_8_BUFFER_SIZE;
|
2018-12-04 11:02:04 +03:00
|
|
|
// Truncation OK, because the constant is small enough.
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
size_t overBoundary = bufferedPlusLength.value() - boundary;
|
2018-12-17 20:08:51 +03:00
|
|
|
MOZ_RELEASE_ASSERT(overBoundary < aBuffer.Length());
|
|
|
|
size_t untilBoundary = aBuffer.Length() - overBoundary;
|
2018-12-04 11:02:04 +03:00
|
|
|
auto span = aBuffer.AsSpan();
|
2018-12-17 20:08:51 +03:00
|
|
|
auto head = span.To(untilBoundary);
|
|
|
|
auto tail = span.From(untilBoundary);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_RELEASE_ASSERT(mNumBytesBuffered + untilBoundary == boundary);
|
|
|
|
// The following copies may end up being useless, but optimizing
|
|
|
|
// them away would add complexity.
|
|
|
|
Maybe<Buffer<uint8_t>> maybeHead = Buffer<uint8_t>::CopyFrom(head);
|
|
|
|
if (maybeHead.isNothing()) {
|
2018-12-04 11:02:04 +03:00
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mNumBytesBuffered = boundary;
|
|
|
|
mBufferedBytes.AppendElement(std::move(*maybeHead));
|
|
|
|
DoDataAvailable(mBufferedBytes.LastElement());
|
2021-12-07 13:05:44 +03:00
|
|
|
// Re-decode may have happened here.
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
|
|
|
|
Maybe<Buffer<uint8_t>> maybeTail = Buffer<uint8_t>::CopyFrom(tail);
|
|
|
|
if (maybeTail.isNothing()) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mNumBytesBuffered += tail.Length();
|
|
|
|
mBufferedBytes.AppendElement(std::move(*maybeTail));
|
|
|
|
DoDataAvailable(mBufferedBytes.LastElement());
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
|
|
|
// Do this clean-up here to avoid use-after-free when
|
|
|
|
// DoDataAvailable is passed a span pointing into an
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
// element of mBufferedBytes.
|
|
|
|
if (!mBufferingBytes) {
|
|
|
|
mBufferedBytes.Clear();
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-12-06 21:59:56 +03:00
|
|
|
void nsHtml5StreamParser::DoDataAvailable(Span<const uint8_t> aBuffer) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2017-10-06 10:42:21 +03:00
|
|
|
MOZ_RELEASE_ASSERT(STREAM_BEING_READ == mStreamState,
|
|
|
|
"DoDataAvailable called when stream not open.");
|
2009-10-12 17:08:04 +04:00
|
|
|
mTokenizerMutex.AssertCurrentThreadOwns();
|
|
|
|
|
|
|
|
if (IsTerminated()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2011-09-28 16:45:17 +04:00
|
|
|
nsresult rv;
|
2011-02-11 13:33:32 +03:00
|
|
|
if (HasDecoder()) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if ((mForceAutoDetection || mCharsetSource < kCharsetFromParentFrame) &&
|
|
|
|
!mBufferingBytes && !mReparseForbidden &&
|
|
|
|
!(mMode == LOAD_AS_DATA || mMode == VIEW_SOURCE_XML)) {
|
|
|
|
MOZ_ASSERT(!mDecodingLocalFileWithoutTokenizing,
|
|
|
|
"How is mBufferingBytes false if "
|
|
|
|
"mDecodingLocalFileWithoutTokenizing is true?");
|
|
|
|
FeedDetector(aBuffer);
|
2011-02-11 13:33:32 +03:00
|
|
|
}
|
2018-12-06 21:59:56 +03:00
|
|
|
rv = WriteStreamBytes(aBuffer);
|
2011-02-11 13:33:32 +03:00
|
|
|
} else {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
rv = SniffStreamBytes(aBuffer, false);
|
2011-09-28 16:45:17 +04:00
|
|
|
}
|
|
|
|
if (NS_FAILED(rv)) {
|
2014-11-05 18:44:37 +03:00
|
|
|
MarkAsBroken(rv);
|
2011-09-28 16:45:17 +04:00
|
|
|
return;
|
2011-02-11 13:33:32 +03:00
|
|
|
}
|
2009-10-12 17:08:04 +04:00
|
|
|
|
|
|
|
if (IsTerminatedOrInterrupted()) {
|
|
|
|
return;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
2009-10-12 17:08:04 +04:00
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!mLookingForMetaCharset && mDecodingLocalFileWithoutTokenizing) {
|
2018-12-04 11:02:04 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
ParseAvailableData();
|
2010-04-16 14:52:06 +04:00
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mBomState != BOM_SNIFFING_OVER || mFlushTimerArmed || mSpeculating) {
|
2010-04-16 14:52:06 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-01-26 09:45:31 +03:00
|
|
|
{
|
|
|
|
mozilla::MutexAutoLock flushTimerLock(mFlushTimerMutex);
|
2017-06-12 22:34:10 +03:00
|
|
|
mFlushTimer->InitWithNamedFuncCallback(
|
|
|
|
nsHtml5StreamParser::TimerCallback, static_cast<void*>(this),
|
2018-03-27 03:59:48 +03:00
|
|
|
mFlushTimerEverFired ? StaticPrefs::html5_flushtimer_initialdelay()
|
|
|
|
: StaticPrefs::html5_flushtimer_subsequentdelay(),
|
2017-06-12 22:34:10 +03:00
|
|
|
nsITimer::TYPE_ONE_SHOT, "nsHtml5StreamParser::DoDataAvailable");
|
2017-01-26 09:45:31 +03:00
|
|
|
}
|
2011-10-17 18:59:28 +04:00
|
|
|
mFlushTimerArmed = true;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
2016-04-26 03:23:21 +03:00
|
|
|
class nsHtml5DataAvailable : public Runnable {
|
2018-03-16 18:26:06 +03:00
|
|
|
private:
|
|
|
|
nsHtml5StreamParserPtr mStreamParser;
|
2018-12-07 11:28:08 +03:00
|
|
|
Buffer<uint8_t> mData;
|
2018-03-16 18:26:06 +03:00
|
|
|
|
|
|
|
public:
|
|
|
|
nsHtml5DataAvailable(nsHtml5StreamParser* aStreamParser,
|
2018-12-07 11:28:08 +03:00
|
|
|
Buffer<uint8_t>&& aData)
|
2018-03-16 18:26:06 +03:00
|
|
|
: Runnable("nsHtml5DataAvailable"),
|
|
|
|
mStreamParser(aStreamParser),
|
2018-12-07 11:28:08 +03:00
|
|
|
mData(std::move(aData)) {}
|
2018-03-16 18:26:06 +03:00
|
|
|
NS_IMETHOD Run() override {
|
|
|
|
mozilla::MutexAutoLock autoLock(mStreamParser->mTokenizerMutex);
|
2018-12-04 11:02:04 +03:00
|
|
|
mStreamParser->DoDataAvailableBuffer(std::move(mData));
|
2021-12-22 18:54:49 +03:00
|
|
|
mStreamParser->PostLoadFlusher();
|
2018-03-16 18:26:06 +03:00
|
|
|
return NS_OK;
|
|
|
|
}
|
2009-09-25 21:11:02 +04:00
|
|
|
};
|
|
|
|
|
2009-09-18 13:21:47 +04:00
|
|
|
nsresult nsHtml5StreamParser::OnDataAvailable(nsIRequest* aRequest,
|
2014-04-16 09:41:39 +04:00
|
|
|
nsIInputStream* aInStream,
|
|
|
|
uint64_t aSourceOffset,
|
|
|
|
uint32_t aLength) {
|
2012-03-21 16:39:25 +04:00
|
|
|
nsresult rv;
|
2011-09-28 16:45:17 +04:00
|
|
|
|
2018-12-04 11:02:04 +03:00
|
|
|
MOZ_ASSERT(mRequest == aRequest, "Got data on wrong stream.");
|
2012-08-22 19:56:38 +04:00
|
|
|
uint32_t totalRead;
|
2013-07-09 05:45:26 +04:00
|
|
|
// Main thread to parser thread dispatch requires copying to buffer first.
|
2018-12-04 11:02:04 +03:00
|
|
|
if (MOZ_UNLIKELY(NS_IsMainThread())) {
|
2021-12-19 19:51:41 +03:00
|
|
|
if (NS_FAILED(rv = mExecutor->IsBroken())) {
|
|
|
|
return rv;
|
|
|
|
}
|
2018-12-07 11:28:08 +03:00
|
|
|
Maybe<Buffer<uint8_t>> maybe = Buffer<uint8_t>::Alloc(aLength);
|
|
|
|
if (maybe.isNothing()) {
|
2013-07-09 05:45:26 +04:00
|
|
|
return mExecutor->MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
}
|
2018-12-07 11:28:08 +03:00
|
|
|
Buffer<uint8_t> data(std::move(*maybe));
|
|
|
|
rv = aInStream->Read(reinterpret_cast<char*>(data.Elements()),
|
|
|
|
data.Length(), &totalRead);
|
2013-07-09 05:45:26 +04:00
|
|
|
NS_ENSURE_SUCCESS(rv, rv);
|
2018-12-07 11:28:08 +03:00
|
|
|
MOZ_ASSERT(totalRead == aLength);
|
2013-07-09 05:45:26 +04:00
|
|
|
|
2018-03-16 18:26:06 +03:00
|
|
|
nsCOMPtr<nsIRunnable> dataAvailable =
|
2018-12-07 11:28:08 +03:00
|
|
|
new nsHtml5DataAvailable(this, std::move(data));
|
2018-03-16 18:26:06 +03:00
|
|
|
if (NS_FAILED(mEventTarget->Dispatch(dataAvailable,
|
|
|
|
nsIThread::DISPATCH_NORMAL))) {
|
2013-07-09 05:45:26 +04:00
|
|
|
NS_WARNING("Dispatching DataAvailable event failed.");
|
|
|
|
}
|
|
|
|
return rv;
|
2018-12-04 11:02:04 +03:00
|
|
|
}
|
2021-12-19 19:51:41 +03:00
|
|
|
|
2018-12-04 11:02:04 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
|
|
|
mozilla::MutexAutoLock autoLock(mTokenizerMutex);
|
|
|
|
|
2021-12-19 19:51:41 +03:00
|
|
|
if (NS_FAILED(rv = mTreeBuilder->IsBroken())) {
|
|
|
|
return rv;
|
|
|
|
}
|
2021-12-22 18:54:49 +03:00
|
|
|
|
|
|
|
// Since we're getting OnDataAvailable directly on the parser thread,
|
|
|
|
// there is no nsHtml5DataAvailable that would call PostLoadFlusher.
|
|
|
|
// Hence, we need to call PostLoadFlusher() before this method returns.
|
|
|
|
// Braces for RAII clarity relative to the mutex despite not being
|
|
|
|
// strictly necessary.
|
|
|
|
{
|
|
|
|
auto speculationFlusher = MakeScopeExit([&] { PostLoadFlusher(); });
|
|
|
|
|
|
|
|
if (mBufferingBytes) {
|
|
|
|
Maybe<Buffer<uint8_t>> maybe = Buffer<uint8_t>::Alloc(aLength);
|
|
|
|
if (maybe.isNothing()) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return NS_ERROR_OUT_OF_MEMORY;
|
|
|
|
}
|
|
|
|
Buffer<uint8_t> data(std::move(*maybe));
|
|
|
|
rv = aInStream->Read(reinterpret_cast<char*>(data.Elements()),
|
|
|
|
data.Length(), &totalRead);
|
|
|
|
NS_ENSURE_SUCCESS(rv, rv);
|
|
|
|
MOZ_ASSERT(totalRead == aLength);
|
|
|
|
DoDataAvailableBuffer(std::move(data));
|
|
|
|
return rv;
|
2013-07-09 05:45:26 +04:00
|
|
|
}
|
2021-12-22 18:54:49 +03:00
|
|
|
// Read directly from response buffer.
|
|
|
|
rv = aInStream->ReadSegments(CopySegmentsToParser, this, aLength,
|
|
|
|
&totalRead);
|
2018-12-04 11:02:04 +03:00
|
|
|
NS_ENSURE_SUCCESS(rv, rv);
|
|
|
|
MOZ_ASSERT(totalRead == aLength);
|
|
|
|
return rv;
|
2013-06-17 23:44:07 +04:00
|
|
|
}
|
2013-07-09 05:45:26 +04:00
|
|
|
}
|
|
|
|
|
2021-12-19 19:51:41 +03:00
|
|
|
// Called under lock by function ptr
|
2019-02-26 01:11:34 +03:00
|
|
|
/* static */
|
|
|
|
nsresult nsHtml5StreamParser::CopySegmentsToParser(
|
2018-03-16 18:26:06 +03:00
|
|
|
nsIInputStream* aInStream, void* aClosure, const char* aFromSegment,
|
2022-03-22 17:06:03 +03:00
|
|
|
uint32_t aToOffset, uint32_t aCount,
|
|
|
|
uint32_t* aWriteCount) NO_THREAD_SAFETY_ANALYSIS {
|
2013-07-09 05:45:26 +04:00
|
|
|
nsHtml5StreamParser* parser = static_cast<nsHtml5StreamParser*>(aClosure);
|
|
|
|
|
2020-08-07 10:49:47 +03:00
|
|
|
parser->DoDataAvailable(AsBytes(Span(aFromSegment, aCount)));
|
2013-07-09 05:45:26 +04:00
|
|
|
// Assume DoDataAvailable consumed all available bytes.
|
|
|
|
*aWriteCount = aCount;
|
|
|
|
return NS_OK;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
2018-03-16 18:26:06 +03:00
|
|
|
const Encoding* nsHtml5StreamParser::PreferredForInternalEncodingDecl(
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
const nsAString& aEncoding) {
|
2017-06-17 05:54:40 +03:00
|
|
|
const Encoding* newEncoding = Encoding::ForLabel(aEncoding);
|
|
|
|
if (!newEncoding) {
|
2012-11-08 03:04:22 +04:00
|
|
|
// the encoding name is bogus
|
2018-03-16 18:26:06 +03:00
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaUnsupported", true,
|
|
|
|
mTokenizer->getLineNumber());
|
2017-06-18 14:37:50 +03:00
|
|
|
return nullptr;
|
2010-07-30 14:03:54 +04:00
|
|
|
}
|
|
|
|
|
2018-03-16 18:26:06 +03:00
|
|
|
if (newEncoding == UTF_16BE_ENCODING || newEncoding == UTF_16LE_ENCODING) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaUtf16", true,
|
|
|
|
mTokenizer->getLineNumber());
|
2017-06-17 05:54:40 +03:00
|
|
|
newEncoding = UTF_8_ENCODING;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
2012-11-08 03:04:22 +04:00
|
|
|
|
2017-06-17 05:54:40 +03:00
|
|
|
if (newEncoding == X_USER_DEFINED_ENCODING) {
|
2014-01-02 11:18:19 +04:00
|
|
|
// WebKit/Blink hack for Indian and Armenian legacy sites
|
2018-03-16 18:26:06 +03:00
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaUserDefined", true,
|
|
|
|
mTokenizer->getLineNumber());
|
2017-06-17 05:54:40 +03:00
|
|
|
newEncoding = WINDOWS_1252_ENCODING;
|
2014-01-02 11:18:19 +04:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (newEncoding == REPLACEMENT_ENCODING) {
|
|
|
|
// No line number, because the replacement encoding doesn't allow
|
|
|
|
// showing the lines.
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaReplacement", true, 0);
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
2012-09-18 18:46:30 +04:00
|
|
|
|
2017-06-18 14:37:50 +03:00
|
|
|
return newEncoding;
|
2011-11-01 15:33:11 +04:00
|
|
|
}
|
|
|
|
|
2017-03-20 15:45:15 +03:00
|
|
|
bool nsHtml5StreamParser::internalEncodingDeclaration(nsHtml5String aEncoding) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
|
|
|
if ((mCharsetSource >= kCharsetFromMetaTag &&
|
|
|
|
mCharsetSource != kCharsetFromFinalAutoDetectionFile) ||
|
|
|
|
mSeenEligibleMetaCharset) {
|
2011-11-01 15:33:11 +04:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
nsString newEncoding; // Not Auto, because using it to hold nsStringBuffer*
|
|
|
|
aEncoding.ToString(newEncoding);
|
2017-06-18 14:37:50 +03:00
|
|
|
auto encoding = PreferredForInternalEncodingDecl(newEncoding);
|
|
|
|
if (!encoding) {
|
2011-11-01 15:33:11 +04:00
|
|
|
return false;
|
|
|
|
}
|
2010-07-30 14:03:54 +04:00
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mSeenEligibleMetaCharset = true;
|
|
|
|
|
|
|
|
if (!mLookingForMetaCharset) {
|
|
|
|
if (mInitialEncodingWasFromParentFrame) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaTooLateFrame", true,
|
|
|
|
mTokenizer->getLineNumber());
|
|
|
|
} else {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaTooLate", true,
|
|
|
|
mTokenizer->getLineNumber());
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (mTemplatePushedOrHeadPopped) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncMetaAfterHeadInKilobyte", false,
|
2018-03-16 18:26:06 +03:00
|
|
|
mTokenizer->getLineNumber());
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (mForceAutoDetection &&
|
|
|
|
(encoding->IsAsciiCompatible() || encoding == ISO_2022_JP_ENCODING)) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
mNeedsEncodingSwitchTo = encoding;
|
|
|
|
mEncodingSwitchSource = kCharsetFromMetaTag;
|
2011-10-17 18:59:28 +04:00
|
|
|
return true;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
bool nsHtml5StreamParser::TemplatePushedOrHeadPopped() {
|
|
|
|
MOZ_ASSERT(
|
|
|
|
IsParserThread() || mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN,
|
|
|
|
"Wrong thread!");
|
|
|
|
mTemplatePushedOrHeadPopped = true;
|
|
|
|
return mNumBytesBuffered >= UNCONDITIONAL_META_SCAN_BOUNDARY;
|
|
|
|
}
|
|
|
|
|
|
|
|
void nsHtml5StreamParser::RememberGt(int32_t aPos) {
|
|
|
|
if (mLookingForMetaCharset) {
|
|
|
|
mGtBuffer = mFirstBuffer;
|
|
|
|
mGtPos = aPos;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-12-22 18:54:49 +03:00
|
|
|
void nsHtml5StreamParser::PostLoadFlusher() {
|
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
|
|
|
mTokenizerMutex.AssertCurrentThreadOwns();
|
|
|
|
|
|
|
|
mTreeBuilder->FlushLoads();
|
|
|
|
// Dispatch this runnable unconditionally, because the loads
|
|
|
|
// that need flushing may have been flushed earlier even if the
|
|
|
|
// flush right above here did nothing. (Is this still true?)
|
|
|
|
nsCOMPtr<nsIRunnable> runnable(mLoadFlusher);
|
|
|
|
if (NS_FAILED(DispatchToMain(runnable.forget()))) {
|
|
|
|
NS_WARNING("failed to dispatch load flush event");
|
|
|
|
}
|
2022-04-27 10:24:34 +03:00
|
|
|
|
|
|
|
if ((mMode == VIEW_SOURCE_HTML || mMode == VIEW_SOURCE_XML) &&
|
|
|
|
mTokenizer->ShouldFlushViewSource()) {
|
|
|
|
auto r = mTreeBuilder->Flush(); // delete useless ops
|
|
|
|
MOZ_ASSERT(r.isOk(), "Should have null sink with View Source");
|
|
|
|
r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (r.unwrap()) {
|
|
|
|
nsCOMPtr<nsIRunnable> runnable(mExecutorFlusher);
|
|
|
|
if (NS_FAILED(DispatchToMain(runnable.forget()))) {
|
|
|
|
NS_WARNING("failed to dispatch executor flush event");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2021-12-22 18:54:49 +03:00
|
|
|
}
|
|
|
|
|
2010-04-16 14:52:06 +04:00
|
|
|
void nsHtml5StreamParser::FlushTreeOpsAndDisarmTimer() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2010-04-16 14:52:06 +04:00
|
|
|
if (mFlushTimerArmed) {
|
|
|
|
// avoid calling Cancel if the flush timer isn't armed to avoid acquiring
|
|
|
|
// a mutex
|
2017-01-26 09:45:31 +03:00
|
|
|
{
|
|
|
|
mozilla::MutexAutoLock flushTimerLock(mFlushTimerMutex);
|
|
|
|
mFlushTimer->Cancel();
|
|
|
|
}
|
2011-10-17 18:59:28 +04:00
|
|
|
mFlushTimerArmed = false;
|
2010-04-16 14:52:06 +04:00
|
|
|
}
|
2010-07-30 14:15:38 +04:00
|
|
|
if (mMode == VIEW_SOURCE_HTML || mMode == VIEW_SOURCE_XML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
2010-07-30 14:15:38 +04:00
|
|
|
}
|
2016-09-19 21:59:54 +03:00
|
|
|
nsCOMPtr<nsIRunnable> runnable(mExecutorFlusher);
|
2017-07-26 11:13:35 +03:00
|
|
|
if (NS_FAILED(DispatchToMain(runnable.forget()))) {
|
2010-04-16 14:52:06 +04:00
|
|
|
NS_WARNING("failed to dispatch executor flush event");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
void nsHtml5StreamParser::SwitchDecoderIfAsciiSoFar(
|
|
|
|
NotNull<const Encoding*> aEncoding) {
|
|
|
|
if (mEncoding == aEncoding) {
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDevTools);
|
|
|
|
// Report all already-decoded buffers to the dev tools if needed.
|
|
|
|
if (mURIToSendToDevtools) {
|
|
|
|
nsHtml5OwningUTF16Buffer* buffer = mFirstBufferOfMetaScan;
|
|
|
|
while (buffer) {
|
|
|
|
auto s = Span(buffer->getBuffer(), buffer->getEnd());
|
|
|
|
OnNewContent(s);
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (!mEncoding->IsAsciiCompatible() || !aEncoding->IsAsciiCompatible()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
size_t numAscii = 0;
|
|
|
|
MOZ_ASSERT(mFirstBufferOfMetaScan,
|
|
|
|
"Why did we come here without starting meta scan?");
|
|
|
|
nsHtml5OwningUTF16Buffer* buffer = mFirstBufferOfMetaScan;
|
|
|
|
while (buffer != mFirstBuffer) {
|
|
|
|
MOZ_ASSERT(buffer, "mFirstBuffer should have acted as sentinel!");
|
|
|
|
MOZ_ASSERT(buffer->getStart() == buffer->getEnd(),
|
|
|
|
"Why wasn't an early buffer fully consumed?");
|
|
|
|
auto s = Span(buffer->getBuffer(), buffer->getStart());
|
|
|
|
if (!IsAscii(s)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
numAscii += s.Length();
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
|
|
|
auto s = Span(mFirstBuffer->getBuffer(), mFirstBuffer->getStart());
|
|
|
|
if (!IsAscii(s)) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
numAscii += s.Length();
|
|
|
|
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDevTools);
|
|
|
|
// Report the ASCII prefix to dev tools if needed
|
|
|
|
if (mURIToSendToDevtools) {
|
|
|
|
buffer = mFirstBufferOfMetaScan;
|
|
|
|
while (buffer != mFirstBuffer) {
|
|
|
|
MOZ_ASSERT(buffer, "mFirstBuffer should have acted as sentinel!");
|
|
|
|
MOZ_ASSERT(buffer->getStart() == buffer->getEnd(),
|
|
|
|
"Why wasn't an early buffer fully consumed?");
|
|
|
|
auto s = Span(buffer->getBuffer(), buffer->getStart());
|
|
|
|
OnNewContent(s);
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
|
|
|
auto s = Span(mFirstBuffer->getBuffer(), mFirstBuffer->getStart());
|
|
|
|
OnNewContent(s);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Success! Now let's get rid of the already-decoded but not tokenized data:
|
|
|
|
mFirstBuffer->setEnd(mFirstBuffer->getStart());
|
|
|
|
mLastBuffer = mFirstBuffer;
|
|
|
|
mFirstBuffer->next = nullptr;
|
|
|
|
|
|
|
|
// Note: We could have scanned further for ASCII, which could avoid some
|
|
|
|
// buffer deallocation and reallocation. However, chances are that if we got
|
|
|
|
// until meta without non-ASCII before, there's going to be a title with
|
|
|
|
// non-ASCII soon after anyway, so let's avoid the complexity of finding out.
|
|
|
|
|
|
|
|
MOZ_ASSERT(mUnicodeDecoder, "How come we scanned meta without a decoder?");
|
|
|
|
mEncoding = aEncoding;
|
|
|
|
mEncoding->NewDecoderWithoutBOMHandlingInto(*mUnicodeDecoder);
|
|
|
|
mHasHadErrors = false;
|
|
|
|
|
|
|
|
MOZ_ASSERT(!mDecodingLocalFileWithoutTokenizing,
|
|
|
|
"Must have set mDecodingLocalFileWithoutTokenizing to false to "
|
|
|
|
"report data to dev tools below");
|
|
|
|
MOZ_ASSERT(!mLookingForMetaCharset,
|
|
|
|
"Must have set mLookingForMetaCharset to false to report data to "
|
|
|
|
"dev tools below");
|
|
|
|
|
|
|
|
// Now skip over as many bytes and redecode the tail of the
|
|
|
|
// buffered bytes.
|
|
|
|
size_t skipped = 0;
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
size_t nextSkipped = skipped + buffer.Length();
|
|
|
|
if (nextSkipped <= numAscii) {
|
|
|
|
skipped = nextSkipped;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (skipped >= numAscii) {
|
|
|
|
WriteStreamBytes(buffer);
|
|
|
|
skipped = nextSkipped;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
size_t tailLength = nextSkipped - numAscii;
|
|
|
|
WriteStreamBytes(Span<uint8_t>(buffer).From(buffer.Length() - tailLength));
|
|
|
|
skipped = nextSkipped;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
size_t nsHtml5StreamParser::CountGts() {
|
|
|
|
if (!mGtBuffer) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
size_t gts = 0;
|
|
|
|
nsHtml5OwningUTF16Buffer* buffer = mFirstBufferOfMetaScan;
|
|
|
|
for (;;) {
|
|
|
|
MOZ_ASSERT(buffer, "How did we walk past mGtBuffer?");
|
|
|
|
char16_t* buf = buffer->getBuffer();
|
|
|
|
if (buffer == mGtBuffer) {
|
|
|
|
for (int32_t i = 0; i <= mGtPos; ++i) {
|
|
|
|
if (buf[i] == u'>') {
|
|
|
|
++gts;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
for (int32_t i = 0; i < buffer->getEnd(); ++i) {
|
|
|
|
if (buf[i] == u'>') {
|
|
|
|
++gts;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
|
|
|
return gts;
|
|
|
|
}
|
|
|
|
|
|
|
|
void nsHtml5StreamParser::DiscardMetaSpeculation() {
|
|
|
|
mozilla::MutexAutoLock speculationAutoLock(mSpeculationMutex);
|
|
|
|
// Rewind the stream
|
|
|
|
MOZ_ASSERT(!mAtEOF, "How did we end up setting this?");
|
|
|
|
mTokenizer->resetToDataState();
|
|
|
|
mTokenizer->setLineNumber(1);
|
|
|
|
mLastWasCR = false;
|
|
|
|
|
2022-01-17 12:16:10 +03:00
|
|
|
if (mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN) {
|
|
|
|
// resetToDataState() above logically rewinds to the state before
|
|
|
|
// the plain text start, so we need to start plain text again to
|
|
|
|
// put the tokenizer into the plain text state.
|
|
|
|
mTokenizer->StartPlainText();
|
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mFirstBuffer = mLastBuffer;
|
|
|
|
mFirstBuffer->setStart(0);
|
|
|
|
mFirstBuffer->setEnd(0);
|
|
|
|
mFirstBuffer->next = nullptr;
|
|
|
|
|
|
|
|
mTreeBuilder->flushCharacters(); // empty the pending buffer
|
|
|
|
mTreeBuilder->ClearOps(); // now get rid of the failed ops
|
|
|
|
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
mTokenizer->RewindViewSource();
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
// We know that this resets the tree builder back to the start state.
|
|
|
|
// This must happen _after_ the flushCharacters() call above!
|
|
|
|
const auto& speculation = mSpeculations.ElementAt(0);
|
|
|
|
mTreeBuilder->loadState(speculation->GetSnapshot());
|
|
|
|
}
|
|
|
|
|
|
|
|
// Experimentation suggests that we don't need to do anything special
|
|
|
|
// for ignoring the leading LF in View Source here.
|
|
|
|
|
|
|
|
mSpeculations.Clear(); // potentially a huge number of destructors
|
|
|
|
// run here synchronously...
|
|
|
|
|
|
|
|
// Now set up a new speculation for the main thread to find.
|
|
|
|
// Note that we stay in the speculating state, because the main thread
|
|
|
|
// knows how to come out of that state and this thread does not.
|
|
|
|
|
|
|
|
nsHtml5Speculation* speculation = new nsHtml5Speculation(
|
|
|
|
mFirstBuffer, mFirstBuffer->getStart(), mTokenizer->getLineNumber(),
|
|
|
|
mTreeBuilder->newSnapshot());
|
|
|
|
MOZ_ASSERT(!mFlushTimerArmed, "How did we end up arming the timer?");
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
mTokenizer->SetViewSourceOpSink(speculation);
|
|
|
|
mTokenizer->StartViewSourceCharacters();
|
|
|
|
} else {
|
|
|
|
MOZ_ASSERT(mMode != VIEW_SOURCE_XML);
|
|
|
|
mTreeBuilder->SetOpSink(speculation);
|
|
|
|
}
|
|
|
|
mSpeculations.AppendElement(speculation); // adopts the pointer
|
|
|
|
MOZ_ASSERT(mSpeculating, "How did we end speculating?");
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The general idea is to match WebKit and Blink exactly for meta
|
|
|
|
* scan except:
|
|
|
|
*
|
|
|
|
* 1. WebKit and Blink look for meta as if scripting was disabled
|
|
|
|
* for `noscript` purposes. This implementation matches the
|
|
|
|
* `noscript` treatment of the observable DOM building (in order
|
|
|
|
* to be able to use the same tree builder run).
|
|
|
|
* 2. WebKit and Blink look for meta as if the foreign content
|
|
|
|
* feedback from the tree builder to the tokenizer didn't exist.
|
|
|
|
* This implementation considers the foreign content rules in
|
|
|
|
* order to be able to use the same tree builder run for meta
|
|
|
|
* and the observable DOM building. Note that since <svg> and
|
|
|
|
* <math> imply the end of head, this only matters for meta after
|
|
|
|
* head but starting within the 1024-byte zone.
|
|
|
|
*
|
|
|
|
* Template is treated specially, because that WebKit/Blink behavior
|
|
|
|
* is easy to emulate unlike the above two exceptions. In general,
|
|
|
|
* the meta scan token handler in WebKit and Blink behaves as if there
|
|
|
|
* was a scripting-disabled tree builder predating the introduction
|
|
|
|
* of foreign content and template.
|
|
|
|
*
|
|
|
|
* Meta is honored if it _starts_ within the first 1024 kilobytes or,
|
|
|
|
* if by the 1024-byte boundary head hasn't ended and a template
|
|
|
|
* element hasn't started, a meta occurs before the first of the head
|
|
|
|
* ending or a template element starting.
|
|
|
|
*
|
|
|
|
* If a meta isn't honored according to the above definition, and
|
|
|
|
* we aren't dealing with plain text, the buffered bytes, which by
|
|
|
|
* now have to contain `>` character unless we encountered EOF, are
|
|
|
|
* scanned for syntax resembling an XML declaration.
|
|
|
|
*
|
|
|
|
* If neither a meta nor syntax resembling an XML declaration has
|
|
|
|
* been honored and we aren't inheriting the encoding from a
|
|
|
|
* same-origin parent or parsing for XHR, chardetng is used.
|
|
|
|
* chardetng runs first for the part of the document that was searched
|
|
|
|
* for meta and then at EOF. The part searched for meta is defined as
|
|
|
|
* follows in order to avoid network buffer boundary-dependent
|
|
|
|
* behavior:
|
|
|
|
*
|
|
|
|
* 1. At least the first 1024 bytes. (This is what happens for plain
|
|
|
|
* text.)
|
|
|
|
* 2. If the 1024-byte boundary is within a tag, comment, doctype,
|
|
|
|
* or CDATA section, at least up to the end of that token or CDATA
|
|
|
|
* section. (Exception: If the 1024-byte boundary is in an RCDATA
|
|
|
|
* end tag that hasn't yet been decided to be an end tag, the
|
|
|
|
* token is not considered.)
|
|
|
|
* 3. If at the 1024-byte boundary, head hasn't ended and there hasn't
|
|
|
|
* been a template tag, up to the end of the first template tag
|
|
|
|
* or token ending the head, whichever comes first.
|
|
|
|
* 4. Except if head is ended by a text token, only to the end of the
|
|
|
|
* most recent tag, comment, or doctype token. (Because text is
|
|
|
|
* coalesced, so it would be harder to correlate the text to the
|
|
|
|
* bytes.)
|
|
|
|
*
|
|
|
|
* An encoding-related reload is still possible if chardetng's guess
|
|
|
|
* at EOF differs from its initial guess.
|
|
|
|
*/
|
|
|
|
bool nsHtml5StreamParser::ProcessLookingForMetaCharset(bool aEof) {
|
|
|
|
MOZ_ASSERT(mBomState == BOM_SNIFFING_OVER);
|
|
|
|
MOZ_ASSERT(mMode != VIEW_SOURCE_XML);
|
|
|
|
bool rewound = false;
|
|
|
|
MOZ_ASSERT(mForceAutoDetection ||
|
|
|
|
mCharsetSource < kCharsetFromInitialAutoDetectionASCII ||
|
|
|
|
mCharsetSource == kCharsetFromParentFrame,
|
|
|
|
"Why are we looking for meta charset if we've seen it?");
|
|
|
|
// NOTE! We may come here multiple times with
|
|
|
|
// mNumBytesBuffered == UNCONDITIONAL_META_SCAN_BOUNDARY
|
|
|
|
// if the tokenizer suspends multiple times after decoding has reached
|
|
|
|
// mNumBytesBuffered == UNCONDITIONAL_META_SCAN_BOUNDARY. That's why
|
|
|
|
// we need to also check whether the we are at the end of the last
|
|
|
|
// decoded buffer.
|
|
|
|
// Note that DoDataAvailableBuffer() ensures that the code here has
|
|
|
|
// the opportunity to run at the exact UNCONDITIONAL_META_SCAN_BOUNDARY
|
|
|
|
// even if there isn't a network buffer boundary there.
|
|
|
|
bool atKilobyte = false;
|
|
|
|
if ((mNumBytesBuffered == UNCONDITIONAL_META_SCAN_BOUNDARY &&
|
|
|
|
mFirstBuffer == mLastBuffer && !mFirstBuffer->hasMore())) {
|
|
|
|
atKilobyte = true;
|
|
|
|
mTokenizer->AtKilobyteBoundary();
|
|
|
|
}
|
|
|
|
if (!mNeedsEncodingSwitchTo &&
|
|
|
|
(aEof || (mTemplatePushedOrHeadPopped &&
|
|
|
|
!mTokenizer->IsInTokenStartedAtKilobyteBoundary() &&
|
|
|
|
(atKilobyte ||
|
|
|
|
mNumBytesBuffered > UNCONDITIONAL_META_SCAN_BOUNDARY)))) {
|
|
|
|
// meta charset was not found
|
|
|
|
mLookingForMetaCharset = false;
|
|
|
|
if (mStartsWithLtQuestion && mCharsetSource < kCharsetFromXmlDeclaration) {
|
|
|
|
// Look for bogo XML declaration.
|
|
|
|
// Search the first buffer in the hope that '>' is within it.
|
|
|
|
MOZ_ASSERT(!mBufferedBytes.IsEmpty(),
|
|
|
|
"How did at least <? not get buffered?");
|
|
|
|
Buffer<uint8_t>& first = mBufferedBytes[0];
|
|
|
|
const Encoding* encoding =
|
|
|
|
xmldecl_parse(first.Elements(), first.Length());
|
|
|
|
if (!encoding) {
|
|
|
|
// Our bogo XML declaration scanner wants to see a contiguous buffer, so
|
|
|
|
// let's linearize the data. (Ideally, the XML declaration scanner would
|
|
|
|
// be incremental, but this is the rare path anyway.)
|
|
|
|
Vector<uint8_t> contiguous;
|
|
|
|
if (!contiguous.append(first.Elements(), first.Length())) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
for (size_t i = 1; i < mBufferedBytes.Length(); ++i) {
|
|
|
|
Buffer<uint8_t>& buffer = mBufferedBytes[i];
|
|
|
|
const uint8_t* elements = buffer.Elements();
|
|
|
|
size_t length = buffer.Length();
|
|
|
|
const uint8_t* lt = (const uint8_t*)memchr(elements, '>', length);
|
|
|
|
bool stop = false;
|
|
|
|
if (lt) {
|
|
|
|
length = (lt - elements) + 1;
|
|
|
|
stop = true;
|
|
|
|
}
|
|
|
|
if (!contiguous.append(elements, length)) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (stop) {
|
|
|
|
// Avoid linearizing all buffered bytes unnecessarily.
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
encoding = xmldecl_parse(contiguous.begin(), contiguous.length());
|
|
|
|
}
|
|
|
|
if (encoding) {
|
|
|
|
if (!(mForceAutoDetection && (encoding->IsAsciiCompatible() ||
|
|
|
|
encoding == ISO_2022_JP_ENCODING))) {
|
|
|
|
mForceAutoDetection = false;
|
|
|
|
mNeedsEncodingSwitchTo = encoding;
|
|
|
|
mEncodingSwitchSource = kCharsetFromXmlDeclaration;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// Check again in case we found an encoding in the bogo XML declaration.
|
|
|
|
if (!mNeedsEncodingSwitchTo &&
|
|
|
|
(mForceAutoDetection ||
|
|
|
|
mCharsetSource < kCharsetFromInitialAutoDetectionASCII) &&
|
|
|
|
!(mMode == LOAD_AS_DATA || mMode == VIEW_SOURCE_XML) &&
|
|
|
|
!(mDecodingLocalFileWithoutTokenizing && !aEof &&
|
|
|
|
mNumBytesBuffered <= LOCAL_FILE_UTF_8_BUFFER_SIZE)) {
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDetector);
|
|
|
|
if (mNumBytesBuffered == UNCONDITIONAL_META_SCAN_BOUNDARY || aEof) {
|
|
|
|
// We know that all the buffered bytes have been tokenized, so feed
|
|
|
|
// them all to chardetng.
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
FeedDetector(buffer);
|
|
|
|
}
|
|
|
|
if (aEof) {
|
|
|
|
MOZ_ASSERT(!mChardetEof);
|
|
|
|
DetectorEof();
|
|
|
|
}
|
|
|
|
auto [encoding, source] = GuessEncoding(true);
|
|
|
|
mNeedsEncodingSwitchTo = encoding;
|
|
|
|
mEncodingSwitchSource = source;
|
|
|
|
} else if (mNumBytesBuffered > UNCONDITIONAL_META_SCAN_BOUNDARY) {
|
|
|
|
size_t gtsLeftToFind = CountGts();
|
|
|
|
size_t bytesSeen = 0;
|
|
|
|
// We sync the bytes to the UTF-16 code units seen to avoid depending
|
|
|
|
// on network buffer boundaries. We do the syncing by counting '>'
|
|
|
|
// bytes / code units. However, we always scan at least 1024 bytes.
|
|
|
|
// The 1024-byte boundary is guaranteed to be between buffers.
|
|
|
|
// The guarantee is implemented in DoDataAvailableBuffer().
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
|
|
|
if (!mNeedsEncodingSwitchTo) {
|
|
|
|
if (gtsLeftToFind) {
|
|
|
|
auto span = buffer.AsSpan();
|
|
|
|
bool feed = true;
|
|
|
|
for (size_t i = 0; i < span.Length(); ++i) {
|
|
|
|
if (span[i] == uint8_t('>')) {
|
|
|
|
--gtsLeftToFind;
|
|
|
|
if (!gtsLeftToFind) {
|
|
|
|
if (bytesSeen < UNCONDITIONAL_META_SCAN_BOUNDARY) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
++i; // Skip the gt
|
|
|
|
FeedDetector(span.To(i));
|
|
|
|
auto [encoding, source] = GuessEncoding(true);
|
|
|
|
mNeedsEncodingSwitchTo = encoding;
|
|
|
|
mEncodingSwitchSource = source;
|
|
|
|
FeedDetector(span.From(i));
|
|
|
|
bytesSeen += buffer.Length();
|
|
|
|
// No need to update bytesSeen anymore, but let's do it for
|
|
|
|
// debugging.
|
|
|
|
// We should do `continue outer;` but C++ can't.
|
|
|
|
feed = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (feed) {
|
|
|
|
FeedDetector(buffer);
|
|
|
|
bytesSeen += buffer.Length();
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (bytesSeen == UNCONDITIONAL_META_SCAN_BOUNDARY) {
|
|
|
|
auto [encoding, source] = GuessEncoding(true);
|
|
|
|
mNeedsEncodingSwitchTo = encoding;
|
|
|
|
mEncodingSwitchSource = source;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
FeedDetector(buffer);
|
|
|
|
bytesSeen += buffer.Length();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
MOZ_ASSERT(mNeedsEncodingSwitchTo,
|
|
|
|
"How come we didn't call GuessEncoding()?");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (mNeedsEncodingSwitchTo) {
|
|
|
|
mDecodingLocalFileWithoutTokenizing = false;
|
|
|
|
mLookingForMetaCharset = false;
|
|
|
|
|
|
|
|
auto needsEncodingSwitchTo = WrapNotNull(mNeedsEncodingSwitchTo);
|
|
|
|
mNeedsEncodingSwitchTo = nullptr;
|
|
|
|
|
|
|
|
SwitchDecoderIfAsciiSoFar(needsEncodingSwitchTo);
|
|
|
|
// The above line may have changed mEncoding so that mEncoding equals
|
|
|
|
// needsEncodingSwitchTo.
|
|
|
|
|
|
|
|
mCharsetSource = mEncodingSwitchSource;
|
|
|
|
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return false;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (mEncoding != needsEncodingSwitchTo) {
|
|
|
|
// Speculation failed
|
|
|
|
rewound = true;
|
|
|
|
|
|
|
|
if (mEncoding == ISO_2022_JP_ENCODING ||
|
|
|
|
needsEncodingSwitchTo == ISO_2022_JP_ENCODING) {
|
|
|
|
// Chances are no Web author will fix anything due to this message, so
|
|
|
|
// this is here to help understanding issues when debugging sites made
|
|
|
|
// by someone else.
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncSpeculationFail2022", false,
|
|
|
|
mTokenizer->getLineNumber());
|
|
|
|
} else {
|
|
|
|
if (mCharsetSource == kCharsetFromMetaTag) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset(
|
|
|
|
"EncSpeculationFailMeta", false, mTokenizer->getLineNumber());
|
|
|
|
} else if (mCharsetSource == kCharsetFromXmlDeclaration) {
|
|
|
|
// This intentionally refers to the line number of how far ahead
|
|
|
|
// the document was parsed even though the bogo XML decl is always
|
|
|
|
// on line 1.
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset(
|
|
|
|
"EncSpeculationFailXml", false, mTokenizer->getLineNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
DiscardMetaSpeculation();
|
|
|
|
// Redecode the stream.
|
|
|
|
mEncoding = needsEncodingSwitchTo;
|
|
|
|
mUnicodeDecoder = mEncoding->NewDecoderWithBOMRemoval();
|
|
|
|
mHasHadErrors = false;
|
|
|
|
|
|
|
|
MOZ_ASSERT(!mDecodingLocalFileWithoutTokenizing,
|
|
|
|
"Must have set mDecodingLocalFileWithoutTokenizing to false "
|
|
|
|
"to report data to dev tools below");
|
|
|
|
MOZ_ASSERT(!mLookingForMetaCharset,
|
|
|
|
"Must have set mLookingForMetaCharset to false to report data "
|
|
|
|
"to dev tools below");
|
|
|
|
for (auto&& buffer : mBufferedBytes) {
|
2022-04-27 10:24:34 +03:00
|
|
|
nsresult rv = WriteStreamBytes(buffer);
|
|
|
|
if (NS_FAILED(rv)) {
|
|
|
|
MarkAsBroken(rv);
|
|
|
|
return false;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
} else if (!mLookingForMetaCharset && !mDecodingLocalFileWithoutTokenizing) {
|
|
|
|
MOZ_ASSERT(!mStartedFeedingDevTools);
|
|
|
|
// Report all already-decoded buffers to the dev tools if needed.
|
|
|
|
if (mURIToSendToDevtools) {
|
|
|
|
nsHtml5OwningUTF16Buffer* buffer = mFirstBufferOfMetaScan;
|
|
|
|
while (buffer) {
|
|
|
|
auto s = Span(buffer->getBuffer(), buffer->getEnd());
|
|
|
|
OnNewContent(s);
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!mLookingForMetaCharset) {
|
|
|
|
mGtBuffer = nullptr;
|
|
|
|
mGtPos = 0;
|
|
|
|
|
|
|
|
if (!mDecodingLocalFileWithoutTokenizing) {
|
|
|
|
mFirstBufferOfMetaScan = nullptr;
|
|
|
|
mBufferingBytes = false;
|
|
|
|
mBufferedBytes.Clear();
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->SetDocumentCharset(mEncoding, mCharsetSource, true);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return false;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return rewound;
|
|
|
|
}
|
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
void nsHtml5StreamParser::ParseAvailableData() {
|
2018-12-04 11:02:04 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2009-10-12 17:08:04 +04:00
|
|
|
mTokenizerMutex.AssertCurrentThreadOwns();
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!(mDecodingLocalFileWithoutTokenizing && !mLookingForMetaCharset));
|
2009-10-12 17:08:04 +04:00
|
|
|
|
|
|
|
if (IsTerminatedOrInterrupted()) {
|
2009-09-18 13:21:47 +04:00
|
|
|
return;
|
|
|
|
}
|
2015-04-14 02:36:33 +03:00
|
|
|
|
|
|
|
if (mSpeculating && !IsSpeculationEnabled()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
bool requestedReload = false;
|
2009-09-18 13:21:47 +04:00
|
|
|
for (;;) {
|
|
|
|
if (!mFirstBuffer->hasMore()) {
|
|
|
|
if (mFirstBuffer == mLastBuffer) {
|
2009-09-21 17:18:20 +04:00
|
|
|
switch (mStreamState) {
|
|
|
|
case STREAM_BEING_READ:
|
2009-10-12 17:08:04 +04:00
|
|
|
// never release the last buffer.
|
|
|
|
if (!mSpeculating) {
|
|
|
|
// reuse buffer space if not speculating
|
|
|
|
mFirstBuffer->setStart(0);
|
|
|
|
mFirstBuffer->setEnd(0);
|
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
return; // no more data for now but expecting more
|
2009-09-21 17:18:20 +04:00
|
|
|
case STREAM_ENDED:
|
2009-10-12 17:08:04 +04:00
|
|
|
if (mAtEOF) {
|
2010-04-09 18:11:56 +04:00
|
|
|
return;
|
2009-10-12 17:08:04 +04:00
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mLookingForMetaCharset) {
|
|
|
|
// When called with aEof=true, ProcessLookingForMetaCharset()
|
|
|
|
// is guaranteed to set mLookingForMetaCharset to false so
|
|
|
|
// that we can't come here twice.
|
|
|
|
if (ProcessLookingForMetaCharset(true)) {
|
|
|
|
if (IsTerminatedOrInterrupted()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
} else if ((mForceAutoDetection ||
|
|
|
|
mCharsetSource < kCharsetFromParentFrame) &&
|
|
|
|
!(mMode == LOAD_AS_DATA || mMode == VIEW_SOURCE_XML) &&
|
|
|
|
!mReparseForbidden) {
|
|
|
|
// An earlier DetectorEof() call is possible in which case
|
|
|
|
// the one here is a no-op.
|
|
|
|
DetectorEof();
|
|
|
|
auto [encoding, source] = GuessEncoding(false);
|
|
|
|
if (encoding != mEncoding) {
|
|
|
|
// Request a reload from the docshell.
|
|
|
|
MOZ_ASSERT(
|
|
|
|
(source >=
|
2022-03-29 11:04:25 +03:00
|
|
|
kCharsetFromFinalAutoDetectionWouldHaveBeenUTF8InitialWasASCII &&
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
source <=
|
2022-03-29 11:04:25 +03:00
|
|
|
kCharsetFromFinalAutoDetectionWouldNotHaveBeenUTF8DependedOnTLDInitialWasASCII) ||
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
source == kCharsetFromFinalUserForcedAutoDetection);
|
|
|
|
mTreeBuilder->NeedsCharsetSwitchTo(encoding, source, 0);
|
|
|
|
requestedReload = true;
|
2021-12-14 18:01:21 +03:00
|
|
|
} else if (mCharsetSource ==
|
|
|
|
kCharsetFromInitialAutoDetectionASCII &&
|
|
|
|
mDetectorHasSeenNonAscii) {
|
|
|
|
mCharsetSource = source;
|
2021-12-22 18:54:49 +03:00
|
|
|
mTreeBuilder->UpdateCharsetSource(mCharsetSource);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-17 18:59:28 +04:00
|
|
|
mAtEOF = true;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (!mForceAutoDetection && !requestedReload) {
|
|
|
|
if (mCharsetSource == kCharsetFromParentFrame) {
|
2018-03-16 18:26:06 +03:00
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncNoDeclarationFrame",
|
|
|
|
false, 0);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
} else if (mCharsetSource == kCharsetFromXmlDeclaration) {
|
|
|
|
// We know the bogo XML decl is always on the first line.
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncXmlDecl", false, 1);
|
|
|
|
} else if (
|
|
|
|
mCharsetSource >=
|
|
|
|
kCharsetFromInitialAutoDetectionWouldHaveBeenUTF8 &&
|
|
|
|
mCharsetSource <=
|
|
|
|
kCharsetFromInitialAutoDetectionWouldNotHaveBeenUTF8DependedOnTLD) {
|
|
|
|
if (mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncNoDeclPlain",
|
|
|
|
true, 0);
|
|
|
|
} else {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncNoDecl", true, 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mHasHadErrors && mEncoding != REPLACEMENT_ENCODING) {
|
|
|
|
if (mEncoding == UTF_8_ENCODING) {
|
|
|
|
mTreeBuilder->TryToEnableEncodingMenu();
|
|
|
|
}
|
|
|
|
if (mCharsetSource == kCharsetFromParentFrame) {
|
|
|
|
if (mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset(
|
|
|
|
"EncErrorFramePlain", true, 0);
|
|
|
|
} else {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncErrorFrame",
|
|
|
|
true, 0);
|
|
|
|
}
|
|
|
|
} else if (
|
|
|
|
mCharsetSource >= kCharsetFromXmlDeclaration &&
|
|
|
|
!(mCharsetSource >=
|
2022-03-29 11:04:25 +03:00
|
|
|
kCharsetFromFinalAutoDetectionWouldHaveBeenUTF8InitialWasASCII &&
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mCharsetSource <=
|
|
|
|
kCharsetFromFinalUserForcedAutoDetection)) {
|
|
|
|
mTreeBuilder->MaybeComplainAboutCharset("EncError", true, 0);
|
|
|
|
}
|
2012-03-24 15:34:42 +04:00
|
|
|
}
|
|
|
|
}
|
2015-08-25 18:05:45 +03:00
|
|
|
if (NS_SUCCEEDED(mTreeBuilder->IsBroken())) {
|
|
|
|
mTokenizer->eof();
|
2015-08-25 18:05:46 +03:00
|
|
|
nsresult rv;
|
|
|
|
if (NS_FAILED((rv = mTreeBuilder->IsBroken()))) {
|
|
|
|
MarkAsBroken(rv);
|
|
|
|
} else {
|
|
|
|
mTreeBuilder->StreamEnded();
|
|
|
|
if (mMode == VIEW_SOURCE_HTML || mMode == VIEW_SOURCE_XML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
if (!mTokenizer->EndViewSource()) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
}
|
2015-08-25 18:05:46 +03:00
|
|
|
}
|
2015-08-25 18:05:45 +03:00
|
|
|
}
|
2010-07-30 14:15:38 +04:00
|
|
|
}
|
2010-04-16 14:52:06 +04:00
|
|
|
FlushTreeOpsAndDisarmTimer();
|
2009-09-18 13:21:47 +04:00
|
|
|
return; // no more data and not expecting more
|
|
|
|
default:
|
2018-06-18 08:43:11 +03:00
|
|
|
MOZ_ASSERT_UNREACHABLE("It should be impossible to reach this.");
|
2009-09-18 13:21:47 +04:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
2010-04-09 18:11:56 +04:00
|
|
|
mFirstBuffer = mFirstBuffer->next;
|
|
|
|
continue;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
// now we have a non-empty buffer
|
|
|
|
mFirstBuffer->adjust(mLastWasCR);
|
2011-10-17 18:59:28 +04:00
|
|
|
mLastWasCR = false;
|
2009-09-18 13:21:47 +04:00
|
|
|
if (mFirstBuffer->hasMore()) {
|
2015-08-25 18:05:45 +03:00
|
|
|
if (!mTokenizer->EnsureBufferSpace(mFirstBuffer->getLength())) {
|
|
|
|
MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
mLastWasCR = mTokenizer->tokenizeBuffer(mFirstBuffer);
|
2015-08-25 18:05:46 +03:00
|
|
|
nsresult rv;
|
|
|
|
if (NS_FAILED((rv = mTreeBuilder->IsBroken()))) {
|
|
|
|
MarkAsBroken(rv);
|
|
|
|
return;
|
|
|
|
}
|
2012-06-20 11:05:39 +04:00
|
|
|
if (mTreeBuilder->HasScript()) {
|
|
|
|
// HasScript() cannot return true if the tree builder is preventing
|
|
|
|
// script execution.
|
|
|
|
MOZ_ASSERT(mMode == NORMAL);
|
2009-10-12 17:08:04 +04:00
|
|
|
mozilla::MutexAutoLock speculationAutoLock(mSpeculationMutex);
|
2017-04-27 13:27:03 +03:00
|
|
|
nsHtml5Speculation* speculation = new nsHtml5Speculation(
|
2009-10-28 16:48:37 +03:00
|
|
|
mFirstBuffer, mFirstBuffer->getStart(), mTokenizer->getLineNumber(),
|
2009-10-12 17:08:04 +04:00
|
|
|
mTreeBuilder->newSnapshot());
|
2017-04-27 13:27:03 +03:00
|
|
|
mTreeBuilder->AddSnapshotToScript(speculation->GetSnapshot(),
|
2009-10-28 16:48:37 +03:00
|
|
|
speculation->GetStartLineNumber());
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mLookingForMetaCharset) {
|
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto r = mTreeBuilder->Flush();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
FlushTreeOpsAndDisarmTimer();
|
|
|
|
}
|
2009-10-12 17:08:04 +04:00
|
|
|
mTreeBuilder->SetOpSink(speculation);
|
|
|
|
mSpeculations.AppendElement(speculation); // adopts the pointer
|
2011-10-17 18:59:28 +04:00
|
|
|
mSpeculating = true;
|
2009-10-12 17:08:04 +04:00
|
|
|
}
|
|
|
|
if (IsTerminatedOrInterrupted()) {
|
|
|
|
return;
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mLookingForMetaCharset) {
|
|
|
|
Unused << ProcessLookingForMetaCharset(false);
|
|
|
|
}
|
2009-09-18 13:21:47 +04:00
|
|
|
}
|
|
|
|
}
|
2009-09-21 17:18:20 +04:00
|
|
|
|
2016-04-26 03:23:21 +03:00
|
|
|
class nsHtml5StreamParserContinuation : public Runnable {
|
2009-09-25 21:11:02 +04:00
|
|
|
private:
|
2017-06-27 20:38:30 +03:00
|
|
|
nsHtml5StreamParserPtr mStreamParser;
|
|
|
|
|
2009-09-25 21:11:02 +04:00
|
|
|
public:
|
2014-09-02 02:04:52 +04:00
|
|
|
explicit nsHtml5StreamParserContinuation(nsHtml5StreamParser* aStreamParser)
|
2017-05-10 02:19:37 +03:00
|
|
|
: Runnable("nsHtml5StreamParserContinuation"),
|
|
|
|
mStreamParser(aStreamParser) {}
|
2016-08-08 05:18:10 +03:00
|
|
|
NS_IMETHOD Run() override {
|
2009-10-12 17:08:04 +04:00
|
|
|
mozilla::MutexAutoLock autoLock(mStreamParser->mTokenizerMutex);
|
2010-04-16 14:52:06 +04:00
|
|
|
mStreamParser->Uninterrupt();
|
2009-10-12 17:08:04 +04:00
|
|
|
mStreamParser->ParseAvailableData();
|
2009-09-25 21:11:02 +04:00
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
void nsHtml5StreamParser::ContinueAfterScriptsOrEncodingCommitment(
|
|
|
|
nsHtml5Tokenizer* aTokenizer, nsHtml5TreeBuilder* aTreeBuilder,
|
|
|
|
bool aLastWasCR) {
|
2021-12-22 18:54:49 +03:00
|
|
|
// nullptr for aTokenizer means encoding commitment as opposed to the "after
|
|
|
|
// scripts" case.
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(NS_IsMainThread(), "Wrong thread!");
|
|
|
|
MOZ_ASSERT(mMode != VIEW_SOURCE_XML,
|
|
|
|
"ContinueAfterScriptsOrEncodingCommitment called in XML view "
|
|
|
|
"source mode!");
|
|
|
|
MOZ_ASSERT(!(aTokenizer && mMode == VIEW_SOURCE_HTML),
|
|
|
|
"ContinueAfterScriptsOrEncodingCommitment called with non-null "
|
|
|
|
"tokenizer in HTML view "
|
|
|
|
"source mode.");
|
2012-08-14 13:10:41 +04:00
|
|
|
if (NS_FAILED(mExecutor->IsBroken())) {
|
2011-09-28 16:45:17 +04:00
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!(aTokenizer && mMode != NORMAL),
|
|
|
|
"We should only be executing scripts in the normal mode.");
|
2021-12-22 18:54:49 +03:00
|
|
|
if (!aTokenizer && (mMode == PLAIN_TEXT || mMode == VIEW_SOURCE_PLAIN ||
|
|
|
|
mMode == VIEW_SOURCE_HTML)) {
|
|
|
|
// Take the ops that were generated from OnStartRequest for the synthetic
|
|
|
|
// head section of the document for plain text and HTML View Source.
|
|
|
|
// XML View Source never needs this kind of encoding commitment.
|
|
|
|
// We need to take the ops here so that they end up in the queue before
|
|
|
|
// the ops that we take from a speculation later in this method.
|
2022-04-27 10:24:34 +03:00
|
|
|
if (!mExecutor->TakeOpsFromStage()) {
|
|
|
|
mExecutor->MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
2021-12-22 18:54:49 +03:00
|
|
|
} else {
|
2018-03-16 18:26:06 +03:00
|
|
|
#ifdef DEBUG
|
2021-12-22 18:54:49 +03:00
|
|
|
mExecutor->AssertStageEmpty();
|
2018-03-16 18:26:06 +03:00
|
|
|
#endif
|
2021-12-22 18:54:49 +03:00
|
|
|
}
|
2011-09-29 10:19:26 +04:00
|
|
|
bool speculationFailed = false;
|
2009-10-12 17:08:04 +04:00
|
|
|
{
|
|
|
|
mozilla::MutexAutoLock speculationAutoLock(mSpeculationMutex);
|
|
|
|
if (mSpeculations.IsEmpty()) {
|
2018-06-18 08:43:11 +03:00
|
|
|
MOZ_ASSERT_UNREACHABLE(
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
"ContinueAfterScriptsOrEncodingCommitment called without "
|
2018-06-18 08:43:11 +03:00
|
|
|
"speculations.");
|
2009-10-12 17:08:04 +04:00
|
|
|
return;
|
|
|
|
}
|
2018-06-18 08:43:11 +03:00
|
|
|
|
2020-02-28 23:49:06 +03:00
|
|
|
const auto& speculation = mSpeculations.ElementAt(0);
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (aTokenizer &&
|
|
|
|
(aLastWasCR || !aTokenizer->isInDataState() ||
|
|
|
|
!aTreeBuilder->snapshotMatches(speculation->GetSnapshot()))) {
|
2011-10-17 18:59:28 +04:00
|
|
|
speculationFailed = true;
|
2009-10-12 17:08:04 +04:00
|
|
|
// We've got a failed speculation :-(
|
2015-04-14 02:36:33 +03:00
|
|
|
MaybeDisableFutureSpeculation();
|
2009-10-12 17:08:04 +04:00
|
|
|
Interrupt(); // Make the parser thread release the tokenizer mutex sooner
|
2021-12-18 10:37:15 +03:00
|
|
|
// Note that the interrupted state continues across possible intervening
|
|
|
|
// Necko events until the nsHtml5StreamParserContinuation posted at the
|
|
|
|
// end of this method runs. Therefore, this thread is guaranteed to
|
|
|
|
// acquire mTokenizerMutex soon even if an intervening Necko event grabbed
|
|
|
|
// it between now and the acquisition below.
|
|
|
|
|
2018-03-16 18:26:06 +03:00
|
|
|
// now fall out of the speculationAutoLock into the tokenizerAutoLock
|
|
|
|
// block
|
2009-10-12 17:08:04 +04:00
|
|
|
} else {
|
|
|
|
// We've got a successful speculation!
|
|
|
|
if (mSpeculations.Length() > 1) {
|
2017-04-27 13:27:03 +03:00
|
|
|
// the first speculation isn't the current speculation, so there's
|
2009-10-12 17:08:04 +04:00
|
|
|
// no need to bother the parser thread.
|
2022-04-27 10:24:34 +03:00
|
|
|
if (!speculation->FlushToSink(mExecutor)) {
|
|
|
|
mExecutor->MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!mExecutor->IsScriptExecuting(),
|
|
|
|
"ParseUntilBlocked() was supposed to ensure we don't come "
|
|
|
|
"here when scripts are executing.");
|
2021-12-22 18:54:49 +03:00
|
|
|
MOZ_ASSERT(!aTokenizer || mExecutor->IsInFlushLoop(),
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
"How are we here if "
|
|
|
|
"RunFlushLoop() didn't call ParseUntilBlocked() or we're "
|
|
|
|
"not committing to an encoding?");
|
2009-10-12 17:08:04 +04:00
|
|
|
mSpeculations.RemoveElementAt(0);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// else
|
|
|
|
Interrupt(); // Make the parser thread release the tokenizer mutex sooner
|
2021-12-18 10:37:15 +03:00
|
|
|
// Note that the interrupted state continues across possible intervening
|
|
|
|
// Necko events until the nsHtml5StreamParserContinuation posted at the
|
|
|
|
// end of this method runs. Therefore, this thread is guaranteed to
|
|
|
|
// acquire mTokenizerMutex soon even if an intervening Necko event grabbed
|
|
|
|
// it between now and the acquisition below.
|
2017-04-27 13:27:03 +03:00
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
// now fall through
|
2017-04-27 13:27:03 +03:00
|
|
|
// the first speculation is the current speculation. Need to
|
|
|
|
// release the the speculation mutex and acquire the tokenizer
|
2009-10-12 17:08:04 +04:00
|
|
|
// mutex. (Just acquiring the other mutex here would deadlock)
|
|
|
|
}
|
|
|
|
}
|
2009-09-21 17:18:20 +04:00
|
|
|
{
|
2009-10-12 17:08:04 +04:00
|
|
|
mozilla::MutexAutoLock tokenizerAutoLock(mTokenizerMutex);
|
2018-03-16 18:26:06 +03:00
|
|
|
#ifdef DEBUG
|
2010-04-09 18:11:56 +04:00
|
|
|
{
|
2018-03-16 18:26:06 +03:00
|
|
|
mAtomTable.SetPermittedLookupEventTarget(
|
|
|
|
GetMainThreadSerialEventTarget());
|
2010-04-09 18:11:56 +04:00
|
|
|
}
|
2018-03-16 18:26:06 +03:00
|
|
|
#endif
|
2009-10-12 17:08:04 +04:00
|
|
|
// In principle, the speculation mutex should be acquired here,
|
|
|
|
// but there's no point, because the parser thread only acquires it
|
|
|
|
// when it has also acquired the tokenizer mutex and we are already
|
|
|
|
// holding the tokenizer mutex.
|
|
|
|
if (speculationFailed) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(mMode == NORMAL);
|
2009-10-12 17:08:04 +04:00
|
|
|
// Rewind the stream
|
2011-10-17 18:59:28 +04:00
|
|
|
mAtEOF = false;
|
2020-02-28 23:49:06 +03:00
|
|
|
const auto& speculation = mSpeculations.ElementAt(0);
|
2009-10-12 17:08:04 +04:00
|
|
|
mFirstBuffer = speculation->GetBuffer();
|
|
|
|
mFirstBuffer->setStart(speculation->GetStart());
|
2009-10-28 16:48:37 +03:00
|
|
|
mTokenizer->setLineNumber(speculation->GetStartLineNumber());
|
2010-06-23 12:33:43 +04:00
|
|
|
|
2011-12-15 18:47:03 +04:00
|
|
|
nsContentUtils::ReportToConsole(
|
2013-08-21 23:28:26 +04:00
|
|
|
nsIScriptError::warningFlag, "DOM Events"_ns,
|
2011-12-15 18:47:03 +04:00
|
|
|
mExecutor->GetDocument(), nsContentUtils::eDOM_PROPERTIES,
|
2021-10-12 09:49:15 +03:00
|
|
|
"SpeculationFailed2", nsTArray<nsString>(), nullptr, u""_ns,
|
2011-12-15 18:47:03 +04:00
|
|
|
speculation->GetStartLineNumber());
|
2010-06-23 12:33:43 +04:00
|
|
|
|
2011-09-28 16:45:17 +04:00
|
|
|
nsHtml5OwningUTF16Buffer* buffer = mFirstBuffer->next;
|
2009-10-12 17:08:04 +04:00
|
|
|
while (buffer) {
|
|
|
|
buffer->setStart(0);
|
|
|
|
buffer = buffer->next;
|
|
|
|
}
|
2017-04-27 13:27:03 +03:00
|
|
|
|
|
|
|
mSpeculations.Clear(); // potentially a huge number of destructors
|
2009-10-12 17:08:04 +04:00
|
|
|
// run here synchronously on the main thread...
|
2009-10-30 14:47:17 +03:00
|
|
|
|
|
|
|
mTreeBuilder->flushCharacters(); // empty the pending buffer
|
2018-03-16 18:26:06 +03:00
|
|
|
mTreeBuilder->ClearOps(); // now get rid of the failed ops
|
2009-10-30 14:47:17 +03:00
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
mTreeBuilder->SetOpSink(mExecutor->GetStage());
|
|
|
|
mExecutor->StartReadingFromStage();
|
2011-10-17 18:59:28 +04:00
|
|
|
mSpeculating = false;
|
2010-04-16 14:52:06 +04:00
|
|
|
|
2009-10-12 17:08:04 +04:00
|
|
|
// Copy state over
|
|
|
|
mLastWasCR = aLastWasCR;
|
|
|
|
mTokenizer->loadState(aTokenizer);
|
2018-11-12 17:37:09 +03:00
|
|
|
mTreeBuilder->loadState(aTreeBuilder);
|
2017-04-27 13:27:03 +03:00
|
|
|
} else {
|
2009-10-12 17:08:04 +04:00
|
|
|
// We've got a successful speculation and at least a moment ago it was
|
|
|
|
// the current speculation
|
2022-04-27 10:24:34 +03:00
|
|
|
if (!mSpeculations.ElementAt(0)->FlushToSink(mExecutor)) {
|
|
|
|
mExecutor->MarkAsBroken(NS_ERROR_OUT_OF_MEMORY);
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!mExecutor->IsScriptExecuting(),
|
|
|
|
"ParseUntilBlocked() was supposed to ensure we don't come "
|
|
|
|
"here when scripts are executing.");
|
2021-12-22 18:54:49 +03:00
|
|
|
MOZ_ASSERT(!aTokenizer || mExecutor->IsInFlushLoop(),
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
"How are we here if "
|
|
|
|
"RunFlushLoop() didn't call ParseUntilBlocked() or we're not "
|
|
|
|
"committing to an encoding?");
|
2009-10-12 17:08:04 +04:00
|
|
|
mSpeculations.RemoveElementAt(0);
|
|
|
|
if (mSpeculations.IsEmpty()) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
if (mMode == VIEW_SOURCE_HTML) {
|
|
|
|
// If we looked for meta charset in the HTML View Source case.
|
|
|
|
mTokenizer->SetViewSourceOpSink(mExecutor->GetStage());
|
|
|
|
} else {
|
|
|
|
// yes, it was still the only speculation. Now stop speculating
|
|
|
|
// However, before telling the executor to read from stage, flush
|
|
|
|
// any pending ops straight to the executor, because otherwise
|
|
|
|
// they remain unflushed until we get more data from the network.
|
|
|
|
mTreeBuilder->SetOpSink(mExecutor);
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush(true);
|
|
|
|
if (r.isErr()) {
|
|
|
|
mExecutor->MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
|
|
|
}
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
mTreeBuilder->SetOpSink(mExecutor->GetStage());
|
|
|
|
}
|
2009-10-12 17:08:04 +04:00
|
|
|
mExecutor->StartReadingFromStage();
|
2011-10-17 18:59:28 +04:00
|
|
|
mSpeculating = false;
|
2009-10-12 17:08:04 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
|
2017-06-08 02:23:29 +03:00
|
|
|
if (NS_FAILED(mEventTarget->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
|
2010-04-16 14:52:06 +04:00
|
|
|
NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
|
2009-10-12 17:08:04 +04:00
|
|
|
}
|
2018-03-16 18:26:06 +03:00
|
|
|
// A stream event might run before this event runs, but that's harmless.
|
|
|
|
#ifdef DEBUG
|
|
|
|
mAtomTable.SetPermittedLookupEventTarget(mEventTarget);
|
|
|
|
#endif
|
2009-09-21 17:18:20 +04:00
|
|
|
}
|
|
|
|
}
|
2009-10-21 16:12:50 +04:00
|
|
|
|
|
|
|
void nsHtml5StreamParser::ContinueAfterFailedCharsetSwitch() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(NS_IsMainThread(), "Wrong thread!");
|
2009-10-21 16:12:50 +04:00
|
|
|
nsCOMPtr<nsIRunnable> event = new nsHtml5StreamParserContinuation(this);
|
2017-06-08 02:23:29 +03:00
|
|
|
if (NS_FAILED(mEventTarget->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
|
2009-11-17 11:52:30 +03:00
|
|
|
NS_WARNING("Failed to dispatch nsHtml5StreamParserContinuation");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-04-26 03:23:21 +03:00
|
|
|
class nsHtml5TimerKungFu : public Runnable {
|
2010-04-16 14:52:06 +04:00
|
|
|
private:
|
2017-06-27 20:38:30 +03:00
|
|
|
nsHtml5StreamParserPtr mStreamParser;
|
|
|
|
|
2010-04-16 14:52:06 +04:00
|
|
|
public:
|
2014-09-02 02:04:52 +04:00
|
|
|
explicit nsHtml5TimerKungFu(nsHtml5StreamParser* aStreamParser)
|
2017-05-10 02:19:37 +03:00
|
|
|
: Runnable("nsHtml5TimerKungFu"), mStreamParser(aStreamParser) {}
|
2016-08-08 05:18:10 +03:00
|
|
|
NS_IMETHOD Run() override {
|
2017-01-26 09:45:31 +03:00
|
|
|
mozilla::MutexAutoLock flushTimerLock(mStreamParser->mFlushTimerMutex);
|
2010-04-16 14:52:06 +04:00
|
|
|
if (mStreamParser->mFlushTimer) {
|
|
|
|
mStreamParser->mFlushTimer->Cancel();
|
2012-07-30 18:20:58 +04:00
|
|
|
mStreamParser->mFlushTimer = nullptr;
|
2010-04-16 14:52:06 +04:00
|
|
|
}
|
|
|
|
return NS_OK;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
void nsHtml5StreamParser::DropTimer() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(NS_IsMainThread(), "Wrong thread!");
|
2010-04-16 14:52:06 +04:00
|
|
|
/*
|
|
|
|
* Simply nulling out the timer wouldn't work, because if the timer is
|
|
|
|
* armed, it needs to be canceled first. Simply canceling it first wouldn't
|
|
|
|
* work, because nsTimerImpl::Cancel is not safe for calling from outside
|
|
|
|
* the thread where nsTimerImpl::Fire would run. It's not safe to
|
|
|
|
* dispatch a runnable to cancel the timer from the destructor of this
|
|
|
|
* class, because the timer has a weak (void*) pointer back to this instance
|
|
|
|
* of the stream parser and having the timer fire before the runnable
|
|
|
|
* cancels it would make the timer access a deleted object.
|
|
|
|
*
|
|
|
|
* This DropTimer method addresses these issues. This method must be called
|
|
|
|
* on the main thread before the destructor of this class is reached.
|
2018-03-16 18:26:06 +03:00
|
|
|
* The nsHtml5TimerKungFu object has an nsHtml5StreamParserPtr that addrefs
|
|
|
|
* this
|
2010-04-16 14:52:06 +04:00
|
|
|
* stream parser object to keep it alive until the runnable is done.
|
|
|
|
* The runnable cancels the timer on the parser thread, drops the timer
|
2017-06-27 20:38:30 +03:00
|
|
|
* and lets nsHtml5StreamParserPtr send a runnable back to the main thread to
|
2010-04-16 14:52:06 +04:00
|
|
|
* release the stream parser.
|
|
|
|
*/
|
2017-01-26 09:45:31 +03:00
|
|
|
mozilla::MutexAutoLock flushTimerLock(mFlushTimerMutex);
|
2010-04-16 14:52:06 +04:00
|
|
|
if (mFlushTimer) {
|
|
|
|
nsCOMPtr<nsIRunnable> event = new nsHtml5TimerKungFu(this);
|
2017-06-08 02:23:29 +03:00
|
|
|
if (NS_FAILED(mEventTarget->Dispatch(event, nsIThread::DISPATCH_NORMAL))) {
|
2010-04-16 14:52:06 +04:00
|
|
|
NS_WARNING("Failed to dispatch TimerKungFu event");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-04-27 13:27:03 +03:00
|
|
|
// Using a static, because the method name Notify is taken by the chardet
|
2009-11-17 11:52:30 +03:00
|
|
|
// callback.
|
|
|
|
void nsHtml5StreamParser::TimerCallback(nsITimer* aTimer, void* aClosure) {
|
2018-03-16 18:26:06 +03:00
|
|
|
(static_cast<nsHtml5StreamParser*>(aClosure))->TimerFlush();
|
2009-11-17 11:52:30 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void nsHtml5StreamParser::TimerFlush() {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2010-04-16 14:52:06 +04:00
|
|
|
mozilla::MutexAutoLock autoLock(mTokenizerMutex);
|
|
|
|
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(!mSpeculating, "Flush timer fired while speculating.");
|
2010-04-16 14:52:06 +04:00
|
|
|
|
|
|
|
// The timer fired if we got here. No need to cancel it. Mark it as
|
|
|
|
// not armed, though.
|
2011-10-17 18:59:28 +04:00
|
|
|
mFlushTimerArmed = false;
|
2009-11-17 11:52:30 +03:00
|
|
|
|
2011-10-17 18:59:28 +04:00
|
|
|
mFlushTimerEverFired = true;
|
2010-04-16 14:52:06 +04:00
|
|
|
|
|
|
|
if (IsTerminatedOrInterrupted()) {
|
2009-11-17 11:52:30 +03:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2010-07-30 14:15:38 +04:00
|
|
|
if (mMode == VIEW_SOURCE_HTML || mMode == VIEW_SOURCE_XML) {
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush(); // delete useless ops
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
r = mTokenizer->FlushViewSource();
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (r.unwrap()) {
|
2017-06-27 20:38:30 +03:00
|
|
|
nsCOMPtr<nsIRunnable> runnable(mExecutorFlusher);
|
2018-03-16 18:26:06 +03:00
|
|
|
if (NS_FAILED(DispatchToMain(runnable.forget()))) {
|
2017-06-27 20:38:30 +03:00
|
|
|
NS_WARNING("failed to dispatch executor flush event");
|
|
|
|
}
|
|
|
|
}
|
2010-07-30 14:15:38 +04:00
|
|
|
} else {
|
|
|
|
// we aren't speculating and we don't know when new data is
|
|
|
|
// going to arrive. Send data to the main thread.
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush(true);
|
|
|
|
if (r.isErr()) {
|
|
|
|
MarkAsBroken(r.unwrapErr());
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (r.unwrap()) {
|
2017-06-27 20:38:30 +03:00
|
|
|
nsCOMPtr<nsIRunnable> runnable(mExecutorFlusher);
|
2018-03-16 18:26:06 +03:00
|
|
|
if (NS_FAILED(DispatchToMain(runnable.forget()))) {
|
2010-07-30 14:15:38 +04:00
|
|
|
NS_WARNING("failed to dispatch executor flush event");
|
|
|
|
}
|
2009-11-17 11:52:30 +03:00
|
|
|
}
|
|
|
|
}
|
2009-10-21 16:12:50 +04:00
|
|
|
}
|
2011-09-28 16:45:17 +04:00
|
|
|
|
2014-11-05 18:44:37 +03:00
|
|
|
void nsHtml5StreamParser::MarkAsBroken(nsresult aRv) {
|
Bug 1701828 - meta charset rewrite. r=smaug
Implements https://github.com/whatwg/html/issues/6962 . Improves performance
when <meta charset> occurs in head but after the first kilobyte and aligns
behavior better with WebKit and Blink.
The main change is to avoid reloads when meta appears within head but
after the first kilobyte. Prior to this change, Gecko reloaded in that
case (in compliance with the spec!) even though WebKit and Blink did not.
Differences from WebKit and Blink:
* WebKit and Blink honor <meta charset> in <noscript>. This implementation
does not.
* WebKit and Blink look for meta as if the tree builder was unaware of
foreign content. This implementation is foreign content-aware. This
makes a difference for CDATA sections that contain a > before the meta
as well as style and script elements within foreign content. This could
happen if the CDATA section that has mysteriously been introduced around
a what looks like a meta tag also contains another prior tag-looking
run of text.
* This implementation processes rel=preload and speculative loads that are
seen before <meta charset> has been seen. WebKit and Blink instead first
look for the meta and rewind before starting speculative parsing.
* Unlike WebKit, if there is neither an honored meta nor syntax resembling
an XML declaration, detection from content takes place (as in Blink).
* Unlike Blink, if there is neither an honored meta nor syntax resembling
an XML declaration, the detection from content is not dependent of network
buffer boundaries.
* Unlike Blink, detection from content can trigger a reload at the end of
the stream if the guess made at that point differs from the first guess.
(See below for the definition of the input to the first guess.)
Differences from the old spec and Gecko previously:
* Meta inside script and RCDATA elements is no longer honored.
* Late meta is now ignored and no longer triggers a reload.
* Later meta counts as early enough meta: In addition to the previous
meta within the first 1024 bytes, now a meta that started within the first
1024 bytes counts as early enough. Additionally, if by then there hasn't
been a template start tag and head hasn't ended, meta occurring before the
earlier of the end of the head or a template start tag counts as early
enough.
* Meta now counts as not-late even if the encoding label has numeric
character reference escapes.
* Syntax resembling an XML declaration longer than a kilobyte is honored if
there is no honored meta.
* If there is neither an honored meta nor syntax resembling an XML declaration,
the initial chardetng scan is potentially longer than before: the first 1024
bytes, the token spanning the 1024-byte boundary if there is such a token,
and, if by then head hasn't ended and there hasn't been a template start tag
until the end of the template start tag or the end of the token that causes
head to end, ever comes first. However, if the token implying the end of the
head is a text token, bytes only to the end of the previous non-text token is
considered. (This definition avoids depending on network buffer boundaries.)
* XML View Source now uses the code for syntax resembling an XML declaration
instead of expat for extracting the internal encoding label.
Reftest are added as both WPT and Gecko reftests in order to test both http:
and file: URL scenarios. The Gecko tests retain the WPT <link> tags in order
to use the exact same bytes.
An encoding declaration has been added to a number of old tests that didn't
intend to test the new speculation behavior especially in the context of
https://bugzilla.mozilla.org/show_bug.cgi?id=1727750 .
Differential Revision: https://phabricator.services.mozilla.com/D125808
2021-12-08 14:34:20 +03:00
|
|
|
MOZ_ASSERT(IsParserThread(), "Wrong thread!");
|
2011-09-28 16:45:17 +04:00
|
|
|
mTokenizerMutex.AssertCurrentThreadOwns();
|
|
|
|
|
|
|
|
Terminate();
|
2014-11-05 18:44:37 +03:00
|
|
|
mTreeBuilder->MarkAsBroken(aRv);
|
2022-04-27 10:24:34 +03:00
|
|
|
auto r = mTreeBuilder->Flush(false);
|
|
|
|
if (r.isOk()) {
|
|
|
|
MOZ_ASSERT(r.unwrap(), "Should have had the markAsBroken op!");
|
|
|
|
} else {
|
|
|
|
MOZ_CRASH("OOM prevents propagation of OOM state");
|
|
|
|
}
|
2017-06-27 20:38:30 +03:00
|
|
|
nsCOMPtr<nsIRunnable> runnable(mExecutorFlusher);
|
2017-07-26 11:13:35 +03:00
|
|
|
if (NS_FAILED(DispatchToMain(runnable.forget()))) {
|
2011-09-28 16:45:17 +04:00
|
|
|
NS_WARNING("failed to dispatch executor flush event");
|
|
|
|
}
|
|
|
|
}
|
2017-06-27 20:38:30 +03:00
|
|
|
|
2017-07-26 11:13:35 +03:00
|
|
|
nsresult nsHtml5StreamParser::DispatchToMain(
|
|
|
|
already_AddRefed<nsIRunnable>&& aRunnable) {
|
2021-03-18 22:24:50 +03:00
|
|
|
if (mNetworkEventTarget) {
|
|
|
|
return mNetworkEventTarget->Dispatch(std::move(aRunnable));
|
2017-06-27 20:38:30 +03:00
|
|
|
}
|
2018-03-16 18:26:06 +03:00
|
|
|
return SchedulerGroup::UnlabeledDispatch(TaskCategory::Network,
|
2018-05-30 22:15:35 +03:00
|
|
|
std::move(aRunnable));
|
2017-06-27 20:38:30 +03:00
|
|
|
}
|