We originally added this functionality in bug 1566855 because we were
seeing a significant amount of orphaned crashes not being submitted.
Notably some of these were early content process startup crashes and
missing them made us unaware of severe issues that would affect our
users. In the meantime we've addressed most of the cases where a crash
would be orphaned. Since the crashes submitted this way have a buildid
that does not match the version of Firefox that actually crashed they
can be confusing during crash triage.
Given the above we can safely remove this functionality. This patch
reverts most of the changes from bug 1566855 but leaves the test
refactorings in place.
Differential Revision: https://phabricator.services.mozilla.com/D78219
In bug 1614933 we changed the order in which the crash annotations are sent
from a crashed content process to the main process in order to prevent a
deadlock that would arise if the child process would block writing into the
pipe used to sent the annotations.
This unfortunately introduced a race that I had missed. Here's the sequence of
event when generating a crash in the child process:
1) The child process enters the exception handler
2) It requests a dump from the main process and wait
3) Once the dump is written, the child process wakes up again and writes out
the crash annotations
4) The child process quits
One the main process side it looks like this:
1) The crash generation thread receives a request to generate a dump
2) The dump is written out, the crash generation thread notifies the content
process that it can resume execution. During this step the finished
minidump is recorded in the `OnChildProcessDumpRequested()` callback.
3) The crash generation thread reads the crash annotations sent by the content
process and finalizes the crash report
4) The main process grabs the finalized crash report
The key issue here is that the main process in step 4 is woken up when the
content process dies. Notice how there's no synchronization between the crash
generation thread and the main thread in the main process: if the crash
generation thread takes too long reading the crash annotations the main thread
might see an incomplete or missing crash report; that's because the event that
wakes it up - the content process ending execution - can happen in parallel
with step 3.
This is an issue that was accidentally introduced in Windows by bug 1614933
but was already present in both Linux and macOS. It was just very unlikely to
happen.
This patch fixes the issue by splitting step 3 in the main process in two
different stages: in the main process we grab the generated minidump in
`OnChildProcessDumpRequested()`, Breakpad then unblocks the child process and
we read the annotations in `OnChildProcessDumpWritten()`. We grab the
`dumpMapLock` Mutex in the latter and release it in the former to ensure that
the main process will have to wait for the crash report to be finalized when
`TakeCrashedChildProcess()` is called. This might appear somewhat confusing
and even causes debug builds to spit a harmless warning (we don't want Mutexes
to be taken and released from different scopes if we can help it).
To implement the above behavior in Breakpad a new callback was introduced in
Windows, an existing one was used under macOS and an unused one was used under
Linux. This accounts for the different way in which minidumps are generated on
the three platforms.
Differential Revision: https://phabricator.services.mozilla.com/D74496
Raw Cr.ERROR don't get stack information, same as throwing JS literals instead
of `new Error()`s.
This was done automatically with a new eslint rule that will be introduced in
the next commit. One instance of a raw Cr.ERROR was not replaced since it is
used in a test that specifically checks the preservation of raw Cr values in
XPCJS. The rule will be disabled for that instance.
Differential Revision: https://phabricator.services.mozilla.com/D28073
This changes the way filter callbacks work in Breakpad so that the
Gecko-provided exception handler can instruct breakpad to just consider an
exception handled and terminate the process without further work. This change
is needed because when two threads enter the exception handler at the same
time the second can get stuck. The logic behind this is convoluted: when the
filter callback "accepts" an exception Breakpad will request minidump
generation and then notify Windows exception search mechanism that the
exception was handled. The process will terminate at this point. However if
a second thread enters the exception handler before this happens it will also
try to handle the exception, either by generating a minidump (prior to my fix
for bug 1434933) or by simply ignoring it (after bug 1434933). However under
both conditions Breakpad will consider the exception not handled and inform
Windows exception search logic to look for another handler. But we're already
at the top-level handler so Windows will try it again, and again, and again
... hanging the process instead of terminating it.
With this patch applied the first time we hit the exception handler we'll
request minidump generation and report the exception as handled. If we hit the
exception handler again we will not request minidump generation but we will
also report the exception as handled so that Windows can terminate the process
right away.
Differential Revision: https://phabricator.services.mozilla.com/D72493
This patch modifies exception handling in child processes so that no more than
one minidump will be requested in case of a crash - even when multiple threads
enter the exception handler almost simultaneously.
The fix is implemented by introducing an atomic flag that is set once we first
decide that the exception is a crashing one. All following threads entering
the exception handler will find the flag already set and give up.
This logic was added to the child process exception filters - callbacks that
breakpad invokes within the exception handler to decide whether to generate a
minidump or not. I factored out some of the affected code trying to make it a
bit more readable and a little bit less redundant than before.
Differential Revision: https://phabricator.services.mozilla.com/D71108
Made the files in toolkit/crashreporter flake8 compliant
And some finishing touches in the symbolstore.py file for readability
Differential Revision: https://phabricator.services.mozilla.com/D68691
Setting breakpad minimum log severity level to CRITICAL should be enough to
quiet it completely since there are no statements in the code using it, the
"highest" I could find all use the ERROR level.
Differential Revision: https://phabricator.services.mozilla.com/D70115
--HG--
extra : moz-landing-system : lando
On Windows content process minidump generation goes through the following
steps:
* The child process exception handler invokes the `filter` callback. This is the
`ChildFPEFilter()` function in nsExceptionHandler.cpp.
* If this callback returns true then the child process sends a request to the
crash generation server living in the main process and then waits.
* The main process generates a minidump in response to this message.
* Once the minidump has been generated the crash generation server calls the
`dump_callback` callback with the minidump filename. This is the
`OnChildProcessDumpRequested()` function in nsExceptionHandler.cpp.
* Once this callback returns the main process signals the child process that
it has finished generating the dump.
* The child process resumes execution in the exception handler and then
terminates.
During this process we send crash annotations via a pipe from the child
process to the main process. The function responsible for this is
`PrepareChildExceptionTimeAnnotations()` in nsExceptionHandler.cpp and it's
called within `ChildFPEFilter()`. Because pipe writes are blocking if we send
enough annotations to fill the pipe buffer then the child process will get
stuck in `ChildFPEFilter()` and never request a dump from the main process. Thus
the content process will never shut down and the main process will never see
that it has crashed, it will only appear indefinitely stuck.
To solve this issue a few changes are necessary:
* The crash annotations are not sent from the `filter` callback, they are sent
from the `minidump` callback which we did not use before. This callback is
executed in the child process **after** signaling the main process.
* This is not enough to prevent the deadlock though, because the content
process will still wait before the parent process has responded before
invoking this new callback. The parent process on the other hand will
generate the minidump and then call `OnChildProcessDumpRequest()` which will
wait to read the annotations sent by the child process. Thus both processes
would be stuck. To solve this other issue we change the order in which the
parent process responds: instead of invoking the `dump_callback` first and
then signaling the child process it does the opposite.
I've also added a new test that covers two separate issues solved by this
patch-set: that we don't deadlock when writing too much data (this issue) and
that we don't fail to send an annotation if it's too large (previously the
shared memory buffer was 16KiB in size so this annotation wouldn't fit).
Differential Revision: https://phabricator.services.mozilla.com/D69877
--HG--
extra : moz-landing-system : lando
Crash annotations in content processes are currently sent over IPC via
shared memory buffers. To pave the way for the Rust rewrite of the exception
handler we are removing this code and gathering all the crash annotations
within the content processes themselves. This patch causes annotations to be
stored in the global table of each content process. They are then streamed
out to the parent process by the exception handler together with the
exception-time annotations.
This has a number of benefits:
* we have one less channel to exchange data between content processes and
the parent process
* we save memory because we don't need to allocate the shared memory buffers
* annotations are faster because we don't stream them all out every time one
changes
* we won't truncate annotations anymore if we run out of space in the shared
segment.
* we don't need delayed annotations anymore, so we can get rid of the
associated machinery
As I refactored the code I tried to adjust all the obsolete comments,
consolidate shared code and remove the redundant steps that were sometimes
present. In many places we had two entire crash annotation tables we merged to
change just a couple; that comes from the fact that historically we loaded
them from disk. Now it doesn't matter anymore and we can just go ahead and
change the ones we care about.
Differential Revision: https://phabricator.services.mozilla.com/D62586
--HG--
extra : moz-landing-system : lando
We previously only included the contents of the /etc/lsb-release file, however
this file is not present in RedHat-based distributions which use
/etc/os-release instead. This patch will read the latter if the former is not
present.
Differential Revision: https://phabricator.services.mozilla.com/D68986
--HG--
extra : moz-landing-system : lando
Also adds missing includes in some files, these were previously only transivitely
included through mozilla/TypeTraits.h.
Differential Revision: https://phabricator.services.mozilla.com/D68561
--HG--
extra : moz-landing-system : lando
Each allocation page is now bracketed by a guard page, and allocations are put
at the end of their page so that bounds violations trigger a crash.
Various operations (realloc(), free(), malloc_usable_size()) now require that
the pointer they are given points to the start of an allocation.
Differential Revision: https://phabricator.services.mozilla.com/D43842
--HG--
rename : toolkit/crashreporter/test/unit_ipc/test_content_phc2.js => toolkit/crashreporter/test/unit_ipc/test_content_phc3.js
extra : moz-landing-system : lando
Crash annotations in content processes are currently sent over IPC via
shared memory buffers. To pave the way for the Rust rewrite of the exception
handler we are removing this code and gathering all the crash annotations
within the content processes themselves. This patch causes annotations to be
stored in the global table of each content process. They are then streamed
out to the parent process by the exception handler together with the
exception-time annotations.
This has a number of benefits:
* we have one less channel to exchange data between content processes and
the parent process
* we save memory because we don't need to allocate the shared memory buffers
* annotations are faster because we don't stream them all out every time one
changes
* we won't truncate annotations anymore if we run out of space in the shared
segment.
* we don't need delayed annotations anymore, so we can get rid of the
associated machinery
As I refactored the code I tried to adjust all the obsolete comments,
consolidate shared code and remove the redundant steps that were sometimes
present. In many places we had two entire crash annotation tables we merged to
change just a couple; that comes from the fact that historically we loaded
them from disk. Now it doesn't matter anymore and we can just go ahead and
change the ones we care about.
Differential Revision: https://phabricator.services.mozilla.com/D62586
--HG--
extra : moz-landing-system : lando
This includes several fixes required to build against musl libc.
Conflicts were resolved in 00-arm-exidx-rollup.patch and
10-json-upload.patch. 08-dont-add-sp-to-clobber-list.patch was
applied upstream and is no longer needed. The others applied cleanly.
breakpad_getcontext.S is now built conditionally based upon the
available of getcontext() from libc, rather than only on Android.
The profiler was updated to reflect this change.
Differential Revision: https://phabricator.services.mozilla.com/D67108
--HG--
rename : toolkit/crashreporter/breakpad-patches/09-gnu-alt-form-minimal-support.patch => toolkit/crashreporter/breakpad-patches/08-gnu-alt-form-minimal-support.patch
rename : toolkit/crashreporter/breakpad-patches/10-json-upload.patch => toolkit/crashreporter/breakpad-patches/09-json-upload.patch
rename : toolkit/crashreporter/google-breakpad/src/common/android/breakpad_getcontext.S => toolkit/crashreporter/google-breakpad/src/common/linux/breakpad_getcontext.S
rename : toolkit/crashreporter/google-breakpad/src/common/android/breakpad_getcontext_unittest.cc => toolkit/crashreporter/google-breakpad/src/common/linux/breakpad_getcontext_unittest.cc
extra : moz-landing-system : lando
This works on all platforms with the exception of Linux where we remove the
exception handler only if the sandbox is disabled. With the sandbox enabled we
would have to whitelist sigaltstack() which breakpad uses to remove the
alternate signal stack which is not worth the fuss.
Besides this patch refactors the code that sets and unsets the exception
handler, cutting down on the duplication:
* The XRE_UnsetRemoteExceptionHandler() call is removed from XULAppAPI.h since it
was no longer used
* The duplicate checks for the special strings used to disable the remote exception
handler have been removed from CrashReporter::UnsetRemoteExceptionHandler()
leaving them in the calling code
* The SetRemoteExceptionHandler() function was consolidated into only one
piece of code with only one non-platform-specific shared prototype
* Some additional code was factored out to improve the readability
These changes pave the way both for bug 1614933 and for the oxidation of the
exception handler code.
Differential Revision: https://phabricator.services.mozilla.com/D67213
--HG--
extra : moz-landing-system : lando