When reading Dwarf unwind info, `CallFrameInfo::RuleMap::registers_` is a
`std::map<int, Rule>` used to map (Dwarf) register numbers to the current
unwind rule for each number. These mappings are very small (typically <= 7
elems) and very short lived. Result is that `std::map` creates a lot of
overhead because it is implemented as a Red-Black tree, and hence does a lot
of malloc/freeing of nodes.
This patch replaces it with a simple vector, wrapped up in a new `class
CallFrameInfo::RuleMapLowLevel`. Comments have also been improved. The
resulting performance changes are as follows:
user time insns #malloc #megabytes
seconds million calls allocated
x86_64 before 0.42 5609 5,891,857 560.8
x86_64 after 0.27 3906 894,188 323.1
arm64 before 0.46 7697 8,469,659 680.8
arm64 after 0.24 4922 1,043,154 427.1
x86_64: Intel Core i5 1135G7, 4.2GHz, Fedora 35
arm64: Apple M1, ??? MHz, Fedora 33 running on Parallels Workstation
Differential Revision: https://phabricator.services.mozilla.com/D151261
We'll probably want to do something more accurate in the future with a
custom clang static analysis pass which validates that XPIDL interfaces
have the expected vtable and struct layout, however doing so would be
more involved than the string matching done in this patch.
In addition to checking for extra virtual methods, we'll likely also
want to check for data members on interfaces, and reject them unless the
class is marked as `[builtinclass]` in addition to some other attribute
which we'll need to add to prevent them from being implemented in Rust
(as c++ data members will not be reflected by the rust macro).
There were 2 instances of a comment which contained the word 'virtual'
within a CDATA block. These comments were moved out of the CDATA block
to avoid triggering the error.
Differential Revision: https://phabricator.services.mozilla.com/D151068
After parsing Dwarf unwind info, the resulting mExtents array must be sorted
by start address. This forms a significant part of the overall unwind-info
reading cost.
This is done by calling std::sort. Unfortunately the way it is coded, the
comparison function is not inlined, which results in 91 million calls to it
when reading info up to and including for libxul.so.
This small patch turns it into a closure, which is inlined. For (performance)
safety, Extent::offset() is marked inline too.
On an Intel Core i5 1135G7 at circa 4 GHz, this reduces the Dwarf read time
from 0.42 seconds (after bug 1754932 has landed) to 0.35 seconds. Insn count
falls from 5630 million to 4775 million.
Differential Revision: https://phabricator.services.mozilla.com/D150943
Fix for a regression in bug 1668867, which removed this important line.
Without it, the logic that handles chunk manager updates doesn't run, and the
overall buffer size limit across processes is not enforced anymore.
To catch future regressions, the ProfileBufferGlobalController now adds its
creation timestamp to the profiling log, which can be tested.
Differential Revision: https://phabricator.services.mozilla.com/D150995
This is a special log that's collected while the parent is gathering all the
JSON profiles (its own, and its children's).
This separate log is necessary, because the gathered logs have already
completed and closed their own `profilingLog` object, and it would be much more
difficult to add to them now.
Differential Revision: https://phabricator.services.mozilla.com/D150562
At profiler startup, LUL reads and preprocesses Dwarf unwind info for all
shared objects mapped into the process. That includes libxul.so and around 90
other shared objects on an x86_64-linux build. This takes too long.
Profiling just this reading phase shows a huge number of mallocs and frees
resulting from LulDwarf.cpp. The largest cause of these is the implementation
of a sum-of-products type with 7 variants (CallFrameInfo::*Rule) using C++
inheritance. The variants have different sizes and so have to be boxed (stored
on the heap and referred to by pointers). They are very short lived and this
causes a lot of heap turnover.
For an m-c build on x86_64-linux on Fedora 35 using clang-13.0.0 -g -O2, when
measuring a MOZ_PROFILER_STARTUP=1 run that is hacked so as to exit
immediately after the unwind info for libxul.so has been read, this patch has
the following perf effects:
* run time reduced from 0.50user to 0.41user (Core i5 1135G7, circa 4 GHz)
* heap allocation reduced from 8,087,404 allocs/397,177,435 bytes to
3,176,936 allocs, 328,204,042 bytes.
* instruction count reduced from 6,848,730,307 to 5,572,028,393.
Main changes are:
* `class CallFrameInfo::Rule` has been completely rewritten, so as to merge
its 7 children into it, and those children have been removed. The new
representation uses just 3 machine words to represent all 7 alternatives.
The various virtual methods of `class CallFrameInfo::Rule` (`Handle`,
`operator==`, etc) have been rewritten to use a simple switch-based scheme.
* The code that uses `class CallFrameInfo::Rule` has been changed (but not
majorly rewritten) so as to pass around instances of it by value, rather
than pointers to it. This removes the need to allocate them on the heap.
To simulate the previous use case where a NULL value had a meaning, the
revised class in fact has an 8th alternative, `INVALID`, and a routine
`isVALID`. INVALID rules are now used in place of NULL pointers to rules as
previously.
* Accessors and constructors for the revised `class CallFrameInfo::Rule` hide
the underlying 3-word representation, ensure that the correct
representational invariants are maintained, and make it impossible to access
fields that are meaningless for the given variant. So it's not any less
safe than the original.
* Additionally, Dwarf expression strings are no longer represented using
`std::string`. Instead a new two-word type `ImageSlice` has been provided.
This avoids all unnecessary (and unknown) overheads with `std::string` and
provides a significant part of the measured speedup.
Differential Revision: https://phabricator.services.mozilla.com/D139288
If the child process failed to allocate Shmem (and this isn't a debug
build, where that would cause an assertion failure) when sending
profiling data back to the parent process, it will send an invalid Shmem
and the parent process should not try to use it.
Currently it does do that, and tries to construct an nsACString with
length `(size_t)-1`, which fails a release assertion in the XPCOM string
library and crashes the browser. With this patch, we'll simply fail to
profile the affected process.
Differential Revision: https://phabricator.services.mozilla.com/D148468
The geckoProfiler.dumpProfileToFile (extension API) needed to be updated, to create the target directory if necessary. Some xpcshell tests would fail otherwise.
Also improved nsIProfiler.idl docs of dumpProfileToFile{,Async}.
***
Use ofstream, fixed xpcshell tests.
Differential Revision: https://phabricator.services.mozilla.com/D148610
This should remove the redefinition warning.
@gaston could you please test it on your system? I've tried simulating a non-MOZ_GECKO_PROFILER build locally, but I'm hitting other build errors -- but maybe I'm doing it wrong. Thanks!
Differential Revision: https://phabricator.services.mozilla.com/D148582
If the sampler records more than one counter (e.g.: memory and per-process CPU), both counters could appear right next to each other in the buffer.
But then the current streaming code goes something like this:
1. Read entry
2. If entry is a counter, go to the next entry (which should be a timestamp) and output the full counter.
3. Go to the next entry.
4: Return to step 1 (until there are no more entries).
The problem is this unconditional 3rd step:
If we've just read a counter at step 2, we're located at the entry *past* that counter's data, which could be the start of the next counter, that the 3rd step now effectively skips!
Also:
- After reading the time, we can do e.Next() to skip it before testing if it's too old, this may save one loop.
- After reading the optional "number", we can also do e.Next() to skip it, which will save one loop as well.
Differential Revision: https://phabricator.services.mozilla.com/D145056
In child processes, Gecko Profiler's profiler_start/stop are running OFF the main thread, so the calls to Ensure/ReleaseBufferForMainThreadAddMarker should be dispatched to the main thread for them to actually create/destroy the main-thread buffer.
Note that this is not done in the Base Profiler, because it should always run on the main thread, and if it wasn't there would be no way to dispatch a task to the main thread -- and it wouldn't be too much of an issue anyway, there are few markers running outside of Gecko Profiler sessions.
Differential Revision: https://phabricator.services.mozilla.com/D144387
The regular xpcshell test don't appear to fail, so this additional patch ensures that only the TV (test verify) runs of test_feature_nativeallocations.js are skipped.
Differential Revision: https://phabricator.services.mozilla.com/D141814
The native allocations feature is crashing tests on Android in the watchdog thread, possibly due to its CPU overhead.
This feature is rarely used, and mostly by Firefox developers, so it's okay to keep it available without tests.
Differential Revision: https://phabricator.services.mozilla.com/D141568
Profiler sleep/wake functions/macros/classes are used in a small number of files.
By placing most of them in mozilla/ProfilerThread{Sleep,State}.h, the dependence on GeckoProfiler.h is greatly reduced.
The remaining AUTO_PROFILER_THREAD_WAKE and AutoProfilerThreadWake depend on profiler_is_sleeping() in mozilla/ProfilerThreadState.h (but are rarely used), so it makes sense to move them there instead.
Depends on D140169
Differential Revision: https://phabricator.services.mozilla.com/D140172
The next patch would have added yet another PROFILERxx_RAII macro, so it makes sense to gather them all into a single header file, to be #included where needed.
Differential Revision: https://phabricator.services.mozilla.com/D140169
This allows multiple threads to access ThreadRegistrations through the thread registry, as long as the registry itself is not modified. So in particular, ThreadRegistrations are guaranteed to stay alive as long as a non-exclusive lock is held.
This ensures cross-thread markers are not blocked while the sampler is running.
Differential Revision: https://phabricator.services.mozilla.com/D139917
This allows multiple threads to access ThreadRegistrations through the thread registry, as long as the registry itself is not modified. So in particular, ThreadRegistrations are guaranteed to stay alive as long as a non-exclusive lock is held.
This ensures cross-thread markers are not blocked while the sampler is running.
Differential Revision: https://phabricator.services.mozilla.com/D139917
In async profiler tests, it's best to await profiler-controlling functions, to ensure that they took full effect in all processes.
It's especially important with StopProfiler, so that the profiler doesn't run anywhere when the following test starts.
Differential Revision: https://phabricator.services.mozilla.com/D139336
This profiler-controlling functions are used in very few places, so it's good to have them in a separate header to reduce dependencies.
On top of making GeckoProfiler.h lighter, this is actually needed for this bug, because a later patch adds MozPromise to profiler_start and others, which would have created a header loop (GeckoProfiler -> MozPromise -> Monitor -> CondVar -> GeckoProfiler) that makes the build fail.
Differential Revision: https://phabricator.services.mozilla.com/D139331
On an Android device, the stacks looks different during testing. It doesn't
include object directory before the test file since the tests are packaged
with the binary on Android.
Depends on D138769
Differential Revision: https://phabricator.services.mozilla.com/D138862
Now that the main cause of broken stacks (LoaderObserver) has been removed, we reinstate the path to this assertion, which should trigger much less or not at all.
Differential Revision: https://phabricator.services.mozilla.com/D138241
Once the buffers are combined, some data could be stored in a thread before its registration with the Gecko Profiler.
Without this fix, the front-end would think the thread had not started yet, which could prevent interacting with that early data.
Differential Revision: https://phabricator.services.mozilla.com/D138239
Once the buffers are combined, some data could be stored in a thread before its registration with the Gecko Profiler.
Without this fix, the front-end would think the thread had not started yet, which could prevent interacting with that early data.
Differential Revision: https://phabricator.services.mozilla.com/D138239
This is a mechanical change which was performed by a script based on the
contents of direct_call.py, and then manually checked over to fix
various rewriting bugs caused by my glorified sed script. See the
previous part for more context on the change.
Differential Revision: https://phabricator.services.mozilla.com/D137227
When trying to profile only one process, the profiler should not even start in those processes excluded by the filter, so that they don't use any resources (memory and CPU).
To do that, the filter is checked for "pid:" at the multiple locations it may appear:
- In the parent process, when starting the overall profiler, don't send the "Start" IPC to excluded processes.
- When a new process starts and initializes the profiler, don't start the profiler if the filter excludes this process.
- When a new process then registers itself with the parent, don't (re)start the profiler in excluded processes.
Child processes that don't run the profiler may still be asked for a JSON profile at the end of the profiling session. This is handled by sending an empty string, so that the parent process will correctly remove them from the pending-profile list.
Differential Revision: https://phabricator.services.mozilla.com/D135854
When trying to profile only one process, the profiler should not even start in those processes excluded by the filter, so that they don't use any resources (memory and CPU).
To do that, the filter is checked for "pid:" at the multiple locations it may appear:
- In the parent process, when starting the overall profiler, don't send the "Start" IPC to excluded processes.
- When a new process starts and initializes the profiler, don't start the profiler if the filter excludes this process.
- When a new process then registers itself with the parent, don't (re)start the profiler in excluded processes.
Child processes that don't run the profiler may still be asked for a JSON profile at the end of the profiling session. This is handled by sending an empty string, so that the parent process will correctly remove them from the pending-profile list.
Differential Revision: https://phabricator.services.mozilla.com/D135854
We don't use unions in the bindings, and rust is very picky with what
you can put on untagged unions, so disable them for now.
Differential Revision: https://phabricator.services.mozilla.com/D137837
Instead of waiting a set time guessed from how long the parent process took to do its work, after a short time the parent process requests progress updates from all still-pending child processes, and restarts the timer if any progress was made.
If processes become unresponsive, they will be the last ones pending, and after one timer cycle without any progress anywhere, the parent process won't wait for children anymore, and will output all profiles successfully gathered so far.
Added `MOZ_LOG=prof` logging in nsProfiler.cpp, to monitor profile-gathering. (And removed a spurious 'd' character in the `LOG` macro.)
Differential Revision: https://phabricator.services.mozilla.com/D135488
This code will be used again in the following patch.
Instead of destroying and re-creating a new timer, we can re-initialize the existing one.
Also add the timer name "nsProfilerGatheringTimer" when first creating it.
Differential Revision: https://phabricator.services.mozilla.com/D135487
This helper function in ProfilerParent sends a progress request to a child process. If successfully sent, the response will resolve the returned promise.
Differential Revision: https://phabricator.services.mozilla.com/D135486
A new IPC function allows the parent process to request a progress update from any child process.
If a profile generation is in progress, the shared `ProportionValue` can be atomically read and sent back in response.
Differential Revision: https://phabricator.services.mozilla.com/D135485
In order to keep the child process responsive to profile IPCs, the heavy work of generating the profile JSON is now done in a separate thread.
A `ProgressLogger` is used to keep track of the progress of this work, and the progress value is stored in a shared atomic `ProportionValue`.
When the JSON profile is ready, the final shmem allocation (used to send the profile to the parent process) is done on the original "ProfilerChild" IPC thread.
Differential Revision: https://phabricator.services.mozilla.com/D135484
The main goal is to separate the profile generation (in a JSONWriter) from the final allocation needed to output the profile in one block.
This will be needed in the next patch, where the profile generation will be done in a new worker thread, but the shmem allocation *must* be done on the original "ProfilerChild" thread that handles IPC responses.
Differential Revision: https://phabricator.services.mozilla.com/D135483
Instead of just waiting for a certain number of profiles, the parent process now waits for profiles from a predetermined list of child process ids.
When receiving a profile, or when something goes wrong with a child process, the corresponding listed id can be removed, until the list is empty.
In a later patch, this list will be used to request progress updates from slow processes.
Differential Revision: https://phabricator.services.mozilla.com/D135482
This will be useful to tie profiles to the child process id that generated them. (At the moment, the parent waits for a number of profiles, but doesn't check where received profiles actually come from.)
Differential Revision: https://phabricator.services.mozilla.com/D135481
A small optimization while working on nearby code, so avoid multiple allocations when we already know how much memory we really need.
Differential Revision: https://phabricator.services.mozilla.com/D135480
Add `ProgressLogger` parameter to most JSON-generating functions.
Each function can update the given `ProgressLogger` between 0% and 100%, and create sub-loggers when calling functions.
The main goal of this instrumentation is to notice when any progress is made by child processes (when the parent process is gathering profiles), so it needs to go deep enough so that it is not stuck on a progress value for "too long" -- During development, that meant progress was always happening when observed every 10ms; In later patches, the overall timeout for no-progress-made will be at least 1 second.
Differential Revision: https://phabricator.services.mozilla.com/D135479
As mentioned in bug 1747354, the location of the dist directory is
relied to be $topobjdir/dist, so just use that consistently rather
than getting it from a separate variable for rust build scripts.
Differential Revision: https://phabricator.services.mozilla.com/D136556
As mentioned in bug 1747354, the location of the dist directory is
relied to be $topobjdir/dist, so just use that consistently rather
than getting it from a separate variable for rust build scripts.
Differential Revision: https://phabricator.services.mozilla.com/D136556
In rare cases, a label may be left dangling, or would be removed too late (after the thread unregisters itself).
The most common cause is from Windows DLL load labels, which contain important information that we want to keep, so it's worth letting the registration data leak in this case.
Later bugs will soon fix the issue in a better way, and remove this leak.
Differential Revision: https://phabricator.services.mozilla.com/D136700
Note: The atomic variable is named `gSkipSampling`, not mentioning forks because it could later be used in other situations, so it's best to describe it by the effect it has.
Differential Revision: https://phabricator.services.mozilla.com/D136205
This helps when dealing with threads that are not registered, e.g.: the Java thread, and the upcoming whole-process thread. And it removes some object copies.
Differential Revision: https://phabricator.services.mozilla.com/D133172
`PrintUsageThenExit(code)` was supposed to exit when `code` was not zero, but:
- The name didn't reflect that, so it was confusing that `PrintUsageThenExit(0)` would *not* exit.
- The implementation in the Base Profiler exited anyway! This caused issues with some legacy code that still used the now-removed "threads" feature.
This patch renames the function to just `PrintUsage()` and never exits, leaving the caller to invoke `exit(code)` as needed -- with the added benefit that it's possible to exit with a zero code, useful in cases where an exit is not actually an error.
Differential Revision: https://phabricator.services.mozilla.com/D135666
Many revisions later... this is my best shoehorning attempt of webrtc's trace
events into the profiler.
It is not an optimal fit as the profiler wants a static schema, and the trace
event framework provides labels and values for the args dynamically.
But it allows displaying all args in the profiler which is a big step for
low-maintenance webrtc trace event integration into the profiler.
Differential Revision: https://phabricator.services.mozilla.com/D135031
This patch adds a part of TraceEvent.h that we had previously skipped. It
handles the conversion from the arg-ful TRACE_EVENTn forms to the internal
representation which carries around a dynamic number of args (num_args and
friends).
This allows us to hook into GeckoTraceEvent.h at a higher level, where all trace
events end up, with or without args.
Differential Revision: https://phabricator.services.mozilla.com/D135029
Many revisions later... this is my best shoehorning attempt of webrtc's trace
events into the profiler.
It is not an optimal fit as the profiler wants a static schema, and the trace
event framework provides labels and values for the args dynamically.
But it allows displaying all args in the profiler which is a big step for
low-maintenance webrtc trace event integration into the profiler.
Differential Revision: https://phabricator.services.mozilla.com/D135031
This patch adds a part of TraceEvent.h that we had previously skipped. It
handles the conversion from the arg-ful TRACE_EVENTn forms to the internal
representation which carries around a dynamic number of args (num_args and
friends).
This allows us to hook into GeckoTraceEvent.h at a higher level, where all trace
events end up, with or without args.
Differential Revision: https://phabricator.services.mozilla.com/D135029
The name of the main thread outside the profiler should not be set by the profiler (to "GeckoMain"), as this may affect the name of the application itself on some systems like Linux.
Added tests to ensure that the profiler doesn't set that main thread public name to "GeckoMain", and also that other threads are publicly named when first registered with the profiler.
Differential Revision: https://phabricator.services.mozilla.com/D134847
This is necessary, because the next patch will remove the "threads" feature, and some tests only add that one feature so now they will have an empty feature list, equivalent to a feature bitset of 0 (zero).
Differential Revision: https://phabricator.services.mozilla.com/D134136
On macOS, we get intermittent failures when comparing a marker start time with the start of the test, with a difference of 0.00004166666666094443, which is very close to 24kHZ! I'm not too sure of the significance of this time difference (OS time slice resolution), but an easy fix for this intermittent is to loosen the expected time from 1e-6 to 5e-5.
Differential Revision: https://phabricator.services.mozilla.com/D134258
This patch was r+ed before by Ted, but it never landed because I initially intended
to address Ted's review comment (about making it work on 10.11 and below), and
because it needed to be rebased around bug 1374888.
The rebase turned out to be really simple, and Ted's review comment no longer applies
because Firefox no longer runs on 10.11 and below.
Profile with fix: https://share.firefox.dev/3oYzvO6
Differential Revision: https://phabricator.services.mozilla.com/D134008
The main thread is the busiest, so it benefits the most from having its own chunked buffer. This removes one allocation when capturing a marker stack on the main thread of each process.
That buffer is allocated when the first profiler starts, and is destroyed when the last profiler stops.
Note: Further improvements are possible (e.g.: Pool of pre-allocated buffers, attempting to use a stack-based buffer, etc.), but they are more complex and will require more work in future bugs.
Differential Revision: https://phabricator.services.mozilla.com/D133725