The code already handles this if the r32 is eax. This allows it to use the other 32-bit registers.
--HG--
extra : histedit_source : 1cff5b54640cc48a0574b0b4323ad909e8a7e7b2
One of the things that I've noticed in profiling startup overhead is that,
even with the startup cache, we spend about 130ms just loading and decoding
scripts from the startup cache on my machine.
I think we should be able to do better than that by doing some of that work in
the background for scripts that we know we'll need during startup. With this
change, we seem to consistently save about 3-5% on non-e10s startup overhead
on talos. But there's a lot of room for tuning, and I think we get some
considerable improvement with a few ongoing tweeks.
Some notes about the approach:
- Setting up the off-thread compile is fairly expensive, since we need to
create a global object, and a lot of its built-in prototype objects for each
compile. So in order for there to be a performance improvement for OMT
compiles, the script has to be pretty large. Right now, the tipping point
seems to be about 20K.
There's currently no easy way to improve the per-compile setup overhead, but
we should be able to combine the off-thread compiles for multiple smaller
scripts into a single operation without any additional per-script overhead.
- The time we spend setting up scripts for OMT compile is almost entirely
CPU-bound. That means that we have a chunk of about 20-50ms where we can
safely schedule thread-safe IO work during early startup, so if we schedule
some of our current synchronous IO operations on background threads during the
script cache setup, we basically get them for free, and can probably increase
the number of scripts we compile in the background.
- I went with an uncompressed mmap of the raw XDR data for a storage format.
That currently occupies about 5MB of disk space. Gzipped, it's ~1.2MB, so
compressing it might save some startup disk IO, but keeping it uncompressed
simplifies a lot of the OMT and even main thread decoding process, but, more
importantly:
- We currently don't use the startup cache in content processes, for a variety
of reasons. However, with this approach, I think we can safely store the
cached script data from a content process before we load any untrusted code
into it, and then share mmapped startup cache data between all content
processes. That should speed up content process startup *a lot*, and very
likely save memory, too. And:
- If we're especially concerned about saving per-process memory, and we keep
the cache data mapped for the lifetime of the JS runtime, I think that with
some effort we can probably share the static string data from scripts between
content processes, without any copying. Right now, it looks like for the main
process, there's about 1.5MB of string-ish data in the XDR dumps. It's
probably less for content processes, but if we could save .5MB per process
this way, it might make it easier to increase the number of content processes
we allow.
MozReview-Commit-ID: CVJahyNktKB
--HG--
extra : source : 1c7df945505930d2d86a076ee20807104324c8cc
extra : histedit_source : 75e193839edf727874f01b2a9f6852f6c1f087fb%2C3ce966d7dcf2bd0454a7d673d0467097456bd782
One of the things that I've noticed in profiling startup overhead is that,
even with the startup cache, we spend about 130ms just loading and decoding
scripts from the startup cache on my machine.
I think we should be able to do better than that by doing some of that work in
the background for scripts that we know we'll need during startup. With this
change, we seem to consistently save about 3-5% on non-e10s startup overhead
on talos. But there's a lot of room for tuning, and I think we get some
considerable improvement with a few ongoing tweeks.
Some notes about the approach:
- Setting up the off-thread compile is fairly expensive, since we need to
create a global object, and a lot of its built-in prototype objects for each
compile. So in order for there to be a performance improvement for OMT
compiles, the script has to be pretty large. Right now, the tipping point
seems to be about 20K.
There's currently no easy way to improve the per-compile setup overhead, but
we should be able to combine the off-thread compiles for multiple smaller
scripts into a single operation without any additional per-script overhead.
- The time we spend setting up scripts for OMT compile is almost entirely
CPU-bound. That means that we have a chunk of about 20-50ms where we can
safely schedule thread-safe IO work during early startup, so if we schedule
some of our current synchronous IO operations on background threads during the
script cache setup, we basically get them for free, and can probably increase
the number of scripts we compile in the background.
- I went with an uncompressed mmap of the raw XDR data for a storage format.
That currently occupies about 5MB of disk space. Gzipped, it's ~1.2MB, so
compressing it might save some startup disk IO, but keeping it uncompressed
simplifies a lot of the OMT and even main thread decoding process, but, more
importantly:
- We currently don't use the startup cache in content processes, for a variety
of reasons. However, with this approach, I think we can safely store the
cached script data from a content process before we load any untrusted code
into it, and then share mmapped startup cache data between all content
processes. That should speed up content process startup *a lot*, and very
likely save memory, too. And:
- If we're especially concerned about saving per-process memory, and we keep
the cache data mapped for the lifetime of the JS runtime, I think that with
some effort we can probably share the static string data from scripts between
content processes, without any copying. Right now, it looks like for the main
process, there's about 1.5MB of string-ish data in the XDR dumps. It's
probably less for content processes, but if we could save .5MB per process
this way, it might make it easier to increase the number of content processes
we allow.
MozReview-Commit-ID: CVJahyNktKB
--HG--
extra : rebase_source : 2ec24c8b0000f9187a9bf4a096ee8d93403d7ab2
extra : absorb_source : bb9d799d664a03941447a294ac43c54f334ef6f5
NS_SetCurrentThreadName() is added as an alternative to PR_SetCurrentThreadName()
inside libxul. The thread names are collected in the form of crash annotation to
be processed on socorro.
MozReview-Commit-ID: 4RpAWzTuvPs
Separate AbstractThread::InitTLS and
AbstractThread::InitMainThread. Init AbstractThread main thread when
init nsThreadManager. Init AbstractThread TLS for all content process
types because for plugin and gmp processes we are doing IPC even
without init XPCOM and for content process init XPCOM requires IPC.
MozReview-Commit-ID: DhLub23oZz8
--HG--
extra : rebase_source : 6e4bfa03ec69e1eb694924903f1fa5e7259cbba3
This tracks TlsAlloc() and TlsFree() calls on Windows for diagnosing crashes when a proces reaches
its limit (1088) for TLS slots. Tracking of TLS allocation is done by intercepting TlsAlloc() and
TlsFree() in kernel32.dll. After initialization, we start tracking the number of allocated TLS
slots. If the number of observed TLS allocations exceeds a high water mark, we record the stack
when TlsAlloc() is called, and the recorded stacks gets serialized in a JSON string ready for
crash annotation.
MozReview-Commit-ID: 5fHVr0eiMy5
This change moves us away from NSPR primitives for our primary
synchronization primitives. We're still using PRMonitor for
ReentrantMonitor, however.
The benefits of this change:
* Slightly faster, as we don't have to deal with some of NSPR's overhead;
* Smaller datatypes. On POSIX platforms in particular, PRLock is
enormous. PRCondVar also has some unnecessary overhead.
* Less dynamic memory allocation. Out of necessity, Mutex and CondVar
allocated the NSPR data structures they needed, which lead to
unnecessary checks for failure.
While sizeof(Mutex) and sizeof(CondVar) may get bigger, since they're
embedding structures now, the total memory usage should be less.
* Less NSPR usage. This shouldn't need any explanation.
PseudoContext::sampleContext() is always called immediately after
profiler_get_pseudo_stack(). This patch introduces profiler_set_js_context()
and profiler_clear_js_context(), which replace the profiler_get_pseudo_stack()
+ sampleContext() pairs. This takes us a step closer to not having to export
PseudoStack outside the profiler.
--HG--
extra : rebase_source : 8558d1600eafd395cc696d31f3d21fb52a1a74b0
This tracks TlsAlloc() and TlsFree() calls on Windows for diagnosing crashes when a proces reaches
its limit (1088) for TLS slots. Tracking of TLS allocation is done by intercepting TlsAlloc() and
TlsFree() in kernel32.dll. After initialization, we start tracking the number of allocated TLS
slots. If the number of observed TLS allocations exceeds a high water mark, we record the stack
when TlsAlloc() is called, and the recorded stacks gets serialized in a JSON string ready for
crash annotation.
MozReview-Commit-ID: 5fHVr0eiMy5
Bug 1343075 - 1a. Add TextEventDispatcherListener::GetIMEUpdatePreference; r=masayuki
Add a GetIMEUpdatePreference method to TextEventDispatcherListener to
optionally control which IME notifications are received by NotifyIME.
This patch also makes nsBaseWidget forward its GetIMEUpdatePreference
call to the widget's native TextEventDispatcherListener.
Bug 1343075 - 1b. Implement GetIMEUpdatePreference for all TextEventDispatcherListener; r=masayuki
This patch implements GetIMEUpdatePreference for all
TextEventDispatcherListener implementations, by moving previous
implementations of nsIWidget::GetIMEUpdatePreference.
Bug 1343075 - 2. Allow setting a PuppetWidget's native TextEventDispatcherListener; r=masayuki
In PuppetWidget, add getter and setter for the widget's native
TextEventDispatcherListener. This allows overriding of PuppetWidget's
default IME handling. For example, on Android, the PuppetWidget's native
TextEventDispatcherListener will communicate directly with Java IME code
in the main process.
Bug 1343075 - 3. Add AIDL interface for main process; r=rbarker
Add AIDL definition and implementation for an interface for the main
process that child processes can access.
Bug 1343075 - 4. Set Gecko thread JNIEnv for child process; r=snorp
Add a JNIEnv* parameter to XRE_SetAndroidChildFds, which is used to set
the Gecko thread JNIEnv for child processes. XRE_SetAndroidChildFds is
the only Android-specific entry point for child processes, so I think
it's the most logical place to initialize JNI.
Bug 1343075 - 5. Support multiple remote GeckoEditableChild; r=esawin
Support remote GeckoEditableChild instances that are created in the
content processes and connect to the parent process GeckoEditableParent
through binders.
Support having multiple GeckoEditableChild instances in GeckoEditable by
keeping track of which child is currently focused, and only allow
calls to/from the focused child by using access tokens.
Bug 1343075 - 6. Add method to get GeckoEditableParent instance; r=esawin
Add IProcessManager.getEditableParent, which a content process can call
to get the GeckoEditableParent instance that corresponds to a given
content process tab, from the main process.
Bug 1343075 - 7. Support GeckoEditableSupport in content processes; r=esawin
Support creating and running GeckoEditableSupport attached to a
PuppetWidget in content processes.
Because we don't know PuppetWidget's lifetime as well as nsWindow's,
when attached to PuppetWidget, we need to attach/detach our native
object on focus/blur, respectively.
Bug 1343075 - 8. Connect GeckoEditableSupport on PuppetWidget creation; r=esawin
Listen to the "tab-child-created" notification and attach our content
process GeckoEditableSupport to the new PuppetWidget.
Bug 1343075 - 9. Update auto-generated bindings; r=me
The profiler can currently handle nested calls to profiler_{init,shutdown}() --
only the first call to profiler_init() and the last call to profiler_shutdown()
do anything. And sure enough, we have the following.
- Outer init/shutdown pairs in XRE_main()/XRE_InitChildProcess() (via
GeckoProfilerInitRAII).
- Inner init/shutdown pairs in NS_InitXPCOM2()/NS_InitMinimalXPCOM() (both shut
down in ShutdownXPCOM()).
This is a bit silly, so the patch removes the inner pairs, and adds a
now-needed pair in XRE_XPCShellMain. This will allow gInitCount -- which tracks
the nesting depth -- to be removed in a future patch.
--HG--
extra : rebase_source : 7e8dc6ce81ce10269d2db6a7bf32852c396dba0e
Hook this into the browser via the XREAppData. This patch does not include the changes to Chromium source code.
--HG--
extra : rebase_source : 4d5637bcdbeae605b0b99e9192598d48f371b698
Added ASSERTions to nsWindowsDllInterceptor in case of a failed detour hook, with an exception for the RET opcode that appears in ImmReleaseContext. Added documentation about TestDllInterceptor.
--HG--
extra : rebase_source : a3c6fe0949f5503979a062bdaa5f35526ddee73b
This includes a near-jump CALL instruction in x64, which expands to a far-jump CALL with a 64-bit address as inline data. This requires us to abandon the method where we memcpy the code block into the trampoline and, instead, build the trampoline function as we go.
--HG--
extra : rebase_source : 7f90ce5ba1a82dff731aff1ac17117c684b7b2cf
Hook this into the browser via the XREAppData. This patch does not include the changes to Chromium source code.
--HG--
extra : rebase_source : e34e8b50101cc40ded26e80791052123b24c8243
extra : histedit_source : 69c9b2dc91546adbfdad03b5d43842809191ffb9
Added ASSERTions to nsWindowsDllInterceptor in case of a failed detour hook, with an exception for the RET opcode that appears in ImmReleaseContext. Added documentation about TestDllInterceptor.
--HG--
extra : rebase_source : 48a38a09a1feb63600e12eba997a83f646cd1595
extra : histedit_source : 566cec5c47c400402e2e4dfa0cdc6d53d82b0815
This includes a near-jump CALL instruction in x64, which expands to a far-jump CALL with a 64-bit address as inline data. This requires us to abandon the method where we memcpy the code block into the trampoline and, instead, build the trampoline function as we go.
--HG--
extra : rebase_source : f0362c4b8200ba3d05191fdd45c5783dccd444bc
extra : histedit_source : 3018adf0c7d5849f87adc5e5459acf9f0e56301c