Summary:
Speed up dtoa by a factor of two by removing its allocation lock.
For performance, the original dtoa library used its own allocator, with
segregated free lists. It started by allocating from a pre-defined
"private memory buffer", and if it was exhausted, allocated from the
heap. Freed blocks were added to the free list and reused, unless they
were too large, in which case they were freed immediately.
This allocator was declared in the data segment, and in a multi-thread
application it had to be protected by a lock. Presumably this was
originally done in order to avoid having to pass a "context" pointer to
dtoa functions. In practice it worked quite well, especially in
single-threaded applications. Even in multi-threaded applications
dtoa's custom allocator (with its lock) was about 2X faster than
system malloc()/free().
Nevertheless, the performance of dtoa ends up dominating pieces of code
like this:
```
function foo(array, from, to) {
while (from < to) {
if (array[from] === array[from])
++from;
}
};
foo([true], 1.541384329975526, 4077282);
```
Every iteration of the loop needs two conversions from double to string.
Somewhat surprisingly, it turned out that most of the time was spent in
mutex lock and unlock, even if the mutex is uncontested. The overhead
of the atomic operations and the rest of the mutex logic is apparently
higher than the cost of the conversion itself.
Hermes itself doesn't use threads, but it is supposed to work in
multi-threaded applications (different instances of Hermes can be used
by different threads at the same time), so unfortunately we can't just
get rid of the locks.
Instead, we move the allocator state into a new structure `dtoa_alloc`
and change most functions in dtoa to accept it as a parameter and use
it. That makes it possible to remove the allocator lock, as long as the
dtoa caller guarantees that the allocator it passes is used only by one
thread at a time.
Unfortunately that is only half of the solution, because in practice it
is not always possible to maintain a per-thread allocator for dtoa
functions. There are deep call trees, especially in the compiler, ending
with a call to dtoa, which don't have a good place to keep a special
dtoa allocator.
So, the second half of the solution is to make the dtoa allocator cheap
and easy to construct on the stack, when we need it. We give it about
a KB of stack, which contains both the allocator metadata and its
"private allocation buffer". This is sufficient to eliminate calls to
malloc()/free() in the performance sensitive cases. The assumption is
that if one is generating hundreds of digits, they can probably tolerate
the cost of a few malloc()/free(), but if they are converting "normal"
numbers, they get the good performance.
So, now we are able to insert this on-stack allocator wherever we need
it, without having to worry about finding a place to keep it and passing
it through deep call hierarchies.
Finally, the last piece of the puzzle has to do with the "powers of 5"
cache, which dtoa keeps. We want to share the cache between allocators,
since otherwise it wouldn't be a cache. So, we create a separate
allocator only for it, and use a lock. The cost of the lock is
negligible, since it is only used when adding a new power of 5 to the
cache.
Note that writing all of this explanation was actually much harder than
actually implementing it, and the result is that the example above is
now a little more than 2X faster than it was, which is a fairly big win.
Note also that other double<->string conversion libraries allocate 512B
bignums on the stack, so they have comparable, if not greater, stack
usage (this is not meant as a critique, but as justification for us
using a bit of stack).
Reviewed By: kodafb
Differential Revision: D18294564
fbshipit-source-id: acd6b8d502231992dd5c99fbc31eee8a41ccf171