hermes-windows

Граф коммитов

Автор	SHA1	Сообщение	Дата
Andres Suarez	c2cd9e385a	Update copyright headers from Facebook to Meta Reviewed By: aaronabramov Differential Revision: D33370313 fbshipit-source-id: da0d9de8a0d269141ea9f2c8d931fcfccbde7edb	2021-12-30 15:08:38 -08:00
Tzvetan Mikov	62a0ab0479	remove dtoa allocation lock Summary: Speed up dtoa by a factor of two by removing its allocation lock. For performance, the original dtoa library used its own allocator, with segregated free lists. It started by allocating from a pre-defined "private memory buffer", and if it was exhausted, allocated from the heap. Freed blocks were added to the free list and reused, unless they were too large, in which case they were freed immediately. This allocator was declared in the data segment, and in a multi-thread application it had to be protected by a lock. Presumably this was originally done in order to avoid having to pass a "context" pointer to dtoa functions. In practice it worked quite well, especially in single-threaded applications. Even in multi-threaded applications dtoa's custom allocator (with its lock) was about 2X faster than system malloc()/free(). Nevertheless, the performance of dtoa ends up dominating pieces of code like this: ``` function foo(array, from, to) { while (from < to) { if (array[from] === array[from]) ++from; } }; foo([true], 1.541384329975526, 4077282); ``` Every iteration of the loop needs two conversions from double to string. Somewhat surprisingly, it turned out that most of the time was spent in mutex lock and unlock, even if the mutex is uncontested. The overhead of the atomic operations and the rest of the mutex logic is apparently higher than the cost of the conversion itself. Hermes itself doesn't use threads, but it is supposed to work in multi-threaded applications (different instances of Hermes can be used by different threads at the same time), so unfortunately we can't just get rid of the locks. Instead, we move the allocator state into a new structure `dtoa_alloc` and change most functions in dtoa to accept it as a parameter and use it. That makes it possible to remove the allocator lock, as long as the dtoa caller guarantees that the allocator it passes is used only by one thread at a time. Unfortunately that is only half of the solution, because in practice it is not always possible to maintain a per-thread allocator for dtoa functions. There are deep call trees, especially in the compiler, ending with a call to dtoa, which don't have a good place to keep a special dtoa allocator. So, the second half of the solution is to make the dtoa allocator cheap and easy to construct on the stack, when we need it. We give it about a KB of stack, which contains both the allocator metadata and its "private allocation buffer". This is sufficient to eliminate calls to malloc()/free() in the performance sensitive cases. The assumption is that if one is generating hundreds of digits, they can probably tolerate the cost of a few malloc()/free(), but if they are converting "normal" numbers, they get the good performance. So, now we are able to insert this on-stack allocator wherever we need it, without having to worry about finding a place to keep it and passing it through deep call hierarchies. Finally, the last piece of the puzzle has to do with the "powers of 5" cache, which dtoa keeps. We want to share the cache between allocators, since otherwise it wouldn't be a cache. So, we create a separate allocator only for it, and use a lock. The cost of the lock is negligible, since it is only used when adding a new power of 5 to the cache. Note that writing all of this explanation was actually much harder than actually implementing it, and the result is that the example above is now a little more than 2X faster than it was, which is a fairly big win. Note also that other double<->string conversion libraries allocate 512B bignums on the stack, so they have comparable, if not greater, stack usage (this is not meant as a critique, but as justification for us using a bit of stack). Reviewed By: kodafb Differential Revision: D18294564 fbshipit-source-id: acd6b8d502231992dd5c99fbc31eee8a41ccf171	2019-11-06 02:33:33 -08:00

Автор

SHA1

Сообщение

Дата

Andres Suarez

c2cd9e385a

Update copyright headers from Facebook to Meta

Reviewed By: aaronabramov

Differential Revision: D33370313

fbshipit-source-id: da0d9de8a0d269141ea9f2c8d931fcfccbde7edb

2021-12-30 15:08:38 -08:00

Tzvetan Mikov

62a0ab0479

remove dtoa allocation lock

Summary:
Speed up dtoa by a factor of two by removing its allocation lock.

For performance, the original dtoa library used its own allocator, with
segregated free lists. It started by allocating from a pre-defined
"private memory buffer", and if it was exhausted, allocated from the
heap. Freed blocks were added to the free list and reused, unless they
were too large, in which case they were freed immediately.

This allocator was declared in the data segment, and in a multi-thread
application it had to be protected by a lock. Presumably this was
originally done in order to avoid having to pass a "context" pointer to
dtoa functions. In practice it worked quite well, especially in
single-threaded applications. Even in multi-threaded applications
dtoa's custom allocator (with its lock) was about 2X faster than
system malloc()/free().

Nevertheless, the performance of dtoa ends up dominating pieces of code
like this:
```
    function foo(array, from, to) {
        while (from < to) {
            if (array[from] === array[from])
                ++from;
        }
    };
    foo([true], 1.541384329975526, 4077282);
```
Every iteration of the loop needs two conversions from double to string.
Somewhat surprisingly, it turned out that most of the time was spent in
mutex lock and unlock, even if the mutex is uncontested. The overhead
of the atomic operations and the rest of the mutex logic is apparently
higher than the cost of the conversion itself.

Hermes itself doesn't use threads, but it is supposed to work in
multi-threaded applications (different instances of Hermes can be used
by different threads at the same time), so unfortunately we can't just
get rid of the locks.

Instead, we move the allocator state into a new structure `dtoa_alloc`
and change most functions in dtoa to accept it as a parameter and use
it. That makes it possible to remove the allocator lock, as long as the
dtoa caller guarantees that the allocator it passes is used only by one
thread at a time.

Unfortunately that is only half of the solution, because in practice it
is not always possible to maintain a per-thread allocator for dtoa
functions. There are deep call trees, especially in the compiler, ending
with a call to dtoa, which don't have a good place to keep a special
dtoa allocator.

So, the second half of the solution is to make the dtoa allocator cheap
and easy to construct on the stack, when we need it. We give it about
a KB of stack, which contains both the allocator metadata and its
"private allocation buffer". This is sufficient to eliminate calls to
malloc()/free() in the performance sensitive cases. The assumption is
that if one is generating hundreds of digits, they can probably tolerate
the cost of a few malloc()/free(), but if they are converting "normal"
numbers, they get the good performance.

So, now we are able to insert this on-stack allocator wherever we need
it, without having to worry about finding a place to keep it and passing
it through deep call hierarchies.

Finally, the last piece of the puzzle has to do with the "powers of 5"
cache, which dtoa keeps. We want to share the cache between allocators,
since otherwise it wouldn't be a cache. So, we create a separate
allocator only for it, and use a lock. The cost of the lock is
negligible, since it is only used when adding a new power of 5 to the
cache.

Note that writing all of this explanation was actually much harder than
actually implementing it, and the result is that the example above is
now a little more than 2X faster than it was, which is a fairly big win.

Note also that other double<->string conversion libraries allocate 512B
bignums on the stack, so they have comparable, if not greater, stack
usage (this is not meant as a critique, but as justification for us
using a bit of stack).

Reviewed By: kodafb

Differential Revision: D18294564

fbshipit-source-id: acd6b8d502231992dd5c99fbc31eee8a41ccf171

2019-11-06 02:33:33 -08:00

2 Коммитов