Also rename the shader's ImageResource into ImageSource to match the terminology on the rust side (especially since the rust code has a different thing named ImageResource).
Differential Revision: https://phabricator.services.mozilla.com/D106484
I chose to rename it back to RenderTaskGraph instead of the other way around to minimize code churn and because it's the name most people are already familiar with.
Differential Revision: https://phabricator.services.mozilla.com/D106411
The RenderBackend::capture_config field determines whether the render backend
logs ongoing activity as requested by `wr_api_start_capture_sequence`. This
field is set by `SceneBuilderResult::CapturedTransactions` messages, but there
is nothing that clears it. This patch adds an additional `SceneBuilderResult`
message to do so.
It would probably suffice to simply always clear `capture_config` upon receipt
of an ordinary `Transactions` message. But it seemed to me to be slighty nicer
to leave capture control to messages specific to that purpose, rather than
letting ordinary messages affect it implicitly.
Differential Revision: https://phabricator.services.mozilla.com/D106308
For now, we will revert this patch since it's not relied on
elsewhere yet, while we find a correct fix for this regression.
Revert "Bug 1687409 - Use offscreen surface for backface visibility + non-preserve3d stacking contexts r=nical"
This reverts commit 2f5002791fa9671aa5c0e6573d28b52d5c978942.
Differential Revision: https://phabricator.services.mozilla.com/D106366
It caused us substantial confusion investigating this bug under the belief that
StartRemoteDrawingInRegion may have been modifying the dirty region. None of our
existing widget code anymore uses the API in this way, so it makes sense to just
force this dirty region to be const so that we no longer support the assumption
and alleviate confusion in the future about how our widget code actually behaves.
Depends on D106246
Differential Revision: https://phabricator.services.mozilla.com/D106247
This requires us to plumb CompositorCapabilities to support the extra field.
This is complicated by the fact that since it is a Rust struct, it has no
default constructor that can pass through to C++ via bindings, so every
one of our RenderCompositors was forced to manually initialize fields. To
get around this brittle footgun, instead the structure is initialized on
the Rust side, and RenderCompositor's are encouraged to only change fields
that actually diverge from the defaults as passed in via pointer.
Finally, we can then do what we need to do, which is just to send the
ForceRedraw message that needs to happen based on what we know about
CompositorCapabilities.
Differential Revision: https://phabricator.services.mozilla.com/D106246
This patch only erases some of the differences between how pictures and other primitves resolve their render tasks. There is a lot more to do there but I quite haven't figured out the incremental next step towards decoupling the picture primitive its content. After this patch we may be close to a good place to start extracting composite modes out into their own primitives.
Differential Revision: https://phabricator.services.mozilla.com/D106142
65k render tasks is a lot more than what we need, and RenderTaskId will soon be stored in more places where size affects performance.
Differential Revision: https://phabricator.services.mozilla.com/D105986
The test render tasks used to dodge the gpu cache interactions. Rather than maintain special cases, make it so the gpu cache is usable during these tests (which mainly means having a valid frame stamp to not trigger some assertions).
Differential Revision: https://phabricator.services.mozilla.com/D105985
This is the last important change of this render task refactoring. Cached render tasks now create nodes in the frame graph so that they can be referenced via a render task ID. With this it is now possible to refer to almost any textured content via a render task ID, regardless of how it was produced and whether it is cached. It also allows any render task to read from a cached one (before, only primitives and clip sources could).
This obsoletes ImageSourceHandle which will be remvoed in a subsequent patch.
Differential Revision: https://phabricator.services.mozilla.com/D105952
This requires us to plumb CompositorCapabilities to support the extra field.
This is complicated by the fact that since it is a Rust struct, it has no
default constructor that can pass through to C++ via bindings, so every
one of our RenderCompositors was forced to manually initialize fields. To
get around this brittle footgun, instead the structure is initialized on
the Rust side, and RenderCompositor's are encouraged to only change fields
that actually diverge from the defaults as passed in via pointer.
Finally, we can then do what we need to do, which is just to send the
ForceRedraw message that needs to happen based on what we know about
CompositorCapabilities.
Differential Revision: https://phabricator.services.mozilla.com/D106246
When compositing a filter (or any off-screen surface) into the
parent picture, we also need to assume non-opaque if the transform
is complex, so that AA gets applied along the edges (and that any
fragments outside the AA zone are discarded).
In future, we aim to improve the performance of this fairly rare
scenario by reducing which parts of the picture get the AA shader,
but for now this is a simple fix for a correctess issue.
Differential Revision: https://phabricator.services.mozilla.com/D106054
Instead of calling `std::mem::replace` with dummy values to extract fields from
the `TransactionMsg` into the `BuiltTransaction`, it's more Rustic to pass the
former by value and just move its fields out.
`SceneBuildingThread::process_transaction` seems to contribute almost no self
time to profiles, so the cost of a move instead of passing a reference is
apparently negligible.
Differential Revision: https://phabricator.services.mozilla.com/D106059
Blit render tasks have special code to read from the texture cache. This isn't necessary anymore now that texture cache items can be used as nodes of the frame graph. Blits can be simplified into reading from any render task without knowing how it was produced.
Differential Revision: https://phabricator.services.mozilla.com/D105746
With this change, image primitives become render tasks. These render tasks don't produce drawing commands, instead they trigger image reuqests which will lead to texture uploads. Using render tasks provides two advantages:
- It adds some expressiveness to the render task graph: render tasks can now take an image as a source directly. This will be needed to implement the Image svg filter, for example.
- Since The image render tasks resolve their texture and uv handles before batching, the batching code can simply query the ImageSourceHandle without knowing whether the image comes from a (cached) render task or the resource cache.
A large part of the diff is moving a lot of the image primitive code from the visibility pass into ImageData::update which happens during the prepare pass.
Differential Revision: https://phabricator.services.mozilla.com/D105487
This patch starts moving some of the logic to resolve source texture ids and uv rects into a separate file, and introduces ImageSourceHandle which will be used in later patches in this series.
ImageSourceHandle is a unified way to refer to "some rectangle into some texture" regardless of how the texture, the rectangle and its content are produced. Moving all primitives to using this handle will allow us to decouple how content is pre-rendered from how it is composited into the main scene. The end goals are to remove some duplication/complexity and also to allow more flexibility, for example enable filters on some primitives directly without requiring a picture.
Differential Revision: https://phabricator.services.mozilla.com/D105486
This patch enables the faster mix-blend-mode path that allows using
picture cache tiles as the backdrop source for blends where that
is appropriate (most of the underlying work is in previous patches
or the dependencies of this bug).
In addition to avoiding an extra intermediate surface for blends
that are on a picture cache surface, it also avoids constant
invalidation of picture cache tiles due to the blend container
not being part of the main content scroll root.
As an example of the typical performance improvement, the GPU times
on an AMD 5700 GPU at 4k, when using the Firelux color temperature addon
browsing pages drops from ~1.8ms to ~0.3 ms.
Differential Revision: https://phabricator.services.mozilla.com/D104491
Delete the method `webrender::RenderApi::set_document_view`, since it is unused
by Gecko. (The unused code suspiciously constructs a `TransactionMsg` whose
`use_scene_builder_thread` flag is false, despite the fact that it contains a
`SceneMsg`.)
Differential Revision: https://phabricator.services.mozilla.com/D105855
The call to `self.resources.update` immediately above already sets
`transaction.use_scene_builder_thread` if the transaction has any `SceneMsg`
operations.
Differential Revision: https://phabricator.services.mozilla.com/D105844
Some sites use pixelated/crisp image-rendering and/or 1x1 images as color
sources. When we hit these, we fall off the fast-path. Try to handle some
of those cases we are finding in the wild, namely nearest filtering and
repeat filtering.
There is some slight movement in the wrench fuzz due to the composite shader
being accelerated in situations it was previously not due to nearest filter.
Differential Revision: https://phabricator.services.mozilla.com/D105864
The same optimization of looking for merged linear gradients can also be
applied to radial gradients by solving the quadratic equation to check
how large a span we can process within a given merged span. This allows
us to save a bunch of table lookup and some other math in the inner loops.
Differential Revision: https://phabricator.services.mozilla.com/D105858
For linear gradients, we are currently bottlenecked by looking up a gradient
table entry, doing interpolation, and converting to pixel formats for every
sample.
We can accelerate this by instead looking for contiguous segments of gradient
within the range of entries we need to sample and then interpolating these
as a single gradient. This also enables us to convert to relevant pixel formats
only when setting up this gradient, which greatly reduces the per-pixel processing
down to essentially a shift and add.
To enable this sort of crawling of the gradient table, the output gradients have
been modified such that each entry's step value will equal an adjacent entry's
step value if and only if they are from same gradient. We can ensure this by, in
the very rare case two segments of gradient have the same step, using the equivalent
of nextafter() to imperceptibly alter the value so that the invariant is maintained.
Differential Revision: https://phabricator.services.mozilla.com/D105716
On some Mali devices we have encountered driver crashes caused by
calling textureSize(samplerExternalOES) in a shader without also
potentially sampling from the texture in the shader. ARM's suggested
workaround was to trick the driver in to thinking that the texture may
be sampled from (ie by sampling in a branch which is never dynamically
taken).
This is done by checking the value of a dummy uniform, and sampling
the texture if the value is non-default. Using a constant expression
did not work because the compiler would optimize the condition (and
therefore the sample) away.
Also re-enable webrender on Mali-72 and G76 devices, as it was blocked
due to this bug.
Differential Revision: https://phabricator.services.mozilla.com/D105493
Previously we had encountered issues when rendering partial regions of
picture cache tiles on Mali-Gxx devices. These often manifested as
patterns of black squares and rectangles. We worked around this by
ensuring that we always clear and render the entire tile. We have now
had a report of a similar looking problem on a Mali-Txxx devices, so
apply the same workaround there.
Differential Revision: https://phabricator.services.mozilla.com/D105278
This patch removes from RenderTaskKind members that are independent from what the render task is drawing. The uv rect set automatically either to Some(handle) if the render task is not cached or to None if it is cached. This reflects what was happening implicitly before this patch. The uv rect kind defaults to Rect which is the most common case, but can be set when creating the render task.
This is a first step toward more flexibility when deciding whether a render task is cached or not (there is stil some coupling in the batching code between the type of primitive and whether their render tasks are cached).
More importantly, not having to understand what is up with presence or absence of uv handles in render tasks makes adding new ones much easier.
Differential Revision: https://phabricator.services.mozilla.com/D104840
Now that most of the complicated alpha-pass features such as clip-masking and anti-aliasing
are handled in SWGL during the blend stage, most of the fast-paths are identical and only call
swgl_commitTextureLinear in a tight loop. We can do a lot better here by just moving that loop
into SWGL, not only making it faster but removing all the redundant boiler-plate code out of
the shaders.
Differential Revision: https://phabricator.services.mozilla.com/D104536
This cleans up the WR brush shaders to not have to use its own
implementation of init_transform_fs() for anti-aliasing when SWGL
is available. To enable this, most of the details of AAing have
been moved into brush.glsl to simplify the control knobs and
allow easier modifications.
With swgl_antiAlias() used, the drawSpan fast-paths no longer have to
care about whether ot not AA is enabled, so we can more easily stay
on these fast-paths without worry.
Differential Revision: https://phabricator.services.mozilla.com/D104493
Now that most of the complicated alpha-pass features such as clip-masking and anti-aliasing
are handled in SWGL during the blend stage, most of the fast-paths are identical and only call
swgl_commitTextureLinear in a tight loop. We can do a lot better here by just moving that loop
into SWGL, not only making it faster but removing all the redundant boiler-plate code out of
the shaders.
Differential Revision: https://phabricator.services.mozilla.com/D104536
This cleans up the WR brush shaders to not have to use its own
implementation of init_transform_fs() for anti-aliasing when SWGL
is available. To enable this, most of the details of AAing have
been moved into brush.glsl to simplify the control knobs and
allow easier modifications.
With swgl_antiAlias() used, the drawSpan fast-paths no longer have to
care about whether ot not AA is enabled, so we can more easily stay
on these fast-paths without worry.
Differential Revision: https://phabricator.services.mozilla.com/D104493
We don't actually need to use brush_mix_blend or KHR_blend_equation_advanced for multiply, screen,
and exclusion modes. Screen and exclusion can be done with simple blending, and multiply can be
done with dual-source blending. Since multiply is the most common mix-blend mode, and dual-source
blending is also common on the desktop with our ANGLE driver, this should be a significant boost
for mix-blend-mode performance for us across.
Differential Revision: https://phabricator.services.mozilla.com/D104614
If we have more than 8 unused/reusable staging textures and buffers for more than 120 consecutive frames, start deallocation them, spreading the deallocation over multiple frames.
The vast majority of frames require less than 4 staging textures and buffers (most don't require any), but some SVG animations can put a lot of pressure on uploads, requiring 30+ staging textures per frame. This patch avoids staying at this kind of peak memory usage for too long.
Differential Revision: https://phabricator.services.mozilla.com/D104510
Instead of using a triple buffering scheme, tag each texture with a frame index and only reuse a texture that hasn't been used for more than two frames.
Differential Revision: https://phabricator.services.mozilla.com/D104421
This fixes incorrect rendering when either the source or backdrop
tasks establish a raster root.
By design, it also changes mix-blend backdrop readbacks to work in
a way that can handle readbacks from picture cache tiles, which is
a follow up optimization being worked on.
Differential Revision: https://phabricator.services.mozilla.com/D103853
In bug 1687394, the semantics of `requested_raster_space` were
changed to only take effect when an intermediate surface was
created.
However, this causes a regression to snapping with text runs
that are animated on the root surface (such as loading spinner
glyphs).
To fix that, while also keeping the functionality of the previous
patch (removing a source of pass-through pictures), there is now
a stack of requested raster space pushed and popped for each
stacking context. This is read and stored by text runs during
scene building, ensuring that these animated glyphs select the
correct raster space to avoid snapping / jittering bugs.
Differential Revision: https://phabricator.services.mozilla.com/D104345
The result of compute_aa_range depends on fwidth(local_pos). In the no-perspective case,
the derivatives of local_pos are constant across an entire primitive. SWGL fast-paths only
run in the no-perspective case anyway, so it is convenient to compute the aa_range once
for the entire span and then reuse it, factoring out this per-pixel cost.
Differential Revision: https://phabricator.services.mozilla.com/D104294
This fixes incorrect rendering when either the source or backdrop
tasks establish a raster root.
By design, it also changes mix-blend backdrop readbacks to work in
a way that can handle readbacks from picture cache tiles, which is
a follow up optimization being worked on.
Differential Revision: https://phabricator.services.mozilla.com/D103853
This fixes incorrect rendering when either the source or backdrop
tasks establish a raster root.
By design, it also changes mix-blend backdrop readbacks to work in
a way that can handle readbacks from picture cache tiles, which is
a follow up optimization being worked on.
Differential Revision: https://phabricator.services.mozilla.com/D103853
This change had previously been backed out due to causing rendering
issues on HTC 10 Android, and some Linux Radeon cards (bug 1687554).
On the HTC 10, the issue was that the extra case statement added to
the text run shader caused the glslopt optimized shader to become too
complex for the device, resulting in rendering issues. Since bug
1689316 has landed, the optimized shader output is simpler and this
issue is avoided.
On radeon, we have established that the problem is due to the format
of the texture and that the shader is fine. Furthermore, the shader
works correctly with either R8 or RGBA8 texture data, as all of the
channels contain the alpha value in the RGBA8 textures. Therefore we
continue to use RGBA8 textures for alpha glyphs on Linux Radeon, but
switch to R8 on other platforms.
Differential Revision: https://phabricator.services.mozilla.com/D104082
Previously, we've taken the strategy of exposing any gecko specific hooks
as traits. The disadvantage of this approach is that it requires plumbing
a boxed trait through to any places that need to use it.
With this approach, we add global functions that don't do anything when
compiled without the 'gecko' feature. This makes it easier to add hooks
and avoids the plumbing which should reduce friction in the process
of moving more stuff out of gecko and into webrender.
Differential Revision: https://phabricator.services.mozilla.com/D102334
Our existing batched texture upload logic works with pixel buffer objects which we don't use with ANGLE.
The motivation is to avoid expensive driver overhead from submitting many glTexSubImage2D calls (one for each texture cache item) on low-end Intel Windows configurations.
On Windows+Intel it is much faster to use batched draw calls to copy from staging textures to texture cache than using CopySubResourceRegion (when there is a high number of copies).
Differential Revision: https://phabricator.services.mozilla.com/D103333
We have encountered issues on some platforms due to a large number of
if statements in shaders. The shader optimizer previously generated
code with a large number of if statements, due to the way in which it
optimized switch statements.
Previously the optimizer output 2 if statements for every case in a
switch. First it ORs the "fallthrough" var with the case's
condition. Then sets the fallthrough var to false if the "break" var
is true. Then conditionally executes the case's instructions if
fallthrough is true. For example:
switch (uMode) {
case 0:
gl_Position = vec4(0.0);
break;
case 1:
gl_Position = vec4(1.0);
break;
}
becomes:
bool break_var = bool(0);
bool fallthrough_var = (0 == uMode);
if (break_var) fallthrough_var = bool(0);
if (fallthrough_var) {
gl_Position = vec4(0.0, 0.0, 0.0, 0.0);
break_var = bool(1);
};
fallthrough_var = (fallthrough_var || (1 == uMode));
if (break_var) fallthrough_var = bool(0);
if (fallthrough_var) {
gl_Position = vec4(1.0, 1.0, 1.0, 1.0);
break_var = bool(1);
};
This update removes one of these ifs, by ANDing the fallthrough_var
with !break_var rather than conditionally setting it to false. eg:
bool break_var = bool(0);
bool fallthrough_var = (0 == uMode);
if (fallthrough_var) {
gl_Position = vec4(0.0, 0.0, 0.0, 0.0);
break_var = bool(1);
};
fallthrough_var = (fallthrough_var || (1 == uMode));
fallthrough_var = (fallthrough_var && !(break_var));
if (fallthrough_var) {
gl_Position = vec4(1.0, 1.0, 1.0, 1.0);
break_var = bool(1);
};
This is logically equivalent but uses half as many if statements,
which helps to avoid driver bugs on some platforms.
Differential Revision: https://phabricator.services.mozilla.com/D103713
We keep encountering issues on various platforms due to the usage of
switch statements, especially the optimized output produced by
glslopt. Replace all instances with if-else statements instead.
Differential Revision: https://phabricator.services.mozilla.com/D103300
Currently when the GPU cache is resized we allocate a new texture and
then copy the contents of the old texture to the new texture. This
copy requires either EXT_copy_image (for glCopyImageSubData) or
EXT_color_buffer_float (to bind the RGBAF32 texture to a framebuffer).
On devices where neither extension is supported, don't attempt to copy
the old texture. Instead mark the entire CPU-side copy of the cache as
dirty, meaning we will subsequently upload the entire contents to the
new texture. (A complete CPU-side copy is only mainted for the
PixelBuffer gpu cache bus type, not for Scatter ones. However, as the
Scatter type also requires EXT_color_buffer_float, we will only be in
this situation for PixelBuffer buses.)
Differential Revision: https://phabricator.services.mozilla.com/D103071
Instead of keeping a stacking context around for scrollbar containers,
extend and use the tile cache barrier code to create them. This
removes the final remaining code path that creates pass through
picture primitives.
The tile cache barrier changes also form the basis of how we will
make blend containers and backdrop roots work in a follow up patch.
Blend containers and backdrop roots will become redundant stacking
contexts when they exist at the start of a tile cache, which will
save an entire off-screen surface / constant invalidation.
Differential Revision: https://phabricator.services.mozilla.com/D102527
Previously, a leaf picture would be created unconditionally when
popping a stacking context during scene building. This results in
many pass-through pictures being created that are often not required.
This patch introduces a helper struct that delays creation of a
pass-through wrapping picture until it's known to be needed (and
instead adds the prim_list to a wrapping picture where possible).
In a follow up patch, the last couple of places that create pass
through pictures via pop_stacking_context will be removed.
Differential Revision: https://phabricator.services.mozilla.com/D102381
Under software WebRender, performance is substantially improved if we make the approximations in ClipItemKind::get_clip_result more accurate.
That function's job is to decide whether a primitive falls entirely inside a
clip, entirely outside it, or has regions of both. If the primitive is known to
fall entirely inside the clip, WebRender doesn't bother applying the clip to it.
This also saves WR the trouble of rendering the clip mask itself - which is what
the cs_rectangle_clip shader is spending a lot of time on when displaying these
pages.
Before this change, ClipItemKind::get_clip_result handles rounded rectangle clip
regions by computing an 'inner rect', a rectangle inset from the rounded
rectangle on each side by the relevant rounding radii. This is a correct
conservative approximation, but it means that any primitive that lies flush with
one of the flat sides of the clip is considered to only be partially within the
clip, and thus needs to have the clip mask applied - even though simple
rectangle intersection would serve.
With this change, instead of an 'inner rect', we approximate the rounded
rectangle by a rectangle with rectangular chunks taken out of each corner. This
lets us recognize more primitives as being fully within the clip, and apply the
clip mask less often.
Differential Revision: https://phabricator.services.mozilla.com/D102526
the storage logic was made for, and used exclusively in, the prim_store.
Moving it into the sub-module and making it private allows for easier navigation in the code.
Differential Revision: https://phabricator.services.mozilla.com/D102494
Bug 1685563 switched to using R8 textures instead of BGRA8 for
non-subpixel AA glyphs. This caused rendering issues on certain
android and linux devices, so switch back temporarily until those
issues are fixed.
Differential Revision: https://phabricator.services.mozilla.com/D102465
Currently we expire old picture cache tiles at the end of the frame,
immediately before garbage collecting them. This means that new
textures have already been allocated for newly-created picture cache
tiles, so we often end up both allocating and destroying textures in
the same frame.
Instead, move the call to expire_old_picture_cache_tiles() to the
beginning of the frame. Picture cache tiles added to the cache during
the frame can then recylce these textures rather than allocate new
ones. Garbage collection still occurs at the end of the frame,
destroying freed textures that were not recycled.
Note that expire_old_picture_cache_tiles() frees picture cache tiles
which were unused in the *previous* as well as the current frame. This
is a legacy from when the function freed all types of texture cache
entries, and could be called throughout the frame. Immediately prior
to this change, it could in fact have just checked for usage during
the *current* frame, as the function was only called at the end of the
frame. However, as this change moves the call to the beginning of the
frame, we do actually now need to check for usage during
the *previous* frame.
Differential Revision: https://phabricator.services.mozilla.com/D102349
Removes another case of pass through pictures, by handling the rare
case of a stacking context with backface-visibility: false that is
_not_ part of a 3d rendering context as an offscreen surface.
Differential Revision: https://phabricator.services.mozilla.com/D102251
Previously, it was possible to request a local/screen raster
space even if the owning stacking context didn't create an
offscreen surface.
This complicates various parts of the code, and also results
in a pass-through picture primitive being created (which we
want to remove as part of the work for #1684781).
With this change, it's only possible to change the raster space
when the enclosing stacking context creates an offscreen surface
for some other reason (e.g. 3d transform, filters etc).
Differential Revision: https://phabricator.services.mozilla.com/D102244
Instead, top level tile cache pictures are stored in the scene.
Follow up tasks in this bug will be simplified by having pictures
only exist when they have Some(..) for requested_composite_mode.
This patch removes one case of a pass-through picture, and
simplfies some of the surrounding code in the process.
Differential Revision: https://phabricator.services.mozilla.com/D101539
All LRU partitions use the same freelist to store the entries, and only have
separate LRU indexes. The shared freelist makes it easier to transfer an entry
from one LRU partition to another. If instead we used different LRUCache
instances, moving entries between partitions would be cumbersome because we
would need to look up the strong handle for the entry so that we could remove it
from the old freelist.
Differential Revision: https://phabricator.services.mozilla.com/D102122
When I wrote this patch, I thought that it would simplify the next patch in this
series, but I think it didn't make much of a difference in the end.
I still think this patch improves things and is worth taking, though.
Differential Revision: https://phabricator.services.mozilla.com/D102121
The limit wasn't doing anything useful anymore, because one of the recent texture
cache refactorings made it so that we weren't actually evicting these glyphs from
the texture cache. In the future, we can implement a similar limit in the texture
cache itself, by giving it per-cache-type limits rather than a global limit.
Differential Revision: https://phabricator.services.mozilla.com/D101834
The new version contains
- A bug fix for the bucketed allocator (we don't currently use it)
- A few fixes that can happen when requesting large enough allocation sizes to cause integer overflows. At the moment we never request an allocation larger than 512px so we are safe but it's still good to stay up to date.
Differential Revision: https://phabricator.services.mozilla.com/D101608
Add a new texture type alpha8_glyphs to the texture cache and store
alpha glyphs in it. Because the opengl texture format is R8 but the
shader requires to read the texture's alpha channel, we must swizzle
the components. We cannot rely on texture swizzling due to driver
bugs, so add the necessary code to the shader to do so manually.
Differential Revision: https://phabricator.services.mozilla.com/D101643
ANGLE appears to truncate uploads from a PBO in cases where the
UNPACK_ROW_LENGTH is greater than the width of the upload. We
encounter this due to rounding up the stride of our data to be a
multiple of 4 bytes. Don't do that on ANGLE.
Note that we only hit this issue in wrench, as in Firefox we do not
use PBO uploads with ANGLE.
Differential Revision: https://phabricator.services.mozilla.com/D101663
The previous traversal strategy for assigning render tasks is very
simple and works fine for normal content. However, it's possible to
create graphs with very deep levels of nesting and dependencies
that cause the pass traversal to not terminate quickly.
This patch contains two changes to fix these cases:
- Recursion in assign_render_pass will early out if a shorter
path has been found.
- Remove recursion from assign_free_pass, iterating each task once.
Differential Revision: https://phabricator.services.mozilla.com/D101541
In cases where clips are in the same coordinate space as the parent
picture surface (but different from the primitive), we can apply
a simple optimization to reduce the size of clip mask allocations.
In the linked test case, this drastically reduces the size of clip
masks that are drawn (each of the many clip masks drops from approx
1000 x 1000 pixels to approx 64 x 64 pixels).
Differential Revision: https://phabricator.services.mozilla.com/D101298
Restructure how frame building handles building of the surface
render task structures.
In future parts of this bug, a surface may have multiple "sub-passes"
where the rendering is separated into different render tasks that
write to the same target (without clear on subsequent sub-passes).
For example, this will be used to allow mix-blend-mode and
backdrop-filter to issue readbacks directly from picture cache
tiles rather than forcing draws to an intermediate surface.
The `add_child_render_task` method will be expanded to support
the case of a surface having multiple sub-passes, adding the
render task to the correct sub-pass.
Differential Revision: https://phabricator.services.mozilla.com/D101122
This is a small extract from a change I tried to land earlier.
It refactors the GL version queries, no functional changes here.
Differential Revision: https://phabricator.services.mozilla.com/D101082
render_impl() calls draw_frame(). We were drawing the debug overlays in both
functions, so the overlays were rendered twice.
This patch removes the drawing in draw_frame() and keeps the one in render_impl().
For anything that uses the debug renderer, such as the "epochs" overlay, this
means that the drawing commands were appended twice, and then all commands were
executed at the end of render_impl().
Comparing the "epochs" debug overlay with and without this change, this patch
noticeable]y reduces its opacity from "extra opaque" (due to double drawing) to
"normal translucency".
Also, when the native compositor was used, any debug drawing in draw_frame()
that wasn't using the debug renderer didn't actually work, because it wasn't
going into the debug overlay surface - only render_impl() correctly binds the
debug overlay surface when the native compositor is used.
Differential Revision: https://phabricator.services.mozilla.com/D100961
In bug 1676474 an issue was reported regarding partial present on
Mali-G77 devices. This was introduced in bug 1675159, which refactored
some partial present logic and shifted the order of some OpenGL calls
around. As a precaution, we disabled the feature on all Mali-Gxx
devices.
The bug seems to occur when eglSetDamageRegion is called after
rendering to an offscreen render target (in this case due to texture
cache or GPU cache updates), but without the driver being flushed in
some way. This appears to be a bug in the Mali driver.
This patch moves the eglSetDamageRegion call back to its original
location -- after all offscreen render targets have been rendered,
immediately before rendering to the main framebuffer -- which fixes
the issue. It also re-enables the feature on all Mali-Gxx devices.
Differential Revision: https://phabricator.services.mozilla.com/D101018
Most of that is *rendering* logic, so the change moves it closer to the renderer.
One tiny bit - `enum DebugItem` - is used in frame building, so I left it in its own small module at the root.
Differential Revision: https://phabricator.services.mozilla.com/D100936
This isn't very systematic as I'm not sure the best approach for that
yet. That being said, this captures the bulk of the autoreleases without
that happen without a pool.
Differential Revision: https://phabricator.services.mozilla.com/D100363
There is a driver bug on old versions of the Adreno driver which
prevents usage of persistenly mapped buffers for texture
uploads. Creating and mapping the buffer works correctly, but
attempting to upload to a texture from the buffer results in an error
due to the buffer still being mapped.
This means that no texture data is uploaded, essentially meaning that
we do not render anything at all.
It appears to affect at least Adreno 4xx and 5xx devices running
Android 6. For now, simply disable persistent mapping on all Adreno
devices, until we know more specifically which are affected.
Differential Revision: https://phabricator.services.mozilla.com/D100391
There are probably other places that have this kind of problem but this
keeps thing simple for now and might be sufficient to get things under
control.
Further work will follow.
Differential Revision: https://phabricator.services.mozilla.com/D100294
this is probably the last of the low-hanging fruits in renderer submodules.
I think it would be very useful to try isolating the scene building stuff in a similar way.
So in the end, the root `src` should only contain things that are used by multiple stages of WR.
Differential Revision: https://phabricator.services.mozilla.com/D100263
This patch introduces the new frame graph implementation, which
allows for more advanced and efficient render task graphs.
The goal of this initial work is to achieve feature parity with
the existing render task graph.
Follow up work will take advantage of the new graph functionality
to improve the efficiency of current mix-blend-mode, backdrop-filter
and svg filter operations.
Differential Revision: https://phabricator.services.mozilla.com/D99743
This patch introduces the new frame graph implementation, which
allows for more advanced and efficient render task graphs.
The goal of this initial work is to achieve feature parity with
the existing render task graph.
Follow up work will take advantage of the new graph functionality
to improve the efficiency of current mix-blend-mode, backdrop-filter
and svg filter operations.
Differential Revision: https://phabricator.services.mozilla.com/D99743
Once the new graph API is in place, it becomes possible to express
an input dependency on a persistent target (for example, if wanting
to read back from a picture cache tile for a mix-blend, or marking
that a color target depends on a render task in a texture cache).
To make that simpler to express, this patch adds a specific struct
for render target locations that are persistent, and updates the
surrounding code to use it. At the same time, introduce an Unallocated
field for dynamic tasks that are not yet allocated, rather than
using an Option.
Differential Revision: https://phabricator.services.mozilla.com/D99305
This patch splits the graph building functionality into
`RenderTaskGraphBuilder` and the graph querying code into
the existing `RenderTaskGraph` struct.
The Builder struct is retained frame to frame, which means
there is no longer a need for the `RenderTaskGraphCounters`
struct. The Graph struct is constructed per-frame by calling
`end_pass` on the Builder.
Although this doesn't do much different internally, it will
make integration with the new task graph changes simpler. It
also enforced during frame building when it is possible
to add / query render tasks.
A few unrelated tidy ups are included in this patch - mostly
removing where the task graph is passed to from a few structs
and methods that no longer require access to the graph.
Differential Revision: https://phabricator.services.mozilla.com/D99297
This patch makes picture cache tiles use normal textures instead
of array textures. With this and the previous patch, WR no longer
uses array textures at all (except when provided by the external
image handler trait).
Differential Revision: https://phabricator.services.mozilla.com/D99013
Mainly this implements a new set of SWGL intrinsics based around swgl_allowTextureNearest
and swgl_commitTextureNearest which can fairly easily provide a further fast-path above
and beyond swgl_commitTextureLinear. This requires the row be from an axis-aligned 1:1
draw so that we can do something not unlike a fast copy of the texture data straight
to the destination in cases where even the linear filter would be essentially doing
the same thing in a more expensive way. For now, only a few WR shaders that were already
using swgl_commitTextureLinear have been fast-pathed with the new intrinsics to see if
this provides significant performance benefit.
Differential Revision: https://phabricator.services.mozilla.com/D100079
It's always supplied by Gecko anyway, and being able to rely on this
will make it easier to create stable spatial node IDs that persist
across display lists.
Differential Revision: https://phabricator.services.mozilla.com/D100076
Once the new graph API is in place, it becomes possible to express
an input dependency on a persistent target (for example, if wanting
to read back from a picture cache tile for a mix-blend, or marking
that a color target depends on a render task in a texture cache).
To make that simpler to express, this patch adds a specific struct
for render target locations that are persistent, and updates the
surrounding code to use it. At the same time, introduce an Unallocated
field for dynamic tasks that are not yet allocated, rather than
using an Option.
Differential Revision: https://phabricator.services.mozilla.com/D99305
This patch splits the graph building functionality into
`RenderTaskGraphBuilder` and the graph querying code into
the existing `RenderTaskGraph` struct.
The Builder struct is retained frame to frame, which means
there is no longer a need for the `RenderTaskGraphCounters`
struct. The Graph struct is constructed per-frame by calling
`end_pass` on the Builder.
Although this doesn't do much different internally, it will
make integration with the new task graph changes simpler. It
also enforced during frame building when it is possible
to add / query render tasks.
A few unrelated tidy ups are included in this patch - mostly
removing where the task graph is passed to from a few structs
and methods that no longer require access to the graph.
Differential Revision: https://phabricator.services.mozilla.com/D99297
This patch makes picture cache tiles use normal textures instead
of array textures. With this and the previous patch, WR no longer
uses array textures at all (except when provided by the external
image handler trait).
Differential Revision: https://phabricator.services.mozilla.com/D99013
With this change, all color/alpha intermediate surfaces are individual
textures, rather than texture arrays.
This can in theory cause more batch breaks in some cases, but this
is likely to be very rare in practice.
Benefits:
- No more allocating the array at the size of the largest task / slice.
- Remove a source of many driver bugs on android devices.
- Simplify integration of future patches with render task graph.
Much of the render target array texture code is still present, since
picture cache tiles in the Draw compositor still make use of texture
arrays. However, once these are switched to normal textures, we can
remove most of the slice layer, blit workaround functionality etc.
Remove the default feature setting for selecting the image sampler
kind. Instead, this must be explicitly specified by the shader or
a dynamic feature define, which makes sampler selection less error prone.
Differential Revision: https://phabricator.services.mozilla.com/D99006
Remove usage of the implicit prev pass alpha and color texture
samplers from batching / renderer / shader code. They are replaced
by explicit references to the texture ID for the source task.
Differential Revision: https://phabricator.services.mozilla.com/D98872
With this change, all color/alpha intermediate surfaces are individual
textures, rather than texture arrays.
This can in theory cause more batch breaks in some cases, but this
is likely to be very rare in practice.
Benefits:
- No more allocating the array at the size of the largest task / slice.
- Remove a source of many driver bugs on android devices.
- Simplify integration of future patches with render task graph.
Much of the render target array texture code is still present, since
picture cache tiles in the Draw compositor still make use of texture
arrays. However, once these are switched to normal textures, we can
remove most of the slice layer, blit workaround functionality etc.
Remove the default feature setting for selecting the image sampler
kind. Instead, this must be explicitly specified by the shader or
a dynamic feature define, which makes sampler selection less error prone.
Differential Revision: https://phabricator.services.mozilla.com/D99006
Remove usage of the implicit prev pass alpha and color texture
samplers from batching / renderer / shader code. They are replaced
by explicit references to the texture ID for the source task.
Differential Revision: https://phabricator.services.mozilla.com/D98872
This commit moves the code that deals with allocating into a dynamic amount of textures (TextureUnits) out of texture_cache.rs, rename it into AllocatorList and make it generic.
The code also changes some of the profile counters to count pixels and number of textures instead of number of regions and size in bytes.
I had to introduce two traits which is a bit cumbersome but not so bad. AtlasAllocator is needed to implement AllocatorList with multiple allocators and AtlasAllocatorList is a dyn trait to let the texture cache can select between allocator lists of different type signatures.
Differential Revision: https://phabricator.services.mozilla.com/D98371
This patch adds infrastructure for crash reporter annotations to
WebRender. This is used to expose the new annotation,
GraphicsCompileShader, to indicate which shader we are in the process of
compiling. We often see crash reports when compiling shaders, and it
would be useful to know which one it is compiling.
This also adds another annotation, IsWebRenderResourcePathOverridden,
which is useful to know if someone overrode the shader resource path for
testing purposes. We can likely ignore any crash reports that have this
annotation set.
Differential Revision: https://phabricator.services.mozilla.com/D99736
There might be some overlap with memory counted elsewhere and some of
the size calculations could be wrong but it should give us an overall
picture.
Differential Revision: https://phabricator.services.mozilla.com/D99562
With this change, all color/alpha intermediate surfaces are individual
textures, rather than texture arrays.
This can in theory cause more batch breaks in some cases, but this
is likely to be very rare in practice.
Benefits:
- No more allocating the array at the size of the largest task / slice.
- Remove a source of many driver bugs on android devices.
- Simplify integration of future patches with render task graph.
Much of the render target array texture code is still present, since
picture cache tiles in the Draw compositor still make use of texture
arrays. However, once these are switched to normal textures, we can
remove most of the slice layer, blit workaround functionality etc.
Remove the default feature setting for selecting the image sampler
kind. Instead, this must be explicitly specified by the shader or
a dynamic feature define, which makes sampler selection less error prone.
Differential Revision: https://phabricator.services.mozilla.com/D99006
With this change, all color/alpha intermediate surfaces are individual
textures, rather than texture arrays.
This can in theory cause more batch breaks in some cases, but this
is likely to be very rare in practice.
Benefits:
- No more allocating the array at the size of the largest task / slice.
- Remove a source of many driver bugs on android devices.
- Simplify integration of future patches with render task graph.
Much of the render target array texture code is still present, since
picture cache tiles in the Draw compositor still make use of texture
arrays. However, once these are switched to normal textures, we can
remove most of the slice layer, blit workaround functionality etc.
Remove the default feature setting for selecting the image sampler
kind. Instead, this must be explicitly specified by the shader or
a dynamic feature define, which makes sampler selection less error prone.
Differential Revision: https://phabricator.services.mozilla.com/D99006
Remove usage of the implicit prev pass alpha and color texture
samplers from batching / renderer / shader code. They are replaced
by explicit references to the texture ID for the source task.
Differential Revision: https://phabricator.services.mozilla.com/D98872
This should not have a functional change because today nothing uses the
individual flags and instead always uses ClearCache::all().
Differential Revision: https://phabricator.services.mozilla.com/D99598
When deallocating the last item of the region the region's size is changed in free and we end up returning the wrong value.
The faulty code is actually removed in a later patch from this series but I'd like to fix the regressing now and land the patch that removes the bad code during the next train.
Differential Revision: https://phabricator.services.mozilla.com/D99458
This ID allows the compositor to track per-frame information from frame
generation, through APZ sampling, to the NotifyDidRender notification.
Differential Revision: https://phabricator.services.mozilla.com/D97535
This is a follow up to the addition of a clip mask texture sampler.
With this patch, that sampler is no longer bound to the PrevPassAlpha
input, but is explicitly bound to any arbitrary texture that the
input render task was drawn into. This is a step towards enabling
the full render task graph.
There's a bit of complexity here in that it's now possible for individual
segments to break a batch, if they have clip masks that ended up on a
different texture. This should be an extremely rare case. However, it
does currently result in (even more) code duplication in some of the
batching code - which can be refactored once the render task graph
changes are in place.
Differential Revision: https://phabricator.services.mozilla.com/D98711
This is an incremental but important step to implementing render
tasks as a proper graph.
By moving the render target management to the frame building step,
we know the texture_id of all sub-passes before the batching is
done for any passes that use these as inputs. This means that
we can directly reference the texture_id during batch, rather
that the old `RenderTaskCache` and `PrevPassAlpha` / `PrevPassColor`
enum fields (although removal of all these will be done in the
next patch).
Another advantage of this is that we have much better knowledge
of which targets are required for rendering a given frame, so
these can be allocated up front at the start of a frame. This
may be a better allocation pattern for some drivers. We also
have better knowledge available on when a texture can be
invalidated, and the render target pool management is simpler since
it is the same as the way other texture cache textures are handled.
Differential Revision: https://phabricator.services.mozilla.com/D98547
Remove usage of the implicit prev pass alpha and color texture
samplers from batching / renderer / shader code. They are replaced
by explicit references to the texture ID for the source task.
Differential Revision: https://phabricator.services.mozilla.com/D98872
This is a follow up to the addition of a clip mask texture sampler.
With this patch, that sampler is no longer bound to the PrevPassAlpha
input, but is explicitly bound to any arbitrary texture that the
input render task was drawn into. This is a step towards enabling
the full render task graph.
There's a bit of complexity here in that it's now possible for individual
segments to break a batch, if they have clip masks that ended up on a
different texture. This should be an extremely rare case. However, it
does currently result in (even more) code duplication in some of the
batching code - which can be refactored once the render task graph
changes are in place.
Differential Revision: https://phabricator.services.mozilla.com/D98711
This is an incremental but important step to implementing render
tasks as a proper graph.
By moving the render target management to the frame building step,
we know the texture_id of all sub-passes before the batching is
done for any passes that use these as inputs. This means that
we can directly reference the texture_id during batch, rather
that the old `RenderTaskCache` and `PrevPassAlpha` / `PrevPassColor`
enum fields (although removal of all these will be done in the
next patch).
Another advantage of this is that we have much better knowledge
of which targets are required for rendering a given frame, so
these can be allocated up front at the start of a frame. This
may be a better allocation pattern for some drivers. We also
have better knowledge available on when a texture can be
invalidated, and the render target pool management is simpler since
it is the same as the way other texture cache textures are handled.
Differential Revision: https://phabricator.services.mozilla.com/D98547
The pixel-local-storage functionality was an experiment for faster
drawing of clip masks on low end tiled GPUs. However, it's never
reached a point where it was shippable and showing clear performance
wins.
This patch removes the experimental PLS support - we can always
revive it from git history if we ever want to consider it again.
Differential Revision: https://phabricator.services.mozilla.com/D98290
Another step towards abstracting out slab allocation from the rest of the texture cache and allowing multiple algorithms. All dynamic atlas allocation algorithms will use an AllocId encoded into 32 bits.
Differential Revision: https://phabricator.services.mozilla.com/D98209
And move texture_id into a TextureUnit structure in the texture_cache.rs that contains the allocator and the id. No behavior changes in this patch.
Depends on D98202
Differential Revision: https://phabricator.services.mozilla.com/D98203
A bit of cleanup and a step towards having more allocation algorithms in the texture cache. This patch mostly moves code around, and should not change the behavior of the code.
Depends on D98201
Differential Revision: https://phabricator.services.mozilla.com/D98202
I'm about to add a couple of new atlas allocation algorithms. This patch renames one of the existing one into something less generic before it gets confusing.
Also fix outdated comments about merging and dynamic allocation which was removed a while ago.
Differential Revision: https://phabricator.services.mozilla.com/D98201
Use a specific texture sampler for clip masks. Although this is
always currently set to the PrevPassAlpha texture source, it does
all the plumbing to allow a follow up commit that explcitly
provides a texture id for clip masks per-primitive.
Differential Revision: https://phabricator.services.mozilla.com/D98128
These structs are created and copied around many times during
batching. As PrevPassColor and PrevPassAlpha are removed, the
BatchTextures struct will contain extra fields (such as the clip
mask binding), so it's important to reduce the size of them to
avoid regressiing batching time. This patch reduces the size of
`BatchTextures` from 72 bytes to 24 bytes.
Differential Revision: https://phabricator.services.mozilla.com/D98122
UploadPBOPool::end_frame() creates a GLsync object for synchronizing
PBO access, which internaly may require flushing the command
stream. Previously this was being called mid-frame, and potentially
multiple times per frame. This overhead caused talos regressions in
cases with very small amounts of texture upload per frame. To fix
this, call UploadPBO::end_frame() only once at the end of the frame.
Differential Revision: https://phabricator.services.mozilla.com/D98003
Webrender uses an LRUCache to hold the items in the texture
cache. When texture usage is over a certain threshold we evict the
least recently used items until we are back under the
threshold. However, this runs in to a problem on pages where the cache
only contains items still in use, but is still over the threshold, as
even the least recently used item is still required. In this scenario
we end up evicting items at the start of the frame, only to reupload
them later in the frame, and repeat the cycle again on the next frame.
To avoid this, tweak the eviction algortithm so that it never evicts
items that were in use in the previous frame. (The eviction step
occurs before we know which items are needed for the current frame, so
using the previous frame is the best approximation.)
Differential Revision: https://phabricator.services.mozilla.com/D98043
With follow up patches, we want to retain the existing logic and
structures in RenderTaskKind, but be able to port and use them in
a new render task graph structure. This is simpler if we move as
much logic out of `RenderTask` as possible.
- Move simple RenderTask create methods into RenderTaskKind
(complex ones will be done as a separate follow up)
- Move `write_task_data` and `write_gpu_blocks` into RenderTaskKind
- Move some static helper methods (e.g. to `BlurTask`)
Differential Revision: https://phabricator.services.mozilla.com/D97527
The dummy texture is a texture array. The bug was caused by selecting the non-array version of the shader while binding a texture array.
Differential Revision: https://phabricator.services.mozilla.com/D97991
Using GL_LINEAR was causing incorrect filtering to occur when copying
the RGBAF32 GPU cache texture on Mali, causing rendering
errors. Switching to GL_NEAREST fixes it.
This is the same bug as bug 1669960, which was believed at the time to
only affect Mali-Gxx. On further testing the bug affects Mali-Txxx
too. Bug 1669960 was worked around at the time by using
glCopyImageSubData instead of glBlitFramebuffer. However, we want to
avoid using glCopyImageSubData on Mali: on Mali-T due to performance
reasons, and on Mali-G due to indefinite hangs. Fixing this filtering
bug allows us to switch both sets of devices to always use
glBlitFramebuffer.
Differential Revision: https://phabricator.services.mozilla.com/D97558
The main framebuffer pass is now always a simplified step of
constructing the CompositeState structure, rather than any
complex alpha batching (since a recent change that meant
picture caching is always enabled).
This patch doesn't contain any functional changes. It removes
the main framebuffer render pass kind, simplifying how passes
are built and rendered. A follow up patch will further simplify
this code by moving the CompositeState creation out of the
regular batching code.
Differential Revision: https://phabricator.services.mozilla.com/D97230
In the following circumstances, WR was failing to detect a
composite was required:
- There is a picture cache slice that is smaller than a single tile.
- The position of that picture cache slice is changed.
- No other content invalidations occur.
This clip rect in the composite descriptor must include the
device_valid_rect rather than the tile device_rect. This ensures
that in the case of a picture cache slice that is smaller than a
single tile, the clip rect in the composite descriptor will change
if the position of that slice is changed. Otherwise, WR may conclude
that no composite is needed if the tile itself was not invalidated
due to changing content.
Differential Revision: https://phabricator.services.mozilla.com/D96966
This change adds a code path to avoid instancing, enabled (if supported) on non-Intel GPUs.
Side note: we still need a plan on what to do on devices that support neither of base-instance or SSBO.
Differential Revision: https://phabricator.services.mozilla.com/D87826
This is a modest win on hidpi and a larger win on lowdpi. We can improve further by introducing more granularity between 16 and 32px, this puts in place the basic infrastructure on top of which we can experiment.
Differential Revision: https://phabricator.services.mozilla.com/D95870
This change adds a code path to avoid instancing, enabled (if supported) on non-Intel GPUs.
Side note: we still need a plan on what to do on devices that support neither of base-instance or SSBO.
Differential Revision: https://phabricator.services.mozilla.com/D87826
A couple of minor changes:
- Don't pass render task graph to resource cache (it's not used).
- Support an `initial_size` for the texture allocator on creation.
This is a convenience for when the allocation tracker is being used
to track a single surface rather than an array.
Differential Revision: https://phabricator.services.mozilla.com/D96956
It's only used by a small subset of render tasks, it makes more
sense to specialize it for those tasks. This is part of reducing
the fields in RenderTask so it's easier to port to the graph
changes being worked on.
Differential Revision: https://phabricator.services.mozilla.com/D96925
A couple of minor changes:
- Don't pass render task graph to resource cache (it's not used).
- Support an `initial_size` for the texture allocator on creation.
This is a convenience for when the allocation tracker is being used
to track a single surface rather than an array.
Differential Revision: https://phabricator.services.mozilla.com/D96956
It's only used by a small subset of render tasks, it makes more
sense to specialize it for those tasks. This is part of reducing
the fields in RenderTask so it's easier to port to the graph
changes being worked on.
Differential Revision: https://phabricator.services.mozilla.com/D96925
This patch simplifies the slab allocator in various ways, most importantly separating the packing logic and texture cache glue (dealing with swizzling, cache entries, etc.). The former is moved into TextureUnits/TextureUnit and the latter is mostly contained into TextureCache.
This patch should have no functional change. The goal to make it easier to introduce custom slab sizes for glyphs in a followup patch, and later use different packing algorithms.
Differential Revision: https://phabricator.services.mozilla.com/D95869
Also tweak the visualization in various ways so that having a large amount of regions (glyphs) doesn't bring down simple SVG viewing software.
Differential Revision: https://phabricator.services.mozilla.com/D95758
Since glyphs are rarely larger than 128x128, we can reduce the amount of wasted space from partially used glyph regions by having smaller ones (and more of them).
Differential Revision: https://phabricator.services.mozilla.com/D95757