2017-06-24 00:23:12 +03:00
|
|
|
Advanced Layers
|
|
|
|
==============
|
|
|
|
|
|
|
|
Advanced Layers is a new method of compositing layers in Gecko. This document serves as a technical
|
|
|
|
overview and provides a short walk-through of its source code.
|
|
|
|
|
|
|
|
Overview
|
|
|
|
-------------
|
|
|
|
|
|
|
|
Advanced Layers attempts to group as many GPU operations as it can into a single draw call. This is
|
|
|
|
a common technique in GPU-based rendering called "batching". It is not always trivial, as a
|
|
|
|
batching algorithm can easily waste precious CPU resources trying to build optimal draw calls.
|
|
|
|
|
|
|
|
Advanced Layers reuses the existing Gecko layers system as much as possible. Huge layer trees do
|
|
|
|
not currently scale well (see the future work section), so opportunities for batching are currently
|
|
|
|
limited without expending unnecessary resources elsewhere. However, Advanced Layers has a few
|
|
|
|
benefits:
|
|
|
|
|
|
|
|
* It submits smaller GPU workloads and buffer uploads than the existing compositor.
|
|
|
|
* It needs only a single pass over the layer tree.
|
|
|
|
* It uses occlusion information more intelligently.
|
|
|
|
* It is easier to add new specialized rendering paths and new layer types.
|
|
|
|
* It separates compositing logic from device logic, unlike the existing compositor.
|
|
|
|
* It is much faster at rendering 3d scenes or complex layer trees.
|
|
|
|
* It has experimental code to use the z-buffer for occlusion culling.
|
|
|
|
|
|
|
|
Because of these benefits we hope that it provides a significant improvement over the existing
|
|
|
|
compositor.
|
|
|
|
|
|
|
|
Advanced Layers uses the acronym "MLG" and "MLGPU" in many places. This stands for "Mid-Level
|
|
|
|
Graphics", the idea being that it is optimized for Direct3D 11-style rendering systems as opposed
|
|
|
|
to Direct3D 12 or Vulkan.
|
|
|
|
|
|
|
|
LayerManagerMLGPU
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
Advanced layers does not change client-side rendering at all. Content still uses Direct2D (when
|
|
|
|
possible), and creates identical layer trees as it would with a normal Direct3D 11 compositor. In
|
|
|
|
fact, Advanced Layers re-uses all of the existing texture handling and video infrastructure as
|
|
|
|
well, replacing only the composite-side layer types.
|
|
|
|
|
|
|
|
Advanced Layers does not create a `LayerManagerComposite` - instead, it creates a
|
|
|
|
`LayerManagerMLGPU`. This layer manager does not have a `Compositor` - instead, it has an
|
|
|
|
`MLGDevice`, which roughly abstracts the Direct3D 11 API. (The hope is that this API is easily
|
|
|
|
interchangeable for something else when cross-platform or software support is needed.)
|
|
|
|
|
|
|
|
`LayerManagerMLGPU` also dispenses with the old "composite" layers for new layer types. For
|
|
|
|
example, `ColorLayerComposite` becomes `ColorLayerMLGPU`. Since these layer types implement
|
|
|
|
`HostLayer`, they integrate with `LayerTransactionParent` as normal composite layers would.
|
|
|
|
|
|
|
|
Rendering Overview
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
The steps for rendering are described in more detail below, but roughly the process is:
|
|
|
|
|
|
|
|
1. Sort layers front-to-back.
|
|
|
|
2. Create a dependency tree of render targets (called "views").
|
|
|
|
3. Accumulate draw calls for all layers in each view.
|
|
|
|
4. Upload draw call buffers to the GPU.
|
|
|
|
5. Execute draw commands for each view.
|
|
|
|
|
|
|
|
Advanced Layers divides the layer tree into "views" (`RenderViewMLGPU`), which correspond to a
|
|
|
|
render target. The root layer is represented by a view corresponding to the screen. Layers that
|
|
|
|
require intermediate surfaces have temporary views. Layers are analyzed front-to-back, and rendered
|
|
|
|
back-to-front within a view. Views themselves are rendered front-to-back, to minimize render target
|
|
|
|
switching.
|
|
|
|
|
|
|
|
Each view contains one or more rendering passes (`RenderPassMLGPU`). A pass represents a single
|
|
|
|
draw command with one or more rendering items attached to it. For example, a `SolidColorPass` item
|
|
|
|
contains a rectangle and an RGBA value, and many of these can be drawn with a single GPU call.
|
|
|
|
|
|
|
|
When considering a layer, views will first try to find an existing rendering batch that can support
|
|
|
|
it. If so, that pass will accumulate another draw item for the layer. Otherwise, a new pass will be
|
|
|
|
added.
|
|
|
|
|
|
|
|
When trying to find a matching pass for a layer, there is a tradeoff in CPU time versus the GPU
|
|
|
|
time saved by not issuing another draw commands. We generally care more about CPU time, so we do
|
|
|
|
not try too hard in matching items to an existing batch.
|
|
|
|
|
|
|
|
After all layers have been processed, there is a "prepare" step. This copies all accumulated draw
|
|
|
|
data and uploads it into vertex and constant buffers in the GPU.
|
|
|
|
|
|
|
|
Finally, we execute rendering commands. At the end of the frame, all batches and (most) constant
|
|
|
|
buffers are thrown away.
|
|
|
|
|
|
|
|
Shaders Overview
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
Advanced Layers currently has five layer-related shader pipelines:
|
|
|
|
|
|
|
|
- Textured (PaintedLayer, ImageLayer, CanvasLayer)
|
|
|
|
- ComponentAlpha (PaintedLayer with component-alpha)
|
|
|
|
- YCbCr (ImageLayer with YCbCr video)
|
|
|
|
- Color (ColorLayers)
|
|
|
|
- Blend (ContainerLayers with mix-blend modes)
|
|
|
|
|
|
|
|
There are also three special shader pipelines:
|
2017-07-11 10:05:30 +03:00
|
|
|
|
2017-06-24 00:23:12 +03:00
|
|
|
- MaskCombiner, which is used to combine mask layers into a single texture.
|
|
|
|
- Clear, which is used for fast region-based clears when not directly supported by the GPU.
|
|
|
|
- Diagnostic, which is used to display the diagnostic overlay texture.
|
|
|
|
|
2017-07-11 10:05:30 +03:00
|
|
|
The layer shaders follow a unified structure. Each pipeline has a vertex and pixel shader.
|
|
|
|
The vertex shader takes a layers ID, a z-buffer depth, a unit position in either a unit
|
|
|
|
square or unit triangle, and either rectangular or triangular geometry. Shaders can also
|
|
|
|
have ancillary data needed like texture coordinates or colors.
|
2017-06-24 00:23:12 +03:00
|
|
|
|
|
|
|
Most of the time, layers have simple rectangular clips with simple rectilinear transforms, and
|
2017-07-11 10:05:30 +03:00
|
|
|
pixel shaders do not need to perform masking or clipping. For these layers we use a fast-path
|
|
|
|
pipeline, using unit-quad shaders that are able to clip geometry so the pixel shader does not
|
|
|
|
have to. This type of pipeline does not support complex masks.
|
|
|
|
|
|
|
|
If a layer has a complex mask, a rotation or 3d transform, or a complex operation like blending,
|
|
|
|
then we use shaders capable of handling arbitrary geometry. Their input is a unit triangle,
|
|
|
|
and these shaders are generally more expensive.
|
2017-06-24 00:23:12 +03:00
|
|
|
|
|
|
|
All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h.
|
|
|
|
|
|
|
|
CPU Occlusion Culling
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
By default, Advanced Layers performs occlusion culling on the CPU. Since layers are visited
|
|
|
|
front-to-back, this is simply a matter of accumulating the visible region of opaque layers, and
|
|
|
|
subtracting it from the visible region of subsequent layers. There is a major difference
|
|
|
|
between this occlusion culling and PostProcessLayers of the old compositor: AL performs culling
|
|
|
|
after invalidation, not before. Completely valid layers will have an empty visible region.
|
|
|
|
|
|
|
|
Most layer types (with the exception of images) will intelligently split their draw calls into a
|
|
|
|
batch of individual rectangles, based on their visible region.
|
|
|
|
|
|
|
|
Z-Buffering and Occlusion
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
Advanced Layers also supports occlusion culling on the GPU, using a z-buffer. This is disabled by
|
|
|
|
default currently since it is significantly costly on integrated GPUs. When using the z-buffer, we
|
|
|
|
separate opaque layers into a separate list of passes. The render process then uses the following
|
|
|
|
steps:
|
|
|
|
|
|
|
|
1. The depth buffer is set to read-write.
|
|
|
|
2. Opaque batches are executed.,
|
|
|
|
3. The depth buffer is set to read-only.
|
|
|
|
4. Transparent batches are executed.
|
|
|
|
|
|
|
|
The problem we have observed is that the depth buffer increases writes to the GPU, and on
|
|
|
|
integrated GPUs this is expensive - we have seen draw call times increase by 20-30%, which is the
|
|
|
|
wrong direction we want to take on battery life. In particular on a full screen video, the call to
|
|
|
|
ClearDepthStencilView plus the actual depth buffer write of the video can double GPU time.
|
|
|
|
|
|
|
|
For now the depth-buffer is disabled until we can find a compelling case for it on non-integrated
|
|
|
|
hardware.
|
|
|
|
|
|
|
|
Clipping
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
Clipping is a bit tricky in Advanced Layers. We cannot use the hardware "scissor" feature, since the
|
|
|
|
clip can change from instance to instance within a batch. And if using the depth buffer, we
|
|
|
|
cannot write transparent pixels for the clipped area. As a result we always clip opaque draw rects
|
|
|
|
in the vertex shader (and sometimes even on the CPU, as is needed for sane texture coordiantes).
|
|
|
|
Only transparent items are clipped in the pixel shader. As a result, masked layers and layers with
|
|
|
|
non-rectangular transforms are always considered transparent, and use a more flexible clipping
|
|
|
|
pipeline.
|
|
|
|
|
|
|
|
Plane Splitting
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Plane splitting is when a 3D transform causes a layer to be split - for example, one transparent
|
|
|
|
layer may intersect another on a separate plane. When this happens, Gecko sorts layers using a BSP
|
|
|
|
tree and produces a list of triangles instead of draw rects.
|
|
|
|
|
|
|
|
These layers cannot use the "unit quad" shaders that support the fast clipping pipeline. Instead
|
|
|
|
they always use the full triangle-list shaders that support extended vertices and clipping.
|
|
|
|
|
|
|
|
This is the slowest path we can take when building a draw call, since we must interact with the
|
|
|
|
polygon clipping and texturing code.
|
|
|
|
|
|
|
|
Masks
|
|
|
|
---------
|
|
|
|
|
|
|
|
For each layer with a mask attached, Advanced Layers builds a `MaskOperation`. These operations
|
|
|
|
must resolve to a single mask texture, as well as a rectangular area to which the mask applies. All
|
|
|
|
batched pixel shaders will automatically clip pixels to the mask if a mask texture is bound. (Note
|
|
|
|
that we must use separate batches if the mask texture changes.)
|
|
|
|
|
|
|
|
Some layers have multiple mask textures. In this case, the MaskOperation will store the list of
|
|
|
|
masks, and right before rendering, it will invoke a shader to combine these masks into a single texture.
|
|
|
|
|
|
|
|
MaskOperations are shared across layers when possible, but are not cached across frames.
|
|
|
|
|
|
|
|
BigImage Support
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
ImageLayers and CanvasLayers can be tiled with many individual textures. This happens in rare cases
|
|
|
|
where the underlying buffer is too big for the GPU. Early on this caused problems for Advanced
|
|
|
|
Layers, since AL required one texture per layer. We implemented BigImage support by creating
|
|
|
|
temporary ImageLayers for each visible tile, and throwing those layers away at the end of the
|
|
|
|
frame.
|
|
|
|
|
|
|
|
Advanced Layers no longer has a 1:1 layer:texture restriction, but we retain the temporary layer
|
|
|
|
solution anyway. It is not much code and it means we do not have to split `TexturedLayerMLGPU`
|
|
|
|
methods into iterated and non-iterated versions.
|
|
|
|
|
|
|
|
Texture Locking
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
Advanced Layers has a different texture locking scheme than the existing compositor. If a texture
|
|
|
|
needs to be locked, then it is locked by the MLGDevice automatically when bound to the current
|
|
|
|
pipeline. The MLGDevice keeps a set of the locked textures to avoid double-locking. At the end of
|
|
|
|
the frame, any textures in the locked set are unlocked.
|
|
|
|
|
|
|
|
We cannot easily replicate the locking scheme in the old compositor, since the duration of using
|
|
|
|
the texture is not scoped to when we visit the layer.
|
|
|
|
|
|
|
|
Buffer Measurements
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
Advanced Layers uses constant buffers to send layer information and extended instance data to the
|
|
|
|
GPU. We do this by pre-allocating large constant buffers and mapping them with `MAP_DISCARD` at the
|
|
|
|
beginning of the frame. Batches may allocate into this up to the maximum bindable constant buffer
|
|
|
|
size of the device (currently, 64KB).
|
|
|
|
|
|
|
|
There are some downsides to this approach. Constant buffers are difficult to work with - they have
|
|
|
|
specific alignment requirements, and care must be taken not too run over the maximum number of
|
|
|
|
constants in a buffer. Another approach would be to store constants in a 2D texture and use vertex
|
|
|
|
shader texture fetches. Advanced Layers implemented this and benchmarked it to decide which
|
|
|
|
approach to use. Textures seemed to skew better on GPU performance, but worse on CPU, but this
|
|
|
|
varied depending on the GPU. Overall constant buffers performed best and most consistently, so we
|
|
|
|
have kept them.
|
|
|
|
|
|
|
|
Additionally, we tested different ways of performing buffer uploads. Buffer creation itself is
|
|
|
|
costly, especially on integrated GPUs, and especially so for immutable, immediate-upload buffers.
|
|
|
|
As a result we aggressively cache buffer objects and always allocate them as MAP_DISCARD unless
|
|
|
|
they are write-once and long-lived.
|
|
|
|
|
|
|
|
Buffer Types
|
|
|
|
------------
|
|
|
|
|
|
|
|
Advanced Layers has a few different classes to help build and upload buffers to the GPU. They are:
|
|
|
|
|
|
|
|
- `MLGBuffer`. This is the low-level shader resource that `MLGDevice` exposes. It is the building
|
|
|
|
block for buffer helper classes, but it can also be used to make one-off, immutable,
|
|
|
|
immediate-upload buffers. MLGBuffers, being a GPU resource, are reference counted.
|
|
|
|
- `SharedBufferMLGPU`. These are large, pre-allocated buffers that are read-only on the GPU and
|
|
|
|
write-only on the CPU. They usually exceed the maximum bindable buffer size. There are three
|
|
|
|
shared buffers created by default and they are automatically unmapped as needed: one for vertices,
|
|
|
|
one for vertex shader constants, and one for pixel shader constants. When callers allocate into a
|
|
|
|
shared buffer they get back a mapped pointer, a GPU resource, and an offset. When the underlying
|
|
|
|
device supports offsetable buffers (like `ID3D11DeviceContext1` does), this results in better GPU
|
|
|
|
utilization, as there are less resources and fewer upload commands.
|
|
|
|
- `ConstantBufferSection` and `VertexBufferSection`. These are "views" into a `SharedBufferMLGPU`.
|
|
|
|
They contain the underlying `MLGBuffer`, and when offsetting is supported, the offset
|
|
|
|
information necessary for resource binding. Sections are not reference counted.
|
|
|
|
- `StagingBuffer`. A dynamically sized CPU buffer where items can be appended in a free-form
|
|
|
|
manner. The stride of a single "item" is computed by the first item written, and successive
|
|
|
|
items must have the same stride. The buffer must be uploaded to the GPU manually. Staging buffers
|
|
|
|
are appropriate for creating general constant or vertex buffer data. They can also write items in
|
|
|
|
reverse, which is how we render back-to-front when layers are visited front-to-back. They can be
|
|
|
|
uploaded to a `SharedBufferMLGPU` or an immutabler `MLGBuffer` very easily. Staging buffers are not
|
|
|
|
reference counted.
|
|
|
|
|
|
|
|
Unsupported Features
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
Currently, these features of the old compositor are not yet implemented.
|
|
|
|
|
|
|
|
- OpenGL and software support (currently AL only works on D3D11).
|
|
|
|
- APZ displayport overlay.
|
|
|
|
- Diagnostic/developer overlays other than the FPS/timing overlay.
|
|
|
|
- DEAA. It was never ported to the D3D11 compositor, but we would like it.
|
|
|
|
- Component alpha when used inside an opaque intermediate surface.
|
|
|
|
- Effects prefs. Possibly not needed post-B2G removal.
|
|
|
|
- Widget overlays and underlays used by macOS and Android.
|
|
|
|
- DefaultClearColor. This is Android specific, but is easy to added when needed.
|
|
|
|
- Frame uniformity info in the profiler. Possibly not needed post-B2G removal.
|
|
|
|
- LayerScope. There are no plans to make this work.
|
|
|
|
|
|
|
|
Future Work
|
|
|
|
--------------------------------
|
|
|
|
|
|
|
|
- Refactor for D3D12/Vulkan support (namely, split MLGDevice into something less stateful and something else more low-level).
|
|
|
|
- Remove "MLG" moniker and namespace everything.
|
|
|
|
- Other backends (D3D12/Vulkan, OpenGL, Software)
|
|
|
|
- Delete CompositorD3D11
|
|
|
|
- Add DEAA support
|
|
|
|
- Re-enable the depth buffer by default for fast GPUs
|
|
|
|
- Re-enable right-sizing of inaccurately sized containers
|
|
|
|
- Drop constant buffers for ancillary vertex data
|
|
|
|
- Fast shader paths for simple video/painted layer cases
|
|
|
|
|
|
|
|
History
|
|
|
|
----------
|
|
|
|
|
|
|
|
Advanced Layers has gone through four major design iterations. The initial version used tiling -
|
|
|
|
each render view divided the screen into 128x128 tiles, and layers were assigned to tiles based on
|
|
|
|
their screen-space draw area. This approach proved not to scale well to 3d transforms, and so
|
|
|
|
tiling was eliminated.
|
|
|
|
|
|
|
|
We replaced it with a simple system of accumulating draw regions to each batch, thus ensuring that
|
|
|
|
items could be assigned to batches while maintaining correct z-ordering. This second iteration also
|
|
|
|
coincided with plane-splitting support.
|
|
|
|
|
|
|
|
On large layer trees, accumulating the affected regions of batches proved to be quite expensive.
|
|
|
|
This led to a third iteration, using depth buffers and separate opaque and transparent batch lists
|
|
|
|
to achieve z-ordering and occlusion culling.
|
|
|
|
|
|
|
|
Finally, depth buffers proved to be too expensive, and we introduced a simple CPU-based occlusion
|
|
|
|
culling pass. This iteration coincided with using more precise draw rects and splitting pipelines
|
|
|
|
into unit-quad, cpu-clipped and triangle-list, gpu-clipped variants.
|
|
|
|
|