gecko-dev/gfx/docs/AdvancedLayers.rst

Advanced Layers
===============

Note: Advanced Layers has been deprecated in favor of WebRender.

Advanced Layers is a new method of compositing layers in Gecko. This
document serves as a technical overview and provides a short
walk-through of its source code.

Overview
--------

Advanced Layers attempts to group as many GPU operations as it can into
a single draw call. This is a common technique in GPU-based rendering
called “batching”. It is not always trivial, as a batching algorithm can
easily waste precious CPU resources trying to build optimal draw calls.

Advanced Layers reuses the existing Gecko layers system as much as
possible. Huge layer trees do not currently scale well (see the future
work section), so opportunities for batching are currently limited
without expending unnecessary resources elsewhere. However, Advanced
Layers has a few benefits:

-  It submits smaller GPU workloads and buffer uploads than the existing
   compositor.
-  It needs only a single pass over the layer tree.
-  It uses occlusion information more intelligently.
-  It is easier to add new specialized rendering paths and new layer
   types.
-  It separates compositing logic from device logic, unlike the existing
   compositor.
-  It is much faster at rendering 3d scenes or complex layer trees.
-  It has experimental code to use the z-buffer for occlusion culling.

Because of these benefits we hope that it provides a significant
improvement over the existing compositor.

Advanced Layers uses the acronym “MLG” and “MLGPU” in many places. This
stands for “Mid-Level Graphics”, the idea being that it is optimized for
Direct3D 11-style rendering systems as opposed to Direct3D 12 or Vulkan.

LayerManagerMLGPU
-----------------

Advanced layers does not change client-side rendering at all. Content
still uses Direct2D (when possible), and creates identical layer trees
as it would with a normal Direct3D 11 compositor. In fact, Advanced
Layers re-uses all of the existing texture handling and video
infrastructure as well, replacing only the composite-side layer types.

Advanced Layers does not create a ``LayerManagerComposite`` - instead,
it creates a ``LayerManagerMLGPU``. This layer manager does not have a
``Compositor`` - instead, it has an ``MLGDevice``, which roughly
abstracts the Direct3D 11 API. (The hope is that this API is easily
interchangeable for something else when cross-platform or software
support is needed.)

``LayerManagerMLGPU`` also dispenses with the old “composite” layers for
new layer types. For example, ``ColorLayerComposite`` becomes
``ColorLayerMLGPU``. Since these layer types implement ``HostLayer``,
they integrate with ``LayerTransactionParent`` as normal composite
layers would.

Rendering Overview
------------------

The steps for rendering are described in more detail below, but roughly
the process is:

1. Sort layers front-to-back.
2. Create a dependency tree of render targets (called “views”).
3. Accumulate draw calls for all layers in each view.
4. Upload draw call buffers to the GPU.
5. Execute draw commands for each view.

Advanced Layers divides the layer tree into “views”
(``RenderViewMLGPU``), which correspond to a render target. The root
layer is represented by a view corresponding to the screen. Layers that
require intermediate surfaces have temporary views. Layers are analyzed
front-to-back, and rendered back-to-front within a view. Views
themselves are rendered front-to-back, to minimize render target
switching.

Each view contains one or more rendering passes (``RenderPassMLGPU``). A
pass represents a single draw command with one or more rendering items
attached to it. For example, a ``SolidColorPass`` item contains a
rectangle and an RGBA value, and many of these can be drawn with a
single GPU call.

When considering a layer, views will first try to find an existing
rendering batch that can support it. If so, that pass will accumulate
another draw item for the layer. Otherwise, a new pass will be added.

When trying to find a matching pass for a layer, there is a tradeoff in
CPU time versus the GPU time saved by not issuing another draw commands.
We generally care more about CPU time, so we do not try too hard in
matching items to an existing batch.

After all layers have been processed, there is a “prepare” step. This
copies all accumulated draw data and uploads it into vertex and constant
buffers in the GPU.

Finally, we execute rendering commands. At the end of the frame, all
batches and (most) constant buffers are thrown away.

Shaders Overview
----------------

Advanced Layers currently has five layer-related shader pipelines:

-  Textured (PaintedLayer, ImageLayer, CanvasLayer)
-  ComponentAlpha (PaintedLayer with component-alpha)
-  YCbCr (ImageLayer with YCbCr video)
-  Color (ColorLayers)
-  Blend (ContainerLayers with mix-blend modes)

There are also three special shader pipelines:

-  MaskCombiner, which is used to combine mask layers into a single
   texture.
-  Clear, which is used for fast region-based clears when not directly
   supported by the GPU.
-  Diagnostic, which is used to display the diagnostic overlay texture.

The layer shaders follow a unified structure. Each pipeline has a vertex
and pixel shader. The vertex shader takes a layers ID, a z-buffer depth,
a unit position in either a unit square or unit triangle, and either
rectangular or triangular geometry. Shaders can also have ancillary data
needed like texture coordinates or colors.

Most of the time, layers have simple rectangular clips with simple
rectilinear transforms, and pixel shaders do not need to perform masking
or clipping. For these layers we use a fast-path pipeline, using
unit-quad shaders that are able to clip geometry so the pixel shader
does not have to. This type of pipeline does not support complex masks.

If a layer has a complex mask, a rotation or 3d transform, or a complex
operation like blending, then we use shaders capable of handling
arbitrary geometry. Their input is a unit triangle, and these shaders
are generally more expensive.

All of the shader-specific data is modelled in ShaderDefinitionsMLGPU.h.

CPU Occlusion Culling
---------------------

By default, Advanced Layers performs occlusion culling on the CPU. Since
layers are visited front-to-back, this is simply a matter of
accumulating the visible region of opaque layers, and subtracting it
from the visible region of subsequent layers. There is a major
difference between this occlusion culling and PostProcessLayers of the
old compositor: AL performs culling after invalidation, not before.
Completely valid layers will have an empty visible region.

Most layer types (with the exception of images) will intelligently split
their draw calls into a batch of individual rectangles, based on their
visible region.

Z-Buffering and Occlusion
-------------------------

Advanced Layers also supports occlusion culling on the GPU, using a
z-buffer. This is disabled by default currently since it is
significantly costly on integrated GPUs. When using the z-buffer, we
separate opaque layers into a separate list of passes. The render
process then uses the following steps:

1. The depth buffer is set to read-write.
2. Opaque batches are executed.,
3. The depth buffer is set to read-only.
4. Transparent batches are executed.

The problem we have observed is that the depth buffer increases writes
to the GPU, and on integrated GPUs this is expensive - we have seen draw
call times increase by 20-30%, which is the wrong direction we want to
take on battery life. In particular on a full screen video, the call to
ClearDepthStencilView plus the actual depth buffer write of the video
can double GPU time.

For now the depth-buffer is disabled until we can find a compelling case
for it on non-integrated hardware.

Clipping
--------

Clipping is a bit tricky in Advanced Layers. We cannot use the hardware
“scissor” feature, since the clip can change from instance to instance
within a batch. And if using the depth buffer, we cannot write
transparent pixels for the clipped area. As a result we always clip
opaque draw rects in the vertex shader (and sometimes even on the CPU,
as is needed for sane texture coordinates). Only transparent items are
clipped in the pixel shader. As a result, masked layers and layers with
non-rectangular transforms are always considered transparent, and use a
more flexible clipping pipeline.

Plane Splitting
---------------

Plane splitting is when a 3D transform causes a layer to be split - for
example, one transparent layer may intersect another on a separate
plane. When this happens, Gecko sorts layers using a BSP tree and
produces a list of triangles instead of draw rects.

These layers cannot use the “unit quad” shaders that support the fast
clipping pipeline. Instead they always use the full triangle-list
shaders that support extended vertices and clipping.

This is the slowest path we can take when building a draw call, since we
must interact with the polygon clipping and texturing code.

Masks
-----

For each layer with a mask attached, Advanced Layers builds a
``MaskOperation``. These operations must resolve to a single mask
texture, as well as a rectangular area to which the mask applies. All
batched pixel shaders will automatically clip pixels to the mask if a
mask texture is bound. (Note that we must use separate batches if the
mask texture changes.)

Some layers have multiple mask textures. In this case, the MaskOperation
will store the list of masks, and right before rendering, it will invoke
a shader to combine these masks into a single texture.

MaskOperations are shared across layers when possible, but are not
cached across frames.

BigImage Support
----------------

ImageLayers and CanvasLayers can be tiled with many individual textures.
This happens in rare cases where the underlying buffer is too big for
the GPU. Early on this caused problems for Advanced Layers, since AL
required one texture per layer. We implemented BigImage support by
creating temporary ImageLayers for each visible tile, and throwing those
layers away at the end of the frame.

Advanced Layers no longer has a 1:1 layer:texture restriction, but we
retain the temporary layer solution anyway. It is not much code and it
means we do not have to split ``TexturedLayerMLGPU`` methods into
iterated and non-iterated versions.

Texture Locking
---------------

Advanced Layers has a different texture locking scheme than the existing
compositor. If a texture needs to be locked, then it is locked by the
MLGDevice automatically when bound to the current pipeline. The
MLGDevice keeps a set of the locked textures to avoid double-locking. At
the end of the frame, any textures in the locked set are unlocked.

We cannot easily replicate the locking scheme in the old compositor,
since the duration of using the texture is not scoped to when we visit
the layer.

Buffer Measurements
-------------------

Advanced Layers uses constant buffers to send layer information and
extended instance data to the GPU. We do this by pre-allocating large
constant buffers and mapping them with ``MAP_DISCARD`` at the beginning
of the frame. Batches may allocate into this up to the maximum bindable
constant buffer size of the device (currently, 64KB).

There are some downsides to this approach. Constant buffers are
difficult to work with - they have specific alignment requirements, and
care must be taken not too run over the maximum number of constants in a
buffer. Another approach would be to store constants in a 2D texture and
use vertex shader texture fetches. Advanced Layers implemented this and
benchmarked it to decide which approach to use. Textures seemed to skew
better on GPU performance, but worse on CPU, but this varied depending
on the GPU. Overall constant buffers performed best and most
consistently, so we have kept them.

Additionally, we tested different ways of performing buffer uploads.
Buffer creation itself is costly, especially on integrated GPUs, and
especially so for immutable, immediate-upload buffers. As a result we
aggressively cache buffer objects and always allocate them as
MAP_DISCARD unless they are write-once and long-lived.

Buffer Types
------------

Advanced Layers has a few different classes to help build and upload
buffers to the GPU. They are:

-  ``MLGBuffer``. This is the low-level shader resource that
   ``MLGDevice`` exposes. It is the building block for buffer helper
   classes, but it can also be used to make one-off, immutable,
   immediate-upload buffers. MLGBuffers, being a GPU resource, are
   reference counted.
-  ``SharedBufferMLGPU``. These are large, pre-allocated buffers that
   are read-only on the GPU and write-only on the CPU. They usually
   exceed the maximum bindable buffer size. There are three shared
   buffers created by default and they are automatically unmapped as
   needed: one for vertices, one for vertex shader constants, and one
   for pixel shader constants. When callers allocate into a shared
   buffer they get back a mapped pointer, a GPU resource, and an offset.
   When the underlying device supports offsetable buffers (like
   ``ID3D11DeviceContext1`` does), this results in better GPU
   utilization, as there are less resources and fewer upload commands.
-  ``ConstantBufferSection`` and ``VertexBufferSection``. These are
   “views” into a ``SharedBufferMLGPU``. They contain the underlying
   ``MLGBuffer``, and when offsetting is supported, the offset
   information necessary for resource binding. Sections are not
   reference counted.
-  ``StagingBuffer``. A dynamically sized CPU buffer where items can be
   appended in a free-form manner. The stride of a single “item” is
   computed by the first item written, and successive items must have
   the same stride. The buffer must be uploaded to the GPU manually.
   Staging buffers are appropriate for creating general constant or
   vertex buffer data. They can also write items in reverse, which is
   how we render back-to-front when layers are visited front-to-back.
   They can be uploaded to a ``SharedBufferMLGPU`` or an immutabler
   ``MLGBuffer`` very easily. Staging buffers are not reference counted.

Unsupported Features
--------------------

Currently, these features of the old compositor are not yet implemented.

-  OpenGL and software support (currently AL only works on D3D11).
-  APZ displayport overlay.
-  Diagnostic/developer overlays other than the FPS/timing overlay.
-  DEAA. It was never ported to the D3D11 compositor, but we would like
   it.
-  Component alpha when used inside an opaque intermediate surface.
-  Effects prefs. Possibly not needed post-B2G removal.
-  Widget overlays and underlays used by macOS and Android.
-  DefaultClearColor. This is Android specific, but is easy to added
   when needed.
-  Frame uniformity info in the profiler. Possibly not needed post-B2G
   removal.
-  LayerScope. There are no plans to make this work.

Future Work
-----------

-  Refactor for D3D12/Vulkan support (namely, split MLGDevice into
   something less stateful and something else more low-level).
-  Remove “MLG” moniker and namespace everything.
-  Other backends (D3D12/Vulkan, OpenGL, Software)
-  Delete CompositorD3D11
-  Add DEAA support
-  Re-enable the depth buffer by default for fast GPUs
-  Re-enable right-sizing of inaccurately sized containers
-  Drop constant buffers for ancillary vertex data
-  Fast shader paths for simple video/painted layer cases

History
-------

Advanced Layers has gone through four major design iterations. The
initial version used tiling - each render view divided the screen into
128x128 tiles, and layers were assigned to tiles based on their
screen-space draw area. This approach proved not to scale well to 3d
transforms, and so tiling was eliminated.

We replaced it with a simple system of accumulating draw regions to each
batch, thus ensuring that items could be assigned to batches while
maintaining correct z-ordering. This second iteration also coincided
with plane-splitting support.

On large layer trees, accumulating the affected regions of batches
proved to be quite expensive. This led to a third iteration, using depth
buffers and separate opaque and transparent batch lists to achieve
z-ordering and occlusion culling.

Finally, depth buffers proved to be too expensive, and we introduced a
simple CPU-based occlusion culling pass. This iteration coincided with
using more precise draw rects and splitting pipelines into unit-quad,
cpu-clipped and triangle-list, gpu-clipped variants.