171 строка
8.6 KiB
ReStructuredText
171 строка
8.6 KiB
ReStructuredText
|
=====================================
|
||
|
Performance Tips for Frontend Authors
|
||
|
=====================================
|
||
|
|
||
|
.. contents::
|
||
|
:local:
|
||
|
:depth: 2
|
||
|
|
||
|
Abstract
|
||
|
========
|
||
|
|
||
|
The intended audience of this document is developers of language frontends
|
||
|
targeting LLVM IR. This document is home to a collection of tips on how to
|
||
|
generate IR that optimizes well. As with any optimizer, LLVM has its strengths
|
||
|
and weaknesses. In some cases, surprisingly small changes in the source IR
|
||
|
can have a large effect on the generated code.
|
||
|
|
||
|
Avoid loads and stores of large aggregate type
|
||
|
================================================
|
||
|
|
||
|
LLVM currently does not optimize well loads and stores of large :ref:`aggregate
|
||
|
types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider
|
||
|
loading individual fields from memory.
|
||
|
|
||
|
Aggregates that are smaller than the largest (performant) load or store
|
||
|
instruction supported by the targeted hardware are well supported. These can
|
||
|
be an effective way to represent collections of small packed fields.
|
||
|
|
||
|
Prefer zext over sext when legal
|
||
|
==================================
|
||
|
|
||
|
On some architectures (X86_64 is one), sign extension can involve an extra
|
||
|
instruction whereas zero extension can be folded into a load. LLVM will try to
|
||
|
replace a sext with a zext when it can be proven safe, but if you have
|
||
|
information in your source language about the range of a integer value, it can
|
||
|
be profitable to use a zext rather than a sext.
|
||
|
|
||
|
Alternatively, you can :ref:`specify the range of the value using metadata
|
||
|
<range-metadata>` and LLVM can do the sext to zext conversion for you.
|
||
|
|
||
|
Zext GEP indices to machine register width
|
||
|
============================================
|
||
|
|
||
|
Internally, LLVM often promotes the width of GEP indices to machine register
|
||
|
width. When it does so, it will default to using sign extension (sext)
|
||
|
operations for safety. If your source language provides information about
|
||
|
the range of the index, you may wish to manually extend indices to machine
|
||
|
register width using a zext instruction.
|
||
|
|
||
|
Other things to consider
|
||
|
=========================
|
||
|
|
||
|
#. Make sure that a DataLayout is provided (this will likely become required in
|
||
|
the near future, but is certainly important for optimization).
|
||
|
|
||
|
#. Add nsw/nuw flags as appropriate. Reasoning about overflow is
|
||
|
generally hard for an optimizer so providing these facts from the frontend
|
||
|
can be very impactful.
|
||
|
|
||
|
#. Use fast-math flags on floating point operations if legal. If you don't
|
||
|
need strict IEEE floating point semantics, there are a number of additional
|
||
|
optimizations that can be performed. This can be highly impactful for
|
||
|
floating point intensive computations.
|
||
|
|
||
|
#. Use inbounds on geps. This can help to disambiguate some aliasing queries.
|
||
|
|
||
|
#. Add noalias/align/dereferenceable/nonnull to function arguments and return
|
||
|
values as appropriate
|
||
|
|
||
|
#. Mark functions as readnone/readonly or noreturn/nounwind when known. The
|
||
|
optimizer will try to infer these flags, but may not always be able to.
|
||
|
Manual annotations are particularly important for external functions that
|
||
|
the optimizer can not analyze.
|
||
|
|
||
|
#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing
|
||
|
analysis), prefer GEPs
|
||
|
|
||
|
#. Use the lifetime.start/lifetime.end and invariant.start/invariant.end
|
||
|
intrinsics where possible. Common profitable uses are for stack like data
|
||
|
structures (thus allowing dead store elimination) and for describing
|
||
|
life times of allocas (thus allowing smaller stack sizes).
|
||
|
|
||
|
#. Use pointer aliasing metadata, especially tbaa metadata, to communicate
|
||
|
otherwise-non-deducible pointer aliasing facts
|
||
|
|
||
|
#. Use the "most-private" possible linkage types for the functions being defined
|
||
|
(private, internal or linkonce_odr preferably)
|
||
|
|
||
|
#. Mark invariant locations using !invariant.load and TBAA's constant flags
|
||
|
|
||
|
#. Prefer globals over inttoptr of a constant address - this gives you
|
||
|
dereferencability information. In MCJIT, use getSymbolAddress to provide
|
||
|
actual address.
|
||
|
|
||
|
#. Be wary of ordered and atomic memory operations. They are hard to optimize
|
||
|
and may not be well optimized by the current optimizer. Depending on your
|
||
|
source language, you may consider using fences instead.
|
||
|
|
||
|
#. If calling a function which is known to throw an exception (unwind), use
|
||
|
an invoke with a normal destination which contains an unreachable
|
||
|
instruction. This form conveys to the optimizer that the call returns
|
||
|
abnormally. For an invoke which neither returns normally or requires unwind
|
||
|
code in the current function, you can use a noreturn call instruction if
|
||
|
desired. This is generally not required because the optimizer will convert
|
||
|
an invoke with an unreachable unwind destination to a call instruction.
|
||
|
|
||
|
#. If you language uses range checks, consider using the IRCE pass. It is not
|
||
|
currently part of the standard pass order.
|
||
|
|
||
|
#. For languages with numerous rarely executed guard conditions (e.g. null
|
||
|
checks, type checks, range checks) consider adding an extra execution or
|
||
|
two of LoopUnswith and LICM to your pass order. The standard pass order,
|
||
|
which is tuned for C and C++ applications, may not be sufficient to remove
|
||
|
all dischargeable checks from loops.
|
||
|
|
||
|
#. Use profile metadata to indicate statically known cold paths, even if
|
||
|
dynamic profiling information is not available. This can make a large
|
||
|
difference in code placement and thus the performance of tight loops.
|
||
|
|
||
|
#. When generating code for loops, try to avoid terminating the header block of
|
||
|
the loop earlier than necessary. If the terminator of the loop header
|
||
|
block is a loop exiting conditional branch, the effectiveness of LICM will
|
||
|
be limited for loads not in the header. (This is due to the fact that LLVM
|
||
|
may not know such a load is safe to speculatively execute and thus can't
|
||
|
lift an otherwise loop invariant load unless it can prove the exiting
|
||
|
condition is not taken.) It can be profitable, in some cases, to emit such
|
||
|
instructions into the header even if they are not used along a rarely
|
||
|
executed path that exits the loop. This guidance specifically does not
|
||
|
apply if the condition which terminates the loop header is itself invariant,
|
||
|
or can be easily discharged by inspecting the loop index variables.
|
||
|
|
||
|
#. In hot loops, consider duplicating instructions from small basic blocks
|
||
|
which end in highly predictable terminators into their successor blocks.
|
||
|
If a hot successor block contains instructions which can be vectorized
|
||
|
with the duplicated ones, this can provide a noticeable throughput
|
||
|
improvement. Note that this is not always profitable and does involve a
|
||
|
potentially large increase in code size.
|
||
|
|
||
|
#. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
|
||
|
of predecessors). Among other issues, the register allocator is known to
|
||
|
perform badly with confronted with such structures. The only exception to
|
||
|
this guidance is that a unified return block with high in-degree is fine.
|
||
|
|
||
|
#. When checking a value against a constant, emit the check using a consistent
|
||
|
comparison type. The GVN pass *will* optimize redundant equalities even if
|
||
|
the type of comparison is inverted, but GVN only runs late in the pipeline.
|
||
|
As a result, you may miss the opportunity to run other important
|
||
|
optimizations. Improvements to EarlyCSE to remove this issue are tracked in
|
||
|
Bug 23333.
|
||
|
|
||
|
#. Avoid using arithmetic intrinsics unless you are *required* by your source
|
||
|
language specification to emit a particular code sequence. The optimizer
|
||
|
is quite good at reasoning about general control flow and arithmetic, it is
|
||
|
not anywhere near as strong at reasoning about the various intrinsics. If
|
||
|
profitable for code generation purposes, the optimizer will likely form the
|
||
|
intrinsics itself late in the optimization pipeline. It is *very* rarely
|
||
|
profitable to emit these directly in the language frontend. This item
|
||
|
explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`.
|
||
|
|
||
|
#. Avoid using the :ref:`assume intrinsic <int_assume>` until you've
|
||
|
established that a) there's no other way to express the given fact and b)
|
||
|
that fact is critical for optimization purposes. Assumes are a great
|
||
|
prototyping mechanism, but they can have negative effects on both compile
|
||
|
time and optimization effectiveness. The former is fixable with enough
|
||
|
effort, but the later is fairly fundamental to their designed purpose.
|
||
|
|
||
|
p.s. If you want to help improve this document, patches expanding any of the
|
||
|
above items into standalone sections of their own with a more complete
|
||
|
discussion would be very welcome.
|
||
|
|