* Simplify the C version of the warp filter to make the intent
of the code clearer
* Replace saturate_uint() in the C warp filter with an assertion
that the intermediate values are in-range. This is because they
should (provably) *never* go out-of-range.
* Add a comment describing the intended hardware architecture
* Miscellaneous comment updates
Change-Id: I798736f923ece599f22d573d31c5dfccd18b2d0e
* Calculate sx4, sy4 by truncation instead of rounding
* Move some repeated calculations out of the filter loop
This is expected to have a roughly neutral effect on BDRATE.
The speedup of each filter (SSE2, lowbd SSSE3, highbd SSSE3) is
7-10%, for a total speedup of 14-18% when considered together
with patches f7a5ee5 and 14b8112.
Change-Id: I692f649202214c7ab53ecf81f81386f1503e2d20
Previously, the projected positions of chroma pixels would effectively
undergo double rounding, since we round both when calculating x4 / y4
and when calculating the filter index. Further, the two roundings
were different: x4 / y4 used ROUND_POWER_OF_TWO_SIGNED, whereas
the filter index uses ROUND_POWER_OF_TWO.
It is slightly more accurate (and faster) to replace the first
rounding by a shift; this is motivated by the fact that
ROUND_POWER_OF_TWO(x >> a, b) == ROUND_POWER_OF_TWO(x, a + b)
Change-Id: Ia52b05745168d0aeb05f0af4c75ff33eee791d82
This fixes a mismatch which occurs when global/warped motion and
a masked compound type are used together.
Change-Id: I08b2702cdb3b85f8d8817b9286a73951c97cf379
The SSSE3 filter is very similar to the SSE2 filter, but
the horizontal pass is sped up by using the 8x8->16
multiplies added in SSSE3.
Also apply const-correctness to all versions of the filter
The timings of the existing filters are unchanged, and the
lowbd SSSE3 filter is ~17% faster than the lowbd SSE2 filter.
Timings per 8x8 block:
lowbd SSE2: 320ns
lowbd SSSE3: 273ns
highbd SSSE3: 300ns
Filter output is unchanged.
Change-Id: Ifb428a33b106d900cde1b080794796c0754ae182
Patch https://aomedia-review.googlesource.com/c/10901/ temporarily
disabled the SSE2 warp filter for 4x4 blocks, because of a
data race when the filter was used at the right-hand edge of a
tile in a multithreaded encode.
This patch fixes the data race and re-enables the SSE2 warp filter.
Change-Id: I7058c897ddf538cd10001c5be13b1a1bfe8320fd
This reverts commit 266db85d4a.
Reason for revert: Reverting to prevent software slowdown. Will be implemented differently in a separate patch.
Change-Id: I386a9661c87d69e22761e5c01507f2f1f968433f
When predicting a 4x4 warp block (either using ZEROMV with
global-motion, or the WARPED_CAUSAL motion mode with
warped-motion), the warp filter would previously write
4 bytes to the right of the block.
This caused encode/decode mismatches when encoding with
multiple threads and tile_cols > 1, since in that case
we could end up overwriting already-generated pixels from
the next tile across.
This patch changes the filter so that we only overwrite the
intended pixels.
Change-Id: I3664b44e872e85aa5ccc0a5781f0f9ad994a5b80
Improve the speed of the warp filter itself by ~30%. This leads
to an overall decoder speedup of 5-20%, depending on bitrate,
for the global-motion experiment, and a small speedup for
warped-motion.
Applies a very minor change to the rounding during filter
selection (ROUND_POWER_OF_TWO makes slightly more sense here
than ROUND_POWER_OF_TWO_SIGNED, and is faster)
Change-Id: I3f364221d1ec35a8aac0d2c8b0e427f527d12e43
End-to-end speed improvements: (measured on tempete_cif.y4m,
20 frames for encoder and all 260 frames for decoder)
* GLOBAL_MOTION encoder: ~10% faster
* GLOBAL_MOTION decoder: 100-200% faster depending on bitrate
* WARPED_MOTION encoder: ~2.5% faster
* WARPED_MOTION decoder: ~20-40% faster depending on bitrate
The improvement in the GLOBAL_MOTION decoder is particularly
large because its runtime is dominated by calls to warp_plane().
This introduces minor changes to the output of the warp filter,
but these should be rare.
Change-Id: I5813ab9e90311e27587045153c32d400b6b9eb92