We are going to have several commits to setup new low/high
bitdepth data path selection logic. This patch is for inverse
transform. Let me summarize the ideas as following.
- For low/high bitdepth selection, encoder depends on
input configuration, e.g., video sequence bitdepth,
profile. Decoder depends on input bitstream. This has
nothing to do with compiler/build configuration.
- Typical encoder usage for sampling format 4:2:0.
1) 8-bit video sequence:
a) --profile=0
Fastest encoding/decoding pipeline on speedup.
b) --profile=2 --bit-depth=10
Image pixels are left shifted by 2 bits. It
employs 16-bit reference frame buffer and has high
calculation precision. It usually enjoys higher
compression performance.
2) 10/12-bit video sequence (HDR):
--profile=2 --bit-depth=10/12
- Transform coefficient type:
Lowbitdepth: int16_t
Highbitdepth: int32_t
- The type, tran_low_t is still used in codebase,
Which is int32_t, defining the data path capacity.
Naturally, it is high bitdepth.
Eventually we shall remove the configuration flags,
CONFIG_HIGHBITDEPTH/CONFIG_LOWBITDEPTH, and seperate
low and high bitdepth data path. Two data paths co-exist
in the same build environment.
Change-Id: I35c06d4d4f19ebf80d909168fdddbae57c3cc884
- First pass encoding time reduces ~10.9% on i7-6700
at 100 frames, 1080p.
- avx2 works for coeff number >= 8 cases; coeff number < 8
case will be implemented by sse2.
- Unit test is added type B/FP/DC.
Change-Id: Ibe5b7807c64e6dfc2d59c470ed50a6e8ca94ef7c
They do not handle border extension correctly (interpolation and
border extension do not commute unless you upsample into the
border), nor do they handle crop dimensions that are not a multiple
of 8 (the upsampled version is not sufficiently large), in addition
to using massive amounts of memory and being a criminal waste of
cache (1 byte used for every 8 bytes fetched).
This commit reimplements use_upsampled_references by computing the
subpixel samples on the fly. This implementation not only corrects
the border handling, but is also faster, while maintaining the
same quality.
HL AWCY results are basically noise:
PSNR | PSNR HVS | SSIM | MS SSIM | CIEDE 2000
0.0188 | 0.0187 | 0.0045 | 0.0063 | 0.0228
Change-Id: I7527db9f83b87a7bb8b35342f7e6457cd0bef9cd
Earlier, intra prediction for rectangular blocks was performed by
running two steps of prediction on square sub-blocks.
With this experiment, we do proper intra prediction for rectangular
blocks. This ensures that we make use of all available neighboring
pixels especially for directional modes. For this, all the intra
predictors were updated to work with rectangular transform block sizes.
Performance improvements are small but free of cost:
All Intra frames:
lowres: -0.126
midres: -0.154
Video Overall:
lowres: -0.043
midres: -0.100
[Could not get AWCY results due to a backlog.]
BUG=aomedia:551
Change-Id: I7936e91b171d5c246cb0a4ea470a981a013892e6
We would expect that these new functions would be slower than
the old masked SAD/SSE functions, as they do additional work
(blending two inputs and comparing to a third, rather than
just comparing two inputs).
This is true for the SAD functions, which are about 50% slower
(depending on block size and bit depth). However, the sub-pixel
SSE functions are comparable to the old speed for the accelerated
special cases (xoffset or yoffset = 0 or 4), and are
between 40-90% faster for the generic case.
Change-Id: I1a296ed8fc9e3edc313a6add516ff76b17cd3e9f
* Rename the 'masked_compound_*' functions to just 'masked_*'.
The previous names were intended to be temporary, to distinguish
the old and new masked motion search pipelines. But now that the
old pipeline has been removed, we can reuse the old names.
* Simplify the new ext-inter compound motion search pipeline
a bit.
* Harmonize names: Rename
aom_highbd_masked_compound_sub_pixel_variance* to
aom_highbd_8_masked_sub_pixel_variance*, to match the naming of
the corresponding non-masked functions
Change-Id: I988768ffe2f42a942405b7d8e93a2757a012dca3
Add SSE2 lowbd and SSSE3 highbd versions of the filters
introduced in https://aomedia-review.googlesource.com/c/11962/ .
These filters are equivalent in speed to the SSE2 implementations
of the regular convolve filter. The average time to filter a
64x64 block is:
lowbd C: 52us
lowbd SSE2: 5.6us
highbd C: 53us
highbd SSSE3: 5.8us
Also add a correctness test based on the warp filter tests.
Change-Id: Ia0d81100e8a414bbfc2b5f664d751cf24765299e
Patches https://aomedia-review.googlesource.com/c/11987/
and https://aomedia-review.googlesource.com/c/11988/
replaced the old masked motion search pipeline with
a new one which uses different SAD/SSE functions.
This resulted in a lot of dead code.
This patch removes the now-dead code. Note that this
includes vectorized SAD/SSE functions, which will need
to be rewritten at some point for the new pipeline. It
also includes the masked_compound_variance_* functions
since these turned out not to be used by the new pipeline.
To help with the later addition of vectorized functions, the
masked_sad/variance_test.cc files are kept but are modified
to work with the new functions. The tests are then disabled
until we actually have the vectorized functions.
Change-Id: I61b686abd14bba5280bed94e1be62eb74ea23d89
Use CONFIG_AV1_{DE,EN}CODER to control decoder and
encoder support inclusion instead.
BUG=aomedia:76,aomedia:508
Change-Id: Ib150ae382b301885589f30d9b6e98d3bfdd1afce
Add functions which take both components of a masked compound and
compute the resulting SAD/SSE. Extend joint_motion_search to understand
masked compounds, and use it to evaluate NEW_NEWMV modes.
Change-Id: I782199a20d119a6c61c6567df157508125ac7ce7
Libvpx dropped armv6 support sometime after the aom fork.
We don't intend to support this platform, which is likely
too slow in any case. Remove the assembly and intrinsics
optimized routines, their tests, cpu feature detection,
and rtcd specialization for this instruction set extension.
Change-Id: If44ec28e5ddafc6af179c5d1982ac7e81fe54d5e
This experiment extends ALT_INTRA by adding two new modes:
smooth horizontal and smooth vertical.
Improvement on *intra frames* in BDRate (PSNR):
===============================================
AWCY (high latency): -0.46%
(Also, -1.0% or more on PSNR Cb,Cr and APSNR Cb,Cr).
AWCY (low latency): -0.43%
(Also, -0.88% to -0.94% on PSNR Cb,Cr and APSNR Cb,Cr).
Google sets:
lowres: -0.454
midres: -0.484
hdres: -0.525
Improvement on *video overall* in BDRate (PSNR):
================================================
AWCY (high latency): -0.15%
Google sets:
lowres: -0.085
midres: -0.079
Change-Id: I9f4e7c1b8ded1fe244c72838f336103ccc715d50
This patch removes dead code and prevents future implementations
to rely on obsolete transforms. Future optimizations and tests should
be based on latest C-functions (av1/common/av1_inv_txfm1d.c)
Cleanup related last unit-test callers.
BUG=aomedia:442
Change-Id: I24953cc1baf30dd7b720df8a72dd91b356b74cad
- Partial inverse DCT unit tests have been enhanced.
- IDCT x86_64 assembly code has been removed.
Change-Id: Ic3bed2c0e70abdfd642a4f74fa969cc672d4795f
Directional predictors for 45, 63 and 207 angle had 2 or 3 variants
each, and only one of them was actually being used. So, removed the
C, sse2, ssse3 and neon versions of the unused ones.
Updates to the test:
- test_intra_pred_speed was testing the unused versions, so changed
it to use the version actually used by code. This meant updating
some golden MD5 values.
- test_intra_pred_speed was NOT filling up bottom-left and top-right
pixels randomly, so the predictors using these pixels weren't tested
properly. This was fixed.
BUG=aomedia:442
Change-Id: I09725d593408b81e0cd636e70a88c28eea5f2222
This experiment complexifies DSP function dispatch, without bringing
any real value (it's non-normative arbitrary behaviour).
Moreover, it only has an effect on obsolete transforms, the new ones
don't implement this mechanism.
Change-Id: Idaccdd0c14ed6b7008cd4f365c7f017ba8ccacf5
A similar cleanup happened before, but the empty statements have since
reappeared. I added a check in 'specialize' subroutine to die whenever
such an empty specialize call is found, so that config+make would fail.
Change-Id: I300ca0f0b077c0aeca8096d6460d8fb1c364d9b9
* Dering and clpf were merged into a single pass.
* 32x32 and 128x128 filter block sizes for clpf were removed.
* RDO for dering and clpf merged and improved:
- "0" no longer required to be in the strength selection
- Dering strength can now be 0, 1 or 2 bits per block
LL HL
PSNR: -0.04 -0.01
PSNR HVS: -0.27 -0.18
SSIM: -0.15 +0.01
CIEDE 2000: -0.11 -0.03
APSNR: -0.03 -0.00
MS SSIM: -0.18 -0.11
Change-Id: I9f002a16ad218eab6007f90f1f176232443495f0
CLPF performance had degraded by about 0.5% over the past six months,
which isn't totally surprising since the codec is a moving target.
About half of that degradation comes from the improved 7 bit filter
coefficients. Therefore, CLPF needs to be retuned for the current
codec.
This patch makes two (normative) changes to the CLPF kernel:
* The clipping function was changed from clamp(x, -s, s) to
sign(x) * max(0, abs(x) - max(0, abs(x) - s +
(abs(x) >> (bitdepth - 3 - log2(s)))))
This adds a rampdown to 0 at -32 and 32 (for 8 bit, -128 & 128
for 10 bit, etc), so large differences are ignored.
* 8 taps instead of 6 taps:
1
4 3
13 31 -> 13 31
4 3
1
AWCY results: low delay high delay
PSNR: -0.40% -0.47%
PSNR HVS: 0.00% -0.11%
SSIM: -0.31% -0.39%
CIEDE 2000: -0.22% -0.31%
APSNR: -0.40% -0.48%
MS SSIM: 0.01% -0.12%
About 3/4 of the gains come from the new clipping function.
Change-Id: Idad9dc4004e71a9c7ec81ba62ebd12fb76fb044a
When the functions were added in
https://aomedia-review.googlesource.com/6545 they were not restricted to
x86_64 builds.
Fixes "undefined reference to
`aom_highbd_convolve8_add_src_sse2'" for --target=x86-linux-gcc
Also remove SSE2 specializations from
`aom_highbd_convolve8_add_src[_horiz/_vert]`, since those functions
don't actually have SSE2 versions (this was left in by accident
in the original patch).
Change-Id: I9f7d0c11b58b6f5a0e6a1fdaed0f92175bdeab34
VS compiling for 32 bit targets does not support vector types in
structs as arguments, which makes the v256 type of the intrinsics hard
to support, so optimizations for this target are disabled.
Change-Id: I675394cf1aed0cb18a48f21216470867031b30ce
The convolve filters generated by loop_wiener_filter_tile
are not compatible with some existing convolve implementations
(they can have coefficients >128, sums of (certain subsets of)
coefficients >128, etc.)
So we implement a new variant, which takes a filter with 128
subtracted from its central element and which adds an extra copy
of the source just before clipping to a pixel (reinstating the
128 we subtracted). This should be easy to adapt from the existing
convolve functions, and this patch includes SSE2 highbd and
SSSE3 lowbd implementations.
Change-Id: I0abf4c2915f0665c49d88fe450dbc77b783f69e1
Provide primitive modules for cb4x4 mode use. This resolves compiler
warnings when both high bit-depth and cb4x4 mode are turned on.
Change-Id: If6ecac50578b3e665b602419a0701c3e047ce623
- For all blocks with width >= 16.
- Add test_count to make the unit tests harder to pass.
- Speed testing on 1080p, 100 frames, 5 Mbps, CPU, i7-6700
User level time reduction:
baseline: 3.68%
baseline + ext-partition: 36.12%
Change-Id: I78c5d9ca216f0fd91f1a360dca2190b11fd54a08