Includes reordering and other clamping changes, as well as
changes to reduce multiplier precision.
cam_lowres (60 frames): -0.092% BDRATE improvement in
--disable-cdef --disable-global-motion --disable-ext-tx
configuation.
Change-Id: I0660c45b44fcd5a193534d8dadd1aa1ae5c5e27a
We are going to have several commits to setup new low/high
bitdepth data path selection logic. This patch is for inverse
transform. Let me summarize the ideas as following.
- For low/high bitdepth selection, encoder depends on
input configuration, e.g., video sequence bitdepth,
profile. Decoder depends on input bitstream. This has
nothing to do with compiler/build configuration.
- Typical encoder usage for sampling format 4:2:0.
1) 8-bit video sequence:
a) --profile=0
Fastest encoding/decoding pipeline on speedup.
b) --profile=2 --bit-depth=10
Image pixels are left shifted by 2 bits. It
employs 16-bit reference frame buffer and has high
calculation precision. It usually enjoys higher
compression performance.
2) 10/12-bit video sequence (HDR):
--profile=2 --bit-depth=10/12
- Transform coefficient type:
Lowbitdepth: int16_t
Highbitdepth: int32_t
- The type, tran_low_t is still used in codebase,
Which is int32_t, defining the data path capacity.
Naturally, it is high bitdepth.
Eventually we shall remove the configuration flags,
CONFIG_HIGHBITDEPTH/CONFIG_LOWBITDEPTH, and seperate
low and high bitdepth data path. Two data paths co-exist
in the same build environment.
Change-Id: I35c06d4d4f19ebf80d909168fdddbae57c3cc884
This change makes the conversions similar to those in av1_quantize.c,
and fix ubsan warnings shown in nightly tests.
Change-Id: I90851a80dcb9f052a32bf22199fd9ef8ff927725
In previous ADSTs, DST-7 and DST-4 are used for length 4 and length
8/16/32, respectively. In this LGT experiment we explore transforms
between DST-4 and DST-7. When CONFIG_LGT flag is on, adst4 and adst8
are replaced by lgt4 and lgt8, the intermediate transforms with
pre-chosen parameters.
The LGTs applied here are lgt4_160 and lgt8_170, where the numbers
mean the self-loop weights times 100. The associated values for DST-7
and DST-4 are 100 and 200.
ovr_psnr:
lowres: -0.140
midres: -0.131
hdres: -0.078
These changes are not applied to the highbd scenario in the
current version.
Change-Id: I20600456da8766528b2b6b11aa28801e70af498e
- First pass encoding time reduces ~10.9% on i7-6700
at 100 frames, 1080p.
- avx2 works for coeff number >= 8 cases; coeff number < 8
case will be implemented by sse2.
- Unit test is added type B/FP/DC.
Change-Id: Ibe5b7807c64e6dfc2d59c470ed50a6e8ca94ef7c
Change the internal lib targets so that external apps
need link only libaom instead of all internal library
targets and libaom.
BUG=aomedia:76,aomedia:609
Change-Id: I38862fcd90cb585300b6b23e8558f78a1934750f
This is enabled via:
$ cmake path/to/aom -DBUILD_SHARED_LIBS=1
Currently supports only Linux and MacOS targets. Symbol visibility
is handled by exports.cmake and its helpers exports_sources.cmake
and generate_exports.cmake.
Some sweeping changes were required to properly support shared libs
and control symbol visibility:
- Object libraries are always linked privately into static
libraries.
- Static libraries are always linked privately into eachother
in the many cases where the CMake build merges multiple library
targets.
- aom_dsp.cmake now links all its targets into the aom_dsp static
library target, and privately links aom_dsp into the aom target.
- av1.cmake now links all its targets into the aom_av1 static library
target, and privately links in aom_dsp and aom_scale as well. It
then privately links aom_av1 into the aom target.
- The aom_mem, aom_ports, aom_scale, and aom_util targets are now
static libs that are privately linked into the aom target.
- In CMakeLists.txt libyuv and libwebm are now privately linked into
app targets.
- The ASM and intrinsic library functions in aom_optimization.cmake
now both require a dependent target argument. This facilitates the
changes noted above regarding new privately linked static library
targets for ASM and intrinsics sources.
BUG=aomedia:76,aomedia:556
Change-Id: I4892059880c5de0f479da2e9c21d8ba2fa7390c3
This reverts commit 79b78b7d47.
The transform coefficient range needs some more tuning.
Before we finalize on that front, directly applying clamping
would cause multiple unit test failure issues. Hence revert
this Cl temporarily.
BUG=aomedia:612
Change-Id: I1dd8680dee17289801c4a209275f05a498355c8e
They do not handle border extension correctly (interpolation and
border extension do not commute unless you upsample into the
border), nor do they handle crop dimensions that are not a multiple
of 8 (the upsampled version is not sufficiently large), in addition
to using massive amounts of memory and being a criminal waste of
cache (1 byte used for every 8 bytes fetched).
This commit reimplements use_upsampled_references by computing the
subpixel samples on the fly. This implementation not only corrects
the border handling, but is also faster, while maintaining the
same quality.
HL AWCY results are basically noise:
PSNR | PSNR HVS | SSIM | MS SSIM | CIEDE 2000
0.0188 | 0.0187 | 0.0045 | 0.0063 | 0.0228
Change-Id: I7527db9f83b87a7bb8b35342f7e6457cd0bef9cd
When --enable-coefficient-range-checking isn't specificed, clamp the
coefficient at each stage.
This doesn't change the decoder behaviour for existing AV1 streams.
However, some AV1 bitstreams that would have been rejected by the
decoder as illegal (range check failure) are now legal bitstreams.
There is no impact on video quality.
BUG=aomedia:30
Change-Id: Ifa01186bae6bfe5d7712298e33d964c20f88435e
The highbd_clip_pixel_add() function is generalized to be used in
the regular 8 bit path. Move its defintions outside the highbd
experimental flag.
This resolves the comiler warning in unit tests when high bit-depth
is turned off.
Change-Id: I90a744adb2381c9bf8476aa2a2bd0c87d9afdf57
The Windows calling convention pushes any __m128i type arguments
after the 3rd (4th on x86-64) onto the stack. But on x86,
stack-allocated arguments are not guaranteed to be aligned to
a multiple of their natural alignment, leading to compile errors.
We fix this by making the functions which take >3 __m128i arguments
instead take pointers. Since the functions are marked INLINE, the
extra memory operations should optimize out.
BUG=aomedia:587
Change-Id: I0cb2831fd12aded6f2821c037365386e6183ba5c
This unifies the codepath for high-bitdepth transforms and deletes
all calls to the old deprecated versions. This required reworking
the way 1d configurations are combined in order to support rectangular
transforms.
There is one remaining codepath that calls the deprecated 4x4 hbd
transform from encoder/encodemb.c. I need to take a closer look
at what is happening there and will leave that for a followup
since this change has already gotten so large.
lowres 10 bit: -0.035%
lowres 12 bit: 0.021%
BUG=aomedia:524
Change-Id: I34cdeaed2461ed7942364147cef10d7d21e3779c
Earlier, intra prediction for rectangular blocks was performed by
running two steps of prediction on square sub-blocks.
With this experiment, we do proper intra prediction for rectangular
blocks. This ensures that we make use of all available neighboring
pixels especially for directional modes. For this, all the intra
predictors were updated to work with rectangular transform block sizes.
Performance improvements are small but free of cost:
All Intra frames:
lowres: -0.126
midres: -0.154
Video Overall:
lowres: -0.043
midres: -0.100
[Could not get AWCY results due to a backlog.]
BUG=aomedia:551
Change-Id: I7936e91b171d5c246cb0a4ea470a981a013892e6
this change makes parallel deblocking experiment works with
cb4x4. the inner loop process every 4x4 block.
Change-Id: I86adb3d7b6d67a91ccc12aab29da9bfb8c522cf1
Implements the high precision Wiener filter with an offset
to reduce the error due to saturation without increasing
the number of bits needed for intermediate precision.
Also turns the high precision filter on.
Change-Id: I34037a5746a6a89c5fce67753c1b027749085edf
We would expect that these new functions would be slower than
the old masked SAD/SSE functions, as they do additional work
(blending two inputs and comparing to a third, rather than
just comparing two inputs).
This is true for the SAD functions, which are about 50% slower
(depending on block size and bit depth). However, the sub-pixel
SSE functions are comparable to the old speed for the accelerated
special cases (xoffset or yoffset = 0 or 4), and are
between 40-90% faster for the generic case.
Change-Id: I1a296ed8fc9e3edc313a6add516ff76b17cd3e9f
This was being worked around by forcing highbitdepth to be off when
enabling tx64x64.
With the fixes, removed the work-around.
Change-Id: I3102f9e17d4037af96a9eff418c5af6a97fd740c
* Rename the 'masked_compound_*' functions to just 'masked_*'.
The previous names were intended to be temporary, to distinguish
the old and new masked motion search pipelines. But now that the
old pipeline has been removed, we can reuse the old names.
* Simplify the new ext-inter compound motion search pipeline
a bit.
* Harmonize names: Rename
aom_highbd_masked_compound_sub_pixel_variance* to
aom_highbd_8_masked_sub_pixel_variance*, to match the naming of
the corresponding non-masked functions
Change-Id: I988768ffe2f42a942405b7d8e93a2757a012dca3
Add SSE2 lowbd and SSSE3 highbd versions of the filters
introduced in https://aomedia-review.googlesource.com/c/11962/ .
These filters are equivalent in speed to the SSE2 implementations
of the regular convolve filter. The average time to filter a
64x64 block is:
lowbd C: 52us
lowbd SSE2: 5.6us
highbd C: 53us
highbd SSSE3: 5.8us
Also add a correctness test based on the warp filter tests.
Change-Id: Ia0d81100e8a414bbfc2b5f664d751cf24765299e
Patches https://aomedia-review.googlesource.com/c/11987/
and https://aomedia-review.googlesource.com/c/11988/
replaced the old masked motion search pipeline with
a new one which uses different SAD/SSE functions.
This resulted in a lot of dead code.
This patch removes the now-dead code. Note that this
includes vectorized SAD/SSE functions, which will need
to be rewritten at some point for the new pipeline. It
also includes the masked_compound_variance_* functions
since these turned out not to be used by the new pipeline.
To help with the later addition of vectorized functions, the
masked_sad/variance_test.cc files are kept but are modified
to work with the new functions. The tests are then disabled
until we actually have the vectorized functions.
Change-Id: I61b686abd14bba5280bed94e1be62eb74ea23d89
Use CONFIG_AV1_{DE,EN}CODER to control decoder and
encoder support inclusion instead.
BUG=aomedia:76,aomedia:508
Change-Id: Ib150ae382b301885589f30d9b6e98d3bfdd1afce
Add functions which take both components of a masked compound and
compute the resulting SAD/SSE. Extend joint_motion_search to understand
masked compounds, and use it to evaluate NEW_NEWMV modes.
Change-Id: I782199a20d119a6c61c6567df157508125ac7ce7
Libvpx dropped armv6 support sometime after the aom fork.
We don't intend to support this platform, which is likely
too slow in any case. Remove the assembly and intrinsics
optimized routines, their tests, cpu feature detection,
and rtcd specialization for this instruction set extension.
Change-Id: If44ec28e5ddafc6af179c5d1982ac7e81fe54d5e
As the block sizes are powers of two, we can index into the weights
array as sm_weights_array[bs] now.
This uses 2 * MAX_BLOCK_DIM memory, instead of NUM_BLOCK_DIMS *
MAX_BLOCK_DIM earlier.
Change-Id: I55bcedc188b8ed7def719c4d002c1fe2ec5e1b7f
- Found this bug when increasing unit test number to 10000.
- Unit test is therefore also updated.
Change-Id: I938e96f6ebd35ae1bd8affebf8665e1da49a324b