Implements scan order to band map with arrays in both the encoder
and decoder to remove conditional statements.
Encoding seems to be about 1% faster at speed 0, tested on football.
Decoding seems to be about 0.5-1% faster on a set of 25 videos.
Change-Id: Idb233ca0b9e0efd790e30880642e8717e1c5c8dd
It is more logical to use dqcoeff buffer to put there *dequantized*
transform coefficients (inside inverse_transform_block and
decode_coefs functions). Dequantization happens inside WRITE_COEF_CONTINUE
macro.
qcoeff buffer should be only used in the encoder for *quantized*
transform coefficients.
Change-Id: Ifd54bef272bbf5311ced6669c4f1079f998af5d7
Removes stack-alocation of token_cache in decode_coefs function
Seems to achieve about 1% decode speed improvement as tested on
25 480p videos.
Change-Id: I8e7eb3361fa09d9654dfad0677a6d606701fdc6e
The change doesn't affect the bitstream. It changes the order or function
calls and affects how we reconstruct intra- and inter-blocks. Speed up is
about 1...1.5%.
For intra-blocks:
Before:
for each transform block read tokens
for each transform block do prediction
for each transform block do inverse transform
Now:
for each transform block
read tokens
do prediction
do inverse transform
For inter-blocks:
Before:
for each transform block read tokens
for each transform block do inverse transform
Now:
for each transform block
read tokens
do inverse transform
Change-Id: I12a79bf1aa5a18c351b8010369bd3ff1deae1570
missed one in vp9_detokenize.c in the last
+ add some asserts in vp9_decode_frame() to catch regressions
Change-Id: Ide67505114ee17efdafb13694aed0c09039e5a16
replace VP9D_COMP usage with the (slightly) more targeted
VP9_COMMON/MACROBLCKD/struct segmentation structures.
Change-Id: Iabb3616e231417b0e17b7e4b384ea63167a81745
* changes:
Use a separate MODE_INFO stream for each tile column
Get rid of "this_mi", use "mi_8x8[0]" everywhere instead
Make the static_segmentation feature work again
The only case where they were intentionally pointing to different
structures was in mbgraph, and this didn't have the expected behavior
because both of these pointers are used interchangeably through the code
Change-Id: I979251782f90885fe962305bcc845bc05907f80c
Moving code that gets band_translate array from get_scan_and_band()
function to get_band_translate() function. Renaming get_scan_and_band() to
get_scan().
Change-Id: I43047c205a1ca2a6e24be44db39dc04b7a385008
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of pointers to MODE_INFO structs. The
MODE_INFO structs are now stored as a stream (decoder only),
eliminating unnecessary copies and is a little more cache
friendly.
Change-Id: I031d376284c6eb98a38ad5595b797f048a6cfc0d
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of a pointer to a MODE_INFO struct and
a "in the image" flag. The MODE_INFO structs are now stored
as a stream, eliminating unnecessary copies and is a little
more cache friendly.
For the test clips used, the decoder performance improved
by ~4.3% (1080p) and ~9.7% (720p).
Patch Set 2: Re-encoded clips with latest. Now ~1.7% (1080p)
and 5.9% (720p).
Change-Id: I846f29e88610fce2523ca697a9a9ef2a182e9256
Adding set_contexts contexts function and call it instead of
set_contexts_on_border. Calling txfrm_block_to_raster_xy to get aoff and
loff.
Change-Id: I41897e344afd2cae1f923f4fdbe63daccf6fe80e
Updating all foreach_transformed_block_visitor functions to work with
plane block size instead of general block. Removing a lot of duplicated
code.
Change-Id: I6a9069e27528c611f5a648e1da0c5a5fd17f1bb4
VP9_COMMON is the right place to segmentatation struct because it has
global segmentation parameters, not something specific to macroblock
processing.
Change-Id: Ib9ada0c06c253996eb3b5f6cccf6a323fbbba708
Making foreach_transformed_block_in_plane more clear (it's not finished
yet). Using explicit tx_size variable consistently instead of
(ss_txfrm_size / 2) or (ss_txfrm_size >> 1) expression.
Change-Id: I1b9bba2c0a9f817fca72c88324bbe6004766fb7d
This commit optimizes the tokenization and detokenization operational
flow for speed-up. It makes the coding process about 0.3% faster at
speed 0.
Change-Id: I28008df7482874e4b5f237f2d418ff82a249dd56
Removing unused constants, macros, and function declarations. Using
ROUND_POWER_OF_TWO macro, vp9_zero, vp9_copy where possible. Moving
#include from *.h to *.c. Merging for loops for motion vectors.
Change-Id: Ic3bf841764a2bb177128bb3a6d7aa8f68229cd13
Adding plane type check condition because it was always used outside of
get_tx_type_{4x4, 8x8, 16x16}.
Change-Id: I02f0bbfee8063474865bd903eb25b54d26e07230
Counts are separate from frame context. We have several frame contexts but
need only one copy of all counts.
Change-Id: I5279b0321cb450bbea7049adaa9275306a7cef7d
Adding segmentation struct to vp9_seg_common.h. Struct members are from
macroblockd and VP9Common structs. Moving segmentation related constants
and enums to vp9_seg_common.h.
Change-Id: I23fabc33f11a359249f5f80d161daf569d02ec03
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.
Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.
Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
Makes cost_coeffs() a lot faster:
4x4: 236 -> 181 cycles
8x8: 888 -> 588 cycles
16x16: 3550 -> 2483 cycles
32x32: 17392 -> 12010 cycles
Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.
Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
Wrong max data size (skip has no data) and use of vp9_get_segdata()
when it should be vp9_segfeature_active().
Change-Id: I1eb97d33df6e2a42cc589049f704266fe3639902
Code intra/inter, then comp/single, then the ref frame selection.
Use contextualization for all steps. Don't code two past frames
in comp pred mode.
Change-Id: I4639a78cd5cccb283023265dbcc07898c3e7cf95