Under a configuration change, where the bitrate suddenly decreases,
the buffer level may be larger than maximum allowed (for that first frame to be encoded after change_config).
This change keeps it clipped to its maximum level.
Change-Id: I4d0b5b3d1fd8148600dd39e02bd630c9464baba5
1. Made speed choices to be progressive
2. Adjusted rt speed settings to achieve better speed/quality
Overall, rt-5 gained 2.5% in compression/quality, encoding time of 720p
niklas clip goes from 137,052ms to 121,874ms
Change-Id: Ia6e7e1e15225395a868a2f1059c3db8e266e1600
This commit further optimizes SSE2 operations in the second 1-D
inverse 16x16 DCT, with (<10) non-zero coefficients. The average
runtime of this module goes down from 779 cycles -> 725 cycles.
Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f
Optimizing all SSSE3 assembly for convolution:
1. vp9_filter_block1d4_h8_sse2
2. vp9_filter_block1d8_h8_sse2
3. vp9_filter_block1d16_h8_sse2
4. vp9_filter_block1d4_v8_sse2
5. vp9_filter_block1d8_v8_sse2
6. vp9_filter_block1d16_v8_sse2
my optimization include:
-processing 2x8 elements in one 128 bit register instead of processing
8 elements in one 128 bit register.
-removing unecessary loads.
This optimization gives between 2.4% user level gain for 480p input
and 1.6% user level gain for 720p.
This Optimization done only for 64bit.
Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969
This commit is the first patch optimizing SSE2 implementation of inverse
16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row)
transformation. It exploits the fact that only top-left 4x4 block contains
non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients.
The average runtime of idct16x16_10 unit is reduced from
883 cycles -> 779 cycles (12% faster).
For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes
down from 310651 ms -> 305910 ms. The decoding speed goes up from
80.37 fps -> 80.87 fps.
Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32,
vp9_variance64x64, vp9_variance32x16, vp9_variance64x32,
vp9_mse16x16 by migrating to AVX2
some of the functions were optimized by processing 32 elements instead of 16.
some of the functions were optimized by processing 2 loop strides of 16
elements in a single 256 bit register
This optimization gives between 2.4% - 2.7% user level performance gain
and 42% function level gain.
Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
Fix miss alignment of the frames contributing to the
error score and bit allocation for gf/arf groups.
Initial results slightly +.
Change-Id: Ie508bdcfdac52e592d48e1f13e01b3551b523deb
Some cleanups on frames_to_key, frames_since_key.
Also removes the unused fixed_q parameters in vp9.
Change-Id: If8743a32c71de30a8d17136477b53d607a7acda8
The previous implementation stops motion vector prediction test when
the zero motion vector appears for the second time. This commit fixes
it by simply skipping the second time check on zero mv and continuing
on to next mv candidate.
It slightly improves stdhd in speed 2 by 0.06% on average. Most static
sequences are not affected. A few hard ones, like jet, ped, and riverbed
were improved by 0.1 - 0.2%.
Change-Id: Ia8d4e2ffb7136669e8ad1fb24ea6e8fdd6b9a3c1
The feature undergoes prior assumption that the recursive partition
size search from 4x4 to 64x64, hence utilizing information from small
blocks to determine early termination in large block rate-distortion
optimization search. The current codebase is now going from top down.
The previous function might go with not properly initialized values,
hence removed.
Tested on pedestrian_area_1080p at 4000 kbps running under speed 2.
No visible difference in runtime observed.
Change-Id: I553df415c6191413762db7ae34e8790c71d8118e
This patch sets frame types correctly in the new
vp9_get_second_pass_params() function called prior
to encode_frame_to_data_rate() function, so that the
latter function can just work with what is passed to
it. This will allow multiple vp9_get_second_pass_params()
to be created for various encode strategies without
messing with the core encode function.
There is no difference in derf and yt. stdhd/hd are pending.
Change-Id: I70dfb97e9f497e9cee04052e0e8e0c2892eab0c3
This commit adds input/output ports for IDCT8_1D macro function to
provide more flexibility in variable use. It allows to skip several
buffer swap operations.
Change-Id: I21f3450509537322293043b3281bfd3949868677
Adding RefBuffer to simplify reference buffer management. The struct has a
pointer to image data and scale factors relative to the current frame.
Change-Id: If38eb1491ff687cc11428aee339f3e052e2c5d9e
This commit merges the initial buffer swap operations in idct8_1d_sse2
into the array transpose step, hence reducing number of instructions
therein.
Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479
This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits
the fact that only top-left 4x4 block contains non-zero coefficients,
and hence reduces the instructions needed.
The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles,
estimated by averaging over 100000 runs. For pedestrian_area_1080p 300
frames coded at 4000kbps, the average decoding speed goes up from
79.3 fps to 79.7 fps.
Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180
In two pass encodes bits are allocated to each frame
according to a modified error score for the frame as a
fraction of the modified error score for the clip or section.
Previously a minimum rate per frame was reserved and
subtracted from the bits allocatable by the two pass code.
The vbr max section rate was enforced by clipping the
actual number of bits allocated.
In this patch the min and max vbr rates are enforced
instead by clipping the modified error scores for each frame
rather than the number of bits allocated.
Small gains for all test sets (psnr and SSIM) ranging from
~ +0.05 for YT psnr up to ~ +0.25 for Std-hd SSIM.
Change-Id: Iae27d70bdd3944e3f0cceaf225bad2e8802833de
vpx_codec_vp9x_cx is not used internally. Experimental flag from
vp9_extracfg is also not really used. YUV 4:4:4 just works after these
changes (you have to specify --profile=1 for the encoder).
Change-Id: Ib1c8461d0d19d159827e005efe868f891eea0140
This commit takes a preliminary attempt to refine the motion search
control. It detects the SAD associated with mv predictor per reference
frame, and based on which to determine whether the encoder wants to
reduce the motion search range (if the predicted mv provides fairly
small SAD), or to skip the current reference frame (if there exists
another ref frame that gives much smaller SAD cost).
This feature is turned on in the settings of speed 1 and above.
In speed 1, compression performance changed
derf -0.018%
yt -0.043%
hd -0.045%
stdhd -0.281%
speed-up
pedestrian_area_1080p at 4000 kbps 100 frames
199651ms -> 188846ms (5.5% speed-up)
blue_sky_1080p at 6000 kbps
443531ms -> 415239ms (6.3% speed-up)
In speed 2, compression performance changed
derf -0.026%
yt -0.090%
hd -0.055%
stdhd -0.210%
speed-up
pedstrian 113949ms -> 108855ms (4.5% speed-up)
blue_sky 271057ms -> 257322ms (5% speed-up)
Change-Id: I1b74ea28278c94fea329d971d706d573983d810d
Buffer the SSE of prediction residuals in the rate-distortion
optimization loop of a given block. This information would be used
for later encoding control.
Change-Id: If4e63f3462490513c48be9407d3327c8dd438367
Moving back to scale_factors struct. We don't need anymore x_offset_q4 and
y_offset_q4 because both values are calculated locally inside vp9_scale_mv
function.
Change-Id: I78a2122ba253c428a14558bda0e78ece738d2b5b
Before mv scaling it is required to calculate x_offset_q4/y_offset_q4
by calling set_scaled_offsets(). Now offset configuration can not be
missed because it happens just before scale_mv().
Change-Id: I7dd1a85b85811a6cc67c46c9b01e6ccbbb06ce3a
Various cleanups and streamlining of interfaces as precursor
to further advancements in rate control.
Pre-encode parameter setting for different use cases:
One-pass, first of 2-pass, second of 2-pass, and Svc
are separated out.
There is no change in output with this change.
Change-Id: Ied8ca7d84d610993776aa30ef263fe20452e0e3e
Take account of the fact that the overlay frame is usually
very cheap so distribute target bits among the other frames.
Change-Id: I120685122e8cbbe75da8d07d02932f7877059867
This will hurt metrics in some cases (particularly for static
clips at low data rates where there is extra overhead, but it
helps smooth transitions around forced key frames between
stitched kf sections.
Change-Id: I7e1026ae0de6c77bba863061e115136d7f283cc0
Slightly reduces the mean tendency to undershoot target
rate in vbr, especially when using the memory less mode
and when recodes are disabled.
The effect is primarily at low q.
Change-Id: I59a593b99522cc7da31b4134d1c8a65f5b7b7c53
This commit reworks the prediction filter rate-distortion cost update
process consistent for all block sizes.
Change-Id: I5874349ab38df380240f96c2d4ef924072bab68d
Guard against incorrect size values moving *data past data_end.
Check read length against the difference of the buffers.
Change-Id: Ie0b54e2db517fd41a0f3ceb23402ee44839a4739
lf deltas are later setup in function vp9_setup_past_independence(),
so this commit removed the redundant copy. Also renamed a function
to better align the behavior of the funciton.
Change-Id: I5d28c2f5b12b3d31817e14296ed4605c1fd5c98c
MV struct was ussed to indicate the postition of a MI_BLOCK with row
and col components. The expression was confusing, this commit added a
new stucture "POSITION" with row and col component to better describe
the position of a mi_block.
Change-Id: I59fdd4b45010fe7d85a8db22a55503265c4f5b2b
Various cleanups and refactoring.
Removes feedback of active worst qaulity and uses last_q
instead to make the interface cleaner. Active worst quality
is now decided only once for a frame being coded in the
beginning based on last_q and other stats. Also, adds other
cleaups on last_q to store also the last_q for altref frames,
and reduces the altref interval a little.
The output does change a little.
derfraw300: +0.224% (global psnr)
stdhdraw250: +0.442% (global psnr)
Change-Id: Ie634cdc032697044c472dd0fe79c109b3e7f9767
Properly handle the rd_filter_cache update, when early termination
or skip prediction filter type check is triggered.
Change-Id: Ie7b9a75fed3358f45ffd15817f2b36670c14eb2d
Increased threshold(t) for interp filter search. This sped up the
encoder with some PSNR loss.
Borg tests were ran at speed 2.
t = 100, PSNR loss:
-0.710%(derf); -0.561%(stdhd); -0.647%(youtube)
speedup:
9%(derf); 3%(stdhd); 5.7%(youtube)
t = 500, PSNR loss:
-1.687%(derf); -1.665%(stdhd); -1.664%(youtube)
speedup:
18%(derf); 10%(stdhd); 8%(youtube)
Change-Id: I180e3657c1e156aaa88dc7c437f8bcbd19f5caba
This commit enables an adaptive prediction filter type selection
for sub8x8 block sizes. In speed 1, it re-uses the filter type of
collocated 8x8 block if it is tested in the rate-distortion optimization
loop, for the sub8x8 blocks. Otherwise, it runs the normal test
over all the three filter types. In speed 2, it re-uses the 8x8
block's prediction filter type, if available. Otherwise, force it
to be EIGHTTAP.
Compression and speed performance wise:
speed 1
derf -0.266%
yt -0.138%
bus at 2000 kbps: 33766ms -> 30451ms (10% speed-up)
football at 600 kbps: 48173ms -> 43786ms (9% speed-up)
speed 2
derf -0.026%
yt +0.134%
bus at 2000 kbps: 18973ms -> 17698ms (6% speed-up)
football at 600 kbps: 26748ms -> 25096ms (6% speed-up)
Change-Id: I77e097533b969fd3472147225fa79fc98095d342
Making overall logic more clear, moving "hacked" calculation of base filter
array pointer to get_filter_base() function.
Change-Id: Ibbd38a9f937e48d35bbbfef3ad933ab36664cccb
Trying to make encode_sb() more similar to write_modes_sb() and
decode_mode_sb() because essentially all branching logic should be the
same.
Change-Id: Ib7dec7b48fce29418142abad4d1dcfdb1c770735
This commit constrains the maximal motion search range for sub8x8
blocks to be [-1023, 1023], in the unit of full pixel.
Change-Id: I955b60649364ab410f2453cafd46a496f2fcb43e
reorder the tiles based on size and their presumed complexity. this
minimizes the cases where the main thread is waiting on a worker to
complete.
Change-Id: Ie80642c6a1d64ece884f41683d23a3708ab38e0c
In evaluating partition split case, Wrong partition size is used in
calling partition_plane_context(). This commit change to use the
correct sub partition size. The incorrect partition size used were
causing an ASAN error in unit test.
Change-Id: Iab695b764bc51cc61580075f2ae4001421132362
Clean up and simplification of both estimate_max_q
variants and only call once per clip/section.
This leads to a more constrained range of Q values
across a clip / section.
Average gains across all 4 test sets:-
PSNR ~0.5% SSIM ~0.3%
Change-Id: If77d5f7bb50939a464e117724f4da5b001c62d70
In lossless coding, distortion is always 0. Early exit based on this
metric was incorrect.
This CL also changed to use best_rd instead of distortion as the metric
for easly exit as requested by Jim.
Change-Id: I8ef3e407ac03b4abc3283b273f936a68fad5c2ab
Add a full range motion search for regular block sizes. This runs
exhaustive search within the given reference area. This commit further
optimizes the search process by combining 4 points test into one
pipeline, which gives 30% speed-up as compared to run each individual
point at a time.
This full range search serves as a best possible motion search reference.
When replacing the diamond search with full range search, the speed 0
runtime of bus CIF at 2000 kbps goes from 153872ms to 623051ms. The
compression performance compared to speed 0 setting gains 0.585% for
derf set.
Change-Id: Ieef1225216b0b86b4ac4872fa7fb9e18bf2eabb3
Removed an adaptive rate correction factor that was having
a negative impact on quality in many clips. This factor
was influencing the Q range available to each frame
independently of the bits allocated to each.
Average results with DISABLE_RC_LONG_TERM_MEM.
derf +0.199, -0.059.
yt +3.957, +3.798
std hd +1.577, +2.140
yt hd +4.127, +4.513
Average results without DISABLE_RC_LONG_TERM_MEM
derf -0.628, -0.665
yt +3.432, +3.015
std hd -0.105, +0.153
yt hd +3.432, +3.015
Change-Id: I45bab6b606f49a442e7b27a6d631f3ffd843bbce
Includes various cleanups.
Streamlines the interfaces so that all rate control state
updates happen in the vp9_rc_postencode_update() function.
This will hopefully make it easier to support multiple
rate control schemes.
Removes some unnecessary code, which in rare cases can casue
a difference in the constrained quality mode output, but
other than that there is no bitstream change yet.
Change-Id: I3198cc37249932feea1e3691c0b2650e7b0c22fc
Removed calls to vp9_update_mode_info_border since
they immediately followed code that initialized the
entire buffer to 0.
Change-Id: Ife06794daa20439a0b607a83a87f88df59afac40
- Disable mode info update in case where current frame is coded
as "show existing frame".
- Should fix issue 676.
Change-Id: Ibee681850eb307f982da6528d3e31cb94f881c08
Both single frame and compound inter motion search run with luma
component only. Hence removing the block size mapping therein.
Change-Id: I217488e702432ae9fa0e95bf6f516ebb36b5c79b
The old code would start in a mixed state, where all the reference
frames were pointing to frame buffer 0, but the reference counts
were 0. This is why we needed special code for the first frame.
Change-Id: I734961012917654ff8c0c8b317aac00ab75ded1a
Using get_plane_block_size() instead of manipulation with subsampling
values, calculating all required values only once without redundant calls
to b_width_log2().
Change-Id: I00303f2a0926f9c4cb17f34591adda60615f8919
Jingning saw bitstream change with this patch. It could be true
that (mask_16x16_0 & 1) is 1, but (mask_16x16_1 & 1) is 0 in some
edge cases.
This reverts commit 8f05e70340.
Change-Id: I0a529435ce816a1e14653eb510d5090de276070a
In the decoder we don't need to save eobs, we can pass eob as an argument.
That's why removing eob arrays from VP9Decompressor and TileWorkerData,
and moving eob pointer from macroblockd_plane to macroblock_plane.
Change-Id: I8eb919acc837acfb3abdd8319af63d1bbca8217a
This commit makes the coefficient tree initialized prior to token
initialization, where the coefficient costs are filled out according
to the probabilities associated with coefficient value categories.
Change-Id: If4e89c3923058376f8382c683fe4a225a4a38af3
This commit fixes the intra prediction reference source selection
in the settings of skip_encode. Use original boundary pixels as
prediction reference, when the inverse transform and reconstruction
are skipped in the per block size rate-distortion optimization loop.
Change-Id: I36081aa30aa46e203e0e6f4e8a420fd08269469a
Its last remaining caller can be passed its results directly without any
additional work. Also, it's not non-4:2:0 safe.
Change-Id: Ia5089ba5f7f66c7617270483c619c9271aefd868
The performance gain of idct16x16_10_add_sse2 function is not
noticeable. However since both functions use the IDCT16_1D,
idct16x16_10_add_sse2 should be modified as well.
Tested with: park_joy_420_720p50.y4m
Change-Id: I02b957e36fcf997c677d15baf496533895271bff
This commit fixes the use of uv_intra_estimate by properly restoring
the mode_info struct required by rd_pick_intra_sbuv_mode.
Change-Id: I6a156d79533c4e2e60dfd3b8c5bb0a42a8eca280
The difference with the old code is that originally the whole token_cache
was initialized with zeros at the beginning of decode_coefs() function.
Now we set several zero values explicitly with "token_cache[scan[c]] = 0".
Change-Id: I88cc5031f01d13012d1a4491739c36cb44f9401e
Removing goto and using while loop instead, renaming seg_eob to max_eob,
moving eob token counter increment.
Change-Id: Idcc4b3a45e4f313596a71776aef56691a6647e5f
E.g. disable vertical partioning for 4:2:2. Until we come up with something
better to do with the chroma block size, this prevents an assert error.
Change-Id: I9394fb3f14ec1343abc3ad4769de208e6278f285
Considering a horizontal edge, if mask_16x16 is 1 for an even-
indexed 8x8 block, then mask_16x16 is 1 for next 8x8 block in
same row. Similiar to a verticle edge, if mask_16x16 is 1 for
an even-rowed 8x8 block, then mask_16x16 is 1 for the 8x8 block
right below it in next raw. Based on that, the mask_16x16 checking
can be simplified to save cycles. The corresponding 8-pixel
vp9_mb_lpf_horizontal_edge code can also be removed.
Change-Id: Ic3fe7a5674322239208cbe2731dc3216ce2084f3
We only need qcoeff buffers in the encoder. Reducing TileWorkerData struct
and VP9Decompressor struct sizes by 24K.
Change-Id: Id148868461f7ffa3d3dd634b371503ae9c57e207
Renaming treed_read() to consistent vp9_read_tree() and moving it from
deleted vp9_treereader.h to vp9_dboolhuff.h file.
Change-Id: Iedd8655acbe25e4fcf62b79e5a13bdea69b6b004