* Including El Attn optimization
* minor changes
* Update benchmark_fs.sh
* Update benchmark_fs.sh
* minor changes
* minor changes
* env variable for testing fairseq optimizer
* CI tests to gpu3
* Trigger Build
* Optimize GPT2
* Add benchmarking scripts for GPT2
* Replace the line break in the generation hypo and update the benchmarking data
* Update README and benchmark script
* Disable transformers-test_gpt2_model_att_mask_past. The currect cache behavior is not compatible with that unit test because the cache key and value will be updated if past is none when the model is called. This unit test will work well if switching the order of calling the second and third model.
* Add readme file for gpt2
* Minor updates
* Use bigger data for prophetnet. Reformat benchmarks.
* Use real hf baseline.
* Fix Transformers rouge metric scale.
* Support ngram in hf.
* Fix blue score thresholds.
* Update install_requires and enable fairseq to work with torch 1.6&1.7
* Better error message and address some warnings in torch1.7
* Raise the error if fairseq/transformers are installed but the optmizations can not be applied
* Move transformers/fairseq to extra_require
* Remove the out-of-dated build files for ngram cuda op
* Run fastseq units before transformers and fairseq
* Cuda op for ngram repeat blocking
* clean up
* Unit test for cuda op
* unit test updated, minor updates in cpp/cu code
* Rebased on new codebase , updated all benchmarks
* Update README.md
* Update README.md
* Update README.md
* minor change in kernel
* changing install order
Simplify ngram block algorithm. Bump speed from 11.0 to 14.8 samples/s.
Before change: generate all ngram pair, and pick banned tokens by look up ngram with last n-1 token.
After change: generate banned tokens directly
For example, the previous generate tokens are 1 2 3 4 2 3. token need to be banned is 4.
Before change, it generate all pair in dict {"1 2": 3, "2 3": 4, "3 4": 2, "4 2" : 3}, and do look up by "2 3", finally find 4 should be banned.
After change, it put 4 in list, and banned it.
* Support BART model in Transformers-2.11.0
* Add the benchmarking results for transformers-2.11.0
* Directly call calc_banned_ngram_tokens_v2 instead of replacing calc_banned_ngram_tokens because we change the function signature and may break other places which use this function
- Avoid the frequent small data copying between GPU and CPU when computing ngrams;
- Avoid the sorting of the cached key and values for encoder-decoder-attention;
- Reduce the cache memory of the encoder-decoder-attention by beam_size times so that we can run a larger batch size;
- Optimize the implementation of updating scores for banned_ngram tokens and banned_bad_word_token;