Граф коммитов

6667 Коммитов

Автор SHA1 Сообщение Дата
Patrick von Platen b972125ced
Deprecate Wav2Vec2ForMaskedLM and add Wav2Vec2ForCTC (#10089)
* add wav2vec2CTC and deprecate for maskedlm

* remove from docs
2021-02-09 03:49:02 -05:00
Lysandre ba542ffb49 Fix deployment script 2021-02-09 08:43:00 +01:00
sandip 263fac71a2
Integration test for electra model (#10073) 2021-02-08 15:42:25 -05:00
Stas Bekman 781220acab
transition to new tests dir (#10080) 2021-02-08 12:41:52 -08:00
demSd 84acf0c7bb
remove token_type_ids from TokenizerBertGeneration output (#10070) 2021-02-08 13:05:32 -05:00
Juan Cruz-Benito e4bf9910dc
Removing run_pl_glue.py from text classification docs, include run_xnli.py & run_tf_text_classification.py (#10066)
* Removing run_pl_glue.py from seq classification docs

* Adding run_tf_text_classification.py

* Using :prefix_link: to refer local files

* Applying "make style" to the branch

* Update docs/source/task_summary.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Removing last underscores

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-08 13:04:21 -05:00
Lysandre 0dd579c9cf Docs for v4.3.0 2021-02-08 18:53:24 +01:00
Stas Bekman 322037e842
[trainer] deepspeed bug fixes and tests (#10039)
* deepspeed bug fixes and tests

* manual wrap?
2021-02-08 09:44:02 -08:00
Anthony MOI f285e4c3ad
Update tokenizers requirement (#10077) 2021-02-08 12:27:26 -05:00
noise-field ddaafd78fb
Fix mlflow param overflow clean (#10071)
* Unify logging with f-strings

* Get limits from MLflow rather than hardcode

* Add a check for parameter length overflow

Also constants are marked as internal

* Don't stop run in on_train_end

This causes bad behaviour when there is a seprarte validation step:
validation gets recorded as separate run.

* Fix style
2021-02-08 11:58:02 -05:00
Olivier ece6c51458
[s2s examples] Replace -100 token ids with the tokenizer pad_id for compute_metrics (#10046)
* replace -100 token ids with the tokenizer pad_id for compute_metrics

* fixed typo for label_ids
2021-02-08 10:08:16 -05:00
Lysandre Debut c9df1b1d53
Model templates (#10072) 2021-02-08 09:07:02 -05:00
demSd 3b7e612a5e
Implementing the test integration of BertGeneration (#9990)
* claiming this issue

* Integration test for BertGeneration(Encoder and Decoder)

* fix code quality
2021-02-08 08:22:19 -05:00
Julien Plu cdd8659231
Fix TF template (#10069)
* Fix template

* Fix template
2021-02-08 08:10:50 -05:00
Patrick von Platen 9e795eac88
fix bert2bert test (#10063) 2021-02-08 16:04:28 +03:00
Julien Plu 31563e056d
Restore TF embeddings and attention layers to their previous version (#9890)
* Refacto BERT

* Restore all the concerned models

* Remove print

* Update template

* Apply Sylvain's and Morgan's comments

* Fix cast

* Put the cast inside call

* Remove cond in ebds

* Fix funnel

* Restore previous dot product (attention_scores) computation

* Add ConvBERT and BART

* Make all the S2S models ONNX compliant

* Fix test

* Fix check copies
2021-02-08 14:36:30 +03:00
Julien Plu 8bb52bd240
Disable temporarily too slow tests (Longformer/LED) (#10062)
* Disable temporarily too slow tests

* Fix style

* Fix template
2021-02-08 12:32:31 +01:00
Nicolas Patry b1aa4982cd
Cleaning up `ConversationalPipeline` to support more than DialoGPT. (#10002)
* Cleaning up `ConversationalPipeline` to support more than DialoGPT.

Currently ConversationalPipeline was heavily biased towards DialoGPT
,which is the default model for this pipeline.

This PR proposes changes to put back the modifications specific to
DialoGPT into tokenizer-specific behavior wherever possible, by
creating `_build_conversation_input_ids` function that takes
conversation as input, and returns a list of ints corresponding
to the tokens. It feels natural to put here because all models
have probably different strategies to build input_ids from the
full conversation and it's the tokenizer's job to transform strings
into tokens (and vice-versa)

If `_build_conversation_input_ids` is missing, previous behavior is
used so we don't break anything so far (except for blenderbot where it's a fix).

This PR also contains a fix for too long inputs. There used
to be dead code for trying to limit the size of incoming input.
The introduced fixed is that we limit
within `_build_conversation_input_ids` to `tokenizer.model_max_length`.
It corresponds to the intent of the removed dead code and is actually
better because it corresponds to `model_max_length` which is different
from `max_length` (which is a default parameter for `generate`).

- Removed `history` logic from the Conversation as it's not relevant
anymore because tokenization logic has been moved to tokenizer.
And tokenizer cannot save any cache, and conversation cannot know
what is relevant or not.
Also it's not usable from `blenderbot` because the input_ids are
not append only (EOS tokens is always at the end).

- Added `iter_texts` method on `Conversation` because all
the code was literred with some form of this iteration of
past/generated_responses.

* Removing torch mention in types.

* Adding type checking to `_build_conversation_input_ids`.

* Fixing import in strings.
2021-02-08 14:29:07 +03:00
Lysandre Debut ae37ceacbd
Fix typo (#10064) 2021-02-08 06:02:05 -05:00
Patrick von Platen 9a0399e18d
fix bart tests (#10060) 2021-02-08 13:25:09 +03:00
Sylvain Gugger b01483faa0
Truncate max length if needed in all examples (#10034) 2021-02-08 05:03:55 -05:00
Sylvain Gugger 45aaf5f7ab
A few fixes in the documentation (#10033) 2021-02-08 05:02:01 -05:00
Sylvain Gugger 04fd783cc5
Check copies match full class/function names (#10030) 2021-02-08 04:58:25 -05:00
Lysandre Debut d51302cca0
Fix slow dpr test (#10059)
* Correct cast to device

* Comment back the slow test
2021-02-08 04:43:25 -05:00
sandip 12e44af5d3
Integration test for FlauBert (#10022) 2021-02-08 04:36:50 -05:00
Stas Bekman 24db8cc329
Can't mix --fp16 and --device cpu (#10041) 2021-02-07 17:54:20 -08:00
Stas Bekman 769948fad2
json to jsonlines, and doc, and typo (#10043) 2021-02-07 17:51:34 -08:00
Stas Bekman 8ea412a86f
[examples] make run scripts executable (#10037)
* make executable

* make executable

* same for the template

* cleanup
2021-02-05 15:51:18 -08:00
Suraj Patil 1cd16512dc
[examples/seq2seq] support label smoothing (#9844)
* add prepare_decoder_input_ids_from_labels in s2s models

* support lbl smoothing and enc/emb freezing

* fix freezing

* use pad_token_id from config

* remove embed freezing and add warning

* prepare decoder_input_ids inside DataCollatorForSeq2Seq
2021-02-05 23:21:57 +05:30
Patrick von Platen b9720dd6f2
Bump minimum Jax requirement to 2.8.0 (#10027)
* Bump minimum Jax requirement to 2.8.0

* update table
2021-02-05 16:20:26 +03:00
Patrick von Platen 89be094e29
[Templates] Add template "call-for-model" markdown and "call-for-big-bird" markdown (#9921)
* add big bird

* change teacher to mentor

* add proposal template

* adapt template

* delete old template

* correct some links

* finish template

* create big bird from template

* add big bird

* improve boxes

* finish boxes

* add pointers for BigBird

* finish big bird

* up

* up

* up

* up

* apply lysandres and sylvains suggestions

* delete bogus file

* correct markdown

* try different style

* try different style

* finalize
2021-02-05 15:47:54 +03:00
Lysandre Debut 4bbad604eb
Clarify QA pipeline output based on character (#10021)
* Clarify QA pipeline output based on character

* Style
2021-02-05 05:40:30 -05:00
Lysandre ad2c431097 Update doc deployment script path 2021-02-05 11:18:59 +01:00
Lysandre 95a5f271e5 Update doc deployment script 2021-02-05 11:10:29 +01:00
Sylvain Gugger 3be965c5db
Update doc for pre-release (#10014)
* Update doc for pre-release

* Use stable as default

* Use the right commit :facepalms:
2021-02-04 16:52:27 -05:00
Sylvain Gugger ba607db180 Bump version 2021-02-04 16:23:05 -05:00
Sylvain Gugger 4cd22512de Release: 4.3.0.rc1 2021-02-04 15:41:19 -05:00
Sylvain Gugger 4739ce177d Fix test for sagemaker and TPU integrations 2021-02-04 15:06:58 -05:00
Sylvain Gugger 21b3922e35
Authorize last version of tokenizer (#9799)
* Authorize last version of tokenizer

* Update version table

* Fix conversion of spm tokenizers and fix some hub links

* Bump tokenizers version to 0.10.1rc1

* Add script to check tokenizers conversion with XNLI

* Add some more mask_token lstrip support

* Must modify mask_token in slow tokenizers too

* Keep using the old method for Pegasus

* add missing import

Co-authored-by: Anthony MOI <m.anthony.moi@gmail.com>
2021-02-04 14:18:33 -05:00
Nicolas Patry d5888ef0ab
Hotfixing tests (blenderbot decoderonly tests, also need to remove (#10003)
`encoder_no_repeat_ngram_size` from their config.
2021-02-04 11:41:34 -05:00
Stas Bekman 8c3b1fcb67
[trainer] a few fixes (#9993)
* trainer fixes

* don't switch the model  just for deepspeed and mp

* correct the fix
2021-02-04 07:44:56 -08:00
Daniel Stancl 714855bd8f
Remove "double" assignment in TF-BART like models (#9997)
* Replace `attn_weights = attn_wegihts = tf.reshape(...)`
with `attn_weights = tf.reshape(...)` and thus remove
unintentionally used "double" assignment.
2021-02-04 10:24:47 -05:00
Sylvain Gugger b72f16b3ec Fix doc for TFConverBertModel 2021-02-04 10:14:46 -05:00
Nicolas Patry aeb18b9224
Adding new `encoder_no_repeat_ngram_size` to `generate`. (#9984)
Adding new `encoder_no_repeat_ngram_size` to `generate`.

Blenderbot results seemed off compared to original ParlAI script:
`https://parl.ai/projects/recipes/`. Notably the model seems
to repeat a lot what was said during the conversation.

The actual problem was that `no_repeat_ngram_size` actually applies
to the `encoder_input_ids` but HF's `no_repeat_ngram_size` applies
to the previously generated ids (within the decoder). The history
conversation of blenderbot is within the `encoder` part so that
explains why HF's implementation had the repetitions.

This fix was focused on blenderbot *not* small and added tests
for those because they are quite different in configuration.

This change includes:

- Adding a new EncoderNoRepeatLogitProcessor.
- Adding 1 new arg to `generate` (`encoder_no_repeat_ngram_size`)
- Adding 1 new config parameter `encoder_no_repeat_ngram_size`.
- Adding 2 tests, one for the pipeline (high level, inputs exhibited
repeat behavior, one low level for EncoderNoRepeatLogitProcessor)
- Factored NoRepeatLogitProcessor so that logic could be reused.

Further work:

- Blenderbot conversational pipeline still does not behave correctly
 as they way input is prepared within the pipeline is still incorrect
(follow up PR)
- Blenderbot allows the bot to have personas, which is done by
prepending "your personna: XXXX" to the input, this could be explored
too in a follow up PR.

@patrickvonplaten
@LysandreJik

* Update src/transformers/generation_logits_process.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/generation_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/generation_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Update src/transformers/configuration_utils.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Doc quality.

* Fixing test.

* Last fixes.

* Fixing to account for batch_size.

* Update src/transformers/configuration_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/generation_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-04 15:00:18 +01:00
Lysandre Debut e89c959af9
Fix model templates (#9999) 2021-02-04 07:47:26 -05:00
Daniel Hug 804cd185d8
Added Integration testing for DistilBert model from issue #9948' (#9995) 2021-02-04 04:24:59 -05:00
demSd 00031785a8
BartForCausalLM analogs to `ProphetNetForCausalLM` (#9128)
* initiliaze bart4causalLM

* create BartDecoderWrapper, setters/getters

* delete spaces

* forward and additional methods

* update cache function, loss function, remove ngram* params in data class.

* add bartcausallm, bartdecoder testing

* correct bart for causal lm

* remove at

* add mbart as well

* up

* fix typo

* up

* correct

* add pegasusforcausallm

* add blenderbotforcausallm

* add blenderbotsmallforcausallm

* add marianforcausallm

* add test for MarianForCausalLM

* add Pegasus test

* add BlenderbotSmall test

* add blenderbot test

* fix a fail

* fix an import fail

* a fix

* fix

* Update modeling_pegasus.py

* fix models

* fix inputs_embeds setting getter

* adapt tests

* correct repo utils check

* finish test improvement

* fix tf models as well

* make style

* make fix-copies

* fix copies

* run all tests

* last changes

* fix all tests

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-02-04 11:56:12 +03:00
Sylvain Gugger 7898fc03b1
Add `from_slow` in fast tokenizers build and fixes some bugs (#9987) 2021-02-04 03:34:23 -05:00
Stefan Schweter 6244727e05
distilbert: fix creation of sinusoidal embeddings when using PyTorch 1.8+ (#9917) 2021-02-03 11:42:16 -05:00
sandip 2f06f2bcd6
Alber model integration testing added (#9980) 2021-02-03 11:41:10 -05:00