huggingface-transformers/examples
Anthony MOI 36434220fc
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-06-15 17:12:51 -04:00
..
adversarial Make DataCollator a callable (#5015) 2020-06-15 11:58:33 -04:00
benchmarking [isort] add matplotlib to known 3rd party dependencies (#4800) 2020-06-05 17:27:31 -04:00
bertology Make DataCollator a callable (#5015) 2020-06-15 11:58:33 -04:00
contrib Kill model archive maps (#4636) 2020-06-02 09:39:33 -04:00
distillation Kill model archive maps (#4636) 2020-06-02 09:39:33 -04:00
language-modeling add DistilBERT to supported models (#4558) 2020-05-25 14:50:45 -04:00
movement-pruning update `mvmt-pruning/saving_prunebert` (updating torch to 1.5) 2020-06-11 19:42:45 +00:00
multiple-choice Remove unused arguments in Multiple Choice example (#4853) 2020-06-09 20:05:09 -04:00
question-answering Updates args in tf squad example. (#4820) 2020-06-08 05:36:09 -04:00
summarization [HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510) 2020-06-15 17:12:51 -04:00
text-classification Remove unnecessary model_type arg in example (#4771) 2020-06-04 13:41:24 -04:00
text-generation run_pplm.py bug fix (#4867) 2020-06-09 19:14:27 -04:00
token-classification NER: fix construction of input examples for RoBERTa (#4943) 2020-06-15 08:30:40 -04:00
translation/t5 [isort] add known 3rd party to setup.cfg (#4053) 2020-04-28 17:12:00 -04:00
README.md [doc] Make it clearer that `text-generation` does not involve training 2020-06-05 14:59:22 +02:00
lightning_base.py BIG Reorganize examples (#4213) 2020-05-07 13:48:44 -04:00
requirements.txt [Benchmark] Memory benchmark utils (#4198) 2020-05-27 23:22:16 +02:00
test_examples.py per_device instead of per_gpu/error thrown when argument unknown (#4618) 2020-05-27 11:36:55 -04:00
xla_spawn.py [TPU] Doc, fix xla_spawn.py, only preprocess dataset once (#4223) 2020-05-08 14:10:05 -04:00

README.md

Examples

Version 2.9 of transformers introduces a new Trainer class for PyTorch, and its equivalent TFTrainer for TF 2. Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.0+.

Here is the list of all our examples:

  • grouped by task (all official examples work for multiple models)
  • with information on whether they are built on top of Trainer/TFTrainer (if not, they still work, they might just lack some features),
  • whether they also include examples for pytorch-lightning, which is a great fully-featured, general-purpose training library for PyTorch,
  • links to Colab notebooks to walk through the scripts and run them easily,
  • links to Cloud deployments to be able to deploy large-scale trainings in the Cloud with little to no setup.

This is still a work-in-progress – in particular documentation is still sparse – so please contribute improvements/pull requests.

The Big Table of Tasks

Task Example datasets Trainer support TFTrainer support pytorch-lightning Colab
language-modeling Raw text - - Open In Colab
text-classification GLUE, XNLI Open In Colab
token-classification CoNLL NER -
multiple-choice SWAG, RACE, ARC - Open In Colab
question-answering SQuAD - - -
text-generation - n/a n/a n/a Open In Colab
distillation All - - - -
summarization CNN/Daily Mail - - - -
translation WMT - - - -
bertology - - - - -
adversarial HANS - - - -

Important note

Important To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements. Execute the following steps in a new virtual environment:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .
pip install -r ./examples/requirements.txt

One-click Deploy to Cloud (wip)

Azure

Deploy to Azure

Running on TPUs

When using Tensorflow, TPUs are supported out of the box as a tf.distribute.Strategy.

When using PyTorch, we support TPUs thanks to pytorch/xla. For more context and information on how to setup your TPU environment refer to Google's documentation and to the very detailed pytorch/xla README.

In this repo, we provide a very simple launcher script named xla_spawn.py that lets you run our example scripts on multiple TPU cores without any boilerplate. Just pass a --num_cores flag to this script, then your regular training script with its arguments (this is similar to the torch.distributed.launch helper for torch.distributed).

For example for run_glue:

python examples/xla_spawn.py --num_cores 8 \
	examples/text-classification/run_glue.py
	--model_name_or_path bert-base-cased \
	--task_name mnli \
	--data_dir ./data/glue_data/MNLI \
	--output_dir ./models/tpu \
	--overwrite_output_dir \
	--do_train \
	--do_eval \
	--num_train_epochs 1 \
	--save_steps 20000

Feedback and more use cases and benchmarks involving TPUs are welcome, please share with the community.