Reorganize examples (#9010)
* Reorganize example folder * Continue reorganization * Change requirements for tests * Final cleanup * Finish regroup with tests all passing * Copyright * Requirements and readme * Make a full link for the documentation * Address review comments * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Add symlink * Reorg again * Apply suggestions from code review Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Adapt title * Update to new strucutre * Remove test * Update READMEs Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
This commit is contained in:
Родитель
86896de064
Коммит
783d7d2629
|
@ -309,7 +309,7 @@ jobs:
|
|||
- v0.4-{{ checksum "setup.py" }}
|
||||
- run: pip install --upgrade pip
|
||||
- run: pip install .[sklearn,torch,sentencepiece,testing]
|
||||
- run: pip install -r examples/requirements.txt
|
||||
- run: pip install -r examples/_tests_requirements.txt
|
||||
- save_cache:
|
||||
key: v0.4-torch_examples-{{ checksum "setup.py" }}
|
||||
paths:
|
||||
|
|
|
@ -16,59 +16,58 @@ limitations under the License.
|
|||
|
||||
# Examples
|
||||
|
||||
Version 2.9 of 🤗 Transformers introduced a new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) class for PyTorch, and its equivalent [`TFTrainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_tf.py) for TF 2.
|
||||
Running the examples requires PyTorch 1.3.1+ or TensorFlow 2.2+.
|
||||
|
||||
Here is the list of all our examples:
|
||||
- **grouped by task** (all official examples work for multiple models)
|
||||
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
|
||||
just lack some features),
|
||||
- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
|
||||
- links to **Colab notebooks** to walk through the scripts and run them easily,
|
||||
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
|
||||
|
||||
This folder contains actively maintained examples of use of 🤗 Transformers organized along NLP tasks. If you are looking for an example that used to
|
||||
be in this folder, it may have moved to our [research projects](https://github.com/huggingface/transformers/tree/master/examples/research_projects) subfolder (which contains frozen snapshots of research projects).
|
||||
|
||||
## Important note
|
||||
|
||||
**Important**
|
||||
|
||||
To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements.
|
||||
Execute the following steps in a new virtual environment:
|
||||
|
||||
To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
|
||||
```bash
|
||||
git clone https://github.com/huggingface/transformers
|
||||
cd transformers
|
||||
pip install .
|
||||
pip install -r ./examples/requirements.txt
|
||||
```
|
||||
Then cd in the example folder of your choice and run
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.4.0):
|
||||
Alternatively, you can run the version of the examples as they were for your current version of Transformers via (for instance with v3.5.1):
|
||||
```bash
|
||||
git checkout tags/v3.4.0
|
||||
git checkout tags/v3.5.1
|
||||
```
|
||||
|
||||
## The Big Table of Tasks
|
||||
|
||||
Here is the list of all our examples:
|
||||
- with information on whether they are **built on top of `Trainer`/`TFTrainer`** (if not, they still work, they might
|
||||
just lack some features),
|
||||
- whether or not they leverage the [🤗 Datasets](https://github.com/huggingface/datasets) library.
|
||||
- links to **Colab notebooks** to walk through the scripts and run them easily,
|
||||
<!--
|
||||
Coming soon!
|
||||
- links to **Cloud deployments** to be able to deploy large-scale trainings in the Cloud with little to no setup.
|
||||
-->
|
||||
|
||||
| Task | Example datasets | Trainer support | TFTrainer support | 🤗 Datasets | Colab
|
||||
|---|---|:---:|:---:|:---:|:---:|
|
||||
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
|
||||
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
|
||||
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
|
||||
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
|
||||
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | ✅ | -
|
||||
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
|
||||
| [**`distillation`**](https://github.com/huggingface/transformers/tree/master/examples/distillation) | All | - | - | - | -
|
||||
| [**`summarization`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | CNN/Daily Mail | ✅ | - | - | -
|
||||
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
|
||||
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
|
||||
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | ✅ | -
|
||||
| [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | -
|
||||
| [**`bertology`**](https://github.com/huggingface/transformers/tree/master/examples/bertology) | - | - | - | - | -
|
||||
| [**`adversarial`**](https://github.com/huggingface/transformers/tree/master/examples/adversarial) | HANS | ✅ | - | - | -
|
||||
|
||||
|
||||
<br>
|
||||
|
||||
<!--
|
||||
## One-click Deploy to Cloud (wip)
|
||||
|
||||
**Coming soon!**
|
||||
-->
|
||||
|
||||
## Running on TPUs
|
||||
|
||||
|
|
|
@ -0,0 +1,20 @@
|
|||
tensorboard
|
||||
scikit-learn
|
||||
seqeval
|
||||
psutil
|
||||
sacrebleu
|
||||
rouge-score
|
||||
tensorflow_datasets
|
||||
matplotlib
|
||||
git-python==1.0.3
|
||||
faiss-cpu
|
||||
streamlit
|
||||
elasticsearch
|
||||
nltk
|
||||
pandas
|
||||
datasets >= 1.1.3
|
||||
fire
|
||||
pytest
|
||||
conllu
|
||||
sentencepiece != 0.1.92
|
||||
protobuf
|
|
@ -1,3 +1,19 @@
|
|||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
# 🤗 Benchmark results
|
||||
|
||||
Here, you can find a list of the different benchmark results created by the community.
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import csv
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass, field
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# tests directory-specific settings - this file is run automatically
|
||||
# by pytest before any tests are run
|
||||
|
||||
|
|
|
@ -1,5 +0,0 @@
|
|||
# Community contributed examples
|
||||
|
||||
This folder contains examples which are not actively maintained (mostly contributed by the community).
|
||||
|
||||
Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
|
|
@ -1,3 +1,19 @@
|
|||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
## Language model training
|
||||
|
||||
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2,
|
||||
|
|
|
@ -0,0 +1,3 @@
|
|||
datasets >= 1.1.3
|
||||
sentencepiece != 0.1.92
|
||||
protobuf
|
|
@ -0,0 +1,21 @@
|
|||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
# Legacy examples
|
||||
|
||||
This folder contains examples which are not actively maintained (mostly contributed by the community).
|
||||
|
||||
Using these examples together with a recent version of the library usually requires to make small (sometimes big) adaptations to get the scripts working.
|
|
@ -21,7 +21,7 @@ mkdir -p $OUTPUT_DIR
|
|||
# Add parent directory to python path to access lightning_base.py
|
||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||
|
||||
python3 run_pl_glue.py --gpus 1 --data_dir $DATA_DIR \
|
||||
python3 run_glue.py --gpus 1 --data_dir $DATA_DIR \
|
||||
--task $TASK \
|
||||
--model_name_or_path $BERT_MODEL \
|
||||
--output_dir $OUTPUT_DIR \
|
|
@ -31,7 +31,7 @@ mkdir -p $OUTPUT_DIR
|
|||
# Add parent directory to python path to access lightning_base.py
|
||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||
|
||||
python3 run_pl_ner.py --data_dir ./ \
|
||||
python3 run_ner.py --data_dir ./ \
|
||||
--labels ./labels.txt \
|
||||
--model_name_or_path $BERT_MODEL \
|
||||
--output_dir $OUTPUT_DIR \
|
|
@ -26,7 +26,7 @@ export SEED=1
|
|||
# Add parent directory to python path to access lightning_base.py
|
||||
export PYTHONPATH="../":"${PYTHONPATH}"
|
||||
|
||||
python3 run_pl_ner.py --data_dir ./ \
|
||||
python3 run_ner.py --data_dir ./ \
|
||||
--task_type POS \
|
||||
--model_name_or_path $BERT_MODEL \
|
||||
--output_dir $OUTPUT_DIR \
|
|
@ -0,0 +1,229 @@
|
|||
## Token classification
|
||||
|
||||
Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/token-classification/run_ner.py).
|
||||
|
||||
The following examples are covered in this section:
|
||||
|
||||
* NER on the GermEval 2014 (German NER) dataset
|
||||
* Emerging and Rare Entities task: WNUT’17 (English NER) dataset
|
||||
|
||||
Details and results for the fine-tuning provided by @stefan-it.
|
||||
|
||||
### GermEval 2014 (German NER) dataset
|
||||
|
||||
#### Data (Download and pre-processing steps)
|
||||
|
||||
Data can be obtained from the [GermEval 2014](https://sites.google.com/site/germeval2014ner/data) shared task page.
|
||||
|
||||
Here are the commands for downloading and pre-processing train, dev and test datasets. The original data format has four (tab-separated) columns, in a pre-processing step only the two relevant columns (token and outer span NER annotation) are extracted:
|
||||
|
||||
```bash
|
||||
curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
|
||||
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
|
||||
curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
|
||||
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
|
||||
curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
|
||||
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
|
||||
```
|
||||
|
||||
The GermEval 2014 dataset contains some strange "control character" tokens like `'\x96', '\u200e', '\x95', '\xad' or '\x80'`.
|
||||
One problem with these tokens is, that `BertTokenizer` returns an empty token for them, resulting in misaligned `InputExample`s.
|
||||
The `preprocess.py` script located in the `scripts` folder a) filters these tokens and b) splits longer sentences into smaller ones (once the max. subtoken length is reached).
|
||||
|
||||
Let's define some variables that we need for further pre-processing steps and training the model:
|
||||
|
||||
```bash
|
||||
export MAX_LENGTH=128
|
||||
export BERT_MODEL=bert-base-multilingual-cased
|
||||
```
|
||||
|
||||
Run the pre-processing script on training, dev and test datasets:
|
||||
|
||||
```bash
|
||||
python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
|
||||
python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
|
||||
python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
|
||||
```
|
||||
|
||||
The GermEval 2014 dataset has much more labels than CoNLL-2002/2003 datasets, so an own set of labels must be used:
|
||||
|
||||
```bash
|
||||
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
|
||||
```
|
||||
|
||||
#### Prepare the run
|
||||
|
||||
Additional environment variables must be set:
|
||||
|
||||
```bash
|
||||
export OUTPUT_DIR=germeval-model
|
||||
export BATCH_SIZE=32
|
||||
export NUM_EPOCHS=3
|
||||
export SAVE_STEPS=750
|
||||
export SEED=1
|
||||
```
|
||||
|
||||
#### Run the Pytorch version
|
||||
|
||||
To start training, just run:
|
||||
|
||||
```bash
|
||||
python3 run_ner.py --data_dir ./ \
|
||||
--labels ./labels.txt \
|
||||
--model_name_or_path $BERT_MODEL \
|
||||
--output_dir $OUTPUT_DIR \
|
||||
--max_seq_length $MAX_LENGTH \
|
||||
--num_train_epochs $NUM_EPOCHS \
|
||||
--per_device_train_batch_size $BATCH_SIZE \
|
||||
--save_steps $SAVE_STEPS \
|
||||
--seed $SEED \
|
||||
--do_train \
|
||||
--do_eval \
|
||||
--do_predict
|
||||
```
|
||||
|
||||
If your GPU supports half-precision training, just add the `--fp16` flag. After training, the model will be both evaluated on development and test datasets.
|
||||
|
||||
#### JSON-based configuration file
|
||||
|
||||
Instead of passing all parameters via commandline arguments, the `run_ner.py` script also supports reading parameters from a json-based configuration file:
|
||||
|
||||
```json
|
||||
{
|
||||
"data_dir": ".",
|
||||
"labels": "./labels.txt",
|
||||
"model_name_or_path": "bert-base-multilingual-cased",
|
||||
"output_dir": "germeval-model",
|
||||
"max_seq_length": 128,
|
||||
"num_train_epochs": 3,
|
||||
"per_device_train_batch_size": 32,
|
||||
"save_steps": 750,
|
||||
"seed": 1,
|
||||
"do_train": true,
|
||||
"do_eval": true,
|
||||
"do_predict": true
|
||||
}
|
||||
```
|
||||
|
||||
It must be saved with a `.json` extension and can be used by running `python3 run_ner.py config.json`.
|
||||
|
||||
#### Evaluation
|
||||
|
||||
Evaluation on development dataset outputs the following for our example:
|
||||
|
||||
```bash
|
||||
10/04/2019 00:42:06 - INFO - __main__ - ***** Eval results *****
|
||||
10/04/2019 00:42:06 - INFO - __main__ - f1 = 0.8623348017621146
|
||||
10/04/2019 00:42:06 - INFO - __main__ - loss = 0.07183869666975543
|
||||
10/04/2019 00:42:06 - INFO - __main__ - precision = 0.8467916366258111
|
||||
10/04/2019 00:42:06 - INFO - __main__ - recall = 0.8784592370979806
|
||||
```
|
||||
|
||||
On the test dataset the following results could be achieved:
|
||||
|
||||
```bash
|
||||
10/04/2019 00:42:42 - INFO - __main__ - ***** Eval results *****
|
||||
10/04/2019 00:42:42 - INFO - __main__ - f1 = 0.8614389652384803
|
||||
10/04/2019 00:42:42 - INFO - __main__ - loss = 0.07064602487454782
|
||||
10/04/2019 00:42:42 - INFO - __main__ - precision = 0.8604651162790697
|
||||
10/04/2019 00:42:42 - INFO - __main__ - recall = 0.8624150210424085
|
||||
```
|
||||
|
||||
### Emerging and Rare Entities task: WNUT’17 (English NER) dataset
|
||||
|
||||
Description of the WNUT’17 task from the [shared task website](http://noisy-text.github.io/2017/index.html):
|
||||
|
||||
> The WNUT’17 shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions.
|
||||
> Named entities form the basis of many modern approaches to other tasks (like event clustering and summarization), but recall on
|
||||
> them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms.
|
||||
|
||||
Six labels are available in the dataset. An overview can be found on this [page](http://noisy-text.github.io/2017/files/).
|
||||
|
||||
#### Data (Download and pre-processing steps)
|
||||
|
||||
The dataset can be downloaded from the [official GitHub](https://github.com/leondz/emerging_entities_17) repository.
|
||||
|
||||
The following commands show how to prepare the dataset for fine-tuning:
|
||||
|
||||
```bash
|
||||
mkdir -p data_wnut_17
|
||||
|
||||
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll' | tr '\t' ' ' > data_wnut_17/train.txt.tmp
|
||||
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
|
||||
curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp
|
||||
```
|
||||
|
||||
Let's define some variables that we need for further pre-processing steps:
|
||||
|
||||
```bash
|
||||
export MAX_LENGTH=128
|
||||
export BERT_MODEL=bert-large-cased
|
||||
```
|
||||
|
||||
Here we use the English BERT large model for fine-tuning.
|
||||
The `preprocess.py` scripts splits longer sentences into smaller ones (once the max. subtoken length is reached):
|
||||
|
||||
```bash
|
||||
python3 scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
|
||||
python3 scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
|
||||
python3 scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt
|
||||
```
|
||||
|
||||
In the last pre-processing step, the `labels.txt` file needs to be generated. This file contains all available labels:
|
||||
|
||||
```bash
|
||||
cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt
|
||||
```
|
||||
|
||||
#### Run the Pytorch version
|
||||
|
||||
Fine-tuning with the PyTorch version can be started using the `run_ner.py` script. In this example we use a JSON-based configuration file.
|
||||
|
||||
This configuration file looks like:
|
||||
|
||||
```json
|
||||
{
|
||||
"data_dir": "./data_wnut_17",
|
||||
"labels": "./data_wnut_17/labels.txt",
|
||||
"model_name_or_path": "bert-large-cased",
|
||||
"output_dir": "wnut-17-model-1",
|
||||
"max_seq_length": 128,
|
||||
"num_train_epochs": 3,
|
||||
"per_device_train_batch_size": 32,
|
||||
"save_steps": 425,
|
||||
"seed": 1,
|
||||
"do_train": true,
|
||||
"do_eval": true,
|
||||
"do_predict": true,
|
||||
"fp16": false
|
||||
}
|
||||
```
|
||||
|
||||
If your GPU supports half-precision training, please set `fp16` to `true`.
|
||||
|
||||
Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
|
||||
|
||||
#### Evaluation
|
||||
|
||||
Evaluation on development dataset outputs the following:
|
||||
|
||||
```bash
|
||||
05/29/2020 23:33:44 - INFO - __main__ - ***** Eval results *****
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_loss = 0.26505235286212275
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_precision = 0.7008264462809918
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_recall = 0.507177033492823
|
||||
05/29/2020 23:33:44 - INFO - __main__ - eval_f1 = 0.5884802220680084
|
||||
05/29/2020 23:33:44 - INFO - __main__ - epoch = 3.0
|
||||
```
|
||||
|
||||
On the test dataset the following results could be achieved:
|
||||
|
||||
```bash
|
||||
05/29/2020 23:33:44 - INFO - transformers.trainer - ***** Running Prediction *****
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_loss = 0.30948806500973547
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_precision = 0.5840108401084011
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_recall = 0.3994439295644115
|
||||
05/29/2020 23:34:02 - INFO - __main__ - eval_f1 = 0.47440836543753434
|
||||
```
|
||||
|
||||
WNUT’17 is a very difficult task. Current state-of-the-art results on this dataset can be found [here](http://nlpprogress.com/english/named_entity_recognition.html).
|
|
@ -20,7 +20,7 @@ export NUM_EPOCHS=3
|
|||
export SAVE_STEPS=750
|
||||
export SEED=1
|
||||
|
||||
python3 run_ner_old.py \
|
||||
python3 run_ner.py \
|
||||
--task_type NER \
|
||||
--data_dir . \
|
||||
--labels ./labels.txt \
|
|
@ -21,7 +21,7 @@ export NUM_EPOCHS=3
|
|||
export SAVE_STEPS=750
|
||||
export SEED=1
|
||||
|
||||
python3 run_ner_old.py \
|
||||
python3 run_ner.py \
|
||||
--task_type Chunk \
|
||||
--data_dir . \
|
||||
--model_name_or_path $BERT_MODEL \
|
|
@ -21,7 +21,7 @@ export NUM_EPOCHS=3
|
|||
export SAVE_STEPS=750
|
||||
export SEED=1
|
||||
|
||||
python3 run_ner_old.py \
|
||||
python3 run_ner.py \
|
||||
--task_type POS \
|
||||
--data_dir . \
|
||||
--model_name_or_path $BERT_MODEL \
|
|
@ -1,3 +1,19 @@
|
|||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
## Multiple Choice
|
||||
|
||||
Based on the script [`run_multiple_choice.py`]().
|
||||
|
|
|
@ -0,0 +1,2 @@
|
|||
sentencepiece != 0.1.92
|
||||
protobuf
|
|
@ -1,8 +1,29 @@
|
|||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
## SQuAD
|
||||
|
||||
Based on the script [`run_squad.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py).
|
||||
Based on the script [`run_qa.py`](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py).
|
||||
|
||||
**Note:** This script only works with models that have a fast tokenizer (backed by the 🤗 Tokenizers library) as it
|
||||
uses special features of those tokenizers. You can check if your favorite model has a fast tokenizer in
|
||||
[this table](https://huggingface.co/transformers/index.html#bigtable), if it doesn't you can still use the old version
|
||||
of the script.
|
||||
|
||||
The old version of this script can be found [here](https://github.com/huggingface/transformers/blob/master/examples/contrib/legacy/question-answering/run_squad.py).
|
||||
|
||||
#### Fine-tuning BERT on SQuAD1.0
|
||||
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
datasets >= 1.1.3
|
|
@ -0,0 +1,28 @@
|
|||
<!---
|
||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
# Research projects
|
||||
|
||||
This folder contains various research projects using 🤗 Transformers. They are not maintained and require a specific
|
||||
version of 🤗 Transformers that is indicated in the requirements file of each folder. Updating them to the most recent version of the library will require some work.
|
||||
|
||||
To use any of them, just run the command
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
inside the folder of your choice.
|
||||
|
||||
If you need help with any of those, contact the author(s), indicated at the top of the `README` of each folder.
|
|
@ -0,0 +1 @@
|
|||
transformers == 3.5.1
|
|
@ -0,0 +1 @@
|
|||
transformers == 3.5.1
|
|
@ -1,4 +1,4 @@
|
|||
transformers
|
||||
transformers == 3.5.1
|
||||
|
||||
# For ROUGE
|
||||
nltk
|
|
@ -0,0 +1 @@
|
|||
transformers == 3.5.1
|
|
@ -0,0 +1 @@
|
|||
transformers == 3.5.1
|
|
@ -1,5 +1,7 @@
|
|||
# Distil*
|
||||
|
||||
Author: @VictorSanh
|
||||
|
||||
This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT, DistilRoBERTa and DistilGPT2.
|
||||
|
||||
**January 20, 2020 - Bug fixing** We have recently discovered and fixed [a bug](https://github.com/huggingface/transformers/commit/48cbf267c988b56c71a2380f748a3e6092ccaed3) in the evaluation of our `run_*.py` scripts that caused the reported metrics to be over-estimated on average. We have updated all the metrics with the latest runs.
|
|
@ -1,5 +1,7 @@
|
|||
# Long Form Question Answering
|
||||
|
||||
Author: @yjernite
|
||||
|
||||
This folder contains the code for the Long Form Question answering [demo](http://35.226.96.115:8080/) as well as methods to train and use a fully end-to-end Long Form Question Answering system using the [🤗transformers](https://github.com/huggingface/transformers) and [🤗datasets](https://github.com/huggingface/datasets) libraries.
|
||||
|
||||
You can use these methods to train your own system by following along the associate [notebook](https://github.com/huggingface/notebooks/blob/master/longform-qa/Long_Form_Question_Answering_with_ELI5_and_Wikipedia.ipynb) or [blog post](https://yjernite.github.io/lfqa.html).
|
|
@ -0,0 +1,4 @@
|
|||
datasets >= 1.1.3
|
||||
faiss-cpu
|
||||
streamlit
|
||||
elasticsearch
|
|
@ -1,5 +1,7 @@
|
|||
# Movement Pruning: Adaptive Sparsity by Fine-Tuning
|
||||
|
||||
Author: @VictorSanh
|
||||
|
||||
*Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of *movement pruning*, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters:*
|
||||
|
||||
| Fine-pruning+Distillation<br>(Teacher=BERT-base fine-tuned) | BERT base<br>fine-tuned | Remaining<br>Weights (%) | Magnitude Pruning | L0 Regularization | Movement Pruning | Soft Movement Pruning |
|
Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше
Загрузка…
Ссылка в новой задаче