[fix] docs/dataprep.md improved

This commit is contained in:
Sass Bálint 2020-05-14 15:18:50 +02:00
Родитель 62046ed949
Коммит 9263465ebc
1 изменённых файлов: 3 добавлений и 5 удалений

Просмотреть файл

@ -3,7 +3,7 @@ The following steps are to prepare Wikipedia corpus for pretraining. However, th
1. Download wiki dump file from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
This is a zip file and needs to be unzipped.
2. Clone [Wikiextractor](https://github.com/attardi/wikiextractor), and run:
2. Clone [Wikiextractor](https://github.com/attardi/wikiextractor), and run it:
```
git clone https://github.com/attardi/wikiextractor
python3 wikiextractor/WikiExtractor.py -o out -b 1000M enwiki-latest-pages-articles.xml
@ -32,11 +32,9 @@ The following steps are to prepare Wikipedia corpus for pretraining. However, th
```
_output:_ `data_shards` directory
6. Run:
```python3 AzureML-BERT/pretrain/PyTorch/dataprep/create_pretraining.py --input_dir=data_shards --output_dir=pickled_pretrain_data --do_lower_case=true
```
python3 AzureML-BERT/pretrain/PyTorch/dataprep/create_pretraining.py --input_dir=data_shards --output_dir=pickled_pretrain_data --do_lower_case=true
```
This script will convert each file into pickled `.bin` file.
_output:_ `pickled_pretrain_data` directory
---