Added new experimental results

This commit is contained in:
Taku Kudo 2017-04-06 18:45:53 +09:00
Родитель b23ed045b1
Коммит 10f39d1417
1 изменённых файлов: 44 добавлений и 1 удалений

Просмотреть файл

@ -155,7 +155,50 @@ You can find that the original input sentence is restored from the vocabulary id
``` ```
```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file. ```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.
## Experiments ## Experiments 1 (subword vs word-based model)
### Experimental settings
* Segmentation algorithms:
* **Unigram**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
* **BPE**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
* **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
* **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
* **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
* **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
* NMT parameters: ([Googles Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
* Dropout prob: 0.2
* num nodes: 512
* num lstms: 6
* Evaluation metrics:
* Case-sensitive BLEU on detokenized text with NIST scorer. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
* Data sets:
* [KFTT](http://www.phontron.com/kftt/index.html)
### Results (BLEU scores)
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|---|---|---|---|---|---|
enja (Unigram)|8k (shared)|0.2718|0.2922|30.97|25.05|
enja (BPE)|8k (shared)|0.2695|0.2919|31.76|25.43|
enja (Moses/KyTea)|80k/80k|0.2514|0.2804|21.25|23.21|
enja (Moses/MeCab)|80k/80k|0.2436|0.2739|21.25|21.20|
enja (Moses/neologd)|80k/80k|0.2102|0.2350|21.25|18.47|
jaen (Unigram)|8k (shared)|0.1984|0.2170|25.05|30.97|
jaen (BPE)|8k (shared)|0.1975|0.2176|25.43|31.76|
jaen (Moses/KyTea)|80k/80k|0.1697|0.1974|23.21|21.25|
jaen (Moses/MeCab)|80k/80k|0.1654|0.1870|21.20|21.25|
jaen (Moses/neologd)|80k/80k|0.1583|0.1838|18.47|21.25|
* **SentencePiece (Unigram/BPE)** outperform word-based methods **(Moses/KyTea/MeCab/neologd)** even with a smaller vocabulary (1/10 of word-based methods).
* The number of tokens to represent Japanese sentences are almost comparable between **SentencePiece (unigram)** and **KyTea**, though the vocabulary of **Sentencepice** is much smaller. It implies that Sentencepieca can effectively compress the sentences with a smaller symbol set.
* **Neologd** shows poor BLEU score. Tokenizing sentences with a large named entity dictionary might not be effective in neural-based text processing.
* **Unigram** shows slightly better text compression ratio than **BPE**, but no significant differences in BLEU score.
## Experiments 2 (subwording with various pre-tokenizations)
### Experimental settings ### Experimental settings
We have evaluated SentencePiece segmentation with the following configurations. We have evaluated SentencePiece segmentation with the following configurations.