Added new experimental results
This commit is contained in:
Родитель
b23ed045b1
Коммит
10f39d1417
45
README.md
45
README.md
|
@ -155,7 +155,50 @@ You can find that the original input sentence is restored from the vocabulary id
|
|||
```
|
||||
```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.
|
||||
|
||||
## Experiments
|
||||
## Experiments 1 (subword vs word-based model)
|
||||
### Experimental settings
|
||||
|
||||
* Segmentation algorithms:
|
||||
* **Unigram**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
|
||||
* **BPE**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
|
||||
* **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
|
||||
* **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
|
||||
* **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
|
||||
* **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
|
||||
|
||||
* NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
|
||||
* Dropout prob: 0.2
|
||||
* num nodes: 512
|
||||
* num lstms: 6
|
||||
|
||||
* Evaluation metrics:
|
||||
* Case-sensitive BLEU on detokenized text with NIST scorer. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
|
||||
|
||||
* Data sets:
|
||||
* [KFTT](http://www.phontron.com/kftt/index.html)
|
||||
|
||||
### Results (BLEU scores)
|
||||
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|
||||
|---|---|---|---|---|---|
|
||||
enja (Unigram)|8k (shared)|0.2718|0.2922|30.97|25.05|
|
||||
enja (BPE)|8k (shared)|0.2695|0.2919|31.76|25.43|
|
||||
enja (Moses/KyTea)|80k/80k|0.2514|0.2804|21.25|23.21|
|
||||
enja (Moses/MeCab)|80k/80k|0.2436|0.2739|21.25|21.20|
|
||||
enja (Moses/neologd)|80k/80k|0.2102|0.2350|21.25|18.47|
|
||||
jaen (Unigram)|8k (shared)|0.1984|0.2170|25.05|30.97|
|
||||
jaen (BPE)|8k (shared)|0.1975|0.2176|25.43|31.76|
|
||||
jaen (Moses/KyTea)|80k/80k|0.1697|0.1974|23.21|21.25|
|
||||
jaen (Moses/MeCab)|80k/80k|0.1654|0.1870|21.20|21.25|
|
||||
jaen (Moses/neologd)|80k/80k|0.1583|0.1838|18.47|21.25|
|
||||
|
||||
|
||||
* **SentencePiece (Unigram/BPE)** outperform word-based methods **(Moses/KyTea/MeCab/neologd)** even with a smaller vocabulary (1/10 of word-based methods).
|
||||
* The number of tokens to represent Japanese sentences are almost comparable between **SentencePiece (unigram)** and **KyTea**, though the vocabulary of **Sentencepice** is much smaller. It implies that Sentencepieca can effectively compress the sentences with a smaller symbol set.
|
||||
* **Neologd** shows poor BLEU score. Tokenizing sentences with a large named entity dictionary might not be effective in neural-based text processing.
|
||||
* **Unigram** shows slightly better text compression ratio than **BPE**, but no significant differences in BLEU score.
|
||||
|
||||
|
||||
## Experiments 2 (subwording with various pre-tokenizations)
|
||||
### Experimental settings
|
||||
We have evaluated SentencePiece segmentation with the following configurations.
|
||||
|
||||
|
|
Загрузка…
Ссылка в новой задаче