Added new experimental results

2017-04-06 18:45:53 +09:00 · 2017-04-06 18:45:53 +09:00 · 10f39d1417
--- a/README.md
+++ b/README.md
@ -155,7 +155,50 @@ You can find that the original input sentence is restored from the vocabulary id
 ```
 ```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

-## Experiments
+## Experiments 1 (subword vs word-based model)
+### Experimental settings
+
+*   Segmentation algorithms:
+    *   **Unigram**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
+    *   **BPE**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
+    *   **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
+    *   **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
+    *   **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
+    *   **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese. 
+    
+*   NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
+    *   Dropout prob: 0.2
+    *   num nodes: 512
+    *   num lstms: 6
+
+*   Evaluation metrics:
+    *   Case-sensitive BLEU on detokenized text with NIST scorer. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
+
+*   Data sets:
+    *   [KFTT](http://www.phontron.com/kftt/index.html)
+    
+### Results (BLEU scores)
+|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
+|---|---|---|---|---|---|
+enja (Unigram)|8k (shared)|0.2718|0.2922|30.97|25.05|
+enja (BPE)|8k (shared)|0.2695|0.2919|31.76|25.43|
+enja (Moses/KyTea)|80k/80k|0.2514|0.2804|21.25|23.21|
+enja (Moses/MeCab)|80k/80k|0.2436|0.2739|21.25|21.20|
+enja (Moses/neologd)|80k/80k|0.2102|0.2350|21.25|18.47|
+jaen (Unigram)|8k (shared)|0.1984|0.2170|25.05|30.97|
+jaen (BPE)|8k (shared)|0.1975|0.2176|25.43|31.76|
+jaen (Moses/KyTea)|80k/80k|0.1697|0.1974|23.21|21.25|
+jaen (Moses/MeCab)|80k/80k|0.1654|0.1870|21.20|21.25|
+jaen (Moses/neologd)|80k/80k|0.1583|0.1838|18.47|21.25|
+
+
+    * **SentencePiece (Unigram/BPE)** outperform word-based methods **(Moses/KyTea/MeCab/neologd)** even with a smaller vocabulary (1/10 of word-based methods).
+    * The number of tokens to represent Japanese sentences are almost comparable between **SentencePiece (unigram)** and **KyTea**, though the vocabulary of **Sentencepice** is much smaller. It implies that Sentencepieca can effectively compress the sentences with a smaller symbol set.
+    * **Neologd** shows poor BLEU score. Tokenizing sentences with a large named entity dictionary might not be effective in neural-based text processing.
+    * **Unigram** shows slightly better text compression ratio than **BPE**, but no significant differences in BLEU score.
+
+
+## Experiments 2 (subwording with various pre-tokenizations)
 ### Experimental settings
 We have evaluated SentencePiece segmentation with the following configurations.