This commit is contained in:
Ryuichirou 2018-05-02 14:09:02 +09:00
Родитель 18fcf6be46
Коммит 80411b9e3d
1 изменённых файлов: 1 добавлений и 1 удалений

Просмотреть файл

@ -14,7 +14,7 @@ Subword segmentation with unigram language model supports probabilistic subword
## Technical highlights ## Technical highlights
- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported. - **Multiple subword algorithms**: **BPE** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
- **Subword regularization**: SentencePiece implements subwrod sampling for [subword regularization](https://arxiv.org/abs/1804.10959) which helps to improve the robustness and accuracy of NMT models. - **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) which helps to improve the robustness and accuracy of NMT models.
- **Purely data driven**: SentencePiece trains tokenization and detokenization - **Purely data driven**: SentencePiece trains tokenization and detokenization
models from only raw sentences. No pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is required. models from only raw sentences. No pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is required.
- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic. - **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.