Subword segmentation with unigram language model supports probabilistic subword sampling for **subword regularization** [[Kudo.](http://acl2018.org/conference/accepted-papers/)], a simple technique to improve the robustness of NMT model. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
- **Subword regularization**: SentencePiece implements subwrod sampling for [subword regularization](https://arxiv.org/abs/1804.10959) which helps to improve the robustness and accuracy of NMT models.
- **Purely data driven**: SentencePiece trains tokenization and detokenization
models from only raw sentences. No pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is required.
- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
- **Self-contained**: The same tokenization/detokenization is obtained as long as the same model file is used.
- **Direct vocabulary id generation**: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
- **NFKC-based normalization**: SentencePiece performs NFKC-based text normalization.
SentencePiece is an unsupervised text tokenizer and detokenizer designed mainly for Neural Network-based text generation, for example Neural Network Machine Translation. SentencePiece is a re-implementation of **sub-word units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and
**unigram language model** [[Kudo.](http://acl2018.org/conference/accepted-papers/)]). Unlike previous sub-word approaches that train tokenizers from pretokenized sentences, SentencePiece directly trains the tokenizer and detokenizer from raw sentences. SentencePiece might seem like a sort of unsupervised word segmentation, but there are several differences and constraints in SentencePiece.
SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.
Subword regularization [[Kudo.](http://acl2018.org/conference/accepted-papers/)]) is a simple regularization method
that virtually augments training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.
To enable subword regularization, you would like to use the SentencePiece library to sample one segmentation for each parameter updates, which is different from the standard off-line data preparations. Here's the example of [Python libra\
ry](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode`` call. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).
*`--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.
By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in training phase as follows.
When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknow id cannot be removed. In addition, these ids must start with 0 and be continous. We can define an id for padding (<pad>). Padding id is disabled by default. You can assign an id as ```--pad_id=3```.