From 721296164f1c5c6b9b16d6e91ee210e7ff6beaee Mon Sep 17 00:00:00 2001 From: Taku Kudo Date: Mon, 9 Apr 2018 19:00:21 +0900 Subject: [PATCH] Update README.md --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index 7c1b92a..8bde5a1 100644 --- a/README.md +++ b/README.md @@ -200,6 +200,16 @@ You can find that the original input sentence is restored from the vocabulary id ``` `````` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file. +## Refine special meta tokens + By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine these mappings in training phase as follows. + +``` +% spm_train --bos_id=0 --eos_id=1 --unk_id=2 --input=... --model_prefix=... +``` +When setting -1 id e.g., ```bos_id=-1```, this special token is ignored. Note that the unknow id cannot be removed and these ids must start with 0 and continous. In addition, we can define an id for padding (<pad>). Padding id is disabled by default. You can assign an id as ```--pad_id=3`. + +If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md). + ## Experiments 1 (subword vs word-based model) ### Experimental settings