This commit is contained in:
Taku Kudo 2018-05-01 21:11:54 +09:00 коммит произвёл GitHub
Родитель a78bb705bf
Коммит ff2f301a72
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 4 добавлений и 2 удалений

Просмотреть файл

@ -34,8 +34,6 @@ Subword segmentation with unigram language model supports probabilistic subword
|Pre-segmentation required?|[No](#whitespace-is-treated-as-a-basic-symbol)|Yes|Yes|
|Customizable normalization (e.g., NFKC)|[Yes](doc/normalization.md)|No|N/A|
|Direct id generation|[Yes](#end-to-end-example)|No|N/A|
|Training speed|N/A|N/A|N/A|
|Segmentation speed|N/A|N/A|N/A|
Note that BPE algorithm used in WordPiece is slightly different from the original BPE.
@ -50,6 +48,10 @@ vocabulary. Unlike most unsupervised word segmentation algorithms, which
assume an infinite vocabulary, SentencePiece trains the segmentation model such
that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.
Note that SentencePices specifies the final vocabulary size for training, which is different from the
[subword-nmt](https://github.com/rsennrich/subword-nmt) that uses the number of merge operations.
The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.
#### Whitespace is treated as a basic symbol
The first step of Natural Language processing is text tokenization. For
example, a standard English tokenizer would segment the text "Hello world." into the