This commit is contained in:
Taku Kudo 2018-05-01 19:13:41 +09:00 коммит произвёл GitHub
Родитель 2784cafd8b
Коммит c18b5949cc
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 12 добавлений и 0 удалений

Просмотреть файл

@ -54,6 +54,18 @@ processor.Decode(ids, &text);
std::cout << text << std::endl;
```
## Sampling (subword regularization)
Calls `SentencePieceProcessor::SampleEncode` method to sample one segmentation.
```C++
std::vector<std::string> pieces;
processor.SampleEncode("This is a test.", &pieces, -1, 0.2);
std::vector<int> ids;
processor.SampleEncode("This is a test.", &ids, -1, 0.2);
```
SampleEncode has two sampling parameters, `nbest_size` and `alpha`, which correspond to `l` and `alpha` in the [original paper](https://arxiv.org/abs/1804.10959). When `nbest_size` is -1, one segmentation is sampled from all hypothesis with forward-filtering and backward sampling algorithm.
## SentencePieceText proto
You will want to use `SentencePieceText` class to obtain the pieces and ids at the same time. This proto also encodes a utf8-byte offset of each piece over user input or detokenized text.