4.7 KiB
SentencePieceProcessor C++ API
Load SentencePiece model
To start working with the SentencePiece model, you will want to include the sentencepiece_processor.h
header file.
Then instantiate sentencepiece::SentencePieceProcessor class and calls Load
method to load the model with file path or std::istream.
#include <sentencepiece_processor.h>
sentencepiece::SentencePieceProcessor processor;
const auto status = processor.Load("//path/to/model.model");
if (!status.ok()) {
std::cerr << status.ToString() << std::endl;
// error
}
// You can also load a model from std::ifstream.
// std::ifstream in("//path/to/model.model");
// auto status = processor.Load(in);
Tokenize text (preprocessing)
Calls SentencePieceProcessor::Encode
method to tokenize text.
std::vector<std::string> pieces;
processor.Encode("This is a test.", &pieces);
for (const std::string &token : pieces) {
std::cout << token << std::endl;
}
You will obtain the sequence of vocab ids as follows:
std::vector<int> ids;
processor.Encode("This is a test.", &ids);
for (const int id : ids) {
std::cout << id << std::endl;
}
Detokenize text (postprocessing)
Calls SentencePieceProcessor::Decode
method to detokenize a sequence of pieces or ids into a text. Basically it is guaranteed that the detoknization is an inverse operation of Encode, i.e., Decode(Encode(Normalize(input))) == Normalize(input)
.
std::vector<std::string> pieces = { "▁This", "▁is", "▁a", "▁", "te", "st", "." }; // sequence of pieces
std::string text
processor.Decode(pieces, &text);
std::cout << text << std::endl;
std::vector<int> ids = { 451, 26, 20, 3, 158, 128, 12 }; // sequence of ids
processor.Decode(ids, &text);
std::cout << text << std::endl;
Sampling (subword regularization)
Calls SentencePieceProcessor::SampleEncode
method to sample one segmentation.
std::vector<std::string> pieces;
processor.SampleEncode("This is a test.", &pieces, -1, 0.2);
std::vector<int> ids;
processor.SampleEncode("This is a test.", &ids, -1, 0.2);
SampleEncode has two sampling parameters, nbest_size
and alpha
, which correspond to l
and alpha
in the original paper. When nbest_size
is -1, one segmentation is sampled from all hypothesis with forward-filtering and backward sampling algorithm.
Training
Calls SentencePieceTrainer::Train
function to train sentencepiece model. You can pass the same parameters of spm_train as a single stirng.
#include <sentencepiece_trainer.h>
sentencepiece::SentencePieceTrainer::Train("--input=test/botchan.txt --model_prefix=m --vocab_size=1000");
SentencePieceText proto
You will want to use SentencePieceText
class to obtain the pieces and ids at the same time. This proto also encodes a utf8-byte offset of each piece over user input or detokenized text.
#include <sentencepiece.pb.h>
sentencepiece::SentencePieceText spt;
// Encode
processor.Encode("This is a test.", &spt);
std::cout << spt.text() << std::endl; // This is the same as the input.
for (const auto &piece : spt.pieces()) {
std::cout << piece.begin() << std::endl; // beginning of byte offset
std::cout << piece.end() << std::endl; // end of byte offset
std::cout << piece.piece() << std::endl; // internal representation.
std::cout << piece.surface() << std::endl; // external representation. spt.text().substr(begin, end - begin) == surface().
std::cout << piece.id() << std::endl; // vocab id
}
// Decode
processor.Decode({10, 20, 30}, &spt);
std::cout << spt.text() << std::endl; // This is the same as the decoded string.
for (const auto &piece : spt.pieces()) {
// the same as above.
}
Vocabulary management
You will want to use the following methods to obtain ids from/to pieces.
processor.GetPieceSize(); // returns the size of vocabs.
processor.PieceToId("foo"); // returns the vocab id of "foo"
processor.IdToPiece(10); // returns the string representation of id 10.
processor.IsUnknown(0); // returns true if the given id is an unknown token. e.g., <unk>
processor.IsControl(10); // returns true if the given id is a control token. e.g., <s>, </s>
Extra Options
Use SetEncodeExtraOptions
and SetDecodeExtraOptions
methods to set extra options for encoding and decoding respectively. These methods need to be called just after Load
methods.
processor.SetEncodeExtraOptions("bos:eos"); // add <s> and </s>.
processor.SetEncodeExtraOptions("reverse:bos:eos"); // reverse the input and then add <s> and </s>.
processor.SetDecodeExtraOptions("reverse"); // the decoder's output is reversed.