* add initial doc

* update doc

* finish all docs

Co-authored-by: Ze Tao <zetao@microsoft.com>
This commit is contained in:
Mojimi 2021-09-29 07:59:09 +08:00 коммит произвёл GitHub
Родитель b6e65f8780
Коммит 4290400ed3
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
2 изменённых файлов: 324 добавлений и 13 удалений

Просмотреть файл

@ -9,23 +9,28 @@
|StringToHashBucketFast|Supported|
|StringJoin | Supported |
|StringRegexReplace| Supported |
|StringRegexSplit| Supported |
|StringECMARegexReplace| Supported|
|StringSplit | Supported |
|StringUpper | Supported |
|StringLength | Supported |
|StringConcat | Supported |
|StringRegexSplitWithOffsets| Supported |
|StringECMARegexSplitWithOffsets| Supported|
|VectorToString| Supported |
|StringToVector| Supported|
|StringSlice | Under development|
### Tokenizer
|**Operator**|**Support State**|
|------------|-----------------|
|GPT2Tokenizer| Supported |
|WordpieceTokenizer| Supported |
|XLNetTokenizer| Under development |
|SentencepieceTokenizer| Supported |
|BasicTokenizer| Supported |
|BertTokenizer| Supported |
|BertTokenizerDecoder| Supported |
## Auxiliary String Operator
@ -33,7 +38,7 @@
### <a name="StringRegexReplace"></a><a name="StringRegexReplace">**StringRegexReplace**</a>
String replacement based on regular expressions.
String replacement based on [Re2-format](https://github.com/google/re2/wiki/Syntax) regular expressions.
#### Inputs
@ -86,7 +91,68 @@ expect(node, inputs=[text, pattern, rewrite], outputs=[y],
</details>
### <a name="StringRegexSplit"></a><a name="StringRegexSplit">**StringRegexSplit**</a>
### <a name="StringECMARegexReplace"></a><a name="StringECMARegexReplace">**StringECMARegexReplace**</a>
String replacement based on [ECMA-format](https://en.cppreference.com/w/cpp/regex/ecmascript) regular expressions.
#### Inputs
***text: tensor(string)***
String tensor to extract slices from.
***pattern: tensor(string)***
Pattern of the regular expression.
***rewrite: tensor(string)***
Replacement.
#### Attributes
***global_replace: int64*** (default is 1)
Replace all strings matching the pattern or the first one.
***ignore_case: int64*** (default is 0)
Replace
#### Outputs
***output: tensor(string)***
String with replacements.
#### Examples
<details>
<summary>StringRegexReplace</summary>
```python
node = onnx.helper.make_node(
'StringRegexReplace',
inputs=['text', 'pattern', 'rewrite'],
outputs=['y'],
)
text = np.array([['def myfunc():'], ['def dummy():']])
pattern = np.array([r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):'])
rewrite = np.array([r'static PyObject* py_$1(void) {'])
y = [['static PyObject* py_myfunc(void) {'],
['static PyObject* py_dummy(void) {']]
expect(node, inputs=[text, pattern, rewrite], outputs=[y],
name='test_string_regex_replace')
```
</details>
### <a name="StringRegexSplitWithOffsets"></a><a name="StringRegexSplitWithOffsets">**StringRegexSplitWithOffsets**</a>
Splits string based on regular expressions.
@ -719,27 +785,272 @@ expect(node, inputs=[inputs, nbest_size, alpha, add_bos, add_eos, reverse],
```
</details>
### <a name="XLNetTokenizer"></a><a name="XLNetTokenizer">**XLNetTokenizer**</a>
### <a name="BasicTokenizer"></a><a name="BasicTokenizer">**BasicTokenizer**</a>
GPT2Tokenizer that performs SentencePiece tokenization to the input tensor, based on the [hugging face version](https://huggingface.co/transformers/model_doc/xlnet.html#xlnettokenizer).
BasicTokenizer performs basic tokenization to input string tensor, based on [basic tokenizer in BertTokenizer(hugging face version)](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
#### Inputs
***data: tensor(string)***
The string tensor for tokenization
***text: tensor(string)*** The string tensor for tokenization
#### Attributes
***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to lowercase the input when tokenizing.
***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to tokenize Chinese characters.
***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
Splits punctuation on a piece of text.
***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
Remove control chars(such as NUL, BEL) in the text.
#### Outputs
***output: tensor(int64)***
Tokenized result of the input
***tokens: tensor(string)*** Tokenized tokens.
#### Examples
<details>
<summary>word_piece_tokenizer</summary>
<summary>example 1</summary>
```python
import transformers
tokenizer = transformers.BasicTokenizer()
node = onnx.helper.make_node(
'BasicTokenizer',
inputs=['text'],
outputs=['tokens'],
)
inputs = np.array([ "Hello world louder"], dtype=np.object),
tokens = np.array(tokenizer(inputs), dtype=int32)
expect(node, inputs=[inputs],
outputs=[tokens], name='test_basic_tokenizer')
```
</details>
### <a name="BertTokenizer"></a><a name="BertTokenizer">**BertTokenizer**</a>
BertTokenizer replicates `encode_plus` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
#### Inputs
***text: tensor(string)*** The string tensor for tokenization
#### Attributes
***vocab_file: string***
The content of vocab which has same with huggingface.
***do_lower_case: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to lowercase the input when tokenizing.
***do_basic_tokenize: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to do basic tokenization before WordPiece.
***unk_token: string***
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
***sep_token: string***
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
***pad_token: string***
The token used for padding, for example when batching sequences of different lengths.
***cls_token: string***
The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
***mask_token: string***
The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
***tokenize_chinese_chars: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to tokenize Chinese characters.
***strip_accents: int64_t*** (default is 1, 1 represents True, 0 represents False)
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
***tokenize_punctuation: int64_t*** (default is 0, 1 represents True, 0 represents False)
Splits punctuation on a piece of text.
***remove_control_chars: int64_t*** (default is 0, 1 represents True, 0 represents False)
Remove control chars(such as NUL, BEL) in the text.
***truncation_strategy_name: string***
The name of truncation strategy, it could be `longest_first`, `only_first`, `only_second`, `longest_from_back`.
#### Outputs
***input_ids: tensor(int64_t)***
List of token ids.
***token_type_ids: tensor(64_t)***
List of token type ids
***attention_mask: tensor(64_t)***
List of indices specifying which tokens should b
e attended to by the model
#### Examples
<details>
<summary>example 1</summary>
```python
import transformers
bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
node = onnx.helper.make_node(
'BertTokenizer',
inputs=['text'],
outputs=['tokens'],
)
text = "Hello world louder"
inputs = np.array([text], dtype=np.object),
bert_tokenize_result = bert_cased_tokenizer.tokenize(text)
input_ids = np.array(bert_tokenize_result[0])
token_type_ids = np.array(bert_tokenize_result[1])
attention_mask = np.array(bert_tokenize_result[2])
expect(node, inputs=[inputs],
outputs=[input_ids, token_type_ids, attention_mask], name='test_bert_tokenizer')
```
</details>
### <a name="BertTokenizerDecoder"></a><a name="BertTokenizerDecoder">**BertTokenizerDecoder**</a>
BertTokenizer replicates `decode` function of [BertTokenizer (huggingface version )](https://huggingface.co/transformers/_modules/transformers/models/bert/tokenization_bert.html#BertTokenizer).
#### Inputs
***token_ids: tensor(int64)***
List of tokenized input ids.
***indices: tensor(int64)***
List of `[start_position, end_position]` to indicate what segments of input ids should be decoded. This input only enabled when attribute `use_indices`=1.
Usually, it is used to decode the slot in the text.
#### Attributes
***vocab_file: string***
The content of vocab which has same with huggingface.
***unk_token: string***
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
***sep_token: string***
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
***pad_token: string***
The token used for padding, for example when batching sequences of different lengths.
***cls_token: string***
The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
***mask_token: string***
The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
***suffix_indicator: string***
The suffix indicator.
***use_indices: int64_t***
Whether use second input.
***skip_special_tokens: int64_t***
Whether or not to remove special tokens in the decoding.
***clean_up_tokenization_spaces: int64_t***
Whether or not to clean up the tokenization spaces.
#### Outputs
***sentences: tensor(int64_t)***
The decoded sentences.
#### Examples
<details>
<summary>example 1</summary>
```python
import transformers
def get_file_content(path):
with open(path, "rb") as file:
return file.read()
bert_cased_tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased')
bert_cased_tokenizer.save('.', 'bert')
node = onnx.helper.make_node(
'BertTokenizerDecoder',
inputs=['token_ids'],
outputs=['sentences'],
vocab_file=get_file_content("bert-vocab.txt")
)
text = "Hello world louder"
token_ids = np.array([bert_cased_tokenizer.tokenize(text)], dtype=np.object),
sentences = np.array(text)
expect(node, inputs=[token_ids],
outputs=[sentences], name='test_bert_tokenizer')
```
</details>

Просмотреть файл

@ -85,7 +85,7 @@ KernelBasicTokenizer::KernelBasicTokenizer(OrtApi api, const OrtKernelInfo* info
bool tokenize_chinese_chars = TryToGetAttributeWithDefault("tokenize_chinese_chars", true);
bool strip_accents = TryToGetAttributeWithDefault("strip_accents", false);
bool tokenize_punctuation = TryToGetAttributeWithDefault("tokenize_punctuation", false);
bool remove_control_chars = TryToGetAttributeWithDefault("strip_accents", true);
bool remove_control_chars = TryToGetAttributeWithDefault("remove_control_chars", true);
tokenizer_ = std::make_shared<BasicTokenizer>(do_lower_case, tokenize_chinese_chars, strip_accents, tokenize_punctuation, remove_control_chars);
}