machinelearning/test/Microsoft.ML.Tokenizers.Tests/Data
Tarek Mahmoud Sayed d0aa2c2461
Address the feedback on the tokenizer's library (#7024)
* Fix cache when calling EncodeToIds

* Make EnglishRoberta _mergeRanks thread safe

* Delete Trainer

* Remove the setters on the Bpe properties

* Remove Roberta and Tiktoken special casing in the Tokenizer and support the cases in the Model abstraction

* Support text-embedding-3-small/large embedding

* Remove redundant TokenToId abstraction and keep the one with the extra parameters

* Enable creating Tiktoken asynchronously or directly using the tokenizer data

* Add cancellationToken support in CreateAsync APIs

* Rename sequence to text and Tokenize to Encode

* Rename skipSpecialTokens to considerSpecialTokens

* Rename TokenizerResult to EncodingResult

* Make Token publicly immutable

* Change offset tuples from (Index, End) to (Index, Length)

* Rename NormalizedString method's parameters

* Rename Model's methods to start with verb

* Convert  Model.GetVocab() method to a Vocab property

* Some method's parameters and variable renaming

* Remove Vocab and VocabSize from the abstraction

* Cleanup normalization support

* Minor Bpe cleanup

* Resolve rebase change

* Address the feedback
2024-02-26 10:57:18 -08:00
..
lib.rs.txt Introducing Tiktoken Tokenizer (#6981) 2024-02-06 11:13:28 -08:00
tokens.json Introducing Tiktoken Tokenizer (#6981) 2024-02-06 11:13:28 -08:00
tokens_gpt2.json Introducing Tiktoken Tokenizer (#6981) 2024-02-06 11:13:28 -08:00
tokens_p50k_base.json Introducing Tiktoken Tokenizer (#6981) 2024-02-06 11:13:28 -08:00
tokens_p50k_edit.json Introducing Tiktoken Tokenizer (#6981) 2024-02-06 11:13:28 -08:00
tokens_r50k_base.json Introducing Tiktoken Tokenizer (#6981) 2024-02-06 11:13:28 -08:00