machinelearning

История

Tarek Mahmoud Sayed d0aa2c2461 Address the feedback on the tokenizer's library (#7024 ) * Fix cache when calling EncodeToIds * Make EnglishRoberta _mergeRanks thread safe * Delete Trainer * Remove the setters on the Bpe properties * Remove Roberta and Tiktoken special casing in the Tokenizer and support the cases in the Model abstraction * Support text-embedding-3-small/large embedding * Remove redundant TokenToId abstraction and keep the one with the extra parameters * Enable creating Tiktoken asynchronously or directly using the tokenizer data * Add cancellationToken support in CreateAsync APIs * Rename sequence to text and Tokenize to Encode * Rename skipSpecialTokens to considerSpecialTokens * Rename TokenizerResult to EncodingResult * Make Token publicly immutable * Change offset tuples from (Index, End) to (Index, Length) * Rename NormalizedString method's parameters * Rename Model's methods to start with verb * Convert Model.GetVocab() method to a Vocab property * Some method's parameters and variable renaming * Remove Vocab and VocabSize from the abstraction * Cleanup normalization support * Minor Bpe cleanup * Resolve rebase change * Address the feedback		2024-02-26 10:57:18 -08:00
..
lib.rs.txt	Introducing Tiktoken Tokenizer (#6981 )	2024-02-06 11:13:28 -08:00
tokens.json	Introducing Tiktoken Tokenizer (#6981 )	2024-02-06 11:13:28 -08:00
tokens_gpt2.json	Introducing Tiktoken Tokenizer (#6981 )	2024-02-06 11:13:28 -08:00
tokens_p50k_base.json	Introducing Tiktoken Tokenizer (#6981 )	2024-02-06 11:13:28 -08:00
tokens_p50k_edit.json	Introducing Tiktoken Tokenizer (#6981 )	2024-02-06 11:13:28 -08:00
tokens_r50k_base.json	Introducing Tiktoken Tokenizer (#6981 )	2024-02-06 11:13:28 -08:00