Граф коммитов

9 Коммитов

Автор SHA1 Сообщение Дата
Tarek Mahmoud Sayed bad82989a0
Adding needed Tokenizer's APIs (#7047)
* Adding needed Tokenizer's APIs

* Address the feedback

* Small update to the newly exposed APIs

* fix comments

* Update the APIs signatures

* More feedback addressing

* Fix the comments
2024-03-06 19:04:15 -08:00
Tarek Mahmoud Sayed 99c620ad96
Add Span support in tokenizer's Model abstraction (#7035)
* Add Span support in tokenizer's Model abstraction

* Address the feedback

* Use stackalloc instead of the ArrayPool
2024-02-29 18:01:13 -08:00
Eric StJohn f22b60aa9a
Packaging cleanup (#6939)
* Packaging cleanup

Originally I was just trying to remove mentions of snupkg, but then
things got a bit carried away. :)

This is trying to remove as much duplication and dead code related to
packaging that I can.

* Apply code review feedback

* Suppress copying indirect references

* Remove unwanted bundled files from AutoML

* Remove leading slash

* Refactor model download

* Correct the packaging path of native symbols

* Rename NoTargets projects from csproj to proj

* Fix build issues around model download and respond to feedback

* Remove NoTargets file extension enforcement

* Rename proj to CSProj, include in SLN

I'd like to ensure all our projects are included in the SLN and don't
rely on separate build steps.

VS prefers *.csproj in the sln so I renamed things back to csproj.

* Respond to PR feedback
2024-02-27 16:05:43 -08:00
Tarek Mahmoud Sayed d0aa2c2461
Address the feedback on the tokenizer's library (#7024)
* Fix cache when calling EncodeToIds

* Make EnglishRoberta _mergeRanks thread safe

* Delete Trainer

* Remove the setters on the Bpe properties

* Remove Roberta and Tiktoken special casing in the Tokenizer and support the cases in the Model abstraction

* Support text-embedding-3-small/large embedding

* Remove redundant TokenToId abstraction and keep the one with the extra parameters

* Enable creating Tiktoken asynchronously or directly using the tokenizer data

* Add cancellationToken support in CreateAsync APIs

* Rename sequence to text and Tokenize to Encode

* Rename skipSpecialTokens to considerSpecialTokens

* Rename TokenizerResult to EncodingResult

* Make Token publicly immutable

* Change offset tuples from (Index, End) to (Index, Length)

* Rename NormalizedString method's parameters

* Rename Model's methods to start with verb

* Convert  Model.GetVocab() method to a Vocab property

* Some method's parameters and variable renaming

* Remove Vocab and VocabSize from the abstraction

* Cleanup normalization support

* Minor Bpe cleanup

* Resolve rebase change

* Address the feedback
2024-02-26 10:57:18 -08:00
Tarek Mahmoud Sayed 4635a862dd
Tokenizer's Interfaces Cleanup (#7001)
* Tokenizer's Interfaces Cleanup

* Address the feedback

* Optimization
2024-02-16 12:48:38 -07:00
Tarek Mahmoud Sayed 6f55525602
Introducing Tiktoken Tokenizer (#6981)
* Introducing Tiktoken Tokenizer

* Address the feedback

* file renaming
2024-02-06 11:13:28 -08:00
Michael Sharp 65c7ca9d9a
Add NameEntityRecognition and Q&A deep learning tasks. (#6760)
* NER

* QA almost done, runtime error

* QA finished

* fixes from PR comments

* fixed build

* build fixes

* perf changes

* made disposable

* fixed not disposing model

* added some disposables to TensorFlow for memory

* build testing

* fixing build

* added missing dispose

* build fixes

* build fixes

* testing macos fix
2023-07-24 13:47:24 -06:00
Tarek Mahmoud Sayed c69acbeb97
Embed the Tokenizer data files inside the assembly (#6403) 2022-10-24 10:30:48 -07:00
Tarek Mahmoud Sayed e8073ad4eb
Tokenizers Support (#6272) 2022-08-10 14:43:13 -07:00