Граф коммитов

61 Коммитов

Автор SHA1 Сообщение Дата
Wenbing Li d1148aea4e
Support 'added_token' attribute for BPE tokenizer and some code refactoring. (#591)
* Fix CodeGenTokenizer issues and the related code refactoring.

* refactor the trie-tree

* temp check-ins

* code complete

* correctness fixing

* Update _hf_cvt.py

* more test cases fixing

* more refinement

* linux crash fixing

* Update test_autotokenizer.py
2023-11-04 22:56:26 -07:00
Wenbing Li c71e2ae090
Refactor String and Audio operators with status-return prototype. (#576)
* Refactor String and Audio operators with status-return prototype.

* complete the whole text domain

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
2023-10-19 10:40:58 -07:00
Sayan Shaw 4d2930e35a
Fix newline and apostrophe handling for BPE (#574)
* Fix certain BPE issues

* minor changes

* change newline handling for unix/linux/windows builds

* small test case

* move apostrophe testing into test_cliptok.py

* fix CLIP inconsistency with ftfy install

* add ftfy to requirements-dev.txt

* remove HF CLIP bug testing

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-10-19 00:17:34 -07:00
Wenbing Li 68b9d1dc47
Fix the exception on invalid trie-tokenizer input (#575)
* fix the exception on invalid trie-tokenizer input

* remove unused import
2023-10-16 17:03:02 -07:00
Sam Webster b7e35a1a34
Add token indices output to sentencepiece (#566)
* Add token_indices

* Update test (not tested)

* Address comments

* Change output to optional and fix reverse

* indices test

* Switch param order

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
2023-10-03 09:56:28 -07:00
Sayan Shaw 169438999c
Add support for Fairseq models (like XLMRobertaTokenizer) (#556)
* add XLMRobertaTokenizer support

* update as per comments

* change optional dereference for macos

* typo

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-09-08 17:03:22 -07:00
Wenbing Li 69c2c3a275
Refactor BBPE based Tokenizers (#555)
* Refactor BBPE based Tokenizer

* Address the CI pipeline failure

* address the comments

* no stl unique

* trie test build fix
2023-09-05 15:45:33 -07:00
Wenbing Li 367f59c6fa
Remove the deprecating std::codecvt_utf8 from code base. (#541)
* Remove the deprecating std::codecvt_utf8 from code base.

* utest fix
2023-08-24 10:26:08 -07:00
Wenbing Li 396044310e
Add more HF tokenizer supports in gen_processing_models (#531) 2023-08-18 17:09:22 -07:00
Wenbing Li 978ada6d60
Add TrieTokenizer for RWKV-like LLM models (#509)
* Add TrieTokenizer for RWKV-like LLM models

* add more tests

* fix the windows build

* downloading file instead of check in the vocab file

* a small bug fixing
2023-08-08 16:47:38 -07:00
Scott McKay e448676a5e
Make kernel Compute method implementations const (#500)
* Nodes can be called concurrently and Compute needs to be stateless due to that.

Update the kernels to make Compute const.

* Fix test that uses ustring.h.

Would be better to not have duplicate declarations for GetTensorMutableDataString and FillTensorDataString in ustring.h and string_tensor.h.
2023-07-28 09:25:36 +10:00
Wenbing Li bab1989644
refine audiodecoder with new api (#489)
* refine audiodecoder with new api

* update std::optional usage for macOS
2023-07-12 13:11:58 -07:00
Sayan Shaw 9774370bf3
Add perf changes for Bert, CLIP and Roberta with offset mapping (#488)
* add perf changes for CLIP and Roberta

* add perf improvement for BERT

* remove global var

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-07-11 10:29:45 -07:00
Sayan Shaw d876f7ff82
Initial BertTokenizer offset mapping implementation (#477)
* Initial BertTokenizer offset mapping implementation

* minor change

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-07-03 15:17:23 -07:00
Sayan Shaw afb3e83df2
Change default pad token from 0 to 49407 (#474)
* change defualt pad token from 0 to 49407

* update with GetEncoding

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-06-21 00:11:52 -07:00
Sayan Shaw 6aaf2920bf
Ignore inputs missing from vocab in CLIPTokenizer (#462)
* Ignore unknown inputs in CLIPTokenizer

* add whitespace clean and unknown token handling

* fix const issue

* small updates

* add single whitespace test case

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-06-01 19:40:43 -07:00
Tang, Cheng 8f36cf3272
Use API-lite for custom ops (#386)
* use lite custom op api for math

* add vision ops

* add cx2 ops

* remove useless code

* support register custom kernel struct

* add string tensor support

* add more text kernels

* fix issue with std stringg as scalar

* migrate all text ops

* initial tokenizer change

* migrate all tokenizers

* Resolve conflict with main (#433)

* resolve conflict

* resolve conflict

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Update custom-op-lite PR (#440)

* add the onnxruntime 1.14 release into the CI pipeline (#387)

* add the onnxruntime 1.14 release into the CI pipeline

* torch 2.0 crashed on Linux

* Fix size_t overflow issue for RobertaTokenizer (#388)

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Pre and Post processing example for openAI Whisper model (#380)

* add a stft-norm custom op for log-mel spectrum.

* undo the debug change

* Support ONNX standard STFT op signature.

* Add a unit test onnx STFT compatible mode.

* add whisper pre-/post- processing example

* Update dlib.cmake

* undo test code changes

* Update setup.cfg

* update the end2end example with STFT op

* Added optional outputs for GPT2, CLIP and Roberta Tokenizers (#389)

* Initial optional i/o for robertap

* Small fix

* Added working optional output functionality to RobertaTokenizer with tests

* Added optional outputs to CLIPTokenizer

* Added optional outputs to GPT2Tokenizer

* Use ternary operators

---------

Authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* ignore the unknown token id on bpe deocder (#391)

* Use dependency name 'nlohmann_json' which is the same name that ORT uses. (#393)

* Add an audio decoder custom op for whisper end-to-end processing (#385)

* evaluate the audio decoder library

* MP3 Decoder

* rename it to test_audio_codec

* add the audio decoder to whisper model

* whisper end-to-end draft

* fix the mp3 decoder

* Running with ONNX models

* Add more audio format supports

* refine the end-to-end script

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* some fixings of comments and more test cases.

* changes for review comments.

* Update audio_decoder.hpp

* Update audio_decoder.hpp

* code refinement

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* make tensorflow be optional for unittest (#394)

* make tensorflow be optional for unitest.

* typo

* built-in bounding box op (#382)

* built-in bounding box op
* update boundary check
* assert policy
* more boundary test and check
* XYXY--> X horizon
---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>

* a quick nuget package impl. (#396)

* Update wheels_linux.yml: change the linux machine pool name (#398)

* Add a merge step in whisper end-to-end script and fixed some issues (#399)

* add merged models in whisper model

* verify the final model

* support batch > 1 in BpeDecoder (#400)

* support batch > 1 in BpeDecoder

* update the shape in helper function

* [object detection ppp] YoLo as example (#397)

* object detection
* Unit test
add e2e fastestdet model test

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>

* some fixing for python package (#401)

* more code fixing related whisper models (#403)

* Added windows nuget work temporarily for testing (#402)

* Added windows nuget work temporarily for testing

* Cleanup

* Add back onnxruntime.lib in props file for possible future ORT need

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>

* Remove unnecessary nupkg file and update nuspec (#405)

* Add nuget pack to build.bat and small nuget changes for demo

* Temporarily adding nuget.exe to build package until we can add to CI machine

* Switch back from Release to RelWithDebInfo

* Remove unnecessary changes

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Add initial NuGet pipeline for Windows x64 build (#406)

* initial nuget pipeline

* Update nuget.yml for Azure Pipelines

* update nuget.yml for extensions specific packaging

TODO: add certain template yml files

* added component governance template yaml

* change template yaml path

* remove RoslynAnalyzers

* Add packDestination to nuget pack task (change from default)

* fix nuspec path

* Update nuget.yml for Azure Pipelines

* Update nuget.yml for Azure Pipelines

* Update nuget.yml for Azure Pipelines

* Update 2 nuget.yml for Azure Pipelines

* Update NativeNuget.nuspec

* Update nuget.yml for Azure Pipelines

* update nuspec

* Update 3 nuget.yml for Azure Pipelines

* Update 4 nuget.yml for Azure Pipelines

* Update 7 nuget.yml for Azure Pipelines

* Remove unnecessary nupkg file and update nuspec (#405)

* Add nuget pack to build.bat and small nuget changes for demo

* Temporarily adding nuget.exe to build package until we can add to CI machine

* Switch back from Release to RelWithDebInfo

* Remove unnecessary changes

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Update 8 nuget.yml for Azure Pipelines

* Update 9 nuget.yml for Azure Pipelines

* add DLL signing

* Update nuget.yml for Azure Pipelines

* fix indendation

* Update 11 nuget.yml for Azure Pipelines

* Update 12 nuget.yml for Azure Pipelines

* Update 12 nuget.yml for Azure Pipelines

* Revert some unneccesary changes on nuget.yml

* clean up nuget.yml and update nuspec release notes

* small changes

* update commit id and release notes

---------

Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Compatible with onnxruntime-gpu package (#410)

* be compatible without onnxruntime-gpu version

* some fixing

* Add nuget README and remove ort lib references from props (#409)

* Add nuget README and remove ort lib references from props

* replace commit id in nuspec dynamically

* remove $ sign for commit id token

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>

* Add an C# demo project for NuGet package (#407)

* Add a nuget test app

* remove unused file

* Compatible with onnxruntime-gpu package (#410)

* be compatible without onnxruntime-gpu version

* some fixing

* turn it as a .net demo project

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>

* Make Whisper E2E script more portable (#412)

This PR makes the Whisper E2E script more portable for other environments.

* Update macos wheel timeout to 180 min (#390)

* Update ci timeout to 120 min

* Only update WindowsPython job timeout

* Update ci timeout to 90 min

* update macos wheel timeout to 180 min

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Fix OneBranch PR pipeline CodeQL issue (#413)

* test codeql 3000

* switch codeql from compiled to python

* switch back to compiled

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Adding down-sampling and stereo mixing features for AudioDecoder (#420)

* initial draft

* second

* third

* polishing

* fix the M_PI name in LINUX platform

* fix bessel function issue

* add a unit test case

* fix the unit test name

* Fix Secure Supply Chain Analysis Warning in PR pipeline (#414)

* remove package sources

* remove NuGet.config

* add .sscignore for cfs0011

* change sscignore

* add CFS0013 to sscignore

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* fix onnx version to 1.13.1 (#422)

* [NuGet] All platform package pipeline (#408)

* nuget ci package
* disable macos arm64 build for err

* Get the iOS xcframework build working with the split build/pack approach. (#416)

* refine build_xcframework.py
Cleanup/clarify various things
- naming of parameters and files
- consistency
Make handling of additional build args more generic
Update the artifact download dir/extract dir to more intuitive names
Update scripts
- make usage from CI pipeline clearer (e.g. don't hide directory names inside script)
- keep comments in nuspec
- remove unused args
- make additional arg handling more
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Add new required pre/post processing ops to Android and iOS packages. (#415)

* Revert "Pin onnx version to 1.13.1" (#423)

* Revert "fix onnx version to 1.13.1 (#422)"

This reverts commit eb29d225a7.

* Update requirements.txt

* PyOp attribute supports int and float data type (#425)

* Fix Android AAR in nuget package. Requires libortextensions.so. (#429)

* build for mac M1 (#430)

* Fix the unit test failure with ONNX 1.14 package. (#428)

* Fix the unit test failure with ONNX 1.14 package.

* more tests

* Update whisper_e2e.py

* Add nuget.org publish version option (#426)

* Add nuget.org publish version option

* typo

* small fix

* typo

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

---------

Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: JiCheng <247153481@qq.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Fix a build err (#442)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Fix build err on ort 141 (#444)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Remove shape from span (#445)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

* remove shape

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Fix python tests (#446)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

* remove shape

* fix python tests

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Fix max build (#449)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

* remove shape

* fix python tests

* fix packaging err

* fix mac build

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Fix comments (#452)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

* remove shape

* fix python tests

* fix packaging err

* fix mac build

* fixing the universal2 python package for macOS (#448)

* Remove onnx<1.14 from requirements.txt (#447)

* remove onnx<1.14 from requirements.txt

* downgrade protobuf

* move protobuf req to requirements-dev.txt

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>

* fix comments

* comment version macro

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>

* Fix build err (#453)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

* remove shape

* fix python tests

* fix packaging err

* fix mac build

* fix comments

* comment version macro

* define Compute for StftNormal

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* Merge latest main (#461)

* resolve conflict

* resolve conflict

* minor fix

* rename from TensorT to Tensor

* fix string tensor

* Add OrtLiteCustomOp

* switch to string view

* fix regex ops

* fix build

* fix a build err

* remove shape

* fix python tests

* fix packaging err

* fix mac build

* fix comments

* comment version macro

* define Compute for StftNormal

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>

* revert wanted changes in test

* revert unwanted changed

* add string_strip op

---------

Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com>
Co-authored-by: Randy Shuai <rashuai@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: JiCheng <247153481@qq.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
2023-05-30 18:04:44 -07:00
Wenbing Li adb8efd62b
support batch > 1 in BpeDecoder (#400)
* support batch > 1 in BpeDecoder

* update the shape in helper function
2023-04-19 14:28:56 -07:00
Wenbing Li b5dce955f0
Add an audio decoder custom op for whisper end-to-end processing (#385)
* evaluate the audio decoder library

* MP3 Decoder

* rename it to test_audio_codec

* add the audio decoder to whisper model

* whisper end-to-end draft

* fix the mp3 decoder

* Running with ONNX models

* Add more audio format supports

* refine the end-to-end script

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* some fixings of comments and more test cases.

* changes for review comments.

* Update audio_decoder.hpp

* Update audio_decoder.hpp

* code refinement

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-04-11 14:47:10 -07:00
Wenbing Li 9cd2221965
ignore the unknown token id on bpe deocder (#391) 2023-04-07 15:24:25 -07:00
Sayan Shaw 460bd34183
Added optional outputs for GPT2, CLIP and Roberta Tokenizers (#389)
* Initial optional i/o for robertap

* Small fix

* Added working optional output functionality to RobertaTokenizer with tests

* Added optional outputs to CLIPTokenizer

* Added optional outputs to GPT2Tokenizer

* Use ternary operators

---------

Authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-04-06 13:28:59 -07:00
Sayan Shaw bbd645de1b
Fix size_t overflow issue for RobertaTokenizer (#388)
Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-03-27 21:07:45 -07:00
Sayan Shaw 07d060dbc0
Improve offset map algorithm for RobertaTokenizer (#383)
Authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-03-23 15:51:14 -07:00
Sayan Shaw 20e6c167d4
Fixed and optimized offset mapping algorithm for CLIPTokenizer (#377)
Authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-03-21 09:51:01 -07:00
Wenbing Li 3b0bd66e9e
Add a bbpe tokenizer decoder for Whisper model (#376)
* initial PR

* add the attributes for op

* cmake update

* add the missing symbol

* add a unit test case

* fix the unit test

* fix some corner case.

* format Python code with autopep8
2023-03-08 15:00:01 -08:00
Wenbing Li ee306dee2a
Fix the build breaks the release pipeline and some C++ warnings (#372)
* fix the break in release pipeline

* code cleanup and the warnings fixing.

* Update ci.yml for Azure Pipelines

* Update ci.yml for Azure Pipelines

* fix linux build

* one more fixing

* again?

* fixing for macOS
2023-02-28 15:45:32 -08:00
Sayan Shaw 4d051b854b
Initial RobertaTokenizer implementation (#365)
* Added initial RobertaTokenizer implementation

* Added offset mapping to output

* Updates for new custom op changes

---------

Authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-02-27 16:48:52 -08:00
Scott McKay 5e44a7c3c9
Add ability to prevent exception propagation if building as part of ORT when ORT has exceptions disabled (#368)
* Add ability to prevent exception propagation with top level try/catch hander macros.

If combined build with ORT has exceptions disabled in ORT but ort-ext has an operator that requires exceptions, we enable exceptions in ort-ext but prevent them propagating up via try/catch in the entry points that ORT can call
  - RegisterCustomOps
  - CustomOpBase constructor and Compute

Removed some places in CustomOpApi that threw is OpKernelInfo* was nullptr but standardizing all kernels to store the OpKernelInfo provided in the ctor.

Added unit tests
  - need to validate on more platforms and add CI for build where we don't want to allow exceptions to propagate

* Update pyop

* Update CMakeLists.txt

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update includes/exceptions.h

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update includes/exceptions.h

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update includes/onnxruntime_customop.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Merge with main and update
Address PR comments
Fix some issues.

* Delete local file

* Fix pyop update

* Add CI
Address PR comments

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2023-02-27 10:31:44 -08:00
Sayan Shaw 4ff821805a
Initial CLIP tokenizer implementation (#323)
* Initial CLIP tokenizer implementation

* Moved common code from CLIP and GPT2 tokenizers into separate file

* add the new file into cmake file list.

* Fix ustring reference issue

* merge changes from main branch

* more merge actions

* Minor changes

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
Co-authored-by: Wenbing Li <wenbingl@outlook.com>
2022-12-13 15:52:47 -08:00
Wenbing Li c599b00d07
Using the header files from the ONNXRuntime package (#322)
* Using the header files from the ONNXRuntime package

* Update includes/onnxruntime_customop.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* fix the build break.

* one more fixing

* wired top project

* ort 1.9.0 used

* switch to 1.10.0 package.

* change the vmimage to latest

* URL issue

* cmake policy

* ignore onnxruntime.dll native scan

* update the Onebranch exclusedPaths

* fixing some build tool issues

* update again

* typo

* undo of ORT dll removal

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2022-12-09 14:30:24 -08:00
Wenbing Li fec9af97aa
a naive decoder for sentencepiece tokenization (#314)
* a naive decoder for sentencepiece tokenization

* typo fixing

* add a unit test for the decoder
2022-11-21 11:10:30 -08:00
Wenbing Li 7fc0224410
Partner team's code security fixings (#300) 2022-10-05 16:10:34 -07:00
Wenbing Li d29f6d0f42
fix the C++ dangling pointer from the security check (#296)
* fix the C++ dangling pointer from the security check

* one more fixing
2022-09-28 16:23:43 -07:00
Adrian Lizarraga ae416f6aa6
Issue #288: Prevent copying of OrtApi struct (#290) 2022-09-14 09:45:33 -07:00
Wenbing Li 5320af1eea
Fix the code security issue and 0.5 C++ release preparation. (#274)
* Fix the code security issue and 0.5 C++ release preparation.

* more fixings

* vswhere
2022-08-02 10:09:35 -07:00
shaahji 0616039115 Issue #226: Functional e2e NLP example
* Implemented a new version of Kernel and the CustomOp to support
  output that matches the HuggingFace model's input without the need
  for intermediate python logic.
* Implemented a e2e tutorial for exporting and inferencing using the
  HuggingFace's QuestionAnsering model.

Known Issue: Python side doesn't have an implementation of Bert Decoder
and so the augmented model is only half-complete. At the time of
inferencing the HuggingFace tokenizer is used to decode the result back
to string.
2022-07-22 13:56:40 -07:00
shaahji 6559cf5c0f Issue#226: Prepare BertTokenizer implementation for versioning
* Moved a few variables from Kernel implementation to BertTokenizer so
  each version of the Kernel doesn't have to deal with them.
* Other decorative and code standardization changes.
2022-07-22 13:56:40 -07:00
shaahji d14abe1461
Identified bug in SentencePieceTokenizer where encoding was (#200)
not restricted to specific token. Sentencepiece.Encode itself doesn't
clear the input vector before populating the result for the input
token.

Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2022-02-22 10:52:28 -08:00
joburkho a9737505ca
Joburkho/change to const char star (#187)
* Correct memory reservation.

* Change output_sentences to vector of const char*.
2021-12-01 00:01:52 -08:00
Mojimi b5a8a1abd9
add test (#180)
Co-authored-by: Ze Tao <zetao@microsoft.com>
2021-11-01 10:08:09 +08:00
Zuwei Zhao 05f7ded825
Add check for empty input in StringJoin operator and fix empty string input error in BlingFire sentence breaker. (#175)
* Add test cases and fix empty string error in BlingFire sentence breaker.

* Throw error if input text to join is empty array.

* Fix scalar support and access violation.

* Resolve comments.

* Resolve comments.

Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
2021-10-27 20:21:16 +08:00
Mojimi 46d096f1af
Fix ::tolower error when locale is not 'C' (#174)
* add test and implement tolower

* fix locale

* fix locale

Co-authored-by: Ze Tao <zetao@microsoft.com>
2021-10-20 20:59:29 -07:00
Mojimi 448518534c
Add native test for bert tokenizer (#173)
* add native test for bert tokenizer

* add python test

* fix unicode category

Co-authored-by: Ze Tao <zetao@microsoft.com>
2021-10-19 11:09:38 -07:00
Wenbing Li 70aa18e14e
add a native unit test for regex_split op (#166)
* add a native unit test for regex_split op

* fix the case of shape [1, 0]

* Update mshost.yaml

* downgrade the test model version.

* upgrade torch version on Windows CI

* disable windows python 3.7 pipeline.
2021-10-06 15:58:46 -07:00
joburkho 4d7004bf6e
Correct memory indexing issue. (#165)
* Correct memory reservation.

* Fix the vmImage version for MacOS CI pipeline.

Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2021-10-04 16:26:34 -07:00
Mojimi 4290400ed3
Add doc for new operators (#161)
* add initial doc

* update doc

* finish all docs

Co-authored-by: Ze Tao <zetao@microsoft.com>
2021-09-29 07:59:09 +08:00
Mojimi 2d6cf0b4ea
Reduce bert tokenize memory usage (#156)
* add BertTokenizerVocab

* improve format

Co-authored-by: Ze Tao <zetao@microsoft.com>
2021-09-27 11:19:57 -07:00
Mojimi d8cdb8e042
reduce memory usage (#154)
Co-authored-by: Ze Tao <zetao@microsoft.com>
2021-09-27 13:45:47 +08:00
Wenbing Li 9f3abe20fd
Prepare for 0.4.0 release (#151)
* new CI configuration

* Set up CI with Azure Pipelines

[skip ci]

* install numpy in cibuildwheel

* add pyproject.toml

* upgrade vmImage

* update the build python versions

* remove the pytest

* move the wheel build files

* enable sdist setup.py as well.

* use git command line

* Update wheels.yml for Azure Pipelines

* disable the pypy package for macos;

* fix the external repo code tag

* fix the ctest problem

* fix the unicode 8217.

* fix the locale base test
2021-09-25 00:40:12 -07:00
Zuwei Zhao 6d7a865913
Disable c++ exceptions in onnxruntime-extensions. (#143)
* Disable c++ exceptions in onnxruntime-extensions.

* Remove cxx flags for extensions.

* Remove redundant lines.

Co-authored-by: Zuwei Zhao <zuzhao@microsoft.com>
2021-09-09 08:21:40 +08:00