Граф коммитов

591 Коммитов

Автор SHA1 Сообщение Дата
Wenbing Li c3379ecb6b
fix the build for mobile packaging (#843)
* fix the build for mobile packaging

* update the cmake file as well

* more fixing on dlib related ops

* release the iOS cmake version constraint

* upgrade cmake in Linux CUDA build

* Update Dockerfile.ubuntu_cuda11_8_tensorrt8_6 for typo

* Update ios_packaging.yml for Azure Pipelines

* update the dlib versoin

* update all cases of cmake version

* update the comment for dlb cmake
2024-11-17 20:09:36 -08:00
Wenbing Li 5104bb9897
fix the win32 macro usage (#844) 2024-11-15 11:26:37 -08:00
Wenbing Li 3da0d3c929
Load the tokenizer data from the memory (#836) 2024-11-09 10:15:21 -08:00
Kyle 14f280adf6
Change Pipeline's Service Connection Name (#841)
* change service connection name
2024-11-08 11:39:37 +08:00
Kyle ece1db2dc7
Migrate Pipeline to 1ES PT - wheels_macos (#840)
migrate pipeline.
2024-11-07 12:57:16 +08:00
Kyle aabc4030f0
Upgrade Pipeline Python Version to 3.12 (#839) 2024-11-05 09:24:43 -08:00
Kyle 31056e7d4f
Migrate Pipelines - Phase 1 - Five Pipelines and Templates (#838)
migrate pipelines
2024-11-05 11:38:23 +08:00
Sayan Shaw 5b7e3d4b8b
Fix prefast issue in image transforms (#837)
* fix prefast issue in image transforms

* Update image_transforms.hpp

* Update image_transforms.hpp

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2024-10-31 15:10:47 -07:00
Wenbing Li be5aa773e3
Unify the image operations in extensions library (#831)
* Unify the image operations in extensions library

* fix the build configuration issue

* More build fixings

* Fix the native image codec

* fix encode_image

* Add bgr/rgb conversion for encoding image

* parity check

* build break

* update PNG encoding parameters

* build break on Linux

* using MSE to compare images

* fix the discrependency between Linux and Windows

* final code refinement

* one more change

* fix the C++ warnings

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
2024-10-30 09:17:06 -07:00
Sayan Shaw 0e6bffa201
Fix regex prefast warnings (#832)
* fix regex prefast warnings

* remove try catch

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2024-10-29 22:36:59 -07:00
Sayan Shaw f12431a211
Upgrade versions in CI matrix and fix CI issue (#835)
* upgrade ci matrix

* typo

* revert python version

* update python and ort range

* update python range

* update for macos and linux too

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2024-10-29 19:40:54 -07:00
Wenbing Li aa2c82fa67
Add the MLlama Imaging Processing Support (#823)
* initial checkins for mllama image process

* fix some tests

* some fixings

* add more image

* More test assertions

* parity test passed

* code clean up

* code refinement
2024-10-22 14:24:09 -07:00
Sayan Shaw 7ab9d24cb4
Add general regex support (#822)
* Add general regex support

* add case 5 support instead of replacing with s+

* add more test cases

* address comments

* add back gpt2 and llama regex methods for efficiency

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2024-10-21 16:29:17 -07:00
Wenbing Li 1fb87a30f7
Validate the tokenizer class name on data loading (#830) 2024-10-21 13:25:37 -07:00
Rony Fadel 8de0d6c8db
Change the framework bundle identifier to a valid one (#829)
Ref: https://github.com/microsoft/onnxruntime-extensions/issues/825

"com.microsoft.onnxruntime_extensions" is not a valid identifier. Update it to "com.microsoft.onnxruntime-extensions"
2024-10-21 10:56:41 -07:00
Akshay Sonawane 944bad6036
bump version from 0.13.0 to 0.14.0 (#827) 2024-10-17 11:55:58 -07:00
Wenbing Li e19c0894ec
Fix CUDA CI build failures (#824) 2024-10-11 16:08:44 -07:00
Wenbing Li 62c0a7bfda
fix the unigram detector for last HG tokenizer (#820) 2024-10-03 14:25:53 -07:00
Stalin Sabu Thomas f47bed4596
add(tutorials): exporting yolo world model (#803)
* add(tutorials): exporting yolo world model

This allows us to export yolo world onnx model which can be later used in mobile inference.

* add(tutorial): make classes optional

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-10-03 14:42:35 +10:00
Wenbing Li 12a9e8beb4
support sentence-piece add_dummy_prefix for all models (#819)
* add compatibility docs

continue updating the doc

updating doc 2

* support sentence-piece add_dummy_prefix for all models

* revert the flag

* initialize the add_dummy_prefx for llama model
2024-10-01 09:08:59 -07:00
Wenbing Li e710d80f71
Improve Documentation: Add Hugging Face Compatibility Docs and Refine the existing docs (#818)
* add compatibility docs

* continue updating the doc

* updating doc 2

* revert the bpe changes
2024-09-30 13:04:33 -07:00
Wenbing Li 2c3e936cfc
support the merges array in tokenizer.json (#817) 2024-09-26 11:01:13 -07:00
Chester Liu e424838708
Added support for native image decoding (#808)
This added support for native image decoding on Windows & Apple platforms.
This helps us remove libpng & libjpeg completely on these platforms, and
in the meantime support more image formats thanks to OS vendors,
2024-09-26 09:17:55 +08:00
Chester Liu f90a04606b
Fix unused result warnings (#802)
Fix several unused result warnings

---------

Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>
2024-09-26 07:54:16 +08:00
Wenbing Li f204a4c791
Add a decoder for Unigram tokenizer and unify some classes among tokenizers (#816)
* rename and formalize the file names

* add the decoder impl

* fix a typo
2024-09-25 10:25:06 -07:00
Wenbing Li 6b94f4d7a5
Fix the Unicode code discrepency on CLIP model (#814)
* refine the code structure

* more fixing on unicode

* fix the codepoint 304

* add the clip tokenizer data files abck
2024-09-23 16:49:24 -07:00
Wenbing Li 176c1d0138
Support the Unigram tokenizer kind from sentencepiece library (#811)
* initial commit

* Ugm vocab loaded is good

* test passed

* fixes unit test on win32

* finish the parity check

* code refinement

* code refinement for review
2024-09-19 15:46:13 -07:00
Sayan Shaw 0d5d19f67b
fix prefast warning (#809)
Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2024-09-15 22:34:07 -07:00
Chester Liu 8d842d85e3
Rm zlib when linking ocos_operators (#807) 2024-09-13 07:07:10 +08:00
Sayan Shaw 8bc8e43da1
Add C++ regex support for Llama3, Standard Library, and Custom Cases (#804)
* add C++ standard library regex support for GPT2 case

* reorder regex handling

* try without STL

* missing case

* add llama3 regex support

* add custom regex impl

* change regex based on model

* modify tests, add docs, and code cleanup

* add regex test and const strings

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2024-09-10 23:17:49 -07:00
Scott McKay 9164f54e5d
Don't disable vision operators in a catalyst build. (#805)
* Don't disable vision operators in a catalyst build.

* Patch to exclude NSImage on Mac-catalyst as it's not supported.
2024-09-10 08:58:09 +10:00
Wenbing Li 90d8f33172 Revert "some data calc fixing"
This reverts commit dae9510dbb.
2024-09-05 09:30:19 -07:00
Wenbing Li dae9510dbb some data calc fixing
really split the images

test with sus
2024-09-05 09:26:05 -07:00
Wenbing Li 1b80794903
Remove OpenCV dependency from C_API mode (#800)
* Remove OpenCV dependency from C_API model

* fix build on Windows

* switch ci build flag

* try to fix the macOS build issue

* more fixing

* fix the macOS build issue

* list jpeg source

* verified on MacOS

* update the pp_api too

* avoid the codecs library conflicts

* Add the unit tests

* move the codec test

* add the missing dl lib for extensions test

* refine the code

* a smaller fixing for Windows Python
2024-09-04 16:50:05 -07:00
Kyle 7c3ce36af8
Add Files Signature Validation after Signed by ESRP (#801)
* vlidate sign after ERSP

* blank line

* format
2024-09-02 17:17:03 +08:00
Wenbing Li b8b2ebfb85
optimize spm tokenizer for long text (#799)
* optimize spm tokenizer for long text

* refine the split logic

* re-trigger CI pipeline.
2024-08-30 14:58:40 -07:00
Prathik Rao 6f532376c9
bump (#791)
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2024-08-27 18:58:18 -07:00
Wenbing Li 2d02a687be
Optimize the tokenizer for efficiency (#797)
* optimize the tokenizer for efficiency

* fix the unit test failures.

* fix the api test case failures

* removed the unused code.

* More test cases fixings

* One more fixing

* fix macOS build issues

* refine the test

* add more diagnosis info.

* fix unit test in CI Linux

* fix the pp_api test failure
2024-08-27 18:57:50 -07:00
Yi Zhang 2d044adbf9
sign with the correct key code (#796)
Fixes incorrect dll singnature
2024-08-26 16:48:29 +08:00
Wenbing Li 8f2c35fad0
Add more tests for pre-processing C APIs (#793)
* initial api for tokenizer

* More fixings and test data refinement

* add a simple wrapper for pre-processing APIs

* fix the test issues

* test if the tokenizer is spm based

* fix the failed test cases

* json pointer does not work
2024-08-21 16:48:39 -07:00
Zhipeng Han 85ffb94169
Update custom_ops.md (#795)
add domain for SentencePiece Op
2024-08-21 09:52:54 -07:00
Wenbing Li 711a2cfa69
add a convert_token_string_to_an_id API for the prompt ids (#794)
* add a convert token string to an id API for the prompt ids

* fix the build issues on Linux
2024-08-19 16:44:07 -07:00
vraspar 6ce22f8ac4
Update nuget extraction path for iOS xcframework (#792)
* Update nuget extraction path for iOS xcframework

* Update nuget extraction path for iOS xcframework
2024-08-16 10:34:40 +10:00
vraspar 8b5354fb67
Update macosx framework packaging to follow apple guidelines (#776)
* Update macosx framework packaging to follow apple guidelines

* Test path fix

* Update tools/ci_build/extract_nuget_files.ps1

---------
2024-08-13 10:37:22 +10:00
Wenbing Li be29e28dd7
support tokenizers build only in C API mode (#783)
* support tokenizer build only in C API mode

* fix the python build.

* fix the selectedops build

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
2024-08-02 13:28:58 -07:00
Sayan Shaw 7851b51ee3
Add initial tiktoken and Phi3SmallTokenizer support (#729)
* add initial tiktoken support

* add vector hash and equal for bpe ranks map

* change lambda comparator

* move phi-3-small files

* final changes

* move tiktoken files from data2 to data

* add unit test

* add tokenizer module

* merge json and tiktoken impl

* fix tiktoken encoding problem

* address comments

* remove dummy tokens

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
Co-authored-by: Wenbing Li <10278425+wenbingl@users.noreply.github.com>
2024-08-02 10:24:02 -07:00
Wenbing Li 46998e96fb
Update build-package-for-windows.yml (#784) 2024-08-01 14:45:26 -07:00
Wenbing Li 4bb63dd2aa
Upgrade ESRP signing task from v2 to v5 (#780)
* Upgrade ESRP signing task from v2 to v5

* Upgrade ESRP signing task from v2 to v5 in win

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
2024-08-01 09:57:59 -07:00
Wenbing Li 8b002b86ab
Fix the case that bos_token is null (#781) 2024-07-31 17:50:20 -07:00
Wenbing Li b4ebfc9519
Fix spm converted FastTokenizer issue on non-ascii char (#778)
* Fix spm converted tokenizer issue on non-ascii char

* remove pkg_resource in python
2024-07-31 14:22:25 -07:00