Граф коммитов

85 Коммитов

Автор SHA1 Сообщение Дата
Wenbing Li 396044310e
Add more HF tokenizer supports in gen_processing_models (#531) 2023-08-18 17:09:22 -07:00
Wenbing Li ee14fbe48e
correct CLIP tokenizer name (#526) 2023-08-16 12:51:17 -07:00
Sayan Shaw 9ba649e134
Fix HF Fast Tokenizer cvt issue for AutoTokenizer imp (#520)
* Fix GPT2 and Falcon tokenizer cvt for AutoTokenizer imp

* fix fast tokenizer issue

* small fix

* use slow tokenizer in test script

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-08-11 13:17:56 -07:00
Wenbing Li 978ada6d60
Add TrieTokenizer for RWKV-like LLM models (#509)
* Add TrieTokenizer for RWKV-like LLM models

* add more tests

* fix the windows build

* downloading file instead of check in the vocab file

* a small bug fixing
2023-08-08 16:47:38 -07:00
Sayan Shaw 997e9ee007
Add Falcon-7b and Falcon-40b tokenizer support (#510)
* Add Falcon-7b and Falcon-40b tokenizer support

* fix alignment and add tokenizer file in test/data to speed up compute

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-08-07 14:37:57 -07:00
Wenbing Li 922b7cc387
Add Bert tokenizer in the supported model list and code refinement (#503)
* Add Bert tokenizer in the supported model list and the related code refinement

* utest fix
2023-08-02 14:01:36 -07:00
Wenbing Li b8bac85ecd
Add Llama and Llama 2 tokenization supports (#499) 2023-07-26 10:22:00 -07:00
Wenbing Li 62d8598b6b
Update whisper model test cases and e2e example (#496)
* Update whisper model test cases and e2e example

* fix unit test on windows

* more refinement

* utest fix
2023-07-21 15:27:02 -07:00
Wenbing Li 981cb049ff
Add a new API for building data processing graph from Huggingface transformers processor/tokenizer (#482)
* initial checkins

* test pass

* basic impl

* first unit test pass

* merge error

* refine a little bit

* add more unit test

* fix unit test

* Fix the unit test.

* add one more whisper audiodecoder test case

* update the docs

* More updates
2023-07-17 16:50:58 -07:00
JiCheng 5d480a8c5d
clip_image_processor (#478)
* clip_image_processor
separate clip ppp

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-07-12 17:52:17 +08:00
Sayan Shaw d876f7ff82
Initial BertTokenizer offset mapping implementation (#477)
* Initial BertTokenizer offset mapping implementation

* minor change

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-07-03 15:17:23 -07:00
Wenbing Li 93f239c143
Unit test being compatible with ONNXRuntime-GPU package, and some clean-ups. (#457) 2023-05-30 11:01:30 -07:00
Scott McKay 64f20828ce
Handle ONNX 1.14 in test scripts (#435)
* Calculate and specify ir_version so we use the oldest possible for maximum compatibility

* Don't use `ignore_unknown` in call to `find_min_ir_version_for` as it's only supported in the most recent ONNX release.
2023-05-12 07:13:37 +10:00
Vishal Jain 03b96c822c
Fix ReadMe : Example usage of the PrePostProcessor.md (#436)
- Small typo fix in "Add post-processing steps"
2023-05-11 18:36:14 +10:00
Wenbing Li 43994eb34a
Fix the unit test failure with ONNX 1.14 package. (#428)
* Fix the unit test failure with ONNX 1.14 package.

* more tests

* Update whisper_e2e.py
2023-05-08 11:37:54 -07:00
Wenbing Li 46efcb9051
PyOp attribute supports int and float data type (#425) 2023-05-05 19:35:59 -07:00
Wenbing Li 2fa0b710ea
Adding down-sampling and stereo mixing features for AudioDecoder (#420)
* initial draft

* second

* third

* polishing

* fix the M_PI name in LINUX platform

* fix bessel function issue

* add a unit test case

* fix the unit test name
2023-05-04 13:30:10 -07:00
Wenbing Li 0f45fef2d9
Compatible with onnxruntime-gpu package (#410)
* be compatible without onnxruntime-gpu version

* some fixing
2023-04-26 17:17:23 -07:00
Wenbing Li 997fa892c2
more code fixing related whisper models (#403) 2023-04-21 09:26:44 -07:00
JiCheng db87dc416d
[object detection ppp] YoLo as example (#397)
* object detection
* Unit test
add e2e fastestdet model test

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-20 13:34:11 +08:00
Wenbing Li adb8efd62b
support batch > 1 in BpeDecoder (#400)
* support batch > 1 in BpeDecoder

* update the shape in helper function
2023-04-19 14:28:56 -07:00
Wenbing Li 711774db6b
Add a merge step in whisper end-to-end script and fixed some issues (#399)
* add merged models in whisper model

* verify the final model
2023-04-17 16:37:06 -07:00
JiCheng 154ead35a3
built-in bounding box op (#382)
* built-in bounding box op
* update boundary check
* assert policy
* more boundary test and check
* XYXY--> X horizon
---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2023-04-12 19:35:53 +08:00
Wenbing Li b5dce955f0
Add an audio decoder custom op for whisper end-to-end processing (#385)
* evaluate the audio decoder library

* MP3 Decoder

* rename it to test_audio_codec

* add the audio decoder to whisper model

* whisper end-to-end draft

* fix the mp3 decoder

* Running with ONNX models

* Add more audio format supports

* refine the end-to-end script

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* some fixings of comments and more test cases.

* changes for review comments.

* Update audio_decoder.hpp

* Update audio_decoder.hpp

* code refinement

* Update operators/audio/audio_decoder.hpp

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

---------

Co-authored-by: Sayan Shaw <52221015+sayanshaw24@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-04-11 14:47:10 -07:00
Wenbing Li 9cd1284da8
Pre and Post processing example for openAI Whisper model (#380)
* add a stft-norm custom op for log-mel spectrum.

* undo the debug change

* Support ONNX standard STFT op signature.

* Add a unit test onnx STFT compatible mode.

* add whisper pre-/post- processing example

* Update dlib.cmake

* undo test code changes

* Update setup.cfg

* update the end2end example with STFT op
2023-03-30 13:44:50 -07:00
Sayan Shaw 8b2af20b46
Update CLIPTokenizer cvt for added offset mapping output (#384)
Authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-03-23 23:52:58 -07:00
Sayan Shaw b3420f9ca3
Added CLIPTokenizer to _cuops.py and corresponding cvt func and test (#379)
Authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-03-15 15:45:59 -07:00
Sayan Shaw 29f55ce400
Added cvt function for RobertaTokenizer (#378)
* Added roberta converter

* Added roberta to _cuops and added cvt test

---------

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
2023-03-14 15:45:31 -07:00
Wenbing Li 3b0bd66e9e
Add a bbpe tokenizer decoder for Whisper model (#376)
* initial PR

* add the attributes for op

* cmake update

* add the missing symbol

* add a unit test case

* fix the unit test

* fix some corner case.

* format Python code with autopep8
2023-03-08 15:00:01 -08:00
JiCheng b375cb57e6
support mobilebert_ppp (#354)
* support mobilebert_ppp

* renaming IOEntryValuePreserver

* generalize argmax step
---------

Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
2023-02-27 18:53:37 +08:00
Scott McKay 91d75f460b
Update tutorial and example usage to provide info on installing the nightly (#364) 2023-02-17 19:27:00 -08:00
Scott McKay f3654e5bac
Fix opset 18 issues and bug due to ORT Resize issue (#362)
* - Fix Split(18) requiring num_outputs.
- Calculate `sizes` in Resize instead of using the simpler `scales`
  - ORT implementation does not round correctly when applying scales
- Update center crop to use float so we are more accurate in choosing the crop area.
- Fix minor issue with Debug step by only adding values that are altered to the renaming graph inputs.
- Update unit tests expected output due to the change in Resize using sizes instead of scales.
- Crop e2e example input so before/after image covers same area.

* Simplify.
CenteredCrop doesn't need to use float as it's dividing by 2 (so using float + floor gives the same result).
Remove Resize impl using scales - we most likely will never go back to it.
Address PR comments
Update doc
2023-02-18 06:37:05 +10:00
Scott McKay cd5ea11aaa
Move the pre/post processing scripts into the python module. (#349)
* Move the pre/post processing scripts into the python module.
Update usage/examples.

* Use better version parsing.

* Update tests, docs,

* Address PR comments.
Remove global Settings and pass onnx opset around directly where needed. Make PrePostProcessor the owner of the checker context.
2023-01-26 08:30:21 +10:00
Wenbing Li 67c77d9fbc
align python package version with version.txt (#345)
* align python package version with version.txt

* Update setup.py

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

* remove a line

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2023-01-12 14:28:32 -08:00
Wenbing Li fec9af97aa
a naive decoder for sentencepiece tokenization (#314)
* a naive decoder for sentencepiece tokenization

* typo fixing

* add a unit test for the decoder
2022-11-21 11:10:30 -08:00
Sayan Shaw 4683276158
Fixed attribute not found issues for hf_bert_tokenizer (#311)
* Fixed do_lower_case attribute not found issue

* Added check for strip_accents

* Fixed typo

* Changed strip_accents handling

Co-authored-by: sayanshaw <sayanshaw@microsoft.com>
2022-11-01 13:42:10 -07:00
Wenbing Li 08659eae90
Initial Java API for the JAR package. (#292)
* more C++ code fixing and polish for release

* fixing for android build

* build flags for android release

* add missing exporting function

* imint

* first versoin

* more C++ code fixing and polish for release (#275)

* more C++ code fixing and polish for release

* fixing for android build

* build flags for android release

* add missing exporting function

* support build_id on Python package building (#281)

* support buildid in package building

* undo the change on build.sh

* build.sh issue on macos

* Add `$schema` to `cgmanifest.json` (#284)

Co-authored-by: Jamie Magee <jamie.magee@microsoft.com>

* test package with a simple java app

* demo app

* some fixing for windows platform

* refine the example app

* fix the missing symobls issue for Linux build

* fix the package package build issue

* typo

* a missing change

* fix PythonOp

* fix Android test issue

* one more Android change

* replace build flags in ci pipeline

* android AAR package build

* refine the code for android package

Co-authored-by: Jamie Magee <jamie.magee@gmail.com>
Co-authored-by: Jamie Magee <jamie.magee@microsoft.com>
2022-10-04 16:22:28 -07:00
shaahji 78d8dd5705 OpenCV Image Decoder & SuperResolution CustomOps 2022-09-30 12:08:38 -07:00
Wenbing Li 134f882e64
more C++ code fixing and polish for release (#275)
* more C++ code fixing and polish for release

* fixing for android build

* build flags for android release

* add missing exporting function
2022-08-04 10:13:17 -07:00
Wenbing Li 5320af1eea
Fix the code security issue and 0.5 C++ release preparation. (#274)
* Fix the code security issue and 0.5 C++ release preparation.

* more fixings

* vswhere
2022-08-02 10:09:35 -07:00
shaahji 0616039115 Issue #226: Functional e2e NLP example
* Implemented a new version of Kernel and the CustomOp to support
  output that matches the HuggingFace model's input without the need
  for intermediate python logic.
* Implemented a e2e tutorial for exporting and inferencing using the
  HuggingFace's QuestionAnsering model.

Known Issue: Python side doesn't have an implementation of Bert Decoder
and so the augmented model is only half-complete. At the time of
inferencing the HuggingFace tokenizer is used to decode the result back
to string.
2022-07-22 13:56:40 -07:00
shaahji 3b2409d880 Issue #230: Fix argument handling in BertTokenizer
Fixed argument handling in pnp where the arguments weren't being passed
down to the tokenizer as expected.
2022-07-21 00:00:09 -07:00
shaahji 8c3713194b Issue #243: Cannot rename input and output names of generated model
When the input is a string, the logic takes a different route where the
input model is split into two and joined again. The user provided
input/output names were not respected on this code path. Fixed the
issue by renaming the input/output post join operation.
2022-07-19 12:11:33 -07:00
Wenbing Li e0952e7f2b
update the ci pipeline due to ONNX package upgrading (#256)
* update the ci pipeline due to ONNX package upgrading

* no 3.10 onnxruntime package
2022-06-27 15:04:27 -07:00
Wenbing Li 292a0297b4
reformat test code and verify the pipeline (#251)
* reformat test code and verify the pipeline

* upgrade googletest version

* fix the merge issue

* more formating
2022-06-20 12:38:06 -07:00
Wenbing Li 1a04abdf3e
Add two opencv operators as ONNX custom ops. (#249)
* Add two opencv operators as ONNX custom ops.

* update the git apply command line

* adjust the difference threshold

* do not break the build on binskim issue

* Make ImageReader be optional

* try to fix some potential build break

* undo the debug flag in setup.cfg
2022-06-15 23:22:10 -07:00
Wenbing Li da4784a2cc
update the bert end to end example with hftok (#236) 2022-06-01 10:41:42 -07:00
shaahji 49548f843d Issue #230: Add HuggingFace vocab format to Bert tokenizer
HuggingFace vocab format is newline separated (unlike GPT which is
json). Newline separated is likely to be faster and doesn't require
an external library to parse it. Instead of introducing a json based
format, added support for native HuggingFace newline separated token
format.
2022-05-26 14:17:20 -07:00
Wenbing Li 909acb7ce4
build and packaging script improvement for release (#218)
* integrate opencv

* small fixing

* Add the opencv includes and libs

* refine a little bit

* standardize the output folder.

* fix ctest on Linux

* fix setup.py on output folder change.

* more fixings for CI pipeline

* more fixing 1

* more fixing 2

* more fixing 3

* ci pipeline fixing 1

* ci pipeline fixing 2

* a silly typo...

* ci pipeline fixing 3

* fixing the file copy issue.

* last fixing.

* re-test the fullpath in build_ext.

* One more try

* extent timeout

* mshost.yml indent

* Update mshost.yaml for Azure Pipelines

* cibuild build python versions

* Update wheels.yml

* only build python 3.8/3.9

* Update wheels.yml for Azure Pipelines

* seperate the ci pipeline
2022-05-11 16:51:59 -07:00
Wenbing Li bfbfa5a304
An end-to-end BERT model with pre-/post- processing. (#224)
* bert demo

* add some comments

* support multiple outputs in ONNX model

* code polishing

* encoding issue on Windows platform.
2022-04-20 16:14:46 -07:00