Граф коммитов

7 Коммитов

Автор SHA1 Сообщение Дата
Wenbing Li d1148aea4e
Support 'added_token' attribute for BPE tokenizer and some code refactoring. (#591)
* Fix CodeGenTokenizer issues and the related code refactoring.

* refactor the trie-tree

* temp check-ins

* code complete

* correctness fixing

* Update _hf_cvt.py

* more test cases fixing

* more refinement

* linux crash fixing

* Update test_autotokenizer.py
2023-11-04 22:56:26 -07:00
Wenbing Li 68b9d1dc47
Fix the exception on invalid trie-tokenizer input (#575)
* fix the exception on invalid trie-tokenizer input

* remove unused import
2023-10-16 17:03:02 -07:00
Baiju Meswani 46a37c3902
Ensure noexcep_operators and ocos_operators get built always (#570) 2023-10-09 18:29:02 -07:00
Wenbing Li 367f59c6fa
Remove the deprecating std::codecvt_utf8 from code base. (#541)
* Remove the deprecating std::codecvt_utf8 from code base.

* utest fix
2023-08-24 10:26:08 -07:00
Scott McKay e448676a5e
Make kernel Compute method implementations const (#500)
* Nodes can be called concurrently and Compute needs to be stateless due to that.

Update the kernels to make Compute const.

* Fix test that uses ustring.h.

Would be better to not have duplicate declarations for GetTensorMutableDataString and FillTensorDataString in ustring.h and string_tensor.h.
2023-07-28 09:25:36 +10:00
Wenbing Li 3b0bd66e9e
Add a bbpe tokenizer decoder for Whisper model (#376)
* initial PR

* add the attributes for op

* cmake update

* add the missing symbol

* add a unit test case

* fix the unit test

* fix some corner case.

* format Python code with autopep8
2023-03-08 15:00:01 -08:00
Wenbing Li ee306dee2a
Fix the build breaks the release pipeline and some C++ warnings (#372)
* fix the break in release pipeline

* code cleanup and the warnings fixing.

* Update ci.yml for Azure Pipelines

* Update ci.yml for Azure Pipelines

* fix linux build

* one more fixing

* again?

* fixing for macOS
2023-02-28 15:45:32 -08:00