Improve Documentation: Add Hugging Face Compatibility Docs and Refine the existing docs (#818)
* add compatibility docs * continue updating the doc * updating doc 2 * revert the bpe changes
This commit is contained in:
Родитель
2c3e936cfc
Коммит
e710d80f71
18
README.md
18
README.md
|
@ -4,29 +4,17 @@
|
|||
|
||||
## What's ONNXRuntime-Extensions
|
||||
|
||||
Introduction: ONNXRuntime-Extensions is a library that extends the capability of the ONNX models and inference with ONNX Runtime, via ONNX Runtime Custom Operator ABIs. It includes a set of [ONNX Runtime Custom Operator](https://onnxruntime.ai/docs/reference/operators/add-custom-op.html) to support the common pre- and post-processing operators for vision, text, and nlp models. And it supports multiple languages and platforms, like Python on Windows/Linux/macOS, some mobile platforms like Android and iOS, and Web-Assembly etc. The basic workflow is to enhance a ONNX model firstly and then do the model inference with ONNX Runtime and ONNXRuntime-Extensions package.
|
||||
Introduction: ONNXRuntime-Extensions is a C/C++ library that extends the capability of the ONNX models and inference with ONNX Runtime, via ONNX Runtime Custom Operator ABIs. It includes a set of [ONNX Runtime Custom Operator](https://onnxruntime.ai/docs/reference/operators/add-custom-op.html) to support the common pre- and post-processing operators for vision, text, and nlp models. And it supports multiple languages and platforms, like Python on Windows/Linux/macOS, some mobile platforms like Android and iOS, and Web-Assembly etc. The basic workflow is to enhance a ONNX model firstly and then do the model inference with ONNX Runtime and ONNXRuntime-Extensions package.
|
||||
|
||||
|
||||
## Quickstart
|
||||
The library can be utilized as either a C/C++ library or other advance language packages like Python, Java, C#, etc. To build it as a shared library, you can use the `build.bat` or `build.sh` scripts located in the root folder. The CMake build definition is available in the `CMakeLists.txt` file and can be modified by appending options to `build.bat` or `build.sh`, such as `build.bat -DOCOS_BUILD_SHARED_LIB=OFF`. For more details, please refer to the [C API documentation](./docs/c_api.md).
|
||||
|
||||
### **Python installation**
|
||||
```bash
|
||||
pip install onnxruntime-extensions
|
||||
````
|
||||
|
||||
|
||||
### **Nightly Build**
|
||||
|
||||
#### <strong>on Windows</strong>
|
||||
```cmd
|
||||
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ onnxruntime-extensions
|
||||
```
|
||||
Please ensure that you have met the prerequisites of onnxruntime-extensions (e.g., onnx and onnxruntime) in your Python environment.
|
||||
#### <strong>on Linux/macOS</strong>
|
||||
Please make sure the compiler toolkit like gcc(later than g++ 8.0) or clang are installed before the following command
|
||||
```bash
|
||||
python -m pip install git+https://github.com/microsoft/onnxruntime-extensions.git
|
||||
```
|
||||
The nightly build is also available for the latest features, please refer to [nightly build](./docs/development.md#nightly-build)
|
||||
|
||||
|
||||
## Usage
|
||||
|
|
|
@ -1,26 +1,30 @@
|
|||
# ONNXRuntime Extensions C ABI
|
||||
|
||||
## Introduction
|
||||
|
||||
<span style="color:red">The C APIs in onnxruntime-extensions are experimental and subject to change.</span>
|
||||
|
||||
ONNXRuntime Extensions provides a C-style ABI for pre-processing. It offers support for tokenization, image processing, speech feature extraction, and more. You can compile the ONNXRuntime Extensions as either a static library or a dynamic library to access these APIs.
|
||||
|
||||
The C ABI header files are named `ortx_*.h` and can be found in the include folder. There are three types of data processing APIs available:
|
||||
|
||||
- [`ortx_tokenizer.h`](../include/ortx_tokenizer.h): Provides tokenization for LLM models.
|
||||
- [`ortx_processor.h`](../include/ortx_processor.h): Offers image processing APIs for multimodels.
|
||||
- [`ortx_processor.h`](../include/ortx_processor.h): Offers image processing APIs for multimodel models.
|
||||
- [`ortx_extraction.h`](../include/ortx_extractor.h): Provides speech feature extraction for audio data processing to assist the Whisper model.
|
||||
|
||||
## ABI QuickStart
|
||||
|
||||
Most APIs accept raw data inputs such as audio, image compressed binary formats, or UTF-8 encoded text for tokenization.
|
||||
|
||||
**Tokenization:** You can create a tokenizer object using `OrtxCreateTokenizer` and then use the object to tokenize a text or decode the token ID into the text. A C-style code snippet is available [here](../test/pp_api_test/c_only_test.c).
|
||||
**Tokenization:** You can create a tokenizer object using `OrtxCreateTokenizer` and then use the object to tokenize a text or decode the token ID into the text. A C-style code snippet is available [here](../test/pp_api_test/test_tokenizer.cc#L448).
|
||||
|
||||
**Image processing:** `OrtxCreateProcessor` can create an image processor object from a pre-defined workflow in JSON format to process image files into a tensor-like data type. An example code snippet can be found [here](../test/pp_api_test/test_processor.cc#L75).
|
||||
**Image processing:** `OrtxCreateProcessor` can create an image processor object from a pre-defined workflow in JSON format to process image files into a tensor-like data type. An example code snippet can be found [here](../test/pp_api_test/test_processor.cc#L16).
|
||||
|
||||
**Audio feature extraction:** `OrtxCreateSpeechFeatureExtractor` creates a speech feature extractor to obtain log mel spectrum data as input for the Whisper model. An example code snippet can be found [here](../test/pp_api_test/test_feature_extraction.cc#L16).
|
||||
**Audio feature extraction:** `OrtxCreateSpeechFeatureExtractor` creates a speech feature extractor to obtain log mel spectrum data as input for the Whisper model. An example code snippet can be found [here](../test/pp_api_test/test_feature_extraction.cc#L15).
|
||||
|
||||
**NB:** If onnxruntime-extensions is to build as a shared library, which requires the OCOS_ENABLE_AUDIO OCOS_ENABLE_CV2 OCOS_ENABLE_OPENCV_CODECS OCOS_ENABLE_GPT2_TOKENIZER build flags are ON to have a full function of binary. Only onnxruntime-extensions static library can be used for a minimal build with the selected operators, so in that case, the shared library build can be switched off by `-DOCOS_BUILD_SHARED_LIB=OFF`.
|
||||
**Note:** To build onnxruntime-extensions as a shared library with full functionality, ensure the `OCOS_ENABLE_AUDIO` and `OCOS_ENABLE_GPT2_TOKENIZER` build flags are enabled. For a minimal build with selected operators, you can use the static library version and disable the shared library build by setting `-DOCOS_BUILD_SHARED_LIB=OFF`.
|
||||
|
||||
There is a simple Python wrapper on these C API in [pp_api](../onnxruntime_extensions/pp_api.py), which can have a easy access these APIs in Python code like
|
||||
**Note:** A simple Python wrapper for these C APIs is available in [pp_api](../onnxruntime_extensions/pp_api.py), but it is not included in the default build. To enable it, use the extra build option `--config-settings "ortx-user-option=pp-api,no-opencv"`. For example, you can install it with the following command: `python3 -m pip install --config-settings "ortx-user-option=pp-api,no-opencv" git+https://github.com/microsoft/onnxruntime-extensions.git`. The following code demonstrates how to use the Python API to validate the tokenizer output.
|
||||
|
||||
```Python
|
||||
from onnxruntime_extensions.pp_api import Tokenizer
|
||||
|
@ -28,3 +32,39 @@ from onnxruntime_extensions.pp_api import Tokenizer
|
|||
pp_tok = Tokenizer('google/gemma-2-2b')
|
||||
print(pp_tok.tokenize("what are you? \n 给 weiss ich, über was los ist \n"))
|
||||
```
|
||||
|
||||
## Hugging Face Tokenizer Data Compatibility
|
||||
|
||||
In the C API build, onnxruntime-extensions can directly load Hugging Face tokenizer data. Typically, a Hugging Face tokenizer includes `tokenizer.json` and `tokenizer_config.json` files, unless the model author has fully customized the tokenizer. The onnxruntime-extensions can seamlessly load these files. The following sections describe the supported fields in `tokenizer_config.json` and `tokenizer.json`.
|
||||
|
||||
1) The fields in `tokenizer_config.json` can affect the results of the onnxruntime-extensions tokenizer:
|
||||
|
||||
- `model_max_length`: the maximum length of the tokenized sequence.
|
||||
- `bos_token`: the beginning of the sequence token, both `string` and `object` types are supported.
|
||||
- `eos_token`: the end of the sequence token, both `string` and `object` types are supported.
|
||||
- `unk_token`: the unknown token, both `string` and `object` types are supported.
|
||||
- `pad_token`: the padding token, both `string` and `object` types are supported.
|
||||
- `clean_up_tokenization_spaces`: whether to clean up the tokenization spaces.
|
||||
- `tokenizer_class`: the tokenizer class.
|
||||
|
||||
2) The fields in `tokenizer.json` can affect the results of the onnxruntime-extensions tokenizer:
|
||||
|
||||
- `add_bos_token`: whether to add the beginning of the sequence token.
|
||||
- `add_eos_token`: whether to add the end of the sequence token.
|
||||
- `added_tokens`: the list of added tokens.
|
||||
- `normalizer`: the normalizer, only 2 normalizers are supported, `Replace` and `precompiled_charsmap`.
|
||||
- `pre_tokenizer`: Not directly used, but some properties can be inferred from other fields such as `decoders`.
|
||||
- `post_processor`: `add_bos_token` and `add_eos_token` can be inferred from the `post_processor` field.
|
||||
- `decoder/decoders`: the decoders, only `Replace` and `Strip` decoder steps are checked.
|
||||
- `model/type`: the type of the model. If the type is missing, it will be treated as a Unigram model. Otherwise, the value of `model/type`.
|
||||
- `model/vocab`: the vocabulary of the model.
|
||||
- `model/merges`: the merges of the model.
|
||||
- `model/end_of_word_suffix`: the end of the word suffix.
|
||||
- `model/continuing_subword_prefix`: the continuing subword prefix.
|
||||
- `model/byte_fallback`: Not supported.
|
||||
- `model/unk_token_id`: the id of the unknown token.
|
||||
|
||||
3) `tokenizer_module.json` is a file that contains the user customized Python module information of the tokenizer, which is defined by onnxruntime-extensions, which is optional. The following fields are supported:
|
||||
|
||||
- `tiktoken_file`: the path of the tiktoken file base64 encoded vocab file, which can be loaded by `OrtxCreateTokenizer` too.
|
||||
- `added_tokens`: same as `tokenizer.json`. If `tokenizer.json` does not contain `added_tokens` or the file does not exist, this field can be input by the user.
|
||||
|
|
|
@ -1,7 +0,0 @@
|
|||
### CI Build Matrix
|
||||
|
||||
The matrix below lists the versions of individual dependencies of onxxruntime-extensions. These are the configurations that are routinely and extensively verified by our CI.
|
||||
|
||||
Python | 3.8 | 3.9 | 3.10 | 3.11 |
|
||||
---|---|---|---|---
|
||||
Onnxruntime |1.12.1 (Aug 4, 2022) |1.13.1(Oct 24, 2022) |1.14.1 (Mar 2, 2023) |1.15.0 (May 24, 2023) |
|
|
@ -4,8 +4,24 @@ This project supports Python and can be built from source easily, or a simple cm
|
|||
|
||||
## Python package
|
||||
|
||||
### **Nightly Build**
|
||||
|
||||
#### <strong>Windows</strong>
|
||||
Ensure that the prerequisite packages for onnxruntime-extensions (e.g., onnx and onnxruntime) are installed in your Python environment.
|
||||
```cmd
|
||||
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ onnxruntime-extensions
|
||||
```
|
||||
#### <strong>Linux/macOS</strong>
|
||||
Ensure the compiler toolkit like gcc(later than g++ 8.0) or clang, and cmake are installed before the following command
|
||||
```bash
|
||||
python -m pip install git+https://github.com/microsoft/onnxruntime-extensions.git
|
||||
```
|
||||
|
||||
|
||||
The package contains all custom operators and some Python scripts to manipulate the ONNX models.
|
||||
|
||||
### Build from source
|
||||
|
||||
- Install Visual Studio with C++ development tools on Windows, or gcc(>8.0) for Linux or xcode for macOS, and cmake on the unix-like platform.
|
||||
- If running on Windows, ensure that long file names are enabled, both for the [operating system](https://docs.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=cmd) and for git: `git config --system core.longpaths true`
|
||||
- Make sure the Python development header/library files be installed, (like `apt-get install python3-dev` for Ubuntu Linux)
|
||||
|
|
Загрузка…
Ссылка в новой задаче