Reorganize Features section of Olive docs (#1323)
This commit is contained in:
Родитель
09dde09376
Коммит
2050ee3256
|
@ -0,0 +1,10 @@
|
|||
Model Conversions
|
||||
=================
|
||||
|
||||
Olive supports converting a model from one form to another in various ways.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
passes/convert_pytorch
|
||||
passes/convert_onnx
|
|
@ -1,4 +1,4 @@
|
|||
# Huggingface Model Optimization
|
||||
# Huggingface Integration
|
||||
|
||||
## Introduction
|
||||
|
||||
|
|
|
@ -0,0 +1,137 @@
|
|||
# ONNX
|
||||
|
||||
[ONNX](https://onnx.ai/) is an open graph format to represent machine learning models. [ONNX Runtime](https://onnxruntime.ai/docs/) is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.
|
||||
|
||||
## Model Conversion
|
||||
The `OnnxConversion` pass converts PyTorch models to ONNX using
|
||||
[torch.onnx](https://pytorch.org/docs/stable/onnx.html).
|
||||
|
||||
Please refer to [OnnxConversion](onnx_conversion) for more details about the pass and its config parameters.
|
||||
|
||||
Besides, if you want to convert an existing ONNX model with another target opset, you can use [OnnxOpVersionConversion](onnx_op_version_conversion) pass, similar configs with above case:
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "OnnxConversion",
|
||||
"target_opset": 13
|
||||
},
|
||||
{
|
||||
"type": "OnnxOpVersionConversion",
|
||||
"target_opset": 14
|
||||
}
|
||||
```
|
||||
|
||||
For generative models, the alternative conversion pass [ModelBuilder](model_builder) that integrates the
|
||||
[ONNX Runtime Generative AI](https://github.com/microsoft/onnxruntime-genai) module can be used.
|
||||
|
||||
Please refer to [ModelBuilder](model_builder) for more details about the pass and its config parameters.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "ModelBuilder",
|
||||
"precision": "int4"
|
||||
}
|
||||
```
|
||||
|
||||
## Float16 Conversion
|
||||
|
||||
Converting a model to use Float16 instead of Float32 can decrease the model size and improve performance on some GPUs. The `OnnxFloatToFloat16` pass the [float16 converter from onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py) to convert the model to float16, which convert most nodes/operators to use Float16 instead of Float32.
|
||||
|
||||
Conversion to Float16 is often exposed at multiple stages of optimization, including model conversion and transformer optimization. This stand-alone pass is best suited for models that are not transformer architectures, where fusions may rely on a specific data types in node patterns.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
|
||||
```json
|
||||
{
|
||||
"type": "OnnxFloatToFloat16"
|
||||
}
|
||||
```
|
||||
|
||||
b. More fine-grained control of the conversion conditions is also possible:
|
||||
```json
|
||||
{
|
||||
"type": "OnnxFloatToFloat16",
|
||||
// Don't convert input/output nodes to Float16
|
||||
"keep_io_types": true
|
||||
}
|
||||
```
|
||||
|
||||
See [Float16 Conversion](https://onnxruntime.ai/docs/performance/model-optimizations/float16.html#float16-conversion) for more detailed description of the available configuration parameters.
|
||||
|
||||
## Inputs/Outputs Float16 to Float32 Conversion
|
||||
|
||||
Certain environments such as Onnxruntime WebGPU prefers Float32 logits. The `OnnxIOFloat16ToFloat32` pass converts the inputs and outputs to use Float32 instead of Float16.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
|
||||
```json
|
||||
{
|
||||
"type": "OnnxIOFloat16ToFloat32"
|
||||
}
|
||||
```
|
||||
|
||||
## Mixed Precision Conversion
|
||||
Converting model to mixed precision.
|
||||
|
||||
If float16 conversion is giving poor results, you can convert most of the ops to float16 but leave some in float32. The `OrtMixedPrecision` pass finds a minimal set of ops to skip while retaining a certain level of accuracy.
|
||||
|
||||
The default value for `op_block_list` is `["SimplifiedLayerNormalization", "SkipSimplifiedLayerNormalization", "Relu", "Add"]`.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
|
||||
```json
|
||||
{
|
||||
"type": "OrtMixedPrecision"
|
||||
}
|
||||
```
|
||||
|
||||
b. More fine-grained control of the conversion conditions is also possible:
|
||||
```json
|
||||
{
|
||||
"type": "OrtMixedPrecision",
|
||||
"op_block_list": [
|
||||
"Add",
|
||||
"LayerNormalization",
|
||||
"SkipLayerNormalization",
|
||||
"FastGelu",
|
||||
"EmbedLayerNormalization",
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Convert dynamic shape to fixed shape
|
||||
|
||||
In qnn, snpe and other mobile inference scenarios, the input shape of the model is often fixed. The `DynamicToFixedShape` pass converts the dynamic shape of the model to a fixed shape.
|
||||
|
||||
For example, often models have a dynamic batch size so that training is more efficient. In mobile scenarios the batch generally has a size of 1. Making the batch size dimension ‘fixed’ by setting it to 1 may allow NNAPI and CoreML to run of the model.
|
||||
|
||||
The helper can be used to update specific dimensions, or the entire input shape.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. Making a symbolic dimension fixed
|
||||
```json
|
||||
{
|
||||
"type": "DynamicToFixedShape",
|
||||
"input_dim": ["batch_size"],
|
||||
"dim_value": [1]
|
||||
}
|
||||
```
|
||||
|
||||
b. Making the entire input shape fixed
|
||||
```json
|
||||
{
|
||||
"type": "DynamicToFixedShape",
|
||||
"input_name": ["input"],
|
||||
"input_shape": [[1, 3, 224, 224]]
|
||||
}
|
||||
```
|
||||
|
||||
Note: The `input_dim` and `dim_value` should have the same length, and the `input_name` and `input_shape` should have the same length. Also the `input_dim & dim_value` and `input_name & input_shape` should be exclusive to each other, user cannot specify both of them at the same time.
|
||||
|
||||
More details about the pass and its config parameters can be found [here](https://onnxruntime.ai/docs/tutorials/mobile/helpers/make-dynamic-shape-fixed.html).
|
|
@ -0,0 +1,17 @@
|
|||
# PyTorch
|
||||
|
||||
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
|
||||
|
||||
## TorchTRTConversion
|
||||
`TorchTRTConversion` converts the `torch.nn.Linear` modules in the transformer layers in a Hugging Face PyTorch model to `TRTModules` from `torch_tensorrt` with fp16 precision and sparse weights, if
|
||||
applicable. `torch_tensorrt` is an extension to `torch` where TensorRT compiled engines can be used like regular `torch.nn.Module`s. This pass can be used to accelerate inference on transformer models
|
||||
with sparse weights by taking advantage of the 2:4 structured sparsity pattern supported by TensorRT.
|
||||
|
||||
This pass only supports HfModels. Please refer to [TorchTRTConversion](torch_trt_conversion) for more details on the types of transformers models supported.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "TorchTRTConversion"
|
||||
}
|
||||
```
|
|
@ -4,39 +4,6 @@
|
|||
|
||||
Olive provides multiple transformations and optimizations based on various ONNX to improve model performance.
|
||||
|
||||
## Model Conversion
|
||||
The `OnnxConversion` pass converts PyTorch models to ONNX using
|
||||
[torch.onnx](https://pytorch.org/docs/stable/onnx.html).
|
||||
|
||||
Please refer to [OnnxConversion](onnx_conversion) for more details about the pass and its config parameters.
|
||||
|
||||
Besides, if you want to convert an existing ONNX model with another target opset, you can use [OnnxOpVersionConversion](onnx_op_version_conversion) pass, similar configs with above case:
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "OnnxConversion",
|
||||
"target_opset": 13
|
||||
},
|
||||
{
|
||||
"type": "OnnxOpVersionConversion",
|
||||
"target_opset": 14
|
||||
}
|
||||
```
|
||||
|
||||
For generative models, the alternative conversion pass [ModelBuilder](model_builder) that integrates the
|
||||
[ONNX Runtime Generative AI](https://github.com/microsoft/onnxruntime-genai) module can be used.
|
||||
|
||||
Please refer to [ModelBuilder](model_builder) for more details about the pass and its config parameters.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "ModelBuilder",
|
||||
"precision": "int4"
|
||||
}
|
||||
```
|
||||
|
||||
## Model Optimizer
|
||||
`OnnxModelOptimizer` optimizes an ONNX model by fusing nodes. Fusing nodes involves merging multiple nodes in a model into a single node to
|
||||
reduce the computational cost and improve the performance of the model. The optimization process involves analyzing the structure of the ONNX model and identifying nodes that can be fused.
|
||||
|
@ -207,129 +174,6 @@ Here are some examples to describe the pre/post processing which is exactly same
|
|||
}
|
||||
```
|
||||
|
||||
## Quantize with onnxruntime
|
||||
[Quantization][1] is a technique to compress deep learning models by reducing the precision of the model weights from 32 bits to 8 bits. This
|
||||
technique is used to reduce the memory footprint and improve the inference performance of the model. Quantization can be applied to the
|
||||
weights of the model, the activations of the model, or both.
|
||||
|
||||
There are two ways to quantize a model in onnxruntime:
|
||||
1. [Dynamic Quantization][2]:
|
||||
Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically, which means there is no
|
||||
any requirement for the calibration dataset.
|
||||
|
||||
These calculations increase the cost of inference, while usually achieve higher accuracy comparing to static ones.
|
||||
|
||||
|
||||
2. [Static Quantization][3]:
|
||||
Static quantization method runs the model using a set of inputs called calibration data. In this way, user must provide a calibration
|
||||
dataset to calculate the quantization parameters (scale and zero point) for activations before quantizing the model.
|
||||
|
||||
Olive consolidates the dynamic and static quantization into a single pass called `OnnxQuantization`, and provide the user with the ability to
|
||||
tune both quantization methods and hyperparameter at the same time.
|
||||
If the user desires to only tune either of dynamic or static quantization, Olive also supports them through `OnnxDynamicQuantization` and
|
||||
`OnnxStaticQuantization` respectively.
|
||||
|
||||
Please refer to [OnnxQuantization](onnx_quantization), [OnnxDynamicQuantization](onnx_dynamic_quantization) and
|
||||
[OnnxStaticQuantization](onnx_static_quantization) for more details about the passes and their config parameters.
|
||||
|
||||
**Note:** If target execution provider is QNN EP, the model might need to be preprocessed before quantization. Please refer to [QnnPreprocess](qnn_preprocess) for more details about the pass and its config parameters.
|
||||
This preprocessing step fuses operators unsupported by QNN EP and inserts necessary operators to make the model compatible with QNN EP.
|
||||
|
||||
### Example Configuration
|
||||
a. Tune the parameters of the OlivePass with pre-defined searchable values
|
||||
```json
|
||||
{
|
||||
"type": "OnnxQuantization",
|
||||
"data_config": "calib_data_config"
|
||||
}
|
||||
```
|
||||
|
||||
b. Select parameters to tune
|
||||
```json
|
||||
{
|
||||
"type": "OnnxQuantization",
|
||||
// select per_channel to tune with "SEARCHABLE_VALUES".
|
||||
// other parameters will use the default value, not to be tuned.
|
||||
"per_channel": "SEARCHABLE_VALUES",
|
||||
"data_config": "calib_data_config",
|
||||
"disable_search": true
|
||||
}
|
||||
```
|
||||
|
||||
c. Use default values of the OlivePass (no tuning in this way)
|
||||
```json
|
||||
{
|
||||
"type": "OnnxQuantization",
|
||||
// set per_channel to "DEFAULT_VALUE"
|
||||
"per_channel": "DEFAULT_VALUE",
|
||||
"data_config": "calib_data_config",
|
||||
}
|
||||
```
|
||||
|
||||
d. Specify parameters with user defined values
|
||||
```json
|
||||
"onnx_quantization": {
|
||||
"type": "OnnxQuantization",
|
||||
// set per_channel to True.
|
||||
"per_channel": true,
|
||||
"data_config": "calib_data_config",
|
||||
"disable_search": true
|
||||
}
|
||||
```
|
||||
|
||||
Check out [this file](https://github.com/microsoft/Olive/blob/main/examples/bert/user_script.py)
|
||||
for an example implementation of `"user_script.py"` and `"calib_data_config/dataloader_config/type"`.
|
||||
|
||||
check out [this file](https://github.com/microsoft/Olive/tree/main/examples/bert#bert-optimization-with-intel-neural-compressor-ptq-on-cpu) for an example for Intel® Neural Compressor quantization.
|
||||
|
||||
## Quantize with Intel® Neural Compressor
|
||||
In addition to the default onnxruntime quantization tool, Olive also integrates [Intel® Neural Compressor](https://github.com/intel/neural-compressor).
|
||||
|
||||
Intel® Neural Compressor is a model compression tool across popular deep learning frameworks including TensorFlow, PyTorch, ONNX Runtime (ORT) and MXNet, which supports a variety of powerful model compression techniques, e.g., quantization, pruning, distillation, etc. As a user-experience-driven and hardware friendly tool, Intel® Neural Compressor focuses on providing users with an easy-to-use interface and strives to reach “quantize once, run everywhere” goal.
|
||||
|
||||
Olive consolidates the Intel® Neural Compressor dynamic and static quantization into a single pass called `IncQuantization`, and provide the user with the ability to
|
||||
tune both quantization methods and hyperparameter at the same time.
|
||||
If the user desires to only tune either of dynamic or static quantization, Olive also supports them through `IncDynamicQuantization` and
|
||||
`IncStaticQuantization` respectively.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
"inc_quantization": {
|
||||
"type": "IncStaticQuantization",
|
||||
"approach": "weight_only",
|
||||
"weight_only_config": {
|
||||
"bits": 4,
|
||||
"algorithm": "GPTQ"
|
||||
},
|
||||
"data_config": "calib_data_config",
|
||||
"calibration_sampling_size": [8],
|
||||
"save_as_external_data": true,
|
||||
"all_tensors_to_one_file": true
|
||||
}
|
||||
```
|
||||
|
||||
Please refer to [IncQuantization](inc_quantization), [IncDynamicQuantization](inc_dynamic_quantization) and
|
||||
[IncStaticQuantization](inc_static_quantization) for more details about the passes and their config parameters.
|
||||
|
||||
## Quantize with AMD Vitis AI Quantizer
|
||||
Olive also integrates [AMD Vitis AI Quantizer](https://github.com/microsoft/Olive/blob/main/olive/passes/onnx/vitis_ai/quantize.py) for quantization.
|
||||
|
||||
The Vitis™ AI development environment accelerates AI inference on AMD® hardware platforms. The Vitis AI quantizer can reduce the computing complexity by converting the 32-bit floating-point weights and activations to fixed-point like INT8. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model.
|
||||
Olive consolidates the Vitis™ AI quantization into a single pass called VitisAIQuantization which supports power-of-2 scale quantization methods and supports Vitis AI Execution Provider.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
"vitis_ai_quantization": {
|
||||
"type": "VitisAIQuantization",
|
||||
"calibrate_method":"NonOverflow",
|
||||
"quant_format":"QDQ",
|
||||
"activation_type":"QUInt8",
|
||||
"weight_type":"QInt8",
|
||||
"data_config": "calib_data_config",
|
||||
}
|
||||
```
|
||||
Please refer to [VitisAIQuantization](vitis_ai_quantization) for more details about the pass and its config parameters.
|
||||
|
||||
## ORT Performance Tuning
|
||||
ONNX Runtime provides high performance across a range of hardware options through its Execution Providers interface for different execution
|
||||
environments.
|
||||
|
@ -367,107 +211,6 @@ for an example implementation of `"user_script.py"` and `"calib_data_config/data
|
|||
[2]: <https://onnxruntime.ai/docs/performance/quantization.html#dynamic-quantization> "Dynamic Quantization"
|
||||
[3]: <https://onnxruntime.ai/docs/performance/quantization.html#static-quantization> "Static Quantization"
|
||||
|
||||
## Float16 Conversion
|
||||
|
||||
Converting a model to use Float16 instead of Float32 can decrease the model size and improve performance on some GPUs. The `OnnxFloatToFloat16` pass the [float16 converter from onnxruntime](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py) to convert the model to float16, which convert most nodes/operators to use Float16 instead of Float32.
|
||||
|
||||
Conversion to Float16 is often exposed at multiple stages of optimization, including model conversion and transformer optimization. This stand-alone pass is best suited for models that are not transformer architectures, where fusions may rely on a specific data types in node patterns.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
|
||||
```json
|
||||
{
|
||||
"type": "OnnxFloatToFloat16"
|
||||
}
|
||||
```
|
||||
|
||||
b. More fine-grained control of the conversion conditions is also possible:
|
||||
```json
|
||||
{
|
||||
"type": "OnnxFloatToFloat16",
|
||||
// Don't convert input/output nodes to Float16
|
||||
"keep_io_types": true
|
||||
}
|
||||
```
|
||||
|
||||
See [Float16 Conversion](https://onnxruntime.ai/docs/performance/model-optimizations/float16.html#float16-conversion) for more detailed description of the available configuration parameters.
|
||||
|
||||
## Inputs/Outputs Float16 to Float32 Conversion
|
||||
|
||||
Certain environments such as Onnxruntime WebGPU prefers Float32 logits. The `OnnxIOFloat16ToFloat32` pass converts the inputs and outputs to use Float32 instead of Float16.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
|
||||
```json
|
||||
{
|
||||
"type": "OnnxIOFloat16ToFloat32"
|
||||
}
|
||||
```
|
||||
|
||||
## Mixed Precision Conversion
|
||||
Converting model to mixed precision.
|
||||
|
||||
If float16 conversion is giving poor results, you can convert most of the ops to float16 but leave some in float32. The `OrtMixedPrecision` pass finds a minimal set of ops to skip while retaining a certain level of accuracy.
|
||||
|
||||
The default value for `op_block_list` is `["SimplifiedLayerNormalization", "SkipSimplifiedLayerNormalization", "Relu", "Add"]`.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. The most basic configuration, which is suitable for many models, leaves all configuration options set to their default values:
|
||||
```json
|
||||
{
|
||||
"type": "OrtMixedPrecision"
|
||||
}
|
||||
```
|
||||
|
||||
b. More fine-grained control of the conversion conditions is also possible:
|
||||
```json
|
||||
{
|
||||
"type": "OrtMixedPrecision",
|
||||
"op_block_list": [
|
||||
"Add",
|
||||
"LayerNormalization",
|
||||
"SkipLayerNormalization",
|
||||
"FastGelu",
|
||||
"EmbedLayerNormalization",
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Convert dynamic shape to fixed shape
|
||||
|
||||
In qnn, snpe and other mobile inference scenarios, the input shape of the model is often fixed. The `DynamicToFixedShape` pass converts the dynamic shape of the model to a fixed shape.
|
||||
|
||||
For example, often models have a dynamic batch size so that training is more efficient. In mobile scenarios the batch generally has a size of 1. Making the batch size dimension ‘fixed’ by setting it to 1 may allow NNAPI and CoreML to run of the model.
|
||||
|
||||
The helper can be used to update specific dimensions, or the entire input shape.
|
||||
|
||||
### Example Configuration
|
||||
|
||||
a. Making a symbolic dimension fixed
|
||||
```json
|
||||
{
|
||||
"type": "DynamicToFixedShape",
|
||||
"input_dim": ["batch_size"],
|
||||
"dim_value": [1]
|
||||
}
|
||||
```
|
||||
|
||||
b. Making the entire input shape fixed
|
||||
```json
|
||||
{
|
||||
"type": "DynamicToFixedShape",
|
||||
"input_name": ["input"],
|
||||
"input_shape": [[1, 3, 224, 224]]
|
||||
}
|
||||
```
|
||||
|
||||
Note: The `input_dim` and `dim_value` should have the same length, and the `input_name` and `input_shape` should have the same length. Also the `input_dim & dim_value` and `input_name & input_shape` should be exclusive to each other, user cannot specify both of them at the same time.
|
||||
|
||||
More details about the pass and its config parameters can be found [here](https://onnxruntime.ai/docs/tutorials/mobile/helpers/make-dynamic-shape-fixed.html).
|
||||
|
||||
## Extract Adapters
|
||||
|
||||
LoRA, QLoRA and related techniques allow us to fine-tune a pre-trained model by adding a small number of trainable matrices called adapters. The same base model can be used for multiple tasks by adding different adapters for each task. To support using multiple adapters with the same optimized onnx model, the `ExtractAdapters` pass extracts the adapters weights from the model and saves them to a separate file. The model graph is then modified in one of the following ways:
|
||||
|
|
|
@ -123,42 +123,6 @@ c. Run QAT training with default training loop.
|
|||
Check out [this file](https://github.com/microsoft/Olive/blob/main/examples/resnet/user_script.py)
|
||||
for an example implementation of `"user_script.py"` and `"train_data_config/dataloader_config/type"`.
|
||||
|
||||
## AutoGPTQ
|
||||
Olive also integrates [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) for quantization.
|
||||
|
||||
AutoGPTQ is an easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization). With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.
|
||||
|
||||
Olive consolidates the GPTQ quantization into a single pass called GptqQuantizer which supports tune GPTQ quantization with hyperparameters for trade-off between accuracy and speed.
|
||||
|
||||
Please refer to [GptqQuantizer](gptq_quantizer) for more details about the pass and its config parameters.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "GptqQuantizer",
|
||||
"data_config": "wikitext2_train"
|
||||
}
|
||||
```
|
||||
|
||||
Check out [this file](https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2_template.json)
|
||||
for an example implementation of `"wikitext2_train"`.
|
||||
|
||||
## AutoAWQ
|
||||
AutoAWQ is an easy-to-use package for 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
|
||||
|
||||
Olive integrates [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) for quantization and make it possible to convert the AWQ quantized torch model to onnx model. You can enable `pack_model_for_onnx_conversion` to pack the model for onnx conversion.
|
||||
|
||||
Please refer to [AutoAWQQuantizer](awq_quantizer) for more details about the pass and its config parameters.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "AutoAWQQuantizer",
|
||||
"w_bit": 4,
|
||||
"pack_model_for_onnx_conversion": true
|
||||
}
|
||||
```
|
||||
|
||||
## MergeAdapterWeights
|
||||
Merge Lora weights into a complete model. After running the LoRA pass, the model will only have LoRA adapters. This pass merges the LoRA adapters into the original model and download the context(config/generation_config/tokenizer) of the model.
|
||||
|
||||
|
@ -209,36 +173,3 @@ This pass only supports HuggingFace transformer PyTorch models. Please refer to
|
|||
"calibration_data_config": "wikitext2"
|
||||
}
|
||||
```
|
||||
|
||||
## QuaRot
|
||||
`QuaRot` is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314).
|
||||
|
||||
This pass only supports HuggingFace transformer PyTorch models. Please refer to [QuaRot](quarot) for more details on the types of transformers models supported.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "QuaRot",
|
||||
"w_rtn": true,
|
||||
"rotate": true,
|
||||
"w_bits": 4,
|
||||
"a_bits": 4,
|
||||
"k_bits": 4,
|
||||
"v_bits": 4
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## TorchTRTConversion
|
||||
`TorchTRTConversion` converts the `torch.nn.Linear` modules in the transformer layers in a Hugging Face PyTorch model to `TRTModules` from `torch_tensorrt` with fp16 precision and sparse weights, if
|
||||
applicable. `torch_tensorrt` is an extension to `torch` where TensorRT compiled engines can be used like regular `torch.nn.Module`s. This pass can be used to accelerate inference on transformer models
|
||||
with sparse weights by taking advantage of the 2:4 structured sparsity pattern supported by TensorRT.
|
||||
|
||||
This pass only supports HfModels. Please refer to [TorchTRTConversion](torch_trt_conversion) for more details on the types of transformers models supported.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "TorchTRTConversion"
|
||||
}
|
||||
```
|
||||
|
|
|
@ -0,0 +1,126 @@
|
|||
# ONNX
|
||||
|
||||
[ONNX](https://onnx.ai/) is an open graph format to represent machine learning models. [ONNX Runtime](https://onnxruntime.ai/docs/) is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.
|
||||
|
||||
## Quantize with onnxruntime
|
||||
[Quantization][1] is a technique to compress deep learning models by reducing the precision of the model weights from 32 bits to 8 bits. This
|
||||
technique is used to reduce the memory footprint and improve the inference performance of the model. Quantization can be applied to the
|
||||
weights of the model, the activations of the model, or both.
|
||||
|
||||
There are two ways to quantize a model in onnxruntime:
|
||||
1. [Dynamic Quantization][2]:
|
||||
Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically, which means there is no
|
||||
any requirement for the calibration dataset.
|
||||
|
||||
These calculations increase the cost of inference, while usually achieve higher accuracy comparing to static ones.
|
||||
|
||||
|
||||
2. [Static Quantization][3]:
|
||||
Static quantization method runs the model using a set of inputs called calibration data. In this way, user must provide a calibration
|
||||
dataset to calculate the quantization parameters (scale and zero point) for activations before quantizing the model.
|
||||
|
||||
Olive consolidates the dynamic and static quantization into a single pass called `OnnxQuantization`, and provide the user with the ability to
|
||||
tune both quantization methods and hyperparameter at the same time.
|
||||
If the user desires to only tune either of dynamic or static quantization, Olive also supports them through `OnnxDynamicQuantization` and
|
||||
`OnnxStaticQuantization` respectively.
|
||||
|
||||
Please refer to [OnnxQuantization](onnx_quantization), [OnnxDynamicQuantization](onnx_dynamic_quantization) and
|
||||
[OnnxStaticQuantization](onnx_static_quantization) for more details about the passes and their config parameters.
|
||||
|
||||
**Note:** If target execution provider is QNN EP, the model might need to be preprocessed before quantization. Please refer to [QnnPreprocess](qnn_preprocess) for more details about the pass and its config parameters.
|
||||
This preprocessing step fuses operators unsupported by QNN EP and inserts necessary operators to make the model compatible with QNN EP.
|
||||
|
||||
### Example Configuration
|
||||
a. Tune the parameters of the OlivePass with pre-defined searchable values
|
||||
```json
|
||||
{
|
||||
"type": "OnnxQuantization",
|
||||
"data_config": "calib_data_config"
|
||||
}
|
||||
```
|
||||
|
||||
b. Select parameters to tune
|
||||
```json
|
||||
{
|
||||
"type": "OnnxQuantization",
|
||||
// select per_channel to tune with "SEARCHABLE_VALUES".
|
||||
// other parameters will use the default value, not to be tuned.
|
||||
"per_channel": "SEARCHABLE_VALUES",
|
||||
"data_config": "calib_data_config",
|
||||
"disable_search": true
|
||||
}
|
||||
```
|
||||
|
||||
c. Use default values of the OlivePass (no tuning in this way)
|
||||
```json
|
||||
{
|
||||
"type": "OnnxQuantization",
|
||||
// set per_channel to "DEFAULT_VALUE"
|
||||
"per_channel": "DEFAULT_VALUE",
|
||||
"data_config": "calib_data_config",
|
||||
}
|
||||
```
|
||||
|
||||
d. Specify parameters with user defined values
|
||||
```json
|
||||
"onnx_quantization": {
|
||||
"type": "OnnxQuantization",
|
||||
// set per_channel to True.
|
||||
"per_channel": true,
|
||||
"data_config": "calib_data_config",
|
||||
"disable_search": true
|
||||
}
|
||||
```
|
||||
|
||||
Check out [this file](https://github.com/microsoft/Olive/blob/main/examples/bert/user_script.py)
|
||||
for an example implementation of `"user_script.py"` and `"calib_data_config/dataloader_config/type"`.
|
||||
|
||||
check out [this file](https://github.com/microsoft/Olive/tree/main/examples/bert#bert-optimization-with-intel-neural-compressor-ptq-on-cpu) for an example for Intel® Neural Compressor quantization.
|
||||
|
||||
## Quantize with Intel® Neural Compressor
|
||||
In addition to the default onnxruntime quantization tool, Olive also integrates [Intel® Neural Compressor](https://github.com/intel/neural-compressor).
|
||||
|
||||
Intel® Neural Compressor is a model compression tool across popular deep learning frameworks including TensorFlow, PyTorch, ONNX Runtime (ORT) and MXNet, which supports a variety of powerful model compression techniques, e.g., quantization, pruning, distillation, etc. As a user-experience-driven and hardware friendly tool, Intel® Neural Compressor focuses on providing users with an easy-to-use interface and strives to reach “quantize once, run everywhere” goal.
|
||||
|
||||
Olive consolidates the Intel® Neural Compressor dynamic and static quantization into a single pass called `IncQuantization`, and provide the user with the ability to
|
||||
tune both quantization methods and hyperparameter at the same time.
|
||||
If the user desires to only tune either of dynamic or static quantization, Olive also supports them through `IncDynamicQuantization` and
|
||||
`IncStaticQuantization` respectively.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
"inc_quantization": {
|
||||
"type": "IncStaticQuantization",
|
||||
"approach": "weight_only",
|
||||
"weight_only_config": {
|
||||
"bits": 4,
|
||||
"algorithm": "GPTQ"
|
||||
},
|
||||
"data_config": "calib_data_config",
|
||||
"calibration_sampling_size": [8],
|
||||
"save_as_external_data": true,
|
||||
"all_tensors_to_one_file": true
|
||||
}
|
||||
```
|
||||
|
||||
Please refer to [IncQuantization](inc_quantization), [IncDynamicQuantization](inc_dynamic_quantization) and
|
||||
[IncStaticQuantization](inc_static_quantization) for more details about the passes and their config parameters.
|
||||
|
||||
## Quantize with AMD Vitis AI Quantizer
|
||||
Olive also integrates [AMD Vitis AI Quantizer](https://github.com/microsoft/Olive/blob/main/olive/passes/onnx/vitis_ai/quantize.py) for quantization.
|
||||
|
||||
The Vitis™ AI development environment accelerates AI inference on AMD® hardware platforms. The Vitis AI quantizer can reduce the computing complexity by converting the 32-bit floating-point weights and activations to fixed-point like INT8. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model.
|
||||
Olive consolidates the Vitis™ AI quantization into a single pass called VitisAIQuantization which supports power-of-2 scale quantization methods and supports Vitis AI Execution Provider.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
"vitis_ai_quantization": {
|
||||
"type": "VitisAIQuantization",
|
||||
"calibrate_method":"NonOverflow",
|
||||
"quant_format":"QDQ",
|
||||
"activation_type":"QUInt8",
|
||||
"weight_type":"QInt8",
|
||||
"data_config": "calib_data_config",
|
||||
}
|
||||
```
|
||||
Please refer to [VitisAIQuantization](vitis_ai_quantization) for more details about the pass and its config parameters.
|
|
@ -0,0 +1,57 @@
|
|||
# PyTorch
|
||||
|
||||
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
|
||||
|
||||
## AutoGPTQ
|
||||
Olive also integrates [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) for quantization.
|
||||
|
||||
AutoGPTQ is an easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization). With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.
|
||||
|
||||
Olive consolidates the GPTQ quantization into a single pass called GptqQuantizer which supports tune GPTQ quantization with hyperparameters for trade-off between accuracy and speed.
|
||||
|
||||
Please refer to [GptqQuantizer](gptq_quantizer) for more details about the pass and its config parameters.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "GptqQuantizer",
|
||||
"data_config": "wikitext2_train"
|
||||
}
|
||||
```
|
||||
|
||||
Check out [this file](https://github.com/microsoft/Olive/blob/main/examples/llama2/llama2_template.json)
|
||||
for an example implementation of `"wikitext2_train"`.
|
||||
|
||||
## AutoAWQ
|
||||
AutoAWQ is an easy-to-use package for 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
|
||||
|
||||
Olive integrates [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) for quantization and make it possible to convert the AWQ quantized torch model to onnx model. You can enable `pack_model_for_onnx_conversion` to pack the model for onnx conversion.
|
||||
|
||||
Please refer to [AutoAWQQuantizer](awq_quantizer) for more details about the pass and its config parameters.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "AutoAWQQuantizer",
|
||||
"w_bit": 4,
|
||||
"pack_model_for_onnx_conversion": true
|
||||
}
|
||||
```
|
||||
|
||||
## QuaRot
|
||||
`QuaRot` is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314).
|
||||
|
||||
This pass only supports HuggingFace transformer PyTorch models. Please refer to [QuaRot](quarot) for more details on the types of transformers models supported.
|
||||
|
||||
### Example Configuration
|
||||
```json
|
||||
{
|
||||
"type": "QuaRot",
|
||||
"w_rtn": true,
|
||||
"rotate": true,
|
||||
"w_bits": 4,
|
||||
"a_bits": 4,
|
||||
"k_bits": 4,
|
||||
"v_bits": 4
|
||||
}
|
||||
```
|
|
@ -0,0 +1,10 @@
|
|||
Model Quantizations
|
||||
===================
|
||||
|
||||
Olive supports multiple Quantization algorithms. Olive support PyTorch and ONNX model quantization.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
passes/quant_pytorch
|
||||
passes/quant_onnx
|
|
@ -32,15 +32,17 @@ This document introduces Olive and provides some examples to get you started.
|
|||
:maxdepth: 1
|
||||
:caption: FEATURES
|
||||
|
||||
features/azureml_integration
|
||||
|
||||
features/cli
|
||||
features/cloud_model_cache
|
||||
features/custom_scripts
|
||||
features/azureml_integration
|
||||
features/huggingface_model_optimization
|
||||
features/lora
|
||||
features/model_transformations_and_optimizations
|
||||
features/packaging_output_models
|
||||
features/cloud_model_cache
|
||||
features/run_workflow_remotely
|
||||
features/conversion
|
||||
features/quantization
|
||||
features/model_transformations_and_optimizations
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
@ -48,6 +50,7 @@ This document introduces Olive and provides some examples to get you started.
|
|||
|
||||
extending_olive/design
|
||||
extending_olive/how_to_add_optimization_pass
|
||||
extending_olive/custom_scripts
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
|
5
setup.py
5
setup.py
|
@ -56,10 +56,7 @@ CLASSIFIERS = [
|
|||
]
|
||||
|
||||
long_description = (
|
||||
"Olive is an easy-to-use hardware-aware model optimization tool that composes industry-leading techniques across"
|
||||
" model compression, optimization, and compilation. Given a model and targeted hardware, Olive composes the best"
|
||||
" suitable optimization techniques to output the most efficient model(s) for inferencing on cloud or edge, while"
|
||||
" taking a set of constraints such as accuracy and latency into consideration."
|
||||
"Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs"
|
||||
)
|
||||
|
||||
description = long_description.split(".", maxsplit=1)[0] + "."
|
||||
|
|
Загрузка…
Ссылка в новой задаче