6.8 KiB
phi2 optimization with Olive
This folder contains an example of phi2 optimization with Olive workflow.
- PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model -> Quantized Onnx Model -> ONNX Runtime performance tuning
Prerequisites
- einops
- Pytorch>=2.2.0
The official website offers packages compatible with CUDA 11.8 and 12.1. Please select the appropriate version according to your needs. - ONNXRuntime nightly package In Linux, phi2 optimization requires the ONNXRuntime nightly package(>=1.18.0). In Windows, ONNXRuntime>=1.17.0 is recommended.
Fine-tune phi2 Model using QLoRA
This workflow fine-tunes phi2 model using QLoRA to generate text with given prompt.
You need to install required packages according to qlora. Also we suggest to use gpu devices for fine-tune process.
pip install -r requirements-qlora.txt
Then, you can run the fine-tune using the following command:
python phi2.py --finetune_method qlora
Note that, to demonstrate the fine-tune process, we use a small training steps and a small dataset. For better performance, you can increase the training steps and use a larger dataset by updating
phi2_optimize_template.json
.
We will consider to expose more parameters in the future to make it easier to customize the training process.
Optimization Usage
In this stage, we will use the phi2.py
script to generate optimized models and do inference with the optimized models.
Following are the model types that can be used for optimization: cpu_fp32
# optimize the fine-tuned model
python phi2.py --finetune_method qlora --model_type cpu_fp32
# optimize the original model
cpu_int4
python phi2.py --model_type cpu_int4
cuda_fp16
python phi2.py --model_type cuda_fp16
cuda_int4
python phi2.py --model_type cuda_int4
GenAI Optimization
For using ONNX runtime GenAI to optimize, follow build and installation instructions here to install onnxruntime-genai package(>0.1.0).
Run the following command to execute the workflow:
olive run --config phi2_genai.json
This phi2_genai.json
config file will generate optimized models for cpu_int4
and cuda_int4
model types as onnxruntime-gpu support cpu ep and cuda ep both.
If you only want cpu or cuda model, you can modify the config file by remove the unwanted execution providers.
# CPU
"accelerators": [
{
"device": "CPU",
"execution_providers": [
"CPUExecutionProvider",
]
}
]
# CPU: this is same with above as onnxruntime-gpu support cpu ep
"accelerators": [
{
"device": "GPU",
"execution_providers": [
"CPUExecutionProvider",
]
}
]
# CUDA
"accelerators": [
{
"device": "GPU",
"execution_providers": [
"CUDAExecutionProvider",
]
}
]
or you can use ph2.py
to generate optimized models separately by running the following commands:
python phi2.py --model_type cpu_int4 --genai_optimization
python phi2.py --model_type cuda_int4 --genai_optimization
Snippet below shows an example run of generated phi2 model.
import onnxruntime_genai as og
model = og.Model("model_path")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
prompt = '''def print_prime(n):
"""
Print all primes between 1 and n
"""'''
tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = tokens
output_tokens = model.generate(params)
text = tokenizer.decode(output_tokens)
print("Output:")
print(text)
Also you can use --inference
argument to run inference with the optimized model.
python phi2.py --model_type cuda_int4 --genai_optimization --inference
Optimum Optimization
Above commands will generate optimized models with given model_type and save them in the phi2
folder. These optimized models can be wrapped by ONNXRuntime for inference.
Besides, for better generation experience, this example also let use use Optimum to generate optimized models.
Then use can call model.generate
easily to run inference with the optimized model.
# optimum optimization. Please avoid to use optimum for fine-tune model which is not supported by now in Olive.
python phi2.py --model_type cpu_fp32 --optimum_optimization
Then let us use the optimized model to do inference.
Generation example of optimized model
# --prompt is optional, can accept a string or a list of strings
# if not given, the default prompt "Write a function to print 1 to n" "Write a extremely long story starting with once upon a time"
python phi2.py --model_type cpu_fp32 --inference --prompt "Write a extremely long story starting with once upon a time"
This command will
- generate optimized models if you never run the command before,
- reuse the optimized models if you have run the command before,
- then use the optimized model to do inference with greedy Top1 search strategy. Note that, we only use the simplest greedy Top1 search strategy for inference example which may show not very reasonable results.
For better generation experience, here is the way to run inference with the optimized model using Optimum.
python phi2.py --model_type cpu_fp32 --inference --optimum_optimization --prompt "Write a extremely long story starting with once upon a time"
Export output models in MLFlow format
If you want to output the optimized models to a zip file in MLFlow format, add --export_mlflow_format
argument. The MLFlow model will be packaged in a zip file named mlflow_model
in the output folder.
Limitations
-
The latest ONNXRuntime implements specific fusion patterns for better performance but only works for ONNX model from TorchDynamo-based ONNX Exporter. And the TorchDynamo-based ONNX Exporter is only available on Linux. When using Windows, this example will fallback to the default PyTorch ONNX Exporter, that can achieve a few improvements but not as much as the TorchDynamo-based ONNX Exporter. Therefore, it is recommended to use Linux for phi2 optimization.
-
For Optimum optimization, the dynamo model is not supported very well. So we use legacy Pytorch ONNX Exporter to run optimization like what we do in Windows.
Transformer Compression with SliceGPT
This workflow compresses a model to improve performance and reduce memory footprint. Specific details about the algorithm can be found in the linked paper.
Prerequisites
To run the workflow,
python phi2.py --slicegpt