История

Jambay Kinley 9301aae1ec Engine: Improve output structure, CLI: Configurable model options, Separate `finetune`+`generate-adapter` (#1361 ) ## Describe your changes Engine: - Improve the output folder structure of a workflow run. The current structure was meant for multiple-ep, multiple-passflow worfklows but that is not the common usage for olive. - Unnecessary nesting for accelerator spec and pass flows is removed for single ep, single passflow scenario. - `output_name` is removed from both pass config and engine config. - The behavior of `output_name` is arbitrary. User can get the output in a specific folder by directly providing the `output_dir` like `parent-dir/specific-dir`. - `output_name` was allowed for pass config to save intermediate models. But this can be achieved by providing multiple pass flows like `[[A, B], [A, B, C]]`. This is cleaner than the former. - Refer to `Engine.run` for more details on the new output structure. CLI: - `add_model_options` is made configurable so that only the desired model type related options are added. - `save_output_model` uses the new engine output directory structure to copy the output model into the final output directory. - `finetune` command separated into `finetune` and `generate-adapter` commands. These commands can be chained as shown in the llama2 multilora notebook. ## Checklist before requesting a review - [x] Add unit tests for this change. - [x] Make sure all tests can pass. - [x] Update documents if necessary. - [x] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link		2024-09-18 15:42:06 -07:00
..
notebook	HfModelHandler: Use `optimum` for automatic `io_config` and `dummy_inputs` (#1317 )	2024-08-15 22:44:41 -07:00
.gitignore	`HfModelHandler` separated from `PyTorchModelHandler` (#1239 )	2024-07-17 20:31:04 -07:00
LICENSE	🏷️ Complete Required Licenses for Llama2 optimization (#906 )	2024-01-30 14:15:50 +08:00
README.md	GptqQuantizer: Option to save in hf format (#1338 )	2024-09-02 13:01:05 -07:00
USE-POLICY-META-LLAMA-2.md	🏷️ Complete Required Licenses for Llama2 optimization (#906 )	2024-01-30 14:15:50 +08:00
conda_gpu.yaml	Add llama2 example to pipeline (#1328 )	2024-08-28 14:52:04 -07:00
llama2.py	Add remote workflow and cloud cache support to llama2 example (#1263 )	2024-08-07 16:36:17 -07:00
llama2_generate.json	CLI: `finetune` command (#1277 )	2024-08-05 19:21:24 -07:00
llama2_lmeval.json	Add evaluation support for using lm-eval harness (#1349 )	2024-09-11 10:39:00 -07:00
llama2_model_builder.py	`NestedConfigs`: config fields don't need nesting (#1245 )	2024-07-22 07:34:16 -07:00
llama2_model_builder_template.json	`NestedConfigs`: config fields don't need nesting (#1245 )	2024-07-22 07:34:16 -07:00
llama2_multilora.ipynb	Engine: Improve output structure, CLI: Configurable model options, Separate `finetune`+`generate-adapter` (#1361 )	2024-09-18 15:42:06 -07:00
llama2_qlora.json	HfModelHandler: Use `optimum` for automatic `io_config` and `dummy_inputs` (#1317 )	2024-08-15 22:44:41 -07:00
llama2_template.json	Engine: Improve output structure, CLI: Configurable model options, Separate `finetune`+`generate-adapter` (#1361 )	2024-09-18 15:42:06 -07:00
llama2_tensor_parallel.json	HfModelHandler: Use `optimum` for automatic `io_config` and `dummy_inputs` (#1317 )	2024-08-15 22:44:41 -07:00
requirements-gptq.txt	Add GPTQ quantization (#957 )	2024-03-16 09:20:13 +08:00
requirements-pipeline.txt	Add llama2 example to pipeline (#1328 )	2024-08-28 14:52:04 -07:00
requirements-qlora.txt	Add llama2 example to pipeline (#1328 )	2024-08-28 14:52:04 -07:00
requirements.txt	`HfModelHandler` separated from `PyTorchModelHandler` (#1239 )	2024-07-17 20:31:04 -07:00
tensor_parallel_generate.py	enable more ruff rules (#927 )	2024-02-04 16:37:23 +08:00
tensor_parallel_inference.py	enable more ruff rules (#927 )	2024-02-04 16:37:23 +08:00

README.md

Llama2 optimization

Sample use cases of Olive to optimize a Llama2

Llama2 optimization
License

Optimization Workflows

Inference optimization using ONNX Runtime Tools

Performs optimization pipeline:

CPU, FP32: PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp32
CPU, INT8: PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp32 -> Onnx Dynamic Quantization
CPU, INT4: PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp32 -> Onnx Block wise int4 Quantization
GPU, FP16: PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp16 + Grouped Query Attention (optional)
GPU, INT4: PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp16 + Grouped Query Attention (optional) -> Onnx Block wise int4 Quantization

Note: Group Query Attention is optional and can be enabled by passing --use_gqa flag to the script. It is only supported for GPU.

Requirements file: requirements.txt

Inference optimization with ONNNX Runtime with DirectML

For Llama2 inference with DirectML on GPUs, pls refer to this example.

Inference optimization using ONNX Runtime GenAI

For using ONNX runtime GenAI to optimize, follow build and installation instructions here to install onnxruntime-genai package(>0.1.0).

Run the following command to execute the workflow:

python llama2_model_builder.py [--model_name <>] [--metadata_only]

To generate metadata only for pre-exported onnx model, use the --metadata_only option.

Snippet below shows an example run of generated llama2 model.

import onnxruntime_genai as og

model = og.Model("model_path")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

prompt = '''def print_prime(n):
    """
    Print all primes between 1 and n
    """'''

tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(max_length=200)
params.input_ids = tokens

output_tokens = model.generate(params)

text = tokenizer.decode(output_tokens)

print("Output:")
print(text)

Quantization using GPTQ and do text generation using ONNX Runtime with Optimum

This workflow quantizes the Llama2 model using GPTQ and does text generation using ONNX Runtime with Optimum.

GPU, GPTQ INT4: PyTorch Model -> GPTQ INT4 Onnx Model

Note:

This workflow is only supported for GPU and need GPU to run.
GPTQ quantization can be enabled by passing --use_gptq flag to the script.

Requirements file: requirements-gptq.txt

Once finished, you can do text generation using the following code:

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer, AutoConfig

quantized_model_dir = "${path_to_quantized_llama2-7b}"
AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf").save_pretrained(quantized_model_dir)
AutoConfig.from_pretrained("meta-llama/Llama-2-7b-hf").save_pretrained(quantized_model_dir)
model = ORTModelForCausalLM.from_pretrained(
    quantized_model_dir, provider="CUDAExecutionProvider"
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
inputs = tokenizer("Hello, World", return_tensors="pt").to("cuda:0")
print(tokenizer.batch_decode(model.generate(**inputs, max_length=20), skip_special_tokens=True))

Prerequisites

Clone the repository and install Olive

Refer to the instructions in the examples README to clone the repository and install Olive.

Install onnxruntime

This example requires onnxruntime>=1.16.2. Please install the latest version of onnxruntime:

For CPU:

python -m pip install "onnxruntime>=1.17.0"

For GPU:

python -m pip install "onnxruntime-gpu>=1.17.0"

Note: The GPU package also works for CPU.

Install extra dependencies

Install the necessary python packages:

python -m pip install -r <requirements_file>.txt

Run the config to optimize the model

Optimize using ONNX Runtime Tools

You can only generate the optimized config file by running the following command for double checking before running the optimization pipeline:

python llama2.py --model_name meta-llama/Llama-2-7b-hf --only_config

Or you can run the following command to directly optimize the model:

CPU:

# run to optimize the model: FP32/INT8/INT4
python llama2.py --model_name meta-llama/Llama-2-7b-hf

GPU:

# run to optimize the model: FP16/INT4
python llama2.py --model_name meta-llama/Llama-2-7b-hf --gpu
# use gqa instead of mha
python llama2.py --model_name meta-llama/Llama-2-7b-hf --gpu --use_gqa
# use gptq quantization
python llama2.py --model_name meta-llama/Llama-2-7b-hf --gpu --use_gptq

Fine-tune on a code generation dataset using QLoRA and optimize using ONNX Runtime Tools

Run the following command to execute the workflow:

python llama2.py --qlora

Note: Get access to the following resource on Hugging Face Hub:

nampdn-ai/tiny-codes

huggingface-cli login

Running Workflows on the Cloud

You may notice that this workflow takes a long time to run, especially for QLoRA. Olive offers a feature that allows you to submit the workflow to the cloud, enabling it to run on the compute resources in your Azure Machine Learning workspace.

To use this feature, you will need a remote_config.json file to configure your Azure Machine Learning workspace:

{
    "subscription_id": "<subscription_id>",
    "resource_group": "<resource_group>",
    "workspace_name": "<workspace_name>",
    "keyvault_name": "<keyvault_name>",
    "compute": "<compute>"
}

More details about keyvault_name can be found here.

Make sure you have installed Olive Azure ML extra by running:

pip install olive-ai[azureml]

Then you can run the following command:

python llama2.py --qlora --remote_config remote_config.json

Olive will submit the workflow to the compute resources in your Azure Machine Learning workspace and execute the workflow there. The output artifacts will be automatically exported to the Datastore. For more detailed information, please refer to the official documentation.

Accelerating Workflows with Cloud Model Cache

The cloud model cache is a system where Olive stores intermediate models in Azure Blob Storage. For more detailed information, please refer to the documentation.

You will need a cloud_cache.json configuration file to set up the cloud cache configuration:

{
    "account_url": "<account_url>",
    "container_name": "<container_name>",
}

You can run the following command:

python llama2.py --qlora --cloud_cache cloud_cache.json

Olive will apply cloud model cache for this workflow.

Combining Remote Workflow and Cloud Model Cache

To leverage both the remote workflow and Cloud Model Cache for faster workflow execution, simply run:

python llama2.py --qlora --remote_config remote_config.json --cloud_cache cloud_cache.json

This will submit the workflow to the Azure Machine Learning workspace and store intermediate models in Azure Blob Storage, significantly speeding up the process.

License

Please see the LICENSE file for more details. Also please follow the user policy of the model provider. Besides, please refer to the Responsible Use Guide for more details on how to use the model responsibly.