7f9d2e1462
## Describe your changes Add stable diffusion multi lora example. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. - [ ] Is this PR including examples changes? If yes, please remember to update [example documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md) in a follow-up PR. ## (Optional) Issue link |
||
---|---|---|
.. | ||
assets | ||
notebook | ||
sd_utils | ||
.gitignore | ||
README.md | ||
config.py | ||
config_safety_checker.json | ||
config_text_encoder.json | ||
config_unet.json | ||
config_vae_decoder.json | ||
config_vae_encoder.json | ||
requirements-common.txt | ||
requirements-ov.txt | ||
requirements.txt | ||
stable_diffusion.py | ||
user_script.py |
README.md
Stable Diffusion Optimization
This folder contains sample use cases of Olive with ONNX Runtime and OpenVINO to optimize:
- Stable Diffusion: Stable Diffusion v1-4, Stable Diffusion v2
- Stable Diffusion XL: Stable Diffusion XL Base, Stable Diffusion XL Refiner
Stable Diffusion comprises multiple PyTorch models tied together into a pipeline.
The ONNX Runtime optimization sample will convert each PyTorch model to ONNX, and then run the converted ONNX models through the OrtTransformersOptimization
pass. The transformer optimization pass performs several time-consuming graph transformations that make the models more efficient for inference at runtime.
The OpenVINO optimization sample will convert each PyTorch model to OpenVINO IR model by OpenVINOConversion
pass, and create an OpenVINOStableDiffusionPipeline
for inference.
- ONNX Runtime with
- CUDA EP
- DirectML EP: go to examples Stable Diffusion, Stable Diffusion XL
- OpenVINO
Stable Diffusion and Stable Diffusion XL Optimization with ONNX Runtime CUDA EP
This sample performs the following optimization workflow for each model in the Stable Diffusion pipeline:
- PyTorch Model -> Onnx Model -> Transformers Optimized Onnx Model fp16
Transformers optimization uses the following optimizations to speed up Stable Diffusion in CUDA:
- Flash Attention for float16 precision. Flash Attention uses tiling to reduce number of GPU memory reads/writes, and improves performance with less memory for long sequence length. The kernel requires GPUs of Compute Capability >= 7.5 (like T4, A100, and RTX 2060~4090). Only availanle in Linux.
- Memory Efficient Attention for float32 precision or older GPUs (like V100). We used the fused multi-head attention kernel in CUTLASS, and the kernel was contributed by xFormers.
- Channel-last (NHWC) convolution. For NVidia GPU with Tensor Cores support, NHWC tensor layout is recommended for convolution. See Tensor Layouts In Memory: NCHW vs NHWC.
- GroupNorm for NHWC tensor layout, and SkipGroupNorm fusion which fuses GroupNorm with Add bias and residual inputs
- SkipLayerNormalization which fuses LayerNormalization with Add bias and residual inputs.
- BiasSplitGelu is a fusion of Add bias with SplitGelu activation.
- BiasAdd fuses Add bias and residual.
- Reduce Transpose nodes by graph transformation.
Prerequisites
Clone the repository and install Olive
Refer to the instructions in the examples README to clone the repository and install Olive.
We use the same olive workflow config files and scripts as the DirectML examples. The only difference is the --provider cuda
option provided to the stable_diffusion.py
and stable_diffusion_xl.py
scripts.
So, cd into the corresponding DirectML example folder from the root of the cloned repository:
Stable Diffusion
cd examples/stable_diffusion
Stable Diffusion XL
cd examples/directml/stable_diffusion_xl
Install onnxruntime
This example requires the latest onnxruntime-gpu code which can either be built from source or installed from the nightly builds. The following command can be used to install the latest nightly build of onnxruntime-gpu:
# uninstall any pre-existing onnxruntime packages
pip uninstall -y onnxruntime onnxruntime-gpu onnxruntime-directml ort-nightly ort-nightly-gpu ort-nightly-directml
# install onnxruntime-gpu nightly build
pip install ort-nightly-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/
Install other dependencies
Install the necessary python packages:
python -m pip install -r requirements-common.txt
Conversion to ONNX and Latency Optimization
The easiest way to optimize the pipeline is with the stable_diffusion.py
and stable_diffusion_xl.py
scripts. These scripts will enumerate the config_<model_name>.json
files and optimize each with Olive, then gather the optimized models into a directory structure suitable for testing inference.
Stable Diffusion
# default model_id is "CompVis/stable-diffusion-v1-4"
python stable_diffusion.py --provider cuda --optimize
Stable Diffusion XL
# default model_id is "stabilityai/stable-diffusion-xl-base-1.0"
python stable_diffusion_xl.py --provider cuda --optimize [--use_fp16_fixed_vae]
# or specify a different model_id
python stable_diffusion_xl.py --provider cuda --model_id stabilityai/stable-diffusion-xl-refiner-1.0 --optimize [--use_fp16_fixed_vae]
--use_fp16_fixed_vae
is optional. If provided, will use madebyollin/sdxl-vae-fp16-fix for the vae models and all sub-models will be entirely in fp16.
Otherwise, the vae models (vae-decoder for base and both vae-decoder and vae-encoder for refiner) will be in fp32 and all other sub-models will be in fp16 with fp32 input/output.
Once the script successfully completes:
- The optimized ONNX pipeline will be stored under
models/optimized-cuda/[model_id]
(for examplemodels/optimized-cuda/CompVis/stable-diffusion-v1-4
ormodels/optimized-cuda/stabilityai/stable-diffusion-xl-base-1.0
). - The unoptimized ONNX pipeline (models converted to ONNX, but not run through transformer optimization pass) will be stored under
models/unoptimized/[model_id]
(for examplemodels/unoptimized/CompVis/stable-diffusion-v1-4
ormodels/unoptimized/stabilityai/stable-diffusion-xl-base-1.0
).
Re-running the script with --optimize
will delete the output models, but it will not delete the Olive cache. Subsequent runs will complete much faster since it will simply be copying previously optimized models; you may use the --clean_cache
option to start from scratch (not typically used unless you are modifying the scripts, for example).
Test Inference with CUDA
Test ONNX runtime inference with the optimized models using OnnxStableDiffusionPipeline
:
Stable Diffusion
python stable_diffusion.py --provider cuda --num_images 2
Inference will loop until the generated image passes the safety checker (otherwise you would see black images). The result will be saved as result_<i>.png
on disk.
Stable Diffusion XL
python stable_diffusion_xl.py --provider cuda --num_images 2
The result will be saved as result_<i>.png
on disk.
Refer to the corresponding section in the DirectML READMEs for more details on the test inference options:
Stable Diffusion Optimization with OpenVINO
Contents:
Setup
Olive is currently under pre-release, with constant updates and improvements to the functions and usage. This sample code will be frequently updated as Olive evolves, so it is important to install Olive from source when checking out this code from the main branch. See the README for examples for detailed instructions on how to do this.
Alternatively, you may install a stable release that we have validated. For example:
# Install Olive from main branch
pip install git+https://github.com/microsoft/Olive#egg=olive-ai[openvino]
# Clone Olive repo to access sample code
git clone https://github.com/microsoft/olive
Once you've installed Olive, install the requirements for this sample matching the version of the library you are using:
cd olive/examples/stable_diffusion
pip install -r requirements-ov.txt
Convert to OpenVINO IR model
The easiest way to optimize the pipeline is with the stable_diffusion.py
helper script:
python stable_diffusion.py --optimize
The above command will enumerate the config_<model_name>.json
files and optimize each with Olive, then gather the optimized models into a directory structure suitable for testing inference.
The stable diffusion models are large, and the optimization process is resource intensive. It is recommended to run optimization on a system with a minimum of 16GB of memory (preferably 32GB). Expect optimization to take several minutes (especially the U-Net model).
Once the script successfully completes:
- The converted OpenVINO IR model will be stored under
models/optimized-openvino/[model_id]
(for examplemodels/optimized-openvino/CompVis/stable-diffusion-v1-4
).
Re-running the script with --optimize
will delete the output models, but it will not delete the Olive cache. Subsequent runs will complete much faster since it will simply be copying previously optimized models; you may use the --clean_cache
option to start from scratch (not typically used unless you are modifying the scripts, for example).
Test Inference with OpenVINO
This sample code is primarily intended to illustrate model optimization with Olive, but it also provides a simple interface for testing inference with the OpenVINO models. Inference is done by creating an OVStableDiffusionPipeline
from the saved models.
python stable_diffusion.py --inference --provider openvino
Inference will loop until the generated image. The result will be saved as result_<i>.png
on disk.
Run python stable_diffusion.py --help
for additional options. A few particularly relevant ones:
--image_path <str>
: the input image path for image to image inference.--img_to_img_example
: image to image example. The default input image isassets/dog.png
, the default prompt isamazing watercolor painting
.