### Description
Our nightly CPU python package's name is "ort-nightly" instead of
"onnxruntime". It was because of some historical reasons. Tensorflow was
like that.
Now we would prefer to make them the same.
Do this change for all nightly python packages, including CPU,
GPU(CUDA), and maybe others.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Change the hipify step to remove the -roc option to hipify-perl. This
will prefer hipblas over rocblas. rocblas can still be called directly
such as in TunableOp.
### Motivation and Context
hip interfaces are preferred over roc for porting from cuda to hip.
Calling roc interfaces is meant for ROCm-specific enhancements or
extensions.
followed the rocm example below it which isn't the naming convention we
want to follow. didn't fix rocm because i'm not sure if there are
consumers using its naming convention.
### Description
Enabling python binding and gcc support for AIX.
### Motivation and Context
Code changes in this PR contains:
1. python binding enablement
2. gcc building support
Below are list of files and the description.
1. cmake/CMakeLists.txt
[gcc building support] -no-unused-function compiler flag addition for
IBMClang
2. cmake/external/eigen.cmake
[gcc building support] AIX check for applying the AIX patch
3. cmake/onnxruntime_python.cmake
[python binding ] putting NOT AIX check for -Xlinker
4. cmake/onnxruntime_unittests.cmake
[gcc building support] Fix for gtest behavior. Check the comment .
[python binding ] using -Wl,-brtl for linking
onnxruntime_providers_shared in test_execution_provider
5. cmake/patches/eigen/eigen-aix.patch
[gcc building support] In AIX gcc, we are hitting
__builtin_cpu_supports("mma") which is not supported yet. So patching
code for this method . Patched code will check for P10 Processor at
run-time and based on that routine will be set.
6. onnxruntime/python/onnxruntime_validation.py
[python binding ] Adding AIX check in check_distro_info()
7. onnxruntime/test/providers/cpu/generator/random_test.cc
[gcc building support] updating previous check for AIX , along with
clang. So in case of gcc, else block will hit.
8. onnxruntime/test/python/onnxruntime_test_python.py
[python binding ] powerpc check on platform.processor()
9. setup.py
[python binding ] Adding AIX check for list of libs.
### Description
Exclude cuDNN 9 and CUDA 12 DLLs from manylinux wheel to reduce python
package size.
### Motivation and Context
The 1.20.0 ort-nightly-gpu python wheels on linux are suddenly > 800 MB
in size. The wheels built on 1.19 release branch have a size of around
220 MB.
The size change is caused by
https://github.com/microsoft/onnxruntime/pull/19470.
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.
#### Backward compatible
- For model existed with FusedConv, model can still run.
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.
#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
* environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
* CUDA installation path
#### Potential Issues
- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.
#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70.
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).
### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
### Description
Repeat of #21084 with removal of policy CMP0144 to suppress warnings
which uses CMake 3.27.0.
### Motivation and Context
Already approved PR:
https://github.com/microsoft/onnxruntime/pull/21084
Removed the added policy from CMake 3.27.0.
### Description
This reverts commit 1d7bf56947 because it
broken the AMD GPU CI pipeline. Sorry when I reviewed the PR I forgot to
run the AMD GPU CI pipeline.
Will revert the PR first then ask the author to fix the issue.
### Description
As a follow-up of #20506
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Introducing a new class ORTPipelineModule to handle wrapping layers in
DeepSpeed pipeline parallel.
### Motivation and Context
To support pipeline parallelism on ORTModule.
This PR will include an initial support of deepspeed Pipeline
parallelism.
- [x] Support Pipeline parallel where layers are nn Modules in
Sequential.
- [ ] Support LayerSpec and TiedLayerSpec
- [ ] Enable partitioning to accept List
- [ ] Full-GPU Graph Consolidation
- [ ] Subgraph Merging for Inference
### Description
<!-- Describe your changes. -->
Add `cann_dependencies`
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
The previous [PR](https://github.com/microsoft/onnxruntime/pull/17365)
avioded using patchelf but lost `cann_dependencies`, This PR adds
`cann_dependencies` to avoid require cann libraries when repairing
wheel.
The rocm lib version has changed in rocm 6.0
Using libs packaged in whl might cause errors.
For example, `libamdhip64.so.6` packaged in whl will cause compute error
when training gpt2 model.
The root cause still in investigating.
### Description
<!-- Describe your changes. -->
This PR adds
onnx conversion script for dynamo exported phi2,
optimization script,
and inference example script
A readme file is added as documentation.
https://github.com/microsoft/onnxruntime/tree/wangye/phi2_doc/onnxruntime/python/tools/transformers/models/phi2#readme
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Refactor the VAIEP to use MSFT's standalone API
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Vitis ONNX RT VAI should switch to using the standalone API for ONNX EPs
in order to decouple the EP from onnxruntime.dll and the providers.dll.
This will help to simplify customer deployment of applications and use
cases that need to share their onnxruntime.dll with other applications.
---------
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
Co-authored-by: zz002 <zhenze.wang@amd.com>
### Description
Adding python3.12 support to ORT
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Improve perf for stage3 training - first wave
Port existing PythonOp/PythonOpGrad python runner to C++, also introduce
an unsafe run mode (to skip inplace, save for backward, materrialized
grad detection on the fly).
This reduce the overhead from XX~XXX us to X ~ lower end of XX us . In
LLAMA2 7B training with 8x32GV100, we have observed 6.7% gains over
PyTorch. (1.59 v.s. 1.49it/s)
Peak memory also dropped from 31GB to 28GB.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds graph fusions to preprocessing step that can be called before
creating a QDQ model for QNN EP.
- Fuse Erf sequence to Gelu (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_gelu.py)).
Required by QNN EP.
- Fuse ReduceMean sequence to LayerNormaliation (adapted from
[optimizer.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/fusion_layernorm.py)).
Not required by QNN EP.
- Fuse ReduceL2 sequence to LpNormalization (new, specific to QNN EP).
Required by QNN EP.
Example use:
```python3
from quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model
# Added by this PR:
model_updated = qnn_preprocess_model("model.fp32.onnx", "model.fp32.preprocessed.onnx", fuse_layernorm=True)
model_to_quantize = "model.fp32.preprocessed.onnx" if model_updated else "model.fp32.onnx"
# Quantize model ...
qnn_config = get_qnn_qdq_config(model_to_quantize, data_reader, activation_type=QuantType.QUInt16)
quantize(model_to_quantize, "model.qdq.onnx", qnn_config)
```
### Motivation and Context
Allow more models to be quantized for use with QNN EP
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
#### 1. Adds `TensorQuantOverrides` extra option
Allows specifying a dictionary of tensor-level quantization overrides:
```
TensorQuantOverrides = dictionary :
Default is {}. Set tensor quantization overrides. The key is a tensor name and the value is a
list of dictionaries. For per-tensor quantization, the list contains a single dictionary. For
per-channel quantization, the list contains a dictionary for each channel in the tensor.
Each dictionary contains optional overrides with the following keys and values.
'quant_type' = QuantType : The tensor's quantization data type.
'scale' = Float : The scale value to use. Must also specify `zero_point` if set.
'zero_point' = Int : The zero-point value to use. Must also specify `scale` is set.
'symmetric' = Bool : If the tensor should use symmetric quantization. Invalid if also
set `scale` or `zero_point`.
'reduce_range' = Bool : If the quantization range should be reduced. Invalid if also
set `scale` or `zero_point`.
'rmax' = Float : Override the maximum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
'rmin' = Float : Override the minimum real tensor value in calibration data.
Invalid if also set `scale` or `zero_point`.
```
- All of the options are optional.
- Some combinations are invalid.
- Ex: `rmax` and `rmin` are unnecessary if the `zero_point` and `scale`
are also specified.
Example for per-tensor quantization overrides:
```Python3
extra_options = {
"TensorQuantOverrides": {
"SIG_OUT": [{"scale": 1.0, "zero_point": 127}],
"WGT": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
"BIAS": [{"quant_type": quantization.QuantType.QInt8, "symmetric": True, "reduce_range": True}],
},
}
```
Example for per-channel quantization overrides (Conv weight and bias):
```Python3
extra_options = {
"TensorQuantOverrides": {
"WGT": [
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.0,
"rmax": 2.5,
"reduce_range": True,
},
{
"quant_type": quantization.QuantType.QUInt8,
"rmin": 0.2,
"rmax": 2.55,
"reduce_range": False,
},
],
"BIAS": [
{"zero_point": 0, "scale": 0.000621},
{"zero_point": 0, "scale": 0.23},
],
},
}
```
#### 2. Adds utilities to get the default QDQ configs for QNN EP
Added a `quantization.execution_providers.qnn.get_qnn_qdq_config` method
that inspects the model and returns suitable quantization
configurations.
Example usage:
```python3
from quantization import quantize, QuantType
from quantization.execution_providers.qnn import get_qnn_qdq_config
qnn_config = get_qnn_qdq_config(input_model_path,
data_reader,
activation_type=QuantType.QUInt16,
weight_type=QuantType.QUInt8)
quantize(input_model_path,
output_model_path,
qnn_config)
```
### Motivation and Context
Make it possible to create more QDQ models that run on QNN EP.
---------
Signed-off-by: adrianlizarraga <adlizarraga@microsoft.com>
### Description
Motivation for this PR is code cleanup.
1. Remove all deprecated python code related to orttrainer, old
checkpoint, related tests and utils
2. Cleanup orttraining_pybind_state.cc to remove all deprecated
bindings.
update rocm package exclude libs.
- change librocblas.so.0 to librocblas.so.3 which is used on ROCm5.6 and
ROCm5.7
- add librocfft.so.0, libhipfft.so.0, libhiprtc.so.5 and sort the list.
This PR is to support efficient attention and flash attention in
ORTModule, including:
- Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev
or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
- Integrate Triton Flash attention, which requires
triton==2.0.0.dev20221202. Need A100 or H100.
ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
- A python transformer tool to match sub-graph by config and write
transformer quickly.
Current transformers supports attention mask for both efficient attn and
flash attn, and dropout for efficient attn only. To support more
training scenarios (such as causal mask in GPT2), more transformers need
to be added.
The feature is guarded by system environment variables, it won't effect
any current behavior if not enabled. Since it requires specific
PyTorch/Triton versions, related tests is not added for now.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- we will publish the onnxruntime-training-rocm package on ADO feeds.
The onnxruntime-training package will solely be for cuda.
- Add new pipeline for onnxruntime-training-rocm ADO feeds
https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1278. Only
package with latest rocm version is publish to ADO.
### Description
The files should not have the minor version number. The names were added
in #17365 by mistake.
### Motivation and Context
We did not successfully exclude them out.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Get the latest gcc 12 by default
---------
Co-authored-by: Changming Sun <chasun@microsoft.com>
### Description
This PR adds the following scripts for LLaMA:
- LLaMA conversion (support for TorchScript and Dynamo exporters)
- LLaMA parity
- LLaMA benchmark
- LLaMA quantization
- LLaMA integration with [Hugging Face
Optimum](https://github.com/huggingface/optimum)
### Motivation and Context
This PR adds scripts for using LLaMA. There is a [follow-up
PR](https://github.com/microsoft/onnxruntime/pull/17043) for adding
scripts for Whisper.
### Motivation and Context
When we handle PyTorch models' inputs in different places (ORTModule or
others), it's common for us to flatten a structured data into a 1-D
tensor list (required by lib for example torch.onnx.export,
torch.autograd.Function.forward or ORT inference session), then do
subsequent work, then unflatten back to original hierarchy as returned
values.
DeepStage3 hooks support work also need such a lib to do similar things,
so I was proposing to extract this pair of APIs in training/utils/,
which can be more used more generally. Also a comprehensive set of test
data are used for testing unflatten/flatten in unit tests.
Let me know if you have any other suggestions.
### Refactor schema extraction and output unflattening
Move `_extract_schema` and `unflatten_user_output` in
`orttraining/orttraining/python/training/ortmodule/_io.py` . to
`extract_data_and_schema` and `unflatten_data_using_schema` in
`orttraining/orttraining/python/training/utils/torch_io_helper.py` as
shared libs, which can be used later by other features (deepspeed stage
3 hook rewrite).
While there are still a few duplicated logic handling flatten with
different task by recursively loop the data struct, will change them
step by step in case of heavy review efforts.
### Description
This PR adds `onnxruntime.transformers.models.whisper` to the wheel.
### Usage
There is a README.md document that shows sample commands. The following
command will show how to use the custom Whisper export script in more
detail.
```
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx --help
```
### Motivation and Context
This fixes an issue with adding the Whisper custom export scripts to the
wheel. The Whisper folder now appears in the wheel.
![Screenshot 2023-04-26
143705](https://user-images.githubusercontent.com/115581922/234708587-6d1b7d34-71a9-4f9f-a491-657ceb25afcb.jpg)
### Description
The PR adds VPU support to OpenVINO Execution Provider
Bug fixes for GPU, CPU.
Changes to OpenVINO Backend in Serialized Model API for faster First
Inference Latency.
Deprecation to HDDL-VADM and MYRIAD, removed code
Support OpenVINO 2023.0
Dynamic Shapes Support for iGPU
### Motivation and Context
- VPU is an upcoming hardware that can provide AI Acceleration for
Client Systems through OpenVINO
- If it fixes an open issue, please link to the issue here. -->
---------
Signed-off-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Suryaprakash Shanmugam <suryaprakash.shanmugam@intel.com>
Co-authored-by: MaajidKhan <n.maajid.khan@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>