### Description
Several tests result in segfaults during the minimal cuda build.
Although test failures are expected due to the limitation of the minimal
cuda EP, failing gracefully would be much preferred.
### Motivation and Context
To reproduce:
1. Build ORT with:
```bash
./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1
```
2. Run `onnxruntime_test_all`
```bash
...
[----------] 1 test from AllocationPlannerTest
[ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams
Segmentation fault (core dumped)
```
### Description
WebNN only supports test mode, so we don't care about other inputs or
attributes about training mode, use WebNN's identity op to implement the
Dropout op directly.
### Description
<!-- Describe your changes. -->
Changes to add in Set external data path for model weight files.
Additional fixes to ensure this compiles off the latest v1.19
Onnxruntime
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Separate weights used for larger models (like stable diffusion) is
motivation for this change set
---------
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@amd.com>
Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>
### Description
Added CUDNN Frontend and used it for NHWC convolutions, and optionally
fuse activation.
#### Backward compatible
- For model existed with FusedConv, model can still run.
- If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.
#### Major Changes
- For cuDNN 9, we will enable cudnn frontend to fuse convolution and
bias when a provider option `fuse_conv_bias=1`.
- Remove the fusion of FusedConv from graph transformer for CUDA
provider, so there will not be FusedConv be added to graph for CUDA EP
in the future.
- Update cmake files regarding to cudnn settings. The search order of
CUDNN installation in build are like the following:
* environment variable `CUDNN_PATH`
* `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from
build.py/build.sh, user can pass it through `--cudnn_home` parameter, or
by environment variable `CUDNN_HOME` if `--cudnn_home` not used.
* cudnn python package installation directory like
python3.xx/site-packages/nvidia/cudnn
* CUDA installation path
#### Potential Issues
- If ORT is built with cuDNN 8, FusedConv fusion is no longer done
automatically, so some model might have performance regression. If user
still wants FusedConv operator for performance reason, they can still
have multiple ways to walkaround: like use older version of onnxruntime;
or use older version of ORT to save optimized onnx, then run with latest
version of ORT. We believe that majority users have moved to cudnn 9
when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3
months when 1.20 release), so the impact is small.
- cuDNN graph uses TF32 by default, and user cannot disable TF32 through
the use_tf32 cuda provider option. If user encounters accuracy issue
(like in testing), user has to set environment variable
`NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of
use_tf32 later.
#### Follow ups
This is one of PRs that target to enable NHWC convolution in CUDA EP by
default if device supports it. There are other changes will follow up to
make it possible.
(1) Enable `prefer_nhwc` by default for device with sm >= 70.
(2) Change `fuse_conv_bias=1` by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).
### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution with the pointwise bias operation. On the [NVIDIA
ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/)
we get a performance boost from 49.1144 ms to 42.4643 ms per inference
on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i
'prefer_nhwc|1' resnet50.onnx`).
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>
### Description
- Supports quantized Conv + Activation on the HTP backend:
- Translates `DQs -> Conv -> Relu/Clip -> Q` into a single QNN Conv
operator if the Relu (or Clip) are redundant.
### Motivation and Context
Expands support for QDQ models created with tools that do not wrap Relu
or Clip with QDQ nodes.
This PR introduces the `IQnnNodeGroup` class. In the same way that a
`NodeUnit` represents a collection of `Nodes`, a `IQnnNodeGroup` can
represent one or more `NodeUnits` that are translated into a QNN
operator. QNN EP parses the ONNX graph to create a list of
`IQnnNodeGroup` objects, each representing a single `NodeUnit` or a
fusion of multiple `NodeUnits`.
### Description
model: phi-3-mini-4k-instruct
avx2 symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |49.5|70.0|-29.2%|9.6|10.8|-34.2%
32 |76.8|52.4|9.7%|15.2|14.6|4.1%
64 |78.2|71.4|9.5%|16.6|16.3|1.8%
128 |72.9|70.6|3.2%|17.1|16.8|1.7%
256 |83.7|63.6|31.6%|18.1|17.4|4%
avx2 asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |50.7|61.5|-17.5%|9.6|9.2|4.3%
32 |77.4|52.4|47.7%|14.6|13.9|5.0%
64 |78.7|63.0|24.9%|16.2|15.9|1.8%
128 |80.0|61.9|29.2%|17.2|16.9|1.7%
256 |81.5|63.3|28.7%|17.9|17.3|3.4%
avx2vnni symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |82.9|117.0|-29.0%|15.9|19.3|-17.6%
32 |133.0|100.4|32.4%|26.1|24.5|6.5%
64 |166.9|118.8|40.4%|28.3|27.1|4.4%
128 |165.9|119.6|38.7%|29.3|28.5|2.8%
256 |165.2|119.6|38.1%|30.2|29.0|4.1%
avx2vnni asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |80.2|118.9|-32.5%|15.1|16.7|-9.5%
32 |130.7|99.7|31.0%|25.0|23.8|5.0%
64 |168.7|124.9|35.0%|27.3|26.8|1.8%
128 |169.6|123.8|36.9%|29.2|27.9|4.6%
256 |175.0|125.7|39.0%|30.0|29.7|1.0%
avx512 symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |135.2|156.5|-13.6|25.5|23.8|7.1
32 |150.0|159.5|-5.9|34.9|29.6|17.9
64 |167.5|157.5|6.3|39.7|34.4|15.4
128 |177.8|158.0|12.5|40.3|35.4|13.8
256 |182.6|157.3|16.0|41.7|37.7|10.6
avx512 asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |136.1|151.4|-10.1%|26.1|19.9|31.1%
32 |150.0|157.8|-4.9%|34.3|29.3|17.0%
64 |165.7|156.6|5.8%|38.7|30.7|26.0%
128 |180.4|156.6|15.1%|40.2|34.7|15.8%
256 |181.3|158.0|14.7%|41.6|36.6|13.6%
avx512vnni symmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |143.4|155.4|-7.7%|25.6|23.3|9.8%
32 |159.2|157.0|1.4%|34.1|29.8|14.4%
64 |182.0|159.5|14.1%|38.4|34.8|10.3%
128 |221.2|160.8|37.5%|41.0|36.4|12.6%
256 |250.5|162.4|54.2%|41.6|37.7|10.3%
avx512vnni asymmetric
blklen|updated prompt tps | baseline prompt tps | prompt tps
change%|updated token gen tps | baseline token gen tps | token gen
change%
-|-|-|-|-|-|-
16 |142.5|152.3|-6.4%|26.3|19.7|33.5%
32 |158.2|155.0|2.0%|34.3|29.2|17.4%
64 |184.1|156.6|17.5%|38.3|30.9|23.9%
128 |215.8|156.1|17.5%|41.3|35.0|17.9%
256 |249.2|155.9|59.8%|41.1|36.3|13.2%
4bit gemm implementation with avx using tile.
1.
tile size is 2blk by 4. in case of size less then tile, it reduce to
1blk by 4, 2blk by 1 and lastly 1blk by 1.
with internal kernel, weight and activation are loaded based on SIMD
register width and blk length:
avx2 256bit register, 64 weights and activation are loaded.
blklen16: 4 blks are computed by the internal kernel
blklen32: 2 blks are computed by the internal kernel
blklen64: 1 blk are computed by the internal kernel
blklen128: 1 blks are computed 2 times by the internal kernel
blklen16: 1 blks are computed 4 times by the internal kernel
avx512 512bit register, 128 weights and activation are loaded.
blklen16: 8 blks are computed by the internal kernel
blklen32: 4 blks are computed by the internal kernel
blklen64: 2 blk are computed by the internal kernel
blklen128: 1 blks are computed by the internal kernel
blklen16: 1 blks are computed 2 times by the internal kernel
2.
blksum is precomputed during prepacking.
computation is reformed:
Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b)
Sum_blk is over one blk
Sum1 is over all blks for one output
Sum2 is over all blks for one output
Sum is computed with sgemm with the current implementation. Further
improvement is possible.
---------
Signed-off-by: Liqun Fu <liqfu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>
Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/pytorch/pytorch/releases">torch's
releases</a>.</em></p>
<blockquote>
<h2>PyTorch 2.2: FlashAttention-v2, AOTInductor</h2>
<h1>PyTorch 2.2 Release Notes</h1>
<ul>
<li>Highlights</li>
<li>Backwards Incompatible Changes</li>
<li>Deprecations</li>
<li>New Features</li>
<li>Improvements</li>
<li>Bug fixes</li>
<li>Performance</li>
<li>Documentation</li>
</ul>
<h1>Highlights</h1>
<p>We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2
offers ~2x performance improvements to
<code>scaled_dot_product_attention</code> via FlashAttention-v2
integration, as well as AOTInductor, a new ahead-of-time compilation and
deployment tool built for non-python server-side deployments.</p>
<p>This release also includes improved torch.compile support for
Optimizers, a number of new inductor optimizations, and a new logging
mechanism called TORCH_LOGS.</p>
<p><strong>Please note that we are <a
href="https://redirect.github.com/pytorch/pytorch/issues/114602">deprecating
macOS x86 support</a>, and PyTorch 2.2.x will be the last version that
supports macOS x64.</strong></p>
<p>Along with 2.2, we are also releasing a series of updates to the
PyTorch domain libraries. More details can be found in the library
updates blog.</p>
<p>This release is composed of 3,628 commits and 521 contributors since
PyTorch 2.1. We want to sincerely thank our dedicated community for your
contributions. As always, we encourage you to try these out and report
any issues as we improve 2.2. More information about how to get started
with the PyTorch 2-series can be found at our <a
href="https://pytorch.org/get-started/pytorch-2.0/">Getting Started</a>
page.</p>
<p>Summary:</p>
<ul>
<li><code>scaled_dot_product_attention</code> (SDPA) now supports
FlashAttention-2, yielding around 2x speedups compared to previous
versions.</li>
<li>PyTorch 2.2 introduces a new ahead-of-time extension of
TorchInductor called AOTInductor, designed to compile and deploy PyTorch
programs for non-python server-side.</li>
<li><code>torch.distributed</code> supports a new abstraction for
initializing and representing ProcessGroups called device_mesh.</li>
<li>PyTorch 2.2 ships a standardized, configurable logging mechanism
called TORCH_LOGS.</li>
<li>A number of torch.compile improvements are included in PyTorch 2.2,
including improved support for compiling Optimizers and improved
TorchInductor fusion and layout optimizations.</li>
<li>Please note that we are deprecating macOS x86 support, and PyTorch
2.2.x will be the last version that supports macOS x64.</li>
<li><code>torch.ao.quantization</code> now offers a prototype
<code>torch.export</code> based flow</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="8ac9b20d4b"><code>8ac9b20</code></a>
Run docker release build on final tag (<a
href="https://redirect.github.com/pytorch/pytorch/issues/117131">#117131</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/117182">#117182</a>)</li>
<li><a
href="2490352430"><code>2490352</code></a>
Fix cuInit test on Windows (<a
href="https://redirect.github.com/pytorch/pytorch/issues/117095">#117095</a>)</li>
<li><a
href="3a44bb713f"><code>3a44bb7</code></a>
[CI] Test that cuInit is not called during import (<a
href="https://redirect.github.com/pytorch/pytorch/issues/117043">#117043</a>)</li>
<li><a
href="1c8ba3847d"><code>1c8ba38</code></a>
[CI] Use jemalloc for CUDA builds (<a
href="https://redirect.github.com/pytorch/pytorch/issues/116900">#116900</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116988">#116988</a>)</li>
<li><a
href="96d2ddbafe"><code>96d2ddb</code></a>
Store user model to simplify
ONNXProgram.{adapt_torch_*,<strong>call</strong>} APIs (<a
href="https://redirect.github.com/pytorch/pytorch/issues/1152">#1152</a>...</li>
<li><a
href="738b4a560a"><code>738b4a5</code></a>
Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (<a
href="https://redirect.github.com/pytorch/pytorch/issues/114407">#114407</a>)...</li>
<li><a
href="4cf10bf4dc"><code>4cf10bf</code></a>
[Cherry-pick] [Quant] [PT2] Enable batchnorm in
_move_exported_model_to_eval ...</li>
<li><a
href="7e97e4b4b6"><code>7e97e4b</code></a>
[AARCH64] Fall back to GEMM if mkldnn_matmul fails (<a
href="https://redirect.github.com/pytorch/pytorch/issues/115936">#115936</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116666">#116666</a>)</li>
<li><a
href="1a3e3c7cff"><code>1a3e3c7</code></a>
[CUDA] baddmm should fall back to addmm for batch=1 (<a
href="https://redirect.github.com/pytorch/pytorch/issues/114992">#114992</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116518">#116518</a>)</li>
<li><a
href="ab7505f78c"><code>ab7505f</code></a>
Fix broken PyYAML 6.0 on MacOS x86 (<a
href="https://redirect.github.com/pytorch/pytorch/issues/115956">#115956</a>)
(<a
href="https://redirect.github.com/pytorch/pytorch/issues/116551">#116551</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0">compare
view</a></li>
</ul>
</details>
<br />
[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=torch&package-manager=pip&previous-version=1.13.1&new-version=2.2.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
The header files were added in PR #16454.
Then, recently I made a PR #21464 that changed how we packed Linux
tarballs.
The new tarball misses the custom op header files.
Therefore I need to make this change.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Masks off top 4-bits of INT4 weights, improving accuracy.
### Motivation and Context
This is a workaround as the QNN docs state masking is not required.
### Description
Original argument accepts Enum QuantFormat.QOperator or QuantFormat.QDQ,
but the default value is QOperator.
Change the argument to str to accept QOperator or QDQ and convert to
QuantFormat after parsing.
### Motivation and Context
Bug fix
### Description
#### Issues Fixed
(1) **TRT cross attention not thread safe**. [Core changes like
this](6fd7aba3d4)
are used to make it thread-safe:
* Add an once_flag to CumulatedSequenceLengthCache to make sure it is
only initialized once; and change the cache to be read only after
initialization. Previously, the content is not read-only so it might be
changed by other thread and potentially cause buffer overrun.
* The kernel initialization is not guarded (Although the factory of
kernel loading has static mutex to guard multiple threading), so the
mutable variable might be set by two different threads at the same time.
Add an once_flag to avoid that.
This requires need some workspace computation change as well. So I did
not create a separated pull request.
(2) **Bias for cross attention**
That scenario has assumption that only query has bias, but not for key
and value. However, such assumption is not verified in runtime and there
was no comment of assumption, and there was no test case so the support
of scenario was disabled by mistake. Actually, the scenario is used in
whisper model (TODO: we shall add tests for whisper to CI pipeline, and
also update fusion script to verify such assumptions if needed.)
CUDA/CPU kernels supports bias for cross attention as long as bias is
zero for key and value. I updated the check to support the scenario and
added comments wherever there is such assumption.
(3) **Fallback support**
Previously, unfused kernel did not support packed qkv and packed kv
formats. That means some case might fail since there is no fallback. I
added new AddBiasTranpose cuda kernels for them to support fallback, so
that all supported cases will not fail.
#### Improvements
(4) **QKV workspace size**.
The logic for no_qkv_workspace could be easily out of sync since related
code are scattered in different source files. I refactor the code to
move all related code to one file (attention_prepare_qkv.cu) and add
asserts, so that the logic can be in sync.
(5) **Remove confusing concept of pass past in kv**
parameters.pass_past_in_kv is confusing since the k/v in cross attention
is not past state. Remove it and use parameters.qkv_format ==
Q_K_V_BSNH_BNSH_BNSH instead.
New code does not use past_key/past_value for cross attention, so the
logic is more clear.
(6) **More coverage and less workspace and less transpose of flash and
efficient attention**
Previously, there is one condition does not run flash or efficient
attention:
```
bool past_no_bias = (pass_key_value_as_past || past_key != nullptr || present_key != nullptr) && bias == nullptr;
```
After this change, we can use flash and efficient attention for the
case, and also less workspace.
For example, cross attention with bias, the original code uses two
additional workspaces:
```
transpose: past_key (BxNxSxH) => temp_k_workspace (BxSxNxH), past_value (BxNxSxH_v) => temp_v_workspace (BxSxNxH_v)
add bias: query => q, temp_k_workspace => k, temp_v_workspace => v
```
New logic is like
```
if (has bias)
Add bias to query, key, value, and store in q, k, v workspace
else
Use query, key and value directly as q, k and v in kernel
```
We can see that, we do not need allocate temp_k_workspace and
temp_v_workspace so use less memory. New code saved two transposes in
this case.
Flash and efficient attention supports BSNH or BNSH formats for k and v.
In old code, k/v are also converted to BSNH format. Some is not
necessary. I do some change to convert k/v to BSNH or BNSH case by case.
So that there are more cases can be covered by flash or efficient
attention to improve performance.
(6) **Debugging support**
Previously, there is less debug info. In this change, I add a flag for
debug info in the AttentionData. So that we can output debug info during
the processing.
Also add functions to consolidate the dumping of inputs, QKV processing
and outputs; Add an environment variable `ORT_ENABLE_GPU_DUMP` to allow
disable dumping from cuda kernel.
#### Summary of changes
(1) Refactoring the CheckInputs, and pass in operator type.
(2) Refactoring the PrepareQKV to support fallback for packed qkv or
packed kv inputs.
(3) Change a few case of PrepareQKV to allow more case covered by flash
and efficient attention.
(4) use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH to replace
parameters.pass_past_in_kv
(5) Allow bias input for Q_K_V_BSNH_BNSH_BNSH, and add comments of
assumption that key/value has no bias in this case.
(6) Fix thread-safe issue in CumulatedSequenceLengthCache handling.
(7) Add test cases to cover all supported scenarios.
Current support scenarios for MultiHeadAttention for CUDA/CPU:
| Q | K | V | pastK| pastV | presentK| presentV | Bias | Op desc
| ---- | ---- | ---- | ------ | ----- | --------- | -------- |
-----|---------
| BSNH | BLNH| BLNH| - | - | - | - | QKV | not packed
| BLN3H| - | - | - | - | - | - | QKV | qkv packed <br> not support in
CPU
| BSNH | BLN2H| - | - | - | - | - | --- | kv packed <br> not support in
CPU
| BSNH | BNLH| BNLH| - | - | - | - | Q-- | cross attention <br> bias for
Q only
| BSNH | BLNH | BLNH | - | - | BNTH | BNTH | QKV | no past <br> only
present
| BSNH | BLNH | BLNH | BNPH | BNPH | BNTH | BNTH | QKV | past and
present <br> (not share buffer)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
https://github.com/microsoft/onnxruntime/issues/18854
### Description
ETW trace logger is fakely registered as initialized_ is marked as true
before the registration is done, causing crashing issue for Lenovo
camera application.
A prior attempt to address was made here:
https://github.com/microsoft/onnxruntime/pull/21226
It was reverted here:
https://github.com/microsoft/onnxruntime/pull/21360
### Motivation and Context
The problem is that during initialization of TraceLoggingRegisterEx, it
will reinvoke the callback and attempt reinitialization, which is not
allowed. TraceLoggingRegisterEx however can be initialized concurrently
when initialization happens on multiple threads. For these reasons it
needs to be protected by a lock, but the lock cannot naively block
because the callback's reinvocation will cause a deadlock.
To solve this problem another tracking variable is added :
"initializing" which protects against reinitialization during the first
initialization.
---------
Co-authored-by: Sheil Kumar <sheilk@microsoft.com>
### Description
When the graph is quantized to qdq format, the DQ + MatMul is
transformed to MatMulNBits in the level 2 optimizer when the model is
initialized in an inference session.
In the transformation step, tensors are transposed and new tensor protos
are created. Instead of using protobuf arena allocated memory, the PR
sets the tensor proto to use external buffer, and point the external
location to memory location which contains the tensor buffer allocated
by CPU.
Then, in the step that creates OrtValue using the tensor proto, the
memory buffers in the tensor proto are directly assigned to the tensors
which were originally allocated by Ort Arena.
With these two steps, the peak memory usage of QDQ format model is the
same as usage of QOperator model. Besides, the model initialization time
is significantly reduced. Take
[Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
for example:
|| QOperator Model (MatMulNBits) | QDQ Model (DQ + MatMul, original
code) | QDQ Model (this PR) |
|---|---|---|---|
| peak memory consumption | 2.8 GB | ~4.8 GB | 2.8 GB |
| initialization time | 3 sec | 9 sec | 5 sec |
### Motivation and Context
When the graph is quantized to qdq format, the DQ + MatMul is converted
to MatMulNBits in the level 2 optimizer.
Originally, the newly created tensor proto use memory allocated by
protobuf arena. These memory usage cannot be fully released when the
tensor protos are deleted.
Then, in the tensor proto to OrtValue step, tensors are created using
ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues
are created. The tensors in the ORT arena are not fully released as
well.
The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits
transformation will result in almost 2x memory consumption in the model
initialization.
### Description
This extends the existing pad_fusion for AveragePool operator i.e. fuse
Pad if it is followed by AveragePool operator.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add support for Split Op
### Motivation and Context
Address operator gaps in high priority model.
---------
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
Update TRT OSS Parser to [latest 10.2-GA
branch](f161f95883)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change adds a check for whether the scale in the QuantizeLinear (or
DequantizeLinear) is a positive scalar, and a new selector to disallow
removing the QDQ around MaxPool if it is not.
### Motivation and Context
Currently, the DropQDQNodesRules optimization removes QuantizeLinear and
DequantizeLinear nodes from DequantizeLinear ∘ MaxPool ∘ QuantizeLinear.
However, if the x_scale/y_scale values are non-positive, the
(de-)quantization changes the ordering of the elements in the input
value, so this optimization is changing the results.
https://github.com/microsoft/onnxruntime/issues/21176
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Current labeling bot over-applies many of the labels (e.g., ep:CUDA and
platform:windows) and is missing some of the APIs + EPs
Working on migrating this workflow to GitHub policies but would like to
use this fix in the meantime to avoid causing any issues w/ ORT 1.19
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Propagates NaN values in the min and max operators so that min or max
with a NaN in either input always produces NaN.
### Motivation and Context
Fixes#21455
### Description
Follow up PR for bug fixes on 1.19
### Motivation and Context
- Handles 1.19 docker file fixes.
- Sets the default file naming of epctx onnx model with _ctx.onnx as
suffix.
- Create epctx model directories if it doesn't exist.
---------
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
### Description
This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user
to set to a buffer containing preload .wasm file content.
This PR should resolve the problem from latest discussion in #20876.
Bug:https://github.com/microsoft/onnxruntime/issues/21467
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Since the onedevice training cpu packaging has been a separated
pipeline, it's nuget package publishing step must be moved as well.
### Motivation and Context
Fixes the exception in Nuget Publishing Packaging Pipeline caused by
#21485
### Description
Functionality extension for the SetOutputShape method in custom op shape inference.
### Motivation and Context
- **SetOutputShape** Interface enhancement Actually, the shape infer function need set the tensor type and shape ,Add a parameter **type** to allow users to specify the tensor type, and set **ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT** as default value to ensure compatibility.
Co-authored-by: mingyue <mingyue@amd.com>
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml
### Motivation and Context
This CI pipeline has been divided into 4 different pipeline.
### Description
Local CI setup for AIX reported tests failure after the gtest 1.15.0
upgrade.
### Motivation and Context
Below tests failure is observed after gtest upgrade.
The following tests FAILED:
1 - onnxruntime_test_all (ILLEGAL)
7 - onnxruntime_logging_apis_test (Subprocess aborted)
To fix this, I am enabling pthread support under gtest. This was
disabled with previous version of gtest for some reason.
Now by enabling this, above tests are getting passed with gtest 1.15.0.
Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp)
from 2.1.8 to 2.1.9.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's
releases</a>.</em></p>
<blockquote>
<h2>v2.1.9</h2>
<h2>What's Changed</h2>
<ul>
<li>[2.1] Fix overflow in MemoryAllocator.Create(options) by <a
href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2732">SixLabors/ImageSharp#2732</a></li>
<li>Backport GIF LZW fix to 2.1 by <a
href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2756">SixLabors/ImageSharp#2756</a></li>
<li>Backport 2759 to 2.1.x by <a
href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in
<a
href="https://redirect.github.com/SixLabors/ImageSharp/pull/2770">SixLabors/ImageSharp#2770</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9816ca4501"><code>9816ca4</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2770">#2770</a>
from SixLabors/af/backport-2759-2.1.x</li>
<li><a
href="b33d666ab7"><code>b33d666</code></a>
handle DecodingMode</li>
<li><a
href="6b2030b549"><code>6b2030b</code></a>
Merge branch 'release/2.1.x' into af/backport-2759-2.1.x</li>
<li><a
href="8ffad3f480"><code>8ffad3f</code></a>
Issue2012BadMinCode should decode now</li>
<li><a
href="1f5bf23b9e"><code>1f5bf23</code></a>
skip Issue2758_DecodeWorks</li>
<li><a
href="3bf8c572a0"><code>3bf8c57</code></a>
manual port of 3.1 gif decoder</li>
<li><a
href="28c20ded87"><code>28c20de</code></a>
Clamp JPEG quality estimation results.</li>
<li><a
href="4b910e7f84"><code>4b910e7</code></a>
Decode LZW row by row</li>
<li><a
href="a1f2879771"><code>a1f2879</code></a>
Merge pull request <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2756">#2756</a>
from SixLabors/af/git-av-2.1</li>
<li><a
href="898df7f8ca"><code>898df7f</code></a>
backport <a
href="https://redirect.github.com/SixLabors/ImageSharp/issues/2749">#2749</a>
to 2.1</li>
<li>Additional commits viewable in <a
href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">compare
view</a></li>
</ul>
</details>
<br />
[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.8&new-version=2.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The change in #21005 works for directly building wheels with `build.py`,
but ort-nightly-directml wheels, as well as the 1.18.1 release of the
onnxruntime-directml python wheel, still do not work with conda since
they're built from the `py-win-gpu.yml` pipeline, which uses
`install_third_party_deps.ps1` to set compile flags.
### Description
<!-- Describe your changes. -->
1. We decided to move the context node creation back to our own repo because it is more flexible to modify.
2. We found a bug related the context node. It would change the inference order. So, we fixed in this PR as well.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is crucial for Microsoft Release next month.
---------
Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description
<!-- Describe your changes. -->
[VitisAI] 1. KernelDef supports StartVersion and EndVersion
2. CapabilityOps checks domain
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
### Description
<!-- Describe your changes. -->
Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by
setting GenerateAssemblyInfo to true and passing in ORT version in the
CI.
Minor re-org of the order of properties so related things are grouped a
little better.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#21475
### Description
Add QNN EP option context_node_name_prefix to set EPContext node name prefix
### Motivation and Context
For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model.
To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.