onnxruntime

Граф коммитов

Автор	SHA1	Сообщение	Дата
vraspar	88c811b638	Restructure MacOS framework package to fix malformed Framework errors (#21536 ) ### Description Refactor framework directory structure for MacOS packages ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Apple started enforcing specific [framework structure](https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Concepts/FrameworkAnatomy.html) for MacOS packages. We need to change how we package for MacOS to follow the guidelines Fixes following issue: [Malformed Framework](https://github.com/microsoft/onnxruntime-swift-package-manager/issues/19 )	2024-08-04 12:47:16 -07:00
Po-Wei (Vincent)	2653226ed0	Fail tests gracefully for the minimal cuda build (#21391 ) ### Description Several tests result in segfaults during the minimal cuda build. Although test failures are expected due to the limitation of the minimal cuda EP, failing gracefully would be much preferred. ### Motivation and Context To reproduce: 1. Build ORT with: ```bash ./build.sh --build_shared_lib --use_full_protobuf --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --tensorrt_home /TensorRT-10.0.1.6 --parallel --skip_tests --skip_submodule_sync --allow_running_as_root --use_tensorrt --cmake_extra_defines onnxruntime_CUDA_MINIMAL=1 ``` 2. Run `onnxruntime_test_all` ```bash ... [----------] 1 test from AllocationPlannerTest [ RUN ] AllocationPlannerTest.ReusedInputCrossDifferentStreams Segmentation fault (core dumped) ```	2024-08-02 18:27:36 -07:00
Wanming Lin	8c641d7182	[WebNN EP] Support Dropout op (#21586 ) ### Description WebNN only supports test mode, so we don't care about other inputs or attributes about training mode, use WebNN's identity op to implement the Dropout op directly.	2024-08-02 16:25:04 -07:00
Ted Themistokleous	45b7c41ef0	[MIGraphX EP] Set External Data Path (#21598 ) ### Description <!-- Describe your changes. --> Changes to add in Set external data path for model weight files. Additional fixes to ensure this compiles off the latest v1.19 Onnxruntime ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Separate weights used for larger models (like stable diffusion) is motivation for this change set --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Artur Wojcik <artur.wojcik@amd.com> Co-authored-by: Ted Themistokleous <tedthemistokleous@amd.com>	2024-08-02 16:19:04 -07:00
Tianlei Wu	54d6614ad6	Security fuzz address sanitizer fix Bug (continue) (#21579 ) Add a check of node.InputDefs()[2]->Exists() for Layernorm bias (Follow up https://github.com/microsoft/onnxruntime/pull/21528/files#r1694026327) Format the file: break long line to be within 120 chars limit.	2024-08-02 15:45:42 -07:00
Julius Tischbein	1391354265	Adding CUDNN Frontend and use for CUDA NN Convolution (#19470 ) ### Description Added CUDNN Frontend and used it for NHWC convolutions, and optionally fuse activation. #### Backward compatible - For model existed with FusedConv, model can still run. - If ORT is built with cuDNN 8, cuDNN frontend will not be built into binary. Old kernels (using cudnn backend APIs) are used. #### Major Changes - For cuDNN 9, we will enable cudnn frontend to fuse convolution and bias when a provider option `fuse_conv_bias=1`. - Remove the fusion of FusedConv from graph transformer for CUDA provider, so there will not be FusedConv be added to graph for CUDA EP in the future. - Update cmake files regarding to cudnn settings. The search order of CUDNN installation in build are like the following: * environment variable `CUDNN_PATH` * `onnxruntime_CUDNN_HOME` cmake extra defines. If a build starts from build.py/build.sh, user can pass it through `--cudnn_home` parameter, or by environment variable `CUDNN_HOME` if `--cudnn_home` not used. * cudnn python package installation directory like python3.xx/site-packages/nvidia/cudnn * CUDA installation path #### Potential Issues - If ORT is built with cuDNN 8, FusedConv fusion is no longer done automatically, so some model might have performance regression. If user still wants FusedConv operator for performance reason, they can still have multiple ways to walkaround: like use older version of onnxruntime; or use older version of ORT to save optimized onnx, then run with latest version of ORT. We believe that majority users have moved to cudnn 9 when 1.20 release (since the default in ORT and PyTorch is cudnn 9 for 3 months when 1.20 release), so the impact is small. - cuDNN graph uses TF32 by default, and user cannot disable TF32 through the use_tf32 cuda provider option. If user encounters accuracy issue (like in testing), user has to set environment variable `NVIDIA_TF32_OVERRIDE=0` to disable TF32. Need update the document of use_tf32 later. #### Follow ups This is one of PRs that target to enable NHWC convolution in CUDA EP by default if device supports it. There are other changes will follow up to make it possible. (1) Enable `prefer_nhwc` by default for device with sm >= 70. (2) Change `fuse_conv_bias=1` by default after more testing. (3) Add other NHWC operators (like Resize or UpSample). ### Motivation and Context The new CUDNN Frontend library provides the functionality to fuse operations and provides new heuristics for kernel selection. Here it fuses the convolution with the pointwise bias operation. On the [NVIDIA ResNet50](https://pytorch.org/hub/nvidia_deeplearningexamples_resnet50/) we get a performance boost from 49.1144 ms to 42.4643 ms per inference on a 2560x1440 input (`onnxruntime_perf_test -e cuda -I -q -r 100-d 1 -i 'prefer_nhwc\|1' resnet50.onnx`). --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Maximilian Mueller <maximilianm@nvidia.com>	2024-08-02 15:16:42 -07:00
Adrian Lizarraga	0e708de4fc	[QNN EP] Support Conv + Clip/Relu fusion (#21537 ) ### Description - Supports quantized Conv + Activation on the HTP backend: - Translates `DQs -> Conv -> Relu/Clip -> Q` into a single QNN Conv operator if the Relu (or Clip) are redundant. ### Motivation and Context Expands support for QDQ models created with tools that do not wrap Relu or Clip with QDQ nodes. This PR introduces the `IQnnNodeGroup` class. In the same way that a `NodeUnit` represents a collection of `Nodes`, a `IQnnNodeGroup` can represent one or more `NodeUnits` that are translated into a QNN operator. QNN EP parses the ONNX graph to create a list of `IQnnNodeGroup` objects, each representing a single `NodeUnit` or a fusion of multiple `NodeUnits`.	2024-08-02 11:02:22 -07:00
liqun Fu	b87e8edb98	Mlas int4 int8 with avx2/512 (#20687 ) ### Description model: phi-3-mini-4k-instruct avx2 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|49.5\|70.0\|-29.2%\|9.6\|10.8\|-34.2% 32 \|76.8\|52.4\|9.7%\|15.2\|14.6\|4.1% 64 \|78.2\|71.4\|9.5%\|16.6\|16.3\|1.8% 128 \|72.9\|70.6\|3.2%\|17.1\|16.8\|1.7% 256 \|83.7\|63.6\|31.6%\|18.1\|17.4\|4% avx2 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|50.7\|61.5\|-17.5%\|9.6\|9.2\|4.3% 32 \|77.4\|52.4\|47.7%\|14.6\|13.9\|5.0% 64 \|78.7\|63.0\|24.9%\|16.2\|15.9\|1.8% 128 \|80.0\|61.9\|29.2%\|17.2\|16.9\|1.7% 256 \|81.5\|63.3\|28.7%\|17.9\|17.3\|3.4% avx2vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|82.9\|117.0\|-29.0%\|15.9\|19.3\|-17.6% 32 \|133.0\|100.4\|32.4%\|26.1\|24.5\|6.5% 64 \|166.9\|118.8\|40.4%\|28.3\|27.1\|4.4% 128 \|165.9\|119.6\|38.7%\|29.3\|28.5\|2.8% 256 \|165.2\|119.6\|38.1%\|30.2\|29.0\|4.1% avx2vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|80.2\|118.9\|-32.5%\|15.1\|16.7\|-9.5% 32 \|130.7\|99.7\|31.0%\|25.0\|23.8\|5.0% 64 \|168.7\|124.9\|35.0%\|27.3\|26.8\|1.8% 128 \|169.6\|123.8\|36.9%\|29.2\|27.9\|4.6% 256 \|175.0\|125.7\|39.0%\|30.0\|29.7\|1.0% avx512 symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|135.2\|156.5\|-13.6\|25.5\|23.8\|7.1 32 \|150.0\|159.5\|-5.9\|34.9\|29.6\|17.9 64 \|167.5\|157.5\|6.3\|39.7\|34.4\|15.4 128 \|177.8\|158.0\|12.5\|40.3\|35.4\|13.8 256 \|182.6\|157.3\|16.0\|41.7\|37.7\|10.6 avx512 asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|136.1\|151.4\|-10.1%\|26.1\|19.9\|31.1% 32 \|150.0\|157.8\|-4.9%\|34.3\|29.3\|17.0% 64 \|165.7\|156.6\|5.8%\|38.7\|30.7\|26.0% 128 \|180.4\|156.6\|15.1%\|40.2\|34.7\|15.8% 256 \|181.3\|158.0\|14.7%\|41.6\|36.6\|13.6% avx512vnni symmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|143.4\|155.4\|-7.7%\|25.6\|23.3\|9.8% 32 \|159.2\|157.0\|1.4%\|34.1\|29.8\|14.4% 64 \|182.0\|159.5\|14.1%\|38.4\|34.8\|10.3% 128 \|221.2\|160.8\|37.5%\|41.0\|36.4\|12.6% 256 \|250.5\|162.4\|54.2%\|41.6\|37.7\|10.3% avx512vnni asymmetric blklen\|updated prompt tps \| baseline prompt tps \| prompt tps change%\|updated token gen tps \| baseline token gen tps \| token gen change% -\|-\|-\|-\|-\|-\|- 16 \|142.5\|152.3\|-6.4%\|26.3\|19.7\|33.5% 32 \|158.2\|155.0\|2.0%\|34.3\|29.2\|17.4% 64 \|184.1\|156.6\|17.5%\|38.3\|30.9\|23.9% 128 \|215.8\|156.1\|17.5%\|41.3\|35.0\|17.9% 256 \|249.2\|155.9\|59.8%\|41.1\|36.3\|13.2% 4bit gemm implementation with avx using tile. 1. tile size is 2blk by 4. in case of size less then tile, it reduce to 1blk by 4, 2blk by 1 and lastly 1blk by 1. with internal kernel, weight and activation are loaded based on SIMD register width and blk length: avx2 256bit register, 64 weights and activation are loaded. blklen16: 4 blks are computed by the internal kernel blklen32: 2 blks are computed by the internal kernel blklen64: 1 blk are computed by the internal kernel blklen128: 1 blks are computed 2 times by the internal kernel blklen16: 1 blks are computed 4 times by the internal kernel avx512 512bit register, 128 weights and activation are loaded. blklen16: 8 blks are computed by the internal kernel blklen32: 4 blks are computed by the internal kernel blklen64: 2 blk are computed by the internal kernel blklen128: 1 blks are computed by the internal kernel blklen16: 1 blks are computed 2 times by the internal kernel 2. blksum is precomputed during prepacking. computation is reformed: Sum1(scale_a * scale_b * Sum_blk(a_i * b_i)) + Sum2(blksum_a * blksum_b) Sum_blk is over one blk Sum1 is over all blks for one output Sum2 is over all blks for one output Sum is computed with sgemm with the current implementation. Further improvement is possible. --------- Signed-off-by: Liqun Fu <liqfu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Signed-off-by: Liqun Fu <liqun_fu@hotmail.com>	2024-08-02 10:20:22 -07:00
Atanas Dimitrov	d0a6f57d74	Add reduce kernels for bigger types (#21490 )	2024-08-01 12:21:16 -07:00
Wanming Lin	8c2ee7b32e	[WebNN EP] Create MLGraphBuilder for every model builder (#21514 ) Currently WebNN spec only allows MLGraphBuilder.build() to be called once, we need to create new builder for every subgraph in WebNN EP. Spec change: https://github.com/webmachinelearning/webnn/pull/717	2024-08-01 09:15:31 -07:00
dependabot[bot]	3b73ef2bf7	Bump torch from 1.13.1 to 2.2.0 in /tools/ci_build/github/windows/eager (#21505 ) Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.1 to 2.2.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/pytorch/pytorch/releases">torch's releases</a>.</em></p> <blockquote> <h2>PyTorch 2.2: FlashAttention-v2, AOTInductor</h2> <h1>PyTorch 2.2 Release Notes</h1> <ul> <li>Highlights</li> <li>Backwards Incompatible Changes</li> <li>Deprecations</li> <li>New Features</li> <li>Improvements</li> <li>Bug fixes</li> <li>Performance</li> <li>Documentation</li> </ul> <h1>Highlights</h1> <p>We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to <code>scaled_dot_product_attention</code> via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.</p> <p>This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.</p> <p><strong>Please note that we are <a href="https://redirect.github.com/pytorch/pytorch/issues/114602">deprecating macOS x86 support</a>, and PyTorch 2.2.x will be the last version that supports macOS x64.</strong></p> <p>Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.</p> <p>This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our <a href="https://pytorch.org/get-started/pytorch-2.0/">Getting Started</a> page.</p> <p>Summary:</p> <ul> <li><code>scaled_dot_product_attention</code> (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.</li> <li>PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.</li> <li><code>torch.distributed</code> supports a new abstraction for initializing and representing ProcessGroups called device_mesh.</li> <li>PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.</li> <li>A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.</li> <li>Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.</li> <li><code>torch.ao.quantization</code> now offers a prototype <code>torch.export</code> based flow</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`8ac9b20d4b`"><code>8ac9b20</code></a> Run docker release build on final tag (<a href="https://redirect.github.com/pytorch/pytorch/issues/117131">#117131</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/117182">#117182</a>)</li> <li><a href="`2490352430`"><code>2490352</code></a> Fix cuInit test on Windows (<a href="https://redirect.github.com/pytorch/pytorch/issues/117095">#117095</a>)</li> <li><a href="`3a44bb713f`"><code>3a44bb7</code></a> [CI] Test that cuInit is not called during import (<a href="https://redirect.github.com/pytorch/pytorch/issues/117043">#117043</a>)</li> <li><a href="`1c8ba3847d`"><code>1c8ba38</code></a> [CI] Use jemalloc for CUDA builds (<a href="https://redirect.github.com/pytorch/pytorch/issues/116900">#116900</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116988">#116988</a>)</li> <li><a href="`96d2ddbafe`"><code>96d2ddb</code></a> Store user model to simplify ONNXProgram.{adapt_torch_*,<strong>call</strong>} APIs (<a href="https://redirect.github.com/pytorch/pytorch/issues/1152">#1152</a>...</li> <li><a href="`738b4a560a`"><code>738b4a5</code></a> Update ONNX's IO Adapter to support FakeTensor with ExportedProgram (<a href="https://redirect.github.com/pytorch/pytorch/issues/114407">#114407</a>)...</li> <li><a href="`4cf10bf4dc`"><code>4cf10bf</code></a> [Cherry-pick] [Quant] [PT2] Enable batchnorm in _move_exported_model_to_eval ...</li> <li><a href="`7e97e4b4b6`"><code>7e97e4b</code></a> [AARCH64] Fall back to GEMM if mkldnn_matmul fails (<a href="https://redirect.github.com/pytorch/pytorch/issues/115936">#115936</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116666">#116666</a>)</li> <li><a href="`1a3e3c7cff`"><code>1a3e3c7</code></a> [CUDA] baddmm should fall back to addmm for batch=1 (<a href="https://redirect.github.com/pytorch/pytorch/issues/114992">#114992</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116518">#116518</a>)</li> <li><a href="`ab7505f78c`"><code>ab7505f</code></a> Fix broken PyYAML 6.0 on MacOS x86 (<a href="https://redirect.github.com/pytorch/pytorch/issues/115956">#115956</a>) (<a href="https://redirect.github.com/pytorch/pytorch/issues/116551">#116551</a>)</li> <li>Additional commits viewable in <a href="https://github.com/pytorch/pytorch/compare/v1.13.1...v2.2.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=torch&package-manager=pip&previous-version=1.13.1&new-version=2.2.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-08-01 04:28:43 -07:00
Changming Sun	25722bb9e3	Add CUDA custom op header files to Linux tarball (#21551 ) ### Description The header files were added in PR #16454. Then, recently I made a PR #21464 that changed how we packed Linux tarballs. The new tarball misses the custom op header files. Therefore I need to make this change. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-08-01 04:23:02 -07:00
Adrian Lizarraga	4b8f6dcbb6	[QNN EP] Improve INT4 accuracy (#21582 ) ### Description Masks off top 4-bits of INT4 weights, improving accuracy. ### Motivation and Context This is a workaround as the QNN docs state masking is not required.	2024-07-31 21:05:11 -07:00
Jing Fang	8540ac4f78	Fix quant_format argument for 4bit quantizer (#21581 ) ### Description Original argument accepts Enum QuantFormat.QOperator or QuantFormat.QDQ, but the default value is QOperator. Change the argument to str to accept QOperator or QDQ and convert to QuantFormat after parsing. ### Motivation and Context Bug fix	2024-07-31 15:30:33 -07:00
Wanming Lin	a3883af7bf	[WebNN EP] Fixed bug in ConvTranspose (#21569 ) The constraint of ConvTranspose was placed in wrong place.	2024-07-31 14:39:21 -07:00
Tianlei Wu	c5f8389648	[CUDA] Fix MultiHeadAttention thread safe and bias support (#21498 ) ### Description #### Issues Fixed (1) TRT cross attention not thread safe. [Core changes like this](`6fd7aba3d4`) are used to make it thread-safe: * Add an once_flag to CumulatedSequenceLengthCache to make sure it is only initialized once; and change the cache to be read only after initialization. Previously, the content is not read-only so it might be changed by other thread and potentially cause buffer overrun. * The kernel initialization is not guarded (Although the factory of kernel loading has static mutex to guard multiple threading), so the mutable variable might be set by two different threads at the same time. Add an once_flag to avoid that. This requires need some workspace computation change as well. So I did not create a separated pull request. (2) Bias for cross attention That scenario has assumption that only query has bias, but not for key and value. However, such assumption is not verified in runtime and there was no comment of assumption, and there was no test case so the support of scenario was disabled by mistake. Actually, the scenario is used in whisper model (TODO: we shall add tests for whisper to CI pipeline, and also update fusion script to verify such assumptions if needed.) CUDA/CPU kernels supports bias for cross attention as long as bias is zero for key and value. I updated the check to support the scenario and added comments wherever there is such assumption. (3) Fallback support Previously, unfused kernel did not support packed qkv and packed kv formats. That means some case might fail since there is no fallback. I added new AddBiasTranpose cuda kernels for them to support fallback, so that all supported cases will not fail. #### Improvements (4) QKV workspace size. The logic for no_qkv_workspace could be easily out of sync since related code are scattered in different source files. I refactor the code to move all related code to one file (attention_prepare_qkv.cu) and add asserts, so that the logic can be in sync. (5) Remove confusing concept of pass past in kv parameters.pass_past_in_kv is confusing since the k/v in cross attention is not past state. Remove it and use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH instead. New code does not use past_key/past_value for cross attention, so the logic is more clear. (6) More coverage and less workspace and less transpose of flash and efficient attention Previously, there is one condition does not run flash or efficient attention: ``` bool past_no_bias = (pass_key_value_as_past \|\| past_key != nullptr \|\| present_key != nullptr) && bias == nullptr; ``` After this change, we can use flash and efficient attention for the case, and also less workspace. For example, cross attention with bias, the original code uses two additional workspaces: ``` transpose: past_key (BxNxSxH) => temp_k_workspace (BxSxNxH), past_value (BxNxSxH_v) => temp_v_workspace (BxSxNxH_v) add bias: query => q, temp_k_workspace => k, temp_v_workspace => v ``` New logic is like ``` if (has bias) Add bias to query, key, value, and store in q, k, v workspace else Use query, key and value directly as q, k and v in kernel ``` We can see that, we do not need allocate temp_k_workspace and temp_v_workspace so use less memory. New code saved two transposes in this case. Flash and efficient attention supports BSNH or BNSH formats for k and v. In old code, k/v are also converted to BSNH format. Some is not necessary. I do some change to convert k/v to BSNH or BNSH case by case. So that there are more cases can be covered by flash or efficient attention to improve performance. (6) Debugging support Previously, there is less debug info. In this change, I add a flag for debug info in the AttentionData. So that we can output debug info during the processing. Also add functions to consolidate the dumping of inputs, QKV processing and outputs; Add an environment variable `ORT_ENABLE_GPU_DUMP` to allow disable dumping from cuda kernel. #### Summary of changes (1) Refactoring the CheckInputs, and pass in operator type. (2) Refactoring the PrepareQKV to support fallback for packed qkv or packed kv inputs. (3) Change a few case of PrepareQKV to allow more case covered by flash and efficient attention. (4) use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH to replace parameters.pass_past_in_kv (5) Allow bias input for Q_K_V_BSNH_BNSH_BNSH, and add comments of assumption that key/value has no bias in this case. (6) Fix thread-safe issue in CumulatedSequenceLengthCache handling. (7) Add test cases to cover all supported scenarios. Current support scenarios for MultiHeadAttention for CUDA/CPU: \| Q \| K \| V \| pastK\| pastV \| presentK\| presentV \| Bias \| Op desc \| ---- \| ---- \| ---- \| ------ \| ----- \| --------- \| -------- \| -----\|--------- \| BSNH \| BLNH\| BLNH\| - \| - \| - \| - \| QKV \| not packed \| BLN3H\| - \| - \| - \| - \| - \| - \| QKV \| qkv packed <br> not support in CPU \| BSNH \| BLN2H\| - \| - \| - \| - \| - \| --- \| kv packed <br> not support in CPU \| BSNH \| BNLH\| BNLH\| - \| - \| - \| - \| Q-- \| cross attention <br> bias for Q only \| BSNH \| BLNH \| BLNH \| - \| - \| BNTH \| BNTH \| QKV \| no past <br> only present \| BSNH \| BLNH \| BLNH \| BNPH \| BNPH \| BNTH \| BNTH \| QKV \| past and present <br> (not share buffer) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> https://github.com/microsoft/onnxruntime/issues/18854	2024-07-31 09:01:05 -07:00
Sheil Kumar	b341c44c20	Fix ETW trace logging crash in multithreading situations (#21566 ) ### Description ETW trace logger is fakely registered as initialized_ is marked as true before the registration is done, causing crashing issue for Lenovo camera application. A prior attempt to address was made here: https://github.com/microsoft/onnxruntime/pull/21226 It was reverted here: https://github.com/microsoft/onnxruntime/pull/21360 ### Motivation and Context The problem is that during initialization of TraceLoggingRegisterEx, it will reinvoke the callback and attempt reinitialization, which is not allowed. TraceLoggingRegisterEx however can be initialized concurrently when initialization happens on multiple threads. For these reasons it needs to be protected by a lock, but the lock cannot naively block because the callback's reinvocation will cause a deadlock. To solve this problem another tracking variable is added : "initializing" which protects against reinitialization during the first initialization. --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2024-07-31 08:59:55 -07:00
Wanming Lin	1d4b161145	[WebNN EP] Support ConvTranspose for TFLite backend (#21291 ) ### Description Chromium supports ConvTranspose for TFLite in https://chromium-review.googlesource.com/c/chromium/src/+/5635194 With constraint that only default dilations and groups are supported. --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>	2024-07-30 17:46:08 -07:00
Jing Fang	e7aa11607f	Utilize ext data location to reduce qd matmul memory usage (#21451 ) ### Description When the graph is quantized to qdq format, the DQ + MatMul is transformed to MatMulNBits in the level 2 optimizer when the model is initialized in an inference session. In the transformation step, tensors are transposed and new tensor protos are created. Instead of using protobuf arena allocated memory, the PR sets the tensor proto to use external buffer, and point the external location to memory location which contains the tensor buffer allocated by CPU. Then, in the step that creates OrtValue using the tensor proto, the memory buffers in the tensor proto are directly assigned to the tensors which were originally allocated by Ort Arena. With these two steps, the peak memory usage of QDQ format model is the same as usage of QOperator model. Besides, the model initialization time is significantly reduced. Take [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) for example: \|\| QOperator Model (MatMulNBits) \| QDQ Model (DQ + MatMul, original code) \| QDQ Model (this PR) \| \|---\|---\|---\|---\| \| peak memory consumption \| 2.8 GB \| ~4.8 GB \| 2.8 GB \| \| initialization time \| 3 sec \| 9 sec \| 5 sec \| ### Motivation and Context When the graph is quantized to qdq format, the DQ + MatMul is converted to MatMulNBits in the level 2 optimizer. Originally, the newly created tensor proto use memory allocated by protobuf arena. These memory usage cannot be fully released when the tensor protos are deleted. Then, in the tensor proto to OrtValue step, tensors are created using ORT arena. Later, in the pre-pack step for MatMulNBits, new OrtValues are created. The tensors in the ORT arena are not fully released as well. The two arena memory allocation steps in the DQ + MatMul -> MatMulNBits transformation will result in almost 2x memory consumption in the model initialization.	2024-07-30 15:22:46 -07:00
Sumit Agarwal	1637f22d39	Extend Pad Fusion for AveragePool (#21556 ) ### Description This extends the existing pad_fusion for AveragePool operator i.e. fuse Pad if it is followed by AveragePool operator. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-30 09:35:45 -07:00
Yi-Hong Lyu	530a2d7b41	Enable FP16 Clip and Handle Bias in FP16 Depthwise Conv (#21493 ) - Improved accuracy for face-detection, image-classification, and object-detection in the GeekBench ML benchmark on ARM64. - Fixed issue https://github.com/microsoft/onnxruntime/issues/18992	2024-07-30 03:49:14 -07:00
Changming Sun	82036b0497	Remove references to the outdated CUDA EP factory method (#21549 ) The function "OrtSessionOptionsAppendExecutionProvider_CUDA" is deprecated.	2024-07-29 21:59:16 -07:00
vraspar	07d3be5b0e	CoreML: Add ML Program Split Op (#21456 ) ### Description Add support for Split Op ### Motivation and Context Address operator gaps in high priority model. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-30 14:04:47 +10:00
Yifan Li	5d78b9a17b	[TensorRT EP] Update TRT OSS Parser to 10.2 (#21552 ) ### Description <!-- Describe your changes. --> Update TRT OSS Parser to [latest 10.2-GA branch](`f161f95883`) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 17:27:38 -07:00
mcollinswisc	8417c325ec	Keep QDQ nodes w/ nonpositive scale around MaxPool (#21182 ) ### Description This change adds a check for whether the scale in the QuantizeLinear (or DequantizeLinear) is a positive scalar, and a new selector to disallow removing the QDQ around MaxPool if it is not. ### Motivation and Context Currently, the DropQDQNodesRules optimization removes QuantizeLinear and DequantizeLinear nodes from DequantizeLinear ∘ MaxPool ∘ QuantizeLinear. However, if the x_scale/y_scale values are non-positive, the (de-)quantization changes the ordering of the elements in the input value, so this optimization is changing the results. https://github.com/microsoft/onnxruntime/issues/21176 --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-30 09:06:51 +10:00
Sophie Schoenmeyer	d98581495f	Update labeling bot (#21548 ) Current labeling bot over-applies many of the labels (e.g., ep:CUDA and platform:windows) and is missing some of the APIs + EPs Working on migrating this workflow to GitHub policies but would like to use this fix in the meantime to avoid causing any issues w/ ORT 1.19 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 16:06:03 -07:00
Adam Reeve	7543dd040b	Propagate NaNs in the CPU min and max operators (#21492 ) ### Description Propagates NaN values in the min and max operators so that min or max with a NaN in either input always produces NaN. ### Motivation and Context Fixes #21455	2024-07-30 08:50:13 +10:00
Preetha Veeramalai	c39f1c4fd8	ORT- OVEP 1.19 PR-follow up (#21546 ) ### Description Follow up PR for bug fixes on 1.19 ### Motivation and Context - Handles 1.19 docker file fixes. - Sets the default file naming of epctx onnx model with _ctx.onnx as suffix. - Create epctx model directories if it doesn't exist. --------- Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>	2024-07-29 14:12:36 -07:00
Yulong Wang	b03c9496aa	[js/web] allow load WebAssembly binary from buffer (#21534 ) ### Description This PR adds a new option `ort.env.wasm.wasmBinary`, which allows user to set to a buffer containing preload .wasm file content. This PR should resolve the problem from latest discussion in #20876.	2024-07-29 13:39:38 -07:00
Xu Xing	0d7cf301a1	[js/webgpu] Add activation Tanh (#21540 ) Bug:https://github.com/microsoft/onnxruntime/issues/21467 ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 11:05:34 -07:00
Jian Chen	79537d0523	Remove tools/ci_build/github/android/run_nnapi_code_coverage.sh (#21371 ) ### Description Remove tools/ci_build/github/android/run_nnapi_code_coverage.sh ### Motivation and Context This file is no longer needed	2024-07-29 10:00:52 -07:00
Jian Chen	bc3713206d	Update QNN pipeline pool (#21482 ) ### Description Update QNN pipeline pool ### Motivation and Context Let all our pipelines are using the latest NDK version	2024-07-29 10:00:21 -07:00
Yi Zhang	05cef469e8	Move on-device training packages publish step (#21539 ) ### Description Since the onedevice training cpu packaging has been a separated pipeline, it's nuget package publishing step must be moved as well. ### Motivation and Context Fixes the exception in Nuget Publishing Packaging Pipeline caused by #21485	2024-07-29 09:59:46 -07:00
mingyueliuh	d8888136e3	Add support tensor element type for register custom op shape infer function (#21387 ) ### Description Functionality extension for the SetOutputShape method in custom op shape inference. ### Motivation and Context - SetOutputShape Interface enhancement Actually, the shape infer function need set the tensor type and shape ，Add a parameter type to allow users to specify the tensor type, and set ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT as default value to ensure compatibility. Co-authored-by: mingyue <mingyue@amd.com>	2024-07-29 09:45:52 -07:00
Wanming Lin	94eb70d983	[WebNN EP] Add labels for all WebNN operators (#21516 ) In order to provide more diagnosable error messages for developers. Spec change: https://github.com/webmachinelearning/webnn/pull/742	2024-07-29 08:50:14 -07:00
Xu Xing	5bc12bf209	[js/webgpu] Add activation for conv3d naive (#21466 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-29 08:47:41 -07:00
Yulong Wang	dbff0cd098	[js/node] enable float16 support for Node.js binding (#20581 ) ### Description enable float16 support for Node.js binding. data of float16 tensor uses `Uint16Array`.	2024-07-28 13:03:17 -07:00
liqun Fu	a4d3a1ce0c	pick changes from https://github.com/onnx/onnx/pull/6195 to fix heap-buffer-overflow in onnx::convPoolShapeInference (#21507 ) ### Description onnx 1.16.2 is not available before ort 1.19.0 code freeze. Thus pick the needed change as patch	2024-07-27 15:58:36 -07:00
Jian Chen	7e23212de9	Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml (#21529 ) ### Description Delete tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml ### Motivation and Context This CI pipeline has been divided into 4 different pipeline.	2024-07-27 15:58:12 -07:00
Ranjit Ranjan	82b2955268	[AIX]test failure fix using gtest-1.15.0 for AIX (#21497 ) ### Description Local CI setup for AIX reported tests failure after the gtest 1.15.0 upgrade. ### Motivation and Context Below tests failure is observed after gtest upgrade. The following tests FAILED: 1 - onnxruntime_test_all (ILLEGAL) 7 - onnxruntime_logging_apis_test (Subprocess aborted) To fix this, I am enabling pthread support under gtest. This was disabled with previous version of gtest for some reason. Now by enabling this, above tests are getting passed with gtest 1.15.0.	2024-07-27 11:17:22 -07:00
jingyanwangms	48fb8a7e56	Security fuzz address sanitizer fix Bug #2 and #3 (#21528 ) ### Description Security fuzz test with address sanitizer found several bugs	2024-07-27 11:10:52 -07:00
dependabot[bot]	1ce160883f	Bump Sixlabors.ImageSharp from 2.1.8 to 2.1.9 in /csharp/sample/Microsoft.ML.OnnxRuntime.ResNet50v2Sample (#21444 ) Bumps [Sixlabors.ImageSharp](https://github.com/SixLabors/ImageSharp) from 2.1.8 to 2.1.9. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/SixLabors/ImageSharp/releases">Sixlabors.ImageSharp's releases</a>.</em></p> <blockquote> <h2>v2.1.9</h2> <h2>What's Changed</h2> <ul> <li>[2.1] Fix overflow in MemoryAllocator.Create(options) by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2732">SixLabors/ImageSharp#2732</a></li> <li>Backport GIF LZW fix to 2.1 by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2756">SixLabors/ImageSharp#2756</a></li> <li>Backport 2759 to 2.1.x by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2770">SixLabors/ImageSharp#2770</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`9816ca4501`"><code>9816ca4</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2770">#2770</a> from SixLabors/af/backport-2759-2.1.x</li> <li><a href="`b33d666ab7`"><code>b33d666</code></a> handle DecodingMode</li> <li><a href="`6b2030b549`"><code>6b2030b</code></a> Merge branch 'release/2.1.x' into af/backport-2759-2.1.x</li> <li><a href="`8ffad3f480`"><code>8ffad3f</code></a> Issue2012BadMinCode should decode now</li> <li><a href="`1f5bf23b9e`"><code>1f5bf23</code></a> skip Issue2758_DecodeWorks</li> <li><a href="`3bf8c572a0`"><code>3bf8c57</code></a> manual port of 3.1 gif decoder</li> <li><a href="`28c20ded87`"><code>28c20de</code></a> Clamp JPEG quality estimation results.</li> <li><a href="`4b910e7f84`"><code>4b910e7</code></a> Decode LZW row by row</li> <li><a href="`a1f2879771`"><code>a1f2879</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2756">#2756</a> from SixLabors/af/git-av-2.1</li> <li><a href="`898df7f8ca`"><code>898df7f</code></a> backport <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2749">#2749</a> to 2.1</li> <li>Additional commits viewable in <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.8...v2.1.9">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Sixlabors.ImageSharp&package-manager=nuget&previous-version=2.1.8&new-version=2.1.9)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-07-26 22:31:16 -07:00
maggie1059	10b4a3b90b	Fix conda failure for onnxruntime-directml (#21526 ) The change in #21005 works for directly building wheels with `build.py`, but ort-nightly-directml wheels, as well as the 1.18.1 release of the onnxruntime-directml python wheel, still do not work with conda since they're built from the `py-win-gpu.yml` pipeline, which uses `install_third_party_deps.ps1` to set compile flags.	2024-07-26 22:26:38 -07:00
Yueqing Zhang	d01fc75ef1	[VitisAI] support vaip create ep context nodes & bug fix (#21506 ) ### Description <!-- Describe your changes. --> 1. We decided to move the context node creation back to our own repo because it is more flexible to modify. 2. We found a bug related the context node. It would change the inference order. So, we fixed in this PR as well. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> This is crucial for Microsoft Release next month. --------- Co-authored-by: Yueqing Zhang <yueqingz@amd.com>	2024-07-26 22:15:57 -07:00
zz002	690d745cbf	[VitisAI] 1. KernelDef supports StartVersion and EndVersion (#21519 ) ### Description <!-- Describe your changes. --> [VitisAI] 1. KernelDef supports StartVersion and EndVersion 2. CapabilityOps checks domain ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>	2024-07-26 20:28:55 -07:00
Scott McKay	5af423c7c0	Set version and other info in the C# dll (#21517 ) ### Description <!-- Describe your changes. --> Set version and other info in the Microsoft.ML.OnnxRuntime C# dll by setting GenerateAssemblyInfo to true and passing in ORT version in the CI. Minor re-org of the order of properties so related things are grouped a little better. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #21475	2024-07-27 13:22:57 +10:00
Tianlei Wu	64819f6f8c	Update benchmark_mha.py to compare with PyTorch SDPA (#21449 ) ### Description * Update benchmark_mha.py to compare with PyTorch SDPA api. * Write results to csv file. * Use sdpa_kernel cuda provider option instead of environment variables for better control. * Add arguments (`--use_gpu`, `--causal` etc) to allow testing different senarios. * Update benchmark_mha.sh to add cpu benchmarks For Q,K,V format, torch uses BNSH format, while ort uses BSNH format, so the result is not apple-to-apple. However, if the latency difference is large, that could be a warning. #### Example GPU results Example results on A100-SXM4-80GB with settings (use_gpu=TRUE, enable_cuda_graph=FALSE, causal=FALSE, past_sequence_length=0, intra_op_num_threads=0) in Azure Linux. ORT: build from source with CUDA 12.5; PyTorch 2.3.1 for cuda 12.1. format \| batch_size \| sequence_length \| num_heads \| head_size \| latency (s) \| tflops \| kernel -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,KV \| 4 \| 2048 \| 32 \| 128 \| 0.0015 \| 179.5 \| ort:flash Q,KV \| 4 \| 2048 \| 32 \| 128 \| 0.0015 \| 179.0 \| ort:default Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 170.0 \| ort:default Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 169.5 \| ort:flash QKV \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 168.5 \| ort:default QKV \| 4 \| 2048 \| 32 \| 128 \| 0.0016 \| 167.4 \| ort:flash Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0017 \| 159.4 \| torch:default Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0018 \| 155.0 \| torch:flash Q,KV \| 4 \| 2048 \| 32 \| 128 \| 0.0030 \| 92.7 \| ort:efficient Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0030 \| 90.9 \| ort:efficient QKV \| 4 \| 2048 \| 32 \| 128 \| 0.0031 \| 89.9 \| ort:efficient Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0031 \| 89.0 \| torch:efficient Q,K,V \| 4 \| 2048 \| 32 \| 128 \| 0.0054 \| 51.3 \| torch:math Q,KV \| 4 \| 4096 \| 32 \| 128 \| 0.0058 \| 191.0 \| ort:default Q,KV \| 4 \| 4096 \| 32 \| 128 \| 0.0058 \| 190.6 \| ort:flash Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 187.8 \| ort:default Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 186.7 \| ort:flash QKV \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 185.9 \| ort:flash QKV \| 4 \| 4096 \| 32 \| 128 \| 0.0059 \| 185.8 \| ort:default Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0067 \| 163.4 \| torch:default Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0070 \| 157.2 \| torch:flash Q,KV \| 4 \| 4096 \| 32 \| 128 \| 0.0113 \| 97.6 \| ort:efficient Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0114 \| 96.4 \| ort:efficient QKV \| 4 \| 4096 \| 32 \| 128 \| 0.0114 \| 96.2 \| ort:efficient Q,K,V \| 4 \| 4096 \| 32 \| 128 \| 0.0127 \| 86.3 \| torch:efficient Q,KV \| 8 \| 2048 \| 32 \| 128 \| 0.0031 \| 177.8 \| ort:flash Q,KV \| 8 \| 2048 \| 32 \| 128 \| 0.0031 \| 177.7 \| ort:default Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0032 \| 170.8 \| ort:default Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0032 \| 170.3 \| ort:flash QKV \| 8 \| 2048 \| 32 \| 128 \| 0.0032 \| 169.2 \| ort:default QKV \| 8 \| 2048 \| 32 \| 128 \| 0.0033 \| 169.0 \| ort:flash Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0034 \| 161.9 \| torch:default Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0036 \| 152.9 \| torch:flash Q,KV \| 8 \| 2048 \| 32 \| 128 \| 0.0059 \| 93.5 \| ort:efficient Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0060 \| 91.3 \| ort:efficient QKV \| 8 \| 2048 \| 32 \| 128 \| 0.0060 \| 91.0 \| ort:efficient Q,K,V \| 8 \| 2048 \| 32 \| 128 \| 0.0064 \| 86.0 \| torch:efficient Q,KV \| 8 \| 4096 \| 32 \| 128 \| 0.0115 \| 190.8 \| ort:flash Q,KV \| 8 \| 4096 \| 32 \| 128 \| 0.0115 \| 190.7 \| ort:default Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 187.1 \| ort:default Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 187.0 \| ort:flash QKV \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 185.6 \| ort:default QKV \| 8 \| 4096 \| 32 \| 128 \| 0.0118 \| 185.6 \| ort:flash Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0139 \| 158.7 \| torch:default Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0139 \| 158.3 \| torch:flash Q,KV \| 8 \| 4096 \| 32 \| 128 \| 0.0225 \| 97.7 \| ort:efficient Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0227 \| 96.8 \| ort:efficient QKV \| 8 \| 4096 \| 32 \| 128 \| 0.0228 \| 96.3 \| ort:efficient Q,K,V \| 8 \| 4096 \| 32 \| 128 \| 0.0260 \| 84.5 \| torch:efficient #### Example CPU results Dell XPS 8960 with i9-13900 CPU (use_gpu=FALSE, causal=FALSE, past_sequence_length=0) in Windows. ORT: build from source with CUDA 12.5; PyTorch 2.3.1 for cuda 12.1. format \| causal \| batch_size \| seq_len \| num_heads \| head_size \| threads \| latency (s) \| kernel -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 8 \| 0.0005 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 0 \| 0.0009 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 0 \| 0.0009 \| ort:math Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 4 \| 0.0009 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 2 \| 0.0014 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 1 \| 0.0025 \| ort:flash Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 2 \| 0.0045 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 24 \| 0.0046 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 8 \| 0.0046 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 4 \| 0.0046 \| torch:default Q,K,V \| FALSE \| 1 \| 128 \| 32 \| 128 \| 1 \| 0.0047 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 0 \| 0.0019 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 8 \| 0.0019 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 0 \| 0.0022 \| ort:math Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 4 \| 0.0030 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 2 \| 0.0047 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 1 \| 0.0086 \| ort:flash Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 2 \| 0.0161 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 4 \| 0.0162 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 8 \| 0.0162 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 24 \| 0.0165 \| torch:default Q,K,V \| FALSE \| 1 \| 256 \| 32 \| 128 \| 1 \| 0.0166 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 8 \| 0.0077 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 0 \| 0.0091 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 0 \| 0.0099 \| ort:math Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 4 \| 0.0103 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 2 \| 0.0177 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 1 \| 0.0328 \| ort:flash Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 2 \| 0.0624 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 4 \| 0.0624 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 8 \| 0.0625 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 24 \| 0.0626 \| torch:default Q,K,V \| FALSE \| 1 \| 512 \| 32 \| 128 \| 1 \| 0.0640 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 8 \| 0.0286 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 0 \| 0.0317 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 4 \| 0.0367 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 0 \| 0.0391 \| ort:math Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 2 \| 0.0656 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 1 \| 0.1235 \| ort:flash Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 24 \| 0.2482 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 2 \| 0.2483 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 4 \| 0.2483 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 8 \| 0.2486 \| torch:default Q,K,V \| FALSE \| 1 \| 1024 \| 32 \| 128 \| 1 \| 0.2538 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 0 \| 0.1038 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 8 \| 0.1050 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 0 \| 0.1368 \| ort:math Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 4 \| 0.1535 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 2 \| 0.2461 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 1 \| 0.4724 \| ort:flash Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 8 \| 0.9835 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 4 \| 0.9841 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 24 \| 0.9841 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 2 \| 0.9873 \| torch:default Q,K,V \| FALSE \| 1 \| 2048 \| 32 \| 128 \| 1 \| 0.9985 \| torch:default ### Motivation and Context To compare with PyTorch SDPA on CPU and CUDA latency.	2024-07-26 18:45:14 -07:00
Hector Li	fb61e14153	Add QNN EP option context_node_name_prefix to set EPContext node name prefix (#21236 ) ### Description Add QNN EP option context_node_name_prefix to set EPContext node name prefix ### Motivation and Context For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model. To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.	2024-07-26 16:56:44 -07:00
Jian Chen	7db7c4e5c8	Separating all GPU stages into different Pipelines (#21521 ) ### Description Separating all GPU stages into different Pipelines	2024-07-26 14:54:45 -07:00
Justin Chu	bbbaef3fa6	Update text formatting in generate_cgmanifest.py (#21489 ) The only place where I manually fixed I forgot a format string	2024-07-26 08:46:54 -07:00

... 3 4 5 6 7 ...

11643 Коммитов Все ветки Поиск

11643 Коммитов

Все ветки