onnxruntime

Граф коммитов

Автор	SHA1	Сообщение	Дата
Hector Li	fb61e14153	Add QNN EP option context_node_name_prefix to set EPContext node name prefix (#21236 ) ### Description Add QNN EP option context_node_name_prefix to set EPContext node name prefix ### Motivation and Context For the case to workaround QNN context PD memory limit, user need split the model into pieces and generate the QNN context model separately. It could happen that the generated EPContext node in separate graph has same node name. This will cause issue if glue those EPContext nodes together into a single model. To avoid this user can set this context_node_name_prefix for each split pieces to make the node name unique.	2024-07-26 16:56:44 -07:00
Jian Chen	7db7c4e5c8	Separating all GPU stages into different Pipelines (#21521 ) ### Description Separating all GPU stages into different Pipelines	2024-07-26 14:54:45 -07:00
Justin Chu	bbbaef3fa6	Update text formatting in generate_cgmanifest.py (#21489 ) The only place where I manually fixed I forgot a format string	2024-07-26 08:46:54 -07:00
Prathik Rao	278f0f5cd2	disables qnn in ort training cpu pipeline (#21510 ) ### Description <!-- Describe your changes. --> `enable_windows_arm64_qnn` and `enable_windows_x64_qnn` are true by default but unnecessary for training. This change explicitly sets these parameters to false for training pipeline. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ORT 1.19 Release Preparation	2024-07-26 17:23:35 +08:00
Wanming Lin	b6b29309a5	[WebNN EP] Update argMax/argMin to adapt to latest spec (#21452 ) WebNN spec recently changes the definition of argMax/argMin: - Remove selectLastIndex option, let backends decide to select the last index or not. - Move axes option to axis input	2024-07-25 17:07:01 -07:00
aamajumder	166809425e	[DML EP] Register ReduceMin-20 (#20477 ) ### Description This PR registers the ReduceMin-20 operator to the DML EP. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-25 17:06:30 -07:00
Scott McKay	e5302b23c4	Fix SkipLayerNormFusion incorrectly setting modified every time it runs (#21502 ) ### Description <!-- Describe your changes. --> Current behavior forces all L2 optimizers to loop until they hit the max number of iterations. Only update modified if the graph was modified. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix unnecessary loops of L2 optimizers during model loading.	2024-07-26 10:00:28 +10:00
Justin Chu	c464ab3aca	Allow cpplint to always be green (#21491 ) Allow cpplint to always be green since it is optional. Also changed the workflow name to reflect that.	2024-07-25 15:57:30 -07:00
Scott McKay	b0e1f7f798	CoreML: Aggregated changes to add all required ops for priority model (#21472 ) ### Description <!-- Describe your changes. --> Add these changes to one PR to simplify checkin - Add Concat (#21423) - Add DepthToSpace (#21426) - Add LeakyRelu (#21453) - Add test scripts (#21427) - Add ability to set coreml flags from python (#21434) Other changes - updated partitioning utils to support dropping constant initializers from a ComputeCapability's inputs. - noticed that the list of inputs to the coreml model was unexpectedly long due to this - we copy constant initializers to a CoreML model so don't need the originals, and if they remain as inputs ORT can't free them as they appear to be in use. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-26 08:29:33 +10:00
Scott McKay	3cdf4b917b	Fix Android CI Pipeline code coverage failure (#21504 ) ### Description <!-- Describe your changes. --> Current failure is due to a version mismatch. Use llvm-cov from the Android NDK instead of the system gcov so that the version is correct. Also comment out publishing to the Azure dashboard to simplify the setup. The CI prints out the stats for review by developers. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix CI pipeline	2024-07-26 07:36:23 +10:00
Hector Li	c23517859e	Qnn batchnorm support input with rank 2 (#21469 ) ### Description Qnn BatchNorm support input with rank 2 Update Quantization script to quantize BatchNorm bias using int32 --------- Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-07-25 11:44:10 -07:00
Changming Sun	4167b68abf	Split ondevice training cpu packaging pipeline to a separated pipeline (#21485 ) ### Description Right now our "Zip-Nuget-Java-Nodejs Packaging Pipeline" is too big. This OnDevice training part is independent of the others, so it can be split out. Then our NPM Packaging pipeline will not depends on this training stuff. ### Motivation and Context Similar to #21235 Also, this PR fixed a problem that: "NuGet_Test_Linux_Training_CPU" job downloads artifacts from "onnxruntime-linux-x64" for getting customop shared libs, but the job forget to declare it depends on the "Linux_C_API_Packaging_CPU_x64" which produces the artifact. Such problems can be hard to find when a pipeline goes big.	2024-07-25 10:58:34 -07:00
Yifan Li	ebcb7075eb	Set CUDA12 as default in GPU packages (#21438 ) ### Description * Swap cuda version 11.8/12.2 in GPU CIs * Set CUDA12 as default version in yamls of publishing nuget/python/java GPU packages * Suppress warnings as errors of flash_api.cc during ort win-build	2024-07-25 10:17:16 -07:00
Sophie Schoenmeyer	f3a6e58ae3	Update 05-performance.yml issue template to auto apply label (#21486 ) Updating Performance issue template so "performance" label is automatically applied ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-25 09:52:37 -07:00
Yueqing Zhang	6787cf18a5	[VitisAI] use binary mode for context ep (#21474 ) ### Description <!-- Describe your changes. --> We found text format could caused error. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Because the OS could change the string so we decided to save it as binary file.	2024-07-25 07:18:55 -07:00
Preetha Veeramalai	ca47f0fdd3	OVEP - PR 1.19 (#21443 ) ### Description Add OVEP features for 1.19 The PR has, - Added support for EpCtx with ORT Session options for optimized performance. - Added bug fixes - Support for OV 2024.3 --------- Co-authored-by: ubuntu <ubuntu@ubuntu-mtlp-118727.iind.intel.com> Co-authored-by: vthaniel <vishnudas.thaniel.s@intel.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: saurabhkale17 <saurabh1.kale@intel.com> Co-authored-by: Maheshkar <ankit.maheshkar@intel.com>	2024-07-24 23:45:31 -07:00
Justin Chu	ae3ec2e9ac	Ignore ruff rule `N813` (#21477 ) Allow importing camelcase names in lowercase	2024-07-24 17:48:22 -07:00
pengwa	08001d18ac	Fix security issue #22016 #22017 #22018 (#21333 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-25 08:25:22 +08:00
Adrian Lizarraga	f4edf9bb58	Extend QDQPropagation transformer to handle multiple consumers (#21313 ) ### Description - Extends the QDQPropagationTransformer to propagate DQs (forward) across operators with multiple consumers (previously only supported 1 consumer). - Adds Slice to the list of operators that the QDQPropagationTransformer can propagate DQ/Q ops across. - Supports QDQ propagation for opset 21. - Correctly copies Q or DQ attributes when creating new nodes. ### Motivation and Context The QDQPropagationTransformer fixes up QDQ node units for certain "data movement" ops (e.g., Transpose) by inserting Q -> DQ sequences where necessary. For example, the sequence `DQ -> Transpose -> Sigmoid` is transformed to `DQ -> Transpose -> Q -> DQ -> Sigmoid`. However, this fix-up does not currently support data movement ops with multiple consumers, as in: ``` DQ -> Transpose --+--> Sigmoid -> \| +--> Relu -> \| +-> graph_output ``` With the updates in this PR, the above model can be transformed to: ``` DQ -> Transpose -> Q --+--> DQ -> Sigmoid -> \| +--> DQ -> Relu -> \| +--> DQ -> graph_output ``` This update allows QNN EP to support quantized models created with tools that do not wrap data movement ops in Q/DQ ops. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-24 16:39:32 -07:00
Justin Chu	c203d89958	Update ruff and clang-format versions (#21479 ) ruff -> 0.5.4 clang-format -> 18	2024-07-24 11:50:11 -07:00
Adrian Lizarraga	eb9b377306	[QNN EP] Update to QNN SDK 2.24.0 (#21463 ) ### Description - Update pipelines to use QNN SDK 2.24 by default - Update QNN_Nuget_Windows pipeline to build csharp solution without mobile projects (fixes errors). - Implement workaround for QNN 2.24 validation bug for LayerNorm ops without an explicit bias input. - Enable Relu unit test, which now passes due to the fact Relu is no longer fused into QuantizeLinear for QNN EP. - Fix bug where a negative quantization axis is not properly normalized for per-channel int4 conv. ### Motivation and Context Update QNN SDk.	2024-07-24 10:17:12 -07:00
Changming Sun	b04adcc381	Update copy_strip_binary.sh: use "make install" instead (#21464 ) ### Description Before this change, copy_strip_binary.sh manually copies each file from onnx runtime's build folder to an artifact folder. It can be hard when dealing with symbolic link for shared libraries. This PR will change the packaging pipelines to run "make install" first, before packaging shared libs . ### Motivation and Context Recently because of feature request #21281 , we changed libonnxruntime.so's SONAME. Now every package that contains this shared library must also contains libonnxruntime.so.1. Therefore we need to change the packaging scripts to include this file. Instead of manually construct the symlink layout, using `make install` is much easier and will make things more consistent because it is a standard way of making packages. Breaking change: After this change, our inference tarballs that are published to our Github release pages will be not contain ORT training headers.	2024-07-24 10:02:00 -07:00
Scott McKay	2580d935cb	CoreML: Add ML Program ConvTranspose (#21416 ) ### Description <!-- Describe your changes. --> Add ML Program ConvTranspose - some limitations to simplify the implementation for now - some limitations due to flaky CoreML output Added support for non-contiguous MLMultiArray output as we see that with some unit tests when the CPU-only flag is not set (e.g. innermost dim has min size of 16 but test output only has 8 values). - support only one non-contiguous dim to keep it simple - manually tested as we don't have a setup that can test objective-c code - test code is in model.mm and can be enabled via ifdef if we need to validate any future changes ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address operator gaps in high priority model. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2024-07-24 16:08:20 +10:00
Chester Liu	6794dfd941	[QNN EP] Improve QNN error reporting using the error message (#21458 ) ### Description Massively improve the QNN error reporting by invoking `QnnError_getMessage` and returning the error message. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Example error message before this change: ```text QNN SetupBackend failed Failed to create device. Error: 14001 ``` After: ```text QNN SetupBackend failed Failed to create device. Error: QNN_DEVICE_ERROR_INVALID_CONFIG: Invalid config values ```	2024-07-23 22:41:09 -07:00
Wanming Lin	0274008b6b	[WebNN EP] ConvTranspose should calculate the pads or output shape (#21292 ) This PR adds the missing pads and output shape calculation for ConvTranspose. Per ONNX spec: - If the output shape is explicitly provided, compute the pads. - Otherwise compute the output shape, as well as the pads if the auto_pad attribute is SAME_UPPER/SAME_LOWER.	2024-07-23 18:51:49 -07:00
Scott McKay	1df9aa2f08	CoreML: Add GridSample ML Program support (#21431 ) ### Description <!-- Describe your changes. --> Add GridSample ML Program support One combination of inputs has diffs between the pytorch generated unit tests data and CoreML. Disabling until needed as investigation may take a while. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> High priorities models	2024-07-24 11:04:48 +10:00
mingyueliuh	86cedc6832	[Fix] C++ API SetOutputShape for register custom op. (#21366 ) ### Description Bug fix for the SetOutputShape method in custom op shape inference. ### Motivation and Context - Bug a : A obvious bug that will cause all dimensions to be 1. https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_cxx_inline.h#L2014 integer_dims.push_back(dim.IsInt()); -> integer_dims.push_back(dim.AsInt()); - Bug b : vector out of range error op's input maybe a scalar and shape is empty. https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_cxx_inline.h#L1985 --------- Co-authored-by: mingyue <mingyue@amd.com>	2024-07-23 16:51:00 -07:00
George Wu	c65afcea55	fix python qnn pipelines issues (#21462 ) build_py_params wasn't plumbed through for python qnn pipelines. incorporate fixes for deprecated numpy version option from https://github.com/microsoft/onnxruntime/pull/21459	2024-07-23 15:54:44 -07:00
Tianlei Wu	2b7e2a5bd0	[CUDA] Fix cuda provider fallback inconsistency (#21425 ) * Fix fallback setting (cuda still falls back to cuda). * Fix cuda provider fallback inconsistent with/without CUDA_PATH environment variable. * Add cuda and cudnn major version requirement in error message. Example result in Windows: ``` >>> import onnxruntime >>> ort_session = onnxruntime.InferenceSession("model.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) 2024-07-19 17:43:44.2260019 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 onnxruntime::TryGetProviderInfo_CUDA] D:\onnxruntime\onnxruntime\core\session\provider_bridge_ort.cc:1636 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\.conda\envs\py310\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll" 2024-07-19 17:43:44.2312351 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:970 onnxruntime::python::CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9.* and CUDA 12., and the latest MSVC runtime. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported. >>> ort_session <onnxruntime.capi.onnxruntime_inference_collection.InferenceSession object at 0x0000016BB2DF7D60> >>> ort_session.get_providers() ['CPUExecutionProvider'] ``` Example result in Linux: ``` >>> import onnxruntime >>> ort_session = onnxruntime.InferenceSession("resnet50-v2-7.onnx", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) 2024-07-20 20:33:26.486974543 [E:onnxruntime:Default, provider_bridge_ort.cc:1972 TryGetProviderInfo_CUDA] /work/onnxruntime/onnxruntime/core/session/provider_bridge_ort.cc:1636 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.12: cannot open shared object file: No such file or directory 2024-07-20 20:33:26.487034646 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:961 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Require cuDNN 9. and CUDA 12.*. Please install all dependencies as mentioned in the GPU requirements page (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), make sure they're in the PATH, and that your GPU is supported. >>> ort_session.get_providers() ['CPUExecutionProvider'] ``` ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/21424	2024-07-23 11:58:04 -07:00
Changming Sun	7af39c6955	Update nodejs's cmake file to fix a file copy issue (#21390 ) This commit `e5f18ba2c1` caused some nightly pipelines to fail. This PR fixes it. It is because recently I changed our Linux library's SONAME. At runtime onnxruntime_binding depends on libonnxruntime.so.1 , instead of libonnxruntime.so.1.19.0(with the full version number). Therefore we need to keep the libonnxruntime.so.1 symlink. The packaging tools/ci_build/github/js/pack-npm-packages.ps1 still needs be updated. I will address it in another PR.	2024-07-23 11:03:55 -07:00
Changming Sun	f70215d4e6	Update C++ dependencies (#21410 ) 1. Update google benchmark from 1.8.3 to 1.8.5 2. Update google test from commit in main branch to tag 1.15.0 3. Update pybind11 from 2.12.0 to 2.13.1 4. Update pytorch cpuinfo to include the support for Arm Neoverse V2, Cortex X4, A720 and A520. 5. Update re2 from 2024-05-01 to 2024-07-02 6. Update cmake to 3.30.1 7. Update Linux docker images 8. Fix a warning in test/perftest/ort_test_session.cc:826:37: error: implicit conversion loses integer precision: 'streamoff' (aka 'long long') to 'const std::streamsize' (aka 'const long') [-Werror,-Wshorten-64-to-32]	2024-07-23 10:00:36 -07:00
Scott McKay	0f1f3b7705	CoreML: ML Program Slice (#21433 ) ### Description <!-- Describe your changes. --> Add support for Slice ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> High priority models.	2024-07-23 20:21:55 +10:00
Sheil Kumar	dd010edb37	Update DirectML from 1.14.1 to 1.15.0 (#21323 ) Update DirectML from 1.14.1 to 1.15.0 --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com> Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>	2024-07-22 16:59:03 -07:00
Prathik Rao	11ad299451	Adds ATen fallback for scaled_dot_product_attention (#21107 ) ### Description <!-- Describe your changes. --> Introduces an ATen fallback for `torch.nn.functional.scaled_dot_product_attention`. This operator was introduced in torch 2.0 and, since then, has had many updates including the implementation of memory efficient attention for V100 machines. The current torchscript exporter exports a subgraph for attention which does not provide the same memory savings that PyTorch's memory efficient attention kernel provides. Allowing fallback to PyTorch ATen op for attention helps mitigate memory spike issues for models leveraging memory efficient attention. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Memory issues arose when integrating ONNX Runtime Training with AML Stable Diffusion. --------- Co-authored-by: root <prathikrao@microsoft.com>	2024-07-22 16:37:04 -07:00
mindest	5b9369e93c	Fix typos according to reviewdog report. (#21335 ) ### Description Fix typos based on reviewdog report but with some exceptions/corrections.	2024-07-22 13:37:32 -07:00
Jian Chen	4e75605eec	Replace inline pip install with pip install from requirements.txt (#21106 ) ### Description Replace inline pip install with pip install from requirements.txt ### Motivation and Context so that CG can recognize ### Dependency - [x] https://github.com/microsoft/onnxruntime/pull/21085	2024-07-22 12:39:10 -07:00
Wanming Lin	17e9ea6235	[WebNN EP] Add outputDataType option for the ArgMax/ArgMin ops (#21385 ) ### Description WebNN spec introduces a new option: `outputDataType` to `argMax` and `argMin` ops, it's default value is `int32`, we should explicitly set it to `int64` for WebNN EP. Spec CR: "Add outputDataType to argmin/argmax" https://github.com/webmachinelearning/webnn/pull/730	2024-07-22 11:56:09 -07:00
Tianlei Wu	a6c5e2cd20	[CUDA] FusedMHARunnerFP16v2 thread-safe (#21420 ) ### Description - [x] Rewrite FusedMHARunnerFP16v2 to make it thread-safe. - [x] Add multi-threading tests Previously, the kernel parameters params is stored as a member of mha runner, which means that different threads might change the params at the same time and impacts the other threads. For example, if batch_size and seq_len was changed by another thread to larger values in setup(...), buffer overrun might happen in run(...) because a kernel could read/write memory out of range of allocated buffers. In new implementation, I change the api and remove mutable member variables to make it thread safe. Below is summary of change: Before: ``` class FusedMHARunnerFP16v2::mhaImpl { void setup(int seq_len, int batch_size) { // change scalar params } void run(input, output) { // change params for input and output pointers // launch kernel using params } Fused_multihead_attention_params_v2 params; // mutable, not thread-safe } ``` After: ``` class FusedMHARunnerFP16v2::FmhaImpl { void setup(int seq_len, int batch_size, Fused_multihead_attention_params_v2& params) { // change params } void run(params, input, output) { // change params with input and output pointers // launch kernel using params } } ``` ### Motivation and Context https://github.com/microsoft/onnxruntime/issues/18854 https://github.com/microsoft/onnxruntime/issues/21413	2024-07-22 10:41:08 -07:00
Jing Fang	11bf309736	add transform part of the dq matmul tool chain (#21374 ) ### Description This is a partial change from [fajin/qdqmatmulnbitstoolchain](https://github.com/microsoft/onnxruntime/pull/21180). The original PR is blocked by Web CI failures. MatMulNBits is a heavily optimized matmul operation. Currently a MatMul can be converted to MatMulNBits to speed up the model inference. However, MatMulNBits is an ORT only op. To make the graph compatible with ONNX ops and utilize MatMulNBits at the same time, we introduce Q/DQ support for MatMulNBits. To convert MatMul ops in a model to MatMulNBits: 1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using QDQ mode. 2. In ORT session, DQ + MatMul is fused to MatMulNBits #### Note MatMulNBits assume B weight is uint4. When no zp is provided, zp defaults to 8, which is different from DQ. DQ defaults zp to 0 when no zp provided. And DQ supports int4. Therefore some conversions are introduced during DQ + MatMul --> MatMulNBits step. #### Perf Using QDQ format will increase the model initialization time and memory consumption. With current implement, model init time increased from ~4s to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB. The memory increase is due to 1. in optimizer, after transpose the B weight, a in-memory tensor proto is created using protobuf's arena. 2. in finalize step, when saving initializer and prepacking, ORT arena is used to create buffers for initializers. The memory allocated by arenas cannot be fully deallocated. If disable ORT arena memory allocation, the memory consumptions of both QDQ format and original format are ~2.2GB. The time increase is mainly due to multiple memory copy, but can be further optimized. ### Motivation and Context Please see description for details.	2024-07-19 22:55:15 -07:00
Maximilian Müller	5bec52203d	[TensorRT] Enable refitting an embedded engine when provided as byte stream (#21357 ) ### Description This allows refitting an engine using an ONNX file not available on disk. This is important for encrypted ONNX files on disk.	2024-07-19 21:11:04 -07:00
Scott McKay	34cd2e8ed8	Add CoreML ML Program Resize (#21370 ) ### Description <!-- Describe your changes. --> Add CoreML ML Program Resize - refactor existing logic to try and simplify and share between NeuralNetwork and MLProgram checks - add handling for some new attributes - antialias and axes - should have been done when setting the CoreML EP max opset to 21 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Support priority models	2024-07-20 09:35:05 +10:00
Tianlei Wu	6ffaaebb60	[CUDA] Attention kernel provider option (#21344 ) ### Description * Add a cuda provider option `sdpa_kernel` to choose which attention kernel to run for testing purpose. * Allow dump which attention kernel is used per node. * Reserve a flag for cudnn flash attention which will be added soon. #### CUDA provider option sdpa_kernel Instead of setting environment variable, we also support setting it in provider option. Note that the setting is global per session. That could help performance testing of each kernel. #### Attention Kernel Debug Info Set an environment variable `ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`, and ORT will print sdpa kernel used in each node: For example ``` ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1 ./onnxruntime_test_all --gtest_filter=MultiHeadAttentionTest* ``` It will show debug information of kernel used in testing: ``` [ RUN ] MultiHeadAttentionTest.SelfAttention_Batch2_HeadSize32_NoBias_NoMask_PackedQKV AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=0 TRT_FUSED_ATTENTION=1 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=1 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1 Operator=MultiHeadAttention Node=node1 DataType=fp16 TRT_FUSED_ATTENTION=1 AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=1 TRT_FUSED_ATTENTION=0 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=0 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1 Operator=MultiHeadAttention Node=node1 DataType=fp16 EFFICIENT_ATTENTION=1 ``` In this test case, the debug info shows that one session uses trt fused attention and another session use efficient attention.	2024-07-19 13:58:54 -07:00
Yulong Wang	01df8c787d	[js/web] fix vulnerable version of dependencies (#21412 ) ### Description ``` # npm audit report socket.io 3.0.0 - 4.6.2 Severity: high socket.io has an unhandled 'error' event - https://github.com/advisories/GHSA-25hc-qcg6-38wj Depends on vulnerable versions of engine.io fix available via `npm audit fix` node_modules/socket.io ws 8.0.0 - 8.17.0 Severity: high ws affected by a DoS when handling a request with many HTTP headers - https://github.com/advisories/GHSA-3h5v-q93c-6h6q fix available via `npm audit fix` node_modules/ws engine.io 0.7.8 - 0.7.9 \|\| 6.0.0 - 6.5.4 Depends on vulnerable versions of ws node_modules/engine.io socket.io-adapter 2.5.2 - 2.5.4 Depends on vulnerable versions of ws node_modules/socket.io-adapter 4 high severity vulnerabilities ```	2024-07-19 11:11:30 -07:00
Adrian Lizarraga	22d4d82f3c	Move ReluQuantFusion to Level2 for CPU EP only (#21329 ) ### Description Moves the `Relu -> QuantizeLinear` fusion to Level2 optimizations for CPU EP only. ### Motivation and Context See the related PR for motivation and context: https://github.com/microsoft/onnxruntime/pull/20627	2024-07-19 08:36:47 -07:00
glen-amd	cc4049af83	Enabled more VitisAI backend compilers (#21411 ) ### Description Enabled more VitisAI backend compilers	2024-07-19 08:34:03 -07:00
Changming Sun	9140d9b1ff	Update azure-kusto-data and azure-kusto-ingest (#21409 ) A vulnerability has been found in the Kusto SDK. We need to update it to latest to address a security alert.	2024-07-18 14:26:26 -07:00
Edward Chen	05fc0c60ca	[MLAS][AArch64] SQNBitGemm CompInt8 - Use 4x2 tiles (#21380 ) Update SQNBitGemm ARM NEON kernel to compute 4x2 tile of output. Note: Also tried 2x4 and 4x4 tiles but observed the best microbenchmark results with 4x2 tiles.	2024-07-18 13:37:29 -07:00
Frank Dong	92f66de702	remove llama 70b (#21396 ) Remove llama 70b model due to security reason. We need add shard code in HF to enable model shardding for llama-70b, these codes are not merged into main branch as HF forks want a more general solution instead of doing shard for specify model. shared code is kept here: https://github.com/frank-dong-ms/transformers/tree/frdong/shard_llama we kept llama-70b related code here for internal use: https://github.com/frank-dong-ms/onnxruntime/tree/frdong/llama_70b	2024-07-18 12:12:10 -07:00
Yifan Li	bb76ead96c	[TensorRT EP] support TensorRT 10.2-GA (#21395 ) ### Description <!-- Describe your changes. --> * promote trt version to 10.2.0.19 * EP_Perf CI: clean config of legacy TRT<8.6, promote test env to trt10.2-cu118/cu125 * skip two tests as Float8/BF16 are supported by TRT>10.0 but TRT CIs are not hardware-compatible on these: ``` 1: [ FAILED ] 2 tests, listed below: 1: [ FAILED ] IsInfTest.test_isinf_bfloat16 1: [ FAILED ] IsInfTest.test_Float8E4M3FN ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-18 12:11:52 -07:00
kailums	1b38c05544	change ci docker image to rocm6.1 (#21296 ) ### Description <!-- Describe your changes. --> There is a bug for kernel running on rocm6.0, so change ci docker image to rocm6.1 For the torch installed in the docker image, change to rocm repo when it is not 6.0 version. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-18 14:50:01 +08:00

... 2 3 4 5 6 ...

11546 Коммитов Все ветки Поиск

11546 Коммитов

Все ветки