### Description
<!-- Describe your changes. -->
### Motivation and Context
1. Python API doc needs to be merged from a fork, but 1ES self-hosted
pool is only for one github repo.
2. ubuntu-latest will be install numpy above 2.0 by default, and current
python API doc generation doesn't support it.
So I pin numpy < 2.0.0
---------
Avoid producing presentKey/presentValue outputs if pastKey/pastValue
don't exists.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
- Adds support for the GatherElements operator to QNN EP.
- Adds GatherElements to QDQ quantizer tool.
### Motivation and Context
Enable more models to run on QNN EP.
### Description
Added code in MatMul4BitsQuantizer to quantize Gather to
GatherBlockQuantized.
Only Gather with constant data is quantized.
Since quantized data is in int4, the quantized model will force upgrade
to onnx opset 21.
The implementation purely relies on numpy. If optimization is needed,
C++ kernels can be added later.
Only support default RTN algorithm since GatherBlockQuantized require
zero points to have the same type as quantized data.
### Motivation and Context
Support quantizing gather to int4 in Web scenario.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Bug fix for the ShapeInferContext GetAttrxxxs APIs. Node attribute maybe
is empty.
### Motivation and Context
If the attr value is empty, the expected result through the interface is
empty , but currently, it returns a meaningless {0}.
---------
Co-authored-by: mingyue <mingyue@amd.com>
Co-authored-by: Liu Minyue <mingyue@xilinx.com>
### Description
- TensorRT 10.2.0.19 -> 10.3.0.26
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
This PR fixes the `AttentionProbsSoftmax` recompilation issue when
executing the phi3 model. With this fix, it will further improve the
phi3 performance.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Previously, MultiHeadAttention supports relative position bias of shape
[1, N, S, T] or [B, N, S, T], and DecoderMaskedMultiHeadAttention
supports [1, N, S, T]. This will extend the support to allow [1, N, S,
T], [B, N, S, T], [B, 1, S, T] and [1, 1, S, T] for CUDA and CPU EPs.
- [x] Rename the input of "relative position bias" to "attention bias"
because it can also be used for other types of bias, like ALiBi
(Attention with Linear Biases) or attention mask.
- [x] Update unfused kernel to support broadcasting 2nd dimension of
attention bias.
- [x] Update efficient attention to support broadcasting 2nd dimension
of attention bias.
- [x] Update operators (MultiHeadAttention,
DecoderMaskedMultiHeadAttention, Attention, PackedAttention,
PackedMultiHeadAttention) to support broadcast attention bias on CUDA
and CPU EPs.
- [x] Update ROCm, DML and WebGPU naming to be consistent. (Note that
those EPs do not support broadcasting attention_bias for now).
- [x] Add attention bias tests for MultiHeadAttention.
- [x] Update operator documents
- [x] Update benchmark script
Other changes:
* Fix some checks in multihead-attention.ts
* Add helper functions to dump tensors given dimensions.
### Description
This PR modifies the run_dynamo_export function to ensure it mirrors the
behavior of run_torchscript_merged_export rather than
run_torchscript_separate_export. Additionally, I made adjustments to the
main function to ensure that run_dynamo is correctly invoked.
### Motivation and Context
The main motivation for this change is to enable successful export of
LLaMA-2 and LLaMA-3 models using the Dynamo exporter to ONNX.
Previously, the exporter was saving two copies of the weights, which is
inefficient. The modified approach ensures that only one copy of the
weights is saved, and the model can support both scenarios. These
changes enhance the compatibility of the exporter with LLaMA models and
subsequently other models and optimize the export process
### Description
<!-- Describe your changes. -->
Handle targets in subdirectories for external projects. All targets will
now go in a per-project folder under 'External'
e.g. gmock and gtest now get handled correctly and are under
External/googletest vs. existing setup where they ended up as top-level
projects.
![image](https://github.com/user-attachments/assets/99ec259c-47cd-44f3-954d-58569c941cc2)
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve developer experience.
Chrome Canary is helpful to test some new features. With this PR, we can
enable Chrome Canary in unit tests with command like "npm test -- op
abs.jsonc -b=webgpu -e=chromecanary".
### Description
Replace `memset(0)` with `std::fill(T{})`. This would ensure that all
the types are initialized in a portable way.
### Motivation and Context
Some platforms exhibit intermittent failures with NaN results.
Follow up to: https://github.com/microsoft/onnxruntime/pull/21525
Cc: @ranjitshs
### Description
Exclude cuDNN 9 and CUDA 12 DLLs from manylinux wheel to reduce python
package size.
### Motivation and Context
The 1.20.0 ort-nightly-gpu python wheels on linux are suddenly > 800 MB
in size. The wheels built on 1.19 release branch have a size of around
220 MB.
The size change is caused by
https://github.com/microsoft/onnxruntime/pull/19470.
### Description
See
454996d496
for manual changes (excluded auto-generated formatting changes)
### Why
Because the toolsets for old clang-format is out-of-date. This reduces
the development efficiency.
- The NPM package `clang-format` is already in maintenance mode. not
updated since 2 years ago.
- The VSCode extension for clang-format is not maintained for a while,
and a recent Node.js security update made it not working at all in
Windows.
No one in community seems interested in fixing those.
Choose Prettier as it is the most popular TS/JS formatter.
### How to merge
It's easy to break the build:
- Be careful of any new commits on main not included in this PR.
- Be careful that after this PR is merged, other PRs that already passed
CI can merge.
So, make sure there is no new commits before merging this one, and
invalidate js PRs that already passed CI, force them to merge to latest.
Bug: https://github.com/microsoft/onnxruntime/issues/21386
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Pins pytorch-lightning package to version 2.3.3 since version >=2.4.0
requires torch > 2.1.0 which is not compatible with cu118.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ORT 1.19 Release Preparation
### Description
This change addresses a case where we multiply two matrices, and their
inner dimension is 0.
numpy and Eigen which is being used in our CPU EP implementation
correctly handle this case
and output a [M, N] matrix filled with zeros.
### Motivation and Context
This is required to support GenAI empty input Lora implementation.
Addresses: https://github.com/microsoft/onnxruntime/issues/21483
### Description
<!-- Describe your changes. -->
Replace jcenter. It's deprecated and not responding.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix CIs
Fix two issues:
(1) scale shall be fp32 instead of f16
(2) Softmax program does not handle the normalized dispatch group values, so if the sequence length is over 65535, the result is not correct for this program.
### Description
<!-- Describe your changes. -->
### Motivation and Context
We couldn't get enough A100 agent time to finish the jobs since today.
The PR makes the A100 job only runs in main branch to unblock other PRs
if it's not recovered in a short time.
### Description
Update DML runtime binary to 1.15.1
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix address sanitizer and memory access Bug 1, 4, 5, 7, 8 found in
security fuzz test
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
We saw some models failed to run due to OOM and can be fixed by increase
trt_max_workspace_size.
This PR makes no size limitation by default (max device memory) which is
aligned with trtexec.
### Description
Added eval model buffer as optional field in Module so that you can
export for inference using the eval model stored as a buffer.
### Motivation and Context
- Resolves#21152
- Previous solution (PR #21422) produced an eval model that was specific
to the EP's used to train because of unavoidable runtime optimizations
that changed the graph stored with the eval session.
### Description
This PR improves the range calculation for input to Relu/Clip nodes for
the symmetric quantization case.
### Motivation and Context
Currently, the issue we face is that for the common scenario of conv
followed by relu in the symmetric quantization config, different scales
could assigned for the tensors corresponding to input & output of relu.
The downside is that this may introduce noise due to multiple re-quant,
and makes it difficult to fuse conv-relu nodes for hardware accelerators
that support fused conv-relu.
Instead, it is more efficient to assign the output range of relu as the
input range of relu / output range of upstream op wherever possible.
This adjustment is currently only being done for the asymmetric
quantization case.
For the scenario where the upstream op has multiple consumers, this
assumption could be incorrect. For this case we do not adjust the
ranges.
### Description
Fixes validation of per-channel quantization overrides by not trying to
unnecessary load the external weights.
### Motivation and Context
The `get_qnn_qdq_config()` explicitly loads models without external data
(i.e., `onnx.load_model(load_external_data=False)`). Afterwards,
`get_qnn_qdq_config()` calls `tensor_proto_to_array()`, which expects
that the external weights are stored in the current working directory.
If the external weights are stored in a different directory, then we get
a crash.
Loading the actual weight values is unnecessary because we only need the
weight shape. This PR removes the unnecessary call to
`tensor_proto_to_array()` call.
### Description
Added DequantizeLinear operator for JSEP.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Add null_ptr check to avoid crash when running session which was failed
to generate trt_engine previously
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Reported and verified by
https://github.com/microsoft/onnxruntime/issues/21567
### Description
This fix addresses the issue of handling multiple QLinear nodes as
outputs from the target node in OVEP. Previously, the stripping logic
only supported a single Q node, leading to incorrect stripping of
additional Q nodes.
### Motivation and Context
The OVEP stripping logic was limited to handling a single Q node as an
output from the target node. As a result, additional Q nodes were being
stripped, despite the stripping rules indicating they should be
retained.
With this fix, OVEP can now properly handle multiple Q nodes according
to the specified stripping rules, ensuring that the fate of each Q node
is correctly determined.
---------
Co-authored-by: sfatimar <sahar.fatima@intel.com>
### Description
When quantize MatMul to DQ + MatMul using 4bit QDQ tool chain,
previously the opsets of domains are not changed.
Now, when quantize MatMul to DQ + MatMul in QDQ format, force upgrade
onnx domain to opset 21.
### Motivation and Context
In QDQ format, DQ with int4 and blocked quantization is used. This
requires DQ with opset >= 21.
When quantize MatMul to DQ + MatMul, force upgrade onnx domain to opset
21.
### Description
<!-- Describe your changes. -->
Fix wrong per-tensor quantized weight type for matmul.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fix related bug as described in
https://github.com/microsoft/onnxruntime/issues/21346
### Description
Add a gather that supports block-quantized input data.
### Motivation and Context
To support Web inference scenario with quantized vocabulary embeddings.
### Description
This change enhances the existing Pad Fusion to fuse Pad even if a Cast
operator is present between Pad and Conv/MaxPool/AveragePool. It keeps
the Cast as it is.
<pre>
/*
* Before Fusion:
* Pad
* |
* Cast (Optional)
* |
* Conv/MaxPool/AveragePool
*
* After Fusion:
* Cast (Optional)
* |
* Conv/MaxPool/AveragePool
*/
</pre>
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->