### Description
Fix crash with extra checks ResetQnnLogLevel.
From the dump it looks like during ETW callbacks, while the provider is stopping, we attempt to reset the QNN log level.
While the QNN BackEndMgr (this) is alive logger_ is not valid
### Motivation and Context
ORT should not crash
### Description
Update list of CI pipelines to trigger for external PRs.
### Motivation and Context
The pipelines triggered for external PRs are not consistent with
internal PRs.
### Description
<!-- Describe your changes. -->
Current API docs workflows are scheduled to run monthly, but artifacts
expire after 30 days, which could create issues for 31-day months.
Updating to regenerate artifacts every 2 weeks.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
(1) Upgrade opencv
(2) Add some comments about onnxruntime-gpu installation
### Motivation and Context
opencv-python was locked to an older version, which has security
vulnerabilities: see https://github.com/microsoft/onnxruntime/pull/22445
for more info
### Description
relate to #22282. Let Vitis ai ep handles dynamic_options
### Motivation and Context
---------
Co-authored-by: genmingz <genmingz@amd.com>
### Description
1. Remove the onnxruntime::OrtMutex class and replace it with
~absl::Mutex~ std::mutex.
2. After this change, most source files will not include <Windows.h>
indirectly.
### Motivation and Context
To reduce the number of deps we have, and address some Github issues
that are related to build ONNX Runtime from source.
In PR #3000 , I added a custom implementation of std::mutex . It was
mainly because at that time std::mutex's default constructor was not
trivial on Windows. If you had such a mutex as a global var, it could
not be initialized at compile time. Then VC++ team fixed this issue.
Therefore we don't need this custom implementation anymore.
This PR also removes nsync. I ran several models tests on Linux. I
didn't see any perf difference.
This PR also reverts PR #21005 , which is no longer needed since conda
has updated its msvc runtime DLL.
This PR unblocks #22173 and resolves#22092 . We have a lot of open
issues with nsync. This PR can resolve all of them.
### Description
Updates the ROCm EP opsets to match the current CUDA EP opsets. Also
enable the test CApiTest.basic_cuda_graph_with_annotation.
Note that some changes are whitespace-only. These changes were made to
improve the comparison of corresponding ROCm and CUDA EP source files
when using a side by side diff tool.
### Motivation and Context
The ROCm EP derives from the CUDA EP. Many source files are shared
between the EPs and "hipified" during the ROCm EP build, however quite a
few files within the ROCm EP are under source control after their
initial hipification. Over time these ROCm EP files get stale relative
to their CUDA EP counterparts. It becomes necessary to re-hipify these
otherwise static files in order to pick up important changes such as
opset differences.
Update the python wrapper script to support weight sharing case
### Description
update the script to support json file that from QNN converter or the one extracted from QNN context binary file for the weight sharing scenario
The ONNX Runtime Release Roadmap on our website is not very easy to find
right now, so I'm adding a link here to make it more accessible.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
- Allow specification of iOS simulator runtime version to use.
- Pick simulator runtime version (iphonesimulator 16.4) that is supported by the Xcode version (14.3.1) that we use.
- Disable CoreML EP's DepthToSpace op support for CoreML version less than 7, with DCR mode, and FP16 input. It doesn't produce the correct output in this case.
- Some cleanup of iOS test infrastructure.
### Description
This change enables caching `MLTensor`s between inferences runs. This is
done by keeping a reference to `MLTensor`s alive after they have been
released. `MLTensor`s are only destroyed once the sessions goes out of
scope.
### Motivation and Context
Creating and destroying `MTensor`s on every run has a non-trivial
performance penalty. This performance penalty materializes when using
`ort.Tensors`[location=cpu] for inputs/outputs or when using the CPU EP
as a fallback EP for unsupported operators. The former could be
mitigated by developer using `ort.Tensors`[location=ml-tensor]. The
latter cannot be mitigated by developers.
### Description
The recent PR #22223 introduced 2 bugs in implementation of CPU
LayerNorm f16:
- possible access to nullptr for bias
`const TensorShape& bias_shape = bias->Shape();` will crash when `bias`
does not exist. (amazingly seems this one is not coverred by any test
case)
- fix: guard with pointer check
- a racing condition inside ComputeJob
`ComputeJob()` is dispatched to threadpool and it internally tries to
modify `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_`,
which are `std::unique_ptr`s and are not thread-safe.
- fix: move the modification of `LayerNormImpl::scale_fp32_` and
`LayerNormImpl::bias_fp32_` out of `ComputeJob()` and put into
`LayerNormImpl::ComputeWithoutContext()`. It may still have racing
condition because `ConcurrentRunSupported` is set to `true` for CPU EP.
Added an OrtMutex.
This should fixes the recent flaky tests as well.
### Description
`get_device()` returns a string of hyphen connected device names, such
as "GPU-DML". It's a problem that when CUDA is disabled but OpenVino GPU
is enabled in the build, because in this case `get_device()` returns
"CPU-OPENVINO_GPU", so `supports_device("CUDA")` will return `True` in
this build.
Splitting the value of `get_device()` by "-" and check if the input is
in the list is not an option because it seems some code in the code base
stores the value of `get_device()` and use the value to call
`supports_device()`. Using this implementation will cause
`supports_device("GPU-DML")` to return `False` for a build with
`get_device() == "GPU-DML"` because `"GPU-DML" in ["GPU", "DML"]` is
`False`.
This change also helps to avoid further problems when "WebGPU" is
introduced.
### Description
Adds QNN provider option `offload_graph_io_quantization` to offload
graph input quantization and graph output dequantization to the CPU EP.
Option is disabled by default to maintain current behavior.
### Motivation and Context
Offloading the handling of I/O quantization to the CPU EP significantly
improves inference latency for many models.
### Description
The current code to log profiler event "_fence_before" and
"_fence_after" seems to be useless. The measured duration of the 2
events are 0.
Removed them.
### Description
This adds support for partial RotaryEmbedding to DML. Essentially,
partial RotaryEmbedding simply consists of doing the rotary embedding
calculation on a subregion of the input tensor of as if its head size
was `rotary_embedding_dim`, while leaving the second part of the tensor
(i.e. `head_size - rotary_embedding_dim`) alone.
To achieve this, all we need to do is follow the following steps:
1. Split the tensor into 2 parts
2. Run the rotary embedding algorithm on the first part, just like we
were doing before on the entire tensor
3. Join the 2 parts back together
Since we're leaving the middle part intact, the RotaryEmbedding fusion
will still be done within DML. Also, the concat at the end is
essentially free because DML optimizes it out and directly allocate the
result of RotaryEmbedding at the right place. The only overhead here is
the splitting of the tensor at the beginning, which we should eventually
make part of the RotaryEmbedding fusion within DML.
### Motivation and Context
This fix allows us to correctly run models that have a
`partial_rotary_factor` setting in huggingface, including Nvidia's
Nemotron: https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Our nightly CPU python package's name is "ort-nightly" instead of
"onnxruntime". It was because of some historical reasons. Tensorflow was
like that.
Now we would prefer to make them the same.
Do this change for all nightly python packages, including CPU,
GPU(CUDA), and maybe others.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
* Add in missing operators for llama run
* Add simplified layer norm ops
### Description
<!-- Describe your changes. -->
Adding additional supported operators into MIGraphX EP that are
supported in MIGraphX
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Allows for more models to be run through MIGraphX EP
### Description
Today, stable diffusion stage failed due to there's a upgrade in timm.
controlnet_aux depends on it.
And its latest version limit the timm version less than 0.6.7.
So upgrading controlnet_aux can solve it.
And controlnet_aux uses opencv-python-headless, pin
opencv-python-headless to 4.8.0.74 too.
### Motivation and Context
### Description
For no, CoreML only support run mlmodels on CPU/ALL, However, sometimes
CPU_GPU would be faster a lot.
We support the option to select different hardware to boost performance
in this PR.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
Change the hipify step to remove the -roc option to hipify-perl. This
will prefer hipblas over rocblas. rocblas can still be called directly
such as in TunableOp.
### Motivation and Context
hip interfaces are preferred over roc for porting from cuda to hip.
Calling roc interfaces is meant for ROCm-specific enhancements or
extensions.
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support
added in https://github.com/microsoft/onnxruntime/pull/22063.
- Updated the `LayerNormalization` MLFloat16 implementation to improve
the latency.
```
----------------------------------------------------------------------------------------------
Original MLFloat16 support Time CPU Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47
BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39
BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50
----------------------------------------------------------------------------------------------
Updated MLFloat16 support Time CPU Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84
BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93
BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84
```
1. Add python 3.13 to our python packaging pipelines
2. Because numpy 2.0.0 doesn't support thread free python, this PR also
upgrades numpy to the latest
3. Delete some unused files.
### Description
<!-- Describe your changes. -->
This PR further optimizes matmulnbits specially for iGPUs. The phi3 demo
becomes ~12 tokens/second from ~8 tokens on iGPUs.
Some todos:
1. Make the optimization more general, Remove the blockSize = 32
limitation.
2. Tune the parameter, such as workgroupSize, components size (currently
only support components = 1), to see the performance change.
Bumps [cookie](https://github.com/jshttp/cookie) and
[socket.io](https://github.com/socketio/socket.io). These dependencies
needed to be updated together.
Updates `cookie` from 0.4.2 to 0.7.2
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/jshttp/cookie/releases">cookie's
releases</a>.</em></p>
<blockquote>
<h2>v0.7.2</h2>
<p><strong>Fixed</strong></p>
<ul>
<li>Fix object assignment of <code>hasOwnProperty</code> (<a
href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>)
bc38ffd</li>
</ul>
<p><a
href="https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2">https://github.com/jshttp/cookie/compare/v0.7.1...v0.7.2</a></p>
<h2>0.7.1</h2>
<p><strong>Fixed</strong></p>
<ul>
<li>Allow leading dot for domain (<a
href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>)
<ul>
<li>Although not permitted in the spec, some users expect this to work
and user agents ignore the leading dot according to spec</li>
</ul>
</li>
<li>Add fast path for <code>serialize</code> without options, use
<code>obj.hasOwnProperty</code> when parsing (<a
href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li>
</ul>
<p><a
href="https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1">https://github.com/jshttp/cookie/compare/v0.7.0...v0.7.1</a></p>
<h2>0.7.0</h2>
<ul>
<li>perf: parse cookies ~10% faster (<a
href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a> by
<a href="https://github.com/kurtextrem"><code>@kurtextrem</code></a>
and <a
href="https://redirect.github.com/jshttp/cookie/issues/170">#170</a>)</li>
<li>fix: narrow the validation of cookies to match RFC6265 (<a
href="https://redirect.github.com/jshttp/cookie/issues/167">#167</a> by
<a href="https://github.com/bewinsnw"><code>@bewinsnw</code></a>)</li>
<li>fix: add <code>main</code> to <code>package.json</code> for rspack
(<a href="https://redirect.github.com/jshttp/cookie/issues/166">#166</a>
by <a
href="https://github.com/proudparrot2"><code>@proudparrot2</code></a>)</li>
</ul>
<p><a
href="https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0">https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.0</a></p>
<h2>0.6.0</h2>
<ul>
<li>Add <code>partitioned</code> option</li>
</ul>
<h2>0.5.0</h2>
<ul>
<li>Add <code>priority</code> option</li>
<li>Fix <code>expires</code> option to reject invalid dates</li>
<li>pref: improve default decode speed</li>
<li>pref: remove slow string split in parse</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d19eaa1a2b"><code>d19eaa1</code></a>
0.7.2</li>
<li><a
href="bc38ffd0ea"><code>bc38ffd</code></a>
Fix object assignment of <code>hasOwnProperty</code> (<a
href="https://redirect.github.com/jshttp/cookie/issues/177">#177</a>)</li>
<li><a
href="cf4658f492"><code>cf4658f</code></a>
0.7.1</li>
<li><a
href="6a8b8f5a49"><code>6a8b8f5</code></a>
Allow leading dot for domain (<a
href="https://redirect.github.com/jshttp/cookie/issues/174">#174</a>)</li>
<li><a
href="58015c0b93"><code>58015c0</code></a>
Remove more code and perf wins (<a
href="https://redirect.github.com/jshttp/cookie/issues/172">#172</a>)</li>
<li><a
href="ab057d6c06"><code>ab057d6</code></a>
0.7.0</li>
<li><a
href="5f02ca8768"><code>5f02ca8</code></a>
Migrate history to GitHub releases</li>
<li><a
href="a5d591ce84"><code>a5d591c</code></a>
Migrate history to GitHub releases</li>
<li><a
href="51968f94b5"><code>51968f9</code></a>
Skip isNaN</li>
<li><a
href="9e7ca51ade"><code>9e7ca51</code></a>
perf(parse): cache length, return early (<a
href="https://redirect.github.com/jshttp/cookie/issues/144">#144</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/jshttp/cookie/compare/v0.4.2...v0.7.2">compare
view</a></li>
</ul>
</details>
<details>
<summary>Maintainer changes</summary>
<p>This version was pushed to npm by <a
href="https://www.npmjs.com/~blakeembrey">blakeembrey</a>, a new
releaser for cookie since your current version.</p>
</details>
<br />
Updates `socket.io` from 4.7.5 to 4.8.0
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/socketio/socket.io/releases">socket.io's
releases</a>.</em></p>
<blockquote>
<h2>socket.io-client@4.8.0</h2>
<h3>Features</h3>
<h4>Custom transport implementations</h4>
<p>The <code>transports</code> option now accepts an array of transport
implementations:</p>
<pre lang="js"><code>import { io } from "socket.io-client";
import { XHR, WebSocket } from "engine.io-client";
<p>const socket = io({
transports: [XHR, WebSocket]
});
</code></pre></p>
<p>Here is the list of provided implementations:</p>
<table>
<thead>
<tr>
<th>Transport</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Fetch</code></td>
<td>HTTP long-polling based on the built-in <code>fetch()</code>
method.</td>
</tr>
<tr>
<td><code>NodeXHR</code></td>
<td>HTTP long-polling based on the <code>XMLHttpRequest</code> object
provided by the <code>xmlhttprequest-ssl</code> package.</td>
</tr>
<tr>
<td><code>XHR</code></td>
<td>HTTP long-polling based on the built-in <code>XMLHttpRequest</code>
object.</td>
</tr>
<tr>
<td><code>NodeWebSocket</code></td>
<td>WebSocket transport based on the <code>WebSocket</code> object
provided by the <code>ws</code> package.</td>
</tr>
<tr>
<td><code>WebSocket</code></td>
<td>WebSocket transport based on the built-in <code>WebSocket</code>
object.</td>
</tr>
<tr>
<td><code>WebTransport</code></td>
<td>WebTransport transport based on the built-in
<code>WebTransport</code> object.</td>
</tr>
</tbody>
</table>
<p>Usage:</p>
<table>
<thead>
<tr>
<th>Transport</th>
<th>browser</th>
<th>Node.js</th>
<th>Deno</th>
<th>Bun</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Fetch</code></td>
<td>✅</td>
<td>✅ (1)</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td><code>NodeXHR</code></td>
<td></td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td><code>XHR</code></td>
<td>✅</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><code>NodeWebSocket</code></td>
<td></td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td><code>WebSocket</code></td>
<td>✅</td>
<td>✅ (2)</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td><code>WebTransport</code></td>
<td>✅</td>
<td>✅</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<p>(1) since <a
href="https://nodejs.org/api/globals.html#fetch">v18.0.0</a>
(2) since <a
href="https://nodejs.org/api/globals.html#websocket">v21.0.0</a></p>
<p>Added in <a
href="f4d898ee96">f4d898e</a>
and <a
href="b11763beec">b11763b</a>.</p>
<h4>Test each low-level transports</h4>
<p>When setting the <code>tryAllTransports</code> option to
<code>true</code>, if the first transport (usually, HTTP long-polling)
fails, then the other transports will be tested too:</p>
<pre lang="js"><code>import { io } from "socket.io-client";
</tr></table>
</code></pre>
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d0fc720420"><code>d0fc720</code></a>
chore(release): socket.io@4.8.0</li>
<li><a
href="4a0555c671"><code>4a0555c</code></a>
chore(release): socket.io-client@4.8.0</li>
<li><a
href="2b60df18a8"><code>2b60df1</code></a>
chore(release): engine.io@6.6.1</li>
<li><a
href="d4cb375856"><code>d4cb375</code></a>
ci: ignore tests when publishing to npm</li>
<li><a
href="c251ae7ba7"><code>c251ae7</code></a>
chore(release): engine.io-client@6.6.1</li>
<li><a
href="8a2f5a3da0"><code>8a2f5a3</code></a>
fix(eio-client): move 'offline' event listener at the top</li>
<li><a
href="b04fa64365"><code>b04fa64</code></a>
fix(sio): allow to join a room in a middleware (uws)</li>
<li><a
href="7085f0e3e4"><code>7085f0e</code></a>
refactor(sio-client): mangle private attributes</li>
<li><a
href="4f66708210"><code>4f66708</code></a>
chore(sio-client): use babel loose mode when transpiling classes</li>
<li><a
href="1a95db2145"><code>1a95db2</code></a>
chore(sio-client): add a script to compute the bundle size</li>
<li>Additional commits viewable in <a
href="https://github.com/socketio/socket.io/compare/socket.io@4.7.5...socket.io@4.8.0">compare
view</a></li>
</ul>
</details>
<br />
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
With this optimization, 96 MultiHeadAttention|Transpose ops in phi3
disappear. Phi3 becomes 113 tokens from 107 tokens on my dGPUs.
The optimization mainly skips the transpose op if one of the transposed
dims is 1. Reshape is enough.