Граф коммитов

11958 Коммитов

Автор SHA1 Сообщение Дата
Tianlei Wu b5ef85555a
Support onnx data types (bfloat16, float8) in python I/O binding APIs (#22306)
### Description
(1) Support onnx data types in python APIs:
* IOBinding.bind_input
* IOBinding.bind_output
* ortvalue_from_shape_and_type

(2) Add unit tests, which serves an example of running BFloat16 or
Float8 models in Python.

Other minor changes:
(3) replace deprecated NP_TYPE_TO_TENSOR_TYPE by helper API.
(4) Rename ortvalue_from_numpy_with_onnxtype to
ortvalue_from_numpy_with_onnx_type.

The integer of onnx element type can be found in
(https://onnx.ai/onnx/api/mapping.html). Note that FLOAT4E2M1 is not
supported yet.

### Motivation and Context

Current python API does not support Bfloat16 and float8 (FLOAT8E4M3FN,
FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ) types, and other new data
types like INT4, UInt4 etc.

This removes the limitation.

https://github.com/microsoft/onnxruntime/issues/13001
https://github.com/microsoft/onnxruntime/issues/20481
https://github.com/microsoft/onnxruntime/issues/20578
2024-10-04 17:29:15 -07:00
Dmitri Smirnov 96a1ce1c04
[C#] Address Packaging pipeline failure (#22307)
### Description
Add new test data copy to 2 more test projects.
2024-10-04 17:28:09 -07:00
Dmitri Smirnov 9f3676bc31
Address leftover comments for Lora support (#22322)
### Description
Address comments


### Motivation and Context
Re: https://github.com/microsoft/onnxruntime/pull/22046
2024-10-04 16:43:26 -07:00
Dmitri Smirnov 0645ad19a4
[PyBind] Expose enable_mem_arena property for SessionOptions (#22323)
### Description
Expose enable_mem_arena property for SessionOptions

### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/22271
2024-10-04 16:43:15 -07:00
Changming Sun 715b74d61a
Re-enable codesign for maven packages (#22308)
### Description
PR #22217 was reverted.  This PR re-enables it.


### Motivation and Context
2024-10-04 14:30:17 -07:00
Tianlei Wu f3f33bfa05
Upgrade cutlass to 3.5.1 and cudnn frontend to 1.7.0 (#22316)
### Description
Upgrade cutlass to 3.5.1
Upgrade cudnn_frontend to 1.7.0
2024-10-04 11:48:50 -07:00
Changming Sun f25f3868a7
Auto regenerate LORA's fbs files (#22313)
### Description

A left-over of PR #22046 

### Motivation and Context
Right now our VCPKG pipelines are broken.
2024-10-04 10:01:19 -07:00
Edward Chen 1df215e9bb
Update arena creation check in Environment::CreateAndRegisterAllocator() to check for 32-bit builds instead of non-x86_64 builds. (#22304) 2024-10-04 09:03:16 -07:00
jingyanwangms bb0c1f0a05
Update cuda version in release pipeline (#22305)
### Description
With TensorRT 10.4 update, the name of TensorRT windows package changed


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-03 22:28:28 -07:00
Ranjit Ranjan d0ddfa9b9e
[AIX] build fix for using system install protobuf/onnx (#22302)
### Description
Fixing merge issue occurred in
https://github.com/microsoft/onnxruntime/pull/22272

### Motivation and Context
To build onnxruntime using system installed protobuf/onnx.
2024-10-03 19:29:42 -07:00
Jing Fang a80bf8d158
Reduce matmulnbits UT time (#22303)
### Description
Flatten MatMulNbits UT and reduce unnecessary loops.



### Motivation and Context
Reduce matmulnbits UT time
2024-10-03 16:24:56 -07:00
Edward Chen f1be92faf0
Patch fp16 to fix Xcode 16 builds with XNNPACK EP targeting x86_64. (#22294) 2024-10-03 14:17:15 -07:00
Yi Zhang bbb54985a8
Add MaxPool FP16 in XnnPack EP (#22258)
### Description
Add support for FP16 kernels in the XnnPack execution provider for
MaxPool operations.
Fixes:
[AB#50332](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50332)

### Motivation and Context
The major purpose of this pull request is to add some common
vars/functions and setup a consistent style for adding FP16 kernels in
XnnPack EP.

---------
2024-10-03 18:28:58 +08:00
Caroline Zhu c73e6afa6c
Migrate Android Java E2E tests from App Center to Browserstack (#22117)
### Description
- removed installing AppCenter + pipeline step that runs AppCenter
Espresso tests
- added script for running AppCenter tests

### Motivation and Context
App Center is getting deprecated in the next year + we have upcoming
Android work that depends on working E2E testing.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-10-02 15:04:58 -07:00
Dmitri Smirnov 224f0651d0
[C#] Expose Multi-Lora support in C# (#22281)
### Description


### Motivation and Context
https://github.com/microsoft/onnxruntime/pull/22046
2024-10-02 10:00:43 -07:00
goldsteinn 4e15b229a0
ThreadPool: Spend less time busy waiting. (#21545)
The purpose of the patch is primarily to save power, but it also has
nice perf benefits (mostly from allowing the system to better distribute
power to cores doing meaningful work).

Changes are twofold:

1)    Decrease WorkerLoop spin count dramatically ~10^6 -> ~10^4. The
       reality is after ~10^4 spins, if there hasn't been any new work
       added its unlikely any new work is imminent so sleep to
       preserve power. This aligns more closely with upstream EigenV3.

2)   Use exponential backoff for waiting on memory. This saves a bit
       more power, and important increases the time between iterations
       in WorkerLoop to help accomidate the dramatically lowering spin
       counts.

Since the tuning for both the iteration counts / backoff counts are
dramatically different for hybrid/non-hybrid systems, this patch
templates the affected functions and dynamically choses based on
`CPUIDInfo::IsHybrid()`. This seemed like the "lightest weight" way of
getting the change in, although its likely we could incur less dynamic
overhead if we added the template argument to the entirety of
`ThreadPoolTempl`.

Measured performance on an [Intel Meteor Lake
CPU](https://www.intel.com/content/www/us/en/products/sku/237329/intel-core-ultra-7-processor-165u-12m-cache-up-to-4-90-ghz/specifications.html)
across a range of models.

Below are the result of 3 runs with each metric being the
value-before-patch / value-after-patch (so for something like inference
time, lower is better).
<div align="center">
<table>
<tr>
<th>Session creation time cost</th>
<td>0.7179</td>
</tr>
<tr>
<th>First inference time cost</th>
<td>0.7156</td>
</tr>
<tr>
<th>Total inference time cost</th>
<td>1.0146</td>
</tr>
<tr>
<th>Total inference requests</th>
<td>0.8874</td>
</tr>
<tr>
<th>Average inference time cost</th>
<td>0.8800</td>
</tr>
<tr>
<th>Total inference run time</th>
<td>1.0146</td>
</tr>
<tr>
<th>Number of inferences per second</th>
<td>0.8955</td>
</tr>
<tr>
<th>Avg CPU usage</th>
<td>0.9462</td>
</tr>
<tr>
<th>Peak working set size</th>
<td>0.9922</td>
</tr>
<tr>
<th>Runs</th>
<td>1.1552</td>
</tr>
<tr>
<th>Min Latency</th>
<td>0.7283</td>
</tr>
<tr>
<th>Max Latency</th>
<td>0.9258</td>
</tr>
<tr>
<th>P50 Latency</th>
<td>0.9534</td>
</tr>
<tr>
<th>P90 Latency</th>
<td>0.9639</td>
</tr>
<tr>
<th>P95 Latency</th>
<td>0.9659</td>
</tr>
<tr>
<th>P99 Latency</th>
<td>0.9640</td>
</tr>
</table>
</div>

So the net result is a 1.16x improvement in throughput and between
1.08-1.37x improvement in latency.
2024-10-01 17:25:02 -07:00
Adam Pocock 14d1bfc34b
[java] Multi-LoRA support (#22280)
### Description
Java parts of Multi-LoRA support - #22046.

### Motivation and Context
API equivalence with Python & C#.

---------

Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com>
2024-10-01 13:54:37 -07:00
Dmitri Smirnov 1fc2b94644
Address Android warning error (#22285)
### Description
<!-- Describe your changes. -->

### Motivation and Context
Build issue
https://github.com/microsoft/onnxruntime/pull/22046#issuecomment-2386414899
2024-10-01 13:52:25 -07:00
Edward Chen c24e55b1f1
[Java] Add API for appending QNN EP (#22208)
- Add Java API for appending QNN EP
- Update Java unit test setup
  - Fix issues with setting system properties for tests
  - Unify Windows/non-Windows setup to simplify
2024-10-01 10:18:04 -07:00
Tianlei Wu e2b9ccc44a
Update SAM2 benchmark for testing torch compile modes and profiling (#22279)
This pull request introduces several enhancements to the benchmarking
process for the SAM2 model, including:
(1) Add profiling capabilities.
(2) test torch compile modes (none will disable compile and fallback to
eager mode)
(3) Update README for setting up the environment.

### Documentation Updates:
* README.md: Updated instructions to create separate conda environments
for GPU and CPU benchmarking, and detailed the parameters and outputs of
the benchmark script.

### Benchmark Script Enhancements:
* benchmark_sam2.py: Added optional parameters for enabling NVTX and
PyTorch profiling, and adjusted the initialization and execution flow to
incorporate these profiling options.

These changes enhance the flexibility and functionality of the
benchmarking process, making it easier to profile and benchmark the SAM2
model on different hardware configurations.
2024-10-01 09:51:12 -07:00
Yufeng Li 96e9c99dce
remove neural-speed (#22236)
### Description
<!-- Describe your changes. -->
NS is not developed anymore and ORT doesn't use it for int4 inference
either. Remove it to clean up the code


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-01 09:50:44 -07:00
kunal-vaishnavi 50bda44a70
Fix equation in MatMulNBits op spec (#22253)
### Description
This PR fixes an equation in the MatMulNBits op spec. The old formula is
stated as

```
[CeilDiv((N * n_blocks_per_col + 1) * bits, 8)]
```

but it should be stated as

```
[N * CeilDiv(n_blocks_per_col * bits, 8)]
```

or as

```
[N * FloorDiv((n_blocks_per_col + 1) * bits, 8)]
```

### Motivation and Context
For models such as ChatGLM where the column size is odd, the division
math can be off. For example:


![image_360](https://github.com/user-attachments/assets/a5035bec-4dad-46af-9cb1-24a881eb70a0)

With the old equation, the projections are calculated as follows.

```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
 
4,096 * CeilDiv((107 + 1) * 4, 8) = 4,096 * CeilDiv(108 * 4, 8) = 4,096 * 54 = 221,184

# Up projection
B = 13,696 x 32 x 64
zero_points = 219,136
N = 13,696
n_blocks_per_col = 32
 
13,696 * CeilDiv((32 + 1) * 4, 8) = 13,696 * CeilDiv(33 * 4, 8) = 13,696 * 17 = 232,832
```

With the new equation, the projections are calculated as follows.

```
# Down projection
B = 4,096 x 107 x 64
zero_points = 221,184
N = 4,096
n_blocks_per_col = 107
 
4,096 * CeilDiv(107 * 4, 8) = 4,096 * 54 = 221,184

# Up projection
B = 13,696 x 32 x 64
zero_points= 219,136
N = 13,696
n_blocks_per_col = 32
 
13,696 * CeilDiv(32 * 4, 8) = 13,696 * 16 = 219,136
```
2024-10-01 09:31:56 -07:00
Mauricio A Rovira Galvez ffca096b5a
Fixes a crash on macOS 15 when using CoreML. (#22277)
### Description
In macOS 15, apps running with CoreML will crash with an error message
like this one:
```
Terminating app due to uncaught exception 'NSGenericException', reason: 'Failed to set compute_device_types_mask E5RT: Cannot provide zero compute device types. (1)'
```

This can be easily seen when building ONNXRuntime from source and
running the unit tests. The fix was suggested in [this bug
report](https://forums.developer.apple.com/forums/thread/757040).
I've ported the change to ONNXRuntime and verified that:
* The issue is resolved in macOS 15 (all unit tests pass).
* The behaviour is unchanged in macOS 14. 


### Motivation and Context
This fixes #22275 allowing apps using ONNXRuntime with CoreML to work
normally.
2024-10-01 16:06:03 +10:00
Scott McKay ee7081b828
Fix syntax for some CoreML ML Program supported operator entries (#22268)
### Description
<!-- Describe your changes. -->
Fix syntax so usability checker works as expected.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-10-01 15:49:43 +10:00
Yang Gu 9e5153b688
[js/webgpu] Manage model download with a specific unittest option (#22214)
Currently in debug mode, unit test will always download models to local
file system, which is a bit annoying. This PR fixes this by adding a
specific option to enable model download.
2024-09-30 18:27:43 -07:00
Yang Gu c75f4a09b7
[js/webgpu] Remove the limitation on axis in softmax (#22231)
In current implementation, axis in softmax has to be the last, which is
an obvious limitation. This PR removes this limitation and will fix
issues #20710 and #22176.
2024-09-30 18:27:11 -07:00
Dmitri Smirnov d9de054eb5
Multi-Lora support (#22046)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-30 15:59:07 -07:00
Jian Chen 40bcb7664d
Revert "Jar Maven Signing - GnuPG and sha256" (#22273)
Reverts microsoft/onnxruntime#22217
2024-09-30 15:07:59 -07:00
Jian Chen ebcf2fcd16
Replace gradle/wrapper-validation-action with gradle/actions/wrapper-validation-action (#22224)
### Description
Replace gradle/wrapper-validation-action with
gradle/actions/wrapper-validation-action


### Motivation and Context
This is recommended by
https://github.com/gradle/wrapper-validation-action. This job uses
deprecated functionality from the 'gradle/wrapper-validation-action'
action.
2024-09-30 14:29:16 -07:00
Ranjit Ranjan 812075731c
[AIX] Build fix for using system installed protobuf/onnx (#22272)
### Description
To fix the build issues for AIX OS while using system installed
protobuf/onnx.

### Motivation and Context
Code changes in this PR contains:

1. Fix for below compilation issue.
```
collect2: fatal error: library liblibprotobuf-lite not found
compilation terminated.
```
2.  Adding onnx library into dependency list for test applicaitons.
2024-09-30 12:36:21 -07:00
Yi Zhang d069475a63
Make A100 jobs in PR checks again (#22261)
### Description
if the variable is 1, the job running on A100 in PR checks.
Fixes
[AB#50333](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/50333)


### Motivation and Context
We wish more big models which need to run on A100 can be tested in PR
checks, but Azure may decommission A100 agents without notifications
sometimes, which will block merging PRs.
This PR is an improvement of current workaround, making those jobs only
run main branch.
Once we find the A100 are all decommisioned by Azure, we could change
the UseA100 variable to 0 to disable the A100 jobs in PR checks
2024-09-30 08:29:30 -07:00
wejoncy 2cfe1f031d
[CoreML MLProgram] Support Float16 (1/N) (#22068)
### Description
Support Float16 for CoreML MLProgram EP.
Operations:
    "Add", "Mul", "Sub", "Div", "Pow", "Sqrt", "Reciprocal",
"Sigmoid", "Tanh", "Relu", "LeakyRelu", "Concat", "GridSample",
"GlobalAveragePool",
    "Clip", "DepthToSpace", "Resize", "Slice", "Conv",
    "ConvTranspose", "GlobalMaxPool", "Gemm", "MatMul",
    "AveragePool", "MaxPool", "Reshape", "Split", "Transpose"

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Scott McKay <skottmckay@gmail.com>
2024-09-30 17:56:47 +08:00
Yang Gu 434f0fa536
[js/webgpu] Fix the crash issue in unsqueeze (#22264)
While allowing axes in unsqueeze to be scalar, its shape couldn't be
always accessed like a vector. This PR fixes issue #22031 so that the
original model could run well.
2024-09-30 02:28:16 -07:00
Yulong Wang 1bda91fc57
[js/webgpu] fix external buffer registration (#22254)
### Description

Fixes the problem of running into failure when GPU inputs shuffled
between iterations.
2024-09-28 10:36:40 -07:00
Enrico Galli 52a8c1cae8
[WebNN EP] Enable IO Bindings with MLTensor (#21301)
### Description
Enables using the MLTensor to pass data between models. 


### Motivation and Context
Using MLTensor instead of ArrayBuffers reduces the number of copies
between the CPU and devices as well as the renderer and GPU process in
Chromium.
2024-09-27 17:24:21 -07:00
Patrice Vignola ebda23be16
[DML EP] Fix Clip clamping (#22251)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-27 16:24:37 -07:00
shiyi 1e3cd86d80
[WebNN EP] Support LSTM op (#20293)
<!-- Describe your changes. -->




<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-27 14:23:08 -07:00
liqun Fu f410e7c4cf
Fix mlas bench crash (#22248)
Fix mlas bench crash

---------

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-09-27 13:50:42 -07:00
Sumit Agarwal 529835cc46
[DML EP] Update DML to 1.15.2 (#22247)
### Description
Update DML binary to the current latest redist version
[1.15.2](https://www.nuget.org/packages/Microsoft.AI.DirectML/1.15.2).
2024-09-27 13:20:29 -07:00
Patrice Vignola 20be51525b
Support if node with sequence outputs (#22234)
`If` nodes can have sequence outputs. Those nodes are mapped to the DML
EP to be able to keep the outputs on the GPU, but they actually execute
on the CPU by selecting either the `then` subgraph or the `else`
subgraph.
2024-09-27 12:40:01 -07:00
Patrice Vignola 14ba2fb83c
[DML EP] Add intermediate tensor dumping for DML (#22246)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-27 12:39:45 -07:00
Hector Li 6e3163faa5
Update code regarding some QNN bug fixes (#22222)
### Description
Update code regarding some QNN bug fixes:
1. QnnProfile_ExtendedEventData_t.version is not initialized in Qnn
2. Failed to finalize the graph for HardSigmoid with FP16 precision
2024-09-27 09:51:47 -07:00
Kyle b81e76b9a6
Jar Maven Signing - GnuPG and sha256 (#22217)
### Description
<!-- Describe your changes. -->
Jar maven signing: 
- GnuPG 
- sha256.

Jar packages artifacts: 
- onnxruntime-android-full-aar
- onnxruntime-java
- onnxruntime-java-gpu


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously, it is manually signed. 
Goal: make it automatically.
2024-09-27 17:50:06 +08:00
Tianlei Wu ff8a48ef3b
Update SAM2 benchmark script and doc (#22238)
(1) Fix a bug of parameters order.
(2) Update benchmark script: 
* download test image if not exist
* combine multiple csv files into one file, and remove duplicated lines
(3) Add a section for benchmark in README.md
2024-09-26 20:57:03 -07:00
Scott McKay 3846f84218
Increase React Native E2E (#22230)
### Description
<!-- Describe your changes. -->
Increase the detox setup timeout to 4 minutes. 

The iOS RN E2E tests are taking slightly around 2 mins to setup causing
flakiness.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Improve RN CI pass rate
2024-09-27 08:59:36 +10:00
Tianlei Wu 2deab75d39
Add numeric_limits for float8 types (#22228)
Add std::numeric_limits for float8 data types to provide a consistent
way to access limits of those types.

Reference:
* https://onnx.ai/onnx/technical/float8.html
2024-09-26 14:42:36 -07:00
Jing Fang 1942e40e05
[ARM64] MatMulNBits: use neon instrinsics to convert between fp16 and fp32 (#22195)
### Description
For fp16 Atype, the fallback operation is convert the data to fp32 and
calculate.
Added neon intrinsics version to speed up the conversion.

Store address alignment and loop unrolling have insignificant impact on
latency so they are omitted.

|Benchmark | Time | CPU |

|--------------|---------------------------------------------|--------------------|
|M_ConvertF16ToF32/baseline/real_time | 1076961 ns | 1083398 ns |
|M_ConvertF16ToF32/aligned:0/real_time | 46785 ns | 46516 ns |
|M_ConvertF16ToF32/aligned:1/real_time | 46631 ns | 46391 ns |
|M_ConvertF16ToF32_unroll2/aligned:0/real_time | 44074 ns | 44392 ns |
|M_ConvertF16ToF32_unroll2/aligned:1/real_time | 44726 ns | 45226 ns |
|M_ConvertF32ToF16/baseline/real_time | 520109 ns | 527329 ns |
|M_ConvertF32ToF16/aligned:0/real_time | 73610 ns | 74015 ns |
|M_ConvertF32ToF16/aligned:1/real_time | 71557 ns | 71525 ns |
|M_ConvertF32ToF16_unroll2/aligned:0/real_time | 64227 ns | 63374 ns |
|M_ConvertF32ToF16_unroll2/aligned:1/real_time | 67428 ns | 67989 ns |



### Motivation and Context
speed up fallback implementation of Fp16 MatMulNBits
2024-09-26 13:55:40 -07:00
jingyanwangms d0b0ecfdb9
[Running CI] Update TensorRT to 10.4 (#22049)
### Description
TensorRT 10.4 is GA now, update to 10.4



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-26 11:10:52 -07:00
Tianlei Wu 7880342e5e
Add numeric_limits for MLFloat16 and BFloat16 (#22197)
### Description
* Add std::numeric_limits for MLFloat16 and BFloat16.
* Update some comments in csharp ORTFloat16.shared.cs.
* Add unit tests (including Clip)

Note that the canonical NaN is not consistent in C++ and C#. C# uses
negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN.
The choice of CSharp Float16.NaN is to be consistent with
System.Half.NaN.

FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU
provider might have 0x7E00 as NaN. Anyway there is no consistent
canonical NaN in ORT right now. Because all these NaNs are aligned with
IEEE spec, there shall not an issue in downstream.

### Motivation and Context
std::numeric_limits is used in codebase but not defined for MLFloat16
and BFloat16. It causes some bugs like
https://github.com/microsoft/onnxruntime/issues/21957 introduced by
https://github.com/microsoft/onnxruntime/pull/21493.
2024-09-25 17:10:05 -07:00
liqun Fu 72b0979e8a
Fix a wrong assignment that causing mlas benchmark to crash (#22221)
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Signed-off-by: Liqun Fu <liqfu@microsoft.com>
2024-09-25 15:53:28 -07:00