Граф коммитов

11662 Коммитов

Автор SHA1 Сообщение Дата
Yi Zhang b94ba09e4f
Upgrade XNNPACK to latest version (#22012)
### Description
Update XNNPack to latest version (Sep 4)
- Some op outputs are changed, channel or stride paras are moved into
reshape func.
e.g.
96962a602d
- input params of xnnpack's resize related function are changed a lot
- KleidiAI is added as a dependency in ARM64
- The latest XNNPACK includes 2 static libs microkernels-prod and
xnnpack.
Without microkernels-prod, it throws the exception of Undefined symbols.
- Add ORT_TARGET_PROCESSOR to get the real processor target in CMake
2024-09-17 10:12:16 -07:00
Jian Chen fa68ae2def
Update pool to MacOS-13 (#17361)
### Description
See https://github.com/microsoft/onnxruntime-extensions/pull/476
and https://github.com/actions/runner-images/issues/7671

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

### Current issue
- [ ] For default xcode 15.2, that come with the MacOS-13, We Need to
update the boost container header boost/container_hash/hash.hpp version
to pass the build
- [x] For xcode 14.2 The Build passed but the `Run React Native Detox
Android e2e Test` Failed.
Possible flaky test, https://github.com/microsoft/onnxruntime/pull/21969
- [x] For xcode 14.3.1 We encountered following issue in `Build React
Native Detox iOS e2e Tests`
```
ld: file not found: /Applications/Xcode_14.3.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/arc/libarclite_iphonesimulator.a
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```
Applied following code to the eof in both ios/Podfile and fixed the
issue
```
post_install do |installer|
    installer.generated_projects.each do |project|
        project.targets.each do |target|
            target.build_configurations.each do |config|
                config.build_settings['IPHONEOS_DEPLOYMENT_TARGET'] = '13.0'
            end
        end
    end
end
```


- [x] https://github.com/facebook/react-native/issues/32483

Applying changes to ios/Pofile
```
pre_install do |installer|
  # Custom pre-install script or commands
  puts "Running pre-install script..."

  # Recommended fix for https://github.com/facebook/react-native/issues/32483
  # from https://github.com/facebook/react-native/issues/32483#issuecomment-966784501
  system("sed -i '' 's/typedef uint8_t clockid_t;//' \"${SRCROOT}/Pods/RCT-Folly/folly/portability/Time.h\"")
end
```

- [ ] Detox environment setting up exceeded time out of 120000ms during
iso e2e test


### dependent 

- [x] https://github.com/microsoft/onnxruntime/pull/21159

---------

Co-authored-by: Changming Sun <chasun@microsoft.com>
2024-09-17 10:07:30 -07:00
Chi Lo 6dcdc70aa7
[TensorRT EP] Add supportsModelV2 (#22081)
`supportsModel` is deprecated in TRT 10.1.
Add `supportsModelV2 `but still keep `supportsModel` as we still need to
support TRT 8.6 where `supportsModelV2 ` is not
supported.
2024-09-17 09:52:28 -07:00
Wanming Lin 9786909ab5
[WebNN EP] Support QuantizeLinear and DequantizeLinear ops (#22097) 2024-09-17 08:18:47 -07:00
Xu Xing afd642a194
[js/webgpu] Replace array with string in transpose perm (#21930)
Perf test data(100000 times)
Array: 12.599999997764826ms
String: 1.6000000014901161ms

Perf test case:

```
const permFunctionBodyArray = (rank: number, input: string): string => {
  const reverseFunc = [];
  reverseFunc.push(`fn perm(i: int) -> int {
    var a: int};`);
  for (let i = 0; i < rank; ++i) {
    reverseFunc.push(input);
  }
  reverseFunc.push('return a;}');
  return reverseFunc.join('\n');
};

const permFunctionBodyString = (rank: number, input: string): string => {
  let reverseFunc= `fn perm(i: int}) -> int {
    var a: int;`;
  for (let i = 0; i < rank; ++i) {
    reverseFunc+=input;
  }
  reverseFunc+='return a;}';
  return reverseFunc;//.join('\n');
};
const count = 100000;
let start, end
console.time('array');
start = performance.now();
for(let i =0 ; i < count; i ++) {
    permFunctionBodyArray(3, 'input');
}
end = performance.now();
console.timeEnd('array');
console.log("Array: "+ (end-start));

console.time('string');
start = performance.now();
for(let i =0 ; i < count; i ++) {
    permFunctionBodyString(3, 'input');
}
end = performance.now();
console.log("String: " +(end-start));
console.timeEnd('string');
```

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-16 23:17:46 -07:00
Yang Gu 2db6b734f5
[js/webgpu] Fix issue to run model demucs (#22074)
This is to fix issue #22031 to run model demucs.
For conv-transpose, outputPadding.length could be 1, while spatialRank
is 2. The fix is to append enough 0s to outputPadding. For conv, the
issue is similar. kernelShape.length sometimes could be 1, while
inputs[1].dims.length is 4. The fix is also to append enough 0s to
kernelShape.
2024-09-16 23:17:10 -07:00
Yulong Wang 291a5352b2
[js/web] remove training release (#22103)
### Description

Remove training from onnxruntime-web

Following up of #22082
2024-09-16 10:56:22 -07:00
Erick Muñoz e93f14e00d
Check partial conversion on FP16 to FP32 AVX Cast kernel (#22091)
### Description
Added checks to convert partial vectors in the early stages of the FP16
to FP32 cast using AVX NE CONVERT ISA.



### Motivation and Context
Avoid storing data in sections outside of the output buffer, these
checks are missing on the [original
PR](https://github.com/microsoft/onnxruntime/pull/21183).
This fix prevents memory corruption when the output buffer has a size
[n*16 + 1, n*16 + 7] with 0< n
2024-09-16 09:20:06 -07:00
George Wu 1a1669fe81
use node name in transpose optimizer when adding nodes rather than optype (#22084)
patch from @john-dance

"The main change is simple: Use the original node name rather than the
original node op_type when creating new nodes. Here are my comments on
the change:
------
The onnx runtime uses the op_type as the basis for a new node name, so a
node claimed by QNN EP might be named
Conv_token_1 with no relation to the original /conv1/Conv. This patch:
1. Adds OpName as a virtual function in NodeRef and implements it in
ApiNode.
2. AddNode now takes an op_name and op_type and passes them both to
CreateNodeHelper.
3. CreateNodeHelper uses the op_name rather than the op_type in
GenerateNodeName
4. Direct calls to AddNode are modified to either use the NodeRef if
available, or just repeat the op_type if not available.
The result is that the new nodes are named something like
/conv1/Conv_token_1, allowing a straight forward mapping back to the
original model node (if they exist in the original graph)."
2024-09-16 09:12:13 -07:00
Adam Pocock 6d7235ba5a
[Java] Exposing SessionOptions.SetDeterministicCompute (#18998)
### Description
Exposes `SetDeterministicCompute` in Java, added to the C API by #18944.

### Motivation and Context
Parity between C and Java APIs.
2024-09-16 11:55:38 +10:00
Adam Pocock 02e00dc023
[java] Adding ability to load a model from a memory mapped byte buffer (#20062)
### Description
Adds support for constructing an `OrtSession` from a
`java.nio.ByteBuffer`. These buffers can be memory mapped from files
which means there doesn't need to be copies of the model protobuf held
in Java, reducing peak memory usage during session construction.

### Motivation and Context
Reduces memory usage on model construction by not requiring as many
copies on the Java side. Should help with #19599.
2024-09-16 08:31:55 +10:00
Wanming Lin c63dd0234b
[WebNN EP] Use opSupportLimits to dynamically check data type support (#22025)
- Remove hard code data type checks and use WebNN's opSupportLimits
instead
- Add HasSupportedOutputsImpl for output data type validation
- Get preferred layout info from opSupportLimits
- Move Not op to logical_op_builder.cc because it should be there. This
avoid the inconsistent input names in `unary_op_builder.cc`.
2024-09-13 21:36:20 -07:00
liqun Fu a89bddd5c2
Matmul_nbits kernel for mlas sqnbits to support Fp16 inputs (#21807) 2024-09-13 14:55:08 -07:00
aciddelgado 7e2c722459
Add Continuous Decoding support in GQA (#21523)
### Description
This PR will add support for Continuous Decoding for batch_size = 1
input. From now on, GQA can take arbitrary length input using seqlens_k
as total_sequence_length - 1 and the sequence length of qkv as
new_sequence_length.

**This change will not affect the default behavior of GQA**



### Motivation and Context
Prior to this change it was impossible to support sequence_length > 1
inputs when past context was given. This use case is essential to making
continuous decoding work, which is one of our current efforts in
ORT-GenAI.
2024-09-13 13:21:11 -07:00
Changming Sun 59b7b6bb7c
Remove training from web ci pipeline (#22082)
### Description
Remove training from web ci pipeline


### Motivation and Context
2024-09-13 09:52:49 -07:00
Michael Tyler 904b850b44
Update Arm Compute Library Execution Provider (#22032)
### Description
This PR makes the following updates to the Arm Compute Library execution
provider:

- Target Arm Compute Library 24.07  
- Add support for the following operators: 
  - Conv (FP16) 
  - NhwcConv 
  - QLinearConv 
  - MatMul 
  - FusedMatMul 
  - MatMulIntegerToFloat 
- Optimize memory usage and performance
- Expose the enable_fast_math setting 
- Use the main runtime thread pool 



### Motivation and Context
These updates improve performance and memory usage, and enable use of a
more recent version of Arm Compute Library.

@microsoft-github-policy-service agree company="Arm Ltd"

---------

Signed-off-by: Michael Tyler <michael.tyler@arm.com>
2024-09-12 20:51:59 -07:00
Adam Pocock 22437b581b
[java] Fix for OnnxTensor creation when passing in a ByteBuffer containing elements of a different type (#21774)
### Description
Fixes a bug where the buffer offset and position was incorrectly
computed if the user supplied a `ByteBuffer` to `createTensor` but set
the type of the tensor to something other than `INT8`. This would be
more common if the user was trying to load the initializers from a
serialized representation and didn't want to bother with the type
information (which is the case in #21321).

### Motivation and Context
Partial fix for #21321. The remainder of the fix is to add a helper
which allows users to load initializers out of an `onnx_data` file, but
that will require adding protobuf as a dependency for the Java API to
allow the parsing of an ONNX file separately from the native code. It
might be nicer to put that functionality into ORT's C API so it can
return the lengths & offsets of the initializers when provided with an
ONNX file containing external initializers. We hit this kind of thing in
Java more often than other languages as in Java models can be supplied
as classpath resources which we can easily read, but not materialize on
disk for the ORT native library to read.
2024-09-13 12:38:17 +10:00
Adrian Lizarraga f7bf5a19ba
[QNN EP] Ensure QNN EP rejects nodes with I/O of dynamic shape (#22066)
### Description
Updates QNN EP to properly reject nodes that have inputs or outputs with
dynamic shapes.


### Motivation and Context
Currently, QNN EP does not properly offload subgraphs with dynamic
shapes to the CPU EP. This PR ensures that QNN EP rejects nodes that
consume or generate I/O with dynamic shapes.
2024-09-12 17:18:50 -07:00
mingyueliuh 55ab13e7ca
[VitisAI] support memory buffer contains the TensorProto external data (#22042)
### Description
Extend VitisAI EP `tensor_proto_as_raw` API to support memory buffer
containing the TensorProto external data


### Motivation and Context
For reduce peak memory usage, VitisAI EP need support ORT format model
and setting session option
`session.use_ort_model_bytes_for_initializers` for enable directly use
the model bytes for initializers.

Co-authored-by: mingyue <mingyue@xilinx.com>
2024-09-12 16:23:09 -07:00
0xdr3dd 5c361106e6
[Fuzzer] Add two new ORT libfuzzer (Linux clang support for now) (#22055)
### Description
This PR adds two new libfuzzer in fuzzer project.
1. Binary libfuzzer 
2. libprotobuf-fuzzer

To compile run below cmd on linux:
```
LLVM_PROFILE_FILE="%p.profraw" CFLAGS="-g -fsanitize=address,fuzzer-no-link -shared-libasan -fprofile-instr-generate -fcoverage-mapping" CXXFLAGS="-g -shared-libasan -fsanitize=address,fuzzer-no-link -fprofile-instr-generate -fcoverage-mapping" CC=clang CXX=clang++ ./build.sh --update --build --config Debug --compile_no_warning_as_error --build_shared_lib --skip_submodule_sync --use_full_protobuf  --parallel --fuzz_testing --build_dir build/
```
Run fuzzer:
```
LD_PRELOAD=$(clang -print-file-name=libclang_rt.asan-x86_64.so) build/Debug/onnxruntime_libfuzzer_fuzz  testinput -rss_limit_mb=8196 -max_total_time=472800 -fork=2 -jobs=4 -workers=4 -ignore_crashes=1 -max_len=2097152 2>&1 | grep -v "\[libprotobuf ERROR"
```


### Motivation and Context
The existing custom fuzzer is not coverage guided and it's slow and it
will work on one model mutation at a time. The new fuzzers are coverage
guided, and we can use more models' files as a corpus to increase the
coverage.
2024-09-12 11:50:34 -07:00
wangshuai09 d539c27de8
Fix version check for using -mavxvnni (#21616)
### Description
<!-- Describe your changes. -->
Change the `CMAKE_CXX_COMPILER_VERSION` greater than `11` for using
'-mavxvnni'.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->


`CMakeFiles/onnxruntime_mlas.dir/root/Git.d/onnxruntime/onnxruntime/core/mlas/lib/x86_64/QgemmU8S8KernelAvx2.S.o
cc: error: unrecognized command-line option ‘-mavxvnni’; did you mean
‘-mavx512vnni’?` using `gcc (GCC) 10.3.1`.

`-mavxnni` is supported since [GCC 11
Release](https://gcc.gnu.org/gcc-11/changes.html), this PR change the
version check.
2024-09-12 11:42:17 -07:00
Clément Péron 10883d7997
Suppress GCC warning in TreeEnsembleAggregator (#22062)
### Description
When building with GCC 14.2.1, I got the following warning:

onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h:329:59:
error: template-id not allowed for constructor in C++20
[-Werror=template-id-cdtor]

Remove template parameters from the constructor: The constructor
TreeAggregatorMax<InputType, ThresholdType, OutputType> has been
simplified to TreeAggregatorMax, because the compiler already knows the
template parameters from the class definition.

### Motivation and Context
Fix the build issue

Signed-off-by: Clément Péron <peron.clem@gmail.com>
2024-09-12 19:46:27 +02:00
Yulong Wang 84f73327f5
allow scalar axes for Unsqueeze for WebGPU (#22054)
### Description

Align with CPU behavior.


https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cpu/tensor/unsqueeze.cc#L60-L62
2024-09-12 10:33:37 -07:00
mindest 951b1b7160
[CI] Linux ROCm CI Pipeline: fix error, set trigger rules. (#22069)
### Description
* Correct the wrong EP name for ROCm, fix CI error.
* Update `set-trigger-rules.py`.
* Modify the .yml via `set-trigger-rules.py`
2024-09-12 09:54:32 -07:00
Yi Zhang ae39c40e5b
fix typo in iOS pipeline (#22067)
### Description
<!-- Describe your changes. -->



### Motivation and Context
The parameter isn't correct.
Maybe it hasn't negative impact by chance so far.

d8e64bb529/cmake/CMakeLists.txt (L1712-L1717)
2024-09-12 19:07:42 +08:00
Prathik Rao d495e6cf1c
adds support for Uint8ClampedArray (#21985)
Fixes https://github.com/microsoft/onnxruntime/issues/21753
2024-09-11 22:02:30 -07:00
Lennart Hannink d8e64bb529
Refactor CoreMLExecution to C++ bridge class (#21857)
Refactor Objective-C++ class `CoreMLExecution` into existing C++ bridge class `onnxruntime::coreml::Execution`.
2024-09-11 16:05:37 -07:00
sfatimar 0309c5f02f
Ovep release lnl 1.2.1 (#22027)
Error Codes are added to catch compilation error and signal recompile.
Remote Tensors are added to ensure direct memory access for NPU
inferencing.
UMD Bypass cache enabled with 2024.4 will eliminate need to disk caching

### Motivation and Context
The changes are needed to ensure backward compatibility
UMD Bypass caching eliminates driver caching
Remote Tensors lead to performance improvement with inferencing on NPU

---------

Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: Srirammaswamy <srirammaswamy.s@intel.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
2024-09-11 14:55:40 -07:00
Jagadish Krishnamoorthy b800328628
[ROCm EP/ MIGraphx EP] matmul_nbits: Use GPU_WARP_SIZE_HOST for host side code (#22045)
### Description
For ROCm device, the host side code needs to call GPU_WARP_SIZE_HOST to
query warpSize
of the underlying GPU device.

### Motivation and Context
Fixes MatMulNBits tests on gfx1100/01 which has warpSize of 32.

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
2024-09-11 14:52:18 -07:00
Bin Miao 4d82404544
[WebNN EP] Support GRU operator (#20405)
This PR support Gru operator for WebNN EP.
@Honry ,  @fdwr thanks!
2024-09-11 14:16:36 -07:00
Xavier Dupré 91c916f9c6
Improve hash_function used by TreeEnsemble (#22043)
### Description
unordered_map are implemented in a different way on VisualStudio and
gcc. It seems that inserting consecutive keys has a poor performance on
Windows.



### Motivation and Context
Improve the performance of onnxruntime when initializing trees.
2024-09-11 10:41:04 -07:00
Yi-Hong Lyu e91ff9438b
Enable Pad->Conv(no pads) fusion (#22001)
### Description


### Motivation and Context
For some model has pattern Pad -> Conv. If the Conv doesn't have pads
attributes, the Pad can be fused into Conv.
2024-09-11 09:54:15 -07:00
Julius Tischbein 20d94648bb
ConvTranpose using CUDNN Frontend with NHWC support (#21752)
### Description
Added CUDNN Frontend and used it for NHWC ConvTranspose op including
option for bias fusion. Similar to this [Conv
PR](https://github.com/microsoft/onnxruntime/pull/19470)

### Backward compatible
If ORT is built with cuDNN 8, cuDNN frontend will not be built into
binary. Old kernels (using cudnn backend APIs) are used.

### Major Changes
For cuDNN 9, we will enable cudnn frontend to fuse data gradient
convolution and bias when a provider option fuse_conv_bias=1.

### Potential Issues
cuDNN frontend uses TF32 by default. It can be disabled using use_tf32
cuda provider option, but in the case cuDNN frontend encounters issues
building an operation graph it will fallback to using TF32.

### Follow ups
This is one of the PRs that target to enable NHWC, here the
ConvTranspose operation in CUDA EP by default if device supports it.
There are other changes will follow up to make it possible.
(1) Enable prefer_nhwc by default for device with sm >= 70.
(2) Change fuse_conv_bias=1 by default after more testing.
(3) Add other NHWC operators (like Resize or UpSample).

### Motivation and Context
The new CUDNN Frontend library provides the functionality to fuse
operations and provides new heuristics for kernel selection. Here it
fuses the convolution data gradient operation (ConvTranspose) with the
pointwise bias operation.

### Minor Change
In the CUDA convolution operation was a small bug when
`GetCudnnConv1dPadToNc1d ` was enabled.
2024-09-10 16:51:00 -07:00
PARK DongHa f633caa0b1
Create CMake option `onnxruntime_USE_VCPKG` (#21348)
### Changes

1. CMake option `onnxruntime_USE_VCPKG`. It will be used in the vcpkg
port
* Unit test may fail because this option leads to a mixture of
unexpected external library versions.
     Especially ONNX, Protobuf, and Flatbuffers version can be different
2. Overhaul of `onnxruntime_external_deps.cmake`
   * Make `FetchContent_Declare` to try `find_package`.  
See
https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
* Relocated `FetchContent_Declare` and `FetchContent_MakeAvailable`(or
`onnxruntime_fetchcontent_makeavailable`) to closer lines.
It was too hard to navigate the entire file to search related
sections...
* Alias `IMPORTED` targets like build targets (e.g. `ONNX::onnx` -->
`onnx`)

```cmake
# The script uses `find_package` with the changes.
# In this case, use vcpkg to search dependencies
# See https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
include(external/onnxruntime_external_deps.cmake)
```

3. Create CMakePresets.json and presets to [run vcpkg in manifest
mode](https://learn.microsoft.com/en-us/vcpkg/concepts/manifest-mode)
   * Currently, it's NOT for training build
   * Main triplets are `x64-windows` and `x64-osx`

```pwsh
Push-Location "cmake"
    cmake --preset "x64-windows-vcpkg"
    cmake --build --preset "x64-windows-vcpkg-debug"
Pop-Location
```
```bash
pushd "cmake"
    cmake --preset "x64-osx-vcpkg"
    cmake --build --preset "x64-osx-vcpkg-debug"
popd
```

4. Updated tools/ci_build/build.py
* `--use_vcpkg` option: it needs `CMAKE_TOOLCHAIN_FILE` with
[vcpkg.cmake toolchain
script](https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake)
* `--compile_no_warning_as_error` is recommended because library version
differences will cause unexpected compiler warnings

```bash
python ./tools/ci_build/build.py \
    --compile_no_warning_as_error \
    --use_vcpkg \
    --cmake_extra_defines "CMAKE_TOOLCHAIN_FILE:FILEPATH=${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" \
    --cmake_extra_defines "VCPKG_TARGET_TRIPLET=..."
```

5. Created Job `Vcpkg` for Windows and macOS
   * Show how to setup and use vcpkg.  
     Similar to the CMakePresets.json usage

### Motivation and Context

* Help #7150
* Help https://github.com/microsoft/vcpkg/pull/36850
   * https://github.com/luncliff/vcpkg-registry/pull/212
   * https://github.com/microsoft/vcpkg/pull/39881
* https://github.com/luncliff/vcpkg-registry/pull/215
   * https://github.com/luncliff/vcpkg-registry/pull/216
   * https://github.com/luncliff/vcpkg-registry/pull/227
*
https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html
*
https://github.com/microsoft/vcpkg/blob/master/scripts/buildsystems/vcpkg.cmake

### Future Works?

More feature coverage with the vcpkg supported libraries

* CUDA feature support
* Training feature support
2024-09-10 16:39:27 -07:00
kunal-vaishnavi c5418f35d4
Add fusions for re-designed Phi-3 vision and Phi-3.5 vision ONNX models (#22026)
### Description
This PR adds the optimizer logic to fuse the newly designed exported
ONNX models for Phi-3 vision and Phi-3.5 vision.

### Motivation and Context
After the re-designed export of Phi-3 vision and Phi-3.5 vision, the
ONNX models for the vision component and embedding component contain
`If` and `Loop` ops to handle multi-image support.
2024-09-10 16:18:05 -07:00
dependabot[bot] 19954decaf
Bump body-parser from 1.20.2 to 1.20.3 in /js/web (#22044) 2024-09-10 23:05:44 +00:00
jingyanwangms 4a5d66c15f
Default value 10.2->10.3 in linux-gpu-tensorrt-daily-perf-pipeline.yml (#21823)
### Description
Fix default value 10.2->10.3 in
linux-gpu-tensorrt-daily-perf-pipeline.yml

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-10 15:26:16 -07:00
George Wu 31ae11788a
[QNN EP] Update QNN SDK to 2.26 (#22037)
* update default QNN SDK version to 2.26
* enable layernorm implicit bias workaround for QNN 2.26
* update artifact names for py win arm64 and arm64ec to re-enable
ort-qnn-nightly arm64 python packages
2024-09-10 14:03:06 -07:00
Sophie Schoenmeyer e7107f41de
Decrease API docs artifact retention days (#22003)
### Description
When API docs workflows fail, we typically don't catch the issue until
the most recently generated artifact expires. The current artifact
retention is 60 days, so by decreasing to 30 days, we can ensure that
we're resolving the workflow failures more quickly.



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-09-10 10:44:08 -07:00
Erick Muñoz 7489bfee53
Enable AVX NE CONVERT for FP16 to FP32 cast (#21183)
### Description
Implementation of a new cast assembly kernel that uses AVX_NE_CONVERT
instructions to accelerate casting from FP16 to FP32. Added CPUID checks
to determine support of the ISA.

### Motivation and Context
Currently FP16 models executed on systems that lack complete FP16
operator support use single precision on every node to run the model,
this means the original FP16 weights have to be casted to FP32 in order
to run the model properly, this change aims to accelerate the casting by
using upconvert instructions and therefore improve performance.
2024-09-09 21:19:31 -07:00
Jake Mathern d4d419f789
fix more dml warnings (#21980)
### Description
Fixes more warnings in DML execution provider that lead to security
issues in binskim


### Motivation and Context
OS components that include ORT must treat certain warnings as errors,
and cannot disable critical compiler warnings

https://github.com/microsoft/binskim/blob/main/src/BinSkim.Rules/PERules/BA2007.EnableCriticalCompilerWarnings.cs
2024-09-09 17:50:17 -07:00
Jian Chen 93c4c9cb6a
Using wostringstream only on Windows (#21938)
### Description
Using wostringstream only on Windows



### Motivation and Context
From line
[62](https://github.com/microsoft/onnxruntime/pull/21938/files#diff-47776d020ac08134de4059eab473550237f4999c598ab56afad3676d2f193edcR62),
currently, `stream_` can be either `wostringstream` or `ostringstream`
depending on the OS, however, for Unix like system, `stream_` should be
`ostringstream`, instead of.
2024-09-09 13:20:17 -07:00
Adrian Lizarraga c7ae9b977a
[Quantization] Apply workaround for crash when using histogram-based calibrators (#21972)
### Description
- Applies a workaround that prevents the histogram-based calibrators
(percentile, entropy, distribution) from crashing. The workaround
involves copying inference outputs that come directly from model inputs.
A description of the bug is here:
https://github.com/microsoft/onnxruntime/issues/21922. **This PR does
not fix the root bug, but instead provides a workaround to _unblock_
users using histogram-based calibration.**
- Adds a unit test that runs all histogram-based calibrators to help
catch future regressions. We didn't have unit tests that ran these
calibration methods.

### Motivation and Context
Trying to quantize a model with the percentile, entropy, or distribution
calibration methods raises an exception:
```shell
  File "/.../site-packages/onnxruntime/quantization/quantize.py", line 691, in quantize
    quantize_static(
  File "/.../site-packages/onnxruntime/quantization/quantize.py", line 525, in quantize_static
    calibrator.collect_data(calibration_data_reader)
  File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 571, in collect_data
    self.collector.collect(clean_merged_dict)
  File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 746, in collect
    return self.collect_value(name_to_arr)
  File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 836, in collect_value
    hist, hist_edges = np.histogram(data_arr, self.num_bins, range=(-threshold, threshold))
  File "<__array_function__ internals>", line 180, in histogram
  File ".../site-packages/numpy/lib/histograms.py", line 793, in histogram
    bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
  File "/.../site-packages/numpy/lib/histograms.py", line 426, in _get_bin_edges
    first_edge, last_edge = _get_outer_edges(a, range)
  File "/.../site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges
    raise ValueError(
ValueError: supplied range of [nan, nan] is not finite
```

The calibrators create an augmented model with all tensors (including
model inputs) set as model outputs. The data for outputs that are also
model inputs is corrupted as described in
https://github.com/microsoft/onnxruntime/issues/21922. The corrupted
data sometimes contains `NaN` values that cause numpy's histogram
utilities to raise an exception.
2024-09-09 12:05:41 -07:00
Peishen Yan 2cdc05f189
Move Gelu and LayerNorm fusion to L1 optimization (#21332)
According to https://github.com/microsoft/onnxruntime/issues/20915, we
move the Gelu and LayerNorm fusion to L1 with a condition on the ONNX
opset the model imports (LayerNorm requires opset 16+ and Gelu requires
opset 20+.) If the opset version doesn't meet the requirements, the
fusion is delayed to L2 optimization since the internal contrib op
doesn't have a requirement for any specific ONNX opset.

---------

Co-authored-by: Scott McKay <Scott.McKay@microsoft.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
2024-09-09 13:27:52 +10:00
Yi Zhang de7a02beef
Add parameter for flexdonwload (#22009)
### Description
<!-- Describe your changes. -->



### Motivation and Context
Thus, we can run Nuget_Packaging_GPU stage directly
2024-09-08 14:17:55 +08:00
Wanming Lin ad9afbb042
[WebNN EP] Remove workaround for CPU op supported list (#21962)
We assume all WebNN ops are supported across all backends.
2024-09-06 22:14:52 -07:00
Edward Chen f3725b9f06
Use output variable from InstallAppleProvisioningProfile task to set provisioning profile UUID. (#22018)
This is more flexible than hardcoding the provisioning profile name or UUID. The name shouldn't usually change but it is not guaranteed to remain constant.
2024-09-06 18:00:34 -07:00
zz002 28b550f091
[VitisAI] Add processing for sessionOptions.AppendExecutionProvider("VitisAI", options) (#21839)
### Description
<!-- Describe your changes. -->

[VitisAI] Add processing for
sessionOptions.AppendExecutionProvider("VitisAI", options)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Zhenze Wang <zhenzew@xilinx.com>
2024-09-06 14:06:33 -07:00
Arne H Juul 493159b481
near-zero negative values must convert to 0 not NAN (#18473)
for the Float8 types with unsigned zero, we must clear the sign bit when
rounding to zero;
otherwise we end up with 0x80 which is the encoding for NAN.

### Description
Handle all zero and near-zero values the same way, rounding to positive
zero.
Note that I removed one "if" level but did not re-indent the code in
this PR, to make it
easier to see what the actual changes are.

### Motivation and Context
For the two new 8-bit floating point types Float8E4M3FNUZ and
Float8E5M2FNUZ,
converting from a near-zero negative value would end up with the sign
bit set only;
this bit pattern is not negative zero but instead means NAN.
2024-09-06 11:41:48 -07:00
Arne H Juul 605a84ffc9
remove unused and confusing float16 constants (#21999)
### Description
Remove unused and confusing special constants in MLFloat16 and BFloat16
types.

### Motivation and Context
While looking at adding a specialization for std::numeric_limits for the
16-bit floating point types, I found that there are various special
constants in those types that are confusing or just wrong.

MLFLoat16::Epsilon is not an epsilon at all, but approximates "e". Looks
like a copy-paste bug.
BFloat16::Epsilon does not correspond to `numeric_limits::epsilon()`,
nor even to the C# Float.Epsilon.
Instead, it corresponds to `numeric_limits::min()` which was really
confusing to me.

The "MinValue" constants does correspond to the C# `Float.MinValue`
constant, but this is C++ so it would be better renamed to "LowestValue"
since it corresponds to `numeric_limits::lowest()`. As it was unused
except for some unit tests I have replaced it with the equivalent
`MaxValue.Negate()` here.

There's also an unused `kSignaling_NaNBits` constant which is just wrong
(has the same value as `kPositiveInfinityBits` instead of a NaN).
2024-09-05 22:00:48 -07:00