Граф коммитов

16 Коммитов

Автор SHA1 Сообщение Дата
Yulong Wang abdc31de40
[js] change default formatter for JavaScript/TypeScript from clang-format to Prettier (#21728)
### Description

See
454996d496
for manual changes (excluded auto-generated formatting changes)

### Why

Because the toolsets for old clang-format is out-of-date. This reduces
the development efficiency.

- The NPM package `clang-format` is already in maintenance mode. not
updated since 2 years ago.
- The VSCode extension for clang-format is not maintained for a while,
and a recent Node.js security update made it not working at all in
Windows.

No one in community seems interested in fixing those.

Choose Prettier as it is the most popular TS/JS formatter.

### How to merge

It's easy to break the build:
- Be careful of any new commits on main not included in this PR.
- Be careful that after this PR is merged, other PRs that already passed
CI can merge.

So, make sure there is no new commits before merging this one, and
invalidate js PRs that already passed CI, force them to merge to latest.
2024-08-14 16:51:22 -07:00
Scott McKay b0e1f7f798
CoreML: Aggregated changes to add all required ops for priority model (#21472)
### Description
<!-- Describe your changes. -->
Add these changes to one PR to simplify checkin
- Add Concat (#21423)
- Add DepthToSpace (#21426)
- Add LeakyRelu (#21453)
- Add test scripts (#21427)
- Add ability to set coreml flags from python (#21434)


Other changes
- updated partitioning utils to support dropping constant initializers
from a ComputeCapability's inputs.
- noticed that the list of inputs to the coreml model was unexpectedly
long due to this
- we copy constant initializers to a CoreML model so don't need the
originals, and if they remain as inputs ORT can't free them as they
appear to be in use.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
2024-07-26 08:29:33 +10:00
aciddelgado ebd0368bb0
Make Flash Attention work on Windows (#21015)
### Description
Previously, Flash Attention only worked on Linux systems. This PR will
make it work and enable it to be built and run on Windows.

Limitations of Flash Attention in Windows: Requires CUDA 12.

### Motivation and Context
This will significantly increase the performance of Windows-based LLM's
with hardware sm>=80.

To illustrate the improvement of Flash Attention over Memory Efficient
Attention, here are some average benchmark numbers for the GQA operator,
run with configurations based on several recent models (Llama, Mixtral,
Phi-3). The benchmarks were obtained on RTX4090 GPU using the test
script located at
(onnxruntime/test/python/transformers/benchmark_gqa_windows.py).

* Clarifying Note: These benchmarks are just for the GQA operator, not
the entire model.

### Memory Efficient Attention Kernel Benchmarks:
| Model Name | Max Sequence Length | Inference Interval (ms) |
Throughput (samples/second) |

|----------------------------------------|---------------------|-------------------------|-----------------------------|
| Llama3-8B (Average Prompt) | 8192 | 0.19790525 | 13105.63425 |
| Llama3-8B (Average Token) | 8192 | 0.207775538 | 12025.10172 |
| Llama3-70B (Average Prompt) | 8192 | 0.216049167 | 11563.31185 |
| Llama3-70B (Average Token) | 8192 | 0.209730731 | 12284.38149 |
| Mixtral-8x22B-v0.1 (Average Prompt) | 32768 | 0.371928785 |
7031.440056 |
| Mixtral-8x22B-v0.1 (Average Token) | 32768 | 0.2996659 | 7607.947159 |
| Phi-3-mini-128k (Average Prompt) | 131072 | 0.183195867 | 15542.0852 |
| Phi-3-mini-128k (Average Token) | 131072 | 0.198215688 | 12874.53494 |
| Phi-3-small-128k (Average Prompt) | 65536 | 2.9884929 | 2332.584142 |
| Phi-3-small-128k (Average Token) | 65536 | 0.845072406 | 2877.85822 |
| Phi-3-medium-128K (Average Prompt) | 32768 | 0.324974429 | 8094.909517
|
| Phi-3-medium-128K (Average Token) | 32768 | 0.263662567 | 8978.463687
|

### Flash Attention Kernel Benchmarks:
| Model Name | Max Sequence Length | Inference Interval (ms) |
Throughput (samples/second) |

|--------------------------------------|---------------------|-------------------------|-----------------------------|
| Llama3-8B (Average Prompt) | 8192 | 0.163566292 | 16213.69057 |
| Llama3-8B (Average Token) | 8192 | 0.161643692 | 16196.14715 |
| Llama3-70B (Average Prompt) | 8192 | 0.160510375 | 17448.67753 |
| Llama3-70B (Average Token) | 8192 | 0.169427308 | 14702.62043 |
| Mixtral-8x22B-v0.1 (Average Prompt) | 32768 | 0.164121964 |
15618.51301 |
| Mixtral-8x22B-v0.1 (Average Token) | 32768 | 0.1715865 | 14524.32273 |
| Phi-3-mini-128k (Average Prompt) | 131072 | 0.167527167 | 14576.725 |
| Phi-3-mini-128k (Average Token) | 131072 | 0.175940594 | 15762.051 |
| Phi-3-small-128k (Average Prompt) | 65536 | 0.162719733 | 17824.494 |
| Phi-3-small-128k (Average Token) | 65536 | 0.14977525 | 16749.19858 |
| Phi-3-medium-128K (Average Prompt) | 32768 | 0.156490786 | 17679.2513
|
| Phi-3-medium-128K (Average Token) | 32768 | 0.165333833 | 14932.26079
|

Flash Attention is consistently faster for every configuration we
benchmarked, with improvements in our trials ranging from ~20% to ~650%.

In addition to these improvements in performance, Flash Attention has
better memory usage. For example, Memory Efficient Attention cannot
handle a max sequence length higher than 32,768, but Flash Attention can
handle max sequence lengths at least as high as 131,072.

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
2024-06-24 09:43:49 -07:00
Chen Fu 6fb09055d4
Adding a sm80 q4 gemm kernel for small tiles (#20545)
### Description

Implementation of a q4 gemm cuda kernel for small tiles and small
sequence_len or batch_size (<=16)

### Performance Test Results

| Problem Shape |New Kernel | | | Current Kernel| |
| ------------------: | ----------- | ------- |--| ------------- |
------- |
| **(M x N x K)** | **Latency (ms)** | **GFLOPS** | | **Latency (ms)** |
**GFLOPS** |
| 1 x 3072 x 3072 | 0.008124 | 2310.93 | | 0.017231 | 1095.39 |
| 16 x 3072 x 3072 | 0.011263 | 26813.7 | | 0.017431 | 17325.4 |
| 32 x 3072 x 3072 | 0.018559 | 32544.3 | | 0.079493 | 7597.89 |
| 64 x 3072 x 3072 | 0.030364 | 39782.1 | | 0.079387 | 15216 |
| 1024 x 3072 x 3072 | 0.387194 | 49916.5 | | 0.080849 | 239054 |
| | | | | | |
| 1 x 3072 x 9216 | 0.015734 | 3598.77 | | 0.043404 | 1304.55 |
| 16 x 3072 x 9216 | 0.023611 | 38371.3 | | 0.043388 | 20859.1 |
| 32 x 3072 x 9216 | 0.038652 | 46878 | | 0.224353 | 8076.31 |
| 64 x 3072 x 9216 | 0.072334 | 50099.5 | | 0.224338 | 16153.6 |
| 1024 x 3072 x 9216 | 1.02872 | 56363.2 | | 0.231284 | 250696 |
| | | | | | |
| 1 x 8192 x 3072 | 0.015787 | 3188.18 | | 0.017714 | 2841.28 |
| 16 x 8192 x 3072 | 0.025933 | 31053.3 | | 0.017919 | 44942.2 |
| 32 x 8192 x 3072 | 0.042633 | 37778.9 | | 0.079407 | 20282.9 |
| 64 x 8192 x 3072 | 0.070061 | 45977.5 | | 0.079531 | 40502.8 |
| 1024 x 8192 x 3072 | 1.01264 | 50896.3 | | 0.237244 | 217243 |
| | | | | | |
| 1 x 3072 x 8192 | 0.014444 | 3484.56 | | 0.038961 | 1291.85 |
| 16 x 3072 x 8192 | 0.020433 | 39411.8 | | 0.039056 | |
| 32 x 3072 x 8192 | 0.03459 | 46563.5 | | 0.200189 | 8045.47 |
| 64 x 3072 x 8192 | 0.063319 | 50873.4 | | 0.20029 | 16082.8 |
| 1024 x 3072 x 8192 | 0.928282 | 55521.5 | | 0.205883 | 250334 |
| | | | | | |
| 1 x 5120 x 5120 | 0.014573 | 3597.79 | | 0.02604 | 2013.42 |
| 16 x 5120 x 5120 | 0.025638 | 32719.5 | | 0.026194 | 32024.4 |
| 32 x 5120 x 5120 | 0.037421 | 44834.2 | | 0.127676 | 13140.4 |
| 64 x 5120 x 5120 | 0.065593 | 51155.9 | | 0.127706 | 26274.8 |
| 1024 x 5120 x 5120 | 1.00217 | 53570.9 | | 0.256388 | 209398 |
| | | | | | |
| 1 x 17920 x 5120 | 0.053868 | 3406.49 | | 0.04715 | 3891.84 |
| 16 x 17920 x 5120 | 0.071952 | 40805.1 | | 0.049755 | 59009.3 |
| 32 x 17920 x 5120 | 0.123657 | 47486.3 | | 0.129812 | 45234.8 |
| 64 x 17920 x 5120 | 0.222113 | 52874.2 | | 0.129781 | 90491.6 |
| 1024 x 17920 x 5120 | 3.50124 | 53668.1 | | 0.770569 | 243852 |
| | | | | | |
| 1 x 1280 x 5120 | 0.007029 | 1864.66 | | 0.025954 | 505.027 |
| 16 x 1280 x 5120 | 0.008122 | 25821.6 | | 0.025953 | 8080.59 |
| 32 x 1280 x 5120 | 0.012498 | 33558.7 | | 0.127618 | 3286.62 |
| 64 x 1280 x 5120 | 0.022049 | 38044.6 | | 0.127762 | 6565.81 |
| 1024 x 1280 x 5120 | 0.258547 | 51912.4 | | 0.128425 | 104511 |
| | | | | | |
| 1 x 5120 x 17920 | 0.049096 | 3737.59 | | 0.109703 | 1672.7 |
| 16 x 5120 x 17920 | 0.073145 | 40139.7 | | 0.110608 | 26544.3 |
| 32 x 5120 x 17920 | 0.11405 | 51486.3 | | 0.430942 | 13626 |
| 64 x 5120 x 17920 | 0.210022 | 55918.1 | | 0.430948 | 27251.7 |
| 1024 x 5120 x 17920 | 4.571 | 41108 | | 0.860118 | 218464 |
2024-06-12 16:02:26 -07:00
Scott McKay 9372e9a0a3
Support >2GB of Tensor data in training checkpoint (#20077)
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).


0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)

Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.

Please feel free to suggest alternative approaches. 

Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.

I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: carzh <wolfivyaura@gmail.com>
2024-04-22 15:17:43 -07:00
Chen Fu 06e684c9f2
Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619)
### Description
Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is
specially optimized for Nvidia Ampere GPUs.


### Motivation and Context
Trying to improve quantized LLM inference performance on Nvidia Ampere
GPUs

### Note:
This is implemented by extending CUTLASS, so it has a hard dependency on
CUTLASS. However, in current build system, loading of CUTLASS dependency
is guarded with:

(onnxruntime_USE_FLASH_ATTENTION OR
onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION)

If both of these options are turned off, then compilation will fail.

Why CUTLASS dependency is guarded at all? It's a header file only
library that does not introduce any binary if not instantiated. What's
the downside of removing all the guards and just include CUTLASS
unconditionally?
2024-03-05 09:37:45 -08:00
pengwa 2c6b31c5aa
FP16 optimizer automatically detect DeepSpeed compatibility (#18084)
### FP16 optimizer automatically detect DeepSpeed compatibility

Optimum/Transformers are using accelerate lib to prepare models, so our
FP16 optimizer wrapper does not work for long time. Because the
namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`,
which underlying is still calling into DeepSpeed stage1and2 optimizer.

This PR includes following changes:
1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the
modifier registry, plus a check on its contained `optimizer` property
MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3
optimizer later)
2. For DeepSpeed version > 0.9.1, we will store the source code in a
version list. As long as the related function in DeepSpeed remains
unchanged during its new release, we won't need manually upgrade the
version check any more. If some day, the source code did not match, a
warning will be raised to users, to add a new version of source code in
the list.

With the above change, we will have our FP16 Optimizer working again in
Optimum.


![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)
2023-10-25 15:11:02 +08:00
Justin Chu be7541ef4a
[Linter] Bump ruff and remove pylint (#17797)
Bump ruff version and remove pylint from the linter list. Fix any new
error detected by ruff.

### Motivation and Context

Ruff covers many of the pylint rules. Since pylint is not enabled in
this repo and runs slow, we remove it from the linters
2023-10-05 21:07:33 -07:00
Justin Chu eeef157888
Format c++ code under `winml/` (#16660)
winml/ was previously excluded from lintrunner config. This change
includes the directory and adds the clang-format config file specific to
winml/ that fits existing style.

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-07-25 21:56:50 -07:00
Baiju Meswani 10ba1e270c
Minimal Build for On-Device Training (#16326)
🛠️ __Changes in this pull request:__

This pull request introduces two significant changes to the project:

- Changing on device training checkpoint format: The current
implementation stores the on device training checkpoint as a sequence of
tensors in multiple files inside a checkpoint folder, which can be
inefficient in terms of storage and performance. In this PR, I have
modified the checkpoint format to utilize the flatbuffer table to save
the checkpoint to a single file, providing a more compact and efficient
representation. The changes around this are twofold:
- Add the checkpoint flatbuffer schema that will generate the necessary
checkpoint source files.
- Update the checkpoint saving and loading functionality to use the new
format.

- Adding support for onnxruntime minimal build: To support scenarios
where binary size is a constraint, I made changes to ensure that the
training build can work well with the minimal build.

🔍 __Open Issues:__
- In order to extract the optimizer type, the existing implementation
re-loaded the onnx optimizer model and parsed it. This is no longer
possible, since the model format can either be onnx or ort. One idea is
to do the same for ort format optimizer model. This needs some
investigation.
- Changes to the offline tooling to generate ort format training
artifacts.
- End-to-end training example showcasing the use of the minimal training
build.
- Add support for export model for inferencing in a minimal build.
2023-06-22 12:27:23 -07:00
Justin Chu 76ddc92fbd
Enable RUFF as a formatter (#15699)
### Description

RUFF can now format since lintrunner-adapters v0.8. Removed the RUFF-FIX
linter.



### Motivation and Context

Better engineering
2023-04-26 14:04:07 -07:00
Justin Chu 1f7c2f724f
Fix lintrunner configurations (#15586)
### Description

- Fix lintrunner configurations to always use `python` instead of
`python3`.
- Set up dependabot
- Moved dependencies to requirements-lintrunner to allow dependabot to
update it similar to https://github.com/onnx/onnx/pull/5124
2023-04-20 08:54:26 -07:00
Justin Chu cf19c3697d
Run clang-format in CI (#15524)
### Description

Run clang-format in CI. Formatted all c/c++, objective-c/c++ files.

Excluded

```
    'onnxruntime/core/mlas/**',
    'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/**',
```

because they contain assembly or is data heavy


### Motivation and Context

Coding style consistency
2023-04-18 09:26:58 -07:00
Justin Chu a36caba073
Bump ruff in CI (#15533)
### Description

Bump ruff version in CI and fixed new lint errors. 

- This change enables the flake8-implicit-str-concat rules which helps
detect unintended string concatenations:
https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc
- Update gitignore to include common python files that we want to
exclude.


### Motivation and Context

Code quality
2023-04-17 10:11:44 -07:00
Justin Chu e754edaecf
Run rustfmt in CI (#15217)
I considered running clippy as well but ort takes too long to build
2023-03-27 08:12:59 -07:00
Justin Chu d834ec895a
Adopt linrtunner as the linting tool - take 2 (#15085)
### Description

`lintrunner` is a linter runner successfully used by pytorch, onnx and
onnx-script. It provides a uniform experience running linters locally
and in CI. It supports all major dev systems: Windows, Linux and MacOs.
The checks are enforced by the `Python format` workflow.

This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors
in Python code. `lintrunner` now runs all required python lints
including `ruff`(replacing `flake8`), `black` and `isort`. Future lints
like `clang-format` can be added.

Most errors are auto-fixed by `ruff` and the fixes should be considered
robust.

Lints that are more complicated to fix are applied `# noqa` for now and
should be fixed in follow up PRs.

### Notable changes

1. This PR **removed some suboptimal patterns**:

	- `not xxx in` -> `xxx not in` membership checks
	- bare excepts (`except:` -> `except Exception`)
	- unused imports
	
	The follow up PR will remove:
	
	- `import *`
	- mutable values as default in function definitions (`def func(a=[])`)
	- more unused imports
	- unused local variables

2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than
flake8 and is more robust. We are using it successfully in onnx and
onnx-script. It also supports auto-fixing many flake8 errors.

3. Removed the legacy flake8 ci flow and updated docs.

4. The added workflow supports SARIF code scanning reports on github,
example snapshot:
	

![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png)

5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Unified linting experience in CI and local.

Replacing https://github.com/microsoft/onnxruntime/pull/14306

---------

Signed-off-by: Justin Chu <justinchu@microsoft.com>
2023-03-24 15:29:03 -07:00