onnxruntime

Граф коммитов

Автор	SHA1	Сообщение	Дата
Yulong Wang	abdc31de40	[js] change default formatter for JavaScript/TypeScript from clang-format to Prettier (#21728 ) ### Description See `454996d496` for manual changes (excluded auto-generated formatting changes) ### Why Because the toolsets for old clang-format is out-of-date. This reduces the development efficiency. - The NPM package `clang-format` is already in maintenance mode. not updated since 2 years ago. - The VSCode extension for clang-format is not maintained for a while, and a recent Node.js security update made it not working at all in Windows. No one in community seems interested in fixing those. Choose Prettier as it is the most popular TS/JS formatter. ### How to merge It's easy to break the build: - Be careful of any new commits on main not included in this PR. - Be careful that after this PR is merged, other PRs that already passed CI can merge. So, make sure there is no new commits before merging this one, and invalidate js PRs that already passed CI, force them to merge to latest.	2024-08-14 16:51:22 -07:00
Scott McKay	b0e1f7f798	CoreML: Aggregated changes to add all required ops for priority model (#21472 ) ### Description <!-- Describe your changes. --> Add these changes to one PR to simplify checkin - Add Concat (#21423) - Add DepthToSpace (#21426) - Add LeakyRelu (#21453) - Add test scripts (#21427) - Add ability to set coreml flags from python (#21434) Other changes - updated partitioning utils to support dropping constant initializers from a ComputeCapability's inputs. - noticed that the list of inputs to the coreml model was unexpectedly long due to this - we copy constant initializers to a CoreML model so don't need the originals, and if they remain as inputs ORT can't free them as they appear to be in use. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2024-07-26 08:29:33 +10:00
aciddelgado	ebd0368bb0	Make Flash Attention work on Windows (#21015 ) ### Description Previously, Flash Attention only worked on Linux systems. This PR will make it work and enable it to be built and run on Windows. Limitations of Flash Attention in Windows: Requires CUDA 12. ### Motivation and Context This will significantly increase the performance of Windows-based LLM's with hardware sm>=80. To illustrate the improvement of Flash Attention over Memory Efficient Attention, here are some average benchmark numbers for the GQA operator, run with configurations based on several recent models (Llama, Mixtral, Phi-3). The benchmarks were obtained on RTX4090 GPU using the test script located at (onnxruntime/test/python/transformers/benchmark_gqa_windows.py). * Clarifying Note: These benchmarks are just for the GQA operator, not the entire model. ### Memory Efficient Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|----------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.19790525 \| 13105.63425 \| \| Llama3-8B (Average Token) \| 8192 \| 0.207775538 \| 12025.10172 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.216049167 \| 11563.31185 \| \| Llama3-70B (Average Token) \| 8192 \| 0.209730731 \| 12284.38149 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.371928785 \| 7031.440056 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.2996659 \| 7607.947159 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.183195867 \| 15542.0852 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.198215688 \| 12874.53494 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 2.9884929 \| 2332.584142 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.845072406 \| 2877.85822 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.324974429 \| 8094.909517 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.263662567 \| 8978.463687 \| ### Flash Attention Kernel Benchmarks: \| Model Name \| Max Sequence Length \| Inference Interval (ms) \| Throughput (samples/second) \| \|--------------------------------------\|---------------------\|-------------------------\|-----------------------------\| \| Llama3-8B (Average Prompt) \| 8192 \| 0.163566292 \| 16213.69057 \| \| Llama3-8B (Average Token) \| 8192 \| 0.161643692 \| 16196.14715 \| \| Llama3-70B (Average Prompt) \| 8192 \| 0.160510375 \| 17448.67753 \| \| Llama3-70B (Average Token) \| 8192 \| 0.169427308 \| 14702.62043 \| \| Mixtral-8x22B-v0.1 (Average Prompt) \| 32768 \| 0.164121964 \| 15618.51301 \| \| Mixtral-8x22B-v0.1 (Average Token) \| 32768 \| 0.1715865 \| 14524.32273 \| \| Phi-3-mini-128k (Average Prompt) \| 131072 \| 0.167527167 \| 14576.725 \| \| Phi-3-mini-128k (Average Token) \| 131072 \| 0.175940594 \| 15762.051 \| \| Phi-3-small-128k (Average Prompt) \| 65536 \| 0.162719733 \| 17824.494 \| \| Phi-3-small-128k (Average Token) \| 65536 \| 0.14977525 \| 16749.19858 \| \| Phi-3-medium-128K (Average Prompt) \| 32768 \| 0.156490786 \| 17679.2513 \| \| Phi-3-medium-128K (Average Token) \| 32768 \| 0.165333833 \| 14932.26079 \| Flash Attention is consistently faster for every configuration we benchmarked, with improvements in our trials ranging from ~20% to ~650%. In addition to these improvements in performance, Flash Attention has better memory usage. For example, Memory Efficient Attention cannot handle a max sequence length higher than 32,768, but Flash Attention can handle max sequence lengths at least as high as 131,072. --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>	2024-06-24 09:43:49 -07:00
Chen Fu	6fb09055d4	Adding a sm80 q4 gemm kernel for small tiles (#20545 ) ### Description Implementation of a q4 gemm cuda kernel for small tiles and small sequence_len or batch_size (<=16) ### Performance Test Results \| Problem Shape \|New Kernel \| \| \| Current Kernel\| \| \| ------------------: \| ----------- \| ------- \|--\| ------------- \| ------- \| \| (M x N x K) \| Latency (ms) \| GFLOPS \| \| Latency (ms) \| GFLOPS \| \| 1 x 3072 x 3072 \| 0.008124 \| 2310.93 \| \| 0.017231 \| 1095.39 \| \| 16 x 3072 x 3072 \| 0.011263 \| 26813.7 \| \| 0.017431 \| 17325.4 \| \| 32 x 3072 x 3072 \| 0.018559 \| 32544.3 \| \| 0.079493 \| 7597.89 \| \| 64 x 3072 x 3072 \| 0.030364 \| 39782.1 \| \| 0.079387 \| 15216 \| \| 1024 x 3072 x 3072 \| 0.387194 \| 49916.5 \| \| 0.080849 \| 239054 \| \| \| \| \| \| \| \| \| 1 x 3072 x 9216 \| 0.015734 \| 3598.77 \| \| 0.043404 \| 1304.55 \| \| 16 x 3072 x 9216 \| 0.023611 \| 38371.3 \| \| 0.043388 \| 20859.1 \| \| 32 x 3072 x 9216 \| 0.038652 \| 46878 \| \| 0.224353 \| 8076.31 \| \| 64 x 3072 x 9216 \| 0.072334 \| 50099.5 \| \| 0.224338 \| 16153.6 \| \| 1024 x 3072 x 9216 \| 1.02872 \| 56363.2 \| \| 0.231284 \| 250696 \| \| \| \| \| \| \| \| \| 1 x 8192 x 3072 \| 0.015787 \| 3188.18 \| \| 0.017714 \| 2841.28 \| \| 16 x 8192 x 3072 \| 0.025933 \| 31053.3 \| \| 0.017919 \| 44942.2 \| \| 32 x 8192 x 3072 \| 0.042633 \| 37778.9 \| \| 0.079407 \| 20282.9 \| \| 64 x 8192 x 3072 \| 0.070061 \| 45977.5 \| \| 0.079531 \| 40502.8 \| \| 1024 x 8192 x 3072 \| 1.01264 \| 50896.3 \| \| 0.237244 \| 217243 \| \| \| \| \| \| \| \| \| 1 x 3072 x 8192 \| 0.014444 \| 3484.56 \| \| 0.038961 \| 1291.85 \| \| 16 x 3072 x 8192 \| 0.020433 \| 39411.8 \| \| 0.039056 \| \| \| 32 x 3072 x 8192 \| 0.03459 \| 46563.5 \| \| 0.200189 \| 8045.47 \| \| 64 x 3072 x 8192 \| 0.063319 \| 50873.4 \| \| 0.20029 \| 16082.8 \| \| 1024 x 3072 x 8192 \| 0.928282 \| 55521.5 \| \| 0.205883 \| 250334 \| \| \| \| \| \| \| \| \| 1 x 5120 x 5120 \| 0.014573 \| 3597.79 \| \| 0.02604 \| 2013.42 \| \| 16 x 5120 x 5120 \| 0.025638 \| 32719.5 \| \| 0.026194 \| 32024.4 \| \| 32 x 5120 x 5120 \| 0.037421 \| 44834.2 \| \| 0.127676 \| 13140.4 \| \| 64 x 5120 x 5120 \| 0.065593 \| 51155.9 \| \| 0.127706 \| 26274.8 \| \| 1024 x 5120 x 5120 \| 1.00217 \| 53570.9 \| \| 0.256388 \| 209398 \| \| \| \| \| \| \| \| \| 1 x 17920 x 5120 \| 0.053868 \| 3406.49 \| \| 0.04715 \| 3891.84 \| \| 16 x 17920 x 5120 \| 0.071952 \| 40805.1 \| \| 0.049755 \| 59009.3 \| \| 32 x 17920 x 5120 \| 0.123657 \| 47486.3 \| \| 0.129812 \| 45234.8 \| \| 64 x 17920 x 5120 \| 0.222113 \| 52874.2 \| \| 0.129781 \| 90491.6 \| \| 1024 x 17920 x 5120 \| 3.50124 \| 53668.1 \| \| 0.770569 \| 243852 \| \| \| \| \| \| \| \| \| 1 x 1280 x 5120 \| 0.007029 \| 1864.66 \| \| 0.025954 \| 505.027 \| \| 16 x 1280 x 5120 \| 0.008122 \| 25821.6 \| \| 0.025953 \| 8080.59 \| \| 32 x 1280 x 5120 \| 0.012498 \| 33558.7 \| \| 0.127618 \| 3286.62 \| \| 64 x 1280 x 5120 \| 0.022049 \| 38044.6 \| \| 0.127762 \| 6565.81 \| \| 1024 x 1280 x 5120 \| 0.258547 \| 51912.4 \| \| 0.128425 \| 104511 \| \| \| \| \| \| \| \| \| 1 x 5120 x 17920 \| 0.049096 \| 3737.59 \| \| 0.109703 \| 1672.7 \| \| 16 x 5120 x 17920 \| 0.073145 \| 40139.7 \| \| 0.110608 \| 26544.3 \| \| 32 x 5120 x 17920 \| 0.11405 \| 51486.3 \| \| 0.430942 \| 13626 \| \| 64 x 5120 x 17920 \| 0.210022 \| 55918.1 \| \| 0.430948 \| 27251.7 \| \| 1024 x 5120 x 17920 \| 4.571 \| 41108 \| \| 0.860118 \| 218464 \|	2024-06-12 16:02:26 -07:00
Scott McKay	9372e9a0a3	Support >2GB of Tensor data in training checkpoint (#20077 ) ### Description <!-- Describe your changes. --> Add ability to store initializer data in an external file. Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). `0cfb7eb80b/tests/64bit/test_64bit.fbs (L38-L39)` Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running: `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date. I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: carzh <wolfivyaura@gmail.com>	2024-04-22 15:17:43 -07:00
Chen Fu	06e684c9f2	Adding cuda kernel (optimized for sm80) for block-wise 4b quantized float 16 GEMM. (#18619 ) ### Description Adding CUDA kernel for block-wise 4b quantized float 16 GEMM, this is specially optimized for Nvidia Ampere GPUs. ### Motivation and Context Trying to improve quantized LLM inference performance on Nvidia Ampere GPUs ### Note: This is implemented by extending CUTLASS, so it has a hard dependency on CUTLASS. However, in current build system, loading of CUTLASS dependency is guarded with: (onnxruntime_USE_FLASH_ATTENTION OR onnxruntime_USE_MEMORY_EFFICIENT_ATTENTION) If both of these options are turned off, then compilation will fail. Why CUTLASS dependency is guarded at all? It's a header file only library that does not introduce any binary if not instantiated. What's the downside of removing all the guards and just include CUTLASS unconditionally?	2024-03-05 09:37:45 -08:00
pengwa	2c6b31c5aa	FP16 optimizer automatically detect DeepSpeed compatibility (#18084 ) ### FP16 optimizer automatically detect DeepSpeed compatibility Optimum/Transformers are using accelerate lib to prepare models, so our FP16 optimizer wrapper does not work for long time. Because the namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`, which underlying is still calling into DeepSpeed stage1and2 optimizer. This PR includes following changes: 1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the modifier registry, plus a check on its contained `optimizer` property MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3 optimizer later) 2. For DeepSpeed version > 0.9.1, we will store the source code in a version list. As long as the related function in DeepSpeed remains unchanged during its new release, we won't need manually upgrade the version check any more. If some day, the source code did not match, a warning will be raised to users, to add a new version of source code in the list. With the above change, we will have our FP16 Optimizer working again in Optimum. ![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)	2023-10-25 15:11:02 +08:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
Justin Chu	eeef157888	Format c++ code under `winml/` (#16660 ) winml/ was previously excluded from lintrunner config. This change includes the directory and adds the clang-format config file specific to winml/ that fits existing style. --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-07-25 21:56:50 -07:00
Baiju Meswani	10ba1e270c	Minimal Build for On-Device Training (#16326 ) 🛠️ __Changes in this pull request:__ This pull request introduces two significant changes to the project: - Changing on device training checkpoint format: The current implementation stores the on device training checkpoint as a sequence of tensors in multiple files inside a checkpoint folder, which can be inefficient in terms of storage and performance. In this PR, I have modified the checkpoint format to utilize the flatbuffer table to save the checkpoint to a single file, providing a more compact and efficient representation. The changes around this are twofold: - Add the checkpoint flatbuffer schema that will generate the necessary checkpoint source files. - Update the checkpoint saving and loading functionality to use the new format. - Adding support for onnxruntime minimal build: To support scenarios where binary size is a constraint, I made changes to ensure that the training build can work well with the minimal build. 🔍 __Open Issues:__ - In order to extract the optimizer type, the existing implementation re-loaded the onnx optimizer model and parsed it. This is no longer possible, since the model format can either be onnx or ort. One idea is to do the same for ort format optimizer model. This needs some investigation. - Changes to the offline tooling to generate ort format training artifacts. - End-to-end training example showcasing the use of the minimal training build. - Add support for export model for inferencing in a minimal build.	2023-06-22 12:27:23 -07:00
Justin Chu	76ddc92fbd	Enable RUFF as a formatter (#15699 ) ### Description RUFF can now format since lintrunner-adapters v0.8. Removed the RUFF-FIX linter. ### Motivation and Context Better engineering	2023-04-26 14:04:07 -07:00
Justin Chu	1f7c2f724f	Fix lintrunner configurations (#15586 ) ### Description - Fix lintrunner configurations to always use `python` instead of `python3`. - Set up dependabot - Moved dependencies to requirements-lintrunner to allow dependabot to update it similar to https://github.com/onnx/onnx/pull/5124	2023-04-20 08:54:26 -07:00
Justin Chu	cf19c3697d	Run clang-format in CI (#15524 ) ### Description Run clang-format in CI. Formatted all c/c++, objective-c/c++ files. Excluded ``` 'onnxruntime/core/mlas/', 'onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/', ``` because they contain assembly or is data heavy ### Motivation and Context Coding style consistency	2023-04-18 09:26:58 -07:00
Justin Chu	a36caba073	Bump ruff in CI (#15533 ) ### Description Bump ruff version in CI and fixed new lint errors. - This change enables the flake8-implicit-str-concat rules which helps detect unintended string concatenations: https://beta.ruff.rs/docs/rules/#flake8-implicit-str-concat-isc - Update gitignore to include common python files that we want to exclude. ### Motivation and Context Code quality	2023-04-17 10:11:44 -07:00
Justin Chu	e754edaecf	Run rustfmt in CI (#15217 ) I considered running clippy as well but ort takes too long to build	2023-03-27 08:12:59 -07:00
Justin Chu	d834ec895a	Adopt linrtunner as the linting tool - take 2 (#15085 ) ### Description `lintrunner` is a linter runner successfully used by pytorch, onnx and onnx-script. It provides a uniform experience running linters locally and in CI. It supports all major dev systems: Windows, Linux and MacOs. The checks are enforced by the `Python format` workflow. This PR adopts `lintrunner` to onnxruntime and fixed ~2000 flake8 errors in Python code. `lintrunner` now runs all required python lints including `ruff`(replacing `flake8`), `black` and `isort`. Future lints like `clang-format` can be added. Most errors are auto-fixed by `ruff` and the fixes should be considered robust. Lints that are more complicated to fix are applied `# noqa` for now and should be fixed in follow up PRs. ### Notable changes 1. This PR removed some suboptimal patterns: - `not xxx in` -> `xxx not in` membership checks - bare excepts (`except:` -> `except Exception`) - unused imports The follow up PR will remove: - `import *` - mutable values as default in function definitions (`def func(a=[])`) - more unused imports - unused local variables 2. Use `ruff` to replace `flake8`. `ruff` is much (40x) faster than flake8 and is more robust. We are using it successfully in onnx and onnx-script. It also supports auto-fixing many flake8 errors. 3. Removed the legacy flake8 ci flow and updated docs. 4. The added workflow supports SARIF code scanning reports on github, example snapshot: ![image](https://user-images.githubusercontent.com/11205048/212598953-d60ce8a9-f242-4fa8-8674-8696b704604a.png) 5. Removed `onnxruntime-python-checks-ci-pipeline` as redundant ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Unified linting experience in CI and local. Replacing https://github.com/microsoft/onnxruntime/pull/14306 --------- Signed-off-by: Justin Chu <justinchu@microsoft.com>	2023-03-24 15:29:03 -07:00

16 Коммитов