## Description:
+ Improve `SymCryptFdefMontgomeryReduceAsm`
+ Reduce instruction count in the inner loop - remove superfluous `adc` with zero
+ Special case first iteration of the reduction loop to further reduce instruction count and multiplication uops
+ For ease of phrasing used non-volatile registers in aapcs64 assembly for the first time, and had to slightly extend SymCryptAsm processor script for this.
+ Improve `SymCryptFdefRawSquareAsm` by tweaking to reduce undue dependencies.
+ More room for improvements in follow-on PR, but checking in what we have to get improvements before GE deadline.
## Admin Checklist:
- [X] You have updated documentation in symcrypt.h to reflect any changes in behavior
- [X] You have updated CHANGELOG.md to reflect any changes in behavior
- [X] You have updated symcryptunittest to exercise any new functionality
- [X] If you have introduced any symbols in symcrypt.h you have updated production and test dynamic export symbols (exports.ver / exports.def / symcrypt.src) and tested the updated dynamic modules with symcryptunittest
- [X] If you have introduced functionality that varies based on CPU features, you have manually tested with and without relevant features
- [X] If you have made significant changes to a particular algorithm, you have checked that performance numbers reported by symcryptunittest are in line with expectations
- [X] If you have added new algorithms/modes, you have updated the status indicator text for the associated modules if necessary
- !10935012 added a `.gitattributes` file to try to enforce consistent Windows-style line endings, but this causes a bunch of spurious diffs to show up after checking out the latest branch (ironically, on Windows only). See [this Stack Overflow question](https://stackoverflow.com/questions/5787937/git-status-shows-files-as-changed-even-though-contents-are-the-same) which refers to a similar issue. After fighting with Git for a bit, it seems like the easiest fix is just to remove this file.
- Workaround for Python versions < 3.11 not being able to parse timestamps with the 'Z' suffix indicating UTC time (started breaking our pipeline builds due to a recent Git version update)
- Fix for Python 3.12 complaining about invalid escape characters in `symcryptasm_processor.py` (use raw strings)
- When building OpenSSL, pin to a specific tag if no branch is specified on the command line, so that we're not building against a moving target
This change rewrites our Azure DevOps pipelines to be compatible with OneBranch pipelines. It also adds new scripts to help with building, testing and packaging SymCrypt. These scripts replicate some of the functionality of `scbuild` but are also compatible with Linux builds. They can be used directly on the command line by developers, but the OneBranch pipeline also uses them to move as much as possible of the "business logic" of building SymCrypt out of the YAML templates and into Python scripts.
Also includes various reorganization and small fixes.
Add the following SHA-2 implementations:
- SHA-256 SSSE3+BMI2 intrinsics implementation with 4-way parallel message expansion
- SHA-256 SSSE3+BMI2 assembly implementation with 4-way parallel message expansion
- SHA-256 AVX2+BMI2 intrinsics implementation with 8-way parallel message expansion
- SHA-256 AVX2+BMI2 assembly implementation with 8-way parallel message expansion
- SHA-512 AVX2+BMI2 intrinsics implementation for single-block processing
- SHA-512 AVX2+BMI2 intrinsics implementation with 2-way parallel message expansion
- SHA-512 AVX2+BMI2 intrinsics implementation with 4-way parallel message expansion
- SHA-512 AVX2+BMI2 assembly implementation with 4-way parallel message expansion
- SHA-512 AVX-512 assembly implementation with 4-way parallel message expansion
Other changes:
- Add INCLUDE directive to `symcryptasm_processor.py`
- Update `symcryptasm_processor.py` to support saving non-volatile Xmm registers and allocating stack space
- Update feed-forwarding step of block processing in C implementations
- Use alternative expressions for LSIGMA and CSIGMA functions in SHA-512 C implementation
- Fix updating of pcbRemaining in `SymCryptSha512AppendBlocks_ull2`
Related work items: #38759923, #38958807
+ Extends SymCryptAsm format and script to work in the Arm64 context
+ Now specify architecture, assembler, and calling convention in script invocation
+ Make various changes to assembly to remove redundant instructions, and generally
slightly improve perf for all platforms (a couple of % here and there)
+ Use assembly routines in Linux builds and remove asmstubs file
+ Do not enable Windows Arm64 build with CMake yet
Related work items: #35613721
+ Resolves all issues flagged by runoacr in symcrypt\lib
+ Leaves some oacr issues in test code
+ Also includes some unrelated fixes to typos etc.
Related work items: #35052770
+ Add more optimization flags for MSVC in CMake to get closer to parity
between Razzle and CMake builds
+ Make some AES-GCM tweaks for GCC/clang to avoid aggressive loop
peeling which hurts performance by unduly increasing code size
Related work items: #32785997
+ Introduce a 2 stage pre-processing setup to convert .symcryptasm to either masm (msft x64
calling convention) or gas (SystemV amd64 calling convention)
+ Step 1 converts .symcryptasm to .cppasm (using `lib\symcryptasm_processor.py`)
+ Step 2 converts .cppasm to .asm using the C preprocessor
+ Updated CMakeLists.txt to invoke this preprocesssing when any relevant files is updated
+ Also introduced makefile.inc for the razzle build
+ I have translated all of the amd64 asm files we want to preserve, and the performance for big
integer reliant code is the same on Windows and Linux (and a bit better on Windows than before :))
+ In translation I did some tidying of the underlying assembly:
+ Removing needless work (some size specific functions in particular had cruft from their
adaptation from the generic sized versions)
+ Reducing code size (i.e. by using inc/dec rather than add/sub 1)
+ Some micro-optimizations to remove needless instruction dependencies
Related work items: #30621935