### Description
Windows - Fully dynamic ETW controlled logging for ORT and QNN logs
The logging support is documented here
-
https://onnxruntime.ai/docs/performance/tune-performance/logging_tracing.html#tracing---windows
-
https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html#tracelogging-etw-windows-profiling
Also add support for logging ORT SessionCreation on ETW CaptureState
### Motivation and Context
The previous ETW support only worked if you enabled ETW before the
session started. There can commonly be long-lived AI inference processes
that need to be traced & debugged. This enables logging fully on the
fly.
Without this support a dev would have to end up killing a process or
stopping a service in order to get tracing. We had to do this for a
recent issue with QNN, and it was a bit painful to get the logs and it
ruined the repro.
### Testing
I tested with the following cases
- Leaving default ORT run
- Enabling ETW prior to start and leaving running for entire session +
inferences, then stopping
- Starting ORT session + inf, then enabling and stopping ETW
- Start ORT session /w long running Inferences
- wpr -start
[ort.wprp](e6228575e4/ort.wprp (L4))
-start
[etw_provider.wprp](e6228575e4/onnxruntime/test/platform/windows/logging/etw_provider.wprp)
- Wait a few seconds
- wpr -stop ort.etl
- Inferences are still running
- Verify ONNXRuntimeLogEvent provider events are present and new
SessionCreation_CaptureState event under Microsoft.ML.ONNXRuntime
provider
Related:
#18882#19428
### Description
This PR has several combined ORT ETW changes that improve ORT log
diagnosability & performance.
- The existing log behavior in the ORT API and Severity behavior remain
the same as compiled by the dev using the ORT API
- The PR keeps the existing design which has 2 TraceLogging providers
defined (although both were not used before this PR)
- Keeps great inference (inf) and session load performance even with
dynamic logging enabled (see below)
- On Windows, when ONNXRuntimeTraceLoggingProvider is enabled, then ORT
will dynamically _add_ a new sink reflecting the severity level provided
by ETW dynamically. E.G Critical - Verbose per the need at runtime
- This allows previous printf style LOGS() statements both for CPU and
NPU cases to flow to ETW via a local trace (if enabled)
- This DOES NOT add any new Telemetry which can optionally be sent to
Microsoft.
- Telemetry are ETW events marked with the Measure keyword that _can_ be
sampled if a box opts-in
- Existing Microsoft.ML.ONNXRuntime events have appropriate keywords and
levels added if they were missing
- If Execution Providers (EPs) can provide more detailed insight into
their HW or component, then this PR allows for those to be dynamically
logged instead of just at compile time
- In this PR, the QNN EP for QC NPUs can have basic or detailed
profiling enabled to give some insight into how the NPU is performing
- When the Microsoft.ML.ONNXRuntime ETW provider is enabled with the
Profiling keyword and level 5 then QC QNN basic profiling info is output
to ETW
### Motivation and Context
- This make ORT logging and diagnosability more performant (on Windows)
and available in a wider variety of runtime environments.
- The performance difference for inf times was ~300x+ drastically
better/faster when these logs were output to ETW vs just stdout (Verbose
Severity)
- This style of ETW dynamic tracing is the widely used standard for
Windows components, and even by some 3rd party software since the ETW
API is open and part of the Windows API
- These ETW based logs can be seamlessly combined with other ETW logs
such as an AI component/feature using ORT, OS CPU profiling, scheduling,
and more
- Before the PR, ORT logging is largely printf style and output to a
sink (usually stdout) only if compiled with a certain log Severity. When
enabled the previous logging (to stdout) would vastly slow down
inference times. Once compiled for release the internal ORT logs were
not accessible by anyone except the AI model developer in their dev
inner loop. That means logs could not be enabled on a lab machine, or on
a production system where the runtime behavior or performance might be
different for various reasons on a wide variety of HW.
- This change was tested with performance in mind and tested with a
mobilenet small AI model with onnxruntime_perf_test
- CPU: There was no statistical difference for inf times with the
baseline (main) or this PR whether ETW was enabled or not (both ORT
providers all keywords level 5).
- NPU (QNN on SP9 or Dev Kit 2023 QC SQ3): There was no statistical
difference for inf times with this PR whether ETW (both ORT providers
all keywords) were enabled or not for Level 5 (Verbose). This is even
with QNN Basic profiling turned on and outputting NPU stats to ETW
- As expected and designed, there was perf slowdown when Max Level 255
is enabled which translates to QNN Detailed profiling. This mirrors the
expected slowdown in the NPU when individual model operations are
monitored & recorded as well. This perf is similar to the QNN SDK
Detailed profiling performance separate from this PR. This is designed
to be above level 5 (verbose) as that is commonly the max level used in
many trace profiles and won't affect inf performance.
- Other OSes such as Linux & Android are left untouched for now.
- Out of scope for this PR but TraceLogging is available for Linux with
LTTng tracing. So in the future, this optional tracing could also be
made available on other OSes where a TraceLogging API is available