### Description
This PR introduces support for custom external data loader. An EP can
register a custom external data loader to override the default behavior,
making it possible to upload initializers directly to GPU.
### Motivation and Context
- In ONNX Runtime Web, WebAssembly uses 32-bit as pointer type
(`sizeof(size_t)==4`), which means there is a 4GB hard limit on the
maximum memory. As the ONNX models get larger, this becomes a blocker
for supporting medium-sized language models.
- ORT runs out of memory because the current code always loads data into
CPU memory, including the .onnx file (protobuf) and external data
file(s). However, if using GPU EP, the big data does not need to be kept
on CPU because the only thing that ORT does is to load the data into
memory, upload to GPU and then release them.
- Some platforms has offered developers way to upload data directly to
GPU. For example, webgpu allows uploading from any ArrayBuffer (it can
be a side buffer, not count into the 4GB) to GPU directly. This helps to
keep the CPU memory usage significantly.
### Design
Class `ExternalDataLoader` and `ExternalDataLoaderManager` are
introduced. They are similar to `DataTransfer` and
`DataTransferManager`. `InferenceSession` owns the manager object, and
`SessionState` keeps a reference to it.
Added a new method `GetExternalDataLoader` in `IExecutionProvider`. An
EP can override the method to register an instance of custom external
data loader.
The key function in a `ExternalDataLoader` class is method `LoadTensor`:
```c++
// the tensor is pre-created using the TensorProto info of the initializer and the MemoryInfo (from allocation plan).
virtual common::Status LoadTensor(const Env& env,
const std::filesystem::path& data_file_path,
FileOffsetType data_offset,
SafeInt<size_t> data_length,
Tensor& tensor) const;
```
This function can be registered by EP, going through a few layers and
eventually get into `DeserializeTensorProto()` in the finalizing stage
of session initialization. In this step, initializer tensors are
created. Behavior is changed to first look up for a registered external
data loader that can handle the current memory info. If any instance is
available, use the loader; otherwise respect the old code path.