The current public types like ContainerProperties lose any new fields added by the service currently. If the property is not a know field is lost on the deserialize/serialize logic. This can possibly break older SDKs if new contract elements are added by the service and an update is attempted. The service might throw an exception thinking the user attempted to remove the new field when it was lost in the serialization logic.
To fix this issue a new internal dictionary is added to all the public service contract types to hold any additional properties the c# poco type is not aware of. So that in case of service contract evolution if metadata information changes, we won't loose any information.
Add tests for emulator change feed fix.
Validates Incremental Change Feed by inserting and deleting documents and verifying nothing reported
Validates error message with Full Fidelity Change Feed and start from beginning.
EnableTcpConnectionEndpointRediscovery is used to enable address cache refresh on TCP connection reset notification. By refreshing the caches it helps future requests from going to the stale addresses.
ConnectionStateListener holds a cache that maps the server physical addresses to the partitionKeyRanges(Server Key -> List[PartitionKeyRangeIdentity]). ReplicatedResourceClient updates this cache when it resolves the physical addresses for the partitions. When the listener receives a notification on connection reset from the transport client, it triggers an address refresh in the GatewayAddressCache for all the partitionKeyRanges that maps to the server address received in the notification.
This PR adds the option for customer to enable PopulateIndexMetrics in query request options. This allows customer to obtain the index utilization of their query. The information is aggregated and automatically showed up in QueryMetrics once this request option is enabled. Please note that enabling this incurs overhead, so please set it only during debugging. It also adds a field called IndexMetrics to FeedResponse to expose the result to users.
This change is being done because most users are not capturing the diagnostics for these exceptions. It is not possible to root cause or trouble shoot these issues without the diagnostics. This requires the customer to make code changes to get the additional info which causes the issue to persist for a long time.
1. This bug can cause requests to get stuck because the semaphore was not getting released in the scenario where multiple requests are waiting for a new token. This only occurs in scenario where the background refresh has failed to get a new token.
2. This optimizes the scenario where multiple concurrent requests are waiting on the token. It will now return the original task that is getting the token. This prevents all the requests waiting in serial to get the failure. All the requests will return the same exception.
The existing test for this scenario use .Wait which blocked the threads which also blocked all the other task to simulate concurrent requests. The tasks now use a Task.Run to prevent them from getting blocked again and wait logic was converted to an async/await.
1. Improved precision of all the metrics: So we are collecting different kind of metrics, Somewhere using conversion as a precision factor and if value needs not to be converted then using HistogramPrecisionfactor to preserve data precision.
2. Move telemetry handler before retry handler: Telemetry should be collected from the toppest layer of the operation.
3. Test cases improvement : There were chances that test cases started failing due to thread starvation or related issues. So added waiting till 30 sec.
4. Upper Case Connection Mode and Consistency : Just like Java
1. If the CreateContainerAsync fails the container delete operation throws an exception causing the test to hide the original exception.
2. Applied several Visual Studio suggestions about code formatting
1. Fixes a query performance regression caused by doing 2 hash lookups on all the headers. This now only does the validation for debug mode.
2. Adds a mocked query benchmark to the performance tests
The build is currently broken since it is referencing a field only available in the preview SDK. This fixes the build issue and adds gates to prevent future issues
1. Add new system usage telemetry
2. Adds new ClientSideRequestStats interface to fix the start and end time
3. Fixes lock on client telemetry logging
4. Adds optimization to DiagnosticsHandler
5. Removes duplicate CPU collector in HA layer
* Updating Authorization header size limit for AAD and Resource tokens
Microsoft.Azure.Documents AuthorizationHelper allows for Resource Tokens to be upto 8*1024 characters in length, and AAD tokens to be upto 16*1024 characters in length.
This PR carries that logic over to AuthorizationHelper for Microsoft.Azure.Cosmos
* Dummy commit
* Update AuthorizationHelper.cs
Adding error trace
Co-authored-by: j82w <j82w@users.noreply.github.com>
* Initial types and implementation
* Wiring through known places
* Refactor monitoring
* New tests
* public exception
* tests
* Refactoring error messaging so users get diagnostics
* Wiring through estimator
* tests
* undoing and cleaning up
* Rename of base implementation and refactor of public API
* more unit tests
* emulator tests
* docs
* adding context to exception
* contract
* undo file
* undo more
* logging always
* Addressing comments
* contract
* preview update after merge
Adds a feature flag for DCount.
Adds additional error info for unsupported cross partition queries. This is not used on the usual V3 paths that go through the pagination library. This is used for Compute.
1. DCount has special logic that doesn't block the query unless the flag is set to false. The flag missing will not block the query. This is necessary to avoid any breaking changes where the old SDKs did not set the flag.
2. The PartitionRoutingHelper is only used by compute and not by any of the v3 query code.
3. Adding the query plan info to the additional info is for back compatibility with the old SDKs that do the try execute instead of the get query plan call.
Related PR #2385.
In this PR, following changes have been made.
Add Preferred Region in telemetry payload
Add Time Interval In Sec in telemetry Payload
Change Request Latency Unit to milliseconds
Right now if any exceptions occurs while getting account details in globalEndpointManager, It gets rethrown without inner exception stacktrace.
So, Adding inner exception stack trace while rethrowing exception in GlobalEndpointManager.
Removes retry during decryption failure in DeserializeAndDecryptResponseAsync(). The retry logic around decryption was basically to address the problem with policy change and is handled by the caller in all cases except for ChangeFeedProcessor API set since we do not have Options to pass in the required headers. Will address this later with a better approach.
In this PR, We are collecting Azure VM metadata at the time of initialization of client and collecting Latency / Request Charge related metrics for each type of operation (Point, Stream, Batch, Query).
How can we Enable Client Telemetry?
If it is enabled using environment properties i.e COSMOS.CLIENT_TELEMETRY_ENABLED, then it will be enabled by default.
While creating client it can be disabled:
CosmosClientBuilder cosmosClientBuilder =<initialize client builder>
cosmosClientBuilder.WithTelemetryDisabled();
default Value : false
How to configure time to collect and send information to kustos?
As part of this PR, we are scheduling a Task to collect and calculate the metrics and sending this information to Kustos.
Environment Variables
Juno endpoint : COSMOS.CLIENT_TELEMETRY_ENDPOINT
default Value = blank , If it is not configured, then it won't even try to send any data but it will still keep collecting the data in background
Scheduling time : COSMOS.CLIENT_TELEMETRY_SCHEDULING_IN_SECONDS;
default Value = 600 seconds = 10 minutes
What happens if Telemetry is enabled but endpoint is not configured?
It will keep collecting the data in memory but will not send it anywhere. once URL is configured it will start sending that data to that URL.
As part of this PR, What all information we are collecting?
We are collection following information:
System Information : CPU usage and Memory Usage.
Operation Related Information : We are collecting Request Charge and Latency information for all Item/Document level operations along with batch and query.
VM Metadata
Database Account Name
How this change is going to affect performance?
Checkout Sheet 3 for final results.
Performance Testing Report
Maximum change in performance is 1.2%
Points Helpful for code review:
ClientTelemetry.cs : This file has all collection and calculation related code.
ClientTelemetryTest.cs : It cover most of the scenarios to test client telemetry
ClientTelemetryOptions.cs : Telemetry related options
* Set correct substatus code for GROUP BY queries in PartitionRoutingHelper
* Use common test object for QueryPartitionProvider
* Remove extra method with default argument
Co-authored-by: Samer Boshra <sboshra@microsoft.com>
Use Diagnostics instead of Trace while creating ItemResponse from ResponseMessage. The other methods in the ResponseFactory are using Diagnostics except for the this one CreateItemResponse method.
* Add coverage for gateway mode in the emulator tests
* Add more coverage for gateway mode to SanityQueryTests
Co-authored-by: j82w <j82w@users.noreply.github.com>
* Allow user to specify a real trace object in DocumentClient.ProcessRequestAsync
* Allow PartitionKeyRangeGoneRetryPolicy to use a trace other than NoOpTrace
* Added test