[Internal] Design Docs: Adds Design Document for Client Telemetry (#3590)
* sdk design for client telemetry * Otel design * update optel design * added more nformation * updated ct design * remove otel design * Client Telemetry Design * update typos * fix typos * fix typos * added limitation * updated docs * updated doc * updated text * Update docs/observability.md Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com> * Update docs/observability.md Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com> * Update docs/observability.md Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com> * Update docs/observability.md Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com> * Update docs/observability.md Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com> * move stuff here and there. --------- Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>
This commit is contained in:
Родитель
d58b44138e
Коммит
df630928ca
|
@ -34,3 +34,74 @@ flowchart TD
|
||||||
SendResponse --> OperationCall
|
SendResponse --> OperationCall
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Send telemetry from SDK to service (Private Preview)
|
||||||
|
|
||||||
|
### Introduction
|
||||||
|
When opted-in CosmosDB SDK collects below aggregated telemetry data every 10 minutes to Azure CosmosDB service.
|
||||||
|
1. Operation(CRUD APIs) Latencies and Request Units (RUs).
|
||||||
|
2. Metadata caches (ex: CollectionCache) miss statistics
|
||||||
|
3. Client System Usage (during an operation) :
|
||||||
|
* CPU usage
|
||||||
|
* Memory Usage
|
||||||
|
* Thread Starvation
|
||||||
|
* Network Connections Opened (only TCP Connections)
|
||||||
|
4. TOP 10 slower network interactions
|
||||||
|
|
||||||
|
> Note: We don't collect any PII data as part of this feature.
|
||||||
|
|
||||||
|
### Benefits
|
||||||
|
Enabling this feature provides numerous benefits. The telemetry data collected will allow us to identify and address potential issues. This results in a superior support experience and ensures that some issues can even be resolved before they impact your application. In short, customers with this feature enabled can expect a smoother and more reliable experience.
|
||||||
|
|
||||||
|
### Impact of this feature enabled
|
||||||
|
* _Latency_: Customer should not see any impact on latency.
|
||||||
|
* _Total RPS_: It depends on the infrastructure the application using SDK is hosted on among other factors but the impact should not exceed 10%.
|
||||||
|
* _Any other impact_: Collector needs around 18MB of in-memory storage to hold the data and this storage is always constant (it means it doesn't grow, no matter how much data we have)
|
||||||
|
* Benchmark Numbers: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Performance.Tests/Contracts/BenchmarkResults.json
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
**Telemetry Job:** Background task which collects the data and sends it to a Azure CosmosDB service every 10 minutes.
|
||||||
|
|
||||||
|
**Collectors:** In-memory storage which keeps the telemetry data collected during an operation. There are 3 types of collectors including:
|
||||||
|
* _Operational Data Collector_: It keeps operation level latencies and request units.
|
||||||
|
* _Network Data Collector_: It keeps all the metrics related to network or TCP calls. It has its own Sampler which sample-in only slowest TCP calls for a particular replica.
|
||||||
|
* _Cache Data Collector_: It keeps all the cache call latencies. Right now, only collection cache is covered.
|
||||||
|
|
||||||
|
**Get VM Information**:
|
||||||
|
|
||||||
|
- Azure VM: [Azure Instance Metadata](https://learn.microsoft.com/azure/virtual-machines/instance-metadata-service?tabs=windows) call.
|
||||||
|
- Non-Azure VM: We don't collect any other information except VMID which will a Guid or Hashed Machine Name.
|
||||||
|
|
||||||
|
**Processor**: Its responsibility is to get all the data and divide it into small chunks (<2MB) and send each chunk to the Azure CosmosDB service.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TD
|
||||||
|
subgraph TelemetryJob[Telemetry Background Job]
|
||||||
|
subgraph Storage[In Memory Storage or Collectors]
|
||||||
|
subgraph NetworkDataCollector[Network Data Collector]
|
||||||
|
TcpDatapoint(Network Request Datapoint) --> NetworkHistogram[(Histogram)]
|
||||||
|
DataSampler(Sampler)
|
||||||
|
end
|
||||||
|
subgraph DataCollector[Operational Data Collector]
|
||||||
|
OpsDatapoint(Operation Datapoint) --> OperationHistogram[(Histogram)]
|
||||||
|
end
|
||||||
|
subgraph CacheCollector[Cache Data Collector]
|
||||||
|
CacheDatapoint(Cache Request Datapoint) --> CacheHistogram[(Histogram)]
|
||||||
|
end
|
||||||
|
end
|
||||||
|
subgraph TelemetryTask[Telemetry Task Every 10 min]
|
||||||
|
CacheAccountInfo(Cached Account Properties) --> VMInfo
|
||||||
|
VMInfo(Get VM Information) --> CollectSystemUsage
|
||||||
|
CollectSystemUsage(Record System Usage Information) --> GetDataFromCollector
|
||||||
|
end
|
||||||
|
subgraph Processor
|
||||||
|
GetDataFromCollector(Fetch Data from Collectors) --> Serializer
|
||||||
|
Serializer(Serialize and divide the Payload) --> SendCTOverHTTP(Send Data over HTTP to Service)
|
||||||
|
end
|
||||||
|
Storage --> |Get Aggregated data|GetDataFromCollector
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
### Limitations
|
||||||
|
1. AAD Support is not available.
|
Загрузка…
Ссылка в новой задаче