[Internal] Design Docs: Adds Design Document for Client Telemetry (#3590)

* sdk design for client telemetry

* Otel design

* update optel design

* added more nformation

* updated ct design

* remove otel design

* Client Telemetry Design

* update typos

* fix typos

* fix typos

* added limitation

* updated docs

* updated doc

* updated text

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>

* move stuff here and there.

---------

Co-authored-by: Kiran Kumar Kolli <kirankk@microsoft.com>
This commit is contained in:
Sourabh Jain 2023-06-10 01:52:17 +05:30 коммит произвёл GitHub
Родитель d58b44138e
Коммит df630928ca
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 72 добавлений и 1 удалений

Просмотреть файл

@ -33,4 +33,75 @@ flowchart TD
OtherLogic --> GetResponse(Get Response for the request) OtherLogic --> GetResponse(Get Response for the request)
SendResponse --> OperationCall SendResponse --> OperationCall
``` ```
## Send telemetry from SDK to service (Private Preview)
### Introduction
When opted-in CosmosDB SDK collects below aggregated telemetry data every 10 minutes to Azure CosmosDB service.
1. Operation(CRUD APIs) Latencies and Request Units (RUs).
2. Metadata caches (ex: CollectionCache) miss statistics
3. Client System Usage (during an operation) :
* CPU usage
* Memory Usage
* Thread Starvation
* Network Connections Opened (only TCP Connections)
4. TOP 10 slower network interactions
> Note: We don't collect any PII data as part of this feature.
### Benefits
Enabling this feature provides numerous benefits. The telemetry data collected will allow us to identify and address potential issues. This results in a superior support experience and ensures that some issues can even be resolved before they impact your application. In short, customers with this feature enabled can expect a smoother and more reliable experience.
### Impact of this feature enabled
* _Latency_: Customer should not see any impact on latency.
* _Total RPS_: It depends on the infrastructure the application using SDK is hosted on among other factors but the impact should not exceed 10%.
* _Any other impact_: Collector needs around 18MB of in-memory storage to hold the data and this storage is always constant (it means it doesn't grow, no matter how much data we have)
* Benchmark Numbers: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Performance.Tests/Contracts/BenchmarkResults.json
### Components
**Telemetry Job:** Background task which collects the data and sends it to a Azure CosmosDB service every 10 minutes.
**Collectors:** In-memory storage which keeps the telemetry data collected during an operation. There are 3 types of collectors including:
* _Operational Data Collector_: It keeps operation level latencies and request units.
* _Network Data Collector_: It keeps all the metrics related to network or TCP calls. It has its own Sampler which sample-in only slowest TCP calls for a particular replica.
* _Cache Data Collector_: It keeps all the cache call latencies. Right now, only collection cache is covered.
**Get VM Information**:
- Azure VM: [Azure Instance Metadata](https://learn.microsoft.com/azure/virtual-machines/instance-metadata-service?tabs=windows) call.
- Non-Azure VM: We don't collect any other information except VMID which will a Guid or Hashed Machine Name.
**Processor**: Its responsibility is to get all the data and divide it into small chunks (<2MB) and send each chunk to the Azure CosmosDB service.
```mermaid
flowchart TD
subgraph TelemetryJob[Telemetry Background Job]
subgraph Storage[In Memory Storage or Collectors]
subgraph NetworkDataCollector[Network Data Collector]
TcpDatapoint(Network Request Datapoint) --> NetworkHistogram[(Histogram)]
DataSampler(Sampler)
end
subgraph DataCollector[Operational Data Collector]
OpsDatapoint(Operation Datapoint) --> OperationHistogram[(Histogram)]
end
subgraph CacheCollector[Cache Data Collector]
CacheDatapoint(Cache Request Datapoint) --> CacheHistogram[(Histogram)]
end
end
subgraph TelemetryTask[Telemetry Task Every 10 min]
CacheAccountInfo(Cached Account Properties) --> VMInfo
VMInfo(Get VM Information) --> CollectSystemUsage
CollectSystemUsage(Record System Usage Information) --> GetDataFromCollector
end
subgraph Processor
GetDataFromCollector(Fetch Data from Collectors) --> Serializer
Serializer(Serialize and divide the Payload) --> SendCTOverHTTP(Send Data over HTTP to Service)
end
Storage --> |Get Aggregated data|GetDataFromCollector
end
```
### Limitations
1. AAD Support is not available.