Public machine learning dataset for cloud monitoring
Перейти к файлу
microsoft-github-policy-service[bot] 6a07a53c2f
Auto merge mandatory file pr
This pr is auto merged as it contains a mandatory file and is opened for more than 10 days.
2022-11-28 19:10:33 +00:00
data Initial checkin 2019-02-15 17:55:38 -08:00
.gitignore Initial commit 2019-01-29 13:00:03 -08:00
CONTRIBUTING.md Initial checkin 2019-02-15 17:55:38 -08:00
LICENSE Initial checkin 2019-02-15 17:55:38 -08:00
README.md Update readme 2019-03-12 17:09:42 -07:00
SECURITY.md Microsoft mandatory file 2022-08-31 15:20:26 +00:00

README.md

Cloud Monitoring Dataset

The Cloud Monitoring Dataset is a set of real-world time series derived from Microsoft service and client telemetry signals. The data set contains anomalous patterns manually labeled by experts. The dataset is used for development, evaluation and improvement of anomaly detection algorithms in Microsoft's cloud monitoring tools.

Data Domains

The Cloud Monitoring Dataset consists of 67 real-world time series in 8 domains ranging from online services to Windows PCs. All data was collected from production Microsoft service telemetry.

The domains for the datasets are:

  • Store purchase counts by client type
  • Database query rates by individual machines in cluster
  • Database query rates by application context
  • Incoming service API query rate
  • Outgoing service API latency
  • Windows application crash rate by OS version and country
  • Web app query rates to service API

Data granularity is per minute or per hour depending on the time series.

The dataset is labeled with anomaly time points using domain knowledge elicited from experts in each area. Anomalies are labeled either by the domain experts, or with the assitance of domain experts reviewing the time series data. We developed a labeling web tool as part of this project.

Our data includes some time series that contain no anomaly labels. These are included to test algorithms on their ability to avoid false positives.

Dataset Format

Each dataset time series is contained in a CSV format file with 3 fields.

Field Description Example
TimeStamp This is an ISO 8601 compliant date value.
It can be enclosed in double quotes.
Time zone designation is optional.
"2018-07-03 14:00:00"
Value Integer or float value for the time series metric for the time stamp. 3429.0003
Label Designates whether time stamp is an anomaly.
1 means anomaly, 0 means not anomaly.
0

The time series CSV file has a header with the field names. A sample excerpt is the following.

TimeStamp,Value,Label
"2018-07-03 14:00:00",1,0
"2018-07-03 15:00:00",0,0
"2018-07-03 16:00:00",1,1
"2018-07-03 17:00:00",1,0

Annotation

Each time series CSV file is accompanied by a JSON file with a description of the time series. The JSON file follows the CSV on the Web (CSVW) standard.