watchdog-proxy/docs/metrics.md

4.1 KiB

Watchdog Metrics

Last Update: 2018-06-08

Analysis

Questions we want to answer with metrics data include:

  • Overall throughput performance:
    • Consumer submission to submission to PhotoDNA (time in queue)
    • Response from PhotoDNA (time waiting for reply)
    • Response to consumer API (time to reply)
    • The sum of the above to give an easy health measure
  • Throughput data for positive identifications since they require manual intervention:
    • Number of positively flagged images
      • Breakdown of images not yet reviewed and under review
    • Number of images confirmed vs falsely identified
  • The number of items in the message queue
  • Total number of images processed
    • Breakdown of positive vs negative responses

Each of these should be available globally, as well as broken down per consumer application.

Collection

This project uses Ping Centre to collect metrics data. Pings will be sent as JSON blobs. All pings will include the following fields:

  • topic: used by Ping Centre. In this case always "watchdog-proxy": string
  • timestamp: Using UNIX epoch time in milliseconds (i.e. Date.now() in JavaScript): number

Events

Additional fields submitted are described below.

A new item is submitted from a consumer

  • consumer_name: the name of the consumer submitting the request: string
  • event: "new_item": string
  • watchdog_id: the ID assigned to the task: string
  • type: Content-Type of item submitted (eg. 'image/png' or 'image/jpg'): string

Example:

{
  "topic": "watchdog-proxy",
  "timestamp": "1534784298646",

  "consumer_name": "screenshots",
  "event": "new_item",
  "watchdog_id": "9ad08ec4-be1a-4327-b4ef-282bed37621f"
  "type": "image/png",
}

Queue poller periodic heartbeat

The pollQueue function repeatedly polls the queue for jobs waiting to be processed. It gets called every 60 seconds and runs for most of 60 seconds before exiting. (This is a hack to work around lacking support for long-running functions in Amazon Lambda.)

Metrics pings will be sent at these times while the pollQueue function is running:

  • when the function starts (every 60 seconds)
  • roughly every 20 seconds while it runs
  • when the function exits (roughly 60 seconds after start)

The metrics sent in the ping will contain:

  • event: "poller_heartbeat": string
  • poller_id: UUID given by Lambda to the current invocation of the pollQueue function
  • items_in_queue: Number of items in the queue before the worker removes any: integer
  • items_in_progress: Number of items being processed: integer
  • items_in_waiting: Number of items waiting to be queued: integer

Example:

{
  "topic": "watchdog-proxy",
  "timestamp": "1534784298646",

  "event": "poller_heartbeat",
  "poller_id": "31417de1-b3ef-4e90-be3c-e5116d459d1d",
  "items_in_queue": 1504,
  "items_in_progress": 22,
  "items_in_waiting": 38
}

A worker processes a queue item

For each item fetched from the queue by the poller, the processQueueItem function will be invoked. That function, in turn, will send these metrics:

  • event: "worker_works": string
  • worker_id: UUID given by Lambda to the current invocation of the processQueueItem function
  • consumer_name: the ID of the consumer submitting the request: string
  • watchdog_id: the ID assigned to the task: string
  • photodna_tracking_id: ID from PhotoDNA: string
  • is_match: Whether the response was positive or negative: boolean
  • is_error: Was the response an error?: boolean
  • timing_sent: time (in ms) to send item to PhotoDNA: integer
  • timing_received: time (in ms) before response from PhotoDNA: integer
  • timing_submitted: time (in ms) to finish sending a response to consumer's report URL: integer

Example:

{
  "topic": "watchdog-proxy",
  "timestamp": "1534784298646",

  "event": "worker_works",
  "worker_id": "8cdb1e6b-7e15-489d-b171-e7a05781c5da",
  "consumer_name": "screenshots,
  "watchdog_id": "9ad08ec4-be1a-4327-b4ef-282bed37621f"
  "photodna_tracking_id": "1_photodna_a0e3d02b-1a0a-4b38-827f-764acd288c25",
  "is_match": false,
  "is_error": false,

  "timing_sent": 89,
  "timing_received": 161,
  "timing_submitted": 35
}