You OK, Remote Settings Uptake Telemetry?

Перейти к файлу

Peter Bengtsson 0bdbd9520d Moved update		2019-03-14 10:24:54 -04:00
.circleci	consistent dockerhub env vars (#16 )	2019-03-01 14:08:24 -05:00
.dockerignore	basic unit test (#7 )	2019-02-22 10:08:31 -05:00
.gitignore	sentry events (#14 )	2019-02-26 09:16:43 -05:00
.therapist.yml	therapist (#9 )	2019-02-25 09:23:56 -05:00
Dockerfile	basic unit test (#7 )	2019-02-22 10:08:31 -05:00
README.md	Moved update	2019-03-14 10:24:54 -04:00
deploy.sh	ship to dockerhub (#15 )	2019-03-01 13:38:20 -05:00
main.py	Refactor reading good/bad status in rows (#21 )	2019-03-06 16:59:31 +01:00
pytest.ini	sentry events (#14 )	2019-02-26 09:16:43 -05:00
run.sh	therapist (#9 )	2019-02-25 09:23:56 -05:00
setup.cfg	bustamove	2019-02-11 13:50:03 -05:00
setup.py	sentry events (#14 )	2019-02-26 09:16:43 -05:00
test_main.py	Refactor reading good/bad status in rows (#21 )	2019-03-06 16:59:31 +01:00

README.md

MOVED

Moved to be a part of https://github.com/mozilla-services/remote-settings-lambdas

remote-settings-health

The objective of this script is to check in on the collected uptake from Telemetry with respect to Remote Settings. Essentially, across all statuses that we worry about, we use this script to check that the amount of bad statuses don't exceed a threshold.

The ultimately use case for this script is to run it periodically and use it to trigger alerts/notifications that the Product Delivery team can take heed of.

Architectural Overview

In Plain English

This is stateless script, written in Python, meant to be executed roughly once a day. It queries Redash for a Redash query's data that is pre-emptively run every 24 hours. The data it analyses is a list of everything stored in "Remote Content" along with a count of every possible status (e.g. up_to_date, network_error, etc.) The script sums the "good" statuses and compares it against the "bad" statuses and if that ratio (expressed as a percentage) is over a certain threshold it alerts by sending an event to Sentry which notifies the team by email.

Note: bad statuses are the ones ending with _error (eg. sync_error, network_error, ...)

Historical Strategy

As of Firefox Nightly 67, Firefox clients that use Remote Settings only send Telemetry pings in daily batches. I.e. it's not real-time. The "uptake histogram" is buffered in the browser and will send periodically instead of as soon as possible. In Firefox Nightly (67 at the time of writing), we are switching to real-time Telemetry Events.

On the Telemetry backend we're still consuming the older uptake histogram but once, the population using the new Telemetry Events is large enough, we will switch the Redash query (where appropriate) and still use this script to worry about the percentage thresholds. And the strategy for notifications should hopefully not change. There is no plan to rush this change since we'll still be doing "legacy" histogram telemetry and the new telemetry events so we can let it mature a bit before changing the source of data.

Although we will eventually switch to real-time Telemetry Events, nothing changes in terms of the surface API but the underlying data is more up-to-date and the response time of reacting to failure spikes is faster.

It is worth noting that as underlying tools change, we might decommission this solution and use something native to the Telemetry analysis tools that achives the same goal.

Redash

The query we use is mentioned as a config default in the main main.py file. Follow that link, without the API key, in your browser to study the query and/or to fork it for a better query when applicable.

Configuration

This script encodes all configuration inside main.py as defaults for things that can be overwritten by environment variables.

The only mandatory environment variable is REDASH_API_KEY. You get this by logging in https://sql.telemetry.mozilla.org and click to edit the/any relevant query and in the upper right-hand corner there's a drop-down with the label "..." and inside it "Show API Key".

Either put this into a .env file or use it as a regular sourced enviroment variable. E.g.:

    $ cat .env
    REDASH_API_KEY=MXAqPP1Q4FNLjHeT1w

For most other configuration options, the best strategy is to study all the uppercased constants in the main.py file.

To hack on

$ pip install -e ".[dev,test]"
$ REDASH_API_KEY=OCZccH...FhB4cT python main.py

The only mandatory environment variable is REDASH_API_KEY. All other variables are encoded as good defaults inside main.py and can all be overwritten with environment variables.

To not have to put the REDASH_API_KEY on the command line every time you can either run:

export REDASH_API_KEY=OCZccH...FhB4cT

or, create a .env file that looks like this:

$ cat .env
REDASH_API_KEY=OCZccH...FhB4cT

Another useful variable is DEBUG=true. If you use this, all repeated network requests are cached to disk. Note! If you want to invalidate that cache, delete the file requests_cache1.sqlite.

Running tests

The simplest form is to run:

$ pip install -e ".[dev,test]"

...if you haven't already done so. Then run:

$ pytest

You can also invoke it with:

$ ./run.sh test

Docker

To build:

docker build -t remote-settings-uptake-health .

To run:

docker run -t --env-file .env remote-settings-uptake-health

Note, this is the same as running:

docker run -t --env-file .env remote-settings-uptake-health main

Or, to see what other options are available:

docker run -t --env-file .env remote-settings-uptake-health main --help

To run the unit tests with docker use:

docker run -t remote-settings-uptake-health test

To run the linting checks with docker use:

docker run -t remote-settings-uptake-health lintcheck

And if you just want to fix any lint warnings (that can be automated):

docker run -t remote-settings-uptake-health lintfix

To get into a bash prompt inside the docker container run:

docker run -it --env-file .env remote-settings-uptake-health bash
python main.py  # for example

Code style

We enforce all Python code style with black and flake8. The best tool to test this is with therapist which is installed as a dev package. To run it;

therapist run

That will check all the files you've touched in the current branch. To test across all files, including those you haven't touched:

therapist use lint:all

If you do get warnings from this, you can either fix it manually in your editor or run:

therapist use fix

Even better, if you intend to work on this repo a lot, install therapist as a pre-commit git hook:

therapist install

Now, trying to commit, with code style warnings, will prevent you and you have to run the above (therapist use fix) fix command.