Run these tests with `sbt dockerComposeTest`. You can also startup
the docker containers separately with `sbt dockerComposeRun`.
Plenty of weirdness with spark in this patch. Some of the potential
spark bugs are labeled as such.
We found out that a handful of the HLL fields introduced to handle
session-based counts and filtered client counts killed the dimension
of the output parquet files. See
https://bugzilla.mozilla.org/show_bug.cgi?id=1388351 for more context.
This new dimension allows us to distinguish between new profiles, young
profiles, old profiles and so on. The age has been put in bins to
reduce the number of total aggregates. The logic is:
age < 60 days : daily granularity
age < 52 weeks: weekly granularity
age > 52 weeks: bin 365
I also increased the schema version as this change is not backward
compatible and it will require a data backfill.
Previously, when run in streaming mode, the job would fail with:
```
org.apache.spark.sql.AnalysisException: Append output mode not
supported when there are streaming aggregations on streaming
DataFrames/DataSets without watermark
```
This makes the watermark part of the aggregation.
This change adds the ability to cheaply compute user set cardinality
using hyperloglog.
HLL is used to both count the number of distinct users for each
aggegate and to measure the impact that a certain anomaly has on users,
e.g. how many users are affected by main crashes.
This change will make error_aggregates behave a bit differently.
Here we are multiplexing pings - for each experiment a ping is
involved in, we are going to get one data entry. We are then
going to add one data entry for every ping with NO experiment.
We then aggregate across these new experiment outputs. What this
means is that you can query for experiment_id = NULL, and that
will include the entire population. Otherwise, you need to select
a single experiment_id to query on, or clients will be double counted.
For example, if a client has the following:
{"experiment1": "control", "experiment2": "branch1"}
They will be included in the dataset three times:
- once for experiment1
- once for experiment2
- once for NULL
That means if you don't query on a specific experiment, the pings
for this client will be tripled. As such, EVERY QUERY MUST NOW
INCLUDE AN EXPERIMENT_ID.
This change is not backwards compatible, because selecting NULL
on historical entries will result only in clients with NO experiments,
rather than all experiments.
This is the first measure that requires a non-sum aggregation. I added
a map to define which columns require which aggregation. If a column is
not listed there, it falls back to a sum().
For every histogram we can store a threshold count and a list of
process types. For each combination, the final schema will have a new
column in the format ${histogram_name}_${process_type}_${threshold}