Граф коммитов

71 Коммитов

Автор SHA1 Сообщение Дата
Frank Bertsch ca8de72394 Add Integration Tests with Docker
Run these tests with `sbt dockerComposeTest`. You can also startup
the docker containers separately with `sbt dockerComposeRun`.

Plenty of weirdness with spark in this patch. Some of the potential
spark bugs are labeled as such.
2017-09-18 14:36:48 -05:00
Mauro Doglio c9b7a77871 Remove session-based counts and filtered client counts
We found out that a handful of the HLL fields introduced to handle
session-based counts and filtered client counts killed the dimension
of the output parquet files. See
https://bugzilla.mozilla.org/show_bug.cgi?id=1388351 for more context.
2017-09-18 08:58:03 -05:00
Frank Bertsch 8ad313430b Set number of files per partition 2017-09-15 12:49:11 -05:00
Mauro Doglio 3758a849f6 Remove temporary fields from output schema
Fixes #70
2017-09-15 18:45:18 +01:00
Frank Bertsch 84e73d55df Remove duplicate client_count columns
This failed when reading the final parquet output
2017-08-30 13:20:21 -05:00
Mauro Doglio d4703e1f2f Discard non-firefox pings on ingestion 2017-08-29 19:27:27 +01:00
Mauro Doglio 1a43e70a1a Add new profile_age_days dimension
This new dimension allows us to distinguish between new profiles, young
profiles, old profiles and so on. The age has been put in bins to
reduce the number of total aggregates. The logic is:
age < 60 days : daily granularity
age < 52 weeks: weekly granularity
age > 52 weeks: bin 365

I also increased the schema version as this change is not backward
compatible and it will require a data backfill.
2017-08-28 16:51:44 +01:00
Frank Bertsch 225c22b877 Fix runtime error
Previously, when run in streaming mode, the job would fail with:

```
org.apache.spark.sql.AnalysisException: Append output mode not
supported when there are streaming aggregations on streaming
DataFrames/DataSets without watermark
```

This makes the watermark part of the aggregation.
2017-08-28 16:26:23 +01:00
Mauro Doglio f3e6370624 Add subsession count
This is merely the number of main pings for each aggregate.
2017-08-21 13:17:12 +01:00
Mauro Doglio 305079084e Add session count-based statistics
Same as 8687e8e6b8, but this time for
session counts. I added client_id to the recipe for the session hll
to mitigate the risk of collisions.
2017-08-21 13:17:12 +01:00
Mauro Doglio 8687e8e6b8 Add client_count-based statistics
This change adds the ability to cheaply compute user set cardinality
using hyperloglog.
HLL is used to both count the number of distinct users for each
aggegate and to measure the impact that a certain anomaly has on users,
e.g. how many users are affected by main crashes.
2017-08-16 10:58:20 +01:00
Mauro Doglio 68c4d4fd5c Refactor aggregation code for better readability
This commit prepares the ground for the upcoming addition of
client_count-based stats.
2017-08-16 10:58:20 +01:00
Frank Bertsch 50dba3fd48 Include all experiments a ping is involved in
This change will make error_aggregates behave a bit differently.
Here we are multiplexing pings - for each experiment a ping is
involved in, we are going to get one data entry. We are then
going to add one data entry for every ping with NO experiment.

We then aggregate across these new experiment outputs. What this
means is that you can query for experiment_id = NULL, and that
will include the entire population. Otherwise, you need to select
a single experiment_id to query on, or clients will be double counted.

For example, if a client has the following:
{"experiment1": "control", "experiment2": "branch1"}

They will be included in the dataset three times:
- once for experiment1
- once for experiment2
- once for NULL

That means if you don't query on a specific experiment, the pings
for this client will be tripled. As such, EVERY QUERY MUST NOW
INCLUDE AN EXPERIMENT_ID.

This change is not backwards compatible, because selecting NULL
on historical entries will result only in clients with NO experiments,
rather than all experiments.
2017-08-08 13:33:18 +01:00
Frank Bertsch 4aeaa8d70e Use implicit classes rather than implicit functions
This is a bit more terse.
2017-08-08 13:33:18 +01:00
Rob Hudson acbbed7dc6 Add a test for new style experiments 2017-07-18 16:12:39 -05:00
Mauro Doglio f88217083e Split window column into 2 timestamp columns
This makes the dataset easier to interact with for downstream consumers.
2017-06-28 16:48:03 -07:00
Mauro Doglio 65bc79584a Add mean firstPaint
This is the first measure that requires a non-sum aggregation. I added
a map to define which columns require which aggregation. If a column is
not listed there, it falls back to a sum().
2017-06-19 13:08:35 +01:00
Mauro Doglio af50955216 Add quantum_ready dimension 2017-06-14 18:13:35 +01:00
Mauro Doglio a94312430f Add threshold histogram counts for quantum relase criteria
For every histogram we can store a threshold count and a list of
process types. For each combination, the final schema will have a new
column in the format ${histogram_name}_${process_type}_${threshold}
2017-06-09 16:32:03 +01:00
Mauro Doglio a90d6c830a Speed up test execution by running a single query
This test was running N queries, one per column.
2017-06-09 16:32:03 +01:00
Mauro Doglio 16e1320189 Disable scalastyle "Magic number" check 2017-06-09 16:32:03 +01:00
Mauro Doglio 5de8ada3e5 Make SchemaBuilder.merge parameters variable
It's rather handy to merge more than 2 schemas at the same time.
2017-06-09 16:32:03 +01:00
Mauro Doglio f30949268e Tweak log configuration to report streaming job info 2017-06-09 10:26:20 +01:00
Wesley Dawson 1d3d598b9d Add starting offset and checkpoint path command-line options 2017-06-08 17:29:00 +01:00
Wesley Dawson 096935ea5b Don't explicitly set cluster manager 2017-06-08 17:29:00 +01:00
Wesley Dawson 783220fa63 Modify assembly configuration for fat jar generation 2017-06-08 17:29:00 +01:00
Mauro Doglio ba394a7cac Make messageTo[Crash|Main]Ping constructors 2017-06-06 15:42:36 +01:00
Mauro Doglio 0cbdd4b23a Add support for new-style experiments 2017-06-06 15:42:36 +01:00
Mauro Doglio 54a4b5f32f Standardize case classes indentation style 2017-06-06 15:42:36 +01:00
Mauro Doglio 68d9d191ba Fix failing tests and factor out scalar values 2017-06-06 15:42:36 +01:00
Mauro Doglio ce4d8f874b Minor style fixes 2017-06-06 15:42:36 +01:00
Mauro Doglio bce53161b5 Better name Parsable[Main|Crash]Ping 2017-06-06 15:42:36 +01:00
Mauro Doglio ff44578a12 Wrap Row objects in Arrays instead of Tuple1
This gives us the ability to eventually return more than one Row per
ping.
2017-06-06 15:42:36 +01:00
Mauro Doglio 22288df53c Return None instead of zeros when values are NA 2017-06-06 15:42:36 +01:00
Mauro Doglio 18b730ddd3 Refactor Timestamp extraction 2017-06-06 15:42:36 +01:00
Mauro Doglio 4e91a9cad7 Add MPLv2 header to source code 2017-06-06 15:42:36 +01:00
Mauro Doglio 3246f7ef74 Add sbt alias command for test and linter runs 2017-06-06 15:42:36 +01:00
Mauro Doglio 8913a87b27 Add e10s, experiments and gfx dimensions to ErrorAggregator 2017-06-06 15:42:36 +01:00
Mauro Doglio 2d324c6fd4 Rename CrashPing method `isMain` to `isMainCrash` 2017-06-06 15:42:36 +01:00
Mauro Doglio bf8ee528c7 Decorate MainPings and CrashPings with a parse() method 2017-06-06 15:42:36 +01:00
Mauro Doglio 46548dcc92 Add non-main crashes to stats
This also increases the test coverage adding some tests for the
MainPing methods.
2017-06-06 15:42:36 +01:00
Mauro Doglio fe6b351b68 Use case classes to parse pings out of heka messages 2017-06-06 15:42:36 +01:00
Mauro Doglio 897bbe3593 Add case classes for main and crash pings 2017-06-06 15:42:36 +01:00
Mauro Doglio d7fc30eed4 Upgrade Spark version to 2.1.1 2017-05-22 12:38:03 +01:00
Frank Bertsch 650c6d9c12 Add options to run in batch mode 2017-05-05 11:32:03 +01:00
Mauro Doglio 8d7b4b3a39 Use moztelemetry to autogenerate protobuf classes 2017-04-27 14:47:14 +00:00
Mauro Doglio d523870990 Add SLOW_SCRIPT_PAGE_COUNT to ErrorAggregator stats
This is one of the metrics suggested by releng.
I will use it to test schema updates as per #19
2017-02-28 15:09:43 +00:00
Mauro Doglio 053d5ed00b Set root logger log level to warning 2017-02-21 14:05:26 +00:00
Mauro Doglio ee37f56f49 Add log4j configuration file
Fixes #18
2017-02-20 13:53:00 +00:00
Mauro Doglio a18475aece Add failOnDataLoss option to ErrorAggregator
This is useful when we want to recover from an expected data loss.
Fixes #25
2017-02-20 13:49:53 +00:00