telemetry-streaming

Граф коммитов

Автор	SHA1	Сообщение	Дата
Frank Bertsch	ca8de72394	Add Integration Tests with Docker Run these tests with `sbt dockerComposeTest`. You can also startup the docker containers separately with `sbt dockerComposeRun`. Plenty of weirdness with spark in this patch. Some of the potential spark bugs are labeled as such.	2017-09-18 14:36:48 -05:00
Mauro Doglio	c9b7a77871	Remove session-based counts and filtered client counts We found out that a handful of the HLL fields introduced to handle session-based counts and filtered client counts killed the dimension of the output parquet files. See https://bugzilla.mozilla.org/show_bug.cgi?id=1388351 for more context.	2017-09-18 08:58:03 -05:00
Frank Bertsch	8ad313430b	Set number of files per partition	2017-09-15 12:49:11 -05:00
Mauro Doglio	3758a849f6	Remove temporary fields from output schema Fixes #70	2017-09-15 18:45:18 +01:00
Frank Bertsch	84e73d55df	Remove duplicate client_count columns This failed when reading the final parquet output	2017-08-30 13:20:21 -05:00
Mauro Doglio	d4703e1f2f	Discard non-firefox pings on ingestion	2017-08-29 19:27:27 +01:00
Mauro Doglio	1a43e70a1a	Add new profile_age_days dimension This new dimension allows us to distinguish between new profiles, young profiles, old profiles and so on. The age has been put in bins to reduce the number of total aggregates. The logic is: age < 60 days : daily granularity age < 52 weeks: weekly granularity age > 52 weeks: bin 365 I also increased the schema version as this change is not backward compatible and it will require a data backfill.	2017-08-28 16:51:44 +01:00
Frank Bertsch	225c22b877	Fix runtime error Previously, when run in streaming mode, the job would fail with: ``` org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark ``` This makes the watermark part of the aggregation.	2017-08-28 16:26:23 +01:00
Mauro Doglio	f3e6370624	Add subsession count This is merely the number of main pings for each aggregate.	2017-08-21 13:17:12 +01:00
Mauro Doglio	305079084e	Add session count-based statistics Same as `8687e8e6b8`, but this time for session counts. I added client_id to the recipe for the session hll to mitigate the risk of collisions.	2017-08-21 13:17:12 +01:00
Mauro Doglio	8687e8e6b8	Add client_count-based statistics This change adds the ability to cheaply compute user set cardinality using hyperloglog. HLL is used to both count the number of distinct users for each aggegate and to measure the impact that a certain anomaly has on users, e.g. how many users are affected by main crashes.	2017-08-16 10:58:20 +01:00
Mauro Doglio	68c4d4fd5c	Refactor aggregation code for better readability This commit prepares the ground for the upcoming addition of client_count-based stats.	2017-08-16 10:58:20 +01:00
Frank Bertsch	50dba3fd48	Include all experiments a ping is involved in This change will make error_aggregates behave a bit differently. Here we are multiplexing pings - for each experiment a ping is involved in, we are going to get one data entry. We are then going to add one data entry for every ping with NO experiment. We then aggregate across these new experiment outputs. What this means is that you can query for experiment_id = NULL, and that will include the entire population. Otherwise, you need to select a single experiment_id to query on, or clients will be double counted. For example, if a client has the following: {"experiment1": "control", "experiment2": "branch1"} They will be included in the dataset three times: - once for experiment1 - once for experiment2 - once for NULL That means if you don't query on a specific experiment, the pings for this client will be tripled. As such, EVERY QUERY MUST NOW INCLUDE AN EXPERIMENT_ID. This change is not backwards compatible, because selecting NULL on historical entries will result only in clients with NO experiments, rather than all experiments.	2017-08-08 13:33:18 +01:00
Frank Bertsch	4aeaa8d70e	Use implicit classes rather than implicit functions This is a bit more terse.	2017-08-08 13:33:18 +01:00
Rob Hudson	acbbed7dc6	Add a test for new style experiments	2017-07-18 16:12:39 -05:00
Mauro Doglio	f88217083e	Split window column into 2 timestamp columns This makes the dataset easier to interact with for downstream consumers.	2017-06-28 16:48:03 -07:00
Mauro Doglio	65bc79584a	Add mean firstPaint This is the first measure that requires a non-sum aggregation. I added a map to define which columns require which aggregation. If a column is not listed there, it falls back to a sum().	2017-06-19 13:08:35 +01:00
Mauro Doglio	af50955216	Add quantum_ready dimension	2017-06-14 18:13:35 +01:00
Mauro Doglio	a94312430f	Add threshold histogram counts for quantum relase criteria For every histogram we can store a threshold count and a list of process types. For each combination, the final schema will have a new column in the format ${histogram_name}_${process_type}_${threshold}	2017-06-09 16:32:03 +01:00
Mauro Doglio	a90d6c830a	Speed up test execution by running a single query This test was running N queries, one per column.	2017-06-09 16:32:03 +01:00
Mauro Doglio	16e1320189	Disable scalastyle "Magic number" check	2017-06-09 16:32:03 +01:00
Mauro Doglio	5de8ada3e5	Make SchemaBuilder.merge parameters variable It's rather handy to merge more than 2 schemas at the same time.	2017-06-09 16:32:03 +01:00
Mauro Doglio	f30949268e	Tweak log configuration to report streaming job info	2017-06-09 10:26:20 +01:00
Wesley Dawson	1d3d598b9d	Add starting offset and checkpoint path command-line options	2017-06-08 17:29:00 +01:00
Wesley Dawson	096935ea5b	Don't explicitly set cluster manager	2017-06-08 17:29:00 +01:00
Wesley Dawson	783220fa63	Modify assembly configuration for fat jar generation	2017-06-08 17:29:00 +01:00
Mauro Doglio	ba394a7cac	Make messageTo[Crash\|Main]Ping constructors	2017-06-06 15:42:36 +01:00
Mauro Doglio	0cbdd4b23a	Add support for new-style experiments	2017-06-06 15:42:36 +01:00
Mauro Doglio	54a4b5f32f	Standardize case classes indentation style	2017-06-06 15:42:36 +01:00
Mauro Doglio	68d9d191ba	Fix failing tests and factor out scalar values	2017-06-06 15:42:36 +01:00
Mauro Doglio	ce4d8f874b	Minor style fixes	2017-06-06 15:42:36 +01:00
Mauro Doglio	bce53161b5	Better name Parsable[Main\|Crash]Ping	2017-06-06 15:42:36 +01:00
Mauro Doglio	ff44578a12	Wrap Row objects in Arrays instead of Tuple1 This gives us the ability to eventually return more than one Row per ping.	2017-06-06 15:42:36 +01:00
Mauro Doglio	22288df53c	Return None instead of zeros when values are NA	2017-06-06 15:42:36 +01:00
Mauro Doglio	18b730ddd3	Refactor Timestamp extraction	2017-06-06 15:42:36 +01:00
Mauro Doglio	4e91a9cad7	Add MPLv2 header to source code	2017-06-06 15:42:36 +01:00
Mauro Doglio	3246f7ef74	Add sbt alias command for test and linter runs	2017-06-06 15:42:36 +01:00
Mauro Doglio	8913a87b27	Add e10s, experiments and gfx dimensions to ErrorAggregator	2017-06-06 15:42:36 +01:00
Mauro Doglio	2d324c6fd4	Rename CrashPing method `isMain` to `isMainCrash`	2017-06-06 15:42:36 +01:00
Mauro Doglio	bf8ee528c7	Decorate MainPings and CrashPings with a parse() method	2017-06-06 15:42:36 +01:00
Mauro Doglio	46548dcc92	Add non-main crashes to stats This also increases the test coverage adding some tests for the MainPing methods.	2017-06-06 15:42:36 +01:00
Mauro Doglio	fe6b351b68	Use case classes to parse pings out of heka messages	2017-06-06 15:42:36 +01:00
Mauro Doglio	897bbe3593	Add case classes for main and crash pings	2017-06-06 15:42:36 +01:00
Mauro Doglio	d7fc30eed4	Upgrade Spark version to 2.1.1	2017-05-22 12:38:03 +01:00
Frank Bertsch	650c6d9c12	Add options to run in batch mode	2017-05-05 11:32:03 +01:00
Mauro Doglio	8d7b4b3a39	Use moztelemetry to autogenerate protobuf classes	2017-04-27 14:47:14 +00:00
Mauro Doglio	d523870990	Add SLOW_SCRIPT_PAGE_COUNT to ErrorAggregator stats This is one of the metrics suggested by releng. I will use it to test schema updates as per #19	2017-02-28 15:09:43 +00:00
Mauro Doglio	053d5ed00b	Set root logger log level to warning	2017-02-21 14:05:26 +00:00
Mauro Doglio	ee37f56f49	Add log4j configuration file Fixes #18	2017-02-20 13:53:00 +00:00
Mauro Doglio	a18475aece	Add failOnDataLoss option to ErrorAggregator This is useful when we want to recover from an expected data loss. Fixes #25	2017-02-20 13:49:53 +00:00

1 2

71 Коммитов Все ветки Поиск

71 Коммитов

Все ветки