Граф коммитов

218 Коммитов

Автор SHA1 Сообщение Дата
Daniel Thorn 1edc654ed3
fix: bug 1579266 fix fetch_schema in socorro_import_crash_data.py (#2030)
* fix: bug 1579266 fix fetch_schema in socorro_import_crash_data.py

it should not read from s3 anymore

also fix the code for reading from github

* black format and fix ruff
2024-06-21 10:19:16 -04:00
Frank Bertsch 8402c46c7c
Remove adjust_import and android_client_history_sim dags (#1879)
Co-authored-by: Mikaël Ducharme <mducharme@mozilla.com>
2024-01-30 15:35:58 -05:00
Eduardo Filho c69e173c1d
Update orphaning to write to GCS instead of S3 (#1770) 2023-08-01 11:25:44 -04:00
akkomar 26b555236a
Fix update_orphaning job (#1703)
* Fix update_orphaning job

This uses `mozfun.norm.truncate_version` to extrack major version number and use it in subsequent version comparisons.

* Update update_orphaning contact emails

* Address review feedback

* Reformat update_orphaning DAG

* Reformat update_orphaning job
2023-05-16 22:26:00 +02:00
Wesley Dawson ebe3aec383 Remove unused s3 script 2022-12-19 10:28:47 -08:00
mikaeld 3cc49d4090 fix deprecation warnings, clean up and update for 2.3.3 2022-12-12 13:24:03 -05:00
Mikaël Ducharme 0b4b4eb418
Revert "feat(airflow): upgrade airflow from 2.1.4 to 2.3.3 [DSRE-1039] " (#1612)
* Revert "update airflow config for 2.3.3"

This reverts commit d19cc711aa.

* Revert "fix deprecation warnings, clean up and update for 2.3.3"

This reverts commit e80472ab9a.

* Revert "update requirements, introduce constraints file and clean up for 2.3.3"

This reverts commit 8e60dba783.
2022-12-07 18:25:40 -05:00
mikaeld e80472ab9a fix deprecation warnings, clean up and update for 2.3.3 2022-12-07 13:30:42 -05:00
Harold Woo 3611c63d82 [DSRE-922] Migrate jobs off of gke derived datasets 2022-10-12 11:04:11 -07:00
Will Kahn-Greene 854572de4e Fix socorro-import schema location (#1546)
This updates the location of the telemetry socorro crash schema file so
the socorro-import job is using the correct one.
2022-08-25 11:48:14 -04:00
Evgeny Pavlov 3cde830d2f
Disable hive for taar similarity job (#1263) 2021-02-26 15:08:31 -08:00
Arkadiusz Komarzewski 6c74723990 Update orphaning: use mozdata.tmp for aggregation 2021-02-22 15:54:10 +01:00
Evgeny Pavlov d6fb763bdc
TAAR jobs GCS migration (#1252)
* Use GCS instead of S3 in TAAR jobs

* Return back previous version of telemetry extraction for taar lite job

* Fix bucket for taar guid ranking task

* Fix taar ensemble job to work with new taar package

* Add taar_etl_model_storage_bucket variable

* Add telemetry alert email to taar dags

* TAAR fixes: add todos, exceptions logging, fix formatting
2021-02-08 16:06:13 -08:00
Jeff Klukas fabb424540
Use 'main' in references to telemetry-batch-view (#1248)
Preparing this for when we change the default branch in telemetry-batch-view.
2021-02-03 16:58:20 -05:00
Arkadiusz Komarzewski e9ba66f791 Bug 1639196 - remove fx_usage_report DAG
This DAG was superseded by firefox_public_data_report.
2020-09-28 16:31:11 +02:00
Arkadiusz Komarzewski 84a0d3a946
fixup! Bug 1654038 Tolerate compact histograms in update_orphaning_dashboard job 2020-07-28 15:15:53 +02:00
Jeff Klukas 0f25662d09
Bug 1654038 Tolerate compact histograms in update_orphaning_dashboard job (#1100) 2020-07-27 17:45:49 +02:00
Victor Ng daf7d1bac4
Update URI from s3a:// to gs:// to process taar-lite (#1023) 2020-06-12 22:22:37 -04:00
Victor Ng 8d85cf6f06
Port TAAR similarity to BigQuery (#1035)
* Reduce retries to 0 for TAAR daily

Retrying the TAAR jobs is not useful as the cluster needs to be restarted between retries.
Better to not retry which reuses the cluster.
Instead - simply re-run the DAG which will provision a new cluster.

* Port TAAR similarity job to Cloud BigQuery

The query has been dramatically simplified by using clients_last_seen.
GCS buckets are no longer required to read parquet files in.

* Update configuration for TAAR Similarity job

Note that using SSDs speeds up the similarity job substantially.
2020-06-12 21:39:19 -04:00
Frank Bertsch f5020a7b08 Remove csv comment 2020-06-10 15:33:33 -04:00
Frank Bertsch 2c9ad8cd39 Update adjust_import for external tables 2020-06-10 15:33:33 -04:00
Frank Bertsch ab2bad3da3 Use provided args 2020-05-21 14:15:57 -04:00
Frank Bertsch 7cc63eb66a Add job for hashing Adjust Ids 2020-05-21 14:15:57 -04:00
Arkadiusz Komarzewski cd641ecdcf Bug 1625940 - Remove legacy Hardware Report job 2020-05-19 15:08:34 +02:00
Anthony Miyaguchi ef0b29da4d
Bug 1600721 - Remove active profiles (#980) 2020-05-11 09:17:47 -07:00
Frank Bertsch 5e4e3d6dee Add days_seen prediction; predict p_alive
This makes two big changes:
1. Adds days_seen to predictions. We will now predict the number
   of days we will see this user over the next 28 days.
2. Adds p_alive for all metrics. We can use this to determine
   if we think the user is alive or dead. If p_alive for
   days_seen is <.5, we may assume the user is dead.

Small changes:
- Code fixup
- Cluster on sample id and client id
- Allow adding new fields to output table
2020-04-29 17:43:55 -04:00
Frank Bertsch 850122589a Fix writing to partition in ltv 2020-04-28 14:18:40 -04:00
Arkadiusz Komarzewski 9b1a13a2fc
Bug 1625940 - Schedule new Hardware Report ETL 2020-04-23 14:04:18 +02:00
Frank Bertsch c8c6429aa2 Fix bug in ltv script 2020-04-21 11:23:34 -04:00
Frank Bertsch 02551229dd Add partitioning of LTV predictions 2020-04-20 09:44:55 -04:00
Ben Miroglio 36fe6145ee
Add days_searched_with_ads to list of metrics (#946)
Cleaned up metric specification as well.

Preliminary analysis suggests `search_with_ads` is highly correlated to revenue.

Co-authored-by: Anthony Miyaguchi <amiyaguchi@mozilla.com>
2020-04-16 10:41:10 -07:00
Anthony Miyaguchi ebcb8da4e4
Cast date column to a date (#944) 2020-04-13 10:10:23 -07:00
Anthony Miyaguchi 3617189663
Bug 1627091 - Add partition on date for model perf (#941) 2020-04-08 10:36:25 -07:00
Victor Ng 3b61514a24
TAAR ensemble fix (#929)
* taar ensemble patches for GCP

* Added local SSD support to DataprocClusterCreateOperator
* Default is set to 0 SSDs for both master and worker nodes.
* Added master|worker num_local_ssds parameter to DataProcHelper
* Added arguments for local SSD to dataproc runners
* updated taar_weekly for taar_ensemble
* added context into load_recommenders
* dropped logging in spark job

We need to pass in a context to pyspark workers or else they won't get
the right AWS credentials

No logging as it causes spark issues with thread lock objects

* bumped sample rate to 0.005 for taar ensemble

* added local SSDs and bumped taar version

* renamed output file to ensemble_jobs.json

* added doc string to ensemble job for estimate on runtime

* added pip-install.sh to jobs

* switched from testing to prod gs URLs

* updated the sample rate to 0.005 in the DAG

* added better param docs to local SSD option

* Sorted imports to alphabetic order

* inlined taar_aws_conn_id

* Revert "inlined taar_aws_conn_id"

This reverts commit 4363acb77f.
2020-04-03 18:26:58 -04:00
Anthony Miyaguchi db5aa42bad
Bug 1623728 - Schedule LTV Task in Airflow for Dataset Creation (#914)
* Add initial ltv_daily script

This was created by :bmiroglio

* Black ltv_daily script

* Refactor variables for running on other projects

* Refactor into main and parameterize script

* Add dag for ltv

* Update script and dag parameters

This has been tested in a sandbox project.

* Add task sensor for search_clients_last_seen

* Update table names

* Update job prediction days to 364

* Update number of workers to 5

* Remove default value for --submission-date
2020-03-25 12:17:46 -07:00
Jeff Klukas 87f6d97b39 Move dashboard and amplitude ETL to shared-prod tables
We should be able to merge these changes immediately. They are internal
details that don't need to hit derived-datasets anymore.
2020-03-13 18:38:10 +01:00
Arkadiusz Komarzewski 43ff8cbd43
Fix TAAR similarity for reading Parquet from GCS 2020-01-25 08:48:01 +01:00
Arkadiusz Komarzewski d844995e35
Migrate TAAR Similarity job to Dataproc 2020-01-21 16:57:19 +01:00
Victor Ng a2a0b3a10f
GCP port of taar-dynamo (#815)
* GCP port of taar-dynamo

* added options for master and worker disk type and size
* dataproc should really be operated with only pd-ssd disk types and >
  1TB drives.
* added explicit 10 retry configuration for boto3+dynamodb
* added custom exponential backoff around the batch write
  as boto3 does not seem to backoff quite enough

* Enable production gs:// link to etl script

* Added missing cluster name for taar-dynamo

Also dropped EMRSparkOperator import as it's no longer required

* Changed default to None for disk type and size

Dataproc clusters have meaningful defaults defined in
DataprocClusterCreateOperator.  All APIs higher up have defaults set to
None and will only pass in values to the DataprocClusterCreateOperator
if the value has been overridden.

* Docstrings added to moz_dataproc_pyspark_runner

Added docstrings for master_disk_type, master_disk_size,
worker_disk_type and worker_disk_type.

* Revert "Changed default to None for disk type and size"

This reverts commit 24dc037094953fd932eef76b299857b4da36ffbe.

* API consistency for script runner and jar runner

The moz_dataproc_jar_runner and moz_dataproc_scriptrunner methods now
have the disk type and disk size parameters added to match the pyspark
runner.
2020-01-14 16:32:25 -05:00
William Lachance 39f4982d1d Remove some unused code and references
* events-to-amplitude
* telemetry-streaming
2020-01-08 13:46:48 -05:00
William Lachance d72dda39dd Remove reference to telemetry-streaming 2020-01-08 13:46:48 -05:00
Blake Imsland 7f53b91f31 Remove references to metastore from tbv 2019-12-17 15:01:15 -08:00
Arkadiusz Komarzewski b282ba3922 Use only main ping view in update orphaning ETL
This removes usage of `main_1pct_backfill` table which is no longer
needed after main ping table has been backfilled.

This also adds a filter for `build.version` being not null. Such pings
should not be included in this analysis (we had a single ping having
this field empty in the last year).
2019-12-09 18:16:40 +01:00
Anthony Miyaguchi b6f33e4c2a
Add missing bgbb_runner (#769) 2019-11-27 14:49:35 -08:00
Daniel Thorn 29f3f31029
drop --output flag to AddonRecommender (#747)
flag removed in https://github.com/mozilla/telemetry-batch-view/pull/555/files#diff-ebfbc196fc5b4a7f5a210a2363aa5ae1L42
2019-11-19 19:42:50 -05:00
Victor Ng 95bb4a8d86
Migrate taar_locale to Dataproc (#724)
This PR migrates the taar_locale job from python_mozetl and makes minor
changes to run under Dataproc.

Changes include:

* inlined taar_utils functions
* replaced s3_submission_date with the current clients_daily date
* removed temporary file requirement for the TAAR Locale
* updated boto code to write JSON data directly into S3
* clients_daily is now being loaded using parquet files loaded from GCS
2019-11-18 15:05:01 -05:00
Anthony Miyaguchi b9d7c135a0
Bug 1572115 - Add dataproc version aggregates job that writes to a dev database (#727)
* Bug 1572115 - Add dataproc version aggregates job that writes to a dev database

* Add bigquery source and project-id

* Expose storage_bucket and artifact_bucket for local testing

* Add GOOGLE_APPLICATION_CREDENTIALS to docker-compose for local testing

* Update prerelease aggregates to run locally

* Minimize mozaggregator runner

* Use default service account on production

* Use n1-standard-8 on mozaggregator job
2019-11-14 17:29:45 -08:00
Arkadiusz Komarzewski b872df04ba Fix Hardware Report start date parameter name 2019-11-12 03:24:59 -05:00
Arkadiusz Komarzewski 75b2b4ba9d
Bug 1582110 - Add hardware report job with BigQuery support (#686) 2019-11-08 08:19:31 -05:00
Arkadiusz Komarzewski 40b64f1d5a
Bug 1571462 - Decommission mobile_clients dataset (#678) 2019-11-07 08:48:29 -05:00