telemetry-airflow

Граф коммитов

Автор	SHA1	Сообщение	Дата
Daniel Thorn	1edc654ed3	fix: bug 1579266 fix fetch_schema in socorro_import_crash_data.py (#2030 ) * fix: bug 1579266 fix fetch_schema in socorro_import_crash_data.py it should not read from s3 anymore also fix the code for reading from github * black format and fix ruff	2024-06-21 10:19:16 -04:00
Frank Bertsch	8402c46c7c	Remove adjust_import and android_client_history_sim dags (#1879 ) Co-authored-by: Mikaël Ducharme <mducharme@mozilla.com>	2024-01-30 15:35:58 -05:00
Eduardo Filho	c69e173c1d	Update orphaning to write to GCS instead of S3 (#1770 )	2023-08-01 11:25:44 -04:00
akkomar	26b555236a	Fix update_orphaning job (#1703 ) * Fix update_orphaning job This uses `mozfun.norm.truncate_version` to extrack major version number and use it in subsequent version comparisons. * Update update_orphaning contact emails * Address review feedback * Reformat update_orphaning DAG * Reformat update_orphaning job	2023-05-16 22:26:00 +02:00
Wesley Dawson	ebe3aec383	Remove unused s3 script	2022-12-19 10:28:47 -08:00
mikaeld	3cc49d4090	fix deprecation warnings, clean up and update for 2.3.3	2022-12-12 13:24:03 -05:00
Mikaël Ducharme	0b4b4eb418	Revert "feat(airflow): upgrade airflow from 2.1.4 to 2.3.3 [DSRE-1039] " (#1612 ) * Revert "update airflow config for 2.3.3" This reverts commit `d19cc711aa`. * Revert "fix deprecation warnings, clean up and update for 2.3.3" This reverts commit `e80472ab9a`. * Revert "update requirements, introduce constraints file and clean up for 2.3.3" This reverts commit `8e60dba783`.	2022-12-07 18:25:40 -05:00
mikaeld	e80472ab9a	fix deprecation warnings, clean up and update for 2.3.3	2022-12-07 13:30:42 -05:00
Harold Woo	3611c63d82	[DSRE-922] Migrate jobs off of gke derived datasets	2022-10-12 11:04:11 -07:00
Will Kahn-Greene	854572de4e	Fix socorro-import schema location (#1546 ) This updates the location of the telemetry socorro crash schema file so the socorro-import job is using the correct one.	2022-08-25 11:48:14 -04:00
Evgeny Pavlov	3cde830d2f	Disable hive for taar similarity job (#1263 )	2021-02-26 15:08:31 -08:00
Arkadiusz Komarzewski	6c74723990	Update orphaning: use mozdata.tmp for aggregation	2021-02-22 15:54:10 +01:00
Evgeny Pavlov	d6fb763bdc	TAAR jobs GCS migration (#1252 ) * Use GCS instead of S3 in TAAR jobs * Return back previous version of telemetry extraction for taar lite job * Fix bucket for taar guid ranking task * Fix taar ensemble job to work with new taar package * Add taar_etl_model_storage_bucket variable * Add telemetry alert email to taar dags * TAAR fixes: add todos, exceptions logging, fix formatting	2021-02-08 16:06:13 -08:00
Jeff Klukas	fabb424540	Use 'main' in references to telemetry-batch-view (#1248 ) Preparing this for when we change the default branch in telemetry-batch-view.	2021-02-03 16:58:20 -05:00
Arkadiusz Komarzewski	e9ba66f791	Bug 1639196 - remove fx_usage_report DAG This DAG was superseded by firefox_public_data_report.	2020-09-28 16:31:11 +02:00
Arkadiusz Komarzewski	84a0d3a946	fixup! Bug 1654038 Tolerate compact histograms in update_orphaning_dashboard job	2020-07-28 15:15:53 +02:00
Jeff Klukas	0f25662d09	Bug 1654038 Tolerate compact histograms in update_orphaning_dashboard job (#1100 )	2020-07-27 17:45:49 +02:00
Victor Ng	daf7d1bac4	Update URI from s3a:// to gs:// to process taar-lite (#1023 )	2020-06-12 22:22:37 -04:00
Victor Ng	8d85cf6f06	Port TAAR similarity to BigQuery (#1035 ) * Reduce retries to 0 for TAAR daily Retrying the TAAR jobs is not useful as the cluster needs to be restarted between retries. Better to not retry which reuses the cluster. Instead - simply re-run the DAG which will provision a new cluster. * Port TAAR similarity job to Cloud BigQuery The query has been dramatically simplified by using clients_last_seen. GCS buckets are no longer required to read parquet files in. * Update configuration for TAAR Similarity job Note that using SSDs speeds up the similarity job substantially.	2020-06-12 21:39:19 -04:00
Frank Bertsch	f5020a7b08	Remove csv comment	2020-06-10 15:33:33 -04:00
Frank Bertsch	2c9ad8cd39	Update adjust_import for external tables	2020-06-10 15:33:33 -04:00
Frank Bertsch	ab2bad3da3	Use provided args	2020-05-21 14:15:57 -04:00
Frank Bertsch	7cc63eb66a	Add job for hashing Adjust Ids	2020-05-21 14:15:57 -04:00
Arkadiusz Komarzewski	cd641ecdcf	Bug 1625940 - Remove legacy Hardware Report job	2020-05-19 15:08:34 +02:00
Anthony Miyaguchi	ef0b29da4d	Bug 1600721 - Remove active profiles (#980 )	2020-05-11 09:17:47 -07:00
Frank Bertsch	5e4e3d6dee	Add days_seen prediction; predict p_alive This makes two big changes: 1. Adds days_seen to predictions. We will now predict the number of days we will see this user over the next 28 days. 2. Adds p_alive for all metrics. We can use this to determine if we think the user is alive or dead. If p_alive for days_seen is <.5, we may assume the user is dead. Small changes: - Code fixup - Cluster on sample id and client id - Allow adding new fields to output table	2020-04-29 17:43:55 -04:00
Frank Bertsch	850122589a	Fix writing to partition in ltv	2020-04-28 14:18:40 -04:00
Arkadiusz Komarzewski	9b1a13a2fc	Bug 1625940 - Schedule new Hardware Report ETL	2020-04-23 14:04:18 +02:00
Frank Bertsch	c8c6429aa2	Fix bug in ltv script	2020-04-21 11:23:34 -04:00
Frank Bertsch	02551229dd	Add partitioning of LTV predictions	2020-04-20 09:44:55 -04:00
Ben Miroglio	36fe6145ee	Add days_searched_with_ads to list of metrics (#946 ) Cleaned up metric specification as well. Preliminary analysis suggests `search_with_ads` is highly correlated to revenue. Co-authored-by: Anthony Miyaguchi <amiyaguchi@mozilla.com>	2020-04-16 10:41:10 -07:00
Anthony Miyaguchi	ebcb8da4e4	Cast date column to a date (#944 )	2020-04-13 10:10:23 -07:00
Anthony Miyaguchi	3617189663	Bug 1627091 - Add partition on date for model perf (#941 )	2020-04-08 10:36:25 -07:00
Victor Ng	3b61514a24	TAAR ensemble fix (#929 ) * taar ensemble patches for GCP * Added local SSD support to DataprocClusterCreateOperator * Default is set to 0 SSDs for both master and worker nodes. * Added master\|worker num_local_ssds parameter to DataProcHelper * Added arguments for local SSD to dataproc runners * updated taar_weekly for taar_ensemble * added context into load_recommenders * dropped logging in spark job We need to pass in a context to pyspark workers or else they won't get the right AWS credentials No logging as it causes spark issues with thread lock objects * bumped sample rate to 0.005 for taar ensemble * added local SSDs and bumped taar version * renamed output file to ensemble_jobs.json * added doc string to ensemble job for estimate on runtime * added pip-install.sh to jobs * switched from testing to prod gs URLs * updated the sample rate to 0.005 in the DAG * added better param docs to local SSD option * Sorted imports to alphabetic order * inlined taar_aws_conn_id * Revert "inlined taar_aws_conn_id" This reverts commit `4363acb77f`.	2020-04-03 18:26:58 -04:00
Anthony Miyaguchi	db5aa42bad	Bug 1623728 - Schedule LTV Task in Airflow for Dataset Creation (#914 ) * Add initial ltv_daily script This was created by :bmiroglio * Black ltv_daily script * Refactor variables for running on other projects * Refactor into main and parameterize script * Add dag for ltv * Update script and dag parameters This has been tested in a sandbox project. * Add task sensor for search_clients_last_seen * Update table names * Update job prediction days to 364 * Update number of workers to 5 * Remove default value for --submission-date	2020-03-25 12:17:46 -07:00
Jeff Klukas	87f6d97b39	Move dashboard and amplitude ETL to shared-prod tables We should be able to merge these changes immediately. They are internal details that don't need to hit derived-datasets anymore.	2020-03-13 18:38:10 +01:00
Arkadiusz Komarzewski	43ff8cbd43	Fix TAAR similarity for reading Parquet from GCS	2020-01-25 08:48:01 +01:00
Arkadiusz Komarzewski	d844995e35	Migrate TAAR Similarity job to Dataproc	2020-01-21 16:57:19 +01:00
Victor Ng	a2a0b3a10f	GCP port of taar-dynamo (#815 ) * GCP port of taar-dynamo * added options for master and worker disk type and size * dataproc should really be operated with only pd-ssd disk types and > 1TB drives. * added explicit 10 retry configuration for boto3+dynamodb * added custom exponential backoff around the batch write as boto3 does not seem to backoff quite enough * Enable production gs:// link to etl script * Added missing cluster name for taar-dynamo Also dropped EMRSparkOperator import as it's no longer required * Changed default to None for disk type and size Dataproc clusters have meaningful defaults defined in DataprocClusterCreateOperator. All APIs higher up have defaults set to None and will only pass in values to the DataprocClusterCreateOperator if the value has been overridden. * Docstrings added to moz_dataproc_pyspark_runner Added docstrings for master_disk_type, master_disk_size, worker_disk_type and worker_disk_type. * Revert "Changed default to None for disk type and size" This reverts commit 24dc037094953fd932eef76b299857b4da36ffbe. * API consistency for script runner and jar runner The moz_dataproc_jar_runner and moz_dataproc_scriptrunner methods now have the disk type and disk size parameters added to match the pyspark runner.	2020-01-14 16:32:25 -05:00
William Lachance	39f4982d1d	Remove some unused code and references * events-to-amplitude * telemetry-streaming	2020-01-08 13:46:48 -05:00
William Lachance	d72dda39dd	Remove reference to telemetry-streaming	2020-01-08 13:46:48 -05:00
Blake Imsland	7f53b91f31	Remove references to metastore from tbv	2019-12-17 15:01:15 -08:00
Arkadiusz Komarzewski	b282ba3922	Use only main ping view in update orphaning ETL This removes usage of `main_1pct_backfill` table which is no longer needed after main ping table has been backfilled. This also adds a filter for `build.version` being not null. Such pings should not be included in this analysis (we had a single ping having this field empty in the last year).	2019-12-09 18:16:40 +01:00
Anthony Miyaguchi	b6f33e4c2a	Add missing bgbb_runner (#769 )	2019-11-27 14:49:35 -08:00
Daniel Thorn	29f3f31029	drop --output flag to AddonRecommender (#747 ) flag removed in https://github.com/mozilla/telemetry-batch-view/pull/555/files#diff-ebfbc196fc5b4a7f5a210a2363aa5ae1L42	2019-11-19 19:42:50 -05:00
Victor Ng	95bb4a8d86	Migrate taar_locale to Dataproc (#724 ) This PR migrates the taar_locale job from python_mozetl and makes minor changes to run under Dataproc. Changes include: * inlined taar_utils functions * replaced s3_submission_date with the current clients_daily date * removed temporary file requirement for the TAAR Locale * updated boto code to write JSON data directly into S3 * clients_daily is now being loaded using parquet files loaded from GCS	2019-11-18 15:05:01 -05:00
Anthony Miyaguchi	b9d7c135a0	Bug 1572115 - Add dataproc version aggregates job that writes to a dev database (#727 ) * Bug 1572115 - Add dataproc version aggregates job that writes to a dev database * Add bigquery source and project-id * Expose storage_bucket and artifact_bucket for local testing * Add GOOGLE_APPLICATION_CREDENTIALS to docker-compose for local testing * Update prerelease aggregates to run locally * Minimize mozaggregator runner * Use default service account on production * Use n1-standard-8 on mozaggregator job	2019-11-14 17:29:45 -08:00
Arkadiusz Komarzewski	b872df04ba	Fix Hardware Report start date parameter name	2019-11-12 03:24:59 -05:00
Arkadiusz Komarzewski	75b2b4ba9d	Bug 1582110 - Add hardware report job with BigQuery support (#686 )	2019-11-08 08:19:31 -05:00
Arkadiusz Komarzewski	40b64f1d5a	Bug 1571462 - Decommission mobile_clients dataset (#678 )	2019-11-07 08:48:29 -05:00

1 2 3 4 5

218 Коммитов