* fix: bug 1579266 fix fetch_schema in socorro_import_crash_data.py
it should not read from s3 anymore
also fix the code for reading from github
* black format and fix ruff
* Fix update_orphaning job
This uses `mozfun.norm.truncate_version` to extrack major version number and use it in subsequent version comparisons.
* Update update_orphaning contact emails
* Address review feedback
* Reformat update_orphaning DAG
* Reformat update_orphaning job
* Revert "update airflow config for 2.3.3"
This reverts commit d19cc711aa.
* Revert "fix deprecation warnings, clean up and update for 2.3.3"
This reverts commit e80472ab9a.
* Revert "update requirements, introduce constraints file and clean up for 2.3.3"
This reverts commit 8e60dba783.
* Use GCS instead of S3 in TAAR jobs
* Return back previous version of telemetry extraction for taar lite job
* Fix bucket for taar guid ranking task
* Fix taar ensemble job to work with new taar package
* Add taar_etl_model_storage_bucket variable
* Add telemetry alert email to taar dags
* TAAR fixes: add todos, exceptions logging, fix formatting
* Reduce retries to 0 for TAAR daily
Retrying the TAAR jobs is not useful as the cluster needs to be restarted between retries.
Better to not retry which reuses the cluster.
Instead - simply re-run the DAG which will provision a new cluster.
* Port TAAR similarity job to Cloud BigQuery
The query has been dramatically simplified by using clients_last_seen.
GCS buckets are no longer required to read parquet files in.
* Update configuration for TAAR Similarity job
Note that using SSDs speeds up the similarity job substantially.
This makes two big changes:
1. Adds days_seen to predictions. We will now predict the number
of days we will see this user over the next 28 days.
2. Adds p_alive for all metrics. We can use this to determine
if we think the user is alive or dead. If p_alive for
days_seen is <.5, we may assume the user is dead.
Small changes:
- Code fixup
- Cluster on sample id and client id
- Allow adding new fields to output table
Cleaned up metric specification as well.
Preliminary analysis suggests `search_with_ads` is highly correlated to revenue.
Co-authored-by: Anthony Miyaguchi <amiyaguchi@mozilla.com>
* taar ensemble patches for GCP
* Added local SSD support to DataprocClusterCreateOperator
* Default is set to 0 SSDs for both master and worker nodes.
* Added master|worker num_local_ssds parameter to DataProcHelper
* Added arguments for local SSD to dataproc runners
* updated taar_weekly for taar_ensemble
* added context into load_recommenders
* dropped logging in spark job
We need to pass in a context to pyspark workers or else they won't get
the right AWS credentials
No logging as it causes spark issues with thread lock objects
* bumped sample rate to 0.005 for taar ensemble
* added local SSDs and bumped taar version
* renamed output file to ensemble_jobs.json
* added doc string to ensemble job for estimate on runtime
* added pip-install.sh to jobs
* switched from testing to prod gs URLs
* updated the sample rate to 0.005 in the DAG
* added better param docs to local SSD option
* Sorted imports to alphabetic order
* inlined taar_aws_conn_id
* Revert "inlined taar_aws_conn_id"
This reverts commit 4363acb77f.
* Add initial ltv_daily script
This was created by :bmiroglio
* Black ltv_daily script
* Refactor variables for running on other projects
* Refactor into main and parameterize script
* Add dag for ltv
* Update script and dag parameters
This has been tested in a sandbox project.
* Add task sensor for search_clients_last_seen
* Update table names
* Update job prediction days to 364
* Update number of workers to 5
* Remove default value for --submission-date
* GCP port of taar-dynamo
* added options for master and worker disk type and size
* dataproc should really be operated with only pd-ssd disk types and >
1TB drives.
* added explicit 10 retry configuration for boto3+dynamodb
* added custom exponential backoff around the batch write
as boto3 does not seem to backoff quite enough
* Enable production gs:// link to etl script
* Added missing cluster name for taar-dynamo
Also dropped EMRSparkOperator import as it's no longer required
* Changed default to None for disk type and size
Dataproc clusters have meaningful defaults defined in
DataprocClusterCreateOperator. All APIs higher up have defaults set to
None and will only pass in values to the DataprocClusterCreateOperator
if the value has been overridden.
* Docstrings added to moz_dataproc_pyspark_runner
Added docstrings for master_disk_type, master_disk_size,
worker_disk_type and worker_disk_type.
* Revert "Changed default to None for disk type and size"
This reverts commit 24dc037094953fd932eef76b299857b4da36ffbe.
* API consistency for script runner and jar runner
The moz_dataproc_jar_runner and moz_dataproc_scriptrunner methods now
have the disk type and disk size parameters added to match the pyspark
runner.
This removes usage of `main_1pct_backfill` table which is no longer
needed after main ping table has been backfilled.
This also adds a filter for `build.version` being not null. Such pings
should not be included in this analysis (we had a single ping having
this field empty in the last year).
This PR migrates the taar_locale job from python_mozetl and makes minor
changes to run under Dataproc.
Changes include:
* inlined taar_utils functions
* replaced s3_submission_date with the current clients_daily date
* removed temporary file requirement for the TAAR Locale
* updated boto code to write JSON data directly into S3
* clients_daily is now being loaded using parquet files loaded from GCS
* Bug 1572115 - Add dataproc version aggregates job that writes to a dev database
* Add bigquery source and project-id
* Expose storage_bucket and artifact_bucket for local testing
* Add GOOGLE_APPLICATION_CREDENTIALS to docker-compose for local testing
* Update prerelease aggregates to run locally
* Minimize mozaggregator runner
* Use default service account on production
* Use n1-standard-8 on mozaggregator job