ActiveData-ETL/activedata_etl
Mike Hommey 90c55fd65a
Prepare for bug 1651538
Bug 1651538 is going to relabel docker images from `build-docker-image-*` to `docker-image-*`.
2020-07-31 07:38:41 +09:00
..
imports Prepare for bug 1651538 2020-07-31 07:38:41 +09:00
monitor contact 2020-05-05 12:50:24 -04:00
sinks contact 2020-05-05 12:50:24 -04:00
transforms catch more recoverable errors 2020-05-06 04:49:44 -04:00
README.md docs 2019-08-17 22:53:21 -04:00
__init__.py contact 2020-05-05 12:50:24 -04:00
auto_backfill.py contact 2020-05-05 12:50:24 -04:00
backfill.py contact 2020-05-05 12:50:24 -04:00
backfill_repo.py contact 2020-05-05 12:50:24 -04:00
buildbot_json_logs.py contact 2020-05-05 12:50:24 -04:00
compact_tc_logger.py contact 2020-05-05 12:50:24 -04:00
copy_index.py more hg info 2020-02-28 11:49:41 -05:00
copy_queue.py contact 2020-05-05 12:50:24 -04:00
etl.py remove need for psutil 2020-05-05 14:48:10 -04:00
find_es_oom.py contact 2020-05-05 12:50:24 -04:00
fx_test_logger.py contact 2020-05-05 12:50:24 -04:00
get_tuid.py contact 2020-05-05 12:50:24 -04:00
look_at_queue.py contact 2020-05-05 12:50:24 -04:00
pulse_logger.py contact 2020-05-05 12:50:24 -04:00
push_to_es.py contact 2020-05-05 12:50:24 -04:00
push_to_es_start.py contact 2020-05-05 12:50:24 -04:00
push_to_es_stop.py contact 2020-05-05 12:50:24 -04:00
s3_clear.py contact 2020-05-05 12:50:24 -04:00
s3_clear_bucket.py contact 2020-05-05 12:50:24 -04:00
s3_find.py contact 2020-05-05 12:50:24 -04:00
s3_look.py contact 2020-05-05 12:50:24 -04:00
s3_make_public.py contact 2020-05-05 12:50:24 -04:00
s3_summary.py use "except Exception as e" 2017-03-09 11:06:43 -05:00
update_etl.py contact 2020-05-05 12:50:24 -04:00
update_push_to_es.py contact 2020-05-05 12:50:24 -04:00

README.md

The ETL Tasks

This ETL library performs many functions, and they are all listed in this directory. Here are some, listed in most-important-first order:

Module pulse_logger on tc-logger branch

A stand-alone program that stays connected to Mozilla's Pulse queue and archives the messages to S3, along with putting them on the work queue. This program is the start of the ETL pipeline, and most important because Pulse messages last only a couple of hours before they are lost (due to queue overflow).

Module etl

This contains the main routine responsible for using transforms and applying them against a queue of work to be done.

Module backfill

Given a set of conditions, this will review S3 and fill the work queue with items not found in ES

Module push_to_es

Responsible for adding S3 records into ES, with little or no transform.

Module update_etl on branch etl

If the etl or transform code is changed, you can push those changes to the worker machines immediately. All workers use the etl branch, so be sure the changes you want are there.

python.exe activedata_etl/update_etl.py --settings=resources/settings/update_etl.json