The ETL process responsible for filling ActiveData
Перейти к файлу
Andrew Halberstadt 86a2c958e6 Update repos to point to 'mozilla' org 2020-10-08 15:34:13 -04:00
activedata_etl Prepare for bug 1651538 2020-07-31 07:38:41 +09:00
docs Update repos to point to 'mozilla' org 2020-10-08 15:34:13 -04:00
resources Update repos to point to 'mozilla' org 2020-10-08 15:34:13 -04:00
tests contact 2020-05-05 12:50:24 -04:00
vendor better url concatenation 2020-05-05 14:27:18 -04:00
.editorconfig add etl for firefox-files 2017-10-31 05:28:36 -04:00
.gitignore Merge branch 'etl' into dev 2019-11-23 22:40:11 -05:00
.pep8 start integration of actual transform to ES format 2015-01-04 11:18:58 -05:00
LICENSE Initial commit 2014-12-20 00:24:47 -05:00
README.md Update repos to point to 'mozilla' org 2020-10-08 15:34:13 -04:00
requirements.txt remove need for psutil 2020-05-05 14:48:10 -04:00

README.md

ActiveData-ETL

The ETL code responsible for filling ActiveData.

Sounds Exciting! Can I Use This?

Probably not. The majority of the code implements a high volume idiosyncratic data pipeline on top of AWS services, and requires other services to work in tandem with this. But, feel free to pillage activedata_etl/imports or activedata_etl/transforms for the transformation code.

Branches

Many branches are meant as stable versions for each of the processes involved in the ETL. Ideally, they would be unified, but library upgrades can cause unique instability: deployment of a branch does not happen until (manual) testing has been done.

Here are the important branches:

  • dev - unstable - primary branch for accepting changes
  • etl - stable - for ETL machines
  • primary - stable - for the "primary" and "coordinator" ES nodes
  • codecoverage - unstable - for Code Coverage ETL development
  • pulse-logger - stable - for the PulseLogger
  • tc-logger - stable - for the TaskCluster logger
  • push-to-es - stable - code installed on ES spot instance machines for final indexing.
  • beta - stable - of all branches for testing on the beta machines
  • manager - stable - installed on the ActiveData management machine for cron jobs
  • master - unstable - intermittently updated to track dev, eventually intended as the single-stable-version

Requirements

  • Python 2.7.x
  • Elasticsearch 1.7.x (the current 2.x versions are not supported yet)
  • Access to Amazon S3 bucket for ETL results
  • Access to Amazon SQS for the ETL pipeline

Installing Fabric

It is 2016, and Python is still hard on Windows. It would be a nice question for Stack Overflow, but apparently not.

  1. Install Python, and PIP
  2. pip install fabric - There will be errors
  3. Install pycrypto. Hopefully, voidspace still provides pre-compiled binaries. Knowing the internet, it probably moved by the time you read this, so I made a copy of pycrypto-2.6.win32-py2.7.exe
  4. pip install fabric again. This should be successful.

Configuration Files

The configuration files, located in resources/settings, often point to a private.json config file outside the repository tree. This file holds the credentials and access info required, and looks something like this:

{
    "email":{
        "host": "smtp.gmail.com",
        "port": 465,
        "username": "",
        "password": "",
        "use_ssl": 1
    },
    "aws_credentials":{
        "aws_access_key_id":"",
        "aws_secret_access_key" :"",
        "region":"us-west-2"
    },
    "pulse_user":{
        "user": "",
        "password": ""
    }
}

The exact properties will depend on the the resources you are accessing.