Victor Ng
e41df7e1f1
pile-o logging and increased the sample size by 100x
2018-02-06 09:41:03 -05:00
Victor Ng
9dbcecb462
added extra JSON serialization safeties
2018-02-06 09:27:51 -05:00
Victor Ng
4648c6cead
Added extra check to filter out empty string client_id values.
...
Drop k/v pairs in the JSON blob where keys have empty values.
2018-02-02 00:37:42 -05:00
Victor Ng
4eac02a6ca
added a Makefile so I don't have to remember how to upload to PyPI
2018-02-01 21:19:04 -05:00
Victor Ng
c5a8842901
removed the call to push boto3 to spark workers
2018-02-01 21:16:46 -05:00
Victor Ng
477846f9c9
added more metadata to support PyPI
2018-02-01 21:14:13 -05:00
Victor Ng
82215f1b64
added readme
2018-02-01 20:57:27 -05:00
Victor Ng
c9fee0acf7
dropped dead code module
2018-02-01 20:55:58 -05:00
Victor Ng
ab5a82c427
Added lots of docstirngs to make it clear what is going on.
2018-02-01 20:53:52 -05:00
Victor Ng
4669884b51
Added the dynamo_reducer to the last stage of processing of the RDD.
2018-02-01 20:53:38 -05:00
Victor Ng
7a9948923b
Removed the unnecessary `load_parquet` closure function and just inlined the relevant template code into the etl function.
2018-02-01 20:52:04 -05:00
Victor Ng
07a9772d8a
Dropped reducer function in taar_dynamo as it's been moved into the filters submodule.
2018-02-01 20:40:18 -05:00
Victor Ng
9967f9a96f
Added a force_write optional argument to the dynamo_reducer.
...
Pulled out some in-function import statements to the module top level.
2018-02-01 20:37:23 -05:00
Victor Ng
88fb87d6d1
updated airflow_job.py from atmo testing
2018-02-01 20:36:56 -05:00
Victor Ng
0b644246c0
more minor patches
2018-02-01 17:06:45 -05:00
Victor Ng
4f79867117
More refactoring of the dynamo loaders.
2018-02-01 11:39:23 -05:00
Victor Ng
e0d6e329fb
Refactored code so that loading package into spark nodes is possible.
2018-01-30 14:38:33 -05:00
Victor Ng
0d4d34cd6a
added wheel dependencies for argparse and boto3 so that we can safely
...
load this code on spark nodes
2018-01-30 14:38:19 -05:00
Victor Ng
35dbeacf6e
init commit.
...
PySpark has some edge cases where sparkContext.addPyFile() doesn't
properly push code to the spark nodes. This package has been created so
that the entire codebase can be pushed to all nodes.
2018-01-30 12:38:39 -05:00