This is motivated by
[a discussion doc](https://docs.google.com/document/d/1NwY1iReuxcbvxCi9jRUOYiymCUOoSq3cnShCI-OiW3c/edit#heading=h.c6fp0v2ffacr)
that focused on the problem of implicit dependencies between repos.
emr-bootstrap-spark is one of the projects where changes can cause unanticipated
problems for running code that lives elsewhere, and it can be difficult to
track down the source of the problem when an emr-bootstrap-spark deploy goes
wrong.
This will be followed up by documentation changes to other repos that depend
on this one, making the dependency more explicit.
The existing comment about `set -e` appears to no longer apply.
I've been deploying clusters while iterating on these changes,
hitting many "Bootstrap failure" states. I have reliably been
able to find stdout.log.gz containing the relevant failure message
for these failed clusters.
There were a few commands failing for every cluster, so I had to
make some small cleanup changes in order for bootstrap to complete
with the new `set -eo pipefail` setting.
Matches style of tags datadog automatically sends
(security-group, instance-type, etc.) and
allows cleaning of tag values containing characters like ':'
It looks like we were already pre-loading the vitillo version
of spark-hyperloglog from spark-packages.org for zeppelin,
but we've recently published the mozilla version of the package
on spark-packages.org; this PR makes that package a cluster default.
Packages specified in spark.jars.packages aren't installed at
cluster spinup time, but rather at instantiation of a SparkContext,
so this adds a slight amount of work for each new SparkContext
that's created.
Previously, executor logs would fail with:
java.io.FileNotFoundException: /mnt/var/log/spark/spark.log (Permission denied)
Which caused the executor logs to not write.
This change will make them available, and then available in s3.