Bug: add spark conf defaults (#115)

* added default spark conf files to config/ * refactored loading and cleanup of spark conf files * curated spark conf files and added docs
2017-09-29 13:03:11 -07:00 · 2017-09-29 13:03:11 -07:00 · c2172b5f64
--- a/config/spark-defaults.conf
+++ b/config/spark-defaults.conf
@ -0,0 +1,29 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Default system properties included when running spark-submit.
+# This is useful for setting default environmental settings.
+
+# Example:
+# spark.eventLog.enabled           true
+# spark.eventLog.dir               hdfs://namenode:8021/directory
+# spark.serializer                 org.apache.spark.serializer.KryoSerializer
+# spark.driver.memory              5g
+# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
+
+# Note: Thunderbolt pre-loads wasb jars, so loading is not necessary
+spark.jars                 /home/spark-current/jars/azure-storage-2.0.0.jar,/home/spark-current/jars/hadoop-azure-2.7.3.jar
--- a/config/spark-env.sh
+++ b/config/spark-env.sh
@ -0,0 +1,46 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This file is sourced when running various Spark programs.
+
+# Options read when launching programs locally with
+# ./bin/run-example or ./bin/spark-submit
+# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
+
+# Options read by executors and drivers running inside the cluster
+# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
+
+# Options for the daemons used in the standalone deploy mode
+# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
+# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
+# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2
+# - SPARK_WORKER_DIR, to set the working directory of worker processes
+# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
+# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default:
+# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
+# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y
+# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
+
+# Generic options for the daemons used in the standalone deploy mode
+# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
+# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
+# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
+# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
+# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
+# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
--- a/docs/13-configuration.md
+++ b/docs/13-configuration.md
@ -58,3 +58,23 @@ jupyter_port: 8088
 Running the command `azb spark cluster ssh --id <cluster_id>` will attempt to ssh into the cluster which has the id specified with the username 'spark'. It will forward the Spark Job UI to localhost:4040, the Spark master's web UI to localhost:8080 and Jupyter to localhost:8088.

 Note that all of the settings in ssh.yaml will be overrided by parameters passed on the command line.
+
+## Spark Configuration
+
+The repository comes with default Spark configuration files which provision your Spark cluster just the same as you would locally. After running `azb spark init` to initialize your working environment, you can view and edit these files at `.thunderbolt/spark-defaults.conf` and `.thunderbolt/spark-env.sh`. Please note that you can bring your own Spark configuration files by copying your `spark-defaults.conf` and `spark-env.sh` into your `.thunderbolt/` direcotry.
+
+The following settings available in `spark-defaults.conf` and `spark-env.sh` are not supported in Thunderbolt:
+
+`spark-env.sh`:
+- SPARK\_LOCAL\_IP
+- SPARK\_PUBLIC\_DNS
+- SPARK\_MASTER\_HOST
+- SPARK\_MASTER\_PORT
+- SPARK\_WORKER\_PORT
+- SPARK\_MASTER\_WEBUI\_PORT
+- Any options related to YARN client mode or Mesos
+
+`spark-defaults.conf`:
+- spark.master
+
+Also note that Thunderbolt pre-loads wasb jars, so loading them elsewhere is not necessary.
--- a/dtde/clusterlib.py
+++ b/dtde/clusterlib.py
@ -8,6 +8,7 @@ import azure.batch.models as batch_models
 from . import azure_api, constants, upload_node_scripts, util, log
 from dtde.error import ClusterNotReadyError, ThunderboltError
 from collections import namedtuple
+import dtde.config as config
 import getpass

 POOL_ADMIN_USER_IDENTITY = batch_models.UserIdentity(
@ -162,10 +163,15 @@ def create_cluster(
        :param password: Optional password of user to add to the pool when ready(Need wait to be True)
        :param wait: If this function should wait for the cluster to be ready(Master and all slave booted)
    """
+    # Copy spark conf files if they exist
+    config.load_spark_config()

    # Upload start task scripts
    zip_resource_file = upload_node_scripts.zip_and_upload()

+    # Clean up spark conf files
+    config.cleanup_spark_config()
+
    batch_client = azure_api.get_batch_client()

    # vm image
--- a/dtde/spark/cli/cluster_create.py
+++ b/dtde/spark/cli/cluster_create.py
@ -53,8 +53,6 @@ def execute(args: typing.NamedTuple):
            ssh_key = args.ssh_key,
            docker_repo = args.docker_repo)

-    load_spark_config()
-
    log.info("-------------------------------------------")
    log.info("spark cluster id:        %s", cluster_conf.uid)
    log.info("spark cluster size:      %s", cluster_conf.size + cluster_conf.size_low_pri)
@ -83,5 +81,3 @@ def execute(args: typing.NamedTuple):
        cluster_conf.wait)

    log.info("Cluster created successfully.")
-
-    cleanup_spark_config()