* added default spark conf files to config/

* refactored loading and cleanup of spark conf files

* curated spark conf files and added docs
This commit is contained in:
Jacob Freck 2017-09-29 13:03:11 -07:00 коммит произвёл GitHub
Родитель 5f12ff66dc
Коммит c2172b5f64
5 изменённых файлов: 101 добавлений и 4 удалений

Просмотреть файл

@ -0,0 +1,29 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# Note: Thunderbolt pre-loads wasb jars, so loading is not necessary
spark.jars /home/spark-current/jars/azure-storage-2.0.0.jar,/home/spark-current/jars/hadoop-azure-2.7.3.jar

46
config/spark-env.sh Normal file
Просмотреть файл

@ -0,0 +1,46 @@
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is sourced when running various Spark programs.
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default:
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.

Просмотреть файл

@ -58,3 +58,23 @@ jupyter_port: 8088
Running the command `azb spark cluster ssh --id <cluster_id>` will attempt to ssh into the cluster which has the id specified with the username 'spark'. It will forward the Spark Job UI to localhost:4040, the Spark master's web UI to localhost:8080 and Jupyter to localhost:8088.
Note that all of the settings in ssh.yaml will be overrided by parameters passed on the command line.
## Spark Configuration
The repository comes with default Spark configuration files which provision your Spark cluster just the same as you would locally. After running `azb spark init` to initialize your working environment, you can view and edit these files at `.thunderbolt/spark-defaults.conf` and `.thunderbolt/spark-env.sh`. Please note that you can bring your own Spark configuration files by copying your `spark-defaults.conf` and `spark-env.sh` into your `.thunderbolt/` direcotry.
The following settings available in `spark-defaults.conf` and `spark-env.sh` are not supported in Thunderbolt:
`spark-env.sh`:
- SPARK\_LOCAL\_IP
- SPARK\_PUBLIC\_DNS
- SPARK\_MASTER\_HOST
- SPARK\_MASTER\_PORT
- SPARK\_WORKER\_PORT
- SPARK\_MASTER\_WEBUI\_PORT
- Any options related to YARN client mode or Mesos
`spark-defaults.conf`:
- spark.master
Also note that Thunderbolt pre-loads wasb jars, so loading them elsewhere is not necessary.

Просмотреть файл

@ -8,6 +8,7 @@ import azure.batch.models as batch_models
from . import azure_api, constants, upload_node_scripts, util, log
from dtde.error import ClusterNotReadyError, ThunderboltError
from collections import namedtuple
import dtde.config as config
import getpass
POOL_ADMIN_USER_IDENTITY = batch_models.UserIdentity(
@ -162,10 +163,15 @@ def create_cluster(
:param password: Optional password of user to add to the pool when ready(Need wait to be True)
:param wait: If this function should wait for the cluster to be ready(Master and all slave booted)
"""
# Copy spark conf files if they exist
config.load_spark_config()
# Upload start task scripts
zip_resource_file = upload_node_scripts.zip_and_upload()
# Clean up spark conf files
config.cleanup_spark_config()
batch_client = azure_api.get_batch_client()
# vm image

Просмотреть файл

@ -53,8 +53,6 @@ def execute(args: typing.NamedTuple):
ssh_key = args.ssh_key,
docker_repo = args.docker_repo)
load_spark_config()
log.info("-------------------------------------------")
log.info("spark cluster id: %s", cluster_conf.uid)
log.info("spark cluster size: %s", cluster_conf.size + cluster_conf.size_low_pri)
@ -83,5 +81,3 @@ def execute(args: typing.NamedTuple):
cluster_conf.wait)
log.info("Cluster created successfully.")
cleanup_spark_config()