This commit is contained in:
miguelgfierro 2020-06-22 10:33:14 +00:00
Родитель 57b0c8a0d8
Коммит c0185c1d71
1 изменённых файлов: 6 добавлений и 2 удалений

Просмотреть файл

@ -96,6 +96,8 @@ To set these variables every time the environment is activated, we can follow th
First, get the path of the environment `reco_pyspark` is installed:
RECO_ENV=$(conda env list | grep reco_pyspark | awk '{print $NF}')
mkdir -p $RECO_ENV/etc/conda/activate.d
mkdir -p $RECO_ENV/etc/conda/deactivate.d
Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add:
@ -107,8 +109,7 @@ Then, create the file `$RECO_ENV/etc/conda/activate.d/env_vars.sh` and add:
unset SPARK_HOME
This will export the variables every time we do `conda activate reco_pyspark`.
To unset these variables when we deactivate the environment,
create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add:
To unset these variables when we deactivate the environment, create the file `$RECO_ENV/etc/conda/deactivate.d/env_vars.sh` and add:
#!/bin/sh
unset PYSPARK_PYTHON
@ -180,6 +181,7 @@ If you are using the DSVM, you can [connect to JupyterHub](https://docs.microsof
### Troubleshooting for the DSVM
* We found that there can be problems if the Spark version of the machine is not the same as the one in the conda file. You can use the option `--pyspark-version` to address this issue.
* When running Spark on a single local node it is possible to run out of disk space as temporary files are written to the user's home directory. To avoid this on a DSVM, we attached an additional disk to the DSVM and made modifications to the Spark configuration. This is done by including the following lines in the file at `/dsvm/tools/spark/current/conf/spark-env.sh`.
```{shell}
@ -188,6 +190,8 @@ SPARK_WORKER_DIR="/mnt"
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true, -Dspark.worker.cleanup.appDataTtl=3600, -Dspark.worker.cleanup.interval=300, -Dspark.storage.cleanupFilesAfterExecutorExit=true"
```
* Another source of problems is when the variable `SPARK_HOME` is not set correctly. In the Azure DSVM, `SPARK_HOME` should be `/dsvm/tools/spark/current`.
## Setup guide for Azure Databricks
### Requirements of Azure Databricks