This commit is contained in:
msalvaris 2018-12-18 11:46:54 +00:00
Родитель ab579367b1
Коммит e01702943e
2 изменённых файлов: 0 добавлений и 1143 удалений

Просмотреть файл

@ -1,709 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train Keras Model Distributed on Batch AI\n",
"In this notebook we will train a Keras model ([ResNet50](https://arxiv.org/abs/1512.03385)) in a distributed fashion using [Horovod](https://github.com/uber/horovod) on the Imagenet dataset. This tutorial will take you through the following steps:\n",
" * [Create Azure Resources](#azure_resources)\n",
" * [Create Fileserver(NFS)](#create_fileshare)\n",
" * [Configure Batch AI Cluster](#configure_cluster)\n",
" * [Submit and Monitor Job](#job)\n",
" * [Clean Up Resources](#clean_up)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"../common\") \n",
"\n",
"from dotenv import dotenv_values, set_key, find_dotenv, get_key\n",
"from getpass import getpass\n",
"import os\n",
"import json\n",
"from utils import get_password, write_json_to_file, dotenv_for"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below are the variables that describe our experiment. By default we are using the NC24rs_v3 (Standard_NC24rs_v3) VMs which have V100 GPUs and Infiniband. By default we are using 2 nodes with each node having 4 GPUs, this equates to 8 GPUs. Feel free to increase the number of nodes but be aware what limitations your subscription may have.\n",
"\n",
"Set the USE_FAKE to True if you want to use fake data rather than the Imagenet dataset. This is often a good way to debug your models as well as checking what IO overhead is."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Variables for Batch AI - change as necessary\n",
"ID = \"ddkeras\"\n",
"GROUP_NAME = f\"batch{ID}rg\"\n",
"STORAGE_ACCOUNT_NAME = f\"batch{ID}st\"\n",
"FILE_SHARE_NAME = f\"batch{ID}share\"\n",
"SELECTED_SUBSCRIPTION = \"Team Danielle Internal\" #\"<YOUR SUBSCRIPTION>\"\n",
"WORKSPACE = \"workspace\"\n",
"NUM_NODES = 2\n",
"CLUSTER_NAME = \"msv100\"\n",
"VM_SIZE = \"Standard_NC24rs_v3\"\n",
"GPU_TYPE = \"V100\"\n",
"PROCESSES_PER_NODE = 4\n",
"LOCATION = \"eastus\"\n",
"NFS_NAME = f\"batch{ID}nfs\"\n",
"EXPERIMENT = f\"distributed_keras_{GPU_TYPE}\"\n",
"USERNAME = \"batchai_user\"\n",
"USE_FAKE = False\n",
"DOCKERHUB = \"caia\" #\"<YOUR DOCKERHUB>\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FAKE='-env FAKE=True' if USE_FAKE else ''\n",
"TOTAL_PROCESSES = PROCESSES_PER_NODE * NUM_NODES"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='azure_resources'></a>\n",
"## Create Azure Resources\n",
"First we need to log in to our Azure account. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az login -o table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have more than one Azure account you will need to select it with the command below. If you only have one account you can skip this step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az account set --subscription \"$SELECTED_SUBSCRIPTION\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az account list -o table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we create the group that will hold all our Azure resources."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az group create -n $GROUP_NAME -l $LOCATION -o table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will create the storage account that will store our fileshare where all the outputs from the jobs will be stored."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"json_data = !az storage account create -l $LOCATION -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME --sku Standard_LRS\n",
"print('Storage account {} provisioning state: {}'.format(STORAGE_ACCOUNT_NAME, \n",
" json.loads(''.join(json_data))['provisioningState']))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"json_data = !az storage account keys list -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME\n",
"storage_account_key = json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['value']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az storage share create --account-name $STORAGE_ACCOUNT_NAME \\\n",
"--account-key $storage_account_key --name $FILE_SHARE_NAME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az storage directory create --share-name $FILE_SHARE_NAME --name scripts \\\n",
"--account-name $STORAGE_ACCOUNT_NAME --account-key $storage_account_key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are setting some defaults so we don't have to keep adding them to every command"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az configure --defaults location=$LOCATION\n",
"!az configure --defaults group=$GROUP_NAME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%env AZURE_STORAGE_ACCOUNT $STORAGE_ACCOUNT_NAME\n",
"%env AZURE_STORAGE_KEY=$storage_account_key"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create Workspace\n",
"Batch AI has the concept of workspaces and experiments. Below we will create the workspace for our work."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai workspace create -n $WORKSPACE -g $GROUP_NAME"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='create_fileshare'></a>\n",
"## Create Fileserver\n",
"In this example we will store the data on an NFS fileshare. It is possible to use many storage solutions with Batch AI. NFS offers the best traideoff between performance and ease of use. The best performance is achieved by loading the data locally but this can be cumbersome since it requires that the data is download by the all the nodes which with the imagenet dataset can take hours. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai file-server create -n $NFS_NAME --disk-count 4 --disk-size 250 -w $WORKSPACE \\\n",
"-s Standard_DS4_v2 -u $USERNAME -p {get_password(dotenv_for())} -g $GROUP_NAME --storage-sku Premium_LRS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai file-server list -o table -w $WORKSPACE -g $GROUP_NAME"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"json_data = !az batchai file-server list -w $WORKSPACE -g $GROUP_NAME\n",
"nfs_ip=json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['mountSettings']['fileServerPublicIp']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we have created the NFS share we need to copy the data to it. To do this we write the script below which will be executed on the fileserver. It installs a tool called azcopy and then downloads and extracts the data to the appropriate directory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile nodeprep.sh\n",
"#!/usr/bin/env bash\n",
"wget https://gist.githubusercontent.com/msalvaris/073c28a9993d58498957294d20d74202/raw/87a78275879f7c9bb8d6fb9de8a2d2996bb66c24/install_azcopy\n",
"chmod 777 install_azcopy\n",
"sudo ./install_azcopy\n",
"\n",
"mkdir -p /data/imagenet\n",
"azcopy --source https://datasharesa.blob.core.windows.net/imagenet/validation.csv \\\n",
" --destination /data/imagenet/validation.csv\\\n",
" --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=7x3rN7c/nlXbnZ0gAFywd5Er3r6MdwCq97Vwvda25WE%3D\"\\\n",
" --quiet\n",
"\n",
"azcopy --source https://datasharesa.blob.core.windows.net/imagenet/validation.tar.gz \\\n",
" --destination /data/imagenet/validation.tar.gz\\\n",
" --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=zy8L4shZa3XXBe152hPnhXsyfBqCufDOz01a9ZHWU28%3D\"\\\n",
" --quiet\n",
"\n",
"azcopy --source https://datasharesa.blob.core.windows.net/imagenet/train.csv \\\n",
" --destination /data/imagenet/train.csv\\\n",
" --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=EUcahDDZcefOKtHoVWDh7voAC1BoxYNM512spFmjmDU%3D\"\\\n",
" --quiet\n",
"\n",
"azcopy --source https://datasharesa.blob.core.windows.net/imagenet/train.tar.gz \\\n",
" --destination /data/imagenet/train.tar.gz\\\n",
" --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=qP%2B7lQuFKHo5UhQKpHcKt6p5fHT21lPaLz1O/vv4FNU%3D\"\\\n",
" --quiet\n",
"\n",
"cd /data/imagenet\n",
"tar -xzf train.tar.gz\n",
"tar -xzf validation.tar.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we will copy the file over and run it on the NFS VM. This will install azcopy and download and prepare the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!sshpass -p {get_password(dotenv_for())} scp -o \"StrictHostKeyChecking=no\" nodeprep.sh $USERNAME@{nfs_ip}:~/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!sshpass -p {get_password(dotenv_for())} ssh -o \"StrictHostKeyChecking=no\" $USERNAME@{nfs_ip} \"sudo chmod 777 ~/nodeprep.sh && ./nodeprep.sh\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we create our experiment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai experiment create -n $EXPERIMENT -g $GROUP_NAME -w $WORKSPACE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='configure_cluster'></a>\n",
"## Configure Batch AI Cluster\n",
"We then upload the scripts we wish to execute onto the fileshare. The fileshare will later be mounted by Batch AI. An alternative to uploading the scripts would be to embedd them inside the Docker container."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az storage file upload --share-name $FILE_SHARE_NAME --source src/imagenet_keras_horovod.py --path scripts\n",
"!az storage file upload --share-name $FILE_SHARE_NAME --source src/data_generator.py --path scripts\n",
"!az storage file upload --share-name $FILE_SHARE_NAME --source ../common/timer.py --path scripts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below it the command to create the cluster."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai cluster create \\\n",
" -w $WORKSPACE \\\n",
" --name $CLUSTER_NAME \\\n",
" --image UbuntuLTS \\\n",
" --vm-size $VM_SIZE \\\n",
" --min $NUM_NODES --max $NUM_NODES \\\n",
" --afs-name $FILE_SHARE_NAME \\\n",
" --afs-mount-path extfs \\\n",
" --user-name $USERNAME \\\n",
" --password {get_password(dotenv_for())} \\\n",
" --storage-account-name $STORAGE_ACCOUNT_NAME \\\n",
" --storage-account-key $storage_account_key \\\n",
" --nfs $NFS_NAME \\\n",
" --nfs-mount-path nfs "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check that the cluster was created succesfully."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai cluster show -n $CLUSTER_NAME -w $WORKSPACE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai cluster list -w $WORKSPACE -o table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai cluster node list -c $CLUSTER_NAME -w $WORKSPACE -o table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='job'></a>\n",
"## Submit and Monitor Job\n",
"Below we specify the job we wish to execute. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"jobs_dict = {\n",
" \"$schema\": \"https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2017-09-01-preview/job.json\",\n",
" \"properties\": {\n",
" \"nodeCount\": NUM_NODES,\n",
" \"customToolkitSettings\": {\n",
" \"commandLine\": f\"source /opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/mpivars.sh; \\\n",
" echo $AZ_BATCH_HOST_LIST; \\\n",
" mpirun -n {TOTAL_PROCESSES} -ppn {PROCESSES_PER_NODE} -hosts $AZ_BATCH_HOST_LIST \\\n",
" -env I_MPI_FABRICS=dapl \\\n",
" -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 \\\n",
" -env I_MPI_DYNAMIC_CONNECTION=0 \\\n",
" -env I_MPI_DEBUG=6 \\\n",
" -env I_MPI_HYDRA_DEBUG=on \\\n",
" -env DISTRIBUTED=True \\\n",
" {FAKE} \\\n",
" python -u $AZ_BATCHAI_INPUT_SCRIPTS/imagenet_keras_horovod.py\"\n",
" },\n",
" \"stdOutErrPathPrefix\": \"$AZ_BATCHAI_MOUNT_ROOT/extfs\",\n",
" \"inputDirectories\": [{\n",
" \"id\": \"SCRIPTS\",\n",
" \"path\": \"$AZ_BATCHAI_MOUNT_ROOT/extfs/scripts\"\n",
" },\n",
" {\n",
" \"id\": \"TRAIN\",\n",
" \"path\": \"$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet\",\n",
" },\n",
" {\n",
" \"id\": \"TEST\",\n",
" \"path\": \"$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet\",\n",
" },\n",
" ],\n",
" \"outputDirectories\": [{\n",
" \"id\": \"MODEL\",\n",
" \"pathPrefix\": \"$AZ_BATCHAI_MOUNT_ROOT/extfs\",\n",
" \"pathSuffix\": \"Models\"\n",
" }],\n",
" \"containerSettings\": {\n",
" \"imageSourceRegistry\": {\n",
" \"image\": f\"{DOCKERHUB}/distributed-training.horovod-keras\"\n",
" }\n",
" }\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"write_json_to_file(jobs_dict, 'job.json')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"JOB_NAME='keras-horovod-{}'.format(NUM_NODES*PROCESSES_PER_NODE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now submit the job to Batch AI"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!az batchai job create -n $JOB_NAME --cluster $CLUSTER_NAME -w $WORKSPACE -e $EXPERIMENT -f job.json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the command below we can check the status of the job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai job list -w $WORKSPACE -e $EXPERIMENT -o table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To view the files that the job has generated use the command below"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai job file list -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are also able to stream the stdout and stderr that our job produces. This is great to check the progress of our job as well as debug issues."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stdout.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stderr.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can either wait for the job to complete or delete it with the command below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai job delete -w $WORKSPACE -e $EXPERIMENT --name $JOB_NAME -y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='clean_up'></a>\n",
"## Clean Up Resources\n",
"Next we wish to tidy up the resource we created. \n",
"First we reset the default values we set earlier."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az configure --defaults group=''\n",
"!az configure --defaults location=''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Next we delete the cluster"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai cluster delete -w $WORKSPACE --name $CLUSTER_NAME -g $GROUP_NAME -y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once the cluster is deleted you will not incur any cost for the computation but you can still retain your experiments and workspace. If you wish to delete those as well execute the commands below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai experiment delete -w $WORKSPACE --name $EXPERIMENT -g $GROUP_NAME -y"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az batchai workspace delete -n $WORKSPACE -g $GROUP_NAME -y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we can delete the group and we will have deleted everything created for this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!az group delete --name $GROUP_NAME -y"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -1,434 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "# Train Tensorflow Model Distributed on Batch AI\nIn this notebook we will train a TensorFlow model ([ResNet50](https://arxiv.org/abs/1512.03385)) in a distributed fashion using [Horovod](https://github.com/uber/horovod) on the Imagenet dataset. This tutorial will take you through the following steps:\n * [Create Azure Resources](#azure_resources)\n * [Create Fileserver(NFS)](#create_fileshare)\n * [Configure Batch AI Cluster](#configure_cluster)\n * [Submit and Monitor Job](#job)\n * [Clean Up Resources](#clean_up)"
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": "import sys\nsys.path.append(\"../common\") \n\nfrom dotenv import dotenv_values, set_key, find_dotenv, get_key\nfrom getpass import getpass\nimport os\nimport json\nfrom utils import get_password, write_json_to_file, dotenv_for"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Below are the variables that describe our experiment. By default we are using the NC24rs_v3 (Standard_NC24rs_v3) VMs which have V100 GPUs and Infiniband. By default we are using 2 nodes with each node having 4 GPUs, this equates to 8 GPUs. Feel free to increase the number of nodes but be aware what limitations your subscription may have.\n\nSet the USE_FAKE to True if you want to use fake data rather than the Imagenet dataset. This is often a good way to debug your models as well as checking what IO overhead is."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": "# Variables for Batch AI - change as necessary\nID = \"ddtf2\"\nGROUP_NAME = f\"batch{ID}rg\"\nSTORAGE_ACCOUNT_NAME = f\"batch{ID}st\"\nFILE_SHARE_NAME = f\"batch{ID}share\"\nSELECTED_SUBSCRIPTION = \"<YOUR_SUBSCRIPTION>\" #\"<YOUR SUBSCRIPTION>\"\nWORKSPACE = \"workspace\"\nNUM_NODES = 2\nCLUSTER_NAME = \"msv100\"\nVM_SIZE = \"Standard_NC24rs_v3\"\nGPU_TYPE = \"V100\"\nPROCESSES_PER_NODE = 4\nLOCATION = \"eastus\"\nNFS_NAME = f\"batch{ID}nfs\"\nEXPERIMENT = f\"distributed_tensorflow_{GPU_TYPE}\"\nUSERNAME = \"batchai_user\"\nUSE_FAKE = False\nDOCKERHUB = \"caia\" #\"<YOUR DOCKERHUB>\""
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "FAKE='-env FAKE=True' if USE_FAKE else ''\nTOTAL_PROCESSES = PROCESSES_PER_NODE * NUM_NODES"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a id='azure_resources'></a>\n## Create Azure Resources\nFirst we need to log in to our Azure account. "
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az login -o table"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "If you have more than one Azure account you will need to select it with the command below. If you only have one account you can skip this step."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az account set --subscription \"$SELECTED_SUBSCRIPTION\""
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az account list -o table"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next we create the group that will hold all our Azure resources."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az group create -n $GROUP_NAME -l $LOCATION -o table"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We will create the storage account that will store our fileshare where all the outputs from the jobs will be stored."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "json_data = !az storage account create -l $LOCATION -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME --sku Standard_LRS\nprint('Storage account {} provisioning state: {}'.format(STORAGE_ACCOUNT_NAME, \n json.loads(''.join(json_data))['provisioningState']))"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "json_data = !az storage account keys list -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME\nstorage_account_key = json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['value']"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az storage share create --account-name $STORAGE_ACCOUNT_NAME \\\n--account-key $storage_account_key --name $FILE_SHARE_NAME"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az storage directory create --share-name $FILE_SHARE_NAME --name scripts \\\n--account-name $STORAGE_ACCOUNT_NAME --account-key $storage_account_key"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Here we are setting some defaults so we don't have to keep adding them to every command"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az configure --defaults location=$LOCATION\n!az configure --defaults group=$GROUP_NAME"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "%env AZURE_STORAGE_ACCOUNT $STORAGE_ACCOUNT_NAME\n%env AZURE_STORAGE_KEY=$storage_account_key"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "#### Create Workspace\nBatch AI has the concept of workspaces and experiments. Below we will create the workspace for our work."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai workspace create -n $WORKSPACE -g $GROUP_NAME"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a id='create_fileshare'></a>\n## Create Fileserver\nIn this example we will store the data on an NFS fileshare. It is possible to use many storage solutions with Batch AI. NFS offers the best tradeoff between performance and ease of use. The best performance is achieved by loading the data locally but this can be cumbersome since it requires that the data is download by the all the nodes which with the imagenet dataset can take hours. "
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai file-server create -n $NFS_NAME --disk-count 4 --disk-size 250 -w $WORKSPACE \\\n-s Standard_DS4_v2 -u $USERNAME -p {get_password(dotenv_for())} -g $GROUP_NAME --storage-sku Premium_LRS"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai file-server list -o table -w $WORKSPACE -g $GROUP_NAME"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "json_data = !az batchai file-server list -w $WORKSPACE -g $GROUP_NAME\nnfs_ip=json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['mountSettings']['fileServerPublicIp']"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "After we have created the NFS share we need to copy the data to it. To do this we write the script below which will be executed on the fileserver. It installs a tool called azcopy and then downloads and extracts the data to the appropriate directory."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "%%writefile nodeprep.sh\n#!/usr/bin/env bash\nwget https://gist.githubusercontent.com/msalvaris/073c28a9993d58498957294d20d74202/raw/87a78275879f7c9bb8d6fb9de8a2d2996bb66c24/install_azcopy\nchmod 777 install_azcopy\nsudo ./install_azcopy\n\nmkdir -p /data/imagenet\nazcopy --source https://datasharesa.blob.core.windows.net/imagenet/validation.csv \\\n --destination /data/imagenet/validation.csv\\\n --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=7x3rN7c/nlXbnZ0gAFywd5Er3r6MdwCq97Vwvda25WE%3D\"\\\n --quiet\n\nazcopy --source https://datasharesa.blob.core.windows.net/imagenet/validation.tar.gz \\\n --destination /data/imagenet/validation.tar.gz\\\n --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=zy8L4shZa3XXBe152hPnhXsyfBqCufDOz01a9ZHWU28%3D\"\\\n --quiet\n\nazcopy --source https://datasharesa.blob.core.windows.net/imagenet/train.csv \\\n --destination /data/imagenet/train.csv\\\n --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=EUcahDDZcefOKtHoVWDh7voAC1BoxYNM512spFmjmDU%3D\"\\\n --quiet\n\nazcopy --source https://datasharesa.blob.core.windows.net/imagenet/train.tar.gz \\\n --destination /data/imagenet/train.tar.gz\\\n --source-sas \"?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=qP%2B7lQuFKHo5UhQKpHcKt6p5fHT21lPaLz1O/vv4FNU%3D\"\\\n --quiet\n\ncd /data/imagenet\ntar -xzf train.tar.gz\ntar -xzf validation.tar.gz"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next we will copy the file over and run it on the NFS VM. This will install azcopy and download and prepare the data"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!sshpass -p {get_password(dotenv_for())} scp -o \"StrictHostKeyChecking=no\" nodeprep.sh $USERNAME@{nfs_ip}:~/"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!sshpass -p {get_password(dotenv_for())} ssh -o \"StrictHostKeyChecking=no\" $USERNAME@{nfs_ip} \"sudo chmod 777 ~/nodeprep.sh && ./nodeprep.sh\""
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Next we create our experiment."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai experiment create -n $EXPERIMENT -g $GROUP_NAME -w $WORKSPACE"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a id='configure_cluster'></a>\n## Configure Batch AI Cluster\nWe then upload the scripts we wish to execute onto the fileshare. The fileshare will later be mounted by Batch AI. An alternative to uploading the scripts would be to embedd them inside the Docker container."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az storage file upload --share-name $FILE_SHARE_NAME --source src/imagenet_estimator_tf_horovod.py --path scripts\n!az storage file upload --share-name $FILE_SHARE_NAME --source src/resnet_model.py --path scripts\n!az storage file upload --share-name $FILE_SHARE_NAME --source ../common/timer.py --path scripts"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Below is the command to create the cluster."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai cluster create \\\n -w $WORKSPACE \\\n --name $CLUSTER_NAME \\\n --image UbuntuLTS \\\n --vm-size $VM_SIZE \\\n --min $NUM_NODES --max $NUM_NODES \\\n --afs-name $FILE_SHARE_NAME \\\n --afs-mount-path extfs \\\n --user-name $USERNAME \\\n --password {get_password(dotenv_for())} \\\n --storage-account-name $STORAGE_ACCOUNT_NAME \\\n --storage-account-key $storage_account_key \\\n --nfs $NFS_NAME \\\n --nfs-mount-path nfs "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Let's check that the cluster was created succesfully."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai cluster show -n $CLUSTER_NAME -w $WORKSPACE"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai cluster list -w $WORKSPACE -o table"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai cluster node list -c $CLUSTER_NAME -w $WORKSPACE -o table"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a id='job'></a>\n## Submit and Monitor Job\nBelow we specify the job we wish to execute. "
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "jobs_dict = {\n \"$schema\": \"https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2017-09-01-preview/job.json\",\n \"properties\": {\n \"nodeCount\": NUM_NODES,\n \"customToolkitSettings\": {\n \"commandLine\": f\"source /opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/mpivars.sh; \\\n echo $AZ_BATCH_HOST_LIST; \\\n mpirun -n {TOTAL_PROCESSES} -ppn {PROCESSES_PER_NODE} -hosts $AZ_BATCH_HOST_LIST \\\n -env I_MPI_FABRICS=dapl \\\n -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 \\\n -env I_MPI_DYNAMIC_CONNECTION=0 \\\n -env I_MPI_DEBUG=6 \\\n -env I_MPI_HYDRA_DEBUG=on \\\n -env DISTRIBUTED=True \\\n {FAKE} \\\n python -u $AZ_BATCHAI_INPUT_SCRIPTS/imagenet_estimator_tf_horovod.py\"\n },\n \"stdOutErrPathPrefix\": \"$AZ_BATCHAI_MOUNT_ROOT/extfs\",\n \"inputDirectories\": [{\n \"id\": \"SCRIPTS\",\n \"path\": \"$AZ_BATCHAI_MOUNT_ROOT/extfs/scripts\"\n },\n {\n \"id\": \"TRAIN\",\n \"path\": \"$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet\",\n },\n {\n \"id\": \"TEST\",\n \"path\": \"$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet\",\n },\n ],\n \"outputDirectories\": [{\n \"id\": \"MODEL\",\n \"pathPrefix\": \"$AZ_BATCHAI_MOUNT_ROOT/extfs\",\n \"pathSuffix\": \"Models\"\n }],\n \"containerSettings\": {\n \"imageSourceRegistry\": {\n \"image\": f\"{DOCKERHUB}/distributed-training.horovod-tf\"\n }\n }\n }\n}"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "write_json_to_file(jobs_dict, 'job.json')"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "JOB_NAME='tf-horovod-{}'.format(NUM_NODES*PROCESSES_PER_NODE)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We now submit the job to Batch AI"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai job create -n $JOB_NAME --cluster $CLUSTER_NAME -w $WORKSPACE -e $EXPERIMENT -f job.json"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "With the command below we can check the status of the job"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai job list -w $WORKSPACE -e $EXPERIMENT -o table"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To view the files that the job has generated use the command below"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai job file list -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We are also able to stream the stdout and stderr that our job produces. This is great to check the progress of our job as well as debug issues."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stdout.txt"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stderr.txt"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can either wait for the job to complete or delete it with the command below."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai job delete -w $WORKSPACE -e $EXPERIMENT --name $JOB_NAME -y"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "<a id='clean_up'></a>\n## Clean Up Resources\nNext we wish to tidy up the resource we created. \nFirst we reset the default values we set earlier."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az configure --defaults group=''\n!az configure --defaults location=''"
},
{
"cell_type": "markdown",
"metadata": {},
"source": " Next we delete the cluster"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai cluster delete -w $WORKSPACE --name $CLUSTER_NAME -g $GROUP_NAME -y"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Once the cluster is deleted you will not incur any cost for the computation but you can still retain your experiments and workspace. If you wish to delete those as well execute the commands below."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai experiment delete -w $WORKSPACE --name $EXPERIMENT -g $GROUP_NAME -y"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az batchai workspace delete -n $WORKSPACE -g $GROUP_NAME -y"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Finally we can delete the group and we will have deleted everything created for this tutorial."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "!az group delete --name $GROUP_NAME -y"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}