deep_bait/ExploringBatchAI.ipynb

505 строки
24 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Batch AI\n",
"In this notebook we will go through the steps of setting up the cluster executing the notebooks and pulling the executed notebooks locally. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have defined a setup script called setup.py. Here we are simply executing it which will also bring all the varialbes and methods into the notebook namespace. You can also use the setup script inside an ipython environment simply execute anaconda-project run ipython-bait"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%run setup_bait.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we setup the cluster and wait for the VMs to be allocated"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"setup_cluster()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: resizing Target: 10; Allocated: 0; Idle: 0; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 0; Unusable: 0; Running: 0; Preparing: 10; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 2; Unusable: 0; Running: 0; Preparing: 8; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 2; Unusable: 0; Running: 0; Preparing: 8; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 2; Unusable: 0; Running: 0; Preparing: 8; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 2; Unusable: 0; Running: 0; Preparing: 8; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 2; Unusable: 0; Running: 0; Preparing: 8; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 2; Unusable: 0; Running: 0; Preparing: 8; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 4; Unusable: 0; Running: 0; Preparing: 6; Leaving: 0\n",
"Cluster state: steady Target: 10; Allocated: 10; Idle: 10; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n"
]
}
],
"source": [
"wait_for_cluster()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we print the status of the cluster. We can see many details of the cluster we created including its name and the docker images for the various DL frameworks."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'allocation_state': 'steady',\n",
" 'allocation_state_transition_time': '2018-05-03T17:46:52.418Z',\n",
" 'creation_time': '2018-05-03T17:44:11.653Z',\n",
" 'current_node_count': 4,\n",
" 'id': '/subscriptions/10d0b7c6-9243-4713-91a9-2730375d3a1b/resourceGroups/build18/providers/Microsoft.BatchAI/clusters/dsvmcluster',\n",
" 'location': 'westus2',\n",
" 'name': 'dsvmcluster',\n",
" 'node_setup': {'mount_volumes': {'azure_blob_file_systems': [{'account_name': 'sensordemoblob',\n",
" 'container_name': 'models',\n",
" 'credentials': {},\n",
" 'relative_mount_path': 'models'},\n",
" {'account_name': 'sensordemoblob',\n",
" 'container_name': 'scripts',\n",
" 'credentials': {},\n",
" 'relative_mount_path': 'scripts'},\n",
" {'account_name': 'sensordemoblob',\n",
" 'container_name': 'predictions',\n",
" 'credentials': {},\n",
" 'relative_mount_path': 'predictions'}]}},\n",
" 'node_state_counts': {'idle_node_count': 4,\n",
" 'leaving_node_count': 0,\n",
" 'preparing_node_count': 0,\n",
" 'running_node_count': 0,\n",
" 'unusable_node_count': 0},\n",
" 'provisioning_state': 'succeeded',\n",
" 'provisioning_state_transition_time': '2018-05-03T17:44:13.649Z',\n",
" 'scale_settings': {'manual': {'node_deallocation_option': 'requeue',\n",
" 'target_node_count': 4}},\n",
" 'type': 'Microsoft.BatchAI/Clusters',\n",
" 'user_account_settings': {'admin_user_name': 'alex',\n",
" 'admin_user_ssh_public_key': 'ssh-rsa '\n",
" 'AAAAB3NzaC1yc2EAAAABJQAAAQEAjREhGswttyUADx4nZ1eq0Vw9+NEua0ONgyNw7Pbk8w/bVj7pPoMyQyLG6CfpvfaL9nAANbz08fkXUkED6O/kTfnHQnh4LrGSMbfz2chaubz29HEa46UIezFJQ2mw4VF+Wi84sPDg9XJo2cLHdj29PoRyJlM1CuLnYoKb46DvvsBodf8QcH90GSqd6QoQVEnCCT5LfLx6ihzN1pyMvxZxPdsvZlfiKlMFPPwfTD0QTGp7YacMIjBc8ih1j4Xeezm6R6jevyS3OGxAm6ksHUtSA8TRHJhB+RKAgsWwCu799OA0RJNmSpxwQO99ac5q6hsIUB0mPlsa5Uv7a03HTN6ayQ== '\n",
" 'rsa-key-20180502'},\n",
" 'virtual_machine_configuration': {'image_reference': {'offer': 'UbuntuServer',\n",
" 'publisher': 'Canonical',\n",
" 'sku': '16.04-LTS',\n",
" 'version': 'latest'}},\n",
" 'vm_priority': 'dedicated',\n",
" 'vm_size': 'STANDARD_NC6'},\n",
" {'allocation_state': 'steady',\n",
" 'allocation_state_transition_time': '2018-05-04T03:20:41.635Z',\n",
" 'creation_time': '2018-05-04T03:18:10.532Z',\n",
" 'current_node_count': 2,\n",
" 'id': '/subscriptions/10d0b7c6-9243-4713-91a9-2730375d3a1b/resourceGroups/sensordemo/providers/Microsoft.BatchAI/clusters/sensordemocluster',\n",
" 'location': 'westus2',\n",
" 'name': 'sensordemocluster',\n",
" 'node_setup': {'mount_volumes': {'azure_blob_file_systems': [{'account_name': 'sensordemoblob',\n",
" 'container_name': 'scripts',\n",
" 'credentials': {},\n",
" 'relative_mount_path': 'bfs'}]}},\n",
" 'node_state_counts': {'idle_node_count': 2,\n",
" 'leaving_node_count': 0,\n",
" 'preparing_node_count': 0,\n",
" 'running_node_count': 0,\n",
" 'unusable_node_count': 0},\n",
" 'provisioning_state': 'succeeded',\n",
" 'provisioning_state_transition_time': '2018-05-04T03:18:11.438Z',\n",
" 'scale_settings': {'manual': {'node_deallocation_option': 'requeue',\n",
" 'target_node_count': 2}},\n",
" 'type': 'Microsoft.BatchAI/Clusters',\n",
" 'user_account_settings': {'admin_user_name': 'alex'},\n",
" 'virtual_machine_configuration': {'image_reference': {'offer': 'linux-data-science-vm-ubuntu',\n",
" 'publisher': 'microsoft-ads',\n",
" 'sku': 'linuxdsvmubuntu',\n",
" 'version': 'latest'}},\n",
" 'vm_priority': 'dedicated',\n",
" 'vm_size': 'STANDARD_NC6'},\n",
" {'allocation_state': 'steady',\n",
" 'allocation_state_transition_time': '2018-05-04T06:55:54.982Z',\n",
" 'creation_time': '2018-05-04T06:52:56.621Z',\n",
" 'current_node_count': 1,\n",
" 'id': '/subscriptions/10d0b7c6-9243-4713-91a9-2730375d3a1b/resourceGroups/pixeldemorg/providers/Microsoft.BatchAI/clusters/pixeldemo',\n",
" 'location': 'eastus2',\n",
" 'name': 'pixeldemo',\n",
" 'node_setup': {'mount_volumes': {'azure_blob_file_systems': [{'account_name': 'pixelblob',\n",
" 'container_name': 'blobfuse',\n",
" 'credentials': {},\n",
" 'relative_mount_path': 'blobfuse'}],\n",
" 'azure_file_shares': [{'account_name': 'pixelblob',\n",
" 'azure_file_url': 'https://pixelblob.file.core.windows.net/batchai',\n",
" 'credentials': {},\n",
" 'directory_mode': '0777',\n",
" 'file_mode': '0777',\n",
" 'relative_mount_path': 'afs'}]},\n",
" 'setup_task': {'command_line': 'sudo '\n",
" '/anaconda/envs/py35/bin/pip '\n",
" 'install tifffile pillow; sudo '\n",
" \"sh -c 'source \"\n",
" '/anaconda/bin/activate py35 '\n",
" \"&& conda install gdal -y'; \"\n",
" \"sudo sh -c 'source \"\n",
" '/anaconda/bin/activate py35 '\n",
" '&& conda install -c '\n",
" \"conda-forge basemap -y'\",\n",
" 'run_elevated': False,\n",
" 'std_out_err_path_prefix': '$AZ_BATCHAI_MOUNT_ROOT/afs',\n",
" 'std_out_err_path_suffix': '10d0b7c6-9243-4713-91a9-2730375d3a1b/pixeldemorg/clusters/pixeldemo'}},\n",
" 'node_state_counts': {'idle_node_count': 1,\n",
" 'leaving_node_count': 0,\n",
" 'preparing_node_count': 0,\n",
" 'running_node_count': 0,\n",
" 'unusable_node_count': 0},\n",
" 'provisioning_state': 'succeeded',\n",
" 'provisioning_state_transition_time': '2018-05-04T06:52:58.652Z',\n",
" 'scale_settings': {'manual': {'node_deallocation_option': 'requeue',\n",
" 'target_node_count': 1}},\n",
" 'type': 'Microsoft.BatchAI/Clusters',\n",
" 'user_account_settings': {'admin_user_name': 'alex'},\n",
" 'virtual_machine_configuration': {'image_reference': {'offer': 'linux-data-science-vm-ubuntu',\n",
" 'publisher': 'microsoft-ads',\n",
" 'sku': 'linuxdsvmubuntu',\n",
" 'version': 'latest'}},\n",
" 'vm_priority': 'dedicated',\n",
" 'vm_size': 'STANDARD_NC24'}]\n"
]
}
],
"source": [
"print_cluster_list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can submit all the of the jobs with the submit_all function. We also have a submit function for each of the DL frameworks if you wish to execute one seperately."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:__main__:Submitting job run_cntk\n",
"INFO:__main__:Submitting job run_chainer\n",
"INFO:__main__:Submitting job run_mxnet\n",
"INFO:__main__:Submitting job run_keras_cntk\n",
"INFO:__main__:Submitting job run_keras_tf\n",
"INFO:__main__:Submitting job run_caffe2\n",
"INFO:__main__:Submitting job run_pytorch\n",
"INFO:__main__:Submitting job run_tf\n",
"INFO:__main__:Submitting job run_gluon\n"
]
}
],
"source": [
"submit_all(epochs=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can periodically execute the command below to observe the status of the jobs. Under the current subscription we only have 2 nodes so 2 nodes will be executing in parallel. If the exit-code is anything other than 0 then there has been a problem with the job."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"run_cntk: status:running | exit-code None\n",
"run_chainer: status:running | exit-code None\n",
"run_mxnet: status:running | exit-code None\n",
"run_keras_cntk: status:running | exit-code None\n",
"run_keras_tf: status:running | exit-code None\n",
"run_caffe2: status:running | exit-code None\n",
"run_pytorch: status:running | exit-code None\n",
"run_tf: status:running | exit-code None\n",
"run_gluon: status:running | exit-code None\n"
]
}
],
"source": [
"print_jobs_summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the wait_for_job function to wait for the completion of the job. Once it is completed then the stdout is printed out. Let's take a look at the tf job. We can tell the name of the job from the output of the print_jobs_summary as well as the log messages when we submitted the job."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cluster state: steady Target: 10; Allocated: 10; Idle: 10; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n",
"Job state: succeeded ExitCode: 0\n",
"Waiting for job output to become available...\n",
"OS: linux\n",
"Python: 3.5.2 (default, Nov 23 2017, 16:37:01) \n",
"[GCC 5.4.0 20160609]\n",
"Numpy: 1.14.2\n",
"Tensorflow: 1.8.0\n",
"Preparing train set...\n",
"Preparing test set...\n",
"Done.\n",
"(50000, 3, 32, 32) (10000, 3, 32, 32) (50000,) (10000,)\n",
"float32 float32 int32 int32\n",
"CPU times: user 796 ms, sys: 616 ms, total: 1.41 s\n",
"Wall time: 7.32 s\n",
"CPU times: user 148 ms, sys: 0 ns, total: 148 ms\n",
"Wall time: 149 ms\n",
"CPU times: user 464 ms, sys: 388 ms, total: 852 ms\n",
"Wall time: 892 ms\n",
"0 Train accuracy: 0.453125\n",
"1 Train accuracy: 0.59375\n",
"2 Train accuracy: 0.671875\n",
"3 Train accuracy: 0.65625\n",
"4 Train accuracy: 0.6875\n",
"5 Train accuracy: 0.734375\n",
"6 Train accuracy: 0.71875\n",
"7 Train accuracy: 0.828125\n",
"8 Train accuracy: 0.8125\n",
"9 Train accuracy: 0.875\n",
"Training took 166.665 sec.\n",
"CPU times: user 7.04 s, sys: 1.25 s, total: 8.29 s\n",
"Wall time: 8.34 s\n",
"Accuracy: 0.7779447115384616\n",
"Job state: succeeded ExitCode: 0\n"
]
}
],
"source": [
"wait_for_job('run_tf')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets download one of the notebooks we ran."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:__main__:Downloading Tensorflow_run_tf.ipynb\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading https://baitstr.file.core.windows.net/baitshare/10d0b7c6-9243-4713-91a9-2730375d3a1b/baitrg/jobs/run_tf/368943b4-0dd2-40f6-9df7-a1f85bff0a35/outputs/notebooks/Tensorflow_run_tf.ipynb?sv=2016-05-31&sr=f&sig=khQ0iEstW7xrShEEnzgcQICm4R%2FwsN74rU%2FLMdPub4o%3D&se=2018-05-04T18%3A12%3A24Z&sp=rl ...Done\n",
"All files Downloaded\n"
]
}
],
"source": [
"download_files('run_tf', 'notebooks')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the notebook and you can compare the output we printed out from the stdout of the job when we executed the command wait_for_job. We can see that the outputs in the cells are identical. You can download the other notebooks as well by simply supplying the name of the job."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once all the jobs are complete we can delete them and delete the cluster."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO:utilities:Deleting run_cntk\n",
"INFO:utilities:Deleting run_chainer\n",
"INFO:utilities:Deleting run_mxnet\n",
"INFO:utilities:Deleting run_keras_cntk\n",
"INFO:utilities:Deleting run_keras_tf\n",
"INFO:utilities:Deleting run_caffe2\n",
"INFO:utilities:Deleting run_pytorch\n",
"INFO:utilities:Deleting run_tf\n",
"INFO:utilities:Deleting run_gluon\n"
]
}
],
"source": [
"delete_all_jobs()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<msrest.polling.poller.LROPoller at 0x7f23793c1198>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delete_cluster()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cluster state: steady Target: 10; Allocated: 10; Idle: 10; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0\n"
]
}
],
"source": [
"print_status()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These simple methods make it very convenient but may not be suitable for each use case. For more details check out the Batch AI documentation as well as the setup script."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}