{Initial Commit}

{Initial Commit - Transferring code to new repo }
2021-02-17 11:47:57 -08:00 · 2021-02-17 11:47:57 -08:00 · 7e9c2955f0
--- a/README.md
+++ b/README.md
@ -1,33 +1,50 @@
-# Project
+# Distributed training of Image segmentation on Azure ML

-> This repo has been populated by an initial template to help get you started. Please
-> make sure to update the content to build a great experience for community-building.
+The repo will show how to complete distributional training of image segmentation on Azure ML.

-As the maintainer of this project, please make a few updates:
+## Platform

- Improving this README.MD file to provide a great experience
- Updating SUPPORT.MD with content about this project's support experience
- Understanding the security reporting process in SECURITY.MD
- Remove this section from the README
+We complete the distributional training in Azure ML by using mutiple nodes and mutiple GPU's per node.

-## Contributing
+[Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/)

-This project welcomes contributions and suggestions.  Most contributions require you to agree to a
-Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
-the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+[Azure ML SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py)

-When you submit a pull request, a CLA bot will automatically determine whether you need to provide
-a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
-provided by the bot. You will only need to do this once across all repos using our CLA.
+To run the notebook, you need to have/create:
+1. Create/have Azure subscription
+2. Create/have Azure storage
+3. Create/have Azure ML workspace
+4. (Optional) Create/have Azure ML compute target (4 nodes of STANDARD_NC24) - this can be created in notebook.

-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+## Dataset

-## Trademarks
+We used the data from a kaggle project:

-This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
-trademarks or logos is subject to and must follow 
-[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
-Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
-Any use of third-party trademarks or logos are subject to those third-party's policies.
+https://www.kaggle.com/c/airbus-ship-detection
+
+The project is for segmenting ships from sattelite images. We used their train_v2 data.
+
+To run the notebook, you need to:
+1. create a container in Azure storage.
+2. Upload "train_v2" into the container with folder name "airbus"
+
+## Package
+We used a package "Fast.AI". It can use less codes to create deep learning model and train the model. For example, we used 3 lines for the image classfication:
+
+>data = ImageDataBunch.from_folder(data_folder, train=".", valid_pct=0.2, ds_tfms=get_transforms(), size=sz, bs = bs, num_workers=8).normalize(imagenet_stats)
+
+>learn = cnn_learner(data, models.resnet34, metrics=dice)
+
+>learn.fit_one_cycle(5, slice(1e-5), pct_start=0.8)
+
+Fast.AI supports computer vision (CNN and U-Net), and NLP (transformer). Please find details in their website.
+
+https://www.fast.ai/
+
+You can install it by:
+
+>pip install fastai
+
+## Distributional training
+
+Fasi.AI only support the NCCL backend distributional training, which is not natively supported by Azure ML. We used a script "azureml_adapter.py" to help complete the NCCL initialization on Azure ML.
--- a/Ship-segmentation-Azure-ML.ipynb
+++ b/Ship-segmentation-Azure-ML.ipynb
@ -0,0 +1,546 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "_uuid": "9380833a1a2503c5d3518f0ed8d6df8dcf05b7c2"
+   },
+   "source": [
+    "## Overview\n",
+    "The whole processing has 2 steps:\n",
+    "1. Image classification: classifying images with or without ships. \n",
+    "2. Image segmentation: segmenting ships from images.\n",
+    "We downsample images into 256 X 256. However, the downsampling caused ship size to be only 1 pixel, which leads to lower segmentation performance. So we select images with larger ship size for the segmentation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%reload_ext autoreload\n",
+    "%autoreload 2\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_uuid": "2a2f9181ed56a8310f6188ac1254f903574fb115"
+   },
+   "outputs": [],
+   "source": [
+    "import fastai\n",
+    "from fastai.vision import *\n",
+    "from fastai.callbacks.hooks import *\n",
+    "\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import os, glob"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from azureml.core.authentication import InteractiveLoginAuthentication\n",
+    "\n",
+    "from azureml.core import Workspace, Datastore, Dataset, Experiment, Run, Environment\n",
+    "from azureml.core.compute import ComputeTarget, AmlCompute\n",
+    "from azureml.core.compute_target import ComputeTargetException\n",
+    "from azureml.core.model import Model\n",
+    "from azureml.core.conda_dependencies import CondaDependencies\n",
+    "\n",
+    "from azureml.train.dnn import PyTorch, Mpi\n",
+    "from azureml.train.hyperdrive import GridParameterSampling\n",
+    "from azureml.data.data_reference import DataReference\n",
+    "from azureml.train.hyperdrive import HyperDriveConfig\n",
+    "from azureml.pipeline.steps import HyperDriveStep, HyperDriveStepRun\n",
+    "from azureml.pipeline.core import Pipeline, PipelineData\n",
+    "from azureml.train.hyperdrive import PrimaryMetricGoal\n",
+    "from azureml.train.hyperdrive.parameter_expressions import choice\n",
+    "\n",
+    "from azureml.core.runconfig import MpiConfiguration\n",
+    "\n",
+    "from azureml.widgets import RunDetails"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "_uuid": "8fa09c99d9f5b03e8e3a213f6d84902d5e1d59e1"
+   },
+   "source": [
+    "### Prepare Azure Resource"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Connect the workspace\n",
+    "interactive_auth = InteractiveLoginAuthentication()\n",
+    "\n",
+    "subscription_id = '<Your Azure subscription id>'\n",
+    "resource_group = '<Your resource group in Azure'\n",
+    "workspace_name = '<Your workspace name>'\n",
+    "\n",
+    "workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name,\n",
+    "                      auth=interactive_auth)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Register storage container as datastore\n",
+    "storange_name = '<Your Azure storage name>'\n",
+    "ket_to_storage = '<Key to your storage>'\n",
+    "datastore_name = 'airbus'\n",
+    "\n",
+    "datastore = Datastore.register_azure_blob_container(workspace=workspace, \n",
+    "                                                    datastore_name=datastore_name, \n",
+    "                                                    container_name=datastore_name,\n",
+    "                                                    account_name=storange_name, \n",
+    "                                                    account_key=ket_to_storage,\n",
+    "                                                    create_if_not_exists=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Find datastore by name\n",
+    "datastore = Datastore.get(workspace, datastore_name)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Connect/create computer resource\n",
+    "cluster_name = 'gpu-nc24'\n",
+    "\n",
+    "try:\n",
+    "    compute_target = ComputeTarget(workspace = workspace, name = cluster_name)\n",
+    "    print('Found existing compute target')\n",
+    "except ComputeTargetException:\n",
+    "    print('Creating a new compute target...')\n",
+    "    compute_config = AmlCompute.provisioning_configuration(vm_size = 'STANDARD_NC24', min_nodes = 0, max_nodes = 4)\n",
+    "    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)\n",
+    "    compute_target.wait_for_completion(show_output = True, min_node_count = 4, timeout_in_minutes = 20)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Register dataset\n",
+    "dataset = Dataset.File.from_files(path=(datastore, 'airbus'))\n",
+    "dataset = dataset.register(workspace=workspace,\n",
+    "                           name='Airbus root',\n",
+    "                           description='Dataset for airbus images')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the script folder\n",
+    "script_folder = os.path.join(os.getcwd(), \"training_scripts\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Clean\n",
+    "1. downsize images to 256 X 256 \n",
+    "2. Put images to 2 folders: ship or no ship\n",
+    "3. Create the segmentation label images"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create experiment to clear data\n",
+    "exp_data = Experiment(workspace = workspace, name = 'urthecast_data_clean')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Register data reference\n",
+    "data_folder = DataReference(\n",
+    "    datastore=datastore,\n",
+    "    data_reference_name=\"airbus_root\",\n",
+    "    path_on_datastore = 'airbus',\n",
+    "    mode = 'mount')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create estimator for data clean\n",
+    "script_params = {\n",
+    "    '--data_folder': data_folder\n",
+    "}\n",
+    "est_data = PyTorch(source_directory = script_folder,\n",
+    "                    compute_target = compute_target,\n",
+    "                    entry_script = 'clean-data.py',  # python script for cleaning\n",
+    "                    script_params = script_params,\n",
+    "                    use_gpu = False,\n",
+    "                    node_count=1,\n",
+    "                    pip_packages = ['fastai'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Submit for running\n",
+    "data_run = exp_data.submit(est_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Show run details\n",
+    "RunDetails(data_run).show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ship/No ship classification"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create experiment to classification\n",
+    "exp_class = Experiment(workspace = workspace, name = 'classification')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data reference for classification data\n",
+    "class_data_folder = DataReference(\n",
+    "    datastore=datastore,\n",
+    "    data_reference_name=\"airbus_class\",\n",
+    "    path_on_datastore = 'airbus/class',\n",
+    "    mode = 'mount')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Estimator for classification\n",
+    "from azureml.train.dnn import PyTorch, Mpi\n",
+    "\n",
+    "script_params = {\n",
+    "    '--data_folder': class_data_folder,\n",
+    "    '--num_epochs': 5\n",
+    "}\n",
+    "\n",
+    "est_class = PyTorch(source_directory = script_folder,\n",
+    "                    compute_target = compute_target,\n",
+    "                    entry_script = 'classification.py', # Classification script\n",
+    "                    script_params = script_params,\n",
+    "                    use_gpu = True,\n",
+    "                    node_count=3,                       # 3 nodes are used\n",
+    "                    distributed_training=Mpi(process_count_per_node = 4), # 4 GPU's per node\n",
+    "                    pip_packages = ['fastai'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the hyper drive for parameter tunning\n",
+    "param_sampling = GridParameterSampling({\n",
+    "    'start_learning_rate': choice(0.0001, 0.001),\n",
+    "    'end_learning_rate': choice(0.01, 0.1)})\n",
+    "\n",
+    "hyperdrive_class = HyperDriveConfig(estimator = est_class,\n",
+    "                                         hyperparameter_sampling = param_sampling,\n",
+    "                                         policy = None,\n",
+    "                                         primary_metric_name = 'dice',\n",
+    "                                         primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,\n",
+    "                                         max_total_runs = 4,\n",
+    "                                         max_concurrent_runs = 4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Kick off running\n",
+    "classification_run = exp_class.submit(hyperdrive_class)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Show running details\n",
+    "RunDetails(classification_run).show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get results for all running\n",
+    "classification_run.wait_for_completion(show_output = False)\n",
+    "\n",
+    "children = list(classification_run.get_children())\n",
+    "metricslist = {}\n",
+    "i = 0\n",
+    "\n",
+    "for single_run in children:\n",
+    "    results = {k: np.min(v) for k, v in single_run.get_metrics().items() if (k in ['dice', 'loss']) and isinstance(v, float)}\n",
+    "    parameters = single_run.get_details()['runDefinition']['arguments']\n",
+    "    try:\n",
+    "        results['start_learning_rate'] = parameters[5]\n",
+    "        results['end_learning_rate'] = parameters[7]\n",
+    "        metricslist[i] = results\n",
+    "        i += 1\n",
+    "    except:\n",
+    "        pass\n",
+    "\n",
+    "rundata = pd.DataFrame(metricslist).sort_index(1).T.sort_values(by = ['loss'], ascending = True)\n",
+    "rundata"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Show best running\n",
+    "best_run = classification_run.get_best_run_by_primary_metric()\n",
+    "best_run.get_file_names()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "_uuid": "f487dd77687f4edb070bd5d2dc9da9a001d62bdb"
+   },
+   "source": [
+    "### Ship segmentation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data reference for segmentation\n",
+    "sgmt_data_folder = DataReference(\n",
+    "    datastore=datastore,\n",
+    "    data_reference_name=\"airbus_segmentation\",\n",
+    "    path_on_datastore = 'airbus/segmentation',\n",
+    "    mode = 'mount')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Experiment for segmentation\n",
+    "exp_sgmt = Experiment(workspace = workspace, name = 'segmentation')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_uuid": "56ed39146115a4767a257fec60a3b367284fa0d6"
+   },
+   "outputs": [],
+   "source": [
+    "# Estimator for segmentation\n",
+    "segmt_script_params = {\n",
+    "    '--data_folder': sgmt_data_folder,\n",
+    "    '--img_folder': '256-filter99',\n",
+    "    '--num_epochs': 12\n",
+    "}\n",
+    "\n",
+    "segmt_est = PyTorch(source_directory = script_folder,\n",
+    "                    compute_target = compute_target,\n",
+    "                    entry_script = 'segmentation.py', # Segmentation script\n",
+    "                    script_params = segmt_script_params,\n",
+    "                    use_gpu = True,\n",
+    "                    node_count=4,                     # 4 nodes\n",
+    "                    distributed_training=Mpi(process_count_per_node = 4), # 4 GPU's per node\n",
+    "                    pip_packages = ['fastai'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_uuid": "25fa3283c992696575914a5fdb6ebc433a0b5d1f"
+   },
+   "outputs": [],
+   "source": [
+    "# Kick off running\n",
+    "segmentation_run = exp_sgmt.submit(config=segmt_est)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Running detail\n",
+    "RunDetails(segmentation_run).show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "_uuid": "193381699f5595c916647bfd6c51eaeba699379d"
+   },
+   "outputs": [],
+   "source": [
+    "# Results\n",
+    "segmentation_run.wait_for_completion(show_output=False)  # specify True for a verbose log\n",
+    "print(segmentation_run.get_file_names())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Register model\n",
+    "model = larger_sgmt_run.register_model(model_name='segmentation-99',\n",
+    "                           tags={'ship': 'min99'},\n",
+    "                           model_path='outputs/segmentation.pkl')\n",
+    "print(model.name, model.id, model.version, sep='\\t')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "_uuid": "a5af78f6512ab4f514818ef47b3481ef67a65e46"
+   },
+   "source": [
+    "### Prediction\n",
+    "Sample code for prediction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read image\n",
+    "size = 256\n",
+    "ifile = '<Test image>'\n",
+    "img = open_image(ifile)\n",
+    "img = img.resize(size)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Prediction\n",
+    "model_path = '<The model path>'\n",
+    "learn = load_learner(model_path)\n",
+    "pred = learn.predict(img)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/Ship-segmentation-Azure-ML.yaml
+++ b/Ship-segmentation-Azure-ML.yaml
@ -0,0 +1,8 @@
+name: project_environment
+dependencies:
+  - python>3.6.2
+  - torch>1.0
+  - pip:
+      # You must list azureml-defaults as a pip dependency
+    - fastai
+    - azureml-sdk[notebooks,automl]
--- a/training_scripts/azureml_adapter.py
+++ b/training_scripts/azureml_adapter.py
@ -0,0 +1,36 @@
+import os
+
+def set_environment_variables_for_nccl_backend(single_node=False):
+    os.environ['RANK'] = os.environ['OMPI_COMM_WORLD_RANK']
+    os.environ['WORLD_SIZE'] = os.environ['OMPI_COMM_WORLD_SIZE']
+
+    if not single_node: 
+        master_node_params = os.environ['AZ_BATCH_MASTER_NODE'].split(':')
+        os.environ['MASTER_ADDR'] = master_node_params[0]
+        os.environ['MASTER_PORT'] = master_node_params[1]
+    else:
+        os.environ['MASTER_ADDR'] = os.environ['AZ_BATCHAI_MPI_MASTER_NODE']
+        os.environ['MASTER_PORT'] = '54965'
+    print('NCCL_SOCKET_IFNAME original value = {}'.format(os.environ['NCCL_SOCKET_IFNAME']))
+    # TODO make this parameterizable
+    os.environ['NCCL_SOCKET_IFNAME'] = '^docker0,lo'
+
+    print('RANK = {}'.format(os.environ['RANK']))
+    print('WORLD_SIZE = {}'.format(os.environ['WORLD_SIZE']))
+    print('MASTER_ADDR = {}'.format(os.environ['MASTER_ADDR']))
+    print('MASTER_PORT = {}'.format(os.environ['MASTER_PORT']))
+    # print('MASTER_NODE = {}'.format(os.environ['MASTER_NODE']))
+    print('NCCL_SOCKET_IFNAME new value = {}'.format(os.environ['NCCL_SOCKET_IFNAME']))
+
+def get_local_rank():
+    return int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
+
+def get_global_size():
+    return int(os.environ['OMPI_COMM_WORLD_SIZE'])
+
+def get_local_size():
+    return int(os.environ['OMPI_COMM_WORLD_LOCAL_SIZE'])	
+
+def get_world_size():
+    return int(os.environ['WORLD_SIZE'])
+	 
--- a/training_scripts/classification.py
+++ b/training_scripts/classification.py
@ -0,0 +1,66 @@
+import numpy as np
+import fastai
+from fastai.vision import *
+from fastai.callbacks.hooks import *
+from fastai.callbacks.mem import PeakMemMetric
+from fastai.distributed import *
+
+import os, argparse, time, random
+from azureml.core import Workspace, Run, Dataset
+
+from azureml_adapter import set_environment_variables_for_nccl_backend, get_local_rank, get_global_size, get_local_size
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--data_folder', type=str, dest='data_folder', default='')
+parser.add_argument('--img_size', type=int, dest='img_size', default=256)
+parser.add_argument('--batch_size', type=int, dest='banch_size', default=64)
+parser.add_argument('--num_epochs', type=int, dest='num_epochs', default=12)
+parser.add_argument('--start_learning_rate', type=float, dest='start_learning_rate', default=0.001)
+parser.add_argument('--end_learning_rate', type=float, dest='end_learning_rate', default=0.01)
+parser.add_argument('--pct_start', type=float, dest='pct_start', default=0.9)
+args = parser.parse_args()
+
+local_rank = -1
+local_rank = get_local_rank()
+global_size = get_global_size()
+local_size = get_local_size()	
+
+# TODO use logger	
+print('local_rank = {}'.format(local_rank))
+print('global_size = {}'.format(global_size))
+print('local_size = {}'.format(local_size))
+
+set_environment_variables_for_nccl_backend(local_size == global_size)
+torch.cuda.set_device(local_rank)
+torch.distributed.init_process_group(backend='nccl', init_method='env://')
+rank = int(os.environ['RANK'])
+
+data_folder = args.data_folder
+sz = args.img_size
+bs = args.banch_size
+print('Data folder:', data_folder)
+
+run = Run.get_context()
+work_folder = os.getcwd()
+print('Work directory: ', work_folder)
+
+data = ImageDataBunch.from_folder(data_folder, train=".", valid_pct=0.2,
+       ds_tfms=get_transforms(), size=sz, bs = bs, num_workers=8).normalize(imagenet_stats)
+	   
+learn = cnn_learner(data, models.resnet34, metrics=dice).to_distributed(local_rank)
+learn.fit_one_cycle(args.num_epochs, slice(args.start_learning_rate,args.end_learning_rate))
+
+#learn.unfreeze()
+#learn.fit_one_cycle(5, slice(1e-5), pct_start=0.8)
+
+result = learn.validate()
+run.log('Worker #{:} loss'.format(rank), np.float(result[0]))
+run.log('Worker #{:} dice'.format(rank), np.float(result[1]))
+
+os.chdir(work_folder)
+if rank == 0:
+	run.log('loss', np.float(result[0]))
+	run.log('dice', np.float(result[1]))
+
+	#filename = 'outputs/classification.pkl'
+	#learn.export(outputs/)
--- a/training_scripts/clean-data.py
+++ b/training_scripts/clean-data.py
@ -0,0 +1,134 @@
+import numpy as np
+import fastai
+from fastai.vision import *
+from fastai.callbacks.hooks import *
+
+import os, glob, argparse, time, random, math
+
+from azureml.core import Workspace, Run, Dataset
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--data_folder', type=str, dest='data_folder')
+parser.add_argument('--org_size', type=int, dest='org_size', default=768)
+parser.add_argument('--train_folder', type=str, dest='train_folder', default='train_v2')
+parser.add_argument('--train_sgmtfile', type=str, dest='train_sgmtfile', default='train_ship_segmentations_v2.csv')
+parser.add_argument('--class_folder', type=str, dest='class_folder', default='class')
+parser.add_argument('--img_size', type=int, dest='img_size', default=256)
+parser.add_argument('--min_area', type=int, dest='min_area', default=99)
+parser.add_argument('--sgmtimg_folder', type=str, dest='sgmtimg_folder', default='256-filter99')
+parser.add_argument('--sgmtlabel_folder', type=str, dest='sgmtlabel_folder', default='256-label')
+args = parser.parse_args()
+
+run = Run.get_context()
+
+data_folder = args.data_folder
+print('Data folder: ', data_folder)
+
+train_folder = os.path.join(data_folder, args.train_folder)
+SEGMENTATION = os.path.join(data_folder, args.train_sgmtfile)
+
+# Clean images
+print('Searching the broken images.............')
+brokenfiles = []
+for fpath in glob.glob(os.path.join(train_folder, '*.jpg')):
+    try:
+        img = open_image(fpath)
+    except:
+        fn = os.path.basename(fpath)
+        print(fn, ' is broken')
+        brokenfiles.append(fn)
+print(len(brokenfiles), ' images are broken')
+
+print('Moving broken images.........')
+broken_folder = os.path.join(train_folder, 'broken')
+os.makedirs(broken_folder, exist_ok=True)
+for fn in brokenfiles:
+    orig_name = os.path.join(train_folder, fn)
+    new_name = os.path.join(broken_folder, fn)
+    os.rename(orig_name, new_name)
+	
+# Divide images into Ship and NoShip
+print('Split images to ship & no-ship folder .........')
+df_masks = pd.read_csv(SEGMENTATION, index_col='ImageId')
+
+class_folder = os.path.join(train_folder, args.class_folder)
+
+ship_folder = os.path.join(class_folder, 'ship')
+noship_folder = os.path.join(class_folder, 'no-ship')
+
+for fpath in glob.glob(os.path.join(train_folder, '*.jpg')):
+	fn = os.path.basename(fpath)
+	if isinstance(df_masks.loc[fn,'EncodedPixels'], str):
+		tpath = os.path.join(ship_folder, fn)
+	else:
+		tpath = os.path.join(noship_folder, fn)
+
+	os.rename(fpath, tpath)
+
+print('Generating lable files............')
+sz_enc = [args.org_size, args.org_size]
+
+def enc2mask(masks, shape = sz_enc):
+    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
+
+    if(type(masks) == float): return img.reshape(shape)
+    if(type(masks) == str): masks = [masks]
+    for mask in masks:
+        s = mask.split()
+        for i in range(len(s)//2):
+            start = int(s[2*i]) - 1
+            length = int(s[2*i+1])
+            img[start:start+length] = 1
+    return img.reshape(shape).T
+
+label_folder = os.path.join(train_folder, 'label')
+
+for fpath in glob.glob(os.path.join(ship_folder, '*.jpg')):
+	fn = os.path.basename(fpath)
+	labelpath = os.path.join(label_folder, Path(fn).stem + '.png')
+
+	mask = enc2mask(df_masks.loc[fn,'EncodedPixels'])
+	maskimg = PIL.Image.fromarray(mask)
+	maskimg.save(labelpath)
+
+
+def SummaryLabelArea(label_root):
+    min_area = 1000000
+    area_hist = np.zeros(20, int)
+    for fpath in glob.glob(os.path.join(label_root, '*.png')):
+        mask = open_mask(fpath)
+        area = mask.data.sum()
+        area_hist[int(math.log2(area))] += 1
+        if area < min_area: min_area = area
+    
+    print('Min area is ', min_area)
+    print(area_hist / np.sum(area_hist))
+    
+    return min_area, area_hist
+
+SummaryLabelArea(label_folder);
+
+print('Resizing images and labels .........')
+def ResizeTrainLabel(train_root, label_root, dest_train_root, dest_label_root, size, min_area = 0):
+           
+    for fpathstr in glob.glob(os.path.join(train_root, '*.jpg')):
+        fpath = Path(fpathstr)
+        lpath = os.path.join(label_root, fpath.stem + '.png')
+    
+        mask = open_mask(lpath)
+        mask = mask.resize(size)
+        
+        if mask.data.sum() > min_area:
+            dest_lpath = os.path.join(dest_label_root, fpath.stem + '.png')
+            mask.save(dest_lpath)
+        
+            img = open_image(fpath)
+            img = img.resize(size)
+
+            dest_fpath = os.path.join(dest_train_root, fpath.stem + '.jpg')
+            img.save(dest_fpath)
+
+sgmtimg_folder = os.path.join(train_folder, args.sgmtimg_folder)
+sgmtlabel_folder = os.path.join(train_folder, args.sgmtlabel_folder)
+
+ResizeTrainLabel(ship_folder, label_folder, sgmtimg_folder, sgmtlabel_folder, args.img_size, args.min_area)
--- a/training_scripts/segmentation.py
+++ b/training_scripts/segmentation.py
@ -0,0 +1,119 @@
+import numpy as np
+import fastai
+from fastai.vision import *
+from fastai.callbacks.hooks import *
+from fastai.callbacks.mem import PeakMemMetric
+from fastai.distributed import *
+
+import os, argparse, time, random
+from azureml.core import Workspace, Run, Dataset
+
+from azureml_adapter import set_environment_variables_for_nccl_backend, get_local_rank, get_global_size, get_local_size
+
+def dice_loss(input, target):
+    #input = torch.sigmoid(input)
+    smooth = 1.0
+
+    iflat = input.flatten()
+    tflat = target.flatten()
+    intersection = (iflat * tflat).sum()
+    
+    return ((2.0 * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth))
+
+class FocalLoss(nn.Module):
+    def __init__(self, gamma):
+        super().__init__()
+        self.gamma = gamma
+        
+    def forward(self, input, target):
+        if not (target.size() == input.size()):
+            raise ValueError("Target size ({}) must be the same as input size ({})"
+                             .format(target.size(), input.size()))
+
+        max_val = (-input).clamp(min=0)
+        loss = input - input * target + max_val + \
+            ((-max_val).exp() + (-input - max_val).exp()).log()
+
+        invprobs = F.logsigmoid(-input * (target * 2.0 - 1.0))
+        loss = (invprobs * self.gamma).exp() * loss
+        
+        return loss.mean()
+
+class MixedLoss(nn.Module):
+    def __init__(self, alpha, gamma):
+        super().__init__()
+        self.alpha = alpha
+        self.focal = FocalLoss(gamma)
+        
+    def forward(self, input, target):
+        input = F.softmax(input, dim=1)[:,1:,:,:]
+        input2 = torch.log((input+1e-7)/(1-input+1e-7))
+
+        loss = self.alpha*self.focal(input2, target) - torch.log(dice_loss(input, target))
+        return loss
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--data_folder', type=str, dest='data_folder', default='')
+parser.add_argument('--label_folder', type=str, dest='label_folder', default='256-label')
+parser.add_argument('--img_folder', type=str, dest='img_folder', default='256-filter')
+parser.add_argument('--img_size', type=int, dest='img_size', default=256)
+parser.add_argument('--batch_size', type=int, dest='banch_size', default=16)
+parser.add_argument('--num_epochs', type=int, dest='num_epochs', default=12)
+parser.add_argument('--start_learning_rate', type=float, dest='start_learning_rate', default=0.000001)
+parser.add_argument('--end_learning_rate', type=float, dest='end_learning_rate', default=0.001)
+args = parser.parse_args()
+
+local_rank = -1
+local_rank = get_local_rank()
+global_size = get_global_size()
+local_size = get_local_size()	
+
+# TODO use logger	
+print('local_rank = {}'.format(local_rank))
+print('global_size = {}'.format(global_size))
+print('local_size = {}'.format(local_size))
+
+set_environment_variables_for_nccl_backend(local_size == global_size)
+torch.cuda.set_device(local_rank)
+torch.distributed.init_process_group(backend='nccl', init_method='env://')
+rank = int(os.environ['RANK'])
+
+data_folder = args.data_folder
+sz = args.img_size
+bs = args.banch_size
+print('Data folder:', data_folder)
+
+run = Run.get_context()
+work_folder = os.getcwd()
+print('Work directory: ', work_folder)
+
+label_path = Path(os.path.join(data_folder, args.label_folder))
+get_y_fn = lambda x: label_path/f'{x.stem}.png'
+tfms = get_transforms(max_rotate = 10, max_lighting = 0.05, max_warp = 0.2, flip_vert = True,
+                      p_affine = 1., p_lighting = 1)
+
+img_path = os.path.join(data_folder, args.img_folder)
+data = (SegmentationItemList.from_folder(img_path)
+        .split_by_rand_pct(0.2)
+        .label_from_func(get_y_fn, classes=['Background','Ship'])
+        .transform(tfms, size=sz, tfm_y=True)
+        .databunch(path=Path('.'), bs=bs, num_workers=0)
+        .normalize(imagenet_stats))
+
+learn = unet_learner(data, models.resnet34, loss_func=MixedLoss(10.0,2.0), metrics=dice, wd=1e-7).to_distributed(local_rank)
+learn.fit_one_cycle(args.num_epochs, slice(args.start_learning_rate,args.end_learning_rate))
+ 
+#learn.unfreeze()
+#learn.fit_one_cycle(args.num_epochs, slice(args.start_learning_rate,args.end_learning_rate))
+
+result = learn.validate()
+run.log('Worker #{:} loss'.format(rank), np.float(result[0]))
+run.log('Worker #{:} dice'.format(rank), np.float(result[1]))
+
+if rank == 0:
+	run.log('loss', np.float(result[0]))
+	run.log('dice', np.float(result[1]))
+	
+	os.chdir(work_folder)
+	filename = 'outputs/segmentation.pkl'
+	learn.export(filename)