{Initial Commit}
{Initial Commit - Transferring code to new repo }
This commit is contained in:
Родитель
75c41c4857
Коммит
7e9c2955f0
65
README.md
65
README.md
|
@ -1,33 +1,50 @@
|
|||
# Project
|
||||
# Distributed training of Image segmentation on Azure ML
|
||||
|
||||
> This repo has been populated by an initial template to help get you started. Please
|
||||
> make sure to update the content to build a great experience for community-building.
|
||||
The repo will show how to complete distributional training of image segmentation on Azure ML.
|
||||
|
||||
As the maintainer of this project, please make a few updates:
|
||||
## Platform
|
||||
|
||||
- Improving this README.MD file to provide a great experience
|
||||
- Updating SUPPORT.MD with content about this project's support experience
|
||||
- Understanding the security reporting process in SECURITY.MD
|
||||
- Remove this section from the README
|
||||
We complete the distributional training in Azure ML by using mutiple nodes and mutiple GPU's per node.
|
||||
|
||||
## Contributing
|
||||
[Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/)
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
||||
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
|
||||
[Azure ML SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py)
|
||||
|
||||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
|
||||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
|
||||
provided by the bot. You will only need to do this once across all repos using our CLA.
|
||||
To run the notebook, you need to have/create:
|
||||
1. Create/have Azure subscription
|
||||
2. Create/have Azure storage
|
||||
3. Create/have Azure ML workspace
|
||||
4. (Optional) Create/have Azure ML compute target (4 nodes of STANDARD_NC24) - this can be created in notebook.
|
||||
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
||||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
||||
## Dataset
|
||||
|
||||
## Trademarks
|
||||
We used the data from a kaggle project:
|
||||
|
||||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
||||
trademarks or logos is subject to and must follow
|
||||
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
||||
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
||||
Any use of third-party trademarks or logos are subject to those third-party's policies.
|
||||
https://www.kaggle.com/c/airbus-ship-detection
|
||||
|
||||
The project is for segmenting ships from sattelite images. We used their train_v2 data.
|
||||
|
||||
To run the notebook, you need to:
|
||||
1. create a container in Azure storage.
|
||||
2. Upload "train_v2" into the container with folder name "airbus"
|
||||
|
||||
## Package
|
||||
We used a package "Fast.AI". It can use less codes to create deep learning model and train the model. For example, we used 3 lines for the image classfication:
|
||||
|
||||
>data = ImageDataBunch.from_folder(data_folder, train=".", valid_pct=0.2, ds_tfms=get_transforms(), size=sz, bs = bs, num_workers=8).normalize(imagenet_stats)
|
||||
|
||||
>learn = cnn_learner(data, models.resnet34, metrics=dice)
|
||||
|
||||
>learn.fit_one_cycle(5, slice(1e-5), pct_start=0.8)
|
||||
|
||||
Fast.AI supports computer vision (CNN and U-Net), and NLP (transformer). Please find details in their website.
|
||||
|
||||
https://www.fast.ai/
|
||||
|
||||
You can install it by:
|
||||
|
||||
>pip install fastai
|
||||
|
||||
## Distributional training
|
||||
|
||||
Fasi.AI only support the NCCL backend distributional training, which is not natively supported by Azure ML. We used a script "azureml_adapter.py" to help complete the NCCL initialization on Azure ML.
|
||||
|
|
|
@ -0,0 +1,546 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"_uuid": "9380833a1a2503c5d3518f0ed8d6df8dcf05b7c2"
|
||||
},
|
||||
"source": [
|
||||
"## Overview\n",
|
||||
"The whole processing has 2 steps:\n",
|
||||
"1. Image classification: classifying images with or without ships. \n",
|
||||
"2. Image segmentation: segmenting ships from images.\n",
|
||||
"We downsample images into 256 X 256. However, the downsampling caused ship size to be only 1 pixel, which leads to lower segmentation performance. So we select images with larger ship size for the segmentation."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%reload_ext autoreload\n",
|
||||
"%autoreload 2\n",
|
||||
"%matplotlib inline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"_uuid": "2a2f9181ed56a8310f6188ac1254f903574fb115"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import fastai\n",
|
||||
"from fastai.vision import *\n",
|
||||
"from fastai.callbacks.hooks import *\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"import os, glob"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core.authentication import InteractiveLoginAuthentication\n",
|
||||
"\n",
|
||||
"from azureml.core import Workspace, Datastore, Dataset, Experiment, Run, Environment\n",
|
||||
"from azureml.core.compute import ComputeTarget, AmlCompute\n",
|
||||
"from azureml.core.compute_target import ComputeTargetException\n",
|
||||
"from azureml.core.model import Model\n",
|
||||
"from azureml.core.conda_dependencies import CondaDependencies\n",
|
||||
"\n",
|
||||
"from azureml.train.dnn import PyTorch, Mpi\n",
|
||||
"from azureml.train.hyperdrive import GridParameterSampling\n",
|
||||
"from azureml.data.data_reference import DataReference\n",
|
||||
"from azureml.train.hyperdrive import HyperDriveConfig\n",
|
||||
"from azureml.pipeline.steps import HyperDriveStep, HyperDriveStepRun\n",
|
||||
"from azureml.pipeline.core import Pipeline, PipelineData\n",
|
||||
"from azureml.train.hyperdrive import PrimaryMetricGoal\n",
|
||||
"from azureml.train.hyperdrive.parameter_expressions import choice\n",
|
||||
"\n",
|
||||
"from azureml.core.runconfig import MpiConfiguration\n",
|
||||
"\n",
|
||||
"from azureml.widgets import RunDetails"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"_uuid": "8fa09c99d9f5b03e8e3a213f6d84902d5e1d59e1"
|
||||
},
|
||||
"source": [
|
||||
"### Prepare Azure Resource"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Connect the workspace\n",
|
||||
"interactive_auth = InteractiveLoginAuthentication()\n",
|
||||
"\n",
|
||||
"subscription_id = '<Your Azure subscription id>'\n",
|
||||
"resource_group = '<Your resource group in Azure'\n",
|
||||
"workspace_name = '<Your workspace name>'\n",
|
||||
"\n",
|
||||
"workspace = Workspace(subscription_id=subscription_id, resource_group=resource_group, workspace_name=workspace_name,\n",
|
||||
" auth=interactive_auth)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Register storage container as datastore\n",
|
||||
"storange_name = '<Your Azure storage name>'\n",
|
||||
"ket_to_storage = '<Key to your storage>'\n",
|
||||
"datastore_name = 'airbus'\n",
|
||||
"\n",
|
||||
"datastore = Datastore.register_azure_blob_container(workspace=workspace, \n",
|
||||
" datastore_name=datastore_name, \n",
|
||||
" container_name=datastore_name,\n",
|
||||
" account_name=storange_name, \n",
|
||||
" account_key=ket_to_storage,\n",
|
||||
" create_if_not_exists=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Find datastore by name\n",
|
||||
"datastore = Datastore.get(workspace, datastore_name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Connect/create computer resource\n",
|
||||
"cluster_name = 'gpu-nc24'\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" compute_target = ComputeTarget(workspace = workspace, name = cluster_name)\n",
|
||||
" print('Found existing compute target')\n",
|
||||
"except ComputeTargetException:\n",
|
||||
" print('Creating a new compute target...')\n",
|
||||
" compute_config = AmlCompute.provisioning_configuration(vm_size = 'STANDARD_NC24', min_nodes = 0, max_nodes = 4)\n",
|
||||
" compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)\n",
|
||||
" compute_target.wait_for_completion(show_output = True, min_node_count = 4, timeout_in_minutes = 20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Register dataset\n",
|
||||
"dataset = Dataset.File.from_files(path=(datastore, 'airbus'))\n",
|
||||
"dataset = dataset.register(workspace=workspace,\n",
|
||||
" name='Airbus root',\n",
|
||||
" description='Dataset for airbus images')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Define the script folder\n",
|
||||
"script_folder = os.path.join(os.getcwd(), \"training_scripts\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Data Clean\n",
|
||||
"1. downsize images to 256 X 256 \n",
|
||||
"2. Put images to 2 folders: ship or no ship\n",
|
||||
"3. Create the segmentation label images"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create experiment to clear data\n",
|
||||
"exp_data = Experiment(workspace = workspace, name = 'urthecast_data_clean')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Register data reference\n",
|
||||
"data_folder = DataReference(\n",
|
||||
" datastore=datastore,\n",
|
||||
" data_reference_name=\"airbus_root\",\n",
|
||||
" path_on_datastore = 'airbus',\n",
|
||||
" mode = 'mount')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create estimator for data clean\n",
|
||||
"script_params = {\n",
|
||||
" '--data_folder': data_folder\n",
|
||||
"}\n",
|
||||
"est_data = PyTorch(source_directory = script_folder,\n",
|
||||
" compute_target = compute_target,\n",
|
||||
" entry_script = 'clean-data.py', # python script for cleaning\n",
|
||||
" script_params = script_params,\n",
|
||||
" use_gpu = False,\n",
|
||||
" node_count=1,\n",
|
||||
" pip_packages = ['fastai'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Submit for running\n",
|
||||
"data_run = exp_data.submit(est_data)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Show run details\n",
|
||||
"RunDetails(data_run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Ship/No ship classification"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create experiment to classification\n",
|
||||
"exp_class = Experiment(workspace = workspace, name = 'classification')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Data reference for classification data\n",
|
||||
"class_data_folder = DataReference(\n",
|
||||
" datastore=datastore,\n",
|
||||
" data_reference_name=\"airbus_class\",\n",
|
||||
" path_on_datastore = 'airbus/class',\n",
|
||||
" mode = 'mount')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Estimator for classification\n",
|
||||
"from azureml.train.dnn import PyTorch, Mpi\n",
|
||||
"\n",
|
||||
"script_params = {\n",
|
||||
" '--data_folder': class_data_folder,\n",
|
||||
" '--num_epochs': 5\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"est_class = PyTorch(source_directory = script_folder,\n",
|
||||
" compute_target = compute_target,\n",
|
||||
" entry_script = 'classification.py', # Classification script\n",
|
||||
" script_params = script_params,\n",
|
||||
" use_gpu = True,\n",
|
||||
" node_count=3, # 3 nodes are used\n",
|
||||
" distributed_training=Mpi(process_count_per_node = 4), # 4 GPU's per node\n",
|
||||
" pip_packages = ['fastai'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Define the hyper drive for parameter tunning\n",
|
||||
"param_sampling = GridParameterSampling({\n",
|
||||
" 'start_learning_rate': choice(0.0001, 0.001),\n",
|
||||
" 'end_learning_rate': choice(0.01, 0.1)})\n",
|
||||
"\n",
|
||||
"hyperdrive_class = HyperDriveConfig(estimator = est_class,\n",
|
||||
" hyperparameter_sampling = param_sampling,\n",
|
||||
" policy = None,\n",
|
||||
" primary_metric_name = 'dice',\n",
|
||||
" primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,\n",
|
||||
" max_total_runs = 4,\n",
|
||||
" max_concurrent_runs = 4)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Kick off running\n",
|
||||
"classification_run = exp_class.submit(hyperdrive_class)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Show running details\n",
|
||||
"RunDetails(classification_run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Get results for all running\n",
|
||||
"classification_run.wait_for_completion(show_output = False)\n",
|
||||
"\n",
|
||||
"children = list(classification_run.get_children())\n",
|
||||
"metricslist = {}\n",
|
||||
"i = 0\n",
|
||||
"\n",
|
||||
"for single_run in children:\n",
|
||||
" results = {k: np.min(v) for k, v in single_run.get_metrics().items() if (k in ['dice', 'loss']) and isinstance(v, float)}\n",
|
||||
" parameters = single_run.get_details()['runDefinition']['arguments']\n",
|
||||
" try:\n",
|
||||
" results['start_learning_rate'] = parameters[5]\n",
|
||||
" results['end_learning_rate'] = parameters[7]\n",
|
||||
" metricslist[i] = results\n",
|
||||
" i += 1\n",
|
||||
" except:\n",
|
||||
" pass\n",
|
||||
"\n",
|
||||
"rundata = pd.DataFrame(metricslist).sort_index(1).T.sort_values(by = ['loss'], ascending = True)\n",
|
||||
"rundata"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Show best running\n",
|
||||
"best_run = classification_run.get_best_run_by_primary_metric()\n",
|
||||
"best_run.get_file_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"_uuid": "f487dd77687f4edb070bd5d2dc9da9a001d62bdb"
|
||||
},
|
||||
"source": [
|
||||
"### Ship segmentation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Data reference for segmentation\n",
|
||||
"sgmt_data_folder = DataReference(\n",
|
||||
" datastore=datastore,\n",
|
||||
" data_reference_name=\"airbus_segmentation\",\n",
|
||||
" path_on_datastore = 'airbus/segmentation',\n",
|
||||
" mode = 'mount')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Experiment for segmentation\n",
|
||||
"exp_sgmt = Experiment(workspace = workspace, name = 'segmentation')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"_uuid": "56ed39146115a4767a257fec60a3b367284fa0d6"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Estimator for segmentation\n",
|
||||
"segmt_script_params = {\n",
|
||||
" '--data_folder': sgmt_data_folder,\n",
|
||||
" '--img_folder': '256-filter99',\n",
|
||||
" '--num_epochs': 12\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"segmt_est = PyTorch(source_directory = script_folder,\n",
|
||||
" compute_target = compute_target,\n",
|
||||
" entry_script = 'segmentation.py', # Segmentation script\n",
|
||||
" script_params = segmt_script_params,\n",
|
||||
" use_gpu = True,\n",
|
||||
" node_count=4, # 4 nodes\n",
|
||||
" distributed_training=Mpi(process_count_per_node = 4), # 4 GPU's per node\n",
|
||||
" pip_packages = ['fastai'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"_uuid": "25fa3283c992696575914a5fdb6ebc433a0b5d1f"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Kick off running\n",
|
||||
"segmentation_run = exp_sgmt.submit(config=segmt_est)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Running detail\n",
|
||||
"RunDetails(segmentation_run).show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"_uuid": "193381699f5595c916647bfd6c51eaeba699379d"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Results\n",
|
||||
"segmentation_run.wait_for_completion(show_output=False) # specify True for a verbose log\n",
|
||||
"print(segmentation_run.get_file_names())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Register model\n",
|
||||
"model = larger_sgmt_run.register_model(model_name='segmentation-99',\n",
|
||||
" tags={'ship': 'min99'},\n",
|
||||
" model_path='outputs/segmentation.pkl')\n",
|
||||
"print(model.name, model.id, model.version, sep='\\t')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"_uuid": "a5af78f6512ab4f514818ef47b3481ef67a65e46"
|
||||
},
|
||||
"source": [
|
||||
"### Prediction\n",
|
||||
"Sample code for prediction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Read image\n",
|
||||
"size = 256\n",
|
||||
"ifile = '<Test image>'\n",
|
||||
"img = open_image(ifile)\n",
|
||||
"img = img.resize(size)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Prediction\n",
|
||||
"model_path = '<The model path>'\n",
|
||||
"learn = load_learner(model_path)\n",
|
||||
"pred = learn.predict(img)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
|
@ -0,0 +1,8 @@
|
|||
name: project_environment
|
||||
dependencies:
|
||||
- python>3.6.2
|
||||
- torch>1.0
|
||||
- pip:
|
||||
# You must list azureml-defaults as a pip dependency
|
||||
- fastai
|
||||
- azureml-sdk[notebooks,automl]
|
|
@ -0,0 +1,36 @@
|
|||
import os
|
||||
|
||||
def set_environment_variables_for_nccl_backend(single_node=False):
|
||||
os.environ['RANK'] = os.environ['OMPI_COMM_WORLD_RANK']
|
||||
os.environ['WORLD_SIZE'] = os.environ['OMPI_COMM_WORLD_SIZE']
|
||||
|
||||
if not single_node:
|
||||
master_node_params = os.environ['AZ_BATCH_MASTER_NODE'].split(':')
|
||||
os.environ['MASTER_ADDR'] = master_node_params[0]
|
||||
os.environ['MASTER_PORT'] = master_node_params[1]
|
||||
else:
|
||||
os.environ['MASTER_ADDR'] = os.environ['AZ_BATCHAI_MPI_MASTER_NODE']
|
||||
os.environ['MASTER_PORT'] = '54965'
|
||||
print('NCCL_SOCKET_IFNAME original value = {}'.format(os.environ['NCCL_SOCKET_IFNAME']))
|
||||
# TODO make this parameterizable
|
||||
os.environ['NCCL_SOCKET_IFNAME'] = '^docker0,lo'
|
||||
|
||||
print('RANK = {}'.format(os.environ['RANK']))
|
||||
print('WORLD_SIZE = {}'.format(os.environ['WORLD_SIZE']))
|
||||
print('MASTER_ADDR = {}'.format(os.environ['MASTER_ADDR']))
|
||||
print('MASTER_PORT = {}'.format(os.environ['MASTER_PORT']))
|
||||
# print('MASTER_NODE = {}'.format(os.environ['MASTER_NODE']))
|
||||
print('NCCL_SOCKET_IFNAME new value = {}'.format(os.environ['NCCL_SOCKET_IFNAME']))
|
||||
|
||||
def get_local_rank():
|
||||
return int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
|
||||
|
||||
def get_global_size():
|
||||
return int(os.environ['OMPI_COMM_WORLD_SIZE'])
|
||||
|
||||
def get_local_size():
|
||||
return int(os.environ['OMPI_COMM_WORLD_LOCAL_SIZE'])
|
||||
|
||||
def get_world_size():
|
||||
return int(os.environ['WORLD_SIZE'])
|
||||
|
|
@ -0,0 +1,66 @@
|
|||
import numpy as np
|
||||
import fastai
|
||||
from fastai.vision import *
|
||||
from fastai.callbacks.hooks import *
|
||||
from fastai.callbacks.mem import PeakMemMetric
|
||||
from fastai.distributed import *
|
||||
|
||||
import os, argparse, time, random
|
||||
from azureml.core import Workspace, Run, Dataset
|
||||
|
||||
from azureml_adapter import set_environment_variables_for_nccl_backend, get_local_rank, get_global_size, get_local_size
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--data_folder', type=str, dest='data_folder', default='')
|
||||
parser.add_argument('--img_size', type=int, dest='img_size', default=256)
|
||||
parser.add_argument('--batch_size', type=int, dest='banch_size', default=64)
|
||||
parser.add_argument('--num_epochs', type=int, dest='num_epochs', default=12)
|
||||
parser.add_argument('--start_learning_rate', type=float, dest='start_learning_rate', default=0.001)
|
||||
parser.add_argument('--end_learning_rate', type=float, dest='end_learning_rate', default=0.01)
|
||||
parser.add_argument('--pct_start', type=float, dest='pct_start', default=0.9)
|
||||
args = parser.parse_args()
|
||||
|
||||
local_rank = -1
|
||||
local_rank = get_local_rank()
|
||||
global_size = get_global_size()
|
||||
local_size = get_local_size()
|
||||
|
||||
# TODO use logger
|
||||
print('local_rank = {}'.format(local_rank))
|
||||
print('global_size = {}'.format(global_size))
|
||||
print('local_size = {}'.format(local_size))
|
||||
|
||||
set_environment_variables_for_nccl_backend(local_size == global_size)
|
||||
torch.cuda.set_device(local_rank)
|
||||
torch.distributed.init_process_group(backend='nccl', init_method='env://')
|
||||
rank = int(os.environ['RANK'])
|
||||
|
||||
data_folder = args.data_folder
|
||||
sz = args.img_size
|
||||
bs = args.banch_size
|
||||
print('Data folder:', data_folder)
|
||||
|
||||
run = Run.get_context()
|
||||
work_folder = os.getcwd()
|
||||
print('Work directory: ', work_folder)
|
||||
|
||||
data = ImageDataBunch.from_folder(data_folder, train=".", valid_pct=0.2,
|
||||
ds_tfms=get_transforms(), size=sz, bs = bs, num_workers=8).normalize(imagenet_stats)
|
||||
|
||||
learn = cnn_learner(data, models.resnet34, metrics=dice).to_distributed(local_rank)
|
||||
learn.fit_one_cycle(args.num_epochs, slice(args.start_learning_rate,args.end_learning_rate))
|
||||
|
||||
#learn.unfreeze()
|
||||
#learn.fit_one_cycle(5, slice(1e-5), pct_start=0.8)
|
||||
|
||||
result = learn.validate()
|
||||
run.log('Worker #{:} loss'.format(rank), np.float(result[0]))
|
||||
run.log('Worker #{:} dice'.format(rank), np.float(result[1]))
|
||||
|
||||
os.chdir(work_folder)
|
||||
if rank == 0:
|
||||
run.log('loss', np.float(result[0]))
|
||||
run.log('dice', np.float(result[1]))
|
||||
|
||||
#filename = 'outputs/classification.pkl'
|
||||
#learn.export(outputs/)
|
|
@ -0,0 +1,134 @@
|
|||
import numpy as np
|
||||
import fastai
|
||||
from fastai.vision import *
|
||||
from fastai.callbacks.hooks import *
|
||||
|
||||
import os, glob, argparse, time, random, math
|
||||
|
||||
from azureml.core import Workspace, Run, Dataset
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--data_folder', type=str, dest='data_folder')
|
||||
parser.add_argument('--org_size', type=int, dest='org_size', default=768)
|
||||
parser.add_argument('--train_folder', type=str, dest='train_folder', default='train_v2')
|
||||
parser.add_argument('--train_sgmtfile', type=str, dest='train_sgmtfile', default='train_ship_segmentations_v2.csv')
|
||||
parser.add_argument('--class_folder', type=str, dest='class_folder', default='class')
|
||||
parser.add_argument('--img_size', type=int, dest='img_size', default=256)
|
||||
parser.add_argument('--min_area', type=int, dest='min_area', default=99)
|
||||
parser.add_argument('--sgmtimg_folder', type=str, dest='sgmtimg_folder', default='256-filter99')
|
||||
parser.add_argument('--sgmtlabel_folder', type=str, dest='sgmtlabel_folder', default='256-label')
|
||||
args = parser.parse_args()
|
||||
|
||||
run = Run.get_context()
|
||||
|
||||
data_folder = args.data_folder
|
||||
print('Data folder: ', data_folder)
|
||||
|
||||
train_folder = os.path.join(data_folder, args.train_folder)
|
||||
SEGMENTATION = os.path.join(data_folder, args.train_sgmtfile)
|
||||
|
||||
# Clean images
|
||||
print('Searching the broken images.............')
|
||||
brokenfiles = []
|
||||
for fpath in glob.glob(os.path.join(train_folder, '*.jpg')):
|
||||
try:
|
||||
img = open_image(fpath)
|
||||
except:
|
||||
fn = os.path.basename(fpath)
|
||||
print(fn, ' is broken')
|
||||
brokenfiles.append(fn)
|
||||
print(len(brokenfiles), ' images are broken')
|
||||
|
||||
print('Moving broken images.........')
|
||||
broken_folder = os.path.join(train_folder, 'broken')
|
||||
os.makedirs(broken_folder, exist_ok=True)
|
||||
for fn in brokenfiles:
|
||||
orig_name = os.path.join(train_folder, fn)
|
||||
new_name = os.path.join(broken_folder, fn)
|
||||
os.rename(orig_name, new_name)
|
||||
|
||||
# Divide images into Ship and NoShip
|
||||
print('Split images to ship & no-ship folder .........')
|
||||
df_masks = pd.read_csv(SEGMENTATION, index_col='ImageId')
|
||||
|
||||
class_folder = os.path.join(train_folder, args.class_folder)
|
||||
|
||||
ship_folder = os.path.join(class_folder, 'ship')
|
||||
noship_folder = os.path.join(class_folder, 'no-ship')
|
||||
|
||||
for fpath in glob.glob(os.path.join(train_folder, '*.jpg')):
|
||||
fn = os.path.basename(fpath)
|
||||
if isinstance(df_masks.loc[fn,'EncodedPixels'], str):
|
||||
tpath = os.path.join(ship_folder, fn)
|
||||
else:
|
||||
tpath = os.path.join(noship_folder, fn)
|
||||
|
||||
os.rename(fpath, tpath)
|
||||
|
||||
print('Generating lable files............')
|
||||
sz_enc = [args.org_size, args.org_size]
|
||||
|
||||
def enc2mask(masks, shape = sz_enc):
|
||||
img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
|
||||
|
||||
if(type(masks) == float): return img.reshape(shape)
|
||||
if(type(masks) == str): masks = [masks]
|
||||
for mask in masks:
|
||||
s = mask.split()
|
||||
for i in range(len(s)//2):
|
||||
start = int(s[2*i]) - 1
|
||||
length = int(s[2*i+1])
|
||||
img[start:start+length] = 1
|
||||
return img.reshape(shape).T
|
||||
|
||||
label_folder = os.path.join(train_folder, 'label')
|
||||
|
||||
for fpath in glob.glob(os.path.join(ship_folder, '*.jpg')):
|
||||
fn = os.path.basename(fpath)
|
||||
labelpath = os.path.join(label_folder, Path(fn).stem + '.png')
|
||||
|
||||
mask = enc2mask(df_masks.loc[fn,'EncodedPixels'])
|
||||
maskimg = PIL.Image.fromarray(mask)
|
||||
maskimg.save(labelpath)
|
||||
|
||||
|
||||
def SummaryLabelArea(label_root):
|
||||
min_area = 1000000
|
||||
area_hist = np.zeros(20, int)
|
||||
for fpath in glob.glob(os.path.join(label_root, '*.png')):
|
||||
mask = open_mask(fpath)
|
||||
area = mask.data.sum()
|
||||
area_hist[int(math.log2(area))] += 1
|
||||
if area < min_area: min_area = area
|
||||
|
||||
print('Min area is ', min_area)
|
||||
print(area_hist / np.sum(area_hist))
|
||||
|
||||
return min_area, area_hist
|
||||
|
||||
SummaryLabelArea(label_folder);
|
||||
|
||||
print('Resizing images and labels .........')
|
||||
def ResizeTrainLabel(train_root, label_root, dest_train_root, dest_label_root, size, min_area = 0):
|
||||
|
||||
for fpathstr in glob.glob(os.path.join(train_root, '*.jpg')):
|
||||
fpath = Path(fpathstr)
|
||||
lpath = os.path.join(label_root, fpath.stem + '.png')
|
||||
|
||||
mask = open_mask(lpath)
|
||||
mask = mask.resize(size)
|
||||
|
||||
if mask.data.sum() > min_area:
|
||||
dest_lpath = os.path.join(dest_label_root, fpath.stem + '.png')
|
||||
mask.save(dest_lpath)
|
||||
|
||||
img = open_image(fpath)
|
||||
img = img.resize(size)
|
||||
|
||||
dest_fpath = os.path.join(dest_train_root, fpath.stem + '.jpg')
|
||||
img.save(dest_fpath)
|
||||
|
||||
sgmtimg_folder = os.path.join(train_folder, args.sgmtimg_folder)
|
||||
sgmtlabel_folder = os.path.join(train_folder, args.sgmtlabel_folder)
|
||||
|
||||
ResizeTrainLabel(ship_folder, label_folder, sgmtimg_folder, sgmtlabel_folder, args.img_size, args.min_area)
|
|
@ -0,0 +1,119 @@
|
|||
import numpy as np
|
||||
import fastai
|
||||
from fastai.vision import *
|
||||
from fastai.callbacks.hooks import *
|
||||
from fastai.callbacks.mem import PeakMemMetric
|
||||
from fastai.distributed import *
|
||||
|
||||
import os, argparse, time, random
|
||||
from azureml.core import Workspace, Run, Dataset
|
||||
|
||||
from azureml_adapter import set_environment_variables_for_nccl_backend, get_local_rank, get_global_size, get_local_size
|
||||
|
||||
def dice_loss(input, target):
|
||||
#input = torch.sigmoid(input)
|
||||
smooth = 1.0
|
||||
|
||||
iflat = input.flatten()
|
||||
tflat = target.flatten()
|
||||
intersection = (iflat * tflat).sum()
|
||||
|
||||
return ((2.0 * intersection + smooth) / (iflat.sum() + tflat.sum() + smooth))
|
||||
|
||||
class FocalLoss(nn.Module):
|
||||
def __init__(self, gamma):
|
||||
super().__init__()
|
||||
self.gamma = gamma
|
||||
|
||||
def forward(self, input, target):
|
||||
if not (target.size() == input.size()):
|
||||
raise ValueError("Target size ({}) must be the same as input size ({})"
|
||||
.format(target.size(), input.size()))
|
||||
|
||||
max_val = (-input).clamp(min=0)
|
||||
loss = input - input * target + max_val + \
|
||||
((-max_val).exp() + (-input - max_val).exp()).log()
|
||||
|
||||
invprobs = F.logsigmoid(-input * (target * 2.0 - 1.0))
|
||||
loss = (invprobs * self.gamma).exp() * loss
|
||||
|
||||
return loss.mean()
|
||||
|
||||
class MixedLoss(nn.Module):
|
||||
def __init__(self, alpha, gamma):
|
||||
super().__init__()
|
||||
self.alpha = alpha
|
||||
self.focal = FocalLoss(gamma)
|
||||
|
||||
def forward(self, input, target):
|
||||
input = F.softmax(input, dim=1)[:,1:,:,:]
|
||||
input2 = torch.log((input+1e-7)/(1-input+1e-7))
|
||||
|
||||
loss = self.alpha*self.focal(input2, target) - torch.log(dice_loss(input, target))
|
||||
return loss
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--data_folder', type=str, dest='data_folder', default='')
|
||||
parser.add_argument('--label_folder', type=str, dest='label_folder', default='256-label')
|
||||
parser.add_argument('--img_folder', type=str, dest='img_folder', default='256-filter')
|
||||
parser.add_argument('--img_size', type=int, dest='img_size', default=256)
|
||||
parser.add_argument('--batch_size', type=int, dest='banch_size', default=16)
|
||||
parser.add_argument('--num_epochs', type=int, dest='num_epochs', default=12)
|
||||
parser.add_argument('--start_learning_rate', type=float, dest='start_learning_rate', default=0.000001)
|
||||
parser.add_argument('--end_learning_rate', type=float, dest='end_learning_rate', default=0.001)
|
||||
args = parser.parse_args()
|
||||
|
||||
local_rank = -1
|
||||
local_rank = get_local_rank()
|
||||
global_size = get_global_size()
|
||||
local_size = get_local_size()
|
||||
|
||||
# TODO use logger
|
||||
print('local_rank = {}'.format(local_rank))
|
||||
print('global_size = {}'.format(global_size))
|
||||
print('local_size = {}'.format(local_size))
|
||||
|
||||
set_environment_variables_for_nccl_backend(local_size == global_size)
|
||||
torch.cuda.set_device(local_rank)
|
||||
torch.distributed.init_process_group(backend='nccl', init_method='env://')
|
||||
rank = int(os.environ['RANK'])
|
||||
|
||||
data_folder = args.data_folder
|
||||
sz = args.img_size
|
||||
bs = args.banch_size
|
||||
print('Data folder:', data_folder)
|
||||
|
||||
run = Run.get_context()
|
||||
work_folder = os.getcwd()
|
||||
print('Work directory: ', work_folder)
|
||||
|
||||
label_path = Path(os.path.join(data_folder, args.label_folder))
|
||||
get_y_fn = lambda x: label_path/f'{x.stem}.png'
|
||||
tfms = get_transforms(max_rotate = 10, max_lighting = 0.05, max_warp = 0.2, flip_vert = True,
|
||||
p_affine = 1., p_lighting = 1)
|
||||
|
||||
img_path = os.path.join(data_folder, args.img_folder)
|
||||
data = (SegmentationItemList.from_folder(img_path)
|
||||
.split_by_rand_pct(0.2)
|
||||
.label_from_func(get_y_fn, classes=['Background','Ship'])
|
||||
.transform(tfms, size=sz, tfm_y=True)
|
||||
.databunch(path=Path('.'), bs=bs, num_workers=0)
|
||||
.normalize(imagenet_stats))
|
||||
|
||||
learn = unet_learner(data, models.resnet34, loss_func=MixedLoss(10.0,2.0), metrics=dice, wd=1e-7).to_distributed(local_rank)
|
||||
learn.fit_one_cycle(args.num_epochs, slice(args.start_learning_rate,args.end_learning_rate))
|
||||
|
||||
#learn.unfreeze()
|
||||
#learn.fit_one_cycle(args.num_epochs, slice(args.start_learning_rate,args.end_learning_rate))
|
||||
|
||||
result = learn.validate()
|
||||
run.log('Worker #{:} loss'.format(rank), np.float(result[0]))
|
||||
run.log('Worker #{:} dice'.format(rank), np.float(result[1]))
|
||||
|
||||
if rank == 0:
|
||||
run.log('loss', np.float(result[0]))
|
||||
run.log('dice', np.float(result[1]))
|
||||
|
||||
os.chdir(work_folder)
|
||||
filename = 'outputs/segmentation.pkl'
|
||||
learn.export(filename)
|
Загрузка…
Ссылка в новой задаче