Merge pull request #2 from Azure/dev

asof 2021-11-15
2021-11-15 00:56:30 +09:00 · 2021-11-15 00:56:30 +09:00 · fd96bb2d35
--- a/README.md
+++ b/README.md
@ -1,18 +1,7 @@
-# Machine Learning at Scale on Machine Learning
+# Machine Learning at Scale on Azure Machine Learning

 This repo is a collection of codes/notebooks for machine learning at scale on Azure Machine Learning. 

-
-## Getting Started
-
-### setup base environment*
-
-1. setup Azure Machine Learning Workspace in your Azure subscription.
-2. download and install Visual Studio Code in your client PC. And install Azure Machine Learning Extension.
-3. create new Azure ML Compute Instance and launch Visual Studio Code
-
-*These steps are not mandatory. There are other ways to setup environment. But this repo is assumed to follow the above steps.
-
 ## Contents

 ### Train
@ -28,6 +17,18 @@ This repo is a collection of codes/notebooks for machine learning at scale on Az

 Your contribution is welcome !

+## Getting Started
+
+### setup base environment*
+
+1. Setup Azure Machine Learning Workspace in your Azure subscription. See [Manage Azure Machine Learning workspaces in the portal or with the Python SDK](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python) for more details.
+2. Download and install Visual Studio Code in your client PC. And install Azure Machine Learning Extension. See [Azure Machine Learning in VS Code](https://code.visualstudio.com/docs/datascience/azure-machine-learning) for more details.
+3. Create new Azure ML Compute Instance and launch Visual Studio Code. See [Connect to an Azure Machine Learning compute instance in Visual Studio Code (preview) - Configure a remote compute instance](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-vs-code-remote?tabs=studio#configure-a-remote-compute-instance) for more details.
+4. Install Azure Machine Learning Cli 2.0. See [installation document](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-cli) for more details.
+
+*These steps are not mandatory. There are the other ways to setup environment. But this repo is assumed to follow the above steps.
+
+

 ## Events
 [2021-11-18] [DB TECH SHOWCASE 2021 (Japan)](https://www.db-tech-showcase.com/2021/schedule/) - D14 Microsoft : Azure Machine Learning で始める大規模機械学習 @ 女部田啓太
--- a/docs/images/dask-dashboard.png
+++ b/docs/images/dask-dashboard.png
--- a/docs/images/flaml-tb-1.png
+++ b/docs/images/flaml-tb-1.png
--- a/docs/images/flaml-tb-2.png
+++ b/docs/images/flaml-tb-2.png
--- a/docs/images/ray-studio-1.png
+++ b/docs/images/ray-studio-1.png
--- a/docs/images/ray-studio-2.png
+++ b/docs/images/ray-studio-2.png
--- a/docs/images/torch-studio-1.png
+++ b/docs/images/torch-studio-1.png
--- a/docs/images/torch-studio-2.png
+++ b/docs/images/torch-studio-2.png
--- a/docs/images/torch-studio-3.png
+++ b/docs/images/torch-studio-3.png
--- a/examples/train/dask-lightgbm/README.md
+++ b/examples/train/dask-lightgbm/README.md
@ -2,15 +2,32 @@

 This example shows how to use DASK to train LightGBM models in distributed mode on Azure Machine Learning.

+## LightGBM DASK Distributed Training
+
+LightGBM supports distributed training with DASK. DASK is a distributed computing framework for Python. See the following documents in reference section for more details.
+
 ## Prerequisites

 - Azure Machine Learning Workspace
    - Compute Clusters for DASK
    - Compute Instance with Azure ML CLI 2.0 installed

-## LightGBM DASK Distributed Training
+## Getting Started
+
+1. Create Compute Clusters for DASK clusters in your Azure Machine Learning Workspace. 
+2. check train script in `src` directory.
+    - DASK clusters startup script : [startDask.py]('src/startDask.py')
+    - lightgbm training script : [train-lgb-dask.py]('src/train-lgb-dask.py')
+3. Open [job.yml](yob.yml) and start job from VSCode Azure ML Extension or terminal.
+
+    ```bash
+    cd examples/train/dask-lightgbm/
+    az ml job create --file job.yml --stream
+    ```
+
+4. Access to Azure ML studio and see Experiment logs.
+- In Experiment, `dask-report.html` in Outputs+logs tab shows DASK Performance Report like below.<img src="../../../docs/images/dask-dashboard.png">

-LightGBM supports distributed training with DASK. DASK is a distributed computing framework for Python. See the following documents in reference section for more details.


 ## Reference
--- a/examples/train/dask-lightgbm/job.yml
+++ b/examples/train/dask-lightgbm/job.yml
@ -17,15 +17,15 @@ inputs:
    mode: mount

 environment: 
-  conda_file: file:conda.yml
-  docker: 
-    image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
+  conda_file: conda.yml
+  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04

-compute:
-  # use a sku with lots of disk space and memory
-  target: azureml:daskclusters
+# use a sku with lots of disk space and memory
+compute: azureml:daskclusters
+resources:
  instance_count: 5

+
 distribution:
  # The job below is currently launched with `type: pytorch` since that 
  # gives the full flexibility of assigning the work to the
--- a/examples/train/dask-lightgbm/src/startDask.py
+++ b/examples/train/dask-lightgbm/src/startDask.py
@ -1,3 +1,7 @@
+# original source code is from "Azure Machine Learning examples" repo.
+# url : https://github.com/Azure/azureml-examples/tree/main/cli/jobs/single-step/dask/nyctaxi
+
+
 import os
 import argparse
 import time
--- a/examples/train/nni-hyperband/README.md
+++ b/examples/train/nni-hyperband/README.md
@ -2,6 +2,12 @@

 This example shows how to use NNI to perform hyperparameter tuning with HyperBand on Azure Machine Learning.

+
+## HPO with NNI
+
+Neural Network Intelligence (NNI) is a library that provides a unified interface for hyperparameter optimization. Many tuning algorithm is included. See the following link for more details in reference section.
+
+
 ## Prerequisites

 - Azure Machine Learning Workspace
@ -10,26 +16,69 @@ This example shows how to use NNI to perform hyperparameter tuning with HyperBan

 ## Getting Started

-1. Install NNI library in your Compute Instance.
+1. Create conda environment

-```bash
-pip install nni
-```
+  ```bash
+  conda create -n nni python=3.6
+  conda init bash
+  source ~/.bashrc
+  conda activate nni
+  ```

-2. Create a train script and a configuration file.
+2. Install NNI library in your Compute Instance.

-3. Start trial job.
+  ```bash
+  pip install nni==2.5 azureml-sdk==1.35.0
+  ```
+3. Create Compute Clusters in your Azure Machine Learning Workspace

-```bash
-nnictl create --config config_hyperband.yml --port 8888
-```
+<!-- TODO: Azure CLI and YML to create Compute Clusters -->

-4. Access to dashboard.
+4. Create a train script, a job configuration file and a search space configuration file.
+ 
+- train script : train.py
+
+<!-- TODO: explain the points of the codes-->
+
+- job configuration file : config_hyperband.yml
+
+  configure TrainingService to Azure Machine Learning.
+
+  ```yml
+  TrainingService:
+    platform: aml
+    dockerImage: msranni/nni  # modify this if you bring your own docker image
+    subscriptionId: <your subscription ID>
+    resourceGroup: <azure machine learning workspace resource group>
+    workspaceName: <azure machine learning workspace name>
+    computeTarget: <compute cluster name>
+  ```
+
+- search space configuration file : search_space.json
+
+  define your search space of hyperparameters in json file.
+
+  ```json
+  {
+      "dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
+      "conv_size":{"_type":"choice","_value":[2,3,5,7]},
+      "hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
+      "batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
+      "learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
+  }
+  ```
+
+5. Start trial job.
+
+  ```bash
+  nnictl create --config config_hyperband.yml --port 8088
+  ```
+
+6. Access to dashboard.
+  - NNI Dashboard is running on Compute Instance. You can access to it from your client PC from URL like `https://<your compute instance name>-8088.<region>instances.azureml.ms`. For examples, if your Compute Instance name is "client" and region is "japaneast", you can access using `https://client-8088.japaneast.instances.azureml.ms`.


-## HPO with NNI

-Neural Network Intelligence (NNI) is a library that provides a unified interface for hyperparameter optimization. Many tuning algorithm is included. See the following link for more details in reference section.

 ## Reference

--- a/examples/train/nni-hyperband/config_hyperband.yml
+++ b/examples/train/nni-hyperband/config_hyperband.yml
@ -19,4 +19,4 @@ TrainingService:
  subscriptionId: 82a5d8d3-5322-4c49-b9d6-da6e00be5d57
  resourceGroup: azureml-automl
  workspaceName: azureml-automl
-  computeTarget: cpuclusters
+  computeTarget: gpuclusters
--- a/examples/train/nni-hyperband/search_space.json
+++ b/examples/train/nni-hyperband/search_space.json
@ -0,0 +1,7 @@
+{
+    "dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
+    "conv_size":{"_type":"choice","_value":[2,3,5,7]},
+    "hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
+    "batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
+    "learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
+}
--- a/examples/train/pytorch-ddp/README.md
+++ b/examples/train/pytorch-ddp/README.md
@ -8,6 +8,25 @@ This example shows how to use Distributed Data Parallel (DDP) with PyTorch on Az
    - Compute Clusters with GPU for distributed training
    - Compute Instance with Azure ML CLI 2.0 installed

+## Getting Started
+
+1. Create training data from python scirpt that create `data` folder.
+
+    ```bash
+    python dataprep.py
+    ```
+
+2. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.
+
+    ```bash
+    cd examples/train/pytorch-ddp/
+    az ml job create --file job.yml --stream
+    ```
+
+3. Access to Azure ML studio and see Experiment logs.
+- In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.<img src="../../../docs/images/torch-studio-1.png"><img src="../../../docs/images/torch-studio-2.png"><img src="../../../docs/images/torch-studio-3.png">
+
+
 ## Reference
 - [PyTorch Distributed Data Parallel (DDP)][1]
 [1]: https://pytorch.org/docs/stable/distributed.html
--- a/examples/train/pytorch-ddp/job.yml
+++ b/examples/train/pytorch-ddp/job.yml
@ -14,13 +14,14 @@ inputs:

 environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:3

-compute:
-  target: azureml:gpuclusters2
+compute: azureml:gpuclusters2
+
+resources:
  instance_count: 2

 distribution:
  type: pytorch 
-  process_count: 4
+  process_count_per_instance: 4

 experiment_name: pytorch-cifar-distributed-example
 description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
--- a/examples/train/ray-flaml/README.md
+++ b/examples/train/ray-flaml/README.md
@ -2,14 +2,47 @@

 This example shows how to use FLAML to train a model on a dataset using RAY on Azure Machine Learning.

+## FLAML with RAY
+FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. FLAML support Ray Tune for distributed search.
+
 ## Prerequisites
 - Azure Machine Learning Workspace
    - Compute Clusters for Ray
    - Compute Instance with Azure ML CLI 2.0 installed

-## FLAML with RAY
-FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. FLAML support Ray Tune for distributed search.
+## Getting Started

+1. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.
+
+    ```bash
+    cd examples/train/ray-flaml/
+    az ml job create --file job.yml --stream
+    ```
+
+3. Access to Azure ML studio and see Experiment logs.
+- In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.<img src="../../../docs/images/ray-studio-1.png"><img src="../../../docs/images/ray-studio-2.png">
+
+4. Launch tensorboard to see the metrics.
+ - Install Python packages for Tensorboard with Azure Machine Learning.
+
+    ```bash
+    pip install azureml-tensorboard tensorboard
+    ```
+
+ - Input Run ID in the first cell of [azureml-tensorboard.ipynb](azureml-tensorboard.ipynb).
+
+    ```python
+    # if you Run ID is xxxxxxx
+    run_id = "xxxxxxx"
+    run = Run.get(workspace=ws, run_id=run_id)
+    ```
+- Start Tensorboard in the second cell and see the list of trials like below.
+
+    ```python
+    tb = Tensorboard([run], local_root="logs/azureml", port=6006)
+    tb.start()
+    ```
+    <img src="../../../docs/images/flaml-tb-1.png"><img src="../../../docs/images/flaml-tb-2.png">


 ## Reference
--- a/examples/train/ray-flaml/azureml-tensorboard.ipynb
+++ b/examples/train/ray-flaml/azureml-tensorboard.ipynb
@ -2,39 +2,22 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from azureml.core import Workspace, Run\n",
    "from azureml.tensorboard import Tensorboard\n",
    "ws = Workspace.from_config()\n",
-    "run = Run.get(workspace=ws, run_id=\"cd0a70d1-aa17-4991-a9a5-98dfd0142663\")"
+    "run_id = \"\" # input your run id here\n",
+    "run = Run.get(workspace=ws, run_id=run_id)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "https://automl-client-6006.japaneast.instances.azureml.ms\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "'https://automl-client-6006.japaneast.instances.azureml.ms'"
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
   "source": [
    "tb = Tensorboard([run], local_root=\"logs/azureml\", port=6006)\n",
    "tb.start()"
@ -42,34 +25,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tb.stop()"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
-   "hash": "be3ccc1aacd5cd0ada9eab9372c3d7f901636bca88db42a566c3f238cceb324c"
+   "hash": "c10229b7cc312618cfc9408206359580296dd866b8a068feaf7c6af0ab6fe085"
  },
  "kernelspec": {
-   "display_name": "Python 3.6.13 64-bit ('ray': conda)",
+   "display_name": "Python 3.6.13 64-bit ('nni': conda)",
   "name": "python3"
  },
  "language_info": {
--- a/examples/train/ray-flaml/job.yml
+++ b/examples/train/ray-flaml/job.yml
@ -9,12 +9,11 @@ command: >-
  --script train-automl-flaml.py
  
 environment: 
-  conda_file: file:conda.yml
-  docker: 
-    image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
+  conda_file: conda.yml
+  image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04

-compute:
-  target: azureml:cpuclusters
+compute: azureml:cpuclusters
+resources:
  instance_count: 3

 distribution: