Merge pull request #2 from Azure/dev

asof 2021-11-15
This commit is contained in:
konabuta 2021-11-15 00:56:30 +09:00 коммит произвёл GitHub
Родитель 1542288f12 95f86fd9e2
Коммит fd96bb2d35
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
20 изменённых файлов: 181 добавлений и 82 удалений

Просмотреть файл

@ -1,18 +1,7 @@
# Machine Learning at Scale on Machine Learning
# Machine Learning at Scale on Azure Machine Learning
This repo is a collection of codes/notebooks for machine learning at scale on Azure Machine Learning.
## Getting Started
### setup base environment*
1. setup Azure Machine Learning Workspace in your Azure subscription.
2. download and install Visual Studio Code in your client PC. And install Azure Machine Learning Extension.
3. create new Azure ML Compute Instance and launch Visual Studio Code
*These steps are not mandatory. There are other ways to setup environment. But this repo is assumed to follow the above steps.
## Contents
### Train
@ -28,6 +17,18 @@ This repo is a collection of codes/notebooks for machine learning at scale on Az
Your contribution is welcome !
## Getting Started
### setup base environment*
1. Setup Azure Machine Learning Workspace in your Azure subscription. See [Manage Azure Machine Learning workspaces in the portal or with the Python SDK](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python) for more details.
2. Download and install Visual Studio Code in your client PC. And install Azure Machine Learning Extension. See [Azure Machine Learning in VS Code](https://code.visualstudio.com/docs/datascience/azure-machine-learning) for more details.
3. Create new Azure ML Compute Instance and launch Visual Studio Code. See [Connect to an Azure Machine Learning compute instance in Visual Studio Code (preview) - Configure a remote compute instance](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-vs-code-remote?tabs=studio#configure-a-remote-compute-instance) for more details.
4. Install Azure Machine Learning Cli 2.0. See [installation document](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-cli) for more details.
*These steps are not mandatory. There are the other ways to setup environment. But this repo is assumed to follow the above steps.
## Events
[2021-11-18] [DB TECH SHOWCASE 2021 (Japan)](https://www.db-tech-showcase.com/2021/schedule/) - D14 Microsoft : Azure Machine Learning で始める大規模機械学習 @ 女部田啓太

Двоичные данные
docs/images/dask-dashboard.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 472 KiB

Двоичные данные
docs/images/flaml-tb-1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 412 KiB

Двоичные данные
docs/images/flaml-tb-2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 1.5 MiB

Двоичные данные
docs/images/ray-studio-1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 408 KiB

Двоичные данные
docs/images/ray-studio-2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 257 KiB

Двоичные данные
docs/images/torch-studio-1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 458 KiB

Двоичные данные
docs/images/torch-studio-2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 299 KiB

Двоичные данные
docs/images/torch-studio-3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 298 KiB

Просмотреть файл

@ -2,15 +2,32 @@
This example shows how to use DASK to train LightGBM models in distributed mode on Azure Machine Learning.
## LightGBM DASK Distributed Training
LightGBM supports distributed training with DASK. DASK is a distributed computing framework for Python. See the following documents in reference section for more details.
## Prerequisites
- Azure Machine Learning Workspace
- Compute Clusters for DASK
- Compute Instance with Azure ML CLI 2.0 installed
## LightGBM DASK Distributed Training
## Getting Started
1. Create Compute Clusters for DASK clusters in your Azure Machine Learning Workspace.
2. check train script in `src` directory.
- DASK clusters startup script : [startDask.py]('src/startDask.py')
- lightgbm training script : [train-lgb-dask.py]('src/train-lgb-dask.py')
3. Open [job.yml](yob.yml) and start job from VSCode Azure ML Extension or terminal.
```bash
cd examples/train/dask-lightgbm/
az ml job create --file job.yml --stream
```
4. Access to Azure ML studio and see Experiment logs.
- In Experiment, `dask-report.html` in Outputs+logs tab shows DASK Performance Report like below.<img src="../../../docs/images/dask-dashboard.png">
LightGBM supports distributed training with DASK. DASK is a distributed computing framework for Python. See the following documents in reference section for more details.
## Reference

Просмотреть файл

@ -17,15 +17,15 @@ inputs:
mode: mount
environment:
conda_file: file:conda.yml
docker:
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
compute:
# use a sku with lots of disk space and memory
target: azureml:daskclusters
# use a sku with lots of disk space and memory
compute: azureml:daskclusters
resources:
instance_count: 5
distribution:
# The job below is currently launched with `type: pytorch` since that
# gives the full flexibility of assigning the work to the

Просмотреть файл

@ -1,3 +1,7 @@
# original source code is from "Azure Machine Learning examples" repo.
# url : https://github.com/Azure/azureml-examples/tree/main/cli/jobs/single-step/dask/nyctaxi
import os
import argparse
import time

Просмотреть файл

@ -2,6 +2,12 @@
This example shows how to use NNI to perform hyperparameter tuning with HyperBand on Azure Machine Learning.
## HPO with NNI
Neural Network Intelligence (NNI) is a library that provides a unified interface for hyperparameter optimization. Many tuning algorithm is included. See the following link for more details in reference section.
## Prerequisites
- Azure Machine Learning Workspace
@ -10,26 +16,69 @@ This example shows how to use NNI to perform hyperparameter tuning with HyperBan
## Getting Started
1. Install NNI library in your Compute Instance.
1. Create conda environment
```bash
pip install nni
```
```bash
conda create -n nni python=3.6
conda init bash
source ~/.bashrc
conda activate nni
```
2. Create a train script and a configuration file.
2. Install NNI library in your Compute Instance.
3. Start trial job.
```bash
pip install nni==2.5 azureml-sdk==1.35.0
```
3. Create Compute Clusters in your Azure Machine Learning Workspace
```bash
nnictl create --config config_hyperband.yml --port 8888
```
<!-- TODO: Azure CLI and YML to create Compute Clusters -->
4. Access to dashboard.
4. Create a train script, a job configuration file and a search space configuration file.
- train script : train.py
<!-- TODO: explain the points of the codes-->
- job configuration file : config_hyperband.yml
configure TrainingService to Azure Machine Learning.
```yml
TrainingService:
platform: aml
dockerImage: msranni/nni # modify this if you bring your own docker image
subscriptionId: <your subscription ID>
resourceGroup: <azure machine learning workspace resource group>
workspaceName: <azure machine learning workspace name>
computeTarget: <compute cluster name>
```
- search space configuration file : search_space.json
define your search space of hyperparameters in json file.
```json
{
"dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
"learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
}
```
5. Start trial job.
```bash
nnictl create --config config_hyperband.yml --port 8088
```
6. Access to dashboard.
- NNI Dashboard is running on Compute Instance. You can access to it from your client PC from URL like `https://<your compute instance name>-8088.<region>instances.azureml.ms`. For examples, if your Compute Instance name is "client" and region is "japaneast", you can access using `https://client-8088.japaneast.instances.azureml.ms`.
## HPO with NNI
Neural Network Intelligence (NNI) is a library that provides a unified interface for hyperparameter optimization. Many tuning algorithm is included. See the following link for more details in reference section.
## Reference

Просмотреть файл

@ -19,4 +19,4 @@ TrainingService:
subscriptionId: 82a5d8d3-5322-4c49-b9d6-da6e00be5d57
resourceGroup: azureml-automl
workspaceName: azureml-automl
computeTarget: cpuclusters
computeTarget: gpuclusters

Просмотреть файл

@ -0,0 +1,7 @@
{
"dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
"batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
"learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
}

Просмотреть файл

@ -8,6 +8,25 @@ This example shows how to use Distributed Data Parallel (DDP) with PyTorch on Az
- Compute Clusters with GPU for distributed training
- Compute Instance with Azure ML CLI 2.0 installed
## Getting Started
1. Create training data from python scirpt that create `data` folder.
```bash
python dataprep.py
```
2. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.
```bash
cd examples/train/pytorch-ddp/
az ml job create --file job.yml --stream
```
3. Access to Azure ML studio and see Experiment logs.
- In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.<img src="../../../docs/images/torch-studio-1.png"><img src="../../../docs/images/torch-studio-2.png"><img src="../../../docs/images/torch-studio-3.png">
## Reference
- [PyTorch Distributed Data Parallel (DDP)][1]
[1]: https://pytorch.org/docs/stable/distributed.html

Просмотреть файл

@ -14,13 +14,14 @@ inputs:
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:3
compute:
target: azureml:gpuclusters2
compute: azureml:gpuclusters2
resources:
instance_count: 2
distribution:
type: pytorch
process_count: 4
process_count_per_instance: 4
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.

Просмотреть файл

@ -2,14 +2,47 @@
This example shows how to use FLAML to train a model on a dataset using RAY on Azure Machine Learning.
## FLAML with RAY
FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. FLAML support Ray Tune for distributed search.
## Prerequisites
- Azure Machine Learning Workspace
- Compute Clusters for Ray
- Compute Instance with Azure ML CLI 2.0 installed
## FLAML with RAY
FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. FLAML support Ray Tune for distributed search.
## Getting Started
1. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.
```bash
cd examples/train/ray-flaml/
az ml job create --file job.yml --stream
```
3. Access to Azure ML studio and see Experiment logs.
- In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.<img src="../../../docs/images/ray-studio-1.png"><img src="../../../docs/images/ray-studio-2.png">
4. Launch tensorboard to see the metrics.
- Install Python packages for Tensorboard with Azure Machine Learning.
```bash
pip install azureml-tensorboard tensorboard
```
- Input Run ID in the first cell of [azureml-tensorboard.ipynb](azureml-tensorboard.ipynb).
```python
# if you Run ID is xxxxxxx
run_id = "xxxxxxx"
run = Run.get(workspace=ws, run_id=run_id)
```
- Start Tensorboard in the second cell and see the list of trials like below.
```python
tb = Tensorboard([run], local_root="logs/azureml", port=6006)
tb.start()
```
<img src="../../../docs/images/flaml-tb-1.png"><img src="../../../docs/images/flaml-tb-2.png">
## Reference

Просмотреть файл

@ -2,39 +2,22 @@
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from azureml.core import Workspace, Run\n",
"from azureml.tensorboard import Tensorboard\n",
"ws = Workspace.from_config()\n",
"run = Run.get(workspace=ws, run_id=\"cd0a70d1-aa17-4991-a9a5-98dfd0142663\")"
"run_id = \"\" # input your run id here\n",
"run = Run.get(workspace=ws, run_id=run_id)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://automl-client-6006.japaneast.instances.azureml.ms\n"
]
},
{
"data": {
"text/plain": [
"'https://automl-client-6006.japaneast.instances.azureml.ms'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"tb = Tensorboard([run], local_root=\"logs/azureml\", port=6006)\n",
"tb.start()"
@ -42,34 +25,20 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tb.stop()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"interpreter": {
"hash": "be3ccc1aacd5cd0ada9eab9372c3d7f901636bca88db42a566c3f238cceb324c"
"hash": "c10229b7cc312618cfc9408206359580296dd866b8a068feaf7c6af0ab6fe085"
},
"kernelspec": {
"display_name": "Python 3.6.13 64-bit ('ray': conda)",
"display_name": "Python 3.6.13 64-bit ('nni': conda)",
"name": "python3"
},
"language_info": {

Просмотреть файл

@ -9,12 +9,11 @@ command: >-
--script train-automl-flaml.py
environment:
conda_file: file:conda.yml
docker:
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
compute:
target: azureml:cpuclusters
compute: azureml:cpuclusters
resources:
instance_count: 3
distribution: