25
README.md
|
@ -1,18 +1,7 @@
|
|||
# Machine Learning at Scale on Machine Learning
|
||||
# Machine Learning at Scale on Azure Machine Learning
|
||||
|
||||
This repo is a collection of codes/notebooks for machine learning at scale on Azure Machine Learning.
|
||||
|
||||
|
||||
## Getting Started
|
||||
|
||||
### setup base environment*
|
||||
|
||||
1. setup Azure Machine Learning Workspace in your Azure subscription.
|
||||
2. download and install Visual Studio Code in your client PC. And install Azure Machine Learning Extension.
|
||||
3. create new Azure ML Compute Instance and launch Visual Studio Code
|
||||
|
||||
*These steps are not mandatory. There are other ways to setup environment. But this repo is assumed to follow the above steps.
|
||||
|
||||
## Contents
|
||||
|
||||
### Train
|
||||
|
@ -28,6 +17,18 @@ This repo is a collection of codes/notebooks for machine learning at scale on Az
|
|||
|
||||
Your contribution is welcome !
|
||||
|
||||
## Getting Started
|
||||
|
||||
### setup base environment*
|
||||
|
||||
1. Setup Azure Machine Learning Workspace in your Azure subscription. See [Manage Azure Machine Learning workspaces in the portal or with the Python SDK](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python) for more details.
|
||||
2. Download and install Visual Studio Code in your client PC. And install Azure Machine Learning Extension. See [Azure Machine Learning in VS Code](https://code.visualstudio.com/docs/datascience/azure-machine-learning) for more details.
|
||||
3. Create new Azure ML Compute Instance and launch Visual Studio Code. See [Connect to an Azure Machine Learning compute instance in Visual Studio Code (preview) - Configure a remote compute instance](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-vs-code-remote?tabs=studio#configure-a-remote-compute-instance) for more details.
|
||||
4. Install Azure Machine Learning Cli 2.0. See [installation document](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-cli) for more details.
|
||||
|
||||
*These steps are not mandatory. There are the other ways to setup environment. But this repo is assumed to follow the above steps.
|
||||
|
||||
|
||||
|
||||
## Events
|
||||
[2021-11-18] [DB TECH SHOWCASE 2021 (Japan)](https://www.db-tech-showcase.com/2021/schedule/) - D14 Microsoft : Azure Machine Learning で始める大規模機械学習 @ 女部田啓太
|
||||
|
|
После Ширина: | Высота: | Размер: 472 KiB |
После Ширина: | Высота: | Размер: 412 KiB |
После Ширина: | Высота: | Размер: 1.5 MiB |
После Ширина: | Высота: | Размер: 408 KiB |
После Ширина: | Высота: | Размер: 257 KiB |
После Ширина: | Высота: | Размер: 458 KiB |
После Ширина: | Высота: | Размер: 299 KiB |
После Ширина: | Высота: | Размер: 298 KiB |
|
@ -2,15 +2,32 @@
|
|||
|
||||
This example shows how to use DASK to train LightGBM models in distributed mode on Azure Machine Learning.
|
||||
|
||||
## LightGBM DASK Distributed Training
|
||||
|
||||
LightGBM supports distributed training with DASK. DASK is a distributed computing framework for Python. See the following documents in reference section for more details.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Azure Machine Learning Workspace
|
||||
- Compute Clusters for DASK
|
||||
- Compute Instance with Azure ML CLI 2.0 installed
|
||||
|
||||
## LightGBM DASK Distributed Training
|
||||
## Getting Started
|
||||
|
||||
1. Create Compute Clusters for DASK clusters in your Azure Machine Learning Workspace.
|
||||
2. check train script in `src` directory.
|
||||
- DASK clusters startup script : [startDask.py]('src/startDask.py')
|
||||
- lightgbm training script : [train-lgb-dask.py]('src/train-lgb-dask.py')
|
||||
3. Open [job.yml](yob.yml) and start job from VSCode Azure ML Extension or terminal.
|
||||
|
||||
```bash
|
||||
cd examples/train/dask-lightgbm/
|
||||
az ml job create --file job.yml --stream
|
||||
```
|
||||
|
||||
4. Access to Azure ML studio and see Experiment logs.
|
||||
- In Experiment, `dask-report.html` in Outputs+logs tab shows DASK Performance Report like below.<img src="../../../docs/images/dask-dashboard.png">
|
||||
|
||||
LightGBM supports distributed training with DASK. DASK is a distributed computing framework for Python. See the following documents in reference section for more details.
|
||||
|
||||
|
||||
## Reference
|
||||
|
|
|
@ -17,15 +17,15 @@ inputs:
|
|||
mode: mount
|
||||
|
||||
environment:
|
||||
conda_file: file:conda.yml
|
||||
docker:
|
||||
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
|
||||
conda_file: conda.yml
|
||||
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
|
||||
|
||||
compute:
|
||||
# use a sku with lots of disk space and memory
|
||||
target: azureml:daskclusters
|
||||
# use a sku with lots of disk space and memory
|
||||
compute: azureml:daskclusters
|
||||
resources:
|
||||
instance_count: 5
|
||||
|
||||
|
||||
distribution:
|
||||
# The job below is currently launched with `type: pytorch` since that
|
||||
# gives the full flexibility of assigning the work to the
|
||||
|
|
|
@ -1,3 +1,7 @@
|
|||
# original source code is from "Azure Machine Learning examples" repo.
|
||||
# url : https://github.com/Azure/azureml-examples/tree/main/cli/jobs/single-step/dask/nyctaxi
|
||||
|
||||
|
||||
import os
|
||||
import argparse
|
||||
import time
|
||||
|
|
|
@ -2,6 +2,12 @@
|
|||
|
||||
This example shows how to use NNI to perform hyperparameter tuning with HyperBand on Azure Machine Learning.
|
||||
|
||||
|
||||
## HPO with NNI
|
||||
|
||||
Neural Network Intelligence (NNI) is a library that provides a unified interface for hyperparameter optimization. Many tuning algorithm is included. See the following link for more details in reference section.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Azure Machine Learning Workspace
|
||||
|
@ -10,26 +16,69 @@ This example shows how to use NNI to perform hyperparameter tuning with HyperBan
|
|||
|
||||
## Getting Started
|
||||
|
||||
1. Install NNI library in your Compute Instance.
|
||||
1. Create conda environment
|
||||
|
||||
```bash
|
||||
pip install nni
|
||||
```
|
||||
```bash
|
||||
conda create -n nni python=3.6
|
||||
conda init bash
|
||||
source ~/.bashrc
|
||||
conda activate nni
|
||||
```
|
||||
|
||||
2. Create a train script and a configuration file.
|
||||
2. Install NNI library in your Compute Instance.
|
||||
|
||||
3. Start trial job.
|
||||
```bash
|
||||
pip install nni==2.5 azureml-sdk==1.35.0
|
||||
```
|
||||
3. Create Compute Clusters in your Azure Machine Learning Workspace
|
||||
|
||||
```bash
|
||||
nnictl create --config config_hyperband.yml --port 8888
|
||||
```
|
||||
<!-- TODO: Azure CLI and YML to create Compute Clusters -->
|
||||
|
||||
4. Access to dashboard.
|
||||
4. Create a train script, a job configuration file and a search space configuration file.
|
||||
|
||||
- train script : train.py
|
||||
|
||||
<!-- TODO: explain the points of the codes-->
|
||||
|
||||
- job configuration file : config_hyperband.yml
|
||||
|
||||
configure TrainingService to Azure Machine Learning.
|
||||
|
||||
```yml
|
||||
TrainingService:
|
||||
platform: aml
|
||||
dockerImage: msranni/nni # modify this if you bring your own docker image
|
||||
subscriptionId: <your subscription ID>
|
||||
resourceGroup: <azure machine learning workspace resource group>
|
||||
workspaceName: <azure machine learning workspace name>
|
||||
computeTarget: <compute cluster name>
|
||||
```
|
||||
|
||||
- search space configuration file : search_space.json
|
||||
|
||||
define your search space of hyperparameters in json file.
|
||||
|
||||
```json
|
||||
{
|
||||
"dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
|
||||
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
|
||||
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
|
||||
"batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
|
||||
"learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
|
||||
}
|
||||
```
|
||||
|
||||
5. Start trial job.
|
||||
|
||||
```bash
|
||||
nnictl create --config config_hyperband.yml --port 8088
|
||||
```
|
||||
|
||||
6. Access to dashboard.
|
||||
- NNI Dashboard is running on Compute Instance. You can access to it from your client PC from URL like `https://<your compute instance name>-8088.<region>instances.azureml.ms`. For examples, if your Compute Instance name is "client" and region is "japaneast", you can access using `https://client-8088.japaneast.instances.azureml.ms`.
|
||||
|
||||
|
||||
## HPO with NNI
|
||||
|
||||
Neural Network Intelligence (NNI) is a library that provides a unified interface for hyperparameter optimization. Many tuning algorithm is included. See the following link for more details in reference section.
|
||||
|
||||
## Reference
|
||||
|
||||
|
|
|
@ -19,4 +19,4 @@ TrainingService:
|
|||
subscriptionId: 82a5d8d3-5322-4c49-b9d6-da6e00be5d57
|
||||
resourceGroup: azureml-automl
|
||||
workspaceName: azureml-automl
|
||||
computeTarget: cpuclusters
|
||||
computeTarget: gpuclusters
|
|
@ -0,0 +1,7 @@
|
|||
{
|
||||
"dropout_rate":{"_type":"uniform","_value":[0.5,0.9]},
|
||||
"conv_size":{"_type":"choice","_value":[2,3,5,7]},
|
||||
"hidden_size":{"_type":"choice","_value":[124, 512, 1024]},
|
||||
"batch_size": {"_type":"choice","_value":[8, 16, 32, 64]},
|
||||
"learning_rate":{"_type":"choice","_value":[0.0001, 0.001, 0.01, 0.1]}
|
||||
}
|
|
@ -8,6 +8,25 @@ This example shows how to use Distributed Data Parallel (DDP) with PyTorch on Az
|
|||
- Compute Clusters with GPU for distributed training
|
||||
- Compute Instance with Azure ML CLI 2.0 installed
|
||||
|
||||
## Getting Started
|
||||
|
||||
1. Create training data from python scirpt that create `data` folder.
|
||||
|
||||
```bash
|
||||
python dataprep.py
|
||||
```
|
||||
|
||||
2. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.
|
||||
|
||||
```bash
|
||||
cd examples/train/pytorch-ddp/
|
||||
az ml job create --file job.yml --stream
|
||||
```
|
||||
|
||||
3. Access to Azure ML studio and see Experiment logs.
|
||||
- In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.<img src="../../../docs/images/torch-studio-1.png"><img src="../../../docs/images/torch-studio-2.png"><img src="../../../docs/images/torch-studio-3.png">
|
||||
|
||||
|
||||
## Reference
|
||||
- [PyTorch Distributed Data Parallel (DDP)][1]
|
||||
[1]: https://pytorch.org/docs/stable/distributed.html
|
||||
|
|
|
@ -14,13 +14,14 @@ inputs:
|
|||
|
||||
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:3
|
||||
|
||||
compute:
|
||||
target: azureml:gpuclusters2
|
||||
compute: azureml:gpuclusters2
|
||||
|
||||
resources:
|
||||
instance_count: 2
|
||||
|
||||
distribution:
|
||||
type: pytorch
|
||||
process_count: 4
|
||||
process_count_per_instance: 4
|
||||
|
||||
experiment_name: pytorch-cifar-distributed-example
|
||||
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
|
||||
|
|
|
@ -2,14 +2,47 @@
|
|||
|
||||
This example shows how to use FLAML to train a model on a dataset using RAY on Azure Machine Learning.
|
||||
|
||||
## FLAML with RAY
|
||||
FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. FLAML support Ray Tune for distributed search.
|
||||
|
||||
## Prerequisites
|
||||
- Azure Machine Learning Workspace
|
||||
- Compute Clusters for Ray
|
||||
- Compute Instance with Azure ML CLI 2.0 installed
|
||||
|
||||
## FLAML with RAY
|
||||
FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. FLAML support Ray Tune for distributed search.
|
||||
## Getting Started
|
||||
|
||||
1. Create a job (Azure ML CLI 2.0 + YML configuration file) from VSCode Azure ML Extension.
|
||||
|
||||
```bash
|
||||
cd examples/train/ray-flaml/
|
||||
az ml job create --file job.yml --stream
|
||||
```
|
||||
|
||||
3. Access to Azure ML studio and see Experiment logs.
|
||||
- In Experiment, paramters & metrics is logged. And you can check system performance in Monitoring tab like below.<img src="../../../docs/images/ray-studio-1.png"><img src="../../../docs/images/ray-studio-2.png">
|
||||
|
||||
4. Launch tensorboard to see the metrics.
|
||||
- Install Python packages for Tensorboard with Azure Machine Learning.
|
||||
|
||||
```bash
|
||||
pip install azureml-tensorboard tensorboard
|
||||
```
|
||||
|
||||
- Input Run ID in the first cell of [azureml-tensorboard.ipynb](azureml-tensorboard.ipynb).
|
||||
|
||||
```python
|
||||
# if you Run ID is xxxxxxx
|
||||
run_id = "xxxxxxx"
|
||||
run = Run.get(workspace=ws, run_id=run_id)
|
||||
```
|
||||
- Start Tensorboard in the second cell and see the list of trials like below.
|
||||
|
||||
```python
|
||||
tb = Tensorboard([run], local_root="logs/azureml", port=6006)
|
||||
tb.start()
|
||||
```
|
||||
<img src="../../../docs/images/flaml-tb-1.png"><img src="../../../docs/images/flaml-tb-2.png">
|
||||
|
||||
|
||||
## Reference
|
||||
|
|
|
@ -2,39 +2,22 @@
|
|||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from azureml.core import Workspace, Run\n",
|
||||
"from azureml.tensorboard import Tensorboard\n",
|
||||
"ws = Workspace.from_config()\n",
|
||||
"run = Run.get(workspace=ws, run_id=\"cd0a70d1-aa17-4991-a9a5-98dfd0142663\")"
|
||||
"run_id = \"\" # input your run id here\n",
|
||||
"run = Run.get(workspace=ws, run_id=run_id)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"https://automl-client-6006.japaneast.instances.azureml.ms\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'https://automl-client-6006.japaneast.instances.azureml.ms'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tb = Tensorboard([run], local_root=\"logs/azureml\", port=6006)\n",
|
||||
"tb.start()"
|
||||
|
@ -42,34 +25,20 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tb.stop()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"interpreter": {
|
||||
"hash": "be3ccc1aacd5cd0ada9eab9372c3d7f901636bca88db42a566c3f238cceb324c"
|
||||
"hash": "c10229b7cc312618cfc9408206359580296dd866b8a068feaf7c6af0ab6fe085"
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3.6.13 64-bit ('ray': conda)",
|
||||
"display_name": "Python 3.6.13 64-bit ('nni': conda)",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
|
|
|
@ -9,12 +9,11 @@ command: >-
|
|||
--script train-automl-flaml.py
|
||||
|
||||
environment:
|
||||
conda_file: file:conda.yml
|
||||
docker:
|
||||
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
|
||||
conda_file: conda.yml
|
||||
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
|
||||
|
||||
compute:
|
||||
target: azureml:cpuclusters
|
||||
compute: azureml:cpuclusters
|
||||
resources:
|
||||
instance_count: 3
|
||||
|
||||
distribution:
|
||||
|
|