Example of using HyperDrive to tune a regular ML learner.
Перейти к файлу
Mario Bourgoin d9faf99553 Output to data directory. 2018-11-12 13:45:05 +00:00
utilities Update to BatchAI repo version. 2018-09-26 13:51:53 +00:00
.gitignore Add creaed files and directories 2018-11-12 13:44:21 +00:00
00_Data_Prep.ipynb Output to data directory. 2018-11-12 13:45:05 +00:00
01_Training_Script.ipynb Standardize tag names. 2018-10-11 14:43:49 +00:00
02_Docker_Image.ipynb Table of content with anchors. 2018-10-11 14:43:58 +00:00
03_Configure_Batch_AI.ipynb Table of content with anchors. 2018-10-11 14:50:17 +00:00
04_Create_Cluster.ipynb Table of content with anchors. 2018-10-11 14:56:20 +00:00
05_Hyperparameter_Search.ipynb Do a random search instead of a grid search 2018-11-07 14:05:18 +00:00
06_Tear_Down.ipynb Table of content with anchors. 2018-10-11 16:53:36 +00:00
Design.png Add initial user-facing instructions. The diagram needs to be updated. 2018-11-07 09:17:58 -05:00
ItemSelector.py bug fix 2018-11-02 16:37:39 -04:00
LICENSE Initial commit 2018-08-14 10:13:23 -07:00
README.md Missing end of sentence. 2018-11-07 09:54:43 -05:00
environment.yml Pull python-dotenv directly from its repo to workaround bugs. 2018-09-14 16:23:48 +00:00
label_rank.py missing 2018-08-29 19:16:41 +00:00
text_utilities.py Fix bug. 2018-11-06 09:38:58 -05:00
timer.py Missing module. 2018-08-29 19:13:43 +00:00

README.md

Author: Mario Bourgoin

Tuning Python models on a Batch AI cluster

Overview

This scenario shows how to tune a Frequently Asked Questions (FAQ) matching model that can be deployed as a web service to provide predictions for user questions. For this scenario, “Input Data” in the architecture diagram refers to text strings containing the user questions to match with a list of FAQs. The scenario is designed for the Scikit-Learn machine learning library for Python but can be generalized to any scenario that uses Python models to make real-time predictions.

Design

alt text The scenario uses a subset of Stack Overflow question data which includes original questions tagged as JavaScript, their duplicate questions, and their answers. It tunes a Scikit-Learn pipeline to predict the match probability of a duplicate question with each of the original questions. The application flow for this architecture is as follows:

Prerequisites

  1. Linux(Ubuntu).
  2. Anaconda Python installed.
  3. Docker installed.
  4. DockerHub account.
  5. Azure account.

The tutorial was developed on an Azure Ubuntu DSVM, which addresses the first three prerequisites.

Setup

To set up your environment to run these notebooks, please follow these steps. They setup the notebooks to use Docker and Azure seamlessly.

  1. Create a Linux DSVM.
  2. In a bash shell on the DSVM, add your login to the docker group:
    sudo usermod -a -G docker <login>
    
  3. Login to your DockerHub account:
    docker login
    
  4. Clone, fork, or download the zip file for this repository:
    git clone https://github.com/Azure/MLBatchAIHyperparameterTuning.git
    
  5. Create the Python MLBatchAIHyperparameterTuning virtual environment using the environment.yml:
    conda env create -f environment.yml
    
  6. Activate the virtual environment:
    source activate MLBatchAIHyperparameterTuning
    
  7. Login to Azure:
    az login
    
  8. If you have more than one Azure subscription, select it:
    az account set --subscription <Your Azure Subscription>
    
  9. Start the Jupyter notebook server in the virtual environment:
    jupyter notebook
    

Steps

After following the setup instructions above, run the Jupyter notebooks in order starting with Data Prep Notebook.

Cleaning up

To remove the conda environment created see here. The last Jupyter notebook also gives details on deleting Azure resources associated with this repository.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.