3.8 KiB
Author: Mario Bourgoin
Tuning Python models on a Batch AI cluster
Overview
This scenario shows how to tune a Frequently Asked Questions (FAQ) matching model that can be deployed as a web service to provide predictions for user questions. For this scenario, “Input Data” in the architecture diagram refers to text strings containing the user questions to match with a list of FAQs. The scenario is designed for the Scikit-Learn machine learning library for Python but can be generalized to any scenario that uses Python models to make real-time predictions.
Design
The scenario uses a subset of Stack Overflow question data which includes original questions tagged as JavaScript, their duplicate questions, and their answers. It tunes a Scikit-Learn pipeline to predict the match probability of a duplicate question with each of the original questions. The application flow for this architecture is as follows:
Prerequisites
- Linux(Ubuntu).
- Anaconda Python installed.
- Docker installed.
- DockerHub account.
- Azure account.
The tutorial was developed on an Azure Ubuntu DSVM, which addresses the first three prerequisites.
Setup
To set up your environment to run these notebooks, please follow these steps. They setup the notebooks to use Docker and Azure seamlessly.
- Create a Linux DSVM.
- In a bash shell on the DSVM, add your login to the
docker
group:sudo usermod -a -G docker <login>
- Login to your DockerHub account:
docker login
- Clone, fork, or download the zip file for this repository:
git clone https://github.com/Azure/MLBatchAIHyperparameterTuning.git
- Create the Python MLBatchAIHyperparameterTuning virtual environment using the environment.yml:
conda env create -f environment.yml
- Activate the virtual environment:
source activate MLBatchAIHyperparameterTuning
- Login to Azure:
az login
- If you have more than one Azure subscription, select it:
az account set --subscription <Your Azure Subscription>
- Start the Jupyter notebook server in the virtual environment:
jupyter notebook
Steps
After following the setup instructions above, run the Jupyter notebooks in order starting with Data Prep Notebook.
Cleaning up
To remove the conda environment created see here. The last Jupyter notebook also gives details on deleting Azure resources associated with this repository.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.