Class materials for the ICSE tutorial on scalable machine learning
Перейти к файлу
microsoft-github-policy-service[bot] 6b791637b2
Auto merge mandatory file pr
This pr is auto merged as it contains a mandatory file and is opened for more than 10 days.
2023-06-12 18:22:01 +00:00
DS_Fundamentals
R Update README.md 2019-05-31 11:28:52 -07:00
Rlearning merge 2019-07-12 21:38:12 -07:00
data
description
media
slides Added H&M slides 2019-05-28 08:55:01 -04:00
.gitignore
LICENSE
README.md Update README.md 2019-05-31 11:09:13 -07:00
SECURITY.md Microsoft mandatory file 2023-06-02 17:45:35 +00:00

README.md

ICSE Tutorial on Scalable Data Science with Python and R

Setup instructions

Prerequisites

You will need an Azure subscription. You can get a free trial subscription, as described below. It takes about 3 minutes and will require a credit card, which is only required to verify your identity. The subscription comes with $200 credit and your credit card will not be charged unless you actively initiate an upgrade.

  1. Create a Microsoft account at https://outlook.com (skip if you already have an account @outlook.com, @Hotmail.com, or @live.com)
  2. Use your Microsoft account to get the free Azure subscription https://azure.microsoft.com/en-us/free/
  • You get 200$ Azure credits expiring in 30 days
  • Credit card needed for identification purpose. You will not be charged even after 30 days or the 200$ credits are used
  1. Create an Ubuntu Linux DSVM https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro
  2. Create an AzureML workspace https://docs.microsoft.com/en-us/azure/machine-learning/service/setup-create-workspace#portal
  3. Install x2go client on your local machine https://wiki.x2go.org/doku.php/doc:installation:x2goclient
  4. Test connecting to your DSVM
  • Go to Azure portal to get the public IP of your DSVM
  • Connect to it from x2go client by choosing XFCE
  • After connecting, open a command window and clone the github repository to get the tutorial content to your DSVM: git clone https://github.com/microsoft/ICSE2019

Alternate prerequisites to run these examples locally on your machine:

  • Your favorite git client. The original is here.
  • A distribution of Python 3: Anaconda works well. (5 minutes on good network)
  • A distribution of R: How about Microsoft R Open? (2 minutes on good network)

Introduction to Machine Learning at Scale

Introduction to Scalable R

R on Spark hands-on

  1. Launch X2go Client and click Session | New Session
  • Host: enter host IP address
  • Login: enter username
  • Session type: choose XFCE
  • Click OK
  • Click the icon or the session name, e.g. “New session”
  • Enter password
  • Click OK
  1. Open command window on DSVM and execute “docker run” command:
  • sudo docker run -e PASSWORD=mypassword1 -p 8787:8787 rocker/verse:3.6.0
  1. Open web browser on DSVM and connect to RStudio Server on port 8787:
  1. Open terminal window in RStudio and clone git repo
  1. In the Files pane in RStudio,
  • Open the ICSE2019 folder
  • Open the R folder
  • Click 1-Intro-Transform-Train-Score.Rmd (not the .nb.html file) to open the first hands-on script
  • When it says that certain packages are required but are not installed, click “Install”
  • Click 2-ML-Pipelines.Rmd (not the .nb.html file) to open the second hands-on script
  • When it says that certain packages are required but are not installed, click “Install”

Automated ML

AutomatedML takes care of the repetitive process of hypothesizing a model, fitting it to the data, evaluating, and repeating until alternatives or the data scientist have been exhausted. It can run locally or use the power of the cloud to try many models in parallel.

If you brought a nice, clean dataset, please feel free to try the notebooks on it if you complete them before the guy on the podium does! If you brought an big, ugly, hairy, real one, let's talk after the tutorial.

The approximate content of this demonstration will be:

  • Configuration (notebook)
  • AzureML basics (slides, hands-on playtime)
  • Automated ML basics (slides, hands-on playtime)
  • A simple classification problem
  • Creating and attaching scalable compute, managing it in the Portal
  • AutomatedML user interface in Portal
  • AutomatedML forecasting
  • Deploying AutomatedML models

Setup steps - Automated ML

  • Clone the Machine Learning Notebooks repo.
  • Open a shell or command prompt window, go to /how-to-use-azureml/automated-machine-learning and execute the automl_setup script appropriate for your platform (Win, Linux, Mac). Many packages will be installed (10 minutes on good network).
  • A browser window with Jupyter will open. Ctrl+C the ipykernel in the terminal.
  • Re-start jupyter in the root directory of the repo (two folders up) with jupyter notebook
  • Open the setup notebook configuration.ipynb.
  • Make sure to use the azure_automl kernel.
  • Transfer your subscription information and resource group name into the second code cell of the notebook.
  • Run cells according to instructions.

Reinforcement Learning

This will be a presentation interspersed with 3 notebook demos. The notebooks are available in the Rlearn subdirectory of this repository. Note that all demos were developed using python version 3.6 - available on the DSVM as source activate py36.

  • To run the two "Maze" demos you will need to install the current pybrain from github, by running:
    git clone https://github.com/pybrain/pybrain.git
    cd pybrain
    python3 setup.py install

You may need to conda install setuptools for this to work.

Other dependencies on numpy, scipy and matplot lib should already by installed on the Data Science VM.

  • To run the Azure Personalizer you need to creae a Personalizer(preview) resource in Azure, from which you can find the resource key and URL endpoints to be incorporated in your script for running the resource.

Contributing

Contact us if you'd like to contribute to this repository.

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.