doAzureParallel/README.md

197 строки
6.8 KiB
Markdown
Исходник Обычный вид История

2017-09-30 00:17:09 +03:00
[![Build Status](https://travis-ci.org/Azure/doAzureParallel.svg?branch=master)](https://travis-ci.org/Azure/doAzureParallel)
# doAzureParallel
## Introduction
The *doAzureParallel* package is a parallel backend for the widely popular *foreach* package. With *doAzureParallel*, each iteration of the *foreach* loop runs in parallel on an Azure Virtual Machine (VM), allowing users to scale up their R jobs to tens or hundreds of machines.
*doAzureParallel* is built to support the *foreach* parallel computing package. The *foreach* package supports parallel execution - it can execute multiple processes across some parallel backend. With just a few lines of code, the *doAzureParallel* package helps create a cluster in Azure, register it as a parallel backend, and seamlessly connects to the *foreach* package.
NOTE: The terms *pool* and *cluster* are used interchangably throughout this document.
## Notable Features
- Ability to use low-priority VMs for an 80% discount [(link)](./docs/31-vm-sizes.md#low-priority-vms)
- Users can bring their own Docker Image
- AAD and VNets Support
- Built in support for Azure Blob Storage
2017-09-30 00:17:09 +03:00
## Dependencies
- R (>= 3.3.1)
- httr (>= 1.2.1)
- rjson (>= 0.2.15)
- RCurl (>= 1.95-4.8)
- digest (>= 0.6.9)
- foreach (>= 1.4.3)
- iterators (>= 1.0.8)
- bitops (>= 1.0.5)
## Setup
2017-09-30 00:17:09 +03:00
1) Install doAzureParallel directly from Github.
2017-09-30 00:17:09 +03:00
```R
# install the package devtools
install.packages("devtools")
# install the doAzureParallel and rAzureBatch package
2017-11-11 02:36:14 +03:00
devtools::install_github("Azure/rAzureBatch")
Feature/container (#153) * force add PATH to current user * checkin docker setup script * Update cluster_setup.sh * install docker and start container on cluster setup * WIP: Run task in container * fix merge conflict * run tasks and merge task from within container * refactor code to proper docker commands and make a single R container per job * refactor command line utils into its own file * refactor job utilities into its own file * move cluster setup script to inst folder * remove unnecessary curl installs * remove starting container from setup script * check in WIP * add apt_install file * make required directories * update cluster setup files as needed * include libxml2 packages in apt installs * working cluster create with cran and github dependencies * update job prep to install apt-get and not each task * use rocker containers instead of r-base * remove unused & commented code * remove unused install function * address several lintr issues * initial test dockerfile * add spacing between commands * temporarily point wget to feature branch * update bioconductor install for non-jobPrep installs * Delete Dockerfile * minor changes to install bioc * resolve merge conflicts * update cluster to correctly install BioC packages using install_bioconductor * fix issue where some packages were not getting installed * add missing BioConductorCommand initializer * remove print lines * initial dockerfile implementations * update docker files * Only install packages if they are required * Remove requirement on bioconductor installer script on start task * remove duplicate environment variable entry * update docs for container support * update version to 0.6.0 * refactor changes updates * remove poorly formatted whitespaces * add full path to pacakges directory * fix docker command line * update file share sample * update azure files cluster name * update mandelbrot sample * update package management sample * update plyr samples * make montecarlo sample more consistent * update montecarlo sample * remove plyr example * fix bad environment pointer * fix linter issues * more linter fixes * more linter issues * use latest rAzureBatch version * update resource files example * remove reference to deleted sample * pr feedback * PR docs feedback * Print errors from worker (#154) * Fixed pool package command line lintr test * Package installation tests fixed - too long lines * Fixed json in customize cluster docs * Fix: Typos in customize cluster docs * Cleaning up files * Feature/githubbiopackage (#150) * install github package worked for foreach loop * fix lintr error * tests for github and bioc packages installation * lintr fix * add back lost code due to merge and update docs * The Travis CI build failed for feature/githubbiopackage * remove incorrect parameter for install_github * Updated job prep task to have default command * Use the latest version of rAzureBatch * Updated description + Generate cluster config * Fix: Bioconductor and Github packages installation (#155) * Added multiple package install test and fix obj reading args * Fixed naming for packages install * Replaced validation exclusion for linter * Fixed test validate test * Fixing all interactive tests with skip * Fixed renaming validation * Removed default test - cannot be tested * Removed in validation * Added cluster package install tests (#156)
2017-11-03 20:06:40 +03:00
devtools::install_github("Azure/doAzureParallel")
2017-09-30 00:17:09 +03:00
```
2) Create an doAzureParallel's credentials file
``` R
library(doAzureParallel)
generateCredentialsConfig("credentials.json")
```
2017-09-30 00:17:09 +03:00
3) Login or register for an Azure Account, navigate to [Azure Cloud Shell](https://shell.azure.com)
2017-09-30 00:17:09 +03:00
``` sh
wget -q https://raw.githubusercontent.com/Azure/doAzureParallel/master/account_setup.sh &&
chmod 755 account_setup.sh &&
/bin/bash account_setup.sh
```
4) Follow the on screen prompts to create the necessary Azure resources and copy the output into your credentials file. For more information, see [Getting Started Scripts](./docs/02-getting-started-script.md).
2017-09-30 00:17:09 +03:00
To Learn More:
- [Azure Account Requirements for doAzureParallel](./docs/04-azure-requirements.md)
2017-09-30 00:17:09 +03:00
## Getting Started
Import the package
```R
library(doAzureParallel)
```
Set up your parallel backend with Azure. This is your set of Azure VMs.
```R
# 1. Generate your credential and cluster configuration files.
generateClusterConfig("cluster.json")
generateCredentialsConfig("credentials.json")
# 2. Fill out your credential config and cluster config files.
# Enter your Azure Batch Account & Azure Storage keys/account-info into your credential config ("credentials.json") and configure your cluster in your cluster config ("cluster.json")
# 3. Set your credentials - you need to give the R session your credentials to interact with Azure
setCredentials("credentials.json")
# 4. Register the pool. This will create a new pool if your pool hasn't already been provisioned.
cluster <- makeCluster("cluster.json")
# 5. Register the pool as your parallel backend
registerDoAzureParallel(cluster)
# 6. Check that your parallel backend has been registered
getDoParWorkers()
```
Run your parallel *foreach* loop with the *%dopar%* keyword. The *foreach* function will return the results of your parallel code.
```R
number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations) %dopar% {
# This code is executed, in parallel, across your cluster.
myAlgorithm()
}
```
After you finish running your R code in Azure, you may want to shut down your cluster of VMs to make sure that you are not being charged anymore.
```R
# shut down your pool
stopCluster(cluster)
```
## Table of Contents
This section will provide information about how Azure works, how best to take advantage of Azure, and best practices when using the doAzureParallel package.
2017-09-30 00:17:09 +03:00
1. **Azure Introduction** [(link)](./docs/00-azure-introduction.md)
2017-09-30 00:17:09 +03:00
Using *Azure Batch*
2017-09-30 00:17:09 +03:00
2. **Getting Started** [(link)](./docs/01-getting-started.md)
2017-09-30 00:17:09 +03:00
Using the *Getting Started* to create credentials
i. **Generate Credentials Script** [(link)](./docs/02-getting-started-script.md)
2017-09-30 00:17:09 +03:00
- Pre-built bash script for getting Azure credentials without Azure Portal
2017-09-30 00:17:09 +03:00
ii. **National Cloud Support** [(link)](./docs/03-national-clouds.md)
2017-09-30 00:17:09 +03:00
- How to run workload in Azure national clouds
2017-09-30 00:17:09 +03:00
3. **Customize Cluster** [(link)](./docs/30-customize-cluster.md)
2017-09-30 00:17:09 +03:00
Setting up your cluster to user's specific needs
2017-09-30 00:17:09 +03:00
i. **Virtual Machine Sizes** [(link)](./docs/31-vm-sizes.md)
- How do you choose the best VM type/size for your workload?
2017-09-30 00:17:09 +03:00
ii. **Autoscale** [(link)](./docs/32-autoscale.md)
2017-09-30 00:17:09 +03:00
- Automatically scale up/down your cluster to save time and/or money.
2017-09-30 00:17:09 +03:00
iii. **Building Containers** [(link)](./docs/33-building-containers.md)
- Creating your own Docker containers for reproducibility
2017-09-30 00:17:09 +03:00
4. **Managing Cluster** [(link)](./docs/40-clusters.md)
2017-09-30 00:17:09 +03:00
Managing your cluster's lifespan
2017-09-30 00:17:09 +03:00
5. **Customize Job**
2017-09-30 00:17:09 +03:00
Setting up your job to user's specific needs
i. **Asynchronous Jobs** [(link)](./docs/51-long-running-job.md)
- Best practices for managing long running jobs
ii. **Foreach Azure Options** [(link)](./docs/52-azure-foreach-options.md)
- Use Azure package-defined foreach options to improve performance and user experience
iii. **Error Handling** [(link)](./docs/53-error-handling.md)
- How Azure handles errors in your Foreach loop?
6. **Package Management** [(link)](./docs/20-package-management.md)
2017-09-30 00:17:09 +03:00
Best practices for managing your R packages in code. This includes installation at the cluster or job level as well as how to use different package providers.
2017-09-30 00:17:09 +03:00
7. **Storage Management**
i. **Distributing your Data** [(link)](./docs/71-distributing-data.md)
- Best practices and limitations for working with distributed data.
2017-09-30 00:17:09 +03:00
ii. **Persistent Storage** [(link)](./docs/72-persistent-storage.md)
2017-09-30 00:17:09 +03:00
- Taking advantage of persistent storage for long-running jobs
iii. **Accessing Azure Storage through R** [(link)](./docs/73-managing-storage.md)
- Manage your Azure Storage files via R
2017-09-30 00:17:09 +03:00
8. **Performance Tuning** [(link)](./docs/80-performance-tuning.md)
2017-09-30 00:17:09 +03:00
Best practices on optimizing your Foreach loop
2017-09-30 00:17:09 +03:00
9. **Debugging and Troubleshooting** [(link)](./docs/90-troubleshooting.md)
Best practices on diagnosing common issues
2017-09-30 00:17:09 +03:00
10. **Azure Limitations** [(link)](./docs/91-quota-limitations.md)
2017-09-30 00:17:09 +03:00
Learn about the limitations around the size of your cluster and the number of foreach jobs you can run in Azure.
## Additional Documentation
Read our [**FAQ**](./docs/92-faq.md) for known issues and common questions.
2017-09-30 00:17:09 +03:00
## Next Steps
For more information, please visit [our documentation](./docs/README.md).