A R package that allows users to submit parallel workloads in Azure

azure-batch cluster dsvm foreach mran parallel r

Перейти к файлу

Brian c15da58e37 Added warning on README.md		2021-05-09 23:08:29 -07:00
.github	Created an issue template (#207 )	2018-01-25 08:43:06 -08:00
.vsts	Update pipeline.yml for Azure Pipelines	2019-04-16 13:57:01 -07:00
R	Fix: Resize Cluster (#371 )	2020-10-27 09:19:21 -07:00
docker-image	Feature/container (#153 )	2017-11-03 10:06:40 -07:00
docs	Fix: Upgrading to R Batch SDK to 2018-12-01.8.0 (#354 )	2019-06-18 21:04:30 -07:00
inst/startup	Fix: Revert Cluster Setup File (#342 )	2019-02-07 12:26:20 -08:00
man	Documentation rewrite (#273 )	2018-05-15 16:22:11 -07:00
samples	Fix: Upgrading to R Batch SDK to 2018-12-01.8.0 (#354 )	2019-06-18 21:04:30 -07:00
tests	Fix: Test Coverage on Azure Pipeline CI (#348 )	2019-02-13 21:36:40 -08:00
vignettes	Created documentation	2017-02-15 21:27:55 -08:00
.Rbuildignore	Added Travis CI (#23 )	2017-06-13 18:04:08 -07:00
.gitattributes	Change True/False to TRUE/FALSE in README example (#124 )	2017-09-27 14:29:32 -07:00
.gitignore	Feature/bio conductor docs (#106 )	2017-09-07 09:39:03 -07:00
.lintr	Enable AAD and VNet Support (#252 )	2018-04-27 17:43:06 -07:00
.travis.yml	Fix/add task perf (#195 )	2018-01-09 18:15:35 -08:00
CHANGELOG.md	Release: v0.8.0 (#357 )	2019-06-20 14:08:27 -07:00
Contributing.md	Updates to documentation	2017-02-17 07:46:25 -08:00
DESCRIPTION	Release: v0.8.0 (#357 )	2019-06-20 14:08:27 -07:00
LICENSE	Initial commit	2017-02-14 16:17:56 -08:00
NAMESPACE	Feature/asynccluster (#197 )	2018-01-18 13:52:42 -08:00
README.md	Added warning on README.md	2021-05-09 23:08:29 -07:00
account_setup.py	add sharedKey to credentials related code and doc (#266 )	2018-05-03 10:49:13 -07:00
account_setup.sh	Feature/getstarted (#255 )	2018-04-27 11:43:36 -07:00

README.md

This repo is no longer maintained and no new features will be added.

doAzureParallel

Introduction

The doAzureParallel package is a parallel backend for the widely popular foreach package. With doAzureParallel, each iteration of the foreach loop runs in parallel on an Azure Virtual Machine (VM), allowing users to scale up their R jobs to tens or hundreds of machines.

doAzureParallel is built to support the foreach parallel computing package. The foreach package supports parallel execution - it can execute multiple processes across some parallel backend. With just a few lines of code, the doAzureParallel package helps create a cluster in Azure, register it as a parallel backend, and seamlessly connects to the foreach package.

NOTE: The terms pool and cluster are used interchangably throughout this document.

Notable Features

Ability to use low-priority VMs for an 80% discount (link)
Users can bring their own Docker Image
AAD and VNets Support
Built in support for Azure Blob Storage

Dependencies

R (>= 3.3.1)
httr (>= 1.2.1)
rjson (>= 0.2.15)
RCurl (>= 1.95-4.8)
digest (>= 0.6.9)
foreach (>= 1.4.3)
iterators (>= 1.0.8)
bitops (>= 1.0.5)

Setup

Install doAzureParallel directly from Github.

# install the package devtools
install.packages("devtools")

# install the doAzureParallel and rAzureBatch package
devtools::install_github("Azure/rAzureBatch")
devtools::install_github("Azure/doAzureParallel")

Create an doAzureParallel's credentials file

library(doAzureParallel)
generateCredentialsConfig("credentials.json")

wget -q https://raw.githubusercontent.com/Azure/doAzureParallel/master/account_setup.sh &&
chmod 755 account_setup.sh &&
/bin/bash account_setup.sh

Follow the on screen prompts to create the necessary Azure resources and copy the output into your credentials file. For more information, see Getting Started Scripts.

To Learn More:

Azure Account Requirements for doAzureParallel

Getting Started

Import the package

library(doAzureParallel)

Set up your parallel backend with Azure. This is your set of Azure VMs.

# 1. Generate your credential and cluster configuration files.  
generateClusterConfig("cluster.json")
generateCredentialsConfig("credentials.json")

# 2. Fill out your credential config and cluster config files.
# Enter your Azure Batch Account & Azure Storage keys/account-info into your credential config ("credentials.json") and configure your cluster in your cluster config ("cluster.json")

# 3. Set your credentials - you need to give the R session your credentials to interact with Azure
setCredentials("credentials.json")

# 4. Register the pool. This will create a new pool if your pool hasn't already been provisioned.
cluster <- makeCluster("cluster.json")

# 5. Register the pool as your parallel backend
registerDoAzureParallel(cluster)

# 6. Check that your parallel backend has been registered
getDoParWorkers()

Run your parallel foreach loop with the %dopar% keyword. The foreach function will return the results of your parallel code.

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations) %dopar% {
  # This code is executed, in parallel, across your cluster.
  myAlgorithm()
}

After you finish running your R code in Azure, you may want to shut down your cluster of VMs to make sure that you are not being charged anymore.

# shut down your pool
stopCluster(cluster)

This section will provide information about how Azure works, how best to take advantage of Azure, and best practices when using the doAzureParallel package.

Azure Introduction (link)

Using Azure Batch
Getting Started (link)

Using the Getting Started to create credentials

i. Generate Credentials Script (link)
- Pre-built bash script for getting Azure credentials without Azure Portal
ii. National Cloud Support (link)
- How to run workload in Azure national clouds
Customize Cluster (link)

Setting up your cluster to user's specific needs

i. Virtual Machine Sizes (link)
- How do you choose the best VM type/size for your workload?
ii. Autoscale (link)
- Automatically scale up/down your cluster to save time and/or money.
iii. Building Containers (link)
- Creating your own Docker containers for reproducibility
Managing Cluster (link)

Managing your cluster's lifespan
Customize Job

Setting up your job to user's specific needs

i. Asynchronous Jobs (link)
- Best practices for managing long running jobs
ii. Foreach Azure Options (link)
- Use Azure package-defined foreach options to improve performance and user experience
iii. Error Handling (link)
- How Azure handles errors in your Foreach loop?
Package Management (link)

Best practices for managing your R packages in code. This includes installation at the cluster or job level as well as how to use different package providers.
Storage Management

i. Distributing your Data (link)
- Best practices and limitations for working with distributed data.
ii. Persistent Storage (link)
- Taking advantage of persistent storage for long-running jobs
iii. Accessing Azure Storage through R (link)
- Manage your Azure Storage files via R
Performance Tuning (link)

Best practices on optimizing your Foreach loop
Debugging and Troubleshooting (link)

Best practices on diagnosing common issues
Azure Limitations (link)

Learn about the limitations around the size of your cluster and the number of foreach jobs you can run in Azure.

Additional Documentation

Read our FAQ for known issues and common questions.

Next Steps

For more information, please visit our documentation.