Created documentation

* put images into vignette folder * Created 00-azure-introduction.md * Created 10-vm-sizes.md * Update VM sizes link in docs/README.md * Create 20-package-management.md * Updated README.md with Azure Batch limitations * Create 21-distributing-data.md * Create 22-parallelizing-cores.md * Create 23-persistent-storage.md * standardized foreach keyword and pool keyword
2017-02-15 21:27:55 -08:00 · 2017-02-15 21:27:55 -08:00 · 726e10b9d1
--- a/README.md
+++ b/README.md
@ -2,7 +2,7 @@

 The *doAzureParallel* package is a parallel backend for the widely popular *foreach* package. With *doAzureParallel*, each iteration of the *foreach* loop runs in parallel on an Azure Virtual Machine (VM), allowing users to scale up their R jobs to tens or hundreds of machines.

-*doAzureParallel* is built to support the *foreach* parallel computing package. The *foreach* package supports parallel execution - it can execute multiple processes across some parallel backend. With just a few lines of code, the *doAzureParallel* package helps create a cluster in Azure, register it as a parallel backend, and seamlessly connects to the *foreach* package.
+*doAzureParallel* is built to support the *foreach* parallel computing package. The *foreach* package supports parallel execution - it can execute multiple processes across some parallel backend. With just a few lines of code, the *doAzureParallel* package helps create a pool in Azure, register it as a parallel backend, and seamlessly connects to the *foreach* package.

 ## Dependencies

@ -29,7 +29,7 @@ install_github(c("Azure/rAzureBatch", "Azure/doAzureParallel"))

 ## Azure Requirements

-To run your R code across a cluster in Azure, we'll need to get keys and account information.
+To run your R code across a pool in Azure, we'll need to get keys and account information.

 ### Setup Azure Account
 First, set up your Azure Account ([Get started for free!](https://azure.microsoft.com/en-us/free/))
@ -46,7 +46,7 @@ For your Azure Batch Account, we need to get:

 This information can be found in the Azure Portal inside your Batch Account:

-![Azure Batch Acccount in the Portal](/doAzureParallel-azurebatch-instructions.PNG "Azure Batch Acccount in the Portal")
+![Azure Batch Acccount in the Portal](./vignettes/doAzureParallel-azurebatch-instructions.PNG "Azure Batch Acccount in the Portal")

 For your Azure Storage Account, we need to get:
 - Storage Account Name
@ -54,7 +54,7 @@ For your Azure Storage Account, we need to get:

 This information can be found in the Azure Portal inside your Azure Storage Account:

-![Azure Storage Acccount in the Portal](/doAzureParallel-azurestorage-instructions.PNG "Azure Storage Acccount in the Portal")
+![Azure Storage Acccount in the Portal](./vignettes/doAzureParallel-azurestorage-instructions.PNG "Azure Storage Acccount in the Portal")

 Keep track of the above keys and account information as it will be used to connect your R session with Azure.

@ -88,7 +88,7 @@ Run your parallel *foreach* loop with the *%dopar%* keyword. The *foreach* funct
 ```R
 number_of_iterations <- 10
 results <- foreach(i = 1:number_of_iterations) %dopar% {
-  # This code is executed, in parallel, across your Azure cluster
+  # This code is executed, in parallel, across your Azure pool.
 }
 ```

@ -98,13 +98,13 @@ When developing at scale, it is always recommended that you test and debug your
 # run your code sequentially on your local machine
 results <- foreach(i = 1:number_of_iterations) %do% { ... }

-# use the doAzureParallel backend to run your code in parallel across your Azure cluster
+# use the doAzureParallel backend to run your code in parallel across your Azure pool 
 results <- foreach(i = 1:number_of_iterations) %dopar% { ... }
 ```

 ### Pool Configuration JSON

-Use your pool configuration JSON file to define your cluster in Azure.
+Use your pool configuration JSON file to define your pool in Azure.

 ```javascript
 {
@ -113,10 +113,10 @@ Use your pool configuration JSON file to define your cluster in Azure.
    "key": <Azure Batch Account Key>,
    "url": <Azure Batch Account URL>,
    "pool": {
-      "name": <your cluster name>, // example: "my_new_azure_cluster"
-      "vmsize": <your cluster VM size identifier>, // example: "Standard_A1_v2"
+      "name": <your pool name>, // example: "my_new_azure_pool"
+      "vmsize": <your pool VM size name>, // example: "Standard_A1_v2"
      "poolSize": {
-        "targetDedicated": <number of node you want in your cluster>, // example: 10
+        "targetDedicated": <number of node you want in your pool>, // example: 10
      }
    },
    "rPackages": {
@ -137,6 +137,24 @@ Use your pool configuration JSON file to define your cluster in Azure.
 }
 ```

+## Azure Pool Limitations
+
+doAzureParallel is built on top of Azure Batch, which starts with a few quota limitations.
+
+### Core Count Limitation
+
+By default, doAzureParallel users are limited to 20 cores in total. (Please refer to the [VM Size Table](./docs/10-vm-sizes.md#vm-size-table) to see how many cores are in the VM size you have selected.)
+
+Our default VM size selection is the **"Standard_A1_v2"** that has 1 core per VM. With this VM size, users are limited to a 20-node pool.
+
+### Number of *foreach* Loops
+
+By default, doAzureParallel users are limited to running 20 *foreach* loops that run on Azure in succession. This is because each *foreach* loops generates a *job*, of which users are by default limited to 20. To go beyond that, users need to delete their *jobs*.
+
+### Increasing Your Quota
+
+To increase your default quota limitations, please visit [this page](https://docs.microsoft.com/en-us/azure/batch/batch-quota-limit#increase-a-quota) for instructions.
+
 ## Contributing

 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
--- a/docs/00-azure-introduction.md
+++ b/docs/00-azure-introduction.md
@ -0,0 +1,35 @@
+# Azure Introduction
+
+doAzureParallel lets users seamlessly take advantage of the scale and elasticity of Azure to run their parallel workloads. This section will describe how the doAzureParallel package uses Azure and some of the key benefits that Azure provides.
+
+## Azure Batch
+
+Azure Batch is a platform service for running large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud.
+
+### How does it work?
+
+The doAzureParallel is built on of Azure Batch via the *rAzureBatch* package that interacts with the Azure Batch service's REST API. Azure Batch schedules work across a managed collection of VMs (called a *pool*) and automatically scales the pool meet the needs of your R jobs.
+
+In Azure Batch, a pool consists of a collection of VMs (in our case, a collection of DSVMs) - this pool can be configured by the config file that is generated by this package. For each *foreach* loop, the Azure Batch Job Scheduler will create a group of tasks (called an Azure Batch Job), where each iteration in the loop maps to a task. Each task is then distributed across the pool, running the code inside of each iteration in the loop. 
+
+To do this, we copy the existing R environment and store it in Azure Storage. As the VMs in the Azure Batch pool are ready to run the job, it will fetch and load the R environment. The VM will run the R code inside each iteration of the *foreach* loop under the loaded R environment. Once the code is finished, the results are push back into Azure Storage, and a merge task is used to aggregate the results. Finally, the aggregated results are returned to the user within the R session.
+
+Learn more about Azure Batch [here](https://docs.microsoft.com/en-us/azure/batch/batch-technical-overview#pricing).
+
+### Azure Batch Pricing
+
+Azure Batch is a free service; you aren't charged for the Batch account itself. You are charged for the underlying Azure compute resources that your Batch solutions consume, and for the resources consumed by other services when your workloads run.
+
+## Data Science Virtual Machines (DSVM)
+
+The doAzureParallel package uses the Data Science Virtual Machine (DSVM) for each node in the pool. The DSVM is a customized VM image that has many popular R tools pre-installed. Because these tools are pre-baked into the DSVM VM image, using it gives us considerable speedup when provisioning the pool.
+
+This package uses the Linux Edition of the DSVM which comes preinstalled with the Microsoft R Server Developer edition as well as many popular packages from Microsoft R Open (MRO). By using and extending open source R, Microsoft R Server is fully compatible with R scripts, functions and CRAN packages.
+
+Learn more about the DSVM [here](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.standard-data-science-vm?tab=Overview).
+
+### DSVM Pricing
+Using the DSVM is free and doesn't add to the cost of bare VMs.
+
+
+
--- a/docs/10-vm-sizes.md
+++ b/docs/10-vm-sizes.md
@ -0,0 +1,68 @@
+# Virtual Machine Sizes
+
+The doAzureParallel package lets you choose the VMs that your code runs on giving you full control over your infrastructure. By default, we start you on an economical, general-purpose VM size called **"Standard_A1_v2"**. 
+
+Each doAzureParallel pool can only comprise of of a collection of one VM size that is selected upon pool creation. Once the pool is created, users cannot change the VM size unless they plan on reprovisioning another pool.
+
+## Setting your VM size 
+
+The VM size is set in the configuration JSON file that is passed into the `registerPool()` method. To set your desired VM size, simply edit the `vmSize` key in the JSON:
+
+```javascript
+{
+  ...
+  "vmSize": <Your Desired VM Size>,
+  ...
+}
+```
+
+## Choosing your VM Size
+
+Azure has a wide variety of VMs that you can choose from. 
+
+### VM Categories
+
+The three recommended VM categories for the doAzureParallel package are:
+- Av2-Series VMs
+- F-Series VMs
+- Dv2-Series VMs
+
+Each VM category also has a variety of VM sizes (see table below).
+
+Generally speaking, the F-Series VM is ideal for compute intensive workloads, the Dv2-Series VMs are ideal for memory intensive workloads, and finally the Av2-Series VMs are economical, general-purpose VMs.
+
+The Dv2-Series VMs and F-Series VMs use the 2.4 GHz Intel Xeon® E5-2673 v3 (Haswell) processor.
+
+### VM Size Table
+
+Please see the below table for a curated list of VM types:
+
+| VM Category | VM Size | Cores | Memory (GB) |
+| ----------- | ------- | ----- | ----------- |
+| Av2-Series | Standard_A1_v2 | 1 | 2 |
+| Av2-Series | Standard_A2_v2 | 2 | 4 |
+| Av2-Series | Standard_A4_v2 | 4 | 8 |
+| Av2-Series | Standard_A8_v2 | 8 | 16 |
+| Av2-Series | Standard_A2m_v2 | 2 | 16 |
+| Av2-Series | Standard_A4m_v2 | 4 | 32 |
+| Av2-Series | Standard_A8m_v2 | 8 | 64 |
+| F-Series | Standard_F1 | 1 | 2 |
+| F-Series | Standard_F2 | 2 | 4 |
+| F-Series | Standard_F4 | 4 | 8 |
+| F-Series | Standard_F8 | 8 | 16 |
+| F-Series | Standard_F16 | 16 | 32 |
+| Dv2-Series | Standard_D1_v2 | 1 | 3.5 |
+| Dv2-Series | Standard_D2_v2 | 2 | 7 |
+| Dv2-Series | Standard_D3_v2 | 4 | 14 |
+| Dv2-Series | Standard_D4_v2 | 8 | 28 |
+| Dv2-Series | Standard_D5_v2 | 16 | 56 |
+| Dv2-Series | Standard_D11_v2 | 2 | 14 |
+| Dv2-Series | Standard_D12_v2 | 4 | 28 |
+| Dv2-Series | Standard_D13_v2 | 8 | 56 |
+| Dv2-Series | Standard_D14_v2 | 16 | 112 |
+
+The list above covers most scenarios that run R jobs. For special scenarios (such as GPU accelerated R code) please see the full list of available VM sizes by visiting the Azure VM Linux Sizes page [here](https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-linux-sizes?toc=%2fazure%2fvirtual-machines%2flinux%2ftoc.json#a-series).
+
+To get a sense of what each VM costs, please visit the Azure Virtual Machine pricing page [here](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/).
+
+
--- a/docs/20-package-management.md
+++ b/docs/20-package-management.md
@ -0,0 +1,42 @@
+# Package Management
+
+The doAzureParallel package allows you to install packages to your pool in two ways:
+- Installing on pool creation
+- Installing per-*foreach* loop
+
+## Installing Packages on Pool Creation
+You can install packages by specifying the package(s) in your JSON pool configuration file. This will then install the specified packages at the time of pool creation.
+
+```R
+{
+  ...
+  "rPackages": {
+    "cran": {
+      "source": "http://cran.us.r-project.org",
+      "name": ["some_cran_package_name", "some_other_cran_package_name"]
+    },
+    "github": ["github_username/github_package_name", "another_github_username/another_github_package_name"]
+  },
+  ...
+}
+```
+
+## Installing Packages per-*foreach* Loop
+You can also install packages by using the **.packages** option in the *foreach* loop. Instead of installing packages during pool creation, packages (and it's dependencies) can be installed before each iteration in the loop is run on your Azure cluster.
+
+To install a single package:
+```R
+number_of_iterations <- 10
+results <- foreach(i = 1:number_of_iterations, .packages='some_package') %dopar% { ... }
+```
+
+To install multiple packages:
+```R
+number_of_iterations <- 10
+results <- foreach(i = 1:number_of_iterations, .packages=c('package_1', 'package_2')) %dopar% { ... }
+```
+
+Installing packages from github using this method is not yet supported.
+
+## Uninstalling packages
+Uninstalling packages from your pool is not supported. However, you may consider rebuilding your pool.
--- a/docs/21-distributing-data.md
+++ b/docs/21-distributing-data.md
@ -0,0 +1,27 @@
+# Distributing Data
+
+The doAzureParallel package lets you distribute the data you have in your R session across your Azure pool.
+
+As long as the data you wish to distribute can fit in-memory on your local machine as well as in the memory of the VMs in your pool, the doAzureParallel package will be able to manage the data.
+
+```R
+my_data_set <- data_set
+number_of_iterations <- 10
+
+results <- foreach(i = 1:number_of_iterations) %dopar% {
+  runAlgorithm(my_data_set)
+}
+```
+
+## Chunking Data
+
+A common scenario would be to chunk your data so that each chunk is mapped to an interation of the *foreach* loop
+
+```R
+chunks <- split(<data_set>, 10)
+
+results <- foreach(chunk = iter(chunks)) %dopar% {
+  runAlgorithm(chunk)
+}
+```
+
--- a/docs/22-parallelizing-cores.md
+++ b/docs/22-parallelizing-cores.md
@ -0,0 +1,33 @@
+# Parallelizing Cores
+
+Depending on the VM size you select, you may want your R code running on all the cores in each VM. To do this, we recommend nesting a *foreach* loop using *doParallel* package inside the outer *foreach* loop that uses doAzureParallel. 
+
+The *doParallel* package can detect the number of cores on a computer and parallelizes each iteration of the *foreach* loop across those cores. Pairing this with the doAzureParallel package, we can schedule work to each core of each VM in the pool.
+
+```R
+
+# register your Azure pool as the parallel backend
+registerDoAzureParallel(pool)
+
+# execute your outer foreach loop to schedule work to the pool
+number_of_outer_iterations <- 10
+results <- foreach(i = 1:number_of_outer_iterations, .packages='doParallel') %dopar% {
+
+  # detect the number of cores on the VM
+  cores <- detectCores()
+  
+  # make your 'cluster' using the nodes on the VM
+  cl <- makeCluster(cores)
+  
+  # register the above pool as the parallel backend within each VM
+  registerDoParallel(cl)
+  
+  # execute your inner foreach loop that will use all the cores in the VM
+  number_of_inner_iterations <- 20
+  inner_results <- foreach(j = 1:number_of_inner_iterations) %dopar% {
+    runAlgorithm()
+  }
+  
+  return(inner_results)
+}
+```
--- a/docs/23-persistent-storage.md
+++ b/docs/23-persistent-storage.md
@ -0,0 +1,14 @@
+# Persistent Storage
+
+When executing long-running jobs, users may not want to keep their session open to wait for results to be returned. 
+
+The doAzureParallel package automatically stores the results of the *foreach* loop in a Azure Storage account - this means that when the user exits the session, their results won't be lost. Instead, users can simply pull the results down from Azure at any time and load it into their current session.
+
+To do so, users need to keep track of **job ids**. Each *foreach* loop is considered a *job* and is assigned an unique ID. The job id is returned to the user after the *foreach* loop is executed.
+
+When the user returns and begin a new session, the user can pull down the results from their job.
+
+```R
+my_job_id <- "job123456789"
+results <- GetJobResult(my_job_id)
+```
--- a/docs/README.md
+++ b/docs/README.md
@ -5,13 +5,13 @@ This section will provide information about how Azure works, how best to take ad

   Using the *Data Science Virtual Machine (DSVM)* & *Azure Batch* 

-2. **Virtual Machine Sizes** [(link)](./10-choosing-vm-sizes.md)
+2. **Virtual Machine Sizes** [(link)](./10-vm-sizes.md)

   How do you choose the best VM type/size for your workload?
   
 3. **Package Management** [(link)](./20-package-management.md)

-   Best practices for managing your R packages across your Azure cluster
+   Best practices for managing your R packages across your Azure pool 
   
 4. **Distributing your Data** [(link)](./21-distributing-data.md)

@ -19,7 +19,7 @@ This section will provide information about how Azure works, how best to take ad
   
 5. **Parallelizing on each VM Core** [(link)](./22-parallelizing-cores.md)

-   Best practices and limitations for parallelizing your R code to each core in each VM in your cluster
+   Best practices and limitations for parallelizing your R code to each core in each VM in your pool 

 6. **Persistent Storage** [(link)](./23-persistent-storage.md)

--- a/vignettes/doAzureParallel-azurebatch-instructions.PNG
+++ b/vignettes/doAzureParallel-azurebatch-instructions.PNG
--- a/vignettes/doAzureParallel-azurestorage-instructions.PNG
+++ b/vignettes/doAzureParallel-azurestorage-instructions.PNG