Merge branch 'master' of https://github.com/Azure/AzureDSVM

2017-07-10 10:18:03 +08:00 · 2017-07-10 10:18:03 +08:00 · e0f29ca8a1
--- a/3
+++ b/3
@ -9,7 +9,8 @@ Authors@R: c(
    person("Le", "Zhang", role=c("aut"), email="zhle@microsoft.com"))
 Description: The AzureDSVM is a package that aims at providing handy methods,
    with the help of underlying AzureSMR package, for doing data science jobs on
-    Azure Data Science Virtual Machine (DSVM) efficiently and economically. Basically it can be used with the
+    Azure Data Science Virtual Machine (DSVM) efficiently and economically.
+    Basically it can be used with the
    following benefits. Easy operations such as create, start, and stop of Azure
    DSVMs. Remote execution of data analytical jobs on cloud with
    specified computing context. Monitor of data consumption and calculation of
--- a/R/deployDSVMCluster.R
+++ b/R/deployDSVMCluster.R
@ -164,18 +164,20 @@ deployDSVMCluster <- function(context,

  for (i in 1:count)
  {
-      deployDSVM(context=context,
-                 resource.group=resource.group,
-                 location=location,
-                 hostname=hostname[i],
-                 username=username[i],
-                 size=size[i],
-                 os=os[i],
-                 authen=authen[i],
-                 pubkey=pubkey[i],
-                 password=password[i],
-                 dns.label=hostname[i],
-                 mode=ifelse(i == count, "Sync", "Async"))
+    if (i == count) cat("\n") # Tidy the progress messages.
+    
+    deployDSVM(context=context,
+               resource.group=resource.group,
+               location=location,
+               hostname=hostname[i],
+               username=username[i],
+               size=size[i],
+               os=os[i],
+               authen=authen[i],
+               pubkey=pubkey[i],
+               password=password[i],
+               dns.label=hostname[i],
+               mode=ifelse(i == count, "Sync", "Async"))
  }

  # For a cluster set up public credentials for the DSVM cluster to
@ -203,5 +205,5 @@ deployDSVMCluster <- function(context,
                          dns.label=dns.label)
  }
  
-  return(TRUE)
+  invisible(TRUE)
 }
--- a/vignettes/00Introduction.Rmd
+++ b/vignettes/00Introduction.Rmd
@ -9,15 +9,17 @@ vignette: >
 \usepackage[utf8]{inputenc}
 ---

-The Azure Data Science Virtual Machine (DSVM) provide a powerful
+The Azure Data Science Virtual Machine (DSVM) provides a powerful
 platform for data scientists supporting a full open source stack of
-tools that today's data scientist will regularly use.
+regularly used tools.
 	
 # Preliminaries

 1. `AzureDSVM` requires users to have access to Azure resources
 through an Azure subscription which can be obtained through a [free
-Azure account](https://azure.microsoft.com/en-us/free).
+Azure account](https://azure.microsoft.com/en-us/free) using your
+credit card (but only to identify you as it will not be charged unless
+you later decide to take ot a subscription).

 2. It is highly recommended that you read through the DSVM
 documentation:
@ -53,7 +55,7 @@ if(!require("devtools")) install.packages("devtools")
 devtools::install_github("Azure/AzureDSVM")
 ```

-Help pages can be loaded by
+Help pages can be loaded with

 ```{r help azuresvm, eval=FALSE}
 library(help=AzureDSVM)
--- a/vignettes/10Deploy.Rmd
+++ b/vignettes/10Deploy.Rmd
@ -21,8 +21,8 @@ vignette. Once deleted consumption (cost) will cease.
 This script is best run interactively to review its operation and to
 ensure that the interaction with Azure completes.

-The R script that can be generated from this vignette can be run as a
-standalone script to setup a new resource group and single Ubuntu
+An R script that can be generated from this vignette and can be run as
+a standalone script to setup a new resource group and single Ubuntu
 DSVM.

 # Preparation
@ -45,7 +45,7 @@ the users desktop/laptop machine and will be found within
 to contain this information. The contents of the credentials file will
 be something like the foloowing and we assume the user creates such a
 file in the current working directory, naming the file
-<USER>_credentials.R, replace <USER> with the user's username.
+<USER>_credentials.R. Replace <USER> with the user's username.

 ```{r credentials, eval=FALSE}
 # Credentials come from app creation in Active Directory within Azure.
@ -63,8 +63,8 @@ PASSWORD <- "Public%4aR3@kn"               # For Windows DSVM

 ```

-Notice we include a password (fake in this case) for account creation
-on a Windows DSVM.
+Notice we include a password (a fake password in this case) for
+account creation on a Windows DSVM.

 We can simply source the credentials file in R.

@ -160,15 +160,14 @@ Create the resource group within which all resources we create will be
 grouped.

 ```{r create resource group}
-if (! rg_pre_exists) {
-  # Create a new resource group into which we create the VMs and
-  # related resources. Resource group name is RG. 
-  
-  # Note that to create a new resource group one needs to add access
-  # control of Active Directory application at subscription level.
+# Create a new resource group into which we create the VMs and related
+# resources. Resource group name is RG.  Note that to create a new
+# resource group one needs to add access control of Active Directory
+# application at subscription level.

+if (! rg_pre_exists)
+{
  azureCreateResourceGroup(context, RG, LOC) %>% cat("\n\n")
-
 }

 # Check that it now exists.
@ -205,8 +204,7 @@ formals(deployDSVM)$size
 formals(deployDSVM)$os
 ```

-The following code deploys a Linux DSVM, and it will take a few
-minutes.
+The following code deploys a Linux DSVM which will take a few minutes.

 ```{r deploy}
 # Create the required Linux DSVM - generally 4 minutes.
@ -232,8 +230,8 @@ Prove that the deployed DSVM exists.
 # existence. Expect a single line with an indication of how long the
 # server has been up and running.

-# NOTE this must be done after a while since deployment - problem may 
-# be owing to internal processing of system setup. 
+# NOTE this must be done after a while since even though deployment is
+# reported there is a small delay before actually available.

 Sys.sleep(20)

@ -252,9 +250,9 @@ system(cmd, intern=TRUE)

 We can install some useful tools on a fesh server. Note that the
 Ubuntu server will still be running some background scripts as part of
-its own setup so if there are lock file error messages wait from the
-following commands then simply try again in a short while. We also
-update the operating system here though because of a bad console
+its own setup so if there are lock error messages (could not get lock)
+from the following commands then simply try again in a short while. We
+also update the operating system here though because of a bad console
 interaction from the msodbcsql package asking about licensing we have
 to do the distupgrade through a terminal so we need to log on to the
 server through the secure shell and manually run that command.  We
@ -267,6 +265,7 @@ system(paste(ssh, "sudo apt-get -y install wajig"))
 system(paste(ssh, "wajig install -y lsb htop"))
 system(paste(ssh, "lsb_release -idrc"))
 system(paste(ssh, "wajig update"))
+# Manually ssh to the server and then ...
 # wajig distupgrade
 # sudo reboot
 ```
--- a/vignettes/20Multi.Rmd
+++ b/vignettes/20Multi.Rmd
@ -10,26 +10,29 @@ vignette: >

 # Use Case

-Sometimes more than one DSVMs are needed.
+Sometimes more than one DSVM is needed.

 * Multi-deployment of heterogeneous DSVMs may be required for a
-  collaborative project where each of group members work on a machine
-  with specific configuration. For instance, a powerful yet expensive
-  machine is assigned to perform computation intensive tasks while a
-  cheap one can be used for explorative or interactive tasks.
+  collaborative project where each of group member works on a machine
+  with a specific configuration targetting their requirements. For
+  instance, a powerful (expensive) machine is assigned to perform
+  computationally intensive tasks while a smaller (inexpensive)
+  machine can be used for explorative or interactive tasks.

 * Another common use case is for a Data Scientist to create their R
-programs to analyse a dataset on their local compute platform (e.g., a
-laptop with 6GB RAM running Ubuntu with R installed). Development is
-performed with a subset of the full dataset (a random sample) that
-will not exceed the available memory and will return results
-quickly. When the experimental setup is complete the script can be
-sent across to a considerably more capable compute engine on Azure,
-possibly a cluster of servers to build models in parallel.
+  programs to analyse a dataset on their local compute platform (e.g.,
+  a laptop with 6GB RAM running Ubuntu with R installed). Development
+  is performed with a subset of the full dataset (a random sample)
+  that will not exceed the available memory and will return results
+  quickly. When the experimental setup is complete the script can be
+  sent across to a considerably more capable compute engine on Azure,
+  possibly a cluster of servers to build models in parallel in a
+  deploy/compute/destroy cycle to be a significantly more cost
+  effective alternateive to an on-premise purchase of hardware.

 This tutorial deploys a collection/cluster of Linux Data Science
 Virtual Machines (DSVMs) for the above two scenarios. In the latter
-one, user distributes a trivial compute task over those servers,
+scenarios the user distributes a compute task over those servers,
 collects the results and generates a report. Code is included but not
 run to then delete the resource group if the resources are no longer
 required. Once deleted consumption will cease.
@ -39,9 +42,8 @@ ensure that the interaction with Azure completes.

 # Setup

-To get started load our Azure credentials as well as the user's ssh
-public key. This information has been saved into a file with the name
-<USER>_credentials.R where <USER> is your username.
+Refer to
+[Deploy](https://github.com/Azure/AzureDSVM/blob/master/vignettes/10Deploy.Rmd) for an explanation of the set up of the virtual machines.

 ```{r setup}
 # Load the required subscription resources: TID, CID, and KEY.
@ -50,18 +52,14 @@ public key. This information has been saved into a file with the name
 USER <- Sys.info()[['user']]

 source(paste0(USER, "_credentials.R"))
-```

-```{r packages}
 # Load the required packages.

 library(AzureSMR)    # Support for managing Azure resources.
-library(AzureDSVM)    # Further support for the Data Scientist.
-library(magrittr)    
-library(dplyr)
-```
+library(AzureDSVM)   # Further support for the Data Scientist.
+library(magrittr)    # Pipeline computation.
+library(dplyr)       # Data wrangling.

-```{r tuning}
 # Parameters for this script: the name for the new resource group and
 # its location across the Azure cloud. The resource name is used to
 # name the resource group that we will create transiently for the
@ -89,10 +87,10 @@ LOC <-

 cat("\n")

-COUNT <- 4                 # Number of VMs to deploy.
-```
+# Number of VMs to deploy.
+
+COUNT <- 4

-```{r connect}
 # Connect to the Azure subscription and use this as the context for
 # all of our activities.

@ -103,13 +101,6 @@ context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)

 rg_pre_exists <- existsRG(context, RG, LOC)

-```
-# Create Resource Group
-
-Create the resource group within which all resources we create will be
-grouped.
-
-```{r create resource group}
 if (! rg_pre_exists)
 {
  # Create a new resource group into which we create the VMs and
@ -130,30 +121,29 @@ cat("Resource group", RG, "at", LOC,

 # Create a Cluster

-Multi-deployment of DSVM can be achieved by calling
-`deployDSVMCluster` function. Note the function is designed to
-implicitly switch between cluster and collection of DSVMs, according
-to the given inputs. That is, if the `hostname`, i.e., names of the
-DSVMs consists of only one character string, the function will imply
-the deployment is to create a cluster of homogeneous DSVMs (i.e., same
-size), and use the unique machine name as the base, which is appended
-with a sequential number to form a full hostname. If the `hostname`
-is a vector of character strings, the function will create machines
-with names specified in the name vector.
+Multi-deployment of DSVMs can be achieved by calling
+`deployDSVMCluster()`. This function is designed to implicitly switch
+between cluster and collection of DSVMs, according to the given
+inputs. If the `hostname` (i.e., names of the DSVMs) consists of only
+one character string then a cluster of homogeneous DSVMs (i.e., same
+size) will be deployed. Using the unique machine name as the base a
+sequential numbering is used to form a full hostname. On the other
+hand if the `hostname` is a vector of character strings the function
+will create a collection of machines with names so specified.

-It is worth mentioning that a cluster of DSVMs is useful when
-batch-based analytical job needs to be done in a desired computing
-context, especially in a distributed manner across nodes of the
-cluster. The distributed computing functionality is empowered by
-Microsoft RevoScaleR parallel computing backend. The distributed and
-parallel computing is socket-based and relies on SSH for secure
-communication. To allow this communication across nodes,
-`deployDSVMCluster` added inbound security rules into security group
-of each DSVM in the cluster, and establish public key pairs for the
-machines.
+It is worth mentioning that a cluster of DSVMs is useful when a
+batch-based analytical job needs to be performed in a desired
+computing context, especially in a distributed manner across nodes of
+the cluster. The distributed computing functionality is supported by
+the Microsoft RevoScaleR parallel computing backend. The distributed
+and parallel computing backend is socket-based and relies on SSH for
+secure communication. To allow this communication across nodes,
+`deployDSVMCluster` adds inbound security rules into the security
+group of each DSVM in the cluster, and establishes public key pairs for
+the machines.

 We can now deploy a cluster of homogeneous DSVMs. Each DSVM will be
-named based on the *name* provided and sequentially numbered.
+named based on the *hostname* provided and sequentially numbered.

 ```{r deploy a cluster of DSVMs}
 # Deploy a cluster of DSVMs.
@ -166,13 +156,8 @@ deployDSVMCluster(context,
                  authen="Key",
                  pubkey=PUBKEY,
                  count=COUNT)
-```

-```{r check existence of deployed DSVMs}
-
-cluster <- azureListVM(context,
-                       RG,
-                       LOC)
+cluster <- azureListVM(context, RG, LOC)

 # To validate the existence of deployed DSVMs.

@ -212,10 +197,12 @@ and size can also be configured.

 ```{r deploy a set of DSVMs, eval=FALSE}

-DSVM_NAMES <- paste0(BASE, c(1, 2, 3))
+DSVM_NAMES <- paste0(BASE, c(1, 2, 3)) %T>% print()

 # Deploy multiple DSVMs using deployDSVMCluster.

+# TODO: CURRENTLY NOT FUNCTIONAL. NEED A FULLY FUNTIONAL EXAMPLE
+
 deployDSVMCluster(context, 
                  resource.group=RG, 
                  location=LOC, 
@ -258,5 +245,3 @@ for (vm in DSVM_NAMES)
 if (! rg_pre_exists)
  azureDeleteResourceGroup(context, RG)
 ```
-
-Once deleted we are consuming no more.