This commit is contained in:
yueguoguo 2017-07-10 10:18:03 +08:00
Родитель b39d0d9674 e48eddded4
Коммит e0f29ca8a1
5 изменённых файлов: 89 добавлений и 100 удалений

Просмотреть файл

@ -9,7 +9,8 @@ Authors@R: c(
person("Le", "Zhang", role=c("aut"), email="zhle@microsoft.com"))
Description: The AzureDSVM is a package that aims at providing handy methods,
with the help of underlying AzureSMR package, for doing data science jobs on
Azure Data Science Virtual Machine (DSVM) efficiently and economically. Basically it can be used with the
Azure Data Science Virtual Machine (DSVM) efficiently and economically.
Basically it can be used with the
following benefits. Easy operations such as create, start, and stop of Azure
DSVMs. Remote execution of data analytical jobs on cloud with
specified computing context. Monitor of data consumption and calculation of

Просмотреть файл

@ -164,18 +164,20 @@ deployDSVMCluster <- function(context,
for (i in 1:count)
{
deployDSVM(context=context,
resource.group=resource.group,
location=location,
hostname=hostname[i],
username=username[i],
size=size[i],
os=os[i],
authen=authen[i],
pubkey=pubkey[i],
password=password[i],
dns.label=hostname[i],
mode=ifelse(i == count, "Sync", "Async"))
if (i == count) cat("\n") # Tidy the progress messages.
deployDSVM(context=context,
resource.group=resource.group,
location=location,
hostname=hostname[i],
username=username[i],
size=size[i],
os=os[i],
authen=authen[i],
pubkey=pubkey[i],
password=password[i],
dns.label=hostname[i],
mode=ifelse(i == count, "Sync", "Async"))
}
# For a cluster set up public credentials for the DSVM cluster to
@ -203,5 +205,5 @@ deployDSVMCluster <- function(context,
dns.label=dns.label)
}
return(TRUE)
invisible(TRUE)
}

Просмотреть файл

@ -9,15 +9,17 @@ vignette: >
\usepackage[utf8]{inputenc}
---
The Azure Data Science Virtual Machine (DSVM) provide a powerful
The Azure Data Science Virtual Machine (DSVM) provides a powerful
platform for data scientists supporting a full open source stack of
tools that today's data scientist will regularly use.
regularly used tools.
# Preliminaries
1. `AzureDSVM` requires users to have access to Azure resources
through an Azure subscription which can be obtained through a [free
Azure account](https://azure.microsoft.com/en-us/free).
Azure account](https://azure.microsoft.com/en-us/free) using your
credit card (but only to identify you as it will not be charged unless
you later decide to take ot a subscription).
2. It is highly recommended that you read through the DSVM
documentation:
@ -53,7 +55,7 @@ if(!require("devtools")) install.packages("devtools")
devtools::install_github("Azure/AzureDSVM")
```
Help pages can be loaded by
Help pages can be loaded with
```{r help azuresvm, eval=FALSE}
library(help=AzureDSVM)

Просмотреть файл

@ -21,8 +21,8 @@ vignette. Once deleted consumption (cost) will cease.
This script is best run interactively to review its operation and to
ensure that the interaction with Azure completes.
The R script that can be generated from this vignette can be run as a
standalone script to setup a new resource group and single Ubuntu
An R script that can be generated from this vignette and can be run as
a standalone script to setup a new resource group and single Ubuntu
DSVM.
# Preparation
@ -45,7 +45,7 @@ the users desktop/laptop machine and will be found within
to contain this information. The contents of the credentials file will
be something like the foloowing and we assume the user creates such a
file in the current working directory, naming the file
<USER>_credentials.R, replace <USER> with the user's username.
<USER>_credentials.R. Replace <USER> with the user's username.
```{r credentials, eval=FALSE}
# Credentials come from app creation in Active Directory within Azure.
@ -63,8 +63,8 @@ PASSWORD <- "Public%4aR3@kn" # For Windows DSVM
```
Notice we include a password (fake in this case) for account creation
on a Windows DSVM.
Notice we include a password (a fake password in this case) for
account creation on a Windows DSVM.
We can simply source the credentials file in R.
@ -160,15 +160,14 @@ Create the resource group within which all resources we create will be
grouped.
```{r create resource group}
if (! rg_pre_exists) {
# Create a new resource group into which we create the VMs and
# related resources. Resource group name is RG.
# Note that to create a new resource group one needs to add access
# control of Active Directory application at subscription level.
# Create a new resource group into which we create the VMs and related
# resources. Resource group name is RG. Note that to create a new
# resource group one needs to add access control of Active Directory
# application at subscription level.
if (! rg_pre_exists)
{
azureCreateResourceGroup(context, RG, LOC) %>% cat("\n\n")
}
# Check that it now exists.
@ -205,8 +204,7 @@ formals(deployDSVM)$size
formals(deployDSVM)$os
```
The following code deploys a Linux DSVM, and it will take a few
minutes.
The following code deploys a Linux DSVM which will take a few minutes.
```{r deploy}
# Create the required Linux DSVM - generally 4 minutes.
@ -232,8 +230,8 @@ Prove that the deployed DSVM exists.
# existence. Expect a single line with an indication of how long the
# server has been up and running.
# NOTE this must be done after a while since deployment - problem may
# be owing to internal processing of system setup.
# NOTE this must be done after a while since even though deployment is
# reported there is a small delay before actually available.
Sys.sleep(20)
@ -252,9 +250,9 @@ system(cmd, intern=TRUE)
We can install some useful tools on a fesh server. Note that the
Ubuntu server will still be running some background scripts as part of
its own setup so if there are lock file error messages wait from the
following commands then simply try again in a short while. We also
update the operating system here though because of a bad console
its own setup so if there are lock error messages (could not get lock)
from the following commands then simply try again in a short while. We
also update the operating system here though because of a bad console
interaction from the msodbcsql package asking about licensing we have
to do the distupgrade through a terminal so we need to log on to the
server through the secure shell and manually run that command. We
@ -267,6 +265,7 @@ system(paste(ssh, "sudo apt-get -y install wajig"))
system(paste(ssh, "wajig install -y lsb htop"))
system(paste(ssh, "lsb_release -idrc"))
system(paste(ssh, "wajig update"))
# Manually ssh to the server and then ...
# wajig distupgrade
# sudo reboot
```

Просмотреть файл

@ -10,26 +10,29 @@ vignette: >
# Use Case
Sometimes more than one DSVMs are needed.
Sometimes more than one DSVM is needed.
* Multi-deployment of heterogeneous DSVMs may be required for a
collaborative project where each of group members work on a machine
with specific configuration. For instance, a powerful yet expensive
machine is assigned to perform computation intensive tasks while a
cheap one can be used for explorative or interactive tasks.
collaborative project where each of group member works on a machine
with a specific configuration targetting their requirements. For
instance, a powerful (expensive) machine is assigned to perform
computationally intensive tasks while a smaller (inexpensive)
machine can be used for explorative or interactive tasks.
* Another common use case is for a Data Scientist to create their R
programs to analyse a dataset on their local compute platform (e.g., a
laptop with 6GB RAM running Ubuntu with R installed). Development is
performed with a subset of the full dataset (a random sample) that
will not exceed the available memory and will return results
quickly. When the experimental setup is complete the script can be
sent across to a considerably more capable compute engine on Azure,
possibly a cluster of servers to build models in parallel.
programs to analyse a dataset on their local compute platform (e.g.,
a laptop with 6GB RAM running Ubuntu with R installed). Development
is performed with a subset of the full dataset (a random sample)
that will not exceed the available memory and will return results
quickly. When the experimental setup is complete the script can be
sent across to a considerably more capable compute engine on Azure,
possibly a cluster of servers to build models in parallel in a
deploy/compute/destroy cycle to be a significantly more cost
effective alternateive to an on-premise purchase of hardware.
This tutorial deploys a collection/cluster of Linux Data Science
Virtual Machines (DSVMs) for the above two scenarios. In the latter
one, user distributes a trivial compute task over those servers,
scenarios the user distributes a compute task over those servers,
collects the results and generates a report. Code is included but not
run to then delete the resource group if the resources are no longer
required. Once deleted consumption will cease.
@ -39,9 +42,8 @@ ensure that the interaction with Azure completes.
# Setup
To get started load our Azure credentials as well as the user's ssh
public key. This information has been saved into a file with the name
<USER>_credentials.R where <USER> is your username.
Refer to
[Deploy](https://github.com/Azure/AzureDSVM/blob/master/vignettes/10Deploy.Rmd) for an explanation of the set up of the virtual machines.
```{r setup}
# Load the required subscription resources: TID, CID, and KEY.
@ -50,18 +52,14 @@ public key. This information has been saved into a file with the name
USER <- Sys.info()[['user']]
source(paste0(USER, "_credentials.R"))
```
```{r packages}
# Load the required packages.
library(AzureSMR) # Support for managing Azure resources.
library(AzureDSVM) # Further support for the Data Scientist.
library(magrittr)
library(dplyr)
```
library(AzureDSVM) # Further support for the Data Scientist.
library(magrittr) # Pipeline computation.
library(dplyr) # Data wrangling.
```{r tuning}
# Parameters for this script: the name for the new resource group and
# its location across the Azure cloud. The resource name is used to
# name the resource group that we will create transiently for the
@ -89,10 +87,10 @@ LOC <-
cat("\n")
COUNT <- 4 # Number of VMs to deploy.
```
# Number of VMs to deploy.
COUNT <- 4
```{r connect}
# Connect to the Azure subscription and use this as the context for
# all of our activities.
@ -103,13 +101,6 @@ context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)
rg_pre_exists <- existsRG(context, RG, LOC)
```
# Create Resource Group
Create the resource group within which all resources we create will be
grouped.
```{r create resource group}
if (! rg_pre_exists)
{
# Create a new resource group into which we create the VMs and
@ -130,30 +121,29 @@ cat("Resource group", RG, "at", LOC,
# Create a Cluster
Multi-deployment of DSVM can be achieved by calling
`deployDSVMCluster` function. Note the function is designed to
implicitly switch between cluster and collection of DSVMs, according
to the given inputs. That is, if the `hostname`, i.e., names of the
DSVMs consists of only one character string, the function will imply
the deployment is to create a cluster of homogeneous DSVMs (i.e., same
size), and use the unique machine name as the base, which is appended
with a sequential number to form a full hostname. If the `hostname`
is a vector of character strings, the function will create machines
with names specified in the name vector.
Multi-deployment of DSVMs can be achieved by calling
`deployDSVMCluster()`. This function is designed to implicitly switch
between cluster and collection of DSVMs, according to the given
inputs. If the `hostname` (i.e., names of the DSVMs) consists of only
one character string then a cluster of homogeneous DSVMs (i.e., same
size) will be deployed. Using the unique machine name as the base a
sequential numbering is used to form a full hostname. On the other
hand if the `hostname` is a vector of character strings the function
will create a collection of machines with names so specified.
It is worth mentioning that a cluster of DSVMs is useful when
batch-based analytical job needs to be done in a desired computing
context, especially in a distributed manner across nodes of the
cluster. The distributed computing functionality is empowered by
Microsoft RevoScaleR parallel computing backend. The distributed and
parallel computing is socket-based and relies on SSH for secure
communication. To allow this communication across nodes,
`deployDSVMCluster` added inbound security rules into security group
of each DSVM in the cluster, and establish public key pairs for the
machines.
It is worth mentioning that a cluster of DSVMs is useful when a
batch-based analytical job needs to be performed in a desired
computing context, especially in a distributed manner across nodes of
the cluster. The distributed computing functionality is supported by
the Microsoft RevoScaleR parallel computing backend. The distributed
and parallel computing backend is socket-based and relies on SSH for
secure communication. To allow this communication across nodes,
`deployDSVMCluster` adds inbound security rules into the security
group of each DSVM in the cluster, and establishes public key pairs for
the machines.
We can now deploy a cluster of homogeneous DSVMs. Each DSVM will be
named based on the *name* provided and sequentially numbered.
named based on the *hostname* provided and sequentially numbered.
```{r deploy a cluster of DSVMs}
# Deploy a cluster of DSVMs.
@ -166,13 +156,8 @@ deployDSVMCluster(context,
authen="Key",
pubkey=PUBKEY,
count=COUNT)
```
```{r check existence of deployed DSVMs}
cluster <- azureListVM(context,
RG,
LOC)
cluster <- azureListVM(context, RG, LOC)
# To validate the existence of deployed DSVMs.
@ -212,10 +197,12 @@ and size can also be configured.
```{r deploy a set of DSVMs, eval=FALSE}
DSVM_NAMES <- paste0(BASE, c(1, 2, 3))
DSVM_NAMES <- paste0(BASE, c(1, 2, 3)) %T>% print()
# Deploy multiple DSVMs using deployDSVMCluster.
# TODO: CURRENTLY NOT FUNCTIONAL. NEED A FULLY FUNTIONAL EXAMPLE
deployDSVMCluster(context,
resource.group=RG,
location=LOC,
@ -258,5 +245,3 @@ for (vm in DSVM_NAMES)
if (! rg_pre_exists)
azureDeleteResourceGroup(context, RG)
```
Once deleted we are consuming no more.