зеркало из https://github.com/Azure/AzureDSVM.git
Merge branch 'master' of https://github.com/Azure/AzureDSVM
This commit is contained in:
Коммит
e0f29ca8a1
|
@ -9,7 +9,8 @@ Authors@R: c(
|
|||
person("Le", "Zhang", role=c("aut"), email="zhle@microsoft.com"))
|
||||
Description: The AzureDSVM is a package that aims at providing handy methods,
|
||||
with the help of underlying AzureSMR package, for doing data science jobs on
|
||||
Azure Data Science Virtual Machine (DSVM) efficiently and economically. Basically it can be used with the
|
||||
Azure Data Science Virtual Machine (DSVM) efficiently and economically.
|
||||
Basically it can be used with the
|
||||
following benefits. Easy operations such as create, start, and stop of Azure
|
||||
DSVMs. Remote execution of data analytical jobs on cloud with
|
||||
specified computing context. Monitor of data consumption and calculation of
|
||||
|
|
|
@ -164,18 +164,20 @@ deployDSVMCluster <- function(context,
|
|||
|
||||
for (i in 1:count)
|
||||
{
|
||||
deployDSVM(context=context,
|
||||
resource.group=resource.group,
|
||||
location=location,
|
||||
hostname=hostname[i],
|
||||
username=username[i],
|
||||
size=size[i],
|
||||
os=os[i],
|
||||
authen=authen[i],
|
||||
pubkey=pubkey[i],
|
||||
password=password[i],
|
||||
dns.label=hostname[i],
|
||||
mode=ifelse(i == count, "Sync", "Async"))
|
||||
if (i == count) cat("\n") # Tidy the progress messages.
|
||||
|
||||
deployDSVM(context=context,
|
||||
resource.group=resource.group,
|
||||
location=location,
|
||||
hostname=hostname[i],
|
||||
username=username[i],
|
||||
size=size[i],
|
||||
os=os[i],
|
||||
authen=authen[i],
|
||||
pubkey=pubkey[i],
|
||||
password=password[i],
|
||||
dns.label=hostname[i],
|
||||
mode=ifelse(i == count, "Sync", "Async"))
|
||||
}
|
||||
|
||||
# For a cluster set up public credentials for the DSVM cluster to
|
||||
|
@ -203,5 +205,5 @@ deployDSVMCluster <- function(context,
|
|||
dns.label=dns.label)
|
||||
}
|
||||
|
||||
return(TRUE)
|
||||
invisible(TRUE)
|
||||
}
|
||||
|
|
|
@ -9,15 +9,17 @@ vignette: >
|
|||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
The Azure Data Science Virtual Machine (DSVM) provide a powerful
|
||||
The Azure Data Science Virtual Machine (DSVM) provides a powerful
|
||||
platform for data scientists supporting a full open source stack of
|
||||
tools that today's data scientist will regularly use.
|
||||
regularly used tools.
|
||||
|
||||
# Preliminaries
|
||||
|
||||
1. `AzureDSVM` requires users to have access to Azure resources
|
||||
through an Azure subscription which can be obtained through a [free
|
||||
Azure account](https://azure.microsoft.com/en-us/free).
|
||||
Azure account](https://azure.microsoft.com/en-us/free) using your
|
||||
credit card (but only to identify you as it will not be charged unless
|
||||
you later decide to take ot a subscription).
|
||||
|
||||
2. It is highly recommended that you read through the DSVM
|
||||
documentation:
|
||||
|
@ -53,7 +55,7 @@ if(!require("devtools")) install.packages("devtools")
|
|||
devtools::install_github("Azure/AzureDSVM")
|
||||
```
|
||||
|
||||
Help pages can be loaded by
|
||||
Help pages can be loaded with
|
||||
|
||||
```{r help azuresvm, eval=FALSE}
|
||||
library(help=AzureDSVM)
|
||||
|
|
|
@ -21,8 +21,8 @@ vignette. Once deleted consumption (cost) will cease.
|
|||
This script is best run interactively to review its operation and to
|
||||
ensure that the interaction with Azure completes.
|
||||
|
||||
The R script that can be generated from this vignette can be run as a
|
||||
standalone script to setup a new resource group and single Ubuntu
|
||||
An R script that can be generated from this vignette and can be run as
|
||||
a standalone script to setup a new resource group and single Ubuntu
|
||||
DSVM.
|
||||
|
||||
# Preparation
|
||||
|
@ -45,7 +45,7 @@ the users desktop/laptop machine and will be found within
|
|||
to contain this information. The contents of the credentials file will
|
||||
be something like the foloowing and we assume the user creates such a
|
||||
file in the current working directory, naming the file
|
||||
<USER>_credentials.R, replace <USER> with the user's username.
|
||||
<USER>_credentials.R. Replace <USER> with the user's username.
|
||||
|
||||
```{r credentials, eval=FALSE}
|
||||
# Credentials come from app creation in Active Directory within Azure.
|
||||
|
@ -63,8 +63,8 @@ PASSWORD <- "Public%4aR3@kn" # For Windows DSVM
|
|||
|
||||
```
|
||||
|
||||
Notice we include a password (fake in this case) for account creation
|
||||
on a Windows DSVM.
|
||||
Notice we include a password (a fake password in this case) for
|
||||
account creation on a Windows DSVM.
|
||||
|
||||
We can simply source the credentials file in R.
|
||||
|
||||
|
@ -160,15 +160,14 @@ Create the resource group within which all resources we create will be
|
|||
grouped.
|
||||
|
||||
```{r create resource group}
|
||||
if (! rg_pre_exists) {
|
||||
# Create a new resource group into which we create the VMs and
|
||||
# related resources. Resource group name is RG.
|
||||
|
||||
# Note that to create a new resource group one needs to add access
|
||||
# control of Active Directory application at subscription level.
|
||||
# Create a new resource group into which we create the VMs and related
|
||||
# resources. Resource group name is RG. Note that to create a new
|
||||
# resource group one needs to add access control of Active Directory
|
||||
# application at subscription level.
|
||||
|
||||
if (! rg_pre_exists)
|
||||
{
|
||||
azureCreateResourceGroup(context, RG, LOC) %>% cat("\n\n")
|
||||
|
||||
}
|
||||
|
||||
# Check that it now exists.
|
||||
|
@ -205,8 +204,7 @@ formals(deployDSVM)$size
|
|||
formals(deployDSVM)$os
|
||||
```
|
||||
|
||||
The following code deploys a Linux DSVM, and it will take a few
|
||||
minutes.
|
||||
The following code deploys a Linux DSVM which will take a few minutes.
|
||||
|
||||
```{r deploy}
|
||||
# Create the required Linux DSVM - generally 4 minutes.
|
||||
|
@ -232,8 +230,8 @@ Prove that the deployed DSVM exists.
|
|||
# existence. Expect a single line with an indication of how long the
|
||||
# server has been up and running.
|
||||
|
||||
# NOTE this must be done after a while since deployment - problem may
|
||||
# be owing to internal processing of system setup.
|
||||
# NOTE this must be done after a while since even though deployment is
|
||||
# reported there is a small delay before actually available.
|
||||
|
||||
Sys.sleep(20)
|
||||
|
||||
|
@ -252,9 +250,9 @@ system(cmd, intern=TRUE)
|
|||
|
||||
We can install some useful tools on a fesh server. Note that the
|
||||
Ubuntu server will still be running some background scripts as part of
|
||||
its own setup so if there are lock file error messages wait from the
|
||||
following commands then simply try again in a short while. We also
|
||||
update the operating system here though because of a bad console
|
||||
its own setup so if there are lock error messages (could not get lock)
|
||||
from the following commands then simply try again in a short while. We
|
||||
also update the operating system here though because of a bad console
|
||||
interaction from the msodbcsql package asking about licensing we have
|
||||
to do the distupgrade through a terminal so we need to log on to the
|
||||
server through the secure shell and manually run that command. We
|
||||
|
@ -267,6 +265,7 @@ system(paste(ssh, "sudo apt-get -y install wajig"))
|
|||
system(paste(ssh, "wajig install -y lsb htop"))
|
||||
system(paste(ssh, "lsb_release -idrc"))
|
||||
system(paste(ssh, "wajig update"))
|
||||
# Manually ssh to the server and then ...
|
||||
# wajig distupgrade
|
||||
# sudo reboot
|
||||
```
|
||||
|
|
|
@ -10,26 +10,29 @@ vignette: >
|
|||
|
||||
# Use Case
|
||||
|
||||
Sometimes more than one DSVMs are needed.
|
||||
Sometimes more than one DSVM is needed.
|
||||
|
||||
* Multi-deployment of heterogeneous DSVMs may be required for a
|
||||
collaborative project where each of group members work on a machine
|
||||
with specific configuration. For instance, a powerful yet expensive
|
||||
machine is assigned to perform computation intensive tasks while a
|
||||
cheap one can be used for explorative or interactive tasks.
|
||||
collaborative project where each of group member works on a machine
|
||||
with a specific configuration targetting their requirements. For
|
||||
instance, a powerful (expensive) machine is assigned to perform
|
||||
computationally intensive tasks while a smaller (inexpensive)
|
||||
machine can be used for explorative or interactive tasks.
|
||||
|
||||
* Another common use case is for a Data Scientist to create their R
|
||||
programs to analyse a dataset on their local compute platform (e.g., a
|
||||
laptop with 6GB RAM running Ubuntu with R installed). Development is
|
||||
performed with a subset of the full dataset (a random sample) that
|
||||
will not exceed the available memory and will return results
|
||||
quickly. When the experimental setup is complete the script can be
|
||||
sent across to a considerably more capable compute engine on Azure,
|
||||
possibly a cluster of servers to build models in parallel.
|
||||
programs to analyse a dataset on their local compute platform (e.g.,
|
||||
a laptop with 6GB RAM running Ubuntu with R installed). Development
|
||||
is performed with a subset of the full dataset (a random sample)
|
||||
that will not exceed the available memory and will return results
|
||||
quickly. When the experimental setup is complete the script can be
|
||||
sent across to a considerably more capable compute engine on Azure,
|
||||
possibly a cluster of servers to build models in parallel in a
|
||||
deploy/compute/destroy cycle to be a significantly more cost
|
||||
effective alternateive to an on-premise purchase of hardware.
|
||||
|
||||
This tutorial deploys a collection/cluster of Linux Data Science
|
||||
Virtual Machines (DSVMs) for the above two scenarios. In the latter
|
||||
one, user distributes a trivial compute task over those servers,
|
||||
scenarios the user distributes a compute task over those servers,
|
||||
collects the results and generates a report. Code is included but not
|
||||
run to then delete the resource group if the resources are no longer
|
||||
required. Once deleted consumption will cease.
|
||||
|
@ -39,9 +42,8 @@ ensure that the interaction with Azure completes.
|
|||
|
||||
# Setup
|
||||
|
||||
To get started load our Azure credentials as well as the user's ssh
|
||||
public key. This information has been saved into a file with the name
|
||||
<USER>_credentials.R where <USER> is your username.
|
||||
Refer to
|
||||
[Deploy](https://github.com/Azure/AzureDSVM/blob/master/vignettes/10Deploy.Rmd) for an explanation of the set up of the virtual machines.
|
||||
|
||||
```{r setup}
|
||||
# Load the required subscription resources: TID, CID, and KEY.
|
||||
|
@ -50,18 +52,14 @@ public key. This information has been saved into a file with the name
|
|||
USER <- Sys.info()[['user']]
|
||||
|
||||
source(paste0(USER, "_credentials.R"))
|
||||
```
|
||||
|
||||
```{r packages}
|
||||
# Load the required packages.
|
||||
|
||||
library(AzureSMR) # Support for managing Azure resources.
|
||||
library(AzureDSVM) # Further support for the Data Scientist.
|
||||
library(magrittr)
|
||||
library(dplyr)
|
||||
```
|
||||
library(AzureDSVM) # Further support for the Data Scientist.
|
||||
library(magrittr) # Pipeline computation.
|
||||
library(dplyr) # Data wrangling.
|
||||
|
||||
```{r tuning}
|
||||
# Parameters for this script: the name for the new resource group and
|
||||
# its location across the Azure cloud. The resource name is used to
|
||||
# name the resource group that we will create transiently for the
|
||||
|
@ -89,10 +87,10 @@ LOC <-
|
|||
|
||||
cat("\n")
|
||||
|
||||
COUNT <- 4 # Number of VMs to deploy.
|
||||
```
|
||||
# Number of VMs to deploy.
|
||||
|
||||
COUNT <- 4
|
||||
|
||||
```{r connect}
|
||||
# Connect to the Azure subscription and use this as the context for
|
||||
# all of our activities.
|
||||
|
||||
|
@ -103,13 +101,6 @@ context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)
|
|||
|
||||
rg_pre_exists <- existsRG(context, RG, LOC)
|
||||
|
||||
```
|
||||
# Create Resource Group
|
||||
|
||||
Create the resource group within which all resources we create will be
|
||||
grouped.
|
||||
|
||||
```{r create resource group}
|
||||
if (! rg_pre_exists)
|
||||
{
|
||||
# Create a new resource group into which we create the VMs and
|
||||
|
@ -130,30 +121,29 @@ cat("Resource group", RG, "at", LOC,
|
|||
|
||||
# Create a Cluster
|
||||
|
||||
Multi-deployment of DSVM can be achieved by calling
|
||||
`deployDSVMCluster` function. Note the function is designed to
|
||||
implicitly switch between cluster and collection of DSVMs, according
|
||||
to the given inputs. That is, if the `hostname`, i.e., names of the
|
||||
DSVMs consists of only one character string, the function will imply
|
||||
the deployment is to create a cluster of homogeneous DSVMs (i.e., same
|
||||
size), and use the unique machine name as the base, which is appended
|
||||
with a sequential number to form a full hostname. If the `hostname`
|
||||
is a vector of character strings, the function will create machines
|
||||
with names specified in the name vector.
|
||||
Multi-deployment of DSVMs can be achieved by calling
|
||||
`deployDSVMCluster()`. This function is designed to implicitly switch
|
||||
between cluster and collection of DSVMs, according to the given
|
||||
inputs. If the `hostname` (i.e., names of the DSVMs) consists of only
|
||||
one character string then a cluster of homogeneous DSVMs (i.e., same
|
||||
size) will be deployed. Using the unique machine name as the base a
|
||||
sequential numbering is used to form a full hostname. On the other
|
||||
hand if the `hostname` is a vector of character strings the function
|
||||
will create a collection of machines with names so specified.
|
||||
|
||||
It is worth mentioning that a cluster of DSVMs is useful when
|
||||
batch-based analytical job needs to be done in a desired computing
|
||||
context, especially in a distributed manner across nodes of the
|
||||
cluster. The distributed computing functionality is empowered by
|
||||
Microsoft RevoScaleR parallel computing backend. The distributed and
|
||||
parallel computing is socket-based and relies on SSH for secure
|
||||
communication. To allow this communication across nodes,
|
||||
`deployDSVMCluster` added inbound security rules into security group
|
||||
of each DSVM in the cluster, and establish public key pairs for the
|
||||
machines.
|
||||
It is worth mentioning that a cluster of DSVMs is useful when a
|
||||
batch-based analytical job needs to be performed in a desired
|
||||
computing context, especially in a distributed manner across nodes of
|
||||
the cluster. The distributed computing functionality is supported by
|
||||
the Microsoft RevoScaleR parallel computing backend. The distributed
|
||||
and parallel computing backend is socket-based and relies on SSH for
|
||||
secure communication. To allow this communication across nodes,
|
||||
`deployDSVMCluster` adds inbound security rules into the security
|
||||
group of each DSVM in the cluster, and establishes public key pairs for
|
||||
the machines.
|
||||
|
||||
We can now deploy a cluster of homogeneous DSVMs. Each DSVM will be
|
||||
named based on the *name* provided and sequentially numbered.
|
||||
named based on the *hostname* provided and sequentially numbered.
|
||||
|
||||
```{r deploy a cluster of DSVMs}
|
||||
# Deploy a cluster of DSVMs.
|
||||
|
@ -166,13 +156,8 @@ deployDSVMCluster(context,
|
|||
authen="Key",
|
||||
pubkey=PUBKEY,
|
||||
count=COUNT)
|
||||
```
|
||||
|
||||
```{r check existence of deployed DSVMs}
|
||||
|
||||
cluster <- azureListVM(context,
|
||||
RG,
|
||||
LOC)
|
||||
cluster <- azureListVM(context, RG, LOC)
|
||||
|
||||
# To validate the existence of deployed DSVMs.
|
||||
|
||||
|
@ -212,10 +197,12 @@ and size can also be configured.
|
|||
|
||||
```{r deploy a set of DSVMs, eval=FALSE}
|
||||
|
||||
DSVM_NAMES <- paste0(BASE, c(1, 2, 3))
|
||||
DSVM_NAMES <- paste0(BASE, c(1, 2, 3)) %T>% print()
|
||||
|
||||
# Deploy multiple DSVMs using deployDSVMCluster.
|
||||
|
||||
# TODO: CURRENTLY NOT FUNCTIONAL. NEED A FULLY FUNTIONAL EXAMPLE
|
||||
|
||||
deployDSVMCluster(context,
|
||||
resource.group=RG,
|
||||
location=LOC,
|
||||
|
@ -258,5 +245,3 @@ for (vm in DSVM_NAMES)
|
|||
if (! rg_pre_exists)
|
||||
azureDeleteResourceGroup(context, RG)
|
||||
```
|
||||
|
||||
Once deleted we are consuming no more.
|
||||
|
|
Загрузка…
Ссылка в новой задаче