Merge branch 'master' into chrisworking

This commit is contained in:
Buck Woody 2020-01-28 07:34:11 -05:00 коммит произвёл GitHub
Родитель 7aa47790c4 125ac55287
Коммит 081fd6faf6
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
15 изменённых файлов: 169 добавлений и 40 удалений

Просмотреть файл

@ -246,7 +246,7 @@ Open the **03_WorkingWithData.py** file and enter the code you find for section
Python has many ways to read data in (*sometimes into memory, sometimes streaming as it reads it*) built right in to the standard libraries. Other Libraries, such as Pandas and NumPy, have their own way of reading in data.
In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what stucture you need to perform your desired operations.
In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what structure you need to perform your desired operations.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./graphics/checkbox.png"><b>Reading from Files</b></p>
@ -465,7 +465,7 @@ Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/m
Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle-data)
The Data Aquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, youll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, youll replace missing values, add and change columns. Youve already seen the Libraries you'll need to work with for Data Wrangling - Pandas being the most common in use.
The Data Acquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, youll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, youll replace missing values, add and change columns. Youve already seen the Libraries you'll need to work with for Data Wrangling - Pandas being the most common in use.
<p style="border-bottom: 1px solid lightgrey;"></p>
<h4>Phase Three - Modeling</h4>

Просмотреть файл

@ -61,7 +61,7 @@ The entire repository can be [downloaded as a single ZIP file here](https://gith
### Clone all Workshops using git
You can [clone the entire respository using `git` here](https://github.com/Microsoft/sqlworkshops.git).
You can [clone the entire repository using `git` here](https://github.com/Microsoft/sqlworkshops.git).
### Get only one Workshop
You can follow the steps below to clone individual files from a git repo using a git client.

Двоичные данные
k8stobdc/KubernetesToBDC/.DS_Store поставляемый

Двоичный файл не отображается.

Просмотреть файл

@ -203,7 +203,9 @@ For Big Data systems, having lots of Containers is very advantageous to segment
You can <a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/" target="_blank">learn much more about Container Orchestration systems here</a>. We're using the Azure Kubernetes Service (AKS) in this workshop, and <a href="https://aksworkshop.io/" target="_blank">they have a great set of tutorials for you to learn more here</a>.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
> NOTE: The OpenShift Container Platform is a commercially supported Platform as a Service (PaaS) based on Kubernetes from RedHat. Many shops require a commercial vendor to implement and support Kubernetes.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
@ -282,7 +284,7 @@ You'll explore further operations with the Azure Data Studio in the final module
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Azure Data Studio Notebooks Overview</b></p>
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>

Просмотреть файл

@ -78,7 +78,7 @@ We'll begin with a set of definitions. These aren't all the terms used in Kubern
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/"><i>Namespace</i></a> </td>
<td>Used to define multiple virtual clusters backed by the same physical cluster. </td>
<td>Used to define multiple virtual clusters backed by the same physical cluster. It also is the primary construct for working with Role-Based Security (RBAC).</td>
</tr>
<tr style="vertical-align:top;">
<td><a href="https://kubernetes.io/docs/concepts/architecture/master-node-communication/"><b>Kubernetes Master</b></a> </td>
@ -122,17 +122,17 @@ We'll begin with a set of definitions. These aren't all the terms used in Kubern
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/"><i>Daemon Set</i></a> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/"><i>DaemonSet</i></a> </td>
<td>A service that ensures that <i>Nodes</i> run a copy of a <i>Pod</i>. As Nodes are added to the cluster, Pods are added to them. As Nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a hrf="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/"><i>Stateful Set</i></a> </td>
<td>The workload API object used to manage stateful applications. </td>
<td><a hrf="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/"><i>StatefulSet</i></a> </td>
<td>The workload API object used used for stateful apps that are clustered in nature.</td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/"><i>Replica Set </i></a> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/"><i>ReplicaSet </i></a> </td>
<td>A service that maintains a stable set of replica Pods running at any given time. Used to guarantee the availability of a specified number of identical Pods. </td>
</tr>
<tr style="vertical-align:top;">
@ -148,7 +148,7 @@ We'll begin with a set of definitions. These aren't all the terms used in Kubern
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/"><i>etcd</i></a> </td>
<td>A high performance key value store that stores the clusters state. Since <i>etcd</i> is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official etcd.io site provides a detailed breakdown of the hardware requirement for <i>etcd</i>. </td>
<td>A high performance key value store that stores the clusters state. Since <i>etcd</i> is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official http://etcd.io site provides a detailed breakdown of the hardware requirement for <i>etcd</i>. </td>
</tr>
</tbody>
</table>
@ -231,6 +231,7 @@ Luckily help is at hand in the form of a tool that leverages kubeadm in order to
- Add nodes to existing clusters
Kubespray is a Cloud Native Computing Foundation project and with its own [GitHub repository](https://github.com/kubernetes-sigs/kubespray).
### 3.2.5 What Is Ansible? ###
@ -253,6 +254,9 @@ In order to carry out the deployment of the Kubernetes cluster, a basic understa
### 3.2.7 Kubespray Workflow ###
Unlike other available deployment tools, Kubespray does everything for you in “One shot”. For example, Kubeadm requires that certificates on nodes are created manually, Kubespray not only leverages Kubeadm but it also looks after everything including certificate creation for you. Kubespray works against most of the popular public cloud providers and has been tested for the deployment of clusters with thousands of nodes. The real elegance of Kubespray is the reuse it promotes. If an organization has a requirement to deploy multiple clusters, once Kubespray is setup, for every new cluster that needs to be created, the only prerequisite is to create a new inventory file for the nodes the new cluster will use.
3.2.5 High Level Kubespray Workflow
The deployment of a Kubernetes cluster via Kubespray follows this workflow:
- Preinstall step
@ -285,7 +289,6 @@ Note:
### 3.2.8 Requirements ###
Refer to the [requirements](https://github.com/kubernetes-sigs/kubespray#requirements) section in the Kubespray GitHub repo.
### 3.2.9 Post Cluster Deployment Activities ###

Просмотреть файл

@ -148,7 +148,7 @@ These components are used in the Compute Pool of the BDC:
<h3>BDC: App Pool</h3>
The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instatiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification.
The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instantiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification.
These components are used in the Compute Pool of the BDC:

Просмотреть файл

@ -70,11 +70,11 @@ This process allows not only a query to disparate systems, but also those remote
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Load and query data in an External Table</b></p>
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any Polybase target.
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any PolyBase target.
<b>Steps</b>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparattion for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparation for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This step shows you how to create and query an External table.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-oracle?view=sqlallproducts-allversions" target="_blank">(Optional) Open this reference, and review the instructions you see there</a>. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)</p>
@ -107,7 +107,7 @@ In this activity, you will load the sample data into your big data cluster envir
There are three primary uses for a large cluster of data processing systems for Machine Learning and AI applications. The first is that the users will involved in the creation of the <a href="https://www.codeingschool.com/2018/09/what-are-features-and-labels-in-machine-learning.html" target="_blank">Features used in various ML and AI algorithms, and are often tasked to Label</a> the data. These users can access the Data Pool and Data Storage data stores directly to query and assist with this task.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and presisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and persisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
<br>
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution4.png">

Просмотреть файл

@ -0,0 +1,124 @@
# SQL Server 2019 big data cluster
Single-Node Cluster on an Azure Virtual Machine (Unsupported for production - classroom only)
In this set of instructions you'll set up a SQL Server 2019 big data cluster using Ubuntu on
a single-Node using a Microsoft Azure Virtual Machine.
NOTE: This is an unsupported configuration, and should be used only for classroom purposes.
Carefully read the instructions for the parameters you need to replace for your specific
subscription and parameters.
-------------------------------------------------------------------------------------------------------------
## Running these Instructions
These instructions use shell commands, such as PowerShell, bash, or the CMD window from a
system that has the Secure Shell software installed (SSH). You can type:
ssh -h
To see if this tool is installed.
You can copy-and-paste from the lines that show the commands, or you can set your IDE to run the current line
in a Terminal window. (In Visual Studio Code or Azure Data Studio, these are called "Keybindings"):
https://code.visualstudio.com/docs/getstarted/keybindings
You can set this to any key you like:
(Preferences | Keyboard Shortcuts | Terminal: Run Selected Text in Active Terminal)
-------------------------------------------------------------------------------------------------------------
## References
This Notebook uses the script located here:
https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-script-single-node-kubeadm?view=sql-server-ver15
and that reference supersedes the information in the steps listed below.
You can also create a SQL Server Big Data Cluster on the Azure Kubernetes Service (AKS):
Those instructions are located here: https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sql-server-ver15
For a complete workshop on SQL Server 2019's big data clusters, see this reference:
https://github.com/Microsoft/sqlworkshops/tree/master/sqlserver2019bigdataclusters
-------------------------------------------------------------------------------------------------------------
### Step 1: Log in to Azure
az login
-------------------------------------------------------------------------------------------------------------
### Step 2: Set your account - show the accounts, replace <YourAccountNameHere> with your account name
az account list --output table
az account set --subscription "<YourAccountNameHere>"
-------------------------------------------------------------------------------------------------------------
### Step 3: Create a Resource Group, and a Virtual Machine - Look for values with the <Replace> characters to change to your values
#### (Note: Needs a machine large enough to run BDC and also have Nested Virtualization)
az group create -n <ResourceGroupName> -l eastus2
az vm create -n <VMName>> -g <ResourceGroupName> -l eastus2 --image UbuntuLTS --os-disk-size-gb 200 --storage-sku Premium_LRS --admin-username bdcadmin --admin-password <ReplaceWithPassword> --size Standard_D8s_v3 --public-ip-address-allocation static
ssh -X bdcadmin@<ReplaceWithIPAddressThatReturnsFromLastCommand>
### Step 4: Update and Upgrade VM
sudo apt-get update
sudo apt-get upgrade
sudo apt autoremove
-------------------------------------------------------------------------------------------------------------
### Step 5: (Optional) Install an XWindows server
sudo apt-get install xorg openbox
sudo reboot
#### After about 5 minutes:
ssh -X bdcadmin@<ReplaceWithVMIP>
sudo apt-get install gnome-core
sudo reboot
#### After about 5 minutes:
ssh -X bdcadmin@<ReplaceWithVMIP>
sudo sed -i 's/allowed_users=console/allowed_users=anybody/' /etc/X11/Xwrapper.config
mkdir /home/bdcadmin/.config/nautilus
cd ~
mkdir ./Downloads
touch  /home/bdcadmin/.gtk-bookmarks
wget https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/ads.deb
#### Check XWindows - Note, requires that you have XWindows software installed on your laptop
nautilus &
-------------------------------------------------------------------------------------------------------------
### Step 6: Install BDC Single Node - Pre-requisites (Current as of 1/31/2020)
sudo apt update && sudo apt upgrade -y
sudo reboot
#### After about 5 minutes:
ssh -X bdcadmin@<ReplaceWithVMIP>
-------------------------------------------------------------------------------------------------------------
### Step 7: Download and mark BDC Setup script
curl --output setup-bdc.sh https://raw.githubusercontent.com/microsoft/sql-server-samples/master/samples/features/sql-big-data-cluster/deployment/kubeadm/ubuntu-single-node-vm/setup-bdc.sh
chmod +x setup-bdc.sh
sudo ./setup-bdc.sh
-------------------------------------------------------------------------------------------------------------
### Step 8: Setup path and Check
source ~/.bashrc
azdata --version
kubectl get pods
#### You can now use the system.
-------------------------------------------------------------------------------------------------------------
## Cleanup - Erase everything
### Only perform this step when you are done experimenting with the system...
az group delete --name <ResourceGroupName>

Просмотреть файл

@ -8,7 +8,7 @@
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/textbubble.png?raw=true"><b> About this Workshop</b></h2>
Welcome to this Microsoft solutions workshop on *Kubernetes - From Bare Metal to SQL Server Big Data Clusters*. In this workshop, you'll learn about setting up a production grade SQL Server 2019 big data cluster environment on Kubernetes. Topics covered include: hardware, virtualization, and Kubernetes, with a full deployment of SQL Server's Big Data Cluster on the environment that you will use in the class. You'll then walk through a set of Jupyter Notebooks in Azure Data Studio to run T-SQL, Spark, and Machine Learning workloads on the cluster. You'll also receive valuable resources to learn more and go deeper on Linux, Containers, Kubernetes and SQL Server big data clusters.
Welcome to this Microsoft solutions workshop on *Kubernetes - From Bare Metal to SQL Server Big Data Clusters*. In this workshop, you'll learn about setting up a production grade SQL Server 2019 big data cluster environment on Kubernetes. Topics covered include: hardware, virtualization, and Kubernetes, with a full deployment of SQL Server's Big Data Cluster on the environment that you will use in the class. You'll then walk through a set of [Jupyter Notebooks](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) in Microsoft's [Azure Data Studio](https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?view=sql-server-ver15) tool to run T-SQL, Spark, and Machine Learning workloads on the cluster. You'll also receive valuable resources to learn more and go deeper on Linux, Containers, Kubernetes and SQL Server big data clusters.
The focus of this workshop is to understand the hardware, software, and environment you need to work with [SQL Server 2019's big data clusters](https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15) on a Kubernetes platform.

Просмотреть файл

@ -22,7 +22,7 @@ The other requirements are:
- **The pip3 Package**: The Python package manager *pip3* is used to install various BDC deployment and configuration tools.
- **The kubectl program**: The *kubectl* program is the command-line control feature for Kubernetes.
- **The azdata utility**: The *azdata* program is the deployment and configuration tool for BDC.
- **Azure Data Studio**: The *Azure Data Studio* IDE, along with various Extensions, is used for deploying the system, and querying and management of the BDC. In addition, you will use this tool to participate in the workshop. Note: You can connect to a SQL Server 2019 Big Data Cluster using any SQL Server connection tool or applicaiton, such as SQL Server Management Studio, but this course will use Microsoft Azure Data Studio for cluster management, Jupyter Notebooks and other capabilities.
- **Azure Data Studio**: The *Azure Data Studio* IDE, along with various Extensions, is used for deploying the system, and querying and management of the BDC. In addition, you will use this tool to participate in the workshop. Note: You can connect to a SQL Server 2019 Big Data Cluster using any SQL Server connection tool or application, such as SQL Server Management Studio, but this course will use Microsoft Azure Data Studio for cluster management, Jupyter Notebooks and other capabilities.
*Note that all following activities must be completed prior to class - there will not be time to perform these operations during the workshop.*
@ -71,7 +71,7 @@ Get-WindowsUpdate
Install-WindowsUpdate
</pre>
*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may receive certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Install Big Data Cluster Tools</p>
@ -97,7 +97,7 @@ Get-WindowsUpdate
Install-WindowsUpdate
</pre>
*Note 1: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
*Note 1: If you get an error during this update process, evaluate it to see if it is fatal. You may receive certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
**Note 2: If you are using a Virtual Machine in Azure, power off the Virtual Machine using the Azure Portal every time you are done with it. Turning off the VM using just the Windows power off in the VM only stops it running, but you are still charged for the VM if you do not stop it from the Portal. Stop the VM from the Portal unless you are actively using it.**

Просмотреть файл

@ -92,7 +92,7 @@ This solution uses an example of a retail organization that has multiple data so
<img style="height: 25;" src="../graphics/WWI-logo.png">
Wide World Importeres (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a great training program for new employees, that focuses on connecting with their customers and providing great face-to-face customer service. This strong focus on customer relationships has helped set WWI apart from their competitors.
Wide World Importers (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a great training program for new employees, that focuses on connecting with their customers and providing great face-to-face customer service. This strong focus on customer relationships has helped set WWI apart from their competitors.
WWI has now added web and mobile commerce to their platform, which has generated a significant amount of additional data, and data formats. These new platforms have been added without integrating into the OLTP system data or Business Intelligence infrastructures. As a result, "silos" of data stores have developed.
@ -144,7 +144,7 @@ Using the following steps, you will create a Resource Group in Azure that will h
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sqlallproducts-allversions" target="_blank"> Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step</a>. Note that if you followed the pre-requisites properly, you will already have <i>Python</i>, <i>kubectl</i>, and <i>Azure Data Studio</i> installed, so those may be skipped. Follow all other instructions.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sqlallproducts-allversions" target="_blank"> Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step</a>. Stop at the section marked <b>Connect to the cluster</b>.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sqlallproducts-allversions" target="_blank"> Read the following article to deploy the Big Data Cluster to the Azure Kubernetes Service, ensuring that you carefully follow each step</a>. Stop at the section marked <b>Connect to the cluster</b>.</p>
<p style="border-bottom: 1px solid lightgrey;"></p>
@ -309,7 +309,7 @@ You can <a href="https://hackernoon.com/docker-commands-the-ultimate-cheat-sheet
<h3>Container Orchestration <i>(Kubernetes)</i></h3>
For Big Data systems, having lots of Containers is very advantageous to segment purpose and performance profiles. However, dealing with many Container Images, allowing persisted storage, and interconnecting them for network and internetwork communications is a complex task. One such Container Prchestration tool is <i>Kubernetes</i>, an open source Container orchestrator, which can scale Container deployments according to need. The following table defines some important Container Orchestration Tools (Such as Kubernetes or OpenShift) terminology:
For Big Data systems, having lots of Containers is very advantageous to segment purpose and performance profiles. However, dealing with many Container Images, allowing persisted storage, and interconnecting them for network and internetwork communications is a complex task. One such Container Orchestration tool is <i>Kubernetes</i>, an open source Container orchestrator, which can scale Container deployments according to need. The following table defines some important Container Orchestration Tools (Such as Kubernetes or OpenShift) terminology:
<table style="tr:nth-child(even) {background-color: #f2f2f2;}; text-align: left; display: table; border-collapse: collapse; border-spacing: 5px; border-color: gray;">
@ -327,7 +327,7 @@ For Big Data systems, having lots of Containers is very advantageous to segment
You can <a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/" target="_blank">learn much more about Container Orchestration systems here</a>. We're using the Azure Kubernetes Service (AKS) in this workshop, and <a href="https://aksworkshop.io/" target="_blank">they have a great set of tutorials for you to learn more here</a>.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
@ -440,7 +440,7 @@ You'll explore further operations with the Azure Data Studio in the <i>Operation
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Azure Data Studio Notebooks Overview</b></p>
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.</p>
<br>
@ -486,7 +486,7 @@ Since HDFS is a file-system, data transfer is largely a matter of using it as a
<h3>Data Pipelines using <i>Azure Data Factory</i></h3>
As described earlier, you can use various methods to ingest data ad-hoc and as-needed for your two data targets (HDFS and SQL Server Tables. A more holistic archicture is to use a <i>Pipeline</i> system that can define sources, triggers and events, transforms, targets, and has logging and tracking capabilities. The Microsoft Azure Data Factory provides all of the capabilities, and often serves as the mechanism to transfer data to and from on-premises, in-cloud, and other sources and targets. <a href="https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities" target="_blank">ADF can serve as a full data pipeline system, as described here</a>.
As described earlier, you can use various methods to ingest data ad-hoc and as-needed for your two data targets (HDFS and SQL Server Tables. A more holistic architecture is to use a <i>Pipeline</i> system that can define sources, triggers and events, transforms, targets, and has logging and tracking capabilities. The Microsoft Azure Data Factory provides all of the capabilities, and often serves as the mechanism to transfer data to and from on-premises, in-cloud, and other sources and targets. <a href="https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities" target="_blank">ADF can serve as a full data pipeline system, as described here</a>.
<br>
<img style="height: 75;" src="../graphics/adf.png">

Просмотреть файл

@ -28,7 +28,7 @@ You'll cover the following topics in this Module:
SQL Server (starting with version 2019) provides three ways to work with large sets of data:
- **Data Virtualization**: Query multiple sources of data technologies using the Polybase SQL Server feature <i>(data left at source)</i>
- **Data Virtualization**: Query multiple sources of data technologies using the PolyBase SQL Server feature <i>(data left at source)</i>
- **Storage Pools**: Create sets of disparate data sources that can be queried from Distributed Data sets <i>(data ingested into sharded databases using PolyBase)</i>
- **SQL Server Big Data Clusters**: Create, manage and control clusters of SQL Server Instances that co-exist in a Kubernetes cluster with Apache Spark and other technologies to access and process large sets of data <i>(Data left in place, ingested through PolyBase, and into/through HDFS)</i>
@ -228,4 +228,4 @@ In this section you will review the solution tutorial you will perform in the <i
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/geopin.png"><b> Next Steps</b></p>
Next, Continue to <a href="03%20-%20Planning,%20Installation%20and%20Configuration.md" target="_blank"><i> Planning, Installation and Configuration</i></a>.
Next, Continue to <a href="03%20-%20Planning,%20Installation%20and%20Configuration.md" target="_blank"><i> Planning, Installation and Configuration</i></a>.

Просмотреть файл

@ -29,28 +29,28 @@ You'll cover the following topics in this Module:
<i>NOTE: The following Module is based on the Public Preview of the Microsoft SQL Server 2019 big data cluster feature. These instructions will change as the product is updated. <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-get-started?view=sqlallproducts-allversions" target="_blank">The latest installation instructions are located here</a>.</i>
A Big Data Cluster for SQL Server (BDC) is deployed onto a Cluster Orechestration system (such as Kubernetes or OpenShift) using the `azdata` utility which creates the appropriate Nodes, Pods, Containers and other constructs for the system. The installation uses various switches on the `azdata` utility, and reads from several variables contianed within <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/reference-deployment-config?view=sqlallproducts-allversions" target="_blank">an internal JSON document</a> when you run the command. Using a switch, you can change these variables. You can also dump the enitre document to a file, edit it, and then call the installation that uses that file with the `azdata` command. <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-custom-configuration?view=sqlallproducts-allversions" target="_blank">More detail on that process is located here.</a>
A Big Data Cluster for SQL Server (BDC) is deployed onto a Cluster Orchestration system (such as Kubernetes or OpenShift) using the `azdata` utility which creates the appropriate Nodes, Pods, Containers and other constructs for the system. The installation uses various switches on the `azdata` utility, and reads from several variables contained within <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/reference-deployment-config?view=sqlallproducts-allversions" target="_blank">an internal JSON document</a> when you run the command. Using a switch, you can change these variables. You can also dump the entire document to a file, edit it, and then call the installation that uses that file with the `azdata` command. <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-custom-configuration?view=sqlallproducts-allversions" target="_blank">More detail on that process is located here.</a>
For planning, it is essential that you understand the SQL Server BDC components, and have a firm understanding of Kubernetes and TCP/IP networking. You should also have an understanding of how SQL Server and Apache Spark use the "Big Four" (*CPU, I/O, Memory and Networking*).
Since the Cluster Orechestration system is often made up of Virtual Machines that host the Container Images, they must be as large as possible. For the best possible performance, large physical machines that are tuned for optimal performance is a recommended physical architecture. The least viable production system is a Minimum of 3 Linux physical machines or virtual machines. The recommended configuration per machine is 8 CPUs, 32 GB of memory and 100GB of storage. This configuration would support only one or two users with a standard workload, and you would want to increase the system for each additional user or heavier workload.
Since the Cluster Orchestration system is often made up of Virtual Machines that host the Container Images, they must be as large as possible. For the best possible performance, large physical machines that are tuned for optimal performance is a recommended physical architecture. The least viable production system is a Minimum of 3 Linux physical machines or virtual machines. The recommended configuration per machine is 8 CPUs, 32 GB of memory and 100GB of storage. This configuration would support only one or two users with a standard workload, and you would want to increase the system for each additional user or heavier workload.
You can deploy Kubernetes in a few ways:
- In a Cloud Platform such as Azure Kubernetes Service (AKS)
- In your own Cluster Orechestration system deployment using the appropirate tools such as `KubeADM`
- In your own Cluster Orchestration system deployment using the appropriate tools such as `KubeADM`
Regardless of the Cluster Orechestration system target, the general steps for setting up the system are:
Regardless of the Cluster Orchestration system target, the general steps for setting up the system are:
- Set up Cluster Orechestration system with a Cluster target
- Set up Cluster Orchestration system with a Cluster target
- Install the cluster tools on the administration machine
- Deploy the BDC onto the Cluster Orechestration system
- Deploy the BDC onto the Cluster Orchestration system
In the sections that follow, you'll cover the general process for each of these deployments. The official documentation referenced above have the specific steps for each deployment, and the *Activity* section of this Module has the steps for deploying the BDC on AKS for the classroom enviornment.
In the sections that follow, you'll cover the general process for each of these deployments. The official documentation referenced above have the specific steps for each deployment, and the *Activity* section of this Module has the steps for deploying the BDC on AKS for the classroom environment.
<p style="border-bottom: 1px solid lightgrey;"></p>
@ -94,7 +94,7 @@ With this background, you can find the <a href="https://docs.microsoft.com/en-us
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="3-2">3.2 Installing Locally Using KubeADM</h2>
If you choose Kubernetes as your Cluster Orechestration system, the <a href="https://kubernetes.io/docs/setup/independent/install-kubeadm/" target="_blank">kubeadm toolbox</a> helps you bootstrap a Kubernetes cluster that conforms to best practices. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.
If you choose Kubernetes as your Cluster Orchestration system, the <a href="https://kubernetes.io/docs/setup/independent/install-kubeadm/" target="_blank">kubeadm toolbox</a> helps you bootstrap a Kubernetes cluster that conforms to best practices. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.
The kubeadm toolbox can deploy a Kubernetes cluster to physical or virtual machines. It works by specifying the TCP/IP addresses of the targets.

Просмотреть файл

@ -81,11 +81,11 @@ This process allows not only a query to disparate systems, but also those remote
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Load and query data in an External Table</b></p>
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any Polybase target.
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any PolyBase target.
<b>Steps</b>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparattion for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparation for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This step shows you how to create and query an External table.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-oracle?view=sqlallproducts-allversions" target="_blank">(Optional) Open this reference, and review the instructions you see there</a>. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)</p>
@ -118,7 +118,7 @@ In this activity, you will load the sample data into your big data cluster envir
There are three primary uses for a large cluster of data processing systems for Machine Learning and AI applications. The first is that the users will involved in the creation of the <a href="https://www.codeingschool.com/2018/09/what-are-features-and-labels-in-machine-learning.html" target="_blank">Features used in various ML and AI algorithms, and are often tasked to Label</a> the data. These users can access the Data Pool and Data Storage data stores directly to query and assist with this task.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and presisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and persisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
<br>
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution4.png">

Просмотреть файл

@ -29,7 +29,7 @@ You'll cover the following topics in this Module:
There are two primary areas for monitoring your BDC deployment. The first deals with SQL Server 2019, and the second deals with the set of elements in the Cluster.
For SQL Server, <a href="https://docs.microsoft.com/en-us/sql/relational-databases/database-lifecycle-management?view=sql-server-ver15" target="_blank">management is much as you would normally perform for any SQL Server system</a>. You have the same type of services, surface points, security areas and other control vectors as in a stand-alone installation of SQL Server. The tools you have avalaible for managing the Master Instance in the BDC are the same as managing a stand-alone installation, including SQL Server Management Studio, command-line interfaces, Azure Data Studio, and third party tools.
For SQL Server, <a href="https://docs.microsoft.com/en-us/sql/relational-databases/database-lifecycle-management?view=sql-server-ver15" target="_blank">management is much as you would normally perform for any SQL Server system</a>. You have the same type of services, surface points, security areas and other control vectors as in a stand-alone installation of SQL Server. The tools you have available for managing the Master Instance in the BDC are the same as managing a stand-alone installation, including SQL Server Management Studio, command-line interfaces, Azure Data Studio, and third party tools.
For the cluster components, you have three primary interfaces to use, which you will review next.