diff --git a/.DS_Store b/.DS_Store
index d558d5d..9af338e 100644
Binary files a/.DS_Store and b/.DS_Store differ
diff --git a/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md b/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md
index 35d28d4..0805863 100644
--- a/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md
+++ b/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md
@@ -246,7 +246,7 @@ Open the **03_WorkingWithData.py** file and enter the code you find for section
Python has many ways to read data in (*sometimes into memory, sometimes streaming as it reads it*) built right in to the standard libraries. Other Libraries, such as Pandas and NumPy, have their own way of reading in data.
-In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what stucture you need to perform your desired operations.
+In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what structure you need to perform your desired operations.
@@ -6,7 +6,7 @@
- 00 prerequisites
+ 00 prerequisites
This workshop is taught using the following components, which you will install and configure in the sections that follow.
@@ -26,37 +26,37 @@ The other requirements are:
*Note that all following activities must be completed prior to class - there will not be time to perform these operations during the workshop.*
-Activity 1: Set up a Microsoft Azure Account
+Activity 1: Set up a Microsoft Azure Account
You have multiple options for setting up Microsoft Azure account to complete this workshop. You can use a Microsoft Developer Network (MSDN) account, a personal or corporate account, or in some cases a pass may be provided by the instructor. (Note: for most classes, the MSDN account is best)
**If you are attending this course in-person:**
Unless you are explicitly told you will be provided an account by the instructor in the invitation to this workshop, you must have your Microsoft Azure account and Data Science Virtual Machine set up before you arrive at class. There will NOT be time to configure these resources during the course.
-Option 1 - Microsoft Developer Network Account (MSDN) Account
+Option 1 - Microsoft Developer Network Account (MSDN) Account
The best way to take this workshop is to use your [Microsoft Developer Network (MSDN) benefits if you have a subscription](https://marketplace.visualstudio.com/subscriptions).
- [Open this resource and click the "Activate your monthly Azure credit" button](https://azure.microsoft.com/en-us/pricing/member-offers/credit-for-visual-studio-subscribers/)
-Option 2 - Use Your Own Account
+Option 2 - Use Your Own Account
You can also use your own account or one provided to you by your organization, but you must be able to create a resource group and create, start, and manage a Virtual Machine and an Azure AKS cluster.
-Option 3 - Use an account provided by your instructor
+Option 3 - Use an account provided by your instructor
Your workshop invitation may have instructed you that they will provide a Microsoft Azure account for you to use. If so, you will receive instructions that it will be provided.
**Unless you received explicit instructions in your workshop invitations, you much create either an MSDN or Personal account. You must have an account prior to the workshop.**
-Activity 2: Prepare Your Workstation
+Activity 2: Prepare Your Workstation
The instructions that follow are the same for either a "base metal" workstation or laptop, or a Virtual Machine. It's best to have at least 4MB of RAM on the management system, and these instructions assume that you are not planning to run the database server or any Containers on the workstation. It's also assumed that you are using a current version of Windows, either desktop or server.
*(You can copy and paste all of the commands that follow in a PowerShell window that you run as the system Administrator)*
-Updates
+
Updates
First, ensure all of your updates are current. You can use the following commands to do that in an Administrator-level PowerShell session:
@@ -73,12 +73,12 @@ Install-WindowsUpdate
*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
-
Install Big Data Cluster Tools
+Install Big Data Cluster Tools
Next, install the tools to work with Big Data Clusters:
-Activity 3: Install BDC Tools
+Activity 3: Install BDC Tools
Open this resource, and follow all instructions for the Microsoft Windows operating system
@@ -87,7 +87,7 @@ Open this resource, and follow all instructions for the Microsoft Windows operat
- [https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sql-server-ver15](https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sql-server-ver15)
-Activity 4: Re-Update Your Workstation
+Activity 4: Re-Update Your Workstation
Once again, download the MSI and run it from there. It's always a good idea after this many installations to run Windows Update again:
@@ -101,11 +101,11 @@ Install-WindowsUpdate
**Note 2: If you are using a Virtual Machine in Azure, power off the Virtual Machine using the Azure Portal every time you are done with it. Turning off the VM using just the Windows power off in the VM only stops it running, but you are still charged for the VM if you do not stop it from the Portal. Stop the VM from the Portal unless you are actively using it.**
-For Further Study
+For Further Study
-Next Steps
+Next Steps
Next, Continue to Module 1 - Introduction.
diff --git a/k8stobdc/KubernetesToBDC/01-introduction.md b/k8stobdc/KubernetesToBDC/01-introduction.md
index 1f0bd7e..b87ee8a 100644
--- a/k8stobdc/KubernetesToBDC/01-introduction.md
+++ b/k8stobdc/KubernetesToBDC/01-introduction.md
@@ -14,7 +14,7 @@ This module covers Container technologies and how they are different than Virtua
-Activity: Install Class Environment on AKS (Optional)
+Activity: Install Class Environment on AKS (Optional)
*(If you are taking this course on-line and not with an instructor-provided Kubernetes environment, you can use a Microsoft Azure subscription to deploy a Kubernetes Environment, complete with the SQL Server big data clusters feature. Your instructor may also have you use this deployment mechanism if in-class hardware is not practical or available)*
@@ -26,15 +26,15 @@ Using the following steps, you will create a Resource Group in Azure that will h
Steps
- Ensure that you have completed all prerequisites.
+ Ensure that you have completed all prerequisites.
- Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step. Note that if you followed the pre-requisites properly, you will already have Python, kubectl, and Azure Data Studio installed, so those may be skipped. Follow all other instructions.
+ Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step. Note that if you followed the pre-requisites properly, you will already have Python, kubectl, and Azure Data Studio installed, so those may be skipped. Follow all other instructions.
- Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step. Stop at the section marked Connect to the cluster.
+ Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step. Stop at the section marked Connect to the cluster.
-
+
In this section you will learn more about the design the primary operating system (Linux) used with a Kubernetes Cluster.
@@ -135,21 +135,21 @@ The essential commands you should know for this workshop are below. In Linux you
A longer explanation of system administration for Linux is here.
-Activity: Work with Linux Commands
+Activity: Work with Linux Commands
Steps
-Open this link to run a Linux Emulator in a browser
-Find the mounted file systems, and then show the free space in them.
-Show the current directory.
-Show the files in the current directory.
-Create a new directory, navigate to it, and create a file called test.txt with the words This is a test in it. (hint: us the nano editor or the echo command)
-Display the contents of that file.
-Show the help for the cat command.
+Open this link to run a Linux Emulator in a browser
+Find the mounted file systems, and then show the free space in them.
+Show the current directory.
+Show the files in the current directory.
+Create a new directory, navigate to it, and create a file called test.txt with the words This is a test in it. (hint: us the nano editor or the echo command)
+Display the contents of that file.
+Show the help for the cat command.
-
+
Bare-metal installations of an operating system such as Windows are deployed on hardware using a Kernel, and additional software to bring all of the hardware into a set of calls.
@@ -158,7 +158,7 @@ Bare-metal installations of an operating system such as Windows are deployed on
One abstraction layer above installing software directly on hardware is using a Hypervisor. In essence, this layer uses the base operating system to emulate hardware. You install an operating system (called a *Guest* OS) on the Hypervisor (called the *Host*), and the Guest OS acts as if it is on bare-metal.
-
+
In this abstraction level, you have full control (and responsibility) for the entire operating system, but not the hardware. This isolates all process space and provides an entire "Virtual Machine" to applications. For scale-out systems, a Virtual Machine allows for a distribution and control of complete computer environments using only software.
@@ -174,7 +174,7 @@ A Container is provided by the Container Runtime (Such as [containerd](https://c
(NOTE: The Container Image Kernel can run on Windows or Linux, but you will focus on the Linux Kernel Containers in this workshop.)
-
+
This abstraction holds everything for an application to isolate it from other running processes. It is also completely portable - you can create an image on one system, and another system can run it so long as the Container Runtimes (Such as Docker) Runtime is installed. Containers also start very quickly, are easy to create (called Composing) using a simple text file with instructions of what to install on the image. The instructions pull the base Kernel, and then any binaries you want to install. Several pre-built Containers are already available, SQL Server is one of these. You can read more about installing SQL Server on Container Runtimes (Such as Docker) here.
@@ -198,34 +198,36 @@ For Big Data systems, having lots of Containers is very advantageous to segment
-
+
You can learn much more about Container Orchestration systems here. We're using the Azure Kubernetes Service (AKS) in this workshop, and they have a great set of tutorials for you to learn more here.
-In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster.
+In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
+
+> NOTE: The OpenShift Container Platform is a commercially supported Platform as a Service (PaaS) based on Kubernetes from RedHat. Many shops require a commercial vendor to implement and support Kubernetes.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
-Activity: Familiarize Yourself with Container Orchestration using minikube
+Activity: Familiarize Yourself with Container Orchestration using minikube
To practice with Kubernetes, you will use an online emulator to work with the `minikube` platform.
Steps
-Open this resource, and complete the first module. (You can return to it later to complete all exercises if you wish)
-
+
Traditional storage uses a call from the operating system to an underlying I/O system, as you learned earlier. These file systems are either directly connected to the operating system or appear to be connected directly using a Storage Area Network. The blocks of data are stored and managed by the operating system.
For large scale-out data systems, the mounting point for an I/O is another abstraction. For SQL Server BDC, the most commonly used scale-out file system is the Hadoop Data File System, or HDFS. HDFS is a set of Java code that gathers disparate disk subsystems into a Cluster which is comprised of various Nodes - a NameNode, which manages the cluster's metadata, and DataNodes that physically store the data. Files and directories are represented on the NameNode by a structure called inodes. Inodes record attributes such as permissions, modification and access times, and namespace and diskspace quotas.
-
+
With an abstraction such as Containers, storage becomes an issue for two reasons: The storage can disappear when the Container is removed, and other Containers and technologies can't access storage easily within a Container.
@@ -239,7 +241,7 @@ You
-
+
There are three primary tools and utilities you will use to control the SQL Server big data cluster:
@@ -267,28 +269,28 @@ You can learn more about Azure Data Studio here.
-
+
You'll explore further operations with the Azure Data Studio in the final module of this course.
-Activity: Practice with Notebooks
+Activity: Practice with Notebooks
Steps
-Open this reference, and review the instructions you see there. You can clone this Notebook to work with it later.
+Open this reference, and review the instructions you see there. You can clone this Notebook to work with it later.
-Activity: Azure Data Studio Notebooks Overview
+Activity: Azure Data Studio Notebooks Overview
Steps
-Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.
+Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.
-For Further Study
+For Further Study
-Next Steps
+Next Steps
Next, Continue to 02 - Hardware and Virtualization environment for Kubernetes .
diff --git a/k8stobdc/KubernetesToBDC/03-kubernetes.md b/k8stobdc/KubernetesToBDC/03-kubernetes.md
index 94782f8..f364367 100644
--- a/k8stobdc/KubernetesToBDC/03-kubernetes.md
+++ b/k8stobdc/KubernetesToBDC/03-kubernetes.md
@@ -148,7 +148,7 @@ We'll begin with a set of definitions. These aren't all the terms used in Kubern
|
etcd |
- A high performance key value store that stores the cluster’s state. Since etcd is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official etcd.io site provides a detailed breakdown of the hardware requirement for etcd. |
+ A high performance key value store that stores the cluster’s state. Since etcd is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official http://etcd.io site provides a detailed breakdown of the hardware requirement for etcd. |
|
@@ -246,6 +246,7 @@ Luckily help is at hand in the form of a tool that leverages kubeadm in order to
- Add nodes to existing clusters
+
Kubespray is a Cloud Native Computing Foundation project and with its own [GitHub repository](https://github.com/kubernetes-sigs/kubespray).
### 3.2.5 What Is Ansible? ###
@@ -268,6 +269,9 @@ In order to carry out the deployment of the Kubernetes cluster, a basic understa
### 3.2.7 Kubespray Workflow ###
+Unlike other available deployment tools, Kubespray does everything for you in “One shot”. For example, Kubeadm requires that certificates on nodes are created manually, Kubespray not only leverages Kubeadm but it also looks after everything including certificate creation for you. Kubespray works against most of the popular public cloud providers and has been tested for the deployment of clusters with thousands of nodes. The real elegance of Kubespray is the reuse it promotes. If an organization has a requirement to deploy multiple clusters, once Kubespray is setup, for every new cluster that needs to be created, the only prerequisite is to create a new inventory file for the nodes the new cluster will use.
+3.2.5 High Level Kubespray Workflow
+
The deployment of a Kubernetes cluster via Kubespray follows this workflow:
- Preinstall step
@@ -299,7 +303,6 @@ Note:
### 3.2.8 Requirements ###
-Refer to the [requirements](https://github.com/kubernetes-sigs/kubespray#requirements) section in the Kubespray GitHub repo.
### 3.2.9 Post Cluster Deployment Activities ###
diff --git a/k8stobdc/KubernetesToBDC/04-bdc.md b/k8stobdc/KubernetesToBDC/04-bdc.md
index df27fee..221515b 100644
--- a/k8stobdc/KubernetesToBDC/04-bdc.md
+++ b/k8stobdc/KubernetesToBDC/04-bdc.md
@@ -19,7 +19,7 @@ This module covers Container technologies and how they are different than Virtua
SQL Server (starting with version 2019) provides three ways to work with large sets of data:
- - **Data Virtualization**: Query multiple sources of data technologies using the Polybase SQL Server feature (data left at source)
+ - **Data Virtualization**: Query multiple sources of data technologies using the PolyBase SQL Server feature (data left at source)
- **Storage Pools**: Create sets of disparate data sources that can be queried from Distributed Data sets (data ingested into sharded databases using PolyBase)
- **SQL Server Big Data Clusters**: Create, manage and control clusters of SQL Server Instances that co-exist in a Kubernetes cluster with Apache Spark and other technologies to access and process large sets of data (Data left in place, ingested through PolyBase, and into/through HDFS)
@@ -45,12 +45,12 @@ To leverage PolyBase, you first define the external table using a specific set o
-Activity: Review PolyBase Solution
+Activity: Review PolyBase Solution
In this section you will review a solution tutorial similar to one you will perform later. You'll see how to create a reference to an HDFS file store and query it within SQL Server as if it were a standard internal table.
-Open this reference and locate numbers 4-5 of the steps in the tutorial. This explains the two steps required to create and query an External table. *Only review this information; you will perform these steps in another Module*.
+Open this reference and locate numbers 4-5 of the steps in the tutorial. This explains the two steps required to create and query an External table. *Only review this information; you will perform these steps in another Module*.
@@ -148,7 +148,7 @@ These components are used in the Compute Pool of the BDC:
BDC: App Pool
-The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instatiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification.
+The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instantiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification.
These components are used in the Compute Pool of the BDC:
@@ -195,18 +195,18 @@ These components are used in the Storage Pool of the BDC:
-Activity: Review Data Pool Solution
+Activity: Review Data Pool Solution
In this section you will review the solution tutorial similar to the one you will perform in a future step. You'll see how to load data into the Data Pool.
-Open this reference and review the steps in the tutorial. This explains the two steps required to create and load an External table in the Data Pool. You'll perform these steps in the Operationalization Module later. *Only review this information at this time. You will perform these steps in another Module.
+Open this reference and review the steps in the tutorial. This explains the two steps required to create and load an External table in the Data Pool. You'll perform these steps in the Operationalization Module later. *Only review this information at this time. You will perform these steps in another Module.
-For Further Study
+For Further Study
- Official Documentation for this section
- Update on 2019 Blog
diff --git a/k8stobdc/KubernetesToBDC/05-datascience.md b/k8stobdc/KubernetesToBDC/05-datascience.md
index 0467dc4..c965fd3 100644
--- a/k8stobdc/KubernetesToBDC/05-datascience.md
+++ b/k8stobdc/KubernetesToBDC/05-datascience.md
@@ -14,18 +14,18 @@ This module covers Container technologies and how they are different than Virtua
-
+
Recall from The Big Data Landscape module that you learned about the Wide World Importers company. Wide World Importers (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a traditional N-tier application that uses a front-end (mobile, web and installed) that interacts with a scale-out middle-tier software product, which in turn stores data in a large SQL Server database that has been scaled-up to meet demand.
-
+
WWI has now added web and mobile commerce to their platform, which has generated a significant amount of additional data, and data formats. These new platforms were added without integrating into the OLTP system data or Business Intelligence infrastructures. As a result, "silos" of data stores have developed, and ingesting all of this data exceeds the scale of their current RDBMS server:
-
+
This presented the following four challenges - the IT team at WWI needs to:
@@ -49,89 +49,89 @@ To meet these challenges, the following solution is proposed. Using the BDC plat
The following diagram illustrates the complete solution that you can use to brief your audience with:
-
+
In the following sections you'll dive deeper into how this scale is used to solve the rest of the challenges.
-
+
The next challenge the IT team must solve is to enable a single data query to work across multiple disparate systems, optionally joining to internal SQL Server Tables, and also at scale.
Using the Data Virtualization capability you saw in the 02 - SQL Server BDC Components Module, the IT team creates External Tables using the PolyBase feature. These External Table definitions are stored in the database on the SQL Server Master Instance within the cluster. When queried by the user, the queries are engaged from the SQL Server Master Instance through the Compute Pool in the SQL Server BDC, which holds Kubernetes Nodes containing the Pods running SQL Server Instances. These Instances send the query to the PolyBase Connector at the target data system, which processes the query based on the type of target system. The results are processed and returned through the PolyBase Connector to the Compute Pool and then on to the Master Instance, and then on to the user.
-
+
This process allows not only a query to disparate systems, but also those remote systems can hold extremely large sets of data. Normally you are querying a subset of that data, so the results are all that are sent back over the network. These results can be joined with internal tables for a single view, and all from within the same Transact-SQL statements.
-Activity: Load and query data in an External Table
+Activity: Load and query data in an External Table
-In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any Polybase target.
+In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any PolyBase target.
Steps
-Open this reference, and perform all of the instructions you see there. This loads your data in preparattion for the next Activity.
-Open this reference, and perform all of the instructions you see there. This step shows you how to create and query an External table.
-(Optional) Open this reference, and review the instructions you see there. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)
+Open this reference, and perform all of the instructions you see there. This loads your data in preparation for the next Activity.
+Open this reference, and perform all of the instructions you see there. This step shows you how to create and query an External table.
+(Optional) Open this reference, and review the instructions you see there. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)
-
+
Ad-hoc queries are very useful for many scenarios. There are times when you would like to bring the data into storage, so that you can create denormalized representations of datasets, aggregated data, and other purpose-specific data tasks.
-
+
Using the Data Virtualization capability you saw in the 02 - BDC Components Module, the IT team creates External Tables using PolyBase statements. These External Table definitions are stored in the database on the SQL Server Master Instance within the cluster. When queried by the user, the queries are engaged from the SQL Server Master Instance through the Compute Pool in the SQL Server BDC, which holds Kubernetes Nodes containing the Pods running SQL Server Instances. These Instances send the query to the PolyBase Connector at the target data system, which processes the query based on the type of target system. The results are processed and returned through the PolyBase Connector to the Compute Pool and then on to the Master Instance, and the PolyBase statements can specify the target of the Data Pool. The SQL Server Instances in the Data Pool store the data in a distributed fashion across multiple databases, called Shards.
-Activity: Load and query data into the Data Pool
+Activity: Load and query data into the Data Pool
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to load data into the Data Pool.
Steps
-Open this reference, and perform the instructions you see there. This loads data into the Data Pool.
+Open this reference, and perform the instructions you see there. This loads data into the Data Pool.
-
+
There are three primary uses for a large cluster of data processing systems for Machine Learning and AI applications. The first is that the users will involved in the creation of the Features used in various ML and AI algorithms, and are often tasked to Label the data. These users can access the Data Pool and Data Storage data stores directly to query and assist with this task.
-The SQL Server Master Instance in the BDC installs with Machine Learning Services, which allow creation, training, evaluation and presisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
+The SQL Server Master Instance in the BDC installs with Machine Learning Services, which allow creation, training, evaluation and persisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
-
+
The Data Scientist has another option to create and train ML and AI models. The Spark platform within the Storage Pool is accessible through the Knox gateway, using Livy to send Spark Jobs as you learned about in the 02 - SQL Server BDC Components Module. This gives access to the full Spark platform, using Jupyter Notebooks (included in Azure Data Studio) or any other standard tools that can access Spark through REST calls.
-Activity: Load data with Spark, run a Spark Notebook
+Activity: Load data with Spark, run a Spark Notebook
In this activity, you will load the sample data into your big data cluster environment using Spark, and use a Notebook in Azure Data Studio to work with it.
Steps
-Open this reference, and follow the instructions you see there. This loads the data in preparation for the Notebook operations.
-Open this reference, and follow the instructions you see there. This simple example shows you how to work with the data you ingested into the Storage Pool using Spark.
+Open this reference, and follow the instructions you see there. This loads the data in preparation for the Notebook operations.
+Open this reference, and follow the instructions you see there. This simple example shows you how to work with the data you ingested into the Storage Pool using Spark.
-For Further Study
+For Further Study
- Official Documentation for this section
- Use curl to load data into HDFS on SQL Server 2019 big data clusters
diff --git a/k8stobdc/KubernetesToBDC/scripts/SingleNodeClusterOnAzureVM.txt b/k8stobdc/KubernetesToBDC/scripts/SingleNodeClusterOnAzureVM.txt
new file mode 100644
index 0000000..17acbe2
--- /dev/null
+++ b/k8stobdc/KubernetesToBDC/scripts/SingleNodeClusterOnAzureVM.txt
@@ -0,0 +1,124 @@
+# SQL Server 2019 big data cluster
+Single-Node Cluster on an Azure Virtual Machine (Unsupported for production - classroom only)
+In this set of instructions you'll set up a SQL Server 2019 big data cluster using Ubuntu on
+a single-Node using a Microsoft Azure Virtual Machine.
+
+NOTE: This is an unsupported configuration, and should be used only for classroom purposes.
+Carefully read the instructions for the parameters you need to replace for your specific
+subscription and parameters.
+
+-------------------------------------------------------------------------------------------------------------
+## Running these Instructions
+
+These instructions use shell commands, such as PowerShell, bash, or the CMD window from a
+system that has the Secure Shell software installed (SSH). You can type:
+
+ssh -h
+
+To see if this tool is installed.
+
+You can copy-and-paste from the lines that show the commands, or you can set your IDE to run the current line
+in a Terminal window. (In Visual Studio Code or Azure Data Studio, these are called "Keybindings"):
+https://code.visualstudio.com/docs/getstarted/keybindings
+
+You can set this to any key you like:
+(Preferences | Keyboard Shortcuts | Terminal: Run Selected Text in Active Terminal)
+
+-------------------------------------------------------------------------------------------------------------
+## References
+This Notebook uses the script located here:
+https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-script-single-node-kubeadm?view=sql-server-ver15
+and that reference supersedes the information in the steps listed below.
+
+You can also create a SQL Server Big Data Cluster on the Azure Kubernetes Service (AKS):
+Those instructions are located here: https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sql-server-ver15
+
+For a complete workshop on SQL Server 2019's big data clusters, see this reference:
+https://github.com/Microsoft/sqlworkshops/tree/master/sqlserver2019bigdataclusters
+
+-------------------------------------------------------------------------------------------------------------
+### Step 1: Log in to Azure
+az login
+
+-------------------------------------------------------------------------------------------------------------
+### Step 2: Set your account - show the accounts, replace with your account name
+az account list --output table
+
+az account set --subscription ""
+
+-------------------------------------------------------------------------------------------------------------
+### Step 3: Create a Resource Group, and a Virtual Machine - Look for values with the characters to change to your values
+#### (Note: Needs a machine large enough to run BDC and also have Nested Virtualization)
+
+
+az group create -n -l eastus2
+
+az vm create -n -g -l eastus2 --image UbuntuLTS --os-disk-size-gb 200 --storage-sku Premium_LRS --admin-username bdcadmin --admin-password --size Standard_D8s_v3 --public-ip-address-allocation static
+
+ssh -X bdcadmin@
+
+### Step 4: Update and Upgrade VM
+sudo apt-get update
+
+sudo apt-get upgrade
+
+sudo apt autoremove
+
+-------------------------------------------------------------------------------------------------------------
+### Step 5: (Optional) Install an XWindows server
+sudo apt-get install xorg openbox
+
+sudo reboot
+#### After about 5 minutes:
+ssh -X bdcadmin@
+
+sudo apt-get install gnome-core
+
+sudo reboot
+
+#### After about 5 minutes:
+ssh -X bdcadmin@
+
+sudo sed -i 's/allowed_users=console/allowed_users=anybody/' /etc/X11/Xwrapper.config
+
+mkdir /home/bdcadmin/.config/nautilus
+
+cd ~
+mkdir ./Downloads
+
+touch /home/bdcadmin/.gtk-bookmarks
+
+wget https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/ads.deb
+
+#### Check XWindows - Note, requires that you have XWindows software installed on your laptop
+nautilus &
+
+-------------------------------------------------------------------------------------------------------------
+### Step 6: Install BDC Single Node - Pre-requisites (Current as of 1/31/2020)
+sudo apt update && sudo apt upgrade -y
+sudo reboot
+#### After about 5 minutes:
+ssh -X bdcadmin@
+
+-------------------------------------------------------------------------------------------------------------
+### Step 7: Download and mark BDC Setup script
+curl --output setup-bdc.sh https://raw.githubusercontent.com/microsoft/sql-server-samples/master/samples/features/sql-big-data-cluster/deployment/kubeadm/ubuntu-single-node-vm/setup-bdc.sh
+
+chmod +x setup-bdc.sh
+
+sudo ./setup-bdc.sh
+
+-------------------------------------------------------------------------------------------------------------
+### Step 8: Setup path and Check
+source ~/.bashrc
+
+azdata --version
+
+kubectl get pods
+
+#### You can now use the system.
+-------------------------------------------------------------------------------------------------------------
+## Cleanup - Erase everything
+### Only perform this step when you are done experimenting with the system...
+az group delete --name
+
diff --git a/k8stobdc/README.md b/k8stobdc/README.md
index 9a90e46..9217df0 100644
--- a/k8stobdc/README.md
+++ b/k8stobdc/README.md
@@ -8,7 +8,7 @@
About this Workshop
-Welcome to this Microsoft solutions workshop on *Kubernetes - From Bare Metal to SQL Server Big Data Clusters*. In this workshop, you'll learn about setting up a production grade SQL Server 2019 big data cluster environment on Kubernetes. Topics covered include: hardware, virtualization, and Kubernetes, with a full deployment of SQL Server's Big Data Cluster on the environment that you will use in the class. You'll then walk through a set of Jupyter Notebooks in Azure Data Studio to run T-SQL, Spark, and Machine Learning workloads on the cluster. You'll also receive valuable resources to learn more and go deeper on Linux, Containers, Kubernetes and SQL Server big data clusters.
+Welcome to this Microsoft solutions workshop on *Kubernetes - From Bare Metal to SQL Server Big Data Clusters*. In this workshop, you'll learn about setting up a production grade SQL Server 2019 big data cluster environment on Kubernetes. Topics covered include: hardware, virtualization, and Kubernetes, with a full deployment of SQL Server's Big Data Cluster on the environment that you will use in the class. You'll then walk through a set of [Jupyter Notebooks](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) in Microsoft's [Azure Data Studio](https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?view=sql-server-ver15) tool to run T-SQL, Spark, and Machine Learning workloads on the cluster. You'll also receive valuable resources to learn more and go deeper on Linux, Containers, Kubernetes and SQL Server big data clusters.
The focus of this workshop is to understand the hardware, software, and environment you need to work with [SQL Server 2019's big data clusters](https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15) on a Kubernetes platform.
@@ -86,7 +86,7 @@ or
Workshop Details
-This workshop uses , with a focus on .
+This workshop uses Kubernetes to deploy a workload, with a focus on Microsoft SQL Server's big data clusters deployment for Big Data and Data Science workloads.
@@ -128,6 +128,11 @@ This is a modular workshop, and in each section, you'll learn concepts, technolo
Next, Continue to Pre-Requisites
+**Workshop Authors and Contributors**
+
+- [The Microsoft SQL Server Team](http://microsoft.com/sql)
+- [Chris Adkin](https://www.linkedin.com/in/wollatondba/), Pure Storage
+
**Legal Notice**
-*Kubernetes and the Kubernetes logo are trademarks or registered trademarks of The Linux Foundation. in the United States and/or other countries. The Linux Foundation and other parties may also have trademark rights in other terms used herein. This course is not certified, accredited, affiliated with, nor endorsed by Kubernetes or The Linux Foundation.*
\ No newline at end of file
+*Kubernetes and the Kubernetes logo are trademarks or registered trademarks of The Linux Foundation. in the United States and/or other countries. The Linux Foundation and other parties may also have trademark rights in other terms used herein. This Workshop is not certified, accredited, affiliated with, nor endorsed by Kubernetes or The Linux Foundation.*
\ No newline at end of file
diff --git a/k8stobdc/graphics/.DS_Store b/k8stobdc/graphics/.DS_Store
new file mode 100644
index 0000000..5008ddf
Binary files /dev/null and b/k8stobdc/graphics/.DS_Store differ
diff --git a/k8stobdc/graphics/KubernetesCluster.png b/k8stobdc/graphics/KubernetesCluster.png
deleted file mode 100644
index 73afe8f..0000000
Binary files a/k8stobdc/graphics/KubernetesCluster.png and /dev/null differ
diff --git a/k8stobdc/graphics/WWI-001.png b/k8stobdc/graphics/WWI-001.png
deleted file mode 100644
index 1ce6904..0000000
Binary files a/k8stobdc/graphics/WWI-001.png and /dev/null differ
diff --git a/k8stobdc/graphics/WWI-002.png b/k8stobdc/graphics/WWI-002.png
deleted file mode 100644
index 9ecd11f..0000000
Binary files a/k8stobdc/graphics/WWI-002.png and /dev/null differ
diff --git a/k8stobdc/graphics/WWI-003.png b/k8stobdc/graphics/WWI-003.png
deleted file mode 100644
index dd6d45f..0000000
Binary files a/k8stobdc/graphics/WWI-003.png and /dev/null differ
diff --git a/k8stobdc/graphics/WWI-logo.png b/k8stobdc/graphics/WWI-logo.png
deleted file mode 100644
index 6cc0c95..0000000
Binary files a/k8stobdc/graphics/WWI-logo.png and /dev/null differ
diff --git a/k8stobdc/graphics/adf.png b/k8stobdc/graphics/adf.png
deleted file mode 100644
index 60898c1..0000000
Binary files a/k8stobdc/graphics/adf.png and /dev/null differ
diff --git a/k8stobdc/graphics/ads.png b/k8stobdc/graphics/ads.png
deleted file mode 100644
index 8f3c9df..0000000
Binary files a/k8stobdc/graphics/ads.png and /dev/null differ
diff --git a/k8stobdc/graphics/bookpencil.png b/k8stobdc/graphics/bookpencil.png
deleted file mode 100644
index 4319836..0000000
Binary files a/k8stobdc/graphics/bookpencil.png and /dev/null differ
diff --git a/k8stobdc/graphics/building1.png b/k8stobdc/graphics/building1.png
deleted file mode 100644
index 08daf6f..0000000
Binary files a/k8stobdc/graphics/building1.png and /dev/null differ
diff --git a/k8stobdc/graphics/bulletlist.png b/k8stobdc/graphics/bulletlist.png
deleted file mode 100644
index a5a6942..0000000
Binary files a/k8stobdc/graphics/bulletlist.png and /dev/null differ
diff --git a/k8stobdc/graphics/checkbox.png b/k8stobdc/graphics/checkbox.png
deleted file mode 100644
index ce869f7..0000000
Binary files a/k8stobdc/graphics/checkbox.png and /dev/null differ
diff --git a/k8stobdc/graphics/checkmark.png b/k8stobdc/graphics/checkmark.png
deleted file mode 100644
index b3acbbd..0000000
Binary files a/k8stobdc/graphics/checkmark.png and /dev/null differ
diff --git a/k8stobdc/graphics/clipboardcheck.png b/k8stobdc/graphics/clipboardcheck.png
deleted file mode 100644
index 1269d5d..0000000
Binary files a/k8stobdc/graphics/clipboardcheck.png and /dev/null differ
diff --git a/k8stobdc/graphics/cloud1.png b/k8stobdc/graphics/cloud1.png
deleted file mode 100644
index 19e6434..0000000
Binary files a/k8stobdc/graphics/cloud1.png and /dev/null differ
diff --git a/k8stobdc/graphics/datamart.png b/k8stobdc/graphics/datamart.png
deleted file mode 100644
index d279d17..0000000
Binary files a/k8stobdc/graphics/datamart.png and /dev/null differ
diff --git a/k8stobdc/graphics/datavirtualization.png b/k8stobdc/graphics/datavirtualization.png
deleted file mode 100644
index 0044305..0000000
Binary files a/k8stobdc/graphics/datavirtualization.png and /dev/null differ
diff --git a/k8stobdc/graphics/education1.png b/k8stobdc/graphics/education1.png
deleted file mode 100644
index 99eb60d..0000000
Binary files a/k8stobdc/graphics/education1.png and /dev/null differ
diff --git a/k8stobdc/graphics/factory.png b/k8stobdc/graphics/factory.png
deleted file mode 100644
index a5940f6..0000000
Binary files a/k8stobdc/graphics/factory.png and /dev/null differ
diff --git a/k8stobdc/graphics/geopin.png b/k8stobdc/graphics/geopin.png
deleted file mode 100644
index f8a8b31..0000000
Binary files a/k8stobdc/graphics/geopin.png and /dev/null differ
diff --git a/k8stobdc/graphics/hdfs.png b/k8stobdc/graphics/hdfs.png
deleted file mode 100644
index 6715fea..0000000
Binary files a/k8stobdc/graphics/hdfs.png and /dev/null differ
diff --git a/k8stobdc/graphics/kubectl.png b/k8stobdc/graphics/kubectl.png
deleted file mode 100644
index 4dce94a..0000000
Binary files a/k8stobdc/graphics/kubectl.png and /dev/null differ
diff --git a/k8stobdc/graphics/listcheck.png b/k8stobdc/graphics/listcheck.png
deleted file mode 100644
index 16305ff..0000000
Binary files a/k8stobdc/graphics/listcheck.png and /dev/null differ
diff --git a/k8stobdc/graphics/microsoftlogo.png b/k8stobdc/graphics/microsoftlogo.png
deleted file mode 100644
index e0259ce..0000000
Binary files a/k8stobdc/graphics/microsoftlogo.png and /dev/null differ
diff --git a/k8stobdc/graphics/owl.png b/k8stobdc/graphics/owl.png
deleted file mode 100644
index 3dcf729..0000000
Binary files a/k8stobdc/graphics/owl.png and /dev/null differ
diff --git a/k8stobdc/graphics/paperclip1.png b/k8stobdc/graphics/paperclip1.png
deleted file mode 100644
index 2d03457..0000000
Binary files a/k8stobdc/graphics/paperclip1.png and /dev/null differ
diff --git a/k8stobdc/graphics/pencil2.png b/k8stobdc/graphics/pencil2.png
deleted file mode 100644
index e10030b..0000000
Binary files a/k8stobdc/graphics/pencil2.png and /dev/null differ
diff --git a/k8stobdc/graphics/pinmap.png b/k8stobdc/graphics/pinmap.png
deleted file mode 100644
index 630f8cf..0000000
Binary files a/k8stobdc/graphics/pinmap.png and /dev/null differ
diff --git a/k8stobdc/graphics/point1.png b/k8stobdc/graphics/point1.png
deleted file mode 100644
index 17056c3..0000000
Binary files a/k8stobdc/graphics/point1.png and /dev/null differ
diff --git a/k8stobdc/graphics/solutiondiagram.png b/k8stobdc/graphics/solutiondiagram.png
deleted file mode 100644
index b9554fd..0000000
Binary files a/k8stobdc/graphics/solutiondiagram.png and /dev/null differ
diff --git a/k8stobdc/graphics/spark.jpg b/k8stobdc/graphics/spark.jpg
deleted file mode 100644
index d347039..0000000
Binary files a/k8stobdc/graphics/spark.jpg and /dev/null differ
diff --git a/k8stobdc/graphics/spark.png b/k8stobdc/graphics/spark.png
deleted file mode 100644
index ec54176..0000000
Binary files a/k8stobdc/graphics/spark.png and /dev/null differ
diff --git a/k8stobdc/graphics/sqlbdc.png b/k8stobdc/graphics/sqlbdc.png
deleted file mode 100644
index 088a46c..0000000
Binary files a/k8stobdc/graphics/sqlbdc.png and /dev/null differ
diff --git a/k8stobdc/graphics/textbubble.png b/k8stobdc/graphics/textbubble.png
deleted file mode 100644
index 1290a54..0000000
Binary files a/k8stobdc/graphics/textbubble.png and /dev/null differ
diff --git a/sqlserver2019bigdataclusters/.DS_Store b/sqlserver2019bigdataclusters/.DS_Store
new file mode 100644
index 0000000..f9ed96c
Binary files /dev/null and b/sqlserver2019bigdataclusters/.DS_Store differ
diff --git a/sqlserver2019bigdataclusters/SQL2019BDC/00 - Prerequisites.md b/sqlserver2019bigdataclusters/SQL2019BDC/00 - Prerequisites.md
index 0ded3f7..e2d6720 100644
--- a/sqlserver2019bigdataclusters/SQL2019BDC/00 - Prerequisites.md
+++ b/sqlserver2019bigdataclusters/SQL2019BDC/00 - Prerequisites.md
@@ -22,7 +22,7 @@ The other requirements are:
- **The pip3 Package**: The Python package manager *pip3* is used to install various BDC deployment and configuration tools.
- **The kubectl program**: The *kubectl* program is the command-line control feature for Kubernetes.
- **The azdata utility**: The *azdata* program is the deployment and configuration tool for BDC.
-- **Azure Data Studio**: The *Azure Data Studio* IDE, along with various Extensions, is used for deploying the system, and querying and management of the BDC. In addition, you will use this tool to participate in the workshop. Note: You can connect to a SQL Server 2019 Big Data Cluster using any SQL Server connection tool or applicaiton, such as SQL Server Management Studio, but this course will use Microsoft Azure Data Studio for cluster management, Jupyter Notebooks and other capabilities.
+- **Azure Data Studio**: The *Azure Data Studio* IDE, along with various Extensions, is used for deploying the system, and querying and management of the BDC. In addition, you will use this tool to participate in the workshop. Note: You can connect to a SQL Server 2019 Big Data Cluster using any SQL Server connection tool or application, such as SQL Server Management Studio, but this course will use Microsoft Azure Data Studio for cluster management, Jupyter Notebooks and other capabilities.
*Note that all following activities must be completed prior to class - there will not be time to perform these operations during the workshop.*
@@ -71,7 +71,7 @@ Get-WindowsUpdate
Install-WindowsUpdate
-*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
+*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may receive certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
Install Big Data Cluster Tools
@@ -97,7 +97,7 @@ Get-WindowsUpdate
Install-WindowsUpdate
-*Note 1: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
+*Note 1: If you get an error during this update process, evaluate it to see if it is fatal. You may receive certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
**Note 2: If you are using a Virtual Machine in Azure, power off the Virtual Machine using the Azure Portal every time you are done with it. Turning off the VM using just the Windows power off in the VM only stops it running, but you are still charged for the VM if you do not stop it from the Portal. Stop the VM from the Portal unless you are actively using it.**
diff --git a/sqlserver2019bigdataclusters/SQL2019BDC/01 - The Big Data Landscape.md b/sqlserver2019bigdataclusters/SQL2019BDC/01 - The Big Data Landscape.md
index 2a60eed..2c49ed8 100644
--- a/sqlserver2019bigdataclusters/SQL2019BDC/01 - The Big Data Landscape.md
+++ b/sqlserver2019bigdataclusters/SQL2019BDC/01 - The Big Data Landscape.md
@@ -92,7 +92,7 @@ This solution uses an example of a retail organization that has multiple data so
-Wide World Importeres (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a great training program for new employees, that focuses on connecting with their customers and providing great face-to-face customer service. This strong focus on customer relationships has helped set WWI apart from their competitors.
+Wide World Importers (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a great training program for new employees, that focuses on connecting with their customers and providing great face-to-face customer service. This strong focus on customer relationships has helped set WWI apart from their competitors.
WWI has now added web and mobile commerce to their platform, which has generated a significant amount of additional data, and data formats. These new platforms have been added without integrating into the OLTP system data or Business Intelligence infrastructures. As a result, "silos" of data stores have developed.
@@ -144,7 +144,7 @@ Using the following steps, you will create a Resource Group in Azure that will h
Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step. Note that if you followed the pre-requisites properly, you will already have Python, kubectl, and Azure Data Studio installed, so those may be skipped. Follow all other instructions.
- Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step. Stop at the section marked Connect to the cluster.
+ Read the following article to deploy the Big Data Cluster to the Azure Kubernetes Service, ensuring that you carefully follow each step. Stop at the section marked Connect to the cluster.
@@ -309,7 +309,7 @@ You can
@@ -327,7 +327,7 @@ For Big Data systems, having lots of Containers is very advantageous to segment
You can learn much more about Container Orchestration systems here. We're using the Azure Kubernetes Service (AKS) in this workshop, and they have a great set of tutorials for you to learn more here.
-In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster.
+In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
@@ -440,7 +440,7 @@ You'll explore further operations with the Azure Data Studio in the Operation
Activity: Azure Data Studio Notebooks Overview
Steps
-Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.
+Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.
@@ -486,7 +486,7 @@ Since HDFS is a file-system, data transfer is largely a matter of using it as a
Data Pipelines using Azure Data Factory
-As described earlier, you can use various methods to ingest data ad-hoc and as-needed for your two data targets (HDFS and SQL Server Tables. A more holistic archicture is to use a Pipeline system that can define sources, triggers and events, transforms, targets, and has logging and tracking capabilities. The Microsoft Azure Data Factory provides all of the capabilities, and often serves as the mechanism to transfer data to and from on-premises, in-cloud, and other sources and targets. ADF can serve as a full data pipeline system, as described here.
+As described earlier, you can use various methods to ingest data ad-hoc and as-needed for your two data targets (HDFS and SQL Server Tables. A more holistic architecture is to use a Pipeline system that can define sources, triggers and events, transforms, targets, and has logging and tracking capabilities. The Microsoft Azure Data Factory provides all of the capabilities, and often serves as the mechanism to transfer data to and from on-premises, in-cloud, and other sources and targets. ADF can serve as a full data pipeline system, as described here.
diff --git a/sqlserver2019bigdataclusters/SQL2019BDC/02 - SQL Server BDC Components.md b/sqlserver2019bigdataclusters/SQL2019BDC/02 - SQL Server BDC Components.md
index 4d09a8f..d6508eb 100644
--- a/sqlserver2019bigdataclusters/SQL2019BDC/02 - SQL Server BDC Components.md
+++ b/sqlserver2019bigdataclusters/SQL2019BDC/02 - SQL Server BDC Components.md
@@ -28,7 +28,7 @@ You'll cover the following topics in this Module:
SQL Server (starting with version 2019) provides three ways to work with large sets of data:
- - **Data Virtualization**: Query multiple sources of data technologies using the Polybase SQL Server feature (data left at source)
+ - **Data Virtualization**: Query multiple sources of data technologies using the PolyBase SQL Server feature (data left at source)
- **Storage Pools**: Create sets of disparate data sources that can be queried from Distributed Data sets (data ingested into sharded databases using PolyBase)
- **SQL Server Big Data Clusters**: Create, manage and control clusters of SQL Server Instances that co-exist in a Kubernetes cluster with Apache Spark and other technologies to access and process large sets of data (Data left in place, ingested through PolyBase, and into/through HDFS)
@@ -228,4 +228,4 @@ In this section you will review the solution tutorial you will perform in the Next Steps
-Next, Continue to Planning, Installation and Configuration.
+Next, Continue to Planning, Installation and Configuration.
\ No newline at end of file
diff --git a/sqlserver2019bigdataclusters/SQL2019BDC/03 - Planning, Installation and Configuration.md b/sqlserver2019bigdataclusters/SQL2019BDC/03 - Planning, Installation and Configuration.md
index f11c7c5..d7240dd 100644
--- a/sqlserver2019bigdataclusters/SQL2019BDC/03 - Planning, Installation and Configuration.md
+++ b/sqlserver2019bigdataclusters/SQL2019BDC/03 - Planning, Installation and Configuration.md
@@ -29,28 +29,28 @@ You'll cover the following topics in this Module:
NOTE: The following Module is based on the Public Preview of the Microsoft SQL Server 2019 big data cluster feature. These instructions will change as the product is updated. The latest installation instructions are located here.
-A Big Data Cluster for SQL Server (BDC) is deployed onto a Cluster Orechestration system (such as Kubernetes or OpenShift) using the `azdata` utility which creates the appropriate Nodes, Pods, Containers and other constructs for the system. The installation uses various switches on the `azdata` utility, and reads from several variables contianed within an internal JSON document when you run the command. Using a switch, you can change these variables. You can also dump the enitre document to a file, edit it, and then call the installation that uses that file with the `azdata` command. More detail on that process is located here.
+A Big Data Cluster for SQL Server (BDC) is deployed onto a Cluster Orchestration system (such as Kubernetes or OpenShift) using the `azdata` utility which creates the appropriate Nodes, Pods, Containers and other constructs for the system. The installation uses various switches on the `azdata` utility, and reads from several variables contained within an internal JSON document when you run the command. Using a switch, you can change these variables. You can also dump the entire document to a file, edit it, and then call the installation that uses that file with the `azdata` command. More detail on that process is located here.
For planning, it is essential that you understand the SQL Server BDC components, and have a firm understanding of Kubernetes and TCP/IP networking. You should also have an understanding of how SQL Server and Apache Spark use the "Big Four" (*CPU, I/O, Memory and Networking*).
-Since the Cluster Orechestration system is often made up of Virtual Machines that host the Container Images, they must be as large as possible. For the best possible performance, large physical machines that are tuned for optimal performance is a recommended physical architecture. The least viable production system is a Minimum of 3 Linux physical machines or virtual machines. The recommended configuration per machine is 8 CPUs, 32 GB of memory and 100GB of storage. This configuration would support only one or two users with a standard workload, and you would want to increase the system for each additional user or heavier workload.
+Since the Cluster Orchestration system is often made up of Virtual Machines that host the Container Images, they must be as large as possible. For the best possible performance, large physical machines that are tuned for optimal performance is a recommended physical architecture. The least viable production system is a Minimum of 3 Linux physical machines or virtual machines. The recommended configuration per machine is 8 CPUs, 32 GB of memory and 100GB of storage. This configuration would support only one or two users with a standard workload, and you would want to increase the system for each additional user or heavier workload.
You can deploy Kubernetes in a few ways:
- In a Cloud Platform such as Azure Kubernetes Service (AKS)
- - In your own Cluster Orechestration system deployment using the appropirate tools such as `KubeADM`
+ - In your own Cluster Orchestration system deployment using the appropriate tools such as `KubeADM`
-Regardless of the Cluster Orechestration system target, the general steps for setting up the system are:
+Regardless of the Cluster Orchestration system target, the general steps for setting up the system are:
- - Set up Cluster Orechestration system with a Cluster target
+ - Set up Cluster Orchestration system with a Cluster target
- Install the cluster tools on the administration machine
- - Deploy the BDC onto the Cluster Orechestration system
+ - Deploy the BDC onto the Cluster Orchestration system
-In the sections that follow, you'll cover the general process for each of these deployments. The official documentation referenced above have the specific steps for each deployment, and the *Activity* section of this Module has the steps for deploying the BDC on AKS for the classroom enviornment.
+In the sections that follow, you'll cover the general process for each of these deployments. The official documentation referenced above have the specific steps for each deployment, and the *Activity* section of this Module has the steps for deploying the BDC on AKS for the classroom environment.
@@ -94,7 +94,7 @@ With this background, you can find the 3.2 Installing Locally Using KubeADM
-If you choose Kubernetes as your Cluster Orechestration system, the kubeadm toolbox helps you bootstrap a Kubernetes cluster that conforms to best practices. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.
+If you choose Kubernetes as your Cluster Orchestration system, the kubeadm toolbox helps you bootstrap a Kubernetes cluster that conforms to best practices. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.
The kubeadm toolbox can deploy a Kubernetes cluster to physical or virtual machines. It works by specifying the TCP/IP addresses of the targets.
diff --git a/sqlserver2019bigdataclusters/SQL2019BDC/04 - Operationalization.md b/sqlserver2019bigdataclusters/SQL2019BDC/04 - Operationalization.md
index 20ec885..afdfa96 100644
--- a/sqlserver2019bigdataclusters/SQL2019BDC/04 - Operationalization.md
+++ b/sqlserver2019bigdataclusters/SQL2019BDC/04 - Operationalization.md
@@ -81,11 +81,11 @@ This process allows not only a query to disparate systems, but also those remote
Activity: Load and query data in an External Table
-In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any Polybase target.
+In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any PolyBase target.
Steps
-Open this reference, and perform all of the instructions you see there. This loads your data in preparattion for the next Activity.
+Open this reference, and perform all of the instructions you see there. This loads your data in preparation for the next Activity.
Open this reference, and perform all of the instructions you see there. This step shows you how to create and query an External table.
(Optional) Open this reference, and review the instructions you see there. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)
@@ -118,7 +118,7 @@ In this activity, you will load the sample data into your big data cluster envir
There are three primary uses for a large cluster of data processing systems for Machine Learning and AI applications. The first is that the users will involved in the creation of the Features used in various ML and AI algorithms, and are often tasked to Label the data. These users can access the Data Pool and Data Storage data stores directly to query and assist with this task.
-The SQL Server Master Instance in the BDC installs with Machine Learning Services, which allow creation, training, evaluation and presisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
+The SQL Server Master Instance in the BDC installs with Machine Learning Services, which allow creation, training, evaluation and persisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
diff --git a/sqlserver2019bigdataclusters/SQL2019BDC/05 - Management and Monitoring.md b/sqlserver2019bigdataclusters/SQL2019BDC/05 - Management and Monitoring.md
index 37a37ae..a2f2f30 100644
--- a/sqlserver2019bigdataclusters/SQL2019BDC/05 - Management and Monitoring.md
+++ b/sqlserver2019bigdataclusters/SQL2019BDC/05 - Management and Monitoring.md
@@ -29,7 +29,7 @@ You'll cover the following topics in this Module:
There are two primary areas for monitoring your BDC deployment. The first deals with SQL Server 2019, and the second deals with the set of elements in the Cluster.
-For SQL Server, management is much as you would normally perform for any SQL Server system. You have the same type of services, surface points, security areas and other control vectors as in a stand-alone installation of SQL Server. The tools you have avalaible for managing the Master Instance in the BDC are the same as managing a stand-alone installation, including SQL Server Management Studio, command-line interfaces, Azure Data Studio, and third party tools.
+For SQL Server, management is much as you would normally perform for any SQL Server system. You have the same type of services, surface points, security areas and other control vectors as in a stand-alone installation of SQL Server. The tools you have available for managing the Master Instance in the BDC are the same as managing a stand-alone installation, including SQL Server Management Studio, command-line interfaces, Azure Data Studio, and third party tools.
For the cluster components, you have three primary interfaces to use, which you will review next.