diff --git a/.DS_Store b/.DS_Store index d558d5d..9af338e 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md b/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md index 35d28d4..0805863 100644 --- a/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md +++ b/PythonForDataProfessionals/Python for Data Professionals/03 Working with Data.md @@ -246,7 +246,7 @@ Open the **03_WorkingWithData.py** file and enter the code you find for section Python has many ways to read data in (*sometimes into memory, sometimes streaming as it reads it*) built right in to the standard libraries. Other Libraries, such as Pandas and NumPy, have their own way of reading in data. -In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what stucture you need to perform your desired operations. +In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what structure you need to perform your desired operations.

Reading from Files

@@ -465,7 +465,7 @@ Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/m Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle-data) -The Data Aquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, you’ll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, you’ll replace missing values, add and change columns. You’ve already seen the Libraries you'll need to work with for Data Wrangling - Pandas being the most common in use. +The Data Acquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, you’ll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, you’ll replace missing values, add and change columns. You’ve already seen the Libraries you'll need to work with for Data Wrangling - Pandas being the most common in use.

Phase Three - Modeling

diff --git a/README.md b/README.md index 47c6135..8a0a783 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ The entire repository can be [downloaded as a single ZIP file here](https://gith ### Clone all Workshops using git -You can [clone the entire respository using `git` here](https://github.com/Microsoft/sqlworkshops.git). +You can [clone the entire repository using `git` here](https://github.com/Microsoft/sqlworkshops.git). ### Get only one Workshop You can follow the steps below to clone individual files from a git repo using a git client. diff --git a/SQLGroundToCloud/.DS_Store b/SQLGroundToCloud/.DS_Store new file mode 100644 index 0000000..e04dcdf Binary files /dev/null and b/SQLGroundToCloud/.DS_Store differ diff --git a/k8stobdc/.DS_Store b/k8stobdc/.DS_Store index 28e698e..83dcba1 100644 Binary files a/k8stobdc/.DS_Store and b/k8stobdc/.DS_Store differ diff --git a/k8stobdc/KubernetesToBDC/.DS_Store b/k8stobdc/KubernetesToBDC/.DS_Store index 6349bf2..b248eed 100644 Binary files a/k8stobdc/KubernetesToBDC/.DS_Store and b/k8stobdc/KubernetesToBDC/.DS_Store differ diff --git a/k8stobdc/KubernetesToBDC/00-prerequisites.md b/k8stobdc/KubernetesToBDC/00-prerequisites.md index e1349a3..b83eff4 100644 --- a/k8stobdc/KubernetesToBDC/00-prerequisites.md +++ b/k8stobdc/KubernetesToBDC/00-prerequisites.md @@ -1,4 +1,4 @@ -![](../graphics/microsoftlogo.png) +![](https://github.com/microsoft/sqlworkshops/blob/master/graphics/microsoftlogo.png?raw=true) # Workshop: @@ -6,7 +6,7 @@

-

00 prerequisites

+

00 prerequisites

This workshop is taught using the following components, which you will install and configure in the sections that follow. @@ -26,37 +26,37 @@ The other requirements are: *Note that all following activities must be completed prior to class - there will not be time to perform these operations during the workshop.* -

Activity 1: Set up a Microsoft Azure Account

+

Activity 1: Set up a Microsoft Azure Account

You have multiple options for setting up Microsoft Azure account to complete this workshop. You can use a Microsoft Developer Network (MSDN) account, a personal or corporate account, or in some cases a pass may be provided by the instructor. (Note: for most classes, the MSDN account is best) **If you are attending this course in-person:** Unless you are explicitly told you will be provided an account by the instructor in the invitation to this workshop, you must have your Microsoft Azure account and Data Science Virtual Machine set up before you arrive at class. There will NOT be time to configure these resources during the course. -

Option 1 - Microsoft Developer Network Account (MSDN) Account

+

Option 1 - Microsoft Developer Network Account (MSDN) Account

The best way to take this workshop is to use your [Microsoft Developer Network (MSDN) benefits if you have a subscription](https://marketplace.visualstudio.com/subscriptions). - [Open this resource and click the "Activate your monthly Azure credit" button](https://azure.microsoft.com/en-us/pricing/member-offers/credit-for-visual-studio-subscribers/) -

Option 2 - Use Your Own Account

+

Option 2 - Use Your Own Account

You can also use your own account or one provided to you by your organization, but you must be able to create a resource group and create, start, and manage a Virtual Machine and an Azure AKS cluster. -

Option 3 - Use an account provided by your instructor

+

Option 3 - Use an account provided by your instructor

Your workshop invitation may have instructed you that they will provide a Microsoft Azure account for you to use. If so, you will receive instructions that it will be provided. **Unless you received explicit instructions in your workshop invitations, you much create either an MSDN or Personal account. You must have an account prior to the workshop.** -

Activity 2: Prepare Your Workstation

+

Activity 2: Prepare Your Workstation


The instructions that follow are the same for either a "base metal" workstation or laptop, or a Virtual Machine. It's best to have at least 4MB of RAM on the management system, and these instructions assume that you are not planning to run the database server or any Containers on the workstation. It's also assumed that you are using a current version of Windows, either desktop or server.
*(You can copy and paste all of the commands that follow in a PowerShell window that you run as the system Administrator)* -

Updates

+

Updates

First, ensure all of your updates are current. You can use the following commands to do that in an Administrator-level PowerShell session: @@ -73,12 +73,12 @@ Install-WindowsUpdate *Note: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.* -

Install Big Data Cluster Tools

+

Install Big Data Cluster Tools

Next, install the tools to work with Big Data Clusters: -

Activity 3: Install BDC Tools

+

Activity 3: Install BDC Tools

Open this resource, and follow all instructions for the Microsoft Windows operating system @@ -87,7 +87,7 @@ Open this resource, and follow all instructions for the Microsoft Windows operat - [https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sql-server-ver15](https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sql-server-ver15) -

Activity 4: Re-Update Your Workstation

+

Activity 4: Re-Update Your Workstation

Once again, download the MSI and run it from there. It's always a good idea after this many installations to run Windows Update again: @@ -101,11 +101,11 @@ Install-WindowsUpdate **Note 2: If you are using a Virtual Machine in Azure, power off the Virtual Machine using the Azure Portal every time you are done with it. Turning off the VM using just the Windows power off in the VM only stops it running, but you are still charged for the VM if you do not stop it from the Portal. Stop the VM from the Portal unless you are actively using it.** -

For Further Study

+

For Further Study

-

Next Steps

+

Next Steps

Next, Continue to Module 1 - Introduction. diff --git a/k8stobdc/KubernetesToBDC/01-introduction.md b/k8stobdc/KubernetesToBDC/01-introduction.md index 1f0bd7e..b87ee8a 100644 --- a/k8stobdc/KubernetesToBDC/01-introduction.md +++ b/k8stobdc/KubernetesToBDC/01-introduction.md @@ -14,7 +14,7 @@ This module covers Container technologies and how they are different than Virtua

-

Activity: Install Class Environment on AKS (Optional)

+

Activity: Install Class Environment on AKS (Optional)

*(If you are taking this course on-line and not with an instructor-provided Kubernetes environment, you can use a Microsoft Azure subscription to deploy a Kubernetes Environment, complete with the SQL Server big data clusters feature. Your instructor may also have you use this deployment mechanism if in-class hardware is not practical or available)* @@ -26,15 +26,15 @@ Using the following steps, you will create a Resource Group in Azure that will h

Steps

-

Ensure that you have completed all prerequisites.

+

Ensure that you have completed all prerequisites.

-

Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step. Note that if you followed the pre-requisites properly, you will already have Python, kubectl, and Azure Data Studio installed, so those may be skipped. Follow all other instructions.

+

Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step. Note that if you followed the pre-requisites properly, you will already have Python, kubectl, and Azure Data Studio installed, so those may be skipped. Follow all other instructions.

-

Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step. Stop at the section marked Connect to the cluster.

+

Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step. Stop at the section marked Connect to the cluster.

-

1.1 Big Data Technologies: Operating Systems

+

1.1 Big Data Technologies: Operating Systems

In this section you will learn more about the design the primary operating system (Linux) used with a Kubernetes Cluster. @@ -135,21 +135,21 @@ The essential commands you should know for this workshop are below. In Linux you A longer explanation of system administration for Linux is here. -

Activity: Work with Linux Commands

+

Activity: Work with Linux Commands

Steps

-

Open this link to run a Linux Emulator in a browser

-

Find the mounted file systems, and then show the free space in them.

-

Show the current directory.

-

Show the files in the current directory.

-

Create a new directory, navigate to it, and create a file called test.txt with the words This is a test in it. (hint: us the nano editor or the echo command)

-

Display the contents of that file.

-

Show the help for the cat command.

+

Open this link to run a Linux Emulator in a browser

+

Find the mounted file systems, and then show the free space in them.

+

Show the current directory.

+

Show the files in the current directory.

+

Create a new directory, navigate to it, and create a file called test.txt with the words This is a test in it. (hint: us the nano editor or the echo command)

+

Display the contents of that file.

+

Show the help for the cat command.

-

1.2 Big Data Technologies: Containers and Controllers

+

1.2 Big Data Technologies: Containers and Controllers

Bare-metal installations of an operating system such as Windows are deployed on hardware using a Kernel, and additional software to bring all of the hardware into a set of calls. @@ -158,7 +158,7 @@ Bare-metal installations of an operating system such as Windows are deployed on One abstraction layer above installing software directly on hardware is using a Hypervisor. In essence, this layer uses the base operating system to emulate hardware. You install an operating system (called a *Guest* OS) on the Hypervisor (called the *Host*), and the Guest OS acts as if it is on bare-metal.
- +
In this abstraction level, you have full control (and responsibility) for the entire operating system, but not the hardware. This isolates all process space and provides an entire "Virtual Machine" to applications. For scale-out systems, a Virtual Machine allows for a distribution and control of complete computer environments using only software. @@ -174,7 +174,7 @@ A Container is provided by the Container Runtime (Such as [containerd](https://c (NOTE: The Container Image Kernel can run on Windows or Linux, but you will focus on the Linux Kernel Containers in this workshop.)
- +
This abstraction holds everything for an application to isolate it from other running processes. It is also completely portable - you can create an image on one system, and another system can run it so long as the Container Runtimes (Such as Docker) Runtime is installed. Containers also start very quickly, are easy to create (called Composing) using a simple text file with instructions of what to install on the image. The instructions pull the base Kernel, and then any binaries you want to install. Several pre-built Containers are already available, SQL Server is one of these. You can read more about installing SQL Server on Container Runtimes (Such as Docker) here. @@ -198,34 +198,36 @@ For Big Data systems, having lots of Containers is very advantageous to segment
-

+


You can learn much more about Container Orchestration systems here. We're using the Azure Kubernetes Service (AKS) in this workshop, and they have a great set of tutorials for you to learn more here. -In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster. +In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster. + +> NOTE: The OpenShift Container Platform is a commercially supported Platform as a Service (PaaS) based on Kubernetes from RedHat. Many shops require a commercial vendor to implement and support Kubernetes. (You'll cover the storage aspects of Container Orchestration in more detail in a moment.) -

Activity: Familiarize Yourself with Container Orchestration using minikube

+

Activity: Familiarize Yourself with Container Orchestration using minikube

To practice with Kubernetes, you will use an online emulator to work with the `minikube` platform.

Steps

-

Open this resource, and complete the first module. (You can return to it later to complete all exercises if you wish)



-

1.3 Big Data Technologies: Distributed Data Storage

+

1.3 Big Data Technologies: Distributed Data Storage

Traditional storage uses a call from the operating system to an underlying I/O system, as you learned earlier. These file systems are either directly connected to the operating system or appear to be connected directly using a Storage Area Network. The blocks of data are stored and managed by the operating system. For large scale-out data systems, the mounting point for an I/O is another abstraction. For SQL Server BDC, the most commonly used scale-out file system is the Hadoop Data File System, or HDFS. HDFS is a set of Java code that gathers disparate disk subsystems into a Cluster which is comprised of various Nodes - a NameNode, which manages the cluster's metadata, and DataNodes that physically store the data. Files and directories are represented on the NameNode by a structure called inodes. Inodes record attributes such as permissions, modification and access times, and namespace and diskspace quotas. -

+

With an abstraction such as Containers, storage becomes an issue for two reasons: The storage can disappear when the Container is removed, and other Containers and technologies can't access storage easily within a Container. @@ -239,7 +241,7 @@ You


-

1.4 Big Data Technologies: Command and Control

+

1.4 Big Data Technologies: Command and Control

There are three primary tools and utilities you will use to control the SQL Server big data cluster: @@ -267,28 +269,28 @@ You can learn more about Azure Data Studio here.
-

+


You'll explore further operations with the Azure Data Studio in the final module of this course.
-

Activity: Practice with Notebooks

+

Activity: Practice with Notebooks

Steps

-

Open this reference, and review the instructions you see there. You can clone this Notebook to work with it later.

+

Open this reference, and review the instructions you see there. You can clone this Notebook to work with it later.


-

Activity: Azure Data Studio Notebooks Overview

+

Activity: Azure Data Studio Notebooks Overview

Steps

-

Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.

+

Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.



-

For Further Study

+

For Further Study


-

Next Steps

+

Next Steps

Next, Continue to 02 - Hardware and Virtualization environment for Kubernetes . diff --git a/k8stobdc/KubernetesToBDC/03-kubernetes.md b/k8stobdc/KubernetesToBDC/03-kubernetes.md index 94782f8..f364367 100644 --- a/k8stobdc/KubernetesToBDC/03-kubernetes.md +++ b/k8stobdc/KubernetesToBDC/03-kubernetes.md @@ -148,7 +148,7 @@ We'll begin with a set of definitions. These aren't all the terms used in Kubern etcd - A high performance key value store that stores the cluster’s state. Since etcd is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official etcd.io site provides a detailed breakdown of the hardware requirement for etcd. + A high performance key value store that stores the cluster’s state. Since etcd is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official http://etcd.io site provides a detailed breakdown of the hardware requirement for etcd. @@ -246,6 +246,7 @@ Luckily help is at hand in the form of a tool that leverages kubeadm in order to - Add nodes to existing clusters + Kubespray is a Cloud Native Computing Foundation project and with its own [GitHub repository](https://github.com/kubernetes-sigs/kubespray). ### 3.2.5 What Is Ansible? ### @@ -268,6 +269,9 @@ In order to carry out the deployment of the Kubernetes cluster, a basic understa ### 3.2.7 Kubespray Workflow ### +Unlike other available deployment tools, Kubespray does everything for you in “One shot”. For example, Kubeadm requires that certificates on nodes are created manually, Kubespray not only leverages Kubeadm but it also looks after everything including certificate creation for you. Kubespray works against most of the popular public cloud providers and has been tested for the deployment of clusters with thousands of nodes. The real elegance of Kubespray is the reuse it promotes. If an organization has a requirement to deploy multiple clusters, once Kubespray is setup, for every new cluster that needs to be created, the only prerequisite is to create a new inventory file for the nodes the new cluster will use. +3.2.5 High Level Kubespray Workflow + The deployment of a Kubernetes cluster via Kubespray follows this workflow: - Preinstall step @@ -299,7 +303,6 @@ Note: ### 3.2.8 Requirements ### -Refer to the [requirements](https://github.com/kubernetes-sigs/kubespray#requirements) section in the Kubespray GitHub repo. ### 3.2.9 Post Cluster Deployment Activities ### diff --git a/k8stobdc/KubernetesToBDC/04-bdc.md b/k8stobdc/KubernetesToBDC/04-bdc.md index df27fee..221515b 100644 --- a/k8stobdc/KubernetesToBDC/04-bdc.md +++ b/k8stobdc/KubernetesToBDC/04-bdc.md @@ -19,7 +19,7 @@ This module covers Container technologies and how they are different than Virtua SQL Server (starting with version 2019) provides three ways to work with large sets of data: - - **Data Virtualization**: Query multiple sources of data technologies using the Polybase SQL Server feature (data left at source) + - **Data Virtualization**: Query multiple sources of data technologies using the PolyBase SQL Server feature (data left at source) - **Storage Pools**: Create sets of disparate data sources that can be queried from Distributed Data sets (data ingested into sharded databases using PolyBase) - **SQL Server Big Data Clusters**: Create, manage and control clusters of SQL Server Instances that co-exist in a Kubernetes cluster with Apache Spark and other technologies to access and process large sets of data (Data left in place, ingested through PolyBase, and into/through HDFS) @@ -45,12 +45,12 @@ To leverage PolyBase, you first define the external table using a specific set o
-

Activity: Review PolyBase Solution

+

Activity: Review PolyBase Solution

In this section you will review a solution tutorial similar to one you will perform later. You'll see how to create a reference to an HDFS file store and query it within SQL Server as if it were a standard internal table.
-

Open this reference and locate numbers 4-5 of the steps in the tutorial. This explains the two steps required to create and query an External table. *Only review this information; you will perform these steps in another Module*.

+

Open this reference and locate numbers 4-5 of the steps in the tutorial. This explains the two steps required to create and query an External table. *Only review this information; you will perform these steps in another Module*.


@@ -148,7 +148,7 @@ These components are used in the Compute Pool of the BDC:

BDC: App Pool

-The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instatiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification. +The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instantiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification. These components are used in the Compute Pool of the BDC: @@ -195,18 +195,18 @@ These components are used in the Storage Pool of the BDC:
-

Activity: Review Data Pool Solution

+

Activity: Review Data Pool Solution

In this section you will review the solution tutorial similar to the one you will perform in a future step. You'll see how to load data into the Data Pool.
-

Open this reference and review the steps in the tutorial. This explains the two steps required to create and load an External table in the Data Pool. You'll perform these steps in the Operationalization Module later. *Only review this information at this time. You will perform these steps in another Module.

+

Open this reference and review the steps in the tutorial. This explains the two steps required to create and load an External table in the Data Pool. You'll perform these steps in the Operationalization Module later. *Only review this information at this time. You will perform these steps in another Module.



-

For Further Study

+

For Further Study