This commit is contained in:
Anna Hoffman 2020-02-06 07:56:54 -08:00
Родитель 80ab69d159 4b4ade0a57
Коммит 424998262f
95 изменённых файлов: 1381 добавлений и 277 удалений

Двоичные данные
.DS_Store поставляемый

Двоичный файл не отображается.

Просмотреть файл

@ -1,12 +1,12 @@
![](../graphics/microsoftlogo.png)
# The Azure SQL Workshop
# Module 4 - Performance
#### <i>A Microsoft workshop from the SQL team</i>
#### <i>The Azure SQL Workshop</i>
<p style="border-bottom: 1px solid lightgrey;"></p>
<img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/textbubble.png"> <h2>04 - Performance</h2>
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/textbubble.png?raw=true"> <h2>Overview</h2>
> You must complete the [prerequisites](../azuresqlworkshop/00-Prerequisites.md) before completing these activities. You can also choose to audit the materials if you cannot complete the prerequisites. If you were provided an environment to use for the workshop, then you **do not need** to complete the prerequisites.
@ -17,54 +17,504 @@ In each module you'll get more references, which you should follow up on to lear
(<a href="https://github.com/microsoft/sqlworkshops/blob/master/AzureSQLWorkshop/azuresqlworkshop/00-Prerequisites.md" target="_blank">Make sure you check out the <b>Prerequisites</b> page before you start</a>. You'll need all of the items loaded there before you can proceed with the workshop.)
In this module, you'll cover these topics:
[4.1](#4.1): TODO
[4.2](#4.2): TODO
[4.3](#4.3): TODO
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Activity 1](#1): TODO
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Activity 2](#2): TODO
[4.1](#4.1): Azure SQL performance capabilities and Tasks<br>
[4.2](#4.2): Monitoring performance in Azure SQL<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Activity 1](#1): How to monitor performance in Azure SQL Database
[4.3](#4.3): Improving Performance in Azure SQL<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Activity 2](#2): Scaling your workload performance in Azure SQL Database<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[Activity 3 (BONUS)](#2): Optimizing performance for index maintenance.
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4.1">4.1 TODO: Topic Name</h2></a>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4.1">4.1 Azure SQL performance capabilities and Tasks</h2></a>
TODO: Topic Description
In this section you will learn how to monitor the performance of a SQL workload using tools and techniques both familiar to the SQL Server professional along with differences with Azure SQL.
<br>
**Azure SQL Performance Capabilities**
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="linkToPictureEndingIn.png">
**Monitoring and Troubleshooting Performance**
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><a name="1"><b>Activity 1</a>: TODO: Activity Name</b></p>
TODO: Activity Description and tasks
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/checkmark.png"><b>Description</b></p>
TODO: Enter activity description with checkbox
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/checkmark.png"><b>Steps</b></p>
TODO: Enter activity steps description with checkbox
**Accelerating and Improving Performance**
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4.2">4.2 TODO: Topic Name</h2></a>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4.2">4.2 Monitoring performance in Azure SQL</h2></a>
TODO: Topic Description
In this section you will learn how to monitor the performance of a SQL workload using tools and techniques both familiar to the SQL Server professional along with differences with Azure SQL.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><a name="1"><b>Activity 2</a>: TODO: Activity Name</b></p>
**Monitoring SQL queries**
TODO: Activity Description and tasks
- DMVs
- Extended Events
- Azure Portal
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/checkmark.png"><b>Description</b></p>
**Monitoring CPU usage**
TODO: Enter activity description with checkbox
- DMVs
- Azure Portal
- Query Store
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/checkmark.png"><b>Steps</b></p>
**Monitoring Waits**
TODO: Enter activity steps description with checkbox
- DMVs
sys.dm_exec_requests can be used to see wait types, duration, and wait resources for any active request. This DMV also works across Azure SQL. There can be some wait types that are unique to Azure SQL which can be found at XXXXXX...
Some of the more common new wait type values new to Azure SQL are:
XXXX
XXXX
XXXX
SQL Server supports **sys.dm_os_wait_stats**. Azure SQL Database supports a database specific DMV for this called **sys.dm_db_wait_stats**. sys.dm_os_waits or sys.dm_db_wait_stats can be used with Azure SQL Database Managed Instance.
- Query Store
- Azure Portal
**Monitoring Memory**
**Monitoring Transaction Log Usage**
**Monitoring I/O**
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><a name="1"><b>Activity 1</a>: How to monitor performance in Azure SQL Database</b></p>
>**IMPORTANT**: This activity assumes you have completed all the activities in Module 2.
All scripts for this activity can be found in the **azuresqlworkshop\04-Performance\monitor_and_scale** folder.
>**NOTE:** This activity will work against an Azure SQL Database Managed Instance. However, you may need to make some changes to the scripts to increase the workload since the minimum number of vCores for Managed Instance General Purpose is 4 vCores.
In this activity, you will take a typical workload based on SQL queries and learn how to monitor performance for Azure SQL Database. You will learn how to identify a potential performance bottleneck using familiar tools and techniques to SQL Server. You will also learn differences with Azure SQL Database for performance monitoring.
Using the Azure SQL Database based on the AdventureWorksLT sample, you are given an example workload and need to observe its performance. You are told there appears to be a performance bottleneck. Your goal is to identify the possible bottleneck and identify solutions.
>**NOTE**: These scripts use the database name **AdventureWorks0406**. Anywhere this database name is used you should substitute in the name of the database you deployed in Module 2.
**Step 1: Setup to monitor Azure SQL Database**
>**TIP**: To open a script file in the context of a database in SSMS, click on the database in Object Explorer and then use the File/Open menu in SSMS.
- Launch SQL Server Management Studio (SSMS) and load a query *in the context of the database you deployed in Module 2* to monitor the Dynamic Management View (DMV) **sys.dm_exec_requests** from the script **sqlrequests.sql** which looks like the following:
```sql
SELECT er.session_id, er.status, er.command, er.wait_type, er.last_wait_type, er.wait_resource, er.wait_time
FROM sys.dm_exec_requests er
INNER JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
AND es.is_user_process = 1
```
Unlike SQL Server, the familiar DMV dm_exec_requests shows active requests for a specific Azure SQL Database vs an entire server. Azure SQL Database Managed instance will behave just like SQL Server.
In another session for SSMS *in the context of the database you deployed in Module 2* load a query to monitor a Dynamic Management View (DMV) unique to Azure SQL Database called **sys.dm_db_resource_stats** from a script called **azuresqlresourcestats.sql**
```sql
SELECT * FROM sys.dm_db_resource_stats
```
This DMV will track overall resource usage of your workload against Azure SQL Database such as CPU, I/O, and memory.
**Step 2: Run the workload and observe performance**
- Examine the workload query from the script **topcustomersales.sql**.
This database is not large so the query to retrieve customer and their associated sales information ordered by customers with the most sales shouldn't generate a large result set. It is possible to tune this query by reducing the number of columns from the result set but these are needed for demostration purposes of this activity.
```sql
SELECT c.*, soh.OrderDate, soh.DueDate, soh.ShipDate, soh.Status, soh.ShipToAddressID, soh.BillToAddressID, soh.ShipMethod, soh.TotalDue, soh.Comment, sod.*
FROM SalesLT.Customer c
INNER JOIN SalesLT.SalesOrderHeader soh
ON c.CustomerID = soh.CustomerID
INNER JOIN SalesLT.SalesOrderDetail sod
ON soh.SalesOrderID = sod.SalesOrderID
ORDER BY sod.LineTotal desc
GO
```
- Run the workload from the command line using ostress.
Edit the script script that runs ostress **sqlworkload.cmd**:<br><br>
Substitute your Azure Database Server created in Module 2 for the **-S parameter**<br>
Substitute the login name created for the Azure SQL Database Server created in Module 2 for the **-U parameter**
Substitute the database you deployed in Module 2 for the **-d parameter**<br>
Substitute the password for the login for the Azure SQL Database Server created in Module 2 for the **-P parameter**.
This script will use 10 concurrent users running the workload query 1500 times.
>**NOTE:** If you are not seeing CPU usage behavior with this workload for your environment you can adjust the **-n parameter** for number of users and **-r parameter** for iterations.
From a powershell command prompt, change to the directory for this module activity:
[vmusername] is the name of the user in your Windows Virtual Machine. Substitute in the path for c:\users\[vmusername] where you have cloned the GitHub repo.
<pre>
cd c:\users\[vmusername]\AzureSQLWorkshop\azuresqlworkshop\03-Performance\monitor_and_scale
</pre>
Run the workload with the following command
```Powershell
.\sqlworkload.cmd
```
Your screen at the command prompt should look similar to the following
<pre>[datetime] [ostress PID] Max threads setting: 10000
[datetime] [ostress PID] Arguments:
[datetime] [ostress PID] -S[server].database.windows.net
[datetime] [ostress PID] -isqlquery.sql
[datetime] [ostress PID] -U[user]
[datetime] [ostress PID] -dAdventureWorks0406
[datetime] [ostress PID] -P********
[datetime] [ostress PID] -n10
[datetime] [ostress PID] -r1500
[datetime] [ostress PID] -q
[datetime] [ostress PID] Using language id (LCID): 1024 [English_United States.1252] for character formatting with NLS: 0x0006020F and Defined: 0x0006020F
[datetime] [ostress PID] Default driver: SQL Server Native Client 11.0
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_1.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_2.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_3.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_4.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_5.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_6.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_7.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_8.out]
[datetime] [ostress PID] Attempting DOD5015 removal of [directory]\sqlquery_9.out]
[datetime] [ostress PID] Starting query execution...
[datetime] [ostress PID] BETA: Custom CLR Expression support enabled.
[datetime] [ostress PID] Creating 10 thread(s) to process queries
[datetime] [ostress PID] Worker threads created, beginning execution...</pre>
- Use the query in SSMS to monitor dm_exec_requests (**sqlrequests.sql**) to observe active requests. Run this query 5 or 6 times and observe some of the results.
You should see many of the requests have a status = RUNNABLE and last_wait_type = SOS_SCHEDULER_YIELD. One indicator of many RUNNABLE requests and many SOS_SCHEDULER_YIELD seen often is a possible lack of CPU resources for active queries.
>**NOTE:** You may see one or more active requests with a command = SELECT and a wait_type = XE_LIVE_TARGET_TVF. These are queries run by services managed by Microsoft to help power capabilities like Performance Insights using Extended Events. Microsoft does not publish the details of these Extended Event sessions.
The familiar SQL DMV dm_exec_requests can be used with Azure SQL Database but must be run in the context of a database unlike SQL Server (or Azure SQL Database Managed Instance) where dm_exec_requests shows all active requests across the server instance.
- Run the query in SSMS to monitor dm_db_resource_stats (**azuresqlresourcestats.sql**). Run the query to see the results of this DMV 3 or 4 times.
This DMV records of snapshot of resource usage for the database every 15 seconds (kept for 1 hour). You should see the column **avg_cpu_percent** close to 100% for several of the snapshots. (at least in the high 90% range). This is a symptom of a workload pushing the limits of CPU resources for the database. You can read more details about this DMV at https://docs.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-db-resource-stats-azure-sql-database?view=azuresqldb-current. This DMV also works with Azure SQL Database Managed Instance.
For a SQL Server on-premises environment you would typically use a tool specific to the Operating System like Windows Performance Monitor to track overall resource usage such a CPU. If you ran this example on a on-premises SQL Server or SQL Server in a Virtual Machine with 2 CPUs you would see near 100% CPU utilization on the server.
>**NOTE**: Another DMV called, **sys.resource_stats**, can be run in the context of the master database of the Azure Database Server to see resource usage for all Azure SQL Database databases associated with the server. This view is less granular and shows resource usage every 5 minutes (kept for 14 days).
- Let the workload complete and take note of its overall duration. When the workload completes you should see results like the following and a return to the command prompt
<pre>[datetime] [ostress PID] Total IO waits: 0, Total IO wait time: 0 (ms)
[datetime] [ostress PID] OSTRESS exiting normally, elapsed time: 00:01:22.637</pre>
Your duration time may vary but this typically takes at least 1 minute or more.
**Step 3: Use Query Store to do further performance analysis**
Query Store is a capability in SQL Server to track performance execution of queries. Performance data is stored in the user database. You can read more about Query Store at https://docs.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-ver15.
Query Store is not enabled by default for databases created in SQL Server but is on by default for Azure SQL Database (and Azure SQL Database Managed Instance). You can read more about Query Store and Azure SQL Database at https://docs.microsoft.com/en-us/azure/sql-database/sql-database-operate-query-store.
Query Store comes with a series of system catalog views to view performance data. SQL Server Management Studio (SSMS) provides reports using these system views.
- Look at queries consuming the most resource usage using SSMS.
Using the Object Explorer in SSMS, open the Query Store Folder to find the report for **Top Resource Consuming Queries**<br>
<img src="../graphics/SSMS_QDS_Find_Top_Queries.png" alt="SSMS_QDS_Find_Top_Queries"/>
Select the report to find out what queries have consumed the most avg resources and execution details of those queries. Based on the workload run to this point, your report should look something like the following:<br>
<img src="../graphics/SSMS_QDS_Top_Query_Report.png" alt="SSMS_QDS_Find_Top_Queries"/>
The query shown is the SQL query from the workload for customer sales. This report has 3 components: Queries with the high total duration (you can change the metric), the associated query plan and runtime statistics, and the associated query plan in a visual map.
If you click on the bar chart for the query (the query_id may be different for your system), your results should look like the following:<br>
<img src="../graphics/SSMS_QDS_Query_ID.png" alt="SSMS_QDS_Query_ID"/>
You can see the total duration of the query and query text.
Right of this bar chart is a chart for statistics for the query plan associated with the query. Hover over the dot associated with the plan. Your results should look like the following:<br>
<img src="../graphics/SSMS_Slow_Query_Stats.png" alt="SSMS_Slow_Query_Stats" width=350/>
Note the average duration of the query. Your times may vary but the key will be to compare this average duration to the average wait time for this query and eventually the average duration when we introduce a performance improvement.
The final component is the visual query plan. The query plan for this query looks like the following:<br>
<img src="../graphics/SSMS_Workload_Query_Plan.png" alt="SSMS_Workload_Query_Plan"/>
Given the small nature of rows in the tables in this database, this query plan is not inefficient. There could be some tuning opportunities but not much performance will be gained by tuning the query itself.
- Observe waits to see if they are affecting performance.
We know from earlier diagnostics that a high number of requests constantly were in a RUNNABLE status along with almost 100% CPU. Query Store comes with reports to look at possible performance bottlenecks to due waits on resources.
Below the Top Resource Consuming Queries report in SSMS is a report called Query Wait Statistics. Click on this report and hover over the bar chart. Your results should look like the following:<br>
<img src="../graphics/SSMS_Top_Wait_Stats.png" alt="SSMS_Top_Wait_Stats"/>
You can see the top wait category is CPU and the average wait time. Furthermore, the top query waiting for CPU is the query from the workload we are using.
Click on the bar chart for CPU to see more about query wait details. Hover over the bar chart for the query. Your results should look like the following:<br>
<img src="../graphics/SSMS_Top_Wait_Stats_Query.png" alt="SSMS_Top_Wait_Stats_Query"/>
Notice that the average wait time for CPU for this query is a high % of the overall average duration for the query.
The DMV **sys.dm_db_wait_stats** will show a high number of SOS_SCHEDULER_YIELD waits with this scenario.
Given the evidence to this point, without any query tuning, our workload requires more CPU capacity than we have deployed for our Azure SQL Database.
**Step 5: Observe performance using the Azure Portal**
The Azure Portal provides performance information in the form of a graph. The standard default view is called **Compute Utilization** which you can see on the Overview blade for your database:<br><br>
<img src="../graphics/Azure_Portal_Compute_Slow_Query.png" alt="Azure_Portal_Compute_Slow_Query"/>
Notice in this example, the compute utilization near 100% for a recent time range. This chart will show resource resource usage over the last hour and is refreshed continually. If you click on the chart you customize the chart (Ex. bar chart) and look at other resource usage.
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4.3">4.3 Improving Performance in Azure SQL</h2></a>
In this section you will learn how to improve the performance of a SQL workload in Azure SQL using your knowledge of SQL Server and gained knowledge from Module 4.2.
**SQL Query Tuning**
**Azure SQL Database Auto Tuning**
**Scaling Performance**
Here is a good article to reference: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-monitor-tune-overview#troubleshoot-performance-problems and https://docs.microsoft.com/en-us/azure/sql-database/sql-database-monitor-tune-overview#improve-database-performance-with-more-resources.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><a name="2"><b>Activity 2</a>: Scaling your workload performance in Azure SQL Database</b></p>
>**IMPORTANT**: This activity assumes you have completed all the steps in Activity 1 in Module 4.
In this activity you will take the results of your monitoring in Module 4.2 and learn how to scale your workload in Azure to see improved results.
All scripts for this activity can be found in the **azuresqlworkshop\04-Performance\monitor_and_scale** folder.
**Step 1: Decide options on how to scale performance**
Since workload is CPU bound one way to improve performance is to increase CPU capacity or speed. A SQL Server user would have to move to a different machine or reconfigure a VM to get more CPU capacity. In some cases, even a SQL Server administrator may not have permission to make these scaling changes or the process could take time.
For Azure, we can use ALTER DATABASE, az cli, or the portal to increase CPU capacity.
Using the Azure Portal we can see options for how you can scale for more CPU resources. Using the Overview blade for the database, select the Pricing tier current deployment.<br>
<img src="../graphics/Azure_Portal_Change_Tier.png" alt="Azure_Portal_Change_Tier"/>
Here you can see options for changing or scaling compute resources. For General Purpose, you can easily scale up to something like 8 vCores.<br>
<img src="../graphics/Azure_Portal_Compute_Options.png" alt="Azure_Portal_Compute_Options"/>
Instead of using the portal, I'll show you a different method to scale your workload.
**Step 2: Increase capacity of your Azure SQL Database**
There are other methods to change the Pricing tier and one of them is with the T-SQL statement ALTER DATABASE.
>**NOTE**: For this demo you must first flush the query store using the following script **flushhquerystore.sql** or T-SQL statement:
```sql
EXEC sp_query_store_flush_db
```
- First, learn how to find out your current Pricing tier using T-SQL. The Pricing tier is also know as a *service objective*. Using SSMS, open the script **get_service_object.sql** or the T-SQL statements to find out this information:
```sql
SELECT database_name,slo_name,cpu_limit,max_db_memory, max_db_max_size_in_mb, primary_max_log_rate,primary_group_max_io, volume_local_iops,volume_pfs_iops
FROM sys.dm_user_db_resource_governance;
GO
SELECT DATABASEPROPERTYEX('AdventureWorks0406', 'ServiceObjective');
GO
```
For the current Azure SQL Database deployment, your results should look like the following:<br><br>
<img src="../graphics/service_objective_results.png" alt="service_objective_results"/>
Notice the term **slo_name** is also used for service objective. The term **slo** stands for *service level objective*.
The various slo_name values are not documented but you can see from the string value this database uses a General Purpose SKU with 2 vCores:
>**NOTE:** Testing shows that SQLDB_OP_... is the string used for Business Critical.
The documentation for ALTER DATABASE shows all the possible options for service objectives and how they match to the Azure portal: https://docs.microsoft.com/en-us/sql/t-sql/statements/alter-database-transact-sql?view=sql-server-ver15.
When you view the ALTER DATABASE documentation, notice the ability to click on your target SQL Server deployment to get the right syntax options. Click on SQL Database single database/elastic pool to see the options for Azure SQL Database. To match the compute scale you found in the portal you need the service object **'GP_Gen5_8'**
Using SSMS, run the script modify_service_objective.sql or T-SQL command:
```sql
ALTER DATABASE AdventureWorks0406 MODIFY (SERVICE_OBJECTIVE = 'GP_Gen5_8');
```
This statement comes back immediately but the scaling of the compute resources take place in the background. A scale this small should take less than a minute and for a short period of time the database will be offline to make the change effective. You can monitor the progress of this scaling activity using the Azure Portal.<br>
<img src="../graphics/Azure_Portal_Update_In_Progress.png" alt="Azure_Portal_Update_In_Progress"/>
TAnother way to monitor the progress of a change for the service object for Azure SQL Database is to use the DMV **sys.dm_operation_status**. This DMV exposes a history of changes to the database with ALTER DATABASE to the service objective and will show active progress of the change. Here is an example of this DMV after executing the above ALTER DATABASE statement:
<pre>
session_activity_id resource_type resource_type_desc major_resource_id minor_resource_id operation state state_desc percent_complete error_code error_desc error_severity error_state start_time last_modify_time
97F9474C-0334-4FC5-BFD5-337CDD1F9A21 0 Database AdventureWorks0406 ALTER DATABASE 1 IN_PROGRESS 0 0 0 0 [datetime] [datetime]</pre>
During a change for the service objective, queries are allowed against the database until the final change is implemented so an application cannot connect for a very brief period of time. For Azure SQL Database Managed Instance, a change to Tier (or SKU) will allow queries and connections but prevents all database operations like creation of new databases (in these cases operations like these will fail with the error message "**The operation could not be completed because a service tier change is in progress for managed instance '[server]' Please wait for the operation in progress to complete and try again**".)
When this is done using the queries listed above to verify the new service objective or pricing tier of 8 vCores has taken affect.
**Step 3: Run the workload again**
Now that the scaling has complete, we need to see if the workload duration is faster and whether waits on CPU resources has decreased.
Run the workload again using the command **sqlworkload.cmd** that you executed in Section 4.2
**Step 4: Observe new performance of the workload**
- Observe DMV results
Use the same queries from Section 4.2 Activity 1 to observe results from **dm_exec_requests** and **dm_db_resource_stats**.
You will see there are more queries with a status of RUNNING (less RUNNABLE although this will appear some) and the avg_cpu_percent should drop to 40-60%.
- Observe the new workload duration.
The workload duration from **sqlworkload.cmd** should now be much less and somewhere ~20 seconds.
- Observe Query Store reports
Using the same techniques as in Section 4.2 Activity 1, look at the **Top Resource Consuming Queries** report from SSMS:<br>
<img src="../graphics/SSMS_QDS_Top_Query_Faster.png" alt="Azure_Portal_Update_In_Progress"/>
You will now see two queries (query_id). These are the same query but show up as different query_id values in Query Store because the scale operation required a restart so the query had to be recompiled. You can see in the report the overall and average duration was significantly less.
Look also at the Query Wait Statistics report as you did in Section 4.2 Activity 1. You can see the overall average wait time for the query is less and a lower % of the overall duration. This is good indication that CPU is not as much of a resource bottleneck when the database had a lower number of vCores:<br>
<img src="../graphics/SSMS_Top_Wait_Stats_Query_Faster.png" alt="Azure_Portal_Update_In_Progress"/>
- Observe Azure Portal Compute Utilization
Look at the Overview blade again for the Compute Utilization. Notice the significant drop in overall CPU resource usage compared to the previous workload execution:<br>
<img src="../graphics/Azure_Portal_Compute_Query_Comparison.png" alt="Azure_Portal_Compute_Query_Comparison"/>
>**NOTE:** If you continue to increase vCores for this database you can improve performance up to a threshold where all queries have plenty of CPU resources. This does not mean you must match the number of vCores to the number of concurrent users from your workload. In addition, you can change the Pricing Tier to use **Serverless** *Compute Tier* instead of **Provisioned** to achieve a more "auto-scaled" approach to a workload. For example, for this workload if you chose a min vCore value of 2 and max VCore value of 8, this workload would immediately scale to 8vCores.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><a name="2"><b>Activity 3</a>: Optimizing application performance for Azure SQL Database</b></p>
>**IMPORTANT**: This activity assumes you have completed all Activities in Module 2
Good article read: https://azure.microsoft.com/en-us/blog/resource-governance-in-azure-sql-database/
In some cases, migrating an existing application and SQL query workload to Azure may uncover opportunities to optimize and tune queries.
Assume that to support a new extension to a website for AdventureWorks orders to support a rating system from customers you need to add a new table to support a heavy set of concurrent INSERT activity for ratings. You have tested the SQL query workload on a development computer that has a local SSD drive for the database and transaction log.
When you move your test to Azure SQL Database using the General Purpose tier (8 vCores), the INSERT workload is slower. You need to discover whether you need to change the service objective or tier to support the new workload.
All scripts for this activity can be found in the **azuresqlworkshop\04-Performance\tuning_applications** folder.
**Step 1 - Create a new table**
Run the following statement (or use the script**order_rating_ddl.sql**) to create a table in the AdventureWorks database you have used in Activity 1 and 2:
```sql
DROP TABLE IF EXISTS SalesLT.OrderRating;
GO
CREATE TABLE SalesLT.OrderRating
(OrderRatingID int identity not null,
SalesOrderID int not null,
OrderRatingDT datetime not null,
OrderRating int not null,
OrderRatingComments char(500) not null);
GO
```
**Step 2 - Load up a query to monitor query execution**
- Use the following query or script **sqlrequests.sql** to look at active SQL queries *in the context of the AdventureWorks database*:
```sql
SELECT er.session_id, er.status, er.command, er.wait_type, er.last_wait_type, er.wait_resource, er.wait_time
FROM sys.dm_exec_requests er
INNER JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
AND es.is_user_process = 1;
```
- Use the following query or script **top_waits.sql** to look at top wait types by count *in the context of the AdventureWorks database*:
```sql
SELECT * FROM sys.dm_os_wait_stats
ORDER BY waiting_tasks_count DESC;
```
- Use the following query or script **tlog_io.sql** to observe latency for transaction log writes:
```sql
SELECT io_stall_write_ms/num_of_writes as avg_tlog_io_write_ms, *
FROM sys.dm_io_virtual_file_stats
(db_id('AdventureWorks0406'), 2);
```
**Step 3 - Run the workload**
Run the test INSERT workload using the script order_rating_insert_single.cmd. This script uses ostress to run 25 concurrent users running the following T-SQL statement (in the script **order_rating_insert_single.sql**):
```sql
DECLARE @x int;
SET @x = 0;
WHILE (@x < 100)
BEGIN
SET @x = @x + 1;
INSERT INTO SalesLT.OrderRating
(SalesOrderID, OrderRatingDT, OrderRating, OrderRatingComments)
VALUES (@x, getdate(), 5, 'This was a great order');
END
```
You can see from this script that it is not exactly a real depiction of data coming from the website but it does simulate many order ratings being ingested into the database.
**Step 4 - Observe query requests and duration**
Using the queries in Step 2 you should observe the following:
- Many requests constantly have a wait_type of WRITELOG with a value > 0
- The WRITELOG wait type is the highest count
- The avg time to write to the transaction log is somewhere around 2ms.
The duration of this workload on a SQL Server 2019 instance with a SSD drive is somewhere around 15 seconds. The total duration using this on Azure SQL Database using a Gen5 v8core is around 32+ seconds.
WRITELOG wait types are indicative of latency flushing to the transaction log. 2ms per write doesn't seem like much but on a local SSD drive these waits may < 1ms.
TODO: WRITELOG waits sometimes don't show up in Query Store?
**Step 5 - Decide on a resolution**
The problem is not a high% of log write activity. The Azure Portal and **dm_db_resource_stats** don't show any numbers higher than 20-25%. The problem is not an IOPS limit as well. The issue is that application requires low latency for transaction log writes but with the General Purpose database configuration a latency. In fact, the documenation for resource limits lists latency between 5-7ms (https://docs.microsoft.com/en-us/azure/sql-database/sql-database-vcore-resource-limits-single-databases).
If you examine the workload, you will see each INSERT is a single transaction commit which requires a transaction log flush.
One commit for each insert is not efficient but the application was not affected on a local SSD because each commit was very fast. The Business Critical pricing tier (servie objective or SKU) provides local SSD drives with a lower latency but maybe there ia an application optimization.
The T-SQL batch can be changed for the workload to wrap a BEGIN TRAN/COMMIT TRAN around the INSERT iterations.
**Step 6 - Run the modified workload and observe**
The modified workload can be found in the script **order_rating_insert.sql**. Run the modified workload using the script with ostress called **order_rating_insert.cmd**
Now the workload runs in almost 5 seconds compared to even 18-19 seconds with a local SSD using singleton transactions. This is an example of tuning an application for SQL queries that will run after in or outside of Azure.
The workload runs so fast it may be difficult to observe diagnostic data from queries used previously in this activity. It is important to note that sys.dm_os_wait_stats cannot be cleared using DBCC SQLPERF as it can be with SQL Server.
TODO: What does this workload look like in MI?
The concept of "batching" can help most applications including Azure. Read more at https://docs.microsoft.com/en-us/azure/sql-database/sql-database-use-batching-to-improve-performance.
>*NOTE:** Very large transactions can be affected on Azure and the symptoms will be LOG_RATE_GOVERNOR. In this example, the char(500) not null column pads spaces and causes large tlog records. Performance can even be more optimized by making that column a variable length column. TODO: Add more to this paragraph.
<p style="border-bottom: 1px solid lightgrey;"></p>

Просмотреть файл

@ -0,0 +1 @@
SELECT * FROM sys.dm_db_resource_stats

Просмотреть файл

@ -0,0 +1,2 @@
EXEC sp_query_store_flush_db
GO

Просмотреть файл

@ -0,0 +1,5 @@
SELECT database_name,slo_name,cpu_limit,max_db_memory, max_db_max_size_in_mb, primary_max_log_rate,primary_group_max_io, volume_local_iops,volume_pfs_iops
FROM sys.dm_user_db_resource_governance;
GO
SELECT DATABASEPROPERTYEX('AdventureWorks0406', 'ServiceObjective');
GO

Просмотреть файл

@ -0,0 +1,2 @@
ALTER DATABASE AdventureWorks0406 MODIFY (SERVICE_OBJECTIVE = 'GP_Gen5_8');
GO

Просмотреть файл

@ -0,0 +1,6 @@
SELECT er.session_id, er.status, er.command, er.wait_type, er.last_wait_type, er.wait_resource, er.wait_time
FROM sys.dm_exec_requests er
INNER JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
AND es.is_user_process = 1
GO

Просмотреть файл

@ -0,0 +1 @@
ostress.exe -Saw-server<ID>.database.windows.net -itopcustomersales.sql -Ucloudadmin -dAdventureWorks<ID> -P<password> -n10 -r1500 -q

Просмотреть файл

@ -0,0 +1,8 @@
SELECT c.*, soh.OrderDate, soh.DueDate, soh.ShipDate, soh.Status, soh.ShipToAddressID, soh.BillToAddressID, soh.ShipMethod, soh.TotalDue, soh.Comment, sod.*
FROM SalesLT.Customer c
INNER JOIN SalesLT.SalesOrderHeader soh
ON c.CustomerID = soh.CustomerID
INNER JOIN SalesLT.SalesOrderDetail sod
ON soh.SalesOrderID = sod.SalesOrderID
ORDER BY sod.LineTotal desc
GO

Просмотреть файл

@ -0,0 +1,9 @@
DROP TABLE IF EXISTS SalesLT.OrderRating;
GO
CREATE TABLE SalesLT.OrderRating
(OrderRatingID int identity not null,
SalesOrderID int not null,
OrderRatingDT datetime not null,
OrderRating int not null,
OrderRatingComments char(500) not null);
GO

Просмотреть файл

@ -0,0 +1 @@
ostress.exe -Sbobazuresqlserver.database.windows.net -iorder_rating_insert.sql -Uthewandog -dAdventureWorks0406 -P$cprsqlserver2019 -n25 -r100 -q

Просмотреть файл

@ -0,0 +1,12 @@
DECLARE @x int
SET @x = 0
BEGIN TRAN
WHILE (@x < 100)
BEGIN
SET @x = @x + 1
INSERT INTO SalesLT.OrderRating
(SalesOrderID, OrderRatingDT, OrderRating, OrderRatingComments)
VALUES (@x, getdate(), 5, 'This was a great order')
END
COMMIT TRAN
GO

Просмотреть файл

@ -0,0 +1 @@
ostress.exe -Sbobazuresqlserver.database.windows.net -iorder_rating_insert_single.sql -Uthewandog -dAdventureWorks0406 -P$cprsqlserver2019 -n25 -r100 -q

Просмотреть файл

@ -0,0 +1,10 @@
DECLARE @x int;
SET @x = 0;
WHILE (@x < 100)
BEGIN
SET @x = @x + 1;
INSERT INTO SalesLT.OrderRating
(SalesOrderID, OrderRatingDT, OrderRating, OrderRatingComments)
VALUES (@x, getdate(), 5, 'This was a great order');
END
GO

Просмотреть файл

@ -0,0 +1,6 @@
SELECT er.session_id, er.status, er.command, er.wait_type, er.last_wait_type, er.wait_resource, er.wait_time
FROM sys.dm_exec_requests er
INNER JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
AND es.is_user_process = 1;
GO

Просмотреть файл

@ -0,0 +1,4 @@
SELECT io_stall_write_ms/num_of_writes as avg_tlog_io_write_ms, *
FROM sys.dm_io_virtual_file_stats
(db_id('AdventureWorks0406'), 2);
GO

Просмотреть файл

@ -0,0 +1,3 @@
SELECT * FROM sys.dm_os_wait_stats
ORDER BY waiting_tasks_count DESC;
GO

Двоичные данные
AzureSQLWorkshop/graphics/Azure_Portal_Change_Tier.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 131 KiB

Двоичные данные
AzureSQLWorkshop/graphics/Azure_Portal_Compute_Options.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 212 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 157 KiB

Двоичные данные
AzureSQLWorkshop/graphics/Azure_Portal_Compute_Slow_Query.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 229 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 16 KiB

Двоичные данные
AzureSQLWorkshop/graphics/Azure_Portal_Update_In_Progress.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 73 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 180 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_QDS_Find_Top_Queries.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 82 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_QDS_Query_ID.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 334 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_QDS_Top_Query_Faster.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 446 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_QDS_Top_Query_Report.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 382 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_Slow_Query_Stats.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 155 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_Top_Wait_Stats.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 607 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_Top_Wait_Stats_Query.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 390 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_Top_Wait_Stats_Query_Faster.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 463 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_Wait_Stats_Faster_Query.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 181 KiB

Двоичные данные
AzureSQLWorkshop/graphics/SSMS_Workload_Query_Plan.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 33 KiB

Двоичные данные
AzureSQLWorkshop/graphics/service_objective_results.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 30 KiB

Просмотреть файл

@ -246,7 +246,7 @@ Open the **03_WorkingWithData.py** file and enter the code you find for section
Python has many ways to read data in (*sometimes into memory, sometimes streaming as it reads it*) built right in to the standard libraries. Other Libraries, such as Pandas and NumPy, have their own way of reading in data.
In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what stucture you need to perform your desired operations.
In any case, the data is assigned to a data family or *structure*, which you learned about earlier. Depending on which Library you are using, you'll pick a data structure that makes the most sense for how you want to work with it. For instance, Pandas uses a dataframe as the primary data structure it works with. This is why it's important to know the data types, so that you understand what structure you need to perform your desired operations.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./graphics/checkbox.png"><b>Reading from Files</b></p>
@ -465,7 +465,7 @@ Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/m
Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle-data)
The Data Aquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, youll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, youll replace missing values, add and change columns. Youve already seen the Libraries you'll need to work with for Data Wrangling - Pandas being the most common in use.
The Data Acquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, youll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, youll replace missing values, add and change columns. Youve already seen the Libraries you'll need to work with for Data Wrangling - Pandas being the most common in use.
<p style="border-bottom: 1px solid lightgrey;"></p>
<h4>Phase Three - Modeling</h4>

Просмотреть файл

@ -61,7 +61,7 @@ The entire repository can be [downloaded as a single ZIP file here](https://gith
### Clone all Workshops using git
You can [clone the entire respository using `git` here](https://github.com/Microsoft/sqlworkshops.git).
You can [clone the entire repository using `git` here](https://github.com/Microsoft/sqlworkshops.git).
### Get only one Workshop
You can follow the steps below to clone individual files from a git repo using a git client.

Двоичные данные
SQLGroundToCloud/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
k8stobdc/.DS_Store поставляемый

Двоичный файл не отображается.

Двоичные данные
k8stobdc/KubernetesToBDC/.DS_Store поставляемый

Двоичный файл не отображается.

Просмотреть файл

@ -1,4 +1,4 @@
![](../graphics/microsoftlogo.png)
![](https://github.com/microsoft/sqlworkshops/blob/master/graphics/microsoftlogo.png?raw=true)
# Workshop: <TODO: Enter workshop name>
@ -6,7 +6,7 @@
<p style="border-bottom: 1px solid lightgrey;"></p>
<img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/textbubble.png"> <h2>00 prerequisites</h2>
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/textbubble.png?raw=true"> <h2>00 prerequisites</h2>
This workshop is taught using the following components, which you will install and configure in the sections that follow.
@ -26,37 +26,37 @@ The other requirements are:
*Note that all following activities must be completed prior to class - there will not be time to perform these operations during the workshop.*
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity 1: Set up a Microsoft Azure Account</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity 1: Set up a Microsoft Azure Account</b></p>
You have multiple options for setting up Microsoft Azure account to complete this workshop. You can use a Microsoft Developer Network (MSDN) account, a personal or corporate account, or in some cases a pass may be provided by the instructor. (Note: for most classes, the MSDN account is best)
**If you are attending this course in-person:**
Unless you are explicitly told you will be provided an account by the instructor in the invitation to this workshop, you must have your Microsoft Azure account and Data Science Virtual Machine set up before you arrive at class. There will NOT be time to configure these resources during the course.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><b>Option 1 - Microsoft Developer Network Account (MSDN) Account</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><b>Option 1 - Microsoft Developer Network Account (MSDN) Account</b></p>
The best way to take this workshop is to use your [Microsoft Developer Network (MSDN) benefits if you have a subscription](https://marketplace.visualstudio.com/subscriptions).
- [Open this resource and click the "Activate your monthly Azure credit" button](https://azure.microsoft.com/en-us/pricing/member-offers/credit-for-visual-studio-subscribers/)
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><b>Option 2 - Use Your Own Account</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><b>Option 2 - Use Your Own Account</b></p>
You can also use your own account or one provided to you by your organization, but you must be able to create a resource group and create, start, and manage a Virtual Machine and an Azure AKS cluster.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><b>Option 3 - Use an account provided by your instructor</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><b>Option 3 - Use an account provided by your instructor</b></p>
Your workshop invitation may have instructed you that they will provide a Microsoft Azure account for you to use. If so, you will receive instructions that it will be provided.
**Unless you received explicit instructions in your workshop invitations, you much create either an MSDN or Personal account. You must have an account prior to the workshop.**
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity 2: Prepare Your Workstation</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity 2: Prepare Your Workstation</b></p>
<br>
The instructions that follow are the same for either a "base metal" workstation or laptop, or a Virtual Machine. It's best to have at least 4MB of RAM on the management system, and these instructions assume that you are not planning to run the database server or any Containers on the workstation. It's also assumed that you are using a current version of Windows, either desktop or server.
<br>
*(You can copy and paste all of the commands that follow in a PowerShell window that you run as the system Administrator)*
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Updates<p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Updates<p>
First, ensure all of your updates are current. You can use the following commands to do that in an Administrator-level PowerShell session:
@ -73,12 +73,12 @@ Install-WindowsUpdate
*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Install Big Data Cluster Tools</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Install Big Data Cluster Tools</p>
Next, install the tools to work with Big Data Clusters:
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity 3: Install BDC Tools</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity 3: Install BDC Tools</b></p>
Open this resource, and follow all instructions for the Microsoft Windows operating system
@ -87,7 +87,7 @@ Open this resource, and follow all instructions for the Microsoft Windows operat
- [https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sql-server-ver15](https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sql-server-ver15)
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity 4: Re-Update Your Workstation</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity 4: Re-Update Your Workstation</b></p>
Once again, download the MSI and run it from there. It's always a good idea after this many installations to run Windows Update again:
@ -101,11 +101,11 @@ Install-WindowsUpdate
**Note 2: If you are using a Virtual Machine in Azure, power off the Virtual Machine using the Azure Portal every time you are done with it. Turning off the VM using just the Windows power off in the VM only stops it running, but you are still charged for the VM if you do not stop it from the Portal. Stop the VM from the Portal unless you are actively using it.**
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/owl.png"><b>For Further Study</b></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/owl.png?raw=true"><b>For Further Study</b></p>
<ul>
<li><a href="https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads" target="_blank">Official Documentation for this section</a></li>
</ul>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/geopin.png"><b >Next Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/geopin.png?raw=true"><b >Next Steps</b></p>
Next, Continue to <a href="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/KubernetesToBDC/01-introduction.md" target="_blank"><i> Module 1 - Introduction</i></a>.

Просмотреть файл

@ -14,7 +14,7 @@ This module covers Container technologies and how they are different than Virtua
<p style="border-bottom: 1px solid lightgrey;"></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b><a name="aks">Activity: Install Class Environment on AKS (Optional)</a></b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b><a name="aks">Activity: Install Class Environment on AKS (Optional)</a></b></p>
*(If you are taking this course on-line and not with an instructor-provided Kubernetes environment, you can use a Microsoft Azure subscription to deploy a Kubernetes Environment, complete with the SQL Server big data clusters feature. Your instructor may also have you use this deployment mechanism if in-class hardware is not practical or available)*
@ -26,15 +26,15 @@ Using the following steps, you will create a Resource Group in Azure that will h
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://github.com/Microsoft/sqlworkshops/blob/master/sqlserver2019bigdataclusters/SQL2019BDC/00%20-%20Prerequisites.md" target="_blank"> Ensure that you have completed all prerequisites</a>.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"> <a href="https://github.com/Microsoft/sqlworkshops/blob/master/sqlserver2019bigdataclusters/SQL2019BDC/00%20-%20Prerequisites.md" target="_blank"> Ensure that you have completed all prerequisites</a>.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sqlallproducts-allversions" target="_blank"> Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step</a>. Note that if you followed the pre-requisites properly, you will already have <i>Python</i>, <i>kubectl</i>, and <i>Azure Data Studio</i> installed, so those may be skipped. Follow all other instructions.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sqlallproducts-allversions" target="_blank"> Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step</a>. Note that if you followed the pre-requisites properly, you will already have <i>Python</i>, <i>kubectl</i>, and <i>Azure Data Studio</i> installed, so those may be skipped. Follow all other instructions.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sqlallproducts-allversions" target="_blank"> Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step</a>. Stop at the section marked <b>Connect to the cluster</b>.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sqlallproducts-allversions" target="_blank"> Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step</a>. Stop at the section marked <b>Connect to the cluster</b>.</p>
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="1-3">1.1 Big Data Technologies: Operating Systems</a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="1-3">1.1 Big Data Technologies: Operating Systems</a></h2>
In this section you will learn more about the design the primary operating system (Linux) used with a Kubernetes Cluster.
@ -135,21 +135,21 @@ The essential commands you should know for this workshop are below. In Linux you
A <a href="https://opensourceforu.com/2016/07/introduction-linux-system-administration/" target="_blank">longer explanation of system administration for Linux is here</a>.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Work with Linux Commands</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Work with Linux Commands</b></p>
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://bellard.org/jslinux/vm.html?url=https://bellard.org/jslinux/buildroot-x86.cfg" target="_blank">Open this link to run a Linux Emulator in a browser</a></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Find the mounted file systems, and then show the free space in them.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Show the current directory.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Show the files in the current directory. </p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Create a new directory, navigate to it, and create a file called <i>test.txt</i> with the words <i>This is a test</i> in it. (hint: us the <b>nano</b> editor or the <b>echo</b> command)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Display the contents of that file.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Show the help for the <b>cat</b> command.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://bellard.org/jslinux/vm.html?url=https://bellard.org/jslinux/buildroot-x86.cfg" target="_blank">Open this link to run a Linux Emulator in a browser</a></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Find the mounted file systems, and then show the free space in them.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Show the current directory.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Show the files in the current directory. </p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Create a new directory, navigate to it, and create a file called <i>test.txt</i> with the words <i>This is a test</i> in it. (hint: us the <b>nano</b> editor or the <b>echo</b> command)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Display the contents of that file.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Show the help for the <b>cat</b> command.</p>
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="1-4">1.2 Big Data Technologies: Containers and Controllers</a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="1-4">1.2 Big Data Technologies: Containers and Controllers</a></h2>
Bare-metal installations of an operating system such as Windows are deployed on hardware using a <i>Kernel</i>, and additional software to bring all of the hardware into a set of calls.
@ -158,7 +158,7 @@ Bare-metal installations of an operating system such as Windows are deployed on
One abstraction layer above installing software directly on hardware is using a <i>Hypervisor</i>. In essence, this layer uses the base operating system to emulate hardware. You install an operating system (called a *Guest* OS) on the Hypervisor (called the *Host*), and the Guest OS acts as if it is on bare-metal.
<br>
<img style="height: 300;" src="https://docs.docker.com/images/VM%402x.png">
<img style="height: 300;" src="https://docs.docker.com/images/VM%402x.png?raw=true">
<br>
In this abstraction level, you have full control (and responsibility) for the entire operating system, but not the hardware. This isolates all process space and provides an entire "Virtual Machine" to applications. For scale-out systems, a Virtual Machine allows for a distribution and control of complete computer environments using only software.
@ -174,7 +174,7 @@ A Container is provided by the Container Runtime (Such as [containerd](https://c
<i>(NOTE: The Container Image Kernel can run on Windows or Linux, but you will focus on the Linux Kernel Containers in this workshop.)</i>
<br>
<img style="height: 300;" src="https://docs.docker.com/images/Container%402x.png">
<img style="height: 300;" src="https://docs.docker.com/images/Container%402x.png?raw=true">
<br>
This abstraction holds everything for an application to isolate it from other running processes. It is also completely portable - you can create an image on one system, and another system can run it so long as the Container Runtimes (Such as Docker) Runtime is installed. Containers also start very quickly, are easy to create (called <i>Composing</i>) using a simple text file with instructions of what to install on the image. The instructions pull the base Kernel, and then any binaries you want to install. Several pre-built Containers are already available, SQL Server is one of these. <a href="https://docs.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-2017" target="_blank">You can read more about installing SQL Server on Container Runtimes (Such as Docker) here</a>.
@ -198,34 +198,36 @@ For Big Data systems, having lots of Containers is very advantageous to segment
</table>
<br>
<p><img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/KubernetesCluster.png"></p>
<p><img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/KubernetesCluster.png?raw=true"></p>
<br>
You can <a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/" target="_blank">learn much more about Container Orchestration systems here</a>. We're using the Azure Kubernetes Service (AKS) in this workshop, and <a href="https://aksworkshop.io/" target="_blank">they have a great set of tutorials for you to learn more here</a>.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
> NOTE: The OpenShift Container Platform is a commercially supported Platform as a Service (PaaS) based on Kubernetes from RedHat. Many shops require a commercial vendor to implement and support Kubernetes.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Familiarize Yourself with Container Orchestration using minikube</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Familiarize Yourself with Container Orchestration using minikube</b></p>
To practice with Kubernetes, you will use an online emulator to work with the `minikube` platform.
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/cluster-interactive/
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/cluster-interactive/
" target="_blank">Open this resource, and complete the first module</a>. (You can return to it later to complete all exercises if you wish)</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="1-5">1.3 Big Data Technologies: Distributed Data Storage</a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="1-5">1.3 Big Data Technologies: Distributed Data Storage</a></h2>
Traditional storage uses a call from the operating system to an underlying I/O system, as you learned earlier. These file systems are either directly connected to the operating system or appear to be connected directly using a Storage Area Network. The blocks of data are stored and managed by the operating system.
For large scale-out data systems, the mounting point for an I/O is another abstraction. For SQL Server BDC, the most commonly used scale-out file system is the <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html" target="_blank">Hadoop Data File System</a>, or <i>HDFS</i>. HDFS is a set of Java code that gathers disparate disk subsystems into a <i>Cluster</i> which is comprised of various <i>Nodes</i> - a <i>NameNode</i>, which manages the cluster's metadata, and <i>DataNodes</i> that physically store the data. Files and directories are represented on the NameNode by a structure called <i>inodes</i>. Inodes record attributes such as permissions, modification and access times, and namespace and diskspace quotas.
<p><img style="height: 300; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/hdfs.png"></p>
<p><img style="height: 300; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/hdfs.png?raw=true"></p>
With an abstraction such as Containers, storage becomes an issue for two reasons: The storage can disappear when the Container is removed, and other Containers and technologies can't access storage easily within a Container.
@ -239,7 +241,7 @@ You <a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="1-6">1.4 Big Data Technologies: Command and Control</a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="1-6">1.4 Big Data Technologies: Command and Control</a></h2>
There are three primary tools and utilities you will use to control the SQL Server big data cluster:
@ -267,28 +269,28 @@ You can <a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?
" target="_blank">learn more about Azure Data Studio here</a>.
<br>
<p><img style="height: 300; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/ads.png"></p>
<p><img style="height: 300; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/ads.png?raw=true"></p>
<br>
You'll explore further operations with the Azure Data Studio in the final module of this course.
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Practice with Notebooks</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Practice with Notebooks</b></p>
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://notebooks.azure.com/BuckWoodyNoteBooks/projects/AzureNotebooks" target="_blank">Open this reference, and review the instructions you see there</a>. You can clone this Notebook to work with it later.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://notebooks.azure.com/BuckWoodyNoteBooks/projects/AzureNotebooks" target="_blank">Open this reference, and review the instructions you see there</a>. You can clone this Notebook to work with it later.</p>
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Azure Data Studio Notebooks Overview</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Azure Data Studio Notebooks Overview</b></p>
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/owl.png"><b>For Further Study</b></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/owl.png?raw=true"><b>For Further Study</b></p>
<br>
<ul>
@ -302,6 +304,6 @@ You'll explore further operations with the Azure Data Studio in the final module
<li><a href="https://realpython.com/jupyter-notebook-introduction/" target="_blank">Full tutorial on Jupyter Notebooks</a></li>
</ul>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/geopin.png"><b >Next Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/geopin.png?raw=true"><b >Next Steps</b></p>
Next, Continue to <a href="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/KubernetesToBDC/02-hardware.md" target="_blank"><i> 02 - Hardware and Virtualization environment for Kubernetes </i></a>.

Просмотреть файл

@ -8,84 +8,241 @@
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/textbubble.png?raw=true"> 03 - Kubernetes Concepts and Implementation </h2>
In this workshop you have covered <TODO: Explain where the student is at the moment>. The end of this Module contains several helpful references you can use in these exercises and in production.
In this workshop you have covered the hardware and software environment for Kubernetes. You've learned about Linux, Containers, and a quick overview of Kubernetes. With all that in place, in the previous Module you set up your environment to install Kubernetes. We didn't cover the terms you used to define and deploy your cluster.
This module covers the concepts, terms, and tools for Kubernetes. You'll follow various exercises to ground your understanding of each topic, and the end of this Module contains several helpful references you can use in these exercises and in production.
This module covers <TODO: Explain the main topics quicly >.
Glossary:
Implementation
A Kubernetes cluster requires the following components:
- Master nodes
These form the clusters control plane
- Worker nodes
The nodes on which the applications containers run
- etcd
A high performance key value store that stores the clusters state. Since etcd is quite light weight in nature, etcd instances can generally share resources with other nodes in the cluster. The Hardware recommendations section of the official etcd.io site provides a detailed breakdown of the hardware requirement for etcd.
- Container Network Interface (CNI) Plugin
The nodes in the cluster communicate with each other via what is known as an overlay network, or more simply put, a software defined network. There are a variety of CNI plugins that Kubernetes can use, however, for the purpose of this workshop, the default CNI plugin of Calico will be used.
- Certificate Management
- Persistent Storage
Any type of data-centric application, and big data clusters fall into this category have a basic requirement to persists state. One of the key aims is ensure that if a pod is rescheduled to run on a different node, its state is not lost as it moves from its original node to a new one. In the early days of Kubernetes, most storage drivers were called as “In tree”, meaning that vendors who wanted Kubernetes to use their storage had to integrate the code for their drivers directly with the Kubernetes code base. The IT industry is now gravitating towards the Container Storage Interface specification which allows Kubernetes to seamlessly use any storage platform that supports this standard without having to touch the Kubernetes code base. Ultimately, the aim of the CSI standard is to promote storage portability.
- Ingress Management (Optional)
A key difference between Vanilla Kubernetes and Kubernetes-As-A-Service, such as Azure Kubernetes Service (AKS), is that services do not come with load balancing endpoints by default. Load balancer services for vanilla Kubernetes is enabled through the issue of ingress software such as MetalLb.
We'll begin with a set of definitions. These aren't all the terms used in Kubernetes - you'll see more as you work through the Modules - but they do form the basics for the concepts that follow. You'll work with each of these terms throughout this Module and the rest of the course, so just familiarize yourself with them, and refer back to this list as you work through each section.
<table style="tr:nth-child(even) {background-color: #dddddd;}; text-align: left; display: table; border-collapse: collapse; border-spacing: 5px; border-color: gray; ">
<tbody>
<tr style="vertical-align:top;">
<th>Category </th>
<th>Term </th>
<th>Description </th>
</tr>
<tr style="vertical-align:top;">
<td><a href="https://kubernetes.io/docs/reference/tools/"><b>Tools</b></a> </td>
<td><a href="https://kubernetes.io/docs/concepts/overview/kubernetes-api/"><i>The Kubernetes API</i></a> </td>
<td>The foundation for the declarative configuration schema calls for the entire system. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/reference/kubectl/overview/"><i>kubectl</i></a> </td>
<td>A command-line control tool for a Kubernetes cluster. Used from a client workstation or a "Jump Box" that acts as the client for your environment. This tool can be installed on Windows, Linux and Mac OS/X. Uses a set of configurations set in a text file to connect to a Kubernetes cluster. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/reference/tools/#kubeadm"><i>kubeadm</i></a> </td>
<td>A command-line tool for easily provisioning a secure Kubernetes cluster on top of physical or cloud servers or virtual machines. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/reference/tools/#dashboard"><i>The Kubernetes Dashboard</i></a> </td>
<td>A web-based Kubernetes interface that allows you to deploy containerized applications to a Kubernetes cluster, troubleshoot them, and manage the cluster and its resources. </td>
</tr>
<tr style="vertical-align:top;">
<td><b><a href="https://kubernetes.io/docs/concepts/#kubernetes-objects">Object</b></a> </td>
<td><a href="https://www.tutorialspoint.com/kubernetes/kubernetes_node.htm"><i>Node</i></a> </td>
<td>The computers (physical or virtual) that host the rest of the Objects in a Kubernetes cluster. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/admin/kubelet/"><i>kubelet</i></a> </td>
<td>Runs on each Node, and provides communication with the Kubernetes Master. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/admin/kube-proxy/"><i>kube-proxy </i></a> </td>
<td>Runs on each Node, and provides a network proxy which reflects Kubernetes networking services. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><i><a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/">Pod</i></a> </td>
<td>The basic execution unit of a Kubernetes application - holds a processes running on your Cluster. It contains one or more <i>Containers</i>, the storage resources, a unique network IP, and any Container configurations. While the <i>docker daemon</i> is the most common container runtime used in a Kubernetes Pod, other container runtimes are also supported. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><i><a href="https://kubernetes.io/docs/concepts/services-networking/service/">Service</i></a> </td>
<td>A "description" of a set of Pods and a policy to access them. This de-couples the call to an application to it's physical representation, and allows the application running on the Pod to be more stateless. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/storage/volumes/"><i>Volume</i></a> </td>
<td>A pointer to a storage directory - either "ethereal" (has the same lifetime as the Pod) or permanent. Can use various providers such as cloud storage and on-premises devices, and is set with various parameters. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><i><a href="https://github.com/container-storage-interface/spec/blob/master/spec.md">Persistent Storage </i></a> </td>
<td>A hardware and software combination used to persist state. One of the key aims is ensure that if a Pod is rescheduled to run on a different Node, its state is not lost as it moves from its original Node to a new one. In the early days of Kubernetes, most storage drivers were called as “In tree”, meaning that vendors who wanted Kubernetes to use their storage had to integrate the code for their drivers directly with the Kubernetes code base. The IT industry is now gravitating towards the Container Storage Interface specification which allows Kubernetes to seamlessly use any storage platform that supports this standard without having to touch the Kubernetes code base. Ultimately, the aim of the CSI standard is to promote storage portability. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/"><i>Namespace</i></a> </td>
<td>Used to define multiple virtual clusters backed by the same physical cluster. Namespaces are a critical component in the Kubernetes role based access control security model.</td>
</tr>
<tr style="vertical-align:top;">
<td><a href="https://kubernetes.io/docs/concepts/architecture/master-node-communication/"><b>Kubernetes Master</b></a> </td>
<td><a href="https://kubernetes.io/docs/admin/kube-apiserver/"><i>kube-apiserver </i></a> </td>
<td>Responds to REST calls to provide the frontend to the clusters shared state. This allows all commands through which all other components interact. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/"><i>kube-controller-manager</i></a> </td>
<td>A daemon that embeds the non-terminating control loops shipped with Kubernetes that watches the shared state of the cluster through the API Server and makes changes to change the current state of the cluster to the desired state. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/"><i>kube-scheduler </i></a> </td>
<td>A policy-driven scheduling service that is topology aware and specific to a workload. It is called for functions such as availability, performance, and capacity. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/"><i>Container Network Interface </i></a> </td>
<td>The Nodes in the cluster communicate with each other via what is known as an <i>overlay network</i> - a software-defined network. There are a variety of CNI plugins that Kubernetes can use. This Workshop uses the the default <i>Calico</i> CNI plugin. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/security/overview/"><i>Certificate Management </i></a> </td>
<td>Security management for a Kubernetes cluster is managed through <a href="https://kubernetes.io/docs/concepts/configuration/secret/">Secrets which can use Certificates</a>. This concept deals with those layers. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://metallb.universe.tf/"><i>Ingress Management (Optional) </i></a> </td>
<td>A key difference between "Vanilla" Kubernetes and an Kubernetes-As-A-Service (such as Azure Kubernetes Service) is that services do not come with load balancing endpoints by default. Load balancer services for Kubernetes is enabled using software such as <i>MetalLb</i>. </td>
</tr>
<tr style="vertical-align:top;">
<td><a href="https://kubernetes.io/docs/concepts/architecture/cloud-controller/"><b>Control </b></a> </td>
<td><a href="https://kubernetes.io/docs/concepts/architecture/controller/"><i>Controller</i></a> </td>
<td>A "plugin" mechanism that allows cloud providers to integrate with Kubernetes easily. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/"><i>Deployment</i></a> </td>
<td>A YAML file describing the state of Pods and ReplicaSets. Deployed to the <i>Kubernetes API</i> using <i>kubectl</i> or <i>REST</i> calls. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/"><i>DaemonSet</i></a> </td>
<td>A service that ensures that <i>Nodes</i> run a copy of a <i>Pod</i>. As Nodes are added to the cluster, Pods are added to them. As Nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a hrf="https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/"><i>StatefulSet</i></a> </td>
<td>The workload API object used to manage stateful applications that are clustered by nature. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/"><i>ReplicaSet </i></a> </td>
<td>A service that maintains a stable set of replica Pods running at any given time. Used to guarantee the availability of a specified number of identical Pods. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/"><i>Job</i></a> </td>
<td>A Job creates one or more Pods and ensures that a specified number of them successfully terminate. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/#kubernetes-control-plane"><i>Control Plane</i></a> </td>
<td>Contains components such as the <i>Kubernetes Master</i> and <i>kubelet</i> processes that governs how Kubernetes communicates with your cluster. The Control Plane maintains a record of all of the Kubernetes Objects in the system, and runs continuous control loops to manage those objects state. At any given time, the Control Planes control loops will respond to changes in the cluster and work to make the actual state of all the objects in the system match the desired state that you provided. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/"><i>etcd</i></a> </td>
<td>A high performance key value store that stores the clusters state. Since <i>etcd</i> is light-weight, each instance can generally share resources with other Nodes in the cluster. The Hardware recommendations section of the official http://etcd.io site provides a detailed breakdown of the hardware requirement for <i>etcd</i>. </td>
</tr>
<tr style="vertical-align:top;">
<td> </td>
<td><a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/"><i>operator</i></a> </td>
<td>A custom Kubernetes object implemented for the management of applications with complex life cycles. </td>
</tr>
</tbody>
</table>
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.1 Kubernetes Interfaces</h2>
<TODO: Content>
"North-south" traffic between a Kubernetes cluster and the outside is made via the Kubernetes API server. There are a number of standard client tools for administering and utilising a Kubernetes cluster:
kubectl
**[kubectl](https://kubernetes.io/docs/reference/kubectl/overview/)**
A command line tool for administering a Kubernetes cluster and creating / modifying Kubernetes objects via YAML files.
Dashboard
**[Dashboard](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/)**
A general purpose web based grpahical interface for Kubernetes.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
**[Helm](https://helm.sh/)**
A tool for Kubernetes application package management and deployment.
In this activity you will <TODO: Explain Activity>
Language Client Libraries
Client libraries exist for most of the popular third generation languages, such as [Python](https://github.com/kubernetes-client/python).
<p style="border-bottom: 1px solid lightgrey;"></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkmark.png?raw=true"><b>Steps</b></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.2 Deploying a Cluster</h2>
<TODO: Enter specific steps to perform the activity>
### 3.2.1 Control Plane ###
<p style="border-bottom: 1px solid lightgrey;"></p>
Provision must be made for the control plane to be highly available, this includes:
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.2 Deployments (YAML Manifests)</h2>
<TODO: Content>
Provisions must be made for the control plane to be highly available, this includes:
- The API server
- Master nodes
- etcd instance
It is recommended that a production grade cluster has a minimum of two master nodes and three etcd instances.
3.2.2 Worker Nodes
### 3.2.2 Worker Nodes ###
A production grade SQL Server 2019 Big Data Cluster requires a minimum of three nodes each with 64 GB of RAM and 8 logical processors. However, consideration also needs to be made for upgrading a Kubernetes cluster from one version to another. There are two options:
- Upgrade each node in the cluster in-situ
This requires that a Taint is applied to a node so that it cannot accept pods and then drained of its current pod workload. The obvious inference here is that when the node is drained, the pods that are running on it need somewhere else to go, therefore this approach mandates that there are N+1 worker nodes. This approach comes with the risk that if the upgrade fails for any reason, the cluster may be left in a state with worker nodes on different versions of Kubernetes.
- **Upgrade each node in the cluster in-situ**
- Create a new cluster
Create a new cluster, deploy a big data cluster to it and then restore a backup of the data from the original cluster. This approach requires more hardware than the in-situ upgrade method. If the upgrade spans multiple versions of Kubernetes, for example the upgrade is from version 1.15 to 1.17, this method allows a 1.17 cluster to be created from scratch cleanly and then the data from 1.15 cluster restored onto the new 1.17 cluster.
This requires that a Taint is applied to a node so that it cannot accept pods, the node is then drained of its current pod workload after which it can be upgraded. When the node is drained, the pods that are running on it need somewhere else to go, therefore this approach mandates that there are N+1 worker nodes (assuming one node is upgraded at a time). This approach runs the risk that if the upgrade fails for any reason, the cluster may be left in a state with worker nodes on different versions of Kubernetes.
3.2.1 Prerequisites
- **Create a new cluster**
Create a new cluster, deploy a big data cluster to it and then restore a backup of the data from the original cluster. This approach requires more hardware than the in-situ upgrade method. If the upgrade spans multiple versions of Kubernetes, for example the upgrade is from version 1.15 to 1.17, this method allows a 1.17 cluster to be created from scratch cleanly and then the data from 1.15 cluster restored onto the new 1.17 cluster.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Create a single node big data cluster sandpit environment</b></p>
In this activity you will deploy a single node big data cluster sandpit environment on an Ubuntu virtual machine using [this script](https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-script-single-node-kubeadm?view=sql-server-ver15)
### 3.2.3 Kubernetes Production Grade Deployments ###
In the last activity, we deployed a single node SQL Server 2019 big data cluster running on a single node. But what if we want to:
- Deploy a Kubernetes cluster with multiple nodes, even tens of nodes
- Deploy multiple Kubernetes clusters without having to write a script for each cluster
- Automate the tasks that have to be performed in addition to running kubeadm
There is a tool that leverages kubeadm in order to achieve all of these goals.
### 3.2.4 Introducing Kubespray ###
[Kubespray](https://kubespray.io/#/) is a Kubernetes cluster life cycle management tool based on Ansible playbooks, it can:
- Create clusters
- Upgrade clusters
- Remove clusters
- Add nodes to existing clusters
Kubespray is a Cloud Native Computing Foundation project and with its own [GitHub repository](https://github.com/kubernetes-sigs/kubespray).
### 3.2.5 What Is Ansible? ###
Ansible is an open source declarative tool for deploying applications and infrastructure-as-code. Components of an application or infrastructure are specified declaratively in Runbooks. Unlike other infrastructure-as-code tools, Ansible does not require that a special node is built for the purpose of deploying applications and infrastructure. All that is required is a host on which Ansible can be installed. Files known as inventory files are used to specify Ansible deployment targets. In the case of Kubespray, the deployment targets are the hosts which nodes and etcd instances are to be created on. Communication between Ansible and the deployment targets specified in an inventory file is via ssh.
### 3.2.6 Prerequisites ###
In order to carry out the deployment of the Kubernetes cluster, a basic understanding of the following tasks is required:
In order to carry out the deployment of the Kubernetes cluster, it is assumed that workshop attendees have a basic understanding of the following tasks:
- Ubuntu base operating system installation
- Ubuntu package management via apt
@ -95,23 +252,14 @@ In order to carry out the deployment of the Kubernetes cluster, it is assumed th
- Setting up remote access to Ubuntu hosts with ssh
- Basic Ubuntu firewall configuration
3.2.2 Introducing Kubespray
Kubespray is a Kubernetes cluster life cycle management tool that is based on Ansible playbooks, it can:
- Create clusters
- Upgrade clusters
### 3.2.7 Kubespray Workflow ###
- Remove clusters
- Add nodes to existing clusters
Kubespray is a Cloud Native Computing Foundation project and with its own GitHub repo that can be found here.
3.2.3 What Is Ansible?
Ansible is an open source declarative tool for deploying applications and infrastructure-as-code. Components of an application or infrastructure are specified declaratively in what are know as Runbooks. Unlike other infrastructure-as-code tools such as Puppet, Ansible does not require that a special node is built for the purpose of deploying applications and infrastructure. All that is required is a host on which Ansible can be installed. Files known as inventory files are used to specify Ansible deployment targets. In the case of Kubespray, the deployment targets are the hosts which nodes and etcd instances are to be created on. Communication between Ansible and the deployment targets specified in an inventory file is via ssh.
3.2.4 Why Use Kubeadm?
Unlike other available deployment tools, Kubespray does everything for you in “One shot”. For example, Kubeadm requires that certificates on nodes are created manually, Kubespray not only leverages Kubeadm but it also looks after everything including certificate creation for you. Kubespray works against most of the popular public cloud providers and has been tested for the deployment of clusters with thousands of nodes. The real elegance of Kubespray is the reuse it promotes. If an organisation has a requirement to deploy multiple clusters, once Kubespray is setup, for every new cluster that needs to be created, the only prerequisite is to create a new inventory file for the nodes the new cluster will use.
Unlike other available deployment tools, Kubespray does everything for you in “One shot”. For example, Kubeadm requires that certificates on nodes are created manually, Kubespray not only leverages Kubeadm but it also looks after everything including certificate creation for you. Kubespray works against most of the popular public cloud providers and has been tested for the deployment of clusters with thousands of nodes. The real elegance of Kubespray is the reuse it promotes. If an organization has a requirement to deploy multiple clusters, once Kubespray is setup, for every new cluster that needs to be created, the only prerequisite is to create a new inventory file for the nodes the new cluster will use.
3.2.5 High Level Kubespray Workflow
The deployment of a Kubernetes cluster via Kubespray follows this workflow:
- Preinstall step
- Install Container Engine
@ -127,10 +275,11 @@ The deployment of a Kubernetes cluster via Kubespray follows this workflow:
- Configure network plugin
- Configure any add-ons
Conceptually the creation of a three-worker node cluster looks like this:
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_2_7_kubespray-flow.png?raw=true">
Note:
- The deployment is instigated from the jump server,
- The etcd instances can share nodes with the two masters and a worker node due to their minimal CPU and memory requirements,
@ -138,127 +287,426 @@ Note:
- cluster.yml contains the play book for creating the Kubernetes cluster itself,
- The entire cluster is deployed via a single invocation of the ansible-playbook command.
3.2.6 Requirements
Refer to the requirements section here in the Kubespray GitHub repo.
3.2.7 Post Cluster Deployment Activities
The primary tool for administering a Kubernetes cluster is kubectl. After deploying the cluster, the first step is to install this followed by installing and configuring a storage plugin.
### 3.2.8 Requirements ###
### 3.2.9 Post Cluster Deployment Activities ###
Install kubectl - the primary tool for administering a Kubernetes cluster. kubectl requires a configuration file in order to access the cluster, by default kubectl will look for a file named config in the .kube directory under the home directory of the user that is logged in:
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_2_9_kubectl.png?raw=true">
The config file specifies clusters, users and contexts, a context being a label for connection details for a cluster in terms of a user and namespace. If kubectl cannot find a config file or it has been corrupted in any way when an attempt is made to run a command against a cluster, the following error message will appear:
**The connection to the server localhost:8080 was refused - did you specify the right host or port?**
The [Kubernetes documentation](https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/) goes into detail regarding the creation of config files and contexts for accessing multiple clusters. The fastest and simplest way to create a config file is to copy the file: /etc/kubernetes/admin.conf off one of the master node hosts and onto the client machine that kubectl is installed on.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
For this activity, workshop attendees will log onto the jump server and use a preinstalled version of kubectl that allows a production grade kubernetes cluster to be accessed via a read-only context.
Use the kubectl cheat sheet to familiarise yourself with various kubectl commands. One of the key commands to be aware of is kubectl get.
3.2.8 Hands on Practical Exercises
Use the kubectl cheat sheet to familiarise yourself with various kubectl commands. One of the key commands to be aware of is kubectl get.
- Use kubectl to obtain the state of each node in the cluster, all nodes in a healthy cluster should have a state of Ready
- Apart from single node clusters that are used for the purposes of learning Kubernetes such as minikube, pods should never run on master nodes. As such a NoSchedule taint should be present on each master node, use kubectl describe to verify this.
- Ordinarily, with the exception of single node clusters that are used for learning purposes, pods should never run on master nodes. As such a NoSchedule taint should be present on each master node, use kubectl describe to verify this.
- Labels can be assigned to any object created in a Kubernetes cluster, an entity known as a Selector is used to filter objects with labels. Use kubectl get to display the nodes with the role of master. Labels and selectors are covered by the Kubernetes documentation in detail.
- All objects that reside in a Kubernetes cluster reside in a namespace, when a big data cluster is created, all its objects reside in a namespace dedicated to that big data cluster. Use kubectl to obtain the names of namespaces present in the workshop cluster.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
In this activity you will <TODO: Explain Activity>
- All objects that live in a Kubernetes cluster reside in a namespace, when a big data cluster is created, all its objects reside in a namespace dedicated to that big data cluster. Use kubectl to obtain the names of namespaces present in the workshop cluster.
<p style="border-bottom: 1px solid lightgrey;"></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkmark.png?raw=true"><b>Steps</b></p>
## 3.3 OpenShift Container Platform ##
<TODO: Enter specific steps to perform the activity>
## 3.3.1 OpenShift Container Platform - Why ?
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Understanding what open-source means</b></p>
1. Use a search engine lookup the Kubernetes repository on GitHub.
2. Make a note of the licence that the Kubernetes project is available under.
3. On the GitHub page for the Kubernetes repository, click on 'Issues' (top left of the page), then by clicking on 'Sort', sort the issues in ascending date order. Note the age and severity of some of the issues.
4. Note the GitHub login of the individual who has raised issue #489.
5. In a browser navigate to the link https://landscape.cncf.io/ and make a mental note of the number of projects in the CNCF landscape.
## 3.3.2 OpenShift Container Platform - What Is It ? ##
OpenShift Container Platform from Red Hat Software is a platform as a service built on Kubernetes that supports
the full software development lifecycle:
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_3_1_openshift.PNG?raw=true">
## 3.3.3 OpenShift Container Platform Compared to Kubernetes ##
<table style="width:100%">
<tr>
<th><b>Feature / Aspect</b></th>
<th><b>Kubernetes</b></th>
<th><b>Openshift Container Platform</b></th>
</tr>
<tr>
<td>Kubernetes support</td>
<td>100% compatible</td>
<td>100% compatible</td>
</tr>
<tr>
<td>Licence</td>
<td>Apache 2.0</td>
<td>Commercial</td>
</tr>
<tr>
<td>Linux distribution support</td>
<td>Most debian based distributions</td>
<td>Red Hat Enterprise Linux (CentOS for OKD)</td>
</tr>
<tr>
<td>Open source version</td>
<td>Kubernetes is 100% open source</td>
<td>Openshift Community Edition (OKD)</td>
</tr>
<tr>
<td>Preferred method of chart/'Package' installation</td>
<td>Helm</td>
<td>operator</td>
</tr>
<tr>
<td>Command line interface</td>
<td>kubectl</td>
<td>kubectl and oc</td>
</tr>
<tr>
<td>Single node sand pit version</td>
<td>minikube, kind, microk8s</td>
<td>minishift</td>
</tr>
<tr>
<td>Default container engine</td>
<td>containerd</td>
<td>cri-o</td>
</tr>
<tr>
<td>Built in image registry ?</td>
<td>No</td>
<td>Yes</td>
</tr>
</table>
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.3 Pods</h2>
## 3.4 Storage ##
<TODO: Content>
### 3.4.1 Kubernetes Storage Concepts ###
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
There are two storage options available when deploying a SQL Server 2019 big data cluster:
In this activity you will <TODO: Explain Activity>
- Ephemeral storage
<p style="border-bottom: 1px solid lightgrey;"></p>
- Persistent volume
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkmark.png?raw=true"><b>Steps</b></p>
Ephemeral storage, often also referred to as loopback storage, should never be used for production purposes. With ephemeral storage, the second that a pod is rescheduled to run on a different node, any data associated with that pod will be lost. Ephemeral storage should only ever be ysed for snad pit type environments, for all other use cases persistent volumes should be used.
<TODO: Enter specific steps to perform the activity>
There are three key entities are associated with persistent volumes:
<p style="border-bottom: 1px solid lightgrey;"></p>
1. Volume
This can be thought of in similar terms to a mount point for a Linux or Unix file system.
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.4 Services (Networking)</h2>
2. Persistent Volume Claim (PVC)
A request for storage that will underpin the volume.
<TODO: Content>
3. Persistent Volume (PV)
A construct that maps directly to the underlying storage platform that persistent volume claims consume storage from.
A persistent volume claim associated with a persistent volume is said to be in a Bound state.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
The following deployment illustrates the use a volume and persistent volume claim for a SQL Server instance:
In this activity you will <TODO: Explain Activity>
```
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: mssql-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: mssql
spec:
terminationGracePeriodSeconds: 1
containers:
- name: mssql
image: mcr.microsoft.com/mssql/server/server:2017-latest
ports:
- containerPort: 1433
env:
- name: MSSQL_SA_PASSWORD
valueFrom:
secretKeyRef:
name: mssql
key: SA_PASSWORD
volumeMounts:
- name: mssqldb
mountPath: /var/opt/mssql
volumes:
- name: mssqldb
persistentVolumeClaim:
claimName: mssql-data
---
apiVersion: v1
kind: Service
metadata:
name: mssql-deployment
spec:
selector:
app: mssql
ports:
- protocol: TCP
port: 1433
targetPort: 1433
type: LoadBalancer
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: mssql-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: my-block-storage-class
```
<p style="border-bottom: 1px solid lightgrey;"></p>
### 3.4.2 Manual and Automatic Provisioning ###
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkmark.png?raw=true"><b>Steps</b></p>
Persistent Volumes can be provisioned in one of two different ways:
<TODO: Enter specific steps to perform the activity>
- **manually**
<p style="border-bottom: 1px solid lightgrey;"></p>
This requires that the cluster administrator must undertake manual activities in order to create the persistent volume.
- **automatically**
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.5 Storage</h2>
Under an automatic provisioning scheme, once a persistent volume claim is created, a persistent volume is created automatically and the two are bound.
<TODO: Content>
### 3.4.3 Storage Classes ###
3.1.1 The Kubernetes Storage Sub System
The touch point for storage at the pod level is a volume. There are two critical Kubernetes objects that need to be created for storage to be made available to the volume:
When using persistent volumes, something known as a “Storage class” must be specified and a SQL Server 2019 big data cluster is no exception to this. Simply put, each storage platform that can be used to allocate storage to the cluster has its own storage class. Furthermore, the Kubernetes API allows users to create their own storage classes. There are two fundamental components in a SQL Server 2019 big data cluster that consume storage:
- A PersistentVolume (PV) is storage that has been provisioned manually or dynamically using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs have a lifecycle independent of any individual Pod that uses the PV.
- **The storage pool**
- A PersistentVolumeClaim (PVC) is a request for storage by a user that provides the bridge between a volume and persistent volume. For a persistent volume claim
For the storage of unstructured data in HDFS parquet format
- **The data pool**
While PersistentVolumeClaims allow a user to consume abstract storage resources, it is common that users need PersistentVolumes with varying properties, such as performance, for different problems. Cluster administrators need to be able to offer a variety of PersistentVolumes that differ in more ways than just size and access modes, without exposing users to the details of how those volumes are implemented. For these needs, there is the StorageClass resource.
For the storage of traditional SQL Server data
**Use Case: Collecting and Processing Telemetry Data**
Imagine an application which collects log files associated with telemetry data from Internet of Things (IOT) like devices. The log files the application generates may account for hundreds of terabytes, even petabytes of data. We want to avoid paying the ACID tax of storing this data in a relational database, therefore the best place for this data to land is in the storage pool. Once the telemetry data has been collected, we may then wish to aggregate this data for querying via T-SQL. There are two distinct patterns of usage here:
Containers run inside a pod, containers in a pod share the same life cycle and are always scheduled to run on the same node. Pods can either be stateless or stateful. One of the most fundamental tasks that Kubernetes carries out is to ensure that the desired state of a pod in terms of replicas and its actual state are one of the same.
- We want to use the raw horse power of Spark to perform the bulk of the processing in the storage pool, a storage class associated with a platform optimized for high IO bandwidth is the best fit for this purpose.
Pods typically run in either a replicaset or a statefulset, if a replica dies, for example, a node might go offline, Kubernetes will schedule a new pod to run on a healthy node:
- We then want to aggregate the data in the data pool for rapid querying, a storage class associated with a storage platform optimized for low latency is ideally suited for this task.
Things get more nuanced when state is involved, when a pod that is stateful is scheduled to run on a different node, the state associated with that pod needs to Follow it from its original node to its new node. This can be achieved in one of two ways.
- Only a small subset of the data is being used for test and development purposes and because our production grade storage comes at a premium, we might want to use a storage class associated with a cheaper storage platform for this purpose.
### 3.4.4 Pod Mobility ###
Pods can either be stateless or stateful. One of the most fundamental tasks that Kubernetes carries out is to ensure that the desired state of a pod in terms of replicas and its actual state are one of the same. Pods typically run in either a ReplicaSet or a StatefulSet, if a replica dies by a node going offline for example, Kubernetes will schedule a new pod to run on a healthy node:
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_3_4_stateless.png?raw=true">
Things become more nuanced once state is involved. When a pod that is stateful is scheduled to run on a different node, the state associated with that pod needs to Follow it from its original node to its new node. This can be achieved in one of two ways.
Storage Replication
- Storage Replication
Storage is replicated between nodes, such that if a pod needs to be rescheduled, it can be scheduled to run on a node that its state has been replicated to.
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_3_4_stateful_replicated.PNG?raw=true">
- Shared Storage
Each node in the cluster has access to the same storage. When a node fails, a pod can be re-scheduled to any other worker node in the cluster:
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_3_4_stateful_shared.PNG?raw=true">
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
Use the kubectl cheat sheet to familiarise yourself with various kubectl commands in order to carry out the following on the jump server:
- List the storage classes available to the workshop Kubernetes cluster
- List the persistent volume claims present for the workshop SQL Server 2019 big data cluster
- From the list of persistent volume claims obtained in the previous step, pick a persistent volume claim and inspect it in detail using kubectl describe.
- List the persistent volumes present for the workshop SQL Server 2019 big data cluster.
- From the list of persistent volumes obtained in the previous step, pick a persistent volume claim and inspect it in detail using kubectl describe.
- Create a new namespace using the following kubectl command:
```
kubectl create namespace MyNamespace
```
Shared Storage
Each node in the cluster has access to the same storage. When a node fails, a pod can be re-scheduled to any other node in the cluster:
### 3.4.6 Access Modes ###
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
Microsoft database platform professionals will be familiar with the concept of CREATE DATABASE FOR ATTACH:
In this activity you will <TODO: Explain Activity>
```
CREATE DATABASE MyAdventureWorks
ON (FILENAME = 'C:\MySQLServer\AdventureWorks_Data.mdf'),
(FILENAME = 'C:\MySQLServer\AdventureWorks_Log.ldf')
FOR ATTACH;
```
This raises the question; if the persistent volumes for a kubernetes cluster already exist, can this be attached to a big data cluster ?. The answer is that this depends on what is known as the “Access mode” for the persistent volume, of which there are three types:
- **ReadWriteOnce**
The volume can be mounted as read-write by a single node, this is usually associated with block storage platforms; the type of storage that SQL Server usually runs on that is typically accessed via the iSCSI or the Fiber Channel storage protocols.
- **ReadOnlyMany**
The volume can be mounted read-only by many nodes. This access mode and ReadWriteMany are usually associated with file-based storage platforms, such platforms are usually accessed via the NFS or SMB protocols.
- **ReadWriteMany**
The volume can be mounted as read-write by many nodes
### 3.4.7 StatefulSets ###
The architecture of a SQL Server 2019 big data cluster contains components that are clustered by nature, such as storage pods in the storage pool:
<img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/graphics/3_4_7_bdc_architecture.PNG?raw=true">
Components of an application that are clustered have some special requirements which are not catered for by ReplicaSets. Per the [Kubernetes documentation](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/), clustered applications usually exhibit one or more of the following requirements:
- Stable, unique network identifiers.
- Stable, persistent storage.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.
A key feature of a StatefulSet is that each member of a cluster application requires its own persistent volume claim, for this very reason, a StatefulSet uses a persistent volume claim template:
```
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
```
If the Kubernetes cluster's storage platform has a snapshot capability that can be used to refresh storage volumes, persistent volumes can be refreshed via snapshots. The basic work flow for this:
- ```kubectl taint nodes <node name> key=value:NoSchedule```
- ```kubectl drain <node name>```
- ```kubectl scale statefulsets <stateful-set-name> --replicas=0```
- Overwrite StatefulSet persistent volumes using snapshot(s)
- ```kubectl scale statefulsets <stateful-set-name> --replicas=<original replica count>```
- ```kubectl taint nodes <node name> key=value:NoSchedule-```
### 3.4.8 Considerations for Choosing Storage ###
- **cost**
- Is this CAPEX, OPEX, priced on capacity and / or IOPS ?.
- **availability**
- How available is the platform to serve IO in the event that it suffers a component failure ?.
- Can the platform still serve IO if a data center or availability zone is lost ?.
- **durability**
- How durable is the data once it is written ?.
- **performance**
- Does the storage platform meet the latency / IO bandwidth requirements of the application ?.
- **security**
- What security features does the storage platform come with ?.
- If a Kubernetes cluster to be used in a regulated industry that mandates certain security certifications, does the platform adher to these ?.
- **supportability**
- Is the platform open source or commercially supported ?.
- Does the organization the kubernetes cluster is deployed at belong to a regulated industry ?
- **managability**
- How easy is the platform to manage ?.
- What management tools does the platform come with ?.
- How easy is it to add storage capacity to the platform ?.
- What data protection tools does the platform come with ?.
- Does the platform require any scripting / programming expertise in order to manage it ?.
- Does the platform need to provide integration for any existing management frameworks and / or monitoring solutions ?.
- **interoperability**
- Does the storage platform support any industry standard interfaces ?, Kubernetes is moving towards the container storage interface (CSI) as a standard,
- platforms that support this can be seemlessly interchanged.
- Does the platform need to provide interopability with existing infrastructure, virtualized infrastructure for example.
- Does the organization have a preference for storage protocol support; iSCSI, Fiber Channel, NFS, SMB etc, if so does the platform support this ?.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Utilising Kubernetes Persistent Storage Volumes</b></p>
1. Following the plugin instructions in section 2 in order to install the Kubernetes storage plugin.
2. A test-pvc.yaml file should be present in the home directory of the sandpit environmen. The contents of the file should be as follows:
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: block-storage-class
```
The storage class name will vary depending on which vendor has supplied the workshop hardware.
3. Create the persistent volume claim as follows:
```
kubectl apply -f test-pvc.yaml
```
4. List all the persistent volume claims present in your sandpit environment cluster:
```
kubectl get pvc --all-namespaces
```
Note the status of test-pvc persistent volume claim.
5. Obtain a detailed description of the test-pvc persistent volume claim:
```
kubectl desc pvc test-pvc
```
6. List all the persistent volumes present in your sandpit environment cluster:
```
kubectl get pv --all-namespaces
```
For a storage class that provides automatic provisioning, the persistent volume is automatically created for the test-pvc volume.
<p style="border-bottom: 1px solid lightgrey;"></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkmark.png?raw=true"><b>Steps</b></p>
<TODO: Enter specific steps to perform the activity>
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true">3.6 Management and Monitoring</h2>
<TODO: Content>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: <TODO: Activity Name></b></p>
In this activity you will <TODO: Explain Activity>
<p style="border-bottom: 1px solid lightgrey;"></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkmark.png?raw=true"><b>Steps</b></p>
<TODO: Enter specific steps to perform the activity>
<p style="border-bottom: 1px solid lightgrey;"></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/owl.png?raw=true"><b> For Further Study</b></p>
<ul>
<li><a href="<TODO: Enter Link address>" target="_blank"><TODO: Enter Name of Link></a> <TODO: Enter Explanation of Why the link is useful.</li>
</ul>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/geopin.png?raw=true"><b >Next Steps</b></p>
Next, Continue to <a href="https://github.com/microsoft/sqlworkshops/blob/master/k8stobdc/KubernetesToBDC/04-bdc.md"><i> Module 4 - Big Data Clusters</i></a>.
## 3.5 Troubleshooting ##

Просмотреть файл

@ -19,7 +19,7 @@ This module covers Container technologies and how they are different than Virtua
SQL Server (starting with version 2019) provides three ways to work with large sets of data:
- **Data Virtualization**: Query multiple sources of data technologies using the Polybase SQL Server feature <i>(data left at source)</i>
- **Data Virtualization**: Query multiple sources of data technologies using the PolyBase SQL Server feature <i>(data left at source)</i>
- **Storage Pools**: Create sets of disparate data sources that can be queried from Distributed Data sets <i>(data ingested into sharded databases using PolyBase)</i>
- **SQL Server Big Data Clusters**: Create, manage and control clusters of SQL Server Instances that co-exist in a Kubernetes cluster with Apache Spark and other technologies to access and process large sets of data <i>(Data left in place, ingested through PolyBase, and into/through HDFS)</i>
@ -45,12 +45,12 @@ To leverage PolyBase, you first define the external table using a specific set o
</table>
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png?raw=true"><b>Activity: Review PolyBase Solution</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Review PolyBase Solution</b></p>
In this section you will review a solution tutorial similar to one you will perform later. You'll see how to create a reference to an HDFS file store and query it within SQL Server as if it were a standard internal table.
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Open <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">this reference and locate numbers 4-5 of the steps in the tutorial</a>. This explains the two steps required to create and query an External table. *Only review this information; you will perform these steps in another Module*.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Open <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">this reference and locate numbers 4-5 of the steps in the tutorial</a>. This explains the two steps required to create and query an External table. *Only review this information; you will perform these steps in another Module*.</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
@ -148,7 +148,7 @@ These components are used in the Compute Pool of the BDC:
<h3>BDC: App Pool</h3>
The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instatiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification.
The App Pool is a set of Pods within a Node that hold multiple types of end-points into the system. SQL Server Integration Services lives in the App Pool, and other Job systems are possible. You could instantiate a long-running job (such as IoT streaming) or Machine Learning (ML) endpoints used for scoring a prediction or returning a classification.
These components are used in the Compute Pool of the BDC:
@ -195,18 +195,18 @@ These components are used in the Storage Pool of the BDC:
</table>
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Review Data Pool Solution</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Review Data Pool Solution</b></p>
In this section you will review the solution tutorial similar to the one you will perform in a future step. You'll see how to load data into the Data Pool.
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Open <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-sql?view=sqlallproducts-allversions" target="_blank">this reference and review the steps in the tutorial</a>. This explains the two steps required to create and load an External table in the Data Pool. You'll perform these steps in the <i>Operationalization</i> Module later. *Only review this information at this time. You will perform these steps in another Module.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true">Open <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-sql?view=sqlallproducts-allversions" target="_blank">this reference and review the steps in the tutorial</a>. This explains the two steps required to create and load an External table in the Data Pool. You'll perform these steps in the <i>Operationalization</i> Module later. *Only review this information at this time. You will perform these steps in another Module.</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/owl.png"><b>For Further Study</b></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/owl.png?raw=true"><b>For Further Study</b></p>
<ul>
<li><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sqlallproducts-allversions" target="_blank">Official Documentation for this section</a></li>
<li><a href = "https://cloudblogs.microsoft.com/sqlserver/2018/09/26/sql-server-2019-celebrating-25-years-of-sql-server-database-engine-and-the-path-forward/" target="_blank">Update on 2019 Blog</a></li>

Просмотреть файл

@ -14,18 +14,18 @@ This module covers Container technologies and how they are different than Virtua
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4-0">4.0 End-To-End Solution with BDC</a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="4-0">4.0 End-To-End Solution with BDC</a></h2>
Recall from <i>The Big Data Landscape</i> module that you learned about the Wide World Importers company. <a href="https://azure-scenarios-experience.azurewebsites.net/big-data.html" target="_blank">Wide World Importers </a> (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a traditional N-tier application that uses a front-end (mobile, web and installed) that interacts with a scale-out middle-tier software product, which in turn stores data in a large SQL Server database that has been scaled-up to meet demand.
<br>
<img style="height: 150; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/WWI-002.png">
<img style="height: 150; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/WWI-002.png?raw=true">
<br>
WWI has now added web and mobile commerce to their platform, which has generated a significant amount of additional data, and data formats. These new platforms were added without integrating into the OLTP system data or Business Intelligence infrastructures. As a result, "silos" of data stores have developed, and ingesting all of this data exceeds the scale of their current RDBMS server:
<br>
<img style="height: 300; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/WWI-003.png">
<img style="height: 300; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/WWI-003.png?raw=true">
<br>
This presented the following four challenges - the IT team at WWI needs to:
@ -49,89 +49,89 @@ To meet these challenges, the following solution is proposed. Using the BDC plat
The following diagram illustrates the complete solution that you can use to brief your audience with:
<br>
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution1.png">
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/bdcsolution1.png?raw=true">
<br>
In the following sections you'll dive deeper into how this scale is used to solve the rest of the challenges.
<p style="border-bottom: 1px solid lightgrey;"></p>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4-1">4.1 Data Virtualization - <i>Challenge 2: Multiple Data Sources</i></a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="4-1">4.1 Data Virtualization - <i>Challenge 2: Multiple Data Sources</i></a></h2>
The next challenge the IT team must solve is to enable a single data query to work across multiple disparate systems, optionally joining to internal SQL Server Tables, and also at scale.
Using the Data Virtualization capability you saw in the <i>02 - SQL Server BDC Components</i> Module, the IT team creates External Tables using the PolyBase feature. These External Table definitions are stored in the database on the SQL Server Master Instance within the cluster. When queried by the user, the queries are engaged from the SQL Server Master Instance through the Compute Pool in the SQL Server BDC, which holds Kubernetes Nodes containing the Pods running SQL Server Instances. These Instances send the query to the PolyBase Connector at the target data system, which processes the query based on the type of target system. The results are processed and returned through the PolyBase Connector to the Compute Pool and then on to the Master Instance, and then on to the user.
<br>
<img style="height: 250; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution2.png">
<img style="height: 250; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/bdcsolution2.png?raw=true">
<br>
This process allows not only a query to disparate systems, but also those remote systems can hold extremely large sets of data. Normally you are querying a subset of that data, so the results are all that are sent back over the network. These results can be joined with internal tables for a single view, and all from within the same Transact-SQL statements.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Load and query data in an External Table</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Load and query data in an External Table</b></p>
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any Polybase target.
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any PolyBase target.
<b>Steps</b>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparattion for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This step shows you how to create and query an External table.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-oracle?view=sqlallproducts-allversions" target="_blank">(Optional) Open this reference, and review the instructions you see there</a>. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparation for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This step shows you how to create and query an External table.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-oracle?view=sqlallproducts-allversions" target="_blank">(Optional) Open this reference, and review the instructions you see there</a>. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4-2">4.2 Creating a Distributed Data solution using big data cluster - <i>Challenge 3: Deep Analytics</i></a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="4-2">4.2 Creating a Distributed Data solution using big data cluster - <i>Challenge 3: Deep Analytics</i></a></h2>
Ad-hoc queries are very useful for many scenarios. There are times when you would like to bring the data into storage, so that you can create denormalized representations of datasets, aggregated data, and other purpose-specific data tasks.
<br>
<img style="height: 250; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution3.png">
<img style="height: 250; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/bdcsolution3.png?raw=true">
<br>
Using the Data Virtualization capability you saw in the <i>02 - BDC Components</i> Module, the IT team creates External Tables using PolyBase statements. These External Table definitions are stored in the database on the SQL Server Master Instance within the cluster. When queried by the user, the queries are engaged from the SQL Server Master Instance through the Compute Pool in the SQL Server BDC, which holds Kubernetes Nodes containing the Pods running SQL Server Instances. These Instances send the query to the PolyBase Connector at the target data system, which processes the query based on the type of target system. The results are processed and returned through the PolyBase Connector to the Compute Pool and then on to the Master Instance, and the PolyBase statements can specify the target of the Data Pool. The SQL Server Instances in the Data Pool store the data in a distributed fashion across multiple databases, called <i>Shards</i>.
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Load and query data into the Data Pool</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Load and query data into the Data Pool</b></p>
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to load data into the Data Pool.
<b>Steps</b>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-sql?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform the instructions you see there</a>. This loads data into the Data Pool.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-sql?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform the instructions you see there</a>. This loads data into the Data Pool.</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="4-3">4.3 Querying HDFS Data using big data cluster - <i>Challenge 4: Enable AI</i></a></h2>
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/pencil2.png?raw=true"><a name="4-3">4.3 Querying HDFS Data using big data cluster - <i>Challenge 4: Enable AI</i></a></h2>
There are three primary uses for a large cluster of data processing systems for Machine Learning and AI applications. The first is that the users will involved in the creation of the <a href="https://www.codeingschool.com/2018/09/what-are-features-and-labels-in-machine-learning.html" target="_blank">Features used in various ML and AI algorithms, and are often tasked to Label</a> the data. These users can access the Data Pool and Data Storage data stores directly to query and assist with this task.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and presisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and persisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
<br>
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution4.png">
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/bdcsolution4.png?raw=true">
<br>
The Data Scientist has another option to create and train ML and AI models. The Spark platform within the Storage Pool is accessible through the Knox gateway, using Livy to send Spark Jobs as you learned about in the <i>02 - SQL Server BDC Components</i> Module. This gives access to the full Spark platform, using Jupyter Notebooks (included in <i>Azure Data Studio</i>) or any other standard tools that can access Spark through REST calls.
<br>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Load data with Spark, run a Spark Notebook</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/point1.png?raw=true"><b>Activity: Load data with Spark, run a Spark Notebook</b></p>
<br>
In this activity, you will load the sample data into your big data cluster environment using Spark, and use a Notebook in Azure Data Studio to work with it.
<b>Steps</b>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-spark?view=sqlallproducts-allversions" target="_blank">Open this reference, and follow the instructions you see there</a>. This loads the data in preparation for the Notebook operations.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-notebook-spark?view=sqlallproducts-allversions" target="_blank">Open this reference, and follow the instructions you see there</a>. This simple example shows you how to work with the data you ingested into the Storage Pool using Spark.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-data-pool-ingest-spark?view=sqlallproducts-allversions" target="_blank">Open this reference, and follow the instructions you see there</a>. This loads the data in preparation for the Notebook operations.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/checkbox.png?raw=true"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-notebook-spark?view=sqlallproducts-allversions" target="_blank">Open this reference, and follow the instructions you see there</a>. This simple example shows you how to work with the data you ingested into the Storage Pool using Spark.</p>
<br>
<p style="border-bottom: 1px solid lightgrey;"></p>
<br>
<p><img style="margin: 0px 15px 15px 0px;" src="../graphics/owl.png"><b>For Further Study</b></p>
<p><img style="margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/owl.png?raw=true"><b>For Further Study</b></p>
<ul>
<li><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sqlallproducts-allversions" target="_blank">Official Documentation for this section</a></li>
<li><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/data-ingestion-curl?view=sqlallproducts-allversions" target="_blank">Use curl to load data into HDFS on SQL Server 2019 big data clusters</a></li>

Просмотреть файл

@ -0,0 +1,124 @@
# SQL Server 2019 big data cluster
Single-Node Cluster on an Azure Virtual Machine (Unsupported for production - classroom only)
In this set of instructions you'll set up a SQL Server 2019 big data cluster using Ubuntu on
a single-Node using a Microsoft Azure Virtual Machine.
NOTE: This is an unsupported configuration, and should be used only for classroom purposes.
Carefully read the instructions for the parameters you need to replace for your specific
subscription and parameters.
-------------------------------------------------------------------------------------------------------------
## Running these Instructions
These instructions use shell commands, such as PowerShell, bash, or the CMD window from a
system that has the Secure Shell software installed (SSH). You can type:
ssh -h
To see if this tool is installed.
You can copy-and-paste from the lines that show the commands, or you can set your IDE to run the current line
in a Terminal window. (In Visual Studio Code or Azure Data Studio, these are called "Keybindings"):
https://code.visualstudio.com/docs/getstarted/keybindings
You can set this to any key you like:
(Preferences | Keyboard Shortcuts | Terminal: Run Selected Text in Active Terminal)
-------------------------------------------------------------------------------------------------------------
## References
This Notebook uses the script located here:
https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-script-single-node-kubeadm?view=sql-server-ver15
and that reference supersedes the information in the steps listed below.
You can also create a SQL Server Big Data Cluster on the Azure Kubernetes Service (AKS):
Those instructions are located here: https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sql-server-ver15
For a complete workshop on SQL Server 2019's big data clusters, see this reference:
https://github.com/Microsoft/sqlworkshops/tree/master/sqlserver2019bigdataclusters
-------------------------------------------------------------------------------------------------------------
### Step 1: Log in to Azure
az login
-------------------------------------------------------------------------------------------------------------
### Step 2: Set your account - show the accounts, replace <YourAccountNameHere> with your account name
az account list --output table
az account set --subscription "<YourAccountNameHere>"
-------------------------------------------------------------------------------------------------------------
### Step 3: Create a Resource Group, and a Virtual Machine - Look for values with the <Replace> characters to change to your values
#### (Note: Needs a machine large enough to run BDC and also have Nested Virtualization)
az group create -n <ResourceGroupName> -l eastus2
az vm create -n <VMName> -g <ResourceGroupName> -l eastus2 --image UbuntuLTS --os-disk-size-gb 200 --storage-sku Premium_LRS --admin-username bdcadmin --admin-password <ReplaceWithPassword> --size Standard_D8s_v3 --public-ip-address-allocation static
ssh -X bdcadmin@<ReplaceWithIPAddressThatReturnsFromLastCommand>
### Step 4: Update and Upgrade VM
sudo apt-get update
sudo apt-get upgrade
sudo apt autoremove
-------------------------------------------------------------------------------------------------------------
### Step 5: (Optional) Install an XWindows server
sudo apt-get install xorg openbox
sudo reboot
#### After about 5 minutes:
ssh -X bdcadmin@<ReplaceWithVMIP>
sudo apt-get install gnome-core
sudo reboot
#### After about 5 minutes:
ssh -X bdcadmin@<ReplaceWithVMIP>
sudo sed -i 's/allowed_users=console/allowed_users=anybody/' /etc/X11/Xwrapper.config
mkdir /home/bdcadmin/.config/nautilus
cd ~
mkdir ./Downloads
touch  /home/bdcadmin/.gtk-bookmarks
wget https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/ads.deb
#### Check XWindows - Note, requires that you have XWindows software installed on your laptop
nautilus &
-------------------------------------------------------------------------------------------------------------
### Step 6: Install BDC Single Node - Pre-requisites (Current as of 1/31/2020)
sudo apt update && sudo apt upgrade -y
sudo reboot
#### After about 5 minutes:
ssh -X bdcadmin@<ReplaceWithVMIP>
-------------------------------------------------------------------------------------------------------------
### Step 7: Download and mark BDC Setup script
curl --output setup-bdc.sh https://raw.githubusercontent.com/microsoft/sql-server-samples/master/samples/features/sql-big-data-cluster/deployment/kubeadm/ubuntu-single-node-vm/setup-bdc.sh
chmod +x setup-bdc.sh
sudo ./setup-bdc.sh
-------------------------------------------------------------------------------------------------------------
### Step 8: Setup path and Check
source ~/.bashrc
azdata --version
kubectl get pods
#### You can now use the system.
-------------------------------------------------------------------------------------------------------------
## Cleanup - Erase everything
### Only perform this step when you are done experimenting with the system...
az group delete --name <ResourceGroupName>

Просмотреть файл

@ -8,7 +8,7 @@
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/textbubble.png?raw=true"><b> About this Workshop</b></h2>
Welcome to this Microsoft solutions workshop on *Kubernetes - From Bare Metal to SQL Server Big Data Clusters*. In this workshop, you'll learn about setting up a production grade SQL Server 2019 big data cluster environment on Kubernetes. Topics covered include: hardware, virtualization, and Kubernetes, with a full deployment of SQL Server's Big Data Cluster on the environment that you will use in the class. You'll then walk through a set of Jupyter Notebooks in Azure Data Studio to run T-SQL, Spark, and Machine Learning workloads on the cluster. You'll also receive valuable resources to learn more and go deeper on Linux, Containers, Kubernetes and SQL Server big data clusters.
Welcome to this Microsoft solutions workshop on *Kubernetes - From Bare Metal to SQL Server Big Data Clusters*. In this workshop, you'll learn about setting up a production grade SQL Server 2019 big data cluster environment on Kubernetes. Topics covered include: hardware, virtualization, and Kubernetes, with a full deployment of SQL Server's Big Data Cluster on the environment that you will use in the class. You'll then walk through a set of [Jupyter Notebooks](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) in Microsoft's [Azure Data Studio](https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?view=sql-server-ver15) tool to run T-SQL, Spark, and Machine Learning workloads on the cluster. You'll also receive valuable resources to learn more and go deeper on Linux, Containers, Kubernetes and SQL Server big data clusters.
The focus of this workshop is to understand the hardware, software, and environment you need to work with [SQL Server 2019's big data clusters](https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15) on a Kubernetes platform.
@ -27,7 +27,7 @@ This README.MD file explains how the workshop is laid out, what you will learn,
In this workshop you'll learn:
<br>
- How Containers and Kubernetes work and where you can use them
- How Containers and Kubernetes work and when and where you can use them
- Hardware considerations for setting up a production Kubernetes Cluster on -remises
- Considerations for Virtual and Cloud-based environments for production Kubernetes Cluster
@ -86,7 +86,7 @@ or
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/education1.png?raw=true"><b> Workshop Details</b></h2>
This workshop uses <TODO: enter main technologies used to solve the sceanrio>, with a focus on <TODO: architecture and implementation, development and use, etc>.
This workshop uses Kubernetes to deploy a workload, with a focus on Microsoft SQL Server's big data clusters deployment for Big Data and Data Science workloads.
<table style="tr:nth-child(even) {background-color: #f2f2f2;}; text-align: left; display: table; border-collapse: collapse; border-spacing: 5px; border-color: gray;">
@ -126,4 +126,13 @@ This is a modular workshop, and in each section, you'll learn concepts, technolo
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="https://github.com/microsoft/sqlworkshops/blob/master/graphics/geopin.png?raw=true"><b> Next Steps</b></h2>
Next, Continue to <a href="00-prerequisites.md" target="_blank"><i> Pre-Requisites</i></a>
Next, Continue to <a href="00-prerequisites.md" target="_blank"><i> Pre-Requisites</i></a>
**Workshop Authors and Contributors**
- [The Microsoft SQL Server Team](http://microsoft.com/sql)
- [Chris Adkin](https://www.linkedin.com/in/wollatondba/), Pure Storage
**Legal Notice**
*Kubernetes and the Kubernetes logo are trademarks or registered trademarks of The Linux Foundation. in the United States and/or other countries. The Linux Foundation and other parties may also have trademark rights in other terms used herein. This Workshop is not certified, accredited, affiliated with, nor endorsed by Kubernetes or The Linux Foundation.*

Двоичные данные
k8stobdc/graphics/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
k8stobdc/graphics/3_2_7_kubespray-flow.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 62 KiB

Двоичные данные
k8stobdc/graphics/3_2_9_kubectl.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 48 KiB

Двоичные данные
k8stobdc/graphics/3_3_1_openshift.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 119 KiB

Двоичные данные
k8stobdc/graphics/3_3_4_stateful_replicated.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 187 KiB

Двоичные данные
k8stobdc/graphics/3_3_4_stateful_shared.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 185 KiB

Двоичные данные
k8stobdc/graphics/3_3_4_stateless.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 191 KiB

Двоичные данные
k8stobdc/graphics/3_4_7_bdc_architecture.PNG Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 418 KiB

Двоичные данные
k8stobdc/graphics/KubernetesCluster.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 313 KiB

Двоичные данные
k8stobdc/graphics/WWI-001.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 73 KiB

Двоичные данные
k8stobdc/graphics/WWI-002.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 23 KiB

Двоичные данные
k8stobdc/graphics/WWI-003.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 224 KiB

Двоичные данные
k8stobdc/graphics/WWI-logo.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 15 KiB

Двоичные данные
k8stobdc/graphics/adf.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 41 KiB

Двоичные данные
k8stobdc/graphics/ads.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 239 KiB

Двоичные данные
k8stobdc/graphics/bookpencil.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 1.8 KiB

Двоичные данные
k8stobdc/graphics/building1.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 4.3 KiB

Двоичные данные
k8stobdc/graphics/bulletlist.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 1.1 KiB

Двоичные данные
k8stobdc/graphics/checkbox.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 180 B

Двоичные данные
k8stobdc/graphics/checkmark.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 1.7 KiB

Двоичные данные
k8stobdc/graphics/clipboardcheck.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.6 KiB

Двоичные данные
k8stobdc/graphics/cloud1.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 3.2 KiB

Двоичные данные
k8stobdc/graphics/datamart.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 79 KiB

Двоичные данные
k8stobdc/graphics/datavirtualization.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 149 KiB

Двоичные данные
k8stobdc/graphics/education1.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 3.5 KiB

Двоичные данные
k8stobdc/graphics/factory.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.6 KiB

Двоичные данные
k8stobdc/graphics/geopin.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 3.2 KiB

Двоичные данные
k8stobdc/graphics/hdfs.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 240 KiB

Двоичные данные
k8stobdc/graphics/kubectl.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 383 KiB

Двоичные данные
k8stobdc/graphics/listcheck.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.0 KiB

Двоичные данные
k8stobdc/graphics/microsoftlogo.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.4 KiB

Двоичные данные
k8stobdc/graphics/owl.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 3.9 KiB

Двоичные данные
k8stobdc/graphics/paperclip1.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 1.6 KiB

Двоичные данные
k8stobdc/graphics/pencil2.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.7 KiB

Двоичные данные
k8stobdc/graphics/pinmap.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 3.6 KiB

Двоичные данные
k8stobdc/graphics/point1.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 2.9 KiB

Двоичные данные
k8stobdc/graphics/solutiondiagram.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 19 KiB

Двоичные данные
k8stobdc/graphics/spark.jpg

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 216 KiB

Двоичные данные
k8stobdc/graphics/spark.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 166 KiB

Двоичные данные
k8stobdc/graphics/sqlbdc.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 467 KiB

Двоичные данные
k8stobdc/graphics/textbubble.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 1.5 KiB

Двоичные данные
sqlserver2019bigdataclusters/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -22,7 +22,7 @@ The other requirements are:
- **The pip3 Package**: The Python package manager *pip3* is used to install various BDC deployment and configuration tools.
- **The kubectl program**: The *kubectl* program is the command-line control feature for Kubernetes.
- **The azdata utility**: The *azdata* program is the deployment and configuration tool for BDC.
- **Azure Data Studio**: The *Azure Data Studio* IDE, along with various Extensions, is used for deploying the system, and querying and management of the BDC. In addition, you will use this tool to participate in the workshop. Note: You can connect to a SQL Server 2019 Big Data Cluster using any SQL Server connection tool or applicaiton, such as SQL Server Management Studio, but this course will use Microsoft Azure Data Studio for cluster management, Jupyter Notebooks and other capabilities.
- **Azure Data Studio**: The *Azure Data Studio* IDE, along with various Extensions, is used for deploying the system, and querying and management of the BDC. In addition, you will use this tool to participate in the workshop. Note: You can connect to a SQL Server 2019 Big Data Cluster using any SQL Server connection tool or application, such as SQL Server Management Studio, but this course will use Microsoft Azure Data Studio for cluster management, Jupyter Notebooks and other capabilities.
*Note that all following activities must be completed prior to class - there will not be time to perform these operations during the workshop.*
@ -71,7 +71,7 @@ Get-WindowsUpdate
Install-WindowsUpdate
</pre>
*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
*Note: If you get an error during this update process, evaluate it to see if it is fatal. You may receive certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png">Install Big Data Cluster Tools</p>
@ -97,7 +97,7 @@ Get-WindowsUpdate
Install-WindowsUpdate
</pre>
*Note 1: If you get an error during this update process, evaluate it to see if it is fatal. You may recieve certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
*Note 1: If you get an error during this update process, evaluate it to see if it is fatal. You may receive certain driver errors if you are using a Virtual Machine, this can be safely ignored.*
**Note 2: If you are using a Virtual Machine in Azure, power off the Virtual Machine using the Azure Portal every time you are done with it. Turning off the VM using just the Windows power off in the VM only stops it running, but you are still charged for the VM if you do not stop it from the Portal. Stop the VM from the Portal unless you are actively using it.**

Просмотреть файл

@ -92,7 +92,7 @@ This solution uses an example of a retail organization that has multiple data so
<img style="height: 25;" src="../graphics/WWI-logo.png">
Wide World Importeres (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a great training program for new employees, that focuses on connecting with their customers and providing great face-to-face customer service. This strong focus on customer relationships has helped set WWI apart from their competitors.
Wide World Importers (WWI) is a traditional brick and mortar business with a long track record of success, generating profits through strong retail store sales of their unique offering of affordable products from around the world. They have a great training program for new employees, that focuses on connecting with their customers and providing great face-to-face customer service. This strong focus on customer relationships has helped set WWI apart from their competitors.
WWI has now added web and mobile commerce to their platform, which has generated a significant amount of additional data, and data formats. These new platforms have been added without integrating into the OLTP system data or Business Intelligence infrastructures. As a result, "silos" of data stores have developed.
@ -144,7 +144,7 @@ Using the following steps, you will create a Resource Group in Azure that will h
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-big-data-tools?view=sqlallproducts-allversions" target="_blank"> Read the following article to install the big data cluster Tools, ensuring that you carefully follow each step</a>. Note that if you followed the pre-requisites properly, you will already have <i>Python</i>, <i>kubectl</i>, and <i>Azure Data Studio</i> installed, so those may be skipped. Follow all other instructions.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sqlallproducts-allversions" target="_blank"> Read the following article to deploy the bdc to AKS, ensuring that you carefully follow each step</a>. Stop at the section marked <b>Connect to the cluster</b>.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"> <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sqlallproducts-allversions" target="_blank"> Read the following article to deploy the Big Data Cluster to the Azure Kubernetes Service, ensuring that you carefully follow each step</a>. Stop at the section marked <b>Connect to the cluster</b>.</p>
<p style="border-bottom: 1px solid lightgrey;"></p>
@ -309,7 +309,7 @@ You can <a href="https://hackernoon.com/docker-commands-the-ultimate-cheat-sheet
<h3>Container Orchestration <i>(Kubernetes)</i></h3>
For Big Data systems, having lots of Containers is very advantageous to segment purpose and performance profiles. However, dealing with many Container Images, allowing persisted storage, and interconnecting them for network and internetwork communications is a complex task. One such Container Prchestration tool is <i>Kubernetes</i>, an open source Container orchestrator, which can scale Container deployments according to need. The following table defines some important Container Orchestration Tools (Such as Kubernetes or OpenShift) terminology:
For Big Data systems, having lots of Containers is very advantageous to segment purpose and performance profiles. However, dealing with many Container Images, allowing persisted storage, and interconnecting them for network and internetwork communications is a complex task. One such Container Orchestration tool is <i>Kubernetes</i>, an open source Container orchestrator, which can scale Container deployments according to need. The following table defines some important Container Orchestration Tools (Such as Kubernetes or OpenShift) terminology:
<table style="tr:nth-child(even) {background-color: #f2f2f2;}; text-align: left; display: table; border-collapse: collapse; border-spacing: 5px; border-color: gray;">
@ -327,7 +327,7 @@ For Big Data systems, having lots of Containers is very advantageous to segment
You can <a href="https://kubernetes.io/docs/tutorials/kubernetes-basics/" target="_blank">learn much more about Container Orchestration systems here</a>. We're using the Azure Kubernetes Service (AKS) in this workshop, and <a href="https://aksworkshop.io/" target="_blank">they have a great set of tutorials for you to learn more here</a>.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is reponsible for building and configurint the Nodes, assigns Pods to Nodes,creates and manages the Persistent Voumes (durable storage), and manages the operation of the Cluster.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
@ -440,7 +440,7 @@ You'll explore further operations with the Azure Data Studio in the <i>Operation
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Azure Data Studio Notebooks Overview</b></p>
<p><b>Steps</b></p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permist.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-notebooks?view=sql-server-2017" target="_blank">Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.</p>
<br>
@ -486,7 +486,7 @@ Since HDFS is a file-system, data transfer is largely a matter of using it as a
<h3>Data Pipelines using <i>Azure Data Factory</i></h3>
As described earlier, you can use various methods to ingest data ad-hoc and as-needed for your two data targets (HDFS and SQL Server Tables. A more holistic archicture is to use a <i>Pipeline</i> system that can define sources, triggers and events, transforms, targets, and has logging and tracking capabilities. The Microsoft Azure Data Factory provides all of the capabilities, and often serves as the mechanism to transfer data to and from on-premises, in-cloud, and other sources and targets. <a href="https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities" target="_blank">ADF can serve as a full data pipeline system, as described here</a>.
As described earlier, you can use various methods to ingest data ad-hoc and as-needed for your two data targets (HDFS and SQL Server Tables. A more holistic architecture is to use a <i>Pipeline</i> system that can define sources, triggers and events, transforms, targets, and has logging and tracking capabilities. The Microsoft Azure Data Factory provides all of the capabilities, and often serves as the mechanism to transfer data to and from on-premises, in-cloud, and other sources and targets. <a href="https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities" target="_blank">ADF can serve as a full data pipeline system, as described here</a>.
<br>
<img style="height: 75;" src="../graphics/adf.png">

Просмотреть файл

@ -28,7 +28,7 @@ You'll cover the following topics in this Module:
SQL Server (starting with version 2019) provides three ways to work with large sets of data:
- **Data Virtualization**: Query multiple sources of data technologies using the Polybase SQL Server feature <i>(data left at source)</i>
- **Data Virtualization**: Query multiple sources of data technologies using the PolyBase SQL Server feature <i>(data left at source)</i>
- **Storage Pools**: Create sets of disparate data sources that can be queried from Distributed Data sets <i>(data ingested into sharded databases using PolyBase)</i>
- **SQL Server Big Data Clusters**: Create, manage and control clusters of SQL Server Instances that co-exist in a Kubernetes cluster with Apache Spark and other technologies to access and process large sets of data <i>(Data left in place, ingested through PolyBase, and into/through HDFS)</i>
@ -228,4 +228,4 @@ In this section you will review the solution tutorial you will perform in the <i
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/geopin.png"><b> Next Steps</b></p>
Next, Continue to <a href="03%20-%20Planning,%20Installation%20and%20Configuration.md" target="_blank"><i> Planning, Installation and Configuration</i></a>.
Next, Continue to <a href="03%20-%20Planning,%20Installation%20and%20Configuration.md" target="_blank"><i> Planning, Installation and Configuration</i></a>.

Просмотреть файл

@ -29,28 +29,28 @@ You'll cover the following topics in this Module:
<i>NOTE: The following Module is based on the Public Preview of the Microsoft SQL Server 2019 big data cluster feature. These instructions will change as the product is updated. <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-get-started?view=sqlallproducts-allversions" target="_blank">The latest installation instructions are located here</a>.</i>
A Big Data Cluster for SQL Server (BDC) is deployed onto a Cluster Orechestration system (such as Kubernetes or OpenShift) using the `azdata` utility which creates the appropriate Nodes, Pods, Containers and other constructs for the system. The installation uses various switches on the `azdata` utility, and reads from several variables contianed within <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/reference-deployment-config?view=sqlallproducts-allversions" target="_blank">an internal JSON document</a> when you run the command. Using a switch, you can change these variables. You can also dump the enitre document to a file, edit it, and then call the installation that uses that file with the `azdata` command. <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-custom-configuration?view=sqlallproducts-allversions" target="_blank">More detail on that process is located here.</a>
A Big Data Cluster for SQL Server (BDC) is deployed onto a Cluster Orchestration system (such as Kubernetes or OpenShift) using the `azdata` utility which creates the appropriate Nodes, Pods, Containers and other constructs for the system. The installation uses various switches on the `azdata` utility, and reads from several variables contained within <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/reference-deployment-config?view=sqlallproducts-allversions" target="_blank">an internal JSON document</a> when you run the command. Using a switch, you can change these variables. You can also dump the entire document to a file, edit it, and then call the installation that uses that file with the `azdata` command. <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deployment-custom-configuration?view=sqlallproducts-allversions" target="_blank">More detail on that process is located here.</a>
For planning, it is essential that you understand the SQL Server BDC components, and have a firm understanding of Kubernetes and TCP/IP networking. You should also have an understanding of how SQL Server and Apache Spark use the "Big Four" (*CPU, I/O, Memory and Networking*).
Since the Cluster Orechestration system is often made up of Virtual Machines that host the Container Images, they must be as large as possible. For the best possible performance, large physical machines that are tuned for optimal performance is a recommended physical architecture. The least viable production system is a Minimum of 3 Linux physical machines or virtual machines. The recommended configuration per machine is 8 CPUs, 32 GB of memory and 100GB of storage. This configuration would support only one or two users with a standard workload, and you would want to increase the system for each additional user or heavier workload.
Since the Cluster Orchestration system is often made up of Virtual Machines that host the Container Images, they must be as large as possible. For the best possible performance, large physical machines that are tuned for optimal performance is a recommended physical architecture. The least viable production system is a Minimum of 3 Linux physical machines or virtual machines. The recommended configuration per machine is 8 CPUs, 32 GB of memory and 100GB of storage. This configuration would support only one or two users with a standard workload, and you would want to increase the system for each additional user or heavier workload.
You can deploy Kubernetes in a few ways:
- In a Cloud Platform such as Azure Kubernetes Service (AKS)
- In your own Cluster Orechestration system deployment using the appropirate tools such as `KubeADM`
- In your own Cluster Orchestration system deployment using the appropriate tools such as `KubeADM`
Regardless of the Cluster Orechestration system target, the general steps for setting up the system are:
Regardless of the Cluster Orchestration system target, the general steps for setting up the system are:
- Set up Cluster Orechestration system with a Cluster target
- Set up Cluster Orchestration system with a Cluster target
- Install the cluster tools on the administration machine
- Deploy the BDC onto the Cluster Orechestration system
- Deploy the BDC onto the Cluster Orchestration system
In the sections that follow, you'll cover the general process for each of these deployments. The official documentation referenced above have the specific steps for each deployment, and the *Activity* section of this Module has the steps for deploying the BDC on AKS for the classroom enviornment.
In the sections that follow, you'll cover the general process for each of these deployments. The official documentation referenced above have the specific steps for each deployment, and the *Activity* section of this Module has the steps for deploying the BDC on AKS for the classroom environment.
<p style="border-bottom: 1px solid lightgrey;"></p>
@ -94,7 +94,7 @@ With this background, you can find the <a href="https://docs.microsoft.com/en-us
<h2><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/pencil2.png"><a name="3-2">3.2 Installing Locally Using KubeADM</h2>
If you choose Kubernetes as your Cluster Orechestration system, the <a href="https://kubernetes.io/docs/setup/independent/install-kubeadm/" target="_blank">kubeadm toolbox</a> helps you bootstrap a Kubernetes cluster that conforms to best practices. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.
If you choose Kubernetes as your Cluster Orchestration system, the <a href="https://kubernetes.io/docs/setup/independent/install-kubeadm/" target="_blank">kubeadm toolbox</a> helps you bootstrap a Kubernetes cluster that conforms to best practices. Kubeadm also supports other cluster lifecycle functions, such as upgrades, downgrade, and managing bootstrap tokens.
The kubeadm toolbox can deploy a Kubernetes cluster to physical or virtual machines. It works by specifying the TCP/IP addresses of the targets.

Просмотреть файл

@ -81,11 +81,11 @@ This process allows not only a query to disparate systems, but also those remote
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/point1.png"><b>Activity: Load and query data in an External Table</b></p>
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any Polybase target.
In this activity, you will load the sample data into your big data cluster environment, and then create and use an External table to query the data in HDFS. This process is similar to connecting to any PolyBase target.
<b>Steps</b>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparattion for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-load-sample-data?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This loads your data in preparation for the next Activity.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-hdfs-storage-pool?view=sqlallproducts-allversions" target="_blank">Open this reference, and perform all of the instructions you see there</a>. This step shows you how to create and query an External table.</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="../graphics/checkbox.png"><a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/tutorial-query-oracle?view=sqlallproducts-allversions" target="_blank">(Optional) Open this reference, and review the instructions you see there</a>. (You You must have an Oracle server that your BDC can reach to perform these steps, although you can review them if you do not)</p>
@ -118,7 +118,7 @@ In this activity, you will load the sample data into your big data cluster envir
There are three primary uses for a large cluster of data processing systems for Machine Learning and AI applications. The first is that the users will involved in the creation of the <a href="https://www.codeingschool.com/2018/09/what-are-features-and-labels-in-machine-learning.html" target="_blank">Features used in various ML and AI algorithms, and are often tasked to Label</a> the data. These users can access the Data Pool and Data Storage data stores directly to query and assist with this task.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and presisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
The SQL Server Master Instance in the BDC installs with <a href="https://docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-ver15" target="_blank">Machine Learning Services</a>, which allow creation, training, evaluation and persisting of Machine Learning Models. Data from all parts of the BDC are available, and Data Science oriented languages and libraries in R, Python and Java are enabled. In this scenario, the Data Scientist creates the R or Python code, and the Transact-SQL Developer wraps that code in a Stored Procedure. This code can be used to train, evaluate and create Machine Learning Models. The Models can be stored in the Master Instance for scoring, or sent on to the App Pool where the Machine Learning Server is running, waiting to accept REST-based calls from applications.
<br>
<img style="height: 400; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" src="../graphics/bdcsolution4.png">

Просмотреть файл

@ -29,7 +29,7 @@ You'll cover the following topics in this Module:
There are two primary areas for monitoring your BDC deployment. The first deals with SQL Server 2019, and the second deals with the set of elements in the Cluster.
For SQL Server, <a href="https://docs.microsoft.com/en-us/sql/relational-databases/database-lifecycle-management?view=sql-server-ver15" target="_blank">management is much as you would normally perform for any SQL Server system</a>. You have the same type of services, surface points, security areas and other control vectors as in a stand-alone installation of SQL Server. The tools you have avalaible for managing the Master Instance in the BDC are the same as managing a stand-alone installation, including SQL Server Management Studio, command-line interfaces, Azure Data Studio, and third party tools.
For SQL Server, <a href="https://docs.microsoft.com/en-us/sql/relational-databases/database-lifecycle-management?view=sql-server-ver15" target="_blank">management is much as you would normally perform for any SQL Server system</a>. You have the same type of services, surface points, security areas and other control vectors as in a stand-alone installation of SQL Server. The tools you have available for managing the Master Instance in the BDC are the same as managing a stand-alone installation, including SQL Server Management Studio, command-line interfaces, Azure Data Studio, and third party tools.
For the cluster components, you have three primary interfaces to use, which you will review next.