Adding labs for Azure HDInsight
Двоичные данные
Labs/Azure HDInsight/DataScienceLab/Data/sparklabs/.DS_Store
поставляемый
Normal file
Двоичные данные
Labs/Azure HDInsight/DataScienceLab/Data/sparklabs/Lab03/.DS_Store
поставляемый
Normal file
После Ширина: | Высота: | Размер: 40 KiB |
После Ширина: | Высота: | Размер: 19 KiB |
После Ширина: | Высота: | Размер: 18 KiB |
После Ширина: | Высота: | Размер: 15 KiB |
После Ширина: | Высота: | Размер: 21 KiB |
После Ширина: | Высота: | Размер: 37 KiB |
После Ширина: | Высота: | Размер: 16 KiB |
После Ширина: | Высота: | Размер: 14 KiB |
После Ширина: | Высота: | Размер: 12 KiB |
После Ширина: | Высота: | Размер: 198 KiB |
После Ширина: | Высота: | Размер: 224 KiB |
После Ширина: | Высота: | Размер: 64 KiB |
После Ширина: | Высота: | Размер: 61 KiB |
|
@ -0,0 +1,134 @@
|
|||
Provision HDInsight Linux Hadoop cluster with Azure Management Portal
|
||||
---------------------------------------------------------------------
|
||||
|
||||
To provision HDInsight Hadoop cluster with Azure Management Portal,
|
||||
perform the below steps.
|
||||
|
||||
1. Go to the Azure Portal portal.azure.com. Login using your azure
|
||||
account credentials.
|
||||
|
||||
2. Select **NEW -> Data Analytics -> HDInsight**
|
||||
|
||||
> <img src="./media/image1.png" width="592" height="180" />
|
||||
|
||||
1. Enter or select the following values.
|
||||
|
||||
1. **Cluster Name:** Enter the cluster name. A green tick will
|
||||
appear if the cluster name is available.
|
||||
|
||||
2. **Cluster Type:** Select **Spark** as the cluster type.
|
||||
|
||||
3. **Cluster Operating System:** Select Linux as the cluster
|
||||
operating system
|
||||
|
||||
4. **Version:** Select **3.6** as the cluster version.
|
||||
|
||||
5. **Cluster Tier:** Select the **Standard** cluster tier
|
||||
|
||||
> <img src="./media/image2.png" width="436" height="372" />
|
||||
|
||||
1. **Subscription:** Select the Azure subscription to create
|
||||
the cluster.
|
||||
|
||||
2. **Resource Group:** Select an existing resource group or create a
|
||||
new resource group.
|
||||
|
||||
3. **Credentials:** Configure the username and password for HDInsight
|
||||
cluster and the SSH connection. SSH connection is used to connect to
|
||||
HDInsight cluster through a SSH client such as Putty.
|
||||
|
||||
> <img src="./media/image3.png" width="219" height="400" />
|
||||
|
||||
1. **Data Source:** Create a new storage account and a
|
||||
default container.
|
||||
|
||||
> <img src="./media/image4.png" width="230" height="309" />
|
||||
|
||||
1. **Node Pricing Tiers:** Set the number of head node and worker nodes
|
||||
as shown below.
|
||||
|
||||
> <img src="./media/image5.png" width="228" height="290" />
|
||||
|
||||
**Note:** You can select lowest pricing tier A3 nodes or reduce the
|
||||
number of worker nodes decrease the cluster cost.
|
||||
|
||||
1. Leave other configuration options as default and click **Create** to
|
||||
provision HDInsight Hadoop cluster. It will take 15-20 minutes for
|
||||
cluster provisioning.
|
||||
|
||||
**The HDInsight Linux Hadoop cluster is now ready to work with.**
|
||||
|
||||
Copy lab data to the storage account
|
||||
------------------------------------
|
||||
|
||||
In this section, you’ll copy the files required for the lab to your
|
||||
storage account.
|
||||
|
||||
To copy the files, follow the below steps.
|
||||
|
||||
1. Launch Azure Storage from your cluster dashboard
|
||||
|
||||
> <img src="./media/image6.png" width="624" height="854" />
|
||||
|
||||
1. Select the **Blob container** for your cluster
|
||||
|
||||
2. Create a container called **sparklabs**
|
||||
|
||||
3. Navigate to **sparklabs** and create a container called **Lab03**
|
||||
|
||||
4. Upload SalesTransactions1.csv and SalesTransactions2.csv to Lab03.
|
||||
Weblogs.csv can be found in **data\\sparklabs\\Lab03** folder.
|
||||
|
||||
<img src="./media/image7.png" width="624" height="334" />
|
||||
|
||||
Launching a new Jupyter Notebook
|
||||
--------------------------------
|
||||
|
||||
### Access Azure Portal
|
||||
|
||||
1. Sign in to the [Azure Portal](https://ms.portal.azure.com/).
|
||||
|
||||
If Spark Cluster is pinned to the “StartBoard”:
|
||||
|
||||
1. Click the tile for your Spark Cluster.
|
||||
|
||||
<img src="./media/image8.png" width="273" height="169" />
|
||||
|
||||
If Spark Cluster is not pinned to the “StartBoard”:
|
||||
|
||||
1. Click Browse, select HDInsight Clusters.
|
||||
|
||||
<img src="./media/image9.png" width="223" height="219" />
|
||||
|
||||
1. Select your Spark Cluster.
|
||||
|
||||
<img src="./media/image10.png" width="277" height="124" />
|
||||
|
||||
### Launch Jupyter Notebook
|
||||
|
||||
1. Click on Cluster Dashboards tile displayed under the Quick Links of
|
||||
Cluster Blade.
|
||||
|
||||
<img src="./media/image11.png" width="267" height="134" />
|
||||
|
||||
1. Locate **Jupyter Notebook** tile on Cluster Dashboards tile and
|
||||
click on it.
|
||||
|
||||
<img src="./media/image12.png" width="102" height="251" />
|
||||
|
||||
1. When prompted, enter the admin credentials for the Spark cluster.
|
||||
|
||||
This will open the Jupyter dashboard.
|
||||
|
||||
<img src="./media/image13.png" width="278" height="100" />
|
||||
|
||||
### Upload a new notebook
|
||||
|
||||
1. Click **Upload** dropdown button present at top right side of
|
||||
Jupyter Notebook screen.
|
||||
|
||||
2. Select a name with an ipynb extension
|
||||
|
||||
3. Upload and click the notebook to launch it
|
||||
|
||||
|
|
@ -0,0 +1,362 @@
|
|||
Overview
|
||||
--------
|
||||
|
||||
Azure HDInsight is the only fully-managed cloud Apache Hadoop offering
|
||||
that gives you optimized open-source analytic clusters for Spark, Hive,
|
||||
MapReduce, HBase, Storm, Kafka, and Microsoft R Server backed by a 99.9%
|
||||
SLA. Deploy these big data technologies and ISV applications as managed
|
||||
clusters with enterprise-level security and monitoring.
|
||||
|
||||
This lab specifically focuses on Spark ML component of Spark and
|
||||
highlights its value proposition in the Apache Spark Big Data processing
|
||||
framework.
|
||||
|
||||
This hands-on lab will step you through the following features:
|
||||
|
||||
1. **Notebook** – Connect to a Notebook and run the notebook
|
||||
|
||||
2. **Basics of Spark** – Use Python to analyze data using Spark
|
||||
|
||||
3. **Basics of Machine Learning –** This notebook demonstrates how to
|
||||
use MLLib, Sparks's built-in machine learning libraries, to perform
|
||||
a simple prediction on an open dataset.
|
||||
|
||||
|
||||
Learn the basics of data science using Spark
|
||||
--------------------------------------------
|
||||
|
||||
This notebook demonstrates how to use MLLib, Sparks's built-in machine
|
||||
learning libraries, to perform a simple prediction on an open dataset.
|
||||
|
||||
**Launch Jupyter Notebooks **
|
||||
Navigate to this link and add your cluster name and username/password provided. https://<Fill_ME_IN>.azurehdinsight.net/jupyter/tree/PySpark
|
||||
|
||||
- Username: <FILL\_ME\_IN>
|
||||
|
||||
- Password: <FILL\_ME\_IN>
|
||||
|
||||
<img src="./media/image1.png" width="624" height="250" />
|
||||
|
||||
### Open [Spark Machine Learning - Predictive analysis on food inspection data using MLLib.ipynb](https://pranavsparkbuildlab.azurehdinsight.net/jupyter/notebooks/PySpark/05%20-%20Spark%20Machine%20Learning%20-%20Predictive%20analysis%20on%20food%20inspection%20data%20using%20MLLib.ipynb)
|
||||
|
||||
This is a sample notebook which will walk you through the steps of
|
||||
interacting with a notebook, basics of machine learning of Spark. You
|
||||
will apply these learnings in a new notebook to predict book sales
|
||||
|
||||
<img src="./media/image2.png" width="624" height="226" />
|
||||
|
||||
### Notebook Setup
|
||||
|
||||
- Read the opening paragraph to understand about the scenario and the
|
||||
model to apply.
|
||||
|
||||
- Run through all the steps in the notebook.
|
||||
|
||||
- To run the cells below, place the cursor in the cell and then press
|
||||
\*\*SHIFT + ENTER\*\*.
|
||||
|
||||
### Initializing Spark - Construct an Input DataFrame
|
||||
|
||||
Read the dataset from a csv file stored in Azure Blob Storage.
|
||||
|
||||
```python
|
||||
inspections =
|
||||
spark.read.csv('wasb:///HdiSamples/HdiSamples/FoodInspectionData/FoodInspections1.csv',
|
||||
inferSchema=True)
|
||||
```
|
||||
#### Inspect Schema
|
||||
```python
|
||||
inspections.printSchema()
|
||||
```
|
||||
#### See a detailed record
|
||||
```python
|
||||
df.take(1)
|
||||
```
|
||||
#### Understand the dataset
|
||||
Let's start to get a sense of what our dataset contains. For
|
||||
example, what are the different values in the \`results\` column?
|
||||
```python
|
||||
df.select('results').distinct().show()
|
||||
```
|
||||
#### A visualization can help us reason about the distribution of these outcomes.
|
||||
```python
|
||||
%%local
|
||||
|
||||
%matplotlib inline
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
labels = count\_results\_df\['results'\]
|
||||
|
||||
sizes = count\_results\_df\['cnt'\]
|
||||
|
||||
colors = \['turquoise', 'seagreen', 'mediumslateblue', 'palegreen',
|
||||
'coral'\]
|
||||
|
||||
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
|
||||
|
||||
plt.axis('equal')
|
||||
```
|
||||
<img src="./media/image3.png" width="624" height="335" />
|
||||
|
||||
### Create a logistic regression model from the input dataframe
|
||||
|
||||
This will allow you to categorize the data which you can use to predict
|
||||
the outcome in the next step
|
||||
|
||||
### Evaluate the model on a separate test dataset
|
||||
|
||||
We can use the model we created earlier to predict what the results of
|
||||
new inspections will be, based on the violations that were observed.
|
||||
```python
|
||||
testData =
|
||||
selectInterestingColumns(spark.read.csv('wasb:///HdiSamples/HdiSamples/FoodInspectionData/Food\_Inspections2.csv',
|
||||
inferSchema=True))
|
||||
|
||||
testDf = testData.where("results = 'Fail' OR results = 'Pass' OR results
|
||||
= 'Pass w/ Conditions'")
|
||||
|
||||
predictionsDf = model.transform(testDf)
|
||||
|
||||
predictionsDf.registerTempTable('Predictions')
|
||||
|
||||
predictionsDf.columns
|
||||
```
|
||||
#### Look at the success rate.
|
||||
```python
|
||||
numSuccesses = predictionsDf.where("""(prediction = 0 AND results =
|
||||
'Fail') OR
|
||||
|
||||
(prediction = 1 AND (results = 'Pass' OR
|
||||
|
||||
results = 'Pass w/ Conditions'))""").count()
|
||||
|
||||
numInspections = predictionsDf.count()
|
||||
|
||||
print("There were %d inspections and there were %d successful
|
||||
predictions" % (numInspections, numSuccesses))
|
||||
|
||||
print("This is a %d%% success rate" % (float(numSuccesses) /
|
||||
float(numInspections) \* 100))
|
||||
```
|
||||
#### Final visualization to help us reason about the results of this test.
|
||||
|
||||
<img src="./media/image4.png" width="624" height="281" />
|
||||
|
||||
Scenario 2 – Apply the basics of machine learning to predict book sales.
|
||||
------------------------------------------------------------------------
|
||||
|
||||
In this scenario, you will apply your learnings from Scenario 1.
|
||||
|
||||
**Scenario**: This notebook demonstrates how to use MLLib, Spark's
|
||||
built-in machine learning libraries, to perform a simple predictive
|
||||
analysis on an open dataset.
|
||||
|
||||
**Launch Jupyter Notebooks**
|
||||
https://<Fill\_ME\_IN>.azurehdinsight.net/jupyter/tree/PySpark
|
||||
|
||||
- Username: <FILL\_ME\_IN>
|
||||
|
||||
- Password: <FILL\_ME\_IN>
|
||||
|
||||
<img src="./media/image1.png" width="624" height="250" />
|
||||
|
||||
### Open Final Lab.ipynb
|
||||
|
||||
This is the same notebook you learnt in Scenario 1. In this notebook,
|
||||
you will apply the learnings to a different dataset.
|
||||
|
||||
<img src="./media/image5.png" width="624" height="238" />
|
||||
|
||||
### Notebook Setup
|
||||
|
||||
- Read the opening paragraph to understand about the scenario and the
|
||||
model to apply.
|
||||
|
||||
- Run through all the steps in the notebook.
|
||||
|
||||
- To run the cells below, place the cursor in the cell and then press
|
||||
\*\*SHIFT + ENTER\*\*.
|
||||
|
||||
### Initializing Spark - Construct an Input DataFrame
|
||||
|
||||
Read the dataset along with headers from a csv file stored in Azure Blob
|
||||
Storage.
|
||||
|
||||
- Replace <FILL\_ME\_IN\_WITH\_header=True> with
|
||||
**header=True** in the following statement
|
||||
```python
|
||||
inspections = spark.read.csv('/sparklabs/Lab03/SaleTransactions1.csv',
|
||||
inferSchema=True, <FILL_ME_IN_WITH_header=True>);
|
||||
```
|
||||
#### Inspect Schema
|
||||
```python
|
||||
inspections.printSchema()
|
||||
```
|
||||
#### See a detailed record
|
||||
```python
|
||||
df.take(1)
|
||||
```
|
||||
### Understand the dataset
|
||||
|
||||
Let's start to get a sense of what our dataset contains. For example,
|
||||
what are the different values in the \`**CustomerAction**\` column?
|
||||
|
||||
- Replace **<FILL\_ME\_IN\_WITH\_ColName>** with
|
||||
**CustomerAction**
|
||||
```python
|
||||
inspections.select('FILL_ME_IN_WITH_ColName').distinct().show()
|
||||
```
|
||||
#### A visualization can help us reason about the distribution of these outcomes.
|
||||
|
||||
- Replace **<FILL\_ME\_IN\_WITH\_ColName**> with
|
||||
**CustomAction**
|
||||
```python
|
||||
%%local
|
||||
|
||||
%matplotlib inline
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
labels = count_results_df['FILL_ME_IN_WITH_ColName']
|
||||
|
||||
sizes = count_results_df['cnt']
|
||||
|
||||
colors = ['turquoise', 'seagreen', 'mediumslateblue', 'palegreen',
|
||||
'coral']
|
||||
|
||||
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
|
||||
|
||||
plt.axis('equal')
|
||||
```
|
||||
<img src="./media/image6.png" width="624" height="350" />
|
||||
|
||||
Let us develop a model that can guess the outcome whether a book is
|
||||
purchased based on customer action. From the previous visualization, a
|
||||
**CustomerAction** could be one of the following :- ‘**Purchased’**,
|
||||
**‘Added To Cart’** or **‘Browsed’**
|
||||
|
||||
Since logistic regression is a binary classification method, it makes
|
||||
sense to group our data into two categories: \*\*Purchased\*\* and
|
||||
\*\*Not purchased\*\*. A "Added To Cart" is not a purchase, so when we
|
||||
train the model, we will consider the two results equivalent.
|
||||
|
||||
- Replace **<FILL\_ME\_IN>** with the **highlighted**
|
||||
```python
|
||||
def labelForResults(s):
|
||||
|
||||
if s == 'Purchased':
|
||||
|
||||
return 1.0
|
||||
|
||||
elif s == 'Added To Cart' or s == 'Browsed':
|
||||
|
||||
return 0.0
|
||||
|
||||
else:
|
||||
|
||||
return -1.0
|
||||
|
||||
label = UserDefinedFunction(labelForResults, DoubleType())
|
||||
|
||||
labeledData =
|
||||
inspections.select(label(inspections.CustomerAction).alias('label'),
|
||||
inspections.Name ).where('label >= 0')
|
||||
```
|
||||
### Create a logistic regression model from the input dataframe
|
||||
|
||||
This will allow you to categorize the data which you can use to predict
|
||||
the outcome in the next step
|
||||
|
||||
### Evaluate the model on a separate test dataset
|
||||
|
||||
We can use the model we created earlier to predict what the results of
|
||||
new inspections will be, based on the violations that were observed.
|
||||
```python
|
||||
testData = spark.read.csv('/sparklabs/Lab03/SaleTransactions2.csv',
|
||||
inferSchema=True, header=True)
|
||||
|
||||
testDf = testData.where("CustomerAction = 'Purchased' OR
|
||||
CustomerAction = 'Added To Cart' OR CustomerAction = 'Browsed'")
|
||||
|
||||
predictionsDf = model.transform(testDf)
|
||||
|
||||
predictionsDf.registerTempTable('Predictions')
|
||||
|
||||
predictionsDf.columns
|
||||
```
|
||||
#### Look at the success rate.
|
||||
|
||||
- Replace **<FILL\_ME\_IN>** with the **highlighted**
|
||||
```python
|
||||
numSuccesses = predictionsDf.where("""(prediction = 1 AND CustomerAction
|
||||
= 'Purchased') OR
|
||||
|
||||
(prediction = 0 AND (CustomerAction = 'Added To Cart' OR
|
||||
|
||||
CustomerAction = 'Browsed'))""").count()
|
||||
|
||||
numInspections = predictionsDf.count()
|
||||
|
||||
print("There were %d User sessions and there were %d successful
|
||||
predictions" % (numInspections, numSuccesses))
|
||||
|
||||
print("This is a %d%% success rate" % (float(numSuccesses) /
|
||||
float(numInspections) \* 100))
|
||||
```
|
||||
#### Final visualization to help us reason about the results of this test.
|
||||
|
||||
<img src="./media/image4.png" width="624" height="281" />
|
||||
|
||||
Learn more and get help
|
||||
=======================
|
||||
|
||||
- [Azure HDInsight
|
||||
Overview](https://azure.microsoft.com/en-us/services/hdinsight/)
|
||||
|
||||
- [Getting started with Azure
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/)
|
||||
|
||||
- [Use Hive on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-tutorial-get-started)
|
||||
|
||||
- [Use Spark on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview)
|
||||
|
||||
- [Use Interactive Hive on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-interactive-hive)
|
||||
|
||||
- [Use HBase on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview)
|
||||
|
||||
- [Use Kafka on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-introduction)
|
||||
|
||||
- [Use Storm on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-overview)
|
||||
|
||||
- [Use R Server on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-overview)
|
||||
|
||||
- [Open Source component guide on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#hadoop-components-available-with-different-hdinsight-versions)
|
||||
|
||||
- [Extend your cluster to install open source
|
||||
components](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#support-for-open-source-software-used-on-hdinsight-clusters)
|
||||
|
||||
- [HDInsight release
|
||||
notes](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes)
|
||||
|
||||
- [HDInsight versioning and support
|
||||
guidelines](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#supported-hdinsight-versions)
|
||||
|
||||
- [How to upgrade HDInsight cluster to a new
|
||||
version](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upgrade-cluster)
|
||||
|
||||
- [Ask HDInsight questions on
|
||||
stackoverflow](https://stackoverflow.com/questions/tagged/hdinsight)
|
||||
|
||||
- [Ask HDInsight questions on Msdn
|
||||
forums](https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight)
|
||||
|
||||
|
После Ширина: | Высота: | Размер: 166 KiB |
После Ширина: | Высота: | Размер: 68 KiB |
После Ширина: | Высота: | Размер: 168 KiB |
После Ширина: | Высота: | Размер: 154 KiB |
После Ширина: | Высота: | Размер: 151 KiB |
После Ширина: | Высота: | Размер: 126 KiB |
После Ширина: | Высота: | Размер: 135 KiB |
После Ширина: | Высота: | Размер: 198 KiB |
После Ширина: | Высота: | Размер: 224 KiB |
После Ширина: | Высота: | Размер: 224 KiB |
После Ширина: | Высота: | Размер: 68 KiB |
После Ширина: | Высота: | Размер: 168 KiB |
После Ширина: | Высота: | Размер: 154 KiB |
После Ширина: | Высота: | Размер: 151 KiB |
После Ширина: | Высота: | Размер: 166 KiB |
После Ширина: | Высота: | Размер: 126 KiB |
|
@ -0,0 +1,5 @@
|
|||
# Azure HDInsight - Big data processing using Hive on Azure HDInsight
|
||||
## Deployment
|
||||
For this lab, an HDInsight cluster is already created for you. If you want to create this cluster on your own, please go through this [deployment guide](deployment/readme.md).
|
||||
## Lab
|
||||
See the hands-on lab [here](hands-on-lab.md).
|
После Ширина: | Высота: | Размер: 40 KiB |
После Ширина: | Высота: | Размер: 37 KiB |
После Ширина: | Высота: | Размер: 16 KiB |
После Ширина: | Высота: | Размер: 14 KiB |
После Ширина: | Высота: | Размер: 12 KiB |
После Ширина: | Высота: | Размер: 198 KiB |
После Ширина: | Высота: | Размер: 224 KiB |
|
@ -0,0 +1,84 @@
|
|||
Provision HDInsight Linux Hadoop cluster with Azure Management Portal
|
||||
---------------------------------------------------------------------
|
||||
|
||||
To provision HDInsight Hadoop cluster with Azure Management Portal,
|
||||
perform the below steps.
|
||||
|
||||
1. Go to the Azure Portal portal.azure.com. Login using your azure
|
||||
account credentials.
|
||||
|
||||
2. Select **NEW -> Data Analytics -> HDInsight**
|
||||
|
||||
> <img src="./media/image1.png" width="592" height="180" />
|
||||
|
||||
1. Enter or select the following values.
|
||||
|
||||
1. **Cluster Name:** Enter the cluster name. A green tick will
|
||||
appear if the cluster name is available.
|
||||
|
||||
2. **Cluster Type:** Select Hadoop as the cluster type.
|
||||
|
||||
3. **Cluster Operating System:** Select Linux as the cluster
|
||||
operating system
|
||||
|
||||
4. **Version:** Select 3.6 as the cluster version.
|
||||
|
||||
5. **Cluster Tier:** Select the **Standard** cluster tier
|
||||
|
||||
> <img src="./media/image2.png" width="436" height="372" />
|
||||
|
||||
1. **Subscription:** Select the Azure subscription to create
|
||||
the cluster.
|
||||
|
||||
2. **Resource Group:** Select an existing resource group or create a
|
||||
new resource group.
|
||||
|
||||
3. **Credentials:** Configure the username and password for HDInsight
|
||||
cluster and the SSH connection. SSH connection is used to connect to
|
||||
HDInsight cluster through a SSH client such as Putty.
|
||||
|
||||
> <img src="./media/image3.png" width="219" height="400" />
|
||||
|
||||
1. **Data Source:** Create a new storage account and a
|
||||
default container.
|
||||
|
||||
> <img src="./media/image4.png" width="230" height="309" />
|
||||
|
||||
1. **Node Pricing Tiers:** Set the number of head node and worker nodes
|
||||
as shown below.
|
||||
|
||||
> <img src="./media/image5.png" width="228" height="290" />
|
||||
|
||||
**Note:** You can select lowest pricing tier A3 nodes or reduce the
|
||||
number of worker nodes decrease the cluster cost.
|
||||
|
||||
1. Leave other configuration options as default and click **Create** to
|
||||
provision HDInsight Hadoop cluster. It will take 15-20 minutes for
|
||||
cluster provisioning.
|
||||
|
||||
**The HDInsight Linux Hadoop cluster is now ready to work with.**
|
||||
|
||||
Copy lab data to the storage account
|
||||
------------------------------------
|
||||
|
||||
In this section, you’ll copy the files required for the lab to your
|
||||
storage account.
|
||||
|
||||
To copy the files, follow the below steps.
|
||||
|
||||
1. Launch Azure Storage from your cluster dashboard
|
||||
|
||||
> <img src="./media/image6.png" width="624" height="854" />
|
||||
|
||||
1. Select the **Blob container** for your cluster
|
||||
|
||||
2. Create a container called **hadooplabs**
|
||||
|
||||
3. Navigate to **hadooplabs** and create a container called **Lab1**
|
||||
|
||||
4. Upload weblogs.csv to Lab1. Weblogs.csv can be found in
|
||||
**data\\hadooplabs\\Lab1** folder.
|
||||
|
||||
<img src="./media/image7.png" width="624" height="334" />
|
||||
|
||||
|
|
@ -0,0 +1,390 @@
|
|||
Overview
|
||||
========
|
||||
|
||||
Azure HDInsight is the only fully-managed cloud Apache Hadoop offering
|
||||
that gives you optimized open-source analytic clusters for Spark, Hive,
|
||||
MapReduce, HBase, Storm, Kafka, and Microsoft R Server backed by a 99.9%
|
||||
SLA. Deploy these big data technologies and ISV applications as managed
|
||||
clusters with enterprise-level security and monitoring.
|
||||
|
||||
Hive is a data warehousing system that simplifies analyzing large
|
||||
datasets stored in Hadoop clusters, using SQL-Like language known as
|
||||
HiveQL. Hive converts queries to either map/reduce, Apache Tez or Apache
|
||||
Spark jobs.
|
||||
|
||||
To highlight how customers can efficiently leverage HDInsight Hive to
|
||||
analyze big data stored in Azure Blob Storage, this document provides an
|
||||
end-to-end walkthrough of analyzing a web transaction log of an
|
||||
imaginary book store using Hive.
|
||||
|
||||
<span id="about-the-code" class="anchor"></span>After completing this
|
||||
lab, you will learn,
|
||||
|
||||
1. Different ways to execute hive queries on an HDInsight cluster
|
||||
|
||||
2. To use join, aggregates, analytic function, ranking function, group
|
||||
by and order by in Hive Query Language.
|
||||
|
||||
Learn the basics of querying with Hive
|
||||
======================================
|
||||
|
||||
### Launch Hive Views in Ambari portal
|
||||
Replace <FILL_ME_IN> with the cluster name and password provided or created.
|
||||
- [https://
|
||||
<FILL\_ME\_IN>/\#/main/view/HIVE/auto\_hive20\_instance](https://pranavsparkbuildlab.azurehdinsight.net/#/main/view/HIVE/auto_hive20_instance)
|
||||
|
||||
- Username: <FILL\_ME\_IN>
|
||||
|
||||
- Password: <FILL\_ME\_IN>
|
||||
|
||||
<img src="./images/image1.png" width="624" height="363" />
|
||||
|
||||
### Load Data into table
|
||||
|
||||
- Copy and paste the following query in the **Query Editor.** *Do not
|
||||
execute yet.*
|
||||
```sql
|
||||
DROP DATABASE IF EXISTS HDILABDB CASCADE;
|
||||
|
||||
CREATE DATABASE HDILABDB;
|
||||
|
||||
Use HDILABDB;
|
||||
|
||||
CREATE EXTERNAL TABLE IF NOT EXISTS weblogs(
|
||||
|
||||
TransactionDate varchar(50) ,
|
||||
|
||||
CustomerId varchar(50) ,
|
||||
|
||||
BookId varchar(50) ,
|
||||
|
||||
PurchaseType varchar(50) ,
|
||||
|
||||
TransactionId varchar(50) ,
|
||||
|
||||
OrderId varchar(50) ,
|
||||
|
||||
BookName varchar(50) ,
|
||||
|
||||
CategoryName varchar(50) ,
|
||||
|
||||
Quantity varchar(50) ,
|
||||
|
||||
ShippingAmount varchar(50) ,
|
||||
|
||||
InvoiceNumber varchar(50) ,
|
||||
|
||||
InvoiceStatus varchar(50) ,
|
||||
|
||||
PaymentAmount varchar(50)
|
||||
|
||||
) ROW FORMAT DELIMITED FIELDS TERMINATED by ',' lines TERMINATED by
|
||||
'\n'
|
||||
|
||||
STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/weblogs/';
|
||||
|
||||
LOAD DATA INPATH 'wasb:///hadooplabs/Lab1/weblogs.csv' INTO TABLE
|
||||
HDILABDB.weblogs;
|
||||
```
|
||||
- Click Execute to run the query. Once the query complete, the Query
|
||||
Process Results, status will change to **SUCCEEDED**.
|
||||
|
||||
### Select total count
|
||||
|
||||
- Create a new Worksheet and execute the following query in the
|
||||
**Query Editor**.
|
||||
```sql
|
||||
SELECT COUNT(*) FROM HDILABDB.weblogs;
|
||||
```
|
||||
|
||||
### View the data
|
||||
|
||||
- Create a new Worksheet and execute the following query in the
|
||||
**Query Editor**.
|
||||
```sql
|
||||
SELECT * FROM HDILABDB.weblogs LIMIT 5;
|
||||
```
|
||||
### Where clause
|
||||
|
||||
- Create a new Worksheet and execute the following query in the
|
||||
**Query Editor**.
|
||||
```sql
|
||||
SELECT * FROM HDILABDB.weblogs WHERE orderid='107';
|
||||
```
|
||||
### <img src="./images/image2.png" width="624" height="404" />
|
||||
|
||||
### Find DISTINCT
|
||||
|
||||
- Create a new Worksheet and execute the following query in the
|
||||
**Query Editor**.
|
||||
```sql
|
||||
SELECT DISTINCT bookname FROM HDILABDB.weblogs WHERE orderid='107';
|
||||
```
|
||||
### GROUP BY
|
||||
|
||||
- Create a new Worksheet and execute the following query in the
|
||||
**Query Editor**.
|
||||
```sql
|
||||
SELECT bookname,COUNT(*) FROM HDILABDB.weblogs GROUP BY bookname;
|
||||
```
|
||||
### Analyse query using “Visual Explain”
|
||||
|
||||
<img src="./images/image3.png" width="624" height="344" />
|
||||
|
||||
Scenario 2 – Apply the basics
|
||||
=============================
|
||||
|
||||
<span id="_Toc465361432" class="anchor"><span id="_Toc465379292" class="anchor"></span></span>Perform book store sales analysis
|
||||
-------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
In this section, you’ll run hive queries to analyse the data in the
|
||||
weblogs table. The weblogs table contains transactional data of an
|
||||
imaginary online bookstore. You’ll have to analyse the sales data and
|
||||
prepare a sales report.
|
||||
|
||||
All analysis is based on the weblogs table, created earlier in the lab.
|
||||
The table description is given below
|
||||
|
||||
| **Column** | **Description** |
|
||||
|-----------------|----------------------------------------------------------------------------|
|
||||
| TransactionDate | The date of the transaction |
|
||||
| CustomerId | Unique Id assigned to the customer |
|
||||
| BookId | Unique id assigned to a book in the book store |
|
||||
| PurchaseType | Purchased: Customer bought the book Browsed: Customer browsed but not purchased the book. Added to Cart: Customer added the book to the shopping cart |
|
||||
| TransactionId | Unique Id assigned to a transaction |
|
||||
| OrderId | Unique order id |
|
||||
| BookName | The name of the book accessed by the customer |
|
||||
| CategoryName | The category of the book accessed by the customer |
|
||||
| Quantity | Quantity of the book purchased. Valid only for PurchaseType = Purchased |
|
||||
| ShippingAmount | Shipping cost |
|
||||
| InvoiceNumber | Invoice number if a customer purchased the book |
|
||||
| InvoiceStatus | The status of the invoice |
|
||||
| PaymentAmount | Total amount paid by the customer. Valid only for PurchaseType = Purchased |
|
||||
|
||||
### Launch Hive Views in Ambari portal
|
||||
Replace <FILL_ME_IN> with the cluster name and password provided or created.
|
||||
- [https://
|
||||
<FILL\_ME\_IN>/\#/main/view/HIVE/auto\_hive20\_instance](https://pranavsparkbuildlab.azurehdinsight.net/#/main/view/HIVE/auto_hive20_instance)
|
||||
|
||||
- Username: <FILL\_ME\_IN>
|
||||
|
||||
- Password: <FILL\_ME\_IN>
|
||||
|
||||
### Problem Statement \#1
|
||||
|
||||
Write a query to return the total payment amount for each category per
|
||||
month. The output should look like this.
|
||||
|
||||
| **CategoryName** | **QuantitySold** | **TotalAmount** |
|
||||
|-------------------|------------------|-----------------|
|
||||
| Drive\_books | 211029 | 2064435 |
|
||||
| Adventure | 112470 | 1022195 |
|
||||
| World\_History | 112263 | 1048990 |
|
||||
| Art | 112105 | 1043190 |
|
||||
| Non\_Fiction | 111731 | 1046410 |
|
||||
| Psychology | 111555 | 1024255 |
|
||||
| Romance | 111316 | 1038265 |
|
||||
| Automobile\_books | 110017 | 1030720 |
|
||||
| Philosophy | 109691 | 1042410 |
|
||||
| Fiction | 109460 | 1032795 |
|
||||
| Drama | 109246 | 1038565 |
|
||||
| Management | 108262 | 1030805 |
|
||||
| Programming | 108196 | 1013210 |
|
||||
| Music | 108121 | 998930 |
|
||||
| Cook | 108056 | 1051710 |
|
||||
| Science | 107706 | 1063445 |
|
||||
| Religion | 107513 | 999780 |
|
||||
| Political | 106000 | 1034820 |
|
||||
|
||||
#### Create a new Worksheet and execute the following query in the Query Editor.
|
||||
```sql
|
||||
-- Get top Selling Categories
|
||||
|
||||
DROP TABLE IF EXISTS HDILABDB.SalesbyCategory;
|
||||
|
||||
CREATE TABLE HDILABDB.SalesbyCategory ROW FORMAT DELIMITED
|
||||
|
||||
FIELDS TERMINATED by '\1' lines TERMINATED by '\n'
|
||||
|
||||
STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/SalesbyCategory'
|
||||
|
||||
AS
|
||||
|
||||
Select
|
||||
|
||||
categoryname,
|
||||
|
||||
Sum(Quantity) As quantitysold,
|
||||
|
||||
Sum(PaymentAmount) As totalamount
|
||||
|
||||
FROM HDILABDB.weblogs
|
||||
|
||||
WHERE PurchaseType="Purchased"
|
||||
|
||||
GROUP BY CategoryName
|
||||
|
||||
ORDER BY QuantitySold Desc;
|
||||
|
||||
Select * from HDILABDB.SalesbyCategory LIMIT 10
|
||||
```
|
||||
### Problem Statement \#2
|
||||
|
||||
Write a query to return the total payment amount and the total quantity
|
||||
sold per book. The output should look like this.
|
||||
|
||||
| **BookName** | **QuantitySold** | **TotalAmount** |
|
||||
|--------------------------------------|------------------|-----------------|
|
||||
| The voyages of Captain Cook | 232414 | 2194890 |
|
||||
| Advances in school psychology | 231410 | 2193740 |
|
||||
| Science in Dispute | 231408 | 2168425 |
|
||||
| History of political economy | 231255 | 2190040 |
|
||||
| THE BOOK OF WITNESSES | 230872 | 2145540 |
|
||||
| The adventures of Arthur Conan Doyle | 230023 | 2191910 |
|
||||
| Space fact and fiction | 229908 | 2171820 |
|
||||
| New Christian poetry | 228849 | 2185845 |
|
||||
| Understanding American politics | 228598 | 2182720 |
|
||||
|
||||
#### Create a new Worksheet and execute the following query in the Query Editor.
|
||||
```sql
|
||||
-- Top Selling Books
|
||||
|
||||
DROP TABLE IF EXISTS HDILABDB.SalesbyBooks;
|
||||
|
||||
CREATE TABLE HDILABDB.SalesbyBooks ROW FORMAT DELIMITED FIELDS
|
||||
|
||||
TERMINATED by '\1' lines TERMINATED by '\n'
|
||||
|
||||
STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/SalesbyBooks'
|
||||
|
||||
AS
|
||||
|
||||
Select
|
||||
|
||||
BookName,
|
||||
|
||||
Sum(Quantity) As QuantitySold,
|
||||
|
||||
Sum(PaymentAmount) As TotalAmount
|
||||
|
||||
FROM HDILABDB.weblogs
|
||||
|
||||
WHERE PurchaseType='Purchased'
|
||||
|
||||
GROUP BY BookName
|
||||
|
||||
ORDER BY QuantitySold Desc;
|
||||
|
||||
Select * from HDILABDB.SalesbyBooks LIMIT 10
|
||||
```
|
||||
<img src="./images/image4.png" width="624" height="373" />
|
||||
|
||||
### Problem Statement \#3
|
||||
|
||||
Write a query to return the top 3 books browsed by the customers who
|
||||
also browsed the book, **THE BOOK OF WITNESSES**. Your output should
|
||||
look like this
|
||||
|
||||
| **BookName** | **cnt** |
|
||||
|------------------------------|---------|
|
||||
| New Christian poetry | 9445 |
|
||||
| History of political economy | 9384 |
|
||||
| Science in Dispute | 9367 |
|
||||
|
||||
#### Create a new Worksheet and execute the following query in the Query Editor.
|
||||
```sql
|
||||
DROP TABLE IF EXISTS HDILABDB.customerswhobrowsedxbook;
|
||||
|
||||
CREATE TABLE HDILABDB.customerswhobrowsedxbook ROW FORMAT DELIMITED
|
||||
|
||||
FIELDS TERMINATED by '\1' lines TERMINATED by '\n'
|
||||
|
||||
STORED AS TEXTFILE LOCATION
|
||||
'wasb:///hadooplabs/Lab1/customerswhobrowsedxbook'
|
||||
|
||||
AS
|
||||
|
||||
With Customerwhobrowsedbookx as
|
||||
|
||||
(
|
||||
|
||||
SELECT distinct customerid
|
||||
|
||||
from weblogs
|
||||
|
||||
WHERE PurchaseType="Browsed"
|
||||
|
||||
and BookName="THE BOOK OF WITNESSES"
|
||||
|
||||
)
|
||||
|
||||
SELECT w.BookName,count(*) as cnt from HDILABDB.weblogs w
|
||||
|
||||
JOIN Customerwhobrowsedbookx cte
|
||||
|
||||
on w.CustomerId=cte.CustomerId
|
||||
|
||||
WHERE w.PurchaseType="Browsed"
|
||||
|
||||
AND w.BookName Not in ("THE BOOK OF WITNESSES")
|
||||
|
||||
group by w.bookname having count(*) > 10
|
||||
|
||||
order by cnt desc
|
||||
|
||||
LIMIT 3;
|
||||
|
||||
Select * from HDILABDB.customerswhobrowsedxbook LIMIT 10
|
||||
```
|
||||
Learn more and get help
|
||||
=======================
|
||||
|
||||
- [Azure HDInsight
|
||||
Overview](https://azure.microsoft.com/en-us/services/hdinsight/)
|
||||
|
||||
- [Getting started with Azure
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/)
|
||||
|
||||
- [Use Hive on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-tutorial-get-started)
|
||||
|
||||
- [Use Spark on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview)
|
||||
|
||||
- [Use Interactive Hive on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-interactive-hive)
|
||||
|
||||
- [Use HBase on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview)
|
||||
|
||||
- [Use Kafka on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-introduction)
|
||||
|
||||
- [Use Storm on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-overview)
|
||||
|
||||
- [Use R Server on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-overview)
|
||||
|
||||
- [Open Source component guide on
|
||||
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#hadoop-components-available-with-different-hdinsight-versions)
|
||||
|
||||
- [Extend your cluster to install open source
|
||||
components](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#support-for-open-source-software-used-on-hdinsight-clusters)
|
||||
|
||||
- [HDInsight release
|
||||
notes](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes)
|
||||
|
||||
- [HDInsight versioning and support
|
||||
guidelines](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#supported-hdinsight-versions)
|
||||
|
||||
- [How to upgrade HDInsight cluster to a new
|
||||
version](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upgrade-cluster)
|
||||
|
||||
- [Ask HDInsight questions on
|
||||
stackoverflow](https://stackoverflow.com/questions/tagged/hdinsight)
|
||||
|
||||
- [Ask HDInsight questions on Msdn
|
||||
forums](https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight)
|
||||
|
||||
|
После Ширина: | Высота: | Размер: 166 KiB |
После Ширина: | Высота: | Размер: 106 KiB |
После Ширина: | Высота: | Размер: 129 KiB |
После Ширина: | Высота: | Размер: 241 KiB |
После Ширина: | Высота: | Размер: 293 KiB |
После Ширина: | Высота: | Размер: 198 KiB |
После Ширина: | Высота: | Размер: 224 KiB |
После Ширина: | Высота: | Размер: 106 KiB |
После Ширина: | Высота: | Размер: 241 KiB |
После Ширина: | Высота: | Размер: 129 KiB |
После Ширина: | Высота: | Размер: 293 KiB |
После Ширина: | Высота: | Размер: 224 KiB |
|
@ -0,0 +1,5 @@
|
|||
# Azure HDInsight - Big data processing using Hive on Azure HDInsight
|
||||
## Deployment
|
||||
For this lab, an HDInsight cluster is already created for you. If you want to create this cluster on your own, please go through this [deployment guide](deployment/readme.md).
|
||||
## Lab
|
||||
See the hands-on lab [here](hands-on-lab.md).
|