Adding labs for Azure HDInsight

This commit is contained in:
rustd 2017-05-02 19:41:29 -07:00
Родитель 6f702c22b0
Коммит d7922b6f8c
73 изменённых файлов: 360424 добавлений и 0 удалений

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/Data/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/Data/sparklabs/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/Data/sparklabs/Lab03/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/README.docx Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 40 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image10.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 19 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image11.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 18 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image12.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 15 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image13.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 21 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 37 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 16 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image4.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 14 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image5.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image6.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 198 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image7.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 224 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image8.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 64 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/media/image9.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 61 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/deployment/readme.docx Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -0,0 +1,134 @@
Provision HDInsight Linux Hadoop cluster with Azure Management Portal
---------------------------------------------------------------------
To provision HDInsight Hadoop cluster with Azure Management Portal,
perform the below steps.
1. Go to the Azure Portal portal.azure.com. Login using your azure
account credentials.
2. Select **NEW -> Data Analytics -> HDInsight**
> <img src="./media/image1.png" width="592" height="180" />
1. Enter or select the following values.
1. **Cluster Name:** Enter the cluster name. A green tick will
appear if the cluster name is available.
2. **Cluster Type:** Select **Spark** as the cluster type.
3. **Cluster Operating System:** Select Linux as the cluster
operating system
4. **Version:** Select **3.6** as the cluster version.
5. **Cluster Tier:** Select the **Standard** cluster tier
> <img src="./media/image2.png" width="436" height="372" />
1. **Subscription:** Select the Azure subscription to create
the cluster.
2. **Resource Group:** Select an existing resource group or create a
new resource group.
3. **Credentials:** Configure the username and password for HDInsight
cluster and the SSH connection. SSH connection is used to connect to
HDInsight cluster through a SSH client such as Putty.
> <img src="./media/image3.png" width="219" height="400" />
1. **Data Source:** Create a new storage account and a
default container.
> <img src="./media/image4.png" width="230" height="309" />
1. **Node Pricing Tiers:** Set the number of head node and worker nodes
as shown below.
> <img src="./media/image5.png" width="228" height="290" />
**Note:** You can select lowest pricing tier A3 nodes or reduce the
number of worker nodes decrease the cluster cost.
1. Leave other configuration options as default and click **Create** to
provision HDInsight Hadoop cluster. It will take 15-20 minutes for
cluster provisioning.
**The HDInsight Linux Hadoop cluster is now ready to work with.**
Copy lab data to the storage account
------------------------------------
In this section, youll copy the files required for the lab to your
storage account.
To copy the files, follow the below steps.
1. Launch Azure Storage from your cluster dashboard
> <img src="./media/image6.png" width="624" height="854" />
1. Select the **Blob container** for your cluster
2. Create a container called **sparklabs**
3. Navigate to **sparklabs** and create a container called **Lab03**
4. Upload SalesTransactions1.csv and SalesTransactions2.csv to Lab03.
Weblogs.csv can be found in **data\\sparklabs\\Lab03** folder.
<img src="./media/image7.png" width="624" height="334" />
Launching a new Jupyter Notebook
--------------------------------
### Access Azure Portal
1. Sign in to the [Azure Portal](https://ms.portal.azure.com/).
If Spark Cluster is pinned to the “StartBoard”:
1. Click the tile for your Spark Cluster.
<img src="./media/image8.png" width="273" height="169" />
If Spark Cluster is not pinned to the “StartBoard”:
1. Click Browse, select HDInsight Clusters.
<img src="./media/image9.png" width="223" height="219" />
1. Select your Spark Cluster.
<img src="./media/image10.png" width="277" height="124" />
### Launch Jupyter Notebook
1. Click on Cluster Dashboards tile displayed under the Quick Links of
Cluster Blade.
<img src="./media/image11.png" width="267" height="134" />
1. Locate **Jupyter Notebook** tile on Cluster Dashboards tile and
click on it.
<img src="./media/image12.png" width="102" height="251" />
1. When prompted, enter the admin credentials for the Spark cluster.
This will open the Jupyter dashboard.
<img src="./media/image13.png" width="278" height="100" />
### Upload a new notebook
1. Click **Upload** dropdown button present at top right side of
Jupyter Notebook screen.
2. Select a name with an ipynb extension
3. Upload and click the notebook to launch it

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/hands-on-lab.docx Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -0,0 +1,362 @@
Overview
--------
Azure HDInsight is the only fully-managed cloud Apache Hadoop offering
that gives you optimized open-source analytic clusters for Spark, Hive,
MapReduce, HBase, Storm, Kafka, and Microsoft R Server backed by a 99.9%
SLA. Deploy these big data technologies and ISV applications as managed
clusters with enterprise-level security and monitoring.
This lab specifically focuses on Spark ML component of Spark and
highlights its value proposition in the Apache Spark Big Data processing
framework.
This hands-on lab will step you through the following features:
1. **Notebook** – Connect to a Notebook and run the notebook
2. **Basics of Spark** – Use Python to analyze data using Spark
3. **Basics of Machine Learning –** This notebook demonstrates how to
use MLLib, Sparks's built-in machine learning libraries, to perform
a simple prediction on an open dataset.
Learn the basics of data science using Spark
--------------------------------------------
This notebook demonstrates how to use MLLib, Sparks's built-in machine
learning libraries, to perform a simple prediction on an open dataset.
**Launch Jupyter Notebooks **
Navigate to this link and add your cluster name and username/password provided. https://<Fill_ME_IN>.azurehdinsight.net/jupyter/tree/PySpark
- Username: &lt;FILL\_ME\_IN&gt;
- Password: &lt;FILL\_ME\_IN&gt;
<img src="./media/image1.png" width="624" height="250" />
### Open [Spark Machine Learning - Predictive analysis on food inspection data using MLLib.ipynb](https://pranavsparkbuildlab.azurehdinsight.net/jupyter/notebooks/PySpark/05%20-%20Spark%20Machine%20Learning%20-%20Predictive%20analysis%20on%20food%20inspection%20data%20using%20MLLib.ipynb)
This is a sample notebook which will walk you through the steps of
interacting with a notebook, basics of machine learning of Spark. You
will apply these learnings in a new notebook to predict book sales
<img src="./media/image2.png" width="624" height="226" />
### Notebook Setup
- Read the opening paragraph to understand about the scenario and the
model to apply.
- Run through all the steps in the notebook.
- To run the cells below, place the cursor in the cell and then press
\*\*SHIFT + ENTER\*\*.
### Initializing Spark - Construct an Input DataFrame
Read the dataset from a csv file stored in Azure Blob Storage.
```python
inspections =
spark.read.csv('wasb:///HdiSamples/HdiSamples/FoodInspectionData/FoodInspections1.csv',
inferSchema=True)
```
#### Inspect Schema
```python
inspections.printSchema()
```
#### See a detailed record
```python
df.take(1)
```
#### Understand the dataset
Let's start to get a sense of what our dataset contains. For
example, what are the different values in the \`results\` column?
```python
df.select('results').distinct().show()
```
#### A visualization can help us reason about the distribution of these outcomes.
```python
%%local
%matplotlib inline
import matplotlib.pyplot as plt
labels = count\_results\_df\['results'\]
sizes = count\_results\_df\['cnt'\]
colors = \['turquoise', 'seagreen', 'mediumslateblue', 'palegreen',
'coral'\]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
plt.axis('equal')
```
<img src="./media/image3.png" width="624" height="335" />
### Create a logistic regression model from the input dataframe
This will allow you to categorize the data which you can use to predict
the outcome in the next step
### Evaluate the model on a separate test dataset
We can use the model we created earlier to predict what the results of
new inspections will be, based on the violations that were observed.
```python
testData =
selectInterestingColumns(spark.read.csv('wasb:///HdiSamples/HdiSamples/FoodInspectionData/Food\_Inspections2.csv',
inferSchema=True))
testDf = testData.where("results = 'Fail' OR results = 'Pass' OR results
= 'Pass w/ Conditions'")
predictionsDf = model.transform(testDf)
predictionsDf.registerTempTable('Predictions')
predictionsDf.columns
```
#### Look at the success rate.
```python
numSuccesses = predictionsDf.where("""(prediction = 0 AND results =
'Fail') OR
(prediction = 1 AND (results = 'Pass' OR
results = 'Pass w/ Conditions'))""").count()
numInspections = predictionsDf.count()
print("There were %d inspections and there were %d successful
predictions" % (numInspections, numSuccesses))
print("This is a %d%% success rate" % (float(numSuccesses) /
float(numInspections) \* 100))
```
#### Final visualization to help us reason about the results of this test.
<img src="./media/image4.png" width="624" height="281" />
Scenario 2 – Apply the basics of machine learning to predict book sales.
------------------------------------------------------------------------
In this scenario, you will apply your learnings from Scenario 1.
**Scenario**: This notebook demonstrates how to use MLLib, Spark's
built-in machine learning libraries, to perform a simple predictive
analysis on an open dataset.
**Launch Jupyter Notebooks**
https://&lt;Fill\_ME\_IN&gt;.azurehdinsight.net/jupyter/tree/PySpark
- Username: &lt;FILL\_ME\_IN&gt;
- Password: &lt;FILL\_ME\_IN&gt;
<img src="./media/image1.png" width="624" height="250" />
### Open Final Lab.ipynb
This is the same notebook you learnt in Scenario 1. In this notebook,
you will apply the learnings to a different dataset.
<img src="./media/image5.png" width="624" height="238" />
### Notebook Setup
- Read the opening paragraph to understand about the scenario and the
model to apply.
- Run through all the steps in the notebook.
- To run the cells below, place the cursor in the cell and then press
\*\*SHIFT + ENTER\*\*.
### Initializing Spark - Construct an Input DataFrame
Read the dataset along with headers from a csv file stored in Azure Blob
Storage.
- Replace <FILL\_ME\_IN\_WITH\_header=True> with
**header=True** in the following statement
```python
inspections = spark.read.csv('/sparklabs/Lab03/SaleTransactions1.csv',
inferSchema=True, <FILL_ME_IN_WITH_header=True>);
```
#### Inspect Schema
```python
inspections.printSchema()
```
#### See a detailed record
```python
df.take(1)
```
### Understand the dataset
Let's start to get a sense of what our dataset contains. For example,
what are the different values in the \`**CustomerAction**\` column?
- Replace **&lt;FILL\_ME\_IN\_WITH\_ColName&gt;** with
**CustomerAction**
```python
inspections.select('FILL_ME_IN_WITH_ColName').distinct().show()
```
#### A visualization can help us reason about the distribution of these outcomes.
- Replace **&lt;FILL\_ME\_IN\_WITH\_ColName**&gt; with
**CustomAction**
```python
%%local
%matplotlib inline
import matplotlib.pyplot as plt
labels = count_results_df['FILL_ME_IN_WITH_ColName']
sizes = count_results_df['cnt']
colors = ['turquoise', 'seagreen', 'mediumslateblue', 'palegreen',
'coral']
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
plt.axis('equal')
```
<img src="./media/image6.png" width="624" height="350" />
Let us develop a model that can guess the outcome whether a book is
purchased based on customer action. From the previous visualization, a
**CustomerAction** could be one of the following :- **Purchased**,
**Added To Cart** or **Browsed**
Since logistic regression is a binary classification method, it makes
sense to group our data into two categories: \*\*Purchased\*\* and
\*\*Not purchased\*\*. A "Added To Cart" is not a purchase, so when we
train the model, we will consider the two results equivalent.
- Replace **&lt;FILL\_ME\_IN&gt;** with the **highlighted**
```python
def labelForResults(s):
if s == 'Purchased':
return 1.0
elif s == 'Added To Cart' or s == 'Browsed':
return 0.0
else:
return -1.0
label = UserDefinedFunction(labelForResults, DoubleType())
labeledData =
inspections.select(label(inspections.CustomerAction).alias('label'),
inspections.Name ).where('label >= 0')
```
### Create a logistic regression model from the input dataframe
This will allow you to categorize the data which you can use to predict
the outcome in the next step
### Evaluate the model on a separate test dataset
We can use the model we created earlier to predict what the results of
new inspections will be, based on the violations that were observed.
```python
testData = spark.read.csv('/sparklabs/Lab03/SaleTransactions2.csv',
inferSchema=True, header=True)
testDf = testData.where("CustomerAction = 'Purchased' OR
CustomerAction = 'Added To Cart' OR CustomerAction = 'Browsed'")
predictionsDf = model.transform(testDf)
predictionsDf.registerTempTable('Predictions')
predictionsDf.columns
```
#### Look at the success rate.
- Replace **&lt;FILL\_ME\_IN&gt;** with the **highlighted**
```python
numSuccesses = predictionsDf.where("""(prediction = 1 AND CustomerAction
= 'Purchased') OR
(prediction = 0 AND (CustomerAction = 'Added To Cart' OR
CustomerAction = 'Browsed'))""").count()
numInspections = predictionsDf.count()
print("There were %d User sessions and there were %d successful
predictions" % (numInspections, numSuccesses))
print("This is a %d%% success rate" % (float(numSuccesses) /
float(numInspections) \* 100))
```
#### Final visualization to help us reason about the results of this test.
<img src="./media/image4.png" width="624" height="281" />
Learn more and get help
=======================
- [Azure HDInsight
Overview](https://azure.microsoft.com/en-us/services/hdinsight/)
- [Getting started with Azure
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/)
- [Use Hive on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-tutorial-get-started)
- [Use Spark on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview)
- [Use Interactive Hive on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-interactive-hive)
- [Use HBase on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview)
- [Use Kafka on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-introduction)
- [Use Storm on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-overview)
- [Use R Server on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-overview)
- [Open Source component guide on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#hadoop-components-available-with-different-hdinsight-versions)
- [Extend your cluster to install open source
components](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#support-for-open-source-software-used-on-hdinsight-clusters)
- [HDInsight release
notes](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes)
- [HDInsight versioning and support
guidelines](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#supported-hdinsight-versions)
- [How to upgrade HDInsight cluster to a new
version](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upgrade-cluster)
- [Ask HDInsight questions on
stackoverflow](https://stackoverflow.com/questions/tagged/hdinsight)
- [Ask HDInsight questions on Msdn
forums](https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight)

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/BookSalesPrediction.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 166 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/JupyterNotebooks.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 68 KiB

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 168 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/Sc1DataVisualize.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 154 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/Sc1ModelVisualize.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 151 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/Sc2DataVisualize.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 126 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/Sc2ModelVisualize.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 135 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/Storage.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 198 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/blob.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 224 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/images/uploadweblogs.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 224 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/media/image1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 68 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/media/image2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 168 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/media/image3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 154 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/media/image4.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 151 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/media/image5.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 166 KiB

Двоичные данные
Labs/Azure HDInsight/DataScienceLab/media/image6.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 126 KiB

Просмотреть файл

@ -0,0 +1,5 @@
# Azure HDInsight - Big data processing using Hive on Azure HDInsight
## Deployment
For this lab, an HDInsight cluster is already created for you. If you want to create this cluster on your own, please go through this [deployment guide](deployment/readme.md).
## Lab
See the hands-on lab [here](hands-on-lab.md).

Двоичные данные
Labs/Azure HDInsight/HiveLab/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/HiveLab/Data/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/HiveLab/Data/hadooplabs/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Двоичные данные
Labs/Azure HDInsight/HiveLab/README.docx Normal file

Двоичный файл не отображается.

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 40 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 37 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 16 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image4.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 14 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image5.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 12 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image6.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 198 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/media/image7.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 224 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/deployment/readme.docx Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -0,0 +1,84 @@
Provision HDInsight Linux Hadoop cluster with Azure Management Portal
---------------------------------------------------------------------
To provision HDInsight Hadoop cluster with Azure Management Portal,
perform the below steps.
1. Go to the Azure Portal portal.azure.com. Login using your azure
account credentials.
2. Select **NEW -&gt; Data Analytics -&gt; HDInsight**
> <img src="./media/image1.png" width="592" height="180" />
1. Enter or select the following values.
1. **Cluster Name:** Enter the cluster name. A green tick will
appear if the cluster name is available.
2. **Cluster Type:** Select Hadoop as the cluster type.
3. **Cluster Operating System:** Select Linux as the cluster
operating system
4. **Version:** Select 3.6 as the cluster version.
5. **Cluster Tier:** Select the **Standard** cluster tier
> <img src="./media/image2.png" width="436" height="372" />
1. **Subscription:** Select the Azure subscription to create
the cluster.
2. **Resource Group:** Select an existing resource group or create a
new resource group.
3. **Credentials:** Configure the username and password for HDInsight
cluster and the SSH connection. SSH connection is used to connect to
HDInsight cluster through a SSH client such as Putty.
> <img src="./media/image3.png" width="219" height="400" />
1. **Data Source:** Create a new storage account and a
default container.
> <img src="./media/image4.png" width="230" height="309" />
1. **Node Pricing Tiers:** Set the number of head node and worker nodes
as shown below.
> <img src="./media/image5.png" width="228" height="290" />
**Note:** You can select lowest pricing tier A3 nodes or reduce the
number of worker nodes decrease the cluster cost.
1. Leave other configuration options as default and click **Create** to
provision HDInsight Hadoop cluster. It will take 15-20 minutes for
cluster provisioning.
**The HDInsight Linux Hadoop cluster is now ready to work with.**
Copy lab data to the storage account
------------------------------------
In this section, youll copy the files required for the lab to your
storage account.
To copy the files, follow the below steps.
1. Launch Azure Storage from your cluster dashboard
> <img src="./media/image6.png" width="624" height="854" />
1. Select the **Blob container** for your cluster
2. Create a container called **hadooplabs**
3. Navigate to **hadooplabs** and create a container called **Lab1**
4. Upload weblogs.csv to Lab1. Weblogs.csv can be found in
**data\\hadooplabs\\Lab1** folder.
<img src="./media/image7.png" width="624" height="334" />

Двоичные данные
Labs/Azure HDInsight/HiveLab/hands-on-lab.docx Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -0,0 +1,390 @@
Overview
========
Azure HDInsight is the only fully-managed cloud Apache Hadoop offering
that gives you optimized open-source analytic clusters for Spark, Hive,
MapReduce, HBase, Storm, Kafka, and Microsoft R Server backed by a 99.9%
SLA. Deploy these big data technologies and ISV applications as managed
clusters with enterprise-level security and monitoring.
Hive is a data warehousing system that simplifies analyzing large
datasets stored in Hadoop clusters, using SQL-Like language known as
HiveQL. Hive converts queries to either map/reduce, Apache Tez or Apache
Spark jobs.
To highlight how customers can efficiently leverage HDInsight Hive to
analyze big data stored in Azure Blob Storage, this document provides an
end-to-end walkthrough of analyzing a web transaction log of an
imaginary book store using Hive.
<span id="about-the-code" class="anchor"></span>After completing this
lab, you will learn,
1. Different ways to execute hive queries on an HDInsight cluster
2. To use join, aggregates, analytic function, ranking function, group
by and order by in Hive Query Language.
Learn the basics of querying with Hive
======================================
### Launch Hive Views in Ambari portal
Replace <FILL_ME_IN> with the cluster name and password provided or created.
- [https://
&lt;FILL\_ME\_IN&gt;/\#/main/view/HIVE/auto\_hive20\_instance](https://pranavsparkbuildlab.azurehdinsight.net/#/main/view/HIVE/auto_hive20_instance)
- Username: &lt;FILL\_ME\_IN&gt;
- Password: &lt;FILL\_ME\_IN&gt;
<img src="./images/image1.png" width="624" height="363" />
### Load Data into table
- Copy and paste the following query in the **Query Editor.** *Do not
execute yet.*
```sql
DROP DATABASE IF EXISTS HDILABDB CASCADE;
CREATE DATABASE HDILABDB;
Use HDILABDB;
CREATE EXTERNAL TABLE IF NOT EXISTS weblogs(
TransactionDate varchar(50) ,
CustomerId varchar(50) ,
BookId varchar(50) ,
PurchaseType varchar(50) ,
TransactionId varchar(50) ,
OrderId varchar(50) ,
BookName varchar(50) ,
CategoryName varchar(50) ,
Quantity varchar(50) ,
ShippingAmount varchar(50) ,
InvoiceNumber varchar(50) ,
InvoiceStatus varchar(50) ,
PaymentAmount varchar(50)
) ROW FORMAT DELIMITED FIELDS TERMINATED by ',' lines TERMINATED by
'\n'
STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/weblogs/';
LOAD DATA INPATH 'wasb:///hadooplabs/Lab1/weblogs.csv' INTO TABLE
HDILABDB.weblogs;
```
- Click Execute to run the query. Once the query complete, the Query
Process Results, status will change to **SUCCEEDED**.
### Select total count
- Create a new Worksheet and execute the following query in the
**Query Editor**.
```sql
SELECT COUNT(*) FROM HDILABDB.weblogs;
```
### View the data
- Create a new Worksheet and execute the following query in the
**Query Editor**.
```sql
SELECT * FROM HDILABDB.weblogs LIMIT 5;
```
### Where clause
- Create a new Worksheet and execute the following query in the
**Query Editor**.
```sql
SELECT * FROM HDILABDB.weblogs WHERE orderid='107';
```
### <img src="./images/image2.png" width="624" height="404" />
### Find DISTINCT
- Create a new Worksheet and execute the following query in the
**Query Editor**.
```sql
SELECT DISTINCT bookname FROM HDILABDB.weblogs WHERE orderid='107';
```
### GROUP BY
- Create a new Worksheet and execute the following query in the
**Query Editor**.
```sql
SELECT bookname,COUNT(*) FROM HDILABDB.weblogs GROUP BY bookname;
```
### Analyse query using “Visual Explain”
<img src="./images/image3.png" width="624" height="344" />
Scenario 2 – Apply the basics
=============================
<span id="_Toc465361432" class="anchor"><span id="_Toc465379292" class="anchor"></span></span>Perform book store sales analysis
-------------------------------------------------------------------------------------------------------------------------------
In this section, youll run hive queries to analyse the data in the
weblogs table. The weblogs table contains transactional data of an
imaginary online bookstore. Youll have to analyse the sales data and
prepare a sales report.
All analysis is based on the weblogs table, created earlier in the lab.
The table description is given below
| **Column** | **Description** |
|-----------------|----------------------------------------------------------------------------|
| TransactionDate | The date of the transaction |
| CustomerId | Unique Id assigned to the customer |
| BookId | Unique id assigned to a book in the book store |
| PurchaseType | Purchased: Customer bought the book Browsed: Customer browsed but not purchased the book. Added to Cart: Customer added the book to the shopping cart |
| TransactionId | Unique Id assigned to a transaction |
| OrderId | Unique order id |
| BookName | The name of the book accessed by the customer |
| CategoryName | The category of the book accessed by the customer |
| Quantity | Quantity of the book purchased. Valid only for PurchaseType = Purchased |
| ShippingAmount | Shipping cost |
| InvoiceNumber | Invoice number if a customer purchased the book |
| InvoiceStatus | The status of the invoice |
| PaymentAmount | Total amount paid by the customer. Valid only for PurchaseType = Purchased |
### Launch Hive Views in Ambari portal
Replace <FILL_ME_IN> with the cluster name and password provided or created.
- [https://
&lt;FILL\_ME\_IN&gt;/\#/main/view/HIVE/auto\_hive20\_instance](https://pranavsparkbuildlab.azurehdinsight.net/#/main/view/HIVE/auto_hive20_instance)
- Username: &lt;FILL\_ME\_IN&gt;
- Password: &lt;FILL\_ME\_IN&gt;
### Problem Statement \#1
Write a query to return the total payment amount for each category per
month. The output should look like this.
| **CategoryName** | **QuantitySold** | **TotalAmount** |
|-------------------|------------------|-----------------|
| Drive\_books | 211029 | 2064435 |
| Adventure | 112470 | 1022195 |
| World\_History | 112263 | 1048990 |
| Art | 112105 | 1043190 |
| Non\_Fiction | 111731 | 1046410 |
| Psychology | 111555 | 1024255 |
| Romance | 111316 | 1038265 |
| Automobile\_books | 110017 | 1030720 |
| Philosophy | 109691 | 1042410 |
| Fiction | 109460 | 1032795 |
| Drama | 109246 | 1038565 |
| Management | 108262 | 1030805 |
| Programming | 108196 | 1013210 |
| Music | 108121 | 998930 |
| Cook | 108056 | 1051710 |
| Science | 107706 | 1063445 |
| Religion | 107513 | 999780 |
| Political | 106000 | 1034820 |
#### Create a new Worksheet and execute the following query in the Query Editor.
```sql
-- Get top Selling Categories
DROP TABLE IF EXISTS HDILABDB.SalesbyCategory;
CREATE TABLE HDILABDB.SalesbyCategory ROW FORMAT DELIMITED
FIELDS TERMINATED by '\1' lines TERMINATED by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/SalesbyCategory'
AS
Select
categoryname,
Sum(Quantity) As quantitysold,
Sum(PaymentAmount) As totalamount
FROM HDILABDB.weblogs
WHERE PurchaseType="Purchased"
GROUP BY CategoryName
ORDER BY QuantitySold Desc;
Select * from HDILABDB.SalesbyCategory LIMIT 10
```
### Problem Statement \#2
Write a query to return the total payment amount and the total quantity
sold per book. The output should look like this.
| **BookName** | **QuantitySold** | **TotalAmount** |
|--------------------------------------|------------------|-----------------|
| The voyages of Captain Cook | 232414 | 2194890 |
| Advances in school psychology | 231410 | 2193740 |
| Science in Dispute | 231408 | 2168425 |
| History of political economy | 231255 | 2190040 |
| THE BOOK OF WITNESSES | 230872 | 2145540 |
| The adventures of Arthur Conan Doyle | 230023 | 2191910 |
| Space fact and fiction | 229908 | 2171820 |
| New Christian poetry | 228849 | 2185845 |
| Understanding American politics | 228598 | 2182720 |
#### Create a new Worksheet and execute the following query in the Query Editor.
```sql
-- Top Selling Books
DROP TABLE IF EXISTS HDILABDB.SalesbyBooks;
CREATE TABLE HDILABDB.SalesbyBooks ROW FORMAT DELIMITED FIELDS
TERMINATED by '\1' lines TERMINATED by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/SalesbyBooks'
AS
Select
BookName,
Sum(Quantity) As QuantitySold,
Sum(PaymentAmount) As TotalAmount
FROM HDILABDB.weblogs
WHERE PurchaseType='Purchased'
GROUP BY BookName
ORDER BY QuantitySold Desc;
Select * from HDILABDB.SalesbyBooks LIMIT 10
```
<img src="./images/image4.png" width="624" height="373" />
### Problem Statement \#3
Write a query to return the top 3 books browsed by the customers who
also browsed the book, **THE BOOK OF WITNESSES**. Your output should
look like this
| **BookName** | **cnt** |
|------------------------------|---------|
| New Christian poetry | 9445 |
| History of political economy | 9384 |
| Science in Dispute | 9367 |
#### Create a new Worksheet and execute the following query in the Query Editor.
```sql
DROP TABLE IF EXISTS HDILABDB.customerswhobrowsedxbook;
CREATE TABLE HDILABDB.customerswhobrowsedxbook ROW FORMAT DELIMITED
FIELDS TERMINATED by '\1' lines TERMINATED by '\n'
STORED AS TEXTFILE LOCATION
'wasb:///hadooplabs/Lab1/customerswhobrowsedxbook'
AS
With Customerwhobrowsedbookx as
(
SELECT distinct customerid
from weblogs
WHERE PurchaseType="Browsed"
and BookName="THE BOOK OF WITNESSES"
)
SELECT w.BookName,count(*) as cnt from HDILABDB.weblogs w
JOIN Customerwhobrowsedbookx cte
on w.CustomerId=cte.CustomerId
WHERE w.PurchaseType="Browsed"
AND w.BookName Not in ("THE BOOK OF WITNESSES")
group by w.bookname having count(*) > 10
order by cnt desc
LIMIT 3;
Select * from HDILABDB.customerswhobrowsedxbook LIMIT 10
```
Learn more and get help
=======================
- [Azure HDInsight
Overview](https://azure.microsoft.com/en-us/services/hdinsight/)
- [Getting started with Azure
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/)
- [Use Hive on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-tutorial-get-started)
- [Use Spark on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview)
- [Use Interactive Hive on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-interactive-hive)
- [Use HBase on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview)
- [Use Kafka on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-introduction)
- [Use Storm on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-overview)
- [Use R Server on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-overview)
- [Open Source component guide on
HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#hadoop-components-available-with-different-hdinsight-versions)
- [Extend your cluster to install open source
components](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#support-for-open-source-software-used-on-hdinsight-clusters)
- [HDInsight release
notes](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes)
- [HDInsight versioning and support
guidelines](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#supported-hdinsight-versions)
- [How to upgrade HDInsight cluster to a new
version](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upgrade-cluster)
- [Ask HDInsight questions on
stackoverflow](https://stackoverflow.com/questions/tagged/hdinsight)
- [Ask HDInsight questions on Msdn
forums](https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight)

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/BookSalesPrediction.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 166 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/HiveViews.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 106 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/HiveVisualExplain.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 129 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/Sc1WhereClauses.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 241 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/Sc2TopSellingBooks.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 293 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/Storage.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 198 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/blob.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 224 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/image1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 106 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/image2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 241 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/image3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 129 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/image4.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 293 KiB

Двоичные данные
Labs/Azure HDInsight/HiveLab/images/uploadweblogs.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 224 KiB

Просмотреть файл

@ -0,0 +1,5 @@
# Azure HDInsight - Big data processing using Hive on Azure HDInsight
## Deployment
For this lab, an HDInsight cluster is already created for you. If you want to create this cluster on your own, please go through this [deployment guide](deployment/readme.md).
## Lab
See the hands-on lab [here](hands-on-lab.md).