Adding labs for Azure HDInsight

2017-05-02 19:41:29 -07:00 · 2017-05-02 19:41:29 -07:00 · d7922b6f8c
--- a/HDInsight/DataScienceLab/.DS_Store
+++ b/HDInsight/DataScienceLab/.DS_Store
--- a/HDInsight/DataScienceLab/Data/.DS_Store
+++ b/HDInsight/DataScienceLab/Data/.DS_Store
--- a/HDInsight/DataScienceLab/Data/sparklabs/.DS_Store
+++ b/HDInsight/DataScienceLab/Data/sparklabs/.DS_Store
--- a/HDInsight/DataScienceLab/Data/sparklabs/Lab03/.DS_Store
+++ b/HDInsight/DataScienceLab/Data/sparklabs/Lab03/.DS_Store
--- a/HDInsight/DataScienceLab/Data/sparklabs/Lab03/SaleTransactions1.csv
+++ b/HDInsight/DataScienceLab/Data/sparklabs/Lab03/SaleTransactions1.csv
--- a/HDInsight/DataScienceLab/Data/sparklabs/Lab03/SaleTransactions2.csv
+++ b/HDInsight/DataScienceLab/Data/sparklabs/Lab03/SaleTransactions2.csv
--- a/HDInsight/DataScienceLab/Notebooks/05+-+Spark+Machine+Learning+-+Predictive+analysis+on+food+inspection+data+using+MLLib.ipynb
+++ b/HDInsight/DataScienceLab/Notebooks/05+-+Spark+Machine+Learning+-+Predictive+analysis+on+food+inspection+data+using+MLLib.ipynb
--- a/HDInsight/DataScienceLab/Notebooks/Final+Lab.ipynb
+++ b/HDInsight/DataScienceLab/Notebooks/Final+Lab.ipynb
--- a/HDInsight/DataScienceLab/README.docx
+++ b/HDInsight/DataScienceLab/README.docx
--- a/HDInsight/DataScienceLab/deployment/.DS_Store
+++ b/HDInsight/DataScienceLab/deployment/.DS_Store
--- a/HDInsight/DataScienceLab/deployment/media/image1.png
+++ b/HDInsight/DataScienceLab/deployment/media/image1.png
--- a/HDInsight/DataScienceLab/deployment/media/image10.png
+++ b/HDInsight/DataScienceLab/deployment/media/image10.png
--- a/HDInsight/DataScienceLab/deployment/media/image11.png
+++ b/HDInsight/DataScienceLab/deployment/media/image11.png
--- a/HDInsight/DataScienceLab/deployment/media/image12.png
+++ b/HDInsight/DataScienceLab/deployment/media/image12.png
--- a/HDInsight/DataScienceLab/deployment/media/image13.png
+++ b/HDInsight/DataScienceLab/deployment/media/image13.png
--- a/HDInsight/DataScienceLab/deployment/media/image2.png
+++ b/HDInsight/DataScienceLab/deployment/media/image2.png
--- a/HDInsight/DataScienceLab/deployment/media/image3.png
+++ b/HDInsight/DataScienceLab/deployment/media/image3.png
--- a/HDInsight/DataScienceLab/deployment/media/image4.png
+++ b/HDInsight/DataScienceLab/deployment/media/image4.png
--- a/HDInsight/DataScienceLab/deployment/media/image5.png
+++ b/HDInsight/DataScienceLab/deployment/media/image5.png
--- a/HDInsight/DataScienceLab/deployment/media/image6.png
+++ b/HDInsight/DataScienceLab/deployment/media/image6.png
--- a/HDInsight/DataScienceLab/deployment/media/image7.png
+++ b/HDInsight/DataScienceLab/deployment/media/image7.png
--- a/HDInsight/DataScienceLab/deployment/media/image8.png
+++ b/HDInsight/DataScienceLab/deployment/media/image8.png
--- a/HDInsight/DataScienceLab/deployment/media/image9.png
+++ b/HDInsight/DataScienceLab/deployment/media/image9.png
--- a/HDInsight/DataScienceLab/deployment/readme.docx
+++ b/HDInsight/DataScienceLab/deployment/readme.docx
--- a/HDInsight/DataScienceLab/deployment/readme.md
+++ b/HDInsight/DataScienceLab/deployment/readme.md
@ -0,0 +1,134 @@
+Provision HDInsight Linux Hadoop cluster with Azure Management Portal
+---------------------------------------------------------------------
+
+To provision HDInsight Hadoop cluster with Azure Management Portal,
+perform the below steps.
+
+1.  Go to the Azure Portal portal.azure.com. Login using your azure
+    account credentials.
+
+2.  Select **NEW -&gt; Data Analytics -&gt; HDInsight**
+
+> <img src="./media/image1.png" width="592" height="180" />
+
+1.  Enter or select the following values.
+
+    1.  **Cluster Name:** Enter the cluster name. A green tick will
+        appear if the cluster name is available.
+
+    2.  **Cluster Type:** Select **Spark** as the cluster type.
+
+    3.  **Cluster Operating System:** Select Linux as the cluster
+        operating system
+
+    4.  **Version:** Select **3.6** as the cluster version.
+
+    5.  **Cluster Tier:** Select the **Standard** cluster tier
+
+> <img src="./media/image2.png" width="436" height="372" />
+
+1.  **Subscription:** Select the Azure subscription to create
+    the cluster.
+
+2.  **Resource Group:** Select an existing resource group or create a
+    new resource group.
+
+3.  **Credentials:** Configure the username and password for HDInsight
+    cluster and the SSH connection. SSH connection is used to connect to
+    HDInsight cluster through a SSH client such as Putty.
+
+> <img src="./media/image3.png" width="219" height="400" />
+
+1.  **Data Source:** Create a new storage account and a
+    default container.
+
+> <img src="./media/image4.png" width="230" height="309" />
+
+1.  **Node Pricing Tiers:** Set the number of head node and worker nodes
+    as shown below.
+
+> <img src="./media/image5.png" width="228" height="290" />
+
+**Note:** You can select lowest pricing tier A3 nodes or reduce the
+number of worker nodes decrease the cluster cost.
+
+1.  Leave other configuration options as default and click **Create** to
+    provision HDInsight Hadoop cluster. It will take 15-20 minutes for
+    cluster provisioning.
+
+**The HDInsight Linux Hadoop cluster is now ready to work with.**
+
+Copy lab data to the storage account
+------------------------------------
+
+In this section, you’ll copy the files required for the lab to your
+storage account.
+
+To copy the files, follow the below steps.
+
+1.  Launch Azure Storage from your cluster dashboard
+
+> <img src="./media/image6.png" width="624" height="854" />
+
+1.  Select the **Blob container** for your cluster
+
+2.  Create a container called **sparklabs**
+
+3.  Navigate to **sparklabs** and create a container called **Lab03**
+
+4.  Upload SalesTransactions1.csv and SalesTransactions2.csv to Lab03.
+    Weblogs.csv can be found in **data\\sparklabs\\Lab03** folder.
+
+    <img src="./media/image7.png" width="624" height="334" />
+
+Launching a new Jupyter Notebook
+--------------------------------
+
+### Access Azure Portal
+
+1.  Sign in to the [Azure Portal](https://ms.portal.azure.com/).
+
+If Spark Cluster is pinned to the “StartBoard”:
+
+1.  Click the tile for your Spark Cluster.
+
+<img src="./media/image8.png" width="273" height="169" />
+
+If Spark Cluster is not pinned to the “StartBoard”:
+
+1.  Click Browse, select HDInsight Clusters.
+
+<img src="./media/image9.png" width="223" height="219" />
+
+1.  Select your Spark Cluster.
+
+<img src="./media/image10.png" width="277" height="124" />
+
+### Launch Jupyter Notebook
+
+1.  Click on Cluster Dashboards tile displayed under the Quick Links of
+    Cluster Blade.
+
+<img src="./media/image11.png" width="267" height="134" />
+
+1.  Locate **Jupyter Notebook** tile on Cluster Dashboards tile and
+    click on it.
+
+<img src="./media/image12.png" width="102" height="251" />
+
+1.  When prompted, enter the admin credentials for the Spark cluster.
+
+This will open the Jupyter dashboard.
+
+<img src="./media/image13.png" width="278" height="100" />
+
+### Upload a new notebook
+
+1.  Click **Upload** dropdown button present at top right side of
+    Jupyter Notebook screen.
+
+2.  Select a name with an ipynb extension
+
+3.  Upload and click the notebook to launch it
+
+
--- a/HDInsight/DataScienceLab/hands-on-lab.docx
+++ b/HDInsight/DataScienceLab/hands-on-lab.docx
--- a/HDInsight/DataScienceLab/hands-on-lab.md
+++ b/HDInsight/DataScienceLab/hands-on-lab.md
@ -0,0 +1,362 @@
+Overview
+--------
+
+Azure HDInsight is the only fully-managed cloud Apache Hadoop offering
+that gives you optimized open-source analytic clusters for Spark, Hive,
+MapReduce, HBase, Storm, Kafka, and Microsoft R Server backed by a 99.9%
+SLA. Deploy these big data technologies and ISV applications as managed
+clusters with enterprise-level security and monitoring.
+
+This lab specifically focuses on Spark ML component of Spark and
+highlights its value proposition in the Apache Spark Big Data processing
+framework.
+
+This hands-on lab will step you through the following features:
+
+1.  **Notebook** – Connect to a Notebook and run the notebook
+
+2.  **Basics of Spark** – Use Python to analyze data using Spark
+
+3.  **Basics of Machine Learning –** This notebook demonstrates how to
+    use MLLib, Sparks's built-in machine learning libraries, to perform
+    a simple prediction on an open dataset.
+
+
+Learn the basics of data science using Spark
+--------------------------------------------
+
+This notebook demonstrates how to use MLLib, Sparks's built-in machine
+learning libraries, to perform a simple prediction on an open dataset.
+
+**Launch Jupyter Notebooks **
+ Navigate to this link and add your cluster name and username/password provided. https://<Fill_ME_IN>.azurehdinsight.net/jupyter/tree/PySpark
+
+-   Username: &lt;FILL\_ME\_IN&gt;
+
+-   Password: &lt;FILL\_ME\_IN&gt;
+
+<img src="./media/image1.png" width="624" height="250" />
+
+### Open [Spark Machine Learning - Predictive analysis on food inspection data using MLLib.ipynb](https://pranavsparkbuildlab.azurehdinsight.net/jupyter/notebooks/PySpark/05%20-%20Spark%20Machine%20Learning%20-%20Predictive%20analysis%20on%20food%20inspection%20data%20using%20MLLib.ipynb)
+
+This is a sample notebook which will walk you through the steps of
+interacting with a notebook, basics of machine learning of Spark. You
+will apply these learnings in a new notebook to predict book sales
+
+<img src="./media/image2.png" width="624" height="226" />
+
+### Notebook Setup
+
+-   Read the opening paragraph to understand about the scenario and the
+    model to apply.
+
+-   Run through all the steps in the notebook.
+
+-   To run the cells below, place the cursor in the cell and then press
+    \*\*SHIFT + ENTER\*\*.
+
+### Initializing Spark - Construct an Input DataFrame
+
+Read the dataset from a csv file stored in Azure Blob Storage.
+
+```python
+inspections =
+spark.read.csv('wasb:///HdiSamples/HdiSamples/FoodInspectionData/FoodInspections1.csv',
+inferSchema=True)
+```
+#### Inspect Schema
+```python
+inspections.printSchema()
+```
+#### See a detailed record
+```python
+df.take(1)
+```
+#### Understand the dataset
+Let's start to get a sense of what our dataset contains. For
+    example, what are the different values in the \`results\` column?
+```python
+df.select('results').distinct().show()
+```
+#### A visualization can help us reason about the distribution of these outcomes.
+```python
+%%local
+
+%matplotlib inline
+
+import matplotlib.pyplot as plt
+
+labels = count\_results\_df\['results'\]
+
+sizes = count\_results\_df\['cnt'\]
+
+colors = \['turquoise', 'seagreen', 'mediumslateblue', 'palegreen',
+'coral'\]
+
+plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
+
+plt.axis('equal')
+```
+<img src="./media/image3.png" width="624" height="335" />
+
+### Create a logistic regression model from the input dataframe
+
+This will allow you to categorize the data which you can use to predict
+the outcome in the next step
+
+### Evaluate the model on a separate test dataset
+
+We can use the model we created earlier to predict what the results of
+new inspections will be, based on the violations that were observed.
+```python
+testData =
+selectInterestingColumns(spark.read.csv('wasb:///HdiSamples/HdiSamples/FoodInspectionData/Food\_Inspections2.csv',
+inferSchema=True))
+
+testDf = testData.where("results = 'Fail' OR results = 'Pass' OR results
+= 'Pass w/ Conditions'")
+
+predictionsDf = model.transform(testDf)
+
+predictionsDf.registerTempTable('Predictions')
+
+predictionsDf.columns
+```
+#### Look at the success rate.
+```python
+numSuccesses = predictionsDf.where("""(prediction = 0 AND results =
+'Fail') OR
+
+(prediction = 1 AND (results = 'Pass' OR
+
+results = 'Pass w/ Conditions'))""").count()
+
+numInspections = predictionsDf.count()
+
+print("There were %d inspections and there were %d successful
+predictions" % (numInspections, numSuccesses))
+
+print("This is a %d%% success rate" % (float(numSuccesses) /
+float(numInspections) \* 100))
+```
+#### Final visualization to help us reason about the results of this test.
+
+<img src="./media/image4.png" width="624" height="281" />
+
+Scenario 2 – Apply the basics of machine learning to predict book sales.
+------------------------------------------------------------------------
+
+In this scenario, you will apply your learnings from Scenario 1.
+
+**Scenario**: This notebook demonstrates how to use MLLib, Spark's
+built-in machine learning libraries, to perform a simple predictive
+analysis on an open dataset.
+
+**Launch Jupyter Notebooks**
+https://&lt;Fill\_ME\_IN&gt;.azurehdinsight.net/jupyter/tree/PySpark
+
+-   Username: &lt;FILL\_ME\_IN&gt;
+
+-   Password: &lt;FILL\_ME\_IN&gt;
+
+<img src="./media/image1.png" width="624" height="250" />
+
+### Open Final Lab.ipynb
+
+This is the same notebook you learnt in Scenario 1. In this notebook,
+you will apply the learnings to a different dataset.
+
+<img src="./media/image5.png" width="624" height="238" />
+
+### Notebook Setup
+
+-   Read the opening paragraph to understand about the scenario and the
+    model to apply.
+
+-   Run through all the steps in the notebook.
+
+-   To run the cells below, place the cursor in the cell and then press
+    \*\*SHIFT + ENTER\*\*.
+
+### Initializing Spark - Construct an Input DataFrame
+
+Read the dataset along with headers from a csv file stored in Azure Blob
+Storage.
+
+-   Replace <FILL\_ME\_IN\_WITH\_header=True> with
+    **header=True** in the following statement
+```python
+inspections = spark.read.csv('/sparklabs/Lab03/SaleTransactions1.csv',
+inferSchema=True, <FILL_ME_IN_WITH_header=True>);
+```
+#### Inspect Schema
+```python
+inspections.printSchema()
+```
+#### See a detailed record
+```python
+df.take(1)
+```
+### Understand the dataset
+
+Let's start to get a sense of what our dataset contains. For example,
+what are the different values in the \`**CustomerAction**\` column?
+
+-   Replace **&lt;FILL\_ME\_IN\_WITH\_ColName&gt;** with
+    **CustomerAction**
+```python
+inspections.select('FILL_ME_IN_WITH_ColName').distinct().show()
+```
+#### A visualization can help us reason about the distribution of these outcomes.
+
+-   Replace **&lt;FILL\_ME\_IN\_WITH\_ColName**&gt; with
+    **CustomAction**
+```python
+%%local
+
+%matplotlib inline
+
+import matplotlib.pyplot as plt
+
+labels = count_results_df['FILL_ME_IN_WITH_ColName']
+
+sizes = count_results_df['cnt']
+
+colors = ['turquoise', 'seagreen', 'mediumslateblue', 'palegreen',
+'coral']
+
+plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
+
+plt.axis('equal')
+```
+<img src="./media/image6.png" width="624" height="350" />
+
+Let us develop a model that can guess the outcome whether a book is
+purchased based on customer action. From the previous visualization, a
+**CustomerAction** could be one of the following :- ‘**Purchased’**,
+**‘Added To Cart’** or **‘Browsed’**
+
+Since logistic regression is a binary classification method, it makes
+sense to group our data into two categories: \*\*Purchased\*\* and
+\*\*Not purchased\*\*. A "Added To Cart" is not a purchase, so when we
+train the model, we will consider the two results equivalent.
+
+-   Replace **&lt;FILL\_ME\_IN&gt;** with the **highlighted**
+```python
+def labelForResults(s):
+
+if s == 'Purchased':
+
+return 1.0
+
+elif s == 'Added To Cart' or s == 'Browsed':
+
+return 0.0
+
+else:
+
+return -1.0
+
+label = UserDefinedFunction(labelForResults, DoubleType())
+
+labeledData =
+inspections.select(label(inspections.CustomerAction).alias('label'),
+inspections.Name ).where('label >= 0')
+```
+### Create a logistic regression model from the input dataframe
+
+This will allow you to categorize the data which you can use to predict
+the outcome in the next step
+
+### Evaluate the model on a separate test dataset
+
+We can use the model we created earlier to predict what the results of
+new inspections will be, based on the violations that were observed.
+```python
+testData = spark.read.csv('/sparklabs/Lab03/SaleTransactions2.csv',
+inferSchema=True, header=True)
+
+testDf = testData.where("CustomerAction = 'Purchased' OR
+CustomerAction = 'Added To Cart' OR CustomerAction = 'Browsed'")
+
+predictionsDf = model.transform(testDf)
+
+predictionsDf.registerTempTable('Predictions')
+
+predictionsDf.columns
+```
+#### Look at the success rate.
+
+-   Replace **&lt;FILL\_ME\_IN&gt;** with the **highlighted**
+```python
+numSuccesses = predictionsDf.where("""(prediction = 1 AND CustomerAction
+= 'Purchased') OR
+
+(prediction = 0 AND (CustomerAction = 'Added To Cart' OR
+
+CustomerAction = 'Browsed'))""").count()
+
+numInspections = predictionsDf.count()
+
+print("There were %d User sessions and there were %d successful
+predictions" % (numInspections, numSuccesses))
+
+print("This is a %d%% success rate" % (float(numSuccesses) /
+float(numInspections) \* 100))
+```
+#### Final visualization to help us reason about the results of this test.
+
+<img src="./media/image4.png" width="624" height="281" />
+
+Learn more and get help
+=======================
+
+-   [Azure HDInsight
+    Overview](https://azure.microsoft.com/en-us/services/hdinsight/)
+
+-   [Getting started with Azure
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/)
+
+-   [Use Hive on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-tutorial-get-started)
+
+-   [Use Spark on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview)
+
+-   [Use Interactive Hive on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-interactive-hive)
+
+-   [Use HBase on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview)
+
+-   [Use Kafka on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-introduction)
+
+-   [Use Storm on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-overview)
+
+-   [Use R Server on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-overview)
+
+-   [Open Source component guide on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#hadoop-components-available-with-different-hdinsight-versions)
+
+-   [Extend your cluster to install open source
+    components](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#support-for-open-source-software-used-on-hdinsight-clusters)
+
+-   [HDInsight release
+    notes](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes)
+
+-   [HDInsight versioning and support
+    guidelines](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#supported-hdinsight-versions)
+
+-   [How to upgrade HDInsight cluster to a new
+    version](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upgrade-cluster)
+
+-   [Ask HDInsight questions on
+    stackoverflow](https://stackoverflow.com/questions/tagged/hdinsight)
+
+-   [Ask HDInsight questions on Msdn
+    forums](https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight)
+
+
--- a/HDInsight/DataScienceLab/images/BookSalesPrediction.png
+++ b/HDInsight/DataScienceLab/images/BookSalesPrediction.png
--- a/HDInsight/DataScienceLab/images/JupyterNotebooks.png
+++ b/HDInsight/DataScienceLab/images/JupyterNotebooks.png
--- a/HDInsight/DataScienceLab/images/Jupyter_FoodInspection.png
+++ b/HDInsight/DataScienceLab/images/Jupyter_FoodInspection.png
--- a/HDInsight/DataScienceLab/images/Sc1DataVisualize.png
+++ b/HDInsight/DataScienceLab/images/Sc1DataVisualize.png
--- a/HDInsight/DataScienceLab/images/Sc1ModelVisualize.png
+++ b/HDInsight/DataScienceLab/images/Sc1ModelVisualize.png
--- a/HDInsight/DataScienceLab/images/Sc2DataVisualize.png
+++ b/HDInsight/DataScienceLab/images/Sc2DataVisualize.png
--- a/HDInsight/DataScienceLab/images/Sc2ModelVisualize.png
+++ b/HDInsight/DataScienceLab/images/Sc2ModelVisualize.png
--- a/HDInsight/DataScienceLab/images/Storage.png
+++ b/HDInsight/DataScienceLab/images/Storage.png
--- a/HDInsight/DataScienceLab/images/blob.png
+++ b/HDInsight/DataScienceLab/images/blob.png
--- a/HDInsight/DataScienceLab/images/uploadweblogs.png
+++ b/HDInsight/DataScienceLab/images/uploadweblogs.png
--- a/HDInsight/DataScienceLab/media/image1.png
+++ b/HDInsight/DataScienceLab/media/image1.png
--- a/HDInsight/DataScienceLab/media/image2.png
+++ b/HDInsight/DataScienceLab/media/image2.png
--- a/HDInsight/DataScienceLab/media/image3.png
+++ b/HDInsight/DataScienceLab/media/image3.png
--- a/HDInsight/DataScienceLab/media/image4.png
+++ b/HDInsight/DataScienceLab/media/image4.png
--- a/HDInsight/DataScienceLab/media/image5.png
+++ b/HDInsight/DataScienceLab/media/image5.png
--- a/HDInsight/DataScienceLab/media/image6.png
+++ b/HDInsight/DataScienceLab/media/image6.png
--- a/HDInsight/DataScienceLab/readme.md
+++ b/HDInsight/DataScienceLab/readme.md
@ -0,0 +1,5 @@
+# Azure HDInsight - Big data processing using Hive on Azure HDInsight
+## Deployment
+For this lab, an HDInsight cluster is already created for you. If you want to create this cluster on your own, please go through this [deployment guide](deployment/readme.md).
+## Lab
+See the hands-on lab [here](hands-on-lab.md).
--- a/HDInsight/HiveLab/.DS_Store
+++ b/HDInsight/HiveLab/.DS_Store
--- a/HDInsight/HiveLab/Data/.DS_Store
+++ b/HDInsight/HiveLab/Data/.DS_Store
--- a/HDInsight/HiveLab/Data/hadooplabs/.DS_Store
+++ b/HDInsight/HiveLab/Data/hadooplabs/.DS_Store
--- a/HDInsight/HiveLab/Data/hadooplabs/Lab1/weblogs.csv
+++ b/HDInsight/HiveLab/Data/hadooplabs/Lab1/weblogs.csv
--- a/HDInsight/HiveLab/README.docx
+++ b/HDInsight/HiveLab/README.docx
--- a/HDInsight/HiveLab/deployment/media/image1.png
+++ b/HDInsight/HiveLab/deployment/media/image1.png
--- a/HDInsight/HiveLab/deployment/media/image2.png
+++ b/HDInsight/HiveLab/deployment/media/image2.png
--- a/HDInsight/HiveLab/deployment/media/image3.png
+++ b/HDInsight/HiveLab/deployment/media/image3.png
--- a/HDInsight/HiveLab/deployment/media/image4.png
+++ b/HDInsight/HiveLab/deployment/media/image4.png
--- a/HDInsight/HiveLab/deployment/media/image5.png
+++ b/HDInsight/HiveLab/deployment/media/image5.png
--- a/HDInsight/HiveLab/deployment/media/image6.png
+++ b/HDInsight/HiveLab/deployment/media/image6.png
--- a/HDInsight/HiveLab/deployment/media/image7.png
+++ b/HDInsight/HiveLab/deployment/media/image7.png
--- a/HDInsight/HiveLab/deployment/readme.docx
+++ b/HDInsight/HiveLab/deployment/readme.docx
--- a/HDInsight/HiveLab/deployment/readme.md
+++ b/HDInsight/HiveLab/deployment/readme.md
@ -0,0 +1,84 @@
+Provision HDInsight Linux Hadoop cluster with Azure Management Portal
+---------------------------------------------------------------------
+
+To provision HDInsight Hadoop cluster with Azure Management Portal,
+perform the below steps.
+
+1.  Go to the Azure Portal portal.azure.com. Login using your azure
+    account credentials.
+
+2.  Select **NEW -&gt; Data Analytics -&gt; HDInsight**
+
+> <img src="./media/image1.png" width="592" height="180" />
+
+1.  Enter or select the following values.
+
+    1.  **Cluster Name:** Enter the cluster name. A green tick will
+        appear if the cluster name is available.
+
+    2.  **Cluster Type:** Select Hadoop as the cluster type.
+
+    3.  **Cluster Operating System:** Select Linux as the cluster
+        operating system
+
+    4.  **Version:** Select 3.6 as the cluster version.
+
+    5.  **Cluster Tier:** Select the **Standard** cluster tier
+
+> <img src="./media/image2.png" width="436" height="372" />
+
+1.  **Subscription:** Select the Azure subscription to create
+    the cluster.
+
+2.  **Resource Group:** Select an existing resource group or create a
+    new resource group.
+
+3.  **Credentials:** Configure the username and password for HDInsight
+    cluster and the SSH connection. SSH connection is used to connect to
+    HDInsight cluster through a SSH client such as Putty.
+
+> <img src="./media/image3.png" width="219" height="400" />
+
+1.  **Data Source:** Create a new storage account and a
+    default container.
+
+> <img src="./media/image4.png" width="230" height="309" />
+
+1.  **Node Pricing Tiers:** Set the number of head node and worker nodes
+    as shown below.
+
+> <img src="./media/image5.png" width="228" height="290" />
+
+**Note:** You can select lowest pricing tier A3 nodes or reduce the
+number of worker nodes decrease the cluster cost.
+
+1.  Leave other configuration options as default and click **Create** to
+    provision HDInsight Hadoop cluster. It will take 15-20 minutes for
+    cluster provisioning.
+
+**The HDInsight Linux Hadoop cluster is now ready to work with.**
+
+Copy lab data to the storage account
+------------------------------------
+
+In this section, you’ll copy the files required for the lab to your
+storage account.
+
+To copy the files, follow the below steps.
+
+1.  Launch Azure Storage from your cluster dashboard
+
+> <img src="./media/image6.png" width="624" height="854" />
+
+1.  Select the **Blob container** for your cluster
+
+2.  Create a container called **hadooplabs**
+
+3.  Navigate to **hadooplabs** and create a container called **Lab1**
+
+4.  Upload weblogs.csv to Lab1. Weblogs.csv can be found in
+    **data\\hadooplabs\\Lab1** folder.
+
+    <img src="./media/image7.png" width="624" height="334" />
+
+
--- a/HDInsight/HiveLab/hands-on-lab.docx
+++ b/HDInsight/HiveLab/hands-on-lab.docx
--- a/HDInsight/HiveLab/hands-on-lab.md
+++ b/HDInsight/HiveLab/hands-on-lab.md
@ -0,0 +1,390 @@
+Overview
+========
+
+Azure HDInsight is the only fully-managed cloud Apache Hadoop offering
+that gives you optimized open-source analytic clusters for Spark, Hive,
+MapReduce, HBase, Storm, Kafka, and Microsoft R Server backed by a 99.9%
+SLA. Deploy these big data technologies and ISV applications as managed
+clusters with enterprise-level security and monitoring.
+
+Hive is a data warehousing system that simplifies analyzing large
+datasets stored in Hadoop clusters, using SQL-Like language known as
+HiveQL. Hive converts queries to either map/reduce, Apache Tez or Apache
+Spark jobs.
+
+To highlight how customers can efficiently leverage HDInsight Hive to
+analyze big data stored in Azure Blob Storage, this document provides an
+end-to-end walkthrough of analyzing a web transaction log of an
+imaginary book store using Hive.
+
+<span id="about-the-code" class="anchor"></span>After completing this
+lab, you will learn,
+
+1.  Different ways to execute hive queries on an HDInsight cluster
+
+2.  To use join, aggregates, analytic function, ranking function, group
+    by and order by in Hive Query Language.
+
+Learn the basics of querying with Hive
+======================================
+
+### Launch Hive Views in Ambari portal
+Replace <FILL_ME_IN> with the cluster name and password provided or created.
+-   [https://
+    &lt;FILL\_ME\_IN&gt;/\#/main/view/HIVE/auto\_hive20\_instance](https://pranavsparkbuildlab.azurehdinsight.net/#/main/view/HIVE/auto_hive20_instance)
+
+-   Username: &lt;FILL\_ME\_IN&gt;
+
+-   Password: &lt;FILL\_ME\_IN&gt;
+
+<img src="./images/image1.png" width="624" height="363" />
+
+### Load Data into table
+
+-   Copy and paste the following query in the **Query Editor.** *Do not
+    execute yet.*
+```sql
+DROP DATABASE IF EXISTS HDILABDB CASCADE;
+
+CREATE DATABASE HDILABDB;
+
+Use HDILABDB;
+
+CREATE EXTERNAL TABLE IF NOT EXISTS weblogs(
+
+TransactionDate varchar(50) ,
+
+CustomerId varchar(50) ,
+
+BookId varchar(50) ,
+
+PurchaseType varchar(50) ,
+
+TransactionId varchar(50) ,
+
+OrderId varchar(50) ,
+
+BookName varchar(50) ,
+
+CategoryName varchar(50) ,
+
+Quantity varchar(50) ,
+
+ShippingAmount varchar(50) ,
+
+InvoiceNumber varchar(50) ,
+
+InvoiceStatus varchar(50) ,
+
+PaymentAmount varchar(50)
+
+) ROW FORMAT DELIMITED FIELDS TERMINATED by ',' lines TERMINATED by
+'\n'
+
+STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/weblogs/';
+
+LOAD DATA INPATH 'wasb:///hadooplabs/Lab1/weblogs.csv' INTO TABLE
+HDILABDB.weblogs;
+```
+-   Click Execute to run the query. Once the query complete, the Query
+    Process Results, status will change to **SUCCEEDED**.
+
+### Select total count
+
+-   Create a new Worksheet and execute the following query in the
+    **Query Editor**.
+```sql
+SELECT COUNT(*) FROM HDILABDB.weblogs;
+```
+
+### View the data
+
+-   Create a new Worksheet and execute the following query in the
+    **Query Editor**.
+```sql
+SELECT * FROM HDILABDB.weblogs LIMIT 5;
+```
+### Where clause
+
+-   Create a new Worksheet and execute the following query in the
+    **Query Editor**.
+```sql
+SELECT * FROM HDILABDB.weblogs WHERE orderid='107';
+```
+### <img src="./images/image2.png" width="624" height="404" />
+
+### Find DISTINCT
+
+-   Create a new Worksheet and execute the following query in the
+    **Query Editor**.
+```sql
+SELECT DISTINCT bookname FROM HDILABDB.weblogs WHERE orderid='107';
+```
+### GROUP BY
+
+-   Create a new Worksheet and execute the following query in the
+    **Query Editor**.
+```sql
+SELECT bookname,COUNT(*) FROM HDILABDB.weblogs GROUP BY bookname;
+```
+### Analyse query using “Visual Explain”
+
+<img src="./images/image3.png" width="624" height="344" />
+
+Scenario 2 – Apply the basics
+=============================
+
+<span id="_Toc465361432" class="anchor"><span id="_Toc465379292" class="anchor"></span></span>Perform book store sales analysis
+-------------------------------------------------------------------------------------------------------------------------------
+
+In this section, you’ll run hive queries to analyse the data in the
+weblogs table. The weblogs table contains transactional data of an
+imaginary online bookstore. You’ll have to analyse the sales data and
+prepare a sales report.
+
+All analysis is based on the weblogs table, created earlier in the lab.
+The table description is given below
+
+| **Column**      | **Description**                                                            |
+|-----------------|----------------------------------------------------------------------------|
+| TransactionDate | The date of the transaction                                                |
+| CustomerId      | Unique Id assigned to the customer                                         |
+| BookId          | Unique id assigned to a book in the book store                             |
+| PurchaseType    | Purchased: Customer bought the book                                  Browsed: Customer browsed but not purchased the book. Added to Cart: Customer added the book to the shopping cart      |
+| TransactionId   | Unique Id assigned to a transaction                                        |
+| OrderId         | Unique order id                                                            |
+| BookName        | The name of the book accessed by the customer                              |
+| CategoryName    | The category of the book accessed by the customer                          |
+| Quantity        | Quantity of the book purchased. Valid only for PurchaseType = Purchased    |
+| ShippingAmount  | Shipping cost                                                              |
+| InvoiceNumber   | Invoice number if a customer purchased the book                            |
+| InvoiceStatus   | The status of the invoice                                                  |
+| PaymentAmount   | Total amount paid by the customer. Valid only for PurchaseType = Purchased |
+
+### Launch Hive Views in Ambari portal
+Replace <FILL_ME_IN> with the cluster name and password provided or created.
+-   [https://
+    &lt;FILL\_ME\_IN&gt;/\#/main/view/HIVE/auto\_hive20\_instance](https://pranavsparkbuildlab.azurehdinsight.net/#/main/view/HIVE/auto_hive20_instance)
+
+-   Username: &lt;FILL\_ME\_IN&gt;
+
+-   Password: &lt;FILL\_ME\_IN&gt;
+
+### Problem Statement \#1
+
+Write a query to return the total payment amount for each category per
+month. The output should look like this.
+
+| **CategoryName**  | **QuantitySold** | **TotalAmount** |
+|-------------------|------------------|-----------------|
+| Drive\_books      | 211029           | 2064435         |
+| Adventure         | 112470           | 1022195         |
+| World\_History    | 112263           | 1048990         |
+| Art               | 112105           | 1043190         |
+| Non\_Fiction      | 111731           | 1046410         |
+| Psychology        | 111555           | 1024255         |
+| Romance           | 111316           | 1038265         |
+| Automobile\_books | 110017           | 1030720         |
+| Philosophy        | 109691           | 1042410         |
+| Fiction           | 109460           | 1032795         |
+| Drama             | 109246           | 1038565         |
+| Management        | 108262           | 1030805         |
+| Programming       | 108196           | 1013210         |
+| Music             | 108121           | 998930          |
+| Cook              | 108056           | 1051710         |
+| Science           | 107706           | 1063445         |
+| Religion          | 107513           | 999780          |
+| Political         | 106000           | 1034820         |
+
+#### Create a new Worksheet and execute the following query in the Query Editor.
+```sql
+-- Get top Selling Categories
+
+DROP TABLE IF EXISTS HDILABDB.SalesbyCategory;
+
+CREATE TABLE HDILABDB.SalesbyCategory ROW FORMAT DELIMITED
+
+FIELDS TERMINATED by '\1' lines TERMINATED by '\n'
+
+STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/SalesbyCategory'
+
+AS
+
+Select
+
+categoryname,
+
+Sum(Quantity) As quantitysold,
+
+Sum(PaymentAmount) As totalamount
+
+FROM HDILABDB.weblogs
+
+WHERE PurchaseType="Purchased"
+
+GROUP BY CategoryName
+
+ORDER BY QuantitySold Desc;
+
+Select * from HDILABDB.SalesbyCategory LIMIT 10
+```
+### Problem Statement \#2
+
+Write a query to return the total payment amount and the total quantity
+sold per book. The output should look like this.
+
+| **BookName**                         | **QuantitySold** | **TotalAmount** |
+|--------------------------------------|------------------|-----------------|
+| The voyages of Captain Cook          | 232414           | 2194890         |
+| Advances in school psychology        | 231410           | 2193740         |
+| Science in Dispute                   | 231408           | 2168425         |
+| History of political economy         | 231255           | 2190040         |
+| THE BOOK OF WITNESSES                | 230872           | 2145540         |
+| The adventures of Arthur Conan Doyle | 230023           | 2191910         |
+| Space fact and fiction               | 229908           | 2171820         |
+| New Christian poetry                 | 228849           | 2185845         |
+| Understanding American politics      | 228598           | 2182720         |
+
+#### Create a new Worksheet and execute the following query in the Query Editor.
+```sql
+-- Top Selling Books
+
+DROP TABLE IF EXISTS HDILABDB.SalesbyBooks;
+
+CREATE TABLE HDILABDB.SalesbyBooks ROW FORMAT DELIMITED FIELDS
+
+TERMINATED by '\1' lines TERMINATED by '\n'
+
+STORED AS TEXTFILE LOCATION 'wasb:///hadooplabs/Lab1/SalesbyBooks'
+
+AS
+
+Select
+
+BookName,
+
+Sum(Quantity) As QuantitySold,
+
+Sum(PaymentAmount) As TotalAmount
+
+FROM HDILABDB.weblogs
+
+WHERE PurchaseType='Purchased'
+
+GROUP BY BookName
+
+ORDER BY QuantitySold Desc;
+
+Select * from HDILABDB.SalesbyBooks LIMIT 10
+```
+<img src="./images/image4.png" width="624" height="373" />
+
+### Problem Statement \#3
+
+Write a query to return the top 3 books browsed by the customers who
+also browsed the book, **THE BOOK OF WITNESSES**. Your output should
+look like this
+
+| **BookName**                 | **cnt** |
+|------------------------------|---------|
+| New Christian poetry         | 9445    |
+| History of political economy | 9384    |
+| Science in Dispute           | 9367    |
+
+#### Create a new Worksheet and execute the following query in the Query Editor.
+```sql
+DROP TABLE IF EXISTS HDILABDB.customerswhobrowsedxbook;
+
+CREATE TABLE HDILABDB.customerswhobrowsedxbook ROW FORMAT DELIMITED
+
+FIELDS TERMINATED by '\1' lines TERMINATED by '\n'
+
+STORED AS TEXTFILE LOCATION
+'wasb:///hadooplabs/Lab1/customerswhobrowsedxbook'
+
+AS
+
+With Customerwhobrowsedbookx as
+
+(
+
+SELECT distinct customerid
+
+from weblogs
+
+WHERE PurchaseType="Browsed"
+
+and BookName="THE BOOK OF WITNESSES"
+
+)
+
+SELECT w.BookName,count(*) as cnt from HDILABDB.weblogs w
+
+JOIN Customerwhobrowsedbookx cte
+
+on w.CustomerId=cte.CustomerId
+
+WHERE w.PurchaseType="Browsed"
+
+AND w.BookName Not in ("THE BOOK OF WITNESSES")
+
+group by w.bookname having count(*) > 10
+
+order by cnt desc
+
+LIMIT 3;
+
+Select * from HDILABDB.customerswhobrowsedxbook LIMIT 10
+```
+Learn more and get help
+=======================
+
+-   [Azure HDInsight
+    Overview](https://azure.microsoft.com/en-us/services/hdinsight/)
+
+-   [Getting started with Azure
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/)
+
+-   [Use Hive on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-linux-tutorial-get-started)
+
+-   [Use Spark on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview)
+
+-   [Use Interactive Hive on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-interactive-hive)
+
+-   [Use HBase on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview)
+
+-   [Use Kafka on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-introduction)
+
+-   [Use Storm on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-overview)
+
+-   [Use R Server on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-r-server-overview)
+
+-   [Open Source component guide on
+    HDInsight](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#hadoop-components-available-with-different-hdinsight-versions)
+
+-   [Extend your cluster to install open source
+    components](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux#support-for-open-source-software-used-on-hdinsight-clusters)
+
+-   [HDInsight release
+    notes](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-release-notes)
+
+-   [HDInsight versioning and support
+    guidelines](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-component-versioning#supported-hdinsight-versions)
+
+-   [How to upgrade HDInsight cluster to a new
+    version](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-upgrade-cluster)
+
+-   [Ask HDInsight questions on
+    stackoverflow](https://stackoverflow.com/questions/tagged/hdinsight)
+
+-   [Ask HDInsight questions on Msdn
+    forums](https://social.msdn.microsoft.com/forums/azure/en-us/home?forum=hdinsight)
+
+
--- a/HDInsight/HiveLab/images/BookSalesPrediction.png
+++ b/HDInsight/HiveLab/images/BookSalesPrediction.png
--- a/HDInsight/HiveLab/images/HiveViews.png
+++ b/HDInsight/HiveLab/images/HiveViews.png
--- a/HDInsight/HiveLab/images/HiveVisualExplain.png
+++ b/HDInsight/HiveLab/images/HiveVisualExplain.png
--- a/HDInsight/HiveLab/images/Sc1WhereClauses.png
+++ b/HDInsight/HiveLab/images/Sc1WhereClauses.png
--- a/HDInsight/HiveLab/images/Sc2TopSellingBooks.png
+++ b/HDInsight/HiveLab/images/Sc2TopSellingBooks.png
--- a/HDInsight/HiveLab/images/Storage.png
+++ b/HDInsight/HiveLab/images/Storage.png
--- a/HDInsight/HiveLab/images/blob.png
+++ b/HDInsight/HiveLab/images/blob.png
--- a/HDInsight/HiveLab/images/image1.png
+++ b/HDInsight/HiveLab/images/image1.png
--- a/HDInsight/HiveLab/images/image2.png
+++ b/HDInsight/HiveLab/images/image2.png
--- a/HDInsight/HiveLab/images/image3.png
+++ b/HDInsight/HiveLab/images/image3.png
--- a/HDInsight/HiveLab/images/image4.png
+++ b/HDInsight/HiveLab/images/image4.png
--- a/HDInsight/HiveLab/images/uploadweblogs.png
+++ b/HDInsight/HiveLab/images/uploadweblogs.png
--- a/HDInsight/HiveLab/readme.md
+++ b/HDInsight/HiveLab/readme.md
@ -0,0 +1,5 @@
+# Azure HDInsight - Big data processing using Hive on Azure HDInsight
+## Deployment
+For this lab, an HDInsight cluster is already created for you. If you want to create this cluster on your own, please go through this [deployment guide](deployment/readme.md).
+## Lab
+See the hands-on lab [here](hands-on-lab.md).