This commit is contained in:
gokuma 2016-09-23 19:32:16 -07:00
Родитель 2cb054a6ec
Коммит 4e8827fdac
1 изменённых файлов: 157 добавлений и 0 удалений

Просмотреть файл

@ -0,0 +1,157 @@
<properties
pageTitle="Data and analytical resources for data science teams"
description="Itemizes and discusses the data and analytics resources avaiable to enterprises standardizing on the Team Data Science Process."
services="machine-learning"
documentationCenter=""
authors="bradsev"
manager="jhubbard"
editor="cgronlun" />
<tags
ms.service="machine-learning"
ms.workload="data-services"
ms.tgt_pltfrm="na"
ms.devlang="na"
ms.topic="article"
ms.date="09/21/2016"
ms.author="bradsev;hangzh;"/>
# Data and analytical resources for data science teams
Microsoft provides a full spectrum of data and analytics services/resources, on cloud or on premise, to make the execution of your data science projects efficient and scalable. Guidance for teams implementing data science projects in a trackable, version controlled, and collaborative way is provided by the [Team Data Science Process](README.md) (TDSP). For an outline of the personnel roles, and their associated tasks that are handled by a data science team standardizing on this process, see [Team Data Science Process: components, roles, and tasks](team-data-science-process-overview-components-roles-tasks.md).
The data and analytics services available to data science teams using the TDSP include:
- Data Science Virtual Machines (both Windows and Linux CentOS)
- HDInsight Spark Clusters
- SQL Data Warehouse
- Azure Data Lake
- HDInsight Hive Clusters
- Azure File Storage
- SQL Server 2016 R Services
The following table provides some guidance on how to choose the data and analysis services that are most appropriate for the needs of your project. To learn how to use the various data and analytics platforms, click the links provided on the bottom row of the table. They direct you to end-to-end walkthrough pages for each of the platforms.
||[DSVM](https://azure.microsoft.com/marketplace/partners/microsoft-ads/standard-data-science-vm)|[Azure HDInsight Spark](https://azure.microsoft.com/documentation/articles/hdinsight-apache-spark-overview)|[Azure SQL Data Warehouse](https://azure.microsoft.com/services/sql-data-warehouse)|[Azure Data Lake](https://azure.microsoft.com/blog/introducing-azure-data-lake/)|[Azure HDInsight (Hadoop)](https://azure.microsoft.com/documentation/articles/hdinsight-use-hive/)|[SQL Server 2016 R Services](https://msdn.microsoft.com/library/mt604845.aspx)|
|:-------------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|:-----------:|
|**Structured Data**|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|
|**Semi-Structured Data**| ![](./media/team-data-science-process-resources/resources-4-green-circle.png) |![](./media/team-data-science-process-resources/resources-4-green-circle.png)||![](./media/team-data-science-process-resources/resources-4-green-circle.png)|||
|**Unstructured Data**|![](./media/team-data-science-process-resources/resources-4-green-circle.png) |![](./media/team-data-science-process-resources/resources-4-green-circle.png)||![](./media/team-data-science-process-resources/resources-4-green-circle.png)|||
|**Scalable**||![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|
|**Elastic**||![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|![](./media/team-data-science-process-resources/resources-4-green-circle.png)|||
|**Machine Learning Embedded**|![](./media/team-data-science-process-resources/resources-4-green-circle.png) |![](./media/team-data-science-process-resources/resources-4-green-circle.png) ||||![](./media/team-data-science-process-resources/resources-4-green-circle.png)|
|**Supported Data Science Languages**|**Python, R**|**Python, Microsoft R Services, Scala**|**SQL**|**U-SQL Queries**|**Hive Queries, UDF in Python**|**SQL, R**|
|**End to End Data Science Walkthrough**|[**Windows**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-vm-do-ten-things/), [**Linux**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-linux-dsvm-intro/)|[**Python**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-spark-overview/), [**Scala**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-scala-walkthrough/)|[**SQL+AzureML**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-sqldw-walkthrough/)|[**U-SQL Queries+AzureML**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-data-lake-walkthrough/)|[**NYC Taxi Data**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-hive-walkthrough/), [**1TB Criteo Data**](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/)|[**Developers**](https://msdn.microsoft.com/library/mt683480.aspx), [**R Programmers**](https://msdn.microsoft.com/library/mt612857.aspx)|
In this document, we briefly describe the resources itemized in this table. More information on these resources is available on their product pages.
## Data Science Virtual Machine (DSVM)
The data science virtual machine offered on both Windows and Linux by Microsoft, contains popular tools for data science modeling and development activities. It includes tools such as:
- Microsoft R Server Developer Edition
- Anaconda Python distribution
- Jupyter notebooks for Python and R
- Visual Studio Community Edition with Python and R Tools on Windows / Eclipse on Linux
- Power BI desktop for Windows
- SQL Server 2016 Developer Edition on Windows / Postgres on Linux
It also includes **ML and AI tools** like CNTK (an Open Source Deep Learning toolkit from Microsoft), xgboost, mxnet and Vowpal Wabbit.
Currently DSVM is available in **Windows** and **Linux CentOS** operating systems. You need to choose the size of your DSVM (number of CPU cores and the amount of memory) based on the needs of the data science projects that you are planning to execute on it.
For more information on Windows edition of DSVM, see [Microsoft Data Science Virtual Machine](https://azure.microsoft.com/marketplace/partners/microsoft-ads/standard-data-science-vm/) on the Azure marketplace. For the Linux edition of the DSVM, see [Linux Data Science Virtual Machine](https://azure.microsoft.com/marketplace/partners/microsoft-ads/linux-data-science-vm/).
To learn how to execute some of the common data science tasks on the DSVM efficiently, see [Ten things you can do on the Data science Virtual Machine](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-vm-do-ten-things)
## Azure HDInsight Spark clusters
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and for graph computations. Spark is also compatible with Azure Blob storage (WASB), so your existing data stored in Azure can easily be processed using Spark.
When you create a Spark cluster in HDInsight, you create Azure compute resources with Spark installed and configured. It takes about ten minutes to create a Spark cluster in HDInsight. Store the data to be processed in Azure Blob storage. For information on using Azure Blob Storage with a cluster, see [Use HDFS-compatible Azure Blob storage with Hadoop in HDInsight](https://azure.microsoft.com/documentation/articles/hdinsight-hadoop-use-blob-storage).
TDSP team from Microsoft has published two end-to-end walkthroughs on how to use Azure HDInsight Spark Clusters to build data science solutions, one using Python and the other Scala. For more information on Azure HDInsight **Spark Clusters**, see [Overview: Apache Spark on HDInsight Linux](https://azure.microsoft.com/documentation/articles/hdinsight-apache-spark-overview). To learn how to build a data science solution using **Python** on an Azure HDInsight Spark Cluster, see [Overview of Data Science using Spark on Azure HDInsight](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-spark-overview). To learn how to build a data science solution using **Scala** on an Azure HDInsight Spark Cluster, see [Data Science using Scala and Spark on Azure](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-scala-walkthrough).
## Azure SQL Data Warehouse
Azure SQL Data Warehouse allows you to scale compute easily and in seconds, without over-provisioning or over-paying. It also offers the unique option to pause compute, giving you the freedom to better manage your cloud costs. This makes it possible to bring all your data into Azure SQL Data Warehouse, where storage costs are minimal, and then to run compute only on the parts of datasets that you want to analyze.
For more information on Azure SQL Data Warehouse, see the [SQL Data Warehouse](https://azure.microsoft.com/services/sql-data-warehouse) website. To learn how to build end-to-end advanced analytics solutions with SQL Data Warehouse, see [The Team Data Science Process in action: using SQL Data Warehouse](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-sqldw-walkthrough).
## Azure Data Lake
Azure data lake is as an enterprise wide repository of every type of data collected in a single location, prior to any formal requirements or schema being imposed. This allows every type of data to be kept in a data lake, regardless of its size or structure or how fast it is ingested. Organizations can then use Hadoop or advanced analytics to find patterns in these data lakes. Data lakes can also serve as a repository for lower-cost data preparation before curating the data and moving it into a data warehouse.
For more information on Azure Data Lake, see [Introducing Azure Data Lake](https://azure.microsoft.com/blog/introducing-azure-data-lake/). To learn how to build a scalable end-to-end data science solution with Azure Data Lake, see [Scalable Data Science in Azure Data Lake: An end-to-end Walkthrough](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-data-lake-walkthrough)
## Azure HDInsight Hive (Hadoop) clusters
Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and the analysis of data using HiveQL, a query language similar to SQL. Hive can be used to interactively explore your data or to create reusable batch processing jobs.
Hive allows you to project structure on largely unstructured data. After you define the structure, you can use Hive to query that data in an Hadoop cluster without having to use, or even know, Java or MapReduce. HiveQL (the Hive query language) allows you to write queries with statements that are similar to T-SQL.
For data scientists, Hive can run Python User-Defined Functions (UDFs) in Hive queries to process records. This ability extends the capability of Hive queries in data analysis considerably. Specifically, it allows data scientists to conduct scalable feature engineering in languages they are mostly familiar with: the SQL-like HiveQL and Python.
For more information on Azure HDInsight Hive Clusters, see [Use Hive and HiveQL with Hadoop in HDInsight](https://azure.microsoft.com/documentation/articles/hdinsight-use-hive). To learn how to build a scalable end to end data science solution with Azure HDInsight Hive Clusters, see [The Team Data Science Process in action: using HDInsight Hadoop clusters](https://azure.microsoft.com/documentation/articles/machine-learning-data-science-process-hive-walkthrough).
## Azure File Storage
Azure File Storage is a service that offers file shares in the cloud using the standard Server Message Block (SMB) Protocol. Both SMB 2.1 and SMB 3.0 are supported. With Azure File storage, you can migrate legacy applications that rely on file shares to Azure quickly and without costly rewrites. Applications running in Azure virtual machines or cloud services or from on-premises clients can mount a file share in the cloud, just as a desktop application mounts a typical SMB share. Any number of application components can then mount and access the File storage share simultaneously.
Especially useful for data science projects is the ability to create an Azure file store as the place to share project data with your project team members. Each of them then has access to the same copy of the data in the Azure file storage. They can also use this file storage to share feature sets generated during the execution of the project. If the project is a client engagement, your clients can create an Azure file storage under their own Azure subscription to share the project data and features with you. In this way, the client has full control of the project data assets. For more information on Azure File Storage, see [Get started with Azure File storage on Windows](https://azure.microsoft.com/documentation/articles/storage-dotnet-how-to-use-files) and [How to use Azure File Storage with Linux](https://azure.microsoft.com/documentation/articles/storage-how-to-use-files-linux).
## SQL Server 2016 R Services
R Services (In-database) provide a platform for developing and deploying intelligent applications that can uncover new insights. You can use the rich and powerful R language, including the many packages provided by the R community, to create models and generate predictions from your SQL Server data. Because R Services (In-database) integrate the R language with SQL Server, analytics are kept close to the data, which eliminates the costs and security risks associated with moving data.
R Services (In-database) support the open source R language with a comprehensive set of SQL Server tools and technologies. They offer superior performance, security, reliability, and manageability. You can deploy R solutions using convenient and familiar tools. Your production applications can call the R runtime and retrieve predictions and visuals using Transact-SQL. You also use the ScaleR libraries to improve the scale and performance of your R solutions. For more information, see [SQL Server R Services](https://msdn.microsoft.com/library/mt604845.aspx)
The TDSP team from Microsoft has published two end-to-end walkthroughs that show how to build data science solutions in SQL Server 2016 R Services: one for R programmers and one for SQL developers. For **R Programmers**, see [Data Science End-to-End Walkthrough](https://msdn.microsoft.com/library/mt612857.aspx). For **SQL Developers**, see [In-Database Advanced Analytics for SQL Developers (Tutorial)](https://msdn.microsoft.com/library/mt683480.aspx).
## Appendix: More tools for setting up TDSP and executing projects
### Install Git Credential Manager on Windows
If you are following the TDSP on **Windows**, you need to install the **Git Credential Manager (GCM)** to communicate with the Git repositories. To install GCM, you first need to install **Chocolaty**. To install Chocolaty and the GCM, run the following commands in Windows PowerShell as an **Administrator**:
iwr https://chocolatey.org/install.ps1 -UseBasicParsing | iex
choco install git-credential-manager-for-windows -y
### Install Git on Linux (CentOS) machines
Run the following bash command to install Git on Linux (CentOS) machines:
sudo yum install git
### Generate public SSH key on Linux (CentOS) machines
If you are using Linux (CentOS) machines to run the git commands, you need to add the public SSH key of your machine to your VSTS server, so that this machine is recognized by the VSTS server. First, you need to generate a public SSH key and add the key to SSH public keys in your VSTS security setting page.
- To generate the SSH key, run the following two commands:
ssh-keygen
cat .ssh/id_rsa.pub
![](./media/team-data-science-process-resources/resources-1-generate_ssh.png)
- Copy the entire ssh key including *ssh-rsa*.
- Log in to your VSTS server.
- Click **<Your Name\>** at the top right corner of the page and click **security**.
![](./media/team-data-science-process-resources/resources-2-user-setting.png)
- Click **SSH public keys**, and click **+Add**.
![](./media/team-data-science-process-resources/resources-3-add-ssh.png)
- Paste the ssh key just copied into the text box and save.