This commit is contained in:
dciborow@microsoft.com 2018-02-12 02:17:20 +00:00
Коммит 5790ba045b
38 изменённых файлов: 736 добавлений и 0 удалений

1
.azureml/project.json Normal file
Просмотреть файл

@ -0,0 +1 @@
{"Id":"573faa84-7570-4f57-8872-cfe0329ea0e8","Scope":"/subscriptions/03909a66-bef8-4d52-8e9a-a346604e0902/resourceGroups/TeamTaoDSVM/providers/Microsoft.MachineLearningExperimentation/accounts/team_tao/workspaces/dciborow/projects/blads_recommendation_rrs"}

11
.gitignore поставляемый Normal file
Просмотреть файл

@ -0,0 +1,11 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
*.Rproj
.vs/slnx.sqlite
.azuremlhistory_git
.ipynb_checkpoints
azureml-logs
*.dprep.user

21
LICENSE Normal file
Просмотреть файл

@ -0,0 +1,21 @@
MIT License
Copyright (c) Microsoft Corporation. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE

68
Readme.md Normal file
Просмотреть файл

@ -0,0 +1,68 @@
# TDSP Project Dashboard
## Summary
**TDSP Project Dashboard**
This is the project dashboard where you put key project information (for example, a project summary, with relevant links). In your actual project, replace the rest of the content with project-specific summary.
## Team Data Science Process From Microsoft (TDSP)
This repository contains an instantiation of the [**Team Data Science Process (TDSP) from Microsoft**](https://github.com/Azure/Microsoft-TDSP) for project **Azure Machine Learning**. The TDSP is an agile, iterative, data science methodology designed to improve team collaboration and learning. It facilitates better coordinated and more productive data science enterprises by providing:
- a [lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md) that defines the steps in project development
- a [standard project structure](https://github.com/Azure/Azure-TDSP-ProjectTemplate)
- artifact templates for reporting
- tools to assist with data science tasks and project execution
## Information About TDSP In Azure Machine Learning
When you instantiate the TDSP from Azure Machine Learning, you get the TDSP-recommended standardized directory structure and document templates for project execution and delivery. The workflow then consists of the following steps:
- modify the documentation templates provided here for your project
- execute your project (fill in with your project's code, documents, and artifact outputs)
- prepare the Data Science deliverables for your client or customer, including the ProjectReport.md report.
We provide [instructions on how to instantiate and use TDSP in Azure Machine Learning](https://aka.ms/how-to-use-tdsp-in-aml).
## The Data Science Lifecycle
TDSP uses the data science lifecycle to structure projects. The lifecycle defines the steps that a project typically must execute, from start to finish. This lifecycle is valid for data science projects that build data products and intelligent applications that include predictive analytics. The goal is to incorporate machine learning or artificial intelligence (AI) models into commercial products. Exploratory data science projects or ad hoc/on-off analytics projects can also use this process, but in this case some steps of this lifecycle may not be needed.
Here is a depiction of the [TDSP lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md).
The TDSP data science lifecycle is composed of four major stages that are executed iteratively. This includes:
* Business Understanding
* Data Acquisition and Understanding
* Modeling
* Deployment
These stages should, ideally, be followed by customer acceptance for successful projects.
If you are using a different lifecycle schema, such as [CRISP-DM](https://wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining), [KDD, or your own custom process that is working well in your organization, you can still use the TDSP in the context of those development lifecycles.
For reference, see a more [detailed description of the TDSP life-cycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md). That version also provides additional documentation templates that are associated with each phase of the TDSP lifecycle.
## Documenting Your Project
Refer to [TDSP documentation templates](https://github.com/Azure/Azure-TDSP-ProjectTemplate) to see how you can document your project for efficient collaboration and reproducibility. In the current Azure Machine Learning TDSP documentation template, we recommend that you include all the information in the [ProjectReport](https://github.com/amlsamples/tdsp/blob/master/docs/deliveralbe_docs/ProjectReport.md) file. This template should be filled out with information that is specific to your project.
In addition to the [ProjectReport](https://github.com/amlsamples/tdsp/blob/master/docs/deliveralbe_docs/ProjectReport.md), which serves as the primary project document, we provide another template, [ProjectLearnings](https://github.com/amlsamples/tdsp/blob/master/docs/ProjectLearnings.md), to include any learnings and information, which may not be included in the primary project document, but still useful to document.
Documents received from a customer can be stored in .\docs\dustomer\_docs. Documents prepared for sharing information with a customer (for example, ProjectReport, graphs, tables etc.) can be stored in .\docs\deliveralbe\_docs.
## Project Folder Structure
The TDSP project template contains following top-level folders:
1. **code**: Contains code
2. **docs**: Contains necessary documentation about the project
3. **sample_data**: Contains **SAMPLE (small)** data that can be used for early development or testing. Typically, not more than several (5) Mbs. Not for full or large data-sets.
**NOTE:**
Make sure other than the readme.md file, all documentation-related content (text, markdowns, images, other document files) that are NOT used during the project execution must reside in the folder named “docs” (all lowercase). This is a special folder ignored by Azure Machine Learning execution so that contents in this folder do not get copied to compute target unnecessarily. Objects in this folder also dont count towards the 25-MB cap for project size, so you can store large image files needed in your documentation for example. They are still tracked by Git through Run History.
## Project Planning And Execution
To deploy [Visual Studio Online (Team Services)](https://azure.microsoft.com/en-us/services/visual-studio-team-services/) for planning, managing and executing your data science projects, detailed instructions are provided [here](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/project-execution.md).
## Release Notes
Release of this template is associated with the preview release of Azure Machine Learning (September 2017). We are continuously improving TDSP based on customer experience and feedback, and releasing new features. Refer to [TDSP](https://github.com/Azure/Microsoft-TDSP) page for more information.
## Ask Questions
We would love to hear back from your own experience with the TDSP. Should you have any questions or suggestions, create a new discussion thread on the [Issues Tab](https://github.com/Azure/Microsoft-TDSP/issues).

Просмотреть файл

@ -0,0 +1,17 @@
# Conda environment specification. The dependencies defined in this file will be
# automatically provisioned for runs against docker, VM, and HDI cluster targets.
# Details about the Conda environment file format:
# https://conda.io/docs/using/envs.html#create-environment-file-by-hand
# For Spark packages and configuration, see spark_dependencies.yml.
name: project_environment
dependencies:
- python=3.5.2
- scikit-learn
- pip:
# The API for Azure Machine Learning Model Management Service.
# Details: https://github.com/Azure/Machine-Learning-Operationalization
- azure-ml-api-sdk==0.1.0a11

Просмотреть файл

@ -0,0 +1,2 @@
type: "localdocker"
baseDockerImage: "microsoft/mmlspark:plus-0.9.9"

Просмотреть файл

@ -0,0 +1,10 @@
ArgumentVector:
- "$file"
Target: "docker"
EnvironmentVariables:
"EXAMPLE_ENV_VAR": "Example Value"
Framework: "PySpark"
CondaDependenciesFile: "aml_config/conda_dependencies.yml"
SparkDependenciesFile: "aml_config/spark_dependencies.yml"
PrepareEnvironment: true
TrackedRun: true

3
aml_config/local.compute Normal file
Просмотреть файл

@ -0,0 +1,3 @@
type: "local"
pythonLocation: "python"
sparkSubmitLocation: "spark-submit"

Просмотреть файл

@ -0,0 +1,8 @@
ArgumentVector:
- "$file"
Target: "local"
Framework: "Python"
CondaDependenciesFile: "aml_config/conda_dependencies.yml"
SparkDependenciesFile: "aml_config/spark_dependencies.yml"
PrepareEnvironment: true
TrackedRun: true

Просмотреть файл

@ -0,0 +1,17 @@
# Spark configuration and packages specification. The dependencies defined in
# this file will be automatically provisioned for each run that uses Spark.
# For third-party python libraries, see conda_dependencies.yml.
configuration: {}
repositories:
- "https://mmlspark.azureedge.net/maven"
packages:
- group: "com.microsoft.ml.spark"
artifact: "mmlspark_2.11"
version: "0.7.91"
# Required for SQL Server data sources.
- group: "com.microsoft.sqlserver"
artifact: "mssql-jdbc"
version: "6.2.1.jre8"

Просмотреть файл

@ -0,0 +1,11 @@
# Code/01_Data_Acquisition_and_Understanding
For source code files associated with data preparation. Data preparation includes one or more of the following steps (not exhaustive):
- Data ingestion
- Data cleanup
- Data reduction
- Data exploration
- Data visualization
For further information on data feature and defnintion, you can read this document in TDSP public GitHub repo [(link)](https://github.com/Azure/Azure-TDSP-ProjectTemplate/blob/master/Docs/DataReport/Data%20Defintion.md). And, for what information may be present in a data summary report, you can read this document [(link)](https://github.com/Azure/Azure-TDSP-ProjectTemplate/blob/master/Docs/DataReport/DataSummaryReport.md).

Просмотреть файл

@ -0,0 +1,12 @@
# Code/02_Modeling
For source code files associated with the predictive models. Although we haven't done it her in this template, if it makes sense to organize your project's source code further according to stages of the Modeling phase, you should create futher sub-folders for your code:
**Code\Modeling\01_FeatureEngineering**
**Code\Modeling\02_ModelCreation**
**Code\Modeling\03_ModelEvaluation**
For an example of what a modeling report could include, see this in TDSP public GitHub repo [(link)](https://github.com/Azure/Azure-TDSP-ProjectTemplate/blob/master/Docs/Model/FinalReport.md). You can also include the modeling report (information) in the ProjectReport file in the top level folder for your porject.

Просмотреть файл

@ -0,0 +1,3 @@
# Code/Deployment
For source code and other instructions needed to fully describe the deployment of the advanced analytics model.

29
code/Readme.md Normal file
Просмотреть файл

@ -0,0 +1,29 @@
# Code
This directory should contain all the source code for the project. Some structure (sub-directories) is provided regarding where to store your code files.
It is recommended to maintain the three main folders as "01\_Data\_Acquisition\_and\_Understanding", "02\_Modeling", and "03\_Deployment" and follow the steps of the TDSP data science process as outlined [(link)](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md).
However, if this does not work for your project such as a big data project without a model or the code is contained in a single process, you are free to restructure folders within "Code" to suit your needs, as long as you number the folders numerically according to the stages of the process to which you approach the problem, e.g. "01\_DataPrep" and "02\_Deployment" if there contains no model.
Please delete the comments above and use this ReadMe file to describe the structure of your code.
### Code/01\_data\_acquisition\_and\_understanding
[comment]: # (Include brief description of what was done here.)
### Code/02_modeling
[comment]: # (Include brief description of what was done here.)
### Code/03_deployment
[comment]: # (Include brief description of what was done here.)
[comment]: # (Coding styles of Python and R)
[comment]: # (It is good practice to follow coding conventions to facilitate better collaboration and standardization.)
[comment]: # (R Style guides:)
[comment]: # (http://adv-r.had.co.nz/Style.html Hadley Wickham's advanced R programming guide is a great resource that is accessible and a good start.)
[comment]: # (https://google.github.io/styleguide/Rguide.xml Google's R style guide is more detailed and what I would suggest we adopt.)
[comment]: # (http://handsondatascience.com/StyleO.pdf a 24 page detailed document that covers almost everything we could ever run into.)
[comment]: # (Additionally, there is the _lintr_ package, which runs a syntax style checker on your code. This is what later versions of RStudio use to issue warnings while editing R code.)
[comment]: # (Python style guides:)
[comment]: # (https://www.python.org/dev/peps/pep-0008/)

21
docs/ProjectLearnings.md Normal file
Просмотреть файл

@ -0,0 +1,21 @@
# Project Learning Notes: (Insert Customer & Use Case Title)
Information about your project, that are useful to keep track of, but don't become a part of the deliverable ProjectReport.md, can be included here.
## Project Execution
[comment]: # (Learnings around the customer engagement process)
## Data science / Engineering
[comment]: # (Learnings related to data science/engineering, tips/tricks, etc)
## Domain
[comment]: # (Learnings around the business domain)
## Product
[comment]: # (Learnings around the products and services utilized in the solution, e.g. ADLA had a bug X)
## What's unique about this project, specific challenges
[comment]: # (Specific issues or setup, unique things, specific challenges that had to be addressed during the engagement and how that was accomplished)
## Reference Documents
[comment]: # (comment]: # (Links to additional documents for reference)

26
docs/Readme.md Normal file
Просмотреть файл

@ -0,0 +1,26 @@
# docs
## Folder for hosting all documents for a Data Science Project
Documents should contain information about the following
1. System architecture
2. Data dictionaries
3. Reports related to data understanding, modeling
4. Project management and planning docs
5. Information obtained from a business owner or client about the project
6. Docs and presentations prepared to share information about hte project
In this folder we store html or markdown reports:
### docs/customer_docs:
This is the folder gives us a single point to store documents from the customer related to this engagement. As the repo will become large if you store Word, ppt, etc. files (especially if these are changing over time), please only store final documents here and utilize another location such as a Sharepoint (or similar shared resource) for working documents of that nature.
### docs/deliverable_docs:
This folder gives us a single point to store deliverable documents related to this engagement. As the repo will become large if you store Word, ppt, etc. files (especially if these are changing over time), please only store final documents here and utilize another location such as a Sharepoint (or similar resource) for working documents of that nature.
### docs/optional_templates
This folder has optional templates which may be used to document your project. If these templates are not used in a project, this folder may be deleted from your project.
### Notes:
Any notes you want to keep about the Docs

Просмотреть файл

@ -0,0 +1,9 @@
# docs/customer_docs
This folder gives us a single point to store documents from the customer related to this engagement. As the repo will become large if you store Word, ppt, etc. files (especially if these are changing over time), please only store final documents here and utilize another location such as a Sharepoint (or similar resource) for working documents of that nature.
## Notes:
*
## Reference documents:
*

Просмотреть файл

@ -0,0 +1,62 @@
# Data Science Project Report: (Insert Customer/Client & Use Case Title)
[comment]: # (This document is intended to capture the use case summary for this engagement. An executive summary should contain a brief overview of the project, but not every detail. Only the current summary should be captured here and this should be edited over time to reflect the latest details.)
[comment]: # (Some ideas of what to include in the executive summary are detailed below. Please edit and capture the relevant information within each section)
[comment]: # (To capture more detail in the scoping phase, the optional template Scoping.md may be utilized. If more detail around the data, use case, architecture, or other aspects needs to be captured, additional markdown files can be referenced and placed into the Docs folder)
## 1. Business Understanding
### Customer & Business Problem
* Who is the client, what is their business domain?
* What is the business problem (in business terms) that the customer is looking to solve?
### Scope
* What data science or advanced analytics solutions are we building?
* What are the high level data sources we will be utilizing?
* How is it going to be consumed by the customer and how will the customer use the model results to make decisions?
* What are the deliverables?
## Plan
Phases (milestones), timeline, short description of what we'll do in each phase.
### Personnel
[comment]: # (Who is assigned to this project)
* **Data Science Group**: Project lead, Data scientist(s), Data engineer(s), Account manager
* **Customer or Client**: Data administrator, Business contact
* **Others (e.g. Partners)**: Project lead, Engineer
### Metrics
* What are the qualitative objectives?
* What is the main quantifiable metric?
* What improvement in the values of the metrics are useful for the customer scenario?
* What is the baseline value of the metric before the project?
* How will we measure the metric?
## 2. Data Acquisition and Understanding
### Data & Analytics Environment
* What are the available data sets and relative size of the data?
* What is the data ingestion method?
* Do we have data needed to answer the business problem?
* What are the data storage/analytics resources (e.g. development resources such as HDInsight)?
## 3. Modeling
### Model Techniques
* Feature Engineering steps used?
* Modeling techniques used, validation results, details of how validation conducted?
## 4. Deployment
### Architecture
* How will the analytics or machine learning model be consumed in the business workflow of the customer?
* What is the planned data movement pipeline in production?
(Insert a 1 slide diagram showing the end to end data flow and decision architecture.)
[comment]: # (If there is a substantial change in the customer's business workflow, make a before/after diagram showing the data flow.)
## Reference Documents
* Version control repository - add link
* OneNote or other locations with important documents - add link

Просмотреть файл

@ -0,0 +1,10 @@
# docs/deliverable_docs
This folder gives us a single point to store final versions of customer or conference presentations related to this engagement. As the repo will become large if you store Word, ppt, etc. files (especially if these are changing over time), please only store final documents here and utilize another location such as a Sharepoint (or similar resource) for working documents of that nature.
## Project Report
The [ProjectReport.md](ProjectReport.md) contains the primary deliverable for the client. It contains a the vital information about the project, its findings, architecture, results, and summary of outputs and artifacts.
## Notes:
*
## Reference documents:
*

Просмотреть файл

@ -0,0 +1,108 @@
---
title: Structure Projects with Team Data Science Process Template | Microsoft Docs
description: How to instantiate Team Data Science Process (TDSP) templates in Azure ML that structure projects for collaboration.
services: machine-learning
documentationcenter: ''
author: bradsev
manager: cgronlun
editor: cgronlun
ms.assetid:
ms.service: machine-learning
ms.workload: data-services
ms.tgt_pltfrm: na
ms.devlang: na
ms.topic: article
ms.date: 09/17/2017
ms.author: bradsev
---
<img src="./images/aml-gallery-tdsp-icon.png" width="190" height="120">
# Structure Projects with Team Data Science Process Template
This document provides instructions on how to create a data science projects in Azure Machine Learning with Team Data Science Process (TDSP) templates that structure projects for collaboration and reproducibility.
## What Is Team Data Science Process?
The Team Data Science Process is an agile, iterative, data science process for executing and delivering advanced analytics solutions. It is designed to improve the collaboration and efficiency of data science teams in enterprise organizations. It supports these objectives with four key components:
1. A standard [data science lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md) definition.
2. A standardized project structure [project documentation and reporting templates](https://github.com/Azure/Azure-TDSP-ProjectTemplate)
3. Infrastructure and resources for project execution, such as compute and storage infrastructure, and code repositories.
4. [Tools and utilities](https://github.com/Azure/Azure-TDSP-Utilities) for data science project tasks, such as collaborative version control and code review, data exploration and modeling, and work planning.
For a more complete discussion of the TDSP, see the [Team Data Science Process overview](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/README.md).
## Why Should You Use TDSP Structure and Templates?
Standardization of the structure, lifecycle, and documentation of data science projects is key to facilitating effective collaboration on data science teams. Creating Azure Machine Learning projects with the TDSP template provides a framework for coordinated teamwork.
We had previously released a [GitHub repository for the TDSP project structure and templates](https://github.com/Azure/Azure-TDSP-ProjectTemplate) to help achieve these objectives. But it was not possible, until now, to instantiate the TDSP structure and templates within a data science tool. It is now possible to create an Azure Machine Learning project that instantiates the TDSP structure and documentation templates.
## Things To Note *Before* Creating A New Project
These are the things you should note or review *before* creating a new project:
* TDSP Azure Machine Learning [Template](https://aka.ms/tdspamlgithubrepo).
* Contents (other than what is there in the 'docs' folder) are required to be less than 25 Mb in size. See **NOTE** below.
* The sample\_data folder is only for small data files (less than 5 Mb) with which you can test your code or do early development.
* Storing files such as Office Word, PowerPoint etc. can increase the size of 'docs' folder substantially. We advise you to find a collaborative Wiki, [SharePoint](https://products.office.com/en-us/sharepoint/collaboration), or other collaborative resource to store such files.
* For handling large files and outputs in Azure Machine Learning, read [this](http://aka.ms/aml-largefiles).
**NOTE:** Make sure other than the readme.md file, all documentation-related content (text, markdowns, images, other document files) that are NOT used during the project execution reside in the folder named 'docs' (all lowercase). This is a special folder ignored by Azure Machine Learning execution so that contents in this folder do not get copied to compute target unnecessarily. Objects in this folder also dont count towards the 25-MB cap for project size, so you may store large image files needed in your documentation for example. They are still tracked by Git through Run History.
## Instantiating TDSP Structure and Templates From the Azure Machine Learning Template Gallery
To create a new project with the Team Data Science Process structure and documentation templates, complete the following procedures:
### Click on "New Project"
Open Azure Machine Learning. Under **Projects** on top left, click on **+** and select **New Project** to create a new project.
<img src="./images/instantiation-1.png" width="800" height="600">
### Creating a new TDSP-structured project
Specify the parameters and information in the relevant boxes:
- Project name
- Project directory
- Project description
- An empty Git repository path
- Workspace name
Then in the **Search** box, type in *TDSP*. When the **Structure a project with TDSP** shows up, click on it to select that template. Then click the **Create** button to create your new project with the TDSP structure. If you provide an empty Git repository during creating the project (in the appropriate box), then that repository will be populated with the project structure and contents after creation of the project.
**NOTE:** The actual icon image may change.
<img src="./images/instantiation-2.png" width="700" height="500">
## Examine The TDSP Project Structure
After your new project is created, you can examine its structure (left panel in figure below). It contains all of the aspects of standardized documentation for business understanding, the stages of the TDSP lifecycle, data location, definition, and architecture in this documentation template. This structure is derived from the TDSP structure published [here](https://github.com/Azure/Azure-TDSP-ProjectTemplate), with some modifications. For example, several of the document templates are merged into one markdown, namely, [ProjectReport](https://aka.ms/tdspamlgithubrepoprojectreport).
### Project Folder Structure
The TDSP project template contains following top-level folders:
1. **code**: Contains code
2. **docs**: Contains necessary documentation about the project (for example, Markdown files, related media etc.)
3. **sample_data**: Contains **SAMPLE (small)** data that can be used for early development or testing. Typically, not more than several (5) Mbs. Not for full or large data-sets.
<img src="./images/instantiation-3.png" width="750" height="500">
## Using The TDSP Structure and Templates
The structure and templates need to be populated with project specific information. You are expected to populate these with code and information necessary for executing and delivering your project. The [ProjectReport](https://aka.ms/tdspamlgithubrepoprojectreport) file is a template that should be directly modified with information relevant to your project. It comes with a set of questions that help you fill out the information for each of the four stages of the [Team Data Science Process lifecycle](https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/lifecycle-detail.md).
For an example of how a project structure can look like during execution or after completion is given below (left panel in figure below). This is from the [Team Data Science Process Sample Project: Classify incomes from US Census data in Azure Machine Learning](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome).
<img src="./images/instantiation-4.png" width="900" height="800">
## Documenting Your Project
Refer to [TDSP documentation templates](https://github.com/Azure/Azure-TDSP-ProjectTemplate) for documenting your project. In the current Azure Machine Learning TDSP documentation template, we recommend that you include all the information in the [ProjectReport](https://aka.ms/tdspamlgithubrepoprojectreport) file. This template should be filled out with information that is specific to your project.
We also provide another [ProjectLearnings](https://aka.ms/tdspamlgithubrepoprojectlearnings) template to include any information not be included in the primary project document, but that is still useful to document.
### Example Project Report
An example project report can be found [here](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome/blob/master/docs/deliveralbe_docs/ProjectReport.md). This is the projet report for the [US Income Classification sample project](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome), which shows how the TDSP template can be instantiated and used for a data sciene project.
## Next Steps
To facilitate your understanding on how the TDSP structure and templates can be used in Azure Machine Learning projects, we provide several worked-out project examples in the documentation for Azure Machine Learning.
- For a sample showing how create a TDSP project in Azure Machine Learning, see [Team Data Science Process Sample Project: Classify incomes from US Census data in Azure Machine Learning](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome)
- For a sample that uses Deep Learning in NLP in a TDSP-instantiated project in Azure Machine Learning, see [Bio-medical entity recognition using Natural Language Processing with Deep Learning](https://github.com/Azure/MachineLearningSamples-BiomedicalEntityExtraction)

9
docs/images/ReadMe.md Normal file
Просмотреть файл

@ -0,0 +1,9 @@
## docs/Images
This directory is where we store images used in other documents. Please use subdirectories as applicable to keep images and media organized.
The markdown syntax for including images is
`![Alt text](/path/to/img.jpg "Optional title")`
Typically contains images for documentation of your project. For example, it is NOT a folder for storing image files for training an algorithm.

Двоичные данные
docs/images/aml-gallery-tdsp-icon.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 68 KiB

Двоичные данные
docs/images/instantiation-1.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 293 KiB

Двоичные данные
docs/images/instantiation-2.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 139 KiB

Двоичные данные
docs/images/instantiation-3.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 211 KiB

Двоичные данные
docs/images/instantiation-4.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 255 KiB

Двоичные данные
docs/images/tdsp-lifecycle.jpg Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 621 KiB

Двоичные данные
docs/images/tdsp-lifecycle.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 155 KiB

Просмотреть файл

@ -0,0 +1,3 @@
# Data Pipeline
Describe the data pipeline and provide a logical diagram. List how frequently the data is moved - real time/stream, near real time, batched with frequency etc.

Просмотреть файл

@ -0,0 +1,4 @@
Column Index,Column Name,Type of Variable,"Values (range, levels, examples, etc)",Short Description,Joining Keys with others datasets?
1,,,,,
2,,,,,
3,,,,,
1 Column Index Column Name Type of Variable Values (range, levels, examples, etc) Short Description Joining Keys with others datasets?
2 1
3 2
4 3

Просмотреть файл

@ -0,0 +1,35 @@
# docs/optional_templates
## Folder hosting optional data science project templates.
These templates can be used, as necessary for for documentation of your project. Some of these may be moved to customer docs or deliverable_docs. For example, project-charter may be prepared at the beginning of the project and shared with the customer as a deliverable. model-report may also be another stand-alone deliverable document, or information from this document may be included in the \/deliverable\_docs/ProjectReport.md.
Optional templates include:
### project-charter.md
This template helps to define business background, scope, metrics (KPIs - key performance indicators), success criteria, architecture, etc. This can be initially used with the customer to define the project as accurately as possible, and set expectations about deliverables and success criteria.
### data-dictionaries.md
This document provides the descriptions of the data that is provided by the client. As appropriate, this may be placed in customer_docs.
### Raw-Data-Dictionary.csv
This file should contain information about raw data files, and all the fields in these files. For example, column name, variable type, variable range etc. One file may be prepared for each raw data file.
### data-definition.md
Data and feature definitions.
NOTE: In this file, links need to be replace by links your data sets.
### DataPipelines.txt
Describe the data pipeline and provide a logical diagram. List how frequently the data is moved - real time/stream, near real time, batched with frequency etc.
### data-summary-report.md
This file may be generated for each data file received or processed. The [Interactive Data Exploration, Analysis, and Reporting (IDEAR)](https://github.com/Azure/Azure-TDSP-Utilities) utility can help you explore and visualize the data in an interactive way, and generate the data report along with the process of exploration and visualization.
### model-report.md
Report describing the final model to be delivered - typically comprised of one or more of the models built during the life of the project.
### project-exit-report.md
This is a report (alternative to the /docs/deliberable_docs/ProjectReport.md) that can be used as a deliverable final report to the customer.
### Notes:
Any notes you want to keep about the Docs

Просмотреть файл

@ -0,0 +1,43 @@
# Data and Feature Definitions
This document provides a central hub for the raw data sources, the processed/transformed data, and feature sets. More details of each dataset is provided in the data summary report.
For each data, an individual report describing the data schema, the meaning of each data field, and other information that is helpful for understanding the data is provided. If the dataset is the output of processing/transforming/feature engineering existing data set(s), the names of the input data sets, and the links to scripts that are used to conduct the operation are also provided.
When applicable, the Interactive Data Exploration, Analysis, and Reporting (IDEAR) utility developed by Microsoft is applied to explore and visualize the data, and generate the data report. Instructions of how to use IDEAR can be found [here]().
For each dataset, the links to the sample datasets in the _**Data**_ directory are also provided.
_**For ease of modifying this report, placeholder links are included in this page, for example a link to dataset 1, but they are just placeholders pointing to a non-existent page. These should be modified to point to the actual location.**_
## Raw Data Sources
| Dataset Name | Original Location | Destination Location | Data Movement Tools / Scripts | Link to Report |
| ---:| ---: | ---: | ---: | -----: |
| Dataset 1 | Brief description of its orignal location | Brief description of its destination location | [script1.py](link/to/python/script/file/in/Code) | [Dataset 1 Report](link/to/report1)|
| Dataset 2 | Brief description of its orignal location | Brief description of its destination location | [script2.R](link/to/R/script/file/in/Code) | [Dataset 2 Report](link/to/report2)|
* Dataset1 summary. <Provide brief summary of the data, such as how to access the data. More detailed information should be in the Dataset1 Report.>
* Dataset2 summary. <Provide brief summary of the data, such as how to access the data. More detailed information should be in the Dataset2 Report.>
## Processed Data
| Processed Dataset Name | Input Dataset(s) | Data Processing Tools/Scripts | Link to Report |
| ---:| ---: | ---: | ---: |
| Processed Dataset 1 | [Dataset1](link/to/dataset1/report), [Dataset2](link/to/dataset2/report) | [Python_Script1.py](link/to/python/script/file/in/Code) | [Processed Dataset 1 Report](link/to/report1)|
| Processed Dataset 2 | [Dataset2](link/to/dataset2/report) |[script2.R](link/to/R/script/file/in/Code) | [Processed Dataset 2 Report](link/to/report2)|
* Processed Data1 summary. <Provide brief summary of the processed data, such as why you want to process data in this way. More detailed information about the processed data should be in the Processed Data1 Report.>
* Processed Data2 summary. <Provide brief summary of the processed data, such as why you want to process data in this way. More detailed information about the processed data should be in the Processed Data2 Report.>
## Feature Sets
| Feature Set Name | Input Dataset(s) | Feature Engineering Tools/Scripts | Link to Report |
| ---:| ---: | ---: | ---: |
| Feature Set 1 | [Dataset1](link/to/dataset1/report), [Processed Dataset2](link/to/dataset2/report) | [R_Script2.R](link/to/R/script/file/in/Code) | [Feature Set1 Report](link/to/report1)|
| Feature Set 2 | [Processed Dataset2](link/to/dataset2/report) |[SQL_Script2.sql](link/to/sql/script/file/in/Code) | [Feature Set2 Report](link/to/report2)|
* Feature Set1 summary. <Provide detailed description of the feature set, such as the meaning of each feature. More detailed information about the feature set should be in the Feature Set1 Report.>
* Feature Set2 summary. <Provide detailed description of the feature set, such as the meaning of each feature. More detailed information about the feature set should be in the Feature Set2 Report.>

Просмотреть файл

@ -0,0 +1,17 @@
# Data Dictionaries
_Place to put data description documents, typically received from a client_
This is typically a field-level description of data files received.
This document provides the descriptions of the data that is provided by the client. If the client is providing data dictionaries in text (in emails or text files), directly copy them here, or have a snapshot of the text, and add it here as an image. If the client is providing data dictionaries in Excel worksheets, directly put the Excel files in this directory, and add a link to this Excel file.
If the client is providing you the data from a database-like data management system, you can also copy and paste the data schema (snapshot) here. If necessary, please also provide brief description of each column after the snapshot image, if such image does not have such information.
## <Dataset 1 name (from database)\>
_Example image of data schema when data is from a sql server_
![](data-dictionary-from-sql-table.PNG)
## <Dataset 2 name (dictionary in Excel file)\>
[dataset 2 with dictionary in Excel](./Raw-Data-Dictionary.csv)

Просмотреть файл

@ -0,0 +1,18 @@
# Data Report
This file will be generated for each data file received or processed. The Interactive Data Exploration, Analysis, and Reporting (IDEAR) utility developed by TDSP team of Microsoft can help you explore and visualize the data in an interactive way, and generate the data report along with the process of exploration and visualization.
IDEAR allows you to output the data summary, statistics, and charts that you want to use to tell the data story into the report. You only need to click a few buttons, and the report will be generated for you.
## General summary of the data
## Data quality summary
## Target variable
## Individual variables
## Variable ranking
## Relationship between explanatory variables and target variable

Просмотреть файл

@ -0,0 +1,34 @@
# Final Model Report
_Report describing the final model to be delivered - typically comprised of one or more of the models built during the life of the project_
## Analytic Approach
* What is target definition
* What are inputs (description)
* What kind of model was built?
## Solution Description
* Simple solution architecture (Data sources, solution components, data flow)
* What is output?
## Data
* Source
* Data Schema
* Sampling
* Selection (dates, segments)
* Stats (counts)
## Features
* List of raw and derived features
* Importance ranking.
## Algorithm
* Description or images of data flow graph
* if AzureML, link to:
* Training experiment
* Scoring workflow
* What learner(s) were used?
* Learner hyper-parameters
## Results
* ROC/Lift charts, AUC, R^2, MAPE as appropriate
* Performance graphs for parameters sweeps if applicable

Просмотреть файл

@ -0,0 +1,54 @@
# Project Charter
## Business background
* Who is the client, what business domain the client is in.
* What business problems are we trying to address?
## Scope
* What data science solutions are we trying to build?
* What will we do?
* How is it going to be consumed by the customer?
## Personnel
* Who are on this project:
* Microsoft:
* Project lead
* PM
* Data scientist(s)
* Account manager
* Client:
* Data administrator
* Business contact
## Metrics
* What are the qualitative objectives? (e.g. reduce user churn)
* What is a quantifiable metric (e.g. reduce the fraction of users with 4-week inactivity)
* Quantify what improvement in the values of the metrics are useful for the customer scenario (e.g. reduce the fraction of users with 4-week inactivity by 20%)
* What is the baseline (current) value of the metric? (e.g. current fraction of users with 4-week inactivity = 60%)
* How will we measure the metric? (e.g. A/B test on a specified subset for a specified period; or comparison of performance after implementation to baseline)
## Plan
* Phases (milestones), timeline, short description of what we'll do in each phase.
## Architecture
* Data
* What data do we expect? Raw data in the customer data sources (e.g. on-prem files, SQL, on-prem Hadoop etc.)
* Data movement from on-prem to Azure using ADF or other data movement tools (Azcopy, EventHub etc.) to move either
* all the data,
* after some pre-aggregation on-prem,
* Sampled data enough for modeling
* What tools and data storage/analytics resources will be used in the solution e.g.,
* ASA for stream aggregation
* HDI/Hive/R/Python for feature construction, aggregation and sampling
* AzureML for modeling and web service operationalization
* How will the score or operationalized web service(s) (RRS and/or BES) be consumed in the business workflow of the customer? If applicable, write down pseudo code for the APIs of the web service calls.
* How will the customer use the model results to make decisions
* Data movement pipeline in production
* Make a 1 slide diagram showing the end to end data flow and decision architecture
* If there is a substantial change in the customer's business workflow, make a before/after diagram showing the data flow.
## Communication
* How will we keep in touch? Weekly meetings?
* Who are the contact persons on both sides?

Просмотреть файл

@ -0,0 +1,64 @@
# Exit Report of Project <X> for Customer <Y>
Instructions: Template for exit criteria for data science projects. This is concise document that includes an overview of the entire project, including details of each stage and learning. If a section isn't applicable (e.g. project didn't include a ML model), simply mark that section as "Not applicable". Suggested length between 5-20 pages. Code should mostly be within code repository (not in this document).
Customer: <Enter Customer Name\>
Team Members: <Enter team member' names. Please also enter relevant parties names, such as team lead, Account team, Business stakeholders, etc.\>
## Overview
<Executive summary of entire solution, brief non-technical overview\>
## Business Domain
<Industry, business domain of customer\>
## Business Problem
<Business problem and exact use case(s), why it matters\>
## Data Processing
<Schema of original datasets, how data was processed, final input data schema for model\>
## Modeling, Validation
<Modeling techniques used, validation results, details of how validation conducted\>
## Solution Architecture
<Architecture of the solution, describe clearly whether this was actually implemented or a proposed architecture. Include diagram and relevant details for reproducing similar architecture. Include details of why this architecture was chosen versus other architectures that were considered, if relevant\>
## Benefits
### Company Benefit (internal only. Double check if you want to share this with your customer)
<What did our company gain from this engagement? ROI, revenue, etc\>
### Customer Benefit
What is the benefit (ROI, savings, productivity gains etc) for the customer? If just POC, what is estimated ROI? If exact metrics are not available, why does it have impact for the customer?\>
## Learnings
### Project Execution
<Learnings around the customer engagement process\>
### Data science / Engineering
<Learnings related to data science/engineering, tips/tricks, etc\>
### Domain
<Learnings around the business domain, \>
### Product
<Learnings around the products and services utilized in the solution \>
### What's unique about this project, specific challenges
<Specific issues or setup, unique things, specific challenges that had to be addressed during the engagement and how that was accomplished\>
## Links
<Links to published case studies, etc.; Link to git repository where all code sits\>
## Next Steps
<Next steps. These should include milestones for follow-ups and who 'owns' this action. E.g. Post- Proof of Concept check-in on status on 12/1/2016 by X, monthly check-in meeting by Y, etc.\>
## Appendix
<Other material that seems relevant try to keep non-appendix to <20 pages but more details can be included in appendix if needed\>

6
sample_data/Readme.md Normal file
Просмотреть файл

@ -0,0 +1,6 @@
# Sample_Data
The **Sample_Data** directory in the project git repository is the place to store **SAMPLE** datasets which should be of small size, **NOT** the entire datasets. If your client does not allow you to store even the sample data on the github repository, if possible, store a sample dataset with all confidential fields hashed. If still not allowed, please do not store sample data here. But, please still fill in the table in each sub-directory.
The small sample datasets can be used to make your data preprocessing, feature engineering, or modeling scripts runnable. It can be helpful to quickly run the scripts that process or model the data, and understand what the scripts are doing.
In each directory, there is a markdown file, which lists all datasets in each directory. Please provide the link to the full dataset in case one wants to access the full dataset.