зеркало из https://github.com/Azure/aztk.git
update docs with new name 'Thunderbolt' (#53)
* update docs with new name 'Thunderbolt' * update instructions around pip for clarity
This commit is contained in:
Родитель
cf6ceff6f7
Коммит
219cf412c0
19
README.md
19
README.md
|
@ -1,11 +1,9 @@
|
|||
# Distributed Tools for Data Engineering (DTDE)
|
||||
A suite of distributed tools to help engineers scale their work into Azure.
|
||||
|
||||
# Spark on DTDE
|
||||
# Azure Spark Thunderbolt
|
||||
A set of tools to help engineers deploy Spark clusters at scale in Azure.
|
||||
|
||||
## Setup
|
||||
1. Clone the repo
|
||||
2. Use pip to install required packages:
|
||||
2. Use pip to install required packages (requires python 3.5+ and pip 9.0.1+)
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
@ -13,16 +11,11 @@ A suite of distributed tools to help engineers scale their work into Azure.
|
|||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
4. Rename 'secrets.cfg.template' to 'secrets.cfg' and fill in the fields for your Batch account and Storage account. These fields can be found in the Azure portal, and in the [Getting Started](./docs/00-getting-started.md) docs.
|
||||
4. Rename 'secrets.cfg.template' to 'secrets.cfg' and fill in the fields for your Batch account and Storage account.
|
||||
|
||||
To complete this step, you will need an Azure account that has a Batch account and Storage account:
|
||||
- To create an Azure account: https://azure.microsoft.com/free/
|
||||
- To create a Batch account: https://docs.microsoft.com/en-us/azure/batch/batch-account-create-portal
|
||||
- To create a Storage account: https://docs.microsoft.com/en-us/azure/storage/storage-create-storage-account
|
||||
Thunerbolt is built on top of two core Azure services, [Azure Batch](https://azure.microsoft.com/en-us/services/batch/) and [Azure Storage](https://azure.microsoft.com/en-us/services/storage/). Create those resources via the portal (see [Getting Started](./docs/00-getting-started.md)).
|
||||
|
||||
|
||||
|
||||
## Getting Started
|
||||
## Quickstart Guide
|
||||
|
||||
The entire experience of this package is centered around a few commands.
|
||||
|
||||
|
|
|
@ -1,18 +1,32 @@
|
|||
# Azure Batch with Spark Documentation
|
||||
Spark on Azure Batch in a project that enables submission of single or batch spark jobs to the cloud. Azure Batch
|
||||
will elastically scale and manage the compute resources required to run the jobs, and spin them down when no longer
|
||||
needed.
|
||||
# Azure Spark Thunderbolt
|
||||
Thunderbolt in a project that enables submission of single or batch spark jobs to the cloud. Clusters will elastically scale and manage the compute resources required to run the jobs, and spin them down when no longer needed. For production workloads data can be persisted to cloud storage services such as Azure Storage blobs.
|
||||
|
||||
## Getting Started
|
||||
The minimum requirements to get started with Spark on Azure Batch are:
|
||||
The minimum requirements to get started with Thunderbolt are:
|
||||
- Python 3.5+, pip 9.0.1+
|
||||
- An Azure account
|
||||
- An Azure Batch account
|
||||
- An Azure Storage account
|
||||
|
||||
### Cloning and installing the project
|
||||
1. Clone the repo
|
||||
2. Make sure you are running python 3.5 or greater.
|
||||
_If the default version on your machine is python 2 make sure to run the following commands with **pip3** instead of **pip**._
|
||||
|
||||
2. Use pip to install required packages:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Use setuptools to install the required biaries locally:
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
|
||||
### Setting up your accounts
|
||||
1. Log into Azure
|
||||
If you do not already have an Azure account, go to [https://azure.microsoft.com/](https://azure.microsoft.com/) to get
|
||||
started for free today.
|
||||
If you do not already have an Azure account, go to [https://azure.microsoft.com/](https://azure.microsoft.com/) to get started for free today.
|
||||
|
||||
Once you have one, simply log in and go to the [Azure Portal](https://portal.azure.com) to start creating the resources you'll need to get going.
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# Clusters
|
||||
A Cluster in Azure Batch is primarily designed to run Spark jobs. This document describes how to create a cluster
|
||||
A Cluster in Thunderbolt is primarily designed to run Spark jobs. This document describes how to create a cluster
|
||||
to use for spark job submissions. Alternitavely for getting started and debugging you can also use the cluster
|
||||
in _interactive mode_ which will allow you to log into the master node, use Jupyter and view the Spark UI.
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# Submitting a Job
|
||||
Submitting a job to your Spark cluster in Batch mimics the experience of a typical standalone cluster. A spark job will be submitted to the system and run to completion.
|
||||
Submitting a job to your Spark cluster in Thunderbolt mimics the experience of a typical standalone cluster. A spark job will be submitted to the system and run to completion.
|
||||
|
||||
## Spark-Submit
|
||||
The spark-submit experience is mostly the same as any regular Spark cluster with a few minor differences. You can take a look at azb spark app --help for more detailed information and options.
|
||||
|
|
|
@ -8,11 +8,11 @@ Reading and writing to and from Azure blobs is easily achieved by using the wasb
|
|||
|
||||
```python
|
||||
# read csv data into data
|
||||
dataframe = spark.read.csv('wasb://MY_CONTAINER@MY_STORAGE_ACCOUNt.blob.core.windows.net/MY_INPUT_DATA.csv')
|
||||
dataframe = spark.read.csv('wasbs://MY_CONTAINER@MY_STORAGE_ACCOUNt.blob.core.windows.net/MY_INPUT_DATA.csv')
|
||||
|
||||
# print off the first 5 rows
|
||||
dataframe.show(5)
|
||||
|
||||
# write the csv back to storage
|
||||
dataframe.write.csv('wasb://MY_CONTAINER@MY_STORAGE_ACCOUNt.blob.core.windows.net/MY_DATA.csv')
|
||||
dataframe.write.csv('wasbs://MY_CONTAINER@MY_STORAGE_ACCOUNt.blob.core.windows.net/MY_DATA.csv')
|
||||
```
|
||||
|
|
Загрузка…
Ссылка в новой задаче