5.8 KiB
Jobs
In the Azure Distributed Data Engineering Toolkit,a Job is an entity that runs against an automatically provisioned and managed cluster. Jobs run a collection of Spark applications and and persist the outputs.
Creating a Job
Creating a Job starts with defining the necessary properties in your .aztk/job.yaml
file. Jobs have one or more applications to run as well as values that define the Cluster the applications will run on.
Job.yaml
Each Job has one or more applications given as a List in Job.yaml. Applications are defined using the following properties:
applications:
- name:
application:
application_args:
-
main_class:
jars:
-
py_files:
-
files:
-
driver_java_options:
-
driver_library_path:
driver_class_path:
driver_memory:
executor_memory:
driver_cores:
executor_cores:
Please note: the only required fields are name and application. All other fields may be removed or left blank.
NOTE: The Applcaition name can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters. Each application must have a unique name.
Jobs also require a definition of the cluster on which the Applications will run. The following properties define a cluster:
cluster_configuration:
vm_size: <the Azure VM size>
size: <the number of nodes in the Cluster>
docker_repo: <Docker Image to download on all nodes>
subnet_id: <resource ID of a subnet to use (optional)>
custom_scripts:
- List
- of
- paths
- to
- custom
- scripts
Please Note: For more information about Azure VM sizes, see Azure Batch Pricing. And for more information about Docker repositories see Docker.
The only required fields are vm_size and either size or size_low_pri, all other fields can be left blank or removed.
A Job definition may also include a default Spark Configuration. The following are the properties to define a Spark Configuration:
spark_configuration:
spark_defaults_conf: </path/to/your/spark-defaults.conf>
spark_env_sh: </path/to/your/spark-env.sh>
core_site_xml: </path/to/your/core-site.xml>
Please note: including a Spark Configuration is optional. Spark Configuration values defined as part of an application will take precedence over the values specified in these files.
Below we will define a simple, functioning job definition.
# Job Configuration
job:
id: test-job
cluster_configuration:
vm_size: standard_f2
size: 3
applications:
- name: pipy100
application: /path/to/pi.py
application_args:
- 100
- name: pipy200
application: /path/to/pi.py
application_args:
- 200
Once submitted, this Job will run two applications, pipy100 and pipy200, on an automatically provisioned Cluster with 3 dedicated Standard_f2 size Azure VMs. Immediately after both pipy100 and pipy200 have completed the Cluster will be destroyed. Application logs will be persisted and available.
Commands
Submit a Spark Job:
aztk spark job submit --id <your_job_id> --configuration </path/to/job.yaml>
NOTE: The Job id (--id
) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters. Each Job must have a unique id.
Low priority nodes
You can create your Job with low-priority VMs at an 80% discount by using --size-low-pri
instead of --size
. Note that these are great for experimental use, but can be taken away at any time. We recommend against this option when doing long running jobs or for critical workloads.
Listing Jobs
You can list all Jobs currently running in your account by running
aztk spark job list
Viewing a Job
To view details about a particular Job, run:
aztk spark job get --id <your_job_id>
For example here Job 'pipy' has 2 applications which have already completed.
Job pipy
------------------------------------------
State: | completed
Transition Time: | 21:29PM 11/12/17
Applications | State | Transition Time
------------------------------------|----------------|-----------------
pipy100 | completed | 21:25PM 11/12/17
pipy200 | completed | 21:24PM 11/12/17
Deleting a Job
To delete a Job run:
aztk spark job delete --id <your_job_id>
Deleting a Job also permanently deletes any data or logs associated with that cluster. If you wish to persist this data, use the --keep-logs
flag.
You are only charged for the job while it is active, Jobs handle provisioning and destorying infrastructure, so you are only charged for the time that your applications are running.
Stopping a Job
To stop a Job run:
aztk spark job stop --id <your_job_id>
Stopping a Job will end any currently running Applications and will prevent any new Applications from running.
Get information about a Job's Application
To get information about a Job's Application:
aztk spark job get-app --id <your_job_id> --name <your_application_name>
Getting a Job's Application's log
To get a job's application logs:
aztk spark job get-app-logs --id <your_job_id> --name <your_application_name>
Stopping a Job's Application
To stop an application that is running or going to run on a Job:
aztk spark job stop-app --id <your_job_id> --name <your_application_name>