7cdac637ce | ||
---|---|---|
.github/workflows | ||
modules | ||
pbspro | ||
specs | ||
templates | ||
util | ||
.gitignore | ||
.gitmodules | ||
LICENSE | ||
README.md | ||
SECURITY.md | ||
dev-requirements.txt | ||
docker-rpmbuild.sh | ||
generate_autoscale_json.sh | ||
generate_release_yaml.py | ||
icon.png | ||
initialize_default_queues.sh | ||
initialize_pbs.sh | ||
install.sh | ||
package.py | ||
project.ini | ||
server_dyn_res_wrapper.sh |
README.md
Azure CycleCloud OpenPBS project
OpenPBS is a highly configurable open source workload manager. See the OpenPBS project site for an overview and the PBSpro documentation for more information on using, configuring, and troubleshooting OpenPBS in general.
Versions
OpenPBS (formerly PBS Professional OSS) is released as part of version 20.0.0
. PBSPro OSS is still available
in CycleCloud by specifying the PBSPro OSS version.
[[[configuration]]]
pbspro.version = 18.1.4-0
Installing Manually
Note: When using the cluster that is shipped with CycleCloud, the autoscaler and default queues are already installed.
First, download the installer pkg from GitHub. For example, you can download the 2.0.23 release here
# Prerequisite: python3, 3.6 or newer, must be installed and in the PATH
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.23/cyclecloud-pbspro-pkg-2.0.23.tar.gz
tar xzf cyclecloud-pbspro-pkg-2.0.23.tar.gz
cd cyclecloud-pbspro
# Optional, but recommended. Adds relevant resources and enables strict placement
./initialize_pbs.sh
# Optional. Sets up workq as a colocated, MPI focused queue and creates htcq for non-MPI workloads.
./initialize_default_queues.sh
# Creates the azpbs autoscaler
./install.sh --venv /opt/cycle/pbspro/venv
# If you have jetpack available, you may use the following:
# ./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
# --username $(jetpack config cyclecloud.config.username) \
# --password $(jetpack config cyclecloud.config.password) \
# --url $(jetpack config cyclecloud.config.web_server) \
# --cluster-name $(jetpack config cyclecloud.cluster.name)
# Otherwise insert your username, password, url, and cluster name here.
./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
--username user \
--password password \
--url https://fqdn:port \
--cluster-name cluster_name
# lastly, run this to understand any changes that may be required.
# For example, you typically have to add the ungrouped and group_id resources
# to the /var/spool/pbs/sched_priv/sched_priv file and restart.
## [root@scheduler cyclecloud-pbspro]# azpbs validate
## ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
## group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
azpbs validate
Autoscale and scalesets
In order to try and ensure that the correct VMs are provisioned for different types of jobs, CycleCloud treats autoscale of MPI and serial jobs differently in OpenPBS clusters.
For serial jobs, multiple VM scalesets (VMSS) are used in order to scale as quickly as possible. For MPI jobs to use the InfiniBand fabric for those instances that support it, all of the nodes allocated to the job have to be deployed in the same VMSS. CycleCloud
handles this by using a PlacementGroupId
that groups nodes with the same id into the same VMSS. By default, the workq
appends
the equivalent of -l place=scatter:group=group_id
by using native queue defaults.
Hooks
Our PBS integration uses 3 different PBS hooks. autoscale
does the bulk of the work required to scale the cluster up and down. All relevant log messages can be seen in /opt/cycle/pbspro/autoscale.log
. cycle_sub_hook
will validate jobs unless they use -l nodes
syntax, in which case those jobs are held and later processed by our last hook cycle_sub_hook_periodic
.
Autoscale Hook
The most important is the autoscale
plugin, which runs by default on a 15 second interval. You can adjust this frequency by running
qmgr -c "set hook autoscale freq=NUM_SECONDS"
Submission Hooks
cycle_sub_hook
will validate that your job has the proper placement restrictions set. If it encounters a problem, it will output a detailed message on why the job was rejected and how to resolve the issue. For example
$> echo sleep 300 | qsub -l select=2 -l place=scatter
Please do one of the following
1) Ensure this placement is set by adding group=group_id to your -l place= statement
Note: Queue workq's resource_defaults.place=group=group_id
2) Add -l skipcyclesubhook=true on this job
Note: If the resource does not exist, create it -> qmgr -c 'create resource skipcyclesubhook type=boolean'
3) Disable this hook for this queue via queue defaults -> qmgr -c 'set queue workq resources_default.skipcyclesubhook=true'
4) Disable this plugin - 'qmgr -c 'set hook cycle_sub_hook enabled=false'
Note: Disabling this plugin may prevent -l nodes= style submissions from working properly.
One important note: if you are using Torque
style submissions, i.e. those that uses -l nodes
instead of -l select
, PBS will simply convert that submission into an equivalent -l select
style submission. However, the default placement defined for the queue is not respected by PBS when converting the job. To get around this, we will hold
the job and our last hook, cycle_sub_hook_periodic
will periodically update the job's placement and release it.
Configuring Resources
The cyclecloud-pbspro application matches PBS resources to azure cloud resources to provide rich autoscaling and cluster configuration tools. The application will be deployed automatically for clusters created via the CycleCloud UI or it can be installed on any PBS admin host on an existing cluster. For more information on defining resources in autoscale.json, see ScaleLib's documentation.
The default resources defined with the cluster template we ship with are
{"default_resources": [
{
"select": {},
"name": "ncpus",
"value": "node.vcpu_count"
},
{
"select": {},
"name": "group_id",
"value": "node.placement_group"
},
{
"select": {},
"name": "host",
"value": "node.hostname"
},
{
"select": {},
"name": "mem",
"value": "node.memory"
},
{
"select": {},
"name": "vm_size",
"value": "node.vm_size"
},
{
"select": {},
"name": "disk",
"value": "size::20g"
}]
}
Note that disk is currently hardcoded to size::20g
because of platform limitations to determine how much disk a node will
have. Here is an example of handling VM Size specific disk size
{
"select": {"node.vm_size": "Standard_F2"},
"name": "disk",
"value": "size::20g"
},
{
"select": {"node.vm_size": "Standard_H44rs"},
"name": "disk",
"value": "size::2t"
}
azpbs cli
The azpbs
cli is the main interface for all autoscaling behavior. Note that it has a fairly powerful autocomplete capabilities. For example, typing azpbs create_nodes --vm-size
and then you can tab-complete the list of possible VM Sizes. Autocomplete information is updated every azpbs autoscale
cycle, but can also be refreshed manually by running azpbs refresh_autocomplete
.
Command | Description |
---|---|
autoscale | End-to-end autoscale process, including creation, deletion and joining of nodes. |
buckets | Prints out autoscale bucket information, like limits etc |
config | Writes the effective autoscale config, after any preprocessing, to stdout |
create_nodes | Create a set of nodes given various constraints. A CLI version of the nodemanager interface. |
default_output_columns | Output what are the default output columns for an optional command. |
delete_nodes | Deletes node, including draining post delete handling |
demand | Dry-run version of autoscale. |
initconfig | Creates an initial autoscale config. Writes to stdout |
jobs | Writes out autoscale jobs as json. Note: Running jobs are excluded. |
join_nodes | Adds selected nodes to the scheduler |
limits | Writes a detailed set of limits for each bucket. Defaults to json due to number of fields. |
nodes | Query nodes |
refresh_autocomplete | Refreshes local autocomplete information for cluster specific resources and nodes. |
remove_nodes | Removes the node from the scheduler without terminating the actual instance. |
retry_failed_nodes | Retries all nodes in a failed state. |
shell | Interactive python shell with relevant objects in local scope. Use --script to run python scripts |
validate | Runs basic validation of the environment |
validate_constraint | Validates then outputs as json one or more constraints. |
azpbs buckets
Use the azpbs buckets
command to see which buckets of compute are available, how many are available, and what resources they have.
azpbs buckets --output-columns nodearray,placement_group,vm_size,ncpus,mem,available_count
NODEARRAY PLACEMENT_GROUP VM_SIZE NCPUS MEM AVAILABLE_COUNT
execute Standard_F2s_v2 1 4.00g 50
execute Standard_D2_v4 1 8.00g 50
execute Standard_E2s_v4 1 16.00g 50
execute Standard_NC6 6 56.00g 16
execute Standard_A11 16 112.00g 6
execute Standard_F2s_v2_pg0 Standard_F2s_v2 1 4.00g 50
execute Standard_F2s_v2_pg1 Standard_F2s_v2 1 4.00g 50
execute Standard_D2_v4_pg0 Standard_D2_v4 1 8.00g 50
execute Standard_D2_v4_pg1 Standard_D2_v4 1 8.00g 50
execute Standard_E2s_v4_pg0 Standard_E2s_v4 1 16.00g 50
execute Standard_E2s_v4_pg1 Standard_E2s_v4 1 16.00g 50
execute Standard_NC6_pg0 Standard_NC6 6 56.00g 16
execute Standard_NC6_pg1 Standard_NC6 6 56.00g 16
execute Standard_A11_pg0 Standard_A11 16 112.00g 6
execute Standard_A11_pg1 Standard_A11 16 112.00g 6
azpbs demand
It is common that you want to test out autoscaling without actually allocating anything. azpbs demand
is a dry-run
version of azpbs autoscale
. Here is a simple example where we allocate two machines for a simple -l select=2
submission. As
you can see, job id 1
is using one ncpus
on two different nodes.
azpbs demand
NAME JOB_IDS NCPUS
execute-1 1 0/1
execute-2 1 0/1
azpbs create_nodes
Manually creating nodes via azpbs create_nodes
is also quite powerful. Note that it also has a --dry-run
mode as well.
Here is an example of allocating 100 slots
of mem=memory::1g
or 1gb partitions. Since our nodes have 4gb each, then we expect 25 nodes to be created.
azpbs create_nodes --keep-alive --vm-size Standard_F2s_v2 --slots 100 --constraint-expr mem=memory::1g --dry-run --output-columns name,/mem
NAME MEM
execute-1 0.00g/4.00g
...
execute-25 0.00g/4.00g
azpbs delete_/remove_nodes
azpbs
supports safely removing a node from PBS. The different between delete_nodes
and remove_nodes
is simply that delete_nodes
, on top of removing the node from PBS, will also delete the node. You may delete by hostname or node name. Pass in *
to delete/remove all nodes.
azpbs shell
azpbs shell
is a more advanced command that can be quit powerful. This command fully constructs the entire in-memory structures used by azpbs autoscale
to allow the user to interact with them dynamically. All of the objects are passed in to the local scope, and can be listd by calling pbsprohelp()
. This is a powerful debugging tool.
[root@pbsserver ~] azpbs shell
CycleCloud Autoscale Shell
>>> pbsprohelp()
config - dict representing autoscale configuration.
cli - object representing the CLI commands
pbs_env - object that contains data structures for queues, resources etc
queues - dict of queue name -> PBSProQueue object
jobs - dict of job id -> Autoscale Job
scheduler_nodes - dict of hostname -> node objects. These represent purely what the scheduler sees without additional booting nodes / information from CycleCloud
resource_definitions - dict of resource name -> PBSProResourceDefinition objects.
default_scheduler - PBSProScheduler object representing the default scheduler.
pbs_driver - PBSProDriver object that interacts directly with PBS and implements PBS specific behavior for scalelib.
demand_calc - ScaleLib DemandCalculator - pseudo-scheduler that determines the what nodes are unnecessary
node_mgr - ScaleLib NodeManager - interacts with CycleCloud for all node related activities - creation, deletion, limits, buckets etc.
pbsprohelp - This help function
>>> queues.workq.resources_default
{'place': 'scatter:group=group_id'}
>>> jobs["0"].node_count
2
azpbs shell
can also take in as an argument --script path/to/python_file.py
, allowing the user to have full access to the in-memory structures, again by passing in the objects through the local scope, to customize the autoscale behavior.
[root@pbsserver ~] cat example.py
for bucket in node_mgr.get_buckets():
print(bucket.nodearray, bucket.vm_size, bucket.available_count)
[root@pbsserver ~] azpbs shell -s example.py
execute Standard_F2s_v2 50
execute Standard_D2_v4 50
execute Standard_E2s_v4 50
Timeouts
By default we set idle and boot timeouts across all nodes.
"boot_timeout": 3600
You can also set these per nodearray.
"boot_timeout": {"default": 3600, "nodearray1": 7200, "nodearray2": 900},
Logging
By default, azpbs
will use /opt/cycle/pbspro/logging.conf
, as defined in /opt/cycle/pbsspro/autoscale.json
. This will create the following logs.
/opt/cycle/pbspro/autoscale.log
autoscale.log
is the main log for all azpbs
invocations.
/opt/cycle/pbspro/qcmd.log
qcmd.log
every PBS executable invocation and the response, so you can see exactly what commands are being run.
/opt/cycle/pbspro/demand.log
Every autoscale
iteration, azpbs
prints out a table of all of the nodes, their resources, their assigned jobs and more. This log
contains these values and nothing else.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.