Updated README

2018-12-15 00:36:19 +01:00 · 2018-12-15 00:36:19 +01:00 · fb1d0f1ca2
--- a/README.md
+++ b/README.md
@ -1,2 +1,268 @@
-# snakepit-client
-Client for the snakepit machine learning job scheduler
+# Snakepit Client
+
+Command-line client for the [snakepit machine learning job scheduler](https://github.com/mozilla/snakepit)
+
+## Installation
+
+This is a preliminary installation guide, as the client is not mature enough for being hosted as NPM package yet.
+
+### Prerequisites
+
+* git
+* Node.js (8.00+ is tested)
+
+### Installing
+
+Follow these steps to install the client:
+```
+$ git clone https://github.com/mozilla/snakepit-client.git
+[...]
+$ cd snakepit
+snakepit$ npm install
+[...]
+snakepit$ sudo npm link
+[...]
+snakepit$ pit --help
+Usage: pit [options] [command]
+
+Options:
+  -V, --version                               output the version number
+  -h, --help                                  output usage information
+
+Commands:
+  add <entity> [properties...]                adds an entity to the system
+  remove|rm <entity>                          removes an entity from the system
+  set <entity> <assignments...>               sets properties of an entity
+  get <entity> <property>                     gets a property of an entity
+  show <entity>                               shows info about an entity
+  add-group <entity> <group>                  adds the entity to the access group
+  remove-group <entity> <group>               removes the entity from the access group
+  stop <jobNumber>                            stops a running job
+  run|put [options] <title> [clusterRequest]  enqueues current directory as new job
+  log [options] <jobNumber>                   show job's log
+  download <jobNumber>                        downloads job directory as .tar.gz archive
+  ls <jobNumber> [path]                       lists contents within a job directory
+  cp <jobNumber> <jobPath> <fsPath>           copies contents within from job directory to local file system
+  mount [options] <entity> [mountpoint]       mounts the data directory of an entity to a local mountpoint
+  status                                      prints a job status report
+  *
+```
+
+### First time use
+
+The administrators of the Snakepit cluster should've provided you a so called `.pitconnect.txt` file.
+This file is to be placed either in your home directory or inside a project root (overruling the one in your home directory).
+
+To test your setup, change into that directory and run the following:
+```
+$ pit status
+No user info found. Seems like a new user or first time login from this machine.
+Please enter an existing or new username: tilman
+Found no user of that name.
+Do you want to register this usename (yes|no)? yes
+Full name: Tilman Kamp
+E-Mail address: ...
+New password: ************
+Reinput a same one to confirm it: ************
+   JOB   S SINCE        UC% UM% USER       TITLE                RESOURCE 
+```
+If your username had been known already, the client would've asked you for the password 
+and registered a token for this (additional) machine.
+
+If all went well, the following command shows your account status:
+```
+$ pit show me
+Username:         tilman
+Full name:        Tilman Kamp
+E-Mail address:   ...
+```
+
+## Running jobs
+
+Running a job is done through the `run` command:
+```
+$ pit run --help
+Usage: run|put [options] <title> [clusterRequest]
+
+enqueues current directory as new job
+
+Options:
+  -p, --private               prevents automatic sharing of this job
+  -c, --continue <jobNumber>  continues job with provided number by copying its "keep" directory over to the new job
+  -d, --direct <commands>     directly executes provided commands through bash instead of loading .compute file
+  -l, --log                   waits for and prints job's log output
+  -h, --help                  output usage information
+
+    Examples:
+
+    $ pit run "My task" 2:[8:gtx1070]
+    $ pit run "My command" [] -d 'hostname; env'
+
+  "title" is a short text that will later help identifying the job and its purpose.
+  "clusterRequest" is an expression to specify resources this job requires from the cluster.
+  It's a comma separated list of "process requests".
+  Each "process request" specifies the number of process instances and (divided by colon and in braces) which resources to allocate for one process instances (on one node).
+  The first example will allocate 2 process instances. For each process, 8 "gtx1070" resources will get allocated.
+  You can also provide a ".pitrequest.txt" file with the same content in your project root as default value.
+```
+
+### Finding and allocating resources (GPUs)
+
+As you can see, we have to specify a so-called cluster-request to allocate a set of GPUs.
+But for being able to do so, we first need to know, which resources/GPUs are available in the cluster:
+```
+$ pit show nodes
+n0
+n1
+$ pit show node:n0
+Node name: n0
+State:     ONLINE
+Resources: 
+  0: "GeForce GTX 1070" aka "gtx1070" (cuda 0)
+  1: "GeForce GTX 1070" aka "gtx1070" (cuda 1)
+```
+
+We found at least one node with 2 "GeForce GTX 1070" GPUs.
+So let's allocate both and run a test job on them:
+```
+$ pit run "First light" [2:gtx1070] -d 'cat /proc/driver/nvidia/gpus/**/*' -l
+Job number: 190
+Remote:     origin <https://github.com/...>
+Hash:       ...
+Diff LoC:   0
+Resources:  "[2:gtx1070]"
+
+[2018-12-14 17:04:58] [daemon] Pit daemon started
+[2018-12-14 17:05:01] [worker 0] Worker 0 started
+[2018-12-14 17:05:01] [worker 0] Model: 		 GeForce GTX 1070
+[2018-12-14 17:05:01] [worker 0] IRQ:   		 139
+[2018-12-14 17:05:01] [worker 0] GPU UUID: 	 ...
+[2018-12-14 17:05:01] [worker 0] Video BIOS: 	 86.04.26.00.80
+[2018-12-14 17:05:01] [worker 0] Bus Type: 	 PCIe
+[2018-12-14 17:05:01] [worker 0] DMA Size: 	 47 bits
+[2018-12-14 17:05:01] [worker 0] DMA Mask: 	 0x7fffffffffff
+[2018-12-14 17:05:01] [worker 0] Bus Location: 	 0000:01:00.0
+[2018-12-14 17:05:01] [worker 0] Device Minor: 	 0
+[2018-12-14 17:05:01] [worker 0] Blacklisted:	 No
+[2018-12-14 17:05:01] [worker 0] Binary: ""
+[2018-12-14 17:05:01] [worker 0] Model: 		 GeForce GTX 1070
+[2018-12-14 17:05:01] [worker 0] IRQ:   		 142
+[2018-12-14 17:05:01] [worker 0] GPU UUID: 	 ...
+[2018-12-14 17:05:01] [worker 0] Video BIOS: 	 86.04.26.00.80
+[2018-12-14 17:05:01] [worker 0] Bus Type: 	 PCIe
+[2018-12-14 17:05:01] [worker 0] DMA Size: 	 47 bits
+[2018-12-14 17:05:01] [worker 0] DMA Mask: 	 0x7fffffffffff
+[2018-12-14 17:05:01] [worker 0] Bus Location: 	 0000:02:00.0
+[2018-12-14 17:05:01] [worker 0] Device Minor: 	 1
+[2018-12-14 17:05:01] [worker 0] Blacklisted:	 No
+[2018-12-14 17:05:01] [worker 0] Binary: ""
+[2018-12-14 17:05:01] [worker 0] Worker 0 ended with exit code 0
+[2018-12-14 17:05:01] [daemon] Worker 0 requested stop. Stopping pit...
+```
+Both GPUs were allocated for one process.
+
+But what if we want to have two processes allocating one GPU each?
+Let's try:
+```
+$ pit run "Second light" 2:[gtx1070] -d 'cat /proc/driver/nvidia/gpus/**/*' -l
+Job number: 191
+Remote:     origin <https://github.com/...>
+Hash:       ...
+Diff LoC:   0
+Resources:  "2:[gtx1070]"
+
+[2018-12-14 22:58:27] [daemon] Pit daemon started
+[2018-12-14 22:58:28] [worker 0] Worker 0 started
+[2018-12-14 22:58:28] [worker 0] Model: 		 GeForce GTX 1070
+[2018-12-14 22:58:28] [worker 0] IRQ:   		 139
+[2018-12-14 22:58:28] [worker 0] GPU UUID: 	 GPU-9009fe9c-0cca-ea59-631c-14d419efc397
+[2018-12-14 22:58:28] [worker 0] Video BIOS: 	 86.04.26.00.80
+[2018-12-14 22:58:28] [worker 0] Bus Type: 	 PCIe
+[2018-12-14 22:58:28] [worker 0] DMA Size: 	 47 bits
+[2018-12-14 22:58:28] [worker 0] DMA Mask: 	 0x7fffffffffff
+[2018-12-14 22:58:28] [worker 0] Bus Location: 	 0000:01:00.0
+[2018-12-14 22:58:28] [worker 0] Device Minor: 	 0
+[2018-12-14 22:58:28] [worker 0] Blacklisted:	 No
+[2018-12-14 22:58:28] [worker 0] Binary: ""
+[2018-12-14 22:58:28] [worker 0] Worker 0 ended with exit code 0
+[2018-12-14 22:58:28] [worker 1] Worker 1 started
+[2018-12-14 22:58:28] [worker 1] Model: 		 GeForce GTX 1070
+[2018-12-14 22:58:28] [worker 1] IRQ:   		 142
+[2018-12-14 22:58:28] [worker 1] GPU UUID: 	 GPU-f5ee1d0f-392c-5999-a708-00eedb04a761
+[2018-12-14 22:58:28] [worker 1] Video BIOS: 	 86.04.26.00.80
+[2018-12-14 22:58:28] [worker 1] Bus Type: 	 PCIe
+[2018-12-14 22:58:28] [worker 1] DMA Size: 	 47 bits
+[2018-12-14 22:58:28] [worker 1] DMA Mask: 	 0x7fffffffffff
+[2018-12-14 22:58:28] [worker 1] Bus Location: 	 0000:02:00.0
+[2018-12-14 22:58:28] [worker 1] Device Minor: 	 1
+[2018-12-14 22:58:28] [worker 1] Blacklisted:	 No
+[2018-12-14 22:58:28] [worker 1] Binary: ""
+[2018-12-14 22:58:28] [worker 1] Worker 1 ended with exit code 0
+[2018-12-14 22:58:28] [daemon] Worker 0 requested stop. Stopping pit...
+[2018-12-14 22:58:28] [daemon] Worker 1 requested stop. Stopping pit...
+```
+
+As you can see, the difference makes the resource allocation format:
+While
+```
+[2:gtx1070]
+```
+allocates __1__ process with __2__ GPUs, 
+```
+2:[gtx1070]
+```
+allocates __2__ process with __1__ GPU each.
+The square brackets represent a process and `n:` prefixes are quantifies. No quantifier means "1:".
+
+It's also possible to allocate processes without GPUs and processes with multiple - comma-separated - GPU types:
+```
+$ pit run "Strange job" 2:[],4:[gtx1070,2:gtx1060] -d 'echo "Strange!"'
+```
+This example allocates 2 processes without GPUs and 4 processes with 1 gtx1070 and 2 gtx1060 each.
+
+If you specify multiple processes, they can also get allocated on different machines.
+It's important to keep in mind that one process cannot be split in half and scheduled to more than one machine. 
+
+### Communicating with other processes
+
+Once you allocated multiple processes for a job, the instances have to be able to communicate with each other.
+This can be achived through a set of environment variables that is provided to each process/script-instance:
+
+* `$NUM_GROUPS`: Number of (comma separated) "process-groups". E.g. allocation "2:[],[gtx1060]" represents two process-groups.
+* `$NUM_PROCESSES_GROUP<i>`: Number of processes in process-group with index i. E.g. in "2:[],[gtx1060]" the value of `$NUM_PROCESSES_GROUP0` is 2.
+* `$HOST_GROUP<i>_PROCESS<j>`: Hostname of process j in process-group i.
+* `$GROUP_INDEX`: Group-index of current process.
+* `$PROCESS_INDEX`: Process-index of current process within its process-group.
+
+Let's imagine a job with allocation "2:[]".
+To let the two processes ping each other, the first process (0) has to execute
+```
+ping $HOST_GROUP0_PROCESS1
+```
+and the other (1) has to execute
+```
+ping $HOST_GROUP0_PROCESS0
+```
+
+### Accessing data
+
+There are four different data domains in Snakepit.
+Jobs have the same read/write rights as their owning users.
+Within your `.compute` script or a direct command you can use the following environment variable to access data:
+* Shared data: `$SHARED_DIR` - Files in this directory are read-only for everyone and considered public.
+    Only users with direct access to the head-node can change its contents.
+* Group data: `$<GROUP-NAME>_GROUP_DIR` - Admins and all members of the given group have read/write access to all contents.
+* User data: `$USER_DIR` - Admins and the user itself have read-write access.
+* Job data: `$JOB_DIR` and `$SRC_DIR` (where the `.compute` script is running) - Admins, the owning user and group members of groups specified in the "groups" property of the job have read-access. Only the running job is allowed to write data.
+
+## Known limitations
+
+- No integrated support for Git LFS. Work-around: Commit/Push LFS binaries to your remote/origin repository before scheduling a job.
+- Problems with binaries. Work-around: Commit/Push binaries to your remote/origin repository before scheduling a job.
+- File diffs are only done on tracked files. Work-around: `git add <filename>` before scheduling a job (and removing it afterwards if not to be pushed to repo).
+
+## Help
+
+1. [**IRC**](https://wiki.mozilla.org/IRC) - You can contact us on the `#machinelearning` channel on [Mozilla IRC](https://wiki.mozilla.org/IRC); people there can try to answer/help
+
+2. [**Issues**](https://github.com/mozilla/snakepit-client/issues) - If you think you ran into a serious problem, feel free to open an issue in our repo.
--- a/src/pit.js
+++ b/src/pit.js
@ -725,13 +725,13 @@ program
    .on('--help', function() {
        printIntro()
        printExample('pit run "My task" 2:[8:gtx1070]')
-        printExample('pit run "My command" [] \'hostname; env\'')
+        printExample('pit run "My command" [] -d \'hostname; env\'')
        printLine()
        printLine('"title" is a short text that will later help identifying the job and its purpose.')
        printLine('"clusterRequest" is an expression to specify resources this job requires from the cluster.')
        printLine('It\'s a comma separated list of "process requests".')
        printLine('Each "process request" specifies the number of process instances and (divided by colon and in braces) which resources to allocate for one process instances (on one node).')
-        printLine('The example above will allocate 2 process instances. For each process, 8 "gtx1070" resources will get allocated.')
+        printLine('The first example will allocate 2 process instances. For each process, 8 "gtx1070" resources will get allocated.')
        printLine('You can also provide a "' + REQUEST_FILE + '" file with the same content in your project root as default value.')
    })
    .action(function(title, clusterRequest, options) {