зеркало из https://github.com/microsoft/pai.git
Merge branch 'folder-refactor' into zhaoyu/port-conflict-after-refactor
This commit is contained in:
Коммит
36434e3a77
|
@ -44,7 +44,7 @@ matrix:
|
|||
node_js: 6
|
||||
env: NODE_ENV=test
|
||||
before_install:
|
||||
- cd rest-server
|
||||
- cd src/rest-server
|
||||
install:
|
||||
- npm install
|
||||
script:
|
||||
|
@ -54,7 +54,7 @@ matrix:
|
|||
node_js: 7
|
||||
env: NODE_ENV=test
|
||||
before_install:
|
||||
- cd rest-server
|
||||
- cd src/rest-server
|
||||
install:
|
||||
- npm install
|
||||
script:
|
||||
|
@ -63,7 +63,7 @@ matrix:
|
|||
- language: node_js
|
||||
node_js: 6
|
||||
before_install:
|
||||
- cd webportal
|
||||
- cd src/webportal
|
||||
install:
|
||||
- npm run yarn install
|
||||
- npm run build
|
||||
|
@ -72,7 +72,7 @@ matrix:
|
|||
- language: node_js
|
||||
node_js: 7
|
||||
before_install:
|
||||
- cd webportal
|
||||
- cd src/webportal
|
||||
install:
|
||||
- npm run yarn install
|
||||
- npm run build
|
||||
|
|
|
@ -69,7 +69,7 @@ Before start, you need to meet the following requirements:
|
|||
### Cluster administration
|
||||
- [Deployment infrastructure](./docs/pai-management/doc/cluster-bootup.md)
|
||||
- [Cluster maintenance](https://github.com/Microsoft/pai/wiki/Maintenance-(Service-&-Machine))
|
||||
- [Monitoring](./webportal/README.md)
|
||||
- [Monitoring](./docs/webportal/README.md)
|
||||
|
||||
## Resources
|
||||
|
||||
|
|
|
@ -8,6 +8,6 @@
|
|||
### Configuration and API
|
||||
- [Configuration: customize OpenPAI via its configuration](./pai-management/doc/how-to-write-pai-configuration.md)
|
||||
- [OpenPAI Programming Guides](../examples/README.md)
|
||||
- [Restful API Docs](../rest-server/README.md)
|
||||
- [Restful API Docs](rest-server/API.md)
|
||||
|
||||
### [FAQs](./faq.md)
|
|
@ -27,16 +27,16 @@ Build image by using ```pai_build.py``` which put under ``build/``. for the conf
|
|||
### Build infrastructure services <a name="Service_Build"></a>
|
||||
|
||||
```
|
||||
sudo ./pai_build.py build -c /path/to/configuration-dir/ [ -s component-list ]
|
||||
./pai_build.py build -c /path/to/configuration-dir/ [ -s component-list ]
|
||||
```
|
||||
|
||||
- Build the corresponding component.
|
||||
- If the option `-n` is added, only the specified component will be built. By default will build all components under ``src/``
|
||||
- If the option `-s` is added, only the specified component will be built. By default will build all components under ``src/``
|
||||
|
||||
### Push infrastructure image(s) <a name="Image_Push"></a>
|
||||
|
||||
```
|
||||
sudo ./pai_build.py push -c /path/to/configuration-dir/ [ -i image-list ]
|
||||
./pai_build.py push -c /path/to/configuration-dir/ [ -i image-list ]
|
||||
```
|
||||
|
||||
- tag and push image to the docker registry which is configured in the ```cluster-configuration```.
|
||||
|
@ -135,4 +135,4 @@ popd > /dev/null
|
|||
|
||||
# TO-DO
|
||||
|
||||
- Incremental build implementation.
|
||||
- Incremental build implementation.
|
||||
|
|
|
@ -28,7 +28,7 @@ User could customize [Kubernetes](https://kubernetes.io/) at OpenPAI's [folder /
|
|||
|
||||
User could customize Webportal at OpenPAI's [folder / file](../../webportal/README.md#Configuration)
|
||||
|
||||
User could customize Webportal startup configuration at OpenPAI's [folder / file](../bootstrap/webportal/webportal.yaml.template)
|
||||
User could customize Webportal startup configuration at OpenPAI's [folder / file](../../../src/webportal/deploy/webportal.yaml.template)
|
||||
|
||||
## Configure Pylon <a name="pylon"></a>
|
||||
|
||||
|
@ -44,7 +44,7 @@ User could customize FrameworkLauncher startup configuration at OpenPAI's [folde
|
|||
|
||||
## Configure Rest-server <a name="restserver"></a>
|
||||
|
||||
User could customize rest server at OpenPAI's [folder / file](../bootstrap/rest-server/rest-server.yaml.template)
|
||||
User could customize rest server at OpenPAI's [folder / file](../../../src/rest-server/deploy/rest-server.yaml.template)
|
||||
|
||||
User could customize rest server startup configuration at OpenPAI's [folder / file](../../../src)
|
||||
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
1. Job config file
|
||||
|
||||
Prepare a job config file as described in [examples/README.md](../docs/job_tutorial.md#json-config-file-for-job-submission), for example, `exampleJob.json`.
|
||||
Prepare a job config file as described in [examples/README.md](../job_tutorial.md#json-config-file-for-job-submission), for example, `exampleJob.json`.
|
||||
|
||||
2. Authentication
|
||||
|
||||
|
@ -54,7 +54,7 @@
|
|||
|
||||
## Root URI
|
||||
|
||||
Configure the rest server port in [services-configuration.yaml](../cluster-configuration/services-configuration.yaml).
|
||||
Configure the rest server port in [services-configuration.yaml](../../cluster-configuration/services-configuration.yaml).
|
||||
|
||||
## API Details
|
||||
|
||||
|
@ -444,7 +444,7 @@ Configure the rest server port in [services-configuration.yaml](../cluster-confi
|
|||
|
||||
*Parameters*
|
||||
|
||||
[job config json](../docs/job_tutorial.md#json-config-file-for-job-submission)
|
||||
[job config json](../job_tutorial.md#json-config-file-for-job-submission)
|
||||
|
||||
*Response if succeeded*
|
||||
```
|
|
@ -27,14 +27,14 @@ REST Server exposes a set of interface that allows you to manage jobs.
|
|||
## Architecture
|
||||
|
||||
REST Server is a Node.js API service for PAI that deliver client requests to different upstream
|
||||
services, including [FrameworkLauncher](../frameworklauncher), Apache Hadoop YARN, WebHDFS and
|
||||
services, including [FrameworkLauncher](../../src/frameworklauncher), Apache Hadoop YARN, WebHDFS and
|
||||
etcd, with some request transformation.
|
||||
|
||||
## Dependencies
|
||||
|
||||
To start a REST Server service, the following services should be ready and correctly configured.
|
||||
|
||||
* [FrameworkLauncher](../frameworklauncher)
|
||||
* [FrameworkLauncher](../../src/frameworklauncher)
|
||||
* Apache Hadoop YARN
|
||||
* HDFS
|
||||
* etcd
|
||||
|
@ -59,7 +59,7 @@ If REST Server is deployed by [pai management tool][pai-management], configurati
|
|||
If REST Server is deployed manually, the following fields should be configured as environment
|
||||
variables:
|
||||
|
||||
* `LAUNCHER_WEBSERVICE_URI`: URI endpoint of [Framework Launcher](../frameworklauncher)
|
||||
* `LAUNCHER_WEBSERVICE_URI`: URI endpoint of [Framework Launcher](../../src/frameworklauncher)
|
||||
* `HDFS_URI`: URI endpoint of HDFS
|
||||
* `WEBHDFS_URI`: URI endpoint of WebHDFS
|
||||
* `YARN_URI`: URI endpoint of Apache Hadoop YARN
|
||||
|
@ -134,4 +134,4 @@ Read [API document](./API.md) for the details of REST API.
|
|||
|
||||
|
||||
[pai-management]: ../pai-management
|
||||
[service-configuration]: ../cluster-configuration/services-configuration.yaml
|
||||
[service-configuration]: ../../cluster-configuration/services-configuration.yaml
|
|
@ -5,8 +5,8 @@
|
|||
</p>
|
||||
|
||||
The system architecture is illustrated above.
|
||||
User submits jobs or monitors cluster status through the [Web Portal](../webportal/README.md),
|
||||
which calls APIs provided by the [REST server](../rest-server/README.md).
|
||||
User submits jobs or monitors cluster status through the [Web Portal](webportal/README.md),
|
||||
which calls APIs provided by the [REST server](rest-server/README.md).
|
||||
Third party tools can also call REST server directly for job management.
|
||||
Upon receiving API calls, the REST server coordinates with [FrameworkLauncher](../frameworklauncher/README.md) (short for Launcher)
|
||||
to perform job management.
|
||||
|
|
|
@ -10,13 +10,13 @@ An [express](https://expressjs.com/) served, [AdminLTE](https://adminlte.io/) th
|
|||
|
||||
## Dependencies
|
||||
|
||||
Since [job toturial](../docs/job_tutorial.md) is included in the document tab, make sure **`docs`** directory is exists as a sibling of `web-portal` directory.
|
||||
Since [job toturial](../job_tutorial.md) is included in the document tab, make sure **`docs`** directory is exists as a sibling of `web-portal` directory.
|
||||
|
||||
To run web portal, the following services should be started, and url of services should be correctly configured:
|
||||
|
||||
* [REST Server](../rest-server)
|
||||
* [Prometheus](../prometheus)
|
||||
* [Grafana](../grafana)
|
||||
* [REST Server](../../src/rest-server)
|
||||
* [Prometheus](../../src/prometheus)
|
||||
* [Grafana](../../src/grafana)
|
||||
* YARN
|
||||
* Kubernetes
|
||||
|
||||
|
@ -38,7 +38,7 @@ For development
|
|||
|
||||
## Configuration
|
||||
|
||||
If web portal is deployed within PAI cluster, the following config field could be change in the `webportal` section in [services-configuration.yaml](../cluster-configuration/services-configuration.yaml) file:
|
||||
If web portal is deployed within PAI cluster, the following config field could be change in the `webportal` section in [services-configuration.yaml](../../cluster-configuration/services-configuration.yaml) file:
|
||||
|
||||
* `server-port`: Integer. The network port to access the web portal. The default value is 9286.
|
||||
|
||||
|
@ -46,10 +46,10 @@ If web portal is deployed within PAI cluster, the following config field could b
|
|||
|
||||
If web portal is deployed as a standalone service, the following envioronment variables must be configured:
|
||||
|
||||
* `REST_SERVER_URI`: URI of [REST Server](../rest-server)
|
||||
* `PROMETHEUS_URI`: URI of [Prometheus](../prometheus)
|
||||
* `REST_SERVER_URI`: URI of [REST Server](../../src/rest-server)
|
||||
* `PROMETHEUS_URI`: URI of [Prometheus](../../src/prometheus)
|
||||
* `YARN_WEB_PORTAL_URI`: URI of YARN's web portal
|
||||
* `GRAFANA_URI`: URI of [Grafana](../grafana)
|
||||
* `GRAFANA_URI`: URI of [Grafana](../../src/grafana)
|
||||
* `K8S_DASHBOARD_URI`: URI of Kubernetes' dashboard
|
||||
* `K8S_API_SERVER_URI`: URI of Kubernetes' api server
|
||||
* `EXPORTER_PORT`: Port of node exporter
|
||||
|
@ -101,7 +101,7 @@ To run web portal on system, a [Node.js](https://nodejs.org/) 6+ runtime is requ
|
|||
|
||||
### Submit a job
|
||||
|
||||
Click the tab "Submit Job" to show a button asking you to select a json file for the submission. The job config file must follow the format shown in [job tutorial](../docs/job_tutorial.md).
|
||||
Click the tab "Submit Job" to show a button asking you to select a json file for the submission. The job config file must follow the format shown in [job tutorial](../job_tutorial.md).
|
||||
|
||||
### View job status
|
||||
|
|
@ -1,80 +0,0 @@
|
|||
# Copyright (c) Microsoft Corporation
|
||||
# All rights reserved.
|
||||
#
|
||||
# MIT License
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
.git
|
||||
|
||||
# Directory for submitted jobs' json file and scripts
|
||||
frameworklauncher/
|
||||
|
||||
# Logs
|
||||
logs
|
||||
*.log
|
||||
npm-debug.log*
|
||||
yarn-debug.log*
|
||||
yarn-error.log*
|
||||
|
||||
# Runtime data
|
||||
pids
|
||||
*.pid
|
||||
*.seed
|
||||
*.pid.lock
|
||||
|
||||
# Directory for instrumented libs generated by jscoverage/JSCover
|
||||
lib-cov
|
||||
|
||||
# Coverage directory used by tools like istanbul
|
||||
coverage
|
||||
|
||||
# nyc test coverage
|
||||
.nyc_output
|
||||
|
||||
# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
|
||||
.grunt
|
||||
|
||||
# Bower dependency directory (https://bower.io/)
|
||||
bower_components
|
||||
|
||||
# node-waf configuration
|
||||
.lock-wscript
|
||||
|
||||
# Compiled binary addons (https://nodejs.org/api/addons.html)
|
||||
build/Release
|
||||
|
||||
# Dependency directories
|
||||
node_modules/
|
||||
jspm_packages/
|
||||
|
||||
# Typescript v1 declaration files
|
||||
typings/
|
||||
|
||||
# Optional npm cache directory
|
||||
.npm
|
||||
|
||||
# Optional eslint cache
|
||||
.eslintcache
|
||||
|
||||
# Optional REPL history
|
||||
.node_repl_history
|
||||
|
||||
# Output of 'npm pack'
|
||||
*.tgz
|
||||
|
||||
# Yarn Integrity file
|
||||
.yarn-integrity
|
||||
|
||||
# dotenv environment variables file
|
||||
.env
|
|
@ -1,24 +0,0 @@
|
|||
# Copyright (c) Microsoft Corporation
|
||||
# All rights reserved.
|
||||
#
|
||||
# MIT License
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
copy-list:
|
||||
# created by the prepare hadoop function on docker_build.py
|
||||
- src: src/hadoop-run/hadoop
|
||||
dst: src/rest-server/copied_file
|
||||
- src: ../rest-server
|
||||
dst: src/rest-server/copied_file
|
||||
|
|
@ -1,26 +0,0 @@
|
|||
# Copyright (c) Microsoft Corporation
|
||||
# All rights reserved.
|
||||
#
|
||||
# MIT License
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
|
||||
copy-list:
|
||||
- src: ../docs
|
||||
dst: src/webportal/copied_file
|
||||
- src: ../examples
|
||||
dst: src/webportal/copied_file
|
||||
- src: ../webportal
|
||||
dst: src/webportal/copied_file
|
||||
|
|
@ -1,165 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright (c) Microsoft Corporation
|
||||
# All rights reserved.
|
||||
#
|
||||
# MIT License
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
|
||||
# Bootstrap script for docker container.
|
||||
|
||||
exec 17>/pai/log/DockerContainerDebug.log
|
||||
BASH_XTRACEFD=17
|
||||
|
||||
function exit_handler()
|
||||
{
|
||||
printf "%s %s\n" \
|
||||
"[DEBUG]" "Docker container exit handler: EXIT signal received in docker container, exiting ..."
|
||||
kill 0
|
||||
}
|
||||
|
||||
set -x
|
||||
PS4="+[\t] "
|
||||
trap exit_handler EXIT
|
||||
|
||||
|
||||
touch "/alive/docker_$PAI_CONTAINER_ID"
|
||||
while /bin/true; do
|
||||
[ $(( $(date +%s) - $(stat -c %Y /alive/yarn_$PAI_CONTAINER_ID) )) -gt 60 ] \
|
||||
&& pkill -9 --ns 1
|
||||
sleep 20
|
||||
done &
|
||||
|
||||
|
||||
export PAI_WORK_DIR="$(pwd)"
|
||||
HDFS_LAUNCHER_PREFIX=$PAI_DEFAULT_FS_URI/Container
|
||||
export CLASSPATH="$(hadoop classpath --glob)"
|
||||
|
||||
task_role_no={{{ idx }}}
|
||||
|
||||
printf "%s %s\n%s\n\n" "[INFO]" "ENV" "$(printenv | sort)"
|
||||
|
||||
mv /pai/code/* ./
|
||||
|
||||
|
||||
function prepare_ssh()
|
||||
{
|
||||
mkdir /root/.ssh
|
||||
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
|
||||
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
|
||||
}
|
||||
|
||||
function start_ssh_service()
|
||||
{
|
||||
printf "%s %s\n" \
|
||||
"[INFO]" "start ssh service"
|
||||
cat /root/.ssh/$APP_ID.pub >> /root/.ssh/authorized_keys
|
||||
sed -i 's/Port.*/Port '$PAI_CONTAINER_SSH_PORT'/' /etc/ssh/sshd_config
|
||||
echo "sshd:ALL" >> /etc/hosts.allow
|
||||
service ssh restart
|
||||
}
|
||||
|
||||
function hdfs_upload_atomically()
|
||||
{
|
||||
printf "%s %s\n%s %s\n%s %s\n" \
|
||||
"[INFO]" "upload ssh key to hdfs" \
|
||||
"[INFO]" "destination path is ${2}" \
|
||||
"[INFO]" "source path is ${1}"
|
||||
tempFolder=${2}"_temp"
|
||||
if hdfs dfs -test -d $tempFolder ; then
|
||||
printf "%s %s\n" \
|
||||
"[WARNING]" "$tempFolder already exists, overwriting..."
|
||||
hdfs dfs -rm -r $tempFolder
|
||||
fi
|
||||
hdfs dfs -put ${1} $tempFolder
|
||||
hdfs dfs -mv $tempFolder ${2}
|
||||
}
|
||||
|
||||
# Check whether hdfs bianry and ssh exists, if not ignore ssh preparation and start part
|
||||
# Start sshd in docker container
|
||||
if which hdfs && service --status-all 2>&1 | grep -q ssh; then
|
||||
prepare_ssh
|
||||
hdfs_ssh_folder=${HDFS_LAUNCHER_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/${APP_ID}
|
||||
printf "%s %s\n%s %s\n%s %s\n" \
|
||||
"[INFO]" "hdfs_ssh_folder is ${hdfs_ssh_folder}" \
|
||||
"[INFO]" "task_role_no is ${task_role_no}" \
|
||||
"[INFO]" "PAI_TASK_INDEX is ${PAI_TASK_INDEX}"
|
||||
# Let taskRoleNumber=0 and taskindex=0 execute upload ssh files
|
||||
if [ ${task_role_no} -eq 0 ] && [ ${PAI_TASK_INDEX} -eq 0 ]; then
|
||||
printf "%s %s %s\n%s\n" \
|
||||
"[INFO]" "task_role_no:${task_role_no}" "PAI_TASK_INDEX:${PAI_TASK_INDEX}" \
|
||||
"Execute upload key pair ..."
|
||||
ssh-keygen -N '' -t rsa -f ~/.ssh/$APP_ID
|
||||
hdfs dfs -mkdir -p "${hdfs_ssh_folder}"
|
||||
hdfs_upload_atomically "/root/.ssh/" "${hdfs_ssh_folder}/.ssh"
|
||||
else
|
||||
# Waiting for ssh key-pair ready
|
||||
while ! hdfs dfs -test -d ${hdfs_ssh_folder}/.ssh ; do
|
||||
echo "[INFO] waitting for ssh key ready"
|
||||
sleep 10
|
||||
done
|
||||
printf "%s %s\n%s %s\n" \
|
||||
"[INFO]" "ssh key pair ready ..." \
|
||||
"[INFO]" "begin to download ssh key pair from hdfs ..."
|
||||
hdfs dfs -get "${hdfs_ssh_folder}/.ssh/" "/root/"
|
||||
fi
|
||||
chmod 400 ~/.ssh/$APP_ID
|
||||
# Generate ssh connect info file in "PAI_CONTAINER_ID-PAI_CURRENT_CONTAINER_IP-PAI_CONTAINER_SSH_PORT" format on hdfs
|
||||
hdfs dfs -touchz ${hdfs_ssh_folder}/$PAI_CONTAINER_ID-$PAI_CONTAINER_HOST_IP-$PAI_CONTAINER_SSH_PORT
|
||||
|
||||
# Generate ssh config
|
||||
ssh_config_path=${HDFS_LAUNCHER_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/config
|
||||
hdfs dfs -mkdir -p ${ssh_config_path}
|
||||
hdfs dfs -touchz ${ssh_config_path}/$APP_ID+$PAI_CURRENT_TASK_ROLE_NAME+$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX+$PAI_CONTAINER_HOST_IP+$PAI_CONTAINER_SSH_PORT
|
||||
while [ `hdfs dfs -ls $ssh_config_path | grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" | wc -l` -lt $PAI_JOB_TASK_COUNT ]; do
|
||||
printf "%s %s\n" "[INFO]" "Waiting for ssh service in other containers ..."
|
||||
sleep 10
|
||||
done
|
||||
NodeList=($(hdfs dfs -ls ${ssh_config_path} \
|
||||
| grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" \
|
||||
| grep -oE "[^/]+$" \
|
||||
| sed -e "s/^$APP_ID+//g" \
|
||||
| sort -n))
|
||||
if [ "${#NodeList[@]}" -ne $PAI_JOB_TASK_COUNT ]; then
|
||||
printf "%s %s\n%s\n%s\n\n" \
|
||||
"[ERROR]" "NodeList" \
|
||||
"${NodeList[@]}" \
|
||||
"ssh services in ${#NodeList[@]} containers are available, not equal to $PAI_JOB_TASK_COUNT, exit ..."
|
||||
exit 2
|
||||
fi
|
||||
for line in "${NodeList[@]}"; do
|
||||
node=(${line//+/ });
|
||||
printf "%s\n %s\n %s\n %s\n %s\n %s\n %s\n" \
|
||||
"Host ${node[0]}-${node[1]}" \
|
||||
"HostName ${node[2]}" \
|
||||
"Port ${node[3]}" \
|
||||
"User root" \
|
||||
"StrictHostKeyChecking no" \
|
||||
"UserKnownHostsFile /dev/null" \
|
||||
"IdentityFile /root/.ssh/$APP_ID" >> /root/.ssh/config
|
||||
done
|
||||
|
||||
# Start ssh service
|
||||
start_ssh_service
|
||||
fi
|
||||
|
||||
# Write env to system-wide environment
|
||||
env | grep -E "^PAI|PATH|PREFIX|JAVA|HADOOP|NVIDIA|CUDA" > /etc/environment
|
||||
|
||||
printf "%s %s\n\n" "[INFO]" "USER COMMAND START"
|
||||
{{{ taskData.command }}} || exit $?
|
||||
printf "\n%s %s\n\n" "[INFO]" "USER COMMAND END"
|
||||
|
||||
exit 0
|
|
@ -17,7 +17,12 @@
|
|||
|
||||
FROM base-image
|
||||
|
||||
RUN wget https://download.docker.com/linux/static/stable/x86_64/docker-17.06.2-ce.tgz && \
|
||||
tar xzvf docker-17.06.2-ce.tgz && \
|
||||
mv docker/* /usr/bin/ && \
|
||||
rm docker-17.06.2-ce.tgz
|
||||
|
||||
COPY build/start.sh /usr/local/start.sh
|
||||
RUN chmod a+x /usr/local/start.sh
|
||||
|
||||
CMD ["/usr/local/start.sh"]
|
||||
CMD ["/usr/local/start.sh"]
|
||||
|
|
|
@ -22,13 +22,14 @@ pushd $(dirname "$0") > /dev/null
|
|||
hadoopBinaryDir="/hadoop-binary/"
|
||||
|
||||
hadoopBinaryPath="${hadoopBinaryDir}hadoop-2.9.0.tar.gz"
|
||||
cacheVersion="${hadoopBinaryDir}12932984-12933562-done"
|
||||
cacheVersion="${hadoopBinaryDir}12932984-12933562-docker_executor-done"
|
||||
|
||||
|
||||
echo "hadoopbinarypath:${hadoopBinaryDir}"
|
||||
|
||||
[[ -f $cacheVersion ]] &&
|
||||
{
|
||||
echo "Hadoop ai with patch 12932984-12933562 has been built"
|
||||
echo "Hadoop ai with patch 12932984-12933562-docker_executor has been built"
|
||||
echo "Skip this build precess"
|
||||
exit 0
|
||||
}
|
||||
|
|
|
@ -31,9 +31,11 @@ git checkout branch-2.9.0
|
|||
|
||||
cp /hadoop-2.9.0.gpu-port.patch /hadoop
|
||||
cp /HDFS-13773.patch /hadoop
|
||||
cp /docker-executor.patch /hadoop
|
||||
|
||||
git apply hadoop-2.9.0.gpu-port.patch
|
||||
git apply HDFS-13773.patch
|
||||
git apply docker-executor.patch
|
||||
|
||||
mvn package -Pdist,native -DskipTests -Dmaven.javadoc.skip=true -Dtar
|
||||
|
||||
|
@ -44,4 +46,5 @@ echo "Successfully build hadoop 2.9.0 AI"
|
|||
|
||||
|
||||
# When Changing the patch id, please update the filename here.
|
||||
touch /hadoop-binary/12932984-12933562-done
|
||||
rm /hadoop-binary/*-done
|
||||
touch /hadoop-binary/12932984-12933562-docker_executor-done
|
||||
|
|
|
@ -0,0 +1,123 @@
|
|||
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
|
||||
index 96f6c57..1b89e90 100644
|
||||
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
|
||||
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
|
||||
@@ -1544,6 +1544,14 @@ public static boolean isAclEnabled(Configuration conf) {
|
||||
public static final String NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME =
|
||||
NM_PREFIX + "docker-container-executor.image-name";
|
||||
|
||||
+ /** The Docker run option(For DockerContainerExecutor).*/
|
||||
+ public static final String NM_DOCKER_CONTAINER_EXECUTOR_EXEC_OPTION =
|
||||
+ NM_PREFIX + "docker-container-executor.exec-option";
|
||||
+
|
||||
+ /** The command before launch script(For DockerContainerExecutor).*/
|
||||
+ public static final String NM_DOCKER_CONTAINER_EXECUTOR_SCRIPT_COMMAND =
|
||||
+ NM_PREFIX + "docker-container-executor.script-command";
|
||||
+
|
||||
/** The name of the docker executor (For DockerContainerExecutor).*/
|
||||
public static final String NM_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME =
|
||||
NM_PREFIX + "docker-container-executor.exec-name";
|
||||
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
|
||||
index a044cb6..819c496 100644
|
||||
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
|
||||
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
|
||||
@@ -98,7 +98,7 @@
|
||||
//containername:0.1 or
|
||||
//containername
|
||||
public static final String DOCKER_IMAGE_PATTERN =
|
||||
- "^(([\\w\\.-]+)(:\\d+)*\\/)?[\\w\\.:-]+$";
|
||||
+ "^(([\\w\\.-]+)(:\\d+)*\\/)?([\\w\\.-]+\\/)?[\\w\\.:-]+$";
|
||||
|
||||
private final FileContext lfs;
|
||||
private final Pattern dockerImagePattern;
|
||||
@@ -127,7 +127,12 @@ public void init() throws IOException {
|
||||
String dockerExecutor = getConf().get(
|
||||
YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME,
|
||||
YarnConfiguration.NM_DEFAULT_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME);
|
||||
- if (!new File(dockerExecutor).exists()) {
|
||||
+ // /use/bin/docker -H=tcp://0.0.0.0:xx is also a valid docker executor
|
||||
+ String[] arr = dockerExecutor.split("\\s");
|
||||
+ if (LOG.isDebugEnabled()) {
|
||||
+ LOG.debug("dockerExecutor: " + dockerExecutor);
|
||||
+ }
|
||||
+ if (!new File(arr[0]).exists()) {
|
||||
throw new IllegalStateException(
|
||||
"Invalid docker exec path: " + dockerExecutor);
|
||||
}
|
||||
@@ -181,8 +186,11 @@ public int launchContainer(ContainerStartContext ctx) throws IOException {
|
||||
|
||||
//Variables for the launch environment can be injected from the command-line
|
||||
//while submitting the application
|
||||
- String containerImageName = container.getLaunchContext().getEnvironment()
|
||||
- .get(YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME);
|
||||
+ //modify get image from configuration rather than env
|
||||
+ String containerImageName = getConf().get(
|
||||
+ YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME);
|
||||
+
|
||||
+ //
|
||||
if (LOG.isDebugEnabled()) {
|
||||
LOG.debug("containerImageName from launchContext: " + containerImageName);
|
||||
}
|
||||
@@ -240,19 +248,27 @@ public int launchContainer(ContainerStartContext ctx) throws IOException {
|
||||
//--net=host allows the container to take on the host's network stack
|
||||
//--name sets the Docker Container name to the YARN containerId string
|
||||
//-v is used to bind mount volumes for local, log and work dirs.
|
||||
+ //-w sets the work dir inside the container
|
||||
+ //add docker option
|
||||
+ String dockerOption = getConf().get(
|
||||
+ YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_EXEC_OPTION);
|
||||
String commandStr = commands.append(dockerExecutor)
|
||||
.append(" ")
|
||||
.append("run")
|
||||
.append(" ")
|
||||
- .append("--rm --net=host")
|
||||
+ .append("--rm --net=host --pid=host --privileged=true")
|
||||
+ .append(" ")
|
||||
+ .append("-w " + containerWorkDir.toUri().getPath().toString())
|
||||
+ .append(" ")
|
||||
+ .append(dockerOption)
|
||||
.append(" ")
|
||||
.append(" --name " + containerIdStr)
|
||||
- .append(localDirMount)
|
||||
- .append(logDirMount)
|
||||
- .append(containerWorkDirMount)
|
||||
.append(" ")
|
||||
.append(containerImageName)
|
||||
.toString();
|
||||
+ if (LOG.isDebugEnabled()) {
|
||||
+ LOG.debug("Docker run command: " + commandStr);
|
||||
+ }
|
||||
//Get the pid of the process which has been launched as a docker container
|
||||
//using docker inspect
|
||||
String dockerPidScript = "`" + dockerExecutor +
|
||||
@@ -597,13 +613,28 @@ private void writeSessionScript(Path launchDst, Path pidFile)
|
||||
// We need to do a move as writing to a file is not atomic
|
||||
// Process reading a file being written to may get garbled data
|
||||
// hence write pid to tmp file first followed by a mv
|
||||
+ // Move dockerpid command to backend, avoid blocking docker run command
|
||||
+ // need to improve it with publisher mode
|
||||
+ // Ref: https://issues.apache.org/jira/browse/YARN-3080
|
||||
pout.println("#!/usr/bin/env bash");
|
||||
pout.println();
|
||||
+ pout.println("{");
|
||||
+ pout.println("n=10");
|
||||
+ pout.println("while [ $n -gt 0 ]; do");
|
||||
+ pout.println("let n=$n-1");
|
||||
+ pout.println("sleep 5");
|
||||
pout.println("echo "+ dockerPidScript +" > " + pidFile.toString()
|
||||
+ ".tmp");
|
||||
+ pout.println("[ -n \"$(cat \"" + pidFile.toString()
|
||||
+ + ".tmp\")\" ] && break");
|
||||
+ pout.println("done");
|
||||
pout.println("/bin/mv -f " + pidFile.toString() + ".tmp " + pidFile);
|
||||
- pout.println(dockerCommand + " bash \"" +
|
||||
- launchDst.toUri().getPath().toString() + "\"");
|
||||
+ pout.println("} &");
|
||||
+ //Add exec command before launch_script.
|
||||
+ String scriptCommand = getConf().get(
|
||||
+ YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_SCRIPT_COMMAND);
|
||||
+ pout.println(dockerCommand + " bash -c '" + scriptCommand + " && bash \"" +
|
||||
+ launchDst.toUri().getPath().toString() + "\"'");
|
||||
} finally {
|
||||
IOUtils.cleanupWithLogger(LOG, pout, out);
|
||||
}
|
|
@ -73,8 +73,10 @@ RUN wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.
|
|||
## The build environment of hadoop has been prepared above.
|
||||
## Copy your build script here. Default script will build our hadoop-ai.
|
||||
|
||||
COPY docker-executor.patch /
|
||||
|
||||
COPY build.sh /
|
||||
|
||||
RUN chmod u+x build.sh
|
||||
|
||||
CMD ["/build.sh"]
|
||||
CMD ["/build.sh"]
|
||||
|
|
|
@ -17,6 +17,18 @@
|
|||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
# Clean running job
|
||||
|
||||
if which docker > /dev/null && [ -S /var/run/docker.sock ]; then
|
||||
|
||||
echo "Clean hadoop jobs"
|
||||
|
||||
docker ps | awk '/container_\w{3}_[0-9]{13}_[0-9]{4}_[0-9]{2}_[0-9]{6}/ { print $NF}' | xargs timeout 30 docker stop || \
|
||||
docker ps | awk '/container_\w{3}_[0-9]{13}_[0-9]{4}_[0-9]{2}_[0-9]{6}/ { print $NF}' | xargs docker kill
|
||||
fi
|
||||
|
||||
|
||||
# Clean data
|
||||
|
||||
echo "Clean the hadoop node manager's data on the disk"
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
<!--
|
||||
Copyright (c) Microsoft Corporation
|
||||
All rights reserved.
|
||||
|
||||
MIT License
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
-->
|
||||
|
||||
See [README.md](../../docs/rest-server/README.md)
|
|
@ -23,6 +23,7 @@ RUN echo "deb http://http.debian.net/debian jessie-backports main" > \
|
|||
apt-get install -y --no-install-recommends -t \
|
||||
jessie-backports \
|
||||
dos2unix \
|
||||
openssh-server \
|
||||
&& \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
@ -32,11 +33,11 @@ WORKDIR /usr/src/app
|
|||
ENV NODE_ENV=production \
|
||||
SERVER_PORT=8080
|
||||
|
||||
COPY copied_file/rest-server/package.json .
|
||||
COPY package.json ./
|
||||
|
||||
RUN npm install
|
||||
|
||||
COPY copied_file/rest-server/ .
|
||||
COPY . .
|
||||
|
||||
RUN dos2unix src/templates/*
|
||||
|
0
pai-management/bootstrap/rest-server/start.sh → src/rest-server/deploy/start.sh
Executable file → Normal file
0
pai-management/bootstrap/rest-server/start.sh → src/rest-server/deploy/start.sh
Executable file → Normal file
|
@ -1,6 +1,6 @@
|
|||
{
|
||||
"name": "pai-rest-server",
|
||||
"version": "0.1.0",
|
||||
"version": "0.8.0",
|
||||
"description": "RESTful api server for Microsoft Platform for AI",
|
||||
"keywords": [
|
||||
"REST",
|
||||
|
@ -49,7 +49,8 @@
|
|||
"nyc": "11.6.0",
|
||||
"statuses": "1.5.0",
|
||||
"unirest": "0.5.1",
|
||||
"winston": "2.4.0"
|
||||
"winston": "2.4.0",
|
||||
"ssh-keygen": "0.4.2"
|
||||
},
|
||||
"scripts": {
|
||||
"coveralls": "nyc report --reporter=text-lcov | coveralls ..",
|
|
@ -20,11 +20,13 @@
|
|||
const async = require('async');
|
||||
const unirest = require('unirest');
|
||||
const mustache = require('mustache');
|
||||
const keygen = require('ssh-keygen');
|
||||
const launcherConfig = require('../config/launcher');
|
||||
const userModel = require('./user');
|
||||
const yarnContainerScriptTemplate = require('../templates/yarnContainerScript');
|
||||
const dockerContainerScriptTemplate = require('../templates/dockerContainerScript');
|
||||
const createError = require('../util/error');
|
||||
const logger = require('../config/logger');
|
||||
|
||||
const Hdfs = require('../util/hdfs');
|
||||
|
||||
|
@ -232,30 +234,51 @@ class Job {
|
|||
hdfs.list(
|
||||
folderPathPrefix,
|
||||
null,
|
||||
(error, result) => {
|
||||
(error, connectInfo) => {
|
||||
if (!error) {
|
||||
let sshInfo = {
|
||||
'containers': [],
|
||||
'keyPair': {
|
||||
'folderPath': `${launcherConfig.hdfsUri}${folderPathPrefix}/.ssh/`,
|
||||
'publicKeyFileName': `${applicationId}.pub`,
|
||||
'privateKeyFileName': `${applicationId}`,
|
||||
'privateKeyDirectDownloadLink':
|
||||
`${launcherConfig.webhdfsUri}/webhdfs/v1${folderPathPrefix}/.ssh/${applicationId}?op=OPEN`,
|
||||
},
|
||||
};
|
||||
for (let x of result.content.FileStatuses.FileStatus) {
|
||||
let pattern = /^container_(.*)-(.*)-(.*)$/g;
|
||||
let arr = pattern.exec(x.pathSuffix);
|
||||
if (arr !== null) {
|
||||
sshInfo.containers.push({
|
||||
'id': 'container_' + arr[1],
|
||||
'sshIp': arr[2],
|
||||
'sshPort': arr[3],
|
||||
});
|
||||
}
|
||||
}
|
||||
next(null, sshInfo);
|
||||
let latestKeyFilePath = `/Container/${userName}/${jobName}/ssh/keyFiles`;
|
||||
let sshInfo = {};
|
||||
// Handle backward compatibility
|
||||
hdfs.list(latestKeyFilePath,
|
||||
null,
|
||||
(error, result) => {
|
||||
if (!error) {
|
||||
sshInfo = {
|
||||
'containers': [],
|
||||
'keyPair': {
|
||||
'folderPath': `${launcherConfig.hdfsUri}${latestKeyFilePath}`,
|
||||
'publicKeyFileName': `${jobName}.pub`,
|
||||
'privateKeyFileName': `${jobName}`,
|
||||
'privateKeyDirectDownloadLink':
|
||||
`${launcherConfig.webhdfsUri}/webhdfs/v1${latestKeyFilePath}/${jobName}?op=OPEN`,
|
||||
},
|
||||
};
|
||||
} else {
|
||||
// older pattern is ${launcherConfig.hdfsUri}${folderPathPrefix}/.ssh/
|
||||
sshInfo = {
|
||||
'containers': [],
|
||||
'keyPair': {
|
||||
'folderPath': `${launcherConfig.hdfsUri}${folderPathPrefix}/.ssh/`,
|
||||
'publicKeyFileName': `${applicationId}.pub`,
|
||||
'privateKeyFileName': `${applicationId}`,
|
||||
'privateKeyDirectDownloadLink':
|
||||
`${launcherConfig.webhdfsUri}/webhdfs/v1${folderPathPrefix}/.ssh/${applicationId}?op=OPEN`,
|
||||
},
|
||||
};
|
||||
}
|
||||
for (let x of connectInfo.content.FileStatuses.FileStatus) {
|
||||
let pattern = /^container_(.*)-(.*)-(.*)$/g;
|
||||
let arr = pattern.exec(x.pathSuffix);
|
||||
if (arr !== null) {
|
||||
sshInfo.containers.push({
|
||||
'id': 'container_' + arr[1],
|
||||
'sshIp': arr[2],
|
||||
'sshPort': arr[3],
|
||||
});
|
||||
}
|
||||
}
|
||||
next(null, sshInfo);
|
||||
});
|
||||
} else {
|
||||
next(error);
|
||||
}
|
||||
|
@ -367,6 +390,7 @@ class Job {
|
|||
'hdfsUri': launcherConfig.hdfsUri,
|
||||
'taskData': data.taskRoles[idx],
|
||||
'jobData': data,
|
||||
'webHdfsUri': launcherConfig.webhdfsUri,
|
||||
});
|
||||
return dockerContainerScript;
|
||||
}
|
||||
|
@ -432,6 +456,21 @@ class Job {
|
|||
return frameworkDescription;
|
||||
}
|
||||
|
||||
generateSshKeyFiles(name, next) {
|
||||
keygen({
|
||||
location: name,
|
||||
read: true,
|
||||
destroy: true,
|
||||
}, function(err, out) {
|
||||
if (err) {
|
||||
next(err);
|
||||
} else {
|
||||
let sshKeyFiles = [{'content': out.pubKey, 'fileName': name+'.pub'}, {'content': out.key, 'fileName': name}];
|
||||
next(null, sshKeyFiles);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
_initializeJobContextRootFolders(next) {
|
||||
const hdfs = new Hdfs(launcherConfig.webhdfsUri);
|
||||
async.parallel([
|
||||
|
@ -535,6 +574,26 @@ class Job {
|
|||
}
|
||||
);
|
||||
},
|
||||
(parallelCallback) => {
|
||||
this.generateSshKeyFiles(name, (error, sshKeyFiles) => {
|
||||
if (error) {
|
||||
logger.error('Generated ssh key files failed');
|
||||
} else {
|
||||
async.each(sshKeyFiles, (file, eachCallback) => {
|
||||
hdfs.createFile(
|
||||
`/Container/${data.userName}/${name}/ssh/keyFiles/${file.fileName}`,
|
||||
file.content,
|
||||
{'user.name': data.userName, 'permission': '775', 'overwrite': 'true'},
|
||||
(error, result) => {
|
||||
eachCallback(error);
|
||||
}
|
||||
);
|
||||
}, (error) => {
|
||||
parallelCallback(error);
|
||||
});
|
||||
}
|
||||
});
|
||||
},
|
||||
], (parallelError) => {
|
||||
return next(parallelError);
|
||||
});
|
|
@ -0,0 +1,190 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright (c) Microsoft Corporation
|
||||
# All rights reserved.
|
||||
#
|
||||
# MIT License
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
|
||||
# Bootstrap script for docker container.
|
||||
|
||||
exec 17>/pai/log/DockerContainerDebug.log
|
||||
BASH_XTRACEFD=17
|
||||
|
||||
function exit_handler()
|
||||
{
|
||||
printf "%s %s\n" \
|
||||
"[DEBUG]" "Docker container exit handler: EXIT signal received in docker container, exiting ..."
|
||||
kill 0
|
||||
}
|
||||
|
||||
set -x
|
||||
PS4="+[\t] "
|
||||
trap exit_handler EXIT
|
||||
|
||||
|
||||
touch "/alive/docker_$PAI_CONTAINER_ID"
|
||||
while /bin/true; do
|
||||
[ $(( $(date +%s) - $(stat -c %Y /alive/yarn_$PAI_CONTAINER_ID) )) -gt 60 ] \
|
||||
&& pkill -9 --ns 1
|
||||
sleep 20
|
||||
done &
|
||||
|
||||
|
||||
export PAI_WORK_DIR="$(pwd)"
|
||||
PAI_WEB_HDFS_PREFIX={{{ webHdfsUri }}}/webhdfs/v1/Container
|
||||
HDFS_LAUNCHER_PREFIX=$PAI_DEFAULT_FS_URI/Container
|
||||
export CLASSPATH="$(hadoop classpath --glob)"
|
||||
|
||||
task_role_no={{{ idx }}}
|
||||
|
||||
printf "%s %s\n%s\n\n" "[INFO]" "ENV" "$(printenv | sort)"
|
||||
|
||||
mv /pai/code/* ./
|
||||
|
||||
function webhdfs_create_file()
|
||||
{
|
||||
webHdfsRequestPath=${1}"?user.name="{{{ jobData.userName }}}"&op=CREATE"
|
||||
redirectResponse=$(curl -i -X PUT ${webHdfsRequestPath} -o /dev/null -w %{redirect_url}' '%{http_code})
|
||||
redirectCode=$(cut -d ' ' -f 2 <<< ${redirectResponse})
|
||||
if [[ ${redirectCode} = "307" ]]; then
|
||||
redirectUri=$(cut -d ' ' -f 1 <<< ${redirectResponse})
|
||||
createResponse=$(curl -i -S -X PUT ${redirectUri})
|
||||
else
|
||||
printf "%s %s\n %s %s\n %s %s\n" \
|
||||
"[WARNING]" "Webhdfs creates folder failed" \
|
||||
"Folder Path:" ${webHdfsRequestPath} \
|
||||
"Response code:" ${redirectCode}
|
||||
fi
|
||||
}
|
||||
|
||||
function webhdfs_download_file()
|
||||
{
|
||||
webHdfsRequestPath=${1}"?user.name="{{{ jobData.userName }}}"&op=OPEN"
|
||||
localPath=${2}
|
||||
downloadResponse=$(curl -S -L ${webHdfsRequestPath} -o ${localPath} -w %{http_code})
|
||||
if [[ ${downloadResponse} = "200" ]]; then
|
||||
printf "%s %s\n" \
|
||||
"[INFO]" "Webhdfs downloads file succeed"
|
||||
else
|
||||
printf "%s %s\n" \
|
||||
"[WARNING]" "Webhdfs downloads file failed"
|
||||
fi
|
||||
}
|
||||
|
||||
function prepare_ssh()
|
||||
{
|
||||
mkdir /root/.ssh
|
||||
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
|
||||
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
|
||||
}
|
||||
|
||||
function start_ssh_service()
|
||||
{
|
||||
printf "%s %s\n" \
|
||||
"[INFO]" "start ssh service"
|
||||
cat /root/.ssh/{{{ jobData.jobName }}}.pub >> /root/.ssh/authorized_keys
|
||||
sed -i 's/Port.*/Port '$PAI_CONTAINER_SSH_PORT'/' /etc/ssh/sshd_config
|
||||
echo "sshd:ALL" >> /etc/hosts.allow
|
||||
service ssh restart
|
||||
}
|
||||
|
||||
function get_ssh_key_files()
|
||||
{
|
||||
info_source="webhdfs"
|
||||
localKeyPath=/root/.ssh/{{{ jobData.jobName }}}.pub
|
||||
|
||||
if [[ -f $localKeyPath ]]; then
|
||||
rm -f $localKeyPath
|
||||
fi
|
||||
|
||||
if [[ "$info_source" = "webhdfs" ]]; then
|
||||
webHdfsKeyPath=${PAI_WEB_HDFS_PREFIX}/{{{ jobData.userName }}}/{{{ jobData.jobName }}}/ssh/keyFiles/{{{ jobData.jobName }}}.pub
|
||||
webhdfs_download_file $webHdfsKeyPath $localKeyPath
|
||||
else
|
||||
printf "%s %s\n" \
|
||||
"[WARNING]" "Get another key store way"
|
||||
fi
|
||||
}
|
||||
|
||||
function generate_ssh_connect_info()
|
||||
{
|
||||
info_source="webhdfs"
|
||||
destFileName=${1}
|
||||
|
||||
if [[ "$info_source" = "webhdfs" ]]; then
|
||||
webHdfsRequestPath=$destFileName
|
||||
webhdfs_create_file $webHdfsRequestPath
|
||||
else
|
||||
printf "%s %s\n" \
|
||||
"[WARNING]" "Get another key store way"
|
||||
fi
|
||||
}
|
||||
|
||||
# Check whether hdfs bianry and ssh exists, if not ignore ssh preparation and start part
|
||||
# Start sshd in docker container
|
||||
if service --status-all 2>&1 | grep -q ssh; then
|
||||
prepare_ssh
|
||||
get_ssh_key_files
|
||||
sshConnectInfoFolder=${PAI_WEB_HDFS_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/$APP_ID
|
||||
# Generate ssh connect info file in "PAI_CONTAINER_ID-PAI_CURRENT_CONTAINER_IP-PAI_CONTAINER_SSH_PORT" format on hdfs
|
||||
destFilePath=${sshConnectInfoFolder}/$PAI_CONTAINER_ID-$PAI_CONTAINER_HOST_IP-$PAI_CONTAINER_SSH_PORT
|
||||
generate_ssh_connect_info ${destFilePath}
|
||||
|
||||
# Generate ssh config for MPI job
|
||||
if which hdfs; then
|
||||
ssh_config_path=${HDFS_LAUNCHER_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/config
|
||||
hdfs dfs -mkdir -p ${ssh_config_path}
|
||||
hdfs dfs -touchz ${ssh_config_path}/$APP_ID+$PAI_CURRENT_TASK_ROLE_NAME+$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX+$PAI_CONTAINER_HOST_IP+$PAI_CONTAINER_SSH_PORT
|
||||
while [ `hdfs dfs -ls $ssh_config_path | grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" | wc -l` -lt $PAI_JOB_TASK_COUNT ]; do
|
||||
printf "%s %s\n" "[INFO]" "Waiting for ssh service in other containers ..."
|
||||
sleep 10
|
||||
done
|
||||
NodeList=($(hdfs dfs -ls ${ssh_config_path} \
|
||||
| grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" \
|
||||
| grep -oE "[^/]+$" \
|
||||
| sed -e "s/^$APP_ID+//g" \
|
||||
| sort -n))
|
||||
if [ "${#NodeList[@]}" -ne $PAI_JOB_TASK_COUNT ]; then
|
||||
printf "%s %s\n%s\n%s\n\n" \
|
||||
"[ERROR]" "NodeList" \
|
||||
"${NodeList[@]}" \
|
||||
"ssh services in ${#NodeList[@]} containers are available, not equal to $PAI_JOB_TASK_COUNT, exit ..."
|
||||
exit 2
|
||||
fi
|
||||
for line in "${NodeList[@]}"; do
|
||||
node=(${line//+/ });
|
||||
printf "%s\n %s\n %s\n %s\n %s\n %s\n %s\n" \
|
||||
"Host ${node[0]}-${node[1]}" \
|
||||
"HostName ${node[2]}" \
|
||||
"Port ${node[3]}" \
|
||||
"User root" \
|
||||
"StrictHostKeyChecking no" \
|
||||
"UserKnownHostsFile /dev/null" \
|
||||
"IdentityFile /root/.ssh/$APP_ID" >> /root/.ssh/config
|
||||
done
|
||||
fi
|
||||
# Start ssh service
|
||||
start_ssh_service
|
||||
fi
|
||||
|
||||
# Write env to system-wide environment
|
||||
env | grep -E "^PAI|PATH|PREFIX|JAVA|HADOOP|NVIDIA|CUDA" > /etc/environment
|
||||
|
||||
printf "%s %s\n\n" "[INFO]" "USER COMMAND START"
|
||||
{{{ taskData.command }}} || exit $?
|
||||
printf "\n%s %s\n\n" "[INFO]" "USER COMMAND END"
|
||||
|
||||
exit 0
|
|
@ -83,6 +83,20 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
)
|
||||
);
|
||||
|
||||
nock(launcherWebserviceUri)
|
||||
.get('/v1/Frameworks/job6')
|
||||
.reply(
|
||||
200,
|
||||
mustache.render(
|
||||
frameworkDetailTemplate,
|
||||
{
|
||||
'frameworkName': 'job6',
|
||||
'userName': 'test',
|
||||
'applicationId': 'app6',
|
||||
}
|
||||
)
|
||||
);
|
||||
|
||||
//
|
||||
// Mock WebHDFS
|
||||
//
|
||||
|
@ -120,6 +134,46 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
},
|
||||
}
|
||||
);
|
||||
|
||||
nock(webhdfsUri)
|
||||
.get('/webhdfs/v1/Container/test/job6/ssh/app6?op=LISTSTATUS')
|
||||
.reply(
|
||||
200,
|
||||
{
|
||||
'FileStatuses': {
|
||||
'FileStatus': [
|
||||
{
|
||||
'pathSuffix': 'container_1519960554030_0046_01_000002-10.240.0.15-39035',
|
||||
},
|
||||
{
|
||||
'pathSuffix': 'container_1519960554030_0046_01_000003-10.240.0.17-28730',
|
||||
},
|
||||
{
|
||||
'pathSuffix': 'container_1519960554030_0046_01_000004-10.240.0.16-30690',
|
||||
},
|
||||
],
|
||||
},
|
||||
}
|
||||
);
|
||||
|
||||
nock(webhdfsUri)
|
||||
.get('/webhdfs/v1/Container/test/job6/ssh/keyFiles?op=LISTSTATUS')
|
||||
.reply(
|
||||
200,
|
||||
{
|
||||
'FileStatuses': {
|
||||
'FileStatus': [
|
||||
{
|
||||
'pathSuffix': 'job6.pub',
|
||||
},
|
||||
{
|
||||
'pathSuffix': 'job6',
|
||||
},
|
||||
],
|
||||
},
|
||||
}
|
||||
);
|
||||
|
||||
});
|
||||
|
||||
//
|
||||
|
@ -137,11 +191,22 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('Case 2 (Positive): Ssh info stored in new pattern will get info succeed.', (done) => {
|
||||
chai.request(server)
|
||||
.get('/api/v1/jobs/job6/ssh')
|
||||
.end((err, res) => {
|
||||
expect(res, 'status code').to.have.status(200);
|
||||
expect(res, 'response format').be.json;
|
||||
expect(JSON.stringify(res.body), 'response body content').include('keyPair');
|
||||
done();
|
||||
});
|
||||
});
|
||||
|
||||
//
|
||||
// Negative cases
|
||||
//
|
||||
|
||||
it('Case 2 (Negative): The job does not exist at all.', (done) => {
|
||||
it('Case 3 (Negative): The job does not exist at all.', (done) => {
|
||||
chai.request(server)
|
||||
.get('/api/v1/jobs/job2/ssh')
|
||||
.end((err, res) => {
|
||||
|
@ -151,7 +216,7 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('Case 3 (Negative): The job exists, but does not contain SSH info.', (done) => {
|
||||
it('Case 4 (Negative): The job exists, but does not contain SSH info.', (done) => {
|
||||
chai.request(server)
|
||||
.get('/api/v1/jobs/job3/ssh')
|
||||
.end((err, res) => {
|
||||
|
@ -161,7 +226,7 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('Case 4 (Negative): Cannot connect to Launcher.', (done) => {
|
||||
it('Case 5 (Negative): Cannot connect to Launcher.', (done) => {
|
||||
chai.request(server)
|
||||
.get('/api/v1/jobs/job4/ssh')
|
||||
.end((err, res) => {
|
||||
|
@ -171,7 +236,7 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
});
|
||||
});
|
||||
|
||||
it('Case 5 (Negative): Cannot connect to WebHDFS.', (done) => {
|
||||
it('Case 6 (Negative): Cannot connect to WebHDFS.', (done) => {
|
||||
chai.request(server)
|
||||
.get('/api/v1/jobs/job5/ssh')
|
||||
.end((err, res) => {
|
||||
|
@ -181,4 +246,3 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
|
|||
});
|
||||
});
|
||||
});
|
||||
|
|
@ -66,7 +66,7 @@ describe('Submit job: POST /api/v1/jobs', () => {
|
|||
);
|
||||
global.nock(global.webhdfsUri)
|
||||
.put(/op=CREATE/)
|
||||
.times(4)
|
||||
.times(6)
|
||||
.reply(
|
||||
201,
|
||||
{}
|
|
@ -19,4 +19,4 @@ FROM python:2.7
|
|||
|
||||
RUN pip install PyYAML requests paramiko prometheus_client
|
||||
|
||||
COPY copied_file/exporter/watchdog.py /
|
||||
COPY src/watchdog.py /
|
|
@ -22,5 +22,4 @@ pushd $(dirname "$0") > /dev/null
|
|||
echo "Call stop script to stop all service first"
|
||||
/bin/bash stop.sh || exit $?
|
||||
|
||||
|
||||
popd > /dev/null
|
||||
popd > /dev/null
|
|
@ -17,18 +17,16 @@
|
|||
|
||||
prerequisite:
|
||||
- cluster-configuration
|
||||
- drivers
|
||||
|
||||
template-list:
|
||||
- watchdog-configmap.yaml
|
||||
- watchdog.yaml
|
||||
- refresh.sh
|
||||
|
||||
start-script: start.sh
|
||||
stop-script: stop.sh
|
||||
delete-script: delete.sh
|
||||
refresh-script: refresh.sh
|
||||
upgraded-script: upgraded.sh
|
||||
|
||||
|
||||
deploy-rules:
|
||||
in: pai-master
|
||||
in: pai-master
|
|
@ -19,11 +19,6 @@
|
|||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
INSTANCES="daemonset/watchdog
|
||||
deployment/watchdog
|
||||
configmap/watchdog
|
||||
"
|
||||
|
||||
for instance in ${INSTANCES}; do
|
||||
kubectl delete --ignore-not-found --now ${instance}
|
||||
done
|
||||
kubectl delete --ignore-not-found --now daemonset/watchdog
|
||||
kubectl delete --ignore-not-found --now deployment/watchdog
|
||||
kubectl delete --ignore-not-found --now configmap/watchdog
|
|
@ -0,0 +1 @@
|
|||
See [README.md](../../docs/webportal/README.md)
|
|
@ -1,3 +1,5 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright (c) Microsoft Corporation
|
||||
# All rights reserved.
|
||||
#
|
||||
|
@ -15,6 +17,9 @@
|
|||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
copy-list:
|
||||
- src: ../prometheus/exporter
|
||||
dst: src/watchdog/copied_file
|
||||
pushd $(dirname "$0") > /dev/null
|
||||
|
||||
mkdir -p "../dependency"
|
||||
cp -arf "../../../docs" "../../../examples" "../dependency"
|
||||
|
||||
popd > /dev/null
|
|
@ -22,9 +22,10 @@ WORKDIR /usr/src/app
|
|||
ENV NODE_ENV=production \
|
||||
SERVER_PORT=8080
|
||||
|
||||
COPY copied_file/ /usr/src/
|
||||
COPY copied_file/webportal/ /usr/src/app/
|
||||
COPY package.json .
|
||||
RUN npm run yarn install
|
||||
COPY dependency/ ../../
|
||||
COPY . .
|
||||
RUN npm run build
|
||||
|
||||
EXPOSE ${SERVER_PORT}
|
Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше
Загрузка…
Ссылка в новой задаче