Merge branch 'folder-refactor' into zhaoyu/port-conflict-after-refactor

This commit is contained in:
ZhaoYu Dong 2018-09-12 21:29:27 +08:00
Родитель 199013eb04 bef05fdb35
Коммит 36434e3a77
177 изменённых файлов: 614 добавлений и 401 удалений

Просмотреть файл

@ -44,7 +44,7 @@ matrix:
node_js: 6
env: NODE_ENV=test
before_install:
- cd rest-server
- cd src/rest-server
install:
- npm install
script:
@ -54,7 +54,7 @@ matrix:
node_js: 7
env: NODE_ENV=test
before_install:
- cd rest-server
- cd src/rest-server
install:
- npm install
script:
@ -63,7 +63,7 @@ matrix:
- language: node_js
node_js: 6
before_install:
- cd webportal
- cd src/webportal
install:
- npm run yarn install
- npm run build
@ -72,7 +72,7 @@ matrix:
- language: node_js
node_js: 7
before_install:
- cd webportal
- cd src/webportal
install:
- npm run yarn install
- npm run build

Просмотреть файл

@ -69,7 +69,7 @@ Before start, you need to meet the following requirements:
### Cluster administration
- [Deployment infrastructure](./docs/pai-management/doc/cluster-bootup.md)
- [Cluster maintenance](https://github.com/Microsoft/pai/wiki/Maintenance-(Service-&-Machine))
- [Monitoring](./webportal/README.md)
- [Monitoring](./docs/webportal/README.md)
## Resources

Просмотреть файл

@ -8,6 +8,6 @@
### Configuration and API
- [Configuration: customize OpenPAI via its configuration](./pai-management/doc/how-to-write-pai-configuration.md)
- [OpenPAI Programming Guides](../examples/README.md)
- [Restful API Docs](../rest-server/README.md)
- [Restful API Docs](rest-server/API.md)
### [FAQs](./faq.md)

Просмотреть файл

@ -27,16 +27,16 @@ Build image by using ```pai_build.py``` which put under ``build/``. for the conf
### Build infrastructure services <a name="Service_Build"></a>
```
sudo ./pai_build.py build -c /path/to/configuration-dir/ [ -s component-list ]
./pai_build.py build -c /path/to/configuration-dir/ [ -s component-list ]
```
- Build the corresponding component.
- If the option `-n` is added, only the specified component will be built. By default will build all components under ``src/``
- If the option `-s` is added, only the specified component will be built. By default will build all components under ``src/``
### Push infrastructure image(s) <a name="Image_Push"></a>
```
sudo ./pai_build.py push -c /path/to/configuration-dir/ [ -i image-list ]
./pai_build.py push -c /path/to/configuration-dir/ [ -i image-list ]
```
- tag and push image to the docker registry which is configured in the ```cluster-configuration```.
@ -135,4 +135,4 @@ popd > /dev/null
# TO-DO
- Incremental build implementation.
- Incremental build implementation.

Просмотреть файл

@ -28,7 +28,7 @@ User could customize [Kubernetes](https://kubernetes.io/) at OpenPAI's [folder /
User could customize Webportal at OpenPAI's [folder / file](../../webportal/README.md#Configuration)
User could customize Webportal startup configuration at OpenPAI's [folder / file](../bootstrap/webportal/webportal.yaml.template)
User could customize Webportal startup configuration at OpenPAI's [folder / file](../../../src/webportal/deploy/webportal.yaml.template)
## Configure Pylon <a name="pylon"></a>
@ -44,7 +44,7 @@ User could customize FrameworkLauncher startup configuration at OpenPAI's [folde
## Configure Rest-server <a name="restserver"></a>
User could customize rest server at OpenPAI's [folder / file](../bootstrap/rest-server/rest-server.yaml.template)
User could customize rest server at OpenPAI's [folder / file](../../../src/rest-server/deploy/rest-server.yaml.template)
User could customize rest server startup configuration at OpenPAI's [folder / file](../../../src)

Просмотреть файл

@ -2,7 +2,7 @@
1. Job config file
Prepare a job config file as described in [examples/README.md](../docs/job_tutorial.md#json-config-file-for-job-submission), for example, `exampleJob.json`.
Prepare a job config file as described in [examples/README.md](../job_tutorial.md#json-config-file-for-job-submission), for example, `exampleJob.json`.
2. Authentication
@ -54,7 +54,7 @@
## Root URI
Configure the rest server port in [services-configuration.yaml](../cluster-configuration/services-configuration.yaml).
Configure the rest server port in [services-configuration.yaml](../../cluster-configuration/services-configuration.yaml).
## API Details
@ -444,7 +444,7 @@ Configure the rest server port in [services-configuration.yaml](../cluster-confi
*Parameters*
[job config json](../docs/job_tutorial.md#json-config-file-for-job-submission)
[job config json](../job_tutorial.md#json-config-file-for-job-submission)
*Response if succeeded*
```

Просмотреть файл

@ -27,14 +27,14 @@ REST Server exposes a set of interface that allows you to manage jobs.
## Architecture
REST Server is a Node.js API service for PAI that deliver client requests to different upstream
services, including [FrameworkLauncher](../frameworklauncher), Apache Hadoop YARN, WebHDFS and
services, including [FrameworkLauncher](../../src/frameworklauncher), Apache Hadoop YARN, WebHDFS and
etcd, with some request transformation.
## Dependencies
To start a REST Server service, the following services should be ready and correctly configured.
* [FrameworkLauncher](../frameworklauncher)
* [FrameworkLauncher](../../src/frameworklauncher)
* Apache Hadoop YARN
* HDFS
* etcd
@ -59,7 +59,7 @@ If REST Server is deployed by [pai management tool][pai-management], configurati
If REST Server is deployed manually, the following fields should be configured as environment
variables:
* `LAUNCHER_WEBSERVICE_URI`: URI endpoint of [Framework Launcher](../frameworklauncher)
* `LAUNCHER_WEBSERVICE_URI`: URI endpoint of [Framework Launcher](../../src/frameworklauncher)
* `HDFS_URI`: URI endpoint of HDFS
* `WEBHDFS_URI`: URI endpoint of WebHDFS
* `YARN_URI`: URI endpoint of Apache Hadoop YARN
@ -134,4 +134,4 @@ Read [API document](./API.md) for the details of REST API.
[pai-management]: ../pai-management
[service-configuration]: ../cluster-configuration/services-configuration.yaml
[service-configuration]: ../../cluster-configuration/services-configuration.yaml

Просмотреть файл

@ -5,8 +5,8 @@
</p>
The system architecture is illustrated above.
User submits jobs or monitors cluster status through the [Web Portal](../webportal/README.md),
which calls APIs provided by the [REST server](../rest-server/README.md).
User submits jobs or monitors cluster status through the [Web Portal](webportal/README.md),
which calls APIs provided by the [REST server](rest-server/README.md).
Third party tools can also call REST server directly for job management.
Upon receiving API calls, the REST server coordinates with [FrameworkLauncher](../frameworklauncher/README.md) (short for Launcher)
to perform job management.

Просмотреть файл

@ -10,13 +10,13 @@ An [express](https://expressjs.com/) served, [AdminLTE](https://adminlte.io/) th
## Dependencies
Since [job toturial](../docs/job_tutorial.md) is included in the document tab, make sure **`docs`** directory is exists as a sibling of `web-portal` directory.
Since [job toturial](../job_tutorial.md) is included in the document tab, make sure **`docs`** directory is exists as a sibling of `web-portal` directory.
To run web portal, the following services should be started, and url of services should be correctly configured:
* [REST Server](../rest-server)
* [Prometheus](../prometheus)
* [Grafana](../grafana)
* [REST Server](../../src/rest-server)
* [Prometheus](../../src/prometheus)
* [Grafana](../../src/grafana)
* YARN
* Kubernetes
@ -38,7 +38,7 @@ For development
## Configuration
If web portal is deployed within PAI cluster, the following config field could be change in the `webportal` section in [services-configuration.yaml](../cluster-configuration/services-configuration.yaml) file:
If web portal is deployed within PAI cluster, the following config field could be change in the `webportal` section in [services-configuration.yaml](../../cluster-configuration/services-configuration.yaml) file:
* `server-port`: Integer. The network port to access the web portal. The default value is 9286.
@ -46,10 +46,10 @@ If web portal is deployed within PAI cluster, the following config field could b
If web portal is deployed as a standalone service, the following envioronment variables must be configured:
* `REST_SERVER_URI`: URI of [REST Server](../rest-server)
* `PROMETHEUS_URI`: URI of [Prometheus](../prometheus)
* `REST_SERVER_URI`: URI of [REST Server](../../src/rest-server)
* `PROMETHEUS_URI`: URI of [Prometheus](../../src/prometheus)
* `YARN_WEB_PORTAL_URI`: URI of YARN's web portal
* `GRAFANA_URI`: URI of [Grafana](../grafana)
* `GRAFANA_URI`: URI of [Grafana](../../src/grafana)
* `K8S_DASHBOARD_URI`: URI of Kubernetes' dashboard
* `K8S_API_SERVER_URI`: URI of Kubernetes' api server
* `EXPORTER_PORT`: Port of node exporter
@ -101,7 +101,7 @@ To run web portal on system, a [Node.js](https://nodejs.org/) 6+ runtime is requ
### Submit a job
Click the tab "Submit Job" to show a button asking you to select a json file for the submission. The job config file must follow the format shown in [job tutorial](../docs/job_tutorial.md).
Click the tab "Submit Job" to show a button asking you to select a json file for the submission. The job config file must follow the format shown in [job tutorial](../job_tutorial.md).
### View job status

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -1,80 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
.git
# Directory for submitted jobs' json file and scripts
frameworklauncher/
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
# Runtime data
pids
*.pid
*.seed
*.pid.lock
# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov
# Coverage directory used by tools like istanbul
coverage
# nyc test coverage
.nyc_output
# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
.grunt
# Bower dependency directory (https://bower.io/)
bower_components
# node-waf configuration
.lock-wscript
# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release
# Dependency directories
node_modules/
jspm_packages/
# Typescript v1 declaration files
typings/
# Optional npm cache directory
.npm
# Optional eslint cache
.eslintcache
# Optional REPL history
.node_repl_history
# Output of 'npm pack'
*.tgz
# Yarn Integrity file
.yarn-integrity
# dotenv environment variables file
.env

Просмотреть файл

@ -1,24 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
copy-list:
# created by the prepare hadoop function on docker_build.py
- src: src/hadoop-run/hadoop
dst: src/rest-server/copied_file
- src: ../rest-server
dst: src/rest-server/copied_file

Просмотреть файл

@ -1,26 +0,0 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
copy-list:
- src: ../docs
dst: src/webportal/copied_file
- src: ../examples
dst: src/webportal/copied_file
- src: ../webportal
dst: src/webportal/copied_file

Просмотреть файл

@ -1,165 +0,0 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# Bootstrap script for docker container.
exec 17>/pai/log/DockerContainerDebug.log
BASH_XTRACEFD=17
function exit_handler()
{
printf "%s %s\n" \
"[DEBUG]" "Docker container exit handler: EXIT signal received in docker container, exiting ..."
kill 0
}
set -x
PS4="+[\t] "
trap exit_handler EXIT
touch "/alive/docker_$PAI_CONTAINER_ID"
while /bin/true; do
[ $(( $(date +%s) - $(stat -c %Y /alive/yarn_$PAI_CONTAINER_ID) )) -gt 60 ] \
&& pkill -9 --ns 1
sleep 20
done &
export PAI_WORK_DIR="$(pwd)"
HDFS_LAUNCHER_PREFIX=$PAI_DEFAULT_FS_URI/Container
export CLASSPATH="$(hadoop classpath --glob)"
task_role_no={{{ idx }}}
printf "%s %s\n%s\n\n" "[INFO]" "ENV" "$(printenv | sort)"
mv /pai/code/* ./
function prepare_ssh()
{
mkdir /root/.ssh
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
}
function start_ssh_service()
{
printf "%s %s\n" \
"[INFO]" "start ssh service"
cat /root/.ssh/$APP_ID.pub >> /root/.ssh/authorized_keys
sed -i 's/Port.*/Port '$PAI_CONTAINER_SSH_PORT'/' /etc/ssh/sshd_config
echo "sshd:ALL" >> /etc/hosts.allow
service ssh restart
}
function hdfs_upload_atomically()
{
printf "%s %s\n%s %s\n%s %s\n" \
"[INFO]" "upload ssh key to hdfs" \
"[INFO]" "destination path is ${2}" \
"[INFO]" "source path is ${1}"
tempFolder=${2}"_temp"
if hdfs dfs -test -d $tempFolder ; then
printf "%s %s\n" \
"[WARNING]" "$tempFolder already exists, overwriting..."
hdfs dfs -rm -r $tempFolder
fi
hdfs dfs -put ${1} $tempFolder
hdfs dfs -mv $tempFolder ${2}
}
# Check whether hdfs bianry and ssh exists, if not ignore ssh preparation and start part
# Start sshd in docker container
if which hdfs && service --status-all 2>&1 | grep -q ssh; then
prepare_ssh
hdfs_ssh_folder=${HDFS_LAUNCHER_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/${APP_ID}
printf "%s %s\n%s %s\n%s %s\n" \
"[INFO]" "hdfs_ssh_folder is ${hdfs_ssh_folder}" \
"[INFO]" "task_role_no is ${task_role_no}" \
"[INFO]" "PAI_TASK_INDEX is ${PAI_TASK_INDEX}"
# Let taskRoleNumber=0 and taskindex=0 execute upload ssh files
if [ ${task_role_no} -eq 0 ] && [ ${PAI_TASK_INDEX} -eq 0 ]; then
printf "%s %s %s\n%s\n" \
"[INFO]" "task_role_no:${task_role_no}" "PAI_TASK_INDEX:${PAI_TASK_INDEX}" \
"Execute upload key pair ..."
ssh-keygen -N '' -t rsa -f ~/.ssh/$APP_ID
hdfs dfs -mkdir -p "${hdfs_ssh_folder}"
hdfs_upload_atomically "/root/.ssh/" "${hdfs_ssh_folder}/.ssh"
else
# Waiting for ssh key-pair ready
while ! hdfs dfs -test -d ${hdfs_ssh_folder}/.ssh ; do
echo "[INFO] waitting for ssh key ready"
sleep 10
done
printf "%s %s\n%s %s\n" \
"[INFO]" "ssh key pair ready ..." \
"[INFO]" "begin to download ssh key pair from hdfs ..."
hdfs dfs -get "${hdfs_ssh_folder}/.ssh/" "/root/"
fi
chmod 400 ~/.ssh/$APP_ID
# Generate ssh connect info file in "PAI_CONTAINER_ID-PAI_CURRENT_CONTAINER_IP-PAI_CONTAINER_SSH_PORT" format on hdfs
hdfs dfs -touchz ${hdfs_ssh_folder}/$PAI_CONTAINER_ID-$PAI_CONTAINER_HOST_IP-$PAI_CONTAINER_SSH_PORT
# Generate ssh config
ssh_config_path=${HDFS_LAUNCHER_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/config
hdfs dfs -mkdir -p ${ssh_config_path}
hdfs dfs -touchz ${ssh_config_path}/$APP_ID+$PAI_CURRENT_TASK_ROLE_NAME+$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX+$PAI_CONTAINER_HOST_IP+$PAI_CONTAINER_SSH_PORT
while [ `hdfs dfs -ls $ssh_config_path | grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" | wc -l` -lt $PAI_JOB_TASK_COUNT ]; do
printf "%s %s\n" "[INFO]" "Waiting for ssh service in other containers ..."
sleep 10
done
NodeList=($(hdfs dfs -ls ${ssh_config_path} \
| grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" \
| grep -oE "[^/]+$" \
| sed -e "s/^$APP_ID+//g" \
| sort -n))
if [ "${#NodeList[@]}" -ne $PAI_JOB_TASK_COUNT ]; then
printf "%s %s\n%s\n%s\n\n" \
"[ERROR]" "NodeList" \
"${NodeList[@]}" \
"ssh services in ${#NodeList[@]} containers are available, not equal to $PAI_JOB_TASK_COUNT, exit ..."
exit 2
fi
for line in "${NodeList[@]}"; do
node=(${line//+/ });
printf "%s\n %s\n %s\n %s\n %s\n %s\n %s\n" \
"Host ${node[0]}-${node[1]}" \
"HostName ${node[2]}" \
"Port ${node[3]}" \
"User root" \
"StrictHostKeyChecking no" \
"UserKnownHostsFile /dev/null" \
"IdentityFile /root/.ssh/$APP_ID" >> /root/.ssh/config
done
# Start ssh service
start_ssh_service
fi
# Write env to system-wide environment
env | grep -E "^PAI|PATH|PREFIX|JAVA|HADOOP|NVIDIA|CUDA" > /etc/environment
printf "%s %s\n\n" "[INFO]" "USER COMMAND START"
{{{ taskData.command }}} || exit $?
printf "\n%s %s\n\n" "[INFO]" "USER COMMAND END"
exit 0

Просмотреть файл

@ -17,7 +17,12 @@
FROM base-image
RUN wget https://download.docker.com/linux/static/stable/x86_64/docker-17.06.2-ce.tgz && \
tar xzvf docker-17.06.2-ce.tgz && \
mv docker/* /usr/bin/ && \
rm docker-17.06.2-ce.tgz
COPY build/start.sh /usr/local/start.sh
RUN chmod a+x /usr/local/start.sh
CMD ["/usr/local/start.sh"]
CMD ["/usr/local/start.sh"]

5
src/hadoop-ai/build/build-pre.sh Executable file → Normal file
Просмотреть файл

@ -22,13 +22,14 @@ pushd $(dirname "$0") > /dev/null
hadoopBinaryDir="/hadoop-binary/"
hadoopBinaryPath="${hadoopBinaryDir}hadoop-2.9.0.tar.gz"
cacheVersion="${hadoopBinaryDir}12932984-12933562-done"
cacheVersion="${hadoopBinaryDir}12932984-12933562-docker_executor-done"
echo "hadoopbinarypath:${hadoopBinaryDir}"
[[ -f $cacheVersion ]] &&
{
echo "Hadoop ai with patch 12932984-12933562 has been built"
echo "Hadoop ai with patch 12932984-12933562-docker_executor has been built"
echo "Skip this build precess"
exit 0
}

Просмотреть файл

@ -31,9 +31,11 @@ git checkout branch-2.9.0
cp /hadoop-2.9.0.gpu-port.patch /hadoop
cp /HDFS-13773.patch /hadoop
cp /docker-executor.patch /hadoop
git apply hadoop-2.9.0.gpu-port.patch
git apply HDFS-13773.patch
git apply docker-executor.patch
mvn package -Pdist,native -DskipTests -Dmaven.javadoc.skip=true -Dtar
@ -44,4 +46,5 @@ echo "Successfully build hadoop 2.9.0 AI"
# When Changing the patch id, please update the filename here.
touch /hadoop-binary/12932984-12933562-done
rm /hadoop-binary/*-done
touch /hadoop-binary/12932984-12933562-docker_executor-done

Просмотреть файл

@ -0,0 +1,123 @@
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
index 96f6c57..1b89e90 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
@@ -1544,6 +1544,14 @@ public static boolean isAclEnabled(Configuration conf) {
public static final String NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME =
NM_PREFIX + "docker-container-executor.image-name";
+ /** The Docker run option(For DockerContainerExecutor).*/
+ public static final String NM_DOCKER_CONTAINER_EXECUTOR_EXEC_OPTION =
+ NM_PREFIX + "docker-container-executor.exec-option";
+
+ /** The command before launch script(For DockerContainerExecutor).*/
+ public static final String NM_DOCKER_CONTAINER_EXECUTOR_SCRIPT_COMMAND =
+ NM_PREFIX + "docker-container-executor.script-command";
+
/** The name of the docker executor (For DockerContainerExecutor).*/
public static final String NM_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME =
NM_PREFIX + "docker-container-executor.exec-name";
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
index a044cb6..819c496 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java
@@ -98,7 +98,7 @@
//containername:0.1 or
//containername
public static final String DOCKER_IMAGE_PATTERN =
- "^(([\\w\\.-]+)(:\\d+)*\\/)?[\\w\\.:-]+$";
+ "^(([\\w\\.-]+)(:\\d+)*\\/)?([\\w\\.-]+\\/)?[\\w\\.:-]+$";
private final FileContext lfs;
private final Pattern dockerImagePattern;
@@ -127,7 +127,12 @@ public void init() throws IOException {
String dockerExecutor = getConf().get(
YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME,
YarnConfiguration.NM_DEFAULT_DOCKER_CONTAINER_EXECUTOR_EXEC_NAME);
- if (!new File(dockerExecutor).exists()) {
+ // /use/bin/docker -H=tcp://0.0.0.0:xx is also a valid docker executor
+ String[] arr = dockerExecutor.split("\\s");
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("dockerExecutor: " + dockerExecutor);
+ }
+ if (!new File(arr[0]).exists()) {
throw new IllegalStateException(
"Invalid docker exec path: " + dockerExecutor);
}
@@ -181,8 +186,11 @@ public int launchContainer(ContainerStartContext ctx) throws IOException {
//Variables for the launch environment can be injected from the command-line
//while submitting the application
- String containerImageName = container.getLaunchContext().getEnvironment()
- .get(YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME);
+ //modify get image from configuration rather than env
+ String containerImageName = getConf().get(
+ YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_IMAGE_NAME);
+
+ //
if (LOG.isDebugEnabled()) {
LOG.debug("containerImageName from launchContext: " + containerImageName);
}
@@ -240,19 +248,27 @@ public int launchContainer(ContainerStartContext ctx) throws IOException {
//--net=host allows the container to take on the host's network stack
//--name sets the Docker Container name to the YARN containerId string
//-v is used to bind mount volumes for local, log and work dirs.
+ //-w sets the work dir inside the container
+ //add docker option
+ String dockerOption = getConf().get(
+ YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_EXEC_OPTION);
String commandStr = commands.append(dockerExecutor)
.append(" ")
.append("run")
.append(" ")
- .append("--rm --net=host")
+ .append("--rm --net=host --pid=host --privileged=true")
+ .append(" ")
+ .append("-w " + containerWorkDir.toUri().getPath().toString())
+ .append(" ")
+ .append(dockerOption)
.append(" ")
.append(" --name " + containerIdStr)
- .append(localDirMount)
- .append(logDirMount)
- .append(containerWorkDirMount)
.append(" ")
.append(containerImageName)
.toString();
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Docker run command: " + commandStr);
+ }
//Get the pid of the process which has been launched as a docker container
//using docker inspect
String dockerPidScript = "`" + dockerExecutor +
@@ -597,13 +613,28 @@ private void writeSessionScript(Path launchDst, Path pidFile)
// We need to do a move as writing to a file is not atomic
// Process reading a file being written to may get garbled data
// hence write pid to tmp file first followed by a mv
+ // Move dockerpid command to backend, avoid blocking docker run command
+ // need to improve it with publisher mode
+ // Ref: https://issues.apache.org/jira/browse/YARN-3080
pout.println("#!/usr/bin/env bash");
pout.println();
+ pout.println("{");
+ pout.println("n=10");
+ pout.println("while [ $n -gt 0 ]; do");
+ pout.println("let n=$n-1");
+ pout.println("sleep 5");
pout.println("echo "+ dockerPidScript +" > " + pidFile.toString()
+ ".tmp");
+ pout.println("[ -n \"$(cat \"" + pidFile.toString()
+ + ".tmp\")\" ] && break");
+ pout.println("done");
pout.println("/bin/mv -f " + pidFile.toString() + ".tmp " + pidFile);
- pout.println(dockerCommand + " bash \"" +
- launchDst.toUri().getPath().toString() + "\"");
+ pout.println("} &");
+ //Add exec command before launch_script.
+ String scriptCommand = getConf().get(
+ YarnConfiguration.NM_DOCKER_CONTAINER_EXECUTOR_SCRIPT_COMMAND);
+ pout.println(dockerCommand + " bash -c '" + scriptCommand + " && bash \"" +
+ launchDst.toUri().getPath().toString() + "\"'");
} finally {
IOUtils.cleanupWithLogger(LOG, pout, out);
}

Просмотреть файл

@ -73,8 +73,10 @@ RUN wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.
## The build environment of hadoop has been prepared above.
## Copy your build script here. Default script will build our hadoop-ai.
COPY docker-executor.patch /
COPY build.sh /
RUN chmod u+x build.sh
CMD ["/build.sh"]
CMD ["/build.sh"]

Просмотреть файл

@ -17,6 +17,18 @@
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# Clean running job
if which docker > /dev/null && [ -S /var/run/docker.sock ]; then
echo "Clean hadoop jobs"
docker ps | awk '/container_\w{3}_[0-9]{13}_[0-9]{4}_[0-9]{2}_[0-9]{6}/ { print $NF}' | xargs timeout 30 docker stop || \
docker ps | awk '/container_\w{3}_[0-9]{13}_[0-9]{4}_[0-9]{2}_[0-9]{6}/ { print $NF}' | xargs docker kill
fi
# Clean data
echo "Clean the hadoop node manager's data on the disk"

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

20
src/rest-server/README.md Normal file
Просмотреть файл

@ -0,0 +1,20 @@
<!--
Copyright (c) Microsoft Corporation
All rights reserved.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-->
See [README.md](../../docs/rest-server/README.md)

Просмотреть файл

@ -23,6 +23,7 @@ RUN echo "deb http://http.debian.net/debian jessie-backports main" > \
apt-get install -y --no-install-recommends -t \
jessie-backports \
dos2unix \
openssh-server \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
@ -32,11 +33,11 @@ WORKDIR /usr/src/app
ENV NODE_ENV=production \
SERVER_PORT=8080
COPY copied_file/rest-server/package.json .
COPY package.json ./
RUN npm install
COPY copied_file/rest-server/ .
COPY . .
RUN dos2unix src/templates/*

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -1,6 +1,6 @@
{
"name": "pai-rest-server",
"version": "0.1.0",
"version": "0.8.0",
"description": "RESTful api server for Microsoft Platform for AI",
"keywords": [
"REST",
@ -49,7 +49,8 @@
"nyc": "11.6.0",
"statuses": "1.5.0",
"unirest": "0.5.1",
"winston": "2.4.0"
"winston": "2.4.0",
"ssh-keygen": "0.4.2"
},
"scripts": {
"coveralls": "nyc report --reporter=text-lcov | coveralls ..",

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -20,11 +20,13 @@
const async = require('async');
const unirest = require('unirest');
const mustache = require('mustache');
const keygen = require('ssh-keygen');
const launcherConfig = require('../config/launcher');
const userModel = require('./user');
const yarnContainerScriptTemplate = require('../templates/yarnContainerScript');
const dockerContainerScriptTemplate = require('../templates/dockerContainerScript');
const createError = require('../util/error');
const logger = require('../config/logger');
const Hdfs = require('../util/hdfs');
@ -232,30 +234,51 @@ class Job {
hdfs.list(
folderPathPrefix,
null,
(error, result) => {
(error, connectInfo) => {
if (!error) {
let sshInfo = {
'containers': [],
'keyPair': {
'folderPath': `${launcherConfig.hdfsUri}${folderPathPrefix}/.ssh/`,
'publicKeyFileName': `${applicationId}.pub`,
'privateKeyFileName': `${applicationId}`,
'privateKeyDirectDownloadLink':
`${launcherConfig.webhdfsUri}/webhdfs/v1${folderPathPrefix}/.ssh/${applicationId}?op=OPEN`,
},
};
for (let x of result.content.FileStatuses.FileStatus) {
let pattern = /^container_(.*)-(.*)-(.*)$/g;
let arr = pattern.exec(x.pathSuffix);
if (arr !== null) {
sshInfo.containers.push({
'id': 'container_' + arr[1],
'sshIp': arr[2],
'sshPort': arr[3],
});
}
}
next(null, sshInfo);
let latestKeyFilePath = `/Container/${userName}/${jobName}/ssh/keyFiles`;
let sshInfo = {};
// Handle backward compatibility
hdfs.list(latestKeyFilePath,
null,
(error, result) => {
if (!error) {
sshInfo = {
'containers': [],
'keyPair': {
'folderPath': `${launcherConfig.hdfsUri}${latestKeyFilePath}`,
'publicKeyFileName': `${jobName}.pub`,
'privateKeyFileName': `${jobName}`,
'privateKeyDirectDownloadLink':
`${launcherConfig.webhdfsUri}/webhdfs/v1${latestKeyFilePath}/${jobName}?op=OPEN`,
},
};
} else {
// older pattern is ${launcherConfig.hdfsUri}${folderPathPrefix}/.ssh/
sshInfo = {
'containers': [],
'keyPair': {
'folderPath': `${launcherConfig.hdfsUri}${folderPathPrefix}/.ssh/`,
'publicKeyFileName': `${applicationId}.pub`,
'privateKeyFileName': `${applicationId}`,
'privateKeyDirectDownloadLink':
`${launcherConfig.webhdfsUri}/webhdfs/v1${folderPathPrefix}/.ssh/${applicationId}?op=OPEN`,
},
};
}
for (let x of connectInfo.content.FileStatuses.FileStatus) {
let pattern = /^container_(.*)-(.*)-(.*)$/g;
let arr = pattern.exec(x.pathSuffix);
if (arr !== null) {
sshInfo.containers.push({
'id': 'container_' + arr[1],
'sshIp': arr[2],
'sshPort': arr[3],
});
}
}
next(null, sshInfo);
});
} else {
next(error);
}
@ -367,6 +390,7 @@ class Job {
'hdfsUri': launcherConfig.hdfsUri,
'taskData': data.taskRoles[idx],
'jobData': data,
'webHdfsUri': launcherConfig.webhdfsUri,
});
return dockerContainerScript;
}
@ -432,6 +456,21 @@ class Job {
return frameworkDescription;
}
generateSshKeyFiles(name, next) {
keygen({
location: name,
read: true,
destroy: true,
}, function(err, out) {
if (err) {
next(err);
} else {
let sshKeyFiles = [{'content': out.pubKey, 'fileName': name+'.pub'}, {'content': out.key, 'fileName': name}];
next(null, sshKeyFiles);
}
});
}
_initializeJobContextRootFolders(next) {
const hdfs = new Hdfs(launcherConfig.webhdfsUri);
async.parallel([
@ -535,6 +574,26 @@ class Job {
}
);
},
(parallelCallback) => {
this.generateSshKeyFiles(name, (error, sshKeyFiles) => {
if (error) {
logger.error('Generated ssh key files failed');
} else {
async.each(sshKeyFiles, (file, eachCallback) => {
hdfs.createFile(
`/Container/${data.userName}/${name}/ssh/keyFiles/${file.fileName}`,
file.content,
{'user.name': data.userName, 'permission': '775', 'overwrite': 'true'},
(error, result) => {
eachCallback(error);
}
);
}, (error) => {
parallelCallback(error);
});
}
});
},
], (parallelError) => {
return next(parallelError);
});

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -0,0 +1,190 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# Bootstrap script for docker container.
exec 17>/pai/log/DockerContainerDebug.log
BASH_XTRACEFD=17
function exit_handler()
{
printf "%s %s\n" \
"[DEBUG]" "Docker container exit handler: EXIT signal received in docker container, exiting ..."
kill 0
}
set -x
PS4="+[\t] "
trap exit_handler EXIT
touch "/alive/docker_$PAI_CONTAINER_ID"
while /bin/true; do
[ $(( $(date +%s) - $(stat -c %Y /alive/yarn_$PAI_CONTAINER_ID) )) -gt 60 ] \
&& pkill -9 --ns 1
sleep 20
done &
export PAI_WORK_DIR="$(pwd)"
PAI_WEB_HDFS_PREFIX={{{ webHdfsUri }}}/webhdfs/v1/Container
HDFS_LAUNCHER_PREFIX=$PAI_DEFAULT_FS_URI/Container
export CLASSPATH="$(hadoop classpath --glob)"
task_role_no={{{ idx }}}
printf "%s %s\n%s\n\n" "[INFO]" "ENV" "$(printenv | sort)"
mv /pai/code/* ./
function webhdfs_create_file()
{
webHdfsRequestPath=${1}"?user.name="{{{ jobData.userName }}}"&op=CREATE"
redirectResponse=$(curl -i -X PUT ${webHdfsRequestPath} -o /dev/null -w %{redirect_url}' '%{http_code})
redirectCode=$(cut -d ' ' -f 2 <<< ${redirectResponse})
if [[ ${redirectCode} = "307" ]]; then
redirectUri=$(cut -d ' ' -f 1 <<< ${redirectResponse})
createResponse=$(curl -i -S -X PUT ${redirectUri})
else
printf "%s %s\n %s %s\n %s %s\n" \
"[WARNING]" "Webhdfs creates folder failed" \
"Folder Path:" ${webHdfsRequestPath} \
"Response code:" ${redirectCode}
fi
}
function webhdfs_download_file()
{
webHdfsRequestPath=${1}"?user.name="{{{ jobData.userName }}}"&op=OPEN"
localPath=${2}
downloadResponse=$(curl -S -L ${webHdfsRequestPath} -o ${localPath} -w %{http_code})
if [[ ${downloadResponse} = "200" ]]; then
printf "%s %s\n" \
"[INFO]" "Webhdfs downloads file succeed"
else
printf "%s %s\n" \
"[WARNING]" "Webhdfs downloads file failed"
fi
}
function prepare_ssh()
{
mkdir /root/.ssh
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
}
function start_ssh_service()
{
printf "%s %s\n" \
"[INFO]" "start ssh service"
cat /root/.ssh/{{{ jobData.jobName }}}.pub >> /root/.ssh/authorized_keys
sed -i 's/Port.*/Port '$PAI_CONTAINER_SSH_PORT'/' /etc/ssh/sshd_config
echo "sshd:ALL" >> /etc/hosts.allow
service ssh restart
}
function get_ssh_key_files()
{
info_source="webhdfs"
localKeyPath=/root/.ssh/{{{ jobData.jobName }}}.pub
if [[ -f $localKeyPath ]]; then
rm -f $localKeyPath
fi
if [[ "$info_source" = "webhdfs" ]]; then
webHdfsKeyPath=${PAI_WEB_HDFS_PREFIX}/{{{ jobData.userName }}}/{{{ jobData.jobName }}}/ssh/keyFiles/{{{ jobData.jobName }}}.pub
webhdfs_download_file $webHdfsKeyPath $localKeyPath
else
printf "%s %s\n" \
"[WARNING]" "Get another key store way"
fi
}
function generate_ssh_connect_info()
{
info_source="webhdfs"
destFileName=${1}
if [[ "$info_source" = "webhdfs" ]]; then
webHdfsRequestPath=$destFileName
webhdfs_create_file $webHdfsRequestPath
else
printf "%s %s\n" \
"[WARNING]" "Get another key store way"
fi
}
# Check whether hdfs bianry and ssh exists, if not ignore ssh preparation and start part
# Start sshd in docker container
if service --status-all 2>&1 | grep -q ssh; then
prepare_ssh
get_ssh_key_files
sshConnectInfoFolder=${PAI_WEB_HDFS_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/$APP_ID
# Generate ssh connect info file in "PAI_CONTAINER_ID-PAI_CURRENT_CONTAINER_IP-PAI_CONTAINER_SSH_PORT" format on hdfs
destFilePath=${sshConnectInfoFolder}/$PAI_CONTAINER_ID-$PAI_CONTAINER_HOST_IP-$PAI_CONTAINER_SSH_PORT
generate_ssh_connect_info ${destFilePath}
# Generate ssh config for MPI job
if which hdfs; then
ssh_config_path=${HDFS_LAUNCHER_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/config
hdfs dfs -mkdir -p ${ssh_config_path}
hdfs dfs -touchz ${ssh_config_path}/$APP_ID+$PAI_CURRENT_TASK_ROLE_NAME+$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX+$PAI_CONTAINER_HOST_IP+$PAI_CONTAINER_SSH_PORT
while [ `hdfs dfs -ls $ssh_config_path | grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" | wc -l` -lt $PAI_JOB_TASK_COUNT ]; do
printf "%s %s\n" "[INFO]" "Waiting for ssh service in other containers ..."
sleep 10
done
NodeList=($(hdfs dfs -ls ${ssh_config_path} \
| grep "/$PAI_JOB_NAME/ssh/config/$APP_ID+" \
| grep -oE "[^/]+$" \
| sed -e "s/^$APP_ID+//g" \
| sort -n))
if [ "${#NodeList[@]}" -ne $PAI_JOB_TASK_COUNT ]; then
printf "%s %s\n%s\n%s\n\n" \
"[ERROR]" "NodeList" \
"${NodeList[@]}" \
"ssh services in ${#NodeList[@]} containers are available, not equal to $PAI_JOB_TASK_COUNT, exit ..."
exit 2
fi
for line in "${NodeList[@]}"; do
node=(${line//+/ });
printf "%s\n %s\n %s\n %s\n %s\n %s\n %s\n" \
"Host ${node[0]}-${node[1]}" \
"HostName ${node[2]}" \
"Port ${node[3]}" \
"User root" \
"StrictHostKeyChecking no" \
"UserKnownHostsFile /dev/null" \
"IdentityFile /root/.ssh/$APP_ID" >> /root/.ssh/config
done
fi
# Start ssh service
start_ssh_service
fi
# Write env to system-wide environment
env | grep -E "^PAI|PATH|PREFIX|JAVA|HADOOP|NVIDIA|CUDA" > /etc/environment
printf "%s %s\n\n" "[INFO]" "USER COMMAND START"
{{{ taskData.command }}} || exit $?
printf "\n%s %s\n\n" "[INFO]" "USER COMMAND END"
exit 0

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -83,6 +83,20 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
)
);
nock(launcherWebserviceUri)
.get('/v1/Frameworks/job6')
.reply(
200,
mustache.render(
frameworkDetailTemplate,
{
'frameworkName': 'job6',
'userName': 'test',
'applicationId': 'app6',
}
)
);
//
// Mock WebHDFS
//
@ -120,6 +134,46 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
},
}
);
nock(webhdfsUri)
.get('/webhdfs/v1/Container/test/job6/ssh/app6?op=LISTSTATUS')
.reply(
200,
{
'FileStatuses': {
'FileStatus': [
{
'pathSuffix': 'container_1519960554030_0046_01_000002-10.240.0.15-39035',
},
{
'pathSuffix': 'container_1519960554030_0046_01_000003-10.240.0.17-28730',
},
{
'pathSuffix': 'container_1519960554030_0046_01_000004-10.240.0.16-30690',
},
],
},
}
);
nock(webhdfsUri)
.get('/webhdfs/v1/Container/test/job6/ssh/keyFiles?op=LISTSTATUS')
.reply(
200,
{
'FileStatuses': {
'FileStatus': [
{
'pathSuffix': 'job6.pub',
},
{
'pathSuffix': 'job6',
},
],
},
}
);
});
//
@ -137,11 +191,22 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
});
});
it('Case 2 (Positive): Ssh info stored in new pattern will get info succeed.', (done) => {
chai.request(server)
.get('/api/v1/jobs/job6/ssh')
.end((err, res) => {
expect(res, 'status code').to.have.status(200);
expect(res, 'response format').be.json;
expect(JSON.stringify(res.body), 'response body content').include('keyPair');
done();
});
});
//
// Negative cases
//
it('Case 2 (Negative): The job does not exist at all.', (done) => {
it('Case 3 (Negative): The job does not exist at all.', (done) => {
chai.request(server)
.get('/api/v1/jobs/job2/ssh')
.end((err, res) => {
@ -151,7 +216,7 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
});
});
it('Case 3 (Negative): The job exists, but does not contain SSH info.', (done) => {
it('Case 4 (Negative): The job exists, but does not contain SSH info.', (done) => {
chai.request(server)
.get('/api/v1/jobs/job3/ssh')
.end((err, res) => {
@ -161,7 +226,7 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
});
});
it('Case 4 (Negative): Cannot connect to Launcher.', (done) => {
it('Case 5 (Negative): Cannot connect to Launcher.', (done) => {
chai.request(server)
.get('/api/v1/jobs/job4/ssh')
.end((err, res) => {
@ -171,7 +236,7 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
});
});
it('Case 5 (Negative): Cannot connect to WebHDFS.', (done) => {
it('Case 6 (Negative): Cannot connect to WebHDFS.', (done) => {
chai.request(server)
.get('/api/v1/jobs/job5/ssh')
.end((err, res) => {
@ -181,4 +246,3 @@ describe('Get job SSH info: GET /api/v1/jobs/:jobName/ssh', () => {
});
});
});

Просмотреть файл

@ -66,7 +66,7 @@ describe('Submit job: POST /api/v1/jobs', () => {
);
global.nock(global.webhdfsUri)
.put(/op=CREATE/)
.times(4)
.times(6)
.reply(
201,
{}

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

@ -19,4 +19,4 @@ FROM python:2.7
RUN pip install PyYAML requests paramiko prometheus_client
COPY copied_file/exporter/watchdog.py /
COPY src/watchdog.py /

Просмотреть файл

@ -22,5 +22,4 @@ pushd $(dirname "$0") > /dev/null
echo "Call stop script to stop all service first"
/bin/bash stop.sh || exit $?
popd > /dev/null
popd > /dev/null

Просмотреть файл

@ -17,18 +17,16 @@
prerequisite:
- cluster-configuration
- drivers
template-list:
- watchdog-configmap.yaml
- watchdog.yaml
- refresh.sh
start-script: start.sh
stop-script: stop.sh
delete-script: delete.sh
refresh-script: refresh.sh
upgraded-script: upgraded.sh
deploy-rules:
in: pai-master
in: pai-master

Просмотреть файл

@ -19,11 +19,6 @@
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
INSTANCES="daemonset/watchdog
deployment/watchdog
configmap/watchdog
"
for instance in ${INSTANCES}; do
kubectl delete --ignore-not-found --now ${instance}
done
kubectl delete --ignore-not-found --now daemonset/watchdog
kubectl delete --ignore-not-found --now deployment/watchdog
kubectl delete --ignore-not-found --now configmap/watchdog

Просмотреть файл

Просмотреть файл

Просмотреть файл

Просмотреть файл

1
src/webportal/README.md Normal file
Просмотреть файл

@ -0,0 +1 @@
See [README.md](../../docs/webportal/README.md)

Просмотреть файл

@ -1,3 +1,5 @@
#!/bin/bash
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
@ -15,6 +17,9 @@
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
copy-list:
- src: ../prometheus/exporter
dst: src/watchdog/copied_file
pushd $(dirname "$0") > /dev/null
mkdir -p "../dependency"
cp -arf "../../../docs" "../../../examples" "../dependency"
popd > /dev/null

Просмотреть файл

@ -22,9 +22,10 @@ WORKDIR /usr/src/app
ENV NODE_ENV=production \
SERVER_PORT=8080
COPY copied_file/ /usr/src/
COPY copied_file/webportal/ /usr/src/app/
COPY package.json .
RUN npm run yarn install
COPY dependency/ ../../
COPY . .
RUN npm run build
EXPOSE ${SERVER_PORT}

Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше