* add job-editor.dockerfile

* refine dockerfile

* add readme

* refine

* refine readme

* add screenshot

* add JUPYTER_TOKEN in jobEnbs

* update readme

* update readme

* update readme
This commit is contained in:
Hao Yuan 2018-12-05 14:21:57 +08:00 коммит произвёл GitHub
Родитель 70173de7a5
Коммит f7366ff6d8
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
3 изменённых файлов: 455 добавлений и 0 удалений

Просмотреть файл

@ -0,0 +1,81 @@
# Launcher a jupyter job-editor for interactive debugging
The example is a prototype of online job editor. It demonstrates:
1. Launcher a jupyter notebook for online code editing and interactive debugging. The editor is running on PAI, so user will debug on the same environment as the job will be running and get the feedback on the fly.
2. Leverage PAI restserver APIs to submit job on PAI. So user could try more experiments without leaving the editor.
The job-editor runs as a normal PAI job, launch a jupyter notebook `prepares dependencies`, `tests code`, `submits job to PAI` and `does all other things`.
We havn't done any encapsulation, leaving the `raw code` exposing the implemetation details. It's simple and for extending, any improves are welcomed.
## Setup
You will need to figure out the ENVs below:
- PAI_USER_NAME: your_pai_user #your user name on PAI
## HDFS (optional)
In order to use hdfs in job-editor, you will need to figure the ENVs below:
- HDFS_FS_DEFAULT: hdfs://ip:9000/ #hdfs://hdfs_name_node_ip:hdfs_port/
- WEBHDFS_FS_DEFAULT: http://ip:5070/ #http://hdfs_name_node_ip:webhdfs_port/
## Run the jupyter job-editor on PAI
Using the config template below, **change** the `jobEnvs` according to your PAI config.
```json
{
"jobName": "job-editor",
"image": "docker.io/openpai/job-editor",
"retryCount": 0,
"jobEnvs": {
"PAI_URL": "your_pai_cluster_url",
"PAI_USER_NAME": "your_pai_user",
"HDFS_FS_DEFAULT": "hdfs://your_hdfs_name_node_ip:9000/",
"WEBHDFS_FS_DEFAULT": "http://your_hdfs_name_node_ip:5070/",
"JUPYTER_TOKEN": "choose_your_jupyter_token"
},
"taskRoles": [
{
"name": "editor",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 4096,
"shmMB": 64,
"gpuNumber": 1,
"minFailedTaskCount": 1,
"command": "bash -c /root/setup_hdfs.sh && start-notebook.sh --ip $JUPYTER_HOST_IP --port=$PAI_CONTAINER_HOST_jupyter_PORT_LIST --NotebookApp.token=${JUPYTER_TOKEN}",
"portList": [
{
"label": "jupyter",
"beginAt": 0,
"portNumber": 1
}
]
}
]
}
```
Refer to the screenshots below.
![image](https://user-images.githubusercontent.com/1547343/48335823-d0119c80-e699-11e8-960a-1e6aa97d567e.png)
![image](https://user-images.githubusercontent.com/1547343/48335887-fc2d1d80-e699-11e8-89e4-b6b15a261cc3.png)
![image](https://user-images.githubusercontent.com/1547343/48335988-3eeef580-e69a-11e8-851a-5415a9aee8a6.png)
## Testing: Run the job-editor out of PAI
Correct the ENVs according to your cluster, then launcher the job-editor as below:
```bash
sudo docker run \
--env "PAI_URL=your_pai_cluster_url" \
--env "PAI_USER_NAME=your_pai_user" \
--env "HDFS_FS_DEFAULT=hdfs://your_hdfs_name_node_ip:9000/" \
--env "WEBHDFS_FS_DEFAULT=http://your_hdfs_name_node_ip:5070/" \
-it \
--network=host \
openpai/job-editor
```

Просмотреть файл

@ -0,0 +1,46 @@
FROM jupyter/minimal-notebook
USER root
# install java
RUN apt-get update \
&& apt-get install default-jdk -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# install hadoop
RUN wget http://www-us.apache.org/dist/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz && \
tar -xzvf hadoop-3.0.3.tar.gz && \
mv hadoop-3.0.3 /usr/local/hadoop && \
rm hadoop-3.0.3.tar.gz
# config hadoop
RUN echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") \n\
export HADOOP_HOME=/usr/local/hadoop \n\
' >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh
# solve error in pydoop installation: 'pydoop.LocalModeNotSupported: ERROR: Hadoop is configured to run in local mode'
RUN echo '<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property></configuration>' > /usr/local/hadoop/etc/hadoop/mapred-site.xml
# install hdfscli
RUN pip install hdfs
# hdfs setup script
COPY setup_hdfs.sh /root/setup_hdfs.sh
ENV JUPYTER_ENABLE_LAB true
ENV GRANT_SUDO yes
ENV HADOOP_HOME /usr/local/hadoop
ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64/
ENV JUPYTER_HOST_IP 0.0.0.0
# mnist example
COPY mnist_pytorch.ipynb work
RUN fix-permissions $HOME
# user will need to passing ENVs below to container:
#PAI_URL
#PAI_USER_NAME
#PAI_CONTAINER_HOST_jupyter_PORT_LIST
#HDFS_FS_DEFAULT(optional)
#WEBHDFS_FS_DEFAULT(optional)
CMD ["bash", "-c", "/root/setup_hdfs.sh && start-notebook.sh --ip $JUPYTER_HOST_IP --port=$PAI_CONTAINER_HOST_jupyter_PORT_LIST --NotebookApp.token=\"\""]

Просмотреть файл

@ -0,0 +1,328 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Install requirements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a requirements.txt and install the requirements."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"echo 'torch\n",
"torchvision\n",
"' > requirements.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install a pip package in the current Jupyter kernel\n",
"import sys\n",
"!{sys.executable} -m pip install -r requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Edit your code\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Edit and save your code file `main.py`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile main.py\n",
"from __future__ import print_function\n",
"import argparse\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import torch.optim as optim\n",
"from torchvision import datasets, transforms\n",
"\n",
"class Net(nn.Module):\n",
" def __init__(self):\n",
" super(Net, self).__init__()\n",
" self.conv1 = nn.Conv2d(1, 10, kernel_size=5)\n",
" self.conv2 = nn.Conv2d(10, 20, kernel_size=5)\n",
" self.conv2_drop = nn.Dropout2d()\n",
" self.fc1 = nn.Linear(320, 50)\n",
" self.fc2 = nn.Linear(50, 10)\n",
"\n",
" def forward(self, x):\n",
" x = F.relu(F.max_pool2d(self.conv1(x), 2))\n",
" x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))\n",
" x = x.view(-1, 320)\n",
" x = F.relu(self.fc1(x))\n",
" x = F.dropout(x, training=self.training)\n",
" x = self.fc2(x)\n",
" return F.log_softmax(x, dim=1)\n",
"\n",
"def train(args, model, device, train_loader, optimizer, epoch):\n",
" model.train()\n",
" for batch_idx, (data, target) in enumerate(train_loader):\n",
" data, target = data.to(device), target.to(device)\n",
" optimizer.zero_grad()\n",
" output = model(data)\n",
" loss = F.nll_loss(output, target)\n",
" loss.backward()\n",
" optimizer.step()\n",
" if batch_idx % args.log_interval == 0:\n",
" print('Train Epoch: {} [{}/{} ({:.0f}%)]\\tLoss: {:.6f}'.format(\n",
" epoch, batch_idx * len(data), len(train_loader.dataset),\n",
" 100. * batch_idx / len(train_loader), loss.item()))\n",
"\n",
"def test(args, model, device, test_loader):\n",
" model.eval()\n",
" test_loss = 0\n",
" correct = 0\n",
" with torch.no_grad():\n",
" for data, target in test_loader:\n",
" data, target = data.to(device), target.to(device)\n",
" output = model(data)\n",
" test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss\n",
" pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability\n",
" correct += pred.eq(target.view_as(pred)).sum().item()\n",
"\n",
" test_loss /= len(test_loader.dataset)\n",
" print('\\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\\n'.format(\n",
" test_loss, correct, len(test_loader.dataset),\n",
" 100. * correct / len(test_loader.dataset)))\n",
"\n",
"def main():\n",
" # Training settings\n",
" parser = argparse.ArgumentParser(description='PyTorch MNIST Example')\n",
" parser.add_argument('--batch-size', type=int, default=64, metavar='N',\n",
" help='input batch size for training (default: 64)')\n",
" parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',\n",
" help='input batch size for testing (default: 1000)')\n",
" parser.add_argument('--epochs', type=int, default=10, metavar='N',\n",
" help='number of epochs to train (default: 10)')\n",
" parser.add_argument('--lr', type=float, default=0.01, metavar='LR',\n",
" help='learning rate (default: 0.01)')\n",
" parser.add_argument('--momentum', type=float, default=0.5, metavar='M',\n",
" help='SGD momentum (default: 0.5)')\n",
" parser.add_argument('--no-cuda', action='store_true', default=False,\n",
" help='disables CUDA training')\n",
" parser.add_argument('--seed', type=int, default=1, metavar='S',\n",
" help='random seed (default: 1)')\n",
" parser.add_argument('--log-interval', type=int, default=10, metavar='N',\n",
" help='how many batches to wait before logging training status')\n",
" args = parser.parse_args()\n",
" use_cuda = not args.no_cuda and torch.cuda.is_available()\n",
"\n",
" torch.manual_seed(args.seed)\n",
"\n",
" device = torch.device(\"cuda\" if use_cuda else \"cpu\")\n",
"\n",
" kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}\n",
" train_loader = torch.utils.data.DataLoader(\n",
" datasets.MNIST('../data', train=True, download=True,\n",
" transform=transforms.Compose([\n",
" transforms.ToTensor(),\n",
" transforms.Normalize((0.1307,), (0.3081,))\n",
" ])),\n",
" batch_size=args.batch_size, shuffle=True, **kwargs)\n",
" test_loader = torch.utils.data.DataLoader(\n",
" datasets.MNIST('../data', train=False, transform=transforms.Compose([\n",
" transforms.ToTensor(),\n",
" transforms.Normalize((0.1307,), (0.3081,))\n",
" ])),\n",
" batch_size=args.test_batch_size, shuffle=True, **kwargs)\n",
"\n",
"\n",
" model = Net().to(device)\n",
" optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)\n",
"\n",
" for epoch in range(1, args.epochs + 1):\n",
" train(args, model, device, train_loader, optimizer, epoch)\n",
" test(args, model, device, test_loader)\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
" main()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test your code"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%run main.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submit job on PAI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input PAI cluster auth info"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"user = os.environ['PAI_USER_NAME']\n",
"pai_url = os.environ['PAI_URL']\n",
"\n",
"print(user)\n",
"import getpass\n",
"password = getpass.getpass()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure your job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generate a job id\n",
"import string\n",
"import random\n",
"def id_generator(size=6, chars=string.ascii_uppercase + string.digits):\n",
" return ''.join(random.choice(chars) for _ in range(size))\n",
"job_name = f'mnist-pytorch-{id_generator()}'\n",
"\n",
"job = {\n",
" 'jobName': job_name,\n",
" # code_dir: will automatically generate code_dir and upload code\n",
" 'image': 'docker.io/openpai/job-editor',\n",
" 'taskRoles': [{\n",
" 'name': 'main',\n",
" 'taskNumber': 1,\n",
" 'cpuNumber': 4,\n",
" 'memoryMB': 8192,\n",
" 'command': f'sh -c \"cd {job_name} && pip install -r requirements.txt && python main.py\"'\n",
" }],\n",
" # 'authFile': f'{os.environ[\"HDFS_FS_DEFAULT\"]}/user/{user}/authFile'\n",
"}\n",
"print(job)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit your job"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# upload code\n",
"target_path = f'/user/{user}/{job_name}/main.py'\n",
"!hdfscli upload --alias=dev main.py {target_path}\n",
"target_path = f'/user/{user}/{job_name}/requirements.txt'\n",
"!hdfscli upload --alias=dev requirements.txt {target_path}\n",
"code_dir = f'{os.environ[\"HDFS_FS_DEFAULT\"]}/user/{user}/{job_name}/'\n",
"job['codeDir'] = code_dir\n",
"\n",
"# display job config\n",
"print('Job:', job)\n",
"\n",
"# Get auth token\n",
"import requests\n",
"data = {\"username\": user, \"password\": password, \"expiration\": 3600}\n",
"url = f'{pai_url}/rest-server/api/v1/token'\n",
"response = requests.post(\n",
" url, headers={'Content-Type': 'application/json'}, json=data)\n",
"# print(response.status_code)\n",
"token = response.json()[\"token\"]\n",
"print('Token:', token)\n",
"\n",
"# Submit job\n",
"create_job_url = f'{pai_url}/rest-server/api/v1/user/{user}/jobs'\n",
"headers = {\n",
" 'Authorization': f'Bearer {token}',\n",
" 'Content-Type': 'application/json'\n",
"}\n",
"\n",
"response = requests.post(create_job_url, headers=headers, json=job)\n",
"print('Submit job:', response.json())\n",
"print('Job link:', f'{pai_url}/view.html?username={user}&jobName={job_name}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}