Merge pull request #28 from jinlccs/master

remove boneyard.
This commit is contained in:
jinl 2017-09-26 17:18:48 -07:00 коммит произвёл GitHub
Родитель ceaf0ef69a 20db5e9414
Коммит 06d78c9d6a
15 изменённых файлов: 2 добавлений и 443 удалений

Просмотреть файл

@ -1,3 +0,0 @@
# Boneyard
This folder contains code/description that is being simplified away during the development of DL workspace.

Просмотреть файл

@ -1,80 +0,0 @@
# Deployment of DL workspace cluster
This document describes the procedure to build and deploy a small DL workspace cluster. You will need to build and deploy the following nodes:
1. One Kubernetes master server,
2. Etcd server (or etcd cluster with multiple nodes for redundant operation),
3. API servers,
4. Web server to host kubelet configuration files and certificates generation service,
4. Multiple kubernete work nodes for running the job.
Basic knowledge of Linux and CoreOS will be very helpful to follow the deployment instruction.
## Deploy Kubernetes master server, etcd, web servers (these servers will then deploy all other servers in the cluster)
We describe the steps to install and deploy a customized Kubernetes cluster. It is possible to install master server, etcd, and web server on a single machine. Using multiple etcd server does provide the benefit of redundancy in case one of the server fails.
### Base CoreOS deployment.
This section describes the process of deploying a base CoreOS image through USB stick. In production environment, there may be other more efficient mechanisms to deploy the CoreOS images to machines.
Please prepare a Cloud config file that will be used to bootstrap the deployed CoreOS machine. A sample config file is provided at: /src/ClusterBootstrap/CoreOSConfig/pxe-kubemaster.yml.template. Please copy the file to pxe-kubemaster.yml, and fill in username, password, and SSH key information according to instruction at: [this](https://coreos.com/os/docs/latest/cloud-config.html). Then, either host the pxe-kubemaster.yml to a web service that you control (you will need to download the cloud config file during the boot [with these instructions](CoreOSBoot.md), or put it on a [bootable USB](USBBootable.md).
You may then install the CoreOS via:
```
sudo coreos-install -d /dev/sda -C stable -c [LOCAL_CLOUD_CONFIG_FILE]
```
Once installation is completed, please use 'ifconfig' to print the IP address of the machine. The IP addresses are used later to access the deployed CoreOS machine. After success installation, you can unplug the USB drive and reboot the machine via:
```
sudo reboot
```
### [Optional] Register the machine with DNS service.
This section is specific to Microsoft Internal setup. You may use the service at http://namesweb/ to self generate a DNS record for your installed machine. If this machine has a prior DNS name (e.g., a prior Redmond domain machine), you may need to first delete the dynamic DNS record of the prior machine, and then add your machine. If you machine has multiple IP address, please be sure to add both IP addresses to the DNS record.
### Generate Certificates for API server & etcd server
Go to folder 'src/ClusterBootstrap/ssl', and perform the following operations:
1. Copy openssl-apiserver.cnf.template to openssl-apiserver.cnf, and edit the configuration file:
* Add DNS name for the kubernetes master.
* For DNS name, add DNS.5, DNS.6 for the new DNS name
* Add IP addresses of the kubernete master.
* Replace ${K8S_SERVICE_IP} by the IP of kubernetes service, default to "10.3.0.1"
* replace ${MASTER_HOST} by the host IPs. If there are multiple IP address of the deployed machine, they should all be added here, e.g., via another entry of "IP.3".
2. Copy openssl-etcd.cnf.template to openssl-etcd.cnf, and edit the configuration file:
* Add DNS name for the etcd server. [similar to above]
* Add IP addresses of the etcd server. [similar to above]h
3. run 'gencerts.sh' to generate certs
### Modify configuration file for the deployed docker container.
Go to directory 'src/ClusterBookstrap', and copy 'config.yaml.template' to 'config.yaml'. Edit the configuration file with the follwoing information:
1. Replace all entries of '{{kubernete_master_dnsname_or_ip}}' with either the DNS name or one of the IP addresses of kubernetes master. This will be used for bootstrapping kubernetes master [deploying kubernete configuration and certificate].
2. Replace '{{user_at_kubernete_master}}' with the authorized user during CoreOS installation.
3. Replace '{{user_at_etcd_server}}' with the authorized user during CoreOS installation.
4. Replace '{{apiserver_dnsname_or_ip}}' with either the DNS name or one of the IP addresses of the etcd server.
5. Generate ssh key for access the kubernete cluster via following command.
'''
ssh-keygen -t rsa -b 4096
'''
You may want to store the generated key under the current directory, instead of default ('~/.ssh').
This key is used just in the kubernete deployment process. Therefore, you can discard the ssh key after the entire deployment procedure has been completed.
6. Replace {{apiserver_password,apiserver_username,apiserver_group}} with a password and username that is used to adminstrating API server. For example, if you use "helloworld,adam,1000", then the API server can be administered through username adam, and password helloworld.
7. Build Kubernete binary.
DL Workspace needs a multigpu-aware Kubernete build. Currently, a validated Kubernete binary will be provided as part of the docker image released at mlcloudreg.westus.cloudapp.azure.com:5000/hyperkube:multigpu.
8. Replace {{pxe_docker_image}} and {{webserver_docker_image}} with a valid docker registry entry.
These two images are outcome of the build process in the 'deploy.py', and will be used to deploy to the Kubernete cluster.
8. run 'python deploy.py' to deploy kubernete masters, etcd servers, and API servers.
### Deploy additional worker nodes
connect worker nodes to the private network, boot worker nodes using network boot option.
Wait until the worker nodes shutdown automatically.
Disconnect worker nodes from the private network.
Restart the worker nodes.
Done: the worker nodes will automatically register themselves to kubernetes cluster and install necessary drivers.

Просмотреть файл

@ -1,136 +0,0 @@
# Configuration file for OneNet Test cluster
# A example of using bond interfaces
# issue: only one IP becomes visible after multiple interfaces gets bonded.
#
id: onenet
status: test
coreos:
version: 1010.5.0
# Additional Configuration to be passed to write-in section.
write_files: |
- path: /etc/modprobe.d/bonding.conf
content: |
# Prevent kernel from automatically creating bond0 when the module is loaded.
# This allows systemd-networkd to create and apply options to bond0.
options bonding max_bonds=0
- path: /etc/systemd/network/10-eth.network
permissions: 0644
owner: root
content: |
[Match]
Name=ens2f*
[Network]
Bond=bond0
- path: /etc/systemd/network/20-bond.netdev
permissions: 0644
owner: root
content: |
[NetDev]
Name=bond0
Kind=bond
[Bond]
Mode=0 # defaults to balance-rr
MIIMonitorSec=1
- path: /etc/systemd/network/30-bond-dhcp.network
permissions: 0644
owner: root
content: |
[Match]
Name=bond0
[Network]
DHCP=ipv4
units: |
- name: down-interfaces.service
command: start
content: |
[Service]
Type=oneshot
ExecStart=/usr/bin/ip link set ens2f0 down
ExecStart=/usr/bin/ip addr flush dev ens2f0
ExecStart=/usr/bin/ip link set ens2f1 down
ExecStart=/usr/bin/ip addr flush dev ens2f1
- name: systemd-networkd.service
command: restart
# global flag which enables automatic failure recovery functionality
autoRecovery: True
network:
domain: redmond.corp.microsoft.com
# corpnet DNS servers
externalDnsServers:
- 10.222.118.154
- 157.54.14.178
- 4.2.2.1
#ignoreAlerts:
# SKUs is optional in DL workspace operation.
skus:
standard:
mem: 196
cpu:
type: E5-2450L
speed: 1855
sockets: 2
coresPerSocket: 8
count: 16
disk:
sda: 400
sdb: 6001
sdc: 6001
sdd: 6001
machines:
# OneNet Rack Rack
# If host name and mac address is available, they will be used to set the host name of the machine
# (with network/domain entry above, if exist)
onenet13:
sku: standard
mac: 9c:b6:54:8d:01:6b
onenet14:
sku: standard
mac: 9c:b6:54:8d:60:67
onenet15:
sku: standard
mac: 9c:b6:54:8c:ff:2f
onenet16:
sku: standard
mac: 9c:b6:54:90:35:2b
onenet17:
sku: standard
mac: 9c:b6:54:8c:8f:02
onenet18:
sku: standard
mac: 9c:b6:54:8c:cf:bb
onenet18:
sku: standard
mac: 9c:b6:54:8c:cf:bb
onenet19:
sku: standard
mac: 9c:b6:54:8c:8f:b3
onenet20:
sku: standard
mac: 9c:b6:54:8d:70:c7

Просмотреть файл

@ -1,5 +0,0 @@
#!/bin/bash
docker run --privileged -v /opt/nvidia-driver:/opt/nvidia-driver -v /opt/nvidia-docker:/opt/nvidia-docker -v /opt/bin:/opt/bin -v /dev:/dev mlcloudreg.westus.cloudapp.azure.com:5000/nvidia_driver:GeForce375.20 && \
sudo mkdir -p /etc/ld.so.conf.d/ && \
sudo tee /etc/ld.so.conf.d/nvidia-ml.conf <<< /opt/nvidia-driver/volumes/nvidia_driver/375.20/lib64 && \
sudo ldconfig

Просмотреть файл

@ -1,27 +0,0 @@
[Unit]
Description=NVIDIA Docker plugin
After=local-fs.target network.target nvidia-driver.service
Wants=docker.service nvidia-driver.service
[Service]
Environment="SOCK_DIR=/var/lib/nvidia-docker"
Environment="SPEC_FILE=/etc/docker/plugins/nvidia-docker.spec"
Environment="NVIDIA_VERSION={{cnf["nvidiadriverversion"]}}"
Restart=on-failure
RestartSec=10
TimeoutStartSec=0
TimeoutStopSec=20
ExecStartPre=/bin/bash -c 'if [ ! -f /opt/bin/nvidia-docker ]; then wget -q -O - /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1_amd64.tar.xz | sudo tar --strip-components=1 -C /opt/bin -Jxvf - ; fi'
ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then until [ -f /proc/driver/nvidia/version ] && grep -q $NVIDIA_VERSION /proc/driver/nvidia/version && lsmod | grep -qE "^nvidia" && [ -e /dev/nvidia0 ] && [ -e /opt/nvidia-driver/current/lib64/libnvidia-ml.so ] ; do /bin/echo "waiting for nvidia-driver..." ; /bin/sleep 2 ; done else exit 0 ; fi'
ExecStartPre=/bin/bash -c 'docker volume rm nvidia_driver_$NVIDIA_VERSION ; exit 0'
ExecStart=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then PATH=$PATH:/opt/bin /opt/bin/nvidia-docker-plugin -s $SOCK_DIR ; else exit 0 ; fi'
ExecStartPost=/bin/bash -c '/bin/mkdir -p $( dirname $SPEC_FILE ) ; exit 0'
ExecStartPost=/bin/bash -c '/bin/echo unix://$SOCK_DIR/nvidia-docker.sock > $SPEC_FILE ; exit 0'
ExecStopPost=/bin/bash -c '/bin/rm -f $SPEC_FILE ; exit 0'
ExecStopPost=/bin/bash -c '/bin/rm /opt/nvidia-docker-plugin.log ; exit 0'
ExecStopPost=/bin/bash -c 'docker volume rm nvidia_driver_$NVIDIA_VERSION ; exit 0'
[Install]
WantedBy=multi-user.target

Просмотреть файл

@ -1,21 +0,0 @@
[Unit]
Description=Install Nvidia driver
After=local-fs.target network.target docker.service
Wants=docker.service
[Service]
Environment=IMG={{cnf["nvidiadriverdocker"]}} CNAME=nvidia-driver
RemainAfterExit=yes
Restart=on-failure
RestartSec=10
TimeoutStartSec=1200
TimeoutStopSec=120
ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then /usr/bin/docker inspect $IMG &> /dev/null || /usr/bin/docker pull $IMG ; else exit 0 ; fi'
ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then /usr/bin/docker rm $CNAME &> /dev/null; exit 0 ; else exit 0 ; fi'
ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then docker run --name $CNAME --privileged -v /opt/nvidia-driver:/opt/nvidia-driver -v /opt/bin:/opt/bin -v /dev:/dev $IMG && mkdir -p /etc/ld.so.conf.d/ && tee /etc/ld.so.conf.d/nvidia-ml.conf <<< /opt/nvidia-driver/current/lib64 && ldconfig ; else exit 0 ; fi'
ExecStart=/bin/true
[Install]
WantedBy=multi-user.target

Просмотреть файл

@ -1,7 +0,0 @@
#!/bin/sh
wget http://192.168.1.20/pxe-coreos-kube.yml
sudo coreos-install -d /dev/sda -c pxe-coreos-kube.yml -b http://192.168.1.20/coreos -V 1185.5.0
sudo shutdown -h now

Просмотреть файл

@ -1,5 +0,0 @@
#!/bin/sh
wget http://192.168.1.20/pxe-kubemaster.yml
sudo coreos-install -d /dev/sda -c pxe-kubemaster.yml -b http://192.168.1.20/coreos
sudo shutdown -h now

Просмотреть файл

@ -1,32 +0,0 @@
#! /bin/bash
export HostIP=$(ip route get 8.8.8.8 | awk '{print $NF; exit}')
mkdir -p /etc/kubernetes/ssl/
mkdir -p /etc/flannel
mkdir -p /etc/kubernetes/manifests
hostnamectl set-hostname $HostIP
if [ ! -f /etc/kubernetes/ssl/worker.pem ]; then
certstr=`wget -q -O - http://ccsdatarepo.westus.cloudapp.azure.com:9090/?workerId=$HOSTNAME\&workerIP=$HostIP`
IFS=',' read -ra certs <<< "$certstr"
echo ${certs[0]} | base64 -d > /etc/kubernetes/ssl/ca.pem
echo ${certs[1]} | base64 -d > /etc/kubernetes/ssl/worker.pem
echo ${certs[2]} | base64 -d > /etc/kubernetes/ssl/worker-key.pem
echo "FLANNELD_IFACE=${HostIP}" > /etc/flannel/options.env
echo "FLANNELD_ETCD_ENDPOINTS=http://104.42.96.204:2379" >> /etc/flannel/options.env
mkdir -p /etc/systemd/system/flanneld.service.d/
wget -q -O "/etc/systemd/system/flanneld.service.d/40-ExecStartPre-symlink.conf" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/40-ExecStartPre-symlink.conf
mkdir -p /etc/systemd/system/docker.service.d
wget -q -O "/etc/systemd/system/docker.service.d/40-flannel.conf" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/40-flannel.conf
wget -q -O "/etc/systemd/system/kubelet.service" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/kubelet.service
wget -q -O "/etc/kubernetes/manifests/kube-proxy.yaml" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/kube-proxy.yaml
wget -q -O "/etc/kubernetes/worker-kubeconfig.yaml" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/worker-kubeconfig.yaml
fi
systemctl daemon-reload
systemctl start flanneld
systemctl start kubelet
systemctl enable flanneld
systemctl enable kubelet
systemctl start rpc-statd

Различия файлов скрыты, потому что одна или несколько строк слишком длинны

Просмотреть файл

@ -1,61 +0,0 @@
#cloud-config
coreos:
units:
- name: fleet.service
command: start
- name: bootstrap.service
command: start
content: |
[Unit]
Description=Bootstrap instance
After=network-online.target
Requires=network-online.target
[Service]
Type=oneshot
#RemainAfterExit=true
#ExecStartPre=/bin/bash -c 'until ping -c1 192.168.1.20; do sleep 1; done;'
ExecStart=/bin/bash /opt/init_k8s.sh
[Install]
WantedBy=local.target
ssh_authorized_keys:
- {{cnf["sshkey"]}}
write_files:
- path: "/opt/init_k8s.sh"
permissions: "0755"
owner: "root"
content: |
#! /bin/bash
wget -q -O - http://{{cnf["webserver"]}}/kubelet.sh | sudo bash -s

Просмотреть файл

@ -1,3 +0,0 @@
# Docker images used in DL workspace.
* dev: development docker used

Просмотреть файл

@ -1,57 +0,0 @@
FROM ubuntu:16.04
MAINTAINER Jin Li <jinlmsft@hotmail.com>
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
wget \
protobuf-compiler \
python-dev \
python-numpy \
python-pip
# Install docker
RUN apt-get update; apt-get install -y apt-transport-https ca-certificates
RUN apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
RUN mkdir /etc/apt/source.list.d
RUN echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" > /etc/apt/sources.list.d/docker.list
RUN apt-get update
RUN apt-cache policy docker-engine
RUN apt-get update
RUN apt-get install -y --no-install-recommends linux-image-extra-$(uname -r) linux-image-extra-virtual
# RUN apt-get update && apt-get install -y docker-engine
# Install go 1.6 (for kubernetes)
RUN wget https://raw.githubusercontent.com/moovweb/gvm/master/binscripts/gvm-installer
RUN bash gvm-installer
RUN apt-get install -y bison curl
RUN chmod +x /root/.gvm/scripts/gvm
ENV PATH="$PATH:/root/.gvm/bin"
RUN /bin/bash -c "source /root/.gvm/scripts/gvm; gvm install go1.4; gvm use go1.4; export GOROOT_BOOTSTRAP=$GOROOT; gvm install go1.7.4; gvm use go1.7.4"
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql.list
# Install python for Azure SQL
RUN apt-get update; apt-get install msodbcsql mssql-tools unixodbc-dev-utf16; pip install pyodbc==3.1.1
WORKDIR /home/code

Просмотреть файл

@ -1,5 +0,0 @@
# Docker environment for DL workspace development.
We will create a docker image for running development of DL workspace. Please follow the procedures.
1. Build docker image (dev:latest) with the following key components: docker-engine,

Просмотреть файл

@ -36,6 +36,8 @@ DL Workspace uses a SQL server or SQL Azure to store user information (uid, gid,
If you are using SQL Azure, we recommend to use change the database DLWorkspaceCluster-xxxxx to S4. The most heavy use of the database is when the Web Portal is left open to see the execution status of a particular job. We use SQL database to store the job status information. SQL instance S0 can quickly max out during job query.
Please note that SQL Azure attaches a pricing tier per database. Only database DLWorkspaceCluster-xxxxx needs to be bumped to a high pricing tier. For most of usage case, the other database can be left at S0.
Investigating better way to organize data and reduce the load of database, or select a database implementation which gives better performance is on the work plan.
4. Other database.