Merge pull request #28 from jinlccs/master

remove boneyard.
2017-09-26 17:18:48 -07:00 · 2017-09-26 17:18:48 -07:00 · 06d78c9d6a
--- a/boneyard/README.md
+++ b/boneyard/README.md
@ -1,3 +0,0 @@
-# Boneyard
-
-This folder contains code/description that is being simplified away during the development of DL workspace.
--- a/boneyard/docs/docs/USB.md
+++ b/boneyard/docs/docs/USB.md
@ -1,80 +0,0 @@
-# Deployment of DL workspace cluster
-
-This document describes the procedure to build and deploy a small DL workspace cluster. You will need to build and deploy the following nodes: 
-  1. One Kubernetes master server,
-  2. Etcd server (or etcd cluster with multiple nodes for redundant operation), 
-  3. API servers, 
-  4. Web server to host kubelet configuration files and certificates generation service,
-  4. Multiple kubernete work nodes for running the job.
-
-Basic knowledge of Linux and CoreOS will be very helpful to follow the deployment instruction.   
-
-## Deploy Kubernetes master server, etcd, web servers (these servers will then deploy all other servers in the cluster)
-
-We describe the steps to install and deploy a customized Kubernetes cluster. It is possible to install master server, etcd, and web server on a single machine. Using multiple etcd server does provide the benefit of redundancy in case one of the server fails. 
-
-### Base CoreOS deployment. 
-
-This section describes the process of deploying a base CoreOS image through USB stick. In production environment, there may be other more efficient mechanisms to deploy the CoreOS images to machines. 
-
-Please prepare a Cloud config file that will be used to bootstrap the deployed CoreOS machine. A sample config file is provided at: /src/ClusterBootstrap/CoreOSConfig/pxe-kubemaster.yml.template. Please copy the file to pxe-kubemaster.yml, and fill in username, password, and SSH key information according to instruction at: [this](https://coreos.com/os/docs/latest/cloud-config.html). Then, either host the pxe-kubemaster.yml to a web service that you control (you will need to download the cloud config file during the boot [with these instructions](CoreOSBoot.md), or put it on a [bootable USB](USBBootable.md). 
-
-You may then install the CoreOS via:
-
-```
-sudo coreos-install -d /dev/sda -C stable -c [LOCAL_CLOUD_CONFIG_FILE]
-```
-
-Once installation is completed, please use 'ifconfig' to print the IP address of the machine. The IP addresses are used later to access the deployed CoreOS machine. After success installation, you can unplug the USB drive and reboot the machine via:
-
-```
-sudo reboot
-```
-
-### [Optional] Register the machine with DNS service. 
-
-This section is specific to Microsoft Internal setup. You may use the service at http://namesweb/ to self generate a DNS record for your installed machine. If this machine has a prior DNS name (e.g., a prior Redmond domain machine), you may need to first delete the dynamic DNS record of the prior machine, and then add your machine. If you machine has multiple IP address, please be sure to add both IP addresses to the DNS record.  
-
-### Generate Certificates for API server & etcd server 
-
-Go to folder 'src/ClusterBootstrap/ssl', and perform the following operations:
-
-1. Copy openssl-apiserver.cnf.template to openssl-apiserver.cnf, and edit the configuration file:
-  * Add DNS name for the kubernetes master. 
-    * For DNS name, add DNS.5, DNS.6 for the new DNS name 
-  * Add IP addresses of the kubernete master. 
-    * Replace ${K8S_SERVICE_IP} by the IP of kubernetes service, default to "10.3.0.1"
-    * replace ${MASTER_HOST} by the host IPs. If there are multiple IP address of the deployed machine, they should all be added here, e.g., via another entry of "IP.3". 
-2. Copy openssl-etcd.cnf.template to openssl-etcd.cnf, and edit the configuration file:
-  * Add DNS name for the etcd server. [similar to above] 
-  * Add IP addresses of the etcd server. [similar to above]h
-3. run 'gencerts.sh' to generate certs
-
-### Modify configuration file for the deployed docker container. 
-
-Go to directory 'src/ClusterBookstrap', and copy 'config.yaml.template' to 'config.yaml'. Edit the configuration file with the follwoing information:
-
-1. Replace all entries of '{{kubernete_master_dnsname_or_ip}}' with either the DNS name or one of the IP addresses of kubernetes master. This will be used for bootstrapping kubernetes master [deploying kubernete configuration and certificate]. 
-2. Replace '{{user_at_kubernete_master}}' with the authorized user during CoreOS installation. 
-3. Replace '{{user_at_etcd_server}}' with the authorized user during CoreOS installation. 
-4. Replace '{{apiserver_dnsname_or_ip}}' with either the DNS name or one of the IP addresses of the etcd server. 
-5. Generate ssh key for access the kubernete cluster via following command. 
-'''
-ssh-keygen -t rsa -b 4096
-''' 
-You may want to store the generated key under the current directory, instead of default ('~/.ssh'). 
-This key is used just in the kubernete deployment process. Therefore, you can discard the ssh key after the entire deployment procedure has been completed. 
-6. Replace {{apiserver_password,apiserver_username,apiserver_group}} with a password and username that is used to adminstrating API server. For example, if you use "helloworld,adam,1000", then the API server can be administered through username adam, and password helloworld. 
-7. Build Kubernete binary. 
-DL Workspace needs a multigpu-aware Kubernete build. Currently, a validated Kubernete binary will be provided as part of the docker image released at mlcloudreg.westus.cloudapp.azure.com:5000/hyperkube:multigpu. 
-8. Replace {{pxe_docker_image}} and {{webserver_docker_image}} with a valid docker registry entry. 
-These two images are outcome of the build process in the 'deploy.py', and will be used to deploy to the Kubernete cluster. 
-8. run 'python deploy.py' to deploy kubernete masters, etcd servers, and API servers. 
-
-### Deploy additional worker nodes
-   connect worker nodes to the private network, boot worker nodes using network boot option. 
-   Wait until the worker nodes shutdown automatically. 
-   Disconnect worker nodes from the private network. 
-   Restart the worker nodes.
-   Done: the worker nodes will automatically register themselves to kubernetes cluster and install necessary drivers. 
-
--- a/boneyard/src/ClusterBootstrap/bond0.yaml.template
+++ b/boneyard/src/ClusterBootstrap/bond0.yaml.template
@ -1,136 +0,0 @@
-# Configuration file for OneNet Test cluster
-
-# A example of using bond interfaces
-# issue: only one IP becomes visible after multiple interfaces gets bonded. 
-# 
-id: onenet
-status: test
-
-coreos:
-  version: 1010.5.0
-  # Additional Configuration to be passed to write-in section. 
-  write_files: |
-          - path: /etc/modprobe.d/bonding.conf
-            content: |
-              # Prevent kernel from automatically creating bond0 when the module is loaded.
-              # This allows systemd-networkd to create and apply options to bond0.
-              options bonding max_bonds=0
-          
-          - path: /etc/systemd/network/10-eth.network
-            permissions: 0644
-            owner: root
-            content: |
-              [Match]
-              Name=ens2f*
-
-              [Network]
-              Bond=bond0
-          - path: /etc/systemd/network/20-bond.netdev
-            permissions: 0644
-            owner: root
-            content: |
-              [NetDev]
-              Name=bond0
-              Kind=bond
-
-              [Bond]
-              Mode=0 # defaults to balance-rr
-              MIIMonitorSec=1
-          - path: /etc/systemd/network/30-bond-dhcp.network
-            permissions: 0644
-            owner: root
-            content: |
-              [Match]
-              Name=bond0
-
-              [Network]
-              DHCP=ipv4
-              
-  units: |
-        - name: down-interfaces.service
-          command: start
-          content: |
-            [Service]
-            Type=oneshot
-            ExecStart=/usr/bin/ip link set ens2f0 down
-            ExecStart=/usr/bin/ip addr flush dev ens2f0
-            ExecStart=/usr/bin/ip link set ens2f1 down
-            ExecStart=/usr/bin/ip addr flush dev ens2f1
-        - name: systemd-networkd.service
-          command: restart    
-    
-  
-# global flag which enables automatic failure recovery functionality
-autoRecovery: True
-
-network:
-  domain: redmond.corp.microsoft.com
-
-  # corpnet DNS servers
-  externalDnsServers:
-   - 10.222.118.154
-   - 157.54.14.178
-   - 4.2.2.1
-
-#ignoreAlerts:
-
-# SKUs is optional in DL workspace operation. 
-skus:
-  standard:
-    mem: 196
-    cpu:
-        type: E5-2450L
-        speed: 1855
-        sockets: 2
-        coresPerSocket: 8
-        count: 16
-    disk:
-        sda: 400
-        sdb: 6001
-        sdc: 6001
-        sdd: 6001
-
-machines:
-
-# OneNet Rack Rack
-# If host name and mac address is available, they will be used to set the host name of the machine
-# (with network/domain entry above, if exist)
-
-  onenet13:
-    sku: standard
-    mac: 9c:b6:54:8d:01:6b
-
-  onenet14:
-    sku: standard
-    mac: 9c:b6:54:8d:60:67
-    
-  onenet15:
-    sku: standard
-    mac: 9c:b6:54:8c:ff:2f
-    
-  onenet16:
-    sku: standard
-    mac: 9c:b6:54:90:35:2b
-    
-  onenet17:
-    sku: standard
-    mac: 9c:b6:54:8c:8f:02
-    
-  
-  onenet18:
-    sku: standard
-    mac: 9c:b6:54:8c:cf:bb
-    
-  
-  onenet18:
-    sku: standard
-    mac: 9c:b6:54:8c:cf:bb
-    
-  onenet19:
-    sku: standard
-    mac: 9c:b6:54:8c:8f:b3
-    
-  onenet20:
-    sku: standard
-    mac: 9c:b6:54:8d:70:c7
-  
--- a/boneyard/src/ClusterBootstrap/drivers/install_nvidia_driver.sh
+++ b/boneyard/src/ClusterBootstrap/drivers/install_nvidia_driver.sh
@ -1,5 +0,0 @@
-#!/bin/bash
-docker run --privileged -v /opt/nvidia-driver:/opt/nvidia-driver -v /opt/nvidia-docker:/opt/nvidia-docker -v /opt/bin:/opt/bin -v /dev:/dev mlcloudreg.westus.cloudapp.azure.com:5000/nvidia_driver:GeForce375.20 && \
-sudo mkdir -p /etc/ld.so.conf.d/  && \
-sudo tee /etc/ld.so.conf.d/nvidia-ml.conf <<< /opt/nvidia-driver/volumes/nvidia_driver/375.20/lib64  && \
-sudo ldconfig 
--- a/boneyard/src/ClusterBootstrap/drivers/nvidia-docker.service
+++ b/boneyard/src/ClusterBootstrap/drivers/nvidia-docker.service
@ -1,27 +0,0 @@
-[Unit]
-Description=NVIDIA Docker plugin
-After=local-fs.target network.target nvidia-driver.service
-Wants=docker.service nvidia-driver.service
-[Service]
-Environment="SOCK_DIR=/var/lib/nvidia-docker"
-Environment="SPEC_FILE=/etc/docker/plugins/nvidia-docker.spec"
-Environment="NVIDIA_VERSION={{cnf["nvidiadriverversion"]}}"
-
-Restart=on-failure
-RestartSec=10
-TimeoutStartSec=0
-TimeoutStopSec=20
-
-
-ExecStartPre=/bin/bash -c 'if [ ! -f /opt/bin/nvidia-docker ]; then wget -q -O - /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1_amd64.tar.xz | sudo tar --strip-components=1 -C /opt/bin -Jxvf - ; fi'
-ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then until [ -f /proc/driver/nvidia/version ] && grep -q $NVIDIA_VERSION /proc/driver/nvidia/version && lsmod | grep -qE "^nvidia" && [ -e /dev/nvidia0 ] && [ -e /opt/nvidia-driver/current/lib64/libnvidia-ml.so ] ; do /bin/echo "waiting for nvidia-driver..." ; /bin/sleep 2 ; done else exit 0 ; fi'
-ExecStartPre=/bin/bash -c 'docker volume rm nvidia_driver_$NVIDIA_VERSION ; exit 0'
-ExecStart=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then PATH=$PATH:/opt/bin /opt/bin/nvidia-docker-plugin -s $SOCK_DIR ; else exit 0 ; fi'
-ExecStartPost=/bin/bash -c '/bin/mkdir -p $( dirname $SPEC_FILE ) ; exit 0'
-ExecStartPost=/bin/bash -c '/bin/echo unix://$SOCK_DIR/nvidia-docker.sock > $SPEC_FILE ; exit 0'
-ExecStopPost=/bin/bash -c '/bin/rm -f $SPEC_FILE ; exit 0'
-ExecStopPost=/bin/bash -c '/bin/rm /opt/nvidia-docker-plugin.log ; exit 0'
-ExecStopPost=/bin/bash -c 'docker volume rm nvidia_driver_$NVIDIA_VERSION ; exit 0'
-
-[Install]
-WantedBy=multi-user.target
--- a/boneyard/src/ClusterBootstrap/drivers/nvidia-driver.service
+++ b/boneyard/src/ClusterBootstrap/drivers/nvidia-driver.service
@ -1,21 +0,0 @@
-[Unit]
-Description=Install Nvidia driver
-After=local-fs.target network.target docker.service 
-Wants=docker.service 
-[Service]
-
-Environment=IMG={{cnf["nvidiadriverdocker"]}} CNAME=nvidia-driver
-
-RemainAfterExit=yes
-Restart=on-failure
-RestartSec=10
-TimeoutStartSec=1200
-TimeoutStopSec=120
-
-ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then /usr/bin/docker inspect $IMG &> /dev/null || /usr/bin/docker pull $IMG ; else exit 0 ; fi'
-ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then /usr/bin/docker rm $CNAME &> /dev/null; exit 0 ; else exit 0 ; fi'
-ExecStartPre=/bin/bash -c 'if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*"; then docker run --name $CNAME --privileged -v /opt/nvidia-driver:/opt/nvidia-driver -v /opt/bin:/opt/bin -v /dev:/dev $IMG && mkdir -p /etc/ld.so.conf.d/  && tee /etc/ld.so.conf.d/nvidia-ml.conf <<< /opt/nvidia-driver/current/lib64  && ldconfig ; else exit 0 ; fi'
-ExecStart=/bin/true
-
-[Install]
-WantedBy=multi-user.target
--- a/boneyard/src/ClusterBootstrap/template/pxe/www/cloud-config/install_os_kubelet.sh
+++ b/boneyard/src/ClusterBootstrap/template/pxe/www/cloud-config/install_os_kubelet.sh
@ -1,7 +0,0 @@
-#!/bin/sh
-
-wget http://192.168.1.20/pxe-coreos-kube.yml
-sudo coreos-install -d /dev/sda -c pxe-coreos-kube.yml -b http://192.168.1.20/coreos -V 1185.5.0
-
-sudo shutdown -h now
-
--- a/boneyard/src/ClusterBootstrap/template/pxe/www/cloud-config/install_os_master.sh
+++ b/boneyard/src/ClusterBootstrap/template/pxe/www/cloud-config/install_os_master.sh
@ -1,5 +0,0 @@
-#!/bin/sh
-wget http://192.168.1.20/pxe-kubemaster.yml
-sudo coreos-install -d /dev/sda -c pxe-kubemaster.yml -b http://192.168.1.20/coreos
-sudo shutdown -h now
-
--- a/boneyard/src/ClusterBootstrap/template/pxe/www/init_k8s.sh
+++ b/boneyard/src/ClusterBootstrap/template/pxe/www/init_k8s.sh
@ -1,32 +0,0 @@
-#! /bin/bash
-
-export HostIP=$(ip route get 8.8.8.8 | awk '{print $NF; exit}')
-mkdir -p /etc/kubernetes/ssl/
-mkdir -p /etc/flannel
-mkdir -p /etc/kubernetes/manifests
-
-hostnamectl  set-hostname $HostIP
-if [ ! -f /etc/kubernetes/ssl/worker.pem ]; then
-    certstr=`wget -q -O - http://ccsdatarepo.westus.cloudapp.azure.com:9090/?workerId=$HOSTNAME\&workerIP=$HostIP`
-    IFS=',' read -ra certs <<< "$certstr"
-    echo ${certs[0]}  | base64 -d > /etc/kubernetes/ssl/ca.pem
-    echo ${certs[1]}  | base64 -d > /etc/kubernetes/ssl/worker.pem
-    echo ${certs[2]}  | base64 -d > /etc/kubernetes/ssl/worker-key.pem
-
-    echo "FLANNELD_IFACE=${HostIP}" > /etc/flannel/options.env
-    echo "FLANNELD_ETCD_ENDPOINTS=http://104.42.96.204:2379" >> /etc/flannel/options.env  
-    mkdir -p /etc/systemd/system/flanneld.service.d/
-    wget -q -O "/etc/systemd/system/flanneld.service.d/40-ExecStartPre-symlink.conf" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/40-ExecStartPre-symlink.conf
-    mkdir -p /etc/systemd/system/docker.service.d
-    wget -q -O "/etc/systemd/system/docker.service.d/40-flannel.conf" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/40-flannel.conf
-    wget -q -O "/etc/systemd/system/kubelet.service" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/kubelet.service
-    wget -q -O  "/etc/kubernetes/manifests/kube-proxy.yaml" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/kube-proxy.yaml
-    wget -q -O "/etc/kubernetes/worker-kubeconfig.yaml" http://ccsdatarepo.westus.cloudapp.azure.com/data/kube/kubelet/worker-kubeconfig.yaml
-fi
-
-systemctl daemon-reload
-systemctl start flanneld
-systemctl start kubelet
-systemctl enable flanneld
-systemctl enable kubelet
-systemctl start rpc-statd
--- a/boneyard/src/ClusterBootstrap/template/pxe/www/kubeconfig
+++ b/boneyard/src/ClusterBootstrap/template/pxe/www/kubeconfig
--- a/boneyard/src/ClusterBootstrap/template/pxe/www/pxe-coreos-kube.yml.template
+++ b/boneyard/src/ClusterBootstrap/template/pxe/www/pxe-coreos-kube.yml.template
@ -1,61 +0,0 @@
-#cloud-config
-coreos:
-  units:
-    - name: fleet.service
-      command: start
-
-    - name: bootstrap.service
-      command: start
-      content: |
-        [Unit]
-        Description=Bootstrap instance
-        After=network-online.target
-        Requires=network-online.target
-        [Service]
-        Type=oneshot
-        #RemainAfterExit=true
-        #ExecStartPre=/bin/bash -c 'until ping -c1 192.168.1.20; do sleep 1; done;'
-        ExecStart=/bin/bash /opt/init_k8s.sh
-        [Install]
-        WantedBy=local.target
-ssh_authorized_keys:
-  - {{cnf["sshkey"]}}
-
-write_files:
-  - path: "/opt/init_k8s.sh"
-    permissions: "0755"
-    owner: "root"
-    content: |
-      #! /bin/bash
-      wget -q -O - http://{{cnf["webserver"]}}/kubelet.sh | sudo bash -s
-      
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
--- a/boneyard/src/docker-images/Readme.md
+++ b/boneyard/src/docker-images/Readme.md
@ -1,3 +0,0 @@
-# Docker images used in DL workspace. 
-
-* dev: development docker used 
--- a/boneyard/src/docker-images/dev/Dockerfile
+++ b/boneyard/src/docker-images/dev/Dockerfile
@ -1,57 +0,0 @@
-FROM ubuntu:16.04
-MAINTAINER Jin Li <jinlmsft@hotmail.com>
-
-RUN apt-get update && apt-get install -y --no-install-recommends \
-        build-essential \
-        cmake \
-        git \
-        wget \
-        protobuf-compiler \
-        python-dev \
-        python-numpy \
-        python-pip 
-        
-
-# Install docker
-RUN apt-get update; apt-get install -y apt-transport-https ca-certificates 
-
-RUN apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
-
-RUN mkdir /etc/apt/source.list.d
-
-RUN echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" > /etc/apt/sources.list.d/docker.list
-
-RUN apt-get update
-
-RUN apt-cache policy docker-engine
-
-RUN apt-get update
-
-RUN apt-get install -y --no-install-recommends linux-image-extra-$(uname -r) linux-image-extra-virtual
-
-# RUN apt-get update && apt-get install -y docker-engine
-
-# Install go 1.6 (for kubernetes)
-
-RUN wget https://raw.githubusercontent.com/moovweb/gvm/master/binscripts/gvm-installer
-
-RUN bash gvm-installer
-
-RUN apt-get install -y bison curl
-
-RUN chmod +x /root/.gvm/scripts/gvm
-
-ENV PATH="$PATH:/root/.gvm/bin"
-
-RUN /bin/bash -c "source /root/.gvm/scripts/gvm; gvm install go1.4; gvm use go1.4; export GOROOT_BOOTSTRAP=$GOROOT; gvm install go1.7.4; gvm use go1.7.4"
-
-RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
-
-RUN curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql.list
-
-# Install python for Azure SQL
-
-RUN apt-get update; apt-get install msodbcsql mssql-tools unixodbc-dev-utf16; pip install pyodbc==3.1.1
-
-WORKDIR /home/code
-
--- a/boneyard/src/docker-images/dev/Readme.md
+++ b/boneyard/src/docker-images/dev/Readme.md
@ -1,5 +0,0 @@
-# Docker environment for DL workspace development. 
-
-We will create a docker image for running development of DL workspace. Please follow the procedures. 
-
-1. Build docker image (dev:latest) with the following key components: docker-engine,  
--- a/docs/deployment/database/Readme.md
+++ b/docs/deployment/database/Readme.md
@ -36,6 +36,8 @@ DL Workspace uses a SQL server or SQL Azure to store user information (uid, gid,

  If you are using SQL Azure, we recommend to use change the database DLWorkspaceCluster-xxxxx to S4. The most heavy use of the database is when the Web Portal is left open to see the execution status of a particular job. We use SQL database to store the job status information. SQL instance S0 can quickly max out during job query. 

+  Please note that SQL Azure attaches a pricing tier per database. Only database DLWorkspaceCluster-xxxxx needs to be bumped to a high pricing tier. For most of usage case, the other database can be left at S0. 
+
  Investigating better way to organize data and reduce the load of database, or select a database implementation which gives better performance is on the work plan. 

 4. Other database.