Merge pull request #22 from jinlccs/master

Customize Job dnsPolicy, add script block of adding node, update document.
This commit is contained in:
jinl 2017-09-22 19:39:34 -07:00 коммит произвёл GitHub
Родитель 89dbcd2cde eff515c0f1
Коммит ae364b39cc
13 изменённых файлов: 90 добавлений и 56 удалений

Просмотреть файл

@ -19,7 +19,7 @@ Here is a few short video clips that can quickly explain DLWorkspace. Note the P
## [DLWorkspace Cluster Deployment](docs/deployment/Readme.md)
## [Known Issues](docs/KnownIssues/Readme.md)
## [Frequently Asked Questions](docs/KnownIssues/Readme.md)
## [Presentation](docs/presentation/1707/Readme.md)

Просмотреть файл

@ -1,7 +1,11 @@
# Known Issues in DL Workspace
# Frequently Asked Questions
## [Deployment](../deployment/knownissues/Readme.md)
* [Development Environment](../DevEnvironment/FAQ.md)
* [Deployment](../deployment/knownissues/Deployment_Issue.md)
* [Authentication](../deployment/authentication/FAQ.md)
* [Azure](../deployment/Azure/FAQ.md)
* [ACS](../deployment/ACS/FAQ.md)
## Running DL Workspace
## Using DL Workspace
* [Zombie Process](zombie_process.md)

Просмотреть файл

@ -24,10 +24,7 @@ Here is a few short video clips that can quickly explain DLWorkspace. Note the P
* [On prem, Ubuntu Cluster](deployment/On-Prem/Ubuntu.md)
* [Single Ubuntu Computer](deployment/On-Prem/SingleUbuntu.md)
## Known Issues
* [Deployment Issue](deployment/knownissues/Readme.md)
* Container Issue: [Zombie Process](KnownIssues/zombie_process.md)
## [Frequently Asked Questions](KnownIssues/Readme.md)
## Presentation

Просмотреть файл

@ -1,4 +1,6 @@
# Common Deployment Issues of DL workspace cluster
# Frequently Asked Question During DL Workspace deployment.
We are still prototyping the platform. Please report issues to the author, so that we can complete the document.
1. DL workspace deployment environment.
@ -35,10 +37,36 @@
Are you sure you want to continue connecting (yes/no)? yes
```
Issue: when machines redeployed, they got new host key, which differed from their prior host key, which triggers the warning above each time a remote machine is connected.
Solution: remove the file /home/<username>/.ssh/known_hosts.
Solution: remove the hosts from /home/<username>/.ssh/known_hosts, you may also delete the file /home/<username>/.ssh/known_hosts.
5. I see a web page of apache server, instead of DL Workspace.
Apache server may be default enabled on the installed node. Please use "sudo service apache2 stop" to disable the server.
6. I have a deployment failure.
Sometime, there is deployment glitches during script execution. Please try to execute the script again to see if the issue goes away.
7. 'docker pull' fails with error "layers from manifest don't match image configuration".
Please check the docker version of the 'pull' machine, the 'push' machine, and the docker register. It seems that this is caused by incompatible docker version. [See](https://github.com/docker/distribution/issues/1439)
8. If you don't use domain, please don't add domain: "", this adds a "." to hostname by the script, and causes the scripts to fail.
9. In some ubuntu distributions, "NetworkManager" is enable and set dns name server to be "127.0.1.1". This is Ok on the host machine, but may cause issues in the container. Typically, if the container is not using host network and inherits dns name server from the host, the domain name will not be able to be resolved inside container.
If the container network is not working, please check /etc/resolv.conf. If there is a value of "127.0.1.1", run scripts:
```
./deploy.py runscriptonall ./scripts/disable_networkmanager.sh
```
NetworkManager is the program which (via the resolvconf utility) inserts address 127.0.1.1 into resolv.conf. NM inserts that address if an only if it is configured to start an instance of the dnsmasq program to serve as a local forwarding nameserver. That dnsmasq instance listens for queries at address 127.0.1.1.
If you do not want to use a local forwarding nameserver then configure NetworkManager not to start a dnsmasq instance and not to insert that address. In /etc/NetworkManager/NetworkManager.conf comment out the line dns=dnsmasq
```
sudo vi /etc/NetworkManager/NetworkManager.conf
[main]:q
plugins=ifupdown,keyfile,ofono
#dns=dnsmasq
```
and restart the NetworkManager service.
```
sudo service network-manager restart
```

Просмотреть файл

@ -1,24 +0,0 @@
# Known issues.
We are still prototyping the platform. Please report issues to the author, so that we can complete the document.
1. 'docker pull' fails with error "layers from manifest don't match image configuration".
Please check the docker version of the 'pull' machine, the 'push' machine, and the docker register. It seems that this is caused by incompatible docker version. [See](https://github.com/docker/distribution/issues/1439)
2. If you don't use domain, please don't add domain: "", this adds a "." to hostname by the script, and causes the scripts to fail.
3. In some ubuntu distributions, "NetworkManager" is enable and set dns name server to be "127.0.1.1". This is Ok on the host machine, but may cause issues in the container. Typically, if the container is not using host network and inherits dns name server from the host, the domain name will not be able to be resolved inside container.
If the container network is not working, please check /etc/resolv.conf. If the value is "127.0.1.1", please follow the below instructions to fix:
NetworkManager is the program which (via the resolvconf utility) inserts address 127.0.1.1 into resolv.conf. NM inserts that address if an only if it is configured to start an instance of the dnsmasq program to serve as a local forwarding nameserver. That dnsmasq instance listens for queries at address 127.0.1.1.
If you do not want to use a local forwarding nameserver then configure NetworkManager not to start a dnsmasq instance and not to insert that address. In /etc/NetworkManager/NetworkManager.conf comment out the line dns=dnsmasq
```
sudo vi /etc/NetworkManager/NetworkManager.conf
[main]:q
plugins=ifupdown,keyfile,ofono
#dns=dnsmasq
```
and restart the NetworkManager service.
```
sudo service network-manager restart
```

Просмотреть файл

@ -1,10 +0,0 @@
# Common/Known Issues in Deployment
# Frequently Asked Questions (FAQs)
## [Authentication](../authentication/FAQ.md)
## [Deployment](Deployment_Issue.md)
## [Development Environment](../DevEnvironment/FAQ.md)

Просмотреть файл

@ -26,10 +26,7 @@ Here is a few short video clips that can quickly explain DL Workspace. Note the
* [On prem, Ubuntu Cluster](deployment/On-Prem/Ubuntu.md)
* [Single Ubuntu Computer](deployment/On-Prem/SingleUbuntu.md)
## Known Issues
* [Deployment Issue](deployment/knownissues/Readme.md)
* Container Issue: [Zombie Process](KnownIssues/zombie_process.md)
## [Frequently Asked Questions](KnownIssues/Readme.md)
## Presentation

Просмотреть файл

@ -53,6 +53,7 @@ coreoschannel = "stable"
coreosbaseurl = ""
verbose = False
nocache = False
limitnodes = None
# These are the default configuration parameter
default_config_parameters = {
@ -507,6 +508,13 @@ scriptblocks = {
"kubernetes start restfulapi",
"kubernetes start webportal",
],
"add_worker": [
"sshkey install",
"runscriptonall ./scripts/prepare_ubuntu.sh",
"-y updateworker",
"-y kubernetes labels",
"mount",
],
"bldwebui": [
"webui",
"docker push restfulapi",
@ -1197,6 +1205,15 @@ def get_worker_nodes(clusterId):
def get_nodes(clusterId):
nodes = get_ETCD_master_nodes(clusterId) + get_worker_nodes(clusterId)
if limitnodes is not None:
matchFunc = re.compile(limitnodes, re.IGNORECASE)
usenodes = []
for node in nodes:
if ( matchFunc.search(node)):
usenodes.append(node)
nodes = usenodes
if verbose:
print "Operate on: %s" % nodes
return nodes
def check_master_ETCD_status():
@ -3368,7 +3385,8 @@ def run_command( args, command, nargs, parser ):
sleeptime = 10 if len(nargs)<1 else int(nargs[0])
print "Sleep for %s sec ... " % sleeptime
for si in range(sleeptime):
print ".",
sys.stdout.write(".")
sys.stdout.flush()
time.sleep(1)
elif command == "connect":
@ -3985,6 +4003,11 @@ Command:
''' ),
action="store",
default="run" )
parser.add_argument("--nodes",
help = "Specify an python regular expression that limit the nodes that the operation is applied.",
action="store",
default=None
)
parser.add_argument("command",
help = "See above for the list of valid command" )
@ -3997,6 +4020,8 @@ Command:
if args.verbose:
verbose = True
utils.verbose = True
if args.nodes is not None:
limitnodes = args.nodes
config = init_config()

Просмотреть файл

@ -48,6 +48,7 @@ if [ -f /etc/NetworkManager/NetworkManager.conf ]; then
sudo service network-manager restart
fi
sudo service apache2 stop
if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*" ; then

Просмотреть файл

@ -10,7 +10,9 @@ spec:
labels:
jobmanager-node: pod
spec:
dnsPolicy: ClusterFirstWithHostNet
{% if cnf["dnsPolicy"] %}
dnsPolicy: {{cnf["dnsPolicy"]}}
{% endif %}
hostNetwork: true
nodeSelector:
jobmanager: active
@ -55,4 +57,4 @@ spec:
path: {{cnf["storage-mount-path"]}}/jobfiles
- name: log
hostPath:
path: /var/log/clustermanager
path: /var/log/clustermanager

Просмотреть файл

@ -12,7 +12,9 @@ spec:
labels:
restfulapi-node: pod
spec:
dnsPolicy: ClusterFirstWithHostNet
{% if cnf["dnsPolicy"] %}
dnsPolicy: {{cnf["dnsPolicy"]}}
{% endif %}
nodeSelector:
restfulapi: active
hostNetwork: true

Просмотреть файл

@ -10,7 +10,9 @@ spec:
labels:
webportal-node: pod
spec:
dnsPolicy: ClusterFirstWithHostNet
{% if cnf["dnsPolicy"] %}
dnsPolicy: {{cnf["dnsPolicy"]}}
{% endif %}
nodeSelector:
webportal: active
hostNetwork: true

Просмотреть файл

@ -11,7 +11,9 @@ spec:
nodeSelector:
FragmentGPUJob: active
{% endif %}
#dnsPolicy: ClusterFirstWithHostNet
{% if job["dnsPolicy"] %}
dnsPolicy: ClusterFirstWithHostNet
{% endif %}
containers:
- name: {{ job["jobId"] }}
image: {{ job["image"] }}
@ -21,8 +23,11 @@ spec:
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: {{ job["resourcegpu"] }}
volumeMounts:
{% if not job["dnsPolicy"] %}
- mountPath: /etc/resolv.conf
name: resolv
{% endif %}
{% for mp in job["mountPoints"] %}
- mountPath: {{ mp.containerPath }}
name: {{ mp.name }}
@ -49,9 +54,14 @@ spec:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
fieldPath: status.podIP
restartPolicy: Never
volumes:
{% if not job["dnsPolicy"] %}
- name: resolv
hostPath:
path: /etc/resolv.conf
{% endif %}
{% for mp in job["mountPoints"] %}
- name: {{ mp.name }}
hostPath: