Merge pull request #22 from jinlccs/master
Customize Job dnsPolicy, add script block of adding node, update document.
This commit is contained in:
Коммит
ae364b39cc
|
@ -19,7 +19,7 @@ Here is a few short video clips that can quickly explain DLWorkspace. Note the P
|
|||
|
||||
## [DLWorkspace Cluster Deployment](docs/deployment/Readme.md)
|
||||
|
||||
## [Known Issues](docs/KnownIssues/Readme.md)
|
||||
## [Frequently Asked Questions](docs/KnownIssues/Readme.md)
|
||||
|
||||
## [Presentation](docs/presentation/1707/Readme.md)
|
||||
|
||||
|
|
|
@ -1,7 +1,11 @@
|
|||
# Known Issues in DL Workspace
|
||||
# Frequently Asked Questions
|
||||
|
||||
## [Deployment](../deployment/knownissues/Readme.md)
|
||||
* [Development Environment](../DevEnvironment/FAQ.md)
|
||||
* [Deployment](../deployment/knownissues/Deployment_Issue.md)
|
||||
* [Authentication](../deployment/authentication/FAQ.md)
|
||||
* [Azure](../deployment/Azure/FAQ.md)
|
||||
* [ACS](../deployment/ACS/FAQ.md)
|
||||
|
||||
## Running DL Workspace
|
||||
## Using DL Workspace
|
||||
|
||||
* [Zombie Process](zombie_process.md)
|
|
@ -24,10 +24,7 @@ Here is a few short video clips that can quickly explain DLWorkspace. Note the P
|
|||
* [On prem, Ubuntu Cluster](deployment/On-Prem/Ubuntu.md)
|
||||
* [Single Ubuntu Computer](deployment/On-Prem/SingleUbuntu.md)
|
||||
|
||||
## Known Issues
|
||||
|
||||
* [Deployment Issue](deployment/knownissues/Readme.md)
|
||||
* Container Issue: [Zombie Process](KnownIssues/zombie_process.md)
|
||||
## [Frequently Asked Questions](KnownIssues/Readme.md)
|
||||
|
||||
## Presentation
|
||||
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
# Common Deployment Issues of DL workspace cluster
|
||||
# Frequently Asked Question During DL Workspace deployment.
|
||||
|
||||
We are still prototyping the platform. Please report issues to the author, so that we can complete the document.
|
||||
|
||||
1. DL workspace deployment environment.
|
||||
|
||||
|
@ -35,10 +37,36 @@
|
|||
Are you sure you want to continue connecting (yes/no)? yes
|
||||
```
|
||||
Issue: when machines redeployed, they got new host key, which differed from their prior host key, which triggers the warning above each time a remote machine is connected.
|
||||
Solution: remove the file /home/<username>/.ssh/known_hosts.
|
||||
Solution: remove the hosts from /home/<username>/.ssh/known_hosts, you may also delete the file /home/<username>/.ssh/known_hosts.
|
||||
5. I see a web page of apache server, instead of DL Workspace.
|
||||
Apache server may be default enabled on the installed node. Please use "sudo service apache2 stop" to disable the server.
|
||||
6. I have a deployment failure.
|
||||
Sometime, there is deployment glitches during script execution. Please try to execute the script again to see if the issue goes away.
|
||||
|
||||
7. 'docker pull' fails with error "layers from manifest don't match image configuration".
|
||||
Please check the docker version of the 'pull' machine, the 'push' machine, and the docker register. It seems that this is caused by incompatible docker version. [See](https://github.com/docker/distribution/issues/1439)
|
||||
|
||||
8. If you don't use domain, please don't add domain: "", this adds a "." to hostname by the script, and causes the scripts to fail.
|
||||
|
||||
9. In some ubuntu distributions, "NetworkManager" is enable and set dns name server to be "127.0.1.1". This is Ok on the host machine, but may cause issues in the container. Typically, if the container is not using host network and inherits dns name server from the host, the domain name will not be able to be resolved inside container.
|
||||
|
||||
If the container network is not working, please check /etc/resolv.conf. If there is a value of "127.0.1.1", run scripts:
|
||||
|
||||
```
|
||||
./deploy.py runscriptonall ./scripts/disable_networkmanager.sh
|
||||
```
|
||||
|
||||
NetworkManager is the program which (via the resolvconf utility) inserts address 127.0.1.1 into resolv.conf. NM inserts that address if an only if it is configured to start an instance of the dnsmasq program to serve as a local forwarding nameserver. That dnsmasq instance listens for queries at address 127.0.1.1.
|
||||
If you do not want to use a local forwarding nameserver then configure NetworkManager not to start a dnsmasq instance and not to insert that address. In /etc/NetworkManager/NetworkManager.conf comment out the line dns=dnsmasq
|
||||
```
|
||||
sudo vi /etc/NetworkManager/NetworkManager.conf
|
||||
[main]:q
|
||||
plugins=ifupdown,keyfile,ofono
|
||||
#dns=dnsmasq
|
||||
```
|
||||
and restart the NetworkManager service.
|
||||
```
|
||||
sudo service network-manager restart
|
||||
```
|
||||
|
||||
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
# Known issues.
|
||||
|
||||
We are still prototyping the platform. Please report issues to the author, so that we can complete the document.
|
||||
|
||||
1. 'docker pull' fails with error "layers from manifest don't match image configuration".
|
||||
Please check the docker version of the 'pull' machine, the 'push' machine, and the docker register. It seems that this is caused by incompatible docker version. [See](https://github.com/docker/distribution/issues/1439)
|
||||
|
||||
2. If you don't use domain, please don't add domain: "", this adds a "." to hostname by the script, and causes the scripts to fail.
|
||||
|
||||
3. In some ubuntu distributions, "NetworkManager" is enable and set dns name server to be "127.0.1.1". This is Ok on the host machine, but may cause issues in the container. Typically, if the container is not using host network and inherits dns name server from the host, the domain name will not be able to be resolved inside container.
|
||||
If the container network is not working, please check /etc/resolv.conf. If the value is "127.0.1.1", please follow the below instructions to fix:
|
||||
|
||||
NetworkManager is the program which (via the resolvconf utility) inserts address 127.0.1.1 into resolv.conf. NM inserts that address if an only if it is configured to start an instance of the dnsmasq program to serve as a local forwarding nameserver. That dnsmasq instance listens for queries at address 127.0.1.1.
|
||||
If you do not want to use a local forwarding nameserver then configure NetworkManager not to start a dnsmasq instance and not to insert that address. In /etc/NetworkManager/NetworkManager.conf comment out the line dns=dnsmasq
|
||||
```
|
||||
sudo vi /etc/NetworkManager/NetworkManager.conf
|
||||
[main]:q
|
||||
plugins=ifupdown,keyfile,ofono
|
||||
#dns=dnsmasq
|
||||
```
|
||||
and restart the NetworkManager service.
|
||||
```
|
||||
sudo service network-manager restart
|
||||
```
|
|
@ -1,10 +0,0 @@
|
|||
# Common/Known Issues in Deployment
|
||||
|
||||
# Frequently Asked Questions (FAQs)
|
||||
|
||||
## [Authentication](../authentication/FAQ.md)
|
||||
|
||||
## [Deployment](Deployment_Issue.md)
|
||||
|
||||
## [Development Environment](../DevEnvironment/FAQ.md)
|
||||
|
|
@ -26,10 +26,7 @@ Here is a few short video clips that can quickly explain DL Workspace. Note the
|
|||
* [On prem, Ubuntu Cluster](deployment/On-Prem/Ubuntu.md)
|
||||
* [Single Ubuntu Computer](deployment/On-Prem/SingleUbuntu.md)
|
||||
|
||||
## Known Issues
|
||||
|
||||
* [Deployment Issue](deployment/knownissues/Readme.md)
|
||||
* Container Issue: [Zombie Process](KnownIssues/zombie_process.md)
|
||||
## [Frequently Asked Questions](KnownIssues/Readme.md)
|
||||
|
||||
## Presentation
|
||||
|
||||
|
|
|
@ -53,6 +53,7 @@ coreoschannel = "stable"
|
|||
coreosbaseurl = ""
|
||||
verbose = False
|
||||
nocache = False
|
||||
limitnodes = None
|
||||
|
||||
# These are the default configuration parameter
|
||||
default_config_parameters = {
|
||||
|
@ -507,6 +508,13 @@ scriptblocks = {
|
|||
"kubernetes start restfulapi",
|
||||
"kubernetes start webportal",
|
||||
],
|
||||
"add_worker": [
|
||||
"sshkey install",
|
||||
"runscriptonall ./scripts/prepare_ubuntu.sh",
|
||||
"-y updateworker",
|
||||
"-y kubernetes labels",
|
||||
"mount",
|
||||
],
|
||||
"bldwebui": [
|
||||
"webui",
|
||||
"docker push restfulapi",
|
||||
|
@ -1197,6 +1205,15 @@ def get_worker_nodes(clusterId):
|
|||
|
||||
def get_nodes(clusterId):
|
||||
nodes = get_ETCD_master_nodes(clusterId) + get_worker_nodes(clusterId)
|
||||
if limitnodes is not None:
|
||||
matchFunc = re.compile(limitnodes, re.IGNORECASE)
|
||||
usenodes = []
|
||||
for node in nodes:
|
||||
if ( matchFunc.search(node)):
|
||||
usenodes.append(node)
|
||||
nodes = usenodes
|
||||
if verbose:
|
||||
print "Operate on: %s" % nodes
|
||||
return nodes
|
||||
|
||||
def check_master_ETCD_status():
|
||||
|
@ -3368,7 +3385,8 @@ def run_command( args, command, nargs, parser ):
|
|||
sleeptime = 10 if len(nargs)<1 else int(nargs[0])
|
||||
print "Sleep for %s sec ... " % sleeptime
|
||||
for si in range(sleeptime):
|
||||
print ".",
|
||||
sys.stdout.write(".")
|
||||
sys.stdout.flush()
|
||||
time.sleep(1)
|
||||
|
||||
elif command == "connect":
|
||||
|
@ -3985,6 +4003,11 @@ Command:
|
|||
''' ),
|
||||
action="store",
|
||||
default="run" )
|
||||
parser.add_argument("--nodes",
|
||||
help = "Specify an python regular expression that limit the nodes that the operation is applied.",
|
||||
action="store",
|
||||
default=None
|
||||
)
|
||||
|
||||
parser.add_argument("command",
|
||||
help = "See above for the list of valid command" )
|
||||
|
@ -3997,6 +4020,8 @@ Command:
|
|||
if args.verbose:
|
||||
verbose = True
|
||||
utils.verbose = True
|
||||
if args.nodes is not None:
|
||||
limitnodes = args.nodes
|
||||
|
||||
config = init_config()
|
||||
|
||||
|
|
|
@ -48,6 +48,7 @@ if [ -f /etc/NetworkManager/NetworkManager.conf ]; then
|
|||
sudo service network-manager restart
|
||||
fi
|
||||
|
||||
sudo service apache2 stop
|
||||
|
||||
if lspci | grep -qE "[0-9a-fA-F][0-9a-fA-F]:[0-9a-fA-F][0-9a-fA-F].[0-9] (3D|VGA compatible) controller: NVIDIA Corporation.*" ; then
|
||||
|
||||
|
|
|
@ -10,7 +10,9 @@ spec:
|
|||
labels:
|
||||
jobmanager-node: pod
|
||||
spec:
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
{% if cnf["dnsPolicy"] %}
|
||||
dnsPolicy: {{cnf["dnsPolicy"]}}
|
||||
{% endif %}
|
||||
hostNetwork: true
|
||||
nodeSelector:
|
||||
jobmanager: active
|
||||
|
@ -55,4 +57,4 @@ spec:
|
|||
path: {{cnf["storage-mount-path"]}}/jobfiles
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/clustermanager
|
||||
path: /var/log/clustermanager
|
||||
|
|
|
@ -12,7 +12,9 @@ spec:
|
|||
labels:
|
||||
restfulapi-node: pod
|
||||
spec:
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
{% if cnf["dnsPolicy"] %}
|
||||
dnsPolicy: {{cnf["dnsPolicy"]}}
|
||||
{% endif %}
|
||||
nodeSelector:
|
||||
restfulapi: active
|
||||
hostNetwork: true
|
||||
|
|
|
@ -10,7 +10,9 @@ spec:
|
|||
labels:
|
||||
webportal-node: pod
|
||||
spec:
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
{% if cnf["dnsPolicy"] %}
|
||||
dnsPolicy: {{cnf["dnsPolicy"]}}
|
||||
{% endif %}
|
||||
nodeSelector:
|
||||
webportal: active
|
||||
hostNetwork: true
|
||||
|
|
|
@ -11,7 +11,9 @@ spec:
|
|||
nodeSelector:
|
||||
FragmentGPUJob: active
|
||||
{% endif %}
|
||||
#dnsPolicy: ClusterFirstWithHostNet
|
||||
{% if job["dnsPolicy"] %}
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
{% endif %}
|
||||
containers:
|
||||
- name: {{ job["jobId"] }}
|
||||
image: {{ job["image"] }}
|
||||
|
@ -21,8 +23,11 @@ spec:
|
|||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: {{ job["resourcegpu"] }}
|
||||
|
||||
volumeMounts:
|
||||
{% if not job["dnsPolicy"] %}
|
||||
- mountPath: /etc/resolv.conf
|
||||
name: resolv
|
||||
{% endif %}
|
||||
{% for mp in job["mountPoints"] %}
|
||||
- mountPath: {{ mp.containerPath }}
|
||||
name: {{ mp.name }}
|
||||
|
@ -49,9 +54,14 @@ spec:
|
|||
- name: POD_IP
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: status.podIP
|
||||
fieldPath: status.podIP
|
||||
restartPolicy: Never
|
||||
volumes:
|
||||
{% if not job["dnsPolicy"] %}
|
||||
- name: resolv
|
||||
hostPath:
|
||||
path: /etc/resolv.conf
|
||||
{% endif %}
|
||||
{% for mp in job["mountPoints"] %}
|
||||
- name: {{ mp.name }}
|
||||
hostPath:
|
||||
|
|
Загрузка…
Ссылка в новой задаче