Benchmark - Fix torch.dist init issue with multiple models (#495)
Fix potential barrier timeout in init_process_group due to race condition of using the same port. Change to different ports when running multiple models sequentially in one process. For example, when running vgg11/13/16/19, will use port 29501~29504 respectively.
This commit is contained in:
Родитель
5a88db1601
Коммит
644b5395df
|
@ -70,7 +70,8 @@ class PytorchBase(ModelBenchmark):
|
|||
)
|
||||
return False
|
||||
# torch >= 1.9.0a0 torch.distributed.elastic is used by default
|
||||
port = int(os.environ['MASTER_PORT']) + 1
|
||||
port = int(os.environ.get('MASTER_PORT', '29500')) + 1
|
||||
os.environ['MASTER_PORT'] = str(port)
|
||||
addr = os.environ['MASTER_ADDR']
|
||||
self._global_rank = int(os.environ['RANK'])
|
||||
self._local_rank = int(os.environ['LOCAL_RANK'])
|
||||
|
|
Загрузка…
Ссылка в новой задаче