b5b1c3dac7
**Description** Fix bug of duration feature for model benchmarks in distributed mode. **Major Revision** - Add all_reduce to sync the result of is_finished(the function to judge whether the model benchmark should be stopped) in each step - to avoid inconsistency between different ranks to determine duration end (some rank may enter one more step and can never finish) - Add torch.cuda.synchronize() before and after step time measuring in train_step() for all model benchmarks - some operations in train_step() maybe async resulting incorrect step time records (for example, lstm) |
||
---|---|---|
.. | ||
docker_benchmarks | ||
micro_benchmarks | ||
model_benchmarks | ||
__init__.py | ||
base.py | ||
build.sh | ||
context.py | ||
reducer.py | ||
registry.py | ||
result.py | ||
return_code.py |