зеркало из https://github.com/microsoft/msrflute.git
30e41400a4
This PR replaces MPI by torch.distributed as main communication backbone, allowing to use NCCL with GPUs and Gloo for CPU distributed jobs. Most significative changes are inside _federated.py_. Asynchronous mode is enabled when using NCCL , which means that the workers are being reassigned to a new Client as soon as they finish, improving the overall GPU utilization + reducing the total time of the job, as shown in the figure below. ![COMPARISON (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/COMPARISON%20%282%29.png) However Gloo does not have a native implementation for non-blocking ways to check if the recv/send request have been completed (see details here: https://github.com/pytorch/pytorch/issues/30723 ) Therefore, when using Gloo the communication works in synchronous way. I've added a fix for the CUDA OOM issues I was receiving when running the bert experiment, the GPU memory was being overloaded during training. Comparison below MPI (https://aka.ms/amlt?q=dcbbn) vs NCCL now, some cleanup is performed after the server receives the gradient. ![image (14).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/image%20%2814%29.png) There are a couple minor changes in _server_, _client_ and _evaluation_ as well. The main reason is that now the Server doesn't hold the list of clients, these ones live inside the worker since the moment is created and the Server is only passing the indexes of the Client to the Worker. The reason behind this change is that torch.distributed does not allow to send objects P2P, only tensors. The rest of modified files are only to update the documentation + the testing file. I tried to be very explicit for each new function inside _federated.py_ to explain the new flow. Let me know if something it's not clear enough. I've tested all experiments already in the sandbox using NCCL and in my local machine (Windows) using Gloo (surprisingly for this case is not as slow as I was expecting, I used some dummy datasets that I had prepared though) --> pending task compare the new performance using CPU. So for now the only thing left is to run the sanity-checks on AML, links below. Sanity checks after cleanup: [x] nlg_gru: https://aka.ms/amlt?q=c9tih [x] mlm_bert: https://aka.ms/amlt?q=dbs8y [x] classif_cnn: https://aka.ms/amlt?q=da2qr [x] ecg: https://aka.ms/amlt?q=c9jof [x] cv: https://aka.ms/amlt?q=da2k4 |
||
---|---|---|
.. | ||
README.md | ||
build_vocab.py | ||
create_data.py | ||
hello_world_classif_cnn.yaml | ||
hello_world_ecg_cnn.yaml | ||
hello_world_mlm_bert.yaml | ||
hello_world_nlg_gru.yaml | ||
test_e2e_trainer.py |
README.md
Information
The tests are designed to evaluate the operation of the tasks, not the performance. Therefore, we are using dummy data to run all tasks. In order to have ralistic results about the behaviour of each experiment, please follow the instructions provided in the README.md file inside each experiment folder, for downloading the recommended datasets.
Setup Instructions for Pytest
- Run create_data.py in order to download and preprocess the dummy training and testing datasets that will be used. Make sure to indicate the task name. The example below shows how to create the data for the
nlg_gru
task.
python create_data.py --task nlg_gru
- The script
test_e2e_trainer.py
is designed to run the test over all tasks, therefore you need to run Step 1 for each experiment first). - Run
pytest -v -s
to perfor the local test.