Old federated.py was breaking when enabling additional privacy metrics in the config file. This PR allows to include these extra keys in the client payload communicated to the Server.
NLG_GRU:
[X] Apply DP metrics: https://aka.ms/amlt?q=hlimq
[X] DP metrics disabled: https://aka.ms/amlt?q=hlimz
Fix replay_server option + update requirements
Sidenote: Seems like python 3.8 is returning some issues when running, I've updated the readme to use python 3.7 since it's the one I'm using in AML and the local sandbox.
Sanity-checks:
[X] https://aka.ms/amlt?q=e3q0b
This PR replaces MPI by torch.distributed as main communication backbone, allowing to use NCCL with GPUs and Gloo for CPU distributed jobs. Most significative changes are inside _federated.py_.
Asynchronous mode is enabled when using NCCL , which means that the workers are being reassigned to a new Client as soon as they finish, improving the overall GPU utilization + reducing the total time of the job, as shown in the figure below.
![COMPARISON (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/COMPARISON%20%282%29.png)
However Gloo does not have a native implementation for non-blocking ways to check if the recv/send request have been completed (see details here: https://github.com/pytorch/pytorch/issues/30723 ) Therefore, when using Gloo the communication works in synchronous way.
I've added a fix for the CUDA OOM issues I was receiving when running the bert experiment, the GPU memory was being overloaded during training. Comparison below MPI (https://aka.ms/amlt?q=dcbbn) vs NCCL now, some cleanup is performed after the server receives the gradient.
![image (14).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/image%20%2814%29.png)
There are a couple minor changes in _server_, _client_ and _evaluation_ as well. The main reason is that now the Server doesn't hold the list of clients, these ones live inside the worker since the moment is created and the Server is only passing the indexes of the Client to the Worker. The reason behind this change is that torch.distributed does not allow to send objects P2P, only tensors.
The rest of modified files are only to update the documentation + the testing file. I tried to be very explicit for each new function inside _federated.py_ to explain the new flow. Let me know if something it's not clear enough.
I've tested all experiments already in the sandbox using NCCL and in my local machine (Windows) using Gloo (surprisingly for this case is not as slow as I was expecting, I used some dummy datasets that I had prepared though) --> pending task compare the new performance using CPU.
So for now the only thing left is to run the sanity-checks on AML, links below.
Sanity checks after cleanup:
[x] nlg_gru: https://aka.ms/amlt?q=c9tih
[x] mlm_bert: https://aka.ms/amlt?q=dbs8y
[x] classif_cnn: https://aka.ms/amlt?q=da2qr
[x] ecg: https://aka.ms/amlt?q=c9jof
[x] cv: https://aka.ms/amlt?q=da2k4
The dp-accountant submodule was not working in the GitHub repo. Seems like one of the .git files that have the pointer was broken, but I couldn't find the exactly issue .. so I've removed the submodule and add it again inside the _utils_ folder.
Reference: https://github.com/microsoft/msrflute/issues/9
I've already tested the functionality on the sandbox, @<Andre Manoel> can you please test it on your local machine as well to see if throws any error that I'm not able to reproduce?
`$ git submodule update --init --recursive`
`$ cd utils`
`$ cd dp-accountant`
`$ python setup.py install`
`$ ./bin/compute-dp-epsilon --help`
- Replace _data_dict in client.py by a dataset.
- Remove the loader_type dependency for dataloaders utilities
- Add Base Classes for dataset and dataloaders
- Add example for a previously created dataset instantiation in classif_cnn example
- Allow datasets to be downloaded on the fly
- Update documentation
Sanity checks:
[x] nlg_gru: https://aka.ms/amlt?q=cn6vj
[x] mlm_bert: https://aka.ms/amlt?q=cppmb
[x] classif_cnn: https://aka.ms/amlt?q=cn6vw
[x] ecg: https://aka.ms/amlt?q=codet
- Include abstract class for models in core/model.py
- Update in model classes accordingly per experiment.
- Remove abstract class for metrics (it is no longer necessary), new metrics only should be declared in the returning dictionary of `inference()` and FLUTE will recognize them during the evaluation rounds.
- custom_metrics.py inside each experiment folder is not needed anymore.
- Update in the docs for model implementation and metrics.
Configs are now classes.
They're set up to be backwards compatible with the existing code- only a few adjustments were needed.
Going forward we should make an effort to reference config values as properties, rather than as dictionary items- over time we can clean up the verbose code and remove the dictionary support.
In the vast majority of cases everything will just work. There may be some edge cases discriminating between None and not configured.
I have validated locally and in a swiftkey run. We should also validate an mlm or zcode.