msrflute

Граф коммитов

Автор	SHA1	Сообщение	Дата
Mirian Hipolito Garcia	8bfe0854ab	Merged PR 1578: Include FedProx aggregation method Implementation of FedProx aggregation method, taken from "Federated Learning on Non-IID Data Silos: An Experimental Study" paper (https://arxiv.org/pdf/2102.02079.pdf). [x] nlg_gru_fedprox: https://ml.azure.com/runs/8c052875-d053-4e70-b5b6-8f591faf5936?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 Comparison - DGA ( Acc 0.15, Loss 5.5) ![image.png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image.png) - FedProx ( Acc 0.18, Loss 4.8) ![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image%20%282%29.png)	2023-08-23 15:36:25 +00:00
Mirian Hipolito Garcia	43e15308e2	Merged PR 1546: Allow single GPU/CPU processes Allow single GPU/CPU processes [X] - Multi GPU_nlg: https://ml.azure.com/runs/adb32644-7ad3-425f-ac5f-8a81d2756147?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 [X] - Single GPU_nlg: https://ml.azure.com/runs/3295eb10-f0bd-478a-a7c2-8f1cac8f9be5?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics [X] - Single GPU_ecg: https://ml.azure.com/runs/34f7da18-e230-4df8-9b2f-f4c916d4d005?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47&reloadCount=1 [X] - Single GPU_classif_cnn: https://ml.azure.com/runs/0ff278ea-0bcb-4781-bb85-fe15514edd53?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics	2023-06-28 15:45:16 +00:00
Mirian Hipolito Garcia	30e41400a4	Merged PR 1272: Replace MPI -> torch.distributed This PR replaces MPI by torch.distributed as main communication backbone, allowing to use NCCL with GPUs and Gloo for CPU distributed jobs. Most significative changes are inside _federated.py_. Asynchronous mode is enabled when using NCCL , which means that the workers are being reassigned to a new Client as soon as they finish, improving the overall GPU utilization + reducing the total time of the job, as shown in the figure below. ![COMPARISON (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/COMPARISON%20%282%29.png) However Gloo does not have a native implementation for non-blocking ways to check if the recv/send request have been completed (see details here: https://github.com/pytorch/pytorch/issues/30723 ) Therefore, when using Gloo the communication works in synchronous way. I've added a fix for the CUDA OOM issues I was receiving when running the bert experiment, the GPU memory was being overloaded during training. Comparison below MPI (https://aka.ms/amlt?q=dcbbn) vs NCCL now, some cleanup is performed after the server receives the gradient. ![image (14).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/image%20%2814%29.png) There are a couple minor changes in _server_, _client_ and _evaluation_ as well. The main reason is that now the Server doesn't hold the list of clients, these ones live inside the worker since the moment is created and the Server is only passing the indexes of the Client to the Worker. The reason behind this change is that torch.distributed does not allow to send objects P2P, only tensors. The rest of modified files are only to update the documentation + the testing file. I tried to be very explicit for each new function inside _federated.py_ to explain the new flow. Let me know if something it's not clear enough. I've tested all experiments already in the sandbox using NCCL and in my local machine (Windows) using Gloo (surprisingly for this case is not as slow as I was expecting, I used some dummy datasets that I had prepared though) --> pending task compare the new performance using CPU. So for now the only thing left is to run the sanity-checks on AML, links below. Sanity checks after cleanup: [x] nlg_gru: https://aka.ms/amlt?q=c9tih [x] mlm_bert: https://aka.ms/amlt?q=dbs8y [x] classif_cnn: https://aka.ms/amlt?q=da2qr [x] ecg: https://aka.ms/amlt?q=c9jof [x] cv: https://aka.ms/amlt?q=da2k4	2022-08-26 14:54:27 +00:00

Автор

SHA1

Сообщение

Дата

Mirian Hipolito Garcia

8bfe0854ab

Merged PR 1578: Include FedProx aggregation method

Implementation of FedProx aggregation method, taken from "Federated Learning on Non-IID Data Silos: An Experimental Study" paper (https://arxiv.org/pdf/2102.02079.pdf).

[x] nlg_gru_fedprox: https://ml.azure.com/runs/8c052875-d053-4e70-b5b6-8f591faf5936?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

**Comparison**

- DGA ( Acc 0.15, Loss 5.5)

![image.png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image.png)

- FedProx ( Acc 0.18, Loss 4.8)

![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image%20%282%29.png)

2023-08-23 15:36:25 +00:00

Mirian Hipolito Garcia

43e15308e2

Merged PR 1546: Allow single GPU/CPU processes

Allow single GPU/CPU processes

[X] - Multi GPU_nlg: https://ml.azure.com/runs/adb32644-7ad3-425f-ac5f-8a81d2756147?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
[X] - Single GPU_nlg: https://ml.azure.com/runs/3295eb10-f0bd-478a-a7c2-8f1cac8f9be5?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics
[X] - Single GPU_ecg: https://ml.azure.com/runs/34f7da18-e230-4df8-9b2f-f4c916d4d005?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47&reloadCount=1
[X] - Single GPU_classif_cnn: https://ml.azure.com/runs/0ff278ea-0bcb-4781-bb85-fe15514edd53?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics

2023-06-28 15:45:16 +00:00

Mirian Hipolito Garcia

30e41400a4

Merged PR 1272: Replace MPI -> torch.distributed

This PR replaces MPI by torch.distributed as main communication backbone, allowing to use NCCL with GPUs and Gloo for CPU distributed jobs. Most significative changes are inside _federated.py_.

Asynchronous mode is enabled when using NCCL , which means that the workers are being reassigned to a new Client as soon as they finish, improving the overall GPU utilization + reducing the total time of the job, as shown in the figure below.

![COMPARISON (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/COMPARISON%20%282%29.png)

However Gloo does not have a native implementation for non-blocking ways to check if the recv/send request have been completed (see details here: https://github.com/pytorch/pytorch/issues/30723 ) Therefore, when using Gloo the communication works in synchronous way.

I've added a fix for the CUDA OOM issues I was receiving when running the bert experiment, the GPU memory was being overloaded during training. Comparison below MPI (https://aka.ms/amlt?q=dcbbn) vs NCCL now, some cleanup is performed after the server receives the gradient.

![image (14).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/image%20%2814%29.png)

There are a couple minor changes in _server_, _client_ and _evaluation_ as well. The main reason is that now the Server doesn't hold the list of clients, these ones live inside the worker since the moment is created and the Server is only passing the indexes of the Client to the Worker. The reason behind this change is that torch.distributed does not allow to send objects P2P, only tensors.

The rest of modified files are only to update the documentation + the testing file. I tried to be very explicit for each new function inside _federated.py_ to explain the new flow. Let me know if something it's not clear enough.

I've tested all experiments already in the sandbox using NCCL and in my local machine (Windows) using Gloo (surprisingly for this case is not as slow as I was expecting, I used some dummy datasets that I had prepared though) --> pending task compare the new performance using CPU.

So for now the only thing left is to run the sanity-checks on AML, links below.

Sanity checks after cleanup:
[x] nlg_gru: https://aka.ms/amlt?q=c9tih
[x] mlm_bert: https://aka.ms/amlt?q=dbs8y
[x] classif_cnn: https://aka.ms/amlt?q=da2qr
[x] ecg: https://aka.ms/amlt?q=c9jof
[x] cv: https://aka.ms/amlt?q=da2k4

2022-08-26 14:54:27 +00:00

3 Коммитов