Граф коммитов

54 Коммитов

Автор SHA1 Сообщение Дата
Mirian Hipolito Garcia 8bfe0854ab Merged PR 1578: Include FedProx aggregation method
Implementation of FedProx aggregation method, taken from "Federated Learning on Non-IID Data Silos: An Experimental Study" paper (https://arxiv.org/pdf/2102.02079.pdf).

[x] nlg_gru_fedprox: https://ml.azure.com/runs/8c052875-d053-4e70-b5b6-8f591faf5936?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

**Comparison**

- DGA ( Acc 0.15, Loss 5.5)

![image.png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image.png)

- FedProx ( Acc 0.18, Loss 4.8)

![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image%20%282%29.png)
2023-08-23 15:36:25 +00:00
Mirian Hipolito Garcia e8fe10b6a2 Merged PR 1563: Adapt federated.py for extra privacy metrics
Old federated.py was breaking when enabling additional privacy metrics in the config file. This PR allows to include these extra keys in the client payload communicated to the Server.

NLG_GRU:
[X] Apply DP metrics:  https://aka.ms/amlt?q=hlimq
[X] DP metrics disabled: https://aka.ms/amlt?q=hlimz
2023-08-14 18:50:16 +00:00
Mirian Hipolito Garcia a47d3ddf56 Merged PR 1557: Update readme files for dummy data creation
Simple PR to update the readmes used for dummy data creation in the testing folder.
2023-06-28 19:06:37 +00:00
Mirian Hipolito Garcia 43e15308e2 Merged PR 1546: Allow single GPU/CPU processes
Allow single GPU/CPU processes

[X] - Multi GPU_nlg: https://ml.azure.com/runs/adb32644-7ad3-425f-ac5f-8a81d2756147?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
[X] - Single GPU_nlg: https://ml.azure.com/runs/3295eb10-f0bd-478a-a7c2-8f1cac8f9be5?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics
[X] - Single GPU_ecg: https://ml.azure.com/runs/34f7da18-e230-4df8-9b2f-f4c916d4d005?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47&reloadCount=1
[X] - Single GPU_classif_cnn: https://ml.azure.com/runs/0ff278ea-0bcb-4781-bb85-fe15514edd53?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics
2023-06-28 15:45:16 +00:00
Mirian Hipolito Garcia 9df34a8bfd Merged PR 1503: Fix replay_server option
Fix replay_server option + update requirements

Sidenote: Seems like python 3.8 is returning some issues when running, I've updated the readme to use python 3.7 since it's the one I'm using in AML and the local sandbox.

Sanity-checks:
[X] https://aka.ms/amlt?q=e3q0b
2023-03-16 16:28:07 +00:00
Mirian Hipolito Garcia 0477a95306 Merged PR 1451: Multi-node Jobs
Allow multi-node jobs using torch.distributed

Sanity-checks:
[x] Multi-node Job: https://aka.ms/amlt?q=d68m9
[x] Single-node Job:  https://aka.ms/amlt?q=d68nu
2023-01-04 15:11:59 +00:00
Mirian Hipolito Garcia 0e8762b0e8 Merged PR 1421: Include Semi-Supervision code
Include Semi-Supervision Code.

**Yae Jee Results**
![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1421/attachments/image%20%282%29.png)

**FLUTE Results**
![image (5).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1421/attachments/image%20%285%29.png)

The final accuracy of the three models is pretty similar to YJ's results, the only difference is the training loss computation. In her code she is using all the client training data on each round, not only the ones sampled per iteration as in FLUTE. I asked about this to Dimitrios a while ago and he mentioned that this shouldn't be a big issue, just that the loss would oscillate more in our case than in her results, which I believe is the case.

I had to do some slight changes in a couple files inside _/core_ to allow FLUTE to use both aggregation methods: using pseudogradients or state dictionaries, and to pass the iteration round from server to clients during each round, given that YJ's algorithm needed for running _burnout rounds_ before starting to execute the algorithm. Aside from that, there are no major changes.

Let me know if you have any comment!

Pending Tasks
[x] Remove amlt/git files
[x] 3rd Party notice?

Sanity-Checks:
[x] semisupervision:  https://aka.ms/amlt?q=d4h7c
[x] mlm_bert: https://aka.ms/amlt?q=d2p0d
[x] nlg_gru: https://aka.ms/amlt?q=d4fq7
2022-12-14 18:18:51 +00:00
Mirian Hipolito Garcia 405ac09606 Merged PR 1416: Benchmarking Experiments
Include the Flower/FedML Benchmarking code for FLUTE.
2022-11-03 14:07:37 +00:00
Robert Sim 6430cbf708 Merged PR 1402: Fix DP accounting bug
Fix DP accounting bug.  The accountant needed the size of the total pool of clients to sample from.
2022-10-24 20:33:31 +00:00
Mirian Hipolito Garcia 1c3f556648 Merged PR 1393: Fix mlm_bert dataset
Avoid unnecesary dataloading in e2e_trainer.py for mlm_bert task. Reduces ~ 90 GB, compared with the previous version.

Sanity-check: https://ml.azure.com/allJobs/job/hai7/lab-rr1-v100-dgx2-3/bafff437ffff5468873739619d14194c?flight=itpmerge&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
2022-10-14 16:00:05 +00:00
Robert Sim 1acad74c1c Merged PR 1392: Fix scaling issue in federated.py
federated.py was accumulating gradients in a big list until all the clients were done, preventing very large numbers of clients running. Adjusted to yield them as they come in.

sample job aggregating 1,000 MLMs: https://ml.azure.com/runs/flute_exp-6d141d91-fancy-arachnid-MLM_LAMB_1k_TEST-347c22a0?wsid=/subscriptions/46da6261-2167-4e71-8b0d-f4a45215ce61/resourcegroups/hai7-wus2/workspaces/hai7&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#overview
2022-10-13 22:58:25 +00:00
Mirian Hipolito Garcia 9d5d8d45f1 Merged PR 1370: Add FedNewsRec Model
Include FedNewsRec Model from: https://github.com/simra/FedNewsRec

- MIND_Large, 1500 rounds, 6 clients per round:

|Platform|AUC|MRR|nDCG5|nDCG10|
|:----|:----|:----|:----|:----|
|FedNews|0.54|0.23|0.25|0.32|
|FLUTE|0.58|0.24|0.26|0.33|

- MIND_Large, 1500 rounds, 500 clients per round:

|Platform|AUC|MRR|nDCG5|nDCG10|
|:----|:----|:----|:----|:----|
|FLUTE|0.56|0.24|0.26|0.32|

Links:
- 6 clients per round: https://ml.azure.com/runs/mirianh-6d141d91-striking-goat-NCCL_fednewsrec_large_6-d0fa3092?wsid=/subscriptions/46da6261-2167-4e71-8b0d-f4a45215ce61/resourcegroups/hai7-wus2/workspaces/hai7&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#overview
- 500 clients per round: https://aka.ms/amlt?q=dqvqj
2022-10-10 14:47:43 +00:00
Mirian Hipolito Garcia 4444b639cd Merged PR 1382: Update readme + requirements
- Update new pointers to testing files in readme
- Freeze PyTorch version
2022-10-05 21:23:50 +00:00
Mirian Hipolito Garcia 30e41400a4 Merged PR 1272: Replace MPI -> torch.distributed
This PR replaces MPI by torch.distributed as main communication backbone, allowing to use NCCL with GPUs and Gloo for CPU distributed jobs. Most significative changes are inside _federated.py_.

Asynchronous mode is enabled when using NCCL , which means that the workers are being reassigned to a new Client as soon as they finish, improving the overall GPU utilization + reducing the total time of the job,  as shown in the figure below.

![COMPARISON (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/COMPARISON%20%282%29.png)

However Gloo does not have a native implementation for non-blocking ways to check if the recv/send request have been completed (see details here: https://github.com/pytorch/pytorch/issues/30723 ) Therefore, when using Gloo the communication works in synchronous way.

I've added a fix for the CUDA OOM issues I was receiving when running the bert experiment, the GPU memory was being overloaded during training. Comparison below MPI (https://aka.ms/amlt?q=dcbbn) vs NCCL now, some cleanup is performed after the server receives the gradient.

![image (14).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/image%20%2814%29.png)

There are a couple minor changes in _server_, _client_ and _evaluation_ as well. The main reason is that now the Server doesn't hold the list of clients, these ones live inside the worker since the moment is created and the Server is only passing the indexes of the Client to the Worker. The reason behind this change is that torch.distributed does not allow to send objects P2P, only tensors.

The rest of modified files are only to update the documentation + the testing file. I tried to be very explicit for each new function inside _federated.py_ to explain the new flow. Let me know if something it's not clear enough.

I've tested all experiments already in the sandbox using NCCL and in my local machine (Windows) using Gloo (surprisingly for this case is not as slow as I was expecting, I used some dummy datasets that I had prepared though) --> pending task compare the new performance using CPU.

So for now the only thing left is to run the sanity-checks on AML, links below.

Sanity checks after cleanup:
[x] nlg_gru: https://aka.ms/amlt?q=c9tih
[x] mlm_bert: https://aka.ms/amlt?q=dbs8y
[x] classif_cnn: https://aka.ms/amlt?q=da2qr
[x] ecg: https://aka.ms/amlt?q=c9jof
[x] cv: https://aka.ms/amlt?q=da2k4
2022-08-26 14:54:27 +00:00
Andre Manoel 887c8ac74f
Attempting to fix CodeQL pipeline (#13) 2022-08-22 12:53:11 -03:00
Andre Manoel 47c4e588b8 Merged PR 1296: Move CodeQL to Github
Adding YAML file for CodeQL Action
2022-08-22 14:50:40 +00:00
Mirian Hipolito Garcia a2a129e5e8 Merged PR 1288: Update dp-accountant submodule
The dp-accountant submodule was not working in the GitHub repo. Seems like one of the .git files that have the pointer was broken, but I couldn't find the exactly issue .. so I've removed the submodule and add it again inside the _utils_ folder.

Reference: https://github.com/microsoft/msrflute/issues/9

I've already tested the functionality on the sandbox, @<Andre Manoel> can you please test it on your local machine as well to see if throws any error that I'm not able to reproduce?

`$ git submodule update --init --recursive`
`$ cd utils`
`$ cd dp-accountant`
`$ python setup.py install`
`$ ./bin/compute-dp-epsilon --help`
2022-08-15 14:31:54 +00:00
Mirian Hipolito Garcia 647cc069ac Merged PR 1286: Add citation file
Adding `CITATION.cff` file to display the BibTeX  and APA widget for citations in GitHub, as the example below.

![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1286/attachments/image%20%282%29.png)
2022-08-10 21:46:55 +00:00
microsoft-github-policy-service[bot] 19eb02f485 Microsoft mandatory file 2022-08-03 10:07:39 -05:00
Robert Sim a6e08be4b6 Update copyright notice
Redacted the MSFT copyright boilerplate as it doesn't apply here.
2022-08-03 10:07:39 -05:00
Mirian Hipolito Garcia 4970ef31d2 Merged PR 1129: Add personalization code
Migrating personalization code from FTL.release ..

Adding **"personalization"** feature being compatible with the current code in _main_.

Latest experiment using resnet50: https://ml.azure.com/runs/f739987a-46b9-47fc-a1bc-75c59ac0c13c?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

I've also tested the backwards compatibility, using nlg_gru task: https://ml.azure.com/runs/02381916-2895-4615-b398-42ae7594a79d?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics

@<Dimitrios Dimitriadis> , do we have any 3rd party notice to add?
2022-06-14 18:52:31 +00:00
Mirian Hipolito Garcia 866d0a072c Merged PR 1213: Remove file type dependency on client.py
- Replace _data_dict in client.py by a dataset.
- Remove the loader_type dependency for dataloaders utilities
- Add Base Classes for dataset and dataloaders
- Add example for a previously created dataset instantiation in classif_cnn example
- Allow datasets to be downloaded on the fly
- Update documentation

Sanity checks:
[x] nlg_gru: https://aka.ms/amlt?q=cn6vj
[x] mlm_bert: https://aka.ms/amlt?q=cppmb
[x] classif_cnn: https://aka.ms/amlt?q=cn6vw
[x] ecg: https://aka.ms/amlt?q=codet
2022-06-08 15:56:17 +00:00
Mirian Hipolito Garcia 78a401a48a Merged PR 1145: Sanity Check
- Incorporate pipeline for ADO.
- Sanity checks for all the experiments (using dummy data).
- Update documentation for _testing_ folder, so users can run the tests on their local machines as well.

**Update:** Using Xfail from pytest to allow mlm_bert fail the test in the pipeline

![image (3).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1145/attachments/image%20%283%29.png)
2022-05-17 19:20:20 +00:00
Mirian Hipolito Garcia 299312e461 Merged PR 1139: Abstract class for models
- Include abstract class for models in core/model.py
- Update in model classes accordingly per experiment.
- Remove abstract class for metrics (it is no longer necessary), new metrics only should be declared in the returning dictionary of `inference()` and FLUTE will recognize them during the evaluation rounds.
- custom_metrics.py inside each experiment folder is not needed anymore.
- Update in the docs for model implementation and metrics.
2022-05-04 23:32:02 +00:00
Mirian Hipolito Garcia 08ac1bb4ed Merged PR 1128: Move model config to its matching experiments folder.
Proof of concept to define independent configuration classes for each model type.

Moving PR from FTL.release
2022-04-26 17:42:05 +00:00
Mirian Hipolito Garcia 4c22c56b06 Merged PR 1127: Update sphinx docs
Moving PR from FTL.release
2022-04-26 16:59:46 +00:00
Andre Manoel 4eda3eaccd Amending last sync 2022-04-25 14:31:45 -07:00
Robert Sim a349f89fb3
Update README.md 2022-04-15 08:01:01 -07:00
Mirian-Hipolito c28e2e6bcf
Merge pull request #7 from microsoft/sync-20220401
New examples, refactored metrics and strategies, and improved config parsing
2022-04-01 12:07:44 -06:00
Andre Manoel a7b53eef55 Fixing doc building pipeline 2022-04-01 09:25:33 -07:00
Robert Sim a8a20c2c7d Merged PR 1083: Move config validation to FLUTEConfig.validate(). Fix some bugs.
what it says on the tin
2022-04-01 08:49:29 -07:00
Robert Sim f97162fd0a Merged PR 1081: config docstrings, part 2
Documented second half of config.py
The last remaining class to document is DataSet, which is complicated.

A couple minor bug/typo fixes included.
2022-04-01 08:33:33 -07:00
Mirian Hipolito Garcia f582e43893 Merged PR 1082: Save the best model 2022-04-01 08:33:33 -07:00
Robert Sim cfd9f57049 Merged PR 1072: Docstrings for config classes, part 1
Added some documentation to the first half of the config classes.
2022-04-01 08:33:33 -07:00
Andre Manoel a5df98b469 Merged PR 1041: Refactoring FL strategy-specific code into a different component 2022-04-01 08:33:33 -07:00
Mirian Hipolito Garcia 4c59f36470 Merged PR 1056: Allowing customized metrics 2022-04-01 08:33:29 -07:00
Jakob Serlier f134c0e091 Merged PR 1037: PR: heartbeat experiment 2022-04-01 08:32:30 -07:00
Mirian-Hipolito 781fa4c096
Merge pull request #6 from microsoft/sync-20220223
Refactoring Evaluation + Documentation in Config files
2022-02-23 14:43:51 -06:00
Mirian Hipolito Garcia 9a5c6e4a3f Typo 2022-02-23 11:56:05 -06:00
Mirian Hipolito Garcia 59bd382291 Some documentation 2022-02-23 11:54:48 -06:00
Mirian Hipolito Garcia ac2ac8967e Updating ServerConfig 2022-02-23 11:54:48 -06:00
Mirian Hipolito Garcia 960884d7c4 Change in config files 2022-02-23 11:54:13 -06:00
Andre Manoel b0ee9cc995
Merge pull request #5 from microsoft/sync-20220207
Some fixes + improving config
2022-02-11 12:55:52 -03:00
Robert Sim d89c9c583d Merged PR 1022: Add classes for configuration
Configs are now classes.
They're set up to be backwards compatible with the existing code- only a few adjustments were needed.
Going forward we should make an effort to reference config values as properties, rather than as dictionary items- over time we can clean up the verbose code and remove the dictionary support.

In the vast majority of cases everything will just work. There may be some edge cases discriminating between None and not configured.

I have validated locally and in a swiftkey run.  We should also validate an mlm or zcode.
2022-02-07 08:21:02 -08:00
Robert Sim 38b4537a6b Merged PR 1019: fix some profiling and testing bugs
fix some profiling and testing bugs
2022-02-07 08:19:22 -08:00
Andre Manoel d0722e8536
Merge pull request #2 from microsoft/build-docs
Create a Github action for building the docs automatically
2022-01-17 15:56:47 -03:00
Andre Manoel 4f91dcb2fe Fixed warnings from CI for building docs 2022-01-17 10:31:55 -08:00
Robert Sim 10c3182499
Merge pull request #1 from simra/main
update readme
2022-01-11 08:21:50 -08:00
Andre Manoel 1b89b8ec67 Adding files for GH action that builds the docs 2022-01-07 11:42:00 -08:00
Robert Sim 367409d42f update README 2021-12-16 11:57:06 -08:00