msrflute

Граф коммитов

Автор	SHA1	Сообщение	Дата
Mirian Hipolito Garcia	8bfe0854ab	Merged PR 1578: Include FedProx aggregation method Implementation of FedProx aggregation method, taken from "Federated Learning on Non-IID Data Silos: An Experimental Study" paper (https://arxiv.org/pdf/2102.02079.pdf). [x] nlg_gru_fedprox: https://ml.azure.com/runs/8c052875-d053-4e70-b5b6-8f591faf5936?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 Comparison - DGA ( Acc 0.15, Loss 5.5) ![image.png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image.png) - FedProx ( Acc 0.18, Loss 4.8) ![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1578/attachments/image%20%282%29.png)	2023-08-23 15:36:25 +00:00
Mirian Hipolito Garcia	e8fe10b6a2	Merged PR 1563: Adapt federated.py for extra privacy metrics Old federated.py was breaking when enabling additional privacy metrics in the config file. This PR allows to include these extra keys in the client payload communicated to the Server. NLG_GRU: [X] Apply DP metrics: https://aka.ms/amlt?q=hlimq [X] DP metrics disabled: https://aka.ms/amlt?q=hlimz	2023-08-14 18:50:16 +00:00
Mirian Hipolito Garcia	a47d3ddf56	Merged PR 1557: Update readme files for dummy data creation Simple PR to update the readmes used for dummy data creation in the testing folder.	2023-06-28 19:06:37 +00:00
Mirian Hipolito Garcia	43e15308e2	Merged PR 1546: Allow single GPU/CPU processes Allow single GPU/CPU processes [X] - Multi GPU_nlg: https://ml.azure.com/runs/adb32644-7ad3-425f-ac5f-8a81d2756147?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 [X] - Single GPU_nlg: https://ml.azure.com/runs/3295eb10-f0bd-478a-a7c2-8f1cac8f9be5?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics [X] - Single GPU_ecg: https://ml.azure.com/runs/34f7da18-e230-4df8-9b2f-f4c916d4d005?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47&reloadCount=1 [X] - Single GPU_classif_cnn: https://ml.azure.com/runs/0ff278ea-0bcb-4781-bb85-fe15514edd53?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics	2023-06-28 15:45:16 +00:00
Mirian Hipolito Garcia	9df34a8bfd	Merged PR 1503: Fix replay_server option Fix replay_server option + update requirements Sidenote: Seems like python 3.8 is returning some issues when running, I've updated the readme to use python 3.7 since it's the one I'm using in AML and the local sandbox. Sanity-checks: [X] https://aka.ms/amlt?q=e3q0b	2023-03-16 16:28:07 +00:00
Mirian Hipolito Garcia	0477a95306	Merged PR 1451: Multi-node Jobs Allow multi-node jobs using torch.distributed Sanity-checks: [x] Multi-node Job: https://aka.ms/amlt?q=d68m9 [x] Single-node Job: https://aka.ms/amlt?q=d68nu	2023-01-04 15:11:59 +00:00
Mirian Hipolito Garcia	0e8762b0e8	Merged PR 1421: Include Semi-Supervision code Include Semi-Supervision Code. Yae Jee Results ![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1421/attachments/image%20%282%29.png) FLUTE Results ![image (5).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1421/attachments/image%20%285%29.png) The final accuracy of the three models is pretty similar to YJ's results, the only difference is the training loss computation. In her code she is using all the client training data on each round, not only the ones sampled per iteration as in FLUTE. I asked about this to Dimitrios a while ago and he mentioned that this shouldn't be a big issue, just that the loss would oscillate more in our case than in her results, which I believe is the case. I had to do some slight changes in a couple files inside _/core_ to allow FLUTE to use both aggregation methods: using pseudogradients or state dictionaries, and to pass the iteration round from server to clients during each round, given that YJ's algorithm needed for running _burnout rounds_ before starting to execute the algorithm. Aside from that, there are no major changes. Let me know if you have any comment! Pending Tasks [x] Remove amlt/git files [x] 3rd Party notice? Sanity-Checks: [x] semisupervision: https://aka.ms/amlt?q=d4h7c [x] mlm_bert: https://aka.ms/amlt?q=d2p0d [x] nlg_gru: https://aka.ms/amlt?q=d4fq7	2022-12-14 18:18:51 +00:00
Mirian Hipolito Garcia	405ac09606	Merged PR 1416: Benchmarking Experiments Include the Flower/FedML Benchmarking code for FLUTE.	2022-11-03 14:07:37 +00:00
Robert Sim	6430cbf708	Merged PR 1402: Fix DP accounting bug Fix DP accounting bug. The accountant needed the size of the total pool of clients to sample from.	2022-10-24 20:33:31 +00:00
Mirian Hipolito Garcia	1c3f556648	Merged PR 1393: Fix mlm_bert dataset Avoid unnecesary dataloading in e2e_trainer.py for mlm_bert task. Reduces ~ 90 GB, compared with the previous version. Sanity-check: https://ml.azure.com/allJobs/job/hai7/lab-rr1-v100-dgx2-3/bafff437ffff5468873739619d14194c?flight=itpmerge&tid=72f988bf-86f1-41af-91ab-2d7cd011db47	2022-10-14 16:00:05 +00:00
Robert Sim	1acad74c1c	Merged PR 1392: Fix scaling issue in federated.py federated.py was accumulating gradients in a big list until all the clients were done, preventing very large numbers of clients running. Adjusted to yield them as they come in. sample job aggregating 1,000 MLMs: https://ml.azure.com/runs/flute_exp-6d141d91-fancy-arachnid-MLM_LAMB_1k_TEST-347c22a0?wsid=/subscriptions/46da6261-2167-4e71-8b0d-f4a45215ce61/resourcegroups/hai7-wus2/workspaces/hai7&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#overview	2022-10-13 22:58:25 +00:00
Mirian Hipolito Garcia	9d5d8d45f1	Merged PR 1370: Add FedNewsRec Model Include FedNewsRec Model from: https://github.com/simra/FedNewsRec - MIND_Large, 1500 rounds, 6 clients per round: \|Platform\|AUC\|MRR\|nDCG5\|nDCG10\| \|:----\|:----\|:----\|:----\|:----\| \|FedNews\|0.54\|0.23\|0.25\|0.32\| \|FLUTE\|0.58\|0.24\|0.26\|0.33\| - MIND_Large, 1500 rounds, 500 clients per round: \|Platform\|AUC\|MRR\|nDCG5\|nDCG10\| \|:----\|:----\|:----\|:----\|:----\| \|FLUTE\|0.56\|0.24\|0.26\|0.32\| Links: - 6 clients per round: https://ml.azure.com/runs/mirianh-6d141d91-striking-goat-NCCL_fednewsrec_large_6-d0fa3092?wsid=/subscriptions/46da6261-2167-4e71-8b0d-f4a45215ce61/resourcegroups/hai7-wus2/workspaces/hai7&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#overview - 500 clients per round: https://aka.ms/amlt?q=dqvqj	2022-10-10 14:47:43 +00:00
Mirian Hipolito Garcia	4444b639cd	Merged PR 1382: Update readme + requirements - Update new pointers to testing files in readme - Freeze PyTorch version	2022-10-05 21:23:50 +00:00
Mirian Hipolito Garcia	30e41400a4	Merged PR 1272: Replace MPI -> torch.distributed This PR replaces MPI by torch.distributed as main communication backbone, allowing to use NCCL with GPUs and Gloo for CPU distributed jobs. Most significative changes are inside _federated.py_. Asynchronous mode is enabled when using NCCL , which means that the workers are being reassigned to a new Client as soon as they finish, improving the overall GPU utilization + reducing the total time of the job, as shown in the figure below. ![COMPARISON (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/COMPARISON%20%282%29.png) However Gloo does not have a native implementation for non-blocking ways to check if the recv/send request have been completed (see details here: https://github.com/pytorch/pytorch/issues/30723 ) Therefore, when using Gloo the communication works in synchronous way. I've added a fix for the CUDA OOM issues I was receiving when running the bert experiment, the GPU memory was being overloaded during training. Comparison below MPI (https://aka.ms/amlt?q=dcbbn) vs NCCL now, some cleanup is performed after the server receives the gradient. ![image (14).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1272/attachments/image%20%2814%29.png) There are a couple minor changes in _server_, _client_ and _evaluation_ as well. The main reason is that now the Server doesn't hold the list of clients, these ones live inside the worker since the moment is created and the Server is only passing the indexes of the Client to the Worker. The reason behind this change is that torch.distributed does not allow to send objects P2P, only tensors. The rest of modified files are only to update the documentation + the testing file. I tried to be very explicit for each new function inside _federated.py_ to explain the new flow. Let me know if something it's not clear enough. I've tested all experiments already in the sandbox using NCCL and in my local machine (Windows) using Gloo (surprisingly for this case is not as slow as I was expecting, I used some dummy datasets that I had prepared though) --> pending task compare the new performance using CPU. So for now the only thing left is to run the sanity-checks on AML, links below. Sanity checks after cleanup: [x] nlg_gru: https://aka.ms/amlt?q=c9tih [x] mlm_bert: https://aka.ms/amlt?q=dbs8y [x] classif_cnn: https://aka.ms/amlt?q=da2qr [x] ecg: https://aka.ms/amlt?q=c9jof [x] cv: https://aka.ms/amlt?q=da2k4	2022-08-26 14:54:27 +00:00
Andre Manoel	887c8ac74f	Attempting to fix CodeQL pipeline (#13 )	2022-08-22 12:53:11 -03:00
Andre Manoel	47c4e588b8	Merged PR 1296: Move CodeQL to Github Adding YAML file for CodeQL Action	2022-08-22 14:50:40 +00:00
Mirian Hipolito Garcia	a2a129e5e8	Merged PR 1288: Update dp-accountant submodule The dp-accountant submodule was not working in the GitHub repo. Seems like one of the .git files that have the pointer was broken, but I couldn't find the exactly issue .. so I've removed the submodule and add it again inside the _utils_ folder. Reference: https://github.com/microsoft/msrflute/issues/9 I've already tested the functionality on the sandbox, @<Andre Manoel> can you please test it on your local machine as well to see if throws any error that I'm not able to reproduce? `$ git submodule update --init --recursive` `$ cd utils` `$ cd dp-accountant` `$ python setup.py install` `$ ./bin/compute-dp-epsilon --help`	2022-08-15 14:31:54 +00:00
Mirian Hipolito Garcia	647cc069ac	Merged PR 1286: Add citation file Adding `CITATION.cff` file to display the BibTeX and APA widget for citations in GitHub, as the example below. ![image (2).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1286/attachments/image%20%282%29.png)	2022-08-10 21:46:55 +00:00
microsoft-github-policy-service[bot]	19eb02f485	Microsoft mandatory file	2022-08-03 10:07:39 -05:00
Robert Sim	a6e08be4b6	Update copyright notice Redacted the MSFT copyright boilerplate as it doesn't apply here.	2022-08-03 10:07:39 -05:00
Mirian Hipolito Garcia	4970ef31d2	Merged PR 1129: Add personalization code Migrating personalization code from FTL.release .. Adding "personalization" feature being compatible with the current code in _main_. Latest experiment using resnet50: https://ml.azure.com/runs/f739987a-46b9-47fc-a1bc-75c59ac0c13c?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 I've also tested the backwards compatibility, using nlg_gru task: https://ml.azure.com/runs/02381916-2895-4615-b398-42ae7594a79d?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics @<Dimitrios Dimitriadis> , do we have any 3rd party notice to add?	2022-06-14 18:52:31 +00:00
Mirian Hipolito Garcia	866d0a072c	Merged PR 1213: Remove file type dependency on client.py - Replace _data_dict in client.py by a dataset. - Remove the loader_type dependency for dataloaders utilities - Add Base Classes for dataset and dataloaders - Add example for a previously created dataset instantiation in classif_cnn example - Allow datasets to be downloaded on the fly - Update documentation Sanity checks: [x] nlg_gru: https://aka.ms/amlt?q=cn6vj [x] mlm_bert: https://aka.ms/amlt?q=cppmb [x] classif_cnn: https://aka.ms/amlt?q=cn6vw [x] ecg: https://aka.ms/amlt?q=codet	2022-06-08 15:56:17 +00:00
Mirian Hipolito Garcia	78a401a48a	Merged PR 1145: Sanity Check - Incorporate pipeline for ADO. - Sanity checks for all the experiments (using dummy data). - Update documentation for _testing_ folder, so users can run the tests on their local machines as well. Update: Using Xfail from pytest to allow mlm_bert fail the test in the pipeline ![image (3).png](https://msktg.visualstudio.com/c507252c-d1be-4d67-a4a1-03b0181c35c7/_apis/git/repositories/0392018c-4507-44bf-97e2-f2bb75d454f1/pullRequests/1145/attachments/image%20%283%29.png)	2022-05-17 19:20:20 +00:00
Mirian Hipolito Garcia	299312e461	Merged PR 1139: Abstract class for models - Include abstract class for models in core/model.py - Update in model classes accordingly per experiment. - Remove abstract class for metrics (it is no longer necessary), new metrics only should be declared in the returning dictionary of `inference()` and FLUTE will recognize them during the evaluation rounds. - custom_metrics.py inside each experiment folder is not needed anymore. - Update in the docs for model implementation and metrics.	2022-05-04 23:32:02 +00:00
Mirian Hipolito Garcia	08ac1bb4ed	Merged PR 1128: Move model config to its matching experiments folder. Proof of concept to define independent configuration classes for each model type. Moving PR from FTL.release	2022-04-26 17:42:05 +00:00
Mirian Hipolito Garcia	4c22c56b06	Merged PR 1127: Update sphinx docs Moving PR from FTL.release	2022-04-26 16:59:46 +00:00
Andre Manoel	4eda3eaccd	Amending last sync	2022-04-25 14:31:45 -07:00
Robert Sim	a349f89fb3	Update README.md	2022-04-15 08:01:01 -07:00
Mirian-Hipolito	c28e2e6bcf	Merge pull request #7 from microsoft/sync-20220401 New examples, refactored metrics and strategies, and improved config parsing	2022-04-01 12:07:44 -06:00
Andre Manoel	a7b53eef55	Fixing doc building pipeline	2022-04-01 09:25:33 -07:00
Robert Sim	a8a20c2c7d	Merged PR 1083: Move config validation to FLUTEConfig.validate(). Fix some bugs. what it says on the tin	2022-04-01 08:49:29 -07:00
Robert Sim	f97162fd0a	Merged PR 1081: config docstrings, part 2 Documented second half of config.py The last remaining class to document is DataSet, which is complicated. A couple minor bug/typo fixes included.	2022-04-01 08:33:33 -07:00
Mirian Hipolito Garcia	f582e43893	Merged PR 1082: Save the best model	2022-04-01 08:33:33 -07:00
Robert Sim	cfd9f57049	Merged PR 1072: Docstrings for config classes, part 1 Added some documentation to the first half of the config classes.	2022-04-01 08:33:33 -07:00
Andre Manoel	a5df98b469	Merged PR 1041: Refactoring FL strategy-specific code into a different component	2022-04-01 08:33:33 -07:00
Mirian Hipolito Garcia	4c59f36470	Merged PR 1056: Allowing customized metrics	2022-04-01 08:33:29 -07:00
Jakob Serlier	f134c0e091	Merged PR 1037: PR: heartbeat experiment	2022-04-01 08:32:30 -07:00
Mirian-Hipolito	781fa4c096	Merge pull request #6 from microsoft/sync-20220223 Refactoring Evaluation + Documentation in Config files	2022-02-23 14:43:51 -06:00
Mirian Hipolito Garcia	9a5c6e4a3f	Typo	2022-02-23 11:56:05 -06:00
Mirian Hipolito Garcia	59bd382291	Some documentation	2022-02-23 11:54:48 -06:00
Mirian Hipolito Garcia	ac2ac8967e	Updating ServerConfig	2022-02-23 11:54:48 -06:00
Mirian Hipolito Garcia	960884d7c4	Change in config files	2022-02-23 11:54:13 -06:00
Andre Manoel	b0ee9cc995	Merge pull request #5 from microsoft/sync-20220207 Some fixes + improving config	2022-02-11 12:55:52 -03:00
Robert Sim	d89c9c583d	Merged PR 1022: Add classes for configuration Configs are now classes. They're set up to be backwards compatible with the existing code- only a few adjustments were needed. Going forward we should make an effort to reference config values as properties, rather than as dictionary items- over time we can clean up the verbose code and remove the dictionary support. In the vast majority of cases everything will just work. There may be some edge cases discriminating between None and not configured. I have validated locally and in a swiftkey run. We should also validate an mlm or zcode.	2022-02-07 08:21:02 -08:00
Robert Sim	38b4537a6b	Merged PR 1019: fix some profiling and testing bugs fix some profiling and testing bugs	2022-02-07 08:19:22 -08:00
Andre Manoel	d0722e8536	Merge pull request #2 from microsoft/build-docs Create a Github action for building the docs automatically	2022-01-17 15:56:47 -03:00
Andre Manoel	4f91dcb2fe	Fixed warnings from CI for building docs	2022-01-17 10:31:55 -08:00
Robert Sim	10c3182499	Merge pull request #1 from simra/main update readme	2022-01-11 08:21:50 -08:00
Andre Manoel	1b89b8ec67	Adding files for GH action that builds the docs	2022-01-07 11:42:00 -08:00
Robert Sim	367409d42f	update README	2021-12-16 11:57:06 -08:00

1 2

54 Коммитов Все ветки Поиск

54 Коммитов

Все ветки