2021-08-11 21:38:07 +03:00
|
|
|
# Microsoft Health Intelligence Machine Learning Toolbox
|
2021-07-01 22:31:52 +03:00
|
|
|
|
2023-03-21 12:21:09 +03:00
|
|
|
[![Codecov coverage](https://codecov.io/gh/microsoft/hi-ml/branch/main/graph/badge.svg?token=kMr2pSIJ2U)](https://codecov.io/gh/microsoft/hi-ml) [![Code style: black](https://camo.githubusercontent.com/d91ed7ac7abbd5a6102cbe988dd8e9ac21bde0a73d97be7603b891ad08ce3479/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f636f64652532307374796c652d626c61636b2d3030303030302e737667)](https://github.com/psf/black)
|
2022-05-17 15:50:13 +03:00
|
|
|
|
2021-07-26 19:52:44 +03:00
|
|
|
## Overview
|
2021-07-01 22:31:52 +03:00
|
|
|
|
2021-07-26 19:52:44 +03:00
|
|
|
This toolbox aims at providing low-level and high-level building blocks for Machine Learning / AI researchers and
|
|
|
|
practitioners. It helps to simplify and streamline work on deep learning models for healthcare and life sciences,
|
|
|
|
by providing tested components (data loaders, pre-processing), deep learning models, and cloud integration tools.
|
2021-07-01 22:31:52 +03:00
|
|
|
|
2022-03-07 15:21:09 +03:00
|
|
|
This repository consists of two Python packages, as well as project-specific codebases:
|
2021-07-26 19:52:44 +03:00
|
|
|
|
2022-03-07 15:21:09 +03:00
|
|
|
* PyPi package [hi-ml-azure](https://pypi.org/project/hi-ml-azure/) - providing helper functions for running in AzureML.
|
|
|
|
* PyPi package [hi-ml](https://pypi.org/project/hi-ml/) - providing ML components.
|
2022-07-18 15:40:47 +03:00
|
|
|
* hi-ml-cpath: Models and workflows for working with histopathology images
|
2021-09-22 19:56:35 +03:00
|
|
|
|
2021-07-26 19:52:44 +03:00
|
|
|
## Getting started
|
|
|
|
|
2021-09-22 19:56:35 +03:00
|
|
|
For the full toolbox (this will also install `hi-ml-azure`):
|
|
|
|
|
2021-07-26 19:52:44 +03:00
|
|
|
* Install from `pypi` via `pip`, by running `pip install hi-ml`
|
|
|
|
|
2021-09-22 19:56:35 +03:00
|
|
|
For just the AzureML helper functions:
|
|
|
|
|
|
|
|
* Install from `pypi` via `pip`, by running `pip install hi-ml-azure`
|
|
|
|
|
2022-07-18 15:40:47 +03:00
|
|
|
For the histopathology workflows, please follow the instructions [here](hi-ml-cpath/README.md).
|
2022-03-07 15:21:09 +03:00
|
|
|
|
|
|
|
If you would like to contribute to the code, please check the [developer guide](docs/source/developers.md).
|
|
|
|
|
2021-09-14 22:52:31 +03:00
|
|
|
## Documentation
|
2021-09-02 20:33:57 +03:00
|
|
|
|
2022-03-11 19:10:15 +03:00
|
|
|
The detailed package documentation, with examples and API reference, is on
|
2021-09-14 22:52:31 +03:00
|
|
|
[readthedocs](https://hi-ml.readthedocs.io/en/latest/).
|
|
|
|
|
|
|
|
## Quick start: Using the Azure layer
|
2021-07-26 19:52:44 +03:00
|
|
|
|
|
|
|
Use case: you have a Python script that does something - that could be training a model, or pre-processing some data.
|
2021-09-22 19:56:35 +03:00
|
|
|
The `hi-ml-azure` package can help easily run that on Azure Machine Learning (AML) services.
|
2021-07-26 19:52:44 +03:00
|
|
|
|
|
|
|
Here is an example script that reads images from a folder, resizes and saves them to an output folder:
|
2022-03-16 17:31:38 +03:00
|
|
|
|
2021-07-26 19:52:44 +03:00
|
|
|
```python
|
|
|
|
from pathlib import Path
|
|
|
|
if __name__ == '__main__':
|
|
|
|
input_folder = Path("/tmp/my_dataset")
|
|
|
|
output_folder = Path("/tmp/my_output")
|
|
|
|
for file in input_folder.glob("*.jpg"):
|
|
|
|
contents = read_image(file)
|
|
|
|
resized = contents.resize(0.5)
|
|
|
|
write_image(output_folder / file.name)
|
|
|
|
```
|
2022-03-16 17:31:38 +03:00
|
|
|
|
2021-09-14 22:52:31 +03:00
|
|
|
Doing that at scale can take a long time. **We'd like to run that script in AzureML, consume the data from a folder in
|
|
|
|
blob storage, and write the results back to blob storage**.
|
2021-07-26 19:52:44 +03:00
|
|
|
|
2021-09-22 19:56:35 +03:00
|
|
|
With the `hi-ml-azure` package, you can turn that script into one that runs on the cloud by adding one function call:
|
2021-07-26 19:52:44 +03:00
|
|
|
|
|
|
|
```python
|
|
|
|
from pathlib import Path
|
2021-10-13 19:08:07 +03:00
|
|
|
from health_azure import submit_to_azure_if_needed
|
2021-07-26 19:52:44 +03:00
|
|
|
if __name__ == '__main__':
|
|
|
|
current_file = Path(__file__)
|
2021-09-14 22:52:31 +03:00
|
|
|
run_info = submit_to_azure_if_needed(compute_cluster_name="preprocess-ds12",
|
2021-09-10 17:53:52 +03:00
|
|
|
input_datasets=["images123"],
|
|
|
|
# Omit this line if you don't create an output dataset (for example, in
|
|
|
|
# model training scripts)
|
2021-09-14 22:52:31 +03:00
|
|
|
output_datasets=["images123_resized"],
|
|
|
|
default_datastore="my_datastore")
|
2021-07-26 19:52:44 +03:00
|
|
|
# When running in AzureML, run_info.input_datasets and run_info.output_datasets will be populated,
|
|
|
|
# and point to the data coming from blob storage. For runs outside AML, the paths will be None.
|
|
|
|
# Replace the None with a meaningful path, so that we can still run the script easily outside AML.
|
2021-08-06 20:54:46 +03:00
|
|
|
input_dataset = run_info.input_datasets[0] or Path("/tmp/my_dataset")
|
|
|
|
output_dataset = run_info.output_datasets[0] or Path("/tmp/my_output")
|
|
|
|
files_processed = []
|
|
|
|
for file in input_dataset.glob("*.jpg"):
|
2021-07-26 19:52:44 +03:00
|
|
|
contents = read_image(file)
|
|
|
|
resized = contents.resize(0.5)
|
2021-08-06 20:54:46 +03:00
|
|
|
write_image(output_dataset / file.name)
|
|
|
|
files_processed.append(file.name)
|
|
|
|
# Any other files that you would not consider an "output dataset", like metrics, etc, should be written to
|
|
|
|
# a folder "./outputs". Any files written into that folder will later be visible in the AzureML UI.
|
|
|
|
# run_info.output_folder already points to the correct folder.
|
|
|
|
stats_file = run_info.output_folder / "processed_files.txt"
|
|
|
|
stats_file.write_text("\n".join(files_processed))
|
2021-07-26 19:52:44 +03:00
|
|
|
```
|
|
|
|
|
|
|
|
Once these changes are in place, you can submit the script to AzureML by supplying an additional `--azureml` flag
|
2021-09-14 22:52:31 +03:00
|
|
|
on the commandline, like `python myscript.py --azureml`.
|
2021-07-26 19:52:44 +03:00
|
|
|
|
2021-09-14 22:52:31 +03:00
|
|
|
That's it!
|
|
|
|
|
|
|
|
For details, please refer to the [onboarding page](docs/source/first_steps.md).
|
2022-03-11 19:10:15 +03:00
|
|
|
|
2021-09-10 17:53:52 +03:00
|
|
|
For more examples, please see [examples.md](docs/source/examples.md).
|
2021-07-26 19:52:44 +03:00
|
|
|
|
2021-09-07 16:30:10 +03:00
|
|
|
## Issues
|
2022-03-16 17:31:38 +03:00
|
|
|
|
2022-03-11 19:10:15 +03:00
|
|
|
If you've found a bug in the code, please check the [issues](https://github.com/microsoft/hi-ml/issues) page.
|
|
|
|
If no existing issue exists, please open a new one. Be sure to include
|
2021-09-07 16:30:10 +03:00
|
|
|
|
2022-03-16 17:31:38 +03:00
|
|
|
* A descriptive title
|
|
|
|
* Expected behaviour (including a code sample if possible)
|
|
|
|
* Actual behavior
|
2021-09-07 16:30:10 +03:00
|
|
|
|
|
|
|
## Contributing
|
2022-03-16 17:31:38 +03:00
|
|
|
|
2021-09-07 16:30:10 +03:00
|
|
|
We welcome all contributions that help us achieve our aim of speeding up ML/AI research in health and life sciences.
|
|
|
|
Examples of contributions are
|
2022-03-16 17:31:38 +03:00
|
|
|
|
2021-09-07 16:30:10 +03:00
|
|
|
* Data loaders for specific health & life sciences data
|
|
|
|
* Network architectures and components for deep learning models
|
|
|
|
* Tools to analyze and/or visualize data
|
|
|
|
* ...
|
|
|
|
|
2022-06-07 13:00:04 +03:00
|
|
|
Please check the [detailed page about contributions](.github/CONTRIBUTING.md).
|
2021-09-07 16:30:10 +03:00
|
|
|
|
2021-07-26 19:52:44 +03:00
|
|
|
## Licensing
|
|
|
|
|
|
|
|
[MIT License](LICENSE)
|
|
|
|
|
|
|
|
**You are responsible for the performance, the necessary testing, and if needed any regulatory clearance for
|
|
|
|
any of the models produced by this toolbox.**
|
|
|
|
|
|
|
|
## Contact
|
|
|
|
|
2022-03-11 19:10:15 +03:00
|
|
|
If you have any feature requests, or find issues in the code, please create an
|
2021-07-26 19:52:44 +03:00
|
|
|
[issue on GitHub](https://github.com/microsoft/hi-ml/issues).
|
2021-07-01 22:31:52 +03:00
|
|
|
|
2022-03-16 17:31:38 +03:00
|
|
|
## Contribution Licensing
|
2021-07-01 22:31:52 +03:00
|
|
|
|
|
|
|
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
|
|
|
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
2022-03-16 17:31:38 +03:00
|
|
|
the rights to use your contribution. For details, visit <https://cla.opensource.microsoft.com>.
|
2021-07-01 22:31:52 +03:00
|
|
|
|
|
|
|
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
|
|
|
|
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
|
|
|
|
provided by the bot. You will only need to do this once across all repos using our CLA.
|
|
|
|
|
|
|
|
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
|
|
|
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
|
|
|
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
|
|
|
|
|
|
|
## Trademarks
|
|
|
|
|
2022-03-11 19:10:15 +03:00
|
|
|
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
|
|
|
|
trademarks or logos is subject to and must follow
|
2021-07-01 22:31:52 +03:00
|
|
|
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
|
|
|
|
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
|
|
|
|
Any use of third-party trademarks or logos are subject to those third-party's policies.
|