Distributed AI/HPC Monitoring Framework

Перейти к файлу

Yang Wang 891fc3ed84 Update Moneo Exporter for MI300 (#81 ) * update moneo for mi300 * fix comments		2024-06-04 10:41:45 +09:00
.azure-pipelines	[Lint] Fix Moneo all the lint error (#31 )	2022-10-26 10:48:24 +08:00
.github/workflows	Update Moneo Exporter (#73 )	2024-01-02 17:16:46 +08:00
deploy_managed_infra	Moneo refresh (#79 )	2024-03-25 15:58:01 +00:00
dockerfile	Update Moneo Exporter for MI300 (#81 )	2024-06-04 10:41:45 +09:00
docs	Config changes (#80 )	2024-04-25 14:48:20 -04:00
examples/slurm	slurm integartion update (#65 )	2023-08-21 11:05:38 -04:00
linux_service	Config changes (#80 )	2024-04-25 14:48:20 -04:00
src	Update Moneo Exporter for MI300 (#81 )	2024-06-04 10:41:45 +09:00
tests	Remove ansible (#39 )	2023-03-23 09:24:05 -04:00
.flake8	[Lint] Fix Moneo all the lint error (#31 )	2022-10-26 10:48:24 +08:00
.gitattributes	Update install scripts and dashboards for amdgpu (#6 )	2022-07-21 10:00:46 -04:00
.gitignore	Deprecate prom-remote-sidecar solution for Managed Prometheus Agent (#63 )	2023-08-08 22:20:48 +08:00
CITATION.bib	Add citation file (#5 )	2022-07-14 22:26:11 +08:00
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md committed	2022-05-23 12:19:45 -07:00
LICENSE	LICENSE updated to template	2022-05-23 12:19:46 -07:00
README.md	Config changes (#80 )	2024-04-25 14:48:20 -04:00
SECURITY.md	SECURITY.md committed	2022-05-23 12:19:47 -07:00
SUPPORT.md	Update SUPPORT.md	2022-06-14 15:59:49 -04:00
moneo.py	Config changes (#80 )	2024-04-25 14:48:20 -04:00
moneo_config.json	Config changes (#80 )	2024-04-25 14:48:20 -04:00

README.md

Moneo

Description

Moneo is a distributed GPU system monitor for AI workflows. It orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization.

Moneo offers flexibility with 3 deployment methods:

The prefered method using Azure Managed Prometheus/Grafana and Moneo linux services for collection (Headless deployment)
Using Azure Application Insights/Azure Monitor Workspace(AMW) (Headless deployment w/ App Insights).
Using Moneo CLI with a dedicate headnode to host local Prometheus/Grafana servers (Local Grafana Deployment)

Moneo Headless Method:

Metrics

There five categories of metrics that Moneo monitors:

GPU Counters
- Compute/Memory Utilization
- SM and Memory Clock frequency
- Temperature
- Power
- ECC Counts (Nvidia)
- GPU Throttling (Nvidia)
- XID code (Nvidia)
GPU Profiling Counters
- SM Activity
- Memory Dram Activity
- NVLink Activity
- PCIE Rate
InfiniBand Network Counters
- IB TX/RX rate
- IB Port errors
- IB Link FLap
CPU Counters
- Utilization
- Clock frequency
Memory
- Utilization

Grafana Dashboards

Menu: List of available dashboards.

Note: When viewing GPU dashboards make sure to note whether you are using Nvidia or AMD GPU nodes and select the proper dashboard.
Cluster View: contains min, max, average across devices for GPU/IB metrics per VM.
GPU Device Counters: Detailed view of node level GPU counters.
GPU Profiling Counters: Node level profiling metrics require additional overhead which may affect workload performance. Tensor, FP16, FP32, and FP64 activity are disabled by default but can be switched on by CLI command.
InfiniBand Network Counters: Detailed view of node level IB network metrics.
Node View: Detailed view of node level CPU, Memory, and Network metrics.

Minimum Requirements

python >=3.7 installed
OS Support:
- Ubuntu 18.04, 20.04, 22.04
- AlmaLinux 8.6

Manager Node Requirements

Note: Not applicable if using Azure Managed Grafana/Prometheus

docker 20.10.23 (May work with other versions but this has been tested.)
parallel-ssh 2.3.1-2 (May work with other versions but this has been tested.)
Manager node must be able to ssh to itself

Worker node requirements

Nvidia Architecture supported (only for Nvidia GPU monitoring):
- Volta
- Ampere
- Hopper
Installed with install script at time of deployment (If not installed):
- DCGM 3.1.6 (For Nvidia deployments)
- Check install scripts for the various python packages installed.

Usage

Deploying Moneo

Get the code:

Clone Moneo from Github.

    # get the code
    git clone https://github.com/Azure/Moneo.git
    cd Moneo
    # install dependency
    sudo apt-get install pssh

Note: If you are using an Azure Ubuntu HPC-AI VM image you can find the Moneo in this path: /opt/azurehpc/tools/Moneo

Configuration File

The moneo_config.json file can be used to specify certain deployment settings prior to moneo deployment.

There are 4 groups of configurations:

exporter_conf - This applies to all deployments. See the following settings:
- gpu_sample_interval - Sample rate per minute for Nvidia GPU exporter. Choices are [1, 2, 30, 60, 120, 600]. with 60 samples per minute being default.
- gpu_profiling - Switches on additional profile metrics (Tensor, FP16, FP32, and FP64). Choices are true/false with false as default.
- Note: These settings may have an impact on performance. Default settings were chosen to minimize impact.
prom_config - This group of settings applies to the Headless deployment method. Refer to Headless Deployment Guide for usage.
geneva_config - Applies to Geneva deployement. Refer to Geneva deployment for usage.
publisher_config - Applies to both Geneva and Azure Monitor agent deployment methods see Geneva deployment or Azure Monitor Agent deployment for usage.

Prefered Moneo Deployment

The prefered way to deploy Moneo is the headless method using Azure Managaed Grafana and Prometheus resources.

Complete the steps listed here: Headless Deployment Guide

Alternative deployment using Moneo CLI and head node

This method requires a deploying of a head node to host the local Prometheus database and Grafana server.

The headnode must have enough storage available to facilitate data collection
Grafana and Prometheus are accessed via web browser. Ensure proper access from web browser to headnode IP.

Complete the steps listed here: Local Grafana Deployment Guide

Moneo CLI

Moneo CLI provides an alternative way to deploy and update Moneo manager and worker nodes. Although linux services are prefered this offers an alternative way to control Moneo.

CLI Usage

python3 moneo.py [-d/--deploy] [-c hostfile] {manager,workers,full}
python3 moneo.py [-s/--shutdown] [-c hostfile] {manager,workers,full}
python3 moneo.py [-j JOB_ID ] [-c hostfile]
i.e. python3 moneo.py -d -c ./hostfile full

Note: For more options check the Moneo help menu

    python3 moneo.py --help

Access the Grafana Portal

For Azure Managed Grafana the dashboards can be accessed via the endpoint provided on the resource overview.
For Moneo CLI deployment with a dedicated head node the Grafana portal can be reached via browser: http://master-ip-or-domain:3000
If Azure Monitor is used navigate to the Azure Monitor Workspace on The Azure portal.

User Docs

Headless Deployment Guide
Local Grafana Deployment Guide
To get started with job level filtering see: Job Level Filtering
Slurm epilog/prolog integration: Slurm example
To deploy moneo-worker inside container: Moneo-exporter
To integrate Moneo with Azure Application Insights dashboard see: Azure Monitor
To expose customized metrics by using custom exporter Custom Exporter
For Geneva ingestion (internal Microsoft) see: Geneva

Known Issues

NVIDIA exporter may conflict with DCGMI

There're two modes for DCGM: embedded mode and standalone mode.

If DCGM is started as embedded mode (e.g., nv-hostengine -n, using no daemon option -n), the exporter will use the DCGM agent while DCGMI may return error.

It's recommended to start DCGM in standalone mode in a daemon, so that multiple clients like exporter and DCGMI can interact with DCGM at the same time, according to NVIDIA.

Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users.
Moneo will attempt to install a tested version of DCGM if it is not present on the worker nodes. However, this step is skipped if DCGM is already installed. In instances DCGM installed may be too old.

This may cause the Nvidia exporter to fail. In this case it is recommended that DCGM be upgraded to atleast version 2.4.4. To view which exporters are running on a worker just run ps -eaf | grep python3

Troubleshooting

For Managed Grafana (headless) deployment
- Verify that the user managed identity is assigned to the VM resource.
- Verify the prerequisite configure file (Moneo/moneo_config.json) is configured correctly on each worker node.
- On the worker nodes verify functionality of prometheus agent remote write:
  - Check prometheus docker with sudo docker logs prometheus | grep 'Done replaying WAL' It will have the result like this:
```
    ts=2023-08-07T07:25:49.636Z caller=dedupe.go:112 component=remote level=info remote_name=6ac237 url="<ingestion_endpoint>" msg="Done replaying WAL" duration=8.339998173s
```
- Check Azure Grafana's is linked to Azure Prometheus workspace.
  - This can be done by accessing settings in Grafana dashboard and ensuring the ingestion link for the Managed Prometheus is being used for the datasource url.
  - You can also verify The Managed Prometheus resource in the portal is linked with the managed Grafana resource
For deployments with a Headnode:
- Verifying Grafana and Prometheus containers are running:
  - Check browser http://master-ip-or-domain:3000 (Grafana), http://master-ip-or-domain:9090 (Prometheus)
  - On Manager node terminal run sudo docker container ls
All deployments:
- Verifying exporters on worker node:
  - ps -eaf | grep python3

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.