CMnO Observability v-team

Перейти к файлу

Bryan Zabchuk ed13f8216c Fixed type, updated content.		2022-11-17 22:40:36 -05:00
.github	Fixed type, updated content.	2022-11-17 22:40:36 -05:00
.gitignore	Initial commit	2022-09-08 13:18:20 +00:00
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md committed	2022-09-08 06:18:26 -07:00
CONTRIBUTING.md	Create CONTRIBUTING.md	2022-09-08 14:37:04 +01:00
LICENSE	Update LICENSE	2022-09-22 17:39:25 +01:00
NOTICE.md	Create NOTICE.md	2022-09-30 09:28:06 +01:00
README.md	Fixed type, updated content.	2022-11-17 22:40:36 -05:00
SECURITY.md	SECURITY.md committed	2022-09-08 06:18:29 -07:00
SUPPORT.md	Update SUPPORT.md	2022-09-09 15:19:38 +01:00

README.md

Alerts for Azure Landing Zone

One of the most common questions we've faced in working with Customers is, "What should we monitor in Azure?" and "What thresholds should we configure our alerts for?"

We started thinking about our experiences on our customer projects where we were working with them to establish how they would manage their Azure Landing Zones, eventually we'd come to the age old question, "What should we monitor?" and "What thresholds should we set?"

There isn't definitive list of what you should monitor when you deploy something to Azure because "it depends", on what services you're using and how the services are used which will dictate what you should monitor and what thresholds the metrics you do decide to collect are and what errors you should alert on in logs.

Microsoft has tried to address this by providing a number of 'insights or solutions' for popular services which pull together all the things you should care about (Storage Insights, VM Insights, Container Insights); but what about everything else???

Let's reduce the list of what we should monitor to something a bit more manageable... Let's look at the Azure Landing Zone. It's a common set of Azure resources/services that are configured in a similar way across organizations. (Of course there will be exceptions but we'll get to that.)

What you should monitor are the key components of your Azure Landing Zones within the Platform/Shared landing zones and the pieces which stretch into application or workload subscriptions and then add monitoring to the resources which use those landing zone components.

Example:

-Storage Accounts -ASR Vaults -VNETs and subnets -Log Analytics workspace/s -DNS -Azure Firewall or 3rd Party NVA (Note: we won't be providing guidance on 3rd Party NVA monitoring, NVA vendors are in the best position to provide you that) -Key Vaults

As for what thresholds should be set we should set?

Look to the documentation Microsoft has provided for the Azure resources, there's a wealth of information to get you started, some of those recommedations are short simple metric queries, some are slightly more complex log alerts and sometimes there's a lot to read through such as with Storage Accounts.

An important part that's missed often is Service Health alerts, getting those can save you a lot of headaches and needless troubleshooting if you know first that there's an issue with the Azure resource service and not how you're using it.

We also thought to look at this using a layered approach within Azure Landing Zone. Identify platform metrics we think you should care about. Next, what Service Health Alerts for resources that are important to us. After that, what log alerts should be used.

The next challenge we've faced is how do you do this at scale in a repeatable way? To be honest there weren't a lot of examples available on how to do this, even if you used Infrastructure-as-Code, each person had their own way of doing things, let's develop a common deployment method that if someone is just starting out and don't have the experience they can get up and running with a scaleable method to deploy Azure Monitor alerts.

If you have a way to deploy Azure Monitor alerts but struggle determining what to monitor and what thresholds to set, you can use the thresholds here for your Azure Landing zones.

Do you need to have Azure Landing zones deployed for this to work?

No but you will need to be using Azure Management groups and for now our focus is on the resources frequently deployed as part of Azure Landing Zone deployments.

Do you need to use the thresholds we've defined in the metric rule alert?

It's provided as a starting point, we've based the initial threshold on what we've seen and what Microsoft's documentation recommends. You will need to adjust the thresholds at some point.

You will need to observe and if the alert is too chatty, adjust the threshold up; if it's not alerting when there's a problem, adjust the threshold down a bit, (or vice-versa depending on what metric or log error is being used as a monitoring source). The key thing is you'll need to investigate, leverage the insights if they are available, or create a workbook or dashboard to help you out.

Do we need to use these metrics or can we replace them with other ones?

The metric rules we've created are based on recommendations from Microsoft documentation and field exprience. How you're using Azure resources may also be different so tailor the alerts to suit your needs. One of the other goals of this project is to help you have a way to do Azure Monitor alerts at scale, create new rules with your own thresholds. We'd love to hear about your new rules too so feel free to share back.

Dependencies

This project uses the bicep modules from the CARML, version 0.7.0. We will work to keep this as compatible with the CARML repo but for the moment our priority is to build this project up as much as possible so things may break if you use a more current version of the Bicep modules found in CARML.

Roadmap

Going back to our layered approach we dediced to

Our approach was to first tackle creating Metric alerts because they are responsive and alerts are relatively inexpensive because it's pre-computed and stored in the system, where as log alerts are stored in a Log Analytics Workspace and have had some sort of logic operation performed on the data. Click here for more information on Metric alerts.

Next we're going to tackle Service Health alerts, knowing when there's an outage, planned maintenance and other health advisories for the services you're using. These types of alerts rely on information in the ActivityLog. Click here for more information on Service Health Alerts.

The final area we're going to tackle is log alerts and as stated earlier the data is collected in a Log Analytics Workspace and some sort of logic operation is performed on the data which means there's charge for using these types of alerts.

Where possible we're going to focus on queries which use the ScheduledQueryRules API and Metric alerts for Logs.
ActivityLog alerts we may create additional alerts based on data found in the ActivityLog Alerts that are not Service Health related.

Prerequisites

VSCode Bicep Extension Azure subscriptions where you want to apply alerts. Management Groups that manage the Azure subscriptions ...

Deployment Steps

The intention of the policies is to provide a common set of metrics and thresholds to monitor all Azure Landing Zone resources which is why you will want to deploy this as high up in your Azure Management group structure to ensure that all subscriptions are included and the steps below will cover that specific scenario. If there are other scenarios you'd like to see or have applied the policies at a different level within your Azure Management group structure send us feedback through GitHub Issues. <>

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.