From f4718751affab022ecde79bf02ef92b3abdeddb6 Mon Sep 17 00:00:00 2001 From: Ben Sadeghi Date: Wed, 31 Jul 2019 15:10:54 +0800 Subject: [PATCH 01/22] change links from docs.databricks.com to docs.azuredatabricks.net --- toc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/toc.md b/toc.md index 2a3124c..593b93e 100644 --- a/toc.md +++ b/toc.md @@ -220,7 +220,7 @@ This recommendation is driven by security and data availability concerns. Every > ***This recommendation doesn't apply to Blob or ADLS folders explicitly mounted as DBFS by the end user*** **More Information:** -[Databricks File System](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) +[Databricks File System](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html) ## Always Hide Secrets in a Key Vault @@ -263,7 +263,7 @@ like notebook commands, SQL queries, Java jar jobs, etc. to this primordial app Under the covers Databricks clusters use the lightweight Spark Standalone resource allocator. -When it comes to taxonomy, ADB clusters are divided along the notions of “type”, and “mode.” There are two ***types*** of ADB clusters, according to how they are created. Clusters created using UI and [Clusters API](https://docs.databricks.com/api/latest/clusters.html) are called Interactive Clusters, whereas those created using [Jobs API](https://docs.databricks.com/api/latest/jobs.html) are called Jobs Clusters. Further, each cluster can be of two ***modes***: Standard and High Concurrency. Regardless of types or mode, all clusters in Azure Databricks can automatically scale to match the workload, using a feature known as [Autoscaling](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling). +When it comes to taxonomy, ADB clusters are divided along the notions of “type”, and “mode.” There are two ***types*** of ADB clusters, according to how they are created. Clusters created using UI and [Clusters API](https://docs.azuredatabricks.net/api/latest/clusters.html) are called Interactive Clusters, whereas those created using [Jobs API](https://docs.azuredatabricks.net/api/latest/jobs.html) are called Jobs Clusters. Further, each cluster can be of two ***modes***: Standard and High Concurrency. Regardless of types or mode, all clusters in Azure Databricks can automatically scale to match the workload, using a feature known as [Autoscaling](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling). *Table 2: Cluster modes and their characteristics* From 5fbb8473474ddb407813130c836e7d5139d34d3c Mon Sep 17 00:00:00 2001 From: Garren Staubli <1999830+gstaubli@users.noreply.github.com> Date: Fri, 2 Aug 2019 09:46:08 -0700 Subject: [PATCH 02/22] Spelling correction --- toc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/toc.md b/toc.md index 2a3124c..7dcdb18 100644 --- a/toc.md +++ b/toc.md @@ -214,7 +214,7 @@ With this info, we can quickly arrive at the table below, showing how many nodes *Impact: High* This recommendation is driven by security and data availability concerns. Every Workspace comes with a default DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You should not store any production data in it, because: -1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contentents. +1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contents. 2. One can't restrict access to this default folder and its contents. > ***This recommendation doesn't apply to Blob or ADLS folders explicitly mounted as DBFS by the end user*** From 945d8c34456d1c10a23415175e52d507427377d6 Mon Sep 17 00:00:00 2001 From: Greg Galloway Date: Sat, 23 Nov 2019 13:16:16 -0600 Subject: [PATCH 03/22] Clarifying external users and fixing typos --- toc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/toc.md b/toc.md index 6dbf5ec..c0be925 100644 --- a/toc.md +++ b/toc.md @@ -107,6 +107,7 @@ Each workspace is identified by a globally unique 53-bit number, called ***Works Example: *https://eastus2.azuredatabricks.net/?o=12345* Azure Databricks uses [Azure Active Directory (AAD)](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis) as the exclusive Identity Provider and there’s a seamless out of the box integration between them. This makes ADB tightly integrated with Azure just like its other core services. Any AAD member assigned to the Owner or Contributor role can deploy Databricks and is automatically added to the ADB members list upon first login. If a user is not a member of the Active Directory tenant, they can’t login to the workspace. +Granting access to a user in another tenant (for example, if contoso.com wants to collaborate with adventure-works.com users) does work because those external users are added as guests to the tenant hosting Azure Databricks. Azure Databricks comes with its own user management interface. You can create users and groups in a workspace, assign them certain privileges, etc. While users in AAD are equivalent to Databricks users, by default AAD roles have no relationship with groups created inside ADB, unless you use [SCIM](https://docs.azuredatabricks.net/administration-guide/admin-settings/scim/aad.html) for provisioning users and groups. With SCIM, you can import both groups and users from AAD into Azure Databricks, and the synchronization is automatic after the initial import. ADB also has a special group called ***Admins***, not to be confused with AAD’s role Admin. @@ -213,7 +214,7 @@ With this info, we can quickly arrive at the table below, showing how many nodes *Impact: High* This recommendation is driven by security and data availability concerns. Every Workspace comes with a default DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You should not store any production data in it, because: -1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contentents. +1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contents. 2. One can't restrict access to this default folder and its contents. > ***This recommendation doesn't apply to Blob or ADLS folders explicitly mounted as DBFS by the end user*** @@ -234,7 +235,7 @@ If using Azure Key Vault, create separate AKV-backed secret scopes and correspon [Create an Azure Key Vault-backed secret scope](https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html) -[Example of using secretin a notebook](https://docs.azuredatabricks.net/user-guide/secrets/example-secret-workflow.html) +[Example of using secret in a notebook](https://docs.azuredatabricks.net/user-guide/secrets/example-secret-workflow.html) [Best practices for creating secret scopes](https://docs.azuredatabricks.net/user-guide/secrets/secret-acl.html) From 32f5b722be6525d780f75f26d92c93913a3ae98e Mon Sep 17 00:00:00 2001 From: Teresa Fontanella De Santis Date: Mon, 2 Dec 2019 16:57:54 -0300 Subject: [PATCH 04/22] Fix typos --- toc.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/toc.md b/toc.md index 6dbf5ec..c40732e 100644 --- a/toc.md +++ b/toc.md @@ -191,7 +191,7 @@ More information: [Azure Virtual Datacenter: a network perspective](https://docs Recall the each Workspace can have multiple clusters. The total capacity of clusters in each workspace is a function of the masks used for the workspace's enclosing Vnet and the pair of subnets associated with each cluster in the workspace. The masks can be changed if you use the [Bring Your Own Vnet](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.html#vnet-inject) feature as it gives you more control over the networking layout. It is important to understand this relationship for accurate capacity planning. * Each cluster node requires 1 Public IP and 2 Private IPs - * These IPs and are logically grouped into 2 subnets named “public” and “private” + * These IPs are logically grouped into 2 subnets named “public” and “private” * For a desired cluster size of X: number of Public IPs = X, number of Private IPs = 4X * The 4X requirement for Private IPs is due to the fact that for each deployment: + Half of address space is reserved for future use @@ -262,7 +262,7 @@ like notebook commands, SQL queries, Java jar jobs, etc. to this primordial app Under the covers Databricks clusters use the lightweight Spark Standalone resource allocator. -When it comes to taxonomy, ADB clusters are divided along the notions of “type”, and “mode.” There are two ***types*** of ADB clusters, according to how they are created. Clusters created using UI and [Clusters API](https://docs.databricks.com/api/latest/clusters.html) are called Interactive Clusters, whereas those created using [Jobs API](https://docs.databricks.com/api/latest/jobs.html) are called Jobs Clusters. Further, each cluster can be of two ***modes***: Standard and High Concurrency. Regardless of types or mode, all clusters in Azure Databricks can automatically scale to match the workload, using a feature known as [Autoscaling](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling). +When it comes to taxonomy, ADB clusters are divided along the notions of “type” and “mode”. There are two ***types*** of ADB clusters, according to how they are created. Clusters created using UI and [Clusters API](https://docs.databricks.com/api/latest/clusters.html) are called Interactive Clusters, whereas those created using [Jobs API](https://docs.databricks.com/api/latest/jobs.html) are called Jobs Clusters. Further, each cluster can be of two ***modes***: Standard and High Concurrency. Regardless of types or mode, all clusters in Azure Databricks can automatically scale to match the workload, using a feature known as [Autoscaling](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling). *Table 2: Cluster modes and their characteristics* @@ -289,7 +289,7 @@ Because of these differences, supporting Interactive workloads entails minimizin 2. **Optimizing for Latency:** Only High Concurrency clusters have features which allow queries from different users share cluster resources in a fair, secure manner. HC clusters come with Query Watchdog, a process which keeps disruptive queries in check by automatically pre-empting rogue queries, limiting the maximum size of output rows returned, etc. - 3. **Security:** Table Access control feature is only available in High Concurrency mode and needs to be turned on so that users can limit access to their database objects (tables, views, functions...) created on the shared cluster. In case of ADLS, we recommend restricting access using the AAD Credential Passthrough feature instead of Table Access Controls. + 3. **Security:** Table Access control feature is only available in High Concurrency mode and needs to be turned on so that users can limit access to their database objects (tables, views, functions, etc.) created on the shared cluster. In case of ADLS, we recommend restricting access using the AAD Credential Passthrough feature instead of Table Access Controls. > ***If you’re using ADLS, we recommend AAD Credential Passthrough instead of Table Access Control for easy manageability.*** @@ -343,7 +343,7 @@ By default, Cluster logs are sent to default DBFS but you should consider sendin ## Choose Cluster VMs to Match Workload Class *Impact: High* -To allocate the right amount and type of cluster resresource for a job, we need to understand how different types of jobs demand different types of cluster resources. +To allocate the right amount and type of cluster resource for a job, we need to understand how different types of jobs demand different types of cluster resources. * **Machine Learning** - To train machine learning models it’s usually required cache all of the data in memory. Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. You can also use storage optimized instances for very large datasets. To size the cluster, take a % of the data set → cache it → see how much memory it used → extrapolate that to the rest of the data. From 0d40c2473ad6ad98d965fd17fc3b984297b5d24d Mon Sep 17 00:00:00 2001 From: Nick Hurt Date: Mon, 13 Jan 2020 15:51:42 +0000 Subject: [PATCH 05/22] Update toc.md Updated secret scope limit per workspace --- toc.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/toc.md b/toc.md index 6dbf5ec..5ea40fc 100644 --- a/toc.md +++ b/toc.md @@ -151,7 +151,8 @@ Key workspace limits are: * The maximum number of jobs that a workspace can create in an hour is **1000** * At any time, you cannot have more than **150 jobs** simultaneously running in a workspace - * There can be a maximum of **150 notebooks or execution contexts** attached to a cluster + * There can be a maximum of **150 notebooks or execution contexts** attached to a cluster + * The maximum number secret scopes per workspace is 100 ### Azure Subscription Limits Next, there are [Azure limits](https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits) to consider since ADB deployments are built on top of the Azure infrastructure. From 3f3310ac09925837cd96acf22f9bf8214ba2118c Mon Sep 17 00:00:00 2001 From: Greg Galloway Date: Wed, 26 Feb 2020 21:48:58 -0600 Subject: [PATCH 06/22] clarifying AAD member or guest --- toc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/toc.md b/toc.md index c0be925..4c33617 100644 --- a/toc.md +++ b/toc.md @@ -106,7 +106,7 @@ Each workspace is identified by a globally unique 53-bit number, called ***Works Example: *https://eastus2.azuredatabricks.net/?o=12345* -Azure Databricks uses [Azure Active Directory (AAD)](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis) as the exclusive Identity Provider and there’s a seamless out of the box integration between them. This makes ADB tightly integrated with Azure just like its other core services. Any AAD member assigned to the Owner or Contributor role can deploy Databricks and is automatically added to the ADB members list upon first login. If a user is not a member of the Active Directory tenant, they can’t login to the workspace. +Azure Databricks uses [Azure Active Directory (AAD)](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis) as the exclusive Identity Provider and there’s a seamless out of the box integration between them. This makes ADB tightly integrated with Azure just like its other core services. Any AAD member assigned to the Owner or Contributor role can deploy Databricks and is automatically added to the ADB members list upon first login. If a user is not a member or guest of the Active Directory tenant, they can’t login to the workspace. Granting access to a user in another tenant (for example, if contoso.com wants to collaborate with adventure-works.com users) does work because those external users are added as guests to the tenant hosting Azure Databricks. Azure Databricks comes with its own user management interface. You can create users and groups in a workspace, assign them certain privileges, etc. While users in AAD are equivalent to Databricks users, by default AAD roles have no relationship with groups created inside ADB, unless you use [SCIM](https://docs.azuredatabricks.net/administration-guide/admin-settings/scim/aad.html) for provisioning users and groups. With SCIM, you can import both groups and users from AAD into Azure Databricks, and the synchronization is automatic after the initial import. ADB also has a special group called ***Admins***, not to be confused with AAD’s role Admin. From c4f86baf45836cc14e22d7fe5e88ef7dc0f46fa7 Mon Sep 17 00:00:00 2001 From: Nick Hurt Date: Mon, 30 Mar 2020 18:43:23 +0100 Subject: [PATCH 07/22] Update toc.md Added guidance on access patterns to ADLS Gen2 --- toc.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/toc.md b/toc.md index 6dbf5ec..7065816 100644 --- a/toc.md +++ b/toc.md @@ -46,6 +46,7 @@ Created by: + [Step 4 - Configure the Init Script](#Step-4---Configure-the-Init-script) + [Step 5 - View Collected Data via Azure Portal](#Step-5---View-Collected-Data-via-Azure-Portal) + [References](#References) + * [Access patterns with Azure Data Lake Storage Gen2](#Securing-access-to-ADLS) @@ -466,7 +467,8 @@ See [this](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-coll * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/OMS-Agent-for-Linux.md * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/Troubleshooting.md - +## Securing access to ADLS +To understand the various access patterns and approaches to securing data in ADLS see the [following guidance](https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks/blob/master/readme.md). From f55cfcb6e87cb377cc5662c5cc708b57a242b7f8 Mon Sep 17 00:00:00 2001 From: Nick Hurt Date: Mon, 30 Mar 2020 18:48:58 +0100 Subject: [PATCH 08/22] Update toc.md Added access patterns and updated TOC link --- toc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/toc.md b/toc.md index 7065816..df0e997 100644 --- a/toc.md +++ b/toc.md @@ -46,7 +46,7 @@ Created by: + [Step 4 - Configure the Init Script](#Step-4---Configure-the-Init-script) + [Step 5 - View Collected Data via Azure Portal](#Step-5---View-Collected-Data-via-Azure-Portal) + [References](#References) - * [Access patterns with Azure Data Lake Storage Gen2](#Securing-access-to-ADLS) + * [Access patterns with Azure Data Lake Storage Gen2](#Access-patterns-with-Azure-Data-Lake-Storage-Gen2) @@ -467,7 +467,7 @@ See [this](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-coll * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/OMS-Agent-for-Linux.md * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/Troubleshooting.md -## Securing access to ADLS +## Access patterns with Azure Data Lake Storage Gen2 To understand the various access patterns and approaches to securing data in ADLS see the [following guidance](https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks/blob/master/readme.md). From a075c6d4659b02ef87ec3f4485010169fd582fef Mon Sep 17 00:00:00 2001 From: Kyle Hale Date: Tue, 21 Apr 2020 18:39:00 -0500 Subject: [PATCH 09/22] Updated link to Subscription hierarchy Subscription hierarchy link pointed to deprecated Azure Cloud Adoption docs, updated to revised link. --- toc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/toc.md b/toc.md index 6dbf5ec..eb4b11c 100644 --- a/toc.md +++ b/toc.md @@ -130,7 +130,7 @@ With this basic understanding let’s discuss how to plan a typical ADB deployme ## Map Workspaces to Business Divisions *Impact: Very High* -How many workspaces do you need to deploy? The answer to this question depends a lot on your organization’s structure. We recommend that you assign workspaces based on a related group of people working together collaboratively. This also helps in streamlining your access control matrix within your workspace (folders, notebooks etc.) and also across all your resources that the workspace interacts with (storage, related data stores like Azure SQL DB, Azure SQL DW etc.). This type of division scheme is also known as the [Business Unit Subscription](https://docs.microsoft.com/en-us/azure/architecture/cloud-adoption/appendix/azure-scaffold?wt.mc_id=itshowcase-codeapps#departments-and-accounts) design pattern and it aligns well with the Databricks chargeback model. +How many workspaces do you need to deploy? The answer to this question depends a lot on your organization’s structure. We recommend that you assign workspaces based on a related group of people working together collaboratively. This also helps in streamlining your access control matrix within your workspace (folders, notebooks etc.) and also across all your resources that the workspace interacts with (storage, related data stores like Azure SQL DB, Azure SQL DW etc.). This type of division scheme is also known as the [Business Unit Subscription](https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/decision-guides/subscriptions/) design pattern and it aligns well with the Databricks chargeback model.

From 625cff72d7bbd15cc70a801925b3b063d6a394a4 Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 14 Jul 2020 18:03:00 -0700 Subject: [PATCH 10/22] Update toc.md Written by: Priya Aswani, WW Data Engineering & AI Technical Lead --- toc.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/toc.md b/toc.md index 6dbf5ec..b0b536c 100644 --- a/toc.md +++ b/toc.md @@ -12,6 +12,8 @@ Created by: * Dhruv Kumar, Senior Solutions Architect, Databricks * Premal Shah, Azure Databricks PM, Microsoft +Written by: Priya Aswani, WW Data Engineering & AI Technical Lead + # Table of Contents - [Introduction](#Introduction) From 284e37d9ec470151eedc1bb0d269355cd8abf35b Mon Sep 17 00:00:00 2001 From: Travis LaGrone Date: Fri, 11 Sep 2020 16:43:27 -0500 Subject: [PATCH 11/22] Convert image of code to text The image converted is for the init script under /toc.md#Step-4---Configure-the-Init-script --- toc.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/toc.md b/toc.md index 6dbf5ec..f6c1599 100644 --- a/toc.md +++ b/toc.md @@ -427,8 +427,16 @@ startup time by a few minutes. You can use Log analytics directly to query the Perf data. Here is an example of a query which charts out CPU for the VM’s in question for a specific cluster ID. See log analytics overview for further documentation on log analytics and query syntax. +```python +%python +script = """ +sed -i "s/^exit 101$/exit 0/" /usr/sbin/policy-rc.d +wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w $YOUR_ID -s $YOUR_KEY -d opinsights.azure.com +""" -![Perfsnippet](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/PerfSnippet.PNG "PerfSnippet") +#save script to databricks file system so it can be loaded by VMs +dbutils.fs.put("/databricks/log_init_scripts/configure-omsagent.sh", script, True) +``` ![Grafana](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Grafana.PNG "Grafana") @@ -465,9 +473,3 @@ See [this](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-coll * https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/OMS-Agent-for-Linux.md * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/Troubleshooting.md - - - - - - From 769732eac9da9d9ed202848899558d55012e8a93 Mon Sep 17 00:00:00 2001 From: bhpraka <68075483+bhpraka@users.noreply.github.com> Date: Mon, 26 Oct 2020 14:11:18 -0700 Subject: [PATCH 12/22] Update toc.md Added Cost Management, Chargeback and Analysis --- toc.md | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 167 insertions(+) diff --git a/toc.md b/toc.md index 6dbf5ec..f42f968 100644 --- a/toc.md +++ b/toc.md @@ -37,6 +37,7 @@ Created by: - [Running ADB Applications Smoothly: Guidelines on Observability and Monitoring](#Running-ADB-Applications-Smoothly-Guidelines-on-Observability-and-Monitoring) * [Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace](#Collect-resource-utilization-metrics-across-Azure-Databricks-cluster-in-a-Log-Analytics-workspace) + [Querying VM metrics in Log Analytics once you have started the collection using the above document](#Querying-VM-metrics-in-log-analytics-once-you-have-started-the-collection-using-the-above-document) +- [Cost Management, Charegback and Analysis](#Cost-Management,-Charegback-and-Analysis) - [Appendix A](#Appendix-A) * [Installation for being able to capture VM metrics in Log Analytics](#Installation-for-being-able-to-capture-VM-metrics-in-Log-Analytics) + [Overview](#Overview) @@ -434,6 +435,172 @@ You can use Log analytics directly to query the Perf data. Here is an example of You can also use Grafana to visualize your data from Log Analytics. +## Cost Management, Charegback and Analysis + +This section will focus on Azure Databricks billing, tools to manage and analyze cost and how to charge back to the team. +Azure Databricks Billing: +First, it is important to understand the different workloads and tiers available with Azure Databricks. Azure Databricks is available in 2 tiers – Standard and Premium. Premium Tier offers additional features on top of what is available in Standard tier. These include Role-based access control for notebooks, jobs, and tables, Audit logs, Azure AD conditional pass-through, conditional authentication and many more. Please refer to https://azure.microsoft.com/en-us/pricing/details/databricks/ for the complete list. +Both Premium and Standard tier come with 3 types of workload +• Jobs Compute (previously called Data Engineering) +• Jobs Light Compute (previously called Data Engineering Light) +• All-purpose Compute (previously called Data Analytics) +The Jobs Compute and Jobs Light Compute make it easy for data engineers to build and execute jobs, and All-purpose make it easy for data scientists to explore, visualize, manipulate, and share data and insights interactively. Depending upon the use-case, one can also use All-purpose Compute for data engineering or automated scenarios especially if the incoming job rate is higher. +When you create an Azure Databricks workspace and spin up a cluster, below resources are consumed +• DBUs – A DBU is a unit of processing capability, billed on a per-second usage +• Virtual Machines – These represent your Databricks clusters that run the Databricks Runtime +• Public IP Addresses – These represent the IP Addresses consumed by the Virtual Machines when the cluster is running +• Blob Storage – Each workspace comes with a default storage +• Managed Disk +• Bandwidth – Bandwidth charges for any data transfer +Service/Resource Pricing +DBUs https://azure.microsoft.com/en-us/pricing/details/databricks/ +VMs https://azure.microsoft.com/en-us/pricing/details/databricks/ +Public IP Addresses https://azure.microsoft.com/en-us/pricing/details/ip-addresses/ +Blob Storage https://azure.microsoft.com/en-us/pricing/details/storage/ +Managed Disk https://azure.microsoft.com/en-us/pricing/details/managed-disks/ +Bandwidth https://azure.microsoft.com/en-us/pricing/details/bandwidth/ + +In addition, if you use additional services as part of your end-2-end solution, such as Azure CosmosDB, or Azure Event Hub, then they are charged per their pricing plan. +Per the details in Azure Databricks pricing page, there are 2 options +1. Pay as you go – Pay for the DBUs as you use: Refer to the pricing page for the DBU prices based on the SKU. Note: The DBU per hour price for different SKUs differs across Azure public cloud, Azure Gov and Azure China region. + +2. Pre-purchase or Reservations – You can get up to 37% savings over pay-as-you-go DBU when you pre-purchase Azure Databricks Units (DBU) as Databricks Commit Units (DBCU) for either 1 or 3 years. A Databricks Commit Unit (DBCU) normalizes usage from Azure Databricks workloads and tiers into to a single purchase. Your DBU usage across those workloads and tiers will draw down from the Databricks Commit Units (DBCU) until they are exhausted, or the purchase term expires. The draw down rate will be equivalent to the price of the DBU, as per the table above. Refer to the pricing page for the pre-purchase pricing. +Since, you are also billed for the VMs, you have both the above options for VMs as well +1. Pay as you go +2. Reservations - https://azure.microsoft.com/en-us/pricing/reserved-vm-instances/ +Below are few examples of a billing for Azure Databricks with Pay as you go +Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute, Jobs Light Compute, or All-purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-purpose Compute workload. +Accordingly, the pricing will be dependent on below components +1. DBU SKU – DBU price based on the workload and tier +2. VM SKU – VM price based on the VM SKU +3. DBU Count – Each VM SKU has an associated DBU count. Example – D3v2 has DBU count of 0.75 +4. Region +5. Duration +• If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-purpose Compute: +• VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 +• DBU cost for All-purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 +• The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. +• If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: +• VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 +• DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 +• The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. +In addition to VM and DBU charges, there will be additional charges for managed disks, public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos DB depending on your application. +Azure Databricks Trial: If you are new to Azure Databricks, you can also use a Trial SKU that gives you free DBUs for Premium tier for 14 days. You will still need to pay for other resources like VM, Storage etc. that are consumed during this period. After the trial is over, you will need to start paying for the DBUs. +Chargeback scenarios +There are 2 broad scenarios we have seen with respect to chargeback internal teams for sharing Databricks resources +1. Chargeback across a single Azure Databricks workspace: In this case, a single workspace is shared across multiple teams and user would like to chargeback the individual teams. Individual teams would use their own Databricks cluster and can be charged back at cluster level. +2. Chargeback across multiple Databricks workspace: In this case, teams use their own workspace and would like to chargeback at workspace level. +To support these scenarios, Azure Databricks leverages Azure Tags so that the users can view the cost/usage for resources with tags. There are default tags that comes with the. Please see below the default tags that are available with the resources: +Resources Default Tags +All-purpose Compute Vendor, Creator, ClusterName, ClusterId +Jobs Compute/ Jobs Light Compute Vendor, Creator, ClusterName, ClusterId, RunName, JobId + +Pool Vendor, DatabricksInstancePoolId,DatabricksInstancePoolCreatorId + +Resources created during workspace creation (Storage, Worker VNet, NSG) +application, databricks-environment + + +In addition to the default tags, customers can add custom tags to the resources based on how they want to charge back. Both default and custom tags are displayed on Azure bills that allows one to chargeback by filtering resource usage based on tags. +1. Cluster Tags : You can create custom tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to underlying cluster resources – VMs, DBUs, Public IP Addresses, Disks. +2. Pool Tags : You can create custom tags as key-value pairs when you create a pool, and Azure Databricks applies these tags to underlying pool resources – VMs, Public IP Addresses, Disks. Pool-backed clusters inherit default and custom tags from the pool configuration. +3. Workspace Tags: You can create custom tags as key-value pairs when you create an Azure Databricks workspaces. These tags apply to underlying resources within the workspace – VMs, DBUs, and others. +Please see below on how tags propagate for DBUs and VMs +1. Clusters created from pools +a. DBU Tag = Workspace Tag + Pool Tag + Cluster Tag +b. VM Tag = Workspace Tag + Pool Tag +2. Clusters not from Pools +a. DBU Tag = Workspace Tag + Cluster Tag +b. VM Tag = Workspace Tag + Cluster Tag +These tags (default and custom) propagate to Cost Analysis Reports that you can access in the Azure Portal. The below section will explain how to do cost/usage analysis using these tags. +Cost/Usage Analysis +The Cost Analysis report is available under Cost Management within Azure Portal. Please refer to Cost Management section to get a detailed overview on how to use Cost Management. + +Below example is aimed at giving a quick start to get you going to do cost analysis for Azure Databricks. Below are the steps +1. In Azure Portal, click on Cost Management + Billing +2. In Cost Management, click on Cost Analysis Tab + + + +3. Choose the right billing scope that want report for and make sure the user has Cost Management Reader permission for the that scope. +4. Once selected, then you will see cost reports for all the Azure resources at that scope. +5. Post that you can create different reports by using the different options on the chart. For example, one of the reports you can create is +a. Chart option as Column (stacked) +b. Granularity – Daily +c. Group by – Tag – Choose clustername or clustered +You will see something like below where it will show the distribution of cost on a daily basis for different clusters in your subscription or the scope that you chose in Step 3. You also have option to save this report and share it with your team. + +To chargeback, you can filter this report by using the tag option. For example, you can use default tag: Creator or can use own custom tag – Cost Center and chargeback based on that. + +You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below +1. To download this data to CSV, you can set export from Cost Management + Billing -> Usage + Charges and choose Usage Details Version 2 on the right. Refer this for more details. Once downloaded, you can view the cost usage data and filter based on tags to chargeback. In the CSV, you can refer the Meter Name to get the Databricks workload consumed. In addition, this is how the other fields are represented for meters related to Azure Databricks +a. Quantity = Number of Virtual Machines x Number of hours x DBU count +b. Effective Price = DBU price based on the SKU +c. Cost = Quantity x Effective Price + + + +2. There is a native Cost Management Connector in Power BI that allows one to make powerful, customized visualization and cost/usage reports. + + +Once you connect, you can create various rich reports easily like below by choosing the right fields from the table. + +Tip: To filter on tags, you will need to parse the json in Power BI. To do that, follow these steps +1. Go to "Query Editor" +2. Select the "Usage Details" table +3. On the right side the "Properties" tab shows the steps as + +4. From the menu bar go to "Add column" -> "Add custom column" +5. Name the column and enter the following text in the query += "{"& [Tags] & "}" + +6. This will create a new column of "tags" in the json format. +7. Now user can transform it as expand it. You can then use the different tags as columns that you can use in a report. + + +Please see some of the common views created easily using this connector. + +How Does Cost Management and Report work for Pre-purchase Plan +Following are the key things to note about pre-purchase plan +1. Pricing/Discount: Pre-purchase plan for DBUs with discount is available in the pricing page. +2. To view the overall consumption for pre-purchase, you can find it in Azure Portal by going to Reservations page. If you have multiple Reservations, you can find all of them in under Reservations in Azure Portal. This will allow one to track to-date usage of different reservations separately. Please see Reservations page on how to access this information from various tools including REST API, PowerShell, and CLI. + +3. To get the detailed utilized and reports (like Pay as you go), the same Cost Management section above would apply with few below changes +a. Use the field Amortized Cost instead of Actual Cost in Azure Portal +b. For EA, and Modern customers, the Meter Name would reflect the exact DBU workload and tier in the cost reports. The report would also show the exact tier of reservation – as 1 year or 3 year. One would still need to download the same Usage Details Version 2 report as mentioned here or use the Power BI Cost Management connector. For Web, and Direct customers, the product and meter name would show as Azure Databricks Reservations-DBU and DBU respectively. To identify the workload SKU, you can find the MeterID under “additionalinfo” as consumption meter. +4. For Web and Direct customers, one can calculate the normalized consumption for DBCUs using the below steps: +a. Refer to this table to get the Cost Management Ratio +Key Things to Note: +1. Cost is shown as 0 for the reservation. That is because the reservation is pre-paid. To calculate cost, one needs to start by looking at "consumedquantity" +2. meterid changes from the SKU-based IDs, to a reservation specific meterid +3. Previous meterid information is now displayed in "additionalinfo" under "ConsumptionMeter" +4. To calculate DBCU drawdown, customer needs to multiply the "consumedquantity" by the respective cost ratio based on the relevant "ConsumptionMeter". + +1. The portal separates a "normalized quantity" which is the # of DBCUs deducted from the reservation +2. There is an "instance size flexibility ratio" which indicates the "Cost Management Ratios" detailed in the first sheet of this document. These indicate the conversion factors from PAYG SKUs to DBCUs +3. The "reservation applied quantity" showcases the original DBU SKU quantities that were consumed + +In this example, the customer uses 334 Premium All Purpose Compute DBUs. To understand how many DBCUs are deducted from the reservation, we need to multiply 334 by the conversion factor ("instance size flexibility ratio) of 1.818 for that particular PAYG SKU. This results in a DBCU SKU of 183 units. + + +Pricing difference for Regions + +Please refer to Azure Databricks pricing page to get the pricing for DBU SKU and pricing discount based on Reservations. There are certain differences to consider +1. The DBU prices are different for Azure public cloud and other regions such as Azure Gov +2. The pre-purchase plan prices are different for Azure public cloud and Azure Gov + + +Known Issues/Limitation + +1. Tag change propagation at workspace level takes up to ~1 hour to apply to resources under Managed resource group. +2. Tag change propagation at workspace level requires cluster restart for existing running cluster, or pool expansion +3. Cost Management at parent resource group won’t show Managed RG resources consumption +4. Cost Management role assignments are not possible at Managed RG level. User today must have role assignment at parent resource group level or above (i.e. subscription) to show managed RG consumption +5. For clusters created from pool, only workspace tags and pool tags are propagated to the VMs +6. Tag keys and values can contain only characters from ISO 8859-1 set +7. Custom tag gets prefixed with x_ when it conflicts with default tag +8. Max of 50 tags can be assigned to Azure resource + # Appendix A ## Installation for being able to capture VM metrics in Log Analytics From 081abc9a9275dc1386d16f8f1530584231630051 Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Mon, 26 Oct 2020 14:26:10 -0700 Subject: [PATCH 13/22] Update toc.md updated bhanu as an author --- toc.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/toc.md b/toc.md index 7efc75c..df3aeac 100644 --- a/toc.md +++ b/toc.md @@ -8,9 +8,10 @@ # Azure Databricks Best Practices -Created by: +Authors: * Dhruv Kumar, Senior Solutions Architect, Databricks * Premal Shah, Azure Databricks PM, Microsoft +* Bhanu Prakash, Azure Databricks PM, Microsoft Written by: Priya Aswani, WW Data Engineering & AI Technical Lead From 7846cf743435c84e344a865c0c87547f2379b340 Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Mon, 26 Oct 2020 14:28:22 -0700 Subject: [PATCH 14/22] Update README.md --- README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index bf4976c..1500a1b 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,13 @@ ![Azure Databricks](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/ADBicon.jpg "Azure Databricks") -Created by: -Dhruv Kumar, Databricks Senior Solutions Architect -and + +Authors: +Dhruv Kumar, Senior Solutions Architect, Databricks Premal Shah, Azure Databricks PM, Microsoft +Bhanu Prakash, Azure Databricks PM, Microsoft + +Written by: Priya Aswani, WW Data Engineering & AI Technical Lead Published: June 22, 2019 From 6aeef30ad040432296cdfcdd856154f8e33ec96e Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 27 Oct 2020 10:47:49 -0700 Subject: [PATCH 15/22] Update toc.md --- toc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/toc.md b/toc.md index 82b89a1..a739602 100644 --- a/toc.md +++ b/toc.md @@ -40,7 +40,7 @@ Written by: Priya Aswani, WW Data Engineering & AI Technical Lead - [Running ADB Applications Smoothly: Guidelines on Observability and Monitoring](#Running-ADB-Applications-Smoothly-Guidelines-on-Observability-and-Monitoring) * [Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace](#Collect-resource-utilization-metrics-across-Azure-Databricks-cluster-in-a-Log-Analytics-workspace) + [Querying VM metrics in Log Analytics once you have started the collection using the above document](#Querying-VM-metrics-in-log-analytics-once-you-have-started-the-collection-using-the-above-document) -- [Cost Management, Charegback and Analysis](#Cost-Management,-Charegback-and-Analysis) +- [Cost Management, Chargeback and Analysis](#Cost-Management,-Chargeback-and-Analysis) - [Appendix A](#Appendix-A) * [Installation for being able to capture VM metrics in Log Analytics](#Installation-for-being-able-to-capture-VM-metrics-in-Log-Analytics) + [Overview](#Overview) @@ -443,7 +443,7 @@ You can use Log analytics directly to query the Perf data. Here is an example of You can also use Grafana to visualize your data from Log Analytics. -## Cost Management, Charegback and Analysis +## Cost Management, Chargeback and Analysis This section will focus on Azure Databricks billing, tools to manage and analyze cost and how to charge back to the team. Azure Databricks Billing: From eca298ad157e69007d7fcdb6aab8fa9ab494a47a Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 27 Oct 2020 10:49:21 -0700 Subject: [PATCH 17/22] Update toc.md --- toc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/toc.md b/toc.md index a739602..6fcf3d8 100644 --- a/toc.md +++ b/toc.md @@ -40,7 +40,7 @@ Written by: Priya Aswani, WW Data Engineering & AI Technical Lead - [Running ADB Applications Smoothly: Guidelines on Observability and Monitoring](#Running-ADB-Applications-Smoothly-Guidelines-on-Observability-and-Monitoring) * [Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace](#Collect-resource-utilization-metrics-across-Azure-Databricks-cluster-in-a-Log-Analytics-workspace) + [Querying VM metrics in Log Analytics once you have started the collection using the above document](#Querying-VM-metrics-in-log-analytics-once-you-have-started-the-collection-using-the-above-document) -- [Cost Management, Chargeback and Analysis](#Cost-Management,-Chargeback-and-Analysis) +- [Cost Management, Chargeback and Analysis](#Cost-Management-Chargeback-and-Analysis) - [Appendix A](#Appendix-A) * [Installation for being able to capture VM metrics in Log Analytics](#Installation-for-being-able-to-capture-VM-metrics-in-Log-Analytics) + [Overview](#Overview) From f1b75c66eee41786bb889a2fc676f3d8065ca698 Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 27 Oct 2020 10:54:43 -0700 Subject: [PATCH 18/22] Update toc.md --- toc.md | 76 ++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 45 insertions(+), 31 deletions(-) diff --git a/toc.md b/toc.md index 6fcf3d8..cc7be5c 100644 --- a/toc.md +++ b/toc.md @@ -446,21 +446,27 @@ You can also use Grafana to visualize your data from Log Analytics. ## Cost Management, Chargeback and Analysis This section will focus on Azure Databricks billing, tools to manage and analyze cost and how to charge back to the team. + Azure Databricks Billing: First, it is important to understand the different workloads and tiers available with Azure Databricks. Azure Databricks is available in 2 tiers – Standard and Premium. Premium Tier offers additional features on top of what is available in Standard tier. These include Role-based access control for notebooks, jobs, and tables, Audit logs, Azure AD conditional pass-through, conditional authentication and many more. Please refer to https://azure.microsoft.com/en-us/pricing/details/databricks/ for the complete list. -Both Premium and Standard tier come with 3 types of workload -• Jobs Compute (previously called Data Engineering) -• Jobs Light Compute (previously called Data Engineering Light) -• All-purpose Compute (previously called Data Analytics) + +Both Premium and Standard tier come with 3 types of workload: + +1. Jobs Compute (previously called Data Engineering) +2. Jobs Light Compute (previously called Data Engineering Light) +3. All-purpose Compute (previously called Data Analytics) The Jobs Compute and Jobs Light Compute make it easy for data engineers to build and execute jobs, and All-purpose make it easy for data scientists to explore, visualize, manipulate, and share data and insights interactively. Depending upon the use-case, one can also use All-purpose Compute for data engineering or automated scenarios especially if the incoming job rate is higher. -When you create an Azure Databricks workspace and spin up a cluster, below resources are consumed -• DBUs – A DBU is a unit of processing capability, billed on a per-second usage -• Virtual Machines – These represent your Databricks clusters that run the Databricks Runtime -• Public IP Addresses – These represent the IP Addresses consumed by the Virtual Machines when the cluster is running -• Blob Storage – Each workspace comes with a default storage -• Managed Disk -• Bandwidth – Bandwidth charges for any data transfer -Service/Resource Pricing + +When you create an Azure Databricks workspace and spin up a cluster, below resources are consumed: + +1. DBUs – A DBU is a unit of processing capability, billed on a per-second usage +2. Virtual Machines – These represent your Databricks clusters that run the Databricks Runtime +3. Public IP Addresses – These represent the IP Addresses consumed by the Virtual Machines when the cluster is running +4. Blob Storage – Each workspace comes with a default storage +5. Managed Disk +6. Bandwidth – Bandwidth charges for any data transfer + +Service/Resource Pricing" DBUs https://azure.microsoft.com/en-us/pricing/details/databricks/ VMs https://azure.microsoft.com/en-us/pricing/details/databricks/ Public IP Addresses https://azure.microsoft.com/en-us/pricing/details/ip-addresses/ @@ -469,32 +475,38 @@ Managed Disk https://azure.microsoft.com/en-us/pricing/details/managed-disks/ Bandwidth https://azure.microsoft.com/en-us/pricing/details/bandwidth/ In addition, if you use additional services as part of your end-2-end solution, such as Azure CosmosDB, or Azure Event Hub, then they are charged per their pricing plan. -Per the details in Azure Databricks pricing page, there are 2 options +Per the details in Azure Databricks pricing page, there are 2 options: + 1. Pay as you go – Pay for the DBUs as you use: Refer to the pricing page for the DBU prices based on the SKU. Note: The DBU per hour price for different SKUs differs across Azure public cloud, Azure Gov and Azure China region. 2. Pre-purchase or Reservations – You can get up to 37% savings over pay-as-you-go DBU when you pre-purchase Azure Databricks Units (DBU) as Databricks Commit Units (DBCU) for either 1 or 3 years. A Databricks Commit Unit (DBCU) normalizes usage from Azure Databricks workloads and tiers into to a single purchase. Your DBU usage across those workloads and tiers will draw down from the Databricks Commit Units (DBCU) until they are exhausted, or the purchase term expires. The draw down rate will be equivalent to the price of the DBU, as per the table above. Refer to the pricing page for the pre-purchase pricing. -Since, you are also billed for the VMs, you have both the above options for VMs as well + +Since, you are also billed for the VMs, you have both the above options for VMs as well: 1. Pay as you go 2. Reservations - https://azure.microsoft.com/en-us/pricing/reserved-vm-instances/ Below are few examples of a billing for Azure Databricks with Pay as you go Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute, Jobs Light Compute, or All-purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-purpose Compute workload. + Accordingly, the pricing will be dependent on below components 1. DBU SKU – DBU price based on the workload and tier 2. VM SKU – VM price based on the VM SKU 3. DBU Count – Each VM SKU has an associated DBU count. Example – D3v2 has DBU count of 0.75 4. Region 5. Duration -• If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-purpose Compute: -• VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 -• DBU cost for All-purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 -• The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. -• If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: -• VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 -• DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 -• The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. + +* If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-purpose Compute: +* VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 +* DBU cost for All-purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 +* The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. +* If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: +* VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 +* DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 +* The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. In addition to VM and DBU charges, there will be additional charges for managed disks, public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos DB depending on your application. + Azure Databricks Trial: If you are new to Azure Databricks, you can also use a Trial SKU that gives you free DBUs for Premium tier for 14 days. You will still need to pay for other resources like VM, Storage etc. that are consumed during this period. After the trial is over, you will need to start paying for the DBUs. -Chargeback scenarios + +Chargeback scenarios: There are 2 broad scenarios we have seen with respect to chargeback internal teams for sharing Databricks resources 1. Chargeback across a single Azure Databricks workspace: In this case, a single workspace is shared across multiple teams and user would like to chargeback the individual teams. Individual teams would use their own Databricks cluster and can be charged back at cluster level. 2. Chargeback across multiple Databricks workspace: In this case, teams use their own workspace and would like to chargeback at workspace level. @@ -513,14 +525,16 @@ In addition to the default tags, customers can add custom tags to the resources 1. Cluster Tags : You can create custom tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to underlying cluster resources – VMs, DBUs, Public IP Addresses, Disks. 2. Pool Tags : You can create custom tags as key-value pairs when you create a pool, and Azure Databricks applies these tags to underlying pool resources – VMs, Public IP Addresses, Disks. Pool-backed clusters inherit default and custom tags from the pool configuration. 3. Workspace Tags: You can create custom tags as key-value pairs when you create an Azure Databricks workspaces. These tags apply to underlying resources within the workspace – VMs, DBUs, and others. -Please see below on how tags propagate for DBUs and VMs + +Please see below on how tags propagate for DBUs and VMs: 1. Clusters created from pools -a. DBU Tag = Workspace Tag + Pool Tag + Cluster Tag -b. VM Tag = Workspace Tag + Pool Tag + a. DBU Tag = Workspace Tag + Pool Tag + Cluster Tag + b. VM Tag = Workspace Tag + Pool Tag 2. Clusters not from Pools -a. DBU Tag = Workspace Tag + Cluster Tag -b. VM Tag = Workspace Tag + Cluster Tag + a. DBU Tag = Workspace Tag + Cluster Tag + b. VM Tag = Workspace Tag + Cluster Tag These tags (default and custom) propagate to Cost Analysis Reports that you can access in the Azure Portal. The below section will explain how to do cost/usage analysis using these tags. + Cost/Usage Analysis The Cost Analysis report is available under Cost Management within Azure Portal. Please refer to Cost Management section to get a detailed overview on how to use Cost Management. @@ -574,10 +588,10 @@ Following are the key things to note about pre-purchase plan 2. To view the overall consumption for pre-purchase, you can find it in Azure Portal by going to Reservations page. If you have multiple Reservations, you can find all of them in under Reservations in Azure Portal. This will allow one to track to-date usage of different reservations separately. Please see Reservations page on how to access this information from various tools including REST API, PowerShell, and CLI. 3. To get the detailed utilized and reports (like Pay as you go), the same Cost Management section above would apply with few below changes -a. Use the field Amortized Cost instead of Actual Cost in Azure Portal -b. For EA, and Modern customers, the Meter Name would reflect the exact DBU workload and tier in the cost reports. The report would also show the exact tier of reservation – as 1 year or 3 year. One would still need to download the same Usage Details Version 2 report as mentioned here or use the Power BI Cost Management connector. For Web, and Direct customers, the product and meter name would show as Azure Databricks Reservations-DBU and DBU respectively. To identify the workload SKU, you can find the MeterID under “additionalinfo” as consumption meter. + a. Use the field Amortized Cost instead of Actual Cost in Azure Portal + b. For EA, and Modern customers, the Meter Name would reflect the exact DBU workload and tier in the cost reports. The report would also show the exact tier of reservation – as 1 year or 3 year. One would still need to download the same Usage Details Version 2 report as mentioned here or use the Power BI Cost Management connector. For Web, and Direct customers, the product and meter name would show as Azure Databricks Reservations-DBU and DBU respectively. To identify the workload SKU, you can find the MeterID under “additionalinfo” as consumption meter. 4. For Web and Direct customers, one can calculate the normalized consumption for DBCUs using the below steps: -a. Refer to this table to get the Cost Management Ratio + a. Refer to this table to get the Cost Management Ratio Key Things to Note: 1. Cost is shown as 0 for the reservation. That is because the reservation is pre-paid. To calculate cost, one needs to start by looking at "consumedquantity" 2. meterid changes from the SKU-based IDs, to a reservation specific meterid From 621262a34e83a4f6a745e4bfc95cde840372e8b8 Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 27 Oct 2020 11:03:03 -0700 Subject: [PATCH 19/22] Update toc.md --- toc.md | 57 ++++++++++++++++++++++++++------------------------------- 1 file changed, 26 insertions(+), 31 deletions(-) diff --git a/toc.md b/toc.md index cc7be5c..780643e 100644 --- a/toc.md +++ b/toc.md @@ -451,14 +451,12 @@ Azure Databricks Billing: First, it is important to understand the different workloads and tiers available with Azure Databricks. Azure Databricks is available in 2 tiers – Standard and Premium. Premium Tier offers additional features on top of what is available in Standard tier. These include Role-based access control for notebooks, jobs, and tables, Audit logs, Azure AD conditional pass-through, conditional authentication and many more. Please refer to https://azure.microsoft.com/en-us/pricing/details/databricks/ for the complete list. Both Premium and Standard tier come with 3 types of workload: - 1. Jobs Compute (previously called Data Engineering) 2. Jobs Light Compute (previously called Data Engineering Light) 3. All-purpose Compute (previously called Data Analytics) The Jobs Compute and Jobs Light Compute make it easy for data engineers to build and execute jobs, and All-purpose make it easy for data scientists to explore, visualize, manipulate, and share data and insights interactively. Depending upon the use-case, one can also use All-purpose Compute for data engineering or automated scenarios especially if the incoming job rate is higher. When you create an Azure Databricks workspace and spin up a cluster, below resources are consumed: - 1. DBUs – A DBU is a unit of processing capability, billed on a per-second usage 2. Virtual Machines – These represent your Databricks clusters that run the Databricks Runtime 3. Public IP Addresses – These represent the IP Addresses consumed by the Virtual Machines when the cluster is running @@ -467,12 +465,12 @@ When you create an Azure Databricks workspace and spin up a cluster, below resou 6. Bandwidth – Bandwidth charges for any data transfer Service/Resource Pricing" -DBUs https://azure.microsoft.com/en-us/pricing/details/databricks/ -VMs https://azure.microsoft.com/en-us/pricing/details/databricks/ -Public IP Addresses https://azure.microsoft.com/en-us/pricing/details/ip-addresses/ -Blob Storage https://azure.microsoft.com/en-us/pricing/details/storage/ -Managed Disk https://azure.microsoft.com/en-us/pricing/details/managed-disks/ -Bandwidth https://azure.microsoft.com/en-us/pricing/details/bandwidth/ +* DBUs https://azure.microsoft.com/en-us/pricing/details/databricks/ +* VMs https://azure.microsoft.com/en-us/pricing/details/databricks/ +* Public IP Addresses https://azure.microsoft.com/en-us/pricing/details/ip-addresses/ +* Blob Storage https://azure.microsoft.com/en-us/pricing/details/storage/ +* Managed Disk https://azure.microsoft.com/en-us/pricing/details/managed-disks/ +* Bandwidth https://azure.microsoft.com/en-us/pricing/details/bandwidth/ In addition, if you use additional services as part of your end-2-end solution, such as Azure CosmosDB, or Azure Event Hub, then they are charged per their pricing plan. Per the details in Azure Databricks pricing page, there are 2 options: @@ -510,13 +508,13 @@ Chargeback scenarios: There are 2 broad scenarios we have seen with respect to chargeback internal teams for sharing Databricks resources 1. Chargeback across a single Azure Databricks workspace: In this case, a single workspace is shared across multiple teams and user would like to chargeback the individual teams. Individual teams would use their own Databricks cluster and can be charged back at cluster level. 2. Chargeback across multiple Databricks workspace: In this case, teams use their own workspace and would like to chargeback at workspace level. -To support these scenarios, Azure Databricks leverages Azure Tags so that the users can view the cost/usage for resources with tags. There are default tags that comes with the. Please see below the default tags that are available with the resources: +To support these scenarios, Azure Databricks leverages Azure Tags so that the users can view the cost/usage for resources with tags. There are default tags that comes with the. + +Please see below the default tags that are available with the resources: Resources Default Tags All-purpose Compute Vendor, Creator, ClusterName, ClusterId Jobs Compute/ Jobs Light Compute Vendor, Creator, ClusterName, ClusterId, RunName, JobId - Pool Vendor, DatabricksInstancePoolId,DatabricksInstancePoolCreatorId - Resources created during workspace creation (Storage, Worker VNet, NSG) application, databricks-environment @@ -538,36 +536,31 @@ These tags (default and custom) propagate to Cost Analysis Reports that you can Cost/Usage Analysis The Cost Analysis report is available under Cost Management within Azure Portal. Please refer to Cost Management section to get a detailed overview on how to use Cost Management. -Below example is aimed at giving a quick start to get you going to do cost analysis for Azure Databricks. Below are the steps +Below example is aimed at giving a quick start to get you going to do cost analysis for Azure Databricks. Below are the steps: 1. In Azure Portal, click on Cost Management + Billing 2. In Cost Management, click on Cost Analysis Tab - - - 3. Choose the right billing scope that want report for and make sure the user has Cost Management Reader permission for the that scope. 4. Once selected, then you will see cost reports for all the Azure resources at that scope. 5. Post that you can create different reports by using the different options on the chart. For example, one of the reports you can create is -a. Chart option as Column (stacked) -b. Granularity – Daily -c. Group by – Tag – Choose clustername or clustered + a. Chart option as Column (stacked) + b. Granularity – Daily + c. Group by – Tag – Choose clustername or clustered + You will see something like below where it will show the distribution of cost on a daily basis for different clusters in your subscription or the scope that you chose in Step 3. You also have option to save this report and share it with your team. To chargeback, you can filter this report by using the tag option. For example, you can use default tag: Creator or can use own custom tag – Cost Center and chargeback based on that. -You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below +You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below: 1. To download this data to CSV, you can set export from Cost Management + Billing -> Usage + Charges and choose Usage Details Version 2 on the right. Refer this for more details. Once downloaded, you can view the cost usage data and filter based on tags to chargeback. In the CSV, you can refer the Meter Name to get the Databricks workload consumed. In addition, this is how the other fields are represented for meters related to Azure Databricks -a. Quantity = Number of Virtual Machines x Number of hours x DBU count -b. Effective Price = DBU price based on the SKU -c. Cost = Quantity x Effective Price - - - + a. Quantity = Number of Virtual Machines x Number of hours x DBU count + b. Effective Price = DBU price based on the SKU + c. Cost = Quantity x Effective Price 2. There is a native Cost Management Connector in Power BI that allows one to make powerful, customized visualization and cost/usage reports. Once you connect, you can create various rich reports easily like below by choosing the right fields from the table. -Tip: To filter on tags, you will need to parse the json in Power BI. To do that, follow these steps +Tip: To filter on tags, you will need to parse the json in Power BI. To do that, follow these steps: 1. Go to "Query Editor" 2. Select the "Usage Details" table 3. On the right side the "Properties" tab shows the steps as @@ -582,16 +575,18 @@ Tip: To filter on tags, you will need to parse the json in Power BI. To do that, Please see some of the common views created easily using this connector. -How Does Cost Management and Report work for Pre-purchase Plan + +**How Does Cost Management and Report work for Pre-purchase Plan** Following are the key things to note about pre-purchase plan 1. Pricing/Discount: Pre-purchase plan for DBUs with discount is available in the pricing page. 2. To view the overall consumption for pre-purchase, you can find it in Azure Portal by going to Reservations page. If you have multiple Reservations, you can find all of them in under Reservations in Azure Portal. This will allow one to track to-date usage of different reservations separately. Please see Reservations page on how to access this information from various tools including REST API, PowerShell, and CLI. - 3. To get the detailed utilized and reports (like Pay as you go), the same Cost Management section above would apply with few below changes a. Use the field Amortized Cost instead of Actual Cost in Azure Portal - b. For EA, and Modern customers, the Meter Name would reflect the exact DBU workload and tier in the cost reports. The report would also show the exact tier of reservation – as 1 year or 3 year. One would still need to download the same Usage Details Version 2 report as mentioned here or use the Power BI Cost Management connector. For Web, and Direct customers, the product and meter name would show as Azure Databricks Reservations-DBU and DBU respectively. To identify the workload SKU, you can find the MeterID under “additionalinfo” as consumption meter. + b. For EA, and Modern customers, the Meter Name would reflect the exact DBU workload and tier in the cost reports. + The report would also show the exact tier of reservation – as 1 year or 3 year. One would still need to download the same Usage Details Version 2 report as mentioned here or use the Power BI Cost Management connector. For Web, and Direct customers, the product and meter name would show as Azure Databricks Reservations-DBU and DBU respectively. To identify the workload SKU, you can find the MeterID under “additionalinfo” as consumption meter. 4. For Web and Direct customers, one can calculate the normalized consumption for DBCUs using the below steps: a. Refer to this table to get the Cost Management Ratio + Key Things to Note: 1. Cost is shown as 0 for the reservation. That is because the reservation is pre-paid. To calculate cost, one needs to start by looking at "consumedquantity" 2. meterid changes from the SKU-based IDs, to a reservation specific meterid @@ -605,14 +600,14 @@ Key Things to Note: In this example, the customer uses 334 Premium All Purpose Compute DBUs. To understand how many DBCUs are deducted from the reservation, we need to multiply 334 by the conversion factor ("instance size flexibility ratio) of 1.818 for that particular PAYG SKU. This results in a DBCU SKU of 183 units. -Pricing difference for Regions +**Pricing difference for Regions** Please refer to Azure Databricks pricing page to get the pricing for DBU SKU and pricing discount based on Reservations. There are certain differences to consider 1. The DBU prices are different for Azure public cloud and other regions such as Azure Gov 2. The pre-purchase plan prices are different for Azure public cloud and Azure Gov -Known Issues/Limitation +**Known Issues/Limitations** 1. Tag change propagation at workspace level takes up to ~1 hour to apply to resources under Managed resource group. 2. Tag change propagation at workspace level requires cluster restart for existing running cluster, or pool expansion From 8f6d97acd671ef908ac45e1518e64f4dea134431 Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 27 Oct 2020 11:07:25 -0700 Subject: [PATCH 20/22] Update toc.md --- toc.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/toc.md b/toc.md index 780643e..adb50e5 100644 --- a/toc.md +++ b/toc.md @@ -552,9 +552,9 @@ To chargeback, you can filter this report by using the tag option. For example, You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below: 1. To download this data to CSV, you can set export from Cost Management + Billing -> Usage + Charges and choose Usage Details Version 2 on the right. Refer this for more details. Once downloaded, you can view the cost usage data and filter based on tags to chargeback. In the CSV, you can refer the Meter Name to get the Databricks workload consumed. In addition, this is how the other fields are represented for meters related to Azure Databricks - a. Quantity = Number of Virtual Machines x Number of hours x DBU count - b. Effective Price = DBU price based on the SKU - c. Cost = Quantity x Effective Price + a. Quantity = Number of Virtual Machines x Number of hours x DBU count + b. Effective Price = DBU price based on the SKU + c. Cost = Quantity x Effective Price 2. There is a native Cost Management Connector in Power BI that allows one to make powerful, customized visualization and cost/usage reports. From 90e19ee49a170fcc89ee37d4334ab55103256cdf Mon Sep 17 00:00:00 2001 From: Priya Aswani Date: Tue, 27 Oct 2020 11:09:23 -0700 Subject: [PATCH 22/22] Update toc.md --- toc.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/toc.md b/toc.md index adb50e5..83ff0e5 100644 --- a/toc.md +++ b/toc.md @@ -526,11 +526,17 @@ In addition to the default tags, customers can add custom tags to the resources Please see below on how tags propagate for DBUs and VMs: 1. Clusters created from pools + a. DBU Tag = Workspace Tag + Pool Tag + Cluster Tag + b. VM Tag = Workspace Tag + Pool Tag + 2. Clusters not from Pools + a. DBU Tag = Workspace Tag + Cluster Tag + b. VM Tag = Workspace Tag + Cluster Tag + These tags (default and custom) propagate to Cost Analysis Reports that you can access in the Azure Portal. The below section will explain how to do cost/usage analysis using these tags. Cost/Usage Analysis @@ -542,9 +548,13 @@ Below example is aimed at giving a quick start to get you going to do cost analy 3. Choose the right billing scope that want report for and make sure the user has Cost Management Reader permission for the that scope. 4. Once selected, then you will see cost reports for all the Azure resources at that scope. 5. Post that you can create different reports by using the different options on the chart. For example, one of the reports you can create is + a. Chart option as Column (stacked) + b. Granularity – Daily + c. Group by – Tag – Choose clustername or clustered + You will see something like below where it will show the distribution of cost on a daily basis for different clusters in your subscription or the scope that you chose in Step 3. You also have option to save this report and share it with your team. @@ -552,9 +562,13 @@ To chargeback, you can filter this report by using the tag option. For example, You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below: 1. To download this data to CSV, you can set export from Cost Management + Billing -> Usage + Charges and choose Usage Details Version 2 on the right. Refer this for more details. Once downloaded, you can view the cost usage data and filter based on tags to chargeback. In the CSV, you can refer the Meter Name to get the Databricks workload consumed. In addition, this is how the other fields are represented for meters related to Azure Databricks + a. Quantity = Number of Virtual Machines x Number of hours x DBU count + b. Effective Price = DBU price based on the SKU + c. Cost = Quantity x Effective Price + 2. There is a native Cost Management Connector in Power BI that allows one to make powerful, customized visualization and cost/usage reports. @@ -581,10 +595,14 @@ Following are the key things to note about pre-purchase plan 1. Pricing/Discount: Pre-purchase plan for DBUs with discount is available in the pricing page. 2. To view the overall consumption for pre-purchase, you can find it in Azure Portal by going to Reservations page. If you have multiple Reservations, you can find all of them in under Reservations in Azure Portal. This will allow one to track to-date usage of different reservations separately. Please see Reservations page on how to access this information from various tools including REST API, PowerShell, and CLI. 3. To get the detailed utilized and reports (like Pay as you go), the same Cost Management section above would apply with few below changes + a. Use the field Amortized Cost instead of Actual Cost in Azure Portal + b. For EA, and Modern customers, the Meter Name would reflect the exact DBU workload and tier in the cost reports. + The report would also show the exact tier of reservation – as 1 year or 3 year. One would still need to download the same Usage Details Version 2 report as mentioned here or use the Power BI Cost Management connector. For Web, and Direct customers, the product and meter name would show as Azure Databricks Reservations-DBU and DBU respectively. To identify the workload SKU, you can find the MeterID under “additionalinfo” as consumption meter. 4. For Web and Direct customers, one can calculate the normalized consumption for DBCUs using the below steps: + a. Refer to this table to get the Cost Management Ratio Key Things to Note: