docs: update OpenAI notebook for acrolinx (#1999)

This commit is contained in:
JessicaXYWang 2023-06-28 13:12:26 -07:00 коммит произвёл GitHub
Родитель 9d51784af6
Коммит 7cb018a6e6
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 58 добавлений и 29 удалений

Просмотреть файл

@ -1,6 +1,7 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -17,34 +18,49 @@
"source": [
"# Azure OpenAI for Big Data\n",
"\n",
"The Azure OpenAI service can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples we have integrated the Azure OpenAI service with the distributed machine learning library [SynapseML](https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with the OpenAI service. This tutorial shows how to apply large language models at a distributed scale using Azure Open AI and Azure Synapse Analytics. \n",
"\n",
"The Azure OpenAI service can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples, we have integrated the Azure OpenAI service with the distributed machine learning library [SynapseML](https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with the OpenAI service. This tutorial shows how to apply large language models at a distributed scale using Azure Open AI and Azure Synapse Analytics. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"tags": [
"hide-synapse-internal"
]
},
"source": [
"## Step 1: Prerequisites\n",
"\n",
"The key prerequisites for this quickstart include a working Azure OpenAI resource, and an Apache Spark cluster with SynapseML installed. We suggest creating a Synapse workspace, but an Azure Databricks, HDInsight, or Spark on Kubernetes, or even a python environment with the `pyspark` package will work. \n",
"\n",
"1. An Azure OpenAI resource – request access [here](https://customervoice.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7en2Ais5pxKtso_Pz4b1_xUOFA5Qk1UWDRBMjg0WFhPMkIzTzhKQ1dWNyQlQCN0PWcu) before [creating a resource](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource)\n",
"1. [Create a Synapse workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-create-workspace)\n",
"1. [Create a serverless Apache Spark pool](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-analyze-spark#create-a-serverless-apache-spark-pool)\n",
"\n",
"\n",
"1. [Create a serverless Apache Spark pool](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-analyze-spark#create-a-serverless-apache-spark-pool)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Import this guide as a notebook\n",
"\n",
"The next step is to add this code into your Spark cluster. You can either create a notebook in your Spark platform and copy the code into this notebook to run the demo. Or download the notebook and import it into Synapse Analytics\n",
"\n",
"1.\t[Download this demo as a notebook](https://github.com/microsoft/SynapseML/blob/master/notebooks/features/cognitive_services/CognitiveServices%20-%20OpenAI.ipynb) (click Raw, then save the file)\n",
"1.\tImport the notebook [into the Synapse Workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#create-a-notebook) or if using Databricks [into the Databricks Workspace](https://docs.microsoft.com/en-us/azure/databricks/notebooks/notebooks-manage#create-a-notebook)\n",
"1. Install SynapseML on your cluster. Please see the installation instructions for Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). Note that this requires pasting an additional cell at the top of the notebook you just imported\n",
"3.\tConnect your notebook to a cluster and follow along, editing and rnnung the cells below.\n",
"1.\tImport the notebook [into the Synapse Workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#create-a-notebook) or if using Databricks [import into the Databricks Workspace](https://docs.microsoft.com/en-us/azure/databricks/notebooks/notebooks-manage#create-a-notebook). If using Fabric [import into the Fabric Workspace](https://learn.microsoft.com/en-us/fabric/data-engineering/how-to-use-notebook)\n",
"1. Install SynapseML on your cluster. Please see the installation instructions for Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). If using Fabric, please check [Installation Guide](https://learn.microsoft.com/en-us/fabric/data-science/install-synapseml). This requires pasting an extra cell at the top of the notebook you imported. \n",
"1.\tConnect your notebook to a cluster and follow along, editing and running the cells.\n",
"\n",
"## Step 3: Fill in your service information\n",
"\n",
"Next, please edit the cell in the notebook to point to your service. In particular set the `service_name`, `deployment_name`, `location`, and `key` variables to match those for your OpenAI service:"
"Next, edit the cell in the notebook to point to your service. In particular set the `service_name`, `deployment_name`, `location`, and `key` variables to match those for your OpenAI service:"
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -83,6 +99,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -106,7 +123,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -140,6 +157,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -156,12 +174,12 @@
"source": [
"## Step 5: Create the OpenAICompletion Apache Spark Client\n",
"\n",
"To apply the OpenAI Completion service to your dataframe you just created, create an OpenAICompletion object which serves as a distributed client. Parameters of the service can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object. Here we are setting `maxTokens` to 200. A token is around 4 characters, and this limit applies to the sum of the prompt and the result. We are also setting the `promptCol` parameter with the name of the prompt column in the dataframe."
"To apply the OpenAI Completion service to your dataframe you created, create an OpenAICompletion object, which serves as a distributed client. Parameters of the service can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object. Here we're setting `maxTokens` to 200. A token is around four characters, and this limit applies to the sum of the prompt and the result. We're also setting the `promptCol` parameter with the name of the prompt column in the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -200,6 +218,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -214,14 +233,14 @@
}
},
"source": [
"## Step 5: Transform the dataframe with the OpenAICompletion Client\n",
"## Step 6: Transform the dataframe with the OpenAICompletion Client\n",
"\n",
"Now that you have the dataframe and the completion client, you can transform your input dataset and add a column called `completions` with all of the information the service adds. We will select out just the text for simplicity."
"Now that you have the dataframe and the completion client, you can transform your input dataset and add a column called `completions` with all of the information the service adds. We'll select out just the text for simplicity."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -249,6 +268,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -273,6 +293,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -291,6 +312,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -307,12 +329,12 @@
"source": [
"### Generating Text Embeddings\n",
"\n",
"In addition to completing text, we can also embed text for use in downstream algorithms or vector retrieval architectures. Creating embeddings allows you to search and retrieve documents from large collections and can be used when prompt engineering alo is not sufficient for the task. For more information on using `OpenAIEmbedding` see our [embedding guide](https://microsoft.github.io/SynapseML/docs/features/cognitive_services/CognitiveServices%20-%20OpenAI%20Embedding/)."
"In addition to completing text, we can also embed text for use in downstream algorithms or vector retrieval architectures. Creating embeddings allows you to search and retrieve documents from large collections and can be used when prompt engineering isn't sufficient for the task. For more information on using `OpenAIEmbedding`, see our [embedding guide](https://microsoft.github.io/SynapseML/docs/features/cognitive_services/CognitiveServices%20-%20OpenAI%20Embedding/)."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -343,6 +365,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -359,12 +382,12 @@
"source": [
"### Chat Completion\n",
"\n",
"Models such as ChatGPT and GPT-4 are capable of understanding chats instead of just single prompts. The `OpenAIChatCompletion` transformer exposes this functionality at scale."
"Models such as ChatGPT and GPT-4 are capable of understanding chats instead of single prompts. The `OpenAIChatCompletion` transformer exposes this functionality at scale."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -426,6 +449,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -442,15 +466,15 @@
"source": [
"### Improve throughput with request batching \n",
"\n",
"The example above makes several requests to the service, one for each prompt. To complete multiple prompts in a single request, use batch mode. First, in the OpenAICompletion object, instead of setting the Prompt column to \"Prompt\", specify \"batchPrompt\" for the BatchPrompt column.\n",
"The example makes several requests to the service, one for each prompt. To complete multiple prompts in a single request, use batch mode. First, in the OpenAICompletion object, instead of setting the Prompt column to \"Prompt\", specify \"batchPrompt\" for the BatchPrompt column.\n",
"To do so, create a dataframe with a list of prompts per row.\n",
"\n",
"**Note** that as of this writing there is currently a limit of 20 prompts in a single request, as well as a hard limit of 2048 \"tokens\", or approximately 1500 words."
"**Note** that as of this writing there is currently a limit of 20 prompts in a single request, and a hard limit of 2048 \"tokens\", or approximately 1500 words."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -483,6 +507,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -502,7 +527,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -530,6 +555,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -544,12 +570,12 @@
}
},
"source": [
"In the call to transform a request will then be made per row. Since there are multiple prompts in a single row, each request will be sent with all prompts in that row. The results will contain a row for each row in the request."
"In the call to transform a request will then be made per row. Since there are multiple prompts in a single row, each is sent with all prompts in that row. The results contain a row for each row in the request."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -569,6 +595,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -590,7 +617,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -631,6 +658,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -652,7 +680,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
@ -680,6 +708,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
@ -701,7 +730,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {