Distillation qanda (#3335)
* Q and A Distillation Example * Create distillation_qanda - Copy.ipynb Distillation Q and A changes * Changes for distillation Q&A * Changes for black validation and utils for download dataset
This commit is contained in:
Родитель
ce99aa8908
Коммит
a46bb8ef9b
|
@ -486,6 +486,7 @@
|
|||
" learning_rate=0.00002,\n",
|
||||
" per_device_train_batch_size=1,\n",
|
||||
" num_train_epochs=3,\n",
|
||||
" data_generation_task_type=\"NLI\",\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" return {\"output_model\": oss_distillation.outputs.output_model}"
|
||||
|
|
|
@ -0,0 +1,610 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Distillation Q&A with Large Language Models\n",
|
||||
" \n",
|
||||
"### Notebook details\n",
|
||||
" \n",
|
||||
"This sample demonstrates how to train the selected student model using the teacher model, resulting in the creation of the distilled model.\n",
|
||||
" \n",
|
||||
"We will use the Meta Llama 3.1 405B Instruct as the teacher model and the Meta Llama 3.1 8B Instruct as the student model.\n",
|
||||
" \n",
|
||||
"**Note :**\n",
|
||||
" \n",
|
||||
"- Distillation offering is only available in **West US 3** regions.\n",
|
||||
"- The Meta Llama 3.1 405B Instruct model can only be used as a teacher model.\n",
|
||||
"- The Meta Llama 3.1 8B Instruct can only be used as a student (target) model.\n",
|
||||
"- Distllation is supported for Question and Answering (QnA) task, which is a standard task in benchmarking for Natural Language Understanding.\n",
|
||||
"\n",
|
||||
"**Prerequisites :**\n",
|
||||
"- Subscribe to the Meta Llama 3.1 405B Instruct and Meta Llama 3.1 8B Instruct, see [how to subscribe your project to the model offering in MS Learn](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-serverless?tabs=azure-ai-studio#subscribe-your-project-to-the-model-offering)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Install the SDK v2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install azure-ai-ml\n",
|
||||
"%pip install azure-identity\n",
|
||||
"\n",
|
||||
"%pip install mlflow\n",
|
||||
"%pip install azureml-mlflow\n",
|
||||
"%pip install -U datasets"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Import the required libraries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# import required libraries\n",
|
||||
"\n",
|
||||
"import base64\n",
|
||||
"import json\n",
|
||||
"import os\n",
|
||||
"from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential\n",
|
||||
"from azure.ai.ml import MLClient, Input\n",
|
||||
"from azure.ai.ml.constants import AssetTypes\n",
|
||||
"from azure.ai.ml.dsl import pipeline\n",
|
||||
"from azure.ai.ml.entities import Data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prerequisites\n",
|
||||
"\n",
|
||||
"An AI Studio project in **West US 3** is required. Please follow [this](https://learn.microsoft.com/azure/ai-studio/how-to/fine-tune-model-llama?tabs=llama-two%2Cchatcompletion#prerequisites) document to setup your AI Studio project"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## AI Studio project settings\n",
|
||||
"\n",
|
||||
"Update following cell with the information of the AI Studio project just created."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"SUBSCRIPTION_ID = \"<SUBSCRIPTION>\"\n",
|
||||
"RESOURCE_GROUP = \"<RESOURCE_GROUP>\"\n",
|
||||
"AI_PROJECT_NAME = \"<AI_PROJECT_NAME>\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Configure credential\n",
|
||||
"\n",
|
||||
"We are using `DefaultAzureCredential` to get access to workspace. \n",
|
||||
"`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"try:\n",
|
||||
" credential = DefaultAzureCredential()\n",
|
||||
" # Check if given credential can get token successfully.\n",
|
||||
" credential.get_token(\"https://management.azure.com/.default\")\n",
|
||||
"except Exception as ex:\n",
|
||||
" # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work\n",
|
||||
" credential = InteractiveBrowserCredential()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Get handle to AI Studio project"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ml_client = MLClient(credential, SUBSCRIPTION_ID, RESOURCE_GROUP, AI_PROJECT_NAME)\n",
|
||||
"ai_project = ml_client._workspaces.get(ml_client.workspace_name)\n",
|
||||
"ai_project._workspace_id"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Pick a teacher model\n",
|
||||
"\n",
|
||||
"We support **Meta-Llama-3.1-405B-Instruct** as the teacher model. \n",
|
||||
"### First deploy the teacher model in Azure AI Studio\n",
|
||||
"* Deploy Meta Llama 3.1 405B Instruct as a serverless API - [link](https://aka.ms/meta-llama-3.1-405B-instruct-azure-ai-studio-docs#create-a-new-deployment)\n",
|
||||
"* Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.\n",
|
||||
"\n",
|
||||
"Update the following cell with the information of the deployment you just created."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Llama-3-405B Teacher model endpoint name\n",
|
||||
"# The serverless model name is the name found in ML Studio > Endpoints > Serverless endpoints > Model column\n",
|
||||
"TEACHER_MODEL_NAME = \"Meta-Llama-3.1-405B-Instruct\"\n",
|
||||
"\n",
|
||||
"# The serverless model endpoint name is the name found in ML Studio > Endpoints > Serverless endpoints > Name column\n",
|
||||
"# The endpoint URL will be resolved from this name by the MLFlow component\n",
|
||||
"TEACHER_MODEL_ENDPOINT_NAME = \"Meta-Llama-3-1-405B-Instruct-vum\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Pick a student model\n",
|
||||
"\n",
|
||||
"We will use **Meta-Llama-3.1-8B-Instruct** as student model. We only support chat completion models that are available for PayGo finetuning in Azure AI Studio."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"STUDENT_MODEL_NAME = \"Meta-Llama-3.1-8B-Instruct\"\n",
|
||||
"STUDENT_MODEL_VERSION = 1\n",
|
||||
"\n",
|
||||
"# retrieve student model from model registry\n",
|
||||
"mlclient_azureml_meta = MLClient(credential, registry_name=\"azureml-meta\")\n",
|
||||
"student_model = mlclient_azureml_meta.models.get(\n",
|
||||
" STUDENT_MODEL_NAME, version=STUDENT_MODEL_VERSION\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\n",
|
||||
" \"\\n\\nUsing model name: {0}, version: {1}, id: {2} for fine tuning\".format(\n",
|
||||
" student_model.name, student_model.version, student_model.id\n",
|
||||
" )\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Download the dataset from HuggingFace repo\n",
|
||||
"\n",
|
||||
"For our example, we download and use the Common Sense Q&A dataset (https://huggingface.co/datasets/tau/commonsense_qa/) from HuggingFace.\n",
|
||||
"\n",
|
||||
"The next few cells show basic data preparation for fine tuning:\n",
|
||||
"* Visualize some data rows\n",
|
||||
"* Preprocess the data and format it in required format. This is an important step for performing QandA as we add the required sequences/separators in the data. \n",
|
||||
"* While fintuning, text column is concatenated with ground_truth column to produce finetuning input. Hence, the data should be prepared such that `text + ground_truth` is your actual finetuning data.\n",
|
||||
"* bos and eos tokens are added to the data by finetuning pipeline, you do not need to add it explicitly \n",
|
||||
"\n",
|
||||
"##### Here is an example of how the data should look like\n",
|
||||
"\n",
|
||||
"text generation requires the training data to include at least 2 fields – one for ‘text’ and ‘ground_truth’ like in this example. The below examples are from Common Sense QandA dataset. \n",
|
||||
"\n",
|
||||
"Original dataset:\n",
|
||||
"\n",
|
||||
"{'id': '1afa02df02c908a558b4036e80242fac', 'question': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?', 'question_concept': 'revolving door', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['bank', 'library', 'department store', 'mall', 'new york']}, 'answerKey': 'A'}\n",
|
||||
"\n",
|
||||
"{'id': 'a7ab086045575bb497933726e4e6ad28', 'question': 'What do people aim to do at work?', 'question_concept': 'people', 'choices': {'label': ['A', 'B', 'C', 'D', 'E'], 'text': ['complete job', 'learn from each other', 'kill animals', 'wear hats', 'talk to each other']}, 'answerKey': 'A'}\n",
|
||||
"\n",
|
||||
"Formatted dataset the user might pass:\n",
|
||||
"\n",
|
||||
"{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'.\"}, {\"role\": \"user\", \"content\": \"Answer the following multiple-choice question by selecting the correct option.\\n\\nQuestion: A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?\\nAnswer Choices:\\n(A) bank\\n(B) library\\n(C) department store\\n(D) mall\\n(E) new york\"}]}\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'.\"}, {\"role\": \"user\", \"content\": \"Answer the following multiple-choice question by selecting the correct option.\\n\\nQuestion: What do people aim to do at work?\\nAnswer Choices:\\n(A) complete job\\n(B) learn from each other\\n(C) kill animals\\n(D) wear hats\\n(E) talk to each other\"}]}\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# We can define train and test sample sizes here. Validation size is kept same as test sample size\n",
|
||||
"import download_dataset\n",
|
||||
"\n",
|
||||
"train_sample_size = 100\n",
|
||||
"val_sample_size = 100\n",
|
||||
"\n",
|
||||
"# Sample notebook using the dataset: https://huggingface.co/datasets/tau/commonsense_qa\n",
|
||||
"dataset_name = \"tau/commonsense_qa\"\n",
|
||||
"input_dataset = download_dataset.CQnAHuggingFaceInputDataset()\n",
|
||||
"\n",
|
||||
"# Note: train_split_name and test_split_name can vary by dataset. They are passed as arguments in load_hf_dataset.\n",
|
||||
"# If validation_split_name is None, the below function will split the train set to create the specified sized validation set.\n",
|
||||
"train, val, _ = input_dataset.load_hf_dataset(\n",
|
||||
" dataset_name=dataset_name,\n",
|
||||
" train_sample_size=train_sample_size,\n",
|
||||
" val_sample_size=val_sample_size,\n",
|
||||
" train_split_name=\"train\",\n",
|
||||
" val_split_name=\"validation\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"Len of train data sample is \" + str(len(train)))\n",
|
||||
"print(\"Len of validation data sample is \" + str(len(val)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"try:\n",
|
||||
" path = \"data\"\n",
|
||||
" os.mkdir(path, 0o755)\n",
|
||||
"except FileExistsError as err:\n",
|
||||
" print(f\"Error: {err}\")\n",
|
||||
"except OSError as err:\n",
|
||||
" print(f\"Error: {err}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_data_path = \"data/train_cqna_512.jsonl\"\n",
|
||||
"valid_data_path = \"data/valid_cqna_256.jsonl\"\n",
|
||||
"\n",
|
||||
"for row in train:\n",
|
||||
" data = {\"messages\": []}\n",
|
||||
" data[\"messages\"].append(\n",
|
||||
" {\n",
|
||||
" \"role\": \"system\",\n",
|
||||
" \"content\": \"You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'.\",\n",
|
||||
" }\n",
|
||||
" )\n",
|
||||
" question, choices, ans_key = row[\"question\"], row[\"choices\"], row[\"answerKey\"]\n",
|
||||
" labels, choice_list = choices[\"label\"], choices[\"text\"]\n",
|
||||
" answer_choices = [\n",
|
||||
" \"({}) {}\".format(labels[i], choice_list[i]) for i in range(len(labels))\n",
|
||||
" ]\n",
|
||||
" answer_choices = \"\\n\".join(answer_choices)\n",
|
||||
" data[\"messages\"].append(\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": \"Answer the following multiple-choice question by selecting the correct option.\\n\\nQuestion: \"\n",
|
||||
" + row[\"question\"]\n",
|
||||
" + \"\\nAnswer Choices:\\n\"\n",
|
||||
" + answer_choices,\n",
|
||||
" }\n",
|
||||
" )\n",
|
||||
" with open(train_data_path, \"a\") as f:\n",
|
||||
" f.write(json.dumps(data) + \"\\n\")\n",
|
||||
"\n",
|
||||
"for row in val:\n",
|
||||
" data = {\"messages\": []}\n",
|
||||
" data[\"messages\"].append(\n",
|
||||
" {\n",
|
||||
" \"role\": \"system\",\n",
|
||||
" \"content\": \"You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'.\",\n",
|
||||
" }\n",
|
||||
" )\n",
|
||||
" question, choices, ans_key = row[\"question\"], row[\"choices\"], row[\"answerKey\"]\n",
|
||||
" labels, choice_list = choices[\"label\"], choices[\"text\"]\n",
|
||||
" answer_choices = [\n",
|
||||
" \"({}) {}\".format(labels[i], choice_list[i]) for i in range(len(labels))\n",
|
||||
" ]\n",
|
||||
" answer_choices = \"\\n\".join(answer_choices)\n",
|
||||
" data[\"messages\"].append(\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": \"Answer the following multiple-choice question by selecting the correct option.\\n\\nQuestion: \"\n",
|
||||
" + row[\"question\"]\n",
|
||||
" + \"\\nAnswer Choices:\\n\"\n",
|
||||
" + answer_choices,\n",
|
||||
" }\n",
|
||||
" )\n",
|
||||
" with open(valid_data_path, \"a\") as f:\n",
|
||||
" f.write(json.dumps(data) + \"\\n\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"val_df = pd.read_json(\"./data/valid_cqna_256.jsonl\", lines=True)\n",
|
||||
"val_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Prepare data inputs\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_data = None\n",
|
||||
"train_data_name = \"cqna_train_70\"\n",
|
||||
"\n",
|
||||
"train_data = ml_client.data.create_or_update(\n",
|
||||
" Data(\n",
|
||||
" path=train_data_path,\n",
|
||||
" type=AssetTypes.URI_FILE,\n",
|
||||
" description=\"Training dataset\",\n",
|
||||
" name=train_data_name,\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"train_data_asset_id = f\"azureml://locations/{ai_project.location}/workspaces/{ai_project._workspace_id}/data/{train_data.name}/versions/{train_data.version}\"\n",
|
||||
"train_data_asset_id"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"valid_data = None\n",
|
||||
"valid_data_name = \"cqna_valid_70\"\n",
|
||||
"\n",
|
||||
"valid_data = ml_client.data.create_or_update(\n",
|
||||
" Data(\n",
|
||||
" path=valid_data_path,\n",
|
||||
" type=AssetTypes.URI_FILE,\n",
|
||||
" description=\"validation dataset\",\n",
|
||||
" name=valid_data_name,\n",
|
||||
" )\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"valid_data_asset_id = f\"azureml://locations/{ai_project.location}/workspaces/{ai_project._workspace_id}/data/{valid_data.name}/versions/{valid_data.version}\"\n",
|
||||
"valid_data_asset_id"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Distillation strategy settings\n",
|
||||
"\n",
|
||||
"We provide the option to leverage Chain of Thought (CoT) reasoning for distillation. CoT leverages step by step reasoning ability of the teacher model to generate more accurate labels."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ENABLE_CHAIN_OF_THOUGHT = \"true\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Configure distillation\n",
|
||||
"Define finetune parameters\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Training parameters define the training aspects such as - \n",
|
||||
"1. the learning rate\n",
|
||||
"2. the number of epochs to finetune\n",
|
||||
"3. the data_generation_task_type , in this case its 'NLU_QA'\n",
|
||||
"and so on"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"mlclient_azureml = MLClient(credential, registry_name=\"azureml\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"distillation_pipeline_name = \"oss_distillation_pipeline\"\n",
|
||||
"distillation_pipeline_component = mlclient_azureml.components.get(\n",
|
||||
" name=distillation_pipeline_name, label=\"latest\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"@pipeline\n",
|
||||
"def distillation_pipeline(\n",
|
||||
" teacher_model_endpoint_name: str,\n",
|
||||
" enable_chain_of_thought: str,\n",
|
||||
" system_properties: str,\n",
|
||||
" input_finetune_model: Input,\n",
|
||||
" train_file_path: Input,\n",
|
||||
" validation_file_path: Input = None,\n",
|
||||
"):\n",
|
||||
" oss_distillation = distillation_pipeline_component(\n",
|
||||
" teacher_model_endpoint_name=teacher_model_endpoint_name,\n",
|
||||
" enable_chain_of_thought=enable_chain_of_thought,\n",
|
||||
" train_file_path=train_file_path,\n",
|
||||
" validation_file_path=validation_file_path,\n",
|
||||
" # Finetune\n",
|
||||
" mlflow_model_path=input_finetune_model,\n",
|
||||
" model_asset_id=student_model.id,\n",
|
||||
" system_properties=system_properties,\n",
|
||||
" ## hyperparams\n",
|
||||
" learning_rate=0.00002,\n",
|
||||
" per_device_train_batch_size=1,\n",
|
||||
" num_train_epochs=3,\n",
|
||||
" data_generation_task_type=\"NLU_QA\",\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" return {\"output_model\": oss_distillation.outputs.output_model}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"system_properties = {\n",
|
||||
" \"finetune_oss\": \"True\",\n",
|
||||
" \"model_asset_id\": student_model.id,\n",
|
||||
" \"PipelineType\": \"Finetune\",\n",
|
||||
" \"azureml.PipelineType\": \"Finetune\",\n",
|
||||
" \"azureml.ModelName\": student_model.name,\n",
|
||||
" \"azureml.original_model_id\": student_model.id,\n",
|
||||
" \"azureml.trainingData.assetId\": train_data_asset_id,\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"json_str = json.dumps(system_properties).replace(\" \", \"\")\n",
|
||||
"\n",
|
||||
"system_properties_b64_encoded = base64.b64encode(json_str.encode(\"utf-8\")).decode(\n",
|
||||
" \"utf-8\"\n",
|
||||
")\n",
|
||||
"print(f\"System properties => {system_properties_b64_encoded}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_file_path_input = Input(type=\"uri_file\", path=train_data.path)\n",
|
||||
"validation_file_path_input = Input(type=\"uri_file\", path=valid_data.path)\n",
|
||||
"input_finetune_model = Input(type=\"mlflow_model\", path=student_model.id)\n",
|
||||
"experiment_name = f\"distillation-{TEACHER_MODEL_NAME}\".replace(\".\", \"-\")\n",
|
||||
"\n",
|
||||
"finetuning_job = distillation_pipeline(\n",
|
||||
" teacher_model_endpoint_name=TEACHER_MODEL_ENDPOINT_NAME,\n",
|
||||
" enable_chain_of_thought=ENABLE_CHAIN_OF_THOUGHT,\n",
|
||||
" system_properties=system_properties_b64_encoded,\n",
|
||||
" input_finetune_model=input_finetune_model,\n",
|
||||
" train_file_path=train_file_path_input,\n",
|
||||
" validation_file_path=validation_file_path_input,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"finetuning_job.properties.update(system_properties)\n",
|
||||
"print(f\"job property: {finetuning_job.properties}\")\n",
|
||||
"\n",
|
||||
"# pipeline_job.identity = UserIdentityConfiguration()\n",
|
||||
"finetuning_job.display_name = f\"finetune-{student_model.name}\"\n",
|
||||
"finetuning_job.experiment_name = experiment_name\n",
|
||||
"finetuning_job.settings.default_compute_type = \"serverless\"\n",
|
||||
"finetuning_job.continue_on_step_failure = False\n",
|
||||
"# pipeline_job.settings.force_rerun = True"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Submit pipeline job"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Submit pipeline job to workspace\n",
|
||||
"ft_job = ml_client.jobs.create_or_update(finetuning_job)\n",
|
||||
"print(f\"Submitted job, progress available at {ft_job.studio_url}\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# wait for the pipeline job to complete\n",
|
||||
"ml_client.jobs.stream(ft_job.name)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Consuming the distilled model\n",
|
||||
"\n",
|
||||
"Once the above job completes, you should be able to deploy the model and use it for inferencing. To deploy this model, do the following:\n",
|
||||
"\n",
|
||||
"* Go to AI Studio\n",
|
||||
"* Navigate to the Fine-tuning tab on the left menu\n",
|
||||
"* In the list of models you see, click on the model which got created from the distillation\n",
|
||||
"* This should take you to the details page where you can see the model attributes and other details\n",
|
||||
"* Click on the Deploy button on top of the page\n",
|
||||
"* Follow the steps to deploy the model"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.5"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
|
@ -0,0 +1,49 @@
|
|||
from datasets import load_dataset
|
||||
from abc import ABC
|
||||
|
||||
|
||||
class InputDataset(ABC):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
(
|
||||
self.train_data_file_name,
|
||||
self.test_data_file_name,
|
||||
self.eval_data_file_name,
|
||||
) = (None, None, None)
|
||||
|
||||
|
||||
class CQnAHuggingFaceInputDataset(InputDataset):
|
||||
"""
|
||||
Loads the HuggingFace dataset
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
||||
def load_hf_dataset(
|
||||
self,
|
||||
dataset_name,
|
||||
train_sample_size=10,
|
||||
val_sample_size=10,
|
||||
test_sample_size=10,
|
||||
train_split_name="train",
|
||||
val_split_name="validation",
|
||||
test_split_name="test",
|
||||
):
|
||||
full_dataset = load_dataset(dataset_name)
|
||||
|
||||
if val_split_name is not None:
|
||||
train_data = full_dataset[train_split_name].select(range(train_sample_size))
|
||||
val_data = full_dataset[val_split_name].select(range(val_sample_size))
|
||||
test_data = full_dataset[test_split_name].select(range(test_sample_size))
|
||||
else:
|
||||
train_val_data = full_dataset[train_split_name].select(
|
||||
range(train_sample_size + val_sample_size)
|
||||
)
|
||||
train_data = train_val_data.select(range(train_sample_size))
|
||||
val_data = train_val_data.select(
|
||||
range(train_sample_size, train_sample_size + val_sample_size)
|
||||
)
|
||||
test_data = full_dataset[test_split_name].select(range(test_sample_size))
|
||||
|
||||
return train_data, val_data, test_data
|
Загрузка…
Ссылка в новой задаче