492 строки
56 KiB
Plaintext
492 строки
56 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "jBasof3bv1LB"
|
|
},
|
|
"source": [
|
|
"<h1><center>How to export 🤗 Transformers Models to ONNX ?<h1><center>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[ONNX](http://onnx.ai/) is open format for machine learning models. It allows to save your neural network's computation graph in a framework agnostic way, which might be particulary helpful when deploying deep learning models.\n",
|
|
"\n",
|
|
"Indeed, businesses might have other requirements _(languages, hardware, ...)_ for which the training framework might not be the best suited in inference scenarios. In that context, having a representation of the actual computation graph that can be shared accross various business units and logics across an organization might be a desirable component.\n",
|
|
"\n",
|
|
"Along with the serialization format, ONNX also provides a runtime library which allows efficient and hardware specific execution of the ONNX graph. This is done through the [onnxruntime](https://microsoft.github.io/onnxruntime/) project and already includes collaborations with many hardware vendors to seamlessly deploy models on various platforms.\n",
|
|
"\n",
|
|
"Through this notebook we'll walk you through the process to convert a PyTorch or TensorFlow transformers model to the [ONNX](http://onnx.ai/) and leverage [onnxruntime](https://microsoft.github.io/onnxruntime/) to run inference tasks on models from 🤗 __transformers__"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "yNnbrSg-5e1s"
|
|
},
|
|
"source": [
|
|
"## Exporting 🤗 transformers model to ONNX\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"Exporting models _(either PyTorch or TensorFlow)_ is easily achieved through the conversion tool provided as part of 🤗 __transformers__ repository. \n",
|
|
"\n",
|
|
"Under the hood the process is sensibly the following: \n",
|
|
"\n",
|
|
"1. Allocate the model from transformers (**PyTorch or TensorFlow**)\n",
|
|
"2. Forward dummy inputs through the model this way **ONNX** can record the set of operations executed\n",
|
|
"3. Optionally define dynamic axes on input and output tensors\n",
|
|
"4. Save the graph along with the network parameters"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install --upgrade git+https://github.com/huggingface/transformers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"colab": {},
|
|
"colab_type": "code",
|
|
"id": "PwAaOchY4N2-"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!rm -rf onnx/\n",
|
|
"from transformers.convert_graph_to_onnx import convert\n",
|
|
"\n",
|
|
"# Handles all the above steps for you\n",
|
|
"convert(framework=\"pt\", model=\"bert-base-cased\", output=\"onnx/bert-base-cased.onnx\", opset=11)\n",
|
|
"\n",
|
|
"# Tensorflow \n",
|
|
"# convert(framework=\"tf\", model=\"bert-base-cased\", output=\"onnx/bert-base-cased.onnx\", opset=11)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## How to leverage runtime for inference over an ONNX graph\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"As mentionned in the introduction, **ONNX** is a serialization format and many side projects can load the saved graph and run the actual computations from it. Here, we'll focus on the official [onnxruntime](https://microsoft.github.io/onnxruntime/). The runtime is implemented in C++ for performance reasons and provides API/Bindings for C++, C, C#, Java and Python.\n",
|
|
"\n",
|
|
"In the case of this notebook, we will use the Python API to highlight how to load a serialized **ONNX** graph and run inference workload on various backends through **onnxruntime**.\n",
|
|
"\n",
|
|
"**onnxruntime** is available on pypi:\n",
|
|
"\n",
|
|
"- onnxruntime: ONNX + MLAS (Microsoft Linear Algebra Subprograms)\n",
|
|
"- onnxruntime-gpu: ONNX + MLAS + CUDA\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install transformers onnxruntime-gpu onnx psutil matplotlib"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "-gP08tHfBvgY"
|
|
},
|
|
"source": [
|
|
"## Preparing for an Inference Session\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"Inference is done using a specific backend definition which turns on hardware specific optimizations of the graph. \n",
|
|
"\n",
|
|
"Optimizations are basically of three kinds: \n",
|
|
"\n",
|
|
"- **Constant Folding**: Convert static variables to constants in the graph \n",
|
|
"- **Deadcode Elimination**: Remove nodes never accessed in the graph\n",
|
|
"- **Operator Fusing**: Merge multiple instruction into one (Linear -> ReLU can be fused to be LinearReLU)\n",
|
|
"\n",
|
|
"ONNX Runtime automatically applies most optimizations by setting specific `SessionOptions`.\n",
|
|
"\n",
|
|
"Note:Some of the latest optimizations that are not yet integrated into ONNX Runtime are available in [optimization script](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) that tunes models for the best performance."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# # An optional step unless\n",
|
|
"# # you want to get a model with mixed precision for perf accelartion on newer GPU\n",
|
|
"# # or you are working with Tensorflow(tf.keras) models or pytorch models other than bert\n",
|
|
"\n",
|
|
"# !pip install onnxruntime-tools\n",
|
|
"# from onnxruntime_tools import optimizer\n",
|
|
"\n",
|
|
"# # Mixed precision conversion for bert-base-cased model converted from Pytorch\n",
|
|
"# optimized_model = optimizer.optimize_model(\"bert-base-cased.onnx\", model_type='bert', num_heads=12, hidden_size=768)\n",
|
|
"# optimized_model.convert_model_float32_to_float16()\n",
|
|
"# optimized_model.save_model_to_file(\"bert-base-cased.onnx\")\n",
|
|
"\n",
|
|
"# # optimizations for bert-base-cased model converted from Tensorflow(tf.keras)\n",
|
|
"# optimized_model = optimizer.optimize_model(\"bert-base-cased.onnx\", model_type='bert_keras', num_heads=12, hidden_size=768)\n",
|
|
"# optimized_model.save_model_to_file(\"bert-base-cased.onnx\")\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from os import environ\n",
|
|
"from psutil import cpu_count\n",
|
|
"\n",
|
|
"# Constants from the performance optimization available in onnxruntime\n",
|
|
"# It needs to be done before importing onnxruntime\n",
|
|
"environ[\"OMP_NUM_THREADS\"] = str(cpu_count(logical=True))\n",
|
|
"environ[\"OMP_WAIT_POLICY\"] = 'ACTIVE'\n",
|
|
"\n",
|
|
"from onnxruntime import InferenceSession, SessionOptions, get_all_providers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"colab": {},
|
|
"colab_type": "code",
|
|
"id": "2k-jHLfdcTFS"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def create_model_for_provider(model_path: str, provider: str) -> InferenceSession: \n",
|
|
" \n",
|
|
" assert provider in get_all_providers(), f\"provider {provider} not found, {get_all_providers()}\"\n",
|
|
"\n",
|
|
" # Few properties than might have an impact on performances (provided by MS)\n",
|
|
" options = SessionOptions()\n",
|
|
" options.intra_op_num_threads = 1\n",
|
|
"\n",
|
|
" # Load the model as a graph and prepare the CPU backend \n",
|
|
" return InferenceSession(model_path, options, providers=[provider])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "teJdG3amE-hR"
|
|
},
|
|
"source": [
|
|
"## Forwarding through our optimized ONNX model running on CPU\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"When the model is loaded for inference over a specific provider, for instance **CPUExecutionProvider** as above, an optimized graph can be saved. This graph will might include various optimizations, and you might be able to see some **higher-level** operations in the graph _(through [Netron](https://github.com/lutzroeder/Netron) for instance)_ such as:\n",
|
|
"- **EmbedLayerNormalization**\n",
|
|
"- **Attention**\n",
|
|
"- **FastGeLU**\n",
|
|
"\n",
|
|
"These operations are an example of the kind of optimization **onnxruntime** is doing, for instance here gathering multiple operations into bigger one _(Operator Fusing)_."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 34
|
|
},
|
|
"colab_type": "code",
|
|
"id": "dmC22kJfVGYe",
|
|
"outputId": "f3aba5dc-15c0-4f82-b38c-1bbae1bf112e"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Sequence output: (1, 6, 768), Pooled output: (1, 768)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from transformers import BertTokenizerFast\n",
|
|
"\n",
|
|
"tokenizer = BertTokenizerFast.from_pretrained(\"bert-base-cased\")\n",
|
|
"cpu_model = create_model_for_provider(\"onnx/bert-base-cased.onnx\", \"CPUExecutionProvider\")\n",
|
|
"\n",
|
|
"# Inputs are provided through numpy array\n",
|
|
"model_inputs = tokenizer(\"My name is Bert\", return_tensors=\"pt\")\n",
|
|
"inputs_onnx = {k: v.cpu().detach().numpy() for k, v in model_inputs.items()}\n",
|
|
"\n",
|
|
"# Run the model (None = get all the outputs)\n",
|
|
"sequence, pooled = cpu_model.run(None, inputs_onnx)\n",
|
|
"\n",
|
|
"# Print information about outputs\n",
|
|
"\n",
|
|
"print(f\"Sequence output: {sequence.shape}, Pooled output: {pooled.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "Kda1e7TkEqNR"
|
|
},
|
|
"source": [
|
|
"## Benchmarking different CPU & GPU providers\n",
|
|
"\n",
|
|
"_**Disclamer: results may vary from the actual hardware used to run the model**_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 170
|
|
},
|
|
"colab_type": "code",
|
|
"id": "WcdFZCvImVig",
|
|
"outputId": "bfd779a1-0bc7-42db-8587-e52a485ec5e3"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Doing GPU inference on TITAN RTX\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Warming up: 100%|██████████| 10/10 [00:00<00:00, 333.82it/s]\n",
|
|
"Tracking inference time on CUDAExecutionProvider: 100%|██████████| 100/100 [00:00<00:00, 521.76it/s]\n",
|
|
"Warming up: 100%|██████████| 10/10 [00:00<00:00, 62.95it/s]\n",
|
|
"Tracking inference time on CPUExecutionProvider: 100%|██████████| 100/100 [00:01<00:00, 68.65it/s]\n",
|
|
"Warming up: 100%|██████████| 10/10 [00:00<00:00, 69.72it/s]\n",
|
|
"Tracking inference time on TensorrtExecutionProvider: 100%|██████████| 100/100 [00:01<00:00, 71.31it/s]\n",
|
|
"Warming up: 100%|██████████| 10/10 [00:00<00:00, 66.28it/s]\n",
|
|
"Tracking inference time on DnnlExecutionProvider: 100%|██████████| 100/100 [00:01<00:00, 72.03it/s]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from torch.cuda import get_device_name\n",
|
|
"from contextlib import contextmanager\n",
|
|
"from dataclasses import dataclass\n",
|
|
"from time import time\n",
|
|
"from tqdm import trange\n",
|
|
"\n",
|
|
"print(f\"Doing GPU inference on {get_device_name(0)}\", flush=True)\n",
|
|
"\n",
|
|
"@contextmanager\n",
|
|
"def track_infer_time(buffer: [int]):\n",
|
|
" start = time()\n",
|
|
" yield\n",
|
|
" end = time()\n",
|
|
"\n",
|
|
" buffer.append(end - start)\n",
|
|
"\n",
|
|
"\n",
|
|
"@dataclass\n",
|
|
"class OnnxInferenceResult:\n",
|
|
" model_inference_time: [int] \n",
|
|
" optimized_model_path: str\n",
|
|
"\n",
|
|
"\n",
|
|
"# All the providers we'll be using in the test\n",
|
|
"results = {}\n",
|
|
"providers = [\n",
|
|
" \"CUDAExecutionProvider\",\n",
|
|
" \"CPUExecutionProvider\", \n",
|
|
" \"TensorrtExecutionProvider\",\n",
|
|
" \"DnnlExecutionProvider\", \n",
|
|
"]\n",
|
|
"\n",
|
|
"# Iterate over all the providers\n",
|
|
"for provider in providers:\n",
|
|
"\n",
|
|
" # Create the model with the specified provider\n",
|
|
" model = create_model_for_provider(\"onnx/bert-base-cased.onnx\", provider)\n",
|
|
"\n",
|
|
" # Keep track of the inference time\n",
|
|
" time_buffer = []\n",
|
|
"\n",
|
|
" # Warm up the model\n",
|
|
" for _ in trange(10, desc=\"Warming up\"):\n",
|
|
" model.run(None, inputs_onnx)\n",
|
|
"\n",
|
|
" # Compute \n",
|
|
" for _ in trange(100, desc=f\"Tracking inference time on {provider}\"):\n",
|
|
" with track_infer_time(time_buffer):\n",
|
|
" model.run(None, inputs_onnx)\n",
|
|
"\n",
|
|
" # Store the result\n",
|
|
" results[provider] = OnnxInferenceResult(\n",
|
|
" time_buffer,\n",
|
|
" model.get_session_options().optimized_model_filepath\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 51
|
|
},
|
|
"colab_type": "code",
|
|
"id": "PS_49goe197g",
|
|
"outputId": "0ef0f70c-f5a7-46a0-949a-1a93f231d193"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Warming up: 100%|██████████| 10/10 [00:00<00:00, 18.04it/s]\n",
|
|
"Tracking inference time on PyTorch: 100%|██████████| 100/100 [00:05<00:00, 18.88it/s]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from transformers import BertModel\n",
|
|
"\n",
|
|
"# Add PyTorch to the providers\n",
|
|
"model_pt = BertModel.from_pretrained(\"bert-base-cased\")\n",
|
|
"for _ in trange(10, desc=\"Warming up\"):\n",
|
|
" model_pt(**model_inputs)\n",
|
|
"\n",
|
|
"# Compute \n",
|
|
"time_buffer = []\n",
|
|
"for _ in trange(100, desc=f\"Tracking inference time on PyTorch\"):\n",
|
|
" with track_infer_time(time_buffer):\n",
|
|
" model_pt(**model_inputs)\n",
|
|
"\n",
|
|
"# Store the result\n",
|
|
"results[\"Pytorch\"] = OnnxInferenceResult(\n",
|
|
" time_buffer, \n",
|
|
" model.get_session_options().optimized_model_filepath\n",
|
|
") "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Show the inference performance of each providers \n",
|
|
"\n",
|
|
"_Note: PyTorch model benchmark is run on CPU_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {
|
|
"colab": {
|
|
"base_uri": "https://localhost:8080/",
|
|
"height": 676
|
|
},
|
|
"colab_type": "code",
|
|
"id": "dj-rS8AcqRZQ",
|
|
"outputId": "b4bf07d1-a7b4-4eff-e6bd-d5d424fd17fb"
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<Figure size 1600x1200 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"import matplotlib\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import numpy as np\n",
|
|
"import os\n",
|
|
"\n",
|
|
"# Compute average inference time + std\n",
|
|
"time_results = {k: np.mean(v.model_inference_time) * 1e3 for k, v in results.items()}\n",
|
|
"time_results_std = np.std([v.model_inference_time for v in results.values()]) * 1000\n",
|
|
"\n",
|
|
"plt.rcdefaults()\n",
|
|
"fig, ax = plt.subplots(figsize=(16, 12))\n",
|
|
"ax.set_ylabel(\"Avg Inference time (ms)\")\n",
|
|
"ax.set_title(\"Average inference time (ms) for each provider\")\n",
|
|
"ax.bar(time_results.keys(), time_results.values(), yerr=time_results_std)\n",
|
|
"plt.show()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"accelerator": "GPU",
|
|
"colab": {
|
|
"collapsed_sections": [],
|
|
"name": "ONNX Overview",
|
|
"provenance": [],
|
|
"toc_visible": true
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
} |