MLOpsSamples/notebooks/xdeepfm_synthetic.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# xDeepFM : the eXtreme Deep Factorization Machine \n",
    "This notebook will give you a quick example of how to train an xDeepFM model. \n",
    "xDeepFM \\[1\\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:\n",
    "* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;\n",
    "* It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.\n",
    "* The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like `use_Linear_part`, `use_FM_part`, `use_CIN_part`, and `use_DNN_part`. For example, by enabling only the `use_Linear_part` and `use_FM_part`, we can get a classical FM model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Global Settings and Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System version: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) \n",
      "[GCC 7.3.0]\n",
      "Tensorflow version: 1.12.0\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append(\"notebooks\")\n",
    "import papermill as pm\n",
    "import tensorflow as tf\n",
    "\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import *\n",
    "from reco_utils.recommender.deeprec.models.xDeepFM import *\n",
    "from reco_utils.recommender.deeprec.IO.iterator import *\n",
    "\n",
    "print(\"System version: {}\".format(sys.version))\n",
    "print(\"Tensorflow version: {}\".format(tf.__version__))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "EPOCHS_FOR_SYNTHETIC_RUN = 15\n",
    "EPOCHS_FOR_CRITEO_RUN = 30\n",
    "BATCH_SIZE_SYNTHETIC = 128\n",
    "BATCH_SIZE_CRITEO = 4096"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download data\n",
    "xDeepFM uses the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  \n",
    "Each line represents an instance, `<label>` is a binary value with 1 meaning positive instance and 0 meaning negative instance. \n",
    "Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1. <br>\n",
    "Now let's start with a small synthetic dataset. In this dataset, there are 10 fields, 1000 fefatures, and label is generated according to the result of a set of preset pair-wise feature interactions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = '../../tests/resources/deeprec/xdeepfm'\n",
    "yaml_file = os.path.join(data_path, r'xDeepFM.yaml')\n",
    "train_file = os.path.join(data_path, r'synthetic_part_0')\n",
    "valid_file = os.path.join(data_path, r'synthetic_part_1')\n",
    "test_file = os.path.join(data_path, r'synthetic_part_2')\n",
    "output_file = os.path.join(data_path, r'output.txt')\n",
    "\n",
    "if not os.path.exists(yaml_file):\n",
    "    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hyper-parameters\n",
    "prepare_hparams() will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put those parameters in a yaml file, or pass parameters as the function's parameters (which will overwrite yaml settings)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('DNN_FIELD_NUM', None), ('FEATURE_COUNT', 1000), ('FIELD_COUNT', 10), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('activation', ['relu', 'relu']), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('batch_size', 128), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0001), ('cross_layer_sizes', [1]), ('cross_layers', None), ('data_format', 'ffm'), ('dim', 10), ('doc_size', None), ('dropout', [0.0, 0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0001), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), ('entity_size', None), ('epochs', 15), ('fast_CIN_d', 0), ('filter_sizes', None), ('init_method', 'tnormal'), ('init_value', 0.3), ('is_clip_norm', 0), ('iterator_type', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0001), ('layer_sizes', [100, 100]), ('learning_rate', 0.001), ('load_model_name', 'you model path'), ('load_saved_model', False), ('loss', 'log_loss'), ('lr_kg', 0.5), ('lr_rs', 1), ('max_grad_norm', 2), ('method', 'classification'), ('metrics', ['auc', 'logloss']), ('model_type', 'xDeepFM'), ('mu', None), ('n_item', None), ('n_item_attr', None), ('n_user', None), ('n_user_attr', None), ('num_filters', None), ('optimizer', 'adam'), ('reg_kg', 0.0), ('save_epoch', 2), ('save_model', False), ('show_step', 200000), ('train_ratio', None), ('transform', None), ('use_CIN_part', True), ('use_DNN_part', False), ('use_FM_part', False), ('user_Linear_part', False), ('user_clicks', None), ('user_dropout', False), ('wordEmb_file', None), ('word_size', None), ('write_tfevents', False)]\n"
     ]
    }
   ],
   "source": [
    "hparams = prepare_hparams(yaml_file, \n",
    "                          FEATURE_COUNT=1000, \n",
    "                          FIELD_COUNT=10, \n",
    "                          cross_l2=0.0001, \n",
    "                          embed_l2=0.0001, \n",
    "                          learning_rate=0.001, \n",
    "                          epochs=EPOCHS_FOR_SYNTHETIC_RUN,\n",
    "                          batch_size=BATCH_SIZE_SYNTHETIC)\n",
    "print(hparams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create data loader\n",
    "Designate a data iterator for the model. xDeepFM uses FFMTextIterator. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_creator = FFMTextIterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create model\n",
    "When both hyper-parameters and data iterator are ready, we can create a model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Add CIN part.\n"
     ]
    }
   ],
   "source": [
    "model = XDeepFMModel(hparams, input_creator)\n",
    "\n",
    "## sometimes we don't want to train a model from scratch\n",
    "## then we can load a pre-trained model like this: \n",
    "#model.load_model(r'your_model_path')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see what is the model's performance at this point (without starting training):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'auc': 0.4995, 'logloss': 0.7267}\n"
     ]
    }
   ],
   "source": [
    "print(model.run_eval(test_file))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AUC=0.5 is a state of random guess. We can see that before training, the model behaves like random guessing. Next we want to train the model on a training set, and check the performance on a validation dataset. Training the model is as simple as a function call:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 1 train info: auc:0.5299, logloss:0.6919 eval info: auc:0.4979, logloss:0.6958\n",
      "at epoch 1 , train time: 4.8 eval time: 4.7\n",
      "at epoch 2 train info: auc:0.5552, logloss:0.6891 eval info: auc:0.5046, logloss:0.6943\n",
      "at epoch 2 , train time: 4.7 eval time: 4.9\n",
      "at epoch 3 train info: auc:0.5924, logloss:0.6837 eval info: auc:0.5195, logloss:0.6933\n",
      "at epoch 3 , train time: 4.6 eval time: 4.7\n",
      "at epoch 4 train info: auc:0.6689, logloss:0.6609 eval info: auc:0.5731, logloss:0.6847\n",
      "at epoch 4 , train time: 4.3 eval time: 4.1\n",
      "at epoch 5 train info: auc:0.8033, logloss:0.5583 eval info: auc:0.719, logloss:0.6157\n",
      "at epoch 5 , train time: 4.0 eval time: 4.0\n",
      "at epoch 6 train info: auc:0.8952, logloss:0.4199 eval info: auc:0.8332, logloss:0.5036\n",
      "at epoch 6 , train time: 4.1 eval time: 3.9\n",
      "at epoch 7 train info: auc:0.9391, logloss:0.324 eval info: auc:0.8844, logloss:0.4292\n",
      "at epoch 7 , train time: 4.1 eval time: 3.9\n",
      "at epoch 8 train info: auc:0.964, logloss:0.2523 eval info: auc:0.9133, logloss:0.3772\n",
      "at epoch 8 , train time: 4.1 eval time: 4.0\n",
      "at epoch 9 train info: auc:0.9799, logloss:0.1921 eval info: auc:0.9346, logloss:0.3318\n",
      "at epoch 9 , train time: 4.0 eval time: 4.0\n",
      "at epoch 10 train info: auc:0.9897, logloss:0.1411 eval info: auc:0.9515, logloss:0.2884\n",
      "at epoch 10 , train time: 4.0 eval time: 4.0\n",
      "at epoch 11 train info: auc:0.9952, logloss:0.1004 eval info: auc:0.9641, logloss:0.2499\n",
      "at epoch 11 , train time: 4.0 eval time: 4.1\n",
      "at epoch 12 train info: auc:0.998, logloss:0.0696 eval info: auc:0.9729, logloss:0.2189\n",
      "at epoch 12 , train time: 4.0 eval time: 4.0\n",
      "at epoch 13 train info: auc:0.9993, logloss:0.0473 eval info: auc:0.9787, logloss:0.196\n",
      "at epoch 13 , train time: 4.0 eval time: 4.0\n",
      "at epoch 14 train info: auc:0.9998, logloss:0.0317 eval info: auc:0.9825, logloss:0.1801\n",
      "at epoch 14 , train time: 4.0 eval time: 4.0\n",
      "at epoch 15 train info: auc:0.9999, logloss:0.0212 eval info: auc:0.9851, logloss:0.1693\n",
      "at epoch 15 , train time: 4.2 eval time: 4.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x2451c0c9278>"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(train_file, valid_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let's see what is the model's performance now (after training):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'auc': 0.9851, 'logloss': 0.1693}\n"
     ]
    }
   ],
   "source": [
    "res_syn = model.run_eval(test_file)\n",
    "print(res_syn)\n",
    "pm.record(\"res_syn\", res_syn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to get the full prediction scores rather than evaluation metrics, we can do this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.deeprec.models.xDeepFM.XDeepFMModel at 0x2451c0c9278>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(test_file, output_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have successfully launched an experiment on a synthetic dataset. Next let's try something on a real world dataset, which is a small sample from Criteo dataset \\[2\\]. Criteo dataset is a well known industry benchmarking dataset for developing CTR prediction models and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('demo with Criteo dataset')\n",
    "hparams = prepare_hparams(yaml_file, \n",
    "                          FEATURE_COUNT=2300000, \n",
    "                          FIELD_COUNT=39, \n",
    "                          cross_l2=0.01, \n",
    "                          embed_l2=0.01, \n",
    "                          layer_l2=0.01,\n",
    "                          learning_rate=0.002, \n",
    "                          batch_size=BATCH_SIZE_CRITEO, \n",
    "                          epochs=EPOCHS_FOR_CRITEO_RUN, \n",
    "                          cross_layer_sizes=[20, 10], \n",
    "                          init_value=0.1, \n",
    "                          layer_sizes=[20,20],\n",
    "                          use_Linear_part=True, \n",
    "                          use_CIN_part=True, \n",
    "                          use_DNN_part=True)\n",
    "\n",
    "train_file = os.path.join(data_path, r'cretio_tiny_train')\n",
    "valid_file = os.path.join(data_path, r'cretio_tiny_valid')\n",
    "test_file = os.path.join(data_path, r'cretio_tiny_test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "demo with Criteo dataset\n",
      "Add linear part.\n",
      "Add CIN part.\n",
      "Add DNN part.\n",
      "{'auc': 0.4827, 'logloss': 0.7696}\n",
      "at epoch 1 train info: auc:0.6427, logloss:0.548 eval info: auc:0.6413, logloss:0.5458\n",
      "at epoch 1 , train time: 52.5 eval time: 29.1\n",
      "at epoch 2 train info: auc:0.7001, logloss:0.519 eval info: auc:0.7004, logloss:0.5182\n",
      "at epoch 2 , train time: 50.8 eval time: 28.8\n",
      "at epoch 3 train info: auc:0.7221, logloss:0.5071 eval info: auc:0.7199, logloss:0.5075\n",
      "at epoch 3 , train time: 50.7 eval time: 29.0\n",
      "at epoch 4 train info: auc:0.7337, logloss:0.5007 eval info: auc:0.7304, logloss:0.5016\n",
      "at epoch 4 , train time: 50.7 eval time: 28.9\n",
      "at epoch 5 train info: auc:0.7405, logloss:0.4967 eval info: auc:0.7367, logloss:0.498\n",
      "at epoch 5 , train time: 50.3 eval time: 29.2\n",
      "at epoch 6 train info: auc:0.7442, logloss:0.4944 eval info: auc:0.74, logloss:0.4958\n",
      "at epoch 6 , train time: 50.5 eval time: 28.9\n",
      "at epoch 7 train info: auc:0.7461, logloss:0.4932 eval info: auc:0.7418, logloss:0.4946\n",
      "at epoch 7 , train time: 50.1 eval time: 28.6\n",
      "at epoch 8 train info: auc:0.7472, logloss:0.4925 eval info: auc:0.7429, logloss:0.4939\n",
      "at epoch 8 , train time: 50.5 eval time: 28.7\n",
      "at epoch 9 train info: auc:0.7479, logloss:0.4921 eval info: auc:0.7435, logloss:0.4934\n",
      "at epoch 9 , train time: 50.9 eval time: 28.8\n",
      "at epoch 10 train info: auc:0.7484, logloss:0.4919 eval info: auc:0.744, logloss:0.4931\n",
      "at epoch 10 , train time: 51.1 eval time: 29.0\n",
      "at epoch 11 train info: auc:0.7488, logloss:0.4916 eval info: auc:0.7445, logloss:0.4928\n",
      "at epoch 11 , train time: 50.8 eval time: 28.7\n",
      "at epoch 12 train info: auc:0.7492, logloss:0.4913 eval info: auc:0.7449, logloss:0.4925\n",
      "at epoch 12 , train time: 50.9 eval time: 28.6\n",
      "at epoch 13 train info: auc:0.7496, logloss:0.491 eval info: auc:0.7453, logloss:0.4922\n",
      "at epoch 13 , train time: 50.2 eval time: 29.3\n",
      "at epoch 14 train info: auc:0.75, logloss:0.4907 eval info: auc:0.7457, logloss:0.4919\n",
      "at epoch 14 , train time: 50.8 eval time: 28.9\n",
      "at epoch 15 train info: auc:0.7504, logloss:0.4905 eval info: auc:0.746, logloss:0.4917\n",
      "at epoch 15 , train time: 50.4 eval time: 28.9\n",
      "at epoch 16 train info: auc:0.7508, logloss:0.4903 eval info: auc:0.7463, logloss:0.4915\n",
      "at epoch 16 , train time: 51.0 eval time: 28.6\n",
      "at epoch 17 train info: auc:0.7512, logloss:0.49 eval info: auc:0.7466, logloss:0.4914\n",
      "at epoch 17 , train time: 56.1 eval time: 29.4\n",
      "at epoch 18 train info: auc:0.7516, logloss:0.4897 eval info: auc:0.7468, logloss:0.4912\n",
      "at epoch 18 , train time: 52.8 eval time: 28.8\n",
      "at epoch 19 train info: auc:0.7522, logloss:0.4893 eval info: auc:0.7471, logloss:0.491\n",
      "at epoch 19 , train time: 50.4 eval time: 28.8\n",
      "at epoch 20 train info: auc:0.7529, logloss:0.4888 eval info: auc:0.7474, logloss:0.4907\n",
      "at epoch 20 , train time: 50.3 eval time: 28.8\n",
      "at epoch 21 train info: auc:0.7537, logloss:0.4883 eval info: auc:0.7478, logloss:0.4904\n",
      "at epoch 21 , train time: 51.0 eval time: 28.8\n",
      "at epoch 22 train info: auc:0.7546, logloss:0.4877 eval info: auc:0.7482, logloss:0.4901\n",
      "at epoch 22 , train time: 50.6 eval time: 28.8\n",
      "at epoch 23 train info: auc:0.7557, logloss:0.487 eval info: auc:0.7488, logloss:0.4897\n",
      "at epoch 23 , train time: 50.4 eval time: 28.8\n",
      "at epoch 24 train info: auc:0.7569, logloss:0.4862 eval info: auc:0.7493, logloss:0.4893\n",
      "at epoch 24 , train time: 51.0 eval time: 28.8\n",
      "at epoch 25 train info: auc:0.7582, logloss:0.4853 eval info: auc:0.75, logloss:0.4888\n",
      "at epoch 25 , train time: 51.4 eval time: 29.3\n",
      "at epoch 26 train info: auc:0.7598, logloss:0.4843 eval info: auc:0.7507, logloss:0.4883\n",
      "at epoch 26 , train time: 50.7 eval time: 29.1\n",
      "at epoch 27 train info: auc:0.7615, logloss:0.4831 eval info: auc:0.7514, logloss:0.4878\n",
      "at epoch 27 , train time: 51.1 eval time: 28.9\n",
      "at epoch 28 train info: auc:0.7633, logloss:0.4819 eval info: auc:0.7521, logloss:0.4873\n",
      "at epoch 28 , train time: 57.5 eval time: 38.5\n",
      "at epoch 29 train info: auc:0.7653, logloss:0.4806 eval info: auc:0.7528, logloss:0.4867\n",
      "at epoch 29 , train time: 68.7 eval time: 38.0\n",
      "at epoch 30 train info: auc:0.7675, logloss:0.4792 eval info: auc:0.7536, logloss:0.4862\n",
      "at epoch 30 , train time: 68.1 eval time: 38.4\n",
      "{'auc': 0.7536, 'logloss': 0.4862}\n"
     ]
    }
   ],
   "source": [
    "model = XDeepFMModel(hparams, FFMTextIterator)\n",
    "\n",
    "# check the predictive performance before the model is trained\n",
    "print(model.run_eval(test_file)) \n",
    "model.fit(train_file, valid_file)\n",
    "# check the predictive performance after the model is trained\n",
    "res_real = model.run_eval(test_file)\n",
    "print(res_real)\n",
    "pm.record(\"res_real\", res_real)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "\\[1\\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems.Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \\& Data Mining, KDD 2018, London, UK, August 19-23, 2018.<br>\n",
    "\\[2\\] The Criteo datasets: http://labs.criteo.com/category/dataset/. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco_bare)",
   "language": "python",
   "name": "reco_bare"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}