# Training Script
In this notebook, we create the training script whose hyperparameters will be tuned. The notebook cells are each appended in turn in the training script, so it is essential that you run the notebook's cells _in order_ for the script to run correctly.  

*Note:* If you edit this notebook's cells, be sure to preserve the blank lines at the start and end of the cells, as they prevent the contents of consecutive cells from being improperly concatenated.

The script sections are
- [import libraries](#import),
- [define utility functions and classes](#utility),
- [define the script input parameters](#parameters),
- [load and prepare the training data](#data),
- [define the training pipeline](#pipeline),
- [train the model](#train),
- [score the test data](#score), and
- [compute the test data performance](#performance).

[The final cell](#run) runs the script using the training data created by [the first notebook](00_Data_Prep.ipynb).

## Load libraries <a id='import'></a>

In [None]:
%%writefile TrainTestClassifier.py

from __future__ import print_function
import os
import warnings
import logging
import argparse
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.feature_extraction import text
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.externals import joblib
from sklearn.base import BaseEstimator, TransformerMixin


## Define utility functions and classes <a id='utility'></a>

In [None]:
%%writefile --append TrainTestClassifier.py

class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at provided
    key(s).

    The data are expected to be stored in a 2D data structure, where
    the first index is over features and the second is over samples,
    i.e.

    >> len(data[keys]) == n_samples

    Please note that this is the opposite convention to scikit-learn
    feature matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[keys]).  Examples include: a dict of lists, 2D numpy array,
    Pandas DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample
    (e.g. a list of dicts).  If your data are structured this way,
    consider a transformer along the lines of
    `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    keys : hashable or list of hashable, required
        The key(s) corresponding to the desired value(s) in a mappable.

    """

    def __init__(self, keys):
        self.keys = keys

    def fit(self, x, *args, **kwargs):
        if type(self.keys) is list:
            assert all([key in x for key in self.keys]), 'Not all keys in data'
        else:
            assert self.keys in x, 'key not in data'
        return self

    def transform(self, data_dict, *args, **kwargs):
        return data_dict[self.keys]

    
def score_rank(scores):
    """Compute the ranks of the scores."""
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    """Compute the index of label in label_order."""
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    """Compute the rank of label using the scores."""
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]


warnings.filterwarnings(action='ignore', category=UserWarning, module='lightgbm')


## Define the input parameters <a id='parameters'></a>
One of the most important parameters is `estimators`, the number of estimators that allows you to trade-off accuracy, modeling time, and model size. The table below should give you an idea of the relationships between the number of estimators and the metrics. The default value is 100.

| Estimators | Run time (s) | Size (MB) | Accuracy@1 | Accuracy@2 | Accuracy@3 |
|------------|--------------|-----------|------------|------------|------------|
|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |
|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |
|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |
|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |
|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |

Other parameters that may be useful to tune include the following:
* `ngrams`: the maximum n-gram size for features, an integer ranging from 1 (default 1),
* `min_child_samples`: the minimum number of samples in a leaf, an integer ranging from 1 (default 20),
* `match`: the maximum number of training examples per duplicate question, an integer ranging from 2 (default 10), and
* `unweighted`: whether to use sample weights to compensate for unbalanced data, a boolean (default weighted).

The performance of the estimator is estimated on held-aside test data, and the statistic reported is how far down the list of sorted results is the correct result found. The `rank` parameter controls the maximum distance down the list for which the statistic is reported.

In [None]:
%%writefile --append TrainTestClassifier.py

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser(description='Fit and evaluate a model'
                                     ' based on train-test datasets.')
    parser.add_argument('--data', help='the training dataset name',
                        default='balanced_pairs_train.tsv')
    parser.add_argument('--test', help='the test dataset name',
                        default='balanced_pairs_test.tsv')
    parser.add_argument('--estimators',
                        help='the number of learner estimators',
                        type=int, default=100)
    parser.add_argument('--min_child_samples',
                        help='the minimum number of samples in a child(leaf)',
                        type=int, default=20)
    parser.add_argument('--ngrams',
                        help='the maximum size of word ngrams',
                        type=int, default=1)
    parser.add_argument('--match',
                        help='the maximum number of duplicate matches',
                        type=int, default=20)
    parser.add_argument('--unweighted',
                        help='do not use instance weights',
                        action='store_true')
    parser.add_argument('--rank',
                        help='the maximum rank of correct answers',
                        type=int, default=3)
    parser.add_argument('--inputs', help='the inputs directory',
                        default='.')
    parser.add_argument('--outputs', help='the outputs directory',
                        default='.')
    parser.add_argument('--save', help='save the model',
                        action='store_true')
    parser.add_argument('--log', help='the log file',
                        default='TrainTestClassifier.log')
    parser.add_argument('--model', help='the model file',
                        default='model.pkl')
    parser.add_argument('--instances', help='the instances file',
                        default='instances.csv')
    parser.add_argument('--labels', help='the labels file',
                        default='labels.csv')
    parser.add_argument('--verbose',
                        help='the verbosity of the estimator',
                        type=int, default=-1)
    args = parser.parse_args()
    

## Load and prepare the training data <a id='data'></a>

In [None]:
%%writefile --append TrainTestClassifier.py

    outputs_path = args.outputs
    log_path = os.path.join(outputs_path, args.log)
    logging.basicConfig(filename=log_path, level=logging.DEBUG)

    print('Prepare the training data.')
    logging.info('Prepare the training data.')
    
    # Paths to the input data.
    inputs_path = args.inputs
    data_path = os.path.join(inputs_path, args.data)
    test_path = os.path.join(inputs_path, args.test)

    # Paths for the output data.
    outputs_path = args.outputs
    model_path = os.path.join(outputs_path, args.model)
    instances_path = os.path.join(outputs_path, args.instances)
    labels_path = os.path.join(outputs_path, args.labels)

    # Create the outputs folder.
    os.makedirs(outputs_path, exist_ok=True)

    # Load the data.
    print('Reading {}'.format(data_path))
    logging.info('Reading {}'.format(data_path))
    train = pd.read_csv(data_path, sep='\t', encoding='latin1')

    # Limit the number of training duplicate matches.
    train = train[train.n < args.match]

    # Define the input data columns.
    feature_columns = ['Text_x', 'Text_y']
    label_column = 'Label'
    group_column = 'Id_x'
    answerid_column = 'AnswerId_y'
    name_columns = ['Id_x', 'Id_y']
    weight_column = 'Weight'

    # Report on the dataset.
    print('train: {:,} rows with {:.2%} matches'.format(
        train.shape[0], train[label_column].mean()))
    logging.info('train: {:,} rows with {:.2%} matches'.format(
        train.shape[0], train[label_column].mean()))
    
    # Compute instance weights.
    if args.unweighted:
        print('No sample weights.')
        logging.info('No sample weights.')
        labels = train[label_column].unique()
        weight = pd.Series([1.0] * labels.shape[0], labels)
    else:
        print('Using sample weights.')
        logging.info('Using sample weights.')
        label_counts = train[label_column].value_counts()
        weight = train.shape[0]/(label_counts.shape[0]*label_counts)
        print(weight)
        logging.info(weight)
    train[weight_column] = train[label_column].apply(lambda x: weight[x])

    # Select and format the training data.
    train_X = train[feature_columns]
    train_y = train[label_column]
    sample_weight = train[weight_column]
    groups = train[group_column]
    names = train[name_columns]
    

## Define the featurization and estimator <a id='pipeline'></a>

In [None]:
%%writefile --append TrainTestClassifier.py

    print('Define the model pipeline.')
    logging.info('Define the model pipeline.')

    # Select the training hyperparameters.
    n_estimators = args.estimators
    min_child_samples = args.min_child_samples
    if args.ngrams > 0:
        ngram_range = (1, args.ngrams)
    else:
        ngram_range = None

    # Verify that the hyperparameter values are valid.
    assert n_estimators > 0
    assert min_child_samples > 1
    assert ngram_range is not None
    assert type(ngram_range) is tuple and len(ngram_range) == 2
    assert ngram_range[0] > 0 and ngram_range[0] <= ngram_range[1]

    # Define the featurization pipeline.
    featurization = [
        (column,
         make_pipeline(ItemSelector(column),
                       text.TfidfVectorizer(ngram_range=ngram_range)))
        for column in feature_columns]
    features = FeatureUnion(featurization)

    # Define the estimator.
    estimator = lgb.LGBMClassifier(n_estimators=n_estimators,
                                   min_child_samples=min_child_samples,
                                   verbose=args.verbose)

    # Put them together into the model pipeline.
    model = Pipeline([
        ('features', features),
        ('model', estimator)
    ])
    
    # Report the featurization.
    print('Estimators={:,}'.format(n_estimators))
    print('Ngram range={}'.format(ngram_range))
    print('Min child samples={}'.format(min_child_samples))
    logging.info('Estimators={:,}'.format(n_estimators))
    logging.info('Ngram range={}'.format(ngram_range))
    logging.info('Min child samples={}'.format(min_child_samples))
    

## Train the model <a id='train'></a>

In [None]:
%%writefile --append TrainTestClassifier.py

    print('Fitting the model.')
    logging.info('Fitting the model.')

    # Fit the model.
    model.fit(train_X, train_y, model__sample_weight=sample_weight)
    
    # Collect the ordered label for computing scores.
    labels = sorted(train[answerid_column].unique())
    label_order = pd.DataFrame({'label': labels})

    # Write the model to file.
    if args.save:
        print('Saving the model to {}'.format(model_path))
        logging.info('Saving the model to {}'.format(model_path))
        joblib.dump(model, model_path)
        print('{}: {:.2f} MB'.format(
            model_path, os.path.getsize(model_path)/(2**20)))
        logging.info('{}: {:.2f} MB'.format(
            model_path, os.path.getsize(model_path)/(2**20)))
        print('Saving the labels to {}'.format(labels_path))
        logging.info('Saving the labels to {}'.format(labels_path))
        label_order.to_csv(labels_path, sep='\t', index=False)
        

## Score the test data using the model <a id='score'></a>
This produces a dataframe of scores with one row per duplicate question.

In [None]:
%%writefile --append TrainTestClassifier.py

    print('Scoring the test data.')
    logging.info('Scoring the test data.')

    # Read the test data.
    print('Reading {}'.format(test_path))
    logging.info('Reading {}'.format(test_path))
    test = pd.read_csv(test_path, sep='\t', encoding='latin1')
    print('test: {:,} rows with {:.2%} matches'.format(
        test.shape[0], test[label_column].mean()))
    logging.info('test {:,} rows with {:.2%} matches'.format(
        test.shape[0], test[label_column].mean()))

    # Collect the model predictions.
    test_X = test[feature_columns]
    test['probabilities'] = model.predict_proba(test_X)[:, 1]

    # Order the testing data by dupe Id and question AnswerId.
    test.sort_values([group_column, answerid_column], inplace=True)

    # Extract the ordered probabilities.
    probabilities = (
        test.probabilities
        .groupby(test[group_column], sort=False)
        .apply(lambda x: tuple(x.values)))

    # Get the individual records.
    output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
    test_score = (test[output_columns_x]
                  .drop_duplicates()
                  .set_index(group_column))
    test_score['probabilities'] = probabilities
    test_score.reset_index(inplace=True)
    test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']
    

## Report the model's performance statistics on the test data <a id='performance'></a>

In [None]:
%%writefile --append TrainTestClassifier.py

    print("Evaluating the model's performance.")
    logging.info("Evaluating the model's performance.")
    
    # Collect the ordered AnswerId for computing the scores.
    labels = sorted(train[answerid_column].unique())
    label_order = pd.DataFrame({'label': labels})

    # Rank the correct answers.
    test_score['Ranks'] = test_score.apply(lambda x:
                                           label_rank(x.AnswerId,
                                                      x.probabilities,
                                                      label_order.label),
                                           axis=1)

    # Compute the number of correctly ranked answers
    for i in range(1, args.rank+1):
        print('Accuracy @{} = {:.2%}'.format(
            i, (test_score['Ranks'] <= i).mean()))
        logging.info('Accuracy @{} = {:.2%}'.format(
            i, (test_score['Ranks'] <= i).mean()))
    mean_rank = test_score['Ranks'].mean()
    print('Mean Rank {:.4f}'.format(mean_rank))
    logging.info('Mean Rank {:.4f}'.format(mean_rank))

    # Write the scored instances.
    if args.save:
        print('Saving the scored instances to {}'.format(instances_path))
        logging.info('Saving the scored instances to {}'.format(instances_path))        
        test_score.to_csv(instances_path, sep='\t', index=False,
                          encoding='latin1')
        

## Run the script to see that it works <a id='run'></a>
This should take around five minutes.

In [None]:
%run -t TrainTestClassifier.py --estimators 1000 --match 5 --ngrams 2 --min_child_samples 10

In [the next notebook](02_Docker_Image.ipynb), we create the docker image used by Batch AI to run the training script.