lightgbm-transform/docs/Transform-Tutorial.rst

LightGBM Transform Tutorial
===========================

The purpose of this document is to give you a tutorial on how to do transformation in LightGBM with `FreeForm2Parser <../examples/freeform2_parser.cpp>`__.

Transformation is a process to convert data/feature from one format to another.
Now we support two kinds of transformations in LightGBM:

-   Linear. Linear transformation, could be adjusted by `slope` and `intercept`.

-   `FreeForm2 <./FreeForm2-Language.rst>`__. FreeForm2 is a more flexible transform, created by Microsoft Core Ranking and used widely over Microsoft production model training.
    As the name indicates, FreeForm2 empowers users to compose a free combination of features as they like. It is expressed by formulas to be applied in the model inputs.
    The surface syntax is s-expression, with parentheses in a LISP-like fashion to delimit.
    FreeForm2 has implicit type systems and evaluate a single, nested expression that returns a floating-point number.


How to use `FreeForm2Parser`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Installation
------------

See `Installation Guide <./Installation-Guide.rst>`__, install dependencies and `FreeForm2Parser`.

Data preparation
----------------
1.  Input data. Data file used for training or prediction.

    **Note**: only TSV is supported now.

    **Note**: header is a must-have, you could provide header in input data or `parser_config_file`.

2.  Parser config file. The `json` file should contain `className`, `transform` and `header` key-value pairs, below is an example.

    .. code::

        {
            "className":"FreeForm2Parser",
            "transform":"[Input:0]\nLine1=(+ feature_1 feature_2)\nTransform=FreeForm2\nSlope=1\nIntercept=0\n\n[Input:1]\nTransform=FreeForm2\nLine1=(* feature_1 feature_3)\n",
            "header":"feature_0\tfeature_1\tfeature_2\tfeature_3\tfeature_4\tfeature_5\tfeature_6\tfeature_7\tfeature_8\tfeature_9\tlabels"
        }

    **Note**: transform value is the content of transform file.
    Transform file is not a supplement of raw features, but all used for training. Use "Linear" type if you want to keep the original ones.
    See `FreeForm2 language spec <./FreeForm2-Language.rst>`__ and learn more about the grammar.

    **Note**: transformed feature index ranges from 0 to the maximum "Input" value given in transform file.
    By default, will pad 0 as feature value for missing indices within the range.

    **Note**: the `query_idx` parameter means the index of query in in raw data.(Tips: query is just an alias `group column <https://lightgbm.readthedocs.io/en/latest/Parameters.html?highlight=query#group_column>`_, use other names are ok.)
    We will set query as the last line of the transform_str, and its index in transformed data is the same as total Input number.
    Now we only support index number, as supporting select by name will introduces big changes to lightGBM src code.

    **Kindly reminder**: you could auto-generate parser config file with command. Note that header_file and `query_idx` are optional arguments.
    The auto generation for query only work when header_file exists. The query feature won't be included in training with other features. LightGBM can ignore it correctly.
    Our script will use "Linear" to auto generate an "Input" at the end of transform_str.
    After generation, the script will print the new index for query column in transformed data to help user understand.


    .. code::

        python ./scripts/generate_parser_config.py --class_name your_parser_name --transform_file path/to/transform --header_file path/to/header --parser_config_file path/to/parser_config --query_idx raw_query_id

**Note**: if no parser config file is given,
the input data will be used as features directly for training.

**Note**: if header_file doesn't exist, `query_idx` won't work. Please make sure that `query_idx` won't exceed the maximum of raw data column.


Run task
--------

Actually, the use way is the same as previous, no interface change.

.. code::

    train_data = lgb.Dataset("path/to/train.tsv", params={"parser_config_file": "path/to/parser_config.json"})
    valid_data = lgb.Dataset("path/to/valid.tsv", params={"parser_config_file": "path/to/parser_config.json"})
    # train and predict.
    bst = lgb.train(params, train_data, valid_sets=[valid_data])
    pred = bst.predict("path/to/test.tsv")
    # save model.
    bst.save_model(trained_model_path)
    # load model and predict again.
    bst = lgb.Booster(model_file=trained_model_path)
    pred = bst.predict("path/to/test.tsv")

**Note**: parser config will be saved at the bottom of model file, between section flag "parser" and "end of parser".