Script to translate the model file to PMML

file as input and output a directory with pmml files. Each file can be given as input to
pmml consumers. You can test the files using jpmml-evaluator. Please refer to this the
data mining group's website for the pmml specifcation (http://dmg.org/pmml/pmml-v4-3.html).
Also refer to the README.md file in the pmml directory to see how to test the pmml output
with jpmml-evaluator.
This commit is contained in:
Rakib Hasan 2017-01-05 03:10:44 -05:00 коммит произвёл Guolin Ke
Родитель f893fbf6f2
Коммит 96d08f42f8
2 изменённых файлов: 243 добавлений и 0 удалений

25
pmml/README.md Normal file
Просмотреть файл

@ -0,0 +1,25 @@
PMML Generator
==============
The script pmml.py can be used to translate the LightGBM models, found in LightGBM_model.txt, to predictive model markup language (PMML). These models can then be imported by other analytics applications. The models that the language can describe includes decision trees. The specification of PMML can be found here at the Data Mining Group's [website](http://dmg.org/pmml/v4-3/GeneralStructure.html).
In order to generate pmml files do the following steps.
```
lightgbm config=train.conf
python pmml.py LightGBM_model.txt
```
The python script will create a file called **LightGBM_pmml.xml**. Inside the file you will find a `MiningModel` tag. In there you will find `TreeModel` tags. Each `TreeModel` tag contains the pmml translation of a decision tree inside the LightGBM_model.txt file. The model described by the **LightGBM_pmml.xml** file can be transferred to other analytics applications. For instance you can use the pmml file as an input to the jpmml-evaluator API. Follow the steps below to run a model described by **LightGBM_pmml.xml**.
##### Steps to run jpmml-evaluator
1, First clone the repository
```
git clone https://github.com/jpmml/jpmml-evaluator.git
```
2, Build using maven
```
mvn clean install
```
3, Run the EvaluationExample class on the model file using the following command
```
java -cp example-1.3-SNAPSHOT.jar org.jpmml.evaluator.EvaluationExample --model LightGBM_pmml.xml --input input.csv --output output.csv
```
Note, in order to run the model on the input.csv file, the input.csv file must have the same number of columns as specified by the `DataDictionary` field in the pmml file. Also, the column headers inside the input.csv file must be the same as the column names specified by the `MiningSchema` field. Inside output.csv you will find all the columns inside the input.csv file plus a new column. In the new column you will find the scores calculated by processing each rows data on the model. More information about jpmml-evaluator can be found at its [github repository](https://github.com/jpmml/jpmml-evaluator).

218
pmml/pmml.py Normal file
Просмотреть файл

@ -0,0 +1,218 @@
from __future__ import print_function
from decimal import Decimal
import sys
import os
import traceback
def unique_id():
global unique_node_id
nid = unique_node_id
unique_node_id += 1
return nid
def get_value_string(line):
return line[line.index('=') + 1:]
def get_array_strings(line):
return line[line.index('=') + 1:].split()
def get_array_ints(line):
return map(lambda x: int(x), line[line.index('=') + 1:].split())
def get_array_floats(line):
return map(lambda x: Decimal(x), line[line.index('=') + 1:].split())
def get_field_name(node_id, prev_node_idx, is_child):
idx = leaf_parent[node_id - 1] if is_child else prev_node_idx
return feature_names[split_feature[idx]]
def get_threshold(node_id, prev_node_idx, is_child):
idx = leaf_parent[node_id - 1] if is_child else prev_node_idx
return threshold[idx]
def print_simple_predicate(
tab_length,
node_id,
is_left_child,
prev_node_idx,
is_leaf,
pmml_out):
if is_left_child:
op = 'equal' if decision_type[prev_node_idx] == 1 else 'lessOrEqual'
else:
op = 'notEqual' if decision_type[prev_node_idx] == 1 else 'greaterThan'
print('\t' * (tab_length + 1) + ("<SimplePredicate field=\"{0}\" " + " operator=\"{1}\" value=\"{2}\" />") .format(
get_field_name(node_id, prev_node_idx, is_leaf), op, get_threshold(node_id, prev_node_idx, is_leaf)), file=pmml_out)
def print_nodes_pmml(**kwargs):
node_id = kwargs['node_id']
pmml_out = kwargs['out_file']
tab_len = kwargs['tab_length']
if node_id < 0:
node_id = -1 * node_id
score = leaf_value[node_id - 1]
recordCount = leaf_count[node_id - 1]
is_leaf = True
else:
score = internal_value[node_id]
recordCount = internal_count[node_id]
is_leaf = False
print(
'\t' *
tab_len +
(
"<Node id=\"{0}\" score=\"{1}\" " +
" recordCount=\"{2}\">").format(
unique_id(),
score,
recordCount),
file=pmml_out)
print_simple_predicate(
tab_len,
node_id,
kwargs['is_left_child'],
kwargs['prev_node_idx'],
is_leaf,
pmml_out)
if not is_leaf:
print_nodes_pmml(
node_id=left_child[node_id],
tab_length=tab_len + 1,
is_left_child=True,
prev_node_idx=node_id,
out_file=pmml_out)
print_nodes_pmml(
node_id=right_child[node_id],
tab_length=tab_len + 1,
is_left_child=False,
prev_node_idx=node_id,
out_file=pmml_out)
print('\t' * tab_len + "</Node>", file=pmml_out)
# print out the pmml for a decision tree
def print_pmml(pmml_out):
# specify the objective as function name and binarySplit for
# splitCharacteristic because each node has 2 children
print(
"\t\t\t\t<TreeModel functionName=\"regression\" splitCharacteristic=\"binarySplit\">",
file=pmml_out)
print("\t\t\t\t\t<MiningSchema>", file=pmml_out)
# list each feature name as a mining field, and treat all outliers as is,
# unless specified
for feature in feature_names:
print(
"\t\t\t\t\t\t<MiningField name=\"%s\"/>" %
(feature), file=pmml_out)
print("\t\t\t\t\t</MiningSchema>", file=pmml_out)
# begin printing out the decision tree
print("\t\t\t\t\t<Node id=\"%d\" score=\"%s\" recordCount=\"%d\">" %
(unique_id(), internal_value[0], internal_count[0]), file=pmml_out)
print("\t\t\t\t\t\t<True/>", file=pmml_out)
print_nodes_pmml(
node_id=left_child[0],
tab_length=6,
is_left_child=True,
prev_node_idx=0,
out_file=pmml_out)
print_nodes_pmml(
node_id=right_child[0],
tab_length=6,
is_left_child=False,
prev_node_idx=0,
out_file=pmml_out)
print("\t\t\t\t\t</Node>", file=pmml_out)
print("\t\t\t\t</TreeModel>", file=pmml_out)
if len(sys.argv) != 2:
print('usage: pmml.py <input model file>')
sys.exit(0)
# open the model file and then process it
try:
with open(sys.argv[1]) as model_in:
model_content = filter(
lambda line: line != '',
model_in.read().strip().split('\n'))
objective = get_value_string(model_content[4])
sigmoid = Decimal(get_value_string(model_content[5]))
feature_names = get_array_strings(model_content[6])
model_content = model_content[7:]
line_no = 0
segment_id = 1
with open('LightGBM_pmml.xml', 'w') as pmml_out:
print(
"<PMML version=\"4.3\" \n" +
"\t\txmlns=\"http://www.dmg.org/PMML-4_3\"\n" +
"\t\txmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n" +
"\t\txsi:schemaLocation=\"http://www.dmg.org/PMML-4_3 http://dmg.org/pmml/v4-3/pmml-4-3.xsd\"" +
">",
file=pmml_out)
print("\t<Header copyright=\"Microsoft\">", file=pmml_out)
print("\t\t<Application name=\"LightGBM\"/>", file=pmml_out)
print("\t</Header>", file=pmml_out)
# print out data dictionary entries for each column
print(
"\t<DataDictionary numberOfFields=\"%d\">" %
len(feature_names), file=pmml_out)
# not adding any interval definition, all values are currently
# valid
for feature in feature_names:
print(
"\t\t<DataField name=\"" +
feature +
"\" optype=\"continuous\" dataType=\"double\"/>",
file=pmml_out)
print("\t</DataDictionary>", file=pmml_out)
print("\t<MiningModel functionName=\"regression\">", file=pmml_out)
print("\t\t<MiningSchema>", file=pmml_out)
# list each feature name as a mining field, and treat all outliers
# as is, unless specified
for feature in feature_names:
print(
"\t\t\t<MiningField name=\"%s\"/>" %
(feature), file=pmml_out)
print("\t\t</MiningSchema>", file=pmml_out)
print(
"\t\t<Segmentation multipleModelMethod=\"sum\">",
file=pmml_out)
# read each array that contains pertinent information for the pmml
# these arrays will be used to recreate the traverse the decision
# tree
while model_content[line_no][:4] == 'Tree':
print("\t\t\t<Segment id=\"%d\">" % segment_id, file=pmml_out)
print("\t\t\t\t<True/>", file=pmml_out)
tree_no = model_content[line_no][5:]
num_leaves = int(get_value_string(model_content[line_no + 1]))
split_feature = get_array_ints(model_content[line_no + 2])
threshold = get_array_floats(model_content[line_no + 4])
decision_type = get_array_ints(model_content[line_no + 5])
left_child = get_array_ints(model_content[line_no + 6])
right_child = get_array_ints(model_content[line_no + 7])
leaf_parent = get_array_ints(model_content[line_no + 8])
leaf_value = get_array_floats(model_content[line_no + 9])
leaf_count = get_array_ints(model_content[line_no + 10])
internal_value = get_array_floats(model_content[line_no + 11])
internal_count = get_array_ints(model_content[line_no + 12])
unique_node_id = 0
print_pmml(pmml_out)
print("\t\t\t</Segment>", file=pmml_out)
line_no += 13
segment_id += 1
print("\t\t</Segmentation>", file=pmml_out)
print("\t</MiningModel>", file=pmml_out)
print("</PMML>", file=pmml_out)
except Exception as ioex:
print(ioex)