[python][docs] more detailed docs for trees_to_dataframe(), create_tree_digraph(), plot_tree() (#3618)

* [python] more detailed docs for trees_to_dataframe(), create_tree_digraph(), plot_tree()

* fixing warnings

* fix warnings

* undo unnecessary space

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* single line, better weight descriptions

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* column names

* Update python-package/lightgbm/plotting.py

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
This commit is contained in:
James Lamb 2020-12-07 12:04:06 +00:00 коммит произвёл GitHub
Родитель f38f118ce2
Коммит eb035013c3
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
2 изменённых файлов: 54 добавлений и 6 удалений

Просмотреть файл

@ -2202,6 +2202,24 @@ class Booster(object):
def trees_to_dataframe(self):
"""Parse the fitted model and return in an easy-to-read pandas DataFrame.
The returned DataFrame has the following columns.
- ``tree_index`` : int64, which tree a node belongs to. 0-based, so a value of ``6``, for example, means "this node is in the 7th tree".
- ``node_depth`` : int64, how far a node is from the root of the tree. The root node has a value of ``1``, its direct children are ``2``, etc.
- ``node_index`` : string, unique identifier for a node.
- ``left_child`` : string, ``node_index`` of the child node to the left of a split. ``None`` for leaf nodes.
- ``right_child`` : string, ``node_index`` of the child node to the right of a split. ``None`` for leaf nodes.
- ``parent_index`` : string, ``node_index`` of this node's parent. ``None`` for the root node.
- ``split_feature`` : string, name of the feature used for splitting. ``None`` for leaf nodes.
- ``split_gain`` : float64, gain from adding this split to the tree. ``NaN`` for leaf nodes.
- ``threshold`` : float64, value of the feature used to decide which side of the split a record will go down. ``NaN`` for leaf nodes.
- ``decision_type`` : string, logical operator describing how to compare a value to ``threshold``. For example, ``split_feature = "Column_10", threshold = 15, decision_type = "<="`` means that records where ``Column_10 <= 15`` follow the left side of the split, otherwise follows the right side of the split. ``None`` for leaf nodes.
- ``missing_direction`` : string, split direction that missing values should go to. ``None`` for leaf nodes.
- ``missing_type`` : string, describes what types of values are treated as missing.
- ``value`` : float64, predicted value for this leaf node, multiplied by the learning rate.
- ``weight`` : float64 or int64, sum of hessian (second-order derivative of objective), summed over observations that fall in this node.
- ``count`` : int64, number of records in the training data that fall into this node.
Returns
-------
result : pandas DataFrame

Просмотреть файл

@ -474,6 +474,16 @@ def create_tree_digraph(booster, tree_index=0, show_info=None, precision=3,
orientation='horizontal', **kwargs):
"""Create a digraph representation of specified tree.
Each node in the graph represents a node in the tree.
Non-leaf nodes have labels like ``Column_10 <= 875.9``, which means
"this node splits on the feature named "Column_10", with threshold 875.9".
Leaf nodes have labels like ``leaf 2: 0.422``, which means "this node is a
leaf node, and the predicted value for records that fall into this node
is 0.422". The number (``2``) is an internal unique identifier and doesn't
have any special meaning.
.. note::
For more information please visit
@ -487,9 +497,14 @@ def create_tree_digraph(booster, tree_index=0, show_info=None, precision=3,
The index of a target tree to convert.
show_info : list of strings or None, optional (default=None)
What information should be shown in nodes.
Possible values of list items:
'split_gain', 'internal_value', 'internal_count', 'internal_weight',
'leaf_count', 'leaf_weight', 'data_percentage'.
- ``'split_gain'`` : gain from adding this split to the model
- ``'internal_value'`` : raw predicted value that would be produced by this node if it was a leaf node
- ``'internal_count'`` : number of records from the training data that fall into this non-leaf node
- ``'internal_weight'`` : total weight of all nodes that fall into this non-leaf node
- ``'leaf_count'`` : number of records from the training data that fall into this leaf node
- ``'leaf_weight'`` : total weight (sum of hessian) of all observations that fall into this leaf node
- ``'data_percentage'`` : percentage of training data that fall into this node
precision : int or None, optional (default=3)
Used to restrict the display of floating point values to a certain precision.
orientation : string, optional (default='horizontal')
@ -536,6 +551,16 @@ def plot_tree(booster, ax=None, tree_index=0, figsize=None, dpi=None,
show_info=None, precision=3, orientation='horizontal', **kwargs):
"""Plot specified tree.
Each node in the graph represents a node in the tree.
Non-leaf nodes have labels like ``Column_10 <= 875.9``, which means
"this node splits on the feature named "Column_10", with threshold 875.9".
Leaf nodes have labels like ``leaf 2: 0.422``, which means "this node is a
leaf node, and the predicted value for records that fall into this node
is 0.422". The number (``2``) is an internal unique identifier and doesn't
have any special meaning.
.. note::
It is preferable to use ``create_tree_digraph()`` because of its lossless quality
@ -556,9 +581,14 @@ def plot_tree(booster, ax=None, tree_index=0, figsize=None, dpi=None,
Resolution of the figure.
show_info : list of strings or None, optional (default=None)
What information should be shown in nodes.
Possible values of list items:
'split_gain', 'internal_value', 'internal_count', 'internal_weight',
'leaf_count', 'leaf_weight', 'data_percentage'.
- ``'split_gain'`` : gain from adding this split to the model
- ``'internal_value'`` : raw predicted value that would be produced by this node if it was a leaf node
- ``'internal_count'`` : number of records from the training data that fall into this non-leaf node
- ``'internal_weight'`` : total weight of all nodes that fall into this non-leaf node
- ``'leaf_count'`` : number of records from the training data that fall into this leaf node
- ``'leaf_weight'`` : total weight (sum of hessian) of all observations that fall into this leaf node
- ``'data_percentage'`` : percentage of training data that fall into this node
precision : int or None, optional (default=3)
Used to restrict the display of floating point values to a certain precision.
orientation : string, optional (default='horizontal')