Many boosting tools use pre-sorted based algorithms\ `[2, 3] <#references>`__ (e.g. default algorithm in xgboost) for decision tree learning. It is a simple solution, but not easy to optimize.
LightGBM uses the histogram based algorithms\ `[4, 5, 6] <#references>`__, which bucketing continuous feature(attribute) values into discrete bins, to speed up training procedure and reduce memory usage.
- So it only need to construct histograms for one leaf (with smaller ``#data`` than its neighbor), then can get histograms of its neighbor by histogram subtraction with small cost (``O(#bins)``)
When growing same ``#leaf``, leaf-wise algorithm can reduce more loss than level-wise algorithm.
Leaf-wise may cause over-fitting when ``#data`` is small.
So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tree and avoid over-fitting (tree still grows by leaf-wise).
..image:: ./_static/images/leaf-wise.png
:align:center
Optimal Split for Categorical Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We often convert the categorical features into one-hot coding.
However, it is not a good solution in tree learner.
The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.
Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are ``2^(k-1) - 1`` possible partitions.
The basic idea is reordering the categories according to the relevance of training target.
More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (``sum_gradient / sum_hessian``), then find the best split on the sorted histogram.
Optimization in Network Communication
-------------------------------------
It only needs to use some collective communication algorithms, like "All reduce", "All gather" and "Reduce scatter", in parallel learning of LightGBM.
These collective communication algorithms can provide much better performance than point-to-point communication.
Optimization in Parallel Learning
---------------------------------
LightGBM provides following parallel learning algorithms.
Feature Parallel
~~~~~~~~~~~~~~~~
Traditional Algorithm
^^^^^^^^^^^^^^^^^^^^^
Feature parallel aims to parallel the "Find Best Split" in the decision tree. The procedure of traditional feature parallel is:
1. Partition data vertically (different machines have different feature set)
2. Workers find local best split point {feature, threshold} on local feature set
3. Communicate local best splits with each other and get the best one
4. Worker with best split to perform split, then send the split result of data to other workers
5. Other workers split data according received data
The shortage of traditional feature parallel:
- Has computation overhead, since it cannot speed up "split", whose time complexity is ``O(#data)``.
Thus, feature parallel cannot speed up well when ``#data`` is large.
- Need communication of split result, which cost about ``O(#data / 8)`` (one bit for one data).
Feature Parallel in LightGBM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since feature parallel cannot speed up well when ``#data`` is large, we make a little change here: instead of partitioning data vertically, every worker holds the full data.
Thus, LightGBM doesn't need to communicate for split result of data since every worker know how to split data.
And ``#data`` won't be larger, so it is reasonable to hold full data in every machine.
The procedure of feature parallel in LightGBM:
1. Workers find local best split point {feature, threshold} on local feature set
2. Communicate local best splits with each other and get the best one
3. Perform best split
However, this feature parallel algorithm still suffers from computation overhead for "split" when ``#data`` is large.
So it will be better to use data parallel when ``#data`` is large.
Data Parallel
~~~~~~~~~~~~~
Traditional Algorithm
^^^^^^^^^^^^^^^^^^^^^
Data parallel aims to parallel the whole decision learning. The procedure of data parallel is:
1. Partition data horizontally
2. Workers use local data to construct local histograms
3. Merge global histograms from all local histograms
4. Find best split from merged global histograms, then perform splits
The shortage of traditional data parallel:
- High communication cost.
If using point-to-point communication algorithm, communication cost for one machine is about ``O(#machine * #feature * #bin)``.
If using collective communication algorithm (e.g. "All Reduce"), communication cost is about ``O(2 * #feature * #bin)`` (check cost of "All Reduce" in chapter 4.5 at `[9] <#references>`__).
1. Instead of "Merge global histograms from all local histograms", LightGBM use "Reduce Scatter" to merge histograms of different (non-overlapping) features for different workers.
[1] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. "`LightGBM\: A Highly Efficient Gradient Boosting Decision Tree`_." In Advances in Neural Information Processing Systems (NIPS), pp. 3149-3157. 2017.
[2] Mehta, Manish, Rakesh Agrawal, and Jorma Rissanen. "SLIQ: A fast scalable classifier for data mining." International Conference on Extending Database Technology. Springer Berlin Heidelberg, 1996.
[3] Shafer, John, Rakesh Agrawal, and Manish Mehta. "SPRINT: A scalable parallel classifier for data mining." Proc. 1996 Int. Conf. Very Large Data Bases. 1996.
[4] Ranka, Sanjay, and V. Singh. "CLOUDS: A decision tree classifier for large datasets." Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 1998.
[6] Li, Ping, Qiang Wu, and Christopher J. Burges. "Mcrank: Learning to rank using multiple classification and gradient boosting." Advances in neural information processing systems. 2007.
[8] Walter D. Fisher. "`On Grouping for Maximum Homogeneity`_." Journal of the American Statistical Association. Vol. 53, No. 284 (Dec., 1958), pp. 789-798.
[9] Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. "`Optimization of collective communication operations in MPICH`_." International Journal of High Performance Computing Applications 19.1 (2005): 49-66.