Show how to perform fast retraining with LightGBM in different business cases
Перейти к файлу
Miguel González-Fierro e43c919521 Merge pull request #71 from Azure/kaggle_bug_fix
Fixes bug: Adds missing import
2017-08-29 20:37:39 +02:00
environment updated documentation 2017-06-20 15:05:55 +01:00
experiments Fixes bug: Adds missing import 2017-08-29 17:27:31 +00:00
.gitignore small modification gitignore 2017-06-26 14:49:36 +00:00
INSTALL.md updated INSTALL.md 2017-07-17 14:09:35 +01:00
LICENSE Some testing 2017-03-28 18:26:53 +00:00
README.md update readme with post link fix #65 2017-08-15 14:23:03 +01:00
requirements.txt added ipykernel to requirments 2017-08-29 15:48:57 +01:00

README.md

Fast Retraining

In this repo we compare two of the fastest boosted decision tree libraries: XGBoost and LightGBM. We will evaluate them across datasets of several domains and different sizes.

On July 25, 2017, we published a blog post evaluating both libraries and discussing the benchmark results. The post is Lessons Learned From Benchmarking Fast Machine Learning Algorithms.

Installation and Setup

The installation instructions can be found here.

Project

In the folder experiments you can find the different experiments of the project. We developed 6 experiments with the CPU and GPU versions of the libraries.

  • Airline
  • BCI
  • Football
  • Planet Kaggle
  • Fraud Detection
  • HIGGS

In the folder experiment/libs there is the common code for the project.

Benchmark

In the following table there are summarized the time results (in seconds) and the ratio of the benchmarks performed in the experiments:

Dataset Experiment Data size Features xgb time:
CPU (GPU)
xgb_hist time:
CPU (GPU)
lgb time:
CPU (GPU)
ratio xgb/lgb:
CPU (GPU)
ratio xgb_hist/lgb:
CPU
(GPU)
Football Link CPU
Link GPU
19673 46 2.27 (7.09) 2.47 (4.58) 0.58 (0.97) 3.90
(7.26)
4.25
(4.69)
Fraud Detection Link CPU
Link GPU
284807 30 4.34 (5.80) 2.01 (1.64) 0.66 (0.29) 6.58
(19.74)
3.04
(5.58)
BCI Link CPU
Link GPU
20497 2048 11.51 (12.93) 41.84 (42.69) 7.31 (2.76) 1.57
(4.67)
5.72
(15.43)
Planet Kaggle Link CPU
Link GPU
40479 2048 313.89 (-) 2115.28 (2028.43) 194.57 (317.68) 1.61
(-)
10.87
(6.38)
HIGGS Link CPU
Link GPU
11000000 28 2996.16 (-) 121.21 (114.88) 119.34 (71.87) 25.10
(-)
1.01
(1.59)
Airline Link CPU
Link GPU
115069017 13 - (-) 1242.09 (1271.91) 1056.20 (645.40) -
(-)
1.17
(1.97)

In the next table we summarize the performance results using the F1-Score.

Dataset Experiment Data size Features xgb F1:
CPU (GPU)
xgb_hist F1:
CPU (GPU)
lgb F1:
CPU (GPU)
Football Link
Link
19673 46 0.458 (0.470) 0.460 (0.472) 0.459 (0.470)
Fraud Detection Link
Link
284807 30 0.824 (0.821) 0.802 (0.814) 0.813 (0.811)
BCI Link
Link
20497 2048 0.110 (0.093) 0.142 (0.120) 0.137 (0.138)
Planet Kaggle Link
Link
40479 2048 0.805 (-) 0.822 (0.822) 0.822 (0.821)
HIGGS Link
Link
11000000 28 0.763 (-) 0.767 (0.767) 0.768 (0.767)
Airline Link
Link
115069017 13 - (-) 0.741 (0.745) 0.732 (0.745)

The experiments were run on an Azure NV24 VM with 24 cores and 224 GB memory. The machine has 4 NVIDIA M60 GPUs. In both cases we used Ubuntu 16.04.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.