Show how to perform fast retraining with LightGBM in different business cases

azure benchmark boosted-trees distributed-systems gbdt gbm gbrt gpu kaggle lightgbm machine-learning xgboost

Перейти к файлу

Miguel González-Fierro e43c919521 Merge pull request #71 from Azure/kaggle_bug_fix Fixes bug: Adds missing import		2017-08-29 20:37:39 +02:00
environment	updated documentation	2017-06-20 15:05:55 +01:00
experiments	Fixes bug: Adds missing import	2017-08-29 17:27:31 +00:00
.gitignore	small modification gitignore	2017-06-26 14:49:36 +00:00
INSTALL.md	updated INSTALL.md	2017-07-17 14:09:35 +01:00
LICENSE	Some testing	2017-03-28 18:26:53 +00:00
README.md	update readme with post link fix #65	2017-08-15 14:23:03 +01:00
requirements.txt	added ipykernel to requirments	2017-08-29 15:48:57 +01:00

README.md

Fast Retraining

In this repo we compare two of the fastest boosted decision tree libraries: XGBoost and LightGBM. We will evaluate them across datasets of several domains and different sizes.

On July 25, 2017, we published a blog post evaluating both libraries and discussing the benchmark results. The post is Lessons Learned From Benchmarking Fast Machine Learning Algorithms.

Installation and Setup

The installation instructions can be found here.

Project

In the folder experiments you can find the different experiments of the project. We developed 6 experiments with the CPU and GPU versions of the libraries.

Airline
BCI
Football
Planet Kaggle
Fraud Detection
HIGGS

In the folder experiment/libs there is the common code for the project.

Benchmark

In the following table there are summarized the time results (in seconds) and the ratio of the benchmarks performed in the experiments:

Dataset	Experiment	Data size	Features	xgb time: CPU (GPU)	xgb_hist time: CPU (GPU)	lgb time: CPU (GPU)	ratio xgb/lgb: CPU (GPU)	ratio xgb_hist/lgb: CPU (GPU)
Football	Link CPU Link GPU	19673	46	2.27 (7.09)	2.47 (4.58)	0.58 (0.97)	3.90 (7.26)	4.25 (4.69)
Fraud Detection	Link CPU Link GPU	284807	30	4.34 (5.80)	2.01 (1.64)	0.66 (0.29)	6.58 (19.74)	3.04 (5.58)
BCI	Link CPU Link GPU	20497	2048	11.51 (12.93)	41.84 (42.69)	7.31 (2.76)	1.57 (4.67)	5.72 (15.43)
Planet Kaggle	Link CPU Link GPU	40479	2048	313.89 (-)	2115.28 (2028.43)	194.57 (317.68)	1.61 (-)	10.87 (6.38)
HIGGS	Link CPU Link GPU	11000000	28	2996.16 (-)	121.21 (114.88)	119.34 (71.87)	25.10 (-)	1.01 (1.59)
Airline	Link CPU Link GPU	115069017	13	- (-)	1242.09 (1271.91)	1056.20 (645.40)	- (-)	1.17 (1.97)

In the next table we summarize the performance results using the F1-Score.

Dataset	Experiment	Data size	Features	xgb F1: CPU (GPU)	xgb_hist F1: CPU (GPU)	lgb F1: CPU (GPU)
Football	Link Link	19673	46	0.458 (0.470)	0.460 (0.472)	0.459 (0.470)
Fraud Detection	Link Link	284807	30	0.824 (0.821)	0.802 (0.814)	0.813 (0.811)
BCI	Link Link	20497	2048	0.110 (0.093)	0.142 (0.120)	0.137 (0.138)
Planet Kaggle	Link Link	40479	2048	0.805 (-)	0.822 (0.822)	0.822 (0.821)
HIGGS	Link Link	11000000	28	0.763 (-)	0.767 (0.767)	0.768 (0.767)
Airline	Link Link	115069017	13	- (-)	0.741 (0.745)	0.732 (0.745)

The experiments were run on an Azure NV24 VM with 24 cores and 224 GB memory. The machine has 4 NVIDIA M60 GPUs. In both cases we used Ubuntu 16.04.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.