commit

2016-10-11 14:40:43 +08:00 · 2016-10-11 14:40:43 +08:00 · 0ad02cd209
--- a/Configuration.md
+++ b/Configuration.md
@ -0,0 +1,193 @@
 This is a page contains all parameters in LightGBM.
 ## Parameters format
 The parameter format is ```key1=value1 key2=value2 ... ``` . And parameters can be written both in config file and command line. For parameters in command line, it cannot have spaces before and after ```=```. For parameters in config file, one line can only contains one parameters, and can use ```#``` to comment. If one parameter appears in both command line and config file, LightGBM will use the parameter in command line.
 ## Core Parameters
 * ```config```, default=```""```, type=string, alias=```config_file```
  * path of config file
 * ```task```, default=```train```, type=enum, options=```train```,```prediction```
  * ```train``` for training
  * ```prediction``` for prediction.
 * ```application```, default=```regression```, type=enum, options=```regression```,```binary```,```lambdarank```, alias=```objective```,```app```
  * ```regression```, regression application
  * ```binary```, binary classification application 
  * ```lambdarank```, lambdarank application
 * ```data```, default=```""```, type=string, alias=```train```,```train_data```
  * training data, LightGBM will train from this data
 * ```valid```, default=```""```, type=multi-string, alias=```test```,```valid_data```,```test_data```
  * validation/test data, LightGBM will output metrics for these data
  * support multi validation data, separate by ```,```
 * ```num_iterations```, default=```10```, type=int, alias=```num_iteration```,```num_tree```,```num_trees```,```num_round```,```num_rounds```
  * number of boosting iterations/trees
 * ```learning_rate```, default=```0.1```, type=double, alias=```shrinkage_rate```
  * shrinkage rate
 * ```num_leaves```, default=```127```, type=int, alias=```num_leaf```
  * number of leaves for one tree
 * ```tree_learner```, default=```serial```, type=enum, options=```serial```,```feature```,```data```
  * ```serial```, single machine tree learner
  * ```feature```, feature parallel tree learner
  * ```data```, data parallel tree learner
  * Refer to [Parallel Learning Guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide) to get more details.
 * ```num_threads```, default=OpenMP_default, type=int, alias=```num_thread```,```nthread```
  * Number of threads for LightGBM. 
  * For the best speed, set this to the number of **real CPU cores**, not the number of threads (most CPU using [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading) to generate 2 threads per CPU core).
  * For parallel learning, should not use full CPU cores since this will cause poor performance for network.
 ## Learning control parameters
 * ```min_data_in_leaf```, default=```100```, type=int, alias=```min_data_per_leaf``` , ```min_data```
  * Minimal number of data in one leaf. can use this to deal with over-fit.
 * ```min_sum_hessian_in_leaf```, default=```10.0```, type=double, alias=```min_sum_hessian_per_leaf```, ```min_sum_hessian```, ```min_hessian```
  * Minimal sum hessian in one leaf. Like ```min_data_in_leaf```, can use this to deal with over-fit.
 * ```feature_fraction```, default=```1.0```, type=double, ```0.0 < feature_fraction < 1.0```, alias=```sub_feature```
  * LightGBM will random select part of features on each iteration if ```feature_fraction``` smaller than ```1.0```. For example, if set to ```0.8```, will select 80% features before training each tree.
  * Can use this to speed up training
  * Can use this to deal with over-fit
 * ```feature_fraction_seed```, default=```2```, type=int
  * Random seed for feature fraction.
 * ```bagging_fraction```, default=```1.0```, type=double, , ```0.0 < bagging_fraction < 1.0```, alias=```sub_row```
  * Like ```feature_fraction```, but this will random select part of data
  * can use this to speed up training
  * Can use this to deal with over-fit
  * Note: To enable bagging, should set ```bagging_freq``` to a non zero value as well
 * ```bagging_freq```, default=```0```, type=int
  * Frequency for bagging, ```0``` means disable bagging. ```k``` means will perform bagging at every ```k``` iteration.
  * Note: To enable bagging, should set ```bagging_fraction``` as well
 * ```bagging_seed``` , default=```3```, type=int
  * Random seed for bagging.
 ## IO parameters
 * ```max_bin```, default=```255```, type=int
  * max number of bin that feature values will bucket in. Small bin may reduce training accuracy but may increase general power (deal with over-fit).
  * LightGBM will auto compress memory according ```max_bin```. For example, LightGBM will use ```uint8_t``` for feature value if ```max_bin=255```.
 * ```data_random_seed```, default=```1```, type=int
  * random seed for data partition in parallel learning(not include feature parallel).
 * ```data_has_label```, default=```true```, type=bool
  * Must be ```true``` in training task. For prediction task, should change this according to data file.
 * ```output_model```, default=```LightGBM_model.txt```, type=string, alias=```model_output```,```model_out```
  * file name of output model in training.
 * ```input_model```, default=```""```, type=string, alias=```model_input```,```model_in```
  * file name of input model.
  * for prediction task, will prediction data using this model.
  * for train task, will continued train from this model.
 * ```output_result```, default=```LightGBM_predict_result.txt```, type=string, alias=```predict_result```,```prediction_result```
  * file name of prediction result in prediction task.
 * ```is_sigmoid```, default=```true```, type=bool
  * Set to ```true``` will use sigmoid(if needed, only effect for ```binary``` now) transform for prediction result.
  * Set to ```false``` will only predict the raw scores.
 * ```init_score```, default=```""```, type=string, alias=```input_init_score```
  * file name of initial score file. LightGBM will use this score to start training.
  * only support train task.
  * each line contains one score corresponding with data
 * ```is_pre_partition```, default=```false```, type=bool
  * used for parallel learning(not include feature parallel).
  * ```true``` if training data is pre-partitioned, and different machines using different partition.
 * ```is_sparse```, default=```true```, type=bool, alias=```is_enable_sparse```
  * used to enable/disable sparse optimization. Set to ```false``` to disable sparse optimization.
 * ```two_round```, default=```false```, type=bool, alias=```two_round_loading```,```use_two_round_loading```
  * by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed. But it may out of memory when data file is very big.
  * set this to ```true``` if data file is too big to fit in memory.
 * ```save_binary```, default=```false```, type=bool, alias=```is_save_binary```,```is_save_binary_file```
  * set this to ```true``` will save data set(include validation data) to binary file. speed up the data loading speed for the next time.
 ## Objective parameters
 * ```sigmoid```, default=```1.0```, type=double
  * parameter for sigmoid function. Will be used in binary classification and lambdarank.
 * ```is_unbalance```, default=```false```, type=bool
  * used in binary classification. Set this to ```true``` if training data is unbalance.
 * ```max_position```, default=```20```, type=int
  * used in lambdarank, will optimize NDCG at this position.
 * ```label_gain```, default=```{0,1,3,7,15,31,63,...}```, type=multi-double
  * used in lambdarank, relevant gain for labels. For example, the gain of label ```2``` is ```3``` if using default label gains.
  * Separate by ```,```
 ## Metric parameters
 * ```metric```, default={```l2``` for regression}, {```binary_logloss``` for binary classification},{```ndcg``` for lambdarank}, type=multi-enum, options=```l1```,```l2```,```ndcg```,```auc```,```binary_logloss```,```binary_error```
  * ```l1```, absolute loss
  * ```l2```, square loss
  * ```ndcg```, [NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG)
  * ```auc```, [AUC](https://en.wikipedia.org/wiki/Area_under_the_curve_(pharmacokinetics))
  * ```binary_logloss```, [log loss](https://www.kaggle.com/wiki/LogarithmicLoss)
  * ```binary_error```. For one sample ```0``` for correct classification, ```1``` for error classification.
  * Support multi metrics, separate by ```,```
 * ```metric_freq```, default=```1```, type=int
  * frequency for metric output
 * ```is_training_metric```, default=```false```, type=bool
  * set this to true if need to output metric result for training
 * ```ndcg_at```, default=```{1,2,3,4,5}```, type=multi-int, alias=```ndcg_eval_at```
  * NDCG evaluation position, separate by ```,```
 ## Network parameters
 Following parameters are used for parallel learning, and only used for base(socket) version. It is not need to set them for MPI version. 
 * ```num_machines```, default=```1```, type=int, alias=```num_machine```
  * Used for parallel learning, the number of machines for parallel learning application
 * ```local_listen_port```, default=```12400```, type=int, alias=```local_port```
  * TCP listen port for local machines.
  * Should allow this port in firewall setting before training.
 * ```time_out```, default=```120```, type=int
  * Socket time-out in minutes.
 * ```machine_list_file```, default=```""```, type=string
  * File that list machines for this parallel learning application
  * Each line contains one ip and one port for one machine. Format is ```ip port```, separate by space.
 ## Tuning Parameters
 ### For faster speed
 * Use bagging by set ```bagging_fraction``` and ```bagging_freq``` 
 * Use feature sub-sampling by set ```feature_fraction```
 * Use small ```max_bin```
 * Use ```save_binary``` to speed up data loading in future learning
 * Use parallel learning, refer to [parallel learning guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide).
 ### For better accuracy
 * Use large ```max_bin``` (may slower)
 * Use small ```learning_rate``` with large ```num_iterations```
 * Use large ```num_leave```(may over-fitting)
 * Use bigger training data
 ### Deal with over-fitting
 * Use small ```max_bin```
 * Use small ```num_leave```
 * Use ```min_data_in_leaf``` and ```min_sum_hessian_in_leaf```
 * Use bagging by set ```bagging_fraction``` and ```bagging_freq``` 
 * Use feature sub-sampling by set ```feature_fraction```
 * Use bigger training data
 ## Others
 ### Weight data
 LightGBM support weighted training. It use an additional file to store weight data, like the following:
 ```
 1.0
 0.5
 0.8
 ...
 ```
 It means the weight of first data is ```1.0```, second is ```0.5```, and so on. The weight file is corresponded with training data file line by line, and has per weight per line. And if the name of data file is "train.txt", the weight file should be named as "train.txt.weight" and in same folder as the data file. And LightGBM will auto load weight file if it exists.
 ### Query data
 For LambdaRank learning, it needs query information for training data. LightGBM use an additional file to store query data. Following is an example:
 ```
 27
 18
 67
 ...
 ```
 It means the first ```27``` data belong one query and next ```18``` belong another query, and so on.(**Note: data should order by query**) If name of data file is "train.txt", the query file should be named as "train.txt.query" and in same folder as the data file. And LightGBM will auto load query file if it exists.
--- a/Experiments.md
+++ b/Experiments.md
@ -0,0 +1,168 @@
 ## Comparison Experiment
 ### Experiment Data
 We use 3 data set to conduct our comparison experiments. Details of data are list in the following table:
 | Data     |      Task     |  Link | #Data | #Feature| Comments|
 |----------|---------------|-------|-------|---------|---------|
 | Higgs    |  Binary classification | [link](https://archive.ics.uci.edu/ml/datasets/HIGGS) |10,000,000|28| use last 500,000 samples as test set  | 
 | Yahoo LTR|  Learning to rank      | [link](https://webscope.sandbox.yahoo.com/catalog.php?datatype=c)  |~2,000,000|~140| 	   set1.train as train, set1.test as test |
 | MS LTR   |  Learning to rank      | [link](http://research.microsoft.com/en-us/projects/mslr/)|~473,000|~700| {S1,S2,S3} as train set, {S5} as test set |
 ### Environment
 We use one Linux server as experiment platform, details are listed in the following table:
 | OS     |      CPU     |  Memory | 
 |--------|--------------|---------|
 | Ubuntu 14.04 LTS  |  2 * E5-2680 v2 | DDR3 1600Mhz, 256GB|
 ### Baseline
 we use [xgboost](https://github.com/dmlc/xgboost) as baseline, and build version is latest version at 8 OCT 2016 [f9648ac](https://github.com/dmlc/xgboost/tree/f9648ac320ba9d9fb77c1b9bf091406b9b6b4086).
 Both xgboost and LightGBM are built with OpenMP support.
 ### Settings
 We set up total 3 settings for experiments, the parameters of these settings are listed in the following:
 1. xgboost:
 ```
 eta = 0.1
 max_depth = 8
 num_round = 500
 nthread=16
 tree_method=exact
 ```
 2. xgboost_approx (using histogram based algorithm):
 ```
 eta = 0.1
 max_depth = 8
 num_round = 500
 nthread=16
 # num_bins = (1/sketch_eps)
 sketch_eps=0.004
 tree_method=approx
 ```
 3. LightGBM:
 ```
 learning_rate = 0.1
 num_leaves = 255
 num_trees = 500
 num_threads = 16
 ```
 xgboost with ```max_depth=8``` will have max number leaves to 255. This has same model complexity as LightGBM with ```num_leves=255```. And xgboost_approx with ```sketch_eps=0.004``` will have #bins to 250, which is similar to default(255) in LightGBM.
 Other parameters are default values.
 ### Result
 #### Speed
 For speed comparison, we only run the training task, which is without any test or metric output. And we don't count the time for IO.
 Following table is the comparison of time cost:
 | Data      |  xgboost| xgboost_approx |  LightGBM|  
 |-----------|---------|----------------|----------|
 | Higgs     | 4445s   |     2206s      | **386s** | 
 | Yahoo LTR | 844s    |     591s       | **176s** | 
 | MS LTR    | 1374s   |     1233s      | **268s** |
 [[image/time_cost.png]]
 We found LightGBM is faster than xgboost on all experiment data sets. 
 #### Accuracy
 For accuracy comparison, we use the accuracy on test data set to have a fair comparison.
 Higgs's AUC:
 | Metric  | xgboost | xgboost_approx | LightGBM|  
 |---------|---------|----------------|---------|
 | AUC     | 0.8393  | 0.8402 | **0.8450**  | 
 NDCG at Yahoo LTR:
 | Metric    | xgboost | xgboost_approx| LightGBM|  
 |-----------|---------|-------------- |---------|
 | NDCG@1    | 0.7247  | 0.7272 | **0.7323**  | 
 | NDCG@3    | 0.7282  | 0.7278 | **0.7368**  | 
 | NDCG@5    | 0.7463  | 0.7465 | **0.7560**  | 
 | NDCG@10   | 0.7878  | 0.7879 | **0.7969**  | 
 NDCG at MS LTR:
 | Metric    | xgboost | xgboost_approx| LightGBM|  
 |-----------|---------|-------------- |---------|
 | NDCG@1    | 0.4994  | 0.4959 | **0.5182**  | 
 | NDCG@3    | 0.4817  | 0.4813 | **0.5042**  | 
 | NDCG@5    | 0.4860  | 0.4861 | **0.5085**  | 
 | NDCG@10   | 0.5051  | 0.5037 | **0.5265**  | 
 We found LightGBM has better accuracy than xgboost on all experiment data sets.
 #### Memory consumption
 We monitor ```RES``` while running training task. And we set ```two_round=true``` in LightGBM to reduce peak memory usage.
 | Data      | xgboost | xgboost_approx| LightGBM|  
 |-----------|---------|-------------- |---------|
 | Higgs     | 4.853g  | 4.875g | **0.822g** | 
 | Yahoo LTR | 1.907g  | 2.221g | **0.831g** | 
 | MS LTR    | 5.469g  | 5.600g | **0.745g** |
 LightGBM benefits from its histogram optimization algorithm, so it consumes much lower memories.
 ## Parallel Experiment
 ### Data
 We use a terabyte click log dataset to conduct parallel experiments. Details are list in following table:
 | Data     |      Task     |  Link | #Data | #Feature|
 |----------|---------------|-------|-------|---------|
 | Criteo    |  Binary classification | [link](http://labs.criteo.com/downloads/download-terabyte-click-logs/) |1,700,000,000|67|
 This data contains 13 integer features and 26 category features of 24 days click log. We statistic the CTR and count for these 26 category features from first ten days, then use next ten days’ data, which had been replaced the category features by the corresponding CTR and count, as training data. The processed training data has total 1.7 billions records and 67 features.
 ### Environment
 We use 16 windows servers as experiment platform, details are list in following table:
 | OS     |      CPU     |  Memory | Network Adapter |
 |--------|--------------|---------|-----------------|
 | Windows Server 2012 |  2 * E5-2670 v2 | DDR3 1600Mhz, 256GB| Mellanox ConnectX-3, 54Gbps, RDMA support |
 ### Settings
 ```
 learning_rate = 0.1
 num_leaves = 255
 num_trees=100
 num_thread=16
 tree_learner=data
 ```
 We use data parallel here, since this data is large in #data but small in #feature.
 Other parameters are default values.
 ### Result
 |  #machine | time per tree | memory usage(per machine) |
 |-----------|---------|--------------|
 | 1   | 627.8s  | 176GB |
 | 2   | 311s    | 87GB  |
 | 4   | 156s  | 43GB  |
 | 8   | 80s   | 22GB  |
 | 16  | 42s   | 11GB  |
 From the results, we find LightGBM perform linear speed up in parallel learning. 
--- a/Features.md
+++ b/Features.md
@ -0,0 +1,151 @@
 This is a short introduction for the features and algorithms used in LightGBM. 
 This page doesn't contain detailed algorithms, please refer to cited paper or source code if you are interested.
 ## Optimization in speed and memory usage
 Many boosting tools use pre-sorted based algorithms<sup>[1][2]</sup>(e.g. default algorithm in xgboost) for decision tree learning. It is a simple solution, but not easy to optimize.
 LightGBM uses the histogram based algorithms<sup>[3][4][5]</sup>, which bucketing continuous feature(attribute) values into discrete bins, to speed up training procedure and reduce memory usage. Following are advantages for histogram based algorithms:
 * **Reduce calculation cost of split gain**
    * Pre-sorted based algorithms need O(#data) times calculation
    * Histogram based algorithms only need to calculate O(#bins) times, and #bins is far smaller than #data
        * It still need O(#data) times to construct histogram, which only contain sum-up operation
 * **Only need to split data one time after finding best split point**
    * Pre-sorted based algorithms need to split data O(#features) times (since different features access data by different order)
 * **Use histogram subtraction for further speed-up**
    * To get one leaf's histograms in binary tree, can use the histogram subtraction of its parent and its neighbor 
    * So it only need to construct histograms for one leaf (with smaller #data than its neighbor), then can get histograms of its neighbor by histogram subtraction with small cost( O(#bins) )
 * **Reduce Memory usage**
    * Can replace continuous values to discrete bins. If #bins is small, can use small data type, e.g. uint8_t, to store training data
    * Not need to store additional information for pre-sorting feature values
 * **Reduce communication cost for parallel learning**
 * **Easy to optimize for cache hit chance**, since different features access data in same order
  * Pre-sorted based algorithm access data in different order for different features. It is not easy to optimize for cache.
 ## Sparse optimization
  * Only need O(#non_zero_data) to construct histogram for sparse features
 ## Optimization in accuracy
 Most decision tree learning algorithms grow tree by level(depth)-wise, like the following:
 [[image/level_wise.png]]
 LightGBM grows tree by leaf-wise. It will choose the leaf with max delta loss to grow. 
 When growing same #leaf, Leaf-wise algorithm can reduce more loss than level-wise algorithm.
 [[image/leaf_wise.png]]
 ## Optimization in network communication
 It only need to use some collective communication algorithms, like "All reduce", "All gather" and "Reduce scatter", in parallel learning of LightGBM. LightGBM implement state-of-art algorithms described in this [paper](http://wwwi10.lrr.in.tum.de/~gerndt/home/Teaching/HPCSeminar/mpich_multi_coll.pdf)<sup>[6]</sup>. These collective communication algorithms can provide much better performance than point-to-point communication.
 ## Optimization in parallel learning
 LightGBM provides following parallel learning algorithms. 
 ### Feature Parallel
 #### Traditional algorithm
 Feature parallel aim to parallel the "Find Best Split" in decision tree. The procedure of traditional feature parallel is:
 1. Partition data vertically (different machines have different feature set)
 2. Workers find local best split point {feature, threshold} on local feature set
 3. Communicate local best splits with each other and get best one
 4. Worker with best split to perform split, then send split result of data to other workers
 5. Other workers split data according received data
 The shortage of traditional feature parallel:
 * Has computation overhead, since it cannot speed up "split", whose time complexity is O(#data). Thus, feature parallel cannot speed up well when #data is large.
 * Need communication of split result, which cost about O(#data/8) (one bit for one data). 
 #### Feature parallel in LightGBM
 Since feature parallel cannot speed up well when #data is large, we make a little change here: instead of partitioning data vertically, every worker hold the full data. Thus, LightGBM doesn't need to communicate for split result of data since every worker know how to split data. And #data won't be larger, so it is reasonable to hold full data in every machine.
 The procedure of feature parallel in LightGBM:
 1. Workers find local best split point{feature, threshold} on local feature set
 2. Communicate local best splits with each other and get best one
 3. Perform best split
 However, this feature parallel algorithm still suffer from computation overhead for "split" when #data is large. So it will be better to use data parallel when #data is large.
 ### Data Parallel
 #### Traditional algorithm
 Data parallel aim to parallel the whole decision learning. The procedure of data parallel is:
 1. Partition data horizontally 
 2. Workers use local data to construct local histograms 
 3. Merge global histograms from all local histograms.
 4. Find best split from merged global histograms then perform split
 The shortage of traditional data parallel:
 * High communication cost. If using point-to-point communication algorithm, communication cost for one machine is about O(#machine * #feature * #bin). If using collective communication algorithm (e.g. "All Reduce"), communication cost is about O(2 * #feature * #bin) ( check cost of "All Reduce" in chapter 4.5 at this [paper](http://wwwi10.lrr.in.tum.de/~gerndt/home/Teaching/HPCSeminar/mpich_multi_coll.pdf) ).
 #### Data parallel in LightGBM
 We reduce communication cost of data parallel in LightGBM:
 1. Instead of "Merge global histograms from all local histograms", LightGBM use "Reduce Scatter" to merge histograms of different(non-overlapping) features for different workers. Then workers find local best split on local merged histograms and sync up global best split. 
 2. As aforementioned, LightGBM use histogram subtraction to speed up training. Based on this, we can communicate histograms only for one leaf, and get its neighbor's histograms by subtraction as well.
 Above all, we reduce communication cost to O(0.5 * #feature* #bin) for data parallel in LightGBM.
 ## Applications and metrics
 Support following application:
 * regression, objective function is L2 loss
 * binary classification, objective function is logloss
 * lambdarank, objective is lambdarank with NDCG
 Support following metrics:
 * L1 Loss
 * L2 Loss
 * Log loss
 * Classification Error rate
 * AUC
 * NDCG
 For more details, please refer to [Configuration](https://github.com/Microsoft/LightGBM/wiki/Configuration).
 ## Other features
 * Bagging(sub-samples)
 * Column(feature) sub-sample
 * Continued train with input GBDT model
 * Continued train with input score file
 * Weighted training
 * Validation metric output during training
 * Multi validation data
 * Multi metrics
 For more details, please refer to [Configuration](https://github.com/Microsoft/LightGBM/wiki/Configuration).
 ## Future Plan
 * More languages (e.g. Python, R) support
 * More platforms (e.g. Hadoop, Spark) support
 * Directly use for categorical features
 ## References
 [1] Mehta, Manish, Rakesh Agrawal, and Jorma Rissanen. "SLIQ: A fast scalable classifier for data mining." International Conference on Extending Database Technology. Springer Berlin Heidelberg, 1996.
 [2] Shafer, John, Rakesh Agrawal, and Manish Mehta. "SPRINT: A scalable parallel classi er for data mining." Proc. 1996 Int. Conf. Very Large Data Bases. 1996.
 [3] Ranka, Sanjay, and V. Singh. "CLOUDS: A decision tree classifier for large datasets." Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 1998.
 [4] Gehrke, Johannes, et al. "BOAT—optimistic decision tree construction." ACM SIGMOD Record. Vol. 28. No. 2. ACM, 1999.
 [5] Machado, F. P. "Communication and memory efficient parallel decision tree construction." (2003).
 [6] Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. "Optimization of collective communication operations in MPICH." International Journal of High Performance Computing Applications 19.1 (2005): 49-66.
--- a/Home.md
+++ b/Home.md
@ -0,0 +1,25 @@
 LightGBM is a gradient boosting framework that using tree based learning algorithms. It is designed to be distributed and efficient with following advantages:
 - Fast training efficiency
 - Low memory usage
 - Better accuracy
 - Parallel learning supported
 - Deal with large scale of data
 For the details, please refer to [Features](https://github.com/Microsoft/LightGBM/wiki/Features).
 The [experiments](https://github.com/Microsoft/LightGBM/wiki/Experiments#comparison-experiment) on the public data also shows that LightGBM can outperform other existing boosting tools on both learning efficiency and accuracy, with significant lower memory consumption. What's more, the [experiments](https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel-experiment) shows that LightGBM can achieve linear speed-up by using multiple machines for training in specific settings. 
 Get Started
 ------------
 For a quick start,  please follow the [Installation Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide) and [Quick Start](https://github.com/Microsoft/LightGBM/wiki/Quick-Start).
 Documents
 ------------
 * [**Wiki**](https://github.com/Microsoft/LightGBM/wiki)
 * [**Installation Guide**](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide) 
 * [**Quick Start**](https://github.com/Microsoft/LightGBM/wiki/Quick-Start)
 * [**Examples**](https://github.com/Microsoft/LightGBM/tree/master/examples)
 * [**Features**](https://github.com/Microsoft/LightGBM/wiki/Features) 
 * [**Parallel Learning Guide**](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide) 
 * [**Configuration**](https://github.com/Microsoft/LightGBM/wiki/Configuration) 
--- a/Installation-Guide.md
+++ b/Installation-Guide.md
@ -0,0 +1,54 @@
 LightGBM is implemented by standard C++ 11. It doesn't need additional packages to build.
 ## Windows
 LightGBM use Visual Studio (2013 or higher) to build in Windows.
 1. Clone or download latest source code.
 2. Open ```./windows/LightGBM.sln``` by Visual Studio.
 3. Set configuration to ```Release``` and ```x64``` .
 4. Press ```Ctrl+Shift+B``` to build.
 5. The exe file is in ```./windows/x64/Release/``` after built.
 ## Unix
 LightGBM use ***cmake*** to build in unix. Run following: 
 ```
 git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
 mkdir build ; cd build
 cmake .. 
 make -j 
 ```
 ## Build MPI Version
 The default build version of LightGBM is based on socket. LightGBM also support [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface). MPI is a high performance communication approach with [RDMA](https://en.wikipedia.org/wiki/Remote_direct_memory_access) supported. 
 If you need to run parallel learning application with high performance communication, you can build the LightGBM with MPI support.
 ### Windows
 You need to install [MSMPI](https://www.microsoft.com/en-us/download/details.aspx?id=49926) first. Both ```msmpisdk.msi``` and ```MSMpiSetup.exe``` are needed.
 Then:
 1. Clone or download latest source code.
 2. Open ```./windows/LightGBM.sln``` by Visual Studio.
 3. Set configuration to ```Release_mpi``` and ```x64``` .
 4. Press ```Ctrl+Shift+B``` to build.
 5. The exe file is in ```./windows/x64/Release_mpi/``` after built.
 ### Unix
 You need to install [OpenMPI](https://www.open-mpi.org/) first.
 Then run following:
 ```
 git clone --recursive https://github.com/Microsoft/LightGBM ; cd LightGBM
 mkdir build ; cd build
 cmake -DUSE_MPI=ON .. 
 make -j 
 ```
--- a/Parallel-Learning-Guide.md
+++ b/Parallel-Learning-Guide.md
@ -0,0 +1,89 @@
 This is a guide for parallel learning of LightGBM.
 Follow the [Quick Start](https://github.com/Microsoft/LightGBM/wiki/Quick-Start) to know how to use LightGBM first.
 ## Choose appropriate parallel algorithm
 LightGBM provides 2 parallel learning algorithms now. 
 |Parallel algorithm| How to use |
 |----------------|---------------------|
 |Data parallel   | tree_learner=data   |
 |Feature parallel| tree_learner=feature|
 These algorithms are suit for different scenarios, which is listed in following table:
 |                     | #data is small| #data is large| 
 |---------------------|------------------|-----------------|
 |**#feature is small**| Feature Parallel | Data Parallel   |
 |**#feature is large**| Feature Parallel | will release soon |
 More details about these parallel algorithms can be found in [optimization in parallel learning](https://github.com/Microsoft/LightGBM/wiki/Features#optimization-in-parallel-learning).
 ## Build parallel version
 Default build version support parallel learning based on socket.
 If you need build parallel version with MPI support, please refer to [this](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide#build-mpi-version).
 ## Preparation
 ### socket version
 It needs to collect IP of all machines that want to run parallel learning in and allocate one TCP port (assume 12345 here) for all machines, and change firewall rules to allow income of this port(12345). Then write these IP and ports in one file (assume mlist.txt), like following:
 ```
 machine1_ip 12345
 machine2_ip 12345
 ```
 ### MPI version
 It needs to collect IP (or hostname) of all machines that want to run parallel learning in. Then write these IP in one file (assume mlist.txt) like following:
 ```
 machine1_ip
 machine2_ip
 ```
 Note: For Windows users, need to start "smpd" to start MPI service. More details can be found in [here](https://blogs.technet.microsoft.com/windowshpc/2015/02/02/how-to-compile-and-run-a-simple-ms-mpi-program/).
 ## Run parallel learning
 ### Socket version
 1. Edit following parameters in config file:
    ```tree_learner=your_parallel_algorithm```, edit "your_parallel_algorithm"(e.g. feature/data) here.
    ```num_machines=your_num_machines```, edit "your_num_machines"(e.g. 4) here.
    ```machine_list_file=mlist.txt```,  mlist.txt is created in Preparation.
    ```local_listen_port=12345```,  12345 is allocated in Preparation.
 2. Copy data file, executable file, config file and "mlist.txt" to all machines.
 3. Run following command on all machines, need to change "your_config_file" to real config file.
   For windows:```lightgbm.exe config=your_config_file```
   For linux:```./lightgbm config=your_config_file```
 ### MPI version
 1. Edit following parameters in config file:
    ```tree_learner=your_parallel_algorithm```, edit "your_parallel_algorithm"(e.g. feature/data) here.
 2. Copy data file, executable file, config file and "mlist.txt" to all machines. Note: MPI need run in **same path on all machines**.
 3. Run following command on one machine (not need to run on all machines), need to change "your_config_file" to real config file.
   For windows:```mpiexec.exe /machinefile mlist.txt lightgbm.exe config=your_config_file```
   For linux:```mpiexec --machinefile mlist.txt ./lightgbm config=your_config_file```
 ### Example
 * [A simple parallel example](https://github.com/Microsoft/lightgbm/tree/master/examples/parallel_learning).
--- a/Quick-Start.md
+++ b/Quick-Start.md
@ -0,0 +1,76 @@
 This is a quick start guide for LightGBM.
 Follow the [Installation Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide) to install LightGBM first.
 ## Training data format 
 LightGBM support input data file with [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), [TSV] (https://en.wikipedia.org/wiki/Tab-separated_values) and [LibSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) formats.
 Label is the data of first column, and there is not header in file.
 LightGBM also support weighted training, it needs an additional [weight data](https://github.com/Microsoft/LightGBM/wiki/Configuration#weight-data). And it needs an additional [query data](https://github.com/Microsoft/LightGBM/wiki/Configuration#query-data) for ranking task.
 ## Parameter quick look
 The parameter format is ```key1=value1 key2=value2 ... ``` . And parameters can be in both config file and command line.
 Some important parameters:
 * ```config```, default=```""```, type=string, alias=```config_file```
  * path of config file
 * ```task```, default=```train```, type=enum, options=```train```,```prediction```
  * ```train``` for training
  * ```prediction``` for prediction.
 * ```application```, default=```regression```, type=enum, options=```regression```,```binary```,```lambdarank```, alias=```objective```,```app```
  * ```regression```, regression application
  * ```binary```, binary classification application 
  * ```lambdarank```, lambdarank application
 * ```data```, default=```""```, type=string, alias=```train```,```train_data```
  * training data, LightGBM will train from this data
 * ```valid```, default=```""```, type=multi-string, alias=```test```,```valid_data```,```test_data```
  * validation/test data, LightGBM will output metrics for these data
  * support multi validation data, separate by ```,```
 * ```num_iterations```, default=```10```, type=int, alias=```num_iteration```,```num_tree```,```num_trees```,```num_round```,```num_rounds```
  * number of boosting iterations/trees
 * ```learning_rate```, default=```0.1```, type=double, alias=```shrinkage_rate```
  * shrinkage rate
 * ```num_leaves```, default=```127```, type=int, alias=```num_leaf```
  * number of leaves for one tree
 * ```tree_learner```, default=```serial```, type=enum, options=```serial```,```feature```,```data```
  * ```serial```, single machine tree learner
  * ```feature```, feature parallel tree learner
  * ```data```, data parallel tree learner
  * Refer to [Parallel Learning Guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide) to get more details.
 * ```num_threads```, default=OpenMP_default, type=int, alias=```num_thread```,```nthread```
  * Number of threads for LightGBM. 
  * For the best speed, set this to the number of **real CPU cores**, not the number of threads (most CPU using [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading) to generate 2 threads per CPU core).
 * ```min_data_in_leaf ```, default=```100```, type=int
  * number of minimal data for one leaves, important parameter to avoid over-fit
 For complete parameters, please refer to [Parameters](https://github.com/Microsoft/LightGBM/wiki/Configuration).
 ## Run LightGBM
 For windows:
 ```
 lightgbm.exe config=your_config_file other_args ...
 ```
 For unix:
 ```
 ./lightgbm config=your_config_file other_args ...
 ```
 Parameters can be both in config file and command line, and the parameters in command line has higher priority than in config file.
 For example, following command line will keep 'num_trees=10' and ignore same parameter in config file.
 ```
 ./lightgbm config=train.conf num_trees=10
 ```
 ## Examples
 * [Binary Classifiaction](https://github.com/Microsoft/LightGBM/tree/master/examples/binary_classification)
 * [Regression](https://github.com/Microsoft/LightGBM/tree/master/examples/regression)
 * [Lambdarank](https://github.com/Microsoft/LightGBM/tree/master/examples/lambdarank)
 * [Parallel Learning](https://github.com/Microsoft/LightGBM/tree/master/examples/parallel_learning)
--- a/_Sidebar.md
+++ b/_Sidebar.md
@ -0,0 +1,32 @@
 **DMTK**
 * [Overview](https://github.com/Microsoft/DMTK/wiki)
 * [News](https://github.com/Microsoft/DMTK/wiki/News)
 **Multiverso**
 * [Overview](https://github.com/Microsoft/multiverso/wiki/Overview)
 * [Multiverso setup](https://github.com/Microsoft/multiverso/wiki/Setup-Multiverso)
 * [Multiverso document](https://github.com/Microsoft/multiverso/wiki/Multiverso-document)
 * [Multiverso API document](https://github.com/Microsoft/multiverso/wiki/API-document)
 * Multiverso applications
  * [Logistic Regression](https://github.com/Microsoft/multiverso/wiki/Logistic-Regression)
  * [Word Embedding](https://github.com/Microsoft/multiverso/wiki/Word-Embedding)
  * [LightLDA]()
  * Deep Learning
    * [Torch](https://github.com/Microsoft/multiverso/wiki/Multiverso-Torch-Binding-Benchmark)
    * [Theano](https://github.com/Microsoft/multiverso/wiki/Multiverso-Python-Binding-Benchmark)
 * Multiverso binding
  * [lua](https://github.com/Microsoft/multiverso/wiki/Multiverso-Torch-Lua-Binding)
  * [python](https://github.com/Microsoft/multiverso/wiki/Multiverso-Python-Theano-Lasagne-Binding)
 * [Run in docker](https://github.com/Microsoft/DMTK/wiki/Run-in-docker)
 **LightGBM**
 * [Overview](https://github.com/Microsoft/LightGBM/wiki)
 * [Installation Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide)
 * [Quick Start](https://github.com/Microsoft/LightGBM/wiki/Quick-Start)
 * [Parallel Learning Guide](https://github.com/Microsoft/LightGBM/wiki/Parallel-Learning-Guide)
 * [Features](https://github.com/Microsoft/LightGBM/wiki/Features)
 * [Configuration](https://github.com/Microsoft/LightGBM/wiki/Configuration)
 * [Experiments](https://github.com/Microsoft/LightGBM/wiki/Experiments)
  * [Comparison Experiment](https://github.com/Microsoft/LightGBM/wiki/Experiments#comparison-experiment)
  * [Parallel Experiment](https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel-experiment)
--- a/image/leaf_wise.png
+++ b/image/leaf_wise.png
--- a/image/level_wise.png
+++ b/image/level_wise.png
--- a/image/time_cost.png
+++ b/image/time_cost.png