LightGBM/docs/GPU-Performance.rst

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

212 строки
14 KiB
ReStructuredText
Исходник Обычный вид История

GPU Tuning Guide and Performance Comparison
===========================================
How It Works?
-------------
In LightGBM, the main computation cost during training is building the feature histograms. We use an efficient algorithm on GPU to accelerate this process.
The implementation is highly modular, and works for all learning tasks (classification, ranking, regression, etc). GPU acceleration also works in distributed learning settings.
GPU algorithm implementation is based on OpenCL and can work with a wide range of GPUs.
Supported Hardware
------------------
We target AMD Graphics Core Next (GCN) architecture and NVIDIA Maxwell and Pascal architectures.
Most AMD GPUs released after 2012 and NVIDIA GPUs released after 2014 should be supported. We have tested the GPU implementation on the following GPUs:
- AMD RX 480 with AMDGPU-pro driver 16.60 on Ubuntu 16.10
- AMD R9 280X (aka Radeon HD 7970) with fglrx driver 15.302.2301 on Ubuntu 16.10
- NVIDIA GTX 1080 with driver 375.39 and CUDA 8.0 on Ubuntu 16.10
- NVIDIA Titan X (Pascal) with driver 367.48 and CUDA 8.0 on Ubuntu 16.04
- NVIDIA Tesla M40 with driver 375.39 and CUDA 7.5 on Ubuntu 16.04
Using the following hardware is discouraged:
- NVIDIA Kepler (K80, K40, K20, most GeForce GTX 700 series GPUs) or earlier NVIDIA GPUs. They don't support hardware atomic operations in local memory space and thus histogram construction will be slow.
- AMD VLIW4-based GPUs, including Radeon HD 6xxx series and earlier GPUs. These GPUs have been discontinued for years and are rarely seen nowadays.
How to Achieve Good Speedup on GPU
----------------------------------
#. You want to run a few datasets that we have verified with good speedup (including Higgs, epsilon, Bosch, etc) to ensure your setup is correct.
If you have multiple GPUs, make sure to set ``gpu_platform_id`` and ``gpu_device_id`` to use the desired GPU.
Also make sure your system is idle (especially when using a shared computer) to get accuracy performance measurements.
#. GPU works best on large scale and dense datasets. If dataset is too small, computing it on GPU is inefficient as the data transfer overhead can be significant.
If you have categorical features, use the ``categorical_column`` option and input them into LightGBM directly; do not convert them into one-hot variables.
#. To get good speedup with GPU, it is suggested to use a smaller number of bins.
Setting ``max_bin=63`` is recommended, as it usually does not noticeably affect training accuracy on large datasets, but GPU training can be significantly faster than using the default bin size of 255.
For some dataset, even using 15 bins is enough (``max_bin=15``); using 15 bins will maximize GPU performance. Make sure to check the run log and verify that the desired number of bins is used.
#. Try to use single precision training (``gpu_use_dp=false``) when possible, because most GPUs (especially NVIDIA consumer GPUs) have poor double-precision performance.
Performance Comparison
----------------------
We evaluate the training performance of GPU acceleration on the following datasets:
+-----------+----------------+----------+------------+-----------+------------+
| Data | Task | Link | #Examples | #Features | Comments |
+===========+================+==========+============+===========+============+
| Higgs | Binary | `link1`_ | 10,500,000 | 28 | use last |
| | classification | | | | 500,000 |
| | | | | | samples |
| | | | | | as test |
| | | | | | set |
+-----------+----------------+----------+------------+-----------+------------+
| Epsilon | Binary | `link2`_ | 400,000 | 2,000 | use the |
| | classification | | | | provided |
| | | | | | test set |
+-----------+----------------+----------+------------+-----------+------------+
| Bosch | Binary | `link3`_ | 1,000,000 | 968 | use the |
| | classification | | | | provided |
| | | | | | test set |
+-----------+----------------+----------+------------+-----------+------------+
| Yahoo LTR | Learning to | `link4`_ | 473,134 | 700 | set1.train |
| | rank | | | | as train, |
| | | | | | set1.test |
| | | | | | as test |
+-----------+----------------+----------+------------+-----------+------------+
| MS LTR | Learning to | `link5`_ | 2,270,296 | 137 | {S1,S2,S3} |
| | rank | | | | as train |
| | | | | | set, {S5} |
| | | | | | as test |
| | | | | | set |
+-----------+----------------+----------+------------+-----------+------------+
| Expo | Binary | `link6`_ | 11,000,000 | 700 | use last |
| | classification | | | | 1,000,000 |
| | (Categorical) | | | | as test |
| | | | | | set |
+-----------+----------------+----------+------------+-----------+------------+
We used the following hardware to evaluate the performance of LightGBM GPU training.
Our CPU reference is **a high-end dual socket Haswell-EP Xeon server with 28 cores**;
GPUs include a budget GPU (RX 480) and a mainstream (GTX 1080) GPU installed on the same server.
It is worth mentioning that **the GPUs used are not the best GPUs in the market**;
if you are using a better GPU (like AMD RX 580, NVIDIA GTX 1080 Ti, Titan X Pascal, Titan Xp, Tesla P100, etc), you are likely to get a better speedup.
+--------------------------------+----------------+------------------+---------------+
| Hardware | Peak FLOPS | Peak Memory BW | Cost (MSRP) |
+================================+================+==================+===============+
| AMD Radeon RX 480 | 5,161 GFLOPS | 256 GB/s | $199 |
+--------------------------------+----------------+------------------+---------------+
| NVIDIA GTX 1080 | 8,228 GFLOPS | 320 GB/s | $499 |
+--------------------------------+----------------+------------------+---------------+
| 2x Xeon E5-2683v3 (28 cores) | 1,792 GFLOPS | 133 GB/s | $3,692 |
+--------------------------------+----------------+------------------+---------------+
During benchmarking on CPU we used only 28 physical cores of the CPU, and did not use hyper-threading cores,
because we found that using too many threads actually makes performance worse.
The following shows the training configuration we used:
::
max_bin = 63
num_leaves = 255
num_iterations = 500
learning_rate = 0.1
tree_learner = serial
task = train
is_training_metric = false
min_data_in_leaf = 1
min_sum_hessian_in_leaf = 100
ndcg_eval_at = 1,3,5,10
device = gpu
gpu_platform_id = 0
gpu_device_id = 0
num_thread = 28
We use the configuration shown above, except for the Bosch dataset, we use a smaller ``learning_rate=0.015`` and set ``min_sum_hessian_in_leaf=5``.
For all GPU training we vary the max number of bins (255, 63 and 15).
The GPU implementation is from commit `0bb4a82`_ of LightGBM, when the GPU support was just merged in.
The following table lists the accuracy on test set that CPU and GPU learner can achieve after 500 iterations.
GPU with the same number of bins can achieve a similar level of accuracy as on the CPU, despite using single precision arithmetic.
For most datasets, using 63 bins is sufficient.
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| | CPU 255 bins | CPU 63 bins | CPU 15 bins | GPU 255 bins | GPU 63 bins | GPU 15 bins |
+===========================+================+===============+===============+================+===============+===============+
| Higgs AUC | 0.845612 | 0.845239 | 0.841066 | 0.845612 | 0.845209 | 0.840748 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Epsilon AUC | 0.950243 | 0.949952 | 0.948365 | 0.950057 | 0.949876 | 0.948365 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Yahoo-LTR NDCG\ :sub:`1` | 0.730824 | 0.730165 | 0.729647 | 0.730936 | 0.732257 | 0.73114 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Yahoo-LTR NDCG\ :sub:`3` | 0.738687 | 0.737243 | 0.736445 | 0.73698 | 0.739474 | 0.735868 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Yahoo-LTR NDCG\ :sub:`5` | 0.756609 | 0.755729 | 0.754607 | 0.756206 | 0.757007 | 0.754203 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Yahoo-LTR NDCG\ :sub:`10` | 0.79655 | 0.795827 | 0.795273 | 0.795894 | 0.797302 | 0.795584 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Expo AUC | 0.776217 | 0.771566 | 0.743329 | 0.776285 | 0.77098 | 0.744078 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| MS-LTR NDCG\ :sub:`1` | 0.521265 | 0.521392 | 0.518653 | 0.521789 | 0.522163 | 0.516388 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| MS-LTR NDCG\ :sub:`3` | 0.503153 | 0.505753 | 0.501697 | 0.503886 | 0.504089 | 0.501691 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| MS-LTR NDCG\ :sub:`5` | 0.509236 | 0.510391 | 0.507193 | 0.509861 | 0.510095 | 0.50663 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| MS-LTR NDCG\ :sub:`10` | 0.527835 | 0.527304 | 0.524603 | 0.528009 | 0.527059 | 0.524722 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
| Bosch AUC | 0.718115 | 0.721791 | 0.716677 | 0.717184 | 0.724761 | 0.717005 |
+---------------------------+----------------+---------------+---------------+----------------+---------------+---------------+
We record the wall clock time after 500 iterations, as shown in the figure below:
.. image:: ./_static/images/gpu-performance-comparison.png
:align: center
:target: ./_static/images/gpu-performance-comparison.png
[docs] Add alt text on images (related to #4036) (#4038) * [docs]Add alt text on images * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <jaylamb20@gmail.com> * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <jaylamb20@gmail.com> * Apply suggestions from code review Co-authored-by: James Lamb <jaylamb20@gmail.com> * Apply suggestions from code review Co-authored-by: James Lamb <jaylamb20@gmail.com> * Merge main branch commit updates (#1) * [docs] Add alt text to image in Parameters-Tuning.rst (#4035) * [docs] Add alt text to image in Parameters-Tuning.rst Add alt text to Leaf-wise growth image, as part of #4028 * Update docs/Parameters-Tuning.rst Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: James Lamb <jaylamb20@gmail.com> * [ci] [R-package] upgrade to R 4.0.4 in CI (#4042) * [docs] update description of deterministic parameter (#4027) * update description of deterministic parameter to require using with force_row_wise or force_col_wise * Update include/LightGBM/config.h Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update docs Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [dask] Include support for init_score (#3950) * include support for init_score * use dataframe from init_score and test difference with and without init_score in local model * revert refactoring * initial docs. test between distributed models with and without init_score * remove ranker from tests * test value for root node and change docs * comma * re-include parametrize * fix incorrect merge * use single init_score and the booster_ attribute * use np.float64 instead of float * [ci] ignore untitle Jupyter notebooks in .gitignore (#4047) * [ci] prevent getting incompatible dask and distributed versions (#4054) * [ci] prevent getting incompatible dask and distributed versions * Update .ci/test.sh Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * empty commit Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055) * [ci] fix R CMD CHECK note about example timings (fixes #4049) * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * empty commit Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [ci] add CMake + R 3.6 test back (fixes #3469) (#4053) * [ci] add CMake + R 3.6 test back (fixes #3469) * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .ci/test_r_package_windows.ps1 * -Wait and remove rtools40 * empty commit Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [dask] include multiclass-classification task in tests (#4048) * include multiclass-classification task and task_to_model_factory dicts * define centers coordinates. flatten init_scores within each partition for multiclass-classification * include issue comment and fix linting error * Update index.rst (#4029) Add alt text to logo image Co-authored-by: James Lamb <jaylamb20@gmail.com> * [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059) * [dask] raise more informative error for duplicates in 'machines' * uncomment * avoid test failure * Revert "avoid test failure" This reverts commit 9442bdf00f193a19a923dc0deb46b7822cb6f601. * [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030) * [dask] add tutorial documentation (fixes #3814, fixes #3838) * add notes on saving the model * quick start examples * add examples * fix timeouts in examples * remove notebook * fill out prediction section * table of contents * add line back * linting * isort * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * move examples under python-guide * remove unused pickle import Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * set 'pending' commit status for R Solaris optional workflow (#4061) * [docs] add Yu Shi to repo maintainers (#4060) * Update FAQ.rst * Update CODEOWNERS * set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) * Add CMake option to enable sanitizers and build gtest (#3555) * Add CMake option to enable sanitizer * Set up gtest * Address reviewer's feedback * Address reviewer's feedback * Update CMakeLists.txt Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * added type hint (#4070) * [ci] run Dask examples on CI (#4064) * Update Parallel-Learning-Guide.rst * Update test.sh * fix path * address review comments * [python-package] add type hints on Booster.set_network() (#4068) * [python-package] add type hints on Booster.set_network() * change behavior * [python-package] Some mypy fixes (#3916) * Some mypy fixes * address James' comments * Re-introduce pass in empty classes * Update compat.py Remove extra lines * [dask] [ci] fix flaky network-setup test (#4071) * [tests][dask] simplify code in Dask tests (#4075) * simplify Dask tests code * enable CI * disable CI * Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076) This reverts commit 4e9c976867e1493b881b32d0e94ccf5c915fa31f. * Fix parsing of non-finite values (#3942) * Fix index out-of-range exception generated by BaggingHelper on small datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero. * Update goss.hpp * Update goss.hpp * Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array) * Fix incorrect upstream merge * Add link to LightGBM.NET * Fix indenting to 2 spaces * Dummy edit to trigger CI * Dummy edit to trigger CI * remove duplicate functions from merge * Fix parsing of non-finite values. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked. * Dummy commit to trigger CI * Also handle -nan in double parsing method * Update include/LightGBM/utils/common.h Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com> Co-authored-by: Guolin Ke <guolin.ke@outlook.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [dask] remove unused imports from typing (#4079) * Range check for DCG position discount lookup (#4069) * Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data. * Change debug logging location so that we can print the data file name as well. * Revert "Change debug logging location so that we can print the data file name as well." This reverts commit 3981b34bd6e0530f89c4733e78e6b6603bf50d48. * Add data file name to debug logging. * Move log line to a place where it is output even when query IDs are read from a separate file. * Also add the out-of-range check to rank metrics. * Perform check after number of queries is initialized. * Update * [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084) * [ci] install additional LaTeX packages in R CI jobs * update autoconf version * bump upper limit on package size to 100 * [SWIG] Add streaming data support + cpp tests (#3997) * [feature] Add ChunkedArray to SWIG * Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers * Address some review comments * Fix linting issues * Move test to tests/test_ChunkedArray_manually.cpp * Add test note * Move ChunkedArray to include/LightGBM/utils/ * Declare more explicit types of ChunkedArray in the SWIG API. * Port ChunkedArray tests to googletest * Please C++ linter * Address StrikerRUS' review comments * Update SWIG doc & disable ChunkedArray<int64_t> * Use CHECK_EQ instead of assert * Change include order (linting) * Rename ChunkedArray -> chunked_array files * Change header guards * Address last comments from StrikerRUS * store all CMake files in one place (#4087) * v3.2.0 release (#3872) * Update VERSION.txt * update appveyor.yml and configure * fix Appveyor builds Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: StrikerRUS <nekit94-12@hotmail.com> * [ci] Bump version for development (#4094) * Update .appveyor.yml * Update cran-comments.md * Update VERSION.txt * update configure Co-authored-by: James Lamb <jaylamb20@gmail.com> * [ci] fix flaky Azure Pipelines jobs (#4095) * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update setup.sh * Update setup.sh Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com> Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: jmoralez <jmoralz92@gmail.com> Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com> Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu> Co-authored-by: Deddy Jobson <dedjob@hotmail.com> Co-authored-by: Alberto Ferreira <AlbertoEAF@users.noreply.github.com> Co-authored-by: mjmckp <mjmckp@users.noreply.github.com> Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com> Co-authored-by: Guolin Ke <guolin.ke@outlook.com> Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com> Co-authored-by: StrikerRUS <nekit94-12@hotmail.com> * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com> Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: jmoralez <jmoralz92@gmail.com> Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com> Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu> Co-authored-by: Deddy Jobson <dedjob@hotmail.com> Co-authored-by: Alberto Ferreira <AlbertoEAF@users.noreply.github.com> Co-authored-by: mjmckp <mjmckp@users.noreply.github.com> Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com> Co-authored-by: Guolin Ke <guolin.ke@outlook.com> Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com> Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>
2021-03-25 03:13:28 +03:00
:alt: A performance chart which is a record of the wall clock time after 500 iterations on G P U for Higgs, epsilon, Bosch, Microsoft L T R, Expo and Yahoo L T R and bin size of 63 performs comparatively better.
When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training significantly without noticeably affecting accuracy.
On CPU, using a smaller bin size only marginally improves performance, sometimes even slows down training,
like in Higgs (we can reproduce the same slowdown on two different machines, with different GCC versions).
We found that GPU can achieve impressive acceleration on large and dense datasets like Higgs and Epsilon.
Even on smaller and sparse datasets, a *budget* GPU can still compete and be faster than a 28-core Haswell server.
Memory Usage
------------
The next table shows GPU memory usage reported by ``nvidia-smi`` during training with 63 bins.
We can see that even the largest dataset just uses about 1 GB of GPU memory,
indicating that our GPU implementation can scale to huge datasets over 10x larger than Bosch or Epsilon.
Also, we can observe that generally a larger dataset (using more GPU memory, like Epsilon or Bosch) has better speedup,
because the overhead of invoking GPU functions becomes significant when the dataset is small.
+-------------------------+---------+-----------+---------+----------+--------+-------------+
| Datasets | Higgs | Epsilon | Bosch | MS-LTR | Expo | Yahoo-LTR |
+=========================+=========+===========+=========+==========+========+=============+
| GPU Memory Usage (MB) | 611 | 901 | 1067 | 413 | 405 | 291 |
+-------------------------+---------+-----------+---------+----------+--------+-------------+
Further Reading
---------------
You can find more details about the GPU algorithm and benchmarks in the
following article:
Huan Zhang, Si Si and Cho-Jui Hsieh. `GPU Acceleration for Large-scale Tree Boosting`_. SysML Conference, 2018.
.. _link1: https://archive.ics.uci.edu/dataset/280/higgs
2023-11-06 20:59:01 +03:00
.. _link2: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
.. _link3: https://www.kaggle.com/c/bosch-production-line-performance/data
.. _link4: https://webscope.sandbox.yahoo.com/catalog.php?datatype=c
.. _link5: https://www.microsoft.com/en-us/research/project/mslr/
.. _link6: https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2009
.. _0bb4a82: https://github.com/microsoft/LightGBM/commit/0bb4a82
.. _GPU Acceleration for Large-scale Tree Boosting: https://arxiv.org/abs/1706.08359