* [dask] raise more informative error for duplicates in 'machines'
* uncomment
* avoid test failure
* Revert "avoid test failure"
This reverts commit 9442bdf00f.
* include multiclass-classification task and task_to_model_factory dicts
* define centers coordinates. flatten init_scores within each partition for multiclass-classification
* include issue comment and fix linting error
* include support for init_score
* use dataframe from init_score and test difference with and without init_score in local model
* revert refactoring
* initial docs. test between distributed models with and without init_score
* remove ranker from tests
* test value for root node and change docs
* comma
* re-include parametrize
* fix incorrect merge
* use single init_score and the booster_ attribute
* use np.float64 instead of float
* [docs] Add alt text to image in Parameters-Tuning.rst
Add alt text to Leaf-wise growth image, as part of #4028
* Update docs/Parameters-Tuning.rst
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* [dask] [ci] add support for scikit-learn 0.24+ in tests (fixes#4031)
* Update tests/python_package_test/test_dask.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* try upgrading mixtexsetup
* they changed the executable name UGH
* more changes for executable name
* another path change
* changing package mirrors
* undo experiments
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* include support for column array as label
* remove nested ifs
* fix linting errors
* include tests for sklearn regressors
* include docstring for numpy_1d_array_to_dtype
* include . at end of docstring
* remove pandas import and test for regression, classification and ranking
* check predictions of sklearn models as well
* test training only in dask. drop pandas series tests
* use PANDAS_INSTALLED and pd_Series
* inline imports
* use col array in fit for test_dask
* include review comments
* use socket.bind with port 0 and client.run to find random open ports
* include test for found ports
* find random open ports as default
* parametrize local_listen_port. type hint to _find_random_open_port. fid open ports only on workers with data.
* make indentation consistent and pass list of workers to client.run
* remove socket import
* change random port implementation
* fix test
* Fix index out-of-range exception generated by BaggingHelper on small datasets.
Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.
* Update goss.hpp
* Update goss.hpp
* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)
* Fix incorrect upstream merge
* Add link to LightGBM.NET
* Fix indenting to 2 spaces
* Dummy edit to trigger CI
* Dummy edit to trigger CI
* remove duplicate functions from merge
* Fix evalution of linear trees with a single leaf.
Note that trees without linear models at the leaf always handle num_leaves = 1 as a special case and directly output the leaf value. Linear trees were missing this special case handling, and hence would have the following issues:
* Calling Tree::Predict or Tree::PredictByMap would cause an access violation exception attempting to access the first value of the empty split_feature_ array in GetLeaf.
* PredictionFunLinear would either cause an access violation or go into an infinite loop when attempting to do the equivalent of GetLeaf.
Note also that PredictionFun does not need the same changes as PredictionFunLinear, since both are only called by Tree::AddPredictionToScore, which has a special case for (!is_linear_ && num_leaves_ <= 1) that precludes calling PredictionFun.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>
* Fix index out-of-range exception generated by BaggingHelper on small datasets.
Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.
* Update goss.hpp
* Update goss.hpp
* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)
* Fix incorrect upstream merge
* Add link to LightGBM.NET
* Fix indenting to 2 spaces
* Dummy edit to trigger CI
* Dummy edit to trigger CI
* remove duplicate functions from merge
* In Tree::ToString() method, print double values for linear tree models with high precision, so that the tree may be accurately reproduced elsewhere (LightGBM.Net in particular)
* Need to use more precise StringToArray instead of StringToArrayFast when parsing double valued arrays for linear trees, to ensure models round-trip via string or file correctly.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>
* Fix index out-of-range exception generated by BaggingHelper on small datasets.
Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.
* Update goss.hpp
* Update goss.hpp
* Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)
* Fix incorrect upstream merge
* Add link to LightGBM.NET
* Fix indenting to 2 spaces
* Dummy edit to trigger CI
* Dummy edit to trigger CI
* remove duplicate functions from merge
* Fix for CreatePredictor function: for VS2017 in Debug build, the previous version would end up giving an uninitialised prediction function that would throw access violation exceptions when invoked.
Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com>
Co-authored-by: Guolin Ke <guolin.ke@outlook.com>
Approximately %80 of runtime when loading "low column count, high row
count" DataFrames into Datasets is consumed in `np.fromiter`, called
as part of the `Dataset.get_field` method.
This is particularly pernicious hotspot, as unlike other ctypes-based
methods this is a hot loop over a python iterator loop and causes
significant GIL-contention in multi-threaded applications.
Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`,
which allows a single-shot `copy` of the underlying array.
This reduces the load time of a ~35 million row categorical dataframe
with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
execution.