* [python-package] create Dataset from sampled data.
* [python-package] create Dataset from List[Sequence].
1. Use random access for data sampling
2. Support read data from multiple input files
3. Read data in batch so no need to hold all data in memory
* [python-package] example: create Dataset from multiple HDF5 file.
* fix: revert is_class implementation for seq
* fix: unwanted memory view reference for seq
* fix: seq is_class accepts sklearn matrices
* fix: requirements for example
* fix: pycode
* feat: print static code linting stage
* fix: linting: avoid shell str regex conversion
* code style: doc style
* code style: isort
* fix ci dependency: h5py on windows
* [py] remove rm files in test seq
https://github.com/microsoft/LightGBM/pull/4089#discussion_r612929623
* docs(python): init_from_sample summary
https://github.com/microsoft/LightGBM/pull/4089#discussion_r612903389
* remove dataset dump sample data debugging code.
* remove typo fix.
Create separate PR for this.
* fix typo in src/c_api.cpp
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* style(linting): py3 type hint for seq
* test(basic): os.path style path handling
* Revert "feat: print static code linting stage"
This reverts commit 10bd79f7f8.
* feat(python): sequence on validation set
* minor(python): comment
* minor(python): test option hint
* style(python): fix code linting
* style(python): add pydoc for ref_dataset
* doc(python): sequence
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
* revert(python): sequence class abc
* chore(python): remove rm_files
* Remove useless static_assert.
* refactor: test_basic test for sequence.
* fix lint complaint.
* remove dataset._dump_text in sequence test.
* Fix reverting typo fix.
* Apply suggestions from code review
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* Fix type hint, code and doc style.
* fix failing test_basic.
* Remove TODO about keep constant in sync with cpp.
* Install h5py only when running python-examples.
* Fix lint complaint.
* Apply suggestions from code review
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* Doc fixes, remove unused params_str in __init_from_seqs.
* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Remove unnecessary conda install in windows ci script.
* Keep param as example in dataset_from_multi_hdf5.py
* Add _get_sample_count function to remove code duplication.
* Use batch_size parameter in generate_hdf.
* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Fix after applying suggestions.
* Fix test, check idx is instance of numbers.Integral.
* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Expose Sequence class in Python-API doc.
* Handle Sequence object not having batch_size.
* Fix isort lint complaint.
* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Update docstring to mention Sequence as data input.
* Remove get_one_line in test_basic.py
* Make Sequence an abstract class.
* Reduce number of tests for test_sequence.
* Add c_api: LGBM_SampleCount, fix potential bug in LGBMSampleIndices.
* empty commit to trigger ci
* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Rename to LGBM_GetSampleCount, change LGBM_SampleIndices out_len to int32_t.
Also rename total_nrow to num_total_row in c_api.h for consistency.
* Doc about Sequence in docs/Python-Intro.rst.
* Fix: basic.py change LGBM_SampleIndices out_len to int32.
* Add create_valid test case with Dataset from Sequence.
* Apply suggestions from code review
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Apply suggestions from code review
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
* Remove no longer used DEFAULT_BIN_CONSTRUCT_SAMPLE_CNT.
* Update python-package/lightgbm/basic.py
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Willian Zhang <willian@willian.email>
Co-authored-by: Willian Z <Willian@Willian-Zhang.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: shiyu1994 <shiyu_k1994@qq.com>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
* Correct spelling
Most changes were in comments, and there were a few changes to literals for log output.
There were no changes to variable names, function names, IDs, or functionality.
* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* Clarify a phrase in a comment
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* Correct spelling
Most are code comments, but one case is a literal in a logging message.
There are a few grammar fixes too.
Co-authored-by: James Lamb <jaylamb20@gmail.com>
* Revert "specify the last supported version of scikit-learn (#2637)"
This reverts commit d100277649.
* ban scikit-learn 0.22.0 and skip broken test
* fix updated test
* fix lint test
* Revert "fix lint test"
This reverts commit 8b4db0805f.
* Use first_metric_only flag for early_stopping function.
In order to apply early stopping with only first metric, applying first_metric_only flag for early_stopping function.
* upcate comment
* Revert "upcate comment"
This reverts commit 1e75a1a415.
* added test
* fixed docstring
* cut comment and save one line
* document new feature
* it is confusing to name validation data `test_data` especially as terms like train, validation, test splits are common in ML. Change variable name in python quick start.
* added links to corresponding params in Quick-Start guide
* updated description of possible input types in python
* clarify list of numpy arrays input type in docs
* bring consistency and clearness into early_stopping_rounds desc, metric desc and implementation
* hotfix
* hotfix
* used NDCG as default metric for lambdarank task
* fixed missed methods at ReadTheDocs and changed default eval_metric
* leaved only unique metrics
* fixed comment
* added missed description of plot_example in python_guide folder and fixed consistency for packages naming
* more reliable OS detection
* fixed grammar
* made pylint happy
* A nitpicky grammer edit with minor clarifications added.
* fix link
* strike s
* try a different optimal-split link, clarify experimental details
* smoothing the FAQ
* edit Features.rst
* several minor edits throughout docs
* historgram-based
The document of `early_stopping_rounds` says it will check all of
eval_set. But, this is not true. It doesn't check the dataset
specified as the training data.
This change appends an extra phrase "except the training data" to all
of the sentences "If there's more than one, will check all of them" in
documents.
* add info on adaptive learning rate in the sklearn API
* adjust learning rate documentation following the PR discussion
* fix early stopping documentation
* improve wording
* fixing trailing spaces