[python-package] limit when num_boost_round warnings are emitted (fixes #6324) (#6579)

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
2024-09-02 22:46:24 -05:00 · 2024-09-02 22:46:24 -05:00 · 3ccdea1a08
--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@ -17,18 +17,63 @@ This page contains descriptions of all parameters in LightGBM.
 Parameters Format
 -----------------

+Parameters are merged together in the following order (later items overwrite earlier ones):
+
+1. LightGBM's default values
+2. special files for ``weight``, ``init_score``, ``query``, and ``positions`` (see `Others <#others>`__)
+3. (CLI only) configuration in a file passed like ``config=train.conf``
+4. (CLI only) configuration passed via the command line
+5. (Python, R) special keyword arguments to some functions (e.g. ``num_boost_round`` in ``train()``)
+6. (Python, R) ``params`` function argument (including ``**kwargs`` in Python and ``...`` in R)
+7. (C API) ``parameters`` or ``params`` function argument
+
+Many parameters have "aliases", alternative names which refer to the same configuration.
+
+Where a mix of the primary parameter name and aliases are given, the primary parameter name is always preferred to any aliases.
+
+For example, in Python:
+
+.. code-block:: python
+
+   # use learning rate of 0.07, becase 'learning_rate'
+   # is the primary parameter name
+   lgb.train(
+      params={
+         "learning_rate": 0.07,
+         "shrinkage_rate": 0.12
+      },
+      train_set=dtrain
+   )
+
+Where multiple aliases are given, and the primary parameter name is not, the first alias
+appearing in the lists returned by ``Config::parameter2aliases()`` in the C++ library is used.
+Those lists are hard-coded in a fairly arbitrary way... wherever possible, avoid relying on this behavior.
+
+For example, in Python:
+
+.. code-block:: python
+
+   # use learning rate of 0.12, LightGBM has a hard-coded preference for 'shrinkage_rate'
+   # over any other aliases, and 'learning_rate' is not provided
+   lgb.train(
+      params={
+         "eta": 0.19,
+         "shrinkage_rate": 0.12
+      },
+      train_set=dtrain
+   )
+
+**CLI**
+
 The parameters format is ``key1=value1 key2=value2 ...``.
 Parameters can be set both in config file and command line.
 By using command line, parameters should not have spaces before and after ``=``.
 By using config files, one line can only contain one parameter. You can use ``#`` to comment.

-If one parameter appears in both command line and config file, LightGBM will use the parameter from the command line.
-
-For the Python and R packages, any parameters that accept a list of values (usually they have ``multi-xxx`` type, e.g. ``multi-int`` or ``multi-double``) can be specified in those languages' default array types.
-For example, ``monotone_constraints`` can be specified as follows.
-
 **Python**

+Any parameters that accept multiple values should be passed as a Python list.
+
 .. code-block:: python

   params = {
@ -38,6 +83,8 @@ For example, ``monotone_constraints`` can be specified as follows.

 **R**

+Any parameters that accept multiple values should be passed as an R list.
+
 .. code-block:: r

   params <- list(
@ -1340,7 +1387,8 @@ Others
 Continued Training with Input Score
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-LightGBM supports continued training with initial scores. It uses an additional file to store these initial scores, like the following:
+LightGBM supports continued training with initial scores.
+It uses an additional file to store these initial scores, like the following:

 ::

@ -1352,7 +1400,7 @@ LightGBM supports continued training with initial scores. It uses an additional
 It means the initial score of the first data row is ``0.5``, second is ``-0.1``, and so on.
 The initial score file corresponds with data file line by line, and has per score per line.

-And if the name of data file is ``train.txt``, the initial score file should be named as ``train.txt.init`` and placed in the same folder as the data file.
+If the name of data file is ``train.txt``, the initial score file should be named as ``train.txt.init`` and placed in the same folder as the data file.
 In this case, LightGBM will auto load initial score file if it exists.

 If binary data files exist for raw data file ``train.txt``, for example in the name ``train.txt.bin``, then the initial score file should be named as ``train.txt.bin.init``.
@ -1360,7 +1408,8 @@ If binary data files exist for raw data file ``train.txt``, for example in the n
 Weight Data
 ~~~~~~~~~~~

-LightGBM supports weighted training. It uses an additional file to store weight data, like the following:
+LightGBM supports weighted training.
+It uses an additional file to store weight data, like the following:

 ::

@ -1376,7 +1425,8 @@ The weight file corresponds with data file line by line, and has per weight per
 And if the name of data file is ``train.txt``, the weight file should be named as ``train.txt.weight`` and placed in the same folder as the data file.
 In this case, LightGBM will load the weight file automatically if it exists.

-Also, you can include weight column in your data file. Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.
+Also, you can include weight column in your data file.
+Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.

 Query Data
 ~~~~~~~~~~
@ -1405,4 +1455,5 @@ For example, if you have a 112-document dataset with ``group = [27, 18, 67]``, t
 If the name of data file is ``train.txt``, the query file should be named as ``train.txt.query`` and placed in the same folder as the data file.
 In this case, LightGBM will load the query file automatically if it exists.

-Also, you can include query/group id column in your data file. Please refer to the ``group_column`` `parameter <#group_column>`__ in above.
+Also, you can include query/group id column in your data file.
+Please refer to the ``group_column`` `parameter <#group_column>`__ in above.
--- a/python-package/lightgbm/engine.py
+++ b/python-package/lightgbm/engine.py
@ -63,6 +63,62 @@ def _emit_dataset_kwarg_warning(calling_function: str, argname: str) -> None:
    warnings.warn(msg, category=LGBMDeprecationWarning, stacklevel=2)


+def _choose_num_iterations(num_boost_round_kwarg: int, params: Dict[str, Any]) -> Dict[str, Any]:
+    """Choose number of boosting rounds.
+
+    In ``train()`` and ``cv()``, there are multiple ways to provide configuration for
+    the number of boosting rounds to perform:
+
+      * the ``num_boost_round`` keyword argument
+      * any of the ``num_iterations`` or its aliases via the ``params`` dictionary
+
+    These should be preferred in the following order (first one found wins):
+
+      1. ``num_iterations`` provided via ``params`` (because it's the main parameter name)
+      2. any other aliases of ``num_iterations`` provided via ``params``
+      3. the ``num_boost_round`` keyword argument
+
+    This function handles that choice, and issuing helpful warnings in the cases where the
+    result might be surprising.
+
+    Returns
+    -------
+    params : dict
+        Parameters, with ``"num_iterations"`` set to the preferred value and all other
+        aliases of ``num_iterations`` removed.
+    """
+    num_iteration_configs_provided = {
+        alias: params[alias] for alias in _ConfigAliases.get("num_iterations") if alias in params
+    }
+
+    # now that the relevant information has been pulled out of params, it's safe to overwrite it
+    # with the content that should be used for training (i.e. with aliases resolved)
+    params = _choose_param_value(
+        main_param_name="num_iterations",
+        params=params,
+        default_value=num_boost_round_kwarg,
+    )
+
+    # if there were not multiple boosting rounds configurations provided in params,
+    # then by definition they cannot have conflicting values... no need to warn
+    if len(num_iteration_configs_provided) <= 1:
+        return params
+
+    # if all the aliases have the same value, no need to warn
+    if len(set(num_iteration_configs_provided.values())) <= 1:
+        return params
+
+    # if this line is reached, lightgbm should warn
+    value_string = ", ".join(f"{alias}={val}" for alias, val in num_iteration_configs_provided.items())
+    _log_warning(
+        f"Found conflicting values for num_iterations provided via 'params': {value_string}. "
+        f"LightGBM will perform up to {params['num_iterations']} boosting rounds. "
+        "To be confident in the maximum number of boosting rounds LightGBM will perform and to "
+        "suppress this warning, modify 'params' so that only one of those is present."
+    )
+    return params
+
+
 def train(
    params: Dict[str, Any],
    train_set: Dataset,
@ -169,9 +225,6 @@ def train(
    if not isinstance(train_set, Dataset):
        raise TypeError(f"train() only accepts Dataset object, train_set has type '{type(train_set).__name__}'.")

-    if num_boost_round <= 0:
-        raise ValueError(f"num_boost_round must be greater than 0. Got {num_boost_round}.")
-
    if isinstance(valid_sets, list):
        for i, valid_item in enumerate(valid_sets):
            if not isinstance(valid_item, Dataset):
@ -198,11 +251,12 @@ def train(
    if callable(params["objective"]):
        fobj = params["objective"]
        params["objective"] = "none"
-    for alias in _ConfigAliases.get("num_iterations"):
-        if alias in params:
-            num_boost_round = params.pop(alias)
-            _log_warning(f"Found `{alias}` in params. Will use it instead of argument")
-    params["num_iterations"] = num_boost_round
+
+    params = _choose_num_iterations(num_boost_round_kwarg=num_boost_round, params=params)
+    num_boost_round = params["num_iterations"]
+    if num_boost_round <= 0:
+        raise ValueError(f"Number of boosting rounds must be greater than 0. Got {num_boost_round}.")
+
    # setting early stopping via global params should be possible
    params = _choose_param_value(
        main_param_name="early_stopping_round",
@ -713,9 +767,6 @@ def cv(
    if not isinstance(train_set, Dataset):
        raise TypeError(f"cv() only accepts Dataset object, train_set has type '{type(train_set).__name__}'.")

-    if num_boost_round <= 0:
-        raise ValueError(f"num_boost_round must be greater than 0. Got {num_boost_round}.")
-
    # raise deprecation warnings if necessary
    # ref: https://github.com/microsoft/LightGBM/issues/6435
    if categorical_feature != "auto":
@ -733,11 +784,12 @@ def cv(
    if callable(params["objective"]):
        fobj = params["objective"]
        params["objective"] = "none"
-    for alias in _ConfigAliases.get("num_iterations"):
-        if alias in params:
-            _log_warning(f"Found '{alias}' in params. Will use it instead of 'num_boost_round' argument")
-            num_boost_round = params.pop(alias)
-    params["num_iterations"] = num_boost_round
+
+    params = _choose_num_iterations(num_boost_round_kwarg=num_boost_round, params=params)
+    num_boost_round = params["num_iterations"]
+    if num_boost_round <= 0:
+        raise ValueError(f"Number of boosting rounds must be greater than 0. Got {num_boost_round}.")
+
    # setting early stopping via global params should be possible
    params = _choose_param_value(
        main_param_name="early_stopping_round",
--- a/tests/python_package_test/test_engine.py
+++ b/tests/python_package_test/test_engine.py
@ -24,6 +24,7 @@ from lightgbm.compat import PANDAS_INSTALLED, pd_DataFrame, pd_Series

 from .utils import (
    SERIALIZERS,
+    assert_silent,
    dummy_obj,
    load_breast_cancer,
    load_digits,
@ -4291,7 +4292,7 @@ def test_verbosity_is_respected_when_using_custom_objective(capsys):
        "num_leaves": 3,
    }
    lgb.train({**params, "verbosity": -1}, ds, num_boost_round=1)
-    assert capsys.readouterr().out == ""
+    assert_silent(capsys)
    lgb.train({**params, "verbosity": 0}, ds, num_boost_round=1)
    assert "[LightGBM] [Warning] Unknown parameter: nonsense" in capsys.readouterr().out

@ -4320,6 +4321,115 @@ def test_verbosity_can_suppress_alias_warnings(capsys, verbosity_param, verbosit
        assert re.search(r"\[LightGBM\]", stdout) is None


+def test_cv_only_raises_num_rounds_warning_when_expected(capsys):
+    X, y = make_synthetic_regression()
+    ds = lgb.Dataset(X, y)
+    base_params = {
+        "num_leaves": 5,
+        "objective": "regression",
+        "verbosity": -1,
+    }
+    additional_kwargs = {"return_cvbooster": True, "stratified": False}
+
+    # no warning: no aliases, all defaults
+    cv_bst = lgb.cv({**base_params}, ds, **additional_kwargs)
+    assert all(t == 100 for t in cv_bst["cvbooster"].num_trees())
+    assert_silent(capsys)
+
+    # no warning: no aliases, just num_boost_round
+    cv_bst = lgb.cv({**base_params}, ds, num_boost_round=2, **additional_kwargs)
+    assert all(t == 2 for t in cv_bst["cvbooster"].num_trees())
+    assert_silent(capsys)
+
+    # no warning: 1 alias + num_boost_round (both same value)
+    cv_bst = lgb.cv({**base_params, "n_iter": 3}, ds, num_boost_round=3, **additional_kwargs)
+    assert all(t == 3 for t in cv_bst["cvbooster"].num_trees())
+    assert_silent(capsys)
+
+    # no warning: 1 alias + num_boost_round (different values... value from params should win)
+    cv_bst = lgb.cv({**base_params, "n_iter": 4}, ds, num_boost_round=3, **additional_kwargs)
+    assert all(t == 4 for t in cv_bst["cvbooster"].num_trees())
+    assert_silent(capsys)
+
+    # no warning: 2 aliases (both same value)
+    cv_bst = lgb.cv({**base_params, "n_iter": 3, "num_iterations": 3}, ds, **additional_kwargs)
+    assert all(t == 3 for t in cv_bst["cvbooster"].num_trees())
+    assert_silent(capsys)
+
+    # no warning: 4 aliases (all same value)
+    cv_bst = lgb.cv({**base_params, "n_iter": 3, "num_trees": 3, "nrounds": 3, "max_iter": 3}, ds, **additional_kwargs)
+    assert all(t == 3 for t in cv_bst["cvbooster"].num_trees())
+    assert_silent(capsys)
+
+    # warning: 2 aliases (different values... "num_iterations" wins because it's the main param name)
+    with pytest.warns(UserWarning, match="LightGBM will perform up to 5 boosting rounds"):
+        cv_bst = lgb.cv({**base_params, "n_iter": 6, "num_iterations": 5}, ds, **additional_kwargs)
+    assert all(t == 5 for t in cv_bst["cvbooster"].num_trees())
+    # should not be any other logs (except the warning, intercepted by pytest)
+    assert_silent(capsys)
+
+    # warning: 2 aliases (different values... first one in the order from Config::parameter2aliases() wins)
+    with pytest.warns(UserWarning, match="LightGBM will perform up to 4 boosting rounds"):
+        cv_bst = lgb.cv({**base_params, "n_iter": 4, "max_iter": 5}, ds, **additional_kwargs)["cvbooster"]
+    assert all(t == 4 for t in cv_bst.num_trees())
+    # should not be any other logs (except the warning, intercepted by pytest)
+    assert_silent(capsys)
+
+
+def test_train_only_raises_num_rounds_warning_when_expected(capsys):
+    X, y = make_synthetic_regression()
+    ds = lgb.Dataset(X, y)
+    base_params = {
+        "num_leaves": 5,
+        "objective": "regression",
+        "verbosity": -1,
+    }
+
+    # no warning: no aliases, all defaults
+    bst = lgb.train({**base_params}, ds)
+    assert bst.num_trees() == 100
+    assert_silent(capsys)
+
+    # no warning: no aliases, just num_boost_round
+    bst = lgb.train({**base_params}, ds, num_boost_round=2)
+    assert bst.num_trees() == 2
+    assert_silent(capsys)
+
+    # no warning: 1 alias + num_boost_round (both same value)
+    bst = lgb.train({**base_params, "n_iter": 3}, ds, num_boost_round=3)
+    assert bst.num_trees() == 3
+    assert_silent(capsys)
+
+    # no warning: 1 alias + num_boost_round (different values... value from params should win)
+    bst = lgb.train({**base_params, "n_iter": 4}, ds, num_boost_round=3)
+    assert bst.num_trees() == 4
+    assert_silent(capsys)
+
+    # no warning: 2 aliases (both same value)
+    bst = lgb.train({**base_params, "n_iter": 3, "num_iterations": 3}, ds)
+    assert bst.num_trees() == 3
+    assert_silent(capsys)
+
+    # no warning: 4 aliases (all same value)
+    bst = lgb.train({**base_params, "n_iter": 3, "num_trees": 3, "nrounds": 3, "max_iter": 3}, ds)
+    assert bst.num_trees() == 3
+    assert_silent(capsys)
+
+    # warning: 2 aliases (different values... "num_iterations" wins because it's the main param name)
+    with pytest.warns(UserWarning, match="LightGBM will perform up to 5 boosting rounds"):
+        bst = lgb.train({**base_params, "n_iter": 6, "num_iterations": 5}, ds)
+    assert bst.num_trees() == 5
+    # should not be any other logs (except the warning, intercepted by pytest)
+    assert_silent(capsys)
+
+    # warning: 2 aliases (different values... first one in the order from Config::parameter2aliases() wins)
+    with pytest.warns(UserWarning, match="LightGBM will perform up to 4 boosting rounds"):
+        bst = lgb.train({**base_params, "n_iter": 4, "max_iter": 5}, ds)
+    assert bst.num_trees() == 4
+    # should not be any other logs (except the warning, intercepted by pytest)
+    assert_silent(capsys)
+
+
@pytest.mark.skipif(not PANDAS_INSTALLED, reason="pandas is not installed")
 def test_validate_features():
    X, y = make_synthetic_regression()
@ -4355,7 +4465,7 @@ def test_train_and_cv_raise_informative_error_for_train_set_of_wrong_type():
@pytest.mark.parametrize("num_boost_round", [-7, -1, 0])
 def test_train_and_cv_raise_informative_error_for_impossible_num_boost_round(num_boost_round):
    X, y = make_synthetic_regression(n_samples=100)
-    error_msg = rf"num_boost_round must be greater than 0\. Got {num_boost_round}\."
+    error_msg = rf"Number of boosting rounds must be greater than 0\. Got {num_boost_round}\."
    with pytest.raises(ValueError, match=error_msg):
        lgb.train({}, train_set=lgb.Dataset(X, y), num_boost_round=num_boost_round)
    with pytest.raises(ValueError, match=error_msg):
--- a/tests/python_package_test/test_sklearn.py
+++ b/tests/python_package_test/test_sklearn.py
@ -24,6 +24,7 @@ import lightgbm as lgb
 from lightgbm.compat import DATATABLE_INSTALLED, PANDAS_INSTALLED, dt_DataTable, pd_DataFrame, pd_Series

 from .utils import (
+    assert_silent,
    load_breast_cancer,
    load_digits,
    load_iris,
@ -1337,6 +1338,58 @@ def test_verbosity_is_respected_when_using_custom_objective(capsys):
    assert "[LightGBM] [Warning] Unknown parameter: nonsense" in capsys.readouterr().out


+def test_fit_only_raises_num_rounds_warning_when_expected(capsys):
+    X, y = make_synthetic_regression()
+    base_kwargs = {
+        "num_leaves": 5,
+        "verbosity": -1,
+    }
+
+    # no warning: no aliases, all defaults
+    reg = lgb.LGBMRegressor(**base_kwargs).fit(X, y)
+    assert reg.n_estimators_ == 100
+    assert_silent(capsys)
+
+    # no warning: no aliases, just n_estimators
+    reg = lgb.LGBMRegressor(**base_kwargs, n_estimators=2).fit(X, y)
+    assert reg.n_estimators_ == 2
+    assert_silent(capsys)
+
+    # no warning: 1 alias + n_estimators (both same value)
+    reg = lgb.LGBMRegressor(**base_kwargs, n_estimators=3, n_iter=3).fit(X, y)
+    assert reg.n_estimators_ == 3
+    assert_silent(capsys)
+
+    # no warning: 1 alias + n_estimators (different values... value from params should win)
+    reg = lgb.LGBMRegressor(**base_kwargs, n_estimators=3, n_iter=4).fit(X, y)
+    assert reg.n_estimators_ == 4
+    assert_silent(capsys)
+
+    # no warning: 2 aliases (both same value)
+    reg = lgb.LGBMRegressor(**base_kwargs, n_iter=3, num_iterations=3).fit(X, y)
+    assert reg.n_estimators_ == 3
+    assert_silent(capsys)
+
+    # no warning: 4 aliases (all same value)
+    reg = lgb.LGBMRegressor(**base_kwargs, n_iter=3, num_trees=3, nrounds=3, max_iter=3).fit(X, y)
+    assert reg.n_estimators_ == 3
+    assert_silent(capsys)
+
+    # warning: 2 aliases (different values... "num_iterations" wins because it's the main param name)
+    with pytest.warns(UserWarning, match="LightGBM will perform up to 5 boosting rounds"):
+        reg = lgb.LGBMRegressor(**base_kwargs, num_iterations=5, n_iter=6).fit(X, y)
+    assert reg.n_estimators_ == 5
+    # should not be any other logs (except the warning, intercepted by pytest)
+    assert_silent(capsys)
+
+    # warning: 2 aliases (different values... first one in the order from Config::parameter2aliases() wins)
+    with pytest.warns(UserWarning, match="LightGBM will perform up to 4 boosting rounds"):
+        reg = lgb.LGBMRegressor(**base_kwargs, n_iter=4, max_iter=5).fit(X, y)
+    assert reg.n_estimators_ == 4
+    # should not be any other logs (except the warning, intercepted by pytest)
+    assert_silent(capsys)
+
+
@pytest.mark.parametrize("estimator_class", [lgb.LGBMModel, lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker])
 def test_getting_feature_names_in_np_input(estimator_class):
    # input is a numpy array, which doesn't have feature names. LightGBM adds
--- a/tests/python_package_test/utils.py
+++ b/tests/python_package_test/utils.py
@ -191,6 +191,25 @@ def pickle_and_unpickle_object(obj, serializer):
    return obj_from_disk  # noqa: RET504


+def assert_silent(capsys) -> None:
+    """
+    Given a ``CaptureFixture`` instance (from the ``pytest`` built-in ``capsys`` fixture),
+    read the recently-captured data into a variable and assert that nothing was written
+    to stdout or stderr.
+
+    This is just here to turn 3 lines of repetitive code into 1.
+
+    Note that this does have a side effect... ``capsys.readouterr()`` copies
+    from a buffer then frees it. So it will only store into ``.out`` and ``.err`` the
+    captured output since the last time that ``.readouterr()`` was called.
+
+    ref: https://docs.pytest.org/en/stable/how-to/capture-stdout-stderr.html
+    """
+    captured = capsys.readouterr()
+    assert captured.out == "", captured.out
+    assert captured.err == "", captured.err
+
+
 # doing this here, at import time, to ensure it only runs once_per import
 # instead of once per assertion
 _numpy_testing_supports_strict_kwarg = "strict" in getfullargspec(np.testing.assert_array_equal).kwonlyargs