Skip to content

Commit

Permalink
Merge branch 'master' of github.com:microsoft/LightGBM into fix/docke…
Browse files Browse the repository at this point in the history
…rfiles
  • Loading branch information
jameslamb committed Sep 3, 2024
2 parents 9afebc9 + 3ccdea1 commit dca1333
Show file tree
Hide file tree
Showing 5 changed files with 313 additions and 28 deletions.
71 changes: 61 additions & 10 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,63 @@ This page contains descriptions of all parameters in LightGBM.
Parameters Format
-----------------

Parameters are merged together in the following order (later items overwrite earlier ones):

1. LightGBM's default values
2. special files for ``weight``, ``init_score``, ``query``, and ``positions`` (see `Others <#others>`__)
3. (CLI only) configuration in a file passed like ``config=train.conf``
4. (CLI only) configuration passed via the command line
5. (Python, R) special keyword arguments to some functions (e.g. ``num_boost_round`` in ``train()``)
6. (Python, R) ``params`` function argument (including ``**kwargs`` in Python and ``...`` in R)
7. (C API) ``parameters`` or ``params`` function argument

Many parameters have "aliases", alternative names which refer to the same configuration.

Where a mix of the primary parameter name and aliases are given, the primary parameter name is always preferred to any aliases.

For example, in Python:

.. code-block:: python
# use learning rate of 0.07, becase 'learning_rate'
# is the primary parameter name
lgb.train(
params={
"learning_rate": 0.07,
"shrinkage_rate": 0.12
},
train_set=dtrain
)
Where multiple aliases are given, and the primary parameter name is not, the first alias
appearing in the lists returned by ``Config::parameter2aliases()`` in the C++ library is used.
Those lists are hard-coded in a fairly arbitrary way... wherever possible, avoid relying on this behavior.

For example, in Python:

.. code-block:: python
# use learning rate of 0.12, LightGBM has a hard-coded preference for 'shrinkage_rate'
# over any other aliases, and 'learning_rate' is not provided
lgb.train(
params={
"eta": 0.19,
"shrinkage_rate": 0.12
},
train_set=dtrain
)
**CLI**

The parameters format is ``key1=value1 key2=value2 ...``.
Parameters can be set both in config file and command line.
By using command line, parameters should not have spaces before and after ``=``.
By using config files, one line can only contain one parameter. You can use ``#`` to comment.

If one parameter appears in both command line and config file, LightGBM will use the parameter from the command line.

For the Python and R packages, any parameters that accept a list of values (usually they have ``multi-xxx`` type, e.g. ``multi-int`` or ``multi-double``) can be specified in those languages' default array types.
For example, ``monotone_constraints`` can be specified as follows.

**Python**

Any parameters that accept multiple values should be passed as a Python list.

.. code-block:: python
params = {
Expand All @@ -38,6 +83,8 @@ For example, ``monotone_constraints`` can be specified as follows.
**R**

Any parameters that accept multiple values should be passed as an R list.

.. code-block:: r
params <- list(
Expand Down Expand Up @@ -1340,7 +1387,8 @@ Others
Continued Training with Input Score
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

LightGBM supports continued training with initial scores. It uses an additional file to store these initial scores, like the following:
LightGBM supports continued training with initial scores.
It uses an additional file to store these initial scores, like the following:

::

Expand All @@ -1352,15 +1400,16 @@ LightGBM supports continued training with initial scores. It uses an additional
It means the initial score of the first data row is ``0.5``, second is ``-0.1``, and so on.
The initial score file corresponds with data file line by line, and has per score per line.

And if the name of data file is ``train.txt``, the initial score file should be named as ``train.txt.init`` and placed in the same folder as the data file.
If the name of data file is ``train.txt``, the initial score file should be named as ``train.txt.init`` and placed in the same folder as the data file.
In this case, LightGBM will auto load initial score file if it exists.

If binary data files exist for raw data file ``train.txt``, for example in the name ``train.txt.bin``, then the initial score file should be named as ``train.txt.bin.init``.

Weight Data
~~~~~~~~~~~

LightGBM supports weighted training. It uses an additional file to store weight data, like the following:
LightGBM supports weighted training.
It uses an additional file to store weight data, like the following:

::

Expand All @@ -1376,7 +1425,8 @@ The weight file corresponds with data file line by line, and has per weight per
And if the name of data file is ``train.txt``, the weight file should be named as ``train.txt.weight`` and placed in the same folder as the data file.
In this case, LightGBM will load the weight file automatically if it exists.

Also, you can include weight column in your data file. Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.
Also, you can include weight column in your data file.
Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.

Query Data
~~~~~~~~~~
Expand Down Expand Up @@ -1405,4 +1455,5 @@ For example, if you have a 112-document dataset with ``group = [27, 18, 67]``, t
If the name of data file is ``train.txt``, the query file should be named as ``train.txt.query`` and placed in the same folder as the data file.
In this case, LightGBM will load the query file automatically if it exists.

Also, you can include query/group id column in your data file. Please refer to the ``group_column`` `parameter <#group_column>`__ in above.
Also, you can include query/group id column in your data file.
Please refer to the ``group_column`` `parameter <#group_column>`__ in above.
84 changes: 68 additions & 16 deletions python-package/lightgbm/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,62 @@ def _emit_dataset_kwarg_warning(calling_function: str, argname: str) -> None:
warnings.warn(msg, category=LGBMDeprecationWarning, stacklevel=2)


def _choose_num_iterations(num_boost_round_kwarg: int, params: Dict[str, Any]) -> Dict[str, Any]:
"""Choose number of boosting rounds.
In ``train()`` and ``cv()``, there are multiple ways to provide configuration for
the number of boosting rounds to perform:
* the ``num_boost_round`` keyword argument
* any of the ``num_iterations`` or its aliases via the ``params`` dictionary
These should be preferred in the following order (first one found wins):
1. ``num_iterations`` provided via ``params`` (because it's the main parameter name)
2. any other aliases of ``num_iterations`` provided via ``params``
3. the ``num_boost_round`` keyword argument
This function handles that choice, and issuing helpful warnings in the cases where the
result might be surprising.
Returns
-------
params : dict
Parameters, with ``"num_iterations"`` set to the preferred value and all other
aliases of ``num_iterations`` removed.
"""
num_iteration_configs_provided = {
alias: params[alias] for alias in _ConfigAliases.get("num_iterations") if alias in params
}

# now that the relevant information has been pulled out of params, it's safe to overwrite it
# with the content that should be used for training (i.e. with aliases resolved)
params = _choose_param_value(
main_param_name="num_iterations",
params=params,
default_value=num_boost_round_kwarg,
)

# if there were not multiple boosting rounds configurations provided in params,
# then by definition they cannot have conflicting values... no need to warn
if len(num_iteration_configs_provided) <= 1:
return params

# if all the aliases have the same value, no need to warn
if len(set(num_iteration_configs_provided.values())) <= 1:
return params

# if this line is reached, lightgbm should warn
value_string = ", ".join(f"{alias}={val}" for alias, val in num_iteration_configs_provided.items())
_log_warning(
f"Found conflicting values for num_iterations provided via 'params': {value_string}. "
f"LightGBM will perform up to {params['num_iterations']} boosting rounds. "
"To be confident in the maximum number of boosting rounds LightGBM will perform and to "
"suppress this warning, modify 'params' so that only one of those is present."
)
return params


def train(
params: Dict[str, Any],
train_set: Dataset,
Expand Down Expand Up @@ -169,9 +225,6 @@ def train(
if not isinstance(train_set, Dataset):
raise TypeError(f"train() only accepts Dataset object, train_set has type '{type(train_set).__name__}'.")

if num_boost_round <= 0:
raise ValueError(f"num_boost_round must be greater than 0. Got {num_boost_round}.")

if isinstance(valid_sets, list):
for i, valid_item in enumerate(valid_sets):
if not isinstance(valid_item, Dataset):
Expand All @@ -198,11 +251,12 @@ def train(
if callable(params["objective"]):
fobj = params["objective"]
params["objective"] = "none"
for alias in _ConfigAliases.get("num_iterations"):
if alias in params:
num_boost_round = params.pop(alias)
_log_warning(f"Found `{alias}` in params. Will use it instead of argument")
params["num_iterations"] = num_boost_round

params = _choose_num_iterations(num_boost_round_kwarg=num_boost_round, params=params)
num_boost_round = params["num_iterations"]
if num_boost_round <= 0:
raise ValueError(f"Number of boosting rounds must be greater than 0. Got {num_boost_round}.")

# setting early stopping via global params should be possible
params = _choose_param_value(
main_param_name="early_stopping_round",
Expand Down Expand Up @@ -713,9 +767,6 @@ def cv(
if not isinstance(train_set, Dataset):
raise TypeError(f"cv() only accepts Dataset object, train_set has type '{type(train_set).__name__}'.")

if num_boost_round <= 0:
raise ValueError(f"num_boost_round must be greater than 0. Got {num_boost_round}.")

# raise deprecation warnings if necessary
# ref: https://github.com/microsoft/LightGBM/issues/6435
if categorical_feature != "auto":
Expand All @@ -733,11 +784,12 @@ def cv(
if callable(params["objective"]):
fobj = params["objective"]
params["objective"] = "none"
for alias in _ConfigAliases.get("num_iterations"):
if alias in params:
_log_warning(f"Found '{alias}' in params. Will use it instead of 'num_boost_round' argument")
num_boost_round = params.pop(alias)
params["num_iterations"] = num_boost_round

params = _choose_num_iterations(num_boost_round_kwarg=num_boost_round, params=params)
num_boost_round = params["num_iterations"]
if num_boost_round <= 0:
raise ValueError(f"Number of boosting rounds must be greater than 0. Got {num_boost_round}.")

# setting early stopping via global params should be possible
params = _choose_param_value(
main_param_name="early_stopping_round",
Expand Down
114 changes: 112 additions & 2 deletions tests/python_package_test/test_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

from .utils import (
SERIALIZERS,
assert_silent,
dummy_obj,
load_breast_cancer,
load_digits,
Expand Down Expand Up @@ -4291,7 +4292,7 @@ def test_verbosity_is_respected_when_using_custom_objective(capsys):
"num_leaves": 3,
}
lgb.train({**params, "verbosity": -1}, ds, num_boost_round=1)
assert capsys.readouterr().out == ""
assert_silent(capsys)
lgb.train({**params, "verbosity": 0}, ds, num_boost_round=1)
assert "[LightGBM] [Warning] Unknown parameter: nonsense" in capsys.readouterr().out

Expand Down Expand Up @@ -4320,6 +4321,115 @@ def test_verbosity_can_suppress_alias_warnings(capsys, verbosity_param, verbosit
assert re.search(r"\[LightGBM\]", stdout) is None


def test_cv_only_raises_num_rounds_warning_when_expected(capsys):
X, y = make_synthetic_regression()
ds = lgb.Dataset(X, y)
base_params = {
"num_leaves": 5,
"objective": "regression",
"verbosity": -1,
}
additional_kwargs = {"return_cvbooster": True, "stratified": False}

# no warning: no aliases, all defaults
cv_bst = lgb.cv({**base_params}, ds, **additional_kwargs)
assert all(t == 100 for t in cv_bst["cvbooster"].num_trees())
assert_silent(capsys)

# no warning: no aliases, just num_boost_round
cv_bst = lgb.cv({**base_params}, ds, num_boost_round=2, **additional_kwargs)
assert all(t == 2 for t in cv_bst["cvbooster"].num_trees())
assert_silent(capsys)

# no warning: 1 alias + num_boost_round (both same value)
cv_bst = lgb.cv({**base_params, "n_iter": 3}, ds, num_boost_round=3, **additional_kwargs)
assert all(t == 3 for t in cv_bst["cvbooster"].num_trees())
assert_silent(capsys)

# no warning: 1 alias + num_boost_round (different values... value from params should win)
cv_bst = lgb.cv({**base_params, "n_iter": 4}, ds, num_boost_round=3, **additional_kwargs)
assert all(t == 4 for t in cv_bst["cvbooster"].num_trees())
assert_silent(capsys)

# no warning: 2 aliases (both same value)
cv_bst = lgb.cv({**base_params, "n_iter": 3, "num_iterations": 3}, ds, **additional_kwargs)
assert all(t == 3 for t in cv_bst["cvbooster"].num_trees())
assert_silent(capsys)

# no warning: 4 aliases (all same value)
cv_bst = lgb.cv({**base_params, "n_iter": 3, "num_trees": 3, "nrounds": 3, "max_iter": 3}, ds, **additional_kwargs)
assert all(t == 3 for t in cv_bst["cvbooster"].num_trees())
assert_silent(capsys)

# warning: 2 aliases (different values... "num_iterations" wins because it's the main param name)
with pytest.warns(UserWarning, match="LightGBM will perform up to 5 boosting rounds"):
cv_bst = lgb.cv({**base_params, "n_iter": 6, "num_iterations": 5}, ds, **additional_kwargs)
assert all(t == 5 for t in cv_bst["cvbooster"].num_trees())
# should not be any other logs (except the warning, intercepted by pytest)
assert_silent(capsys)

# warning: 2 aliases (different values... first one in the order from Config::parameter2aliases() wins)
with pytest.warns(UserWarning, match="LightGBM will perform up to 4 boosting rounds"):
cv_bst = lgb.cv({**base_params, "n_iter": 4, "max_iter": 5}, ds, **additional_kwargs)["cvbooster"]
assert all(t == 4 for t in cv_bst.num_trees())
# should not be any other logs (except the warning, intercepted by pytest)
assert_silent(capsys)


def test_train_only_raises_num_rounds_warning_when_expected(capsys):
X, y = make_synthetic_regression()
ds = lgb.Dataset(X, y)
base_params = {
"num_leaves": 5,
"objective": "regression",
"verbosity": -1,
}

# no warning: no aliases, all defaults
bst = lgb.train({**base_params}, ds)
assert bst.num_trees() == 100
assert_silent(capsys)

# no warning: no aliases, just num_boost_round
bst = lgb.train({**base_params}, ds, num_boost_round=2)
assert bst.num_trees() == 2
assert_silent(capsys)

# no warning: 1 alias + num_boost_round (both same value)
bst = lgb.train({**base_params, "n_iter": 3}, ds, num_boost_round=3)
assert bst.num_trees() == 3
assert_silent(capsys)

# no warning: 1 alias + num_boost_round (different values... value from params should win)
bst = lgb.train({**base_params, "n_iter": 4}, ds, num_boost_round=3)
assert bst.num_trees() == 4
assert_silent(capsys)

# no warning: 2 aliases (both same value)
bst = lgb.train({**base_params, "n_iter": 3, "num_iterations": 3}, ds)
assert bst.num_trees() == 3
assert_silent(capsys)

# no warning: 4 aliases (all same value)
bst = lgb.train({**base_params, "n_iter": 3, "num_trees": 3, "nrounds": 3, "max_iter": 3}, ds)
assert bst.num_trees() == 3
assert_silent(capsys)

# warning: 2 aliases (different values... "num_iterations" wins because it's the main param name)
with pytest.warns(UserWarning, match="LightGBM will perform up to 5 boosting rounds"):
bst = lgb.train({**base_params, "n_iter": 6, "num_iterations": 5}, ds)
assert bst.num_trees() == 5
# should not be any other logs (except the warning, intercepted by pytest)
assert_silent(capsys)

# warning: 2 aliases (different values... first one in the order from Config::parameter2aliases() wins)
with pytest.warns(UserWarning, match="LightGBM will perform up to 4 boosting rounds"):
bst = lgb.train({**base_params, "n_iter": 4, "max_iter": 5}, ds)
assert bst.num_trees() == 4
# should not be any other logs (except the warning, intercepted by pytest)
assert_silent(capsys)


@pytest.mark.skipif(not PANDAS_INSTALLED, reason="pandas is not installed")
def test_validate_features():
X, y = make_synthetic_regression()
Expand Down Expand Up @@ -4355,7 +4465,7 @@ def test_train_and_cv_raise_informative_error_for_train_set_of_wrong_type():
@pytest.mark.parametrize("num_boost_round", [-7, -1, 0])
def test_train_and_cv_raise_informative_error_for_impossible_num_boost_round(num_boost_round):
X, y = make_synthetic_regression(n_samples=100)
error_msg = rf"num_boost_round must be greater than 0\. Got {num_boost_round}\."
error_msg = rf"Number of boosting rounds must be greater than 0\. Got {num_boost_round}\."
with pytest.raises(ValueError, match=error_msg):
lgb.train({}, train_set=lgb.Dataset(X, y), num_boost_round=num_boost_round)
with pytest.raises(ValueError, match=error_msg):
Expand Down
Loading

0 comments on commit dca1333

Please sign in to comment.