[docs] expand documentation on 'group' for ranking task (#3772)

* [python-package] expand documentation on 'group' for ranking task

* add R package

* update Query Data section

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* fix typo in group example

* regenerate parameters

* Apply suggestions from code review

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

* regenerate R docs

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
This commit is contained in:
James Lamb 2021-01-18 11:58:04 -06:00 коммит произвёл GitHub
Родитель 356126330c
Коммит 0e5eb9e372
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
7 изменённых файлов: 62 добавлений и 18 удалений

Просмотреть файл

@ -998,7 +998,11 @@ slice.lgb.Dataset <- function(dataset, idxset, ...) {
#' \itemize{
#' \item \code{label}: label lightgbm learn from ;
#' \item \code{weight}: to do a weight rescale ;
#' \item \code{group}: group size ;
#' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
#' group rows together as ordered results from the same set of candidate results to be ranked.
#' For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
#' that means that you have 6 groups, where the first 10 records are in the first group,
#' records 11-30 are in the second group, etc.}
#' \item \code{init_score}: initial score is the base prediction lightgbm will boost from.
#' }
#'
@ -1052,8 +1056,9 @@ getinfo.lgb.Dataset <- function(dataset, name, ...) {
#' \item{\code{init_score}: initial score is the base prediction lightgbm will boost from}
#' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
#' group rows together as ordered results from the same set of candidate results to be ranked.
#' For example, if you have a 1000-row dataset that contains 250 4-document query results,
#' set this to \code{rep(4L, 250L)}}
#' For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
#' that means that you have 6 groups, where the first 10 records are in the first group,
#' records 11-30 are in the second group, etc.}
#' }
#'
#' @examples

Просмотреть файл

@ -30,7 +30,11 @@ The \code{name} field can be one of the following:
\itemize{
\item \code{label}: label lightgbm learn from ;
\item \code{weight}: to do a weight rescale ;
\item \code{group}: group size ;
\item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results to be ranked.
For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
that means that you have 6 groups, where the first 10 records are in the first group,
records 11-30 are in the second group, etc.}
\item \code{init_score}: initial score is the base prediction lightgbm will boost from.
}
}

Просмотреть файл

@ -35,8 +35,9 @@ The \code{name} field can be one of the following:
\item{\code{init_score}: initial score is the base prediction lightgbm will boost from}
\item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
group rows together as ordered results from the same set of candidate results to be ranked.
For example, if you have a 1000-row dataset that contains 250 4-document query results,
set this to \code{rep(4L, 250L)}}
For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
that means that you have 6 groups, where the first 10 records are in the first group,
records 11-30 are in the second group, etc.}
}
}
\examples{

Просмотреть файл

@ -760,7 +760,7 @@ Dataset Parameters
- **Note**: works only in case of loading data directly from file
- **Note**: data should be grouped by query\_id
- **Note**: data should be grouped by query\_id, for more information, see `Query Data <#query-data>`__
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
@ -1229,6 +1229,7 @@ Query Data
~~~~~~~~~~
For learning to rank, it needs query information for training data.
LightGBM uses an additional file to store query data, like the following:
::
@ -1238,7 +1239,13 @@ LightGBM uses an additional file to store query data, like the following:
67
...
It means first ``27`` lines samples belong to one query and next ``18`` lines belong to another, and so on.
For wrapper libraries like in Python and R, this information can also be provided as an array-like via the Dataset parameter ``group``.
::
[27, 18, 67, ...]
For example, if you have a 112-document dataset with ``group = [27, 18, 67]``, that means that you have 3 groups, where the first 27 records are in the first group, records 28-45 are in the second group, and records 46-112 are in the third group.
**Note**: data should be ordered by the query.

Просмотреть файл

@ -670,7 +670,7 @@ struct Config {
// desc = use number for index, e.g. ``query=0`` means column\_0 is the query id
// desc = add a prefix ``name:`` for column name, e.g. ``query=name:query_id``
// desc = **Note**: works only in case of loading data directly from file
// desc = **Note**: data should be grouped by query\_id
// desc = **Note**: data should be grouped by query\_id, for more information, see `Query Data <#query-data>`__
// desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
std::string group_column = "";

Просмотреть файл

@ -941,7 +941,10 @@ class Dataset:
weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
Weight for each instance.
group : list, numpy 1-D array, pandas Series or None, optional (default=None)
Group/query size for Dataset.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
Init score for Dataset.
silent : bool, optional (default=False)
@ -1356,7 +1359,10 @@ class Dataset:
weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
Weight for each instance.
group : list, numpy 1-D array, pandas Series or None, optional (default=None)
Group/query size for Dataset.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
Init score for Dataset.
silent : bool, optional (default=False)
@ -1715,7 +1721,10 @@ class Dataset:
Parameters
----------
group : list, numpy 1-D array, pandas Series or None
Group size of each group.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
Returns
-------
@ -1830,7 +1839,10 @@ class Dataset:
Returns
-------
group : numpy array or None
Group size of each group.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
"""
if self.group is None:
self.group = self.get_field('group')

Просмотреть файл

@ -36,7 +36,10 @@ class _ObjectiveFunctionWrapper:
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values.
group : array-like
Group/query data, used for ranking task.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The value of the first order derivative (gradient) for each sample point.
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
@ -122,7 +125,10 @@ class _EvalFunctionWrapper:
weight : array-like of shape = [n_samples]
The weight of samples.
group : array-like
Group/query data, used for ranking task.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
eval_name : string
The name of evaluation function (without whitespaces).
eval_result : float
@ -266,7 +272,10 @@ class LGBMModel(_LGBMModelBase):
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values.
group : array-like
Group/query data, used for ranking task.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The value of the first order derivative (gradient) for each sample point.
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
@ -384,7 +393,10 @@ class LGBMModel(_LGBMModelBase):
init_score : array-like of shape = [n_samples] or None, optional (default=None)
Init score of training data.
group : array-like or None, optional (default=None)
Group data of training data.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
eval_set : list or None, optional (default=None)
A list of (X, y) tuple pairs to use as validation sets.
eval_names : list of strings or None, optional (default=None)
@ -460,7 +472,10 @@ class LGBMModel(_LGBMModelBase):
weight : array-like of shape = [n_samples]
The weight of samples.
group : array-like
Group/query data, used for ranking task.
Group/query data.
Only used in the learning-to-rank task.
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
eval_name : string
The name of evaluation function (without whitespaces).
eval_result : float