зеркало из https://github.com/microsoft/LightGBM.git
[docs] expand documentation on 'group' for ranking task (#3772)
* [python-package] expand documentation on 'group' for ranking task * add R package * update Query Data section * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * fix typo in group example * regenerate parameters * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * regenerate R docs Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
This commit is contained in:
Родитель
356126330c
Коммит
0e5eb9e372
|
@ -998,7 +998,11 @@ slice.lgb.Dataset <- function(dataset, idxset, ...) {
|
|||
#' \itemize{
|
||||
#' \item \code{label}: label lightgbm learn from ;
|
||||
#' \item \code{weight}: to do a weight rescale ;
|
||||
#' \item \code{group}: group size ;
|
||||
#' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
|
||||
#' group rows together as ordered results from the same set of candidate results to be ranked.
|
||||
#' For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
|
||||
#' that means that you have 6 groups, where the first 10 records are in the first group,
|
||||
#' records 11-30 are in the second group, etc.}
|
||||
#' \item \code{init_score}: initial score is the base prediction lightgbm will boost from.
|
||||
#' }
|
||||
#'
|
||||
|
@ -1052,8 +1056,9 @@ getinfo.lgb.Dataset <- function(dataset, name, ...) {
|
|||
#' \item{\code{init_score}: initial score is the base prediction lightgbm will boost from}
|
||||
#' \item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
|
||||
#' group rows together as ordered results from the same set of candidate results to be ranked.
|
||||
#' For example, if you have a 1000-row dataset that contains 250 4-document query results,
|
||||
#' set this to \code{rep(4L, 250L)}}
|
||||
#' For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
|
||||
#' that means that you have 6 groups, where the first 10 records are in the first group,
|
||||
#' records 11-30 are in the second group, etc.}
|
||||
#' }
|
||||
#'
|
||||
#' @examples
|
||||
|
|
|
@ -30,7 +30,11 @@ The \code{name} field can be one of the following:
|
|||
\itemize{
|
||||
\item \code{label}: label lightgbm learn from ;
|
||||
\item \code{weight}: to do a weight rescale ;
|
||||
\item \code{group}: group size ;
|
||||
\item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
|
||||
group rows together as ordered results from the same set of candidate results to be ranked.
|
||||
For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
|
||||
that means that you have 6 groups, where the first 10 records are in the first group,
|
||||
records 11-30 are in the second group, etc.}
|
||||
\item \code{init_score}: initial score is the base prediction lightgbm will boost from.
|
||||
}
|
||||
}
|
||||
|
|
|
@ -35,8 +35,9 @@ The \code{name} field can be one of the following:
|
|||
\item{\code{init_score}: initial score is the base prediction lightgbm will boost from}
|
||||
\item{\code{group}: used for learning-to-rank tasks. An integer vector describing how to
|
||||
group rows together as ordered results from the same set of candidate results to be ranked.
|
||||
For example, if you have a 1000-row dataset that contains 250 4-document query results,
|
||||
set this to \code{rep(4L, 250L)}}
|
||||
For example, if you have a 100-document dataset with \code{group = c(10, 20, 40, 10, 10, 10)},
|
||||
that means that you have 6 groups, where the first 10 records are in the first group,
|
||||
records 11-30 are in the second group, etc.}
|
||||
}
|
||||
}
|
||||
\examples{
|
||||
|
|
|
@ -760,7 +760,7 @@ Dataset Parameters
|
|||
|
||||
- **Note**: works only in case of loading data directly from file
|
||||
|
||||
- **Note**: data should be grouped by query\_id
|
||||
- **Note**: data should be grouped by query\_id, for more information, see `Query Data <#query-data>`__
|
||||
|
||||
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
|
||||
|
||||
|
@ -1229,6 +1229,7 @@ Query Data
|
|||
~~~~~~~~~~
|
||||
|
||||
For learning to rank, it needs query information for training data.
|
||||
|
||||
LightGBM uses an additional file to store query data, like the following:
|
||||
|
||||
::
|
||||
|
@ -1238,7 +1239,13 @@ LightGBM uses an additional file to store query data, like the following:
|
|||
67
|
||||
...
|
||||
|
||||
It means first ``27`` lines samples belong to one query and next ``18`` lines belong to another, and so on.
|
||||
For wrapper libraries like in Python and R, this information can also be provided as an array-like via the Dataset parameter ``group``.
|
||||
|
||||
::
|
||||
|
||||
[27, 18, 67, ...]
|
||||
|
||||
For example, if you have a 112-document dataset with ``group = [27, 18, 67]``, that means that you have 3 groups, where the first 27 records are in the first group, records 28-45 are in the second group, and records 46-112 are in the third group.
|
||||
|
||||
**Note**: data should be ordered by the query.
|
||||
|
||||
|
|
|
@ -670,7 +670,7 @@ struct Config {
|
|||
// desc = use number for index, e.g. ``query=0`` means column\_0 is the query id
|
||||
// desc = add a prefix ``name:`` for column name, e.g. ``query=name:query_id``
|
||||
// desc = **Note**: works only in case of loading data directly from file
|
||||
// desc = **Note**: data should be grouped by query\_id
|
||||
// desc = **Note**: data should be grouped by query\_id, for more information, see `Query Data <#query-data>`__
|
||||
// desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
|
||||
std::string group_column = "";
|
||||
|
||||
|
|
|
@ -941,7 +941,10 @@ class Dataset:
|
|||
weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
|
||||
Weight for each instance.
|
||||
group : list, numpy 1-D array, pandas Series or None, optional (default=None)
|
||||
Group/query size for Dataset.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
|
||||
Init score for Dataset.
|
||||
silent : bool, optional (default=False)
|
||||
|
@ -1356,7 +1359,10 @@ class Dataset:
|
|||
weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
|
||||
Weight for each instance.
|
||||
group : list, numpy 1-D array, pandas Series or None, optional (default=None)
|
||||
Group/query size for Dataset.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
init_score : list, numpy 1-D array, pandas Series or None, optional (default=None)
|
||||
Init score for Dataset.
|
||||
silent : bool, optional (default=False)
|
||||
|
@ -1715,7 +1721,10 @@ class Dataset:
|
|||
Parameters
|
||||
----------
|
||||
group : list, numpy 1-D array, pandas Series or None
|
||||
Group size of each group.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
|
||||
Returns
|
||||
-------
|
||||
|
@ -1830,7 +1839,10 @@ class Dataset:
|
|||
Returns
|
||||
-------
|
||||
group : numpy array or None
|
||||
Group size of each group.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
"""
|
||||
if self.group is None:
|
||||
self.group = self.get_field('group')
|
||||
|
|
|
@ -36,7 +36,10 @@ class _ObjectiveFunctionWrapper:
|
|||
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
|
||||
The predicted values.
|
||||
group : array-like
|
||||
Group/query data, used for ranking task.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
|
||||
The value of the first order derivative (gradient) for each sample point.
|
||||
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
|
||||
|
@ -122,7 +125,10 @@ class _EvalFunctionWrapper:
|
|||
weight : array-like of shape = [n_samples]
|
||||
The weight of samples.
|
||||
group : array-like
|
||||
Group/query data, used for ranking task.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
eval_name : string
|
||||
The name of evaluation function (without whitespaces).
|
||||
eval_result : float
|
||||
|
@ -266,7 +272,10 @@ class LGBMModel(_LGBMModelBase):
|
|||
y_pred : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
|
||||
The predicted values.
|
||||
group : array-like
|
||||
Group/query data, used for ranking task.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
grad : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
|
||||
The value of the first order derivative (gradient) for each sample point.
|
||||
hess : array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
|
||||
|
@ -384,7 +393,10 @@ class LGBMModel(_LGBMModelBase):
|
|||
init_score : array-like of shape = [n_samples] or None, optional (default=None)
|
||||
Init score of training data.
|
||||
group : array-like or None, optional (default=None)
|
||||
Group data of training data.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
eval_set : list or None, optional (default=None)
|
||||
A list of (X, y) tuple pairs to use as validation sets.
|
||||
eval_names : list of strings or None, optional (default=None)
|
||||
|
@ -460,7 +472,10 @@ class LGBMModel(_LGBMModelBase):
|
|||
weight : array-like of shape = [n_samples]
|
||||
The weight of samples.
|
||||
group : array-like
|
||||
Group/query data, used for ranking task.
|
||||
Group/query data.
|
||||
Only used in the learning-to-rank task.
|
||||
sum(group) = n_samples.
|
||||
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups, where the first 10 records are in the first group, records 11-30 are in the second group, etc.
|
||||
eval_name : string
|
||||
The name of evaluation function (without whitespaces).
|
||||
eval_result : float
|
||||
|
|
Загрузка…
Ссылка в новой задаче