makeCPO
creates a Feature Operation CPOConstructor
, i.e. a constructor for a CPO
that will
operate on feature columns. makeCPOTargetOp
creates a Target Operation CPOConstructor
, which
creates CPO
s that operate on the target column. makeCPORetrafoless
creates a Retrafoless CPOConstructor
,
which creates CPO
s that may operate on both feature and target columns, but have no retrafo operation. See OperatingType for further
details on the distinction of these. makeCPOExtendedTrafo
creates a Feature Operation CPOConstructor
that
has slightly more flexibility in its data transformation behaviour than makeCPO
(but is otherwise identical).
makeCPOExtendedTargetOp
creates a Target Operation CPOConstructor
that has slightly more flexibility in its
data transformation behaviour than makeCPOTargetOp
but is otherwise identical.
See example section for some simple custom CPO.
makeCPO( cpo.name, par.set = makeParamSet(), par.vals = NULL, dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered", "numeric"), dataformat.factor.with.ordered = TRUE, export.params = TRUE, fix.factors = FALSE, properties.data = c("numerics", "factors", "ordered", "missings"), properties.adding = character(0), properties.needed = character(0), properties.target = c("cluster", "classif", "multilabel", "regr", "surv", "oneclass", "twoclass", "multiclass"), packages = character(0), cpo.train, cpo.retrafo ) makeCPOExtendedTrafo( cpo.name, par.set = makeParamSet(), par.vals = NULL, dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered", "numeric"), dataformat.factor.with.ordered = TRUE, export.params = TRUE, fix.factors = FALSE, properties.data = c("numerics", "factors", "ordered", "missings"), properties.adding = character(0), properties.needed = character(0), properties.target = c("cluster", "classif", "multilabel", "regr", "surv", "oneclass", "twoclass", "multiclass"), packages = character(0), cpo.trafo, cpo.retrafo ) makeCPORetrafoless( cpo.name, par.set = makeParamSet(), par.vals = NULL, dataformat = c("df.all", "task"), dataformat.factor.with.ordered = TRUE, export.params = TRUE, fix.factors = FALSE, properties.data = c("numerics", "factors", "ordered", "missings"), properties.adding = character(0), properties.needed = character(0), properties.target = c("cluster", "classif", "multilabel", "regr", "surv", "oneclass", "twoclass", "multiclass"), packages = character(0), cpo.trafo ) makeCPOTargetOp( cpo.name, par.set = makeParamSet(), par.vals = NULL, dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered", "numeric"), dataformat.factor.with.ordered = TRUE, export.params = TRUE, fix.factors = FALSE, properties.data = c("numerics", "factors", "ordered", "missings"), properties.adding = character(0), properties.needed = character(0), properties.target = "cluster", task.type.out = NULL, predict.type.map = c(response = "response"), packages = character(0), constant.invert = FALSE, cpo.train, cpo.retrafo, cpo.train.invert, cpo.invert ) makeCPOExtendedTargetOp( cpo.name, par.set = makeParamSet(), par.vals = NULL, dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered", "numeric"), dataformat.factor.with.ordered = TRUE, export.params = TRUE, fix.factors = FALSE, properties.data = c("numerics", "factors", "ordered", "missings"), properties.adding = character(0), properties.needed = character(0), properties.target = "cluster", task.type.out = NULL, predict.type.map = c(response = "response"), packages = character(0), constant.invert = FALSE, cpo.trafo, cpo.retrafo, cpo.invert )
cpo.name | [ |
||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
par.set | [ |
||||||||||||||||||
par.vals | [ |
||||||||||||||||||
dataformat | [
[type] can be any one of “factor”, “numeric”, “ordered”; if these are given, only a subset of the total
data present is seen by the Note that For If the CPO is a Feature Operation CPO, then the return value of the For Feature Operating CPOs, if If Default is “df.features” for all functions except |
||||||||||||||||||
dataformat.factor.with.ordered | [ |
||||||||||||||||||
export.params | [ |
||||||||||||||||||
fix.factors | [ |
||||||||||||||||||
properties.data | [ |
||||||||||||||||||
properties.adding | [ Note that this may not contain a Property names may be postfixed with “.sometimes”, to indicate that adherence should not be checked internally. This distinction is made by
not putting them in the Default is |
||||||||||||||||||
properties.needed | [ Note that this may not contain a Property names may be postfixed with “.sometimes”, to indicate that adherence should not be checked internally. This distinction is made by
not putting them in the Default is |
||||||||||||||||||
properties.target | [ For Target Operation CPOs, this must contain exactly one of “cluster”, “classif”, “multilabel”, “regr”, “surv”.
This indicates the type of |
||||||||||||||||||
packages | [ |
||||||||||||||||||
cpo.train | [ The behaviour of this function differs slightly in Feature Operation and Target Operation CPOs. For Feature Operation CPOs, if If For Target Operation CPOs, if If This parameter may be |
||||||||||||||||||
cpo.retrafo | [ This function gets called during the “retransformation” step where prediction data is given to the In Feature Operation CPOs, this function receives the data to be
transformed and must return the transformed data in the same format as it received them.
The format of In Target Operation CPOs created with In Target Operation CPOs created with If |
||||||||||||||||||
cpo.trafo | [ This functions primary task is to transform the given data when the For CPOs that are not Retrafoless, a unit of information to be carried over to the retrafo step needs to be created inside the If For Target Operation CPOs created with |
||||||||||||||||||
task.type.out | [ If this is |
||||||||||||||||||
predict.type.map | [ In short, the
|
||||||||||||||||||
constant.invert | [ For For Default is |
||||||||||||||||||
cpo.train.invert | This is a function which must have the parameters This function receives the feature columns given for prediction, and must return a
control object that will be passed on to the If |
||||||||||||||||||
cpo.invert | [ The This function performs the inversion for a Target Operation CPO. It takes a control object, which summarizes information from the training and
retrafo step, and the prediction as returned by a machine learning model, and undoes the operation done to the target column in the For example, if the trafo step consisted of taking the logarithm of a regression target, the As a more elaborate example, a CPO could train a model on the training data and set the target values to the residues of that trained model.
The |
[CPOConstructor
]. A Constructor for CPO
s.
The mlrCPO package offers a powerful framework for handling the tasks necessary for preprocessing, so that the user, when creating custom CPOs, can focus on the actual data transformations to perform. It is, however, useful to understand what it is that the framework does, and how the process can be influenced by the user during CPO definition or application. Aspects of preprocessing that the user needs to influence are:
The core of preprocessing is the actual transformation being performed. In the most general sense, there are three points in a machine learning pipeline that preprocessing can influence.
Transformation of training data before model fitting, done in mlr using train
. In the CPO framework
(when not using a CPOLearner
which makes all of these steps transparent to the user), this is
done by a CPO
.
transformation of new validation or prediction data that is given to the fitted model for prediction, done using
predict
. This is done by a CPORetrafo
retrieved using retrafo
from the result of step 1.
transformation of the predictions made to invert the transformation of the target values done in step 1, which is done using
the CPOInverter
retrieved using inverter
from the result of step 2.
composeCPO
) CPO
s to simplify internal
operation: A CPO
may be one of three OperatingTypes (see there). The Feature Operation CPO
does not
transform target columns and hence only needs to be involved in steps 1 and 2. The Target Operation CPO
only transforms
target columns, and therefore mostly concerns itself with steps 1 and 3. A Retrafoless CPO
may change both feature and
target columns, but may not perform a retrafo or inverter operation (and is therefore only concerned with step 1). Note that this
is effectively a restriction on what kind of transformation a Retrafoless CPO may perform: it must not be a transformation of the data
or target space, it may only act or subtract points within this space. The Operating Type of a CPO
is ultimately dependent on the function that was used to create the CPOConstructor
:
makeCPO
/ makeCPOExtendedTrafo
, makeCPOTargetOp
/ makeCPOExtendedTargetOp
, or makeCPORetrafoless
.At the core of a CPO is the modification of data it performs. For Feature Operation CPOs, the transformation of each row,
during training and prediction, should
happen in the same way, and it may only depend on the entirety of the training data--i.e. the value of a data row in a prediction
data set may not influence the transformation of a different prediction data row. Furthermore, if a data row occurs in both training and prediction
data, its transformation result should ideally be the same. This property is ensured by makeCPO
by splitting the transformation
into two functions: One function that collects all relevant information from the training data (called cpo.train
), and one that transforms
given data, using this collected information and (potentially new, unseen) data to be transformed (called cpo.retrafo
). The cpo.retrafo
function should handle all data as if it were prediction data and unrelated to the data given to cpo.train
. Internally, when a CPO
gets applied to a data set using applyCPO
, the cpo.train
function is called, and the
resulting control object is used for a subsequent cpo.retrafo
call which transforms the data. Before the result is given back from the
applyCPO
call, the control object is used to create a CPORetrafo
object,
which is attached to the result as attribute. Target Operating CPOs additionally create and add a CPOInverter
object. When a CPORetrafo
is then applied to new prediction data, the control object previously returned by cpo.train
is given,
combined with this new data, to another cpo.retrafo
call that performs the new transformation. makeCPOExtendedTrafo
gives more flexibility by having calling only the cpo.trafo
in the training step, which both creates a control
object and modifies the data. This can increase performance if the underlying operation creates a control object and the transformed data in one step,
as for example PCA does. Note that the requirement that the same row in training and prediction data should result in the same transformation
result still stands. The cpo.trafo
function returns the transformed data and creates a local variable with the control information, which the
CPO framework will access.
If a CPO
performs transformations of the target column, the predictions made by a following machine learning process should
ideally have this transformation undone, so that if the process makes a prediction that coincides with a target value after the
transformation, the whole pipeline should return a prediction that equals to the target value before this transformation. This is done by the cpo.invert
function given to makeCPOTargetOp
. It has access to information from both the preceding training and prediction
steps. During the training step, cpo.train
createas a control
object that is not only given to cpo.retrafo
, but also
to cpo.train.invert
. This latter function is called before the prediction step, whenever new data is fed to the machine learning process.
It takes the new data and the old control
object and transforms it to a new control.invert
object to include information about the prediction
data. This object is then given to cpo.invert
. It is possible to have Target Operation CPOs that do not require information from the retrafo step. This is specified by setting
constant.invert
to TRUE
. It has the advantage that the same CPOInverter
can be used for inversion of predictions made with any new data. Otherwise, a new CPOInverter
object must be obtained for each
new data set after the retrafo step (using the inverter
function on the retrafo result). Having constant.invert
set to TRUE
results in hybrid retrafo / inverter objects: The CPORetrafo
object can then also be used for inversions
.
When defining a constant.invert
Target Operating CPO, no cpo.train.invert
function is given, and the same control
object is given to both cpo.retrafo
and cpo.invert
. makeCPOExtendedTargetOp
gives more flexibility and allows more efficient implementation of Target Operating CPOs at cost of more complexity.
With this method, a cpo.trafo
function is given that is executed during the first training step; It must return the transformed target column,
as well as a control
and control.invert
object. The cpo.retrafo
function not only transforms the target, but must also
create a new control.invert
object (unless constant.invert
is TRUE
). The semantics of cpo.invert
is identical with the
basic makeCPOTargetOp
.
cpo.train
-cpo.retrafo
information transferOne possibility to transfer information from cpo.train
to cpo.retrafo
is to have cpo.train
return a
control object (a list
)
that is then given to cpo.retrafo
. The CPO is then called an object based CPO. Another possibility is to not give the cpo.retrafo
argument (set it to NULL
in the makeCPO
call) and have cpo.train
instead return a function instead. This function is then
used as the cpo.retrafo
function, and should have access to all relevant information about the training data as a closure. This is called
functional CPO. To save memory, the actual data (including target) given to cpo.train
is removed from the environment of its
return value in this case
(i.e. the environment of the cpo.retrafo
function). This means the cpo.retrafo
function may not reference a “data
” variable. There are similar possibilities of functional information transfer for other types of CPOs: cpo.trafo
in makeCPOExtendedTargetOp
may
create a cpo.retrafo
function instead of a control
object. cpo.train
in makeCPOTargetOp
has the option of creating
a cpo.retrafo
and cpo.train.invert
(cpo.invert
if constant.invert
is TRUE
) function (and returning NULL
)
instead of returning a control
object. Similarly, cpo.train.invert
may return a cpo.invert
function instead of a control.invert
object. In makeCPOExtendedTargetOp
, cpo.trafo
may create a cpo.retrafo
or a cpo.invert
function, each optionally instead
of a control
or control.invert
object (one or both may be functional). cpo.retrafo
similarly may create a cpo.invert
function instead of giving a control.invert
object. Functional information transfer may be more parsimonious and elegant than control
object information transfer.
The action performed by a CPO may be influenced using hyperparameters, during its construction as well as afterwards (then using
setHyperPars
). Hyperparameters must be specified as a ParamSet
and given as argument par.set
.
Default values for each parameter may be specified in this ParamSet
or optionally as another argument par.vals
. Hyperparameters given are made part of the CPOConstructor
function and can thus be given during construction.
Parameter default values function as the default values for the CPOConstructor
function parameters (which are thus made optional function
parameters of the CPOConstructor
function). The CPO framework handles storage and changing of hyperparameter values.
When the cpo.train
and cpo.retrafo
functions are called to transform data, the hyperparameter values are given to them as arguments, so
cpo.train
and cpo.retrafo
functions must be able to accept these parameters, either directly, or with a ...
argument. Note that with functional CPO
s, the cpo.retrafo
function does not take hyperparameter arguments (and instead can usually
refer to them by its environment). Hyperparameters may be exported (or not), thus making them available for setHyperPars
. Not exporting a parameter
has advantage that it does not clutter the ParamSet
of a big CPO
or CPOLearner
pipeline with
many hyperparameters. Which hyperparameters are exported is chosen during the constructing call of a CPOConstructor
, but the default
exported hyperparameters can be chosen with the export.params
parameter.
Similarly to Learner
s, CPO
s may specify what kind of data they are and are not able to handle. This is done by
specifying .properties.*
arguments. The names of possible properties are the same as possible LearnerProperties
, but since
CPO
s mostly concern themselves with data, only the properties indicating column and task types are relevant. For each CPO
one must specify
which kind of data does the CPO
handle,
which kind of data must the CPO
or Learner
be able to handle that comes after
the given CPO
, and
which kind of data handling capability does the given CPO
add to a following
CPO
or Learner
if coming before it in a pipeline.
properties.data
and properties.target
, (2) is specified using properties.needed
, and
(3) is specified using properties.adding
. Internally, properties.data
and properties.target
are concatenated and treated as
one vector, they are specified separately in makeCPO
etc. for convenience reasons. See CPOProperties
for details. The CPO framework checks the cpo.retrafo
etc. functions for adherence to these properties, so it e.g. throws an error if a cpo.retrafo
function adds missing values to some data but didn't declare “missings” in properties.needed
. It may be desirable to have this
internal checking happen to a laxer standard than the property checking when composing CPOs (e.g. when a CPO adds missings only with certain
hyperparameters, one may still want to compose this CPO to another one that can't handle missings). Therefore it is possible to postfix
listed properties with “.sometimes”. The internal CPO checking will ignore these when listed in properties.adding
(it uses the ‘minimal’ set of adding properties, adding.min
), and it will not declare them externally when listed in
properties.needed
(but keeps them internally in the ‘maximal’ set of needed properties, needed.max
). The adding.min
and needed.max
can be retrieved using getCPOProperties
with get.internal = TRUE
.Different CPOs may want to change different aspects of the data, e.g. they may only care about numeric columns, they may or may not care about
the target column values, sometimes they might need the actual task used as input. The CPO framework offers to present the data in a specified
formats to the cpo.train
, cpo.retrafo
and other functions, to reduce the need for boilerplate data subsetting on the user's part. The format is
requested using the dataformat
and dataformat.factor.with.ordered
parameter. A cpo.retrafo
function is expected to return
data in the same format as it requested, so if it requested a Task
, it must return one, while if it only
requested the feature data.frame
, a data.frame
must be returned.
Target Operation CPOs can be used for conversion between Task
s. For this, the type.out
value must be given. Task conversion
works with all values of dataformat
and is handled by the CPO framework. The cpo.trafo
function must take care to return the target data
in a proper format (see above). Note that for conversion, not only does the Task
type need to be changed during cpo.trafo
, but
also the prediction format (see above) needs to change.
Some preprocessing for factorial columns needs the factor levels to be the same during training and prediction. This is usually not guarranteed
by mlr, so the framework offers to do this if the fix.factors
flag is set.
To prevent parameter name clashes when CPO
s are concatenated, the parameters are prefixed with the CPO
s
id.
The ID can be set during CPO
construction, but will default to the CPO
s name if not given. The name is set
using the cpo.name
parameter.
Whenever a CPO
needs certain packages to be installed to work, it can specify these in the packages
parameter. The framework
will check for the availability of the packages and throw an error if not found during construction. This means that loading a CPO
from a savefile will omit this check, but in most cases it is a sufficient measure to make the user aware of missing packages in time.
Different Task
types have the target in a different formats. They are listed here for reference. Target data is in this format
when given to the target
argument of some functions, and must be returned in this format by cpo.trafo
in Target Operation CPOs. Target values are always in the format of a data.frame
, even when only one column.
Task type | target format |
“classif” | one column of factor |
“cluster” | data.frame with zero columns. |
“multilabel” | several columns of logical |
“regr” | one column of numeric |
“surv” | two columns of numeric |
target
argument, as well as the return value of, the cpo.invert
function depends on the
Task
type as well as the predict.type
. The requested return value predict.type
is given to the cpo.invert
function
as a parameter, the predict.type
of the target
parameter depends on this and the predict.type.map
(see PredictType).
The format of the prediction, depending on the task type and predict.type
, is:
Task type | predict.type | target format |
“classif” | “response” | factor |
“classif” | “prob” | matrix with nclass cols |
“cluster” | “response” | integer cluster index |
“cluster” | “prob” | matrix with nclustr cols |
“multilabel” | “response” | logical matrix |
“multilabel” | “prob” | matrix with nclass cols |
“regr” | “response” | numeric |
“regr” | “se” | 2-col matrix |
“surv” | “response” | numeric |
“surv” | “prob” | [NOT YET SUPPORTED] |
matrix
formats are numeric
, unless otherwise stated.In the place of all cpo.*
arguments, it is possible to make a headless function definition, consisting only of the function body.
This function body must always begin with a ‘{
’. For example, instead of
cpo.retrafo = function(data, control) data[-1]
, it is possible to use
cpo.retrafo = function(data, control) { data[-1] }
. The necessary function head is then added automatically by the CPO framework.
This will always contain the necessary parameters (e.g. “data
”, “target
”, hyperparameters as defined in par.set
)
in the names as required. This can declutter the definition of a CPOConstructor
and is recommended if the CPO consists of
few lines.
Note that if this is used when writing an R package, inside a function, this may lead to the automatic R correctness checker to print warnings.
Other CPOConstructor related:
CPOConstructor
,
getCPOClass()
,
getCPOConstructor()
,
getCPOName()
,
identicalCPO()
,
print.CPOConstructor()
Other CPO lifecycle related:
CPOConstructor
,
CPOLearner
,
CPOTrained
,
CPO
,
NULLCPO
,
%>>%()
,
attachCPO()
,
composeCPO()
,
getCPOClass()
,
getCPOConstructor()
,
getCPOTrainedCPO()
,
identicalCPO()
# an example constant feature remover CPO constFeatRem = makeCPO("constFeatRem", dataformat = "df.features", cpo.train = function(data, target) { names(Filter(function(x) { # names of columns to keep length(unique(x)) > 1 }, data)) }, cpo.retrafo = function(data, control) { data[control] }) # alternatively: constFeatRem = makeCPO("constFeatRem", dataformat = "df.features", cpo.train = function(data, target) { cols.keep = names(Filter(function(x) { length(unique(x)) > 1 }, data)) # the following function will do both the trafo and retrafo result = function(data) { data[cols.keep] } result }, cpo.retrafo = NULL)