This vignette is supposed to be a short reference of the primitives and tools supplied by the
CPOs are first-class objects in R that represent data manipulation. They can be combined to form networks of operation, they can be attached to
Learners, and they have tunable Hyperparameters that influence their behaviour.
CPOs go through a lifecycle from construction to
CPO to a
CPOTrained “retrafo” or “inverter” object. The different stages of a
CPO related object can be distinguished using
getCPOClass(), which takes one of five values:
CPOs are created using
CPOConstructors. These are R functions with a print function and many parameters in common.
The function parameters of a
id(default to the
CPO’s hyperparameters are “exported”, i.e. can late be manipulated using
The inner “state” of a
CPO can be inspected and manipulated using various getters and setters.
getParamSet(cpo) getHyperPars(cpo) setHyperPars(cpo, scale.center = FALSE) getCPOId(cpo) setCPOId(cpo, "MYID") getCPOName(cpo) getCPOAffect(cpo) # empty, since no affect set getCPOAffect(cpoPca(affect.pattern = "Width$")) getCPOConstructor(cpo) # the constructor used to create the CPO getCPOProperties(cpo) # see properties explanation below getCPOPredictType(cpo) getCPOClass(cpo) getCPOOperatingType(cpo) # Operating on feature, target, retrafoless?
Compare the predict type and operating type of a TOCPO or ROCPO:
identicalCPO() function is used to check whether the underlying operation of two
CPOs is identical. For this understanding,
CPOs with different hyperparameters can still be “identical”.
CPOs can be applied to
Task objects using
CPO composition can be done using
composeCPO. It results in a new CPO which mostly behaves like a primitive CPO. Exceptions are:
getCPOOperatingType() always considers the operating type of the whole
CPO chain and may return multiple values:
CPO-Learner attachment works using
The new object is a
CPOLearner, which performs the operation given by the
CPO before trainign the
The work performed by a
CPOLearner can also be performed manually:
is equivalent to
It is possible to obtain both the underlying
Learner and the attached
CPO from a
CPOLearner. Note that if a
CPOLearner is wrapped by some method (e.g. a
TuneWrapper), this does not work, since
CPO can not probe below the first wrapping layer.
CPOs perform data-dependent operation. However, when this operation becomes part of a machine-learning process, the operation on predict-data must depend only on the training data. A
CPORetrafo object represents the re-application of a trained CPO. A
CPOInverter object represents the transformation of a prediction made on a transformed task back to the form of the original data.
Retrafos and inverters are stored as attributes:
It is possible to set the
"inverter" attributes of an object using
inverter(). This can be useful for writing elegant scripts, especially since CPOTrained are automatically chained. To delete the
CPOTrained attribute of an object, set it to
NULLCPO, or use
General methods that work on
CPOTrained object to inspect its object properties. Many methods that work on a
CPO also work on a
CPOTrained and give the same result.
CPOTrained has information about whether it can be used as a
CPORetrafo object (and be applied to new data using
%>>%), or as a
CPOInverter object (and used by
invert()), or possibly both. This is given by
getCPOTrainedCapability(), which returns a
1 if the object has an effect in the given role,
0 if the object has no effect (but can be used), or
-1 if the object can not be used in the role.
CPO class” of a
CPOTrained is determined by this as well. A pure inverter is
CPOInverter, an object that can be used for retrafo is a
CPO and the
CPOConstructor used to create the `CPOTrained can be queried.
CPOTrained objects can be inspected using
getCPOTrainedState(). The state contains the hyperparameters, the
control object (CPO dependent data representing the data information needed to re-apply the operation), and information about the
data.frame layout used for training (column names, column types) in
The state can be manipulated and used to create new
data %>>% CPO, the result has an associated
CPOInverter object. When applying another
CPOInverter will be chained automatically. This is to make
(data %>>% CPO1) %>>% CPO2 work the same as
data %>>% (CPO1 %>>% CPO2).
To interrupt this chain, set retrafo to
NULL either explicitly, or using
this is equivalent to
CPOTrained can be composed using
pipeCPO(), just like
CPOs. They can also be split apart into primitive parts using
as.list. It is recommended to only chain
CPOTrained objects if they were created in the given order by preprocessing operations, since
CPOTraineds are very dependent on their position within a preprocessing pipeline.
CPOTrained objects can be applied to data using
predict. This only works with objects that have the
"retrafo" capability and hence the
Should in general give the same as
head(transformed), since the same data was used:
CPOTrained objects for inversion, the
invert() function is used. Besides the
CPOTrained, it takes the data to invert, and optionally the
CPOTrained objects that were retrieved using
inverter() from a transformed dataset should be used for inversion. Retrafo
CPOTrained objects retrieved from a transformed data set using
retrafo() sometimes have both the
"retrafo" as well as the
"invert" capability (precisely when all TOCPOs used had the
constant.invert flag set, see Building Custom CPOs) and can then also be used for inversion. In that case, however, the
"truth" column of an inverted prediction is dropped.
Inversion can be done on both predictions given by
Learners, as well as plain vectors,
Note that the prediction being inverted must have the form of a prediction done with the
predict.type that an inverter expects as input for the
predict.type given to
invert() as an argument. This can be queried using the
getCPOPredictType() function. If
invert() is called with
predict.type = p, then the prediction must be one made with a
Learner that has
predict.type set to
NULLCPO is the neutral element of
%>>% and the operations it represents (
attachCPO()), i.e. when it is used as an argument of these functions, the data,
CPO is not changed.
NULLCPO is also the result
pipeCPO() called with the empty list, and of
inverter() when they are called for objects with no
CPOTrained objects attached.
Many getters give characteristic results for
getCPOClass(NULLCPO) getCPOName(NULLCPO) getCPOId(NULLCPO) getHyperPars(NULLCPO) getParamSet(NULLCPO) getCPOAffect(NULLCPO) getCPOOperatingType(NULLCPO) # operates neither on features nor on targets. getCPOProperties(NULLCPO) # applying NULLCPO leads to a retrafo() of NULLCPO, so it is its own CPOTrainedCPO getCPOTrainedCPO(NULLCPO) # NULLCPO has no effect on applyCPO and invert, so NULLCPO's capabilities are 0. getCPOTrainedCapability(NULLCPO) getCPOTrainedState(NULLCPO)
Some helper functions convert
NULL and back, while leaving other values as they are.
CPO has a “name” which identifies the general operation done by this
CPO. For example, it is
"pca" for a
CPO created using
cpoPca(). Furthermore, a
CPO has an “ID” which is associated with the particular
CPO object at hand. For primitive
CPOs, it can be queried and set using
setCPOId(), and it can be set during construction, but it defaults to the
CPO’s name. The ID will also be prefixed to the
CPO’s hyperparameters after construction, if they are exported. This can help prevent hyperparameter name clashes when composing
CPOs with otherwise identical hyperparameter names. It is possible to set the ID to
NULL to have no prefix for hyperparameter names.
In the following (silly) example an error is thrown because of hyperparameter name clash. This can be avoided by setting the ID of one of the constituents to a different value.
CPOs contain information about the kind of data they can work with, and what kind of data they produce.
getCPOProperties returns a list with the slots
properties$handling indicates the kind of data a CPO can handle,
properties$needed indicates the kind of data it needs the data receiver (e.g. attached learner) to have, and
properties$adding lists the properties it adds to a given learner. An example is
cpoDummyEncode(), a CPO that converts factors to numerics: The receiving learner needs to handle numerics, so
properties$needed == "numerics", but it adds the ability to handle factors (since they are converted), so
properties$adding = c("factors", "ordered").
As a result,
cpoDummyEncode endows a
Learner with the ability to train on data with factor variables:
As described in more detail in the Building Custom CPOs vignette,
CPOs can have properties that are considered only when composing
CPOs, or only when checking data returned by
CPOs. In short, consider a
CPO that does imputation, but only for factorial features. This
CPO would need to have
"missings" in its
$adding properties slot, since it enables
Learner to handle (some)
Tasks that have missing values. However, this
CPO may under certain circumstances still return data that has missing values. This discrepancy is recorded internally by having two “hidden” sets of properties that can be retrieved with
get.internal set to
TRUE. These properties are
adding.min, the minimal set of properties added, and
needed.max, the maximal set of properties needed by consecutive operators. These can be understood as a description of the “worst case” behaviour of the
CPO, since behaviour that is out of bounds of these sets causes an error by the
An example is the
CPO: When it is constructed, it is not known what kind of properties will be added or needed, so
adding.min is empty while
needed.max is the set of all data properties. When composing
CPO is handled as if it magically does exactly the data conversion necessary to make the
Learner coming after it work with the data. If this ends up not being the case, an error is thrown during application or training by the following
When constructing a
CPO, it is possible to restrict the columns on which the
CPO operates using the
affect.* parameters of the
CPOConstructor. These parameters are:
affect.index: Identify affected columns by a vector of column indices.
affect.names: Identify affected columns by a vector of column names.
affect.pattern: Match column names against a
grep()style regex pattern.
affect.pattern.ignore.case: Ignore case when matching by pattern.
affect.pattern.perl: Use “perl” syntax in
affect.pattern.fixed: Use fixed pattern instead of regex in
affect.invert: Invert the columns to affect: Only columns not matched by any of the other
affect.*parameters are affected.
Sometimes when using many CPOs, their hyperparameters may get messy.
mlrCPO enables the user to control which hyperparameter get exported. The parameter “export” can be one of
"export.none". “all” and “none” do what one expects; “default” exports the “recommended” parameters; “set” and “unset” export the values that have not been set, or only the values that were set (and are not left as default). “default.set” and “default.unset” work as “set” and “unset”, but restricted to the default exported parameters.
There are some
%>>%-related operators that perform similar operations but may be more concise in certain applications. In general these operators are left-assiciative, i.e. they are evaluated after the expressions to their left were evaluated. Therefore, for example,
a %>>% b %<<% c is equivalent to
(a %>>% b) %<<% c. Exceptions are the assignment operators,
%<<<%, as well as the
%>|% operator, see below.
The operators are:
%>>%: The application, composition or attachment operator.
%<<%: The above with exchanged arguments.
a %<<% bis equivalent to
b %>>% a
%>>%, followed with assignment to the left. This operator evaluates the arguments to its right before being evaluated itself.
a %<>>% b %>>% cis equivalent to
a = (a %>>% b %>>% c).
%<<%, followed with assignment to the left. Note this is not the
%<>>%operator with its arguments flipped. This operator evaluates the arguments to its right before being evaluated itself.
a %<<<% b %>>% cis equivalent to
a = (a %<<% (b %>>% c)).
%>>%, followed by application of
retrafo(). This operator evaluates the arguments to its right before being evaluated itself.
a %>|% b %<<% cis equivalent to
retrafo(a %>>% (b %<<% c)).
%|<%: The above with exchanged arguments. Like most R operators, this one evaluates arguments to its left before being evaluated itself.
a %>>% b %|<% cis equivalent to
retrafo((a %>>% b) %<<% c).