vignettes/z_2_mlrCPO_core_terse.Rmd
z_2_mlrCPO_core_terse.Rmd
This vignette is supposed to be a short reference of the primitives and tools supplied by the mlrCPO
package.
CPOs are first-class objects in R that represent data manipulation. They can be combined to form networks of operation, they can be attached to mlr
Learner
s, and they have tunable Hyperparameters that influence their behaviour. CPO
s go through a lifecycle from construction to CPO
to a CPOTrained
“retrafo” or “inverter” object. The different stages of a CPO
related object can be distinguished using getCPOClass()
, which takes one of five values:
getCPOClass(cpoPca)
getCPOClass(cpoPca())
getCPOClass(pid.task %>|% cpoPca())
getCPOClass(inverter(bh.task %>>% cpoLogTrafoRegr()))
getCPOClass(NULLCPO)
CPO
s are created using CPOConstructor
s. These are R functions with a print function and many parameters in common.
print(cpoAsNumeric) # example CPOConstructor
print(cpoAsNumeric, verbose = TRUE) # alternative: !cpoAsNumeric
class(cpoAsNumeric)
getCPOName(cpoPca) # same as getCPOName() of the *constructed* CPO
getCPOClass(cpoPca)
The function parameters of a CPOConstructor
CPO
HyperparametersCPO
id
(default to the CPO
’s name
)affect.*
parameters)CPO
’s hyperparameters are “exported”, i.e. can late be manipulated using setHyperPars()
.(cpo = cpoScale()) # construct CPO with default Hyperparameter values
print(cpo, verbose = TRUE) # detailed printing. Alternative: !cpo
class(cpo) # CPOs that are not compound are "CPOPrimitive"
getCPOClass(cpo)
The inner “state” of a CPO
can be inspected and manipulated using various getters and setters.
getParamSet(cpo)
getHyperPars(cpo)
setHyperPars(cpo, scale.center = FALSE)
getCPOId(cpo)
setCPOId(cpo, "MYID")
getCPOName(cpo)
getCPOAffect(cpo) # empty, since no affect set
getCPOAffect(cpoPca(affect.pattern = "Width$"))
getCPOConstructor(cpo) # the constructor used to create the CPO
getCPOProperties(cpo) # see properties explanation below
getCPOPredictType(cpo)
getCPOClass(cpo)
getCPOOperatingType(cpo) # Operating on feature, target, retrafoless?
Compare the predict type and operating type of a TOCPO or ROCPO:
getCPOPredictType(cpoResponseFromSE())
getCPOOperatingType(cpoResponseFromSE())
getCPOOperatingType(cpoSample())
The identicalCPO()
function is used to check whether the underlying operation of two CPO
s is identical. For this understanding, CPO
s with different hyperparameters can still be “identical”.
CPO
s can be applied to data.frame
and Task
objects using %>>%
or applyCPO
.
CPO
composition can be done using %>>%
or composeCPO
. It results in a new CPO which mostly behaves like a primitive CPO. Exceptions are:
id
compound = scale %>>% pca
composeCPO(scale, pca) # same
class(compound)
!compound
getCPOName(compound)
getHyperPars(compound)
setHyperPars(compound, scale.center = TRUE, pca.center = FALSE)
getCPOId(compound) # error: no ID for compound CPOs
getCPOAffect(compound) # error: no affect for compound CPOs
getCPOOperatingType()
always considers the operating type of the whole CPO
chain and may return multiple values:
CPO-Learner attachment works using %>>%
or attachCPO
.
lrn = makeLearner("classif.logreg")
(cpolrn = cpo %>>% lrn) # the new learner has the CPO hyperparameters
attachCPO(compound, lrn) # attaching compound CPO
The new object is a CPOLearner
, which performs the operation given by the CPO
before trainign the Learner
.
The work performed by a CPOLearner
can also be performed manually:
lrn = cpoLogTrafoRegr() %>>% makeLearner("regr.lm")
model = train(lrn, subsetTask(bh.task, 1:300))
predict(model, subsetTask(bh.task, 301:500))
is equivalent to
trafo = subsetTask(bh.task, 1:300) %>>% cpoLogTrafoRegr()
model = train("regr.lm", trafo)
newdata = subsetTask(bh.task, 301:500) %>>% retrafo(trafo)
pred = predict(model, newdata)
invert(inverter(newdata), pred)
CPOs perform data-dependent operation. However, when this operation becomes part of a machine-learning process, the operation on predict-data must depend only on the training data. A CPORetrafo
object represents the re-application of a trained CPO. A CPOInverter
object represents the transformation of a prediction made on a transformed task back to the form of the original data.
The CPOTrained
objects generated by application of a CPO
(or application of another CPOTrained
) can be retrieved using the retrafo()
or the inverter()
function.
head(getTaskTargets(bh.task))
transformed = bh.task %>>% cpoLogTrafoRegr()
head(getTaskTargets(transformed))
(inv = inverter(transformed))
head(invert(inv, getTaskTargets(transformed)))
Retrafos and inverters are stored as attributes:
It is possible to set the "retrafo"
and "inverter"
attributes of an object using retrafo()
and inverter()
. This can be useful for writing elegant scripts, especially since CPOTrained are automatically chained. To delete the CPOTrained
attribute of an object, set it to NULL
or NULLCPO
, or use clearRI()
.
# clearRI returns the object without retrafo or inverter attributes
bh3 = clearRI(transformed)
attributes(bh3)
General methods that work on CPOTrained
object to inspect its object properties. Many methods that work on a CPO
also work on a CPOTrained
and give the same result.
getCPOName(ret)
getParamSet(ret)
getHyperPars(ret)
getCPOProperties(ret)
getCPOPredictType(ret)
getCPOOperatingType(ret) # Operating on feature, target, both?
getCPOOperatingType(inv)
A CPOTrained
has information about whether it can be used as a CPORetrafo
object (and be applied to new data using %>>%
), or as a CPOInverter
object (and used by invert()
), or possibly both. This is given by getCPOTrainedCapability()
, which returns a 1
if the object has an effect in the given role, 0
if the object has no effect (but can be used), or -1
if the object can not be used in the role.
The “CPO
class” of a CPOTrained
is determined by this as well. A pure inverter is CPOInverter
, an object that can be used for retrafo is a CPORetrafo
.
The CPO
and the CPOConstructor
used to create the `CPOTrained can be queried.
CPOTrained
objects can be inspected using getCPOTrainedState()
. The state contains the hyperparameters, the control
object (CPO dependent data representing the data information needed to re-apply the operation), and information about the Task
/ data.frame
layout used for training (column names, column types) in data$shapeinfo.input
and data$shapeinfo.output
.
The state can be manipulated and used to create new CPOTrained
s, using makeCPOTrainedFromState()
.
When executing data %>>% CPO
, the result has an associated CPORetrafo
and CPOInverter
object. When applying another CPO
, the CPORetrafo
and CPOInverter
will be chained automatically. This is to make (data %>>% CPO1) %>>% CPO2
work the same as data %>>% (CPO1 %>>% CPO2)
.
retrafo(data2)
is the same as retrafo(data %>>% pca %>>% scale)
:
To interrupt this chain, set retrafo to NULL
either explicitly, or using clearRI()
.
this is equivalent to
CPOTrained
can be composed using %>>%
and pipeCPO()
, just like CPO
s. They can also be split apart into primitive parts using as.list
. It is recommended to only chain CPOTrained
objects if they were created in the given order by preprocessing operations, since CPOTrained
s are very dependent on their position within a preprocessing pipeline.
Similarly to CPO
s, CPOTrained
objects can be applied to data using %>>%
, applyCPO
, or predict
. This only works with objects that have the "retrafo"
capability and hence the CPORetrafo
class.
Should in general give the same as head(transformed)
, since the same data was used:
applyCPO()
and predict()
are synonyms of %>>%
when used for CPORetrafo
objects:
To use CPOTrained
objects for inversion, the invert()
function is used. Besides the CPOTrained
, it takes the data to invert, and optionally the predict.type
. Typically CPOTrained
objects that were retrieved using inverter()
from a transformed dataset should be used for inversion. Retrafo CPOTrained
objects retrieved from a transformed data set using retrafo()
sometimes have both the "retrafo"
as well as the "invert"
capability (precisely when all TOCPOs used had the constant.invert
flag set, see Building Custom CPOs) and can then also be used for inversion. In that case, however, the "truth"
column of an inverted prediction is dropped.
transformed = bh.task %>>% cpoLogTrafoRegr()
prediction = predict(train("regr.lm", transformed), transformed)
inv = inverter(transformed)
invert(inv, prediction)
Inversion can be done on both predictions given by mlr
Learner
s, as well as plain vectors, data.frame
s, and matrix
objects.
Note that the prediction being inverted must have the form of a prediction done with the predict.type
that an inverter expects as input for the predict.type
given to invert()
as an argument. This can be queried using the getCPOPredictType()
function. If invert()
is called with predict.type = p
, then the prediction must be one made with a Learner
that has predict.type
set to getCPOPredictType(cpo)[p]
.
NULLCPO
is the neutral element of %>>%
and the operations it represents (composeCPO()
, applyCPO()
, and attachCPO()
), i.e. when it is used as an argument of these functions, the data, Learner
or CPO
is not changed. NULLCPO
is also the result pipeCPO()
called with the empty list, and of retrafo()
and inverter()
when they are called for objects with no CPOTrained
objects attached.
pipeCPO(list())
as.list(NULLCPO) # the inverse of pipeCPO
retrafo(bh.task)
inverter(bh.task %>>% cpoPca()) # cpoPca is a TOCPO, so no inverter is created
Many getters give characteristic results for NULLCPO
.
getCPOClass(NULLCPO)
getCPOName(NULLCPO)
getCPOId(NULLCPO)
getHyperPars(NULLCPO)
getParamSet(NULLCPO)
getCPOAffect(NULLCPO)
getCPOOperatingType(NULLCPO) # operates neither on features nor on targets.
getCPOProperties(NULLCPO)
# applying NULLCPO leads to a retrafo() of NULLCPO, so it is its own CPOTrainedCPO
getCPOTrainedCPO(NULLCPO)
# NULLCPO has no effect on applyCPO and invert, so NULLCPO's capabilities are 0.
getCPOTrainedCapability(NULLCPO)
getCPOTrainedState(NULLCPO)
Some helper functions convert NULLCPO
to NULL
and back, while leaving other values as they are.
A CPO
has a “name” which identifies the general operation done by this CPO
. For example, it is "pca"
for a CPO
created using cpoPca()
. Furthermore, a CPO
has an “ID” which is associated with the particular CPO
object at hand. For primitive CPO
s, it can be queried and set using getCPOId()
and setCPOId()
, and it can be set during construction, but it defaults to the CPO
’s name. The ID will also be prefixed to the CPO
’s hyperparameters after construction, if they are exported. This can help prevent hyperparameter name clashes when composing CPO
s with otherwise identical hyperparameter names. It is possible to set the ID to NULL
to have no prefix for hyperparameter names.
In the following (silly) example an error is thrown because of hyperparameter name clash. This can be avoided by setting the ID of one of the constituents to a different value.
CPOs contain information about the kind of data they can work with, and what kind of data they produce. getCPOProperties
returns a list with the slots handling
, adding
, needed
. properties$handling
indicates the kind of data a CPO can handle, properties$needed
indicates the kind of data it needs the data receiver (e.g. attached learner) to have, and properties$adding
lists the properties it adds to a given learner. An example is cpoDummyEncode()
, a CPO that converts factors to numerics: The receiving learner needs to handle numerics, so properties$needed == "numerics"
, but it adds the ability to handle factors (since they are converted), so properties$adding = c("factors", "ordered")
.
As a result, cpoDummyEncode
endows a Learner
with the ability to train on data with factor variables:
.sometimes
-PropertiesAs described in more detail in the Building Custom CPOs vignette, CPO
s can have properties that are considered only when composing CPO
s, or only when checking data returned by CPO
s. In short, consider a CPO
that does imputation, but only for factorial features. This CPO
would need to have "missings"
in its $adding
properties slot, since it enables Learner
to handle (some) Tasks
that have missing values. However, this CPO
may under certain circumstances still return data that has missing values. This discrepancy is recorded internally by having two “hidden” sets of properties that can be retrieved with getCPOProperties()
with get.internal
set to TRUE
. These properties are adding.min
, the minimal set of properties added, and needed.max
, the maximal set of properties needed by consecutive operators. These can be understood as a description of the “worst case” behaviour of the CPO
, since behaviour that is out of bounds of these sets causes an error by the mlrCPO
-framework.
An example is the cpoApplyFun
CPO
: When it is constructed, it is not known what kind of properties will be added or needed, so adding.min
is empty while needed.max
is the set of all data properties. When composing CPO
s, this CPO
is handled as if it magically does exactly the data conversion necessary to make the CPO
s or Learner
coming after it work with the data. If this ends up not being the case, an error is thrown during application or training by the following CPO
or Learner
.
When constructing a CPO
, it is possible to restrict the columns on which the CPO
operates using the affect.*
parameters of the CPOConstructor
. These parameters are:
affect.index
: Identify affected columns by a vector of column indices.affect.names
: Identify affected columns by a vector of column names.affect.pattern
: Match column names against a grep()
style regex pattern.affect.pattern.ignore.case
: Ignore case when matching by pattern.affect.pattern.perl
: Use “perl” syntax in affect.pattern
.affect.pattern.fixed
: Use fixed pattern instead of regex in affect.pattern
.affect.invert
: Invert the columns to affect: Only columns not matched by any of the other affect.*
parameters are affected.Sometimes when using many CPOs, their hyperparameters may get messy. mlrCPO
enables the user to control which hyperparameter get exported. The parameter “export” can be one of "export.default"
, "export.set"
, "export.unset"
, "export.default.set"
, "export.default.unset"
, "export.all"
, "export.none"
. “all” and “none” do what one expects; “default” exports the “recommended” parameters; “set” and “unset” export the values that have not been set, or only the values that were set (and are not left as default). “default.set” and “default.unset” work as “set” and “unset”, but restricted to the default exported parameters.
There are some %>>%
-related operators that perform similar operations but may be more concise in certain applications. In general these operators are left-assiciative, i.e. they are evaluated after the expressions to their left were evaluated. Therefore, for example, a %>>% b %<<% c
is equivalent to (a %>>% b) %<<% c
. Exceptions are the assignment operators, %<>>%
and %<<<%
, as well as the %>|%
operator, see below.
The operators are:
%>>%
: The application, composition or attachment operator.%<<%
: The above with exchanged arguments. a %<<% b
is equivalent to b %>>% a
%<>>%
: %>>%
, followed with assignment to the left. This operator evaluates the arguments to its right before being evaluated itself. a %<>>% b %>>% c
is equivalent to a = (a %>>% b %>>% c)
.%<<<%
: %<<%
, followed with assignment to the left. Note this is not the %<>>%
operator with its arguments flipped. This operator evaluates the arguments to its right before being evaluated itself. a %<<<% b %>>% c
is equivalent to a = (a %<<% (b %>>% c))
.%>|%
: %>>%
, followed by application of retrafo()
. This operator evaluates the arguments to its right before being evaluated itself. a %>|% b %<<% c
is equivalent to retrafo(a %>>% (b %<<% c))
.%|<%
: The above with exchanged arguments. Like most R operators, this one evaluates arguments to its left before being evaluated itself. a %>>% b %|<% c
is equivalent to retrafo((a %>>% b) %<<% c)
.