vignettes/z_1_getting_started_terse.Rmd
z_1_getting_started_terse.Rmd
Since mlrCPO
is a package with some depth to it, it comes with a few vignettes that each explain different aspects of its operation. These are the current document (“First Steps”), offering a short introduction and information on where to get started, “mlrCPO Core”, describing all the functions and tools offered by mlrCPO
that are independent from specific CPO
s, “CPOs Built Into mlrCPO”, listing all CPO
s included in the mlrCPO
package, and “Building Custom CPOs”, describing the process of creating new CPO
s that offer new functionality.
All vignettes also have a “compact version” with the R output suppressed for readability. They are linked in the navigation section at the top.
All vignettes assume that mlrCPO
(and therefore its requirement mlr
) is installed successfully and loaded using library("mlrCPO")
. Help with installation is provided on the project’s GitHub page.
“Composable Preprocessing Operators”, “CPO”, are an extension for the mlr (“Machine Learning in R”) project which present preprocessing operations in the form of R objects. These CPO objects can be composed to form complex operations, they can be applied to data sets, and can be attached to mlr Learner
objects to generate machine learning pipelines that combine preprocessing and model fitting.
“Preprocessing”, as understood by mlrCPO
, is any manipulation of data used in a machine learning process to get it from its form as found in the wild into a form more fitting for the machine learning algorithm (“Learner
”) used for model fitting. It is important that the exact method of preprocessing is kept track of, to be able to perform this method when the resulting model is used to make predictions on new data. It is also important, when evaluating preprocessing methods e.g. using resampling, that the parameters of these methods are independent of the validation dataset and only depend on the training data set.
mlrCPO
tries to support the user in all these aspects of preprocessing:
CPO
s that can perform many different operations. Operations that go beyond the provided toolset can be implemented in custom CPO
s.CPOTrained
” objects that represent the preprocessing done on training data that should, in that way, be re-applied to new prediction data.mlr
“Learner
” objects that represent the entinre machine learning pipeline to be tuned and evaluated.At the centre of mlrCPO
are “CPO
” objects. To get a CPO
object, it is necessary to call a CPO Constructor. A CPO Constructor sets up the parameters of a CPO
and provides further options for its behaviour. Internally, CPO Constructors are functions that have a common interface and a friendly printer method.
CPO
s exist first to be applied to data. Every CPO
represents a certain data transformation, and this transformation is performed when the CPO
is applied. This can be done using the applyCPO
function, or the %>>%
operator. CPO
s can be applied to data.frame
objects, and to mlr
“Task
” objects.
iris.demo = iris[c(1, 2, 3, 51, 52, 102, 103), ]
tail(iris.demo %>>% cpoQuantileBinNumerics()) # bin the data in below & above median
A useful feature of CPO
s is that they can be concatenated to form new operations. Two CPO
s can be combined using the composeCPO
function or, as before, the %>>%
operator. When two CPO
s are combined, the product is a new CPO
that can itself be composed or applied. The result of a composition represents the operation of first applying the first CPO
and then the second CPO
. Therefore, data %>>% (cpo1 %>>% cpo2)
is the same as (data %>>% cpo1) %>>% cpo2
.
# first create three quantile bins, then as.numeric() all columns to
# get 1, 2 or 3 as the bin number
quantilenum = cpoQuantileBinNumerics(numsplits = 3) %>>% cpoAsNumeric()
iris.demo %>>% quantilenum
The last example shows that it is sometimes not a good idea to have a CPO
affect the whole dataset. Therefore, when a CPO
is created, it is possible to choose what columns the CPO
should affect. The CPO Constructor has a variety of parameters, starting with affect.
, that can be used to choose what columns the CPO
operates on. To prevent cpoAsNumeric
from influencing the Species
column, we can thus do
quantilenum.restricted = cpoQuantileBinNumerics(numsplits = 3) %>>%
cpoAsNumeric(affect.names = "Species", affect.invert = TRUE)
iris.demo %>>% quantilenum.restricted
A more convenient method in this case, however, is to use an mlr
“Task
”, which keeps track of the target column. “Feature Operation” CPO
s (as all the ones shown) do not influence the target column.
When performing preprocessing, it is sometimes necessary to change a small aspect of a long preprocessing pipeline. Instead of having to re-construct the whole pipeline, mlrCPO
offers the possibility to change hyperparameters of a CPO
. This makes it very easy e.g. for tuning of preprocessing in combination with a machine learning algorithm.
Hyperparameters of CPO
s can be manipulated in the same way as they are manipulated for Learners
in mlr
, using getParamSet
(to list the parameters), getHyperPars
(to list the parameter values), and setHyperPars
(to change these values). To get the parameter set of a CPO
, it is also possible to use verbose printing using the !
(exclamation mark) operator.
CPO
s use copy semantics, therefore setHyperPars
creates a copy of a CPO
that has the changed hyperparameters.
When chaining many CPO
s, it is possible for the many hyperparameters to lead to very cluttered ParamSet
s, or even for hyperparameter names to clash. mlrCPO
has two remedies for that.
First, any CPO
also has an id
that is always prepended to the hyperparameter names. It can be set during construction, using the id
parameter, or changed later using setCPOId
. The latter one only works on primitive, i.e. not compound, CPO
s. Set the id
to NULL
to use the CPO
’s hyperparameters without a prefix.
The second remedy against hyperparameter clashes is different “exports” of hyperparameters: The hyperparameters that can be changed using setHyperPars
, i.e. that are exported by a CPO
, are a subset of the parameters of the CPOConstructor
. For each kind of CPO
, there is a standard set of parameters that are exported, but during construction, it is possible to influence the parameters that actually get exported via the export
parameter. export
can be one of a set of standard export settings (among them “export.all
” and “export.none
”) or a character
vector of the parameters to export.
Manipulating data for preprocessing itself is relatively easy. A challenge comes when one wants to integrate preprocessing into a machine-learning pipeline: The same preprocessing steps that are performed on the training data need to be performed on the new prediction data. However, the transformation performed for prediction often needs information from the training step. For example, if training entail performing PCA, then for prediction, the data must not undergo another PCA, instead it needs to be rotated by the rotation matrix found by the training PCA. The process of obtaining the rotation matrix will be called “training” the CPO
, and the object that contains the trained information is called CPOTrained
. For preprocessing operations that operate only on features of a task (as opposed to the target column), the CPOTrained
will always be applied to new incoming data, and hence be of class CPORetrafo
and called a “retrafo” object. To obtain this retrafo object, one can use retrafo()
. Retrafo objects can be applied to data just as CPO
s can, by using the %>>%
operator.
To show that ret
actually represents the exact same preprocessing operation, we can feed the first line of iris.demo
back to it, to verify that the transformation is the same.
We obviously would not have gotten there by feeding the first line to cpoPca
directly:
CPOTrained
objects associated with an object are automatically chained when another CPO
is applied. To prevent this from happening, it is necessary to “clear” the retrafos and inverters associated with the object using clearRI()
.
Note that clearRI
has no influence on the CPO
operations themselves, and the resulting data is the same:
It is also possible to chain CPOTrained
object using composeCPO()
or %>>%
. This can be useful if the trafo chain loses access to the retrafo
attribute for some reason. In general, it is only recommended to compose CPOTrained
objects that were created in the same process and in correct order, since they are usually closely associated with the training data in a particular place within the preprocessing chain.
So far only CPO
s were introduced that change the feature columns of a Task
. (“Feature Operation CPO
s”–FOCPOs). There is another class of CPO
s, “Target Operation CPO
s” or TOCPOs, that can change a Task
’s target columns.
This comes at the cost of some complexity when performing prediction: Since the training data that was ultimately fed into a Learner
had a transformed target column, the predictions made by the resulting model will not be directly comparable to the original target values. Consider cpoLogTrafoRegr
, a CPO
that log-transforms the target variable of a regression Task
. The predictions made with a Learner
on a log-transformed target variable will be in log-space and need to be exponentiated (or otherwise re-transformed). This inversion operation is represented by an “inverter” object that is attached to a transformation result similarly to a retrafo object, and can be obtained using the inverter()
function. It is of class CPOInverter
, a subclass of CPOTrained
.
iris.regr = makeRegrTask(data = iris.demo, target = "Petal.Width")
iris.logd = iris.regr %>>% cpoLogTrafoRegr()
getTaskData(iris.logd) # log-transformed target 'Petal.Width'
The inverter object is used by the invert()
function that inverts the prediction made by a model trained on the transformed task, and re-transforms this prediction to fit the space of the original target data. The inverter object caches the “truth” of the data being inverted (iris.logd
, in the example), so invert
can give information on the truth of the inverted data.
logmodel = train("regr.lm", iris.logd)
pred = predict(logmodel, iris.logd) # prediction on the task itself
pred
This procedure can also be done with new incoming data. In general, more than just the cpoLogTrafoRegr
operation could be done on the iris.regr
task in the example, so to perform the complete preprocessing and inversion, one needs to use the retrafo object as well. When applying the retrafo object, a new inverter object is generated, which is specific to the exact new data that was being retransformed:
newdata = makeRegrTask("newiris", iris[7:9, ], target = "Petal.Width",
fixup.data = "no", check.data = FALSE)
# the retrafo does the same transformation(s) on newdata that were
# done on the training data of the model, iris.logd. In general, this
# could be more than just the target log transformation.
newdata.transformed = newdata %>>% retrafo(iris.logd)
getTaskData(newdata.transformed)
# the inverter of the newly transformed data contains information specific
# to the newly transformed data. In the current case, that is just the
# new "truth" column for the new data.
inv.newdata = inverter(newdata.transformed)
invert(inv.newdata, pred)
The cpoLogTrafoRegr
is a special case of TOCPO in that its inversion operation is constant: It does not depend on the new incoming data, so in theory it is not necessary to get a new inverter object for every piece of data that is being transformed. Therefore, it is possible to use the retrafo object for inversion in this case. However, the “truth” column will not be available in this case:
Whether a retrafo object is capable of performing inversion can be checked with the getCPOTrainedCapability()
function. It returns a vector with named elements "retrafo"
and "invert"
, indicating whether a CPOTrained
is capable of performing retrafo or inversion. A 1
indicates that the object can perform the action and has an effect, a 0
indicates that the action would have no effect (but also throws no error), and a -1
means that the object is not capable of performing the action.
As an example of a CPO
that does not have a constant inverter, consider cpoRegrResiduals
, wich fits a regression model on training data and returns the residuals of this fit. When performing prediction, the invert
action is to add predictions by the CPO
’s model to the incoming predictions made by a model trained on the residuals.
set.seed(123) # for reproducibility
iris.resid = iris.regr %>>% cpoRegrResiduals("regr.lm")
getTaskData(iris.resid)
Besides FOCPOs and TOCPOs, there are also “Retrafoless” CPO
s (ROCPOs). These only perform operation in the training part of a machine learning pipeline, but in turn are the only CPO
s that may change the number of rows in a dataset. The goal of ROCPOs is to change the number of data samples, but not to transform the data or target values themselves. Examples of ROCPOs are cpoUndersample
, cpoSmote
, and cpoSample
.
There is no retrafo or inverter associated with the result. Instead, both of them are NULLCPO
Until now, the CPO
s have been invoked explicitly to manipulate data and get retrafo and inverter objects. It is good to be aware of the data flows in a machine learning process involving preprocessing, but mlrCPO
makes it very easy to automatize this. It is possible to attach a CPO
to a Learner
using attachCPO
or the %>>%
-operator. When a CPO
is attached to a Learner
, a CPOLearner
is created. The CPOLearner
performs the preprocessing operation dictated by the CPO
before training the underlying model, and stores and uses the retrafo and inverter objects necessary during prediction. It is possible to attach compound CPO
s, and it is possible to attach further CPO
s to a CPOLearner
to extend the preprocessing pipeline. Exported hyperparamters of a CPO
are also present in a CPOLearner
and can be changed using setHyperPars
, as usual with other Learner
objects.
Recreating the pipeline from General Inverters with a CPOLearner
looks like the following. Note the prediction pred
made in the end is identical with the one made above.
set.seed(123) # for reproducibility
lrn = cpoRegrResiduals("regr.lm") %>>% makeLearner("regr.randomForest")
lrn
It is possible to get the retrafo object from a model trained with a CPOLearner
using the retrafo()
function. In this example, it is identical with the retrafo(iris.resid)
gotten in the example in General Inverters.
Since the hyperparameters of a CPO
are present in a CPOLearner
, is possible to tune hyperparameters of preprocessing operations. It can be done using mlr
’s tuneParams()
function and works identically to tuning common Learner
-parameters.
Besides the %>>%
operator, there are a few related operators which are short forms of operations that otherwise take more typing.
%<<%
is similar to %>>%
but works in the other direction. a %>>% b
is the same as b %<<% a
.%<>>%
and %<<<%
are the %>>%
or %<<%
operators, combined with assignment. a %<>>% b
is the same as a = a %>>% b
. These operators perform the operations on their right before they do the assignment, so it is not necessary to use parentheses when writing a = a %>>% b %>>% c
as a %<>>% b %>>% c
.%>|%
and %|<%
feed data in a CPO
and gets the retrafo()
. data %>|% a
is the same as retrafo(data %>>% a)
. The %>|%
operator performs the operation on its right before getting the retrafo, so it is not necessary to use parentheses when writing retrafo(data %>>% a %>>% b)
as data %>|% a %>>% b
.As described before, it is possible to compose CPO
s to create relatively complex preprocessing pipelines. It is therefore necessary to have tools to inspect a CPO
pipeline or related objects.
The first line of attack when inspecting a CPO
is always the print
function. print(x, verbose = TRUE)
will often print more information about a CPO
than the ordinary print function. A shorthand alias for this is the exclamation point “!
”. When verbosely printing a CPOConstructor
, the transformation functions are shown. When verbosely printing a CPO
, the constituent elements are separately printed, each showing their parameter sets.
When working with compound CPO
s, it is sometimes necessary to manipulate a CPO
inside a compound CPO
pipeline. For this purpose, the as.list()
generic is implemented for both CPO
and CPOTrained
for splitting a pipeline into a list of the primitive elements. The inverse is pipeCPO()
, which takes a list of CPO
or CPOTrained
and concatenates them using composeCPO()
.
CPOTrained
objects contain information about the retrafo or inversion to be performed for a CPO
. It is possible to access this information using getCPOTrainedState()
. The “state” of a CPOTrained
object often contains a $data
slot with information about the expected input and output format (“ShapeInfo
”) of incoming data, a slot for each of its hyperparameters, and a $control
slot that is specific to the CPO
in question. The cpoPca
state, for example, contains the PCA rotation matrix and a vector for scaling and centering. The contents of a state’s $control
object are described in a CPO
’s help page.
It is even possible to change the “state” of a CPOTrained
and construct a new CPOTrained
using makeCPOTrainedFromState()
. This is fairly advanced usage and only recommended for users familiar with the inner workings of the particular CPO
. If we get familiar with the cpoPca
CPO
using the !
-print (i.e. !cpoPca
) to look at the retrafo function, we notice that the control$center
and control$scale
values are given to a call of scale()
. If we want to create a new CPOTrained
that does not perform centering or scaling during before applying the rotation matrix, we can change these values.
state$control$center = FALSE
state$control$scale = FALSE
nosc.repca = makeCPOTrainedFromState(cpoPca, state)
Comparing this to the original “repca
” retrafo shows that the result of applying repca
has generally smaller values because of the centering.
There is a large and growing variety of CPO
s that perform many different operations. It is advisable to browse through CPOs Built Into mlrCPO for an overview. To get a list of all built-in CPO
s, use listCPO()
. A few important or “meta” CPO
s that can be used to influence the behaviour of other CPO
s are described here.
The value associated with “no operation” is the NULLCPO
value. It is the neutral element of the %>>%
operations, and the value of retrafo()
and inverter()
when there are otherwise no associated retrafo or inverter values.
The multiplexer makes it possible to combine many CPOs into one, with an extra selected.cpo
parameter that chooses between them. This makes it possible to tune over many different tuner configurations at once.
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument’s parameters to the outside.
Attaching the cpo applicator to a learner gives this learner a “cpo” hyperparameter that can be set to any CPO.
cbind
other CPOs as operation. The cbinder
makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other. It is often useful to combine cpoCbind
with cpoSelect
to filter out columns that would otherwise be duplciated.
scale = cpoSelect(pattern = "Sepal", id = "first") %>>% cpoScale(id = "scale")
scale.pca = scale %>>% cpoPca()
cbinder = cpoCbind(scale, scale.pca, cpoSelect(pattern = "Petal", id = "second"))
cpoCbind
recognises that "scale"
happens before "pca"
, but is also fed to the result directly. The verbose print draws a (crude) ascii-art graph.
Even though CPO
s are very flexible and can be combined in many ways, it may be necessary to create completely custom CPO
s. Custom CPOs can be created using the makeCPO()
and related functions. “Building Custom CPOs” is a wide topic which has its own vignette.
CPO
s are built using CPOConstructor
s by calling them like functions.CPOConstructors
can be found by using listCPO()
or consulting the relevant vignette.CPO
s and many related objects is available using the !
(exclamation mark) operator.CPO
s export hyperparameters that are accessible using getParamSet()
and getHyperPars()
, and mutable using setHyperPars()
. Which parameters are exported can be controlled using the export
parameter during construction.composeCPO()
), applied to data (applyCPO()
) and attached to Learner
s (attachCPO()
) using special functions for each of these operations, or using the general %>>%
operator.CPO
: FOCPO (Feature Operation CPO
s), TOCPO (Target Operation CPO
s) and ROCPO (Retrafoless CPO
s). The first may only change feature columns, the second only target columns. While the last one may change both feature and target values and even the number of rows of a dataset, it does so with the understanding that new “prediction” data will not be transformed by it and is thus mainly useful for subsampling.CPO
has a retrafo-CPOTrained
object associated with it that can be retrieved using retrafo()
and used to transform new prediction data in similar way as the original training data.CPOTrained
objects can themselves be composed using composeCPO
or %>>%
, although it is only recommended to compose CPOTrained
objects in the same order as they were created, and only if they were created in the same preprocessing pipeline.CPOTrained
objects can be inspected using getCPOTrainedState()
, and re-built with changed state using makeCPOTrainedFromState()
.inverter()
. An inverter is also created during application of a retrafo CPOTrained
.CPOTrained
are created during training and used on every prediction data set, inverter CPOTrained
are created anew during each CPO
and retrafo-CPOTrained
application and are closely associated with the data that they were created with.CPOTrained
objects associated with data are stored in their “attributes” and are automatically chained when more CPO
s are applied. clearRI()
is used to remove the associated CPOTrained
objects and prevent this chaining.CPO
s can be attached to Learner
s to get CPOLearner
s which automatically transform training and prediction data and perform prediction inversion.CPOLearner
s have the Learner
’s and the CPO
’s hyperparameters and can thus be manipulated using setHyperPars()
, and can be tuned using tuneParams()
.CPO
s are NULLCPO
(the neutral element of %>>%
), cpoMultiplex
, cpoWrap
, and cpoCbind
.CPO
s using makeCPO
and similar functions. These are described in their own vignette.