mlrCPO is a toolset that enriches mlr
with a diverse set of preprocessing operators.
Composable Preprocessing Operators (“CPO”s) are first-class R objects that can be applied to
data.frame
s and mlr
Task
s to modify data, they can be attached to mlr
Learner
s to add preprocessing to machine learning algorithms, and they can be composed to form
preprocessing pipelines.
mlrCPO
focuses on preprocessing as part of automated machine learning pipelines. This means that
it is designed with the expectation that the same preprocessing options are applied to incoming training data,
and test data. A common mistake in machine learning is that a machine learning method is evaluated (e.g. using
resampling) on a dataset after that dataset has been cleaned up and preprocessed in one go. The proper evaluation
would need to consider that the preprocessing of training data may not be influenced by any information contained
in the test data set. mlrCPO
takes this duality into account by providing CPO
objects that
run on training data, and which then create CPOTrained
objects that can be used on test data (or entirely new
prediction data).
This focus on preprocessing is the reason for a strict separation between “Feature Operation” CPO
s,
“Target Operation” CPO
s, and “Retrafoless” CPO
s (see OperatingType).
The first class only changes
(predictor) features of a dataset, and does so in a way reproducible on test data. The second class only changes
(outcome) target data of a dataset, and is then able to invert
the prediction, made by a learner
on new data, back to the space of the original target data. The “Retrafoless” CPO
only operates
during training and may only add or subtract data rows (e.g. for SMOTE-ing or subsampling), without transforming
the space of either predictor or outcome variables.
CPO
's design is supposed to help its user avoid bugs and errors. Therefore it often avoids doing things
implicitly and relies on explicit commands e.g. for removing data or converting between datatypes. It has certain
restrictions in place (e.g. CPOProperties, CPOTrainedCapability) that try to make it hard to do the
wrong thing while not being in the way of the right thing.
Other packages with similar, partially overlapping functionality are recipes, dplyr, and caret.