forml.pipeline.ensemble

Advanced operators for aggregating multiple models into an ensemble.

Model ensembling is a powerful technique for improving the overall accuracy of multiple weak learners.

Ensembling comes in a number of different flavors each with its strengths and trade-offs. This module provides some major implementations.

Classes

class forml.pipeline.ensemble.FullStack(*bases: flow.Composable, crossvalidator: payload.CrossValidable, splitter: type[payload.CVFoldable] = paymod.PandasCVFolds, appender: collections.abc.Callable[..., flow.Features] | flow.Builder = paymod.PandasConcat.builder(axis='columns', ignore_index=False), stacker: collections.abc.Callable[..., flow.Features] | flow.Builder = paymod.PandasConcat.builder(axis='index', ignore_index=True), reducer: collections.abc.Callable[..., flow.Features] | flow.Builder = pandas_mean)[source]
class forml.pipeline.ensemble.FullStack(*bases: flow.Composable, splitter: flow.Builder[payload.CVFoldable], nsplits: int, appender: collections.abc.Callable[..., flow.Features] | flow.Builder = paymod.PandasConcat.builder(axis='columns', ignore_index=False), stacker: collections.abc.Callable[..., flow.Features] | flow.Builder = paymod.PandasConcat.builder(axis='index', ignore_index=True), reducer: collections.abc.Callable[..., flow.Features] | flow.Builder = pandas_mean)

Bases: Ensembler

Stacking ensembler with N cross-validated instances of each base model - all of them also kept for serving.

This operator actually only represents the first layer of the stacked ensembling topology - providing a derived training dataset as the stack of cross-validated predictions of the base models. This dataset is simply passed down to the next composed operator which should be the actual final stacking model constituting the second ensembling layer.

The cross-validation splitter is prepended in front of the entire composition scope which is then expanded separately for every single fold creating N parallel branches cloned from the original segment.

Instances of all stateful actors - including clones of the same logical entities within the parallel folds - are kept for serving (unlike with the other possible techniques where they get retrained on just a single instance each) where their individual predictions are combined using the reducer function (e.g. an arithmetical mean). This results in a computationally more expensive serving but potentially better accuracy.

Parameters:
*bases: flow.Composable

Sequence of the base model operators (or compositions) to be ensembled.

crossvalidator: payload.CrossValidable

Implementation of the split-selection logic.

splitter: type[payload.CVFoldable] = paymod.PandasCVFolds
splitter: flow.Builder[payload.CVFoldable]

Depending on the constructor version:

  1. Folding actor type that is expected to take crossvalidator as its parameter. Defaults to payload.PandasCVFolds.

  2. Actor builder instance defining the folding splitter.

nsplits: int

The number of splits the splitter is going to generate (needs to be explicit as there is no generic way to extract it from the actor builder).

appender: collections.abc.Callable[..., flow.Features] | flow.Builder = paymod.PandasConcat.builder(axis='columns', ignore_index=False)

Horizontal column concatenator (combining base model predictions in train-mode) provided either as a function or as an actor builder.

stacker: collections.abc.Callable[..., flow.Features] | flow.Builder = paymod.PandasConcat.builder(axis='index', ignore_index=True)

Vertical column concatenator (combining folds predictions in train-mode) provided either as a function or as an actor builder.

reducer: collections.abc.Callable[..., flow.Features] | flow.Builder = pandas_mean

Horizontal column merger (combining base model predictions in apply-mode) provided either as a function or as an actor builder.

Examples

>>> PIPELINE = (
...     preprocessing.FooBar()
...     >> ensemble.FullStack(
...         sklearn.ensemble.GradientBoostingClassifier(),
...         sklearn.ensemble.RandomForestClassifier(),
...         crossvalidator=sklearn.model_selection.StratifiedKFold(n_splits=2))
...     >> sklearn.linear_model.LogisticRegression()
... )