As part of the project life cycle management, ForML allows to evaluate the model performance by quantifying the quality of its predictions using a number of different methods.

There are two different evaluation concepts each relating to one of the possible life cycles:

All configuration specific to the evaluation setup is defined on the project level using the evaluation component via the evaluation descriptor:

class forml.project.Evaluation(metric: evaluation.Metric, method: evaluation.Method)[source]

Evaluation component descriptor representing the evaluation configuration.

metric: evaluation.Metric

Loss/Score function to be used to quantify the prediction quality.

method: evaluation.Method

Strategy for generating data for the development train-test evaluation (e.g. holdout or cross-validation, etc).


>>> EVALUATION = project.Evaluation(
...     evaluation.Function(sklearn.metrics.log_loss),
...     evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42),
... )

With this information, ForML can assemble the particular evaluation workflow around the main pipeline component once that given life cycle action gets triggered.


All the evaluation primitives described in this chapter deal with flow topology rather than any direct data values. Their purpose is not to do any calculations themselves but to construct the workflow that performs the evaluation when launched. For curiosity’s sake, the Graphviz runner can be used to explore the particular DAGs composed in the scope of an evaluation.

The evaluation process is in principle based on comparing predicted and true outcomes using some metric function. The evaluation API is using the evaluation.Outcome structure as pointers to the relevant DAG ports publishing these predicted and true outcomes:

class forml.evaluation.Outcome(true: flow.Publishable, pred: flow.Publishable)[source]

True and predicted outcome pair of output ports.

These two ports provide all the data required to calculate the evaluation metric.

true : flow.Publishable

True outcomes port.

pred : flow.Publishable

Predicted outcomes port.

Metric Function

The heart of the evaluation process is a specific metric function quantifying the quality of the predicted versus true outcomes. There are dozens of standard metrics each suitable to different scenarios (plus bespoke new ones can easily be implemented for specific purposes).

The ForML evaluation API is using the following abstraction for its metric implementations:

class forml.evaluation.Metric[source]

Evaluation metric interface.

abstract score(*outcomes: evaluation.Outcome) flow.Node[source]

Compose the metric evaluation task on top of the given outcomes DAG ports.

Return the tail node of the new DAG that’s expected to have a single output apply-port delivering the calculated metric.

*outcomes: evaluation.Outcome

Individual outcomes partitions to be scored.


The input is (potentially) a sequence of outcome partitions as the metrics might need to be calculated from separate chunks (e.g. individual cross-validation folds).


Single node with single output apply-port providing the metric output.

Notable implementation of this Metric interface is the following Function class:

class forml.evaluation.Function(metric: collections.abc.Callable[[Any, Any], float], reducer: collections.abc.Callable[..., float] = mean)[source]

Bases: Metric

Basic metric implementation wrapping a plain scoring function.


As with any ForML task, the implementer is responsible for engaging a function that is compatible with the particular payload.

metric: collections.abc.Callable[[Any, Any], float]

Actual metric function implementation.

reducer: collections.abc.Callable[..., float] = mean

Callable to reduce individual metric partitions into a single final value. It must accept as many positional arguments as many outcome partitions there are. The default reducer is the statistics.mean().


>>> LOG_LOSS = evaluation.Function(sklearn.metrics.log_loss)
>>> ACCURACY = evaluation.Function(
...     lambda t, p: sklearn.metrics.accuracy_score(t, numpy.round(p))
... )

Development Train-Test Evaluation

Continuous evaluation provides essential feedback during the iterative development process indicating a relative change in the solution quality induced by the particular change in its implementation (code).

This type of evaluation is also referred to as backtesting since it involves training and testing the solution on historical data with known outcomes. In other words, the true outcomes are already known when producing the evaluation prediction outcomes.

There are different possible methods of how the historical data can be correctly used within the evaluated solution to essentially make predictions about the past. To generalize this concept for the sake of the workflow assembly, ForML is using the following abstraction:

class forml.evaluation.Method[source]

Interface for extending the pipeline DAG with the logic for producing true and predicted outcome columns from historical data.


The method is only producing the true/prediction outcome pairs - not an evaluation result. The outcomes are expected to be passed to some evaluation.Metric implementation for the actual scoring.

Implementations of this interface can deliver different evaluation techniques using strategies like holdout or cross-validation etc.

abstract produce(pipeline: flow.Composable, features: flow.Publishable, labels: flow.Publishable) Iterable[evaluation.Outcome][source]

Compose the DAG producing the true/predicted outcomes according to the given method.

pipeline: flow.Composable

Evaluation subject - the solution pipeline to be backtested.

features: flow.Publishable

Source port producing the historical features.

labels: flow.Publishable

Source port producing the historical outcomes matching the features.


A sequence of true/predicted outcome port pairs.


Following are the available evaluation.Method implementations:

class forml.evaluation.CrossVal(*, crossvalidator: payload.CrossValidable, splitter: type[payload.CVFoldable] = paymod.PandasCVFolds)[source]
class forml.evaluation.CrossVal(*, splitter: flow.Builder[payload.CVFoldable], nsplits: int)

Bases: Method

Evaluation method based on a number of independent train-test trials using different parts of the same training dataset.

The training dataset gets split into multiple (possibly overlaying) train-test pairs (folds) used to train a vanilla instance of the pipeline and to pass down predictions along with true outcomes independently for each fold.

crossvalidator: payload.CrossValidable

Implementation of the split-selection logic.

splitter: type[payload.CVFoldable] = paymod.PandasCVFolds
splitter: flow.Builder[payload.CVFoldable]

Depending on the constructor version:

  1. Folding actor type that is expected to take the cross-validator as its parameter. Defaults to payload.PandasCVFolds.

  2. Actor builder instance defining the folding splitter.

nsplits: int

The number of splits the splitter is going to generate (needs to be explicit as there is no generic way to infer it from the Builder).


>>> CROSSVAL = evaluation.CrossVal(
...     crossvalidator=sklearn.model_selection.StratifiedKFold(
...         n_splits=3, shuffle=True, random_state=42
...     )
... )
class forml.evaluation.HoldOut(*, test_size: float | int | None = None, train_size: float | int | None = None, random_state: int | None = None, stratify: bool = False, splitter: type[payload.CVFoldable] = paymod.PandasCVFolds)[source]
class forml.evaluation.HoldOut(*, crossvalidator: payload.CrossValidable, splitter: type[payload.CVFoldable] = paymod.PandasCVFolds)
class forml.evaluation.HoldOut(*, splitter: flow.Builder[payload.CVFoldable])

Bases: CrossVal

Evaluation method based on part of a training dataset being withheld for testing the predictions.

The historical dataset available for evaluation is first split into two parts, one is used for training the pipeline, and the second for making actual predictions which are then exposed together with the true outcomes for eventual scoring.


This is implemented on top of the evaluation.CrossVal method simply by forcing the number of folds to 1.

test_size: float | int | None = None

Absolute (if int) or relative (if float) size of the test split (defaults to train_size complement or 0.1).

train_size: float | int | None = None

Absolute (if int) or relative (if float) size of the train split (defaults to test_size complement).

random_state: int | None = None

Controls the randomness of the training and testing indices produced.

stratify: bool = False

Use StratifiedShuffleSplit if True otherwise use ShuffleSplit.

crossvalidator: payload.CrossValidable

Implementation of the split-selection logic.

splitter: type[payload.CVFoldable] = paymod.PandasCVFolds
splitter: flow.Builder[payload.CVFoldable]

Depending on the constructor version:

  1. The folding actor type that is expected to take the cross-validator is its parameter. Defaults to payload.PandasCVFolds.

  2. Actor builder instance defining the train-test splitter.


>>> HOLDOUT = evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42)

Production Performance Tracking

After transitioning to the production life cycle, it becomes an operational necessity to monitor the predictive performance of the deployed solution ensuring it maintains its expected quality.

The natural tendency every model is exhibiting over time is its drift - a gradual or sharp decline between its learned generalization and the observed phenomena. This can have a number of different reasons but the key measure is to detect it and to keep it under control by refreshing or re-implementing the model.

Continuous monitoring of the evaluation metric is the best way to spot these anomalies. This process can also be referred to as the serving evaluation since its goal is to measure objective success while making actual production decisions.

Process-wise, the performance tracking differs from the development evaluation use case in two key aspects:

  1. It does not involve any training - the point is to evaluate predictions made by an existing model generation running in production. The concept of the different methods known from the development evaluation does not apply - the metric function scores directly the genuinely served predictions against the eventual true outcomes.

  2. The predictions that are to be evaluated are in principle made before the true outcomes are known (real future predictions). This entails a dependency on an external reconciliation path a.k.a. feedback loop within the particular business application delivering the eventual true outcomes to which ForML simply plugs into using its feed system. The key attribute of this feedback loop is its latency which determines the turnaround time for the performance measurement (ranging from seconds to possibly months or more depending on the application).

ForML allows reporting the serving evaluation metric based on the project configuration by performing the relevant life cycle action.