Initial Setup¶

Before diving into the actual implementation, we need to go through a couple of setup procedures (in addition to those common to all tutorials).

Starting a New Project¶

Run the following shell command to create the initial project structure:

$ forml project init --requirements=openschema,pandas,scikit-learn,numpy --package titanic forml-tutorial-titanic

You should see a directory structure like this:

$ tree forml-tutorial-titanic
forml-tutorial-titanic
├── pyproject.toml
├── tests
│   └── __init__.py
└── titanic
    ├── __init__.py
    ├── evaluation.py
    ├── pipeline.py
    └── source.py

Source Definition¶

Let’s edit the source.py component supplying the project data source descriptor with a DSL query against the particular Titanic schema from the Openschema catalog. Note the essential call to the project.setup() at the end registering the component within the framework.

titanic/source.py¶

from openschema import kaggle as schema

from forml import project
from forml.pipeline import payload

# Using the ForML DSL to specify the data source:
FEATURES = schema.Titanic.select(
    schema.Titanic.Pclass,
    schema.Titanic.Name,
    schema.Titanic.Sex,
    schema.Titanic.Age,
    schema.Titanic.SibSp,
    schema.Titanic.Parch,
    schema.Titanic.Fare,
    schema.Titanic.Embarked,
).orderby(schema.Titanic.PassengerId)

# Setting up the source descriptor:
SOURCE = project.Source.query(
    FEATURES, schema.Titanic.Survived
) >> payload.ToPandas(  # pylint: disable=no-value-for-parameter
    columns=[f.name for f in FEATURES.schema]
)

# Registering the descriptor
project.setup(SOURCE)

Evaluation Definition¶

Finally, we fill-in the evaluation descriptor within the evaluation.py which involves specifying the evaluation strategy including the particular metric. The file again ends with a call to the project.setup() to register the component within the framework.

titanic/evaluation.py¶

import numpy
from sklearn import metrics

from forml import evaluation, project

# Setting up the evaluation descriptor needs the following input:
# 1) Evaluation metric for the actual assessment of the prediction error
# 2) Evaluation method for out-of-sample evaluation (backtesting) - hold-out or cross-validation
EVALUATION = project.Evaluation(
    evaluation.Function(
        lambda t, p: metrics.accuracy_score(t, numpy.round(p))  # using accuracy as the metric for our project
    ),
    # alternatively we could simply switch to logloss:
    # evaluation.Function(metrics.log_loss),
    evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42),  # hold-out as the backtesting method
    # alternatively we could switch to the cross-validation method instead of hold-out:
    # evaluation.CrossVal(crossvalidator=model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=42)),
)

# Registering the descriptor
project.setup(EVALUATION)