Initial Setup
Before diving into the actual implementation, we need to go through a couple of setup procedures
(in addition to those common to all tutorials).
Starting a New Project
Run the following shell command to create the initial project structure:
$ forml project init --requirements=openschema,pandas,scikit-learn,numpy --package titanic forml-tutorial-titanic
You should see a directory structure like this:
$ tree forml-tutorial-titanic
forml-tutorial-titanic
├── pyproject.toml
├── tests
│ └── __init__.py
└── titanic
├── __init__.py
├── evaluation.py
├── pipeline.py
└── source.py
Source Definition
Let’s edit the source.py
component supplying the project data source descriptor with a DSL query against the particular Titanic
schema from the Openschema catalog
. Note the essential call
to the project.setup()
at the end registering the component within
the framework.
titanic/source.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 | from openschema import kaggle as schema
from forml import project
from forml.pipeline import payload
# Using the ForML DSL to specify the data source:
FEATURES = schema.Titanic.select(
schema.Titanic.Pclass,
schema.Titanic.Name,
schema.Titanic.Sex,
schema.Titanic.Age,
schema.Titanic.SibSp,
schema.Titanic.Parch,
schema.Titanic.Fare,
schema.Titanic.Embarked,
).orderby(schema.Titanic.PassengerId)
# Setting up the source descriptor:
SOURCE = project.Source.query(
FEATURES, schema.Titanic.Survived
) >> payload.ToPandas( # pylint: disable=no-value-for-parameter
columns=[f.name for f in FEATURES.schema]
)
# Registering the descriptor
project.setup(SOURCE)
|
Evaluation Definition
Finally, we fill-in the evaluation descriptor within the
evaluation.py
which involves specifying the evaluation strategy
including the particular metric. The file again ends with a call to the project.setup()
to register the component within the framework.
titanic/evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | import numpy
from sklearn import metrics
from forml import evaluation, project
# Setting up the evaluation descriptor needs the following input:
# 1) Evaluation metric for the actual assessment of the prediction error
# 2) Evaluation method for out-of-sample evaluation (backtesting) - hold-out or cross-validation
EVALUATION = project.Evaluation(
evaluation.Function(
lambda t, p: metrics.accuracy_score(t, numpy.round(p)) # using accuracy as the metric for our project
),
# alternatively we could simply switch to logloss:
# evaluation.Function(metrics.log_loss),
evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42), # hold-out as the backtesting method
# alternatively we could switch to the cross-validation method instead of hold-out:
# evaluation.CrossVal(crossvalidator=model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=42)),
)
# Registering the descriptor
project.setup(EVALUATION)
|