Titanic Exploration using ForML

ForML framework allows to implement an ML solution using formalized project components and their structured packaging. This is great for automated project life cycle management but less suitable for interactive exploration typical for Jupyter. This notebook will demonstrate the ForML-Jupyter interoperability designed specifically for this methodology.

Let’s assume our project has now the following structure with this notebook under its notebooks folder:

$ tree forml-tutorial-titanic
forml-tutorial-titanic
├── notebooks
│   └── exploration.ipynb  # this notebook!
├── pyproject.toml
├── tests
│   └── __init__.py
└── titanic
    ├── __init__.py
    ├── evaluation.py
    ├── pipeline.py
    └── source.py

Obtaining the Project Handle

Given the existing project structure, we can now grab the programmatic handle to our project using the project.open() function:

[1]:
%cd ..
/tmp/forml-tutorial-titanic
[2]:
import io
import sys

import numpy as np
import pandas as pd

from forml import project
from forml.pipeline import wrap

PROJECT = project.open(package='titanic')

Exploring the Source Data

We start with reusing the source component as defined in our initialized project and running it through a custom stateless transformer operator for returning the pandas.DataFrame.info() from the particular dataset. We interactively execute the .apply() action using the .launcher property on the bound pipeline instance retrieving its result:

[3]:
SOURCE = PROJECT.components.source


@wrap.Operator.mapper
@wrap.Actor.apply
def info(df: pd.DataFrame):
    """Custom operator returning the DataFrame.info()"""
    with io.StringIO() as buf:
        df.info(buf=buf)
        return buf.getvalue()


print(SOURCE.bind(info()).launcher.apply())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Pclass    418 non-null    int64
 1   Name      418 non-null    object
 2   Sex       418 non-null    object
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64
 5   Parch     418 non-null    int64
 6   Fare      417 non-null    float64
 7   Embarked  418 non-null    object
dtypes: float64(2), int64(3), object(3)
memory usage: 26.2+ KB

Exploring the labeled trainset is slightly more involved as normally the train-mode by design does not produce any output. The interactive launcher, however, captures both the features as well as the outcomes segment outputs allowing to access their values:

[4]:
print(SOURCE.bind(info()).launcher.train().features)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Pclass    891 non-null    int64
 1   Name      891 non-null    object
 2   Sex       891 non-null    object
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64
 5   Parch     891 non-null    int64
 6   Fare      891 non-null    float64
 7   Embarked  889 non-null    object
dtypes: float64(2), int64(3), object(3)
memory usage: 55.8+ KB

Imputing Missing Values

Given the insight details above, let’s implement a stateful operator for imputing the missing values for the Age, Fare, and Embarked columns:

[5]:
@wrap.Actor.train
def Impute(state, X, y, random_state=None) -> dict[str, float]:
    """Train part of a stateful transformer for missing values imputation."""
    return {'age_mean': X['Age'].mean(), 'age_std': X['Age'].std()}


@wrap.Operator.mapper
@Impute.apply
def Impute(state: dict[str, float], X, random_state=None) -> pd.DataFrame:
    """Apply part of a stateful transformer for missing values imputation."""
    na_slice = X['Age'].isna()
    if na_slice.any():
        rand_age = np.random.default_rng(random_state).integers(
            state['age_mean'] - state['age_std'], state['age_mean'] + state['age_std'], size=na_slice.sum()
        )
        X.loc[na_slice, 'Age'] = rand_age  # random age with same distribution
    X.loc[:, 'Embarked'] = X['Embarked'].fillna('S')  # assuming Southampton
    X.loc[:, 'Fare'] = X['Fare'].fillna(X['Fare'].mean())  # mean fare
    return X.drop(columns='Name')


PIPELINE = Impute(random_state=42)

EXPERIMENT1 = SOURCE.bind(PIPELINE)
EXPERIMENT1.launcher.train()
EXPERIMENT1.launcher.apply()
[5]:
Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 34.5 0 0 7.8292 Q
1 3 female 47.0 1 0 7.0000 S
2 2 male 62.0 0 0 9.6875 Q
3 3 male 27.0 0 0 8.6625 S
4 3 female 22.0 1 1 12.2875 S
... ... ... ... ... ... ... ...
413 3 male 39.0 0 0 8.0500 S
414 1 female 39.0 0 0 108.9000 C
415 3 male 38.5 0 0 7.2500 S
416 3 male 33.0 0 0 8.0500 S
417 3 male 35.0 1 1 22.3583 C

418 rows × 7 columns

Baseline Workflow

Let’s now create our simple baseline model workflow as a reference for further improvements in the scope of the follow-up implementation. We import the OneHotEncoder and RandomForestClassifier from Scikit-learn under the wrap.importer context to auto-turn it into ForML operators and attach it to the initial pipeline.

To explore the runtime graph, we execute it using the graphviz runner:

[6]:
with wrap.importer():
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.preprocessing import OneHotEncoder

PIPELINE >>= OneHotEncoder(handle_unknown='infrequent_if_exist', sparse_output=False) >> RandomForestClassifier(
    random_state=42
)

SOURCE.bind(PIPELINE).launcher['visual'].train()
[6]:
../../_images/tutorials_titanic_exploration_10_0.svg

Evaluation

Finally, let’s run this our simple pipeline through the project-defined evaluation method to get the baseline accuracy score:

[7]:
SOURCE.bind(PIPELINE, evaluation=PROJECT.components.evaluation).launcher.eval()
[7]:
0.7821229050279329

That’s it for our brief exploration. Based on these initial results, we will continue to implement the actual solution as a native ForML project, and let’s see how much we can improve from our baseline.