Project

Starting New Project

ForML project can either be created manually from scratch by defining the component structure or simply using the init subcommand of the forml CLI:

$ forml init myproject

Component Structure

ForML project is defined as a set of specific components wrapped into a python package with the usual Setuptools layout. The framework offers the Convention over Configuration approach for organizing the internal package structure, which means it automatically discovers the relevant project components if the author follows this convention (there is still an option to ignore the convention, but the author is then responsible for configuring all the otherwise automatic steps himself).

The convention is simply based on implementing specific python modules (or packages) within the project namespace root. ForML doesn’t care whether the component is defined as a module (a file with .py suffix) or a package (a subdirectory with __init__.py file in it) since both have the same import syntax.

These naming conventions for the different project components are described in the following subsections. The general project component structure wrapped within the python application layout might look similar to this:

<project_name>
  ├── setup.py
  ├── <optional_project_namespace>
  │     └── <project_name>
  │          ├── __init__.py
  │          ├── pipeline  # here the component is a package
  │          │    ├── __init__.py
  │          │    ├── <moduleX>.py  # arbitrary user defined module
  │          │    └── <moduleY>.py
  │          ├── source.py
  │          └── evaluation.py  # here the component is just a module
  ├── tests
  │    ├── __init__.py
  │    ├── test_pipeline.py
  │    └── ...
  ├── README.md
  └── ...

The individual project components defined in the specific modules described below need to be hooked up into the ForML framework using the component.setup() as shown in the examples below.

forml.project.component.setup(instance)[source]

Dummy component setup representing the API signature of the fake module injected by load.Component.setup.

Parameters

instance (Any) – Component instance to be registered.

Return type

None

Setup.py

This is the standard Setuptools module with few extra features added to allow the project structure customization and integration of the Research lifecycle as described in Lifecycle sections (ie the eval or upload commands).

To hook in this extra functionality, the setup.py just needs to import forml.project.setuptools instead of the original setuptools. The rest is the usual setup.py content:

from forml.project import setuptools

setuptools.setup(name='forml-example-titanic',
                 version='0.1.dev0',
                 packages=setuptools.find_packages(include=['titanic*']),
                 install_requires=['scikit-learn', 'pandas', 'numpy', 'category_encoders==2.0.0'])

Note

The specified version value will become the lineage identifier upon uploading (as part of the Research lifecycle) thus needs to be a valid PEP 440 version.

The project should carefully specify all of its dependencies using the install_requires parameter as these will be included in the released .4ml package.

The only addition provided on top of the original setuptools functionality is the ability to customize the conventional project component layout. If from some reason the user wants to divert from this convention, he can specify the custom locations of its project components using the component parameter as follows:

setuptools.setup(...,
                 component={'pipeline': 'path.to.my.custom.pipeline.module'})

Pipeline (pipeline.py)

Pipeline definition is the heart of the project component structure. The framework needs to understand the pipeline as a Directed Acyclic Task Dependency Graph. For this purpose, it comes with a concept of Operators that the user is supplying with actual functionality (ie feature transformer, classifier) and composing together to define the final flow.

The pipeline is specified in terms of the workflow expression API which is in detail described in the Workflow sections.

Same as for the other project components, the final pipeline expression defined in the pipeline.py needs to be exposed to the framework via the component.setup() handler:

from forml.project import component
from titanic.pipeline import preprocessing, model

FLOW = preprocessing.NaNImputer() >> model.LR(random_state=42, solver='lbfgs')
component.setup(FLOW)

Evaluation (evaluation.py)

Definition of the model evaluation strategy for both the development and production lifecycle.

Note

The whole evaluation implementation is an interim and more robust concept with different API is on the

.roadmap.

The evaluation strategy again needs to be submitted to the framework using the component.setup() handler:

from sklearn import model_selection, metrics
from forml.project import component
from forml.lib.flow.operator.folding import evaluation

EVAL = evaluation.MergingScorer(
    crossvalidator=model_selection.StratifiedKFold(n_splits=2, shuffle=True, random_state=42),
    metric=metrics.log_loss)
component.setup(EVAL)

Source (source.py)

This component is a fundamental part of the IO concept. A project can define the ETL process of sourcing data into the pipeline using the DSL referring to some catalogized schemas that are at runtime resolved via the available feeds.

The source component is provided in form of a descriptor that’s created using the .query() method as shown in the example below or documented in the Source Descriptor Reference.

Note

The descriptor allows to further compose with other operators using the usual >> syntax. Source composition domain is separate from the main pipeline so adding an operator to the source composition vs pipeline composition might have a different effect.

The Source descriptor again needs to be submitted to the framework using the component.setup() handler:

from forml.lib.flow.operator import cast
from forml.lib.schema.kaggle import titanic as schema
from forml.project import component

FEATURES = schema.Passenger.select(
    schema.Passenger.Pclass,
    schema.Passenger.Name,
    schema.Passenger.Sex,
    schema.Passenger.Age,
    schema.Passenger.SibSp,
    schema.Passenger.Parch,
    schema.Passenger.Ticket,
    schema.Passenger.Fare,
    schema.Passenger.Cabin,
    schema.Passenger.Embarked,
)

ETL = component.Source.query(FEATURES, schema.Passenger.Survived) >> cast.ndframe(FEATURES.schema)
component.setup(ETL)

Tests

ForML has a rich operator unit testing facility (see the Operator Unit Testing sections) which can be integrated into the usual tests/ project structure.