Project Organization

Projects built on ForML are in principle software source-code collections consisting of a set of defined components organized as a python package. Their ultimate purpose is to enable effective development leading to delivering (i.e. releasing a version of) a solution in form of a deployable artifact.

While developing, ForML allows execution of the project source-code working copy by triggering its development life cycle actions or when visited in the interactive mode.

Attention

Although not in the scope of this documentation, all the general source-code management best practices (version control, continuous integration/delivery, etc.) are applicable to ForML projects and should be integrated into the development process.

To discover the structure of some real ForML projects, it is worth exploring the available tutorials.

Starting a New Project

ForML project can be initialized either manually by implementing the component structure from scratch or simply via the init subcommand of the forml command-line interface:

$ forml project init myproject

Component Structure

ForML projects are organized as usual python projects accompanied with a PEP 621 compliant pyproject.toml. They are structured in a way to allow ForML identifying its principal components and to operate its life cycle.

The framework adopts the Convention over Configuration approach for organizing the internal project structure to automatically discover the relevant components (it is still possible to ignore the convention and organize the project in an arbitrary way, but the author is then responsible for explicitly configuring all the otherwise automatic steps himself).

The typical project structure matching the ForML convention might look as the following tree:

<project_name>
  ├── pyproject.toml
  ├── <optional_project_namespace_package>
  │     └── <project_root_package>
  │          ├── __init__.py
  │          ├── pipeline  # principal component as a package
  │          │    ├── __init__.py
  │          │    └── <moduleX>.py  # arbitrary module not part of the convention
  │          ├── source.py  # principal component as a module
  │          ├── evaluation.py
  │          ├── <moduleY>.py  # another module not part of the convention
  │          └── tuning.py
  ├── tests
  │    ├── __init__.py
  │    ├── test_<pipeline>.py  # actual name not part of the convention
  │    └── ...
  ├── README.md  # not part of the convention
  ├── notebooks  # not part of the convention
  │    └── ...
  └── ...

Clearly, the overall structure does not look any special - pretty usual python project layout (plus some additional content). What makes it a ForML project is the particular modules and/or packages within that structure and specific metadata provided in the pyproject.toml. Let’s focus on each of these components in the following sections.

Project Descriptor

This is a standard pyproject.toml metadata descriptor with a specific ForML tool section helping to integrate the ForML principal component structure. It’s placed directly in the project root directory.

The minimal content looks as follows:

[project]
name = "forml-tutorial-titanic"
version = "0.1.dev1"
dependencies = [
    "openschema",
    "scikit-learn",
    "pandas",
    "numpy",
]

[tool.forml]
package = "titanic"

The [project] section can contain any additional metadata supported by the PEP 621 specification.

Note

Upon publishing (in the scope of the development life cycle), the specified [project.version] value will become the release identifier and thus needs to be a valid PEP 440 version.

The project should carefully specify all of its dependencies using the [project.dependencies] list as these will be included in the released .4ml package artifact.

The custom [tool.forml] section supports the following options:

  • the package string referring to the python package containing the principal components

  • the optional components map allowing to override the conventional modules representing the individual principal components as submodules relatively to the package:

    [tool.forml.components]
    evaluation = "relative.path.to.my.custom.evaluation.module"
    pipeline = "relative.path.to.my.custom.pipeline.module"
    source = "relative.path.to.my.custom.source.module"
    

Principal Components

These are the actual high-level blocks of the particular ForML solution provided as python modules (or packages) within the project package root.

Hint

ForML does not care whether the principal component is defined as a module (a file with .py suffix) or a package (a subdirectory with __init__.py file in it) since both have the same import syntax.

To load each of the principal components, ForML relies on the project.setup() function as the expected component registration interface:

forml.project.setup(source: project.Source) None[source]
forml.project.setup(pipeline: flow.Composable, schema: dsl.Source.Schema | None = None) None
forml.project.setup(evaluation: project.Evaluation) None

Interface for registering principal component instances.

This function is expected to be called exactly once from within every component module passing the component instance.

The true implementation of this function is only provided when imported within the component loader context (outside the context this is effectively no-op).

Parameters:
source: project.Source

Source descriptor.

pipeline: flow.Composable

Workflow expression.

schema: dsl.Source.Schema | None = None

Optional schema of the pipeline output.

evaluation: project.Evaluation

Evaluation descriptor.

Pipeline Expression

Pipeline definition is the heart of the entire solution. It is provided in form of the workflow expression.

ForML expects this component to be provided as a pipeline.py module or pipeline package under the project package root.

pipeline.py or pipeline/__init__.py
1
2
3
4
5
 from forml import project
 from . import preprocessing, model  # project-specific implementations

 PIPELINE = preprocessing.Imputer() >> model.Classifier(random_state=42)
 project.setup(PIPELINE)

Dataset Definition

The source component provides the project with a definite while still portable dataset description. It is specified using the project.Source.query as a DSL expression against some particular schema catalog.

source.py or source/__init__.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
 from forml import project
 from forml.pipeline import payload
 from openschema import kaggle as schema

 FEATURES = schema.Titanic.select(
     schema.Titanic.Pclass,
     schema.Titanic.Name,
     schema.Titanic.Sex,
     schema.Titanic.Age,
     schema.Titanic.SibSp,
     schema.Titanic.Parch,
     schema.Titanic.Fare,
     schema.Titanic.Embarked,
 ).orderby(schema.Titanic.PassengerId)

 SOURCE = (
     project.Source.query(FEATURES, schema.Titanic.Survived)
     >> payload.ToPandas(columns=[f.name for f in FEATURES.schema])
 )
 project.setup(SOURCE)

Evaluation Strategy

Definition of the model evaluation strategy for both the development and production life cycles provided as the project.Evaluation descriptor.

evaluation.py or evaluation/__init__.py
1
2
3
4
5
6
7
8
 from sklearn import metrics
 from forml import evaluation, project

 EVALUATION = project.Evaluation(
     evaluation.Function(metrics.log_loss),
     evaluation.HoldOut(test_size=0.2, stratify=True, random_state=42),
 )
 project.setup(EVALUATION)

Tests

ForML has a rich operator unit testing facility that can be integrated into the usual tests/ project structure. This topic is extensively covered in the separate Unit Testing chapter.