Input & Output

ForML comes with a special approach to data access. For projects to be portable, they must not be coupled directly with any specific data storages or formats - which for a data-processing framework might sound a bit self-contradictory.

In ForML architecture, all runtime dependencies including the pipeline I/O are handled by the platform while projects themselves are in this regard independent referring only to the intermediary catalogized schemas described bellow.

The two main subsystems a platform uses to handle the pipeline I/O are described further in standalone sections:

  • Source Feed - resolving project defined ETL and supplying the requested data

  • Output Sink - consuming the pipeline output

Catalogized Schemas

To achieve data access independence, ForML introduces a concept of catalogized schemas. Instead of implementing direct operations on specific data source instances, the Data Source DSL used to define the input data ETL refers only to abstract data schemas. It is then the responsibility of the platform to resolve the requested schemas (and the whole ETL queries specified on top of that) mapping them to actual datasources hosted in the particular runtime environment.

This approach can also be seen as datasource virtualization. ForML projects work with datasets regardless of their particular physical format or storage technology.

_images/schema-mapping.png

A schema catalog is a logical group of schemas which both - projects and platforms - can use as a mutual data proxy. It is not a service or a system, rather a namespaced descriptor implemented simply as python module(s) that must be available to both the project expecting the particular data and the platform supposed to serve that project. When a project pipeline is submitted to any given platform, it attempts to resolve the schemas referred in the project DSL using its configured schema-datasource mappings and only when all of these schema dependencies can be satisfied with available data sources, the platform is able to launch that pipeline.

An obvious aspect of the schema catalogs is their decentralization. Currently, there is no naming convention for the schema definition namespaces. Ideally, schemas should be published and held in namespaces of the original data producers. For private first-party datasets (ie. internal company data) this is easy - the owner (motivated to use ForML) would just maintain a (private) package with schemas of their data sources. For public datasets (whose authors don’t endorse ForML yet) this leaves it to some (not yet established) community maintained schema catalogs.

See the Data Source DSL for a schema implementation guide.

Source Descriptor

ForML projects specify their input data requirements (mainly the ETL DSL query optionally composed with other transforming operators) in form of a source descriptor (supplied within the project structure using the source.py component).

This descriptor is created using the forml.project.component.Source.query() class method:

classmethod Source.query(features, *labels, apply=None, ordinal=None)[source]

Create new source descriptor with the given parameters. All parameters are the DSL objects - either queries or columns.

Parameters
  • features (forml.io.dsl.struct.frame.Queryable) – Query defining the train (and if same also the apply) features.

  • labels (forml.io.dsl.struct.series.Column) – Sequence of training label columns.

  • apply (Optional[forml.io.dsl.struct.frame.Queryable]) – Optional query defining the apply features (if different from train ones). If provided, it must result in the same schema as the main provided via features.

  • ordinal (Optional[forml.io.dsl.struct.series.Operable]) – Optional specification of an ordinal column.

Returns

Source descriptor instance.

Return type

forml.project.component.Source