Data Source DSL

To allow projects specifying their data requirements in a portable way, ForML comes with its custom DSL (domain-specific language) that’s at runtime interpreted by the feeds subsystem to deliver the requested data.

The two main features of the DSL grammar is the schema declaration and the query syntax.


Schema is the abstract description of the particular datasource structure in terms of its column attributes (currently a name and a kind). A schema is simply declared by extending the schema base class and defining its fields as class attributes with values represented by For example:

class Person(struct.Schema):
    """Base schema."""

    surname = struct.Field(kind.String())
    dob = struct.Field(kind.Date(), 'birthday')

class Student(Person):
    """Extended schema."""

    level = struct.Field(kind.Integer())
    score = struct.Field(kind.Float())

Here we defined schemas of two potential datasources - a generic Person with a string field called surname and a date field dob (aliased as birthday) plus its extended version Student with two more fields - integer level and float score. The schema declaration API is based on the following rules:

  • the default field name is the class attribute name unless explicitly defined as the Field parameter

  • a field must be associated with one of the supported kinds

  • schemas can be extended

  • extended fields can override same name fields from parents

  • field ordering is based on the in-class definition order, fields from parent classes come before fields of child classes, overriding a field doesn’t change its position

Schemas are expected to be published in form of catalogs which can be imported by both projects and platforms making them the mapping intermediaries.

In project sources, schemas can be used for specifying actual DSL queries. Any declared schema is a fully queryable object so you can use all the query features as described below.

When referring to a schema field, one can use either the form of a attribute-getter like <Schema>.<field_name> or alternatively (if for example the field name is not a valid python identifier) using the item-getter as <schema>['<field_name>'].


Following is the list of types (aka kinds) that can be used in schema field definitions:

Boolean data type class.

Integer data type class.

Float data type class.

Decimal data type class.

String data type class.

Date data type class.

Timestamp data type class.

Array data type class., value)

Map data type class.**element)

Struct data type class.


The DSL allows to specify a rich ETL procedure of retrieving the data in any required shape or form. This can be achieved through the query API that’s available on top of any schema object. Important feature of the query syntax is also the support for column expressions.

Example query might look like:

ETL = student.join(person, student.surname == person.surname)
        .join(school_ref, == school_ref.sid)
        .select(student.surname.alias('student'), school_ref['name'], function.Cast(student.score, kind.String()))
        .where(student.score < 2)
        .orderby(student.level, student.score)

Following is the list of the query API methods:


Get the list of columns supplied by this query.


A sequence of supplying columns.*columns)[source]

Specify the output columns to be provided (projection).


columns ( – Sequence of column expressions.


Query instance.

Return type

Query.join(other, condition=None, kind=None)[source]

Join with another datasource.

  • other ( – Source to join with.

  • condition (Optional[series.Expression]) – Column expression as the join condition.

  • kind (Optional[Union[, str]]) – Type of the join operation (INNER, LEFT, RIGHT, FULL CROSS).


Query instance.

Return type


Aggregation specifiers.


columns ( – Sequence of column expressions.


Query instance.

Return type


Add a row-filtering condition that’s applied to the evaluated aggregations.

Repeated calls to .having combine all the conditions (logical AND).


condition ( – Boolean column expression.


Query instance.

Return type


Add a row-filtering condition that’s evaluated before any aggregations.

Repeated calls to .where combine all the conditions (logical AND).


condition ( – Boolean column expression.


Query instance.

Return type

Query.limit(count, offset=0)[source]

Restrict the result rows by its max count with an optional offset.

  • count (int) – Number of rows to return.

  • offset (int) – Skip the given number of rows.


Query instance.

Return type


Ordering specifiers.

  • *columns – Sequence of column expressions and direction tuples.

  • columns (Union[series.Operable, series.Ordering.Direction, str, Tuple[series.Operable, Union[series.Ordering.Direction, str]]]) –


Query instance.

Return type


Any schema field representing a data column can be involved in a column expression. All the schema field objects implement number native of operators, that can be used to directly form an expression. Furthermore, there are separate function modules that can be imported to build more complex expressions.

The native operators available directly on the field instances are:




==, !=, <, <=, >, >=


&, |, ~


+, -, *, /, %



Use an alias for this column.


alias (str) – Aliased column name.


New column instance with given alias.

Return type

There is also a bunch of functions available to be used within the query expressions. They are grouped into the following categories:

Aggregation functions.

Conversion functions.

Date and time manipulating functions.

Mathematical functions.