Data Source DSL

To allow projects specifying their data requirements in a portable way, ForML comes with its custom DSL (domain-specific language) that’s at runtime interpreted by the feeds subsystem to deliver the requested data.

The two main features of the DSL grammar is the schema declaration and the query syntax.

Schema

Schema is the abstract description of the particular datasource structure in terms of its column attributes (currently a name and a kind). A schema is simply declared by extending the schema base class forml.io.dsl.struct.Schema and defining its fields as class attributes with values represented by forml.io.dsl.struct.Field. For example:

class Person(struct.Schema):
    """Base schema."""

    surname = struct.Field(kind.String())
    dob = struct.Field(kind.Date(), 'birthday')

class Student(Person):
    """Extended schema."""

    level = struct.Field(kind.Integer())
    score = struct.Field(kind.Float())

Here we defined schemas of two potential datasources - a generic Person with a string field called surname and a date field dob (aliased as birthday) plus its extended version Student with two more fields - integer level and float score. The schema declaration API is based on the following rules:

  • the default field name is the class attribute name unless explicitly defined as the Field parameter

  • a field must be associated with one of the supported kinds

  • schemas can be extended

  • extended fields can override same name fields from parents

  • field ordering is based on the in-class definition order, fields from parent classes come before fields of child classes, overriding a field doesn’t change its position

Schemas are expected to be published in form of catalogs which can be imported by both projects and platforms making them the mapping intermediaries.

In project sources, schemas can be used for specifying actual DSL queries. Any declared schema is a fully queryable object so you can use all the query features as described below.

When referring to a schema field, one can use either the form of a attribute-getter like <Schema>.<field_name> or alternatively (if for example the field name is not a valid python identifier) using the item-getter as <schema>['<field_name>'].

Kinds

Following is the list of types (aka kinds) that can be used in schema field definitions:

forml.io.dsl.struct.kind.Boolean()

Boolean data type class.

forml.io.dsl.struct.kind.Integer()

Integer data type class.

forml.io.dsl.struct.kind.Float()

Float data type class.

forml.io.dsl.struct.kind.Decimal()

Decimal data type class.

forml.io.dsl.struct.kind.String()

String data type class.

forml.io.dsl.struct.kind.Date()

Date data type class.

forml.io.dsl.struct.kind.Timestamp()

Timestamp data type class.

forml.io.dsl.struct.kind.Array(element)

Array data type class.

forml.io.dsl.struct.kind.Map(key, value)

Map data type class.

forml.io.dsl.struct.kind.Struct(**element)

Struct data type class.

Query

The DSL allows to specify a rich ETL procedure of retrieving the data in any required shape or form. This can be achieved through the query API that’s available on top of any schema object. Important feature of the query syntax is also the support for column expressions.

Example query might look like:

ETL = student.join(person, student.surname == person.surname)
        .join(school_ref, student.school == school_ref.sid)
        .select(student.surname.alias('student'), school_ref['name'], function.Cast(student.score, kind.String()))
        .where(student.score < 2)
        .orderby(student.level, student.score)
        .limit(10)

Following is the list of the query API methods:

Query.columns

Get the list of columns supplied by this query.

Returns

A sequence of supplying columns.

Query.select(*columns)[source]

Specify the output columns to be provided (projection).

Parameters

columns (forml.io.dsl.struct.series.Column) – Sequence of column expressions.

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Query.join(other, condition=None, kind=None)[source]

Join with another datasource.

Parameters
  • other (forml.io.dsl.struct.frame.Queryable) – Source to join with.

  • condition (Optional[series.Expression]) – Column expression as the join condition.

  • kind (Optional[Union[forml.io.dsl.struct.frame.Join.Kind, str]]) – Type of the join operation (INNER, LEFT, RIGHT, FULL CROSS).

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Query.groupby(*columns)[source]

Aggregation specifiers.

Parameters

columns (forml.io.dsl.struct.series.Operable) – Sequence of column expressions.

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Query.having(condition)[source]

Add a row-filtering condition that’s applied to the evaluated aggregations.

Repeated calls to .having combine all the conditions (logical AND).

Parameters

condition (forml.io.dsl.struct.series.Expression) – Boolean column expression.

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Query.where(condition)[source]

Add a row-filtering condition that’s evaluated before any aggregations.

Repeated calls to .where combine all the conditions (logical AND).

Parameters

condition (forml.io.dsl.struct.series.Expression) – Boolean column expression.

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Query.limit(count, offset=0)[source]

Restrict the result rows by its max count with an optional offset.

Parameters
  • count (int) – Number of rows to return.

  • offset (int) – Skip the given number of rows.

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Query.orderby(*columns)[source]

Ordering specifiers.

Parameters
  • *columns – Sequence of column expressions and direction tuples.

  • columns (Union[series.Operable, series.Ordering.Direction, str, Tuple[series.Operable, Union[series.Ordering.Direction, str]]]) –

Returns

Query instance.

Return type

forml.io.dsl.struct.frame.Query

Expressions

Any schema field representing a data column can be involved in a column expression. All the schema field objects implement number native of operators, that can be used to directly form an expression. Furthermore, there are separate function modules that can be imported to build more complex expressions.

The native operators available directly on the field instances are:

Type

Syntax

Comparison

==, !=, <, <=, >, >=

Logical

&, |, ~

Arithmetical

+, -, *, /, %

Alias

Operable.alias(alias)[source]

Use an alias for this column.

Parameters

alias (str) – Aliased column name.

Returns

New column instance with given alias.

Return type

forml.io.dsl.struct.series.Aliased

There is also a bunch of functions available to be used within the query expressions. They are grouped into the following categories:

forml.io.dsl.function.aggregate

Aggregation functions.

forml.io.dsl.function.conversion

Conversion functions.

forml.io.dsl.function.datetime

Date and time manipulating functions.

forml.io.dsl.function.math

Mathematical functions.