Schema Definition

The schema definition API is the core part of the DSL. A schema is a virtual intermediary allowing to decouple data solutions (ForML projects) from physical data instances, and linking each other directly only at runtime using selected feed providers.

To become available to both projects and platforms, schemas need to be published in form of schema catalogs. Once declared, schemas can be used to formulate complex query statements - most notably as formal descriptions of the project data source requirements.

Schema API

The schema definition API is based on the following structures:

class forml.io.dsl.Schema(name: str, bases: tuple[type], namespace: dict[str, Any])[source]
class forml.io.dsl.Schema(schema: dsl.Source.Schema)

DSL frontend for table (schema) definitions.

A Schema is a logical representation of a particular dataset. Together with the dsl.Field, this class provides the schema definition frontend API which can be used in two operational modes:

Declarative Mode

The primary approach for schema definition is based on the class inheritance syntax with individual class attributes declared as the dsl.Field instances representing the schema fields.

This concept is based on the following rules:

  • the default field name is the class attribute name unless explicitly defined using the dsl.Field.name parameter

  • schemas can be hierarchically extended further down

  • extended fields can override same-name fields from parents

  • field ordering is based on the in-class definition order, fields from parent classes come before fields of child classes; overriding a field does not change its position

Attention

To transparently provide the query statement interface on top of the defined schemas, the internal class handler magically turns all children inherited from dsl.Schema to instances of dsl.Table (which itself has a .schema attribute derived from this class) instead of the intuitively expected subclass of the dsl.Schema parent.

Functional Mode

Additionally, schemas can be retrieved in a number of alternative ways implemented by the following factory methods:

Schema fields can either be referenced using the pythonic attribute-getter syntax like <Schema>.<field_name> or alternatively (e.g. if the field name is not a valid python identifier) using the item-getter syntax as <Schema>[<field_name>].

Examples

Following is an example of the declarative syntax:

class Person(dsl.Schema):
    '''Base schema.'''

    surname = dsl.Field(dsl.String())
    dob = dsl.Field(dsl.Date(), 'birthday')

class Student(Person):
    '''Extended schema.'''

    level = dsl.Field(dsl.Integer())
    score = dsl.Field(dsl.Float())

That’s a declaration of two data sources - a generic Person with a string field called surname and a date field dob aliased as birthday plus its extended version Student with two more fields - integer level and float score.

This schema can be used to formulate a query statement as shown:

>>> ETL = (
...     Student
...     .select(Student.surname.alias('name'), Student.dob)
...     .where(Student.score > 80)
... )
static from_fields(*fields: dsl.Field, title: str | None = None) dsl.Source.Schema[source]

Utility for functional schema assembly.

Parameters:
*fields: dsl.Field

Schema field list.

title: str | None = None

Optional schema name.

Returns:

Assembled schema.

Examples

>>> SCHEMA = dsl.Schema.from_fields(
...     dsl.Field(dsl.Integer(), name='A'),
...     dsl.Field(dsl.String(), name='B'),
... )
classmethod from_record(record: layout.Native, *names: str, title: str | None = None) dsl.Source.Schema[source]

Utility for functional schema inference.

Parameters:
record: layout.Native

Scalar or vector representing a single record from which the schema should be inferred.

*names: str

Optional field names.

title: str | None = None

Optional schema name.

Returns:

Inferred schema.

Examples

>>> SCHEMA = dsl.Schema.from_record(
...     ['foobar', 37], 'name', 'age', title='Person'
... )
classmethod from_path(path: str) dsl.Table[source]

Utility for importing a schema table from the given path.

Parameters:
path: str

Schema path in form of full.module.path:schema.qualified.ClassName.

Returns:

Imported schema table.

Examples

>>> SCHEMA = dsl.Schema.from_path('foo.bar:Baz')
class forml.io.dsl.Field(kind: dsl.Any, name: str | None = None)[source]

Schema field class.

When defined as class attributes on a particular dsl.Schema object, these instances represent the individual fields of the logical data source.

Parameters:
kind: dsl.Any

Mandatory field data type.

The value must be one of the dsl.Any data type instances.

name: str | None = None

Explicit field name.

kind : dsl.Any

Field data type.

name : str | None

Optional explicit field name.

Implicitly defaults to the name of the schema class attribute holding this field.

Type System

The DSL is using its own type system for its schema Field definitions propagated into the query Feature instances.

The type system is based on the following hierarchy:

classDiagram
    Any <|-- Primitive
    Primitive <|-- Numeric
    Primitive <|-- Boolean
    Numeric <|-- Integer
    Numeric <|-- Float
    Numeric <|-- Decimal
    Primitive <|-- String
    Primitive <|-- Date
    Date <|-- Timestamp
    Any <|-- Compound
    Compound <|-- Array
    Compound <|-- Map
    Compound <|-- Struct

    <<abstract>> Any
    <<abstract>> Primitive
    <<abstract>> Numeric
    <<abstract>> Compound

    class Array {
        +Any element
    }

    class Map {
        +Any key
        +Any value
    }

    class Struct {
        +list[Element]
    }

Following is the description of the main types:

class forml.io.dsl.Any(*args, **kwargs)[source]

Base class of all types.

class forml.io.dsl.Boolean[source]

Boolean data type class.

class forml.io.dsl.Integer[source]

Integer data type class.

class forml.io.dsl.Float[source]

Float data type class.

class forml.io.dsl.Decimal[source]

Decimal data type class.

class forml.io.dsl.String[source]

String data type class.

class forml.io.dsl.Date[source]

Date data type class.

class forml.io.dsl.Timestamp[source]

Timestamp data type class.

class forml.io.dsl.Array(element: dsl.Any)[source]

Array data type class.

Parameters:
element: dsl.Any

Array element kind.

class forml.io.dsl.Map(key: dsl.Any, value: dsl.Any)[source]

Map data type class.

Parameters:
key: dsl.Any

Map keys kind.

value: dsl.Any

Map values kind.

class forml.io.dsl.Struct(**element: dsl.Any)[source]

Structure data type class.

Parameters:
**element: dsl.Any

Mapping of attribute name strings and their kinds.