Schema Definition¶
The schema definition API is the core part of the DSL. A schema is a virtual intermediary allowing to decouple data solutions (ForML projects) from physical data instances, and linking each other directly only at runtime using selected feed providers.
To become available to both projects and platforms, schemas need to be published in form of schema catalogs. Once declared, schemas can be used to formulate complex query statements - most notably as formal descriptions of the project data source requirements.
Schema API¶
The schema definition API is based on the following structures:
- class forml.io.dsl.Schema(name: str, bases: tuple[type], namespace: dict[str, Any])[source]¶
- class forml.io.dsl.Schema(schema: dsl.Source.Schema)
DSL frontend for table (schema) definitions.
A Schema is a logical representation of a particular dataset. Together with the
dsl.Field
, this class provides the schema definition frontend API which can be used in two operational modes:- Declarative Mode
The primary approach for schema definition is based on the class inheritance syntax with individual class attributes declared as the
dsl.Field
instances representing the schema fields.This concept is based on the following rules:
the default field name is the class attribute name unless explicitly defined using the
dsl.Field.name
parameterschemas can be hierarchically extended further down
extended fields can override same-name fields from parents
field ordering is based on the in-class definition order, fields from parent classes come before fields of child classes; overriding a field does not change its position
Attention
To transparently provide the query statement interface on top of the defined schemas, the internal class handler magically turns all children inherited from
dsl.Schema
to instances ofdsl.Table
(which itself has a.schema
attribute derived from this class) instead of the intuitively expected subclass of thedsl.Schema
parent.- Functional Mode
Additionally, schemas can be retrieved in a number of alternative ways implemented by the following factory methods:
Schema fields can either be referenced using the pythonic attribute-getter syntax like
<Schema>.<field_name>
or alternatively (e.g. if the field name is not a valid python identifier) using the item-getter syntax as<Schema>[<field_name>]
.Examples
Following is an example of the declarative syntax:
class Person(dsl.Schema): '''Base schema.''' surname = dsl.Field(dsl.String()) dob = dsl.Field(dsl.Date(), 'birthday') class Student(Person): '''Extended schema.''' level = dsl.Field(dsl.Integer()) score = dsl.Field(dsl.Float())
That’s a declaration of two data sources - a generic
Person
with a string field calledsurname
and a date fielddob
aliased asbirthday
plus its extended versionStudent
with two more fields - integerlevel
and floatscore
.This schema can be used to formulate a query statement as shown:
>>> ETL = ( ... Student ... .select(Student.surname.alias('name'), Student.dob) ... .where(Student.score > 80) ... )
-
static from_fields(*fields: dsl.Field, title: str | None =
None
) dsl.Source.Schema [source]¶ Utility for functional schema assembly.
- Parameters:
- Returns:
Assembled schema.
Examples
>>> SCHEMA = dsl.Schema.from_fields( ... dsl.Field(dsl.Integer(), name='A'), ... dsl.Field(dsl.String(), name='B'), ... )
-
classmethod from_record(record: layout.Native, *names: str, title: str | None =
None
) dsl.Source.Schema [source]¶ Utility for functional schema inference.
- Parameters:
- Returns:
Inferred schema.
Examples
>>> SCHEMA = dsl.Schema.from_record( ... ['foobar', 37], 'name', 'age', title='Person' ... )
-
class forml.io.dsl.Field(kind: dsl.Any, name: str | None =
None
)[source]¶ Schema field class.
When defined as class attributes on a particular
dsl.Schema
object, these instances represent the individual fields of the logical data source.- Parameters:
Type System¶
The DSL is using its own type system for its schema Field
definitions propagated into the query Feature
instances.
The type system is based on the following hierarchy:
classDiagram
Any <|-- Primitive
Primitive <|-- Numeric
Primitive <|-- Boolean
Numeric <|-- Integer
Numeric <|-- Float
Numeric <|-- Decimal
Primitive <|-- String
Primitive <|-- Date
Date <|-- Timestamp
Any <|-- Compound
Compound <|-- Array
Compound <|-- Map
Compound <|-- Struct
<<abstract>> Any
<<abstract>> Primitive
<<abstract>> Numeric
<<abstract>> Compound
class Array {
+Any element
}
class Map {
+Any key
+Any value
}
class Struct {
+list[Element]
}
Following is the description of the main types: