Monolite Feed¶

Bases: Feed

Lightweight feed for pulling data from multiple simple origins.

The feed can resolve queries across all of its combined data sources.

All the origins need to be declared using a proper content resolver mapping with keys representing the fully qualified schema name formatted as <full.module.path>:<qualified.Class.Name> and the values should be origin-specific configuration options.

Attention

All the referenced schema catalogs must be installed.

Supported origins:

Inline data provided as a row-oriented array.
CSV files parsed using the pandas.read_csv().
Parquet files parsed using the pandas.read_parquet().

Parameters:

inline: Mapping[dsl.Source | str, layout.RowMajor] | None = None¶

Schema mapping of datasets provided inline as native row-oriented arrays.

csv: Mapping[dsl.Source | str, Path | str | Mapping[str, Any]] | None = None¶

Schema mapping of datasets accessible using a CSV reader. Values can either be direct file system paths or mapping with two keys:

path pointing to the CSV file
kwargs containing additional options to be passed to the underlying pandas.read_csv()

parquet: Mapping[dsl.Source | str, Path | str | Mapping[str, Any]] | None = None¶

Schema mapping of datasets accessible using a Parquet reader. Values can either be direct file system paths or mapping with two keys:

path pointing to the Parquet file
kwargs containing additional options to be passed to the underlying pandas.read_parquet()

The provider can be enabled using the following platform configuration:

config.toml¶

 [FEED.mono]
 provider = "monolite"
 [FEED.mono.inline]
 "foobar.schemas:Foo.Baz" = [
     ["alpha", 27, 0.314, 2021-05-11T17:12:24],
     ["beta", 11, -1.12, 2020-11-03T01:24:56],
 ]
 [FEED.mono.csv]
 "openschema.kaggle:Titanic" = "/tmp/titanic.csv"
 [FEED.mono.csv."openschema.sklearn:Iris"]
 path = "/tmp/iris.csv"
 kwargs = {sep = ";", engine = "pyarrow"}
 [FEED.mono.parquet]
 "openschema.kaggle:Avazu" = "/tmp/avazu.parquet"

Important

Select the sql extras to install ForML together with the SQLAlchemy support.

Todo

More file types (json)
Multi-file data sources (partitions)