Workflow is the backbone of the ML solution responsible for consistently sticking all its pieces together. On the low level, it is a Task Dependency Graph with edges representing data flows and vertices standing for the data transformations. This particular type of graph is called Directed Acyclic Graph (DAG) - meaning the flows are oriented and can’t form any cycles. Representing workflows using task graphs is crucial for robust scheduling, scalable execution, and runtime portability.
At its core, the workflow internals explained in the following chapters are built around the Graph theory and SW+ML engineering principles, which might feel way too involved from a general data science perspective. Fortunately, this level of detail is not required for the usual day-to-day work with the existing high-level ForML operators.
ForML is providing a convenient API for defining complex workflows using simple notation based on the following concepts:
Operators are high-level pipeline macro-instructions that can be composed together and eventually expand into the task graph
Actors are the low-level task primitives representing the graph vertices
Topology is the particular interconnection of the individual actors determining their dependencies.
ForML integrates the two-fold concept typical for supervised learning where the stateful components of the particular solution are operated in two distinct modes:
The Train-mode (a.k.a. fit) allowing the relevant components to acquire an internal state generalizing the processed data.
The Apply-mode (a.k.a. predict) where the previously trained components are applied to unseen data to predict the estimated outcome.
ForML uniquely builds this duality straight into its workflow architecture, hence the modality extends from the individual components to the entire workflow. Thus, each workflow is operated either in train-mode or apply-mode.
While the other ML frameworks and platforms out there are typically model-centric (having their discrete train process produce model(s) that get separately deployed for serving the predict phase), ForML, in contrast, is rather workflow-centric - ensuring all the steps (i.e. workflow) applied during the apply-mode consistently reflect the original train process. That’s achieved by an inseparable integration of both the train as well as the apply (predict) representations of the specific ML scenario into a single ForML expression. Essentially every single ForML workflow expands into one of the two related task graphs depending on its particular mode.
The high-level API for describing a workflow allows to compose operator expressions using the following syntax:
flow = LabelExtractor(column='foo') >> NaNImputer() >> RFC(max_depth=3)
The typically counter-intuitive feature of any DAG-based frameworks is that the execution of these expressions builds a DAG rather than performing the actual processing functions (which happens separately in a completely different context).
Given the implementation of the particular operators used in the previous example, this single expression might render a workflow with the two train and apply task graphs visualized as follows:
flowchart TD subgraph Train Mode ft((Future)) --> xta(["LabelExtractor.apply()"]) -- L --> itt["NaNImputer.train()"] & ctt["RFC.train()"] xta --> itt & ita(["NaNImputer.apply()"]) ita --> ctt itt -. state .-> ita end subgraph Apply Mode fa((Future)) --> iaa(["NaNImputer.apply()"]) --> caa(["RFC.apply()"]) itt -. state .-> iaa ctt -. state .-> caa end