Serving Engine

In addition to the basic CLI-driven project-level batch-mode execution mechanism, ForML allows operating the encompassing applications within an interactive loop performing the apply action of the production life cycle - essentially providing online predictions a.k.a. ML inference based on the underlying models.

Process Control

The core component driving the serving loop is the Engine. To facilitate the end-to-end prediction serving, it interacts with all the different platform sub-systems as shown in the following sequence diagram:

    actor Client
    participant Engine as Engine/Gateway
    Client ->> Engine: query(Application, Request)
    opt if not in cache
        Engine ->> Inventory: get_descriptor(Application)
        Inventory --) Engine: Descriptor
    Engine ->> Engine: Entry, Scope = Descriptor.receive(Request)
    Engine ->> Engine: ModelHandle =
    opt if needed for model selection
        Engine ->> Registry: inspect()
        Registry --) Engine: Metadata
    Engine ->> Engine: Runner = get_or_spawn()
    Engine ->> Runner: apply(ModelHandle, Entry)
    opt if not loaded
        Runner ->> Registry: load(ModelHandle)
        Registry --) Runner: Model
    opt if needs augmenting
        Runner ->> Feed: get_features(Entry)
        Feed --) Runner: Features
    Runner ->> Runner: Outcome = Model.predict(Features)
    Runner --) Engine: Outcome
    Engine ->> Engine: Response = Descriptor.respond(Outcome)
    Engine --) Client: Response

This diagram illustrates the following steps:

  1. Receiving a request containing the query payload and the target application reference.

  2. Upon the very first request for any given application, the engine fetches the particular application descriptor from the configured inventory. The descriptor remains cached for every follow-up request of that application.

  3. The engine uses the descriptor of the selected application to dispatch the request by:

    1. Interpreting the query payload.

    2. Selecting a particular model generation to serve the given request (depending on the model-selection strategy used by that application, this step might involve interaction with the model registry).

  4. Unless already running, the engine spawns a dedicated runner which loads the selected model artifacts providing an isolated environment not colliding with (dependencies of) other models also served by the same engine.

  5. The runner might involve the configured feed system to augment the provided data points using a feature store.

  6. With the complete feature set matching the project-defined schema, the runner executes the pipeline in the apply-mode obtaining the prediction outcomes.

  7. Finally, the engine again uses the application descriptor to produce the response which is then returned to the original caller.


An engine can serve any application available in its linked inventory in a multiplexed fashion. Since the released project packages contain all the declared dependencies, the engine itself remains generic. To avoid collisions between dependencies of different models, the engine separates each one in an isolated context.

Frontend Gateway

While the engine is full-featured in terms of the end-to-end application serving, it can only be engaged using its raw Python API. That’s suitable for products natively embedding the engine as an integrated component, but for a truly decoupled client-server architecture, this needs an extra layer providing some sort of a transport protocol.

For this purpose, ForML comes with the concept of serving frontend gateways. They also follow the provider pattern allowing to deliver a number of different interchangeable implementations pluggable at launch time.

Frontend gateways represent the outermost layer in the logical hierarchy of the ForML architecture:



Problem question



ML solution

How to solve?

Prediction outcomes (e.g. probabilities)


Domain interpretation, model selection

How to utilize?

Domain response (e.g. recommended products)


Serving control

How to operate?

Interactive processing loop


Client-server transport

How to integrate?

ML service API


class forml.runtime.Gateway(inventory: asset.Inventory | None = None, registry: asset.Registry | None = None, feeds: io.Importer | None = None, processes: int | None = None, loop: asyncio.AbstractEventLoop | None = None, **kwargs)[source]

Top-level serving gateway abstraction.

inventory: asset.Inventory | None = None

Inventory of applications to be served (default as per the platform configuration).

registry: asset.Registry | None = None

Model registry of project artifacts to be served (default as per the platform configuration).

feeds: io.Importer | None = None

Feeds to be used for potential feature augmentation (default as per the platform configuration).

processes: int | None = None

Process pool size for each model sandbox.

loop: asyncio.AbstractEventLoop | None = None

Explicit event loop instance.


Additional serving loop keyword arguments passed to the run() method.

abstract classmethod run(apply: Callable[[str, layout.Request], Awaitable[layout.Response]], stats: Callable[[], Awaitable[runtime.Stats]], **kwargs) None[source]

Serving loop implementation.

apply: Callable[[str, layout.Request], Awaitable[layout.Response]]

Prediction request handler provided by the engine. The handler expects two parameters - the target application name and the prediction request.

stats: Callable[[], Awaitable[runtime.Stats]]

Stats producer callback provided by the engine.


Additional keyword arguments provided via the constructor.

Service Management

The gateway service can be managed using the CLI as follows (see the integrated help for full synopsis):

Use case


Launch the gateway service

$ forml application serve

Gateway Providers

Gateway providers can be configured within the runtime platform setup using the [GATEWAY.*] sections.

The available implementations are:


Serving gateway implemented as a RESTful API.