Cognitive Engine Developer Guide

Document Overview

This document is aimed at a technical audience and gives a detailed description of the underlying Cognitive Engine architecture. This is to enable developers to:

  1. Integrate the Cognitive Engine as a Software Project Dependency
  2. Extend the functionality of the Cognitive Engine with Plugins

Prior to reading this document, it is essential to read the User Guide which covers all the base functionality and operation of the Cognitive Engine.

Architecture

General Architecture

The engine is built around the concept of a training graph. Each training graph creates a number of artifacts (typically one or more trained classifiers or a single regressor in a package file plus support artifacts e.g. labels files, test and measurements results, etc.)

Conceptually, a training graph captures and models the relationships between the different categories the data to be classified falls into. Each category – or label – corresponds to a node. Nodes are connected with directed edges, which specify a relationship between nodes. This allows for complex category topologies that can model a very large number of business scenarios.

Besides modelling the high-level business representation, the training graph also drives the training of the actual classifiers or the regressor. Typically,each node in a graph will correspond to a classifier which is trained to predict that node’s label.

The process of training each individual classifier is also modelled by another graph– a pipeline subgraph – that is attached to each node. Each of these subgraphs normally has the shape of a sequence of steps,each step corresponding to a node in the subgraph. Typically, a pipeline subgraph is composed of one data source step,one or more data transformation steps,a training step (and possibly a test step).

Specifying Training Graphs

A training graph is built according to configuration specified in the YAML configuration file. Almost everything,including the actual structure of the training graph, is either specified or derived from this file.

Typically, training graphs are constructed by the train function of the engine. This function will prepare the configuration, initiate the context with it, and build the training graph and pipeline subgraphs, one of the latter for each node of the former.Once it has built the training graph,it will traverse it in topological order, and for each node visited, the attached pipeline subgraph is traversed in the same way, and each stepfunction executed.

The output of the train function is the training graph with the training context dictionary modified by all the steps and sub-pipelines – file artifacts (including the package file) are persisted to disk along the pipeline execution in a timestamped directory in the output directory.

Pipeline Subgraphs

A pipeline subgraph is a graph attached to each node of the training graph. It models the process of training an individual classifier or regressor. This process is normally a linear sequence of steps– a pipeline – each of which modifies and passes along a context map down the pipeline. In other words, a pipeline is a sequence of functions – the steps– applied to a data structure – the contextmap.

A step is a function that takes a map data structure as input (called the training session context), modifies it and returns it. The context contains the state of the training session, including data, parameters, results and the configuration necessary for each step

The structure of each pipeline subgraph,as well as the steps composing it, can be configured in the YAML configuration file.

Pipeline Step

A pipeline step function is the underlying basic block for composing pipeline subgraphs.

The primary input of a step is a map (the training session context). The output of a step is the (possibly) modified context map.

Configuring the Environment

The Cognitive Engine is available as a Maven-style dependency that any Java build system can use. All that is required is to add the dependency string to your build file:

Gradle

‘co.chatterbox:engine:x.x.x'

Maven

<dependency>
    <groupId>co.chatterbox</groupId>
        <artifactId>engine</artifactId>
        <version>x.x.x</version>
</dependency>

Leiningen

[co.chatterbox/engine “x.x.x”]

Add the Chatterbox Labs repo and authentication to your build file to have access to all the Cognitive Engine interfaces and functions.

Extending the Cognitive Engine

The Cognitive Engine is programmed in Clojure –a Lisp-like functional language that compiles to JVM bytecode – which allows it to take advantage of the portability of the Java platform and the extensive ecosystem around it.

This also means the Cognitive Engine can be extended in any JVM-based language such as Clojure, Groovy, Kotlin and, of course, Java.

There are two main ways the Engine can be extended:

  1. As a dependency in an external software project
  2. With plugins

Both ways can be used individually or in combination.

The Cognitive Engine as a Software Project Dependency

The Cognitive Engine can be included as a dependency in an external software project and accessed directly.

Training

This allows the train function to be called programmatically.

Depending on the language you’re developing on, you can either directly require the Clojure namespace that contains the train function, or import the co.chatterbox.engine.Engine class, which is a Java-friendly flow-style wrapper object around it.

In Clojure:

(ns com.mycompany.myproject.myns
  (:require [co.chatterbox.engine.core :refer [train]]))

(def params {...}) ; map of train params

(train params) ; call train function

Importing the Engine class, from Java:

import co.chatterbox.engine.Engine;

import java.util.Map;

public class App {

    public static void main(String[] args) {
        String yaml = "path/to/YAML/file";
        String t_home = "path/to/output/directory";
        String csv = "path/to/csv/data/file";
        Engine engine = new Engine(t_home);
        Map context = engine.withFileConfig(yaml).withLimit(20).trainCSV(csv);
        System.out.println(context);
    }
}

Prediction

The complete package file is ready for use in a live deployment environment.

The code for running prediction is decoupled from the package file – this is so that package files can be trained, stored and deployed independent of an organisation’s development schedule.

Within your software stack you need to follow these steps in order to classify new texts in real-time against your models:

  • Import the Engine class, held in the co.chatterbox.engine Java package.
  • Instantiate the Engine class, and use the various methods for configuring it. You’ll need to pass the location of the package file. This can be a local file, an HTTP(S) address, an FTP address or an Amazon S3 location. For example:
Engine engine = new Engine().withPackage("file:/C:/Users/User1/vendor-product-en-100.package");
  • Get a Predictor instance from the Engine object, and call the predict() method with the text to produce a PredictionResult:
Predictor predictor = engine.predictor();
PredictionResult res = predictor.predict(“your text”);
  • Call getClassifications() on the PredictionResult to produce a classifications map. This map returned will have as keys the classifier_id, mapped to the confidence score of the classifier whereby a positive score represents as association with labels_a and a negative number represents an association with labels_b.
  • Call getPrediction() on the PredictionResult to traverse the hierarchy and produce a list of class labels based on the classification confidence scores, for each level of the hierarchy.

For a classification example with a single text input field:

import co.chatterbox.engine.Engine;

public class App {

    public void myFunction() {
        Engine engine = new Engine().withPackage("file:/C:/Users/User1/vendor-product-en-100.package");
        Predictor predictor = engine.predictor();
        PredictionResult res = predictor.predict( “your text” );
        System.out.println(res.getClassifications());
    }
} 

The equivalent procedure for a hypothetical regression package is almost the same, except that multiple input field values are typically provided as a java.util.Map in step 3 and getPredictions() should be called instead of getClassifications() in step 4.

In the example code below suppose there are predictors (Sales-Month, Price-Range and Months-Since-Launch) for a regression model

import co.chatterbox.engine.Engine; 
import clojure.lang.Keyword;
import java.util.*;

public class App {

    public void myFunction() {
        Engine engine = new Engine().withPackage("file:/C:/Users/User1/sales-forecast-en-100.package");
        Predictor predictor = engine.predictor();
        Map<String,Object> inputs = new HashMap<>();
        inputs.put(“Sales-Month”,”10”);
        inputs.put(“Price-Range”,”HIGH”);
        inputs.put(“Months-Since-Launch”,”3”);
        PredictionResult res = predictor.predict(inputs);
        System.out.println(res.getPredictions());
    }
}

In the above code note that the call to PredictionResult.getPredictions()returns a java.util.Map with a single key value (set to the pipeline name specified in the YAML used to train the regression model) and value (the value predicted by the regression model).

Forecasting

To forecast with a time series package, the recent history of each time series needs to be provided (specifying least as many values as the train_window parameter specified when the package was trained). In the following example where the package contains a time series model for the “river-level” field, the train_window is set to 4. We can obtain and print a forecast for future river levels given the recent values [11.1,10.7,9.9,9.8] using the following code:

import co.chatterbox.engine.Engine; 
import clojure.lang.Keyword;
import java.util.*;

public class App {

    public void myFunction() {
        Engine engine = new Engine().withPackage("file:/C:/Users/User1/river-levels-en-100.package");
        Predictor predictor = engine.predictor();
        // construct the time series of recent values for river-level
        List<Map<String,Double>> data = new ArrayList<>();
        Map<String, Double> d1 = new HashMap<>();
        d1.put("river-level",11.1); data.add(d1); 
        Map<String, Double> d2 = new HashMap<>();
        d2.put("river-level",10.7); data.add(d2);
        Map<String, Double> d3 = new HashMap<>();
        d3.put("river-level",9.9); data.add(d3);
        Map<String, Double> d4 = new HashMap<>();
        d4.put("river-level",9.8); data.add(d4);
        // obtain a forecast for future values and print them
        PredictionResult res = predictor.forecast(data);
        List<Map<String,Double>> forecasts = res.getForecasts();
        for(int i=0; i<forecasts.size(); i++) {
            System.out.println(
               "Forecast ("+i+") "+ forecast.get(i).get("river-level")
            );
        }
    }

The number of values printed when this code is run will depend on the value of the “forecast_window” defined when the package was trained.

Plugins

The plugin architecture allows partners to build bespoke functionality into the engine and take advantage of existing in-house development.

Each time it is executed, the Cognitive Engine will look for plugins on both the Java classpath and in a local file-system directory that can be specified.

The plugin mechanism rests on Java’s Service Provider architecture. Plugins are Service Providers that implement a Service interface defined by the Cognitive Engine.

The Cognitive Engine Service interface is the co.chatterbox.engine.PipelineStep abstract class that defines the step and getKey methods:

public abstract Map<Keyword, Object> step(Map<String, Object> params, 
					Map<Keyword, Object> context);
public abstract String getKey();

The step method takes the training session context as a Keyword-keyed Map, and another String-keyed Map of step parameters. It returns a Keyword-keyed Map which is the modified context.

The getKey method returns a string identifier for the plugin step.  This identifier will be used to invoke the plugin as one of the steps specified in the training YAML configuration file. 

A Cognitive Engine plugin, then, is composed of one or more classes – implemented in any JVM language – that extend PipelineStep and provide implementations for the step method.

These classes should then be packaged as a JAR file following some simple requirements according to the Java Service Provider architecture; namely, there must be a file named co.chatterbox.engine.PipelineStep in the META-INF/services directory of the plugin JAR. This file should contain the fully-qualified class names of the classes extending PipelineStep.

For example, suppose you have a class called MyPipelineStep in the com.mycompany.myproject package that extends PipelineStep.

This how the expanded internal JAR structure should look like:

  • my-plugin.x.x.x
    • com
      • mycompany
        • myproject
          • MyPipelineStep.class
    • META-INF
      • MANIFEST.INF
      • services
        • co.chatterbox.engine.PipelineStep

The co.chatterbox.engine.PipelineStep file should contain the string:

com.mycompany.myproject.MyPipelineStep

When the Cognitive Engine is run, the Java Service Provider mechanism will automatically detect any plugin JARs on the classpath that conform to the Service Provider convention and you’ll be able to use the custom steps and in the YAML file. You can also specify a directory from the command-line for the Cognitive Engine to scan for plugin JARs using the –plugin-dir option.

Get in Touch