Cognitive Engine User Guide

Document Overview

This document gives a detailed description of the configuration and operation of Chatterbox Labs’ Cognitive Engine, including YAML file configuration.  It should be read in combination with the Gold Standard Training Data documentation.  If you wish to extend the Cognitive Engine or integrate it into another software stack, please read the Developer Documentation.

Installation & Environment Configuration

The Cognitive Engine is programmed to run on the Java Virtual Machine (JVM) platform. 

The only software requirement of the Cognitive Engine is that a Java Runtime Environment (Java SE 8) be available on the machine.

Hardware Requirements

As the Cognitive Engine is programmed using Clojure and Java, it can run on any operating system that supports a Java Virtual Machine. 64 bit operating systems are recommended for optimal memory usage.

It is recommended that the Cognitive Engine is installed on a centrally accessed server. Each organization will have their own standard configurations, but as a benchmark a comparable server setup is:

  • CPU: Multi-core (4 minimum, 16 recommended) Xeon/EPYC/i7 @ 3GHz 
  • Memory: 128GB Recommended (minimum of 16GB) 
  • OS: Ubuntu Linux (64 bit) or Microsoft Windows (64 bit), running Java SE8 (64 bit)
  • Disk: 200GB (it is recommended that the Cognitive Engine is installed in a directory that is accessible and writable by all users, rather than in each home directory)
  • Connectivity: Remote access (ssh / MS Remote Desktop) and access to the outside web

General Operation

The usual process of operating the Cognitive Engine is as follows:

  1. Prepare Data. Read the Gold Standard Training Data document and ensure that your training data is of good quality and accessible by the Cognitive Engine
  2. YAML. Create a YAML file that gives configuration for the engine, and specifies each model to be built
  3. Train. Run the Cognitive Engine jar at the command line (or call the train function inside your software stack) with your yaml file to train your models against your training data. The output will be written into a time-stamped directory inside the output directory
  4. Check. Check the training session log (in the output directory) for any warnings that have been logged during training
  5. Results. Check the cross-validation results for each model within the results directory (in the output directory)
  6. Prediction/Forecasting. Run the Cognitive Engine jar at the command line with the package file (found in the payload directory) to predict values for the target field with new, unseen data.

Training

The Cognitive Engine is distributed as an application packaged in an executable JAR file. It can be run like any other JAR file, through the Java binary with the –jar option. Using the Windows Command Prompt or a Linux/Mac shell, a common invocation for training is:

java –jar engine-x.x.x-standalone.jar train --csv <training csv path> <yaml path>

The Cognitive Engine executable accepts the following commonly used command-line options:

  1. ––home – Optional path to the directory for time stamped output
  2. ––csv – Optional path to the location of a csv training file
  3. ––json – Optional path to the location of a json training file
  4. ––sql – Optional URL to specify a SQL training database
  5. ––limit – Optional limit to the number of training data points per class
  6. ––label-limit – Optional limit to the number of training data points per label
  7. ––plugin-dir – Optional path to the location of a directory containing plugins
  8. ––test – Optional flag to train with hold-out testing
  9. ––dry – Optional flag to run the Engine without training any models (useful for configuration file sanity testing)

Run the engine with the –-help option to see the complete set of training options:

java –jar engine-x.x.x-standalone.jar train --help

Once executed successfully, the output directory (created automatically) of the training session will be:

  • output
    • timestamp
      • payload
        • *.package (payload package ready for deployment)
      • production
        • full.yaml (complete YAML file with all configuration)
        • cb.params.* (parameters required for prediction)
        • cb.models (model ids used in prediction)
        • cb.features (production features)
        • *.model(production models)
      • results (HTML and CSV reports with precision, recall and f-score, for each model trained)
        • train_report.html (overview comparison metrics for the best models trained as tested using cross validation)
        • train_report.csv (detailed overview comparison metrics for the best models trained with corresponding parameters)
        • *_model_report.csv – (per classifier cross validation results for each parameter set)
        • *_xv_wrongs.csv – (texts written to file for error analysis)
        • Optional: test_report.html (overview comparison metrics for the best classifiers trained as tested using a hold out test set, defined by the test_train field
      • full.yaml (complete YAML file with all configuration)
      • full.log (training session log file)

Parameter Selection

The Cognitive Engine will carry out automatic parameter selection based on cross-validation testing (which avoids over fitting). The detailed results for each set of parameters are written into the results directory. The best set of parameters to build the final model are selected using the eval_method parameter defined in the YAML file.

Pipelines for training Classification, Regression and Time Series Models

At training time, the Cognitive Engine will build one or more models using the range of algorithms at its disposal.  For classification problems, a separate binary classifier model is usually trained to discriminate between a specific label value and all other labels which are present in the target field.  For regression problems, a single model is trained to be able to estimate numeric values of the target field. For time series problems, a separate model is trained on each time series in the data.

Each of these models (either a regression model, a time series model or a binary classifier) is trained using a pipeline.  The pipeline can specify how the training data is retrieved, how the data is prepared for machine learning and which machine learning algorithms should be used. Default settings suitable for classification problems are provided for the pipeline and each of its steps, but these can be overridden.

Assessing Performance for Classification

When assessing the performance of the classifiers trained, the most common file to load is train_report.html which will report, for each classifier:

  • Accuracy – Calculated by assessing the number of data points correctly classified
  • Precision – Considered a measure of quality (‘of the data points classified as X, how many were correct’)
  • Recall – Considered a measure of coverage (‘of the available Xs in the dataset, how many were correctly classified as X’)
  • F-Score – A mixture of Precision and Recall calculated using the formula

Accuracy does not account for the imbalanced datasets and often presents over inflated performance measures, whereas the other three measures do account for this imbalance. We recommend using f-score as the metric to evaluate performance with.

Assessing Performance for Regression amd Time Series

Several metrics are provided for assessing the performance of regression and time series models:

  • Mean Absolute Error – Calculated from the mean of the absolute difference between the predicted and the actual values
  • Mean Squared Error – Calculated from the mean of the square of the absolute difference between the predicted and the actual values. This metric favours models which avoid making large errors
  • For time series models, the metrics are averaged over the forecast window.

Configuration with YAML files

YAML files are used to configure the Cognitive Engine as they are human readable, key/value text files.  They are very important for:

  1. Formalizing model design
  2. Maintaining consistency through staff changes
  3. Repeatability of services across clients

The first key in any YAML file is the engine key which contains general configuration settings for running the Cognitive Engine and producing a package.  A sample engine configuration is as follows:

engine:
    version: 2
    release:
        pkg_name: my_package
        pkg_version: 4.9.1
        pkg_filename: my_sample_package.package

The release configuration manages the creation of the final package file.  This package file is the output of the training process of Cognitive Engine and contains all configuration, models and features files ready for deployment and hence prediction on new data.  It is passed back to the prediction process of the Cognitive Engine.  Each of these values should be updated to reflect your package. The meta parameters are optional and are present in case you have a use case which requires additional metadata to be written to the package that cannot be captured using the standard YAML keys. Following from the engine configuration, one must specify the models to be built, by declaring pipelines which depend upon whether the problem at hand is one of classification or regression.  See the following sections for more details.

Cognitive Engine Pipeline Structure

Cognitive Engine pipelines are used to train either a binary classification model, a regression model or a time series model. In both cases they are structured as a linear sequence of steps:

  • cb.pipeline.retrieve_data
    • Retrieve data from a data source, defining the field types and roles.  The default values for this step are designed for classification problems and assume that there is an input field called text to use to build input features to the model, a field called label which provides the target category, a field called id which provides a unique identifier for each row and a field called test_train which is assigned either “test” or “train” and describes whether each row belongs to the training or test set.
  • cb.pipeline.tokenise (optional, but use if any fields contain text that needs to be parsed)
    • Parse any text fields into a stream of tokens.
  • cb.pipeline.missingvals (optional, but use if any numeric fields need to have missing values filled)
    • Replace missing values in particular fields with some substitute (for example the mean value).
  • cb.pipeline.features
    • From the retrieved (and if necessary, tokenised) data, create the features ready to be used to train the machine learning models. As an alternative, when dealing with text input that results in a large number of features, consider using instead cb.pipeline.vectorise.lda which performs dimensionality reduction.
  • cb.pipeline.train.* (train a model supplying the algorithm name from the set (linear, svm, naivebayes,regression,neural or randomforest) or auto to select the best classification algorithm).

YAML Configuration for Classification

When working on a classification problem, a minimal YAML file should provide the labels for each classifier.  These can be hierarchical (for example, if building a topic-based classifier football basketball are types of sport). For example:

sport:
    football:
    basketball:
music:
    jazz:
    rock:
business:

Using this type of configuration, 7 classifiers will be built using the default pipeline configuration for classification. 

If a yaml file contains multiple classes for a classification task, these will be modelled as multiple binary classifiers but operate in unison for prediction,creating a multi-class classifier.  For classification models there is no need to add additional configuration.  However, if you wish to modify some of the defaults for each classifier you can provide custom configuration.  This is placed under the params key of each classifier.  For example, to specify the evaluation metric to be precision and set the data source to be a CSVf ile, one could modify the sport classifier to the following:

sport:
    params:
        pipeline:
            eval_method: precision
        cb.pipeline.retrieve_data:
            data_source_type: csv
            data_source_config:
                csv:
                    filepath: my_training_file.csv        
    football:
    basketball:

At runtime the configuration that is contained in your YAML file is merged with the default configuration contained within the Cognitive Engine.  Therefore, you only need to specify the parameters that you wish to change from the default setting in your YAML file.

A full explanation of each YAML parameter is given in Appendix A.  Some example YAML configurations are provided in Appendix B.

YAML file generation.  As the configuration of your classifiers becomes the merge of your YAML file and the default configuration, it is helpful to have an explicit definition of this in a file.  On each training run, the Cognitive Engine will write the complete configuration to a new YAML file(stored in the output directory).  This explicit definition of your classifiers becomes the blueprint and can be used for reference, debugging or in further training sessions.

Training Data. Gold standard training data of good quality is essential for training with the Cognitive Engine.  Please read the Gold Standard Training Data document alongside this one.

Configuring the retrieve_data Step

The Cognitive Engine also supports the classification of categorical and numerical data that does not feature any text component.  For this to take place, the input file must feature one column for each categorical or numerical variable to be passed to the learning algorithm.  There is not requirement to scale, smooth or normalize the data – the engine will handle all of this.  It is necessary however, to ensure that all the values in these columns contain correctly formatted data.

By default, the Cognitive Engine will expect to find in the data:

  • a field containing unique identifiers called id
  • a single target field called label
  • a field called test_train which contains value “train”if the row is to be used for training and “test” if the row is to be placed in a holdout set and used for computing test statistics
  • a single predictor field called text containing text data.  If this is not present all remaining fields are used as predictors

These default field names can be overridden in the YAML file under the cb.pipeline.retrieve_data step. Consider the trivial example, in which there are 4 predictor variables to model: age, profession,income and comments and a target field called risk.  The YAML configuration of the retrieve_data step will look like this:

cb.pipeline.retrieve_data:
   target: risk
   data_fields: [age, score, profession, income]

Alternatively (this is useful when there are a large number of predictors but also a few fields that should not be used for modelling) a set of fields to ignore can be specified – all remaining fields will be used as predictors:

cb.pipeline.retrieve_data:
   target: risk
   ignore_fields: [star_sign, favourite_colour]

The data types of fields will be automatically inferred as one of “c” (categorical), “n” (numeric) or “t” (text) but these can also be overridden using the “data_types” key:

cb.pipeline.retrieve_data:
   target: risk
   data_fields: [age, profession, income, comments]
   data_types: { profession: c, income: n, comments: t }

Classifier Design and Hierarchies

Each classifier built by the Cognitive Engine is a binary (two-class) classifier.  The Cognitive Engine will automatically select the data required to train each side of this binary classifier.  If you want to override the defaults you may do so by specifying the labels for each classifier in YAML – you should provide a list of labels under the labels_a and labels_b keys in YAML.

The automated approach that the Cognitive Engine takes is to build a 1 vs everything elsebinary classifier for each class in your data. This is useful because, when you come to prediction, the engine will provide you with a confidence score for every classifier in your package file.

The 1 vs everything else approach is applied on a hierarchical basis.  Taking our earlier example, the Cognitive Engine will define the classifier for sport to have:

labels_a: ['sport','sport.football','sport.basketball']
labels_b: ['music','music.jazz','music.rock','business']

As the Cognitive Engine progresses down the hierarchy, the 1 vs everything else notion is applied locally to that branch. For example, the sport.football classifier will be defined as:

labels_a: ['sport.football']
labels_b: ['sport.basketball']

This enables the training process to focus on the local differentiating features. 

When executed in prediction, the Cognitive Engine will output the full path to the final node in the hierarchy.

YAML Configuration for Regression

To address a regression problem the YAML file should be configured with a single pipeline that will train a regression model. In the minimal YAML pipeline configuration below p0 is used, but any descriptive name can be used.  This YAML file configures the engine to train a regression model to predict the target field Resale Value, given input data fields Age, Mileage and Manufacturer.

p0:
  params: &params
    pipeline: &pipeline
      auto_labels: False
      steps:
      - cb.pipeline.retrieve_data
      - cb.pipeline.features
      - cb.pipeline.train.regression

    cb.pipeline.retrieve_data:
      target: "Resale Value"
      data_fields: ["Age","Mileage","Manufacturer"]
      data_types: { "Resale Value": n }

YAML Configuration for Time Series

To tackle a time series problem the YAML file should be configured with a single pipeline that will train time series models on one or more target fields, where each target field contains a numeric time series. In the minimal YAML pipeline configuration below p1 is used, but any descriptive name can be used.  This YAML file configures the engine to train a time series model to forecast the target fields , given input data fields Temperature, Flow and Vibration.

p1:
  params: &params
    pipeline: &pipeline
      auto_labels: False
      steps:
      - cb.pipeline.retrieve_data
      - cb.pipeline.features
      - cb.pipeline.train.timeseries

    cb.pipeline.retrieve_data:
      target: ["Temperature","Flow","Vibration"]
      meta_fields: ["id", "test_train"]
      time_step_field: "step"

    cb.pipeline.timeseries:
      train_window: 72
      forecast_window: 12
      evaluation_metric: "rmse"

Time series requires the data to contain a field to specify an integer valued time steps for each row. There should be no gaps in the sequence of time step values. This field should be specified in the YAML using the “time_step_field” setting within the “retrieve_data” step – if this is not specified, the engine will attempt to use the “id” field for this purpose. An optional test_train field may be included to divide the data into a single test set and train set, but the ranges of time step values in these sets cannot overlap. The names of the fields containing the time series to be modelled are specified in the cb.pipeline.retrieve_data.target setting. If the target list is set to [“*”] then all fields not present in the “meta_fields” or “time_step_field” will be used as targets.

The cb.pipeline.timeseries step needs to be configured with three important parameters.

  • train_window: the number of steps that the model uses to fit the time series model. When forecasting from a trained model, the input data will need to contain this many steps for the model.
  • forecast_window: the number of steps ahead that the model will be required to forecast. Models are chosen which minimise the mean value of the evaluation metric over this forecast window.
  • evaluation_metric: the metric to use when evaluating and comparing different time series models to select the best performer. The value should be one of “rmse” (root mean squared error), “mse” (mean squared error) or “mae” (mean absolute error).

Using the Trained Package

The complete package file, having been trained for a classification, regression or time series task, is ready for use in a live deployment environment.  This can be integrated into a custom software stack with a bespoke UI (see the Developer Guide) or executed from the same jar used for training.

Prediction for classification and regression models

Prediction is carried out in the same way as for training, however instead of the train task the predict task is used. The predict task requires as input the package file created during training and the data to containing the predictor value(s), thus:

java –jar engine-x.x.x-standalone.jar predict <package file path> <training csv path>

Once complete, this process will write a CSV of classified data to the output directory.  This csv will contain all the original input data but also:

  • For classification packages:
    • The confidence scores for classification against each classifier in the yaml file
      • This is important so that these raw scores can be interrogated for your use case (such as sorting, or comparing the final data).
    • A final class label. This is calculated by selecting the class with the highest confidence score.
    • You may wish to modify this approach which can be done by interrogating the raw confidence scores.
  • For regression packages:
    • The predicted numeric value produced by the model.

This csv is a standard format and so can be loaded into other software (such as visualisation tools) without the need to integrate the Cognitive Engine codebase.

Forecasting for time series models

Forecasting is carried out in the same way as for training, however instead of the train task the forecast task is used. The predict task requires as input the package file created during training and the data to containing recent values of the time series upon which the forecast will be based:

java –jar engine-x.x.x-standalone.jar forecast <package file path> <training csv path>

Once complete, this process will write a CSV of the forecast data to the output directory.  This csv will consist of columns containing the forecast values for each of the time series in the trained package and will contain one row for each step in the forecast_window defined when the model was trained.

Extensibility

The Cognitive Engine supports a plugin architecture for easy extension. The Cognitive Engine will look for plugins in various places:

  1. on the Java classpath
  2. in a .cogengine directory under the user’s home directory
  3. In a folder specified by the –plugin-dir option to the cognitive engine

Refer to the Cognitive Engine Developer Guide on how to create and apply custom plugins.

Appendix A. YAML Configuration Reference

Each key in this table can be placed underneath a classifier params specification in yaml.

KeyDescriptionExample
pipeline
classifier_idA string name for the classifier. This is used to identify the classifier later in prediction.sport
testA Boolean value – if True the classifiers produced are tested against any manually annotated data which has ‘test’ in the ‘test_train’ field. The default value is False.True
eval_methodThe method that the engine should use in order to select the best model for the final deployment package (after iterating over multiple cost values). Options are ‘accuracy’, ‘precision’, ‘recall’ or ‘fscore’. The default value is fscore.fscore
log_xv_textA Boolean value – If True the texts of wrong cross-validation predictions are logged in the outputTrue
auto_labelsA Boolean value – if True, will infer the labels from the classifier_id values.
Defaults to True
True
stepsThe list of steps used to train a model.
The default list is:
– cb.pipeline.retrieve_data
– cb.pipeline.tokenise
– cb.pipeline.features
– cb.pipeline.train.linear
– cb.pipeline.retrieve_data
– cb.pipeline.features
– cb.pipeline.train.bayes
cb.pipeline.retrieve_data
targetThe field containing the target values. The default is “label”.“Resale Value”
data_fieldsThe fields to use as predictors. The default is [text].[Age,Mileage,Manufacturer]
id_fieldA field which will contain a unique value for each row. The default is “id”.customer_id
test_train_fieldThe field containing values “train” (use the row for training) or “test” (do not use for training, keep a hold-out for testing). The default is “test_train”.partition_name
meta_fieldsFields containing metadata[id,test_train]
ignore_fieldsA list of the fields to ignore[address,”mobile number”]
time_step_fieldAn incrementing integer field specifying the time step associated with each row, used in time series datatstep
data_typesOverride the automatically-selected types chosen for each field, where:
n – field is numeric/continuous
c – field is categorical
t – field is text
Age: n
Mileage: n
Manufacturer: c
“Service Log”: t
data_source_type The chosen data source type, either ‘csv’, ‘json’ or ‘sql’. The default is ‘csv’.csv
data_source_config.csv.filepathThe path to the csv file which contains the data to be used for training./User1/training.csv
data_source_config.csv.encodingThe encoding of the CSV file“UTF-8”
data_source_config.json.filepathThe path to the json file which contains the data to be used for training./User1/training.json
data_source_config.json.encodingThe encoding of the JSON file“UTF-8”
data_source_config.sql.urlThe connection string for the database. All (text, label, test_train) triples used for training must be in a table called ‘training’. The URL includes the DB type, username, password, host, and db.mysql://user1: password>>@localhost/data
data_source_config.sql.tableThe table name to read from. If this option is not specified, a table named training is read.training_data
labels_aThis defines data to use to train one side of the binary classifier. It should be a list of class labels. This implicitly defines a decision tree.[music]
labels_bThis defines the data to use to train the other side of the classifier. It should be a list of class labels. This implicitly defines the decision tree.[technology, sport, politics]
limitA limit on the amount of training data to collect each for stages a and stage b. By default no limit is applied.1000
label_limitA limit on the amount of training data to collect for each label defined in labels_a and labels_b. By default no limit is applied.100
balance_methodA numerical reference to the desired method for dealing with uneven training data sets (that is, stage a has more data than stage b, or vice versa). The default is 2.0 = No Balancing
1 = Downsample
2 = SVM Cost
balance_weightOnly required for balance_ method 2. Value, or list of values, between 0 and 1. If supplied this will override the default values for parameter selection. It is not needed by default.0.25
min_data_countSpecifies the minimum number of training data points required per class. Note: if there are 0 data points in a class training will always stop. 500
min_data_raiseA Boolean variable. Determines what to do if the minimum number of training data points are not supplied. True raises an exception to stop execution; False carries on training but writes a warning to the log file. Default is True.True
cb.pipeline.missingvals
fill_methodThe method to use to replace missing values with valid ones. One of (mean, mode, median, delete, random, interpolate).mean
cb.pipeline.tokenise
lowercaseConvert text to lowercase before tokenising. Defaults to True.True
tokenizer_typeWhether to tokenise by character or by words. Defaults to “words”. “words”
save_hashtagsTrue
strip_suffixesTrue
replace_usernamesBoolean value. If True all usernames are stripped from the input text. Defaults to False.True
remove_punctBoolean value. If True all punctuation is stripped from the input text. Defaults to False.True
replace_moneyBoolean value. If True monetary values in the text are replaced by a flag. Defaults to True.True
replace_numbersBoolean value. If True numbers in the text are replaced by a flag. Defaults to True.True
replace_urlsBoolean value. If True urls in the text are replaced by a flag. Defaults to False.True
cb.pipeline.features
extractorsOverride default method of feature extraction for each field. One of:
ngrams – extract ngrams. Default for text fields.
one-hot – create new feature for each unique value in a categorical field. Default for categorical fields.
target-mean – encode a categorical field using use the mean of a numeric target value for a numeric field
min-max – replace numeric field value:
(value-min)/(max-min)
mean – replace numeric field value:
(value-mean)/(max-min)
standardization – replace numeric field value:
(value-mean)/sd. Default for numeric fields.
{ “my_textfield”: ngrams }
excludeOptional. A list of terms to exclude from training, usually to avoid training bias. A common use case is to remove words used to generate the training dataset (if any) to avoid them biasing the classifiers.[audi, vw, nissan]
exclude_chanceOptional. Used to keep a sample of the exclude labels in the training data. If, e.g. 0.8 is entered then exclude labels will be removed 80% of the time.0.8
export_labelsBoolean value. When True labels (that is, the features used for the classifier) are written to file.True
labels_filenameName of the file to write the feature labels to/tmp/labels.txt
max_nThe maximum length of n_gram to include in the feature space. Defaults to 3.2
min_nThe minimum length of n_gram to include in the feature_space (often a unigram). Defaults to 1.1
separatorThe character used for token splitting, defaults to whitespace.“:”
startendBoolean value. When true flags marking the start and end of a vector are added to the feature vector. Set to true by default.True
min_feature_freqDiscard features having fewer than this many occurrences in the training data. Defaults to 1. 5
keep_duplicatesWhether to retain duplicate feature vectors. Defaults to False.True
cb.pipeline.vectorise.lda
n_iterThe max number of iterations used to build the topic model1000
n_topicsThe max number of topics generated in the topic model500
cb.pipeline.train.linear
cost_rangeA list of cost values to input to the Support Vector Machine. A classifier is trained for each value, and the best is selected for deployment using the eval_method key described earlier.[1.0,3.0,10.0]
neg_class0 or 1. Which side of the classifier should be classed as positive or negative. Used to generate precision and recall values. Defaults to 0.0
pos_class0 or 1. Which side of the classifier should be classed as positive or negative. Used to generate precision and recall values. Defaults to 1.1
suffixThe file extension for data file artefacts..dat
cb.pipeline.train.svmAll parameters from linear are available
gammaA range of gamma values to cross validate over[0.01,1.0,100.0]
cb.pipeline.train.bayesNo configuration required
cb.pipeline.train.neural
hidden_layersA list of integers representing the size of hidden layers. The default is to use no hidden layers.[10,20]
max_epochsThe number of training epochs to use. The default is 1000.500
stop_on_errorBoolean value to enable early stopping during training, which is triggered when the network error increases.False
cb.pipeline.train.regression
alphaWhen alpha is set to 0, the lasso variant of regression is performed. .When alpha is set to 1, the ridge variant of regression is performed. When alpha is between 0 and 1, a weighted hybrid of the two methods is used. The default value is 0.5.0.3
lambda_type“1se” or “min”. Set to 1se instead of min to reduce the tendency of the model to overfit, at the expense of overall accuracy on the training data. The default value is min.“min”
cb.pipeline.train.randomforest
number_of_treesThe number of trees to build. Defaults to 40.100
partition_ratioThe fraction of training data to use to train each tree. Defaults to 0.7.0.5

cb.pipeline.train.timeseries
train_windowThe number of steps that the model uses to fit the time series model.24
forecast_windowThe number of steps that the model will forecast ahead.6
evaluation_metricThe metric to use when evaluating and comparing different time series models to select the best performer. rmse
algorithms
.triple_exponential_smoothing
.seasonality
Defines the number of steps over which a seasonal pattern repeats.12

Each key in this table can be placed under the engine key at the top level of your yaml file

KeyDescriptionExample
release
pkg_nameThe name of the package (used as an identifier for all the classifiers together)co.chatterbox.example
pkg_versionThe version of the current build.“1.9.0”
pkg_filenameThe file name for the package file (found in the payload subdirectory of the output directory)vendor-product-language-version.package
metaOptional keys to be used to carry through meta parameters into the package that can’t be represented elsewherecomment: “risk dept model”
notes: “not for production use”
versionVersion of the engine2

Appendix B. Sample YAML files

Sample YAML Configuration for Hierarchical Classification with Default Pipelines

engine:
    version: 2
    release:
        pkg_name: my_package
        pkg_version: 4.9.1
        pkg_filename: my_sample_package.package

sport:
    football:
    basketball:
music:
    jazz:
    rock:
business:

Sample YAML Configuration for Classification with Custom Pipelines

engine:
  release:
    pkg_filename: randomforest.package
    pkg_name: co.chatterbox.iris
    pkg_version: 0.0.1
  version: 2

Iris-setosa:
    params: &params
      pipeline:
          eval_method: fscore
          steps:
            - cb.pipeline.retrieve_data
            - cb.pipeline.vectorise.numbers
            - cb.pipeline.train.randomforest

      cb.pipeline.retrieve_data:
        data_fields: [sepal_length, sepal_width, petal_length, petal_width]

Iris-versicolor:
    params: *params

Iris-virginica:
    params: *params

Sample YAML Configuration for Regression

engine:
  release:
    meta:
      purpose: “a sample YAML for regression modelling”
    pkg_name: co.chatterbox.sample
    pkg_version: 0.0.1
    pkg_filename: cb-chatterbox-sample-001.package
  version: 2

p0:
  params: &params
    pipeline: &pipeline
      auto_labels: False
      steps:
      - cb.pipeline.retrieve_data
      - cb.pipeline.features
      - cb.pipeline.train.regression

    cb.pipeline.retrieve_data:
      target: PT08_S4_NO2
      meta_fields: [id,test_train]
      data_types: { "PT08_S4_NO2": n }

    cb.pipeline.train.regression:
      alpha: 0.5
      lambda_type: min

Get in Touch