Document Overview
This document gives a detailed description of the configuration and operation of Chatterbox Labs’ Cognitive Engine, including YAML file configuration. It should be read in combination with the Gold Standard Training Data documentation. If you wish to extend the Cognitive Engine or integrate it into another software stack, please read the Developer Documentation.
Installation & Environment Configuration
The Cognitive Engine is programmed to run on the Java Virtual Machine (JVM) platform.

The only software requirement of the Cognitive Engine is that a Java Runtime Environment (Java SE 8) be available on the machine.
Hardware Requirements
As the Cognitive Engine is programmed using Clojure and Java, it can run on any operating system that supports a Java Virtual Machine. 64 bit operating systems are recommended for optimal memory usage.
It is recommended that the Cognitive Engine is installed on a centrally accessed server. Each organization will have their own standard configurations, but as a benchmark a comparable server setup is:
- CPU: Multi-core (4 minimum, 16 recommended) Xeon/EPYC/i7 @ 3GHz
- Memory: 128GB Recommended (minimum of 16GB)
- OS: Ubuntu Linux (64 bit) or Microsoft Windows (64 bit), running Java SE8 (64 bit)
- Disk: 200GB (it is recommended that the Cognitive Engine is installed in a directory that is accessible and writable by all users, rather than in each home directory)
- Connectivity: Remote access (ssh / MS Remote Desktop) and access to the outside web
General Operation
The usual process of operating the Cognitive Engine is as follows:
- Prepare Data. Read the Gold Standard Training Data document and ensure that your training data is of good quality and accessible by the Cognitive Engine
- YAML. Create a YAML file that gives configuration for the engine, and specifies each model to be built
- Train. Run the Cognitive Engine jar at the command line (or call the train function inside your software stack) with your yaml file to train your models against your training data. The output will be written into a time-stamped directory inside the output directory
- Check. Check the training session log (in the output directory) for any warnings that have been logged during training
- Results. Check the cross-validation results for each model within the results directory (in the output directory)
- Prediction/Forecasting. Run the Cognitive Engine jar at the command line with the package file (found in the payload directory) to predict values for the target field with new, unseen data.
Training
The Cognitive Engine is distributed as an application packaged in an executable JAR file. It can be run like any other JAR file, through the Java binary with the –jar option. Using the Windows Command Prompt or a Linux/Mac shell, a common invocation for training is:
java –jar engine-x.x.x-standalone.jar train --csv <training csv path> <yaml path>
The Cognitive Engine executable accepts the following commonly used command-line options:
- ––home – Optional path to the directory for time stamped output
- ––csv – Optional path to the location of a csv training file
- ––json – Optional path to the location of a json training file
- ––sql – Optional URL to specify a SQL training database
- ––limit – Optional limit to the number of training data points per class
- ––label-limit – Optional limit to the number of training data points per label
- ––plugin-dir – Optional path to the location of a directory containing plugins
- ––test – Optional flag to train with hold-out testing
- ––dry – Optional flag to run the Engine without training any models (useful for configuration file sanity testing)
Run the engine with the –-help option to see the complete set of training options:
java –jar engine-x.x.x-standalone.jar train --help
Once executed successfully, the output directory (created automatically) of the training session will be:
- output
- timestamp
- payload
- *.package (payload package ready for deployment)
- production
- full.yaml (complete YAML file with all configuration)
- cb.params.* (parameters required for prediction)
- cb.models (model ids used in prediction)
- cb.features (production features)
- *.model(production models)
- results (HTML and CSV reports with precision, recall and f-score, for each model trained)
- train_report.html (overview comparison metrics for the best models trained as tested using cross validation)
- train_report.csv (detailed overview comparison metrics for the best models trained with corresponding parameters)
- *_model_report.csv – (per classifier cross validation results for each parameter set)
- *_xv_wrongs.csv – (texts written to file for error analysis)
- Optional: test_report.html (overview comparison metrics for the best classifiers trained as tested using a hold out test set, defined by the test_train field
- full.yaml (complete YAML file with all configuration)
- full.log (training session log file)
- payload
- timestamp
Parameter Selection
The Cognitive Engine will carry out automatic parameter selection based on cross-validation testing (which avoids over fitting). The detailed results for each set of parameters are written into the results directory. The best set of parameters to build the final model are selected using the eval_method parameter defined in the YAML file.
Pipelines for training Classification, Regression and Time Series Models
At training time, the Cognitive Engine will build one or more models using the range of algorithms at its disposal. For classification problems, a separate binary classifier model is usually trained to discriminate between a specific label value and all other labels which are present in the target field. For regression problems, a single model is trained to be able to estimate numeric values of the target field. For time series problems, a separate model is trained on each time series in the data.
Each of these models (either a regression model, a time series model or a binary classifier) is trained using a pipeline. The pipeline can specify how the training data is retrieved, how the data is prepared for machine learning and which machine learning algorithms should be used. Default settings suitable for classification problems are provided for the pipeline and each of its steps, but these can be overridden.
Assessing Performance for Classification
When assessing the performance of the classifiers trained, the most common file to load is train_report.html which will report, for each classifier:
- Accuracy – Calculated by assessing the number of data points correctly classified
- Precision – Considered a measure of quality (‘of the data points classified as X, how many were correct’)
- Recall – Considered a measure of coverage (‘of the available Xs in the dataset, how many were correctly classified as X’)
- F-Score – A mixture of Precision and Recall calculated using the formula

Accuracy does not account for the imbalanced datasets and often presents over inflated performance measures, whereas the other three measures do account for this imbalance. We recommend using f-score as the metric to evaluate performance with.
Assessing Performance for Regression amd Time Series
Several metrics are provided for assessing the performance of regression and time series models:
- Mean Absolute Error – Calculated from the mean of the absolute difference between the predicted and the actual values
- Mean Squared Error – Calculated from the mean of the square of the absolute difference between the predicted and the actual values. This metric favours models which avoid making large errors
- For time series models, the metrics are averaged over the forecast window.
Configuration with YAML files
YAML files are used to configure the Cognitive Engine as they are human readable, key/value text files. They are very important for:
- Formalizing model design
- Maintaining consistency through staff changes
- Repeatability of services across clients
The first key in any YAML file is the engine key which contains general configuration settings for running the Cognitive Engine and producing a package. A sample engine configuration is as follows:
engine:
version: 2
release:
pkg_name: my_package
pkg_version: 4.9.1
pkg_filename: my_sample_package.package
The release configuration manages the creation of the final package file. This package file is the output of the training process of Cognitive Engine and contains all configuration, models and features files ready for deployment and hence prediction on new data. It is passed back to the prediction process of the Cognitive Engine. Each of these values should be updated to reflect your package. The meta parameters are optional and are present in case you have a use case which requires additional metadata to be written to the package that cannot be captured using the standard YAML keys. Following from the engine configuration, one must specify the models to be built, by declaring pipelines which depend upon whether the problem at hand is one of classification or regression. See the following sections for more details.
Cognitive Engine Pipeline Structure
Cognitive Engine pipelines are used to train either a binary classification model, a regression model or a time series model. In both cases they are structured as a linear sequence of steps:
- cb.pipeline.retrieve_data
- Retrieve data from a data source, defining the field types and roles. The default values for this step are designed for classification problems and assume that there is an input field called text to use to build input features to the model, a field called label which provides the target category, a field called id which provides a unique identifier for each row and a field called test_train which is assigned either “test” or “train” and describes whether each row belongs to the training or test set.
- cb.pipeline.tokenise (optional, but use if any fields contain text that needs to be parsed)
- Parse any text fields into a stream of tokens.
- cb.pipeline.missingvals (optional, but use if any numeric fields need to have missing values filled)
- Replace missing values in particular fields with some substitute (for example the mean value).
- cb.pipeline.features
- From the retrieved (and if necessary, tokenised) data, create the features ready to be used to train the machine learning models. As an alternative, when dealing with text input that results in a large number of features, consider using instead cb.pipeline.vectorise.lda which performs dimensionality reduction.
- cb.pipeline.train.* (train a model supplying the algorithm name from the set (linear, svm, naivebayes,regression,neural or randomforest) or auto to select the best classification algorithm).
YAML Configuration for Classification
When working on a classification problem, a minimal YAML file should provide the labels for each classifier. These can be hierarchical (for example, if building a topic-based classifier football & basketball are types of sport). For example:
sport:
football:
basketball:
music:
jazz:
rock:
business:
Using this type of configuration, 7 classifiers will be built using the default pipeline configuration for classification.
If a yaml file contains multiple classes for a classification task, these will be modelled as multiple binary classifiers but operate in unison for prediction,creating a multi-class classifier. For classification models there is no need to add additional configuration. However, if you wish to modify some of the defaults for each classifier you can provide custom configuration. This is placed under the params key of each classifier. For example, to specify the evaluation metric to be precision and set the data source to be a CSVf ile, one could modify the sport classifier to the following:
sport:
params:
pipeline:
eval_method: precision
cb.pipeline.retrieve_data:
data_source_type: csv
data_source_config:
csv:
filepath: my_training_file.csv
football:
basketball:
At runtime the configuration that is contained in your YAML file is merged with the default configuration contained within the Cognitive Engine. Therefore, you only need to specify the parameters that you wish to change from the default setting in your YAML file.
A full explanation of each YAML parameter is given in Appendix A. Some example YAML configurations are provided in Appendix B.
YAML file generation. As the configuration of your classifiers becomes the merge of your YAML file and the default configuration, it is helpful to have an explicit definition of this in a file. On each training run, the Cognitive Engine will write the complete configuration to a new YAML file(stored in the output directory). This explicit definition of your classifiers becomes the blueprint and can be used for reference, debugging or in further training sessions.
Training Data. Gold standard training data of good quality is essential for training with the Cognitive Engine. Please read the Gold Standard Training Data document alongside this one.
Configuring the retrieve_data Step
The Cognitive Engine also supports the classification of categorical and numerical data that does not feature any text component. For this to take place, the input file must feature one column for each categorical or numerical variable to be passed to the learning algorithm. There is not requirement to scale, smooth or normalize the data – the engine will handle all of this. It is necessary however, to ensure that all the values in these columns contain correctly formatted data.
By default, the Cognitive Engine will expect to find in the data:
- a field containing unique identifiers called id
- a single target field called label
- a field called test_train which contains value “train”if the row is to be used for training and “test” if the row is to be placed in a holdout set and used for computing test statistics
- a single predictor field called text containing text data. If this is not present all remaining fields are used as predictors
These default field names can be overridden in the YAML file under the cb.pipeline.retrieve_data step. Consider the trivial example, in which there are 4 predictor variables to model: age, profession,income and comments and a target field called risk. The YAML configuration of the retrieve_data step will look like this:
cb.pipeline.retrieve_data:
target: risk
data_fields: [age, score, profession, income]
Alternatively (this is useful when there are a large number of predictors but also a few fields that should not be used for modelling) a set of fields to ignore can be specified – all remaining fields will be used as predictors:
cb.pipeline.retrieve_data:
target: risk
ignore_fields: [star_sign, favourite_colour]
The data types of fields will be automatically inferred as one of “c” (categorical), “n” (numeric) or “t” (text) but these can also be overridden using the “data_types” key:
cb.pipeline.retrieve_data:
target: risk
data_fields: [age, profession, income, comments]
data_types: { profession: c, income: n, comments: t }
Classifier Design and Hierarchies
Each classifier built by the Cognitive Engine is a binary (two-class) classifier. The Cognitive Engine will automatically select the data required to train each side of this binary classifier. If you want to override the defaults you may do so by specifying the labels for each classifier in YAML – you should provide a list of labels under the labels_a and labels_b keys in YAML.
The automated approach that the Cognitive Engine takes is to build a 1 vs everything elsebinary classifier for each class in your data. This is useful because, when you come to prediction, the engine will provide you with a confidence score for every classifier in your package file.
The 1 vs everything else approach is applied on a hierarchical basis. Taking our earlier example, the Cognitive Engine will define the classifier for sport to have:
labels_a: ['sport','sport.football','sport.basketball']
labels_b: ['music','music.jazz','music.rock','business']
As the Cognitive Engine progresses down the hierarchy, the 1 vs everything else notion is applied locally to that branch. For example, the sport.football classifier will be defined as:
labels_a: ['sport.football']
labels_b: ['sport.basketball']
This enables the training process to focus on the local differentiating features.
When executed in prediction, the Cognitive Engine will output the full path to the final node in the hierarchy.
YAML Configuration for Regression
To address a regression problem the YAML file should be configured with a single pipeline that will train a regression model. In the minimal YAML pipeline configuration below p0 is used, but any descriptive name can be used. This YAML file configures the engine to train a regression model to predict the target field Resale Value, given input data fields Age, Mileage and Manufacturer.
p0:
params: ¶ms
pipeline: &pipeline
auto_labels: False
steps:
- cb.pipeline.retrieve_data
- cb.pipeline.features
- cb.pipeline.train.regression
cb.pipeline.retrieve_data:
target: "Resale Value"
data_fields: ["Age","Mileage","Manufacturer"]
data_types: { "Resale Value": n }
YAML Configuration for Time Series
To tackle a time series problem the YAML file should be configured with a single pipeline that will train time series models on one or more target fields, where each target field contains a numeric time series. In the minimal YAML pipeline configuration below p1 is used, but any descriptive name can be used. This YAML file configures the engine to train a time series model to forecast the target fields , given input data fields Temperature, Flow and Vibration.
p1:
params: ¶ms
pipeline: &pipeline
auto_labels: False
steps:
- cb.pipeline.retrieve_data
- cb.pipeline.features
- cb.pipeline.train.timeseries
cb.pipeline.retrieve_data:
target: ["Temperature","Flow","Vibration"]
meta_fields: ["id", "test_train"]
time_step_field: "step"
cb.pipeline.timeseries:
train_window: 72
forecast_window: 12
evaluation_metric: "rmse"
Time series requires the data to contain a field to specify an integer valued time steps for each row. There should be no gaps in the sequence of time step values. This field should be specified in the YAML using the “time_step_field” setting within the “retrieve_data” step – if this is not specified, the engine will attempt to use the “id” field for this purpose. An optional test_train field may be included to divide the data into a single test set and train set, but the ranges of time step values in these sets cannot overlap. The names of the fields containing the time series to be modelled are specified in the cb.pipeline.retrieve_data.target setting. If the target list is set to [“*”] then all fields not present in the “meta_fields” or “time_step_field” will be used as targets.
The cb.pipeline.timeseries step needs to be configured with three important parameters.
- train_window: the number of steps that the model uses to fit the time series model. When forecasting from a trained model, the input data will need to contain this many steps for the model.
- forecast_window: the number of steps ahead that the model will be required to forecast. Models are chosen which minimise the mean value of the evaluation metric over this forecast window.
- evaluation_metric: the metric to use when evaluating and comparing different time series models to select the best performer. The value should be one of “rmse” (root mean squared error), “mse” (mean squared error) or “mae” (mean absolute error).
Using the Trained Package
The complete package file, having been trained for a classification, regression or time series task, is ready for use in a live deployment environment. This can be integrated into a custom software stack with a bespoke UI (see the Developer Guide) or executed from the same jar used for training.
Prediction for classification and regression models
Prediction is carried out in the same way as for training, however instead of the train task the predict task is used. The predict task requires as input the package file created during training and the data to containing the predictor value(s), thus:
java –jar engine-x.x.x-standalone.jar predict <package file path> <training csv path>
Once complete, this process will write a CSV of classified data to the output directory. This csv will contain all the original input data but also:
- For classification packages:
- The confidence scores for classification against each classifier in the yaml file
- This is important so that these raw scores can be interrogated for your use case (such as sorting, or comparing the final data).
- A final class label. This is calculated by selecting the class with the highest confidence score.
- You may wish to modify this approach which can be done by interrogating the raw confidence scores.
- The confidence scores for classification against each classifier in the yaml file
- For regression packages:
- The predicted numeric value produced by the model.
This csv is a standard format and so can be loaded into other software (such as visualisation tools) without the need to integrate the Cognitive Engine codebase.
Forecasting for time series models
Forecasting is carried out in the same way as for training, however instead of the train task the forecast task is used. The predict task requires as input the package file created during training and the data to containing recent values of the time series upon which the forecast will be based:
java –jar engine-x.x.x-standalone.jar forecast <package file path> <training csv path>
Once complete, this process will write a CSV of the forecast data to the output directory. This csv will consist of columns containing the forecast values for each of the time series in the trained package and will contain one row for each step in the forecast_window defined when the model was trained.
Extensibility
The Cognitive Engine supports a plugin architecture for easy extension. The Cognitive Engine will look for plugins in various places:
- on the Java classpath
- in a .cogengine directory under the user’s home directory
- In a folder specified by the –plugin-dir option to the cognitive engine
Refer to the Cognitive Engine Developer Guide on how to create and apply custom plugins.
Appendix A. YAML Configuration Reference
Each key in this table can be placed underneath a classifier params specification in yaml.
Key | Description | Example |
pipeline | ||
classifier_id | A string name for the classifier. This is used to identify the classifier later in prediction. | sport |
test | A Boolean value – if True the classifiers produced are tested against any manually annotated data which has ‘test’ in the ‘test_train’ field. The default value is False. | True |
eval_method | The method that the engine should use in order to select the best model for the final deployment package (after iterating over multiple cost values). Options are ‘accuracy’, ‘precision’, ‘recall’ or ‘fscore’. The default value is fscore. | fscore |
log_xv_text | A Boolean value – If True the texts of wrong cross-validation predictions are logged in the output | True |
auto_labels | A Boolean value – if True, will infer the labels from the classifier_id values. Defaults to True | True |
steps | The list of steps used to train a model. The default list is: – cb.pipeline.retrieve_data – cb.pipeline.tokenise – cb.pipeline.features – cb.pipeline.train.linear | – cb.pipeline.retrieve_data – cb.pipeline.features – cb.pipeline.train.bayes |
cb.pipeline.retrieve_data | ||
target | The field containing the target values. The default is “label”. | “Resale Value” |
data_fields | The fields to use as predictors. The default is [text]. | [Age,Mileage,Manufacturer] |
id_field | A field which will contain a unique value for each row. The default is “id”. | customer_id |
test_train_field | The field containing values “train” (use the row for training) or “test” (do not use for training, keep a hold-out for testing). The default is “test_train”. | partition_name |
meta_fields | Fields containing metadata | [id,test_train] |
ignore_fields | A list of the fields to ignore | [address,”mobile number”] |
time_step_field | An incrementing integer field specifying the time step associated with each row, used in time series data | tstep |
data_types | Override the automatically-selected types chosen for each field, where: n – field is numeric/continuous c – field is categorical t – field is text | Age: n Mileage: n Manufacturer: c “Service Log”: t |
data_source_type | The chosen data source type, either ‘csv’, ‘json’ or ‘sql’. The default is ‘csv’. | csv |
data_source_config.csv.filepath | The path to the csv file which contains the data to be used for training. | /User1/training.csv |
data_source_config.csv.encoding | The encoding of the CSV file | “UTF-8” |
data_source_config.json.filepath | The path to the json file which contains the data to be used for training. | /User1/training.json |
data_source_config.json.encoding | The encoding of the JSON file | “UTF-8” |
data_source_config.sql.url | The connection string for the database. All (text, label, test_train) triples used for training must be in a table called ‘training’. The URL includes the DB type, username, password, host, and db. | mysql://user1: password>>@localhost/data |
data_source_config.sql.table | The table name to read from. If this option is not specified, a table named training is read. | training_data |
labels_a | This defines data to use to train one side of the binary classifier. It should be a list of class labels. This implicitly defines a decision tree. | [music] |
labels_b | This defines the data to use to train the other side of the classifier. It should be a list of class labels. This implicitly defines the decision tree. | [technology, sport, politics] |
limit | A limit on the amount of training data to collect each for stages a and stage b. By default no limit is applied. | 1000 |
label_limit | A limit on the amount of training data to collect for each label defined in labels_a and labels_b. By default no limit is applied. | 100 |
balance_method | A numerical reference to the desired method for dealing with uneven training data sets (that is, stage a has more data than stage b, or vice versa). The default is 2. | 0 = No Balancing 1 = Downsample 2 = SVM Cost |
balance_weight | Only required for balance_ method 2. Value, or list of values, between 0 and 1. If supplied this will override the default values for parameter selection. It is not needed by default. | 0.25 |
min_data_count | Specifies the minimum number of training data points required per class. Note: if there are 0 data points in a class training will always stop. | 500 |
min_data_raise | A Boolean variable. Determines what to do if the minimum number of training data points are not supplied. True raises an exception to stop execution; False carries on training but writes a warning to the log file. Default is True. | True |
cb.pipeline.missingvals | ||
fill_method | The method to use to replace missing values with valid ones. One of (mean, mode, median, delete, random, interpolate). | mean |
cb.pipeline.tokenise | ||
lowercase | Convert text to lowercase before tokenising. Defaults to True. | True |
tokenizer_type | Whether to tokenise by character or by words. Defaults to “words”. | “words” |
save_hashtags | True | |
strip_suffixes | True | |
replace_usernames | Boolean value. If True all usernames are stripped from the input text. Defaults to False. | True |
remove_punct | Boolean value. If True all punctuation is stripped from the input text. Defaults to False. | True |
replace_money | Boolean value. If True monetary values in the text are replaced by a flag. Defaults to True. | True |
replace_numbers | Boolean value. If True numbers in the text are replaced by a flag. Defaults to True. | True |
replace_urls | Boolean value. If True urls in the text are replaced by a flag. Defaults to False. | True |
cb.pipeline.features | ||
extractors | Override default method of feature extraction for each field. One of: ngrams – extract ngrams. Default for text fields. one-hot – create new feature for each unique value in a categorical field. Default for categorical fields. target-mean – encode a categorical field using use the mean of a numeric target value for a numeric field min-max – replace numeric field value: (value-min)/(max-min) mean – replace numeric field value: (value-mean)/(max-min) standardization – replace numeric field value: (value-mean)/sd. Default for numeric fields. | { “my_textfield”: ngrams } |
exclude | Optional. A list of terms to exclude from training, usually to avoid training bias. A common use case is to remove words used to generate the training dataset (if any) to avoid them biasing the classifiers. | [audi, vw, nissan] |
exclude_chance | Optional. Used to keep a sample of the exclude labels in the training data. If, e.g. 0.8 is entered then exclude labels will be removed 80% of the time. | 0.8 |
export_labels | Boolean value. When True labels (that is, the features used for the classifier) are written to file. | True |
labels_filename | Name of the file to write the feature labels to | /tmp/labels.txt |
max_n | The maximum length of n_gram to include in the feature space. Defaults to 3. | 2 |
min_n | The minimum length of n_gram to include in the feature_space (often a unigram). Defaults to 1. | 1 |
separator | The character used for token splitting, defaults to whitespace. | “:” |
startend | Boolean value. When true flags marking the start and end of a vector are added to the feature vector. Set to true by default. | True |
min_feature_freq | Discard features having fewer than this many occurrences in the training data. Defaults to 1. | 5 |
keep_duplicates | Whether to retain duplicate feature vectors. Defaults to False. | True |
cb.pipeline.vectorise.lda | ||
n_iter | The max number of iterations used to build the topic model | 1000 |
n_topics | The max number of topics generated in the topic model | 500 |
cb.pipeline.train.linear | ||
cost_range | A list of cost values to input to the Support Vector Machine. A classifier is trained for each value, and the best is selected for deployment using the eval_method key described earlier. | [1.0,3.0,10.0] |
neg_class | 0 or 1. Which side of the classifier should be classed as positive or negative. Used to generate precision and recall values. Defaults to 0. | 0 |
pos_class | 0 or 1. Which side of the classifier should be classed as positive or negative. Used to generate precision and recall values. Defaults to 1. | 1 |
suffix | The file extension for data file artefacts. | .dat |
cb.pipeline.train.svm | All parameters from linear are available | |
gamma | A range of gamma values to cross validate over | [0.01,1.0,100.0] |
cb.pipeline.train.bayes | No configuration required | |
cb.pipeline.train.neural | ||
hidden_layers | A list of integers representing the size of hidden layers. The default is to use no hidden layers. | [10,20] |
max_epochs | The number of training epochs to use. The default is 1000. | 500 |
stop_on_error | Boolean value to enable early stopping during training, which is triggered when the network error increases. | False |
cb.pipeline.train.regression | ||
alpha | When alpha is set to 0, the lasso variant of regression is performed. .When alpha is set to 1, the ridge variant of regression is performed. When alpha is between 0 and 1, a weighted hybrid of the two methods is used. The default value is 0.5. | 0.3 |
lambda_type | “1se” or “min”. Set to 1se instead of min to reduce the tendency of the model to overfit, at the expense of overall accuracy on the training data. The default value is min. | “min” |
cb.pipeline.train.randomforest | ||
number_of_trees | The number of trees to build. Defaults to 40. | 100 |
partition_ratio | The fraction of training data to use to train each tree. Defaults to 0.7. | 0.5 |
cb.pipeline.train.timeseries | ||
train_window | The number of steps that the model uses to fit the time series model. | 24 |
forecast_window | The number of steps that the model will forecast ahead. | 6 |
evaluation_metric | The metric to use when evaluating and comparing different time series models to select the best performer. | rmse |
algorithms .triple_exponential_smoothing .seasonality | Defines the number of steps over which a seasonal pattern repeats. | 12 |
Each key in this table can be placed under the engine key at the top level of your yaml file
Key | Description | Example |
release | ||
pkg_name | The name of the package (used as an identifier for all the classifiers together) | co.chatterbox.example |
pkg_version | The version of the current build. | “1.9.0” |
pkg_filename | The file name for the package file (found in the payload subdirectory of the output directory) | vendor-product-language-version.package |
meta | Optional keys to be used to carry through meta parameters into the package that can’t be represented elsewhere | comment: “risk dept model” notes: “not for production use” |
version | Version of the engine | 2 |
Appendix B. Sample YAML files
Sample YAML Configuration for Hierarchical Classification with Default Pipelines
engine:
version: 2
release:
pkg_name: my_package
pkg_version: 4.9.1
pkg_filename: my_sample_package.package
sport:
football:
basketball:
music:
jazz:
rock:
business:
Sample YAML Configuration for Classification with Custom Pipelines
engine:
release:
pkg_filename: randomforest.package
pkg_name: co.chatterbox.iris
pkg_version: 0.0.1
version: 2
Iris-setosa:
params: ¶ms
pipeline:
eval_method: fscore
steps:
- cb.pipeline.retrieve_data
- cb.pipeline.vectorise.numbers
- cb.pipeline.train.randomforest
cb.pipeline.retrieve_data:
data_fields: [sepal_length, sepal_width, petal_length, petal_width]
Iris-versicolor:
params: *params
Iris-virginica:
params: *params
Sample YAML Configuration for Regression
engine:
release:
meta:
purpose: “a sample YAML for regression modelling”
pkg_name: co.chatterbox.sample
pkg_version: 0.0.1
pkg_filename: cb-chatterbox-sample-001.package
version: 2
p0:
params: ¶ms
pipeline: &pipeline
auto_labels: False
steps:
- cb.pipeline.retrieve_data
- cb.pipeline.features
- cb.pipeline.train.regression
cb.pipeline.retrieve_data:
target: PT08_S4_NO2
meta_fields: [id,test_train]
data_types: { "PT08_S4_NO2": n }
cb.pipeline.train.regression:
alpha: 0.5
lambda_type: min