This document is aimed at a technical audience and gives a detailed description of the configuration and operation of the Boost feature of Chatterbox Labs’ Synthetic Data Generator (SDG).
SDG Boost Architecture
Installation & Environment Configuration
The Synthetic Data Generator Boost function is programmed to run on the Java Virtual Machine (JVM) platform.
The only requirement of Boost is that a Java Runtime Environment (Java SE 8) be available on the machine.
As Boost is programmed using Clojure and Java, it can run on any operating system that supports a Java Virtual Machine. 64 bit operating systems are recommended for optimal memory usage.
It is recommended that SDG Boost is installed on a centrally accessed server. Each organization will have their own standard configurations, but as a benchmark a comparable server setup is:
- CPU: Multi-core (4 minimum, 16 recommended) Xeon/EPYC/i7/i9 @ 3GHz
- Memory: 128GB Recommended (minimum of 16GB)
- OS: Ubuntu Linux (64 bit) or Microsoft Windows (64 bit), running Java SE8 (64 bit)
- Connectivity: Remote access (ssh / MS Remote Desktop) and access to the outside web
The usual process of operating the Boot function is as follows:
- Identify the required labelled and unlabelled datasets for your use case.
- Create a YAML file that specifies the desired classes that synthesized pairs will be created for.
- Run the SDG Boost jar at the command line. The output will be written into the output directory.
- Check the log (in the output directory) for any warnings that have been logged during synthesis.
- If final classifiers have been trained (by passing the –train-final parameter to SDG Boost), assess performance of these boosted classifiers.
Running SDG Boost
SDG Boost is distributed as an application packaged in an executable JAR file. It can be run, like any other JAR file,through the Java binary with the –jar option.
It is strongly recommended that you increase the default heap size (memory) when you invoke the jar. For more information see the -Xmx parameter in Java’s standard documentation.
Running with the Java binary:
java -Xmx8G –jar sdg-x.x.x-standalone.jar boost
The Synthetic Data Generator executable accepts the following command-line options:
- ––cfg PATH Path to configuration files in YAML or JSON format
- ––home PATH Path that specifies the working directory of the SDG
- ––csv FILE Path to data csv files
- ––json FILE Path to data json files
- ––sql URL URL to a SQL database
- ––xval-type TYPE Type of xval to perform – ‘random’, ‘synth’ or ‘synth-reversed’
- ––drive DRIVE metrics to drive boosting – ‘test-metrics’ or ‘xval-metrics’
- ––iters ITERATIONS Maximum iterations
- ––time-out MINUTES Maximum time to run in minutes
- ––train-final Train a final classifier with the synthed data
Once executed, the working directory will contain:
- The complete synthed csv file
- A final set of classifiers trained with the Cognitive Engine, using the data synthesized from the Boost function
Specifying the input datasets
SDG Boost takes two datasets as input:
- A large unlabelled data export
- This data does not need to be labelled by a Subject Matter Expert
- It should contain as much, varied information about the business case as possible
- A small dataset labelled by a Subject Matter Expert
- Minimum of 200 labelled points per class
- Extreme care must be taken to ensure that the labels are correct
Both the unlabelled data and the labelled data should be stored together in one file. Training data should therefore be structured with the following fields:
- Field: id – An identifier for the data row
- Field: text – All text data that is used for input
- Field: label
- Labelled Set: This should be the class label
- Unlabelled Set: This should be set to ‘nolabel’
- Field: test_train – Denotes whether these data points are for testing or training. All ‘nolabel’ data points should be marked as available for training
If SDG Boost is set to train final, Cognitive Engine classifiers using the synthesized data, this classifier set will be evaluated using two mechanisms:
- Strict Cross Validation – This is an extension to traditional cross validation (which divides the training dataset in 10 folds, allowing for 10 unique hold out sets to be created and tested on). It is important that synthetic data is not used for testing (as this would over inflate the performance metrics by using artificial gold standard labels). Therefore, this modified version of cross validation has an additional,strict step which ensures that the folds used for testing only contain real data, with a mixture of synthetic and real data only used for training folds.
- Hold out test set – This will test classifiers using an independent test set. Any data in the original dataset that is marked as ‘test’ in the ‘test_train’ field is used here.