SDG: Observe & Synth User Guide

Document Overview

This document gives a detailed description of the configuration and operation of Chatterbox Labs’ Observe & Synth function (part of the Synthetic Data Generator). It provides information of how the process of observing and synthesizing a dataset works, operation of the product and integration into a bespoke Java software stack.

Observe & Synth Process

  • Observe & Synth works by first observing an original dataset
  • The relationships between the variables in this dataset are learned during this phase
  • They are stored in a Recipe File
  • This recipe file can be moved to another location
  • The synthesize phase is independent from observation
  • During this phase
    • The user specifies how many rows of data to produce
    • The relationships in the recipe file are referenced
    • New datapoints are synthesized

Hardware Requirements

All technology runs on the Java Virtual Machine (v11 – we recommend using OpenJDK). Any operating system that supports this can be used, our recommendation is Ubuntu Linux. 64 bit operating systems are strongly recommended for optimal memory usage.

It is recommended that the software is installed on a centrally accessed server. Each organization will have their own standard configurations, but as a benchmark a comparable server setup is:

  • CPU: Multi-core (4 minimum, 16 recommended) Xeon/EPYC/i7/i9 @ 3GHz
    • Note: Fast clock speeds are preferable over multiple slower cores
  • Memory: 128GB Recommended (minimum of 16GB)
  • OS: Ubuntu Linux (64 bit) or Microsoft Windows (64 bit), running Java 11 (64 bit)

Usage

Observe & Synth can be executed from the standalone executable jar at the command line, or integrated as a Java dependency.

To execute from the command line, Observe & Synth is called as follows:

java -jar observe_synth.jar <<task>> <<options>>

Observe & Synth accepts two tasks (which map to the architecture above):

  • observe
  • synth

Note, you can always pass the parameter –help to receive a prompt of parameter options.

Supported Data Types

Observe & Synth covers the following data types:

  • Numerical (continuous numerical fields)
  • Categorical (unordered, category fields of 3 or more values)
  • Binary (unordered, category fields of 2 values)

Observe

The observe function requires as input the initial dataset to be observed and a yaml configuration file. The full invocation would therefore be:

java -jar observe_synth.jar observe /path/to/data.csv /path/to/cfg.yaml

Once the observe step is complete the system will output a recipe file.

Synth

The synth function requires as input the recipe file and the number of datapoints you wish to create. For example:

java -jar observe_synth.jar observe /path/to/drugs_data.recipe 100000

This invocation will reference the recipe for drugs data and create 100,000 new synthetic rows.

General Operation

  1. Identify which dataset(s) that you wish to observe and store this in a UTF-8 encoded csv file.
  2. Within that dataset, determine any logical smaller datasets that it can be broken down into (multiple smaller datasets are more efficient to observe that one large dataset).
  3. Determine which fields should be observed.
  4. For each dataset of interest, create a yaml file that specifies the fields of interest.
  5. Pass each dataset and yaml file to the Observe & Synth jar, this will produce a recipe file.
  6. Store this recipe file in a centralized location.
  7. When new data is required, pass the recipe file to the Observe & Synth jar along with the number of rows required to be synthesized. This will output a new data file.

Yaml Configuration Files

Yaml files are required for the configuration of the Observe phase. They are human readable text files. Observe & Synth requires configuration for each field to be synthesized. At the minimum, each variable should be defined with its name, type (either n for numerical, b for binary or c for categorical) and the op (synth or sample).

A full example yaml (example.yaml) is as follows:

cb/obsynth:
  obs/normrank: true
  obs/input:
  - {name: AGE, type: c, op: sample}
  - {name: PAY_0, type: c, op: synth}
  - {name: PAY_2, type: c, op: synth}
  - {name: PAY_3, type: c, op: synth}
  - {name: PAY_4, type: c, op: synth}
  - {name: PAY_5, type: c, op: synth}
  - {name: PAY_6, type: c, op: synth}
  - {name: BILL_AMT1, type: n, op: synth}
  - {name: BILL_AMT2, type: n, op: synth}
  - {name: BILL_AMT3, type: n, op: synth}
  - {name: BILL_AMT4, type: n, op: synth}
  - {name: BILL_AMT5, type: n, op: synth}
  - {name: BILL_AMT6, type: n, op: synth}
  - {name: PAY_AMT1, type: n, op: synth}
  - {name: PAY_AMT2, type: n, op: synth}
  - {name: PAY_AMT3, type: n, op: synth}
  - {name: PAY_AMT4, type: n, op: synth}
  - {name: PAY_AMT5, type: n, op: synth}
  - {name: PAY_AMT6, type: n, op: synth}
  - {name: LIMIT_BAL, type: c, op: synth}
  - {name: SEX, type: c, op: synth}
  - {name: EDUCATION, type: c, op: synth}
  - {name: MARRIAGE, type: c, op: synth}

Normrank uses a different method of synthesis which adds noise to the final output.

Recipe Files

The recipe file is the output of the Observe phase. You will get one receipe file for each execution of observe. It’s recommended that these are stored in a central location so that they can be called upon for synthesis in the future.

It is important that users can transparently see what is inside the recipe file. Therefore it is a JSON file that can be loaded with standard tools.

The recipe file contains everything needed to synthesize a new dataset, nothing else (except the Observe & Synth software) is required:

  • Configuration, similar to the input yaml file but with more detail (such as variable dependencies, definition of categorical variables, etc)
  • Statistical properties learned about each variable
  • The underlying machine learning model for each variable
  • Sampled values

Scaling the Observe Function

Internal to Observe & Synth

  • Category Rationalization: Traditional one-hot encoding suffers from an exploding feature space. Large categories are compressed, then distributions across categories are learnt and stored in the recipe file. This reduces the dimensionality and execution time.
  • Parallelization: During the observe phase, each regression (targeting a unique variable in the dataset) can be executed independently. Observe & Synth therefore distributes each of these tasks to be executed in parallel
  • Native Execution: For large datasets (and hence large underlying matrices) critical computation is moved outside the JVM to exploit native CPU acceleration

Recommendations for structuring your data

  • Only observe necessary fields
    • Data such as row IDs, or other data fields that are not statistically linked, should not be synthesized
    • They can be added post-hoc
  • Multiple smaller observations are more efficient than one large observation, so consider segmenting your input dataset
    • For example, in the scenario of pharmaceutical pricing data there may be a field for drug supplier
    • Rather than observing the entire pricing dataset in one go, synthesize on a per drug supplier basis
  • Maintain the tables in your DB
    • Most datasets are stored in a relational DB
    • It may be tempting to flatten this DB into a large file ignoring the original table structure
    • However, these tables have meaning (by their original design) and therefore make good datasets for synthesis

Integrating Observe & Synth

Observe & Synth can be included as a dependency in a JVM software project. We provide here examples using Maven & Java, however any JVM language can be used.

Configuring Maven

Your organisation will have credentials to Chatterbox Labs code repositories. It is first necessary to add these credentials to your settings.xml file. You can then add the necessary repositories and dependencies to your maven project:

    <repositories>
        <repository>
            <id>chatterbox</id>
            <name>Chatterbox Labs Repo</name>
            <url>REPO URL HERE</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>co.chatterbox</groupId>
            <artifactId>obsynth</artifactId>
            <version>VERSION HERE</version>
        </dependency>
    </dependencies>

Calling the observe and synth functions

The following sample application shows the integration points into Observe and Synth:

import co.chatterbox.obsynth.ObSynth;

public class ExampleObSynthProject {

    public void runObservePhase(String recipePath) {

        //load the paths of the yaml and csv from the resources directory of this project
        ClassLoader classLoader = getClass().getClassLoader();
        String yaml = classLoader.getResource("example.yaml").getPath();
        String csv = classLoader.getResource("example.csv").getPath();

        //run the observe phase of Observe & Synth, writing the recipe to recipePath
        ObSynth.observe(yaml, csv, recipePath);
    }

    public void runSynthetizePhase(String recipeFilePath, Integer numRows,String outputPath) {

        //run the synth phase of Observe & Synth, using the recipe file.
        //write numRows of new data to a csv in outputPath
        ObSynth.synth(recipeFilePath, numRows,outputPath);
    }

    public static void main(String args[]) {
        System.out.println("### Sample Observe & Synth Integration Application ###");

        //define where to write the recipe file
        String recipePath = "myRecipleFile.recipe";


        ExampleObSynthProject project = new ExampleObSynthProject();

        project.runObservePhase(recipePath);

        project.runSynthetizePhase(recipePath, 150, "synthedData.csv");

    }
}

Get in Touch