This document provides a guide to using the Chatterbox Labs Data Validator (UI) tool, with some examples of how to run the tool. The Data Validator provides an easy to use tool for analysing a dataset and generating a YAML configuration file suitable for building a classification, regression or time-series model as appropriate.
The Data Validator runs on the Java Virtual Machine (JVM) platform.
For Microsoft Windows, download the DataValidator-X-Y-WIN.zip file and unzip it to your filesystem. Start the application by clicking on the file DataValidator/DataValidator.exe. The required Java Runtime files are included in the distribution.
For MacOS, download the DataValidator-X-Y-OSX.zip file and unzip it to your filesystem. Click on the DataValidator-X-Y-OSX application to open the Data Validator. The required Java Runtime files are included in the distribution.
For Linux the Java Runtime Environment (Java SE 11) needs to be installed and available on the machine. Download the DataValidator-X-Y-STANDALONE.jar file and launch it using:
java -jar DataValidator-X-Y-STANDALONE.jar
Each organization will have their own standard configurations, but as a benchmark a comparable setup is:
- CPU: Multi-core (2 minimum, 4 recommended) i7/i9/Ryzen @ > 2GHz
- Memory: 16GB Recommended (minimum of 8GB)
- OS: Microsoft Windows (64 bit), Mac OSX or Ubuntu Linux with Java SE 11 (64 bit)
The usual process of operating the Data Validator is as follows:
- Obtain data in a file in CSV or JSON format or in a database table.
- Run the Data Validator, choosing the appropriate machine learning roles and data types for columns in the dataset as you move through the Data Source, Task and Validation screens.
- Review the information presented by the Data Validator. Any errors will need to be fixed, and all warnings should be carefully considered.
- If necessary make changes to the dataset and re-run the Data Validator. You may wish to use the Report function to create an HTML (user friendly) formatted or JSON (machine readable) formatted report.
- You may use the Data Validator Export function to produce a YAML configuration file for training a package using the Chatterbox Labs Cognitive Engine, or creating a recipe using the Chatterbox Labs Observe & Synth tool.
Loading the data
The Data Validator opens at the Data Source screen. Select the type of data source for the data you wish to load into the Data Validator. For CSV data sources, you will be prompted to select a CSV file in the filesystem.
For database table based data sources, you will be prompted to enter the JDBC information for your database and table into a form. Consult your database administrator for advice on filling out the form. The Data Validator includes JDBC drivers for the following DBMS:
- SQL Server
Choosing the target column
Once the data is loaded the Task screen is displayed. Click on the column to be used as the target to continue to the Validation screen.
Validating column types and roles
On the Validation screen, review any problems (errors or warnings) and make any appropriate adjustments to the data or metadata. If the underlying data needs to be modified, return to the Data Source screen and reload the modified data.
If you wish to generate a YAML file for the Cognitive Engine, ensure that one column is marked as the ID field and one is marked as the Test/Train field. Click on the column headings and select the role from the menu.
You can also reassign the type (Categorical, Numeric or Text) of each column if the type automatically chosen by the Data Validator and displayed under the column heading is not appropriate.
To define a Time Series task you should configure multiple numeric targets. To add a column to the set of targets, click on the column heading and select option Add to the targets.
Exporting YAML files
Select option Cognitive Engine yaml from the Export menu in the top right hand corner of the screen to export a yaml file to use in training the Cognitive Engine. Before saving the yaml file, for regression and classification tasks you should select at least one additional column to be used as a predictor from the list on the left hand side of the screen.
You can also select the Observe & Synth yaml option from the Export menu to save a yaml file which configures the Observe & Synth tool to construct a recipe for synthesizing data for a subset of the columns. Select additional column(s) which you wish to synthesize (the column selected as the target column will be automatically synthesized).