This document provides a guide to using the Chatterbox Labs Data Validator tool, with some examples of how to run the tool. The Data Validator provides an easy to use tool for analysing a dataset and generating a YAML configuration file suitable for building a classification, regression or time-series model as appropriate.
Installation & Environment Configuration
The Data Validator runs on the Java Virtual Machine (JVM) platform. It requires that the Java Runtime Environment (Java SE 8) be installed and available on the machine.
As the Data Validator is programmed using Clojure and Java, it can run on any operating system that supports a Java Virtual Machine. 64 bit operating systems are recommended for optimal memory usage.
It is recommended that the Data Validator is installed on a centrally accessed server. Each organization will have their own standard configurations, but as a benchmark a comparable server setup is:
- CPU: Multi-core (4 minimum, 16 recommended) Xeon/EPYC/i7 @ 3GHz
- Memory: 32GB Recommended (minimum of 8GB)
- OS: Ubuntu Linux (64 bit) or Microsoft Windows (64 bit), running Java SE8 (64 bit)
- Disk: 50GB (it is recommended that the Data Validator is installed in a directory that is accessible and writable by all users, rather than in each home directory)
- Connectivity: Remote access (ssh / MS Remote Desktop) and access to the outside web
The usual process of operating the Data Validator is as follows:
- Obtain data in a file in CSV or JSON format
- Run the Data Validator to produce an HTML report on your dataset.
- Review the report, if necessary make changes to the dataset and re-run the Data Validator to refresh the report. The report will highlight potential warnings and problems in the data that may need to be addressed.
- Run the Data Validator again to produce a YAML configuration file for training a package using the Chatterbox Labs Cognitive Engine.
Generating a report
Use the following command to run the Data Validator to produce an HTML report
java –jar data-validator-x.y-standalone.jar validate [options] <file-path/db-connection-url>
The set of available options is:
- ––target-column column – specify the name of the target column(s) (will default to “label” if not specified). If there are multiple target columns, pass in a comma separated list.
- ––id column – specify the name of the id column (will default to “id” if not specified)
- ––test-train column – specify the name of the test/train column (will default to “test_train” if not specified.
- ––encoding name – specify the name of a file encoding used (will default to “utf-8” if not specified).
- ––input-format csv|json|sql – specify the format of the input file as either csv or json, or to read from a database table if sql is specified. If not specified, reading from file is assumed and an attempt will be made to infer the format from the input file suffix.
- ––sql-table tablename – used in conjunction with –input-format=sql, provide the name of the database table to read.
- ––output-file path – specify the path of the output file (will default to “result.html” if not specified)
- ––output-format html|json|yaml – specify the format for the output file as one of (html, json or yaml)
- ––output-yaml path – specify the path of a YAML formatted file to generate for training a package using the Chatterbox Labs Cognitive Engine (if not specified, no YAML file will be generated).
- ––anonymize true|false – obfuscate values and column names with generic terms (defaults to false)
- ––log-level trace|debug|info – level of logging statements
- ––throw true|false – if set to true, if exceptions are present they will not be caught (defaults to false).
- ––column-type column=n|c|t – override the column type for a named column as one of n (numeric), c (categorical) or t (text). It is possible to pass this option multiple times to specify the type of multiple columns
- ––help display information on the set of available options
Run the Data Validator with the –-help option to see the complete set of options:
java –jar data-validator-x.y-standalone.jar validate --help
Data Validator report HTML format
By default the Data Validator will generate an HTML formatted report saved to file result.html.
java –jar data-validator-x.y-standalone.jar validate --id retailer_id --target-column sales mydata.csv
The report written by the Data Validator provides several sections:
- At the top of the page a brief summary is displayed
- On the left hand side of the page the columns and their data types are listed in a side bar. Select a column to display more details.
- The remainder of the page displays details of the column selected in the sidebar
Overriding column types
Use the –-column-type option to override the column types that the validator will automatically infer from your data. For example, to override the column types for “age” to numeric, “notes” to text and “education_years” to categorical, use the following options:
java –jar data-validator-x.y-standalone.jar validate --target-column sales --column-type age=n --column-type notes=t --column-type education_years=c mydata.csv
Specifying multple target columns
Specify multiple targets (typical in the case of time series data) using the –-target-column option and a comma separated list of names. For example, if the data contains target fields stock_a, stock_b and stock_h, use the following options:
java –jar data-validator-x.y-standalone.jar validate --target-column stock_a,stock_b,stock_h mydata.csv
Other report formats
To consume structured output from the Data Validator (typically, to pass into another software system) use the –-output-format option to choose output in json or yaml format. For example:
java –jar data-validator-x.y-standalone.jar validate mydata.csv --output-format json --output-file results.json
Generating YAML files
Use the –-output-yaml option if you would like the Data Validator to output a YAML file in the correct format for training a package using the Cognitive Engine. You may use this as a starting point and customise the YAML file. See the Cognitive Engine User Guide for more details.
java –jar data-validator-x.y-standalone.jar validate mydata.csv --output-yaml engine.yaml
Reading from a Database
Use –-input-format sql to read input data from a database table rather than a file. Specify also the database table name using the –-sql-table option, and pass a JDBC connection URL rather than a file path as the main argument.
java -jar validator-x.y-standalone.jar validate --input-format sql --sql-table loans "jdbc:postgresql://dbhost:5432/customerdata?user=postgres&password=XXXX"
The Data Validator includes JDBC drivers for the following DBMS (but will also work with third party JDBC drivers):
- SQL Server