Data Validator Developer Guide

Document Overview

This document is aimed at a technical audience and gives a detailed description of the Data Validator API. This is to enable developers to integrate the Data Validator as a Software Project Dependency.

Prior to reading this document, it is essential to read the Data Validator User Guide which covers all the base functionality, options and operation of the Data Validator.

Configuring the Environment

The Cognitive Engine is available as a Maven-style dependency that any Java build system can use. All that is required is to add the dependency string to your build file:

Gradle

‘co.chatterbox:data_validator:xxxx.yy'

Maven

<dependency>
    <groupId>co.chatterbox</groupId>
    <artifactId>data_validator</artifactId>
    <version>xxxx.yy</version>
</dependency>

Leiningen

[co.chatterbox/data_validator “xxxx.yy”]

Add the Chatterbox Labs repo and authentication to your build file to have access to all the Data Validator interfaces and functions.

The Data Validator as a Software Project Dependency

The Data Validator can be included as a dependency in an external software project and accessed directly.

Validate

This allows the validate function to be called programmatically, passing in the data and options, and returning back the resulting artifacts as strings.

Depending on the language you’re developing on, you can either directly require the Clojure namespace that contains the validate function, or import the co.chatterbox.validator.api.Validator class.

Importing and running the Validator class, from Java:

import clojure.lang.Keyword;
import co.chatterbox.validator.api.Validator;
import java.util.*;

public class Main {

    public static void main(String[] args) {
        Map params = new HashMap();
        List<String> targets = Arrays.asList("label");
        List<String> predictors = Arrays.asList("text");
        params.put(Keyword.intern("id-column"),"id");
        params.put(Keyword.intern("target-column"),targets);
        params.put(Keyword.intern("test-train-column"),"test_train");
        params.put(Keyword.intern("input-format"),Keyword.intern("csv"));
        params.put(Keyword.intern("encoding"),"utf-8");
        params.put(Keyword.intern("output-file"),"");

        Map<String,Keyword> m1 = new HashMap<>();
        m1.put("text",Keyword.intern("t"));
        params.put(Keyword.intern("column-type"),m1);

        Map map = Validator.validateDatasetComplete("path/to/file.csv",params);
        
        String reportYaml = Validator.exportReport(map,"yaml");
        String reportHtml = Validator.exportReport(map,"html");
        String reportJson = Validator.exportReport(map,"json");
        
        String engineYaml = Validator.exportYAML(map, predictors);
    }
}

Reading from databases

The following java code provides an example of how to read input data from a database table.

import clojure.lang.Keyword;
import co.chatterbox.validator.api.Validator;
import java.util.*;

public class Main {

    public static void main(String[] args) {
        Map params = new HashMap();
        List<String> targets = Arrays.asList("label");
        List<String> predictors = Arrays.asList("sepal_length","sepal_width","petal_length","petal_width");
        params.put(Keyword.intern("id-column"),"id");
        params.put(Keyword.intern("target-column"),targets);
        params.put(Keyword.intern("test-train-column"),"test_train");
        params.put(Keyword.intern("input-format"),Keyword.intern("sql"));
        params.put(Keyword.intern("sql-table"),"iris");
        params.put(Keyword.intern("output-file"),"");

        String dburl = "jdbc:postgresql://dbhost:5432/mydb?user=alice&password=PASSWORD";
        Map map = Validator.validateDatasetComplete(dburl,params);
    
        String reportJson = Validator.exportReport(map,"json");
        String engineYaml = Validator.exportYAML(map, predictors);
    }
}

Working with the validator output map

The call to Validator.validateDatasetComplete returns the results of the validator’s analysis of the data in a java Map object. This object can be inspected to get information on the collected information including:

  • Global statistics for each column
  • Per-split statistics for each column
    • here a “split” defines a subset of the data with a specific combination of values in the target column and the test_train column.
  • A list of potential problems identified by the validator and categorised as warnings or errors.

In the following example, the problems reported by the validator on the IRIS dataset are printed to the system console.

Map map = Validator.validateDatasetComplete(...);

List<Map> problems = (List<Map>)map.get(Keyword.intern("validator/problems"));

for(int i=0; i<problems.size(); i++) {
    Map problem = problems.get(i);
    String type = ((Keyword)problem.get(Keyword.intern("problem/type"))).getName();
    String msg = problem.get(Keyword.intern("problem/msg")).toString();
    String split = problem.get(Keyword.intern("problem/split")).toString();
    System.out.println(String.format("%s: %s (split: %s)",type,msg,split));
}

This will print:

warning: Test train column contains 1 values (split: :split/global)
error: There are not enough data points for this label. We recommend at least 1000 data points per label. (split: {:split/target "Iris-versicolor", :split/test-train "train"})
error: There are not enough data points for this label. We recommend at least 1000 data points per label. (split: {:split/target "Iris-setosa", :split/test-train "train"})
error: There are not enough data points for this label. We recommend at least 1000 data points per label. (split: {:split/target "Iris-virginica", :split/test-train "train"})

In the following example, global and per-split statistics for the sepal_length column are printed.

Map map = Validator.validateDatasetComplete(...);

Map data_stats = (Map)map.get(Keyword.intern("data/stats"));

for(Object key: data_stats.keySet()) {
    // iterate through global and per-split statistics
    Map value = (Map)data_stats.get(key);
    String label = "global";
    if (key instanceof Map) {
        Map<Keyword, String> m = (Map<Keyword, String>) key;
        String target = m.getOrDefault(Keyword.intern("split/target"), "");
        String test_train = m.getOrDefault(Keyword.intern("split/test-train"), "");
        label = String.format("split (target=%s, test_train=%s)",target,test_train);
    }
    Map sl_stats = (Map)value.get("sepal_length");
    Number sl_mean = (Number)sl_stats.get(Keyword.intern("mean"));
    System.out.println(String.format("sepal_length mean=%f (%s)",sl_mean.doubleValue(),label));
}

Which produces the following output:

sepal_length mean=5.843333 (global)
sepal_length mean=5.006000 (split (target=Iris-setosa, test_train=))
sepal_length mean=5.936000 (split (target=Iris-versicolor, test_train=))
sepal_length mean=6.588000 (split (target=Iris-virginica, test_train=))
sepal_length mean=5.006000 (split (target=Iris-setosa, test_train=train))
sepal_length mean=5.936000 (split (target=Iris-versicolor, test_train=train))
sepal_length mean=6.588000 (split (target=Iris-virginica, test_train=train))

The set of statistics collected depends upon the column type (text, categorical or numeric) and can include the following information:

  • min, max
  • mean, sd, q1, median, q3, iqr (univariate statistics)
  • zero-count (count of zero values)
  • empty-count (number of missing values)
  • distinct (frequencies for each category)

Get in Touch