Tensorflow Data Validation

Auto Data Drift and Anomaly Detection

5 min readJun 13, 2023

This article is written by Alparslan Mesri and Eren Kızılırmak.

After deployment, Machine Learning model needs to be monitored. Model performance may change over time due to data drift and anomalies in upcoming data. This can be prevented using Google’s Tensorflow Data Validation library.

Understanding Data Drift

Models may decay over time due to external or internal factors. One of the main reasons of model decay is data drift(feature drift). It happens if data distribution changes gradually. This may cause mispredictions because model is not used to drifted data from start. For example : recommendation models for online shopping may act poorly if a new trending product’s sale increases(frequency of the product changes gradually). So it is important to monitor upcoming data before evaluation.

Installation Guide

Before we dive into the details, it’s important to note that the TensorFlow Data Validation library is currently unavailable for new Mac models with M-series chips. However, we can overcome this limitation by utilizing a cloud environment. The library can be installed using pip, but make sure to install the necessary dependencies first. Here’s the installation command:

Dataset

For the purpose of this article, we will be using the stroke dataset obtained from Kaggle. This dataset contains 5110 observations with 12 attributes, making it suitable for demonstrating the capabilities of TFDV. You can access to the dataset and codes from here.

Loading the data

Let’s load our data and check first five rows.

TFDV gives us useful interface to explore features and warns us about missing values. For numerical values these statistics are obtained : count, missing, mean, std dev, zeros, mean, median and max.

Stats for Numerical Features

In this case we don’t need to worry much about warings about zeros because in stroke data documentation 0 refers if the patient doesn’t have any diseases, 1 if the patient has a disease. So actually its a categorical value which can be ignored. Only thing to worry here is their unbalance.

If we look closely on age data, minimum value for age is 0.08 which is odd. This row needs to be clarified afterwards. Also there are missing values in bmi(body-mass-index) in data this missing values may affect our prediction.

Stats for Categorical Features

For categorical features; count, missing, unique, top, freq top and avg str len statistics are obtained. We can say that Female population is almost 1/3 bigger than male population. This biased data may cause wrong predictions if model doesn’t know about the unbalance.

Acquiring Schema From Statistics

The schema is capable of describing briefly about our statistics such as data type, valency and domain. This is useful when it comes to compare trained data statistics to new data. Hence, saving schema is highly recommended.

To get schema, infer_schema() method will help us

Editing Schema

The Presece of bmi(body mass index) is automatically assumed as ‘optional’ because bmi feature had 3.93% missing data. We can make changes on schema to make it ‘required’. Same goes for domain, we can redefine it:

As we can see domain and Presence has changed.

Detection of Anomalies

Let’s compare our current data with updated schema. To do this we need 2 methods;

validate_statistics() : This method will compare our schema to given data statistic and return an anomaly buffer.

display_anomalies() : This method will display anomalies

It is actually up to user to decide what to do with anomalies. Usally when an anomaly is detected, data won’t be fed to the model.

Drift Detection

Current data is not drifted so i will drift it by increasing bmis by %30 per people. Notice that visualize_statistics() parameters can be used to compare 2 sets at a time. Unfortunately max we can compare is two.

We can obviously detect drift with bare eyes but can validate_statistics() do the same?

TFDV didn’t detect drift because there is lack of information about drift threshold. This can be done by changing attributes of the selected feature from schema.

Threshold used for data drift detection is called Jensen-Shannon Divergence and is often used to detect similarities between data sets. 0 indicates that sets are identical and 1 indicates set distribution is not overlapping therefore it is completely different. If we want to detect minimal drift in data we need to give it’s value close to zero(about 0.05).

Conclusion

Data drift and anomalies can lead to model misclassifications. To mitigate this, models can be made more robust by using ensemble learning, hyperparameter tuning, and monitoring for signs of drift and anomalies. while other problems can be overcomed by different libraries, TFDV library can help us with the monitoring part.

1- https://towardsdatascience.com/hands-on-tensorflow-data-validation-61e552f123d7

2- https://medium.com/@deeptij2007/tensorflow-data-validation-tfdv-5e36fc74d19a

3- https://www.tensorflow.org/tfx/guide/tfdv

4- https://towardsdatascience.com/data-drift-part-1-types-of-data-drift-16b3eb175006