Chapter 5: Making Data Quality Monitoring Models Work in the Real World
May 1, 2025
Welcome to “Use AI to modernize your data quality strategy,” a series spotlighting insights from our O’Reilly book, Automating Data Quality Monitoring. This post corresponds to Chapter 5: Building a Model That Works on Real-World Data.
It’s one thing to develop a theoretical model to detect data quality issues, and quite another to implement that model successfully on arbitrary, real-world datasets.
The datasets you encounter in real life will have nuances like seasonality, time-based features, and column correlations. If you don’t have strategies to account for these nuances, your model will probably send too many alerts or not enough alerts—either way, that’s not very helpful. Chapter 5 first explains the main types of real-world nuances you’ll encounter, then how to use synthetic chaos to optimize your model. This post gives a conceptual summary; the full book chapter (get it for free!) goes into quite a bit more detail.
Common data phenomena can trip up an ML model
As we explained in Chapter 4, Anomalo’s approach to using machine learning (ML) for data quality monitoring is having a model guess if a given piece of data is from today or not. To summarize, we do this by taking some of today’s data for training along with samples from previous days, and then running the rest of today’s data through the model. If the model can’t discern whether the data came from today, versus last week or last year, then there’s a very good chance all is well; if it can, then there’s probably something wonky.
Many datasets have expected variations over time that, if not properly accounted for, would likely lead the model to over-predict that data came from today. Others present inherent challenges based on their structure.Â
Let’s walk through some challenges you’re likely to encounter:
Seasonality
Sales and other business data often fluctuates seasonally. For example, your ice cream store probably does better business in July than in December. If your model doesn’t account for this, it might alert you to a perceived dip in December when business was as usual for that time of year. Same goes for time-of-day, day-of-week, or other cyclical changes.
Time-based features
Any timestamp columns or autoincrementing IDs, like sequential customer numbers, are a dead giveaway to the model. If it sees some IDs that are higher than others, it’s very easy to guess that the entry is from today. Directly analyzing these values will result in many false positives.
Chaotic tables
Not all tables follow regular update schedules. For instance, ad hoc processes like spontaneous inventory reconciliation in a warehouse can create a much higher volume of changes on one day versus another. Similarly, when you’re working on datasets, APIs, etc., you’ll likely be sending data through in irregular bursts as you’re testing.Â
These changes are purely a reflection of operational processes, rather than issues in the data itself, but without any mitigation the pattern will look anomalous to a model. If your model does not account for the level of inherent chaos in the table, it will over-alert on chaotic tables and under-alert on more stable tables.
Niche update types
Static tables and updated-in-place tables are not straightforward time-series tables and can trip up the system.Â
- Static tables aren’t based on a moment in time. Rather, they serve as references. They’re either dimension/lookup tables that have all the known information about something (say, the attributes of every item in your catalog) or summary tables (which generally pull in data from various sources, such as ad performance from various networks).
- Updated-in-place tables at first appear to be log tables, where records are added once and never changed. However, as the name suggests, they can change after they’re initially written. A common use for this kind of table is for ecommerce orders, where some columns such as shipping_date are generally initially NULL and then updated once a value is known.
Because the model is looking for a time series, you have to adapt these tables into a structure it can make sense of. Otherwise, these tables may misleadingly appear to have an anomaly every single day—for instance, today’s order entries are a lot more likely than last week’s to have a value of NULL for shipping_date.
Column correlations
An anomaly that begins in one column can affect multiple correlated columns. If anything, this makes anomalies easier for the system to detect; the issue is how the model flags them to end users.Â
For example, if an order_id is missing, then order_transaction_amount and other columns will also be NULL. Without specific attention, these would each throw a separate alert, leaving it up to a human to find the correlation.
Precise solutions for these common data characteristics
The aforementioned challenges come up frequently enough, and can cause enough woe, that they merit special attention. The good news is that there are reliable ways to account for them.
Seasonality
As described in Chapter 4, our model already accounts for patterns over time by sampling multiple periods. Another helpful tool is metadata. Generate lots of statistics when you monitor data, such as number of rows or average order_transaction_amount per day. Then use time series models on that metadata over time to identify seasonality trends. Explicitly highlighting trends will greatly help the model understand cyclical patterns as a feature of the dataset.
Time-based features
It’s trivial to find and remove explicitly time-based columns, but others aren’t so obviously correlated with time, like autoincrementing IDs. Fortunately, there’s a systematic way to identify these features.
Build an additional, simple version of your model using the complete dataset. Look for features that are incredibly significant to the model’s predictions over time, not just once—in other words, data that are very closely correlated even on a normal day. Then take those features out of the dataset that your real model uses. Just be careful to only remove the ones that will really throw off the model, because any column you remove can no longer be monitored.
Chaotic tablesÂ
Build a time series based on the average magnitude of the SHAP values, a calculation of how much each row-column cell contributed to the model’s prediction. These averaged SHAP values indicate how chaotic a table is over time. You can then set a threshold for alerting that can move up and down dynamically; for instance, if a table changes more frequently on the last Thursday of the month, you’ll know to look out for that. We recommend setting the threshold very high initially and lowering it over the next few weeks to fit the level of chaos inherent in the dataset.
Niche update types
There are two parts to this, and they’re both based on taking snapshots of the table over time.Â
First, identify if your table is static or updated in place. You can do this by monitoring the table’s historical drift in values, which is the rate at which data for previous days changes after it’s been checked the first time. Compare snapshots of the table over various days to check for these changes. (Check out pages 95–97 of the book for more detail on what to look for.)
Then, use these snapshots instead of live data to monitor these tables. In essence, you’re creating an accurate historical record of how the table appeared at periodic points in time, which the model can then evaluate.
Column correlations
The same SHAP values that quantify chaos are also the key to clustering correlated issues and preventing over-alerting. Look for multiple columns sharing a pattern and magnitude of SHAP values. The dead giveaway that changes are correlated: the same rows will show similarly anomalous values in exactly the same columns.Â
Benevolent chaos: testing your model with invented errors
OK, so you’ve made all these adjustments. How do you know if your model’s any good, and how do you improve it?
A common approach across AI/ML is to test models on human-labeled data. However, manually labeling data for your model is difficult and expensive at enterprise scale, and can unintentionally introduce more human error.
An easier and better approach is to generate synthetic errors and see what the model catches. Since you know which errors you’ve introduced, you can systematically judge the false positives and negatives. If you can catch synthetic errors, you can catch real ones. (After all, most real errors are created by a computer at some point anyway!)
You do this by collecting a representative sample of tabular datasets—including the funky kinds discussed above—and programmatically injecting bad values (or “chaos”).Â
Some example changes:
- Multiply a column by a random value
- Change 15% of a column’s values to NULL
- Drop the rows with values equal to the mode of a given column
- Replace all of the values in a column with random floats
Companies that do a lot of model testing, like us at Anomalo, often develop libraries of chaos agents for inserting into datasets. For more on how we build robust real-world models for automated anomaly detection at scale, here’s our 30-minute talk on stage at the Databricks Data + AI Summit last June – including our chaos library:Â
But be careful with how you apply synthetic issues. Since real data issues typically only affect part of the data, you should usually apply synthetic issues to just one segment, a random column, or a random percentage of the data.Â
Testing and benchmarking performance on synthetic chaos
Now that you’ve generated synthetic data, it’s time to iterate the design of your models by making adjustments and seeing how they improve. Here’s a sense of how to do it and what you’ll be looking for, using a technique we call backtesting.
Backtesting
Since we’re working on time-series models, simulate analyzing a table that has new records arriving daily. Use real historical sample data—not infected by chaos!—to generate daily snapshots in time. Then run the model in sequential order. This gives you a baseline.Â
Then introduce your synthetic anomalies and run the model on each day again, generating statistics about how well it’s detecting issues. These will help you understand how well the model performs on errors you’re likely to encounter in the real world.
Analyzing
You can evaluate your model on any number of metrics; some that we recommend are:
- Duration. Every minute a model takes to run costs money. You’ll be running this model every day, so it adds up. If a change leads to better results but takes a lot more time, you’ll have to do the math on the tradeoff.
- Precision and recall. These are, arguably, the two most important metrics. Ideally you want both to be at the top of a 0-to-1 scale.
- Precision measures what percentage of alerts are actually anomalies. A higher number means fewer false positives.
- Recall measures how many of the anomalies caused alerts. A higher number means fewer false negatives.
- F1-Score. Very often, changes that improve precision worsen recall, and vice versa. F1 balances both; the higher, the better overall.
- Area under the curve (AUC). The higher the number, the better the model is able to find the chaotic data points. 0.5 represents random guessing, and 1 means perfect detection.
Benchmarking
Once you have a baseline from the initial run of your model, you can fine-tune, run the model again on the same original and chaotic datasets, and see what changes.
Often, some metrics will improve while others decline, like in the following table:
In this case, recall and precision went up, but compute costs increased and the model got slower.
Now that you have a robust model that works with real-world data, how can we make it useful? The answer is notifications. How do they work to facilitate data issue response, and how can you manage them so they work for you and not against you? How do you avoid alert fatigue without missing important issues? That’s the subject of Chapter 6.
Categories
- Integrations
- Partners
Get Started
Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.