Chapter 4: How to Build a Machine Learning Model for Data Quality Monitoring
April 1, 2025
Welcome to “Use AI to modernize your data quality strategy,” a series spotlighting insights from our O’Reilly book, Automating Data Quality Monitoring. This post corresponds to Chapter 4: Automating Data Quality Monitoring with Machine Learning.
By finding unknown unknowns, unsupervised machine learning (ML) is the most important of the four pillars of data quality. But how does it work?
We’ll give away the secret up front: ML models say “today’s data may be anomalous!” when they predict with some degree of certainty that it came from today.
At first, this might be a mind-bender: of course today’s data came from today, why is it a bad sign when the model figures that out? By the end of this article, we predict you’ll agree that it’s an effective and efficient way of monitoring data quality.
One important note before we begin: the AI/ML we’re talking about is not generative AI. ChatGPT, for instance, wouldn’t be very reliable for finding anomalies in a table. Just as you’d want a factory’s quality inspector to be intimately familiar with the desired final state and potential defects of your product line, you need a purpose-built ML model to learn from your data and flag when things look off.
A machine learning wishlist
In our years of work with data quality monitoring algorithms, we’ve come up with four key characteristics for a successful model: sensitivity, specificity, transparency, and scalability. Let’s walk through them:
1. Sensitivity: avoiding false negatives
How often will the model correctly detect a new issue?
You want a model that’s sensitive enough to pick up on the issues you care about, but not so sensitive that it’s reacting to natural fluctuations or noise. The sweet spot is detecting an issue affecting 1% or more of the data. If we considered smaller issues, we’d get flooded with them and the model would become too noisy. In our experience, “1%+” issues are more indicative of structural concerns.
If you’re worried about specific issues affecting smaller but highly important pieces of data, validation rules are a better solution.
2. Specificity: avoiding false positives
Alert fatigue can be disastrous for data quality monitoring—you don’t want your system crying wolf. So, avoiding false alarms is really important. A good example is seasonal volatility. For instance, your ML model should not alert you to a sharp increase in sales on Cyber Monday or a plunge in retail revenue on Christmas Day.
- Callout: Seasonality comes up in discussions about both ML checks and key metrics, and these two pillars can seem similar when you’re starting out in data quality monitoring. This topic is both important and complex, and some data quality monitoring services erroneously blur the lines. As a data person, you might be the type to appreciate the differences and minutiae in our guide here.
3. Transparency: describing the issue
You’re not a computer, so you need human-readable analysis. This means your model should give useful details about any issues it finds, telling you the severity of a data anomaly and ideally even where the root cause might be.
Generic alerts are about as useful as an alarm saying something’s wrong in the factory, without telling you what machine’s got the issue or what it’s doing wrong. Without enough detail, the model could do more harm than good by contributing to alert fatigue.
4. Scalability: simple to apply broadly
Remember how setting up customized validation rules on each of your 10,000 tables isn’t a good plan? Well, neither is building a model that needs customization for each use case. In practical terms, this means the ML model should work right out of the box; any configuration of your data quality monitoring should happen at a higher level, such as whom to alert about what kinds of errors.
Non-requirements
To keep the focus on getting the important things right, it’s useful to specify what you don’t need your model to do. For some of these non-requirements, other data quality pillars are more appropriate; for others, we don’t find them worthwhile.
Non-requirement | Reason |
Identify individual bad records | Instead, use validation rules on the most important tables or columns |
Process data in real-time | Hourly or daily runs are fine; any more is hard to scale and uses more compute than it’s worth |
Identify existing problems | ML models evaluate data going forward; use validation rules to check for specific problems (e.g., date format errors) in past data |
Monitor tables without timestamps | ML detects changes over time, so it needs to know when each piece of data is from. Use table observability or validation rules to monitor static information |
Identify outliers | Outliers are value-neutral. A value can be very big or very small and not mean the data is mistaken. ML models should look for sudden structural changes in new data. |
Building an ML model that finds anomalies in time series
As we said at the top, the core idea of our model is predicting whether the data it’s given is from today.
The tl;dr: is that you use a portion of today’s data to train the model alongside everything it’s learned from evaluating previous days. Then, you ask it to evaluate data that wasn’t part of the training set, and see if it can guess which day the record came from.
If it can guess that today’s data is from today, that means today’s data is noticeably different from the patterns it learned from previous days—and there’s likely an issue. If it can’t, there are no structural anomalies.
As with all summaries we’ve overly simplified; we’ll get into some details as we go along and there’s much more in the book. We’re also not claiming this is the only way to use ML to satisfy our requirements, but it continues to be the best we’ve seen.
We seem to really like our concepts in fours. We have four data quality pillars in Chapter 2, four characteristics of big data in Chapter 3, and four requirements in the ML wishlist above. Here’s one more, the four concerns when building Anomalo’s ML model: data sampling, feature encoding, model architecture, and model explainability.
Data sampling: choosing the records to evaluate
Here’s how we suggest sampling data in a manner that’s both efficient and representative:
- What data to sample? Use random rows from several lookback dates. These should include today, yesterday, a day from last week, and any other days necessary to account for seasonal trends. Label today’s data “1” and all other days “0.” (Sticklers may object that this makes it, technically, supervised ML. See page XX in the book for our thoughts.)
- How much data to sample? Roughly 10,000 rows. Yes, even for very large tables. It’s the same statistical principle that lets pollsters capture the opinion of hundreds of millions of Americans in a survey of about 2,000 people. But note that it’s of the utmost importance that these rows be random.
- How do you tell your model to sample the data? Don’t just ask your data warehouse for 10,000 random rows. It will write the entire table to memory to randomly select individual rows. Bad plan: that’s inefficient and expensive. Instead, pull out individual days, use TABLESAMPLE or equivalent to pull out a few times more data than you need, and then pull a random sample from that chunk. (If you plan to do this, please look at pages XX-XX in the book for some important details, suggestions, and potential pitfalls.)
Feature encoding: how do you convert New York City to a number?
ML works by comparing values. That means strings of text—including digits with no inherent value, such as ZIP codes or phone numbers—need to be converted into numbers. In the world of AI/ML, this is known as feature encoding.
The numbers have to have some sort of meaning, but you need to assign this meaning at scale, so let’s automate it by creating a process that decides on the best encoding type.
Common options include:
- numeric, such as a count of items or a dollar amount
- frequency, which replaces a string with the count of how often that string appears in the column
- isNull: 1 for if there’s any data in that column for that entry, 0 if not
- secondOfDay and timeDelta: when something happened, or the time between two things
- OneHot: converts variables into several binary columns
The “rewards” column is encoded with isNull so that Yes becomes 1 and entries with nothing are 0. “Store_id” is encoded with frequency, so each row shows the number of times a given store number shows up in the data; for instance, US-476 is in three rows, so the value is 3.
There are many other kinds of encoders out there, but be careful about getting too fancy with them. Complex encoders can increase the number of issues a model identifies, but often at the expense of interpretability if the observations are less intuitive.
Model architecture: choosing how to learn
There are many ways an ML model can learn, but we’ll avoid that discussion and cut to the chase: we like gradient-boosted decision trees for data quality monitoring. Reasons include:
- Can be trained on relatively small samples, but can handle even millions of records very quickly
- Generalizable to any kind of tabular data when properly feature-encoded
- Fast at inference
- Very few parameters really matter for tuning, mostly the learning rate and the complexity of each tree
XGBoost is a great library to start with when developing your model.
Gradient-boosted decision trees work iteratively. This means that the model’s prediction takes into account the results from all the trees that were trained at each step. This is known as an ensemble model, which works iteratively:
- Decision tree A makes initial predictions based on the training data
- Decision tree B sees the predictions tree A made, and the results of those predictions; it keeps the ones that were right and tries to guess the ones that were wrong
- Tree C takes B’s decisions into account, and so on.
Gradient-boosted decision trees are in the sweet spot of complexity. Something like a linear model is too simple to learn the complex patterns in most structured datasets, while something like a neural network is too complex and requires far too much data and computing resources. That said, you could theoretically run gradient-boosted decision trees forever, so you’ll have to figure out when your model has learned as much as it can and then stop.
Model explainability: where’s the problem and how bad is it?
It’s no use to have a model that can catch issues if it’s unclear how to address them. That’s why you want the model to report how severe an issue is, and where that issue is coming from. We’ll get into the benefits, such as reducing alert fatigue and easier triage, in our next posts. For now, here’s an overview of how to generate this crucial diagnostic data.
You need a method to decide how much each row-column cell contributed to the model’s prediction—that is, a guess whether a given datapoint is from today. We like Shapley Additive Explanations (SHAP), although there are other valid options too.
SHAP values let us create visualizations showing relative severity scores from “minimal” to “extreme.” A minimal anomaly might be of vanishing concern, while an extreme anomaly would be a very significant structural change that affects almost all of the data from the day. Combined across thousands of datapoints, and then clearly visualized, these datapoints can give you an easily understood indication of what to investigate—say, if one value shows up a lot less today and another shows up a lot more, there’s a good chance that the affected data was mislabeled.
Finally, another suggestion for anyone eager for more detail to read chapter 4 of the book, which goes into far more depth, including a walkthrough of how models assign attribution to each cell, as well as using this same approach to compare and contrast two tables with the same column schema. We hope you now agree that asking ML to guess if today’s data is from today is quite a sensible way to figure out if your data is suddenly encountering an important issue. But like just about everything in technology, implementation has to contend with all sorts of curveballs.
Next month in this series, Chapter 5 puts the rubber to the road, talking through real-world challenges like unusual table update structures and column correlation. Stay tuned!
Categories
- Book
- Resources
Get Started
Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.