Skip to content 🎉 Webinar: Driving Retail Productivity with Data You Can Trust - Register Today!
Blog

Why Data Quality Without Unsupervised Machine Learning Leaves Results on the Table

Traditional monitoring approaches (like setting up rules or tracking a few metrics over time) can catch some issues, but they often miss the deeper, unexpected problems. Some tools that advertise a “machine learning” approach offer much less coverage than others, because they simply use time series forecasting. 

This post explores how unsupervised machine learning (ML) enhances data quality monitoring beyond simple time series forecasts. We’ll compare time series forecasting vs. unsupervised ML, show the advantages of each, and explain why having both is crucial for trustworthy data.

The four pillars of data quality monitoring

A robust data quality monitoring system should cover four core areas of competency: data observability, validation rules, key metrics, and unsupervised machine learning checks. This last area can be a source of confusion because of how some data quality monitoring tools are marketed.

To understand why not all “machine learning” approaches to data quality monitoring are the same, we’ll investigate two common monitoring mechanisms: time series forecasting and unsupervised machine learning. We’ll explore why time series forecasting should be part of a multi-faceted approach, including unsupervised machine learning, rather than a standalone solution.

What is time series forecasting?

Briefly, time series forecasting is an approach that uses the historical data in your table to make time-aware predictions for future values. This process is necessary for key metrics, which are a valuable component of any data quality monitoring system.

In practice, this means tracking key metrics (record counts, averages, sums, etc.) over time and using statistical models to forecast their expected values. If an actual value deviates significantly from the forecast, an alert is triggered. This approach is intuitive and a natural place to start for monitoring data quality. 

For example, Uber’s data team built a Data Quality Monitor that uses historical data patterns to flag anomalies. If the number of rows in a table drops drastically compared to previous periods, it signals a potential data pipeline issue. Such time series modeling of data metrics can automatically catch big spikes or dips that indicate something went wrong in the data feed.

However, forecasting only catch the “known unknowns,” the issues you anticipate, such as volume drops. As Uber observed, with tens of thousands of tables in play, it’s impossible to manually define and maintain metrics for everything that could go wrong​. We need a different approach for the “unknown unknowns” – those unexpected data errors that don’t yet have a metric or rule. This is where unsupervised learning comes in.

What are unsupervised machine learning checks?

Unsupervised machine learning takes a very different (and far more comprehensive) approach to monitoring. Instead of looking at one predefined metric at a time, an unsupervised ML model learns the normal patterns of your data on its own, across many dimensions. It builds a picture of what “normal” looks like: the typical ranges, distributions, frequencies, and relationships in the dataset, without the user having to specify rules upfront. Then it flags anything that looks sufficiently different from that learned baseline. 

In other words, the system can independently learn the data’s structure and detect deviations. This is done without any labeled examples of bad data; the algorithms are essentially finding outliers or unusual patterns in real time.

Additionally, unsupervised ML checks will generate a human-readable root cause analysis when they discover an anomaly. This analysis helps speed up the triage process by identifying the probable source of the issue.

What unsupervised ML catches that forecasting might miss 

Here are a few types of data quality issues that unsupervised ML is well-equipped to detect, often out-of-the-box, which pure time series monitoring could overlook: 

  • Sudden spikes in missing or invalid data: For example, a sudden influx of NULL or zero values in a field that usually has none. This could indicate part of a pipeline failed or a default value overwrote real data​. A forecast on total row count might not notice if the overall row count stayed the same, but an unsupervised model will catch the rise in missing entries.
  • Unexpected categorical changes: If a common category value disappears or a new, anomalous category shows up (e.g. country code “US” vs “USA” mix-up), unsupervised monitoring will flag the unusual distribution change. Traditional monitors typically don’t track the distribution of individual categorical values unless you explicitly set up a metric for each, which is impractical at scale.
  • Broken relationships or consistency issues: When the correlation or logical relationship between columns changes, it often won’t manifest as a single-column metric anomaly. Unsupervised techniques can notice inter-column inconsistencies, such as column A no longer predicting column B like it used to, or a calculated field no longer matching its components. Even if sums and averages remain in line, these subtler issues get surfaced.

By casting a wide net, unsupervised models can find the needle-in-the-haystack issues that a dashboard metric or predefined test would miss.

Not everything that “uses machine learning” is leveraging unsupervised ML

It’s one thing to decide that you want a monitoring solution that leverages unsupervised ML checks, and quite another to find a solution that actually incorporates those types of checks.

The problem is that the words “machine learning” can apply to both time series forecasting and unsupervised ML checks. Time series forecasting tools can incorporate machine learning techniques like recurrent neural networks and convolutional neural networks. These techniques are sophisticated, but much narrower in scope than the models used for unsupervised ML checks.

This ambiguity around “machine learning” for data quality monitoring sometimes implies that a solution hits above its weight class. And that makes a real difference: If you’re expecting the benefits of uncovering unknown unknowns with unsupervised ML checks, you don’t want to end up only having access to time series forecasting.

If you’re in the market for a tool that incorporates unsupervised ML checks, it’s important to ask questions about what kind of “machine learning” is being used under the hood.

Choose your data quality monitoring tool carefully to sleep soundly at night 

When choosing a data quality monitoring tool, look for solutions that incorporate time series forecasting models and unsupervised machine learning checks. 

Each technique covers the others’ blind spots:

  • Time series forecasting allows you to easily detect when important business metrics are out of line with specific predictions, efficiently querying just the data you know you want to check. Key metrics that use time series forecasting will always look at the values, rows, and columns you’re interested in, even if a stronger anomaly exists somewhere else.
  • Unsupervised ML checks offer a broader scope for greater coverage of unspecified anomalies (or “unknown unknowns”). These checks will take in structured data of any kind and monitor it for unknown unknowns, providing a root cause analysis when they find an issue.

To summarize: time series forecasting looks for issues with metrics you know you care about, unsupervised ML finds problems you didn’t anticipate. Anomalo leverage each of these strategies and more, to ensure the most comprehensive coverage of your data quality.

Business leaders want to trust their numbers, and data teams want to sleep easier, not worrying about hidden data errors. By combining time-series forecasting with unsupervised machine learning, enterprise organizations can get a monitoring solution that is both broad and deep: one that can alert you to an outage in a key metric, and also identify anomalous data errors lurking under the hood. This ensures that no results are left on the table and that everyone, from engineers to executives, can rely on the data when it matters most.

If your data quality monitoring tool does not already include unsupervised machine learning checks, it may be time to explore more coverage. 

Anomalo leverages both time series forecasting and unsupervised machine learning (and more!) to deliver a profoundly useful data quality monitoring solution. Get in touch and request a demo to see how Anomalo can help you find and triage more issues, faster.

Categories

  • Book

Get Started

Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.

Request a Demo