To compete in a data-driven world, organizations must consolidate data into centralized warehouses and use it to enhance products and inform decisions.
Data is now a strategic asset. But how can organizations ensure they can trust the data underpinning these products and decisions?
Most data teams conclude that they need to begin testing their data — using a carefully maintained library of rules.
But monitoring all of the data in an enterprise warehouse can be daunting. It is common for such warehouses to contain tens of thousands or even hundreds of thousands of tables.
- Top 10 tables
Drive board-level metrics and company goals
- Top 100 tables
Cross-functional data that inform product and operations
- Top 1,000 tables
Detailed data owned by single teams driving process and products
- Remaining 1,000+ tables
Special purpose project or feature data
The critical tables in an organization should be thoroughly tested and monitored. See our Airbnb-quality data for all post for details on how that can be achieved. But what about the rest?
If the average table contains 50 columns, and each column requires 5 rules or metrics to be well monitored, then a warehouse with 1,000 important tables requires managing 250,000 rules!
Even if each rule requires just 10 minutes to maintain each year, that would require a dedicated team of 20 highly trained data professionals (most of whom would quickly quit in protest of the drudgery):
Yet, an organization still depends upon the quality of that data. This is where unsupervised data monitoring can be critical.
Uber summarized this well in their post on Monitoring Data Quality at Scale:
Conventional wisdom says to use some variant of statistical modeling to explain away anomalies in large amounts of data. However, with Uber facilitating 14 million trips per day, the scale of the associated data defies this conventional wisdom. Hosting tens of thousands of tables, it is not possible for us to manually assess the quality of each piece of back-end data in our pipelines.
This first post in the series will demonstrate how Anomalo uses unsupervised learning to monitor data quality at scale. In subsequent posts, we will cover:
- The key requirements of our system and why traditional time series and outlier detection approaches do not work
- The architecture of the Anomalo unsupervised learning system — from modeling the data to explaining root causes of issues
- How we define and minimize false positives, and how we benchmark our algorithm using our data chaos library
The role of unsupervised data monitoring
To illustrate how our unsupervised monitoring works, we will use a simple demo environment with just one table, <code>public.fact_listing<code>:
This is a demo dataset of concert and sporting event ticket sales data. You can see that we have 3 checks passing on this table and that the checks run daily when the data is fresh.
Clicking into the table presents the table homepage:
As you can see, <bolderbold>Data freshness<bolderbold> has already run for this table, which determines when the data is complete each day, and automatically kicks off all other checks.
At the bottom are <bolderbold>Key metrics<bolderbold> and <bolderbold>Validation rules<bolderbold> sections where the user would leverage our time series models or library of custom validations to check their data.
In-between is the <bolderbold>Table anomalies<bolderbold> section, which contains two checks that are automatically configured for any monitored table:
Our machine learning model learns a representation of the typical data in the table. As new data arrives, it detects if that data is meaningfully different from what appeared in the table before.
We run two variants of this algorithm:
- no increase in NULL values
A constrained model looking for significant increases in NULL values
- no anomalous records
Our full machine learning algorithm, which identifies changes in continuous distributions, categorical values, time durations, or even relationships between columns
The first variant runs at a high priority level and notifies users when a sudden spike in NULL values is observed, as this may indicate missing data that should be fixed quickly. The second variant, no anomalous records, produces a log of meaningful changes in each table.
For example, in this Oxford COVID-19 Government Response dataset in BigQuery, in the column <code>public_information_campaigns_flag<code>, the value 1 almost entirely disappeared and was replaced with NULL values on October 30th.
Of course, not all columns are important, and so the user can control which tables and columns they wish to see notifications for.
Finding and characterizing data chaos with unsupervised learning
To illustrate how this works in practice, let’s introduce an anomaly into this dataset. We will use our command-line client to trigger one of our chaos operations: <code>TimeColumnZero<code>. This introduces zero values into a column at a specific point in time.
The column <code>numtickets<code> in <code>fact_listing<code> will now contain 30% zero values on 2021–01–17 whenever the <code>venuestate<code> is equal to “NY.” This illustrates a common data quality issue — an invalid value suddenly appears for a fraction of rows in a key data segment.
Next, we can re-run the table anomaly checks, and we find that the <code>no anomalous records<code> variant fails (as it is looking for any meaningful change):
We can click into view details to see the explanation:
The table anomaly check has correctly identified that the column <code>numtickets<code> has a sudden increase in 0 values. Note that we never told it to look at this column or to look for zero values.
It also scores the anomaly's severity (this one is strong) and compares that to a learned threshold for this table, which accounts for how much background noise there is in each column. Because the severity exceeds the threshold, this check fails and is highlighted in the user interface (and could notify the user in Slack or Teams).
Scrolling down, we see a custom visualization selected based on the anomaly type and data distribution:
In this case, a top values chart shows the most common <code>numtickets<code> values and compares the distribution on 2021–01–16 (the left bars) to the distribution on 2021–01-17 (the right bars). You can see that the value 0 was not there before and is now suddenly 10.4% of records.
Scrolling further, we can see the Root Cause Explanation, which analyzes the raw data underlying our unsupervised model to identify if there are segments of the data where the issue is most prominent:
As you can see, the algorithm correctly identifies that NY is where the anomaly occurred. 100% of the anomalous rows are in that state, but only 29.8% of the population rows are in NY.
This automatic identification of where an issue occurs is powerful. Without it, a user would need to examine records, trace lineages, or repeatedly query and visualize the data to isolate the issue.
The algorithm can find even more complex issues, such as when the relationship changes between columns. For example, consider this chaos operation:
Here we are shuffling the <code>priceperticket<code> column so that the values no longer correspond to the correct rows. The actual values remain the same (and have the same distribution and mean), but the relationship between that column and other columns in the table has been broken.
In this case, the algorithm identifies that <code>priceperticket<code> is the most problematic column, and that the issue is related to <code>listid<code> and <code>totalprice<code> as well.
Examining the distribution of <code>priceperticket<code> values on 1/16 and 1/17 shows that the anomaly is the strongest in the lowest and highest <code>priceperticket<code> values:
We can examine a sample of individual rows, where we can see that the algorithm is scoring every value in the table for how anomalous it is:
In this example, the $1,960 price per ticket doesn’t make any sense in the context of 28 tickets for $1,568. Plus, the price per ticket is unusually high for the Vampire Weekend show.
Zooming out, we can see this effect across many rows:
Now you can more clearly see which specific values of the table the algorithm believes are contributing the most to the anomaly, as indicated by the <code>severity<code> color scale on the right-hand side.
This granular allocation of the anomaly into the table's specific values is key. It allows our system to visualize and explain the underlying issue clearly.
When embarking on a journey to monitor and test data quality in your warehouse, it makes sense to start simple. You can use open source libraries, write your own tests, or leverage a platform like ours at Anomalo to thoroughly test your data.
But as warehouses, organizations, and testing ambitions scale, simple rule and time series based approaches fall over. They cannot effectively cover the long tail of data quality issues that commonly occur.
That is where unsupervised data monitoring comes in. You leverage a machine to learn the structure of your data and monitor for significant unexpected changes. It notifies you when a meaningful negative change occurs and presents you with visual summaries and explanations that dramatically accelerate your triage and resolution times.
Stay tuned for our next post in this series, which will explain how our unsupervised algorithm works behind the scenes.
To get started with Anomalo, and begin monitoring your data at scale using our algorithms, be sure to request a demo.
Thanks to Vicky Andonova.