When we think of data quality, the first issues that come to mind are visible problems like duplicate rows, NULL values or corrupted records. But in fact the most common data quality issue is that data has simply disappeared.
In this post we will describe how data disappears, what the common causes are and what data teams can do to identify these issues.
Consider how companies process data into their warehouses:
Raw data is captured through logging systems or from external sources, then data loading systems pre-process the raw data and load it into a data warehouse. Then complex SQL pipelines filter data for important records, join multiple sources together and perform complex aggregations.
The resulting tables, often referred to as “fact tables”, are the golden datasets of an organization. Cross-functional teams leading strategic initiatives rely upon them, product managers make decisions using them and operations and sales teams are managed based on them.
It is hard to underestimate the importance of ensuring these tables are reliable.
But important fact tables can be the result of many transformations across disparate systems linking varied source datasets. This complexity increases the likelihood of incomplete data:
However, unlike some other data quality issues, there isn’t a single SQL query to validate that data is complete. This is because incomplete data can take many forms.
The simplest issue is that there is no recent data at all.
Even this can be fraught with challenges, as we need to know when there should be data there. Ideally systems track how long a dataset typically takes to update, and alerts when new data is significantly delayed.
A more nuanced issue that can go undetected is that there are fewer records than expected, or that data disappeared for a small period of time.
In this case, you need a time series model to predict what range of row counts are expected. Such models need to control for trend, weekday seasonality, annual seasonality and holidays. Doing this reliably at scale can be challenging.
Even more difficult to handle is when data disappears for an important segment of the data.
The missing segment might be small enough to not affect the overall row counts, but still mission critical for your business.
So what happens when data disappears? The consequences can be widespread:
- Dashboards fail to update, or show misleading charts based on limited data
- Machine learning models miss training on the latest observations or are biased by incomplete data
- Data products deliver stale or unrepresentative data to internal or external consumers
Usually, a data organization’s first instinct is to rely upon the monitoring of the systems that produce the data. Infrastructure engineering is monitoring the logging system with production metrics. And data engineering is monitoring the data coordination and loading systems for outages or missing data.
But as an organization matures the way data is produced increases in complexity, and it becomes dangerous to rely upon monitoring of individual components:
At Anomalo, we’ve found the only way to be certain your data is available is to test it independently from the systems producing it.
To ensure that data is available and complete, we run the following sequence of tests:
- Are there any rows from yesterday?
- Is the row count above a predicted lower bound?
- Is there any missing data at the very end of the day?
- Are there any key segments with far fewer than the expected number of rows?
To set this up for a new table only requires a few pieces of information:
Then, if we ever discover an issue with incomplete data we send a notification to the relevant teams’ Slack, Teams, PagerDuty or e-mail:
For example, the above Slack message is for a Google BigQuery Public dataset of San Francisco transit stops. We saw a big decrease in records on 2020–10–29. Our model predicted that there should be at least 10,053 rows, but only 3,672 rows exist. What happened to the other 6,500 rows?
We also note that this was the first time this table had failed to load data on time in the last 34 runs, providing users with a sense of just how unusual or extreme this behavior is.
Digging deeper, we show exactly what happened on the 29th:
It appears that the data disappeared by 8am, and never returned that day.
When it is relevant, we show a breakdown of exactly which segments are appearing less frequently than expected:
This chart is from a COVID-19 online news dataset, and is showing that on November 4th there were fewer articles online about Cases, Quarantine, Prices and Ventilators than expected.
These details help data teams rapidly triage issues, identify root causes, and communicate to the affected teams internally. In many cases, they even help accelerate the development and deployment of resolutions. Moving quickly and confidently to identify and resolve such issues greatly reduces the chances they negatively impact the rest of the company.
Before any dataset is used for mission critical decisions or products, data-driven organizations should validate the quality of the data using an automated system. The first step to get right is to ensure the data hasn’t disappeared.
To learn more about Anomalo, request a demo.