Airbnb-quality data for all

How to build and maintain high quality data without raising billions
December 2, 2020
Ideas

Airbnb has always been a data driven company.

Back in 2015, they were laying the foundation to ensure that data science was democratized at Airbnb. Meanwhile, they have grown to more than 6,000 people and have raised more than $6b of venture funding.

To stay data driven through this massive change has required making big investments in data quality, as outlined by their recent Data Quality at Airbnb series: Part 1 — Rebuilding at Scale and Part 2 — A New Gold Standard.

The first two Data Quality at Airbnb posts: Part 1 — Rebuilding at Scale and Part 2 — A New Gold Standard

Companies aspiring to be as data driven and successful as Airbnb will also need to prioritize data quality.

It does not matter how much data is collected, how fast it can be queried, how insightful analyses are or how intelligent a machine learning model is. If the underlying data is unreliable and of poor quality, then everything that depends upon it will also be suspect.

Fortunately, companies no longer need to reinvent the wheel or make massive investments to improve and maintain high quality data. New startups, such as ours at Anomalo, are building the technology needed to monitor, triage and root cause data quality issues efficiently at scale.

In their first post, Part 1 — Rebuilding at Scale, Airbnb set the following goals for themselves.

The five data quality goals set by Airbnb in Part 1 — Rebuilding at scale, and where Anomalo can help.

All of these investments are critical, and projects like dbt are making it easier to build high quality data pipelines, and there are many open source data discovery tools under development.

But in this article I want to focus on two of their goals in particular: ensuring important data meets SLAs and is trustworthy and routinely validated.

In their latest post, Part 2 — A New Gold Standard, Airbnb outlines the following automated data quality checks they run to validate important datasets:

The data quality checks Airbnb runs on important datasets, from Part 2 — A New Gold Standard.

To summarize, Airbnb has the following key requirements:

  1. Row count time series analysis
    Has the table row count dropped below a predicted range? Have the row counts plummeted for any key segments? Did fresh data arrive in the table on time?
  2. Time series anomaly detection for key business metrics
    Did a metric suddenly move outside of a predicted range? Is the movement unusual given seasonality, trend and holidays?
  3. Standard run-time data quality checks
    Are basic data constraints satisfied: no unexpected NULL or blank values, no violations of unique constraints, strings match expected patterns, timestamps follow a defined order, etc.?
  4. Perfect metric consistency checks
    Do columns and metrics satisfy critical relationships that can be expressed as simple equations?

For the rest of this post, we will use open source data in BigQuery to illustrate how each of these requirements is supported in Anomalo.

1. Row count time series analysis
‍Has the table row count dropped below a predicted range? Have the row counts plummeted for any key segments? Did fresh data arrive in the table on time?

The first question to answer for any production dataset is “has it been updated?” In particular, are there any records from the most recent date and is the row count within an expected range?

If not, then something probably broke upstream in the collection, loading or processing of the table. This must be fixed before anyone can trust the latest data.

In Anomalo, every table is automatically configured to monitor if the most recent period of data is complete.

For example, for the NY 311 service request data, here is the time series of row counts by day, and the predicted interval used to determine if the row counts are complete:

Monitoring the row count for NY 311 service request data by day, the grey points are the actual row counts, and the green band is the predicted interval.

For November 23rd, we expected at least 5,223 rows, and in fact there were 7,056.

You can also pick any column and check that the row counts haven’t plummeted for any important segments in that column. For example, we can evaluate the row counts by borough (neighborhood) in the NY 311 data:

Predicted row counts for November 23rd in the NY 311 service request data for every borough (neighborhood) value. If any borough falls below the 99% predicted interval, a notification would be sent.

Anomalo customers use this feature to ensure their data is complete for a wide variety of segments. Ranging from geography (country, state) to event or API call type to device platform (iOS v Android).

Finally, we can also tell if the data was significantly delayed. This can indicate that an upstream processing stage is taking longer than normal, and may eventually cause data to be incomplete when users query for it.

Here is how long it took for each of the last 8 days of data to arrive for the New York 311 dataset:

How long it typically takes NY 311 service request data to appear in BigQuery — such delays are common for 3rd party datasets, and can increase suddenly causing downstream query issues

On average, it takes around 25 hours for the New York 311 data to arrive in BigQuery, and you can easily set a threshold of when you would like to be notified for delayed data:

You control when you are notified if a given table isn’t complete yet.

For more on data completeness, why it happens so often and how tools like Anomalo can be used to monitor it check out our post on When Data Disappears.

2. Time series anomaly detection for key business metrics
‍Did a metric suddenly move outside of a predicted range? Is the movement unusual given seasonality, trend and holidays?

Once we know that data is complete in a timely manner, the next step is to ensure that any metrics we compute from the data are within expected ranges.

In Anomalo, you can easily configure new metrics, for example, we can easily monitor the mean <code>score<code> for posts on the Hacker News dataset in BigQuery:

It takes just a few clicks to begin monitoring a metric — in this case the mean of the score column in the Hacker News dataset in BigQuery

If the score ever moves sharply outside of the predicted interval, a notification is sent in Slack. This saves a lot of time spent watching metrics for suspicious changes.

Underneath the hood, Anomalo is building a sophisticated time series model, which decomposes the metric into an overall trend (blue), holiday spikes (orange), season of year (red) and day of week (teal) components:


Behind the scenes of how Anomalo monitors a time series, it is decomposed into overall trend (blue), holiday spikes (orange), season of year (red) and day of week (teal) components

It looks like hacker news scores have been trending up over time, are sensitive to holidays (Thanksgiving and Christmas look good, others bad), and are much higher on weekends. Hopefully someone posts this article on hacker news on a Sunday near Christmas 🎄.

In Anomalo, metrics can be defined using a variety of pre-defined aggregates:

The types of metrics that can be easily defined and monitored in Anomalo.

Or you can define a custom metric using any SQL aggregate statement available in your warehouse. For example, in the SF Fire Department service calls dataset in BigQuery, we can measure how many minutes on average it takes for the fire department to reach the scene of a call, and be alerted whenever it takes longer than 15 minutes:

Adding a metric that uses a custom SQL aggregate statement to define the average time between when a SF Fire department call was received, and when they arrived on the scene.

It looks like the average time for the SF Fire Department to respond to calls increased dramatically on October 19th:

It usually only takes 10 minutes to arrive to a scene, but that spiked to longer than an hour on October 19th.

To learn more about testing your data with time series and unsupervised learning models, check out our post on dynamic data testing.

3. Standard run-time data quality checks
Are basic data constraints satisfied: no unexpected NULL or blank values, no violations of unique constraints, strings match expected patterns, timestamps follow a defined order, etc.?

Ensuring that metrics are within expected ranges is important, but ultimately only tests the surface of your data. Even if all of your metrics are within expected tolerances, there could be cracks appearing in your data foundation.

That is where rule based data quality checks come in. These checks typically test that a condition never or always is satisfied.

For example, in Anomalo we can easily set up a number of foundational data quality checks on the Hacker News dataset in BigQuery. For example, to test that <code>id<code> is always unique:

Adding a new validation rule in Anomalo can be done in just a few clicks.

We can then see at a glance which rules have passed, and which have failed:

Every table has a dashboard showing the latest status for all of the configured validation rules.

For failing checks, such as “<code>timestamp<code> is never NULL”, we can see a summary of the issue:

When validation rules fail, we show exactly how many records are affected.

A sample of bad and good records we can scan through:

You can scan through good and bad records to better understand the context of the data quality issue.

And a statistical analysis identifying the segments of data where the issue is most prominent:

Anomalo performs automated statistical analyses to identify if the issue is more common in any subset of the table. This saves a lot of time in triaging and root causing data quality issues.

Depending upon the nature of the check, the summary visualization changes to provide the most useful context. In this case, it appears there are many stories with duplicate titles:

The most common duplicate Hacker News titles.

Knowing not just that the data is broken, but exactly how, where and when the issue occurs is critical to quickly triaging, root causing and fixing it.

4. Perfect metric consistency checks
‍Do columns and metrics satisfy critical relationships that can be expressed as simple equations?

Real world data is complicated. While key metric and fixed rule based checks may capture 95% of the common data quality use cases, the last 5% often require complex logic specific to an organization and dataset.

For example, in the New York Police Department dataset of motor vehicle collisions on BigQuery, there are multiple fields tracking how many people were injured or killed in an accident.

We can validate that the fatality counts are consistent with the following:


Setting up a custom validation to compare metrics — in this case fatalities for people killed in NYPD motor vehicle collisions.

When run, we find that there are 0.003% of records (45 of 1,734,004 as of 2020-11-27) that have inconsistent fatality counts.

When such a validation rule fails, we can also see sample bad and good records:

Specific examples of good and bad records often make it clear why an issue is occurring.

In this case, it appears that there are records where motorists were killed, and yet they are not appearing in the total. Again, we also show any segments of the data that indicate where the issue is occurring most often (limited in this case to just records with some non-zero fatality rows).

When the 3rd vehicle is unspecified this data quality issue happens more frequently.

In this case, when at least one of the fatality columns is non-zero this issue is most likely to happen when the <code>contributing_factor_vehicle<code> number <code>3<code> column is unspecified. This intelligence could be a meaningful hint towards identifying where and how this data quality issue arose.

Today, it is easier than ever to become a data driven organization.

The volume and diversity of data keeps growing. Warehouses like Snowflake and BigQuery make storing and querying it all easy. From BI and visualization through to machine learning, there are a plethora of tools that we can leverage to generate insights and create value.

But most companies have only just begun the journey to ensure the data they depend upon can be trusted. As demonstrated by Airbnb, investing in data quality is critical to staying data driven as data organizations mature.

At Anomalo, we have built a platform that will allow any company to achieve and sustain the same vision of high quality data. If you are interested in starting a trial, head to our site to learn more or request a demo.

Written By
Jeremy Stanley
Try Anomalo with your team for free.
Lorem ipsum dolor sit amet, cour adipiscing elit ullam congue.
More Ideas and Announcements