Chapter 3: Is Automated Data Quality Monitoring Right for Your Business?
March 24, 2025
Welcome to “Use AI to modernize your data quality strategy,” a series spotlighting insights from our O’Reilly book, Automating Data Quality Monitoring. This post corresponds to Chapter 3: Assessing the Business Impact of Automated Data Quality Monitoring.
Automated data quality monitoring is a powerful solution for organizations struggling with data trust and accuracy at scale. But just because something is innovative doesn’t mean it’s the right investment. Like a high-end ski helmet, the value depends on your level of exposure—if you’re shredding black diamonds every weekend, the extra protection is a smart move, but if you’re just sledding with your kid a few times a year, it might be overkill.
So how can you determine whether automated data quality monitoring is worth the investment for your organization? The answer lies in examining the characteristics of your data, the demands of your industry, the maturity of your data infrastructure, and the benefits it could provide to your stakeholders. Ultimately, assessing these factors will help you estimate the potential return on investment (ROI) and decide whether automation is a practical step forward.
Assessing Your Data: Volume, Type, Cadence, and Risk
Not all data requires the same level of scrutiny, and not all organizations generate the kind of complex, fast-moving data that benefits from an automated approach.
Years ago, IBM came up with the “four Vs of big data:” volume, variety, velocity, and veracity. It’s a useful framework for sizing up your data, but we’ll use terminology we think is clearer.
Good candidates for automated DQ monitoring | Low reward for automated DQ monitoring | |
---|---|---|
Data volume | • Large datasets approaching billions of rows per table
• Segmented data |
• Small datasets, like a factory’s manufacturing output records |
Data types (Variety) | • Large variety of data types
• Unstructured data |
• Data that’s difficult to correct later (customer addresses)
• Data from one large, static data dump (pharmaceutical trials) • Some semistructured data |
Update cadence (Velocity) | • Tables with an update cadence of weekly or more | • Yearly or quarterly tables |
Risk profile (Veracity) | • Third-party data
• Data from very complex systems interacting • Data from systems undergoing continuous changes • Data generated by legacy systems |
• Data that is almost entirely static and “hermetically sealed” |
Data Volume
Organizations handling massive datasets—especially those with millions or billions of rows—are strong candidates for automation. At this scale, manual reviews are impossible, and even traditional rule-based monitoring systems may struggle to keep up. However, it’s not just about raw volume; when datasets are naturally segmented (such as sales data broken down by store locations or customer cohorts), automated data quality monitoring becomes particularly appealing.
When datasets are naturally segmented, such as sales data broken down by store locations or customer cohort, automated DQ monitoring becomes particularly appealing. Traditional monitoring may obscure issues in specific segments, because monitoring segments individually is difficult to scale.
Data Types
You likely have a variety of data types. The three main ones are structured, semi-structured, and unstructured.
- Structured Data: Relational databases and fact tables are best suited for machine-learning based monitoring, as automated systems can effectively track trends, outliers, and anomalies across well-defined schemas.
- Semi-Structured Data: Formats like JSON can benefit from automation but may require additional pre-processing depending on how often the data’s structure changes.
- Unstructured Data: This includes images, audio, or free-text documents. Although challenging due to a lack of inherent structure, automation can help by ensuring data is properly labeled or by monitoring metadata.
Update Cadence
Data that updates frequently—such as financial transactions or real-time analytics logs—needs continuous monitoring to catch quality issues before bad data propagates through the system. Conversely, datasets that update less often might not need the same level of automation.
Risk Profile
Some data is inherently more susceptible to issues. Be mindful of third-party data, data from complex or legacy systems, or data undergoing continuous changes. Even though nearly all data can face issues, the more dynamic your data, the more you may benefit from automated monitoring.
Understanding Your Industry
Many of our customers come from sectors such as financial services, ecommerce, media, technology, and real estate. More recently, Anomalo has expanded into industries like CPG and manufacturing. Even businesses not traditionally viewed as data-heavy are now recognizing the critical need to integrate data into key processes.
- Regulatory Pressure: Industries like financial services and healthcare have stringent compliance requirements, making reliable data quality essential.
- AI/ML Risks: For companies leveraging AI, high-quality data is critical to avoid issues like feature shocks or overfitting.
- Data as a Product: If you package or sell data, maintaining high quality is similar to quality control in manufacturing—vital for customer trust.
Assessing Your Data Maturity
As your organization matures in its data strategy, automated data quality monitoring becomes increasingly valuable. Consider these indicators:
Following the Data Science Hierarchy of Needs: Early-stage companies might focus on data collection and storage, but as they move into analysis, automated monitoring becomes essential.
Modern Data Stack Readiness: Organizations using modern data warehouses (like Snowflake or BigQuery) and transformation tools (such as Airflow or dbt) stand to benefit significantly from automation.
Rolling Out Gradually: Start with mission-critical tables—like key financial reports or customer databases—and expand monitoring based on observed benefits.
Assessing Benefits to Stakeholders
Different groups in your organization will experience the benefits of automation in different ways:
- Technical Users: Engineers appreciate tools that are easy to configure and integrate, often with robust APIs.
- Data Team Managers: They benefit from high-level analytics that offer a bird’s-eye view of data health.
- Less Technical Users: Data analysts and scientists value intuitive UIs and rich visualizations to quickly identify root causes.
- Other Stakeholders: Teams like product, operations, marketing, and compliance need clear, actionable alerts from a single source of truth.
Conducting an ROI Analysis
Okay, so you think automated DQ monitoring will be useful for your business. But is it worth the expense and time?
Your manager and finance team will probably appreciate a quantitative ROI, which in this case can be calculated around direct and indirect costs caused by data quality errors. Here’s a sample back-of-the-envelope calculation (turn to pages 61 and 62 of our O’Reilly book for a more comprehensive evaluation including breaking down tables of varying levels of importance):
- Say you detect 12 incidents a year with your current monitoring plan.
- Each incident costs on average $500,000 in business interruption, investigation and resolution time, and other adverse customer and operational impacts.
- You know your current strategy is not comprehensive, so you suspect there’s an additional 6 or so incidents going undetected, making a total of 18 incidents costing half a million dollars each.
- On the whole, that’s $9 million in incident-related costs annually.
- Add to this the purchase and maintenance costs incurred by current data quality systems that would be replaced by an automated one.
The above is your current cost of monitoring. Your ROI is your expected future monitoring savings—the current cost minus the future cost. To calculate the future cost, estimate the following:
- The new expected frequency of incidents, times their cost
- The cost to set up and maintain an automated monitoring system (including labor)
Subtract this second number from your first number, and you’ve got your quantitative ROI.
There are non-quantitative benefits and risks to consider, too:
- Benefits: Speeding up development cycles; having an “audit trail” of documentation; improved internal and external trust in your data.
- Risks: Lowered morale due to new training needs and resistance to change; potential security risks; alert fatigue due to incorrectly configured alerting systems (Ed. note: we think Anomalo addresses all of these pretty well.)
If you’ve determined that automated data quality monitoring is the right fit for your business, the next step is to understand how machine learning can be applied effectively. In Chapter 4, we’ll explore key ML techniques, including anomaly detection, time-series forecasting, and clustering, to power scalable data quality monitoring solutions.
Can’t wait? Get the whole book now, for free.
Categories
- Book
Get Started
Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.