5 Strategies For Stopping Bad Data In It's Tracks
Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.
Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.
For data teams, bad data, broken data pipelines, stale dashboards, and 5 a.m. fire drills are par for the course, particularly as data workflows ingest more and more data from disparate sources. Drawing inspiration from software development, we call this phenomenon data downtime– but how can data teams proactively prevent bad data from striking in the first place?
In this article, I share three key strategies some of the best data organizations in the industry are leveraging to restore trust in their data.
Recently, a customer posed this question: “How do you prevent data downtime?”
As a data leader for a global logistics company, his team was responsible for serving terabytes of data to hundreds of stakeholders per day. Given the scale and speed at which they were moving, poor data quality was an all-too-common occurrence. We call this data downtime-periods of time when data is fully or partially missing, erroneous, or otherwise inaccurate.
Time and again, someone in marketing (or operations or sales or any other business function that uses data) noticed the metrics in their Tableau dashboard looked off, reached out to alert him, and then his team stopped whatever they were doing to troubleshoot what happened to their data pipeline. In the process, his stakeholder lost trust in the data, and valuable time and resources were diverted from actually building data pipelines to firefight this incident.
Perhaps you can relate?
The idea of preventing bad data and data downtime is standard practice across many industries that rely on functioning systems to run their business, from preventative maintenance in manufacturing to error monitoring in software engineering (queue the dreaded 404 page…).
Yet, many of the same companies that tout their data-driven credentials aren’t investing in data pipeline monitoring to detect bad data before it moves downstream. Instead of being proactive about data downtime, they’re reactive, playing whack-a-mole with bad data instead of focusing on preventing it in the first place.
Fortunately, there’s hope. Some of the most forward-thinking data teams have developed best practices for preventing data downtime and stopping broken pipelines and inaccurate dashboards in their tracks, before your CEO has a chance to ask the dreaded question: “what happened here?!”
Below, I share five key strategies you can take to preventing bad data from corrupting your otherwise good pipelines:
Data testing-whether hardcoded, dbt tests, or other types of unit tests-has been the primary mechanism to improve data quality for many data teams.
The problem is that you can’t write a test anticipating every single way data can break, and even if you could, that can’t scale across every pipeline your data team supports. I’ve seen teams with more than a hundred tests on a single data pipeline throw their hands up in frustration as bad data still finds a way in.
Data pipeline monitoring must be powered by machine learning metamonitors that can understand the way your data pipelines typically behave, and then send alerts when anomalies in the data freshness, volume (row count), or schema occur. This should happen automatically and broadly across all of your tablesthe minute they are created.
It should also be paired with machine learning monitors that can understand when anomalies occur in the data itself-things like NULL rates, percent uniques, or value distribution.
For most data teams, testing is the first line of defense against bad data. Courtesy of Arnold Francisca on Unsplash.
In the same way that software engineers unit test their code, data teams should validate their data across every stage of the pipeline through end-to-end testing. At its core, data testing helps you measure whether your data and code are performing as you assume it should.
Schema tests and custom-fixed data tests are both common methods, and can help confirm your data pipelines are working correctly in expected scenarios. These tests look for warning signs like null values and referential integrity, and allows you to set manual thresholds and identify outliers that may indicate a problem. When applied programmatically across every stage of your pipeline, data testing can help you detect and identify issues before they become data disasters.
Data testing supplements data pipeline monitoring in two key ways. The first is by setting more granular thresholds or data SLAs. If data is loaded into your data warehouse a few minutes late that might not be anomalous, but it may be critical to the executive who accesses their dashboard at 8:00 am every day.
The second is by stopping bad data in its tracks before it ever enters the data warehouse in the first place. This can be done through data circuit breakers using the Airflow ShortCircuitOperator, but caveat emptor, with great power comes great responsibility. You want to reserve this capability for the most well defined tests on the most high value operations, otherwise it may add rather than remove your data downtime.
Field and table-level lineage can help data engineers and analysts understand which teams are using data assets affected by data incidents upstream. Image courtesy of Barr Moses.
Often, bad data is the unintended consequence of an innocent change, far upstream from an end consumer relying on a data asset that no member of the data team was even aware of. This is a direct result of having your data pipeline monitoring solution separated from data lineage– I’ve called it the “You’re Using THAT Table?!” problem.
Data lineage, simply put, is the end-to-end mapping of upstream and downstream dependencies of your data, from ingestion to analytics. Data lineage empowers data teams to understand every dependency, including which reports and dashboards rely on which data sources, and what specific transformations and modeling take place at every stage.
When data lineage is incorporated into your data pipeline monitoring strategy, especially at the field and table level, all potential impacts of any changes can be forecasted and communicated to users at every stage of the data lifecycle to offset any unexpected impacts.
While downstream lineage and its associated business use cases are important, don’t neglect understanding which data scientists or engineers are accessing data at the warehouse and lake levels, too. Pushing a change without their knowledge could disrupt time-intensive modeling projects or infrastructure development.
When applied to a specific data pipeline monitoring use case, metadata can be a powerful tool for data incident resolution. Image courtesy of Barr Moses.
Lineage and metadata go hand-in-hand when it comes to data pipeline monitoring and preventing data downtime. Tagging data as part of your lineage practice allows you to specify how the data is being used and by whom, reducing the likelihood of misapplied or broken data.
Until all too recently, however, metadata was treated like those empty Amazon boxes you SWEAR you’re going to use one day – hoarded and soon forgotten.
As companies invest in more data solutions like data observability, more and more organizations are realizing that metadata serves as a seamless connection point throughout your increasingly complex tech stack, ensuring your data is reliable and up-to-date across every solution and stage of the pipeline. Metadata is specifically crucial to not just understanding which consumers are affected by data downtime, but also informing how data assets are connected so data engineers can more collaboratively and quickly resolve incidents should they occur.
When metadata is applied according to business applications, you unlock a powerful understanding of how your data drives insights and decision making for the rest of your company.
End-to-end lineage powered by metadata gives you the necessary information to not just troubleshoot bad data and broken pipelines, but also understand the business applications of your data at every stage in its life cycle. Image courtesy of Barr Moses.
So, where does this leave us when it comes to realizing our dream of a world of data pipeline monitoring that ends data downtime?
Well, like death and taxes, data errors are unavoidable. But when metadata is prioritized, lineage is understood, and both are mapped to testing and data pipeline monitoring, the negative impacts on your business – the true cost of bad data and data downtime – is largely preventable.
I’m predicting that the future of broken data pipelines and data downtime is dark. And that’s a good thing. The more we can prevent data downtime from causing headaches and fire drills, the more our data teams can focus on projects that drive results and move the business forward with trusted, reliable, and powerful data.