Helpful strategies for improving data quality in data lakes | 7wData
Ingesting large volumes of disparate data can yield a rich source of information — but it's also a recipe for data chaos. Use these tips to improve data quality as your Data Lake grows.
For as long as there’s been data, enterprises have tried to store it and make it useful. Unfortunately, sometimes the way enterprises store data does not directly correlate with making it useful. Yes, I’m talking about data lakes.
The promise of data lakes is clear: A central place for an enterprise to push its data. In some ways, data lakes could be seen as the next generation of data warehouses. Unlike the warehouse, however, data lakes allow companies to dump data into the lake without cleansing and preparing it beforehand.
This approach simply delays the inevitable need to make sense of that data. However, properly applied data quality initiatives can simplify and standardize the way data lakes are used. In this guide, learn useful ways to make all that data accessible to the business analysts, data scientists and others in your company who get paid to make sense of it.
A data lake is a central repository for storing data, whatever the source or nature — structured, unstructured or semi-structured — of that data. Unlike a Data warehouse in which data is stored in files and folders, a data lake keeps data in a flat structure and uses object storage, which is tagged for easier, faster retrieval.
Unlike a Data warehouse, which requires incoming data to be stored in a common schema to allow for easier processing, data lakes allow enterprises to store data in its raw format. Data warehouses tend to store data in relational formats, pulling structured data from line-of-business applications and transactional systems. They allow for fast SQL queries but tend to be expensive and proprietary.
Data warehouses are also often misused, as Decodable CEO Eric Sammer has argued, putting expensive, slow batch-oriented ETL processes between applications to move data. Data lakes, by contrast, tend to store data in open formats and allow for a broader range of analytical queries.
That is, if you can first make sense of the data.
This is the first and most pressing problem of data lakes: Learning how to make sense of that wildly disparate data.
In an interview with David Meyer, SVP of Product Management at Databricks, a leading provider of data lake and data warehousing solutions, he called out the benefits of data lakes as “great in a lot of ways” because “you can stuff all your data in them.”
The problem, however, is that “they don’t have a lot of characteristics that you’d want to do data [analytics] and AI at scale.” He went on to say that “they weren’t transactional or ACID compliant. They weren’t fast.”
Databricks has fixed many of those problems by layering things like governance capabilities on top and then open sourcing them. As an example, they developed the Delta Lake format, for which Google Cloud recently announced support. The Delta Lake format essentially turns a data lake into a warehouse.
Though they don’t suffer from the same problems as data warehouses, data lakes can be expensive to implement and maintain — in part because even skilled practitioners may find it difficult to manage them.
The lack of structure may seem liberating when data is being ingested, but it can be burdensome when an enterprise hopes to make sense of the data. Absent something like the Databricks governance overlay, data lakes are often plagued by poor governance and security.
Even so, there’s enough promise in data lakes that enterprises will continue to invest in them for their data management needs. So how can enterprises use data lakes wisely?
One answer to the traditional data lake is to turn it into something else.