What does the data fabric architecture look like? - The EE
Data powers everything we create, share, and experience. To deliver such qualitative engagements, it is important to be able to manage a mammoth of data streaming every moment, says Yash Mehta, an IoT and big data science specialist.
Fabrics, as we all know, are continuously retrieving diverse forms of data sets from varied sources and filtering data assets that are relevant to the business. Apart from facilitating data integration, fabric eliminates data silos, assures data compliance and aids in automated data governance. Ultimately, it accelerates digital transformation initiatives of organisations of different types.
At the ongoing growth rate, the data fabric market size could reach USD 6.97 billion by 2029. This is a CAGR of 22.3% and a testimony of data fabric’s mission to build smarter data processes.
At the core of the fabric, there’s an underlying architecture consisting of many components.
In this post, we discuss these building blocks.
There are many data fabric vendors in the industry such as IBM, Atlan, K2view, Talend, and Netapp, and each one of them may slightly vary in its architecture. However, I have outlined the standard component structure that all products follow. Among many classifications, I liked the one done by Gartner in their insightful article.
The ingestion component captures data from multiple sources, unifies it and then further pipelines to target systems thereby ensuring data integrity. Exactly why this component is also known as the Data Integration layer.
The data ingestion components optimise real-time, batch and stream processing and work with a variety of sources. These include on-premise databases, cloud systems, data sources at the application layer and others. The Ingestion component should be able to work with data in all formats, both structured and unstructured.
After ingestion, the virtualisation layer provides a logical abstraction layer for all underlying systems for easier access to trusted data.
As the name suggests, this component retrieves data from multiple sources and delivers them to the targeted system by any one of multiple methods. These include ETL (bulk), messaging, CDC, Virtualisation, APIs, and others.
It also prepares the data lakes and warehouses for analytics activities. The data transformation makes it analytics-ready for BI systems. While evaluating your data fabric choices, I recommend considering data delivery as an important differentiator.
While IBM continues to be the pioneering product, I find K2view as a promising competitor. Their data fabric solution follows an approach of micro-databases wherein every database holds the data for one business partner only while the fabric maintains millions of databases.
This enables enterprises to quickly create and deliver data products seamlessly. This handles operational and analytical workloads effectively for different architecture types, in the cloud as well as on-premise.
Orchestration is the process of capturing data sets from multiple, siloed sources, combing and then further organising them for data analytics. As an important component, Orchestration provides an end-to-end view of the pipeline.
The complete data workflow needs orchestration for accurate coordination. Orchestration determines the functioning of pipelines, and the exact timing of their requirement followed by controlling the data generated by them.
In a nutshell, this component does code-free controlling of the data flow and transforms it all from source to target systems.
It uses a connected knowledge graph to provision access to all metadata types. The graphical representation of metadata makes it easier to understand and build unique relationships. This layer is responsible for connecting data assets with the
Knowledge Graphs are an integral unit of fabrics. They help in visualising the overall data landscape using identifiers, schemas, and data points. Through a graphical representation, they simplify the concept and make them more searchable.
Analysts can use this component to determine whether or not data from multiple sources could be a part of the same dataset. Such insights are valuable for the overall fabric’s performance. This is the core of fabric and is responsible for addressing the silos issues. In this component, the analysts can utilise data modelling, preparation and curation.
Next, the Persistent Layer does the dynamic storage of data for a wide range of relational and non-relational models. The storage is done based on the use case.
Also known as the intelligence layer, the governance component enforces policies and regulations to determine data visibility and authorisation. The feature centralises the governance process for optimal metadata monitoring as per local and global compliance policies.
It configures the rules and controls synchronisation, integrity and security. In a nutshell, the component is responsible for end to end security of data.
The Active Metadata component enables a data fabric to receive exchange and analyse all types of metadata. Active metadata covers those metadata sets that record the real-time data usage by the systems and the users.
This is different from passive metadata which usually includes design-based and run-time metadata.
As the name suggests, data masking protects the data at rest, while in use and during transit across the landscape such as production, testing, and analytics.
And finally, there’s data service automation that creates, debugs and deploys the web services. It uses an easy, no-code/low-code framework.
So far, we discussed the different components of data fabric architecture. These cover the entire journey from identifying data sources, collecting, processing, and ultimately provisioning to data environments. We also discussed leading fabric products and why it’s time to embrace automation. Which data fabric product are you using?.
The author is Yash Mehta, an IoT and big data science specialist.
Follow us and Comment on Twitter @TheEE_io