Managing vast quantities of data can be a significant challenge. It requires an entire organizational shift towards federated data management and a product-oriented culture. Complex architectural concepts like data mesh can make these cultural and organizational changes seem overwhelming. Hence, critics frequently regard these concepts as overly theoretical and impractical. To combat this perception, it's crucial to approach these transformations with a clear, practical strategy.
In this blog post, I aim to demonstrate a practical data transformation using Microsoft Fabric and Microsoft Purview. The target audience includes executives, architects, analysts, and compliance and governance personnel who are interested in creating a comprehensive data platform.
For this exercise, I'll use a hypothetical airline organization called Oceanic Airlines as an example, but the narrative is grounded in the real-world experiences of contemporary enterprises already utilizing these Microsoft services. It's important to note that embarking on a data journey to identify domains and construct data products at a large scale is intricate. This process entails nuances and comes with organizational and cultural shifts. Moreover, it's essential to understand that there's no universal approach that fits all scenarios.
Getting started
During the initial phase of the transition, it's crucial to focus on comprehending the concepts and their practical application. To achieve this, I recommend the following steps:
- It's crucial to have a well-defined data strategy in place, one that vividly communicates your ambitions and outlines the road ahead. Your strategy should include concrete use cases, along with clear objectives and key results (OKRs). By clearly defining these elements, you ensure that everyone in your organization understands and is committed to your goals. This alignment is key to driving successful data transformations.
- Assemble a dedicated data team to oversee the entire process. This team will assist other business domain teams responsible for constructing the first data pipelines for generating data products.
- Start with a limited number of business domains and identify a simple use case that uses a few source systems as a baseline. These sources will provide the groundwork for developing data products. Preferably, business domains have all the engineering skills necessary to develop the first deliverables.
- Establish a small program board with senior representatives to provide top-down support and coordination.
In the sections that follow, we'll explore how the steps mentioned above will function. We'll begin by implementing our data platform and distributing the first data products to consumers. However, before consumers can use these products, we need to define them using a data catalog. It's important to note that setting up domains, building data, or establishing a data catalog doesn't follow a specific order. These tasks occur simultaneously, but for the purpose of this blog post, I've chosen to start with the data platform.
Initiating data domains
The first step for the initial phase involves reaching out to the head of data at Oceanic Airlines. Their responsibilities include establishing the data platform for the initial organizational setup. They are in charge of defining the initial domain boundaries, assigning administrators for Microsoft Fabric and Microsoft Purview, and identifying suitable use cases for onboarding. While we assume that we already have Microsoft Fabric and Microsoft Purview operational, these services still need to be configured.
To facilitate a data architecture, we will organize our tasks and data using domains. This approach not only helps consumers discover data by domain but also enables federated governance. To set up these domains in Microsoft Fabric, we'll instruct the Fabric administrator to define them. We'll use the Admin portal, navigate to the Domains tab, and begin setting up our domains.
For our hypothetical airline company, I have created four domains, each representing a distinct business concern. We assign domain administrators for each domain, enabling each business unit, department or data team to establish its own rules and restrictions based on its specific needs. In this way, we delegate some governance responsibilities to local domains.
It's crucial to understand that these Fabric domains represent only a portion of the overall business domain. Fabric is an analytical data platform used for extracting data from our source systems, transforming data for data product creation and analysis. The actual source systems, however, are not included in Fabric. They reside elsewhere. Thus, if we consider horizontal teams owning their source systems as well as the data products and other artifacts created in Fabric, then, a Fabric domain is simply a subset. In that respect, I could argue that a Fabric domain symbolizes a data domain, which is considered a segment of the broader business domain. If you would like learning more on this, I encourage you to read my other article on data domains.
Setting up workspaces
After we have set up domains in Fabric, the next step is to define our workspaces. A workspace is a vital component of Microsoft Fabric, providing a collaborative environment for domain users to engage in data ingestion, machine learning, real-time analytics, lakehouses, warehouses, and report generation. Workspaces connect to capacities and utilize version control for managing code and artifacts. Hence, it's advisable to establish at least a Development, Test, and Production workspace for each domain. If we apply this to all four domains, we'll end up creating a total of twelve workspaces.
The following step in the configuration process involves setting up capacity. In Fabric, capacity determines the speed and performance of your workloads. For the purposes of this exercise, we'll start small by purchasing a single capacity unit for all our workspaces. Later, you can reserve or manage capacity for each workspace individually.
For data teams to be successful, they need to get familiar with the iterative process of developing, building, testing, deploying, operating, and monitoring. This is where a good deployment process comes in. Within Fabric, Microsoft offers deployment pipelines allowing creators to move artifacts safely between workspaces. In the context of managing a data platform, the deployment pipeline moves artifacts, such as schema information, and data pipeline code from one environment to another. Between teams, I recommend standardization in terms of how teams work and how artifacts are deployed into production. A common best practice for managing data engineering workloads is to give each team its own Git code repository for keeping track of changes.
In practical terms, for your domains, this means you must establish deployment pipelines, ensuring that Development links to Development, Test links to Test, and Production links to Production. This process will need to be replicated across all domains.
Using Medallion architectures for all of the workspaces
For the initial implementation of a new use case, I suggest embracing a Medallion architecture using Lakehouses. This design offers numerous advantages: it is a well-tested, popular, and easily comprehensible model; it is simple to establish; and it streamlines data management.
The diagram below provides a reference design illustrating how each architecture might initially appear. This compact reference architecture is designed for the onboarding and transformation of data across all individual domains. You can see an example of this reference architecture below.
In Microsoft Fabric, the generic best practice to assign each layer to a distinct Lakehouse entity. Each Lakehouse manages its own data within a separate layer and also features a built-in SQL endpoint. This aspect enables data warehousing capabilities without requiring data movement between layers. Therefore, to implement a Medallion architecture in Microsoft Fabric, we'll create three Lakehouse items: a Lakehouse for Bronze, a Lakehouse for Silver, and a Lakehouse for Gold.
It's essential to understand that the guidelines for layering data using Lakehouses are general, and there may be valid reasons for deviating from the three-layered design. For instance, you could assign each onboarded source its own entity for more precise management or security. Similarly, you could do the same for the gold layer. For example, you can also use additional Lakehouse entities to separate the concerns of cross-use case harmonization, use cases, and data that is shared with other teams. Ultimately, the configuration between domains may vary based on domain requirements or principles established by the (enterprise) architect. For more best practices, consider reading this Medallion architecture blogpost.
With the initial design in place, the next step is to encourage each domain to begin constructing data pipelines for ingestion and data transformations. Typically, teams will import raw data copies from their transactional sources into the Bronze Lakehouse. For batch processing, I recommend that all teams partition this data using interval partitioned tables, for example, with a YYYYMMDD or datetime folder structure. Once more, it's crucial to keep in mind that the data ingestion approaches may differ across domains, as these decisions can heavily impact performance, maintainability, and security.
After ingestion, it's important to validate data against the original schema of the source system. These validations can be written, for example, in Python or a small metadata-driven framework can be designed for this purpose.
Subsequently, the data will be transferred from the Bronze to the Silver layer. Here, the data is refined, corrected, and historized, typically using slowly changing dimensions. I also suggest that domain teams group data around specific subject areas to enhance read performance. In the Silver layer, the data remains source-oriented, making it suitable for operational reporting or analytics.
The final step involves moving data to the Gold layer. Here, data is structured according to specific project use cases. Thus, the data is integrated, combined, and transformed for consumption, resulting in fewer required joins.
As you can see in the screenshot below, I am managing all artifacts within a single workspace. This approach is effective when all team members share a similar trust level, capacity, and geographical region. However, if you believe there's a need for added boundaries, don't hesitate to create extra workspaces for enhanced control. For more insights, check out another insightful blog posts.
Data product data
Data products are designed to facilitate interoperability and provide easily accessible data via clear, transparent agreements. They should empower other teams to quickly access and utilize the data, while also offering flexibility to adapt to internal team changes. This balance ensures a smooth, efficient workflow that benefits all stakeholders.
Your Gold layer stands out as the ideal choice for efficient data distribution across different domains. Within this layer, you can instruct teams to distinguish stable, consumer-ready datasets from other datasets. They can do this by bundling and registering these datasets as "data products" in the catalog, making them easily accessible for use. More on this later.
However, this approach of using datasets from Gold comes with many nuances. For example, some organizations offer flexibility in the way data products are classified. They allow domains to categorize data products within their Silver layer for operational purposes. Additionally, some organizations even permit reports or artifacts to be classified as data products, broadening the scope of what can be considered as such.
Conversely, other organizations uphold strict principles, such as prohibiting distribution from the Gold layer. They insist on decoupling through an additional layer, essentially introducing another Lakehouse entity specifically for domain-to-domain distribution. This rule prevents domains from taking shortcuts to use-case-specific data.
The overarching goal of data product development is to establish a universally accepted standard throughout all domains. This goal finds its culmination at the enterprise level, where elements such as governance, organizational structure, and a self-service platform act as facilitators. As we transition to the next stage, the role of a data catalog becomes crucial. It ensures that data products are well-organized, easily accessible, and immediately ready for use across all domains. In essence, the catalog serves as a language that guarantees data products are reliable, trustworthy, and safely accessible.
Governing domains using Microsoft Purview
For effective data governance within a large-scale data architecture, we adopt Microsoft Purview, a comprehensive data governance solution. This step necessitates collaboration with the head of data at Oceanic Airlines. Their task is to establish the initial catalog structure and align it with your architecture and organization.
During the first phase of adoption, I recommend starting on a small scale by adding only a few domains. So, avoid excessive scanning all your source systems and domains simultaneously. Instead, focus on scanning domains that are being integrated into your new architecture. Onboard domains sequentially, one at a time. Enrich these domains with OKRs, glossary terms, data products, critical data elements, access policies, DQ rules, and the like. It's about learning for scaling for the upcoming phases.
Business domains
To establish a well-structured catalog, it's beneficial to logically align your business domains, application domains, and data domains. Let's delve deeper into what this means.
The domains we've configured in Microsoft Fabric only constitute a portion of the overarching business domain. In reality, a typical business domain is far more expansive. It encompasses a specific area of organizational challenge, involving people, processes, applications, and data across the full business spectrum. Consider, for example, Airport and Lounge management. This sector of the business is committed to ensuring efficient airport operations and passenger comfort. To accomplish this, we not only need operational systems but also a data platform for reporting and analytics. In the case of Oceanic Airlines, Microsoft Fabric is used for only this last part.
Microsoft Purview acknowledges the concept of business boundaries as well. It employs the concept of business domains to manage business concepts and define data products. Think of a business domain as a framework for managing your business-related information in the catalog. Therefore, an ideal starting point would be to set up all your business domains in Microsoft Purview. Each business domain features a set of roles, offering enhanced flexibility and control over the management of specific components. Data products serve as one of these crucial components.
Data products
Data Products is new experience within Microsoft Purview. It's a step forward from the existing experience, enabling the bundling of data assets at the business level. Suppose you wish to offer a mix of tables, files, or even reports to your users — Data Products allow for this bundling. So your data products become an one-stop-shop where users can discover information and request access to the data or artifacts.
For instance, let's consider a real-world representation of a data product. In this case, the domain 'Baggage Handling Management' opted for a 'Baggage Tracking' data product. This logical container holds two physical data assets: 'Bag Scans' and 'Luggage Moments.' We identified both these resource sets after performing a scan using the collection structures. Let's learn more about this in the next sections.
Collection structure
Furthermore, Microsoft Purview employs (technical) domains for tasks like technical data discovery, data scanning, and classifications. It utilizes a Collection structure for management and logical groupings. Consider this as the solution space where applications and data platforms collaborate to achieve business objectives. We can use the Collection structure for modeling and defining clear boundaries within our solution architecture. This establishes who assumes ownership of a specific set of applications or data managed within a platform like Microsoft Fabric.
The screen above demonstrates how (technical) domains of Oceanic Airlines are represented using collection structures. Let's pause for a moment to examine what we've observed so far.
Firstly, you'll notice Microsoft Fabric, our data and analytics platform. Since multiple teams utilize this singe SaaS platform, I've added it into the root of our collection structure.
Secondly, the application domains are visible. These collections group together applications and source systems that work closely to deliver specific business value. I've separated these because it's likely that different stakeholders will manage, scan and describe these systems.
Lastly, there are data domains where data is ingested, utilized, and developed into data products. These data domains will align with the workspaces from the domains we've implemented within Microsoft Fabric.
Note that at the time of writing the distribution logic for shared services, such as Microsoft Fabric, is still absent. So, for the time being, workarounds of moving Fabric's workspace metadata to Purview collection structures involve using the API or scripts. A tool named Purview Bulk Collection Mover, which contains scripts for Microsoft Fabric and I personally developed is available on my GitHub repository.
In the example of Oceanic Airlines, there is a clear alignment between business and technical domains. Thus, application and data domains neatly correspond to business domains. Consequently, I've included an additional collection structure that holds the names of the business domains. However, this may not necessarily be the case for your organization. For instance, if teams are not organized horizontally and are more reliant on a central IT department, these collection structures might be more consolidated. Or, if the application ownership is more fine-grained, you could implement additional collections for individual application domains. Alternatively, if applications are shared and used by multiple business domains, these collection domains might be separate from the logical collection structure representing the business domains.
Glossaries for capturing business context
Contextually, the application domain and data domain intersect, as they both rely on the same semantic models from the business domain. However, their physical data structures differ, as they are individually tailored to meet different needs: supporting transitional processes versus focusing on intensive data reading for reporting and analytics.
The business glossary as part of your business domain, with business terms linked to data attributes, can effectively highlight the overlap and differences in how the application domain and data domain, i.e. data products, relate in relation to the larger business domain. By requiring teams to define business terms and linking them to specific data attributes, your catalog offers a good understanding of how domains overlap and diverge.
Connecting the dots
Let's take a moment to breathe and reflect on our progress so far. The image below brings together all the components we've worked on.
We kicked things off with Oceanic Airlines' journey of building a scalable data platform. We envisioned horizontal business teams taking charge of their systems, applications, data ingestion, and data product creation, and also becoming consumers of this data. These teams embody our business domains, which form part of the solution space and are represented by the light grey boxes in the image.
Inside these light grey boxes, you'll notice applications and systems. These denote the day-to-day operations of our business teams and are represented as application domains.
Moving on, we've established a robust data platform — Microsoft Fabric. Within this platform, we've created a specific domain for each business domain. These domains house workspaces, represented in green. This setup ensures that each business team has its dedicated workspaces to streamline its operations and leverage data effectively.
When it comes to data governance, we've leveraged Microsoft Purview. It acts as an overseeing layer, providing a holistic view of everything within our solution space and beyond that. Within Purview, we've defined our business domains using business terms and represented data products as logical entities for easier understanding.
Subsequently, we established a collection structure that scans and captures technical metadata from the solution space, linking back to our tangible solutions.
In this setup, relationships naturally form between business domains and collections. For instance, in our example, application domains align with source systems, while data domains correspond to Fabric workspaces. This mapping ensures seamless integration and efficient data management across all domains.
With different domains properly aligned, Oceanic Airlines is primed for efficient operation. Utilizing Microsoft Fabric for data development equips data teams with a platform for managing their data products. Simultaneously, Microsoft Purview ensures data governance, promoting the responsible and secure use of data and data products. This combination of services creates an ideal environment for the initial phase of your implementation, paving the way forward for future business domains.
Reflection: Data management is a collaborative effort
After the initial business domains are onboarded into Microsoft Fabric and Microsoft Purview, it's important to reflect on the lessons learned from this preliminary phase. In the proposed operating model, each domain is expected to take ownership of its data, both within their applications and the data they generate and distribute within the data platform. Guiding them is crucial.
During this initial stage, the central data team provides oversight and guidance to the other domain teams. This team not only mentors others but also offers expertise in creating efficient data pipelines, debugging issues, and improving data quality at the source. Collaboration is key in this phase, as collective decisions about each team's responsibilities need to be made.
The same level of oversight and guidance applies to all data governance tasks and activities. For instance, your teams need workflows that guide them in using data safely. To help your teams become more self-sufficient, it's crucial to provide them with the necessary education.
Additionally, monitoring the quality and overall progress of your data strategy is crucial, which includes keeping track of Objectives and Key Results (OKRs). Therefore, consider utilizing monitoring and progress dashboards like "Data estate health" from Microsoft Purview. This application enables Chief Data Officers to oversee the progress and overall health of their data strategy.
A critical part of this stage is assigning roles and permissions within Microsoft Fabric and Microsoft Purview. This task can be challenging because the pre-set roles and responsibilities in Microsoft Fabric and Purview might not align perfectly with those of your organization. Therefore, several workshops may be needed to clarify the roles and responsibilities each member will have.
Again, it's crucial to envision the big picture while scaling in stages. Onboard one domain at a time, progressively. Simultaneously, work on enhancing governance, skills, architecture, and processes. This balanced approach will ensure steady progress towards your data management goals.
Next stages
Once you've successfully implemented your initial set of use cases into production, the focus shifts to scaling up, incorporating more data domains, and refining the overarching architecture. In doing so, it's vital to uphold your fundamental architectural principles.
In the suggested operating model, each domain takes full responsibility for the data it produces and distributes. Additionally, domains are isolated from each other, meaning all items within a workspace are associated with a specific domain. Each domain effectively utilizes the data catalog to fulfill its responsibilities.
Unlike Databricks or Synapse, which usually supply each team with a uniform blueprint regardless of their unique needs, Microsoft Fabric doesn't rely on these infrastructure blueprints. This approach eliminates the need to overprovision resources, offering greater flexibility to address a business domain's specific requirements. As such, the domain architecture can vary significantly based on a domain's scope or requirements. For example, while source-system domains may use three data layers, a domain on the consuming side might only require one or two. The approach mentioned above is visualized through the abstract image provided below.
In all implementation scenarios, it's essential to provide careful guidance for these Medallion architectures. This ensures that the architecture is customized to each business domain's distinct needs, while still adhering to your established architectural principles for the purposes of these different layers.
Introducing additional data domains
As you expand and incorporate more domains, teams may express concerns about the repetitive effort involved in merging and integrating data. Architecture concepts, such as data mesh, recognizing this issue, introduces the concept of aggregates or aggregate domains. Suppose there's a significant overlap in data consumption patterns on the consumer side. In such cases, we can choose to consolidate, integrate, and harmonize all overlapping data by creating a new data domain using Microsoft Fabric.
But how does this relate to Microsoft Purview's business domains and collections? Do we create a new business domain in Purview as well? And, crucially, who takes ownership?
If a business domain is consuming data and initially creating an aggregate, we could categorize these data products under existing business domains. However, if the central data management team facilitates this, it could result in a new business domain representing the data management team. This scenario is quite common among the customers I work with. The same principle applies to Master Data Management, which could constitute a business domain in itself. In all cases, the decomposition of domains and their alignment within the data platform and catalog should be handled with care. This process should be guided by the central architecture and/or data management team.
Data products
The same level of attention should be applied when guiding teams in developing data products. If Microsoft Purview becomes the go-to platform for data discovery, it's crucial to educate teams on how to register data products effectively in Microsoft Purview. It's important to understand that data products are essentially logical business concepts. These concepts should have unique and clear names, descriptions, owners, and, most importantly, lists of associated data assets.
Microsoft Purview has transitioned its focus from applying policies on physical data to managing logical concepts crucial to business domains. This means that teams need to be cognizant of this shift and should be properly guided. Additionally, all data products should have controls to ensure they are initially invisible to other users. The central data governance team should carefully oversee this transition by carefully describing the process of data product development and onboarding.
One note on managing data products within Microsoft Purview: The current offering provides limited control over which data assets can be promoted to data products, whether data assets can be shared between data products, and the linking of business domains to technical domains. For example, as the Chief Data Officer of Oceanic Airlines, I would want the power to ensure that only tables from "Data Products Lakehouses" are selected for promotion into data products in Purview. To achieve this, a custom solution that deeply integrates with Microsoft Purview's APIs would be necessary.
Standardization and clear principles are vital for successful implementation. Data products should not become a subject of experimentation, nor should they trigger a proliferation of data modeling methods. Ultimately, it's about making data consumable. For practical guidance on this topic, I recommend the following article.
Other improvements
Once the groundwork is set, the next step is to concentrate on prioritized business scenarios and bolster your capabilities. A crucial objective at this juncture is to shift all ancillary activities from the central data team to your specific domain teams. Pinpoint any inefficiencies and tackle them through automation and self-service options.
In order to strengthen your organization, guide your teams towards enhancing their data management skills and the efficiency of their corresponding data pipelines. Promote self-onboarding and self-subscription to data products. Develop services that enable effortless self-registration and metadata upkeep. As an example, encourage your domain teams to use APIs for a seamless integration of data product registration within your domain's CI/CD processes. Furthermore, establish best practices concerning data quality, ensuring alignment with reference or master data.
As you progress with future iterations, you may want to explore real-time data processing and additional consumption patterns, such as self-service, facilitated by extra workspaces. To accommodate a variety of query and self-service patterns, it's a common best practice to establish a physical serving layer. Here, data is replicated from, say, the Gold layer into another service, making it more accessible for end-users. This could involve services like KQL databases, Warehouses, or even PowerBI datasets, among others.
This strategy caters to various business lines with differing and often contrasting data usage needs. It eliminates the department's need to prepare the data, freeing up time for deeper data analysis and insight generation. Remember, regardless of the path you choose, maintaining standardization across all dimensions in your operations should always be a top priority.
Conclusion
In conclusion, managing vast data quantities necessitates strategic approaches and robust solutions, which Microsoft Fabric and Microsoft Purview provide. These tools facilitate a shift towards federated data management and a product-oriented culture. Such complex architectural changes are effectively navigated with clear, practical strategies and a deep understanding of the process nuances.
From the initial transition phase through to the scaling up process, organizations must maintain a clear vision and implement changes progressively. Key to this are a well-defined data strategy, a dedicated data team, and clear OKRs. With these foundational elements, organizations can smoothly implement their data platform, distribute data products to consumers, and progressively onboard one domain at a time.
Microsoft Fabric and Microsoft Purview offer a robust basis for managing large-scale data, aligning with various data mesh principles such as domain-oriented data ownership, a data-as-a-product mindset, self-service, and federated computational governance. With these tools, organizations can scale their data management efforts efficiently, driving informed decision-making and business growth.
Thanks to Effie Kilmer, Daan Humble and Remy Ursem for sharing their insights.