Data Products Reliability: The Power of Metadata

8 min read

September 10, 2024

Data Products Reliability: The Power of Metadata

Achieving Reliable Data Products: Insights from Metadata and Collaboration.

1. Photo by Kaleidico on Unsplash

1. Photo by Kaleidico on Unsplash

In today’s data-driven landscape, ensuring the reliability of data products is critical for organizations striving to make informed and impactful business and technical decisions. At Miro, we have embarked on a journey to deploy data products across diverse analytics domains, employing a variety of strategies and technologies to uphold robustness and trustworthiness within our data ecosystem.

Several years ago, we adopted Airflow as our central metadata hub for validating SLAs (see post here), marking a significant shift in our approach to data orchestration and quality assurance. This pivotal decision greatly enhanced our ability to track and ensure the successful delivery of pipelines. Initially, our measurement of pipeline quality revealed challenges in maintaining consistent data reliability. However, through dedicated efforts over multiple quarters, we achieved remarkable improvements, reducing our key pipeline’s downtime from 50% to nearly 1% today. This transformation underscores the crucial role of defining and validating data contracts, which have proven indispensable in fostering trust in our data assets.

Despite these advancements, we continued to encounter challenges in aligning on quality with all data consumers across the organization. This realization led us to adopt a new approach, where all our data products, contracts, and expectations are accessible and created through a holistic metadata strategy. By embracing this comprehensive approach, we aim further to enhance the reliability and transparency of our data products, ensuring they meet the evolving needs of our stakeholders. Additionally, this strategy facilitates a quick collaboration process where both producers and consumers can contribute.

Assessing the Initial Approach

While the existing data product pipelines provided crucial visibility, they also revealed significant challenges:

Technical Complexity of Data Contracts Implementation: A data contract, which serves as an agreement on expectations between the producer and the consumer, must be easily accessible to the target audience. Traditionally, these contracts were defined within the engineering-owned Airflow repository through technical task-aware files, making them overly technical for many analytics data consumers to engage with effectively. For example, consumers had to know the name of the Airflow task producing their data. In the following picture, the data engineer understands clearly that “dbt_build_shared_models_p3” is the relevant task of the pipeline, but for a business user, it does not convey any meaningful information within the finance scope.

Example extract from the Previous Airflow Task Expectation.

Example extract from the Previous Airflow Task Expectation.

User Expectations vs. Notification Understanding: Despite user expectations for reliable data assets, notifications based on pipeline statuses often lack clarity, focusing more on production technical details than on relevance to data consumption or actions to take.

Evaluating Dataset Relationships (Lineage): The relationship between data assets promised as part of the expectations (for example, for the revenue metric, we need the user entity, attribution, etc., to be reliable) was difficult to evaluate without substantial technical assistance, as a format to define them did not exist explicitly.

Naive Uptime Measurement: Uptime metrics primarily based on Airflow task statuses provide freshness indicators but fall short in comprehensively assessing overall data quality, particularly regarding complex dependencies within pipelines.

Incomplete Abstraction with External Tools: External tools like Looker and Events operate independently of Airflow, resulting in an incomplete data product abstraction including only pipelines. This fragmented approach has hindered our goal of delivering comprehensive information across all data components.

Introducing New Capabilities in Our Data Stack

Recognizing these challenges, we have integrated advanced capabilities into our data stack, which were absent during the initial design phase of our data contracts mechanism:

Data Integrity Checks: Tools such as dbt tests, data quality frameworks, and event validations now play a pivotal role. These tools deliver meticulous metadata results to a centralized platform, enhancing our ability to ensure data integrity across the board.

Metadata Management Platform: Metadata Management Platform: We have adopted a robust metadata management platform (DataHub Cloud by Acryl Data) that offers comprehensive insights into all assets within our ecosystem. From the initial events that capture user activity to the final Looker dashboards that make insights consumable for the business, this platform provides detailed lineage and quality information critical for maintaining data reliability.

First Steps Towards Holistic Metadata for Data Products

To achieve holistic data product reliability, which considers metadata from all the data quality sources, we have initiated a concerted effort to align key stakeholders within our data scope. Collaborating closely with our analytics teams, we have defined essential elements to underpin this endeavor:

The Concept of Analytics Data Products

At its core, a data product serves as an abstraction representing a cohesive set of assets within a specific analytics domain, aimed at generating tangible business value. Key attributes of our approach include:

  • Domain-Specific Ownership: Each data product is anchored within a distinct business unit at Miro, ensuring clear accountability and ownership.
  • Analytics Team Ownership: Empowering analytics teams with ownership ensures they have dedicated oversight and responsibility for maintaining high standards of data product quality. This approach also ensures that those who understand the business meaning of the data (business owner) collaborate closely with the technical experts (technical owner) who are accountable for its implementation.
  • Defined Data Contracts: Agreements between data producers and consumers clearly define quality expectations, promoting transparency and alignment across teams.
  • Types of Data Products: In analytics, there are two types of data products: Metrics — business-oriented data products, and Building Blocks — shared assets used by multiple metrics. These data products serve as interfaces with the consumers. Internally, they also encompass code and infrastructure components necessary for transforming data from source to target. Having this classification enables us to differentiate data products used by the business from those used as building blocks, making them discoverable.

Development Lifecycle and Incident Management

Central to our strategy is the structured development lifecycle (PDLC) for data products, alongside a robust incident management framework. These frameworks ensure systematic creation, monitoring, and accountability in addressing any deviations from service level agreements (SLAs). We defined, along with the data consumers:

  • The main standard stages to follow while creating data products.
  • Accountability and ownership at each stage.
  • Artifacts to be delivered at each stage.
  • Mechanisms to define priorities and resolve incidents.

The goal is to create consistency in our ways of working without having a heavyweight bureaucratic process, so we are all on the same page in improving product quality and stakeholder satisfaction.

Signed off Key Data Products with Defined Owners

Critical to our approach is the identification and assignment of clear ownership for key data products across our organization. This step not only enhances accountability but also streamlines processes for effective management and maintenance.

Ownership is defined at the asset level for each entity, always having a technical owner accountable for the technical artefacts and a business owner responsible for the business expectations.

Example of ownership at the asset level (within a dbt model configuration).

Example of ownership at the asset level (within a dbt model configuration).

Advancing Towards Holistic Data Product Reliability

In collaboration with our data consumers, primarily analysts and leads, we have introduced transformative changes aimed at optimizing our data product reliability:

User-Centric Data Discovery: Shifting focus from technical intermediaries to final stakeholders, we prioritize delivering relevant data products and their associated documentation. This approach ensures that users access precisely what they need without navigating through technical complexities.

“Instead of +10 ARR intermediate tables/events/dashboards, I want to discover the revenue data product and all its documentation”.

Simplified Notification Framework: Users now receive concise daily notifications tailored to domain-specific channels, outlining the quality status of their data products. These notifications are accompanied by clear runbooks per product, enhancing transparency and ease of understanding.

Example of data product quality notification.

Example of data product quality notification.

Centralized Metadata Accessibility: Leveraging our Datahub catalog, we have centralized metadata access for all data products. This integration eliminates the dependency on Airflow metadata alone for defining contracts, enabling flexible definitions for both building blocks and business metrics.

Collaborative Product Development: By relocating data product and contract definitions closer to the analytics domain within our dbt repository, we empower analysts to actively contribute to product creation and quality expectations. This collaborative approach ensures alignment with evolving business needs while maintaining robust technical integrity. Thus, analysts already familiar with the dbt repo collaborate in the product creation and quality expectations, setting a YAML file based on the Datahub definition:

Data product YAML definition.

Data product YAML definition.

The data product is available and discoverable in the catalog UI including information about the data contract:

Metadata available in the catalog.

Metadata available in the catalog.

As we can see, now all the entities required and the readable SLAs are available for everyone.

Focused Communication on Data Quality: End-users and business partners receive notifications focused solely on data product quality issues, sparing them from technical intricacies. For instance, actionable alerts specify issues such as data staleness in critical metrics, directing users to pertinent data checks within our catalog for immediate resolution.

Example scenario:

  • A data pipeline fails due to incomplete user data.
  • Users receive notifications indicating that the revenue data product is unhealthy because the user entity input is not fresh. They can then access the freshness check in the Snowflake table within the catalog.
  • Technical details are accessible exclusively to the data team for root cause analysis.

Conclusion: Driving Impact Through Reliable Data Products

Our journey towards enhancing data product reliability at Miro exemplifies our commitment to leveraging metadata-driven strategies for transformative impact. By addressing technical complexities, enhancing transparency, and promoting collaborative ownership, we deliver great value through our data ecosystem. These initiatives not only build trust in our data but also empower stakeholders to make data-driven decisions with confidence, driving long-term business success in the dynamic data landscape.

Moreover, by granting greater autonomy to data consumers in defining their priorities, we are shifting decision-making power from technical data engineers to those who directly interact with and benefit from the data. This transition necessitates making our processes and approaches more accessible to non-technical users, ensuring that our data ecosystem continues to evolve in alignment with the needs of all stakeholders. As we move forward, our focus will remain on creating a more inclusive and user-friendly environment, enabling everyone to contribute to and benefit from reliable data products.

Interested joining the Miro Engineering team? Check out our open positions.

______________________________________

Image Credits:

  1. Photo by Kaleidico on Unsplash

Related blogs

Ready to go beyond?