Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale

This post is part of an ongoing series about governing the machine learning (ML) lifecycle at scale. To view this series from the beginning, start with Part 1. This post dives deep into how to set up data governance at scale using Amazon DataZone for the data mesh. The data mesh is a modern approach to data management that decentralizes data ownership and treats data as a product. It enables different business units within an organization to create, share, and govern their own data assets, promoting self-service analytics and reducing the time required to convert data experiments into production-ready applications. The data mesh architecture aims to increase the return on investments in data teams, processes, and technology, ultimately driving business value through innovative analytics and ML projects across the enterprise.

Organizations spanning various industries are progressively utilizing data and ML to drive innovation, enhance decision-making processes, and gain a competitive advantage. However, as data volumes and complexity continue to grow, effective data governance becomes a critical challenge. Organizations must make sure their data assets are properly managed, secured, and compliant with regulatory requirements, while also enabling seamless access and collaboration among various teams and stakeholders.

This post explores the role of Amazon DataZone, a comprehensive data management and governance service, in addressing these challenges at scale. We dive into a real-world use case from the financial services industry, where effective marketing campaigns are crucial for acquiring and retaining customers, as well as cross-selling products. By taking advantage of the data governance capabilities of Amazon DataZone, financial institutions like banks can securely access and use their comprehensive customer datasets to design and implement targeted marketing campaigns tailored to individual customer needs and preferences.

We explore the following key aspects:

  • Traditional challenges in data management and governance across multiple systems and accounts
  • The benefits of Amazon DataZone in simplifying data governance and enabling seamless data sharing
  • A detailed use case on using governed customer data for effective marketing campaigns in the banking and financial services industry
  • The reference architecture for a multi-account ML platform, highlighting the role of Amazon DataZone in the data management and governance layer
  • Step-by-step guidance on setting up Amazon DataZone in a multi-account environment, including account setup, blueprint enablement, user management, and project configuration for data publishers and subscribers

By the end of this post, you will have a comprehensive understanding of how Amazon DataZone can empower organizations to establish centralized data governance, enforce consistent policies, and facilitate secure data sharing across teams and accounts, ultimately unlocking the full potential of your data assets while maintaining compliance and security.

Challenges in data management

Traditionally, managing and governing data across multiple systems involved tedious manual processes, custom scripts, and disconnected tools. This approach was not only time-consuming but also prone to errors and difficult to scale. Organizations often struggled with the following challenges:

  • Discovering data assets scattered everywhere
  • Enforcing consistent data policies and access controls
  • Understanding data lineage and dependencies
  • A lack of centralized data governance, leading to data silos, compliance issues, and inefficient data utilization

Amazon DataZone solves these problems by providing a comprehensive solution for data management and governance:

  • You can automatically discover and catalog data assets across multiple AWS accounts and virtual private clouds (VPCs)
  • It allows you to define and enforce consistent governance policies, track data lineage, and securely share data with fine-grained access controls—all from a single platform
  • Amazon DataZone integrates with AWS Identity and Access Management (IAM) for secure access management, making sure only authorized users and applications can access data assets based on their roles and permissions
  • With Amazon DataZone, organizations gain better visibility, control, and governance over their data, enabling informed decision-making, better compliance with regulations, and unlocking the full potential of their data

Use case

In the competitive banking and financial services industry, effective marketing campaigns are crucial for acquiring new customers, retaining existing ones, and cross-selling products. With the data governance capabilities of Amazon DataZone, banks can securely access and use their own comprehensive customer datasets to design and implement targeted marketing campaigns for financial products, such as certificates of deposit, investment portfolios, and loan offerings. In this post, we discuss how banks can establish a centralized data catalog, enabling data publishers to share customer datasets and marketing teams to subscribe to relevant data using Amazon DataZone.

The following diagram gives a high-level illustration of the use case.

The diagram shows several accounts and personas as part of the overall infrastructure. In the given use case of using Amazon DataZone for effective marketing campaigns in the banking and financial services industry, the different accounts serve the following functions:

  • Management account – This account manages organization-level functions, such as defining the organizational structure, provisioning new accounts, managing identities and access (identity management), implementing security and governance best practices, and orchestrating the creation of the landing zone (a secure and compliant environment for workloads). For example, in the bank marketing use case, the management account would be responsible for setting up the organizational structure for the bank’s data and analytics teams, provisioning separate accounts for data governance, data lakes, and data science teams, and maintaining compliance with relevant financial regulations.
  • Data governance account – This account hosts the central data governance services provided by Amazon DataZone. It serves as the hub for defining and enforcing data governance policies, data cataloging, data lineage tracking, and managing data access controls across the organization. For instance, for our use case, the data governance account would be used to define and enforce policies around customer data privacy, data quality rules for customer datasets, and access controls for sharing customer data with the marketing team.
  • Data lake account (producer) – There can be one or more data lake accounts within the organization. We discuss this in more detail later in this post.
  • Data science team account (consumer) – There can be one or more data science team accounts or data consumer accounts within the organization. We provide additional information later in this post.

By separating these accounts and their responsibilities, the organization can maintain a clear separation of duties, enforce appropriate access controls, and make sure data governance policies are consistently applied across the entire data lifecycle. The data governance account, acting as the central hub, enables seamless data sharing and collaboration between the data producers (data lake accounts) and data consumers (data science team accounts), while meeting data privacy, security, and compliance requirements.

Solution overview

The following diagram illustrates the ML platform reference architecture using various AWS services. The functional architecture with different capabilities is implemented using a number of AWS services, including AWS Organizations, Amazon SageMaker, AWS DevOps services, and a data lake. For more information about the architecture in detail, refer to Part 1 of this series. In this post, we focus on the highlighted Amazon DataZone section.

solution__architecture

The data management services function is organized through the data lake accounts (producers) and data science team accounts (consumers).

The data lake accounts are responsible for storing and managing the enterprise’s raw, curated, and aggregated datasets. Data engineers and data publishers work within these accounts to ingest, process, and publish data assets that can be consumed by other teams, such as the marketing team or data science teams. In the bank marketing use case, the data lake accounts would store and manage the bank’s customer data, including raw data from various sources, curated datasets with customer profiles, and aggregated datasets for marketing segmentation.

As producers, data engineers in these accounts are responsible for creating, transforming, and managing data assets that will be cataloged and governed by Amazon DataZone. They make sure data is produced consistently and reliably, adhering to the organization’s data governance rules and standards set up in the data governance account. Data engineers contribute to the data lineage process by providing the necessary information and metadata about the data transformations they perform.

Amazon DataZone plays a crucial role in maintaining data lineage information, enabling traceability and impact analysis of data transformations across the organization. It handles the actual maintenance and management of data lineage information, using the metadata provided by data engineers to build and maintain the data lineage.

The data science team accounts are used by data analysts, data scientists, or marketing teams to access and consume the published data assets from the data lake accounts. Within these accounts, they can perform analyses, build models, or design targeted marketing campaigns by using the governed and curated datasets made available through the data sharing and access control mechanisms of Amazon Data Zone. For example, in the bank marketing use case, the data science team accounts would be used by the bank’s marketing teams to access and analyze customer datasets, build predictive models for targeted marketing campaigns, and design personalized financial product offerings based on the shared customer data.

Using Amazon DataZone in a multi-account ML platform

You can find practical, step-by-step instructions for implementing this setup in module 2 of this AWS Multi-Account Data & ML Governance Workshop.  This workshop provides detailed guidance on setting up Amazon DataZone in the central governance account.

Conclusion

Effective governance is crucial for organizations to unlock their data’s potential while maintaining compliance and security. Amazon DataZone provides a comprehensive solution for data management and governance at scale, automating complex tasks like data cataloging, policy enforcement, lineage tracking, and secure data sharing.

As demonstrated in the financial services use case, Amazon DataZone empowers organizations to establish a centralized data catalog, enforce consistent governance policies, and facilitate secure data sharing between data producers and consumers. Financial institutions can use Amazon DataZone to gain a competitive edge by designing and implementing effective, tailored marketing campaigns while adhering to data privacy and compliance regulations.

The multi-account ML platform architecture, combined with Amazon DataZone and other AWS services, provides a scalable and secure foundation for governing data and ML workflows effectively. By following the outlined steps, you can streamline the setup and management of Amazon DataZone, enabling seamless collaboration between stakeholders involved in the data and ML lifecycle.

As data generation and utilization continue to grow, robust data governance solutions become paramount. Amazon DataZone offers a powerful approach to data management and governance, empowering organizations to unlock their data’s true value while maintaining the highest standards of security, compliance, and data privacy.


About the Authors

Ajit Mungale is a Senior Solutions Architect at Amazon Web Services with specialization in AI/ML/Generative AI, IoT and .Net technologies. At AWS, he helps customers build, migrate, and create new cost effective cloud solutions. He possesses extensive experience in developing distributed applications and has worked with multiple cloud platforms. With his deep technical knowledge and business understanding, Ajit guides organizations in leveraging the full capabilities of the cloud.

Ram Vittal is a Principal Generative AI Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his sheep-a-doodle!

Read More