Research trends in privacy, security and cryptography

Research trends in privacy, security and cryptography

The image has one large circle with shaking hands in the center. Surrounding the circle are six smaller graphics: a series of code, a bar graph, a shield, a cog, a cloud, and a computer.

Trust is essential for people and organizations to use technology with confidence. At Microsoft, we strive to earn the trust of our customers, employees, communities, and partners by committing to privacy, security, the responsible use of AI, and transparency.

At Microsoft Research, we take on this challenge by creating and using state-of-the-art tools and technologies that support a proactive, integrated approach to security across all layers of the digital estate.

Threats to cybersecurity are constant and they continue to grow, impacting organizations and individuals everywhere. Attack tools are readily available and well-funded adversaries now have the capability to cause unprecedented harm. These threats help explain why U.S. President Joe Biden issued an executive order in 2021 calling for cybersecurity improvements. Similarly, the European Union recently called for stronger protection of its information and communication technology (ICT) supply chains.

Against that backdrop, Microsoft Research is focused on what comes next in security and privacy. New and emerging computing frontiers, like the metaverse and web3, will require consistent advances in identity, transparency and other security principles, in order to learn from the past and unlock these technologies’ potential. Developments in quantum computing and advances in machine learning and artificial intelligence offer great potential to advance science and the human condition. Our research aims to ensure that future breakthroughs come with robust safety and privacy protections, even as they accelerate profound changes and new business opportunities.

At Microsoft Research, we pursue ambitious projects to improve the privacy and security of everyone on the planet. This is the first blog post in a series exploring the work we do in privacy, security and cryptography. In future installments, we will dive deeper into the research challenges we are addressing, and the opportunities we see.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Digital identities

While the internet was not originally built with an identity layer, digital identities have grown to become foundational elements of today’s web and impacti people’s lives even beyond the digital world. Our research is aimed at modernizing digital identities and building more robust, usable, private and secure user-centric identity systems, putting each of us in control of our own digital identities.

This work includes researching cryptographic algorithms that enable privacy-preserving open-source user-centric identity systems. Such systems would let people present cryptographically signed electronic claims and selectively choose which information they wish to disclose, while preventing tracking of people between presentations of the claim. Our approach would preserve an individual’s privacy and work with existing web protocols to provide easy and safe access to a wide range of resources and activities.

Our research also includes investigating innovative ways for people to manage their identity secrets reliably and safely without having to provide any centralized party with full access to them. Success in this area will also require scalable and verifiable methods to distribute identity public keys, so people can know who exactly they are interacting with.

Media provenance and authenticity 

Advances in graphics and machine learning algorithms have enabled the creation of easy-to-use tools for editing While useful in many ways, this technology has also enabled fraud and manipulation of digital images and media – or deepfakes. Early fakes were easy to spot, but current versions are becoming nearly impossible for machines or people to detect. The potential proliferation of fakes that are indistinguishable from reality undermines society’s trust in everything we see and hear.

Rather than trying to detect fakes, Microsoft Research has developed technology to determine the source of any digital media and whether it has been altered. We do this by adding digitally signed manifests to video, audio or images. The source of these media objects might be well-known news organizations, governments or even individuals using apps on mobile devices. 

Since media creation, distribution, and consumption are complex and involve many industries, Microsoft has helped forma standards organization to stipulate how these signatures are added to media objects. We are also working with news organizations such as the BBC, New York Times, and CBC to promote media provenance as a mitigation for misinformation on social media networks. 

Hardware security foundations 

To promote cyber-resilience, we are developing systems which can detect a cyberattack and safely shut down protecting data and blocking the attacker. The systems are designed to be repaired quickly and securely, if compromised. These systems are built with simple hardware features that provide very high levels of protection for repair and recovery modules. To enable reliable detection of compromised systems, we are also developing storage features that can be used to protect security event logs. This makes it harder for attackers to cover their tracks.

Security analytics 

Modern-day computers and networks are under constant attack by hackers of all kinds. In this seemingly never-ending cat-and-mouse contest, securing and defending today’s global systems is a multi-billion-dollar enterprise. Managing the massive quantities of security data collected is increasingly challenging, which creates an urgent need for disruptive innovation in security analytics. 

We are investigating a transformer-based approach to modeling and analyzing large-scale security data. Applying and tuning such models is a novel field of study that could change the game for security analytics.

Privacy-preserving machine learning

A privacy-preserving AI system should generalize so well that its behavior reveals no personal or sensitive details that may have been contained in the original data on which it was trained.

How close can we get to this ideal? Differential privacy can enable analysts to extract useful insights from datasets containing personal information even while strengthening privacy protections. This method introduces “statistical noise.” The noise is significant enough that AI models are prevented from compromising the privacy of any individual, but still provide accurate, useful research findings. Our recent results show that large language models can be particularly effective differentially private learners.

Another approach, federated learning, enables large models to be trained and fine-tuned on customers’ own devices to protect the privacy of their data, and to respect data boundaries and data-handling policies. At Microsoft Research, we are creating an orchestration infrastructure for developers to deploy cross-platform, cross-device federated learning solutions.

Protecting data in training or fine-tuning is just one piece of the puzzle. Whenever AI is used in a personalized context, it may unintentionally leak information about the target of the personalization. Therefore, we must be able to describe the threat model for a complete deployment of a system with AI components, rather than just a single part of it.

Read more about our work on these and other related topics in an earlier blog post.

Confidential computing

Confidential computing has emerged as a practical solution to securing compute workloads in cloud environments, even from malicious cloud administrators. Azure already offers confidential computing environments in multiple regions, leveraging Trusted Execution Environments (TEEs) available in multiple hardware platforms.

Imagine if all computation were taking place in TEEs, where services would be able to access sensitive data only after they had been attested to perform specific tasks. This is not practical today and much research remains to be done. For example, there are no formal standards to even describe what a TEE is, what kind of programming interface a TEE cloud should have, or how different TEEs should interact.

Additionally, it is important to continuously improve the security guarantees of TEEs. For instance, understanding which side-channel attacks are truly realistic and developing countermeasures remains a major topic for research. Furthermore, we need to continue researching designs for confidential databases, confidential ledgers and confidential storage. Finally, even if we build both confidential computing and storage environments, how can we establish trust in the code that we want to run? As a cloud provider, our customers expect us to work continuously on improving the security of our infrastructure and the services that run on it.

Secure-by-design cloud

In the future, we can imagine Azure customers compiling their software for special hardware with memory tagging capabilities, eliminating problems like buffer overflows for good. To detect compromise, VM memory snapshots could be inspected and studied with AI-powered tools. In the worst case, system security could always be bootstrapped from a minimal hardware root of trust. At Microsoft Research, we are taking a step further and asking how we can build the cloud from the ground up, with security in mind.

New cryptography

The advance of quantum computing presents many exciting potential opportunities. As a leader in both quantum computing development and cryptographic research, Microsoft has a responsibility to ensure that the groundbreaking innovations on the horizon don’t compromise classical (non-quantum) computing systems and information. Working across Microsoft, we are learning more about the weaknesses of classical cryptography and how to build new cryptographic systems strong enough to resist future attacks.

Our active participation in the National Institute of Standards and Technology (NIST) Post-Quantum Cryptography projects has allowed Microsoft Research to examine deeply how the change to quantum-resistant algorithms will impact Microsoft services and Microsoft customers. With over seven years of work in this area, Microsoft Research’s leadership in quantum cryptography will help customers prepare for the upcoming change of cryptographic algorithms.

We’ve joined with the University of Waterloo and others to build a platform for experimenting with the newly proposed cryptographic systems and applying them to real-world protocols and scenarios. We’ve implemented real-world tests of post-quantum cryptography, to learn how these new systems will work at scale and how we can deploy them quickly to protect network tunnels Our specialized hardware implementations and cryptanalysis provide feedback to the new cryptosystems, which improves their performance, making post-quantum cryptosystems smaller and stronger.

ElectionGuard

  • Download

    ElectionGuard 

    ElectionGuard is an open source software development kit (SDK) that makes voting more secure, transparent and accessible.

Advances in cryptography are enabling end-to-end verifiable elections and risk-limiting audits for elections. Our open-source ElectionGuard project uses cryptography to confirm all votes have been correctly counted. Individual voters can see that their vote has been accurately recorded and anyone can check that all votes have been correctly tallied—yet individual ballots are kept secret. Risk-limiting audits use advanced statistical methods that can determine when an election audit has hit a pre-determined level of confidence with greater efficiency than traditional audits.

The cryptography tools that enable verifiable voting are Shamir Secret Sharing, Threshold Encryption, and additive Homomorphic Encryption. The math is interesting, and we will explore that in future blog posts, but there’s much more than math to ElectionGuard.

Securing the future

Through our work, we aim to continue to earn customer trust, striving to ensure that Microsoft’s products and services and our customer’s information will remain safe and secure for years to come.

Forthcoming entries in this blog series will include more details on the areas covered in this post and more. Much of our work is open-source and published, so we will be highlighting our GitHub projects and other ways you can interact directly with our work.

Have a question or topic that you would like to see us address in a future post? Please contact us!

The post Research trends in privacy, security and cryptography appeared first on Microsoft Research.

Read More

MoMA Installation Marks Breakthrough for AI Art

MoMA Installation Marks Breakthrough for AI Art

AI-generated art has arrived.

With a presentation making its debut this week at The Museum of Modern Art in New York City — perhaps the world’s premier institution devoted to modern and contemporary art — the AI technologies that have upended trillion-dollar industries worldwide over the past decade will get a formal introduction.

Created by pioneering artist Refik Anadol, the installation in the museum’s soaring Gund Lobby uses a sophisticated machine-learning model to interpret the publicly available visual and informational data of MoMA’s collection.

“Right now, we are in a renaissance,” Anadol said of the presentation “Refik Anadol: Unsupervised.” “Having AI in the medium is completely and profoundly changing the profession.”

Anadol is a digital media pioneer. Throughout his career, he’s been intrigued by the intersection between art and AI. His first encounter with AI as an artistic tool was at Google, where he used deep learning — and an NVIDIA GeForce GTX 1080 Ti — to create dynamic digital artworks.

In 2017, he started working with one of the first generative AI tools, StyleGAN, created at NVIDIA Research, which was able to generate synthetic images of faces that are incredibly realistic.

Anadol was more intrigued by the ability to use the tool to explore more abstract images, training StyleGAN not on images of faces, but of modern art, and guiding the AI’s synthesis using data streaming in from optical, temperature and acoustic sensors.

Digging Deep With MoMA

Those ideas led him to an online collaboration with The Museum of Modern Art in 2021, which was exhibited by Feral File, using more than 138,000 records from the museum’s publicly available archive. The Feral File exhibit caused an online sensation, reimagining art in real time and inspiring the wave of AI-generated art that’s spread quickly through social media communities on Instagram, Twitter, Discord and Reddit this year.

This year, he returned to MoMA to dig even deeper, collaborating again with MoMA curators Michelle Kuo and Paola Antonelli on a new major installation. On view from Nov. 19 through March 5, 2023, “Refik Anadol: Unsupervised” will use AI to interpret and transform more than 200 years of art from MoMA’s collection.

It’s an exploration not just of the world’s foremost collection of modern art — pretty much every single pioneering sculptor, painter and even game designer of the past two centuries — but a look inside the mind of AI, allowing us to see results of the algorithm processing data from MoMA’s collection, as well as ambient sound, temperature and light, and ‘dreaming,’” Anadol said.

Powering the system is a full suite of NVIDIA technologies. He relies on an NVIDIA DGX server equipped with NVIDIA A100 Tensor Core GPUs to train the model in real time. Another machine equipped with an NVIDIA RTX 4090 GPU translates the model into computer graphics, driving the exhibit’s display.

‘Bending Data’

“Refik is bending data — which we normally associate with rational systems — into a realm of surrealism and irrationality,” Michelle Kuo, the exhibit’s curator at the museum, told the New York Times. “His interpretation of MoMA’s dataset is essentially a transformation of the history of modern art.”

The installation comes amid a wave of excitement around generative AI, a technology that’s been put at the fingertips of amateur and professional artists alike with new tools such as Midjourney, OpenAI’s Dall·E, and DreamStudio.

And while Anadol’s work intersects with the surge in interest in NFT art that had the world buzzing in 2021, like AI-generated art, it goes far beyond it.

Inspired by Cutting-Edge Research

Anadol’s work digs deep into MoMA’s archives and cutting-edge AI, relying on a technology developed at NVIDIA Research called StyleGAN. David Luebke, vice president of graphics research at NVIDIA, said he first got excited about generative AI’s artistic and creative possibilities when he saw NVIDIA researcher Janne Hellsten’s demo of StyleGAN2 trained on stylized artistic portraits.

“Suddenly, one could fluidly explore the content and style of a generated image or have it react to ambient effects like sound or even weather,” Luebke said.

NVIDIA Research has been pushing forward the state of the art in generative AI since at least 2017 when NVIDIA developed “Progressive GANs,” which used AI to synthesize highly realistic, high-resolution images of human faces for the first time. This was followed by StyleGAN, which achieved even higher quality results.

Each year after that, NVIDIA released a paper that advanced the state of the art. StyleGAN has proved to be a versatile platform, Luebke explained, enabling countless other researchers and artists like Anadol to bring their ideas to life.

Democratizing Content Creation

Much more is coming. Modern generative AI models have shown the capability to generalize beyond particular subjects, such as images of human faces or cats or cars, and encompass language models that let users specify the image they want in natural language, or other intuitive ways, such as inpainting, Luebke explains.

“This is exciting because it democratizes content creation,” Luebke said. “Ultimately, generative AI has the potential to unlock the creativity of everybody from professional artists, like Refik, to hobbyists and casual artists, to school kids,” Luebke said.

Anadol’s work at MoMA offers a taste of what’s possible. “Refik Anadol: Unsupervised,” the artist’s first U.S. solo museum presentation, features three new digital artworks by the Los Angeles-based artist that use AI to dynamically explore MoMA’s collection on a vast 24-by-24-foot digital display. It’s as much a work of architecture as it is one of art.

“Often, AI is used to classify, process and generate realistic representations of the world,” the exhibition’s organizer Michelle Kuo, told Archinect, a leading publication covering contemporary art and architecture. “Anadol’s work, by contrast, is visionary: it explores dreams, hallucination and irrationality, posing an alternate understanding of modern art — and of artmaking itself.”

“Refik Anadol: Unsupervised” also hints at how AI will transform our future, and Anadol thinks it will be for the better. “This will just enhance our imagination,” Anadol said. “I’m seeing this as an extension of our minds.”

For more, see our exploration of Refik Anadol’s work in NVIDIA’s AI Art Gallery.

The post MoMA Installation Marks Breakthrough for AI Art appeared first on NVIDIA Blog.

Read More

Research Focus: Week of November 17, 2022

Research Focus: Week of November 17, 2022

Microsoft Research - Research Focus 04
Week of November 14th, 2022

Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Microsoft Research at NeurIPS 2022

Microsoft is a proud platinum sponsor of the 36th annual conference on Neural Information Processing Systems, running from November 28 to December 9. More than 150 of our researchers are involved in presentations, poster sessions, accepted papers, and workshops at this prestigious conference.

We are thrilled to have had more than 100 papers accepted at NeurIPS 2022. Our full roster includes cutting edge research ranging from differential privacy and reinforcement learning to extreme compression, motion capture and large language models.

This hybrid conference includes an in-person component at the New Orleans Convention Center during the first week, and a virtual component the second week. Check out our list of sessions and plan your schedule. 

If you will be attending in person, we hope you will stop by our booth (#202) to chat with our experts, see demos of our latest research and find out about career opportunities with Microsoft.


Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Project Eclipse shows neighborhood level air pollution inequities in real time and during short-term air quality events

Precious Esie, Madeleine I. G. Daepp, Asta Roseway, and Scott Counts

Air pollution kills over 7 million people globally every year. In U.S. cities, pollution sources are more likely to affect communities of color – but the most impacted communities rarely have the data they need to understand and address this invisible hazard.

Through Project Eclipse, a team from Microsoft Research has worked with the Chicago Department of Public Health and JCDecaux – an advertising agency and the world’s largest provider of outdoor street furniture – to deploy low-cost air quality sensors across the city’s bus shelters. In a new paper published this week in the American Journal of Public Health, the team showed how the citywide network of sensors enabled them to capture and visualize differences in exposures over time and space. The work was led by Precious Esie, a PhD student in epidemiology at Columbia’s Mailman School of Public Health, during her summer internship at Microsoft Research.

Over the month of July 2021, the research team found, pollution disproportionately affected Hispanic and Latinx residents on the West side of the city. But 4th of July spikes disproportionately affected the South side—including places where asthma rates are highest. In late July, wildfire smoke increased exposures across the city as a whole. This work shows how next-generation environmental sensing can help public health agencies target interventions when and where they are most needed.


Microsoft Research contributes to industry supply chain standards

Kiran Karunakaran, Principal Product Manager, Azure Security Engineering

Supply Chain Integrity, Transparency and Trust (SCITT) is an industry-standards initiative for managing the compliance of goods and services across end-to-end supply chains, allowing organizations to build, deploy and consume interoperable and automated solutions to manage end-to-end supply chain compliance. SCITT was initiated as a response to United States Executive Order on Improving the Nation’s Cybersecurity (EO14028), and was formally adopted into the Internet Engineering Task Force (IETF) in March 2022. Over the last eight months, SCITT has been one of the fastest growing initiatives within IETF and became a formal working group in October 2022.

Microsoft Research is actively contributing to IETF SCITT Architecture and SCITT Receipt Format Internet Drafts and will be providing and collaborating on several SCITT-related open-source software offerings. This includes the donation of a SCITT Emulator that allows SCITT implementers to experiment with SCITT APIs and message formats. Microsoft is also open sourcing our SCITT implementation prototype that runs on Confidential Consortium Framework (microsoft.com), providing visibility into one of the many possible implementations of SCITT.

A SCITT IETF Working Group session was held at IETF115 on Nov 10th. To learn more about the community efforts, blogs, and implementations around SCITT, please visit SCITT.io.


2010 paper lands “Test of Time” award for Microsoft researcher Mike Chieh-Jan Liang and co-authors

Mike Chieh-Jan Liang, a principal researcher in the Systems and Networking Research Group (Asia), is part of a team that won a “Test of Time” award for the 2010 paper: Design and Evaluation of a Versatile and Efficient Receiver-Initiated Link Layer for Low-Power Wireless

The research introduced A-MAC, a receiver-initiated link layer for low-power wireless networks that supported several services under a unified architecture, and more efficiently and scalably than prior approaches.  

Co-authors on the paper include Prabal Dutta (University of Michigan), Stephen Dawson-Haggerty (University of California, Berkeley), Yin Chen (Johns Hopkins University), and Andreas Terzis (Johns Hopkins University). 

The Test of Time award is given by ACM SenSys in recognition of papers that are at least 10 years old and have had long lasting impact on networked embedded sensing system science and engineering. 

The post Research Focus: Week of November 17, 2022 appeared first on Microsoft Research.

Read More

Large-scale feature engineering with sensitive data protection using AWS Glue interactive sessions and Amazon SageMaker Studio

Large-scale feature engineering with sensitive data protection using AWS Glue interactive sessions and Amazon SageMaker Studio

Organizations are using machine learning (ML) and AI services to enhance customer experience, reduce operational cost, and unlock new possibilities to improve business outcomes. Data underpins ML and AI use cases and is a strategic asset to an organization. As data is growing at an exponential rate, organizations are looking to set up an integrated, cost-effective, and performant data platform in order to preprocess data, perform feature engineering, and build, train, and operationalize ML models at scale. To achieve that, AWS offers a unified modern data platform that is powered by Amazon Simple Storage Service (Amazon S3) as the data lake with purpose-built tools and processing engines to support analytics and ML workloads. For a unified ML experience, you can use Amazon SageMaker Studio, which offers native integration with AWS Glue interactive sessions to perform feature engineering at scale with sensitive data protection. In this post, we demonstrate how to implement this solution.

Amazon SageMaker is a fully managed ML service that enables you to build, train, and deploy models at scale for a wide range of use cases. For model training, you can use any of the built-in algorithms within SageMaker to get started on training and deploying ML models quickly.

A key component of the model building and development process is feature engineering. AWS Glue is one of the recommended options to achieve feature engineering at scale. AWS Glue enables you to run data integration and transformation in a distributed fashion on a serverless Apache Spark infrastructure, and makes it easy to use the popular Spark ML library for feature engineering and model development. In addition, you can use AWS Glue for incremental data processing through job bookmarks, ingest data from over 100 sources using connectors, and run spiky or unpredictable workloads using auto scaling.

Another important requirement for ML-based applications is data security and access control. It’s common demand to have tighter control on who can access the most sensitive data as part of the feature engineering and model building process by following the principal of least privilege access. To achieve this, you can utilize the AWS Glue integration with AWS Lake Formation for increased governance and management of data lake assets. With Lake Formation, you can configure fine-grained data access control and security policies on top of your Amazon S3 data lake. The policies are defined in a central location, allowing multiple analytics and ML services, such as AWS Glue, Amazon Athena, and SageMaker, to interact with data stored in Amazon S3.

AWS Glue includes a personally identifiable information (PII) detection transform that provides the ability to detect, mask, or remove entities as required, for increased compliance and governance. With the PII transform, you can detect PII data in datasets and automatically apply fine-grained access control using Lake Formation to restrict sensitive data for different user groups.

Use case

We focus on a propensity model use case that includes a customer marketing dataset and involves two user personas: a data engineer and data scientist. The dataset contains per-customer information, including lead source, contact notes, job role, some flags, page views per visit, and more. The dataset also includes sensitive information like personal phone numbers.

The data engineer is responsible for building the end-to-end data processing pipeline, including data preparation, preprocessing, and access control. The data scientist is responsible for feature engineering, and training and deploying the ML model. Note that the data scientist is not allowed to access any PII sensitive data for feature engineering or training the ML model.

As part of this use case, the data engineer builds a data pipeline to preprocess the dataset, scans the dataset for any PII information, and restricts the access of the PII column to the data scientist user. As a result, when a data scientist uses the dataset to perform feature engineering and build ML models, they don’t have access to the PII sensitive column (phone numbers, in this case). The feature engineering process involves converting columns of type string to a format that is optimal for ML models. As an advanced use case, you can extend this access pattern to implement row-level and cell-level security using Lake Formation.

Solution overview

The solution contains the following high-level steps:

  1. Set up resources with AWS CloudFormation.
  2. Preprocess the dataset, including PII detection and fine-grained access control, on an AWS Glue interactive session.
  3. Perform feature engineering on an AWS Glue interactive session.
  4. Train and deploy an ML model using the SageMaker built-in XGBoost algorithm.
  5. Evaluate the ML model.

The following diagram illustrates the solution architecture.

Architecture diagram

Prerequisites

To complete this tutorial, you must have the following prerequisites:

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console and the AWS CLI rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.

The CloudFormation template generates the following resources:

  • S3 buckets with a sample dataset
  • An AWS Lambda function to load the dataset
  • AWS Identity and Access Management (IAM) group, users, roles, and policies
  • Lake Formation data lake settings and permissions
  • SageMaker user profiles

To create your resources, complete the following steps:

  1. Sign in to the console.
  2. Choose Launch Stack:
    Launch button
  3. Choose Next.
  4. For DataEngineerPwd and DataScientistPwd, enter your own password for the data engineer and data scientist users.
  5. For GlueDatabaseName, enter demo.
  6. For GlueTableName, enter web_marketing.
  7. For S3BucketNameForInput, enter blog-studio-pii-dataset-<your-aws-account-id>.
  8. For S3BucketNameForOutput, enter blog-studio-output-<your-aws-account-id>.
  9. For SageMakerDomainId, enter your SageMaker domain ID that you prepared in the prerequisite steps.
  10. Choose Next.
  11. On the next page, choose Next.
  12. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  13. Choose Create.

Stack creation can take up to 10 minutes. The stack creates IAM roles and SageMaker user profiles for two personas: data engineer and data scientist. It also creates a database demo and table web_marketing with a sample dataset.

At the time of stack creation, the data engineer persona has complete access to the table, but the data scientist persona doesn’t have any access to the table yet.

Preprocess the dataset

Let’s start preprocessing data on an AWS Glue interactive session. The data engineer persona wants to verify the data to see if there is sensitive data or not, and grant minimal access permission to the data scientist persona. You can download notebook from this location.

  1. Sign in to the console using the data-engineer user.
  2. On the SageMaker console, choose Users.
  3. Select the data-engineer user and choose Open Studio.
  4. Create a new notebook and choose SparkAnalytics 1.0 for Image and Glue PySpark for Kernel.
  5. Start an interactive session with the following magic to install the newer version of Boto3 (this is required for using the create_data_cells_filter method):
    %additional_python_modules boto3==1.24.82

  6. Initialize the session:
    import boto3
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    sc = SparkContext.getOrCreate()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)

  7. Create an AWS Glue DynamicFrame from the newly created table, and resolve choice types based on catalog schema, because we want to use the schema defined in the catalog instead of the automatically inferred schema based on data:
    dyf_marketing = glueContext.create_dynamic_frame.from_catalog(
    database="demo",
    table_name="web_marketing"
    )
    
    dyf_marketing_resolved = dyf_marketing.resolveChoice(
    choice="match_catalog",
    database="demo",
    table_name="web_marketing"
    )
    
    dyf_marketing_resolved.printSchema()

  8. Validate in the table whether there is any PII data using AWS Glue PII detection:
    from awsglueml.transforms import EntityDetector
    
    entities_filter = [
    "EMAIL",
    "CREDIT_CARD",
    "IP_ADDRESS",
    "MAC_ADDRESS",
    "PHONE_NUMBER"
    ]
    entity_detector = EntityDetector()
    classified_map = entity_detector.classify_columns(dyf_marketing_resolved, entities_filter, 1.0, 0.1)
    print(classified_map)

  9. Verify whether the columns classified as PII contain sensitive data or not (if not, update classified_map to drop the non-sensitive columns):
    from pyspark.sql.functions import col
    dyf_marketing_resolved.toDF().select(*[col(c) for c in classified_map.keys()]).show()

  10. Set up Lake Formation permissions using a data cell filter for automatically detected columns, and restrict the columns to the data scientist persona:
    lakeformation = boto3.client('lakeformation')
    sts = boto3.client('sts')
    
    account_id = sts.get_caller_identity().get('Account')
    
    # Create a data cell filter for excluding phone_number column
    lakeformation.create_data_cells_filter(
    TableData={
    'TableCatalogId': account_id,
    'DatabaseName': 'demo',
    'TableName': 'web_marketing',
    'Name': 'pii',
    'RowFilter': {
    'AllRowsWildcard': {}
    
    },
    'ColumnWildcard': {
    'ExcludedColumnNames': list(classified_map.keys())
    }
    }
    )
    
    # Grant permission on the data cell filter
    lakeformation.grant_permissions(
    Principal={
    'DataLakePrincipalIdentifier': f'arn:aws:iam::{account_id}:role/SageMakerStudioExecutionRole_data-scientist'
    },
    Resource={
    'DataCellsFilter': {
    'TableCatalogId': account_id,
    'DatabaseName': 'demo',
    'TableName': 'web_marketing',
    'Name': 'pii'
    }
    },
    Permissions=[
    'SELECT'
    ]
    )

  11. Log in to Studio as data-scientist to see that the PII columns are not visible. You can download notebook from this location.
  12. Create a new notebook and choose SparkAnalytics 1.0 for Image and Glue PySpark for Kernel:
    import boto3
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    sc = SparkContext.getOrCreate()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    
    dyf_marketing = glueContext.create_dynamic_frame.from_catalog(
    database="demo",
    table_name="web_marketing"
    )
    
    dyf_marketing.printSchema()

Perform feature engineering

We use the Apache Spark ML library to perform feature engineering as the data-scientist user and then write back the output to Amazon S3.

  1. In the following cell, we apply features from the Apache Spark ML library:
    • StringIndexer maps a string column of labels to a column of label indexes.
    • OneHotEncoder maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value that indicates the presence of a specific categorical feature. This transform is used for ML algorithms that expect continuous features.
    • VectorAssembler is a transformer that combines a given list of columns into a single vector column, which is then used in training ML models for algorithms such as logistic regression and decision trees.
    #feature engineering by using string indexer and one hot encoding from spark ML library
    from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
    from pyspark.ml import Pipeline
    
    cols = ['lastcampaignactivity','region','viewedadvertisement','usedpromo','jobrole']
    
    int_cols = ['pageviewspervisit','totaltimeonwebsite','totalwebvisits',]
    
    indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in cols
    ]
    
    encoders = [
    OneHotEncoder(
    inputCol=indexer.getOutputCol(),
    outputCol="{0}_encoded".format(indexer.getOutputCol()))
    for indexer in indexers
    ]
    
    assembler = VectorAssembler(
    inputCols=[encoder.getOutputCol() for encoder in encoders]+int_cols,
    outputCol="features"
    )

  2. The final transformed DataFrame can be created using the Pipeline library. A pipeline is specified as a sequence of stages. These stages are run in order and the input DataFrame is transformed as it passes through each stage.
    df_marketing = dyf_marketing.toDF()
    pipeline = Pipeline(stages=indexers + encoders + [assembler])
    df_tfm=pipeline.fit(df_marketing).transform(df_marketing)
    

  3. Next, we split the dataset into train, validate, and test DataFrame and save it in the S3 bucket to train the ML model (provide your AWS account ID in the following code):
    from pyspark.ml.functions import vector_to_array
    
    #set s3 output location for feature engineering output
    bucket='blog-studio-output-<your-aws-account-id>'
    
    #convert sparse to dense vector
    df_tfm=df_tfm.select('converted',vector_to_array("features").alias("features_array"))
    
    #split features array into individual columns
    df_tfm=df_tfm.select([df_tfm.converted] + [df_tfm.features_array[i] for i in range(17)])
    
    #split the overall dataset into 70-20-10 training , validation and test
    (train_df, validation_df, test_df) = df_tfm.randomSplit([0.7,0.2,0.1])
    
    #write back train, validation and test datasets to S3
    train_df.write
    .option("header","false")
    .csv('s3://{}/web-marketing/processed/training/'.format(bucket))
    
    validation_df.write
    .option("header","false")
    .csv('s3://{}/web-marketing/processed/validation/'.format(bucket))
    
    test_df.write
    .option("header","false")
    .csv('s3://{}/web-marketing/processed/test/'.format(bucket))

Train and deploy an ML model

In the previous section, we completed feature engineering, which included converting string columns such as region, jobrole, and usedpromo into a format that is optimal for ML models. We also included columns such as pageviewspervisit and totalwebvisits, which will help us predict a customer’s propensity to buy a product.

We now train an ML model by reading the train and validation dataset using the SageMaker built-in XGBoost algorithm. Then we deploy the model and run an accuracy check. You can download notebook from this location.

In the following cell, we’re reading data from the second S3 bucket, which includes the output from our feature engineering operations. Then we use the built-in algorithm XGBoost to train the model.

  1. Open a new notebook. Choose Data Science for Image and Python 3 for Kernel (provide your AWS account ID in the following code):
    #set s3 bucket location for training data
    import sagemaker
    import boto3
    from sagemaker import get_execution_role
    
    container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name,
    framework='xgboost', version='latest')
    bucket='blog-studio-output-<your-aws-account-id>'
    prefix='web-marketing/processed'
    
    #read train and validation input datasets
    s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/training/'
    .format(bucket, prefix), content_type='csv')
    s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'
    .format(bucket, prefix), content_type='csv')
    
    #train xgb model
    sess = sagemaker.Session()
    from sagemaker import get_execution_role
    
    xgb = sagemaker.estimator.Estimator(
    container,
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'
    .format(bucket, prefix),
    sagemaker_session=sess
    )
    
    xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    silent=0,
    objective='binary:logistic',
    num_round=100
    )
    
    xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

  2. When training is complete, we can deploy the model using SageMaker hosting services:
    #deploy ml model
    xgb_predictor = xgb.deploy(initial_instance_count=1,
    instance_type='ml.m4.xlarge')

Evaluate the ML model

We use the test dataset to evaluate the model and delete the inference endpoint when we’re done to avoid any ongoing charges.

  1. Evaluate the model with the following code:
    #create csv serialiser to run accuracy on test dataset
    xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()
    
    #read test dataset
    import io
    import pandas as pd
    
    s3 = boto3.resource('s3')
    bucket_obj = s3.Bucket(bucket)
    
    test_line = []
    test_objs = bucket_obj.objects.filter(Prefix="web-marketing/processed/test")
    for obj in test_objs:
    try:
    key = obj.key
    body = obj.get()['Body'].read()
    temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')
    test_line.append(temp)
    except:
    continue
    
    test_df = pd.concat(test_line)
    
    #predict results using deployed model
    import numpy as np
    def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
    predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])
    return np.fromstring(predictions[1:], sep=',')
    
    #drop the target variable in test_df and make prediction
    predictions = predict(test_df.drop(test_df.columns[0], axis=1).to_numpy(), xgb_predictor)
    
    #calculate accuracy using sklearn library
    from sklearn.metrics import accuracy_score, confusion_matrix
    y_pred=np.round(predictions)
    y_true=test_df.iloc[:,0].values.tolist()
    print('Accuracy score: ',accuracy_score(y_true, y_pred))
    print('Confusion matrix: n',confusion_matrix(y_true, y_pred))

    The accuracy result for the sample run was 84.6 %. This could be slightly different for your run due to the random split of the dataset.

  2. We can delete the inference endpoint with the following code:
    xgb_predictor.delete_endpoint(delete_endpoint_config=True)

Clean up

Now to the final step, cleaning up the resources.

  1. Empty the two buckets created through the CloudFormation stack.
  2. Delete the apps associated with user profiles data-scientist and data-engineer within Studio.
  3. Delete the CloudFormation stack.

Conclusion

In this post, we demonstrated a solution that enables personas such as data engineers and data scientists to perform feature engineering at scale. With AWS Glue interactive sessions, you can easily achieve feature engineering at scale with automatic PII detection and fine-grained access control without needing to manage any underlying infrastructure. By using Studio as the single entry point, you can get a simplified and integrated experience to build an end-to-end ML workflow: from preparing and securing data to building, training, tuning, and deploying ML models. To learn more, visit Getting started with AWS Glue interactive sessions and Amazon SageMaker Studio.

We are very excited about this new capability and keen to see what you’re going to build with it!


Appendix: Set up resources via the console and the AWS CLI

Complete the instructions in this section to set up resources using the console and AWS CLI instead of the CloudFormation template.

Prerequisites

To complete this tutorial, you must have access to the AWS CLI (see Getting started with the AWS CLI) or use command line access from AWS CloudShell.

Configure IAM group, users, roles, and policies

In this section, we create two IAM users: data-engineer and data-scientist, which belong to the IAM group data-platform-group. Then we add a single IAM policy to the IAM group.

  1. On the IAM console, create a policy on the JSON tab to create a new IAM managed policy named DataPlatformGroupPolicy. The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their IAM user name. Use the following JSON policy document to provide permissions:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Action":[
                "sagemaker:DescribeDomain",
                "sagemaker:ListDomains",
                "sagemaker:ListUserProfiles",
                "sagemaker:ListApps"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerStudioReadOnly"
          },
          {
             "Action":"sagemaker:AddTags",
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerAddTags"
          },
          {
             "Condition":{
                "StringEquals":{
                   "sagemaker:ResourceTag/studiouserid":"${aws:username}"
                }
             },
             "Action":[
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeUserProfile"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerAllowedUserProfile"
          },
          {
             "Condition":{
                "StringNotEquals":{
                   "sagemaker:ResourceTag/studiouserid":"${aws:username}"
                }
             },
             "Action":[
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeUserProfile"
             ],
             "Resource":"*",
             "Effect":"Deny",
             "Sid":"AmazonSageMakerDeniedUserProfiles"
          }
       ]
    }

  2. Create an IAM group called data-platform-group.
  3. Search and attach the AWS managed policy named DataPlatformGroupPolicy to the group.
  4. Create IAM users called data-engineer and data-scientist under the IAM group data-platform-group.
  5. Create a new managed policy named SageMakerExecutionPolicy (provide your Region and account ID in the following code):
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Action":[
                "sagemaker:DescribeDomain",
                "sagemaker:ListDomains",
                "sagemaker:ListUserProfiles",
                "sagemaker:ListApps"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerStudioReadOnly"
          },
          {
             "Action":"sagemaker:AddTags",
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerAddTags"
          },
          {
             "Action":[
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "logs:DescribeLogStreams",
                "sagemaker:CreateModel",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:CreateEndpoint",
                "sagemaker:DescribeEndpoint",
                "sagemaker:InvokeEndpoint",
                "sagemaker:DeleteEndpointConfig",
                "sagemaker:DeleteEndpoint"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerTrainingAndDeploy"
          },
          {
             "Action":"sagemaker:*App",
             "Resource":"arn:aws:sagemaker:<aws region>:<account id>:app/*/${aws:PrincipalTag/userprofilename}/*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerAllowedApp"
          },
          {
             "Action":"sagemaker:*App",
             "Effect":"Deny",
             "NotResource":"arn:aws:sagemaker:<aws region>:<account id>:app/*/${aws:PrincipalTag/userprofilename}/*",
             "Sid":"AmazonSageMakerDeniedApps"
          },
          {
             "Action":[
                "glue:GetTable",
                "glue:GetTables",
                "glue:SearchTables",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions"
             ],
             "Resource":[
                "arn:aws:glue:<aws region>:<account id>:table/demo/*",
                "arn:aws:glue:<aws region>:<account id>:database/demo",
                "arn:aws:glue:<aws region>:<account id>:catalog"
             ],
             "Effect":"Allow",
             "Sid":"GlueCatalogPermissions"
          },
          {
             "Action":[
                "lakeformation:GetDataAccess",
                "lakeformation:StartQueryPlanning",
                "lakeformation:GetQueryState",
                "lakeformation:GetWorkUnits",
                "lakeformation:GetWorkUnitResults"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"LakeFormationPermissions"
          },
          {
             "Effect":"Allow",
             "Action":[
                "s3:CreateBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket",
                "s3:DeleteObject"
             ],
             "Resource":[
                "arn:aws:s3:::blog-studio-output-<account id>",
                "arn:aws:s3:::blog-studio-output-<account id>/*"
             ]
          },
          {
             "Action":[
                "iam:PassRole",
                "iam:GetRole",
                "sts:GetCallerIdentity"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"AmazonSageMakerStudioIAMPassRole"
          },
          {
             "Action":"sts:AssumeRole",
             "Resource":"*",
             "Effect":"Deny",
             "Sid":"DenyAssummingOtherIAMRoles"
          }
       ]
    }

  6. Create a new managed policy named SageMakerAdminPolicy:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Action":[
                "lakeformation:GrantPermissions",
                "lakeformation:RevokePermissions",
                "lakeformation:ListPermissions",
                "lakeformation:BatchGrantPermissions",
                "lakeformation:BatchRevokePermissions",
                "lakeformation:CreateDataCellsFilter",
                "lakeformation:DeleteDataCellsFilter",
                "lakeformation:ListDataCellsFilter",
                "glue:GetUserDefinedFunctions",
                "glue:BatchGetCustomEntityTypes"
             ],
             "Resource":"*",
             "Effect":"Allow",
             "Sid":"GlueLakeFormationPermissions"
          }
       ]
    }

  7. Create an IAM role for SageMaker for the data engineer (data-engineer), which is used as the corresponding user profile’s execution role. On the Attach permissions policy page, AmazonSageMakerFullAccess (AWS managed policy) is attached by default. You remove this policy later to maintain minimum privilege.
    1. For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-engineer.
    2. For Tags, add the key userprofilename and the value data-engineer.
    3. Choose Create role.
    4. To add the remaining policies, on the Roles page, choose the role name you just created.
    5. Under Permissions, remove the policy AmazonSageMakerFullAccess.
    6. On the Attach permissions policy page, select the AWS managed policy AwsGlueSessionUserRestrictedServiceRole, and the customer managed policies SageMakerExecutionPolicy and SageMakerAdminPolicy that you created.
    7. Choose Attach policies.
    8. Modify your role’s trust relationship:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Effect":"Allow",
             "Principal":{
                "Service":[
                   "glue.amazonaws.com",
                   "sagemaker.amazonaws.com"
                ]
             },
             "Action":"sts:AssumeRole"
          }
       ]
    }

  8. Create an IAM role for SageMaker for the data scientist (data-scientist), which is used as the corresponding user profile’s execution role.
    1. For Role name, name the role SageMakerStudioExecutionRole_data-scientist.
    2. For Tags, add the key userprofilename and the value data-scientist.
    3. Choose Create role.
    4. To add the remaining policies, on the Roles page, choose the role name you just created.
    5. Under Permissions, remove the policy AmazonSageMakerFullAccess.
    6. On the Attach permissions policy page, select the AWS managed policy AwsGlueSessionUserRestrictedServiceRole, and the customer managed policy SageMakerExecutionPolicy that you created.
    7. Choose Attach policies.
    8. Modify your role’s trust relationship:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Effect":"Allow",
             "Principal":{
                "Service":[
                   "glue.amazonaws.com",
                   "sagemaker.amazonaws.com"
                ]
             },
             "Action":"sts:AssumeRole"
          }
       ]
    }

Configure SageMaker user profiles

To create your SageMaker user profiles with the studiouserid tag, complete the following steps:

  1. Use the AWS CLI or CloudShell to create the Studio user profile for the data engineer (provide your account ID and Studio domain ID in the following code):
    aws sagemaker create-user-profile --domain-id <domain id> --user-profile-name data-engineer --tags Key=studiouserid,Value=data-engineer --user-settings ExecutionRole=arn:aws:iam::<account id>:role/SageMakerStudioExecutionRole_data-engineer

  2. Repeat the step to create a user profile for the data scientist, replacing the account ID and Studio domain ID:
    aws sagemaker create-user-profile --domain-id <domain id> --user-profile-name data-scientist --tags Key=studiouserid,Value=data-scientist --user-settings ExecutionRole=arn:aws:iam::<account id>:role/SageMakerStudioExecutionRole_data-scientist

Create S3 buckets and upload the sample dataset

In this section, you create two S3 buckets. The first bucket has a sample dataset related to web marketing. The second bucket is used by the data scientist to store output from feature engineering tasks, and this output dataset is used to train the ML model.

First, create the S3 bucket for the input data:

  1. Download the dataset.
  2. On the Amazon S3 console, choose Buckets in the navigation pane.
  3. Choose Create bucket.
  4. For Region, choose the Region with the SageMaker domain that includes the user profiles you created.
  5. For Bucket name, enter blog-studio-pii-dataset-<your-aws-account-id>.
  6. Choose Create bucket.
  7. Select the bucket you created and choose Upload.
  8. In the Select files section, choose Add files and upload the dataset you downloaded.
    Now you create the bucket for the output data:
  9. On the Buckets page, choose Create bucket.
  10. For Region, choose the Region with the SageMaker domain that includes the user profiles you created.
  11. For Bucket name, enter blog-studio-output-<your-aws-account-id>.
  12. Choose Create bucket.

Create an AWS Glue database and table

In this section, you create an AWS Glue database and table for the dataset.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Choose Add database.
  3. For Name, enter demo.
  4. Choose Create database.
  5. Under Data catalog, choose Tables.
  6. For Name, enter web_marketing.
  7. For Database, select demo.
  8. For Include path, enter the path of your S3 bucket for input data.
  9. For Classification, choose CSV.
  10. Under Schema, choose Upload Schema.
  11. Enter the following JSON array into the text box:
    [
       {
          "Name":"lastcampaignactivity",
          "Type":"string"
       },
       {
          "Name":"pageviewspervisit",
          "Type":"double"
       },
       {
          "Name":"totaltimeonwebsite",
          "Type":"bigint"
       },
       {
          "Name":"totalwebvisits",
          "Type":"bigint"
       },
       {
          "Name":"attendedmarketingevent",
          "Type":"string"
       },
       {
          "Name":"organicsearch",
          "Type":"string"
       },
       {
          "Name":"viewedadvertisement",
          "Type":"string"
       },
       {
          "Name":"leadsource",
          "Type":"string"
       },
       {
          "Name":"jobrole",
          "Type":"string"
       },
       {
          "Name":"contactnotes",
          "Type":"string"
       },
       {
          "Name":"leadprofile",
          "Type":"string"
       },
       {
          "Name":"usedpromo",
          "Type":"string"
       },
       {
          "Name":"donotreachout",
          "Type":"boolean"
       },
       {
          "Name":"city",
          "Type":"string"
       },
       {
          "Name":"converted",
          "Type":"bigint"
       },
       {
          "Name":"region",
          "Type":"string"
       },
       {
          "Name":"phone_number",
          "Type":"string"
       }
    ]

  12. Choose Upload.
  13. Choose Submit.
  14. Under Table details, choose Edit table.
  15. Under Table properties, choose Add.
  16. For Key, enter skip.header.line.count, and for Value, enter 1.
  17. Choose Save.

Configure Lake Formation permissions

In this section, you set up Lake Formation permissions to allow IAM role SageMakerStudioExecutionRole_data-engineer to create a database and register the S3 location within Lake Formation.

First, register the data lake location to manage tables under the location in Lake Formation permissions:

  1. Choose Data lake locations.
  2. Choose Register location.
  3. For Amazon S3 path, enter s3://blog-studio-pii-dataset-<your-aws-account-id>/ (the bucket that contains the dataset).
  4. Choose Register location.
    Now you grant Lake Formation database and table permissions to the IAM roles SageMakerStudioExecutionRole_data-engineer and SageMakerStudioExecutionRole_data-scientist.First, grant database permission for SageMakerStudioExecutionRole_data-engineer:
  5. Under Permissions, choose Data lake permissions.
  6. Under Data permission, choose Grant.
  7. For Principals, choose IAM users and roles, and select the role SageMakerStudioExecutionRole_data-engineer.
  8. For Policy tags or catalog resources, choose Named data catalog resources.
  9. For Databases, choose demo.
  10. For Database permissions, select Super.
  11. Choose Grant.
    Next, grant table permission for SageMakerStudioExecutionRole_data-engineer:
  12. Under Data permission, choose Grant.
  13. For Principals, choose IAM users and roles, and select the role SageMakerStudioExecutionRole_data-engineer.
  14. For Policy tags or catalog resources, choose Named data catalog resources.
  15. For Databases, choose demo.
  16. For Tables, choose web_marketing.
  17. For Table permissions, select Super.
  18. For Grantable permissions, select Super.
  19. Choose Grant.
    Finally, grant database permission for SageMakerStudioExecutionRole_data-scientist:
  20. Under Data permission, choose Grant.
  21. For Principals, choose IAM users and roles, and select the role SageMakerStudioExecutionRole_data-scientist.
  22. For Policy tags or catalog resources, choose Named data catalog resources.
  23. For Databases, choose demo.
  24. For Database permissions, select Describe.
  25. Choose Grant.

About the Authors

Praveen Kumar is an Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and ML applications.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys collaborating with different teams to deliver results like this post. In his spare time, he enjoys playing video games with his family.

Read More

Get the Big Picture: Stream GeForce NOW in 4K Resolution on Samsung Smart TVs

Get the Big Picture: Stream GeForce NOW in 4K Resolution on Samsung Smart TVs

Gaming in the living room is getting an upgrade with GeForce NOW.

This GFN Thursday, kick off the weekend streaming GeForce NOW on Samsung TVs, with upcoming support for 4K resolution.

Get started with the 10 new titles streaming this week.

Plus, Yes by YTL Communications, a leading 5G provider in Malaysia, today announced it will soon bring GeForce NOW powered by Yes to gamers across the country. Stay tuned for more updates.

Go Big, Go Bold With 4K on Samsung Smart TVs

GeForce NOW is making its way to 2021 Samsung Smart TV models, and is already available through the Samsung Gaming Hub on 2022 Samsung TVs, so more players than ever can stream from GeForce NOW — no downloads, storage limits or console required.

Samsung Gaming Hub
Get tuned in to the cloud just in time for these TV streaming updates.

Even better, gaming on Samsung Smart TVs will look pixel perfect in 4K resolution. 2022 Samsung TVs and select 2021 Samsung TVs will be capable of streaming in 4K, as Samsung’s leadership in game-streaming technology and AI upscaling optimizes picture quality and the entire gaming experience.

The new TV firmware will start rolling out at the end of the month, enabling 4K resolution for Samsung Smart TV streamers with an RTX 3080 membership. RTX 3080 members will be able to stream up to 4K natively on Samsung Smart TVs for the first time, as well as get maximized eight-hour gaming sessions and dedicated RTX 3080 servers.

Here to Play Today

GFN Thursday delivers new games to the cloud every week. Jump into 10 new additions streaming today.

Warhammer 40000 Darktide
Delve deep into the industrial city of Tertium to combat the forces of Chaos that lurk.

Gamers who’ve preordered Warhammer 40,000: Darktide can leap thousands of years into the future a little early. Take back the city of Tertium from hordes of bloodthirsty foes in this intense, brutal action shooter streaming the Pre-Order Beta on Steam.

Members can also look for the following titles:

  • Ballads of Hongye (New release on Steam)
  • Bravery and Greed (New release on Steam)
  • TERRACOTTA (New release on Steam and Epic Games)
  • Warhammer 40,000: Darktide (New release pre-order beta access on Steam)
  • Frozen Flame (New release on Steam, Nov. 17)
  • Goat Simulator 3 (New release on Epic Games, Nov. 17)
  • Nobody — The Turnaround (New release on Steam, Nov. 17)
  • Caveblazers (Steam)
  • The Darkest Tales (Epic Games)
  • The Tenants (Epic Games)

Then jump into the new season of Rumbleverse, the play-for-free, 40-person Brawler Royale where anyone can be a champion. Take a trip on the expanded map to Low Key Key Island, master new power moves like “Jagged Edge” and earn new gear to show off your style.

And from now until Sunday, Nov. 20, snag a special upgrade to a six-month Priority Membership for just $29.99 — 40% off the standard price of $49.99. Bring a buddy to battle with you by getting them a GeForce NOW gift card.

Before you power up to play this weekend, we’ve got a question for you. Let us know your answer on Twitter or in the comments below.

The post Get the Big Picture: Stream GeForce NOW in 4K Resolution on Samsung Smart TVs appeared first on NVIDIA Blog.

Read More

Lockheed Martin, NVIDIA to Help US Speed Climate Data to Researchers

Lockheed Martin, NVIDIA to Help US Speed Climate Data to Researchers

The U.S. National Oceanic and Atmospheric Administration has selected Lockheed Martin and NVIDIA to build a prototype system to accelerate outputs of Earth Environment Monitoring and their corresponding visualizations.

Using AI techniques, such a system has the potential to reduce by an order of magnitude the amount of time necessary for the output of complex weather visualizations to be generated.

The first-of-its-kind project for a U.S. federal agency, the Global Earth Observation Digital Twin, or EODT, will provide a prototype to visualize terabytes of geophysical data from the land, ocean, cryosphere, atmosphere and space.

By fusing data from a broad variety of sensor sources, the system will be able to deliver information that’s not just up to date, but that decision-makers have confidence in, explained Lockheed Martin Space Senior Research Scientist Lynn Montgomery.

“We’re providing a one-stop shop for researchers, and for next-generation systems, not only for current, but for recent past environmental data,” Montgomery said. “Our collaboration with NVIDIA will provide NOAA a timely, global visualization of their massive datasets.”

Building on NVIDIA Omniverse

Building on NVIDIA Omniverse, the system has the potential to serve as a clearinghouse for scientists and researchers from a broad range of government agencies, one that can be extended over time to support a wide range of applications.

The support for the EODT pilot project is one of several initiatives at NVIDIA to develop tools and technologies for large-scale, even planetary simulations.

Last November, NVIDIA announced it will build a supercomputer, called Earth-2, devoted to predicting climate change by creating a digital twin of the planet.

NVIDIA and Lockheed Martin announced last year that they are working with the U.S. Department of Agriculture Forest Service and Colorado Division of Fire Prevention & Control to use AI and digital-twin simulation to better understand wildfires and stop their spread.

And in March, NVIDIA announced an accelerated digital twins platform for scientific computing consisting of the NVIDIA Modulus AI framework for developing physics-ML neural network models and the NVIDIA Omniverse 3D virtual-world simulation platform.

The EODT project builds on these initiatives, relying on NVIDIA Omniverse Nucleus to allow different applications to quickly import and export custom, visualizable assets to and from the effort’s central data store.

“This is a blueprint for a complex system using Omniverse, where we will have a fusion of sensor data, architectural data and AI inferred data all combined with various visualization capacities deployed to the cloud and various workstations,” said Peter Messmer, senior manager in the HPC Developer Technology group at NVIDIA. “It’s a fantastic opportunity to highlight all these components with a real-world example.”

A Fast-Moving Effort

The effort will move fast, with a demonstration of the system’s ability to visualize sea surface temperature data slated for next September. The system will take advantage of GPU computing instances from Amazon Web Services and NVIDIA DGX and OVX servers on premises.

The fast, flexible system will provide a prototype to visualize geophysical variables from NOAA satellite and ground data sources from a broad range of sources.

These include temperature and moisture profiles, sea surface temperatures, sea ice concentrations and solar wind data, among other sources.

That data will be collected by Lockheed Martin’s OpenRosetta3D software, which is widely used for sophisticated large-scale image analysis, workflow orchestration and sensor fusion by government agencies, such as NASA, and private industry.

NVIDIA will support the development of one-way connectors to import “snapshots” of processed geospatial datasets from Lockheed’s OpenRosetta3D technology into NVIDIA Omniverse Nucleus as Universal Scene Description inputs.

USD is an open source and extensible ecosystem for describing, composing, simulating and collaborating within 3D worlds, initially invented by Pixar Animation Studios.

Omniverse Nucleus will be vital to making the data available fast, in part because of Nucleus’s ability to relay just what’s changed in a dataset, Montgomery explained.

Nucleus will, in turn, deliver those USD datasets to Lockheed’s Agatha 3D viewer, based on Unity, allowing users to quickly see data from multiple sensors on an interactive 3D earth and space platform.

The result is a system that will help researchers at NOAA, and, eventually, elsewhere, make decisions faster based on the latest available data.

The post Lockheed Martin, NVIDIA to Help US Speed Climate Data to Researchers appeared first on NVIDIA Blog.

Read More

Introducing TorchMultimodal – a library for accelerating exploration in Multimodal AI

We are announcing TorchMultimodal Beta, a PyTorch domain library for training SoTA multi-task multimodal models at scale. The library provides composable building blocks (modules, transforms, loss functions) to accelerate model development, SoTA model architectures (FLAVA, MDETR, Omnivore) from published research, training and evaluation scripts, as well as notebooks for exploring these models. The library is under active development, and we’d love to hear your feedback! You can find more details on how to get started here.

Why TorchMultimodal?

Interest is rising around AI models that understand multiple input types (text, images, videos and audio signals), and optionally use this understanding to generate different forms of outputs (sentences, pictures, videos). Recent work from FAIR such as FLAVA, Omnivore and data2vec have shown that multimodal models for understanding are competitive with unimodal counterparts, and in some cases are establishing the new state-of-the art. Generative models such as Make-a-video and Make-a-scene are redefining what modern AI systems can do.

As interest in multimodal AI has grown, researchers are looking for tools and libraries to quickly experiment with ideas, and build on top of the latest research in the field. While the PyTorch ecosystem has a rich repository of libraries and frameworks, it’s not always obvious how components from these interoperate with each other, or how they can be stitched together to build SoTA multimodal models.

TorchMultimodal solves this problem by providing:

  • Composable and easy-to-use building blocks which researchers can use to accelerate model development and experimentation in their own workflows. These are designed to be modular, and can be easily extended to handle new modalities.

  • End-to-end examples for training and evaluating the latest models from research. These should serve as starting points for ongoing/future research, as well as examples for using advanced features such as integrating with FSDP and activation checkpointing for scaling up model and batch sizes.

Introducing TorchMultimodal

TorchMultimodal is a PyTorch domain library for training multi-task multimodal models at scale. In the repository, we provide:

  • Building Blocks. A collection of modular and composable building blocks like models, fusion layers, loss functions, datasets and utilities. Some examples include:

    • Contrastive Loss with Temperature. Commonly used function for training models like CLIP and FLAVA. We also include variants such as ImageTextContrastiveLoss used in models like ALBEF.

    • Codebook layers which compresses high dimensional data by nearest neighbor lookup in an embedding space and is a vital component of VQVAEs (provided as a model in the repository).

    • Shifted-window Attention window based multi-head self attention which is a vital component of encoders like Swin 3D Transformers.

    • Components for CLIP. A popular model published by OpenAI which has proven to be extremely effective at learning text and image representations.

    • Multimodal GPT. An abstraction that extends OpenAI’s GPT architecture for multimodal generation when combined with the generation utility.

    • MultiHeadAttention. A critical component for attention-based models with support for fast auto-regressive decoding.

  • Examples. A collection of examples that show how to combine these building blocks with components and common infrastructure (Lightning, TorchMetrics) from across the PyTorch Ecosystem to replicate state-of-the-art models published in literature. We currently provide five examples, which include.

    • FLAVA [paper]. Official code for the paper accepted at CVPR, including a tutorial on finetuning FLAVA.

    • MDETR [paper]. Collaboration with authors from NYU to provide an example which alleviates interoperability pain points in the PyTorch ecosystem, including a notebook on using MDETR for phrase grounding and visual question answering.

    • Omnivore [paper]. First example in TorchMultimodal of a model which deals with Video and 3D data, including a notebook for exploring the model.

    • MUGEN [paper]. Foundational work for auto-regressive generation and retrieval, including demos for text-video generation and retrieval with a large-scale synthetic dataset enriched from OpenAI coinrun.

    • ALBEF [paper] Code for the model, including a notebook for using this model for Visual Question Answering.

The following code snippet showcases an example usage of several TorchMultimodal components related to CLIP:


# instantiate clip transform
clip_transform = CLIPTransform()

# pass the transform to your dataset. Here we use coco captions
dataset = CocoCaptions(root= ..., annFile=..., transforms=clip_transform)
dataloader = DataLoader(dataset, batch_size=16)

# instantiate model. Here we use clip with vit-L as the image encoder
model= clip_vit_l14()

# define loss and other things needed for training
clip_loss = ContrastiveLossWithTemperature()
optim = torch.optim.AdamW(model.parameters(), lr = 1e-5)
epochs = 1

# write your train loop
for _ in range(epochs):
	for batch_idx, batch in enumerate(dataloader):
		image, text = batch
		image_embeddings, text_embeddings = model(image, text)
		loss = contrastive_loss_with_temperature(image_embeddings, text_embeddings)
		loss.backward()
		optimizer.step()

Apart from the code, we are also releasing a tutorial for fine-tuning multimodal foundation models, and a blog post (with code pointers) on how to scale up such models using techniques from PyTorch Distributed (FSDP and activation checkpointing). We hope such examples and tutorials will serve to demystify a number of advanced features available in the PyTorch ecosystem.

What’s Next?

While this is an exciting launch, there’s a lot more to come. The library is under development and we are working on adding some of the exciting developments in the space of diffusion models, and examples to showcase common trends from research. As you explore and use the library, we’d love to hear any feedback you might have! You can find more details on how to get started here.

Team

The primary contributors and developers of TorchMultimodal include Ankita De, Evan Smothers, Kartikay Khandelwal, Lan Gong, Laurence Rouesnel, Nahiyan Malik, Rafi Ayub and Yosua Michael Maranatha.

Read More

Build a cross-account MLOps workflow using the Amazon SageMaker model registry

Build a cross-account MLOps workflow using the Amazon SageMaker model registry

A well-designed CI/CD pipeline is essential to scale any software development workflow effectively. When designing production CI/CD pipelines, AWS recommends leveraging multiple accounts to isolate resources, contain security threats and simplify billing-and data science pipelines are no different. At AWS, we’re continuing to innovate to simplify the MLOps workflow.

In this post, we discuss some of the newer cross-account features to Amazon SageMaker that allow you to better share and manage model groups as well as manage model versions. For an example account structure to follow organizational unit best practices to host models using SageMaker endpoints across accounts, refer to MLOps Workload Orchestrator.

Solution overview

The following diagram illustrates our shared model registry architecture.

Architecture diagram reflecting the cross account MLOps process

Some things to note in the preceding architecture:

The following steps correspond to the diagram:

  1. A data scientist registers a model from the data science account into the shared services SageMaker model registry in a PendingManualApproval state. The model artifact is created in the shared services account Amazon Simple Storage Service (Amazon S3) bucket.
  2. Upon a new model version registration, someone with the authority to approve the model based on the metrics should approve or reject the model.
  3. After the model is approved, the CI/CD pipeline in deployment account is triggered to deploy the updated model details in the QA account and update the stage as QA.
  4. Upon passing the testing process, you can either choose to have a manual approval step within your CI/CD process or have your CI/CD pipeline directly deploy the model to production and update the stage as Prod.
  5. The production environment references the approved model and code, perhaps doing an A/B test in production. In case of an audit or any issue with the model, you can use Amazon SageMaker ML Lineage Tracking. It creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. With the tracking information, you can reproduce the workflow steps, track the model and dataset lineage, and establish model governance and audit standards.

Throughout the whole process, the shared model registry retains the older model versions. This allows the team to roll back changes, or even host production variants.

Prerequisites

Make sure you have the following prerequisites:

  • A provisioned multi-account structure – For instructions, see Best Practices for Organizational Units with AWS Organizations. For the purposes of this blog we are leveraging the following accounts:
    • Data science account – An account where data scientists have access to the training data and create the models.
    • Shared services account – A central account for storing the model artifacts (as shown in the architecture diagram) to be accessed across the different workload accounts.
    • Deployment account – An account responsible for deploying changes to the various accounts.
    • Workload accounts – These are commonly QA and prod environments where software engineers are able to build applications to consume the ML model.
  • A deployment account with appropriate permissions – For more information about best practices with a multi-account OU structure, refer to Deployments OU. This account is responsible for pointing the workload accounts to the desired model in the shared services account’s model registry.

Define cross-account policies

In following the principle of least privilege, first we need to add cross-account resource policies to the shared services resources to grant access from the other accounts.

Because the model artifacts are stored in the shared services account’s S3 bucket, the data science account needs Amazon S3 read/write access to push trained models to Amazon S3. The following code illustrates this policy, but don’t add it to the shared services account yet:

#Data Science account's policy to access Shared Services' S3 bucket
 {
    'Version': '2012-10-17',
    'Statement': [{
        'Sid': 'AddPerm',
        'Effect': 'Allow',
        'Principal': {
            'AWS': 'arn:aws:iam::<data_science_account_id>:root'
        }, 
        "Action": [ 
            's3:PutObject', 
            's3:PutObjectAcl',
            's3:GetObject', 
            's3:GetObjectVersion'
        ], #read/write
        'Resource': 'arn:aws:s3:::<shared_bucket>/*'
    }]
}

The deployment account only needs to be granted read access to the S3 bucket, so that it can use the model artifacts to deploy to SageMaker endpoints. We also need to attach the following policy to the shared services S3 bucket:

#Deployment account's policy to access Shared Services' S3 bucket
 {
    'Version': '2012-10-17',
    'Statement': [{
        'Sid': 'AddPerm',
        'Effect': 'Allow',
        'Principal': {
            'AWS': 'arn:aws:iam::<deployment_account_id>:root'
        },
        'Action': [ 
            's3:GetObject', 
            's3:GetObjectVersion'
        ], #read
        'Resource': 'arn:aws:s3:::<shared_bucket>/*'
    }]
}

We combine both policies to get the following final policy. Create this policy in the shared services account after replacing the appropriate account IDs:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AddPerm",
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::<data_science_account_id>:root"    
    },
    "Action": [
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:GetObject",
      "s3:GetObjectVersion"    ],
    "Resource": "arn:aws:s3:::<shared_bucket>/*"  
    },
    {
      "Sid": "AddPermDeployment",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<deployment_account_id>:root"      
      },
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"      ], 
      "Resource": "arn:aws:s3:::<shared_bucket>/*"    
    }
  ]
}

To be able to deploy a model created in a different account, the user must have a role that has access to SageMaker actions, such as a role with the AmazonSageMakerFullAccess managed policy. Refer to Deploy a Model Version from a Different Account for additional details.

We need to define the model group that contains the model versions we want to deploy. Also, we want to grant permissions to the data science account. This can be accomplished in the following steps. We refer to the accounts as follows:

  • shared_services_account_id – The account where the model registry is and where we want the model to be
  • data_science_account_id – The account where we will be training and therefore creating the actual model artifact
  • deployment_account_id – The account where we want to host the endpoint for this model

First we need to ensure the model package groups exists. You can use Boto3 APIs as shown the following example, or you can use the AWS Management Console to create the model package. Refer to Create Model Package Group for more details. This assumes you have the Boto3 installed.

model_package_group_name = "cross-account-example-model"
sm_client = boto3.Session().client("sagemaker")

create_model_package_group_response = sm_client.create_model_package_group(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageGroupDescription="Cross account model package group",
    Tags=[
          {
              'Key': 'Name',
              'Value': 'cross_account_test'
          },
      ]

)

print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))

For the permissions for this model package group, you can create a JSON document resembling the following code. Replace the actual account IDs and model package group name with your own values.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AddPermModelPackageGroupCrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<data_science_account_id>:root"      
      },
      "Action": [
        "sagemaker:DescribeModelPackageGroup"      
        ],
      "Resource": "arn:aws:sagemaker:<region>:<shared_services_account_id>:model-package-group/<model_package_group_name>"    
    },
    {
      "Sid": "AddPermModelPackageVersionCrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<data_science_account_id>:root"      
      },
      "Action": [
        "sagemaker:DescribeModelPackage",
        "sagemaker:ListModelPackages",
        "sagemaker:UpdateModelPackage",
        "sagemaker:CreateModelPackage",
        "sagemaker:CreateModel"      
      ],
      "Resource": "arn:aws:sagemaker:<region>:<shared_services_account_id>:model-package/<model_package_group_name>/*"    
    }
  ]
}

Finally, apply the policy to the model package group. You can’t associate this policy with the package group via the console. You need the SDK or AWS Command Line Interface (AWS CLI) access. For example, the following code uses Boto3:

# Convert the policy from JSON dict to string
model_package_group_policy = dict(<put-above-json-policy-after-subsitute> )
model_package_group_policy = json.dumps(model_package_group_policy)

# Set the new policy
sm_client = boto3.Session().client("sagemaker")
response = sm_client.put_model_package_group_policy(
    ModelPackageGroupName = model_package_group_name,
    ResourcePolicy = model_package_group_policy)

We also need a custom AWS Key Management Service (AWS KMS) key to encrypt the model while storing it in Amazon S3. This needs to be done using the data science account. On the AWS KMS console, navigate to the Define key usage permissions page. In the Other AWS accounts section, choose Add another AWS account. Enter the AWS account number for the deployment account. You use this KMS key for the SageMaker training job. If you don’t specify a KMS key for the training job, SageMaker defaults to an Amazon S3 server-side encryption key. A default Amazon S3 server-side encryption key can’t be shared with or used by another AWS account.

The policy and permissions follow this pattern:

  • The Amazon S3 policy specified in shared_services_account gives permissions to the data science account and deployments account
  • The KMS key policy specified in shared_services_account gives permissions to the data science account and deployments account

We need to ensure that the shared services account and deployment account have access to the Docker images that were used for training the model. These images are generally hosted in AWS accounts, and your account admin can help you get access, if you don’t have access already. For this post, we don’t create any custom Docker images after training the model and therefore we don’t need any specific Amazon ECR policies for the images.

In the workload accounts (QA or prod), we need to create two AWS Identity and Access Management (IAM) policies similar to the following. These are inline policies, which means that they’re embedded in an IAM identity. This gives these accounts access to model registry.

The first inline policy allows a role to access the Amazon S3 resource in the shared services account that contains the model artifact. Provide the name of the S3 bucket and your model:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<bucket-name>/sagemaker/<cross-account-example-model>/output/model.tar.gz"
        }
    ]
}

The second inline policy allows a role, which we create later, to use the KMS key in the shared services account. Specify the account ID for the shared services account and KMS key ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:us-east-1:<data_science_account_id>:key/{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}"
            ]
        }
    ]
}

Finally, we need to create an IAM role for SageMaker. This role has the AmazonSageMakerFullAccess policy attached. We then attach these two inline policies to the role we created. If you’re using an existing SageMaker execution role, attach these two policies to that role. For instructions, refer to Creating roles and attaching policies (console).

Now that we have defined the policies of each account, let’s use an example to see it in action.

Build and train a model using a SageMaker pipeline

We first create a SageMaker pipeline in the data science account for carrying out data processing, model training, and evaluation. We use the California housing dataset obtained from the StatLib library. In the following code snippet, we use a custom preprocessing script preprocess.py to perform some simple feature transformation such as feature scaling, which can be generated using the following notebook. This script also splits the dataset into training and test datasets.

We create a SKLearnProcessor object to run this preprocessing script. In the SageMaker pipeline, we create a processing step (ProcessingStep) to run the processing code using SKLearnProcessor. This processing code is called when the SageMaker pipeline is initialized. The code creating the SKLearnProcessor and ProcessingStep are shown in the following code. Note that all the code in this section is run in the data science account.

# Useful SageMaker variables - Create a Pipeline session which will lazy init resources
session = PipelineSession()

framework_version = "0.23-1"

# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=role,
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="tf2-california-housing-processing-job",
    sagemaker_session=session
)

# Use the sklearn_processor in a SageMaker pipelines ProcessingStep
step_preprocess_data = ProcessingStep(
    name="Preprocess-California-Housing-Data",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="preprocess.py",
)

We need a custom KMS key to encrypt the model while storing it to Amazon S3. See the following code:

kms_client = boto3.client('kms')
response = kms_client.describe_key(
    KeyId='alias/sagemaker/outkey',
)
key_id = response['KeyMetadata']['KeyId']

To train the model, we create a TensorFlow estimator object. We pass it the KMS key ID along with our training script train.py, training instance type, and count. We also create a TrainingStep to be added to our pipeline, and add the TensorFlow estimator to it. See the following code:

model_path = f"s3://{bucket}/{prefix}/model/"

hyperparameters = {"epochs": training_epochs}
tensorflow_version = "2.4.1"
python_version = "py37"

tf2_estimator = TensorFlow(
    source_dir="code",
    entry_point="train.py",
    instance_type=training_instance_type,
    instance_count=1,
    framework_version=tensorflow_version,
    role=role,
    base_job_name="tf2-california-housing-train",
    output_path=model_path,
    output_kms_key=key_id,
    hyperparameters=hyperparameters,
    py_version=python_version,
    sagemaker_session=session
)

# Use the tf2_estimator in a SageMaker pipelines ProcessingStep.
# NOTE how the input to the training job directly references the output of the previous step.
step_train_model = TrainingStep(
    name="Train-California-Housing-Model",
    estimator=tf2_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "test": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
)

In addition to training, we need to carry out model evaluation, for which we use mean squared error (MSE) as the metric in this example. The earlier notebook also generates evaluate.py, which we use to evaluate our a model using MSE. We also create a ProcessingStep to initialize the model evaluation script using a SKLearnProcessor object. The following code creates this step:

from sagemaker.workflow.properties import PropertyFile

# Create SKLearnProcessor object.
# The object contains information about what container to use, what instance type etc.
evaluate_model_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="tf2-california-housing-evaluate",
    role=role,
    sagemaker_session=session
)

# Create a PropertyFile
# A PropertyFile is used to be able to reference outputs from a processing step, for instance to use in a condition step.
# For more information, visit https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-propertyfile.html
evaluation_report = PropertyFile(
    name="EvaluationReport", output_name="evaluation", path="evaluation.json"
)

# Use the evaluate_model_processor in a SageMaker pipelines ProcessingStep.
step_evaluate_model = ProcessingStep(
    name="Evaluate-California-Housing-Model",
    processor=evaluate_model_processor,
    inputs=[
        ProcessingInput(
            source=step_train_model.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="evaluate.py",
    property_files=[evaluation_report],
)

After model evaluation, we also need a step to register our model with the model registry, if the model performance meets the requirements. This is shown in the following code using the RegisterModel step. Here we need to specify the model package that we had declared in the shared services account. Replace the Region, account, and model package with your values. The model name used here is modeltest, but you can use any name of your choice.

# Create ModelMetrics object using the evaluation report from the evaluation step
# A ModelMetrics object contains metrics captured from a model.
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri=evaluation_s3_uri,
        content_type="application/json",
    )
)

# Create a RegisterModel step, which registers the model with SageMaker Model Registry.
model = Model(
    image_uri=tf2_estimator.training_image_uri(),
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    source_dir=tf2_estimator.source_dir,
    entry_point=tf2_estimator.entry_point,
    role=role_arn,
    sagemaker_session=session
)

model_registry_args = model.register(
    content_types=['text/csv'],
    response_types=['application/json'],
    inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
    transform_instances=['ml.m5.xlarge'],
    model_package_group_name=model_package_group_name,
    approval_status='PendingManualApproval',
    model_metrics=model_metrics
)

 step_register_model= ModelStep(
    name='RegisterModel',
    step_args=model_registry_args
)

We also need to create the model artifacts so that it can be deployed (using the other account). For creating the model, we create a CreateModelStep, as shown in the following code:

from sagemaker.inputs import CreateModelInput 
from sagemaker.workflow.model_step import ModelStep 
step_create_model = ModelStep( 
    name="Create-California-Housing-Model", 
    step_args=model.create(instance_type="ml.m5.large",accelerator_type="ml.eia1.medium"),
 )

Adding conditions to the pipeline is done with a ConditionStep. In this case, we only want to register the new model version with the model registry if the new model meets an accuracy condition. See the following code:

from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import (
    ConditionStep,
    JsonGet,
)

# Create accuracy condition to ensure the model meets performance requirements.
# Models with a test accuracy lower than the condition will not be registered with the model registry.
cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step=step_evaluate_model,
        property_file=evaluation_report,
        json_path="regression_metrics.mse.value",
    ),
    right=accuracy_mse_threshold,
)

# Create a SageMaker Pipelines ConditionStep, using the preceding condition.
# Enter the steps to perform if the condition returns True / False.
step_cond = ConditionStep(
    name="MSE-Lower-Than-Threshold-Condition",
    conditions=[cond_lte],
    if_steps=[step_register_model, step_create_model],
    else_steps=[step_higher_mse_send_email_lambda],
)

Finally, we want to orchestrate all the pipeline steps so that the pipeline can be initialized:

from sagemaker.workflow.pipeline import Pipeline

# Create a SageMaker Pipeline.
# Each parameter for the pipeline must be set as a parameter explicitly when the pipeline is created.
# Also pass in each of the preceding steps.
# Note that the order of execution is determined from each step's dependencies on other steps,
# not on the order they are passed in.
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        training_instance_type,
        input_data,
        training_epochs,
        accuracy_mse_threshold,
        endpoint_instance_type,
    ],
    steps=[step_preprocess_data, step_train_model, step_evaluate_model, step_cond],
)

Deploy a model version from a different account

Now that the model has been registered in the shared services account, we need to deploy into our workload accounts using the CI/CD pipeline in the deployment account. We have already configured the role and the policy in an earlier step. We use the model package ARN to deploy the model from the model registry. The following code runs in the deployment account and is used to deploy approved models to QA and prod:

from sagemaker import ModelPackage
from time import gmtime, strftime

sagemaker_session = sagemaker.Session(boto_session=sess)

model_package_arn = 'arn:aws:sagemaker:<region>:<shared_services_account>:<model_group_package>/modeltest/version_number'
model = ModelPackage(role=role, 
                     model_package_arn=model_package_arn, 
                     sagemaker_session=sagemaker_session)
model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

Conclusion

In this post, we demonstrated how to set up the policies needed for a multi-account setup for ML based on the principle of least privilege. Then we showed the process of building and training the models in the data science account. Finally, we used the CI/CD pipeline in the deployment account to deploy the latest version of approved models to QA and production accounts. Additionally, you can view the deployment history of models and build triggers in AWS CodeBuild.

You can scale the concepts in this post to host models in Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Kubernetes Service (Amazon EKS), as well as build out a batch inference pipeline.

To learn more about having separate accounts that build ML models in AWS, see Best Practices for Organizational Units with AWS Organizations and Safely update models in production.


About the Authors

Sandeep Verma is a Sr. Prototyping Architect with AWS. He enjoys diving deep into customer challenges and building prototypes for customers to accelerate innovation. He has a background in AI/ML, founder of New Knowledge, and generally passionate about tech. In his free time, he loves traveling and skiing with his family.

Mani Khanuja  Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spend lot of her free time.

Saumitra Vikram is a Software Developer on the Amazon SageMaker team and is based in Chennai, India. Outside of work, he loves spending time running, trekking and motor bike riding through the Himalayas.

Sreedevi Srinivasan is an engineering leader in AWS SageMaker. She is passionate and excited about enabling ML as a platform that is set to transform every day lives. She currently focusses on SageMaker Feature Store. In her free time, she likes to spend time with her family.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from The University of Texas at Austin and a MS in Computer Science from Georgia Institute of Technology. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization and related domains. He has over 16 years of work experience and is also an adjunct faculty member at The University of Texas at Dallas, where he teaches a graduate course on Applied Machine Learning. Based in Dallas, Texas, he and his family love to travel and make long road trips.

Read More

Mixture-of-Experts with Expert Choice Routing

Mixture-of-Experts with Expert Choice Routing

The capacity of a neural network to absorb information is limited by the number of its parameters, and as a consequence, finding more effective ways to increase model parameters has become a trend in deep learning research. Mixture-of-experts (MoE), a type of conditional computation where parts of the network are activated on a per-example basis, has been proposed as a way of dramatically increasing model capacity without a proportional increase in computation. In sparsely-activated variants of MoE models (e.g., Switch Transformer, GLaM, V-MoE), a subset of experts is selected on a per-token or per-example basis, thus creating sparsity in the network. Such models have demonstrated better scaling in multiple domains and better retention capability in a continual learning setting (e.g., Expert Gate). However, a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized.

In “Mixture-of-Experts with Expert Choice Routing”, presented at NeurIPS 2022, we introduce a novel MoE routing algorithm called Expert Choice (EC). We discuss how this novel approach can achieve optimal load balancing in an MoE system while allowing heterogeneity in token-to-expert mapping. Compared to token-based routing and other routing methods in traditional MoE networks, EC demonstrates very strong training efficiency and downstream task scores. Our method resonates with one of the vision for Pathways, which is to enable heterogeneous mixture-of-experts via Pathways MPMD (multi program, multi data) support.

Overview of MoE Routing

MoE operates by adopting a number of experts, each as a sub-network, and activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). Depending on how tokens are mapped to experts, MoE can be sparse or dense. Sparse MoE only selects a subset of experts when routing each token, reducing computational cost as compared to a dense MoE. For example, recent work has implemented sparse routing via k-means clustering, linear assignment to maximize token-expert affinities, or hashing. Google also recently announced GLaM and V-MoE, both of which advance the state of the art in natural language processing and computer vision via sparsely gated MoE with top-k token routing, demonstrating better performance scaling with sparsely activated MoE layers. Many of these prior works used a token choice routing strategy in which the routing algorithm picks the best one or two experts for each token.

Token Choice Routing. The routing algorithm picks the top-1 or top-2 experts with highest affinity scores for each token. The affinity scores can be trained together with model parameters.

The independent token choice approach often leads to an imbalanced load of experts and under-utilization. In order to mitigate this, previous sparsely gated networks introduced additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness was limited. As a result, token choice routings need to overprovision expert capacity by a significant margin (2x–8x of the calculated capacity) to avoid dropping tokens when there is a buffer overflow.

In addition to load imbalance, most prior works allocate a fixed number of experts to each token using a top-k function, regardless of the relative importance of different tokens. We argue that different tokens should be received by a variable number of experts, conditioned on token importance or difficulty.

Expert Choice Routing

To address the above issues, we propose a heterogeneous MoE that employs the expert choice routing method illustrated below. Instead of having tokens select the top-k experts, the experts with predetermined buffer capacity are assigned to the top-k tokens. This method guarantees even load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance. EC routing speeds up training convergence by over 2x in an 8B/64E (8 billion activated parameters, 64 experts) model, compared to the top-1 and top-2 gating counterparts in Switch Transformer, GShard, and GLaM.

Expert Choice Routing. Experts with predetermined buffer capacity are assigned top-k tokens, thus guaranteeing even load balancing. Each token can be received by a variable number of experts.

In EC routing, we set expert capacity k as the average tokens per expert in a batch of input sequences multiplied by a capacity factor, which determines the average number of experts that can be received by each token. To learn the token-to-expert affinity, our method produces a token-to-expert score matrix that is used to make routing decisions. The score matrix indicates the likelihood of a given token in a batch of input sequences being routed to a given expert.

Similar to Switch Transformer and GShard, we apply an MoE and gating function in the dense feedforward (FFN) layer, as it is the most computationally expensive part of a Transformer-based network. After producing the token-to-expert score matrix, a top-k function is applied along the token dimension for each expert to pick the most relevant tokens. A permutation function is then applied based on the generated indexes of the token, to create a hidden value with an additional expert dimension. The data is split across multiple experts such that all experts can execute the same computational kernel concurrently on a subset of tokens. Because a fixed expert capacity can be determined, we no longer overprovision expert capacity due to load imbalancing, thus significantly reducing training and inference step time by around 20% compared to GLaM.

Evaluation

To illustrate the effectiveness of Expert Choice routing, we first look at training efficiency and convergence. We use EC with a capacity factor of 2 (EC-CF2) to match the activated parameter size and computational cost on a per-token basis to GShard top-2 gating and run both for a fixed number of steps. EC-CF2 reaches the same perplexity as GShard top-2 in less than half the steps and, in addition, we find that each GShard top-2 step is 20% slower than our method.

We also scale the number of experts while fixing the expert size to 100M parameters for both EC and GShard top-2 methods. We find that both work well in terms of perplexity on the evaluation dataset during pre-training — having more experts consistently improves training perplexity.

Evaluation results on training convergence: EC routing yields 2x faster convergence at 8B/64E scale compared to top-2 gating used in GShard and GLaM (top). EC training perplexity scales better with the scaling of number of experts (bottom).

To validate whether improved perplexity directly translates to better performance in downstream tasks, we perform fine-tuning on 11 selected tasks from GLUE and SuperGLUE. We compare three MoE methods including Switch Transformer top-1 gating (ST Top-1), GShard top-2 gating (GS Top-2) and a version of our method (EC-CF2) that matches the activated parameters and computational cost of GS Top-2. The EC-CF2 method consistently outperforms the related methods and yields an average accuracy increase of more than 2% in a large 8B/64E setting. Comparing our 8B/64E model against its dense counterpart, our method achieves better fine-tuning results, increasing the average score by 3.4 points.

Our empirical results indicate that capping the number of experts for each token hurts the fine-tuning score by 1 point on average. This study confirms that allowing a variable number of experts per token is indeed helpful. On the other hand, we compute statistics on token-to-expert routing, particularly on the ratio of tokens that have been routed to a certain number of experts. We find that a majority of tokens have been routed to one or two experts while 23% have been routed to three or four experts and only about 3% tokens have been routed to more than four experts, thus verifying our hypothesis that expert choice routing learns to allocate a variable number of experts to tokens.

Final Thoughts

We propose a new routing method for sparsely activated mixture-of-experts models. This method addresses load imbalance and under-utilization of experts in conventional MoE methods, and enables the selection of different numbers of experts for each token. Our model demonstrates more than 2x training efficiency improvement when compared to the state-of-the-art GShard and Switch Transformer models, and achieves strong gains when fine-tuning on 11 datasets in the GLUE and SuperGLUE benchmark.

Our approach for expert choice routing enables heterogeneous MoE with straightforward algorithmic innovations. We hope that this may lead to more advances in this space at both the application and system levels.

Acknowledgements

Many collaborators across google research supported this work. We particularly thank Nan Du, Andrew Dai, Yanping Huang, and Zhifeng Chen for the initial ground work on MoE infrastructure and Tarzan datasets. We greatly appreciate Hanxiao Liu and Quoc Le for contributing the initial ideas and discussions. Tao Lei, Vincent Zhao, Da Huang, Chang Lan, Daiyi Peng, and Yifeng Lu contributed significantly on implementations and evaluations. Claire Cui, James Laudon, Martin Abadi, and Jeff Dean provided invaluable feedback and resource support.

Read More