Implementing a custom labeling GUI with built-in processing logic with Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. It offers easy access to Amazon Mechanical Turk and private human labelers, and provides them with built-in workflows and interfaces for common labeling tasks.

A labeling team may wish to use the powerful customization features in Ground Truth to modify:

  • The look and feel of the workers’ graphical user interface (GUI)
  • The backend AWS Lambda functions that perform the preprocessing and postprocessing logic.

Depending on the nature of your labeling job and your use case, your customization requirements may vary.

In this post, via a custom workflow, I show you how to implement a text classification labeling job consisting of a custom GUI, built-in preprocessing and postprocessing logic, and encrypted output.  (For our example, workers are tasked to determine whether a sentence references a person, animal, or plant.)  I also provide you with an overview of the prerequisites, the code, and estimated costs of implementing the solution.

Understanding task types and processing logic

In this section, I’ll discuss the use cases surrounding built-in vs custom task types and processing logic.

Built-in task types that implement built-in GUIs and built-in processing logic

Ground Truth provides several built-in task types that cover many image, text, video, video frame, and 3D point cloud labeling use cases.

If you want to implement one of these built-in task types, along with a default labeling GUI, creating a labeling job requires no customization steps.

Custom task types that implement custom GUIs and custom processing logic

If the built-in task types don’t satisfy your labeling job requirements, the options for customizing the GUI as well as the preprocessing and postprocessing logic are nearly endless by way of the custom labeling workflow feature.

With this feature, instead of choosing a built-in task type, you define the preprocessing and postprocessing logic via your own Lambda functions. You also have full control over the labeling GUI using HTML elements and the Liquid-based template system. This enables you to do some really cool customization, including Angular framework integration. For more information, see Building a custom Angular application for labeling jobs with Amazon SageMaker Ground Truth.

For more details on custom workflows, see Creating Custom Labeling Workflows and Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth.

Built-in task types that implement custom GUIs and built-in processing logic

So far, I’ve discussed the built-in (100% out-of-the-box) option and the custom workflow (100% custom GUI and logic) option for running a job.

What if you wanted to implement a custom GUI, but implement the built-in preprocessing and postprocessing logic that the built-in task types provide? This way, we can adjust the GUI just the way we want, while still relying on the latest AWS-based preprocessing and postprocessing logic (not to mention not having to maintain another codebase).

You can, and I’ll show you how, step-by-step.

Prerequisites

To complete this solution, you need to set up the following prerequisites:

Setting up an AWS account

In this post, you work directly with IAM, SageMaker, AWS KMS, and Amazon S3, so if you haven’t already, create an AWS account. Following along with this post incurs AWS usage charges, so be sure to shut down and delete resources when you’re finished.

Setting up the AWS CLI

Because we use some parameters not available (as of this writing) on the AWS Management Console, you need access to the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.

All Ground Truth, Amazon S3, and Lambda configurations for this post must be set up within the same Region. This post assumes you’re operating all services out of the us-west-2 region. If you’re operating within another Region, be sure to modify your setup accordingly for a same-Region setup.

Setting up IAM permissions

If you created labeling jobs in the past with Ground Truth, you may already have the permissions needed to implement this solution. Those permissions include the following policies:

  • SageMakerFullAccess – To have access to the SageMaker GUI and S3 buckets to perform the steps outlined in this post, you need the SageMakerFullAccess policy applied to the user, group, or role assumed for this post.
  • AmazonSageMakerGroundTruthExecution – The Ground Truth labeling jobs you create in this post need to run with an execution role that has the AmazonSageMakerGroundTruthExecution policy attached.

If you have the permissions required to create these roles yourself, the SageMaker GUI walks you through a wizard to set them up. If you don’t have access to create these roles, ask your administrator to create them for you to use during job creation and management.

Setting up an S3 bucket

You need an S3 bucket in the us-west-2 Region to host the SageMaker manifest and categories files for the labeling job. By default, the SageMakerFullAccess and AmazonSageMakerGroundTruthExecution policies only grant access to S3 buckets containing sagemaker or groundtruth in their name (for example, buckets named my-awesome-bucket-sagemaker or marketing-groundtruth-datasets).

Be sure to name your buckets accordingly, or modify the policy accordingly to provide the appropriate access.

For more information on creating a bucket, see Step 1: Create an Amazon S3 Bucket. There is no need for public access to this bucket, so don’t grant it.

As mentioned earlier, all the Ground Truth, Amazon S3, and Lambda configurations for this solution must be in the same Region. For this post, we use us-west-2.

Setting up the Ground Truth work team

When you create a labeling job, you need to assign it to a predefined work team that works on it. If you haven’t created a work team already (or want to create a specific one just for this post), see Create and Manage Workforces.

Setting up AWS KMS

With security as job zero, make sure to encrypt the output manifest file created by the job’s output. To do this, at job creation time, you need to reference a KMS key ID to encrypt the output of the custom Ground Truth job in your S3 bucket.

By default, each account has an AWS managed key (aws/s3) created automatically. For this post, you can use the key ID of the AWS managed key, or you create and use your own customer managed key ID.

For more information about creating and using keys with AWS KMS, see Getting started.

Estimated costs

Running this solution incurs costs for the following:

  • Ground Truth labeling – Labeling costs for each job are $0.56 when using your own private workforce (other workforce types, including Mechanical Turk, may have additional costs). For more information, see Amazon SageMaker Ground Truth pricing.
  • Amazon S3 storage, retrieval, and data transfer – These costs are less than $0.05 (this assumes you delete all files when you’re finished, and operate the solution for a day or less). For more information, see Amazon S3 pricing.
  • Key usage – The cost of an AWS managed KMS key is less than $0.02 for a day’s worth of usage. Storage and usage costs for a customer managed key may be higher. For more information, see AWS Key Management Service pricing.

Setting up the manifest, category, and GUI files

Now that you have met the prerequisites, you can create the manifest, categories, GUI files.

Creating the files

We first create the dataset.manifest file, which we use as the input dataset for the labeling job.

Each object in dataset.manifest contains a line of text describing a person, animal, or plant. One or more of these lines of text is presented as tasks to your workers; they’re responsible for correctly identifying which of the three classifications the line of text best fits.

For this post, dataset.manifest only has seven lines (workers can label up to seven objects), but this input dataset file could have up to 100,000 entries.

Create a file locally named dataset.manifest that contains the following text:

{"source":"His nose could detect over 1 trillion odors!"}
{"source":"Why do fish live in salt water? Because pepper makes them sneeze!"}
{"source":"What did the buffalo say to his son when he went away on a trip? Bison!"}
{"source":"Why do plants go to therapy? To get to the roots of their problems!"}
{"source":"What do you call a nervous tree? A sweaty palm!"}
{"source":"Some kids in my family really like birthday cakes and stars!"}
{"source":"A small portion of the human population carries a fabella bone."}

Next, we create the categories.json file. This file is used by Ground Truth to define the categories used to label the data objects.

Create a file locally named categories.json that contains the following code:

{
    "document-version": "2018-11-28",
    "labels": [{
            "label": "person"
        },
        {
            "label": "animal"
        },
        {
            "label": "plant"
        }
    ]
}

Finally, we create the worker_gui.html file. This file, when rendered, provides the GUI for the workers’ labeling tasks. The options are endless, but for this post, we create a custom GUI that adds the following custom features:

  • An additional Submit button that is styled larger than the default.
  • Shortcut keys for submitting and resetting the form.
  • JavaScript logic to programatically modify a CSS style (break-all) on the task text output.

Make this custom GUI by creating a file locally named worker_gui.html containing the following code:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
  <crowd-classifier
    name="crowd-classifier"
    categories="{{ task.input.labels | to_json | escape }}"
    header="Please classify"
  >

    <classification-target>
      <strong>{{ task.input.taskObject }}</strong>
    </classification-target>
    <full-instructions header="Full Instructions">
      <div>
        <p>Based on the general subject or topic of each sentence presented, please classify it as only one of the following: person, animal, or plant. </p>
      </div>
    </full-instructions>

    <short-instructions>
      Complete tasks
    </short-instructions>
  </crowd-classifier>
</crowd-form>

<script>

  document.addEventListener('all-crowd-elements-ready', () => {
    // Creating new button to inject in label pane
    const button = document.createElement('button');
    button.textContent = 'Submit';
    button.classList.add('awsui-button', 'awsui-button-variant-primary', 'awsui-hover-child-icons');

    // Editing styling to make it larger
    button.style.height = '60px';
    button.style.width = '100px';
    button.style.margin = '15px';

    // Adding onclick for submission
    const crowdForm = document.querySelector('crowd-form');
    button.onclick = () => crowdForm.submit();

    // Injecting
    const crowdClassifier = document.querySelector('crowd-classifier').shadowRoot;
    const labelPane = crowdClassifier.querySelector('.category-picker-wrapper');
    labelPane.appendChild(button);

    // Adding a Enter hotkey
    document.addEventListener('keydown', e => {
      if (e.key === 'Enter') {
        crowdForm.submit();
      }
      if (e.key === 'r') {
        crowdForm.reset();
      }

    })

    // Implement break-all style in the layout to handle long text tasks
    const annotationTarget = crowdClassifier.querySelector('.annotation-area.target');
    annotationTarget.style.wordBreak = 'break-all';
  });
</script>

Previewing the GUI in your web browser

While working on the worker.gui.html file, you may find it useful to preview what you’re building.

At any time you can open the worker_gui.html file from your local file system on your browser for a limited preview of the GUI you’re creating. Some dynamic data, such as that provided by the Lamdba preprocessing functions, may not be visible until you run the job from the job status preview page or worker portal.

To preview with real data, you can create a custom job with Lambda functions. For instructions, see Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth. You can preview live from the Ground Truth console’s Create labeling job flow.

For more information about the Liquid-based template system, see Step 2: Creating your custom labeling task template.

Uploading the files to Amazon S3

You can now upload all three files to the root directory of your S3 bucket. When uploading these files to Amazon S3, accept all defaults. For more information, see How Do I Upload Files and Folders to an S3 Bucket?

Creating the custom labeling job

After you upload the files to Amazon S3, you can create your labeling job. For some use cases, the SageMaker console provides the needed interface for creating both built-in and custom workflows. In our use case, we use the AWS CLI because it provides additional options not yet available (as of this writing) on the SageMaker console.

The following scripting instructions assume you’re on MacOS or Linux. If you’re on Windows, you may need to modify the extension and contents of the script for it to work, depending on your environment.

Create a file called createCustom.sh (provide your bucket name, execution role ARN, KMS key ID, and work team ARN):

aws sagemaker create-labeling-job 
--labeling-job-name $1 
--label-attribute-name "aws-blog-demo" 
--label-category-config-s3-uri "s3://YOUR_BUCKET_NAME/categories.json" 
--role-arn "YOUR_SAGEMAKER_GROUNDTRUTH_EXECUTION_ROLE_ARN" 
--input-config '{
  "DataSource": {
    "S3DataSource": {
      "ManifestS3Uri": "s3://YOUR_BUCKET_NAME/dataset.manifest"
    }
  }
}' 
--output-config '{
        "KmsKeyId": "YOUR_KMS_KEY_ID",
        "S3OutputPath": "s3://YOUR_BUCKET_NAME/output"
}' 
--human-task-config '{
        "AnnotationConsolidationConfig": {
            "AnnotationConsolidationLambdaArn": "arn:aws:lambda:us-west-2:081040173940:function:ACS-TextMultiClass"
        },
        "TaskAvailabilityLifetimeInSeconds": 21600,
        "TaskTimeLimitInSeconds": 3600,
        "NumberOfHumanWorkersPerDataObject": 1,
        "PreHumanTaskLambdaArn":  "arn:aws:lambda:us-west-2:081040173940:function:PRE-TextMultiClass",
        "WorkteamArn": "YOUR_WORKTEAM_ARN",
        "TaskDescription": "Select all labels that apply",
        "MaxConcurrentTaskCount": 1000,
        "TaskTitle": "Text classification task",
        "UiConfig": {
            "UiTemplateS3Uri": "s3://YOUR_BUCKET_NAME/worker_gui.html"
        }
    }'

Make sure to use your work team ARN, not your workforce ARN. For your KMS key, use the key ID or the AWS managed or customer managed key you want to encrypt the output with. For instructions on retrieving your key, see Finding the key ID and ARN. For more information about types of KMS keys, see Customer master keys (CMKS).

Make the file executable via the command chmod 700 createCustom.sh.

Almost done! But before we run the script, let’s step through what this script is doing in more detail. The script runs the aws sagemaker create-lableing-job CLI command with the following parameters:

  • –labeling-job-name – We set this value to $1, which translates to the argument we pass on the command line when we run it.
  • –label-attribute-name – The attribute name to use for the label in the output manifest file.
  • –label-category-config-s3-url – The path to the categories.json file we previously uploaded to Amazon S3.
  • –role-arn – The ARN of the IAM role SageMaker runs the job under. If you aren’t sure what this value is, your administrator should be able to provide it to you.
  • –input-config – Points to the location of the input dataset manifest file.
  • –output-config – Points to a KMS key ID and the job’s output path.
  • –human-task-config – Provides the following parameters:
    • PreHumanTaskLambdaArn – The built-in AWS-provided Lambda function that performs the same preprocessing logic as that found in the built-in text classification job type. It handles reading the dataset manifest file in Amazon S3, parsing it, and providing the GUI with the appropriate task data.
    • AnnotationConsolidationLambdaArn – The built-in AWS-provided Lambda function that performs the same postprocessing logic as that found in the built-in text classification job type. It handles postprocessing of the data after each labeler submits an answer. As a reminder, all Ground Truth, Amazon S3, and Lambda configurations for this post must be set up within the same Region (for this post, us-west-2). For non us-west-2 Lambda ARN options, see create-labeling-job.
    • TaskAvailabilityLifetimeInSeconds – The length of time that a task remains available for labeling by human workers.
    • TaskTimeLimitInSeconds – The amount of time that a worker has to complete a task.
    • NumberOfHumanWorkersPerDataObject – The number of human workers that label an object.
    • WorkteamArn – The ARN of the work team assigned to complete the tasks. Make sure to use your work team ARN and not your workforce ARN in the script.
    • TaskDescription – A description of the task for your human workers.
    • MaxConcurrentTaskCount – Defines the maximum number of data objects that can be labeled by human workers at the same time.
    • TaskTitle – A title for the task for your human workers.
    • UiTemplateS3Uri – The S3 bucket location of the GUI template that we uploaded earlier. This is the HTML template used to render the worker GUI for labeling job tasks.

For more information about the options available when creating a labeling job from the AWS CLI, see create-labeling-job.

Running the job

Now that you’ve created the script with all the proper parameters, its time to run it! To run the script, enter ./createCustom.sh JOBNAME from the command line, providing a unique name for the job.

In my example, I named the job gec-custom-template-300, and my command line looked like the following:

gcohen $: ./createCustom.sh gec-custom-template-300

{
"LabelingJobArn": "arn:aws:sagemaker:us-west-2:xxyyzz:labeling-job/gec-custom-template-300"
}

Checking the job status and previewing the GUI

Now that we’ve submitted the job, we can easily check its status on the console.

  1. On the SageMaker console, under Ground Truth, choose Labeling jobs.

You should see the job we just submitted.

  1. Choose the job to get more details.

  1. Choose View labeling tool to preview what our labeling workers see when they take the job.

In addition, by using AWS KMS encryption, you can specify authorized users who can decrypt the output manifest file. Who exactly is authorized to decrypt this file varies depending on whether the key is customer managed or AWS managed. For specifics on access permissions for a given key, review the key’s key policy.

Conclusion

In this post, I demonstrated how to implement a custom labeling GUI with built-in preprocessing and postprocessing logic by way of a custom workflow. I also demonstrated how to encrypt the output with AWS KMS. The prerequisites, code, and estimated costs of running it all were also provided.

The code was provided to get you running quickly, but don’t stop there! Try experimenting by adding additional functionality to your workers’ labeling GUIs, either with your own custom libraries or third-party logic. If you get stuck, don’t hesitate to reach out directly, or post an issue on our GitHub repo issues page.


About the Author

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.

Read More

Building a secure search application with access controls using Amazon Kendra

For many enterprises, critical business information is often stored as unstructured data scattered across multiple content repositories. Not only is it challenging for organizations to make this information available to employees when they need it, but it’s also difficult to do so securely so relevant information is available to the right employees or employee groups.

Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra delivers secure search for enterprise applications and can make sure the results of a user’s search query only include documents the user is authorized to read. In this post, we illustrate how to build an Amazon Kendra-powered search application supporting access controls that reflect the security model of an example organization.

Amazon Kendra supports search filtering based on user access tokens that are provided by your search application, as well as document access control lists (ACLs) collected by the Amazon Kendra connectors. When user access tokens are applied, search results return links to the original document repositories and include a short description. Access control to the full document is still enforced by the original repository.

In this post, we demonstrate token-based user access control in Amazon Kendra with Open ID. We use Amazon Cognito user pools to authenticate users and provide Open ID tokens. You can use a similar approach with other Open ID providers.

Application overview

This application is designed for guests and registered users to make search queries to a document repository, and results are returned only from those documents that are authorized for access by the user. Users are grouped based on their roles, and access control is at a group level. The following table outlines which documents each user is authorized to access for our use case. The documents being used in this example are a subset of AWS public documents.

User Role Group Document Type Authorized for Access
Guest Blogs
Patricia IT Architect Customer Blogs, user guides
James Sales Rep Sales Blogs, user guides, case studies
John Marketing Exec Marketing Blogs, user guides, case studies, analyst reports
Mary Solutions Architect Solutions Architect Blogs, user guides, case studies, analyst reports, whitepapers

Architecture

The following diagram illustrates our solution architecture.

The following diagram illustrates our solution architecture.

The documents being queried are stored in an Amazon Simple Storage Service (Amazon S3) bucket. Each document type has a separate folder: blogs, case-studies, analyst-reports, user-guides, and white-papers. This folder structure is contained in a folder named Data. Metadata files including the ACLs are included in a folder named Meta.

We use the Amazon Kendra S3 connector to configure this S3 bucket as the data source. When the data source is synced with the Amazon Kendra index, it crawls and indexes all documents as well as collects the ACLs and document attributes from the metadata files. For this example, we use a custom attribute DocumentType to denote the type of the document.

We use an Amazon Cognito user pool to authenticate registered users, and use an identity pool to authorize the application to use Amazon Kendra and Amazon S3. The user pool is configured as an Open ID provider in the Amazon Kendra index by configuring the signing URL of the user pool.

When a registered user authenticates and logs in to the application to perform a query, the application sends the user’s access token provided by the user pool to the Amazon Kendra index as a parameter in the query API call. For guest users, there is no authentication and therefore no access token is sent as a parameter to the query API. The results of a query API call without the access token parameter only return the documents without access control restrictions.

When an Amazon Kendra index receives a query API call with a user access token, it decrypts the access token using the user pool signing URL and gets parameters such as cognito:username and cognito:groups associated with the user. The Amazon Kendra index filters the search results based on the stored ACLs and the information received in the user access token. These filtered results are returned in response to the query API call made by the application.

The application, which the users can download with its source, is written in ReactJS using components from the AWS Amplify framework. We use the AWS Amplify console to implement the continuous integration and continuous deployment pipelines. We use an AWS CloudFormation template to deploy the AWS infrastructure, which includes the following:

In this post, we provide a step-by-step walkthrough to configure the backend infrastructure, build and deploy the application code, and use the application.

Prerequisites

To complete the steps in this post, make sure you have the following:

Preparing your S3 bucket as a data source

To prepare an S3 bucket as a data source, create an S3 bucket. In the terminal with the AWS CLI or AWS CloudShell, run the following commands to upload the documents and the metadata to the data source bucket:

aws s3 cp s3://aws-ml-blog/artifacts/building-a-secure-search-application-with-access-controls-kendra/docs.zip .
unzip docs.zip
aws s3 cp Data/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Data/ --recursive
aws s3 cp Meta/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Meta/ --recursive

Deploying the infrastructure as a CloudFormation stack

In a separate browser tab open the AWS Management Console, and make sure that you are logged in to your AWS account. Click the button below to launch the CloudFormation stack to deploy the infrastructure.

You should see a page similar to the image below:

For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.

Stack creation can take 30–45 minutes to complete. While you wait, you can look at the different tabs, such as Events, Resources, and Template. You can monitor the stack creation status on the Stack info tab.

You can monitor the stack creation status on the Stack info tab.

When stack creation is complete, keep the Outputs tab open. We need values from the Outputs and Resources tabs in subsequent steps.

Reviewing Amazon Kendra configuration and starting the data source sync

In the following steps, we configure Amazon Kendra to enable secure token access and start the data source sync to begin crawling and indexing documents.

  1. On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

Under User access control, token-based user access control is enabled, the signing key object is set to the Open ID provider URL of the Amazon Cognito user pool, and the user name and group are set to cognito:username and cognito:groups, respectively.

Under User access control, token-based user access control is enabled.

  1. In the navigation pane, choose Data sources.
  2. On the Settings tab, you can see the data source bucket being configured.
  3. Select the radio button for the data source and choose Sync now.

Choose Sync now.

The data source sync can take 10–15 minutes to complete, but you don’t have to wait to move to the next step.

Creating users and groups in the Amazon Cognito user pool

In the terminal with the AWS CLI or AWS CloudShell, run the following commands to create users and groups in the Amazon Cognito user pool to use for our application. You need to copy the contents of the Physical ID column in the UserPool row from the Resources tab of the CloudFormation stack. This is the user pool ID to use in the following steps. We set AmazonKendra@2020 as the temporary password for all the users. This password is required when logging in for the first time, and Amazon Cognito enforces a password reset.

USER_POOL_ID=<PASTE-USER-POOL-ID-HERE>
aws cognito-idp create-group --group-name customer --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-Sales --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-Marketing --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-SA --user-pool-id ${USER_POOL_ID}
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username patricia --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username james  --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username john  --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username mary  --temporary-password AmazonKendra@2020
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username patricia --group-name customer
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username james --group-name AWS-Sales
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username john --group-name AWS-Marketing
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username mary --group-name AWS-SA

Building and deploying the app

Now we build and deploy the app using the following steps:

  1. On the AWS Amplify console, choose the app AWSKendraAuthApp.
  2. Choose Run build.

Choose Run build.

You can monitor the build progress on the console.

You can monitor the build progress on the console.

Let the build continue and complete the steps: Provision, Build, Deploy, and Verify. After this, the application is deployed and ready to use.

You can browse through the source code by opening up the CodeCommit repository. The important file to look at is src/App.tsx.

  1. Choose the link on the left to start the application in a new browser tab.

Choose the link on the left to start the application in a new browser tab.

Trial run

We can now take a trial run of our app.

  1. On the login page, sign in with the username patricia and the temporary password AmazonKendra@2020.

On the login page, sign in with the username patricia and the temporary password AmazonKendra@2020.

Amazon Cognito requires you to reset your password the first time you log in. After you log in, you can see the search field.

Amazon Cognito requires you to reset your password the first time you log in. After you log in, you can see the search field.

  1. In the search field, enter a query, such as what is serverless?
  2. Expand Filter search results to see different document types.

You can select different document types to filter the search results.

You can select different document types to filter the search results.

  1. Sign out and repeat this process for other users that are created in the Cognito user pool, namely, james, john, and mary.

You can also choose Continue as Guest to use the app without authenticating. However, this option only shows results from blogs.

You can also choose Continue as Guest to use the app without authenticating. However, this option only shows results from blogs.

You can return back to the login screen by choosing Welcome Guest! Click here to sign up or sign in.

Using the application

You can use the application we developed by making a few search queries logged in as different users. To experience how access control works, issue the same query from different user accounts and observe the difference in the search results. The following users get results from different sources:

  • Guests and anonymous users – Only blogs
  • Patricia (Customer) – Blogs and user guides
  • James (Sales) – Blogs, user guides, and case studies
  • John (Marketing) – Blogs, user guides, case studies, and analyst reports
  • Mary (Solutions Architect) – Blogs, user guides, case studies, analyst reports, and whitepapers

We can make additional queries and observe the results. Some suggested queries include “What is machine learning?”, “What is serverless?”, and “Databases”.

Cleaning up

To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.

When the stack status shows as Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the respective management consoles for Amazon Kendra, Amazon Amplify, and the Amazon Cognito user pool and identity pool.

You must delete your data source bucket separately, because it was not created as part of the CloudFormation stack.

Conclusion

In this post, we demonstrated how you can create a secure search application using Amazon Kendra. Organizations who use an Open ID-compliant identity management system with a new or pre-existing Amazon Kendra index can now enable secure token access to make sure your intelligent search applications are aligned with your organizational security model. For more information about access control in Amazon Kendra, see Controlling access to documents in an index.


About the Author

Abhinav JawadekarAbhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.

Read More

Extracting buildings and roads from AWS Open Data using Amazon SageMaker

Extracting buildings and roads from AWS Open Data using Amazon SageMaker

Sharing data and computing in the cloud allows data users to focus on data analysis rather than data access. Open Data on AWS helps you discover and share public open datasets in the cloud. The Registry of Open Data on AWS hosts a large amount of public open data. The datasets range from genomics to climate to transportation information. They are well structured and easily accessible. Additionally, you can use these datasets in machine learning (ML) model development in the cloud.

In this post, we demonstrate how to extract buildings and roads from two large-scale geospatial datasets: SpaceNet satellite images and USGS 3DEP LiDAR data. Both datasets are hosted on the Registry of Open Data on AWS. We show you how to launch an Amazon SageMaker notebook instance and walk you through the tutorial notebooks at a high level. The notebooks reproduce winning algorithms from the SpaceNet challenges (which only use satellite images). In addition to the SpaceNet satellite images, we compare and combine the USGS 3D Elevation Program (3DEP) LiDAR data to extract the same.

This post demonstrates running ML services on AWS to extract features from large-scale geospatial data in the cloud. By following our examples, you can train the ML models on AWS, apply the models to other regions where satellite or LiDAR data is available, and experiment with new ideas to improve the performances. For the complete code and notebooks of this tutorial, see our GitHub repo.

Datasets

In this section, we provide more detail about the datasets we use in this post.

SpaceNet dataset

SpaceNet launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. It’s a large corpus of labeled satellite imagery. The project has also launched a series of competitions ranging from automatic building extraction, road extraction, to recently published multi-temporal urban development analysis. The dataset covers 11 areas of interest (AOIs), including Rio de Janeiro, Las Vegas, and Paris. For this post, we use Las Vegas; the images in this AOI cover 216km2 areas with 151,367 building polygon labels and 3,685km road labels.

The following image is from DigitalGlobe’s SpaceNet Challenge Concludes First Round, Moves to Higher Resolution Challenges.

USGS 3DEP LiDAR dataset

Our second dataset comes from the USGS 3D Elevation Program (3DEP) in the form of LiDAR (Light Detection and Ranging) data. The program’s goal is to complete the acquisition of nationwide LiDAR to provide the first-ever national baseline of consistent high-resolution topographic elevation data, collected in a timeframe of less than a decade. LiDAR is a remote sensing method that emits hundreds of thousands of near-infrared light pulses each second to measure distances to the Earth. These light pulses generate precise, 3D information about the shape of the Earth and its surface characteristics.

The USGS 3DEP LiDAR is presented in two formats. The first is a public repository in Entwine Point Tiles (EPT) format, which is a lossless, full resolution, streamable octree structure. This format is suitable for online visualization. The following image shows an example of LiDAR visualization in Las Vegas.

The other format is in LAZ (compressed LAS) with requester-pays access. In this post, we use LiDAR data in the second format.

Data registration

For this post, we select the Las Vegas AOI where both SpaceNet satellite images and USGS LiDAR data are available. Among SpaceNet data categories, we use the 30cm resolution pan-sharpened 3-band RGB geotiff and corresponding building and road labels. To improve the visual feature extraction performance, we process the data by white balancing and convert it to 8-bit (0–255) values for ease of postprocessing. The following graph shows the RGB value aggregated histogram of all images after processing.

Satellite images are 2D images, whereas the USGS LiDAR data are 3D point clouds and therefore require conversion and projection to align with 2D satellite images. We use Matlab and LAStools to map each 3D LiDAR point to a pixel-wise location corresponding to SpaceNet tiles, and generate two sets of attribute images: elevation and reflectivity intensity. The elevation ranges from approximately 2,000–3,000 feet, and the intensity ranges from 0–5,000 units. The following graphs show the aggregated histograms of all images for elevation and reflectivity intensity values.

Finally, we merge either one of the LiDAR attributes and merge them with the RGB images. The images are saved in 16-bit because LiDAR attribute values can be larger than 255, the 8-bit upper limit. We make this processed and merged data available via a publicly accessible Amazon Simple Storage Service (Amazon S3) bucket for this tutorial. The following are three samples of merged RGB+LiDAR images. From left to right, the columns are RGB image, LiDAR elevation attribute, and LiDAR reflectivity intensity attribute.

Creating a notebook instance

SageMaker is a fully managed service that allows you to build, train, deploy, and monitor ML models. Its modular design allows you to pick and choose features that suit your use cases at different stages of the ML lifecycle. SageMaker offers capabilities that abstract the heavy lifting of infrastructure management and provide the agility and scalability you desire for large-scale ML activities with different features and a pay-as-you-use pricing model.

The SageMaker on-demand notebook instance is a fully managed compute instance running the Jupyter Notebook app. SageMaker manages creating instances and related resources. Notebooks contain everything needed to run or recreate an ML workflow. You can use Jupyter notebooks in your notebook instance to prepare and process data, write code to train models, deploy models to SageMaker hosting, and test or validate your models. For different problems, you can select the type of instance to best fit each scenario (such as high throughput, high memory usage, or real-time inference).

Although training the deep learning model can take a long time, you can reproduce the inference part of this post with a reasonable computing cost. It’s recommended to run the notebooks inside a SageMaker notebook instance of type ml.p3.8xlarge (4 x V100 GPUs) or larger. Network training and inference is a memory-intensive process; if you run into out of memory or out of RAM errors, consider decreasing the batch_size in the configuration files (.yml format).

To create a notebook instance, complete the following steps:

  1. On the SageMaker console, choose Notebook instances.
  2. Choose Create notebook instance.
  3. Enter the name of your notebook instance, such as open-data-tutorial.
  4. Set the instance type to 8xlarge.
  5. Choose Additional configuration.
  6. Set the volume size to 60 GB.
  7. Choose Create notebook instance.
  8. When the instance is ready, choose Open in JupyterLab.
  9. From the launcher, you can open a terminal and run the provided code.

Deploy environment and download datasets

At the JupyterLab terminal, run the following commands:

$ cd ~/SageMaker/$ ./setup-env.sh tutorial_env
$ git clone https://github.com/aws-samples/aws-open-data-satellite-lidar-tutorial.git
$ cd aws-open-data-satellite-lidar-tutorial

This downloads the tutorial repository from GitHub and takes you to the tutorial directory.

Next, set up a Conda environment by running setup-env.sh (see the following code). You can change the environment name from tutorial_env to any other name.

$ ./setup-env.sh tutorial_env

This may take 10–15 minutes to complete, after which you have a new Jupyter kernel called conda_tutorial_env, or conda_[name] if you change the environment name. You may need to wait a few minutes after conda completion and refresh the Jupyter page.

Next, download the necessary data from the public S3 bucket hosting the tutorial files:

$ ./download-from-s3.sh

This may take up to 5 minutes to complete and requires at least 23 GB of notebook instance storage.

Building extraction

Launch the notebook Building-Footprint.ipynb to reproduce this chapter.

The first and second SpaceNet challenges aimed to extract building footprints from satellite images at various AOIs. The fourth SpaceNet challenge posed a similar task with more challenging off-nadir ( oblique-looking angles) imagery. We reproduce a winning algorithm and evaluate its performance with both RGB images and LiDAR data.

Training data

In the Las Vegas AOI, SpaceNet data is tiled to size 200m x 200m. We select 3,084 tiles in which both SpaceNet imagery and LiDAR data are available and merge them together. Unfortunately, the labels of test data for scoring in the SpaceNet challenges are not published, so we split the merged data into 70% and 30% for training and evaluation. Between LiDAR elevation and intensity, we choose elevation for building extractions. See the following code:

In the Las Vegas AOI, SpaceNet data is tiled to size 200m×200m. We select 3084 tiles where both SpaceNet imagery and LiDAR data are available and merge them together. Unfortunately, the labels of test data for scoring in the SpaceNet challenges are not published, so we split the merged data by 70%/30% for training and evaluation. Between LiDAR elevation and intensity, we choose elevation for building extractions.

# Create Pandas data frame, containing columns 'image' and 'label'.
total_df = pd.DataFrame({'image': img_path_list
                         'label': mask_path_list})
# Split this data frame to training data and blind test data.
split_mask = np.random.rand(len(total_df)) < 0.7
train_df = total_df[split_mask]
test_df = total_df[~split_mask]

Model

We reproduce the winning algorithm from SpaceNet challenge 4 by XD_XD. The model has a U-net architecture with skip-connections between encoder and decoder, and a modified VGG16 as backbone encoder. The model takes three different types of input:

  • Three-channel RGB image, same as the original contest
  • One-channel LiDAR elevation image
  • Four-channel RGB+LiDAR merged image

We train three models based on the three types of inputs described in this post and compare their performances.

The label for training is binary mask converted from polygon geojson by Solaris, an ML pipeline library developed by CosmiQ Works. We select a combined loss of binary cross-entropy and Jaccard loss with a weight factor alpha=0.8:

mathcal{L} =
alphamathcal{L}_mathrm{BCE} + (1 –
alphamathcal{L}_mathrm{Jaccard})

We train the models with batch size 20, Adam optimizer, and 10-4 learning rate for 100 epochs. The training takes approximately 100 minutes to finish on an ml.p3.8xlarge SageMaker notebook instance. See the following code:

# Load customized multi-channel input VGG16-Unet model.
from networks.vgg16_unet import get_modified_vgg16_unet

custom_model = get_modified_vgg16_unet(
    in_channels=config['data_specs']['channels'])
custom_model_dict = {
    'model_name': 'modified_vgg16_unet',
    'arch': custom_model}

# Select config file and link training datasets.
config = sol.utils.config.parse('./configs/buildings/RGB+ELEV.yml')
config['training_data_csv'] = train_csv_path
# Create solaris trainer, and train with configuration.
trainer = sol.nets.train.Trainer(config, custom_model_dict=custom_model_dict)
trainer.train()

The following images show examples of building extraction inputs and outputs. From left to right, the columns are RGB image, LiDAR elevation image, model prediction trained with RGB and LiDAR data, and ground truth building footprint mask.

Evaluation

Use the trained model to perform model inference on the test dataset (30% hold-out):

custom_model_dict = {
    'model_name': 'modified_vgg16_unet',
    'arch': custom_model,
    'weight_path': config['training']['model_dest_path']}
config['train'] = False

# Create solaris inferer, and do inference on test data.
inferer = sol.nets.infer.Inferer(config, custom_model_dict=custom_model_dict)
inferer(test_df)

After model inference, we evaluate the model performance using the same metric as in the original contest: an aggregated F-1 score with intersection of union (IoU) ≥ 0.5 criterion. There are two steps to compute this score. First, convert the building footprint binary masks to proposed polygons:

# Convert these probability maps to building polygons.
def pred_to_prop(pred_file, img_path):
    pred_path = os.path.join(pred_dir, pred_file)
    pred = skimage.io.imread(pred_path)[..., 0]
    prop_file = 
        pred_file.replace('RGB+ELEV', 'geojson_buildings').replace('tif', 'geojson')
    prop_path = os.path.join(prop_dir, prop_file)
    prop = sol.vector.mask.mask_to_poly_geojson(
        pred_arr=pred,
        reference_im=img_path,
        do_transform=True,
        min_area=1e-10,
        output_path=prop_path)

Next, compare the proposed polygons against the ground truth polygons (SpaceNet building labels), and count the aggregated F-1 scores:

# Evaluate aggregated F-1 scores.
def compute_score(prop_path, bldg_path):
    evaluator = sol.eval.base.Evaluator(bldg_path)
    evaluator.load_proposal(prop_path, conf_field_list=[])
    score = evaluator.eval_iou(miniou=0.5, calculate_class_scores=False)
    # score_list.append(score[0]) # skip because single-class
    return score[0] # single-class

The following table shows the F-1 scores from the three models trained with RGB images, LiDAR elevation images, and RGB+LiDAR merged images. Compared to using RGB only as in the original SpaceNet competition, the model trained using only LiDAR elevation images achieves a score only a few percent worse. When combining both RGB and LiDAR elevation in training, the model outperforms the RGB-only model. For reference, the F-1 scores of the top three teams from SpaceNet challenge 2 in this AOI are 0.885, 0.829, and 0.787 (we don’t compare them directly because they use a different test set for scoring).

Training data type Aggregated F-1 scores
RGB images 0.8268
LiDAR elevation 0.80676
RGB+LiDAR merged 0.85312

Road extraction

To reproduce this section, launch the notebook Road-Network.ipynb.

The third SpaceNet challenge aimed to extract road networks from satellite images. The fifth SpaceNet challenge added predicting road speed along with the road network extraction in order to minimize travel time and plan optimal routing. Similar to building extraction, we reproduce a top winning algorithm, train different models with either RGB images, LiDAR attributes, or both of them, and evaluate their performance.

Training data

The road network extraction uses larger tiles with size 400m x 400m. We generate 918 merged tiles, and split by 70%/30% for training and evaluation. In this case, we select reflectivity intensity for road extraction because road surfaces often consist of materials that have distinctive reflectivity among backgrounds, such as a paved surface, dirt road, or asphalt.

Model

We reproduce the CRESI algorithm for road networks extraction. It also has a U-net architecture but uses ResNet as the backbone encoder. Again, we train the model with three different types of input:

  • Three-channel RGB image
  • One-channel LiDAR intensity image
  • Four-channel RGB+LiDAR merged image

To extract road location and speed together, binary road mask doesn’t provide enough information for training. As mentioned in the CRESI paper, we can convert the speed metadata to either continuous mask (0–1 values) or multi-class binary mask. Because their test results show that multi-class binary mask performs better, we use the latter conversion scheme. The following images break down the eight-class road masks. The first seven binary masks represent road corresponds to seven bins of speed within 0–65 mph. The eighth mask (bottom right) represents the aggregation of all previous masks.

The following images show the visualization of multi-class road masks. The left is the RGB image tile. The right is the road mask with color coding in which the yellow-to-red colormap represents speed values from low to high speed (0–65 mph).

We train the model with the same setup as in the building extraction. The following images show examples of road extraction inputs and outputs. From left to right, the columns are RGB image, LiDAR reflectivity intensity image, model prediction trained with RGB and LiDAR data, and ground truth road mask.

Evaluation

We implement the average path length similarity (APLS) score to evaluate the road extraction performance. This metric is used in SpaceNet road challenges because APLS considers both logical topology (connections within road network) and physical topology (location of the road edges and nodes). The APLS can be weighted by either length or travel time; a higher score means better performance. See the following code:

# Skeletonize the prediction mask into non-geo road network graph.
!python ./libs/apls/skeletonize.py --results_dir={results_dir}
# Match geospatial info and create geo-projected graph.
!python ./libs/apls/wkt_to_G.py --imgs_dir={img_dir} --results_dir={results_dir}
# Infer road speed on each graph edge based on speed bins.
!python ./libs/apls/infer_speed.py --results_dir={results_dir} 
    --speed_conversion_csv_file='./data/roads/speed_conversion_binned7.csv'

# Compute length-based APLS score.
!python ./libs/apls/apls.py --output_dir={results_dir} 
    --truth_dir={os.path.join(data_dir, 'geojson_roads_speed')} 
    --im_dir={img_dir} 
    --prop_dir={os.path.join(results_dir, 'graph_speed_gpickle')} 
    --weight='length'

# Compute time-based APLS score.
!python ./libs/apls/apls.py --output_dir={results_dir} 
    --truth_dir={os.path.join(data_dir, 'geojson_roads_speed')} 
    --im_dir={img_dir} 
    --prop_dir={os.path.join(results_dir, 'graph_speed_gpickle')} 
    --weight='travel_time_s'

We convert multi-class road mask predictions to skeleton and speed-weighted graph and compute APLS scores. The following table shows the APLS scores of the three models. Similar to the building extraction results, the LiDAR-only result achieves scores close to the RGB-only result, whereas RGB+LiDAR gives the best performance.

Training data type APLSlength APLStime
RGB images 0.59624 0.54298
LiDAR intensity 0.57811 0.52697
RGB+LiDAR merged 0.63651 0.58518

Conclusion

We demonstrate how to extract building extract buildings and roads from two large-scale geospatial datasets hosted on the Registry of Open Data on AWS using a SageMaker notebook instance. The SageMaker notebook instance contains everything needed to run or recreate an ML workflow. It’s easy to use and customize to best fit different scenarios.

By using the LiDAR dataset from the Registry of Open Data on AWS and reproducing winning algorithms from SpaceNet building and road challenges, we show that you can use LiDAR data to perform the same task with similar accuracy, and even outperform the RGB models when combined.

With the full code and notebooks shared on GitHub and the necessary data hosted in the public S3 bucket, you can reproduce the map feature extraction tasks, apply the models to any other area of interest, and innovate with new ideas to improve model performance. For the complete code and notebooks of this tutorial, see our GitHub repo.


About the Authors

Yunzhi Shi is a data scientist at the Amazon ML Solutions Lab where he helps AWS customers address business problems with AI and cloud capabilities. Recently, he has been building computer vision, search, and forecast solutions for various customers.

 

 

Xin Chen is a senior manager at Amazon ML Solutions Lab, where he leads the Automotive Vertical and helps AWS customers across different industries identify and build machine learning solutions to address their organization’s highest return-on-investment machine learning opportunities. Xin obtained his Ph.D. in Computer Science and Engineering from the University of Notre Dame.

 

 

Tianyu Zhang is a data scientist at the Amazon ML Solutions Lab. He helps AWS customers solve business problems by applying ML and AI techniques. Most recently, he has built NLP model and predictive model for procurement and sports.

Read More

How an important change in web standards impacts your image annotation jobs

How an important change in web standards impacts your image annotation jobs

Earlier in 2020, widely used browsers like Chrome and Firefox changed their default behavior for rotating images based on image metadata, referred to as EXIF data. Previously, images always displayed in browsers exactly how they’re stored on disk, which is typically unrotated. After the change, images now rotate according to a piece of image metadata called orientation value. This has important implications for the entire machine learning (ML) community. For example, if the EXIF orientation isn’t considered, applications that you use to annotate images may display images in unexpected orientations and result in confusing or incorrect labels.

For example, before the change, by default images would display in the orientation stored on the device, as shown in the following image. After the change, by default, images display according to the orientation value in EXIF data, as shown in the second image.

Here, the image was stored in portrait mode, with EXIF data attached to indicate it should be displayed with a landscape orientation.

To ensure images are predictably oriented, ML annotation services need to be able to view image EXIF data. The recent change to global web standards requires you to grant explicit permission to image annotation services to view your image EXIF data.

To guarantee data consistency between workers and across datasets, the annotation tools used by Amazon SageMaker Ground Truth, Amazon Augmented AI (Amazon A2I), and Amazon Mechanical Turk need to understand and control orientations of input images that are shown to workers. Therefore, from January 12, 2021, onward, AWS requires that you add a cross-origin resource sharing (CORS) header configuration to Amazon Simple Storage Service (Amazon S3) buckets that contain labeling job or human review task input data. This policy allows these AWS services to view EXIF data and verify that images are predictably oriented in labeling and human review tasks.

This post provides details on the image metadata change, how it can impact labeling jobs and human review tasks, and how you can update your S3 buckets with these new, required permissions.

What is EXIF data?

EXIF data is metadata that tells us things about the image. EXIF data typically includes the height and width of an image but can also include things like the date a photo was taken, what kind of camera was used, and even GPS coordinates where the image was captured. For the image annotation web application community, the orientation property of EXIF is about to become very important.

When you take a photo, whether it’s landscape or portrait, the data is written to storage in the landscape orientation. Instead of storing a portrait photo in the portrait orientation, the camera writes a piece of metadata to the image to explain to applications how that image should be rotated when it’s shown to humans. To learn more, see Exif.

A big change to browsers: Why EXIF data is important

Until recently, popular web browsers such as Chrome and Firefox didn’t use EXIF orientation values, meaning that images that users annotated were never rotated. This means the annotation data matched how the image was stored and the orientation value didn’t matter.

Earlier in 2020, Chrome and Firefox changed their default behavior to begin using EXIF data by default. To make sure image annotating tasks weren’t impacted, AWS mitigated this change by preventing rotation so that users continued to annotate images in their unrotated form. However, AWS can no longer automatically prevent the rotation of images because the web standards group W3C has decided that the ability to control image rotation violates the web’s Same Origin Policy.

It is estimated that, starting with Chrome 88 on January 19th, 2021, annotation services like the ones offered by AWS will require additional permissions to control the orientation of your images when displayed to human workers.

When using AWS services, you can grant these permissions by adding a CORS header policy to the S3 buckets that contain your input images.

Upcoming change to AWS image annotation job security requirements

It is recommended you add a CORS configuration to all S3 buckets that contain input data used for active and future labeling jobs as soon as possible. Starting January 12th, 2021, to ensure human workers annotate your input images in a predictable orientation when you submit requests to create one of the following, you must add a CORS header policy to the S3 buckets that contain your input images:

If you have pre-existing active resources like Ground Truth streaming labeling jobs, you must add a CORS header policy to the S3 bucket used to create those resources. For Ground Truth, this is the input data S3 bucket identified when you created the streaming labeling job.

Additionally, if you reuse resources, such as cloning a Ground Truth labeling job, make sure the input data S3 bucket you use has a CORs header policy attached.

In the context of input image data, AWS services use CORS headers to view EXIF orientation data to control image rotation.

If you don’t add a CORS header policy to an S3 bucket that contains input data by January 12th, 2021, Ground Truth, Amazon A2I, and Mechanical Turk tasks created using this S3 bucket will fail.

Adding a CORS header policy to an S3 bucket

If you’re creating an Amazon A2I human loop or Mechanical Turk job, or you’re using the CreateLabelingJob API to create a Ground Truth labeling job, you can add a CORS policy to an S3 bucket that contains input data on the Amazon S3 console.

If you create your job through the Ground Truth console, under Enable enhanced image access, a check box is select to enable CORS configuration on the S3 bucket that contains your input manifest file as shown in the following image. Keep this check box selected. If all of your input data is not located in the same S3 bucket as your input manifest file, you must manually add a CORS configuration to all S3 buckets that contain input data using the following instructions.

For instructions on setting the required CORS headers on the S3 bucket that hosts your images, see How do I add cross-domain resource sharing with CORS? Use the following CORS configuration code for the buckets that host your images.

The following is the code in JSON format:

[{
   "AllowedHeaders": [],
   "AllowedMethods": ["GET"],
   "AllowedOrigins": ["*"],
   "ExposeHeaders": []
}]

The following is the code in XML format:

<CORSConfiguration>
 <CORSRule>
   <AllowedOrigin>*</AllowedOrigin>
   <AllowedMethod>GET</AllowedMethod>
 </CORSRule>
</CORSConfiguration>

The following GIF demonstrates the instructions found in the Amazon S3 documentation to add a CORS header policy using the Amazon S3 console.

Conclusion

In this post, we explained how a recent decision made by the web standards group W3C will impact the ML community. AWS image annotation service providers will now require you to grant permission to view orientation values of your input images, which are stored in image EXIF data.

Make sure you enable CORS headers on the S3 buckets that contain your input images before creating Ground Truth labeling jobs, Amazon A2I human review jobs, and Mechanical Turk tasks on or after January 12th, 2021.

 


About the Authors

Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon.

 

 

Phil Cunliffe is an engineer turned Software Development Manager for Amazon Human in the Loop services. He is a JavaScript fanboy with an obsession for creating great user experiences.

Read More

How Foxconn built an end-to-end forecasting solution in two months with Amazon Forecast

How Foxconn built an end-to-end forecasting solution in two months with Amazon Forecast

This is a guest post by Foxconn. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. 

In their own words, “Established in Taiwan in 1974, Hon Hai Technology Group (Foxconn) is the world’s largest electronics manufacturer. Foxconn is also the leading technological solution provider and it continuously leverages its expertise in software and hardware to integrate its unique manufacturing systems with emerging technologies.” 

At Foxconn, we manufacture some of the most widely used electronics worldwide. Our effectiveness comes from our ability to plan our production and staffing levels weeks in advance, while maintaining the ability to respond to short-term changes. For years, Foxconn has relied on predictable demand in order to properly plan and allocate resources within our factories. However, as the COVID-19 pandemic began, the demand for our products became more volatile. This increased uncertainty impacted our ability to forecast demand and estimate our future staffing needs.

This highlighted a crucial need for us to develop an improved forecasting solution that could be implemented right away. With Amazon Forecast and AWS, our team was able to build a custom forecasting application in only two months. With limited data science experience internally, we collaborated with the Machine Learning Solutions Lab at AWS to identify a solution using Forecast. The service makes AI-powered forecasting algorithms available to non-expert practitioners. Now we have a state-of-the-art solution that has improved demand forecasting accuracy by 8%, saving an estimated $553,000 annually. In this post, I show you how easy it was to use AWS services to build an application that fit our needs.

Forecasting challenges at Foxconn

Our factory in Mexico assembles and ships electronics equipment to all regions in North and South America. Each product has their own seasonal variations and requires different levels of complexity and skill to build. Having individual forecasts for each product is important to understand the mix of skills we need in our workforce. Forecasting short-term demand allows us to staff for daily and weekly production requirements. Long-term forecasts are used to inform hiring decisions aimed at meeting demand in the upcoming months.

If demand forecasts are inaccurate, it can impact our business in several ways, but the most critical impact for us is staffing our factories. Underestimating demand can result in understaffing and require overtime to meet production targets. Overestimating can lead to overstaffing, which is very costly because workers are underutilized. Both over and underestimating present different costs, and balancing these costs is crucial to our business.

Prior to this effort, we relied on forecasts provided by our customers in order to make these staffing decisions. With the COVID-19 pandemic, our demand became more erratic. This unpredictability caused over and underestimating demand to became more common and staffing related costs to increase. It became clear that we needed to find a better approach to forecasting.

Processing and modeling

Initially, we explored traditional forecasting methods such as ARIMA on our local machines. However, these approaches took a long time to develop, test, and tune for each product. It also required us to maintain a model for each individual product. From this experience, we learned that the new forecasting solution had to be fast, accurate, easy to manage, and scalable. Our team reached out to data scientists at the Amazon Machine Learning (ML) Solutions Lab, who advised and guided us through the process of building our solution around Forecast.

For this solution, we used a 3-year history of daily sales across nine different product categories. We chose these nine categories because they had a long history for the model to train on and exhibited different seasonal buying patterns. To begin, we uploaded the data from our on-premise servers into an Amazon Simple Storage Service (Amazon S3) bucket. After that, we preprocessed the data by removing known anomalies and organizing the data in a format compatible with Forecast. Our final dataset consisted of three columns: timestamp, item_id, and demand.

For model training, we decided to use the AutoML functionality in Forecast. The AutoML tool tries to fit several different algorithms to the data and tunes each one to obtain the highest accuracy. The AutoML feature was vital for a team like ours with limited background in time-series modeling. It only took a few hours for Forecast to train a predictor. After the service identifies the most effective algorithm, it further tunes that algorithm through hyperparameter optimization (HPO) to get the final predictor. This AutoML capability eliminated weeks of development time that the team would have spent researching, training, and evaluating various algorithms.

Forecast evaluation

After the AutoML finished training, it output results for a number of key performance metrics, including root mean squared error (RMSE) and weighted quantile loss (wQL). We chose to focus on wQL, which provides probabilistic estimates by evaluating the accuracy of the model’s predictions for different quantiles. A model with low wQL scores was important for our business because we face different costs associated with underestimating and overestimating demand. Based on our evaluations, the best model for our use case was CNN-QR.

We applied an additional evaluation step using a held-out test set. We combined the estimated forecast with internal business logic to evaluate how we would have planned staffing using the new forecast. The results were a resounding success. The new solution improved our forecast accuracy by 8%, saving an estimated $553,000 per year.

Application architecture

At Foxconn, much of our data resides on premises, so our application is a hybrid solution. The application loads the data to AWS from the on-premises server, builds the forecasts, and allows our team evaluate the output on a client-side GUI.

To ingest the data into AWS, we have a program running on premises that queries the latest data from the on-premises database on a weekly basis. It uploads the data to an S3 bucket via an SFTP server managed by AWS Transfer Family. This upload triggers an AWS Lambda function that performs the data preprocessing and loads the prepared data back into Amazon S3. The preprocessed data being written to the S3 bucket triggers two Lambda functions. The first loads the data from Amazon S3 into an OLTP database. The second starts the Forecast training on the processed data. After the forecast is trained, the results are loaded into a separate S3 bucket and also into the OLTP database. The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

Finally, we wanted a way for customers to review the forecast outputs and provide their own feedback into the system. The team put together a GUI that uses Amazon API Gateway to allow users to visualize and interact with the forecast results in the database. The GUI allows the customer to review the latest forecast and choose a target production for upcoming weeks. The targets are uploaded back to the OLTP and used in further planning efforts.

Summary and next steps

In this post, we showed how a team new to AWS and data science built a custom forecasting solution with Forecast in 2 months. The application improved our forecast accuracy by 8%, saving an estimated $553,000 annually for our Mexico facility alone. Using Forecast also gave us the flexibility to scale out if we add new product categories in the future.

We’re thrilled to see the high performance of the Forecast solution using only the historical demand data. This is the first step in a larger plan to expand our use of ML for supply chain management and production planning.

Over the coming months, the team will migrate other planning data and workloads to the cloud. We’ll use the demand forecast in conjunction with inventory, backlog, and worker data to create an optimization solution for labor planning and allocation. These solutions will make the improved forecast even more impactful by allowing us to better plan production levels and resource needs.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab program. To learn more about how to use Amazon Forecast, check out the service documentation.


About the Authors

Azim Siddique serves as Technical Advisor and CoE Architect at Foxconn. He provides architectural direction for the Digital Transformation program, conducts PoCs with emerging technologies, and guides engineering teams to deliver business value by leveraging digital technologies at scale.

 

 

Felice Chuang is a Data Architect at Foxconn. She uses her diverse skillset to implement end-to-end architecture and design for big data, data governance, and business intelligence applications. She supports analytic workloads and conducts PoCs for Digital Transformation programs.

 

 

Yash Shah is a data scientist in the Amazon ML Solutions Lab, where he works on a range of machine learning use cases from healthcare to manufacturing and retail. He has a formal background in Human Factors and Statistics, and was previously part of the Amazon SCOT team designing products to guide 3P sellers with efficient inventory management.

 

 

Dan Volk is a Data Scientist at Amazon ML Solutions Lab, where he helps AWS customers across various industries accelerate their AI and cloud adoption. Dan has worked in several fields including manufacturing, aerospace, and sports and holds a Masters in Data Science from UC Berkeley.

 

 

 

Xin Chen is a senior manager at Amazon ML Solutions Lab, where he leads Automotive Vertical and helps AWS customers across different industries identify and build machine learning solutions to address their organization’s highest return-on-investment machine learning opportunities. Xin obtained his Ph.D. in Computer Science and Engineering from the University of Notre Dame.

Read More

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

Controlling and auditing data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

Highly-regulated industries, such as financial services, are often required to audit all access to their data. This includes auditing exploratory activities performed by data scientists, who usually query data from within machine learning (ML) notebooks.

This post walks you through the steps to implement access control and auditing capabilities on a per-user basis, using Amazon SageMaker Studio notebooks and AWS Lake Formation access control policies. This is a how-to guide based on the Machine Learning Lens for the AWS Well-Architected Framework, following the design principles described in the Security Pillar:

  • Restrict access to ML systems
  • Ensure data governance
  • Enforce data lineage
  • Enforce regulatory compliance

Additional ML governance practices for experiments and models using Amazon SageMaker are described in the whitepaper Machine Learning Best Practices in Financial Services.

Overview of solution

This implementation uses Amazon Athena and the PyAthena client on a Studio notebook to query data on a data lake registered with Lake Formation.

SageMaker Studio is the first fully integrated development environment (IDE) for ML. Studio provides a single, web-based visual interface where you can perform all the steps required to build, train, and deploy ML models. Studio notebooks are collaborative notebooks that you can launch quickly, without setting up compute instances or file storage beforehand.

Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Lake Formation is a fully managed service that makes it easier for you to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes, including securely making that data available for analytics and ML.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

The workflow includes the following steps:

  1. Data scientists access the AWS Management Console using their AWS Identity and Access Management (IAM) user accounts and open Studio using individual user profiles. Each user profile has an associated execution role, which the user assumes while working on a Studio notebook. The diagram depicts two data scientists that require different permissions over data in the data lake. For example, in a data lake containing personally identifiable information (PII), user Data Scientist 1 has full access to every table in the Data Catalog, whereas Data Scientist 2 has limited access to a subset of tables (or columns) containing non-PII data.
  2. The Studio notebook is associated with a Python kernel. The PyAthena client allows you to run exploratory ANSI SQL queries on the data lake through Athena, using the execution role assumed by the user while working with Studio.
  3. Athena sends a data access request to Lake Formation, with the user profile execution role as principal. Data permissions in Lake Formation offer database-, table-, and column-level access control, restricting access to metadata and the corresponding data stored in Amazon S3. Lake Formation generates short-term credentials to be used for data access, and informs Athena what columns the principal is allowed to access.
  4. Athena uses the short-term credential provided by Lake Formation to access the data lake storage in Amazon S3, and retrieves the data matching the SQL query. Before returning the query result, Athena filters out columns that aren’t included in the data permissions informed by Lake Formation.
  5. Athena returns the SQL query result to the Studio notebook.
  6. Lake Formation records data access requests and other activity history for the registered data lake locations. AWS CloudTrail also records these and other API calls made to AWS during the entire flow, including Athena query requests.

Walkthrough overview

In this walkthrough, I show you how to implement access control and audit using a Studio notebook and Lake Formation. You perform the following activities:

  1. Register a new database in Lake Formation.
  2. Create the required IAM policies, roles, group, and users.
  3. Grant data permissions with Lake Formation.
  4. Set up Studio.
  5. Test Lake Formation access control policies using a Studio notebook.
  6. Audit data access activity with Lake Formation and CloudTrail.

If you prefer to skip the initial setup activities and jump directly to testing and auditing, you can deploy the following AWS CloudFormation template in a Region that supports Studio and Lake Formation:

You can also deploy the template by downloading the CloudFormation template. When deploying the CloudFormation template, you provide the following parameters:

  • User name and password for a data scientist with full access to the dataset. The default user name is data-scientist-full.
  • User name and password for a data scientist with limited access to the dataset. The default user name is data-scientist-limited.
  • Names for the database and table to be created for the dataset. The default names are amazon_reviews_db and amazon_reviews_parquet, respectively.
  • VPC and subnets that are used by Studio to communicate with the Amazon Elastic File System (Amazon EFS) volume associated to Studio.

If you decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to the section Testing Lake Formation access control policies in this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • A data lake set up in Lake Formation with a Lake Formation Admin. For general guidance on how to set up Lake Formation, see Getting started with AWS Lake Formation.
  • Basic knowledge on creating IAM policies, roles, users, and groups.

Registering a new database in Lake Formation

For this post, I use the Amazon Customer Reviews Dataset to demonstrate how to provide granular access to the data lake for different data scientists. If you already have a dataset registered with Lake Formation that you want to use, you can skip this section and go to Creating required IAM roles and users for data scientists.

To register the Amazon Customer Reviews Dataset in Lake Formation, complete the following steps:

  1. Sign in to the console with the IAM user configured as Lake Formation Admin.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Databases.
  3. Choose Create Database.
  4. In Database details, select Database to create the database in your own account.
  5. For Name, enter a name for the database, such as amazon_reviews_db.
  6. For Location, enter s3://amazon-reviews-pds.
  7. Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

  1. Choose Create database.

The Amazon Customer Reviews Dataset is currently available in TSV and Parquet formats. The Parquet dataset is partitioned on Amazon S3 by product_category. To create a table in the data lake for the Parquet dataset, you can use an AWS Glue crawler or manually create the table using Athena, as described in Amazon Customer Reviews Dataset README file.

  1. On the Athena console, create the table.

If you haven’t specified a query result location before, follow the instructions in Specifying a Query Result Location.

  1. Choose the data source AwsDataCatalog.
  2. Choose the database created in the previous step.
  3. In the Query Editor, enter the following query:
    CREATE EXTERNAL TABLE amazon_reviews_parquet(
      marketplace string, 
      customer_id string, 
      review_id string, 
      product_id string, 
      product_parent string, 
      product_title string, 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine string, 
      verified_purchase string, 
      review_headline string, 
      review_body string, 
      review_date bigint, 
      year int)
    PARTITIONED BY (product_category string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/'

  1. Choose Run query.

You should receive a Query successful response when the table is created.

  1. Enter the following query to load the table partitions:
    MSCK REPAIR TABLE amazon_reviews_parquet

  1. Choose Run query.
  2. On the Lake Formation console, in the navigation pane, under Data catalog, choose Tables.
  3. For Table name, enter a table name.
  4. Verify that you can see the table details.

18. Verify that you can see the table details.

  1. Scroll down to see the table schema and partitions.

Finally, you register the database location with Lake Formation so the service can start enforcing data permissions on the database.

  1. On the Lake Formation console, in the navigation pane, under Register and ingest, choose Data lake locations.
  2. On the Data lake locations page, choose Register location.
  3. For Amazon S3 path, enter s3://amazon-reviews-pds/.
  4. For IAM role, you can keep the default role.
  5. Choose Register location.

Creating required IAM roles and users for data scientists

To demonstrate how you can provide differentiated access to the dataset registered in the previous step, you first need to create IAM policies, roles, a group, and users. The following diagram illustrates the resources you configure in this section.

The following diagram illustrates the resources you configure in this section.

In this section, you complete the following high-level steps:

  1. Create an IAM group named DataScientists containing two users: data-scientist-full and data-scientist-limited, to control their access to the console and to Studio.
  2. Create a managed policy named DataScientistGroupPolicy and assign it to the group.

The policy allows users in the group to access Studio, but only using a SageMaker user profile that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. For each IAM user, create individual IAM roles, which are used as user profile execution roles in Studio later.

The naming convention for these roles consists of a common prefix followed by the corresponding IAM user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual IAM users who performed the activities. For this post, I use the prefix SageMakerStudioExecutionRole_.

  1. Create a managed policy named SageMakerUserProfileExecutionPolicy and assign it to each of the IAM roles.

The policy establishes coarse-grained access permissions to the data lake.

Follow the remainder of this section to create the IAM resources described. The permissions configured in this section grant common, coarse-grained access to data lake resources for all the IAM roles. In a later section, you use Lake Formation to establish fine-grained access permissions to Data Catalog resources and Amazon S3 locations for individual roles.

Creating the required IAM group and users

To create your group and users, complete the following steps:

  1. Sign in to the console using an IAM user with permissions to create groups, users, roles, and policies.
  2. On the IAM console, create policies on the JSON tab to create a new IAM managed policy named DataScientistGroupPolicy.
    1. Use the following JSON policy document to provide permissions, providing your AWS Region and AWS account ID:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "sagemaker:DescribeDomain",
                      "sagemaker:ListDomains",
                      "sagemaker:ListUserProfiles",
                      "sagemaker:ListApps"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedDomainUrl",
                      "sagemaker:DescribeUserProfile"
                  ],
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}",
                  "Effect": "Allow"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedDomainUrl",
                      "sagemaker:DescribeUserProfile"
                  ],
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:user-profile/*/${aws:username}"
              },
              {
                  "Action": "sagemaker:*App",
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sagemaker:*App",
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/${aws:username}/*"
              },
              {
                  "Action": [
                      "sagemaker:CreatePresignedNotebookInstanceUrl",
                      "sagemaker:*NotebookInstance",
                      "sagemaker:*NotebookInstanceLifecycleConfig",
                      "sagemaker:CreateUserProfile",
                      "sagemaker:DeleteDomain",
                      "sagemaker:DeleteUserProfile"
                  ],
                  "Resource": "*",
                  "Effect": "Deny"
              }
          ]
      }

This policy forces an IAM user to open Studio using a SageMaker user profile with the same name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. Create an IAM group.
    1. For Group name, enter DataScientists.
    2. Search and attach the AWS managed policy named DataScientist and the IAM policy created in the previous step.
  2. Create two IAM users named data-scientist-full and data-scientist-limited.

Alternatively, you can provide names of your choice, as long as they’re a combination of lowercase letters, numbers, and hyphen (-). Later, you also give these names to their corresponding SageMaker user profiles, which at the time of writing only support those characters.

Creating the required IAM roles

To create your roles, complete the following steps:

  1. On the IAM console, create a new managed policy named SageMakerUserProfileExecutionPolicy.
    1. Use the following policy code:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": [
                      "lakeformation:GetDataAccess",
                      "glue:GetTable",
                      "glue:GetTables",
                      "glue:SearchTables",
                      "glue:GetDatabase",
                      "glue:GetDatabases",
                      "glue:GetPartitions"
                  ],
                  "Resource": "*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sts:AssumeRole",
                  "Resource": "*",
                  "Effect": "Deny"
              }
          ]
      }

This policy provides common coarse-grained IAM permissions to the data lake, leaving Lake Formation permissions to control access to Data Catalog resources and Amazon S3 locations for individual users and roles. This is the recommended method for granting access to data in Lake Formation. For more information, see Methods for Fine-Grained Access Control.

  1. Create an IAM role for the first data scientist (data-scientist-full), which is used as the corresponding user profile’s execution role.
    1. On the Attach permissions policy page, search and attach the AWS managed policy AmazonSageMakerFullAccess.
    2. For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-scientist-full.
  2. To add the remaining policies, on the Roles page, choose the role name you just created.
  3. Under Permissions, choose Attach policies;
  4. Search and select the SageMakerUserProfileExecutionPolicy and AmazonAthenaFullAccess policies.
  5. Choose Attach policy.
  6. To restrict the Studio resources that can be created within Studio (such as image, kernel, or instance type) to only those belonging to the user profile associated to the first IAM role, embed an inline policy to the IAM role.
    1. Use the following JSON policy document to scope down permissions for the user profile, providing the Region, account ID, and IAM user name associated to the first data scientist (data-scientist-full). You can name the inline policy DataScientist1IAMRoleInlinePolicy.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Action": "sagemaker:*App",
                  "Resource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*",
                  "Effect": "Allow"
              },
              {
                  "Action": "sagemaker:*App",
                  "Effect": "Deny",
                  "NotResource": "arn:aws:sagemaker:<AWSREGION>:<AWSACCOUNT>:app/*/<IAMUSERNAME>/*"
              }
          ]
      }

  1. Repeat the previous steps to create an IAM role for the second data scientist (data-scientist-limited).
    1. Name the role SageMakerStudioExecutionRole_data-scientist-limited and the second inline policy DataScientist2IAMRoleInlinePolicy.

Granting data permissions with Lake Formation

Before data scientists are able to work on a Studio notebook, you grant the individual execution roles created in the previous section access to the Amazon Customer Reviews Dataset (or your own dataset). For this post, we implement different data permission policies for each data scientist to demonstrate how to grant granular access using Lake Formation.

  1. Sign in to the console with the IAM user configured as Lake Formation Admin.
  2. On the Lake Formation console, in the navigation pane, choose Tables.
  3. On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
  4. On the Actions menu, under Permissions, choose Grant.
  5. Provide the following information to grant full access to the Amazon Customer Reviews Dataset table for the first data scientist:
  6. Select My account.
  7. For IAM users and roles, choose the execution role associated to the first data scientist, such as SageMakerStudioExecutionRole_data-scientist-full.
  8. For Table permissions and Grantable permissions, select Select.
  9. Choose Grant.
  10. Repeat the first step to grant limited access to the dataset for the second data scientist, providing the following information:
  11. Select My account.
  12. For IAM users and roles, choose the execution role associated to the second data scientist, such as SageMakerStudioExecutionRole_data-scientist-limited.
  13. For Columns, choose Include columns.
  14. Choose a subset of columns, such as: product_category, product_id, product_parent, product_title, star_rating, review_headline, review_body, and review_date.
  15. For Table permissions and Grantable permissions, select Select.
  16. Choose Grant.
  17. To verify the data permissions you have granted, on the Lake Formation console, in the navigation pane, choose Tables.
  18. On the Tables page, select the table you created earlier, such as amazon_reviews_parquet.
  19. On the Actions menu, under Permissions, choose View permissions to open the Data permissions menu.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation Admin.

If you see the principal IAMAllowedPrincipals listed on the Data permissions menu for the table, you must remove it. Select the principal and choose Revoke. On the Revoke permissions page, choose Revoke.

Setting up SageMaker Studio

You now onboard to Studio and create two user profiles, one for each data scientist.

When you onboard to Studio using IAM authentication, Studio creates a domain for your account. A domain consists of a list of authorized users, configuration settings, and an Amazon EFS volume, which contains data for the users, including notebooks, resources, and artifacts.

Each user receives a private home directory within Amazon EFS for notebooks, Git repositories, and data files. All traffic between the domain and the Amazon EFS volume is communicated through specified subnet IDs. By default, all other traffic goes over the internet through a SageMaker system Amazon Virtual Private Cloud (Amazon VPC).

Alternatively, instead of using the default SageMaker internet access, you could secure how Studio accesses resources by assigning a private VPC to the domain. This is beyond the scope of this post, but you can find additional details in Securing Amazon SageMaker Studio connectivity using a private VPC.

If you already have a Studio domain running, you can skip the onboarding process and follow the steps to create the SageMaker user profiles.

Onboarding to Studio

To onboard to Studio, complete the following steps:

  1. Sign in to the console using an IAM user with service administrator permissions for SageMaker.
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio menu, under Get started, choose Standard setup.
  4. For Authentication method, choose AWS Identity and Access Management (IAM).
  5. Under Permission, for Execution role for all users, choose an option from the role selector.

You’re not using this execution role for the SageMaker user profiles that you create later. If you choose Create a new role, the Create an IAM role dialog opens.

  1. For S3 buckets you specify, choose None.
  2. Choose Create role.

SageMaker creates a new IAM role named AmazonSageMaker-ExecutionPolicy role with the AmazonSageMakerFullAccess policy attached.

  1. Under Network and storage, for VPC, choose the private VPC that is used for communication with the Amazon EFS volume.
  2. For Subnet(s), choose multiple subnets in the VPC from different Availability Zones.
  3. Choose Submit.
  4. On the Studio Control Panel, under Studio Summary, wait for the status to change to Ready and the Add user button to be enabled.

Creating the SageMaker user profiles

To create your SageMaker user profiles, complete the following steps:

  1. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  2. On the Studio Control Panel, choose Add user.
  3. For User name, enter data-scientist-full.
  4. For Execution role, choose Enter a custom IAM role ARN.
  5. Enter arn:aws:iam::<AWSACCOUNT>:role/SageMakerStudioExecutionRole_data-scientist-full, providing your AWS account ID.
  6. After creating the first user profile, repeat the previous steps to create a second user profile.
    1. For User name, enter data-scientist-limited.
    2. For Execution role, enter the associated IAM role ARN.

For Execution role, enter the associated IAM role ARN.

Testing Lake Formation access control policies

You now test the implemented Lake Formation access control policies by opening Studio using both user profiles. For each user profile, you run the same Studio notebook containing Athena queries. You should see different query outputs for each user profile, matching the data permissions implemented earlier.

  1. Sign in to the console with IAM user data-scientist-full.
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio Control Panel, choose user name data-scientist-full.
  4. Choose Open Studio.
  5. Wait for SageMaker Studio to load.

Due to the IAM policies attached to the IAM user, you can only open Studio with a user profile matching the IAM user name.

  1. In Studio, on the top menu, under File, under New, choose Terminal.
  2. At the command prompt, run the following command to import a sample notebook to test Lake Formation data permissions:
    git clone https://github.com/aws-samples/amazon-sagemaker-studio-audit.git

  1. In the left sidebar, choose the file browser icon.
  2. Navigate to amazon-sagemaker-studio-audit.
  3. Open the notebook folder.
  4. Choose sagemaker-studio-audit-control.ipynb to open the notebook.
  5. In the Select Kernel dialog, choose Python 3 (Data Science).
  6. Choose Select.
  7. Wait for the kernel to load.

Wait for the kernel to load.

  1. Starting from the first code cell in the notebook, press Shift + Enter to run the code cell.
  2. Continue running all the code cells, waiting for the previous cell to finish before running the following cell.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the amazon_reviews_parquet table.

  1. On the top menu, under File, choose Shut Down.
  2. Choose Shutdown All to shut down all the Studio apps.
  3. Close the Studio browser tab.
  4. Repeat the previous steps in this section, this time signing in as the user data-scientist-limited and opening Studio with this user.
  5. Don’t run the code cell in the section Create S3 bucket for query output files.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the amazon_reviews_parquet table.

Auditing data access activity with Lake Formation and CloudTrail

In this section, we explore the events associated to the queries performed in the previous section. The Lake Formation console includes a dashboard where it centralizes all CloudTrail logs specific to the service, such as GetDataAccess. These events can be correlated with other CloudTrail events, such as Athena query requests, to get a complete view of the queries users are running on the data lake.

Alternatively, instead of filtering individual events in Lake Formation and CloudTrail, you could run SQL queries to correlate CloudTrail logs using Athena. Such integration is beyond the scope of this post, but you can find additional details in Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs and Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.

Auditing data access activity with Lake Formation

To review activity in Lake Formation, complete the following steps:

  1. Sign out of the AWS account.
  2. Sign in to the console with the IAM user configured as Lake Formation Admin.
  3. On the Lake Formation console, in the navigation pane, choose Dashboard.

Under Recent access activity, you can find the events associated to the data access for both users.

  1. Choose the most recent event with event name GetDataAccess.
  2. Choose View event.

Among other attributes, each event includes the following:

  • Event date and time
  • Event source (Lake Formation)
  • Athena query ID
  • Table being queried
  • IAM user embedded in the Lake Formation principal, based on the chosen role name convention

• IAM user embedded in the Lake Formation principal, based on the chosen role name convention

Auditing data access activity with CloudTrail

To review activity in CloudTrail, complete the following steps:

  1. On the CloudTrail console, in the navigation pane, choose Event history.
  2. In the Event history menu, for Filter, choose Event name.
  3. Enter StartQueryExecution.
  4. Expand the most recent event, then choose View event.

This event includes additional parameters that are useful to complete the audit analysis, such as the following:

  • Event source (Athena).
  • Athena query ID, matching the query ID from Lake Formation’s GetDataAccess event.
  • Query string.
  • Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

If you followed this walkthrough using the CloudFormation template, after shutting down the Studio apps for each user profile, deleting the stack deletes the remaining resources.

If you encounter any errors, open the Studio Control Panel and verify that all the apps for every user profile are in Deleted state before deleting the stack.

If you didn’t use the CloudFormation template, you can manually delete the resources you created:

  1. On the Studio Control Panel, for each user profile, choose User Details.
  2. Choose Delete user.
  3. When all users are deleted, choose Delete Studio.
  4. On the Amazon EFS console, delete the volume that was automatically created for Studio.
  5. On the Lake Formation console, delete the table and the database created for the Amazon Customer Reviews Dataset.
  6. Remove the data lake location for the dataset.
  7. On the IAM console, delete the IAM users, group, and roles created for this walkthrough.
  8. Delete the policies you created for these principals.
  9. On the Amazon S3 console, empty and delete the bucket created for storing Athena query results (starting with sagemaker-audit-control-query-results-), and the bucket created by Studio to share notebooks (starting with sagemaker-studio-).

Conclusion

This post described how to the implement access control and auditing capabilities on a per-user basis in ML projects, using Studio notebooks, Athena, and Lake Formation to enforce access control policies when performing exploratory activities in a data lake.

I thank you for following this walkthrough and I invite you to implement it using the associated CloudFormation template. You’re also welcome to visit the GitHub repo for the project.


About the Author

Rodrigo Alarcon is a Sr. Solutions Architect with AWS based out of Santiago, Chile. Rodrigo has over 10 years of experience in IT security and network infrastructure. His interests include machine learning and cybersecurity.

Read More

Building and deploying an object detection computer vision application at the edge with AWS Panorama

Building and deploying an object detection computer vision application at the edge with AWS Panorama

Computer vision (CV) is sought after technology among companies looking to take advantage of machine learning (ML) to improve their business processes. Enterprises have access to large amounts of video assets from their existing cameras, but the data remains largely untapped without the right tools to gain insights from it. CV provides the tools to unlock opportunities with this data, so you can automate processes that typically require visual inspection, such as evaluating manufacturing quality or identifying bottlenecks in industrial processes. You can take advantage of CV models running in the cloud to automate these inspection tasks, but there are circumstances when relying exclusively on the cloud isn’t optimal due to latency requirements or intermittent connectivity that make a round trip to the cloud infeasible.

AWS Panorama enables you to bring CV to on-premises cameras and make predictions locally with high accuracy and low latency. On the AWS Panorama console, you can easily bring custom trained models to the edge and build applications that integrate with custom business logic. You can then deploy these applications on the AWS Panorama Appliance, which auto-discovers existing IP cameras and runs the applications on video streams to make real-time predictions. You can easily integrate the inference results with other AWS services such as Amazon QuickSight to derive ML-powered business intelligence (BI) or route the results to your on-premises systems to trigger an immediate action.

Sign up for the preview to learn more and start building your own CV applications.

In this post, we look at how you can use AWS Panorama to build and deploy a parking lot car counter application.

Parking lot car counter application

Parking facilities, like the one in the image below, need to know how many cars are parked in a given facility at any point of time, to assess vacancy and intake more customers. You also want to keep track of the number of cars that enter and exit your facility during any given time. You can use this information to improve operations, such as adding more parking payment centers, optimizing price, directing cars to different floors, and more. Parking center owners typically operate more than one facility and are looking for real-time aggregate details of vacancy in order to direct traffic to less-populated facilities and offer real-time discounts.

To achieve these goals, parking centers sometimes manually count the cars to provide a tally. This inspection can be error prone and isn’t optimal for capturing real-time data. Some parking facilities install sensors that give the number of cars in a particular lot, but these sensors are typically not integrated with analytics systems to derive actionable insights.

With the AWS Panorama Appliance, you can get a real-time count of number of cars, collect metrics across sites, and correlate them to improve your operations. Let’s see how we can solve this once manual (and expensive) problem using CV at the edge. We go through the details of the trained model, the business logic code, and walk through the steps to create and deploy an application on your AWS Panorama Appliance Developer Kit so you can view the inferences on a connected HDMI screen.

Computer vision model

A CV model helps us extract useful information from images and video frames. We can detect and localize objects in a scene, and identity and classify images and action recognition. You can choose from a variety of frameworks such as TensorFlow, MXNet, and PyTorch to build your CV models, or you can choose from a variety of pre-trained models available from AWS or from third parties such as ISVs.

For this example, we use a pre-trained GluonCV model downloaded from the GluonCV model zoo.

The model we use is the ssd_512_resnet50_v1_voc model. It’s trained on the very popular PASCAL VOC dataset. It has 20 classes of objects annotated and labeled for a model to be trained on. The following code shows the classes and their indexes.

voc_classes = {
	'aeroplane'		: 0,
	'bicycle'		: 1,
	'bird'			: 2,
	'boat'			: 3,
	'bottle'		: 4,
	'bus'			: 5,
	'car'			: 6,
	'cat'			: 7,
	'chair'			: 8,
	'cow'			: 9,
	'diningtable'	: 10,
	'dog'			: 11,
	'horse'			: 12,
	'motorbike'		: 13,
	'person'		: 14,
	'pottedplant'	: 15,
	'sheep'			: 16,
	'sofa'			: 17,
	'train'			: 18,
	'tvmonitor'		: 19
}


For our use case, we’re detecting and counting cars. Because we’re talking about cars, we use class 6 as the index in our business logic later in this post.

Our input image shape is [1, 3, 512, 512]. These are the dimensions of the input image the model expects to be given:

  • Batch size – 1
  • Number of channels – 3
  • Width and height of the input image – 512, 512

Uploading the model artifacts

We need to upload the model artifacts to an Amazon Simple Storage Service (Amazon S3) bucket. The bucket name should have aws-panorama- in the beginning of the name. After downloading the model artifacts, we upload the ssd_512_resnet50_v1_voc.tar.gz file to the S3 bucket. To create your bucket, complete the following steps:

  1. Download the model artifacts.
  2. On the Amazon S3 console, choose Create bucket.
  3. For Bucket name, enter a name starting with aws-panorama-.

  1. Choose Create bucket.

You can view the object details in the Object overview section. The model URI is s3://aws-panorama-models-bucket/ssd_512_resnet50_v1_voc.tar.gz.

The business logic code

After we upload the model artifacts to an S3 bucket, let’s turn our attention to the business logic code. For more information about the sample developer code, see Sample application code. For a comparative example of code samples, see AWS Panorama People Counter Example on GitHub.

Before we look at the full code, let’s look at a skeleton of the business logic code we use:

### Lambda skeleton

class car_counter(object):
    def interface(self):
        # defines the parameters that interface with other services from Panorama
        return

    def init(self, parameters, inputs, outputs):
        # defines the attributes such as arrays and model objects that will be used in the application
        return

    def entry(self, inputs, outputs):
        # defines the application logic responsible for predicting using the inputs and handles what to do
        # with the outputs
        return

The business logic code and AWS Lambda function expect to have at least the interface method, init method, and the entry method.

Let’s go through the python business logic code next.

import panoramasdk
import cv2
import numpy as np
import time
import boto3

# Global Variables 

HEIGHT = 512
WIDTH = 512

class car_counter(panoramasdk.base):
    
    def interface(self):
        return {
                "parameters":
                (
                    ("float", "threshold", "Detection threshold", 0.10),
                    ("model", "car_counter", "Model for car counting", "ssd_512_resnet50_v1_voc"), 
                    ("int", "batch_size", "Model batch size", 1),
                    ("float", "car_index", "car index based on dataset used", 6),
                ),
                "inputs":
                (
                    ("media[]", "video_in", "Camera input stream"),
                ),
                "outputs":
                (
                    ("media[video_in]", "video_out", "Camera output stream"),
                    
                ) 
            }
    
            
    def init(self, parameters, inputs, outputs):  
        try:  
            
            print('Loading Model')
            self.model = panoramasdk.model()
            self.model.open(parameters.car_counter, 1)
            print('Model Loaded')
            
            # Detection probability threshold.
            self.threshold = parameters.threshold
            # Frame Number Initialization
            self.frame_num = 0
            # Number of cars
            self.number_cars = 0
            # Bounding Box Colors
            self.colours = np.random.rand(32, 3)
            # Car Index for Model from parameters
            self.car_index = parameters.car_index
            # Set threshold for model from parameters 
            self.threshold = parameters.threshold
                        
            class_info = self.model.get_output(0)
            prob_info = self.model.get_output(1)
            rect_info = self.model.get_output(2)

            self.class_array = np.empty(class_info.get_dims(), dtype=class_info.get_type())
            self.prob_array = np.empty(prob_info.get_dims(), dtype=prob_info.get_type())
            self.rect_array = np.empty(rect_info.get_dims(), dtype=rect_info.get_type())

            return True
        
        except Exception as e:
            print("Exception: {}".format(e))
            return False

    def preprocess(self, img, size):
        
        resized = cv2.resize(img, (size, size))
        mean = [0.485, 0.456, 0.406]  # RGB
        std = [0.229, 0.224, 0.225]  # RGB
        
        # converting array of ints to floats
        img = resized.astype(np.float32) / 255. 
        img_a = img[:, :, 0]
        img_b = img[:, :, 1]
        img_c = img[:, :, 2]
        
        # Extracting single channels from 3 channel image
        # The above code could also be replaced with cv2.split(img)
        # normalizing per channel data:
        
        img_a = (img_a - mean[0]) / std[0]
        img_b = (img_b - mean[1]) / std[1]
        img_c = (img_c - mean[2]) / std[2]
        
        # putting the 3 channels back together:
        x1 = [[[], [], []]]
        x1[0][0] = img_a
        x1[0][1] = img_b
        x1[0][2] = img_c
        x1 = np.asarray(x1)
        
        return x1
    
    def get_number_cars(self, class_data, prob_data):
        
        # get indices of car detections in class data
        car_indices = [i for i in range(len(class_data)) if int(class_data[i]) == self.car_index]
        # use these indices to filter out anything that is less than self.threshold
        prob_car_indices = [i for i in car_indices if prob_data[i] >= self.threshold]
        return prob_car_indices

    
    def entry(self, inputs, outputs):        
        for i in range(len(inputs.video_in)):
            stream = inputs.video_in[i]
            car_image = stream.image

            # Pre Process Frame
            x1 = self.preprocess(car_image, 512)
                                    
            # Do inference on the new frame.
            
            self.model.batch(0, x1)        
            self.model.flush()
            
            # Get the results.            
            resultBatchSet = self.model.get_result()
            class_batch = resultBatchSet.get(0)
            prob_batch = resultBatchSet.get(1)
            rect_batch = resultBatchSet.get(2)

            class_batch.get(0, self.class_array)
            prob_batch.get(1, self.prob_array)
            rect_batch.get(2, self.rect_array)

            class_data = self.class_array[0]
            prob_data = self.prob_array[0]
            rect_data = self.rect_array[0]
            
            
            # Get Indices of classes that correspond to Cars
            car_indices = self.get_number_cars(class_data, prob_data)
            
            try:
                self.number_cars = len(car_indices)
            except:
                self.number_cars = 0
            
            # Visualize with Opencv or stream.(media) 
            
            # Draw Bounding boxes on HDMI output
            if self.number_cars > 0:
                for index in car_indices:
                    
                    left = np.clip(rect_data[index][0] / np.float(HEIGHT), 0, 1)
                    top = np.clip(rect_data[index][1] / np.float(WIDTH), 0, 1)
                    right = np.clip(rect_data[index][2] / np.float(HEIGHT), 0, 1)
                    bottom = np.clip(rect_data[index][3] / np.float(WIDTH), 0, 1)
                    
                    stream.add_rect(left, top, right, bottom)
                    stream.add_label(str(prob_data[index][0]), right, bottom) 
                    
            stream.add_label('Number of Cars : {}'.format(self.number_cars), 0.8, 0.05)
        
            self.model.release_result(resultBatchSet)            
            outputs.video_out[i] = stream
        return True


def main():
        
    car_counter().run()
main()

For a full explanation of the code and the methods used, see the AWS Panorama Developer Guide.

The code has the following notable features:

  • car_index6
  • model_used ssd_512_resnet50_v1_voc (parameters.car_counter)
  • add_label – Adds text to the HDMI output
  • add_rect – Adds bounding boxes around the object of interest
  • Image – Gets the NumPy array of the frame read from the camera

Now that we have the code ready, we need to create a Lambda function with the preceding code.

  1. On the Lambda console, choose Functions.
  2. Choose Create function.
  3. For Function name, enter a name.
  4. Choose Create function.

  1. Rename the Python file to car_counter.py.

  1. Change the handler to car_counter_main.

  1. In the Basic settings section, confirm that the memory is 2048 MB and the timeout is 2 minutes.

  1. On the Actions menu, choose Publish new version.

We’re now ready to create our application and deploy to the device. We use the model we uploaded and the Lambda function we created in the subsequent steps.

Creating the application

To create your application, complete the following steps:

  1. On the AWS Panorama console, choose My applications.
  2. Choose Create application.

  1. Choose Begin creation.

  1. For Name, enter car_counter.
  2. For Description, enter an optional description.
  3. Choose Next.

  1. Click Choose model.

  1. For Model artifact path, enter the model S3 URI.
  2. For Model name¸ enter the same name that you used in the business logic code.
  3. In the Input configuration section, choose Add input.
  4. For Input name, enter the input Tensor name (for this post, data).
  5. For Shape, enter the frame shape (for this post, 1, 3, 512, 512).

  1. Choose Next.
  2. Under Lambda functions, select your function (CarCounter).

  1. Choose Next.
  2. Choose Proceed to deployment.

Deploying your application

To deploy your new application, complete the following steps:

  1. Choose Choose appliance.

  1. Choose the appliance you created.
  2. Choose Choose camera streams.

  1. Select your camera stream.

  1. Choose Deploy.

Checking the output

After we deploy the application, we can check the output HDMI output or use Amazon CloudWatch Logs. For more information, see Setting up the AWS Panorama Appliance Developer Kit or Viewing AWS Panorama event logs in CloudWatch Logs, respectively.

If we have an HDMI output connected to the device, we should see the output from the device on the HDMI screen, as in the following screenshot.

And that’s it. We have successfully deployed a car counting use case to the AWS Panorama Appliance.

Extending the solution

We can do so much more with this application and extend it to other parking-related use cases, such as the following:

  • Parking lot routing – Where are the vacant parking spots?
  • Parking lot monitoring – Are cars parked in appropriate spots? Are they too close to each other?

You can integrate these use cases with other AWS services like QuickSight, S3 buckets, and MQTT, just to name a few, and get real-time inference data for monitoring cars in a parking lot.

You can adapt this example and build other object detection applications for your use case. We will also continue to share more examples with you so you can build, develop, and test with the AWS Panorama Appliance Developer Kit.

Conclusion

The applications of computer vision at the edge are only now being imagined and built out. As a data scientist, I’m very excited to be innovating in lockstep with AWS Panorama customers to help you ideate and build CV models that are uniquely tailored to solve your problems.

And we’re just scratching the surface of what’s possible with CV at the edge and the AWS Panorama ecosystem.

Resources

For more information about using AWS Panorama, see the following resources:

 


About the Author

Surya Kari is a Data Scientist who works on AI devices within AWS. His interests lie in computer vision and autonomous systems.

Read More