Secure access to Amazon SageMaker Studio with AWS SSO and a SAML application

Cloud security at AWS is the highest priority. Amazon SageMaker Studio offers various mechanisms to protect your data and code using integration with AWS security services like AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), or network isolation with Amazon Virtual Private Cloud (Amazon VPC).

Customers in highly regulated industries, like financial services, can set up Studio in VPC only mode to enable network isolation and disable internet access from Studio notebooks. You can use IAM integration with Studio to control which users have access to resources like Studio notebooks, the Studio IDE, or Amazon SageMaker training jobs.

A popular use case is to restrict access to the Studio IDE to only users from inside a specified network CIDR range or a designated VPC. You can achieve this by implementing IAM identity-based SageMaker policies and attaching those policies to the IAM users or groups that require those permissions. However, the SageMaker domain must be configured with IAM authentication mode, because the IAM identity-based policies aren’t supported in AWS Single Sign-On (SSO) authentication mode.

Many customers use AWS SSO to enable centralized workforce identity control and provide a consistent user sign-in experience. This post shows how to implement this use case while keeping AWS SSO capabilities to access Studio.

Solution overview

When you set up a SageMaker domain in VPC-only mode and specify the subnets and security groups, SageMaker creates elastic network interfaces (ENIs) that are associated with your security groups in the specified subnets. ENIs allow your training containers to connect to resources in your VPC.

In this mode, the direct internet access from notebooks is completely disabled, and all the traffic is routed through an ENI in your private VPC. This also includes traffic from Studio UI widgets and interfaces—such as experiment management, autopilot, and model monitor—to their respective backend SageMaker APIs. AWS recommends using VPC only mode to exercise fine-grained control on network access of Studio.

The first challenge is that even though Studio is deployed with no internet connectivity, Studio IDE can still be accessed from anywhere, assuming access to the AWS Management Console and Studio is granted to an IAM principal. This situation isn’t acceptable if you want to fully isolate Studio from a public network and contain all communication within a tightly controlled private VPC.

To address this challenge and disable any access to Studio IDE except from a designated VPC or a CIDR range, you can use the CreatePresignedDomainUrl SageMaker API. The IAM role or user used to call this API defines the permissions to access Studio. Now you can use IAM identity-based policies to implement the desired access configuration. For example, to enable access only from a designated VPC, add the following condition to the IAM policy, associated with an IAM principal, which is used to generate a presigned domain URL:

"Condition": {
                "StringEquals": {
                    "aws:SourceVpc": "vpc-111bbaaa"
                }
            }

To enable access only from a designated VPC endpoint or endpoints, specify the following condition:

"Condition": {
                "ForAnyValue:StringEquals": {
                    "aws:sourceVpce": [
                        "vpce-111bbccc",
                        "vpce-111bbddd"
                    ]
                }
            }

Use the following condition to restrict access from a designated CIDR range:

"Condition": {
                "IpAddress": {
                    "aws:SourceIp": [
                        "192.0.2.0/24",
                        "203.0.113.0/24"
                    ]
                }
            }

The second challenge is this that IAM-based access control works only when the SageMaker domain is configured in IAM authentication mode; you can’t use it when the SageMaker domain is deployed in AWS SSO mode. The next section shows how to address these challenges and implement IAM-based access control with AWS SSO access to Studio.

Architecture overview

Studio is published as a SAML application, which is assigned to a specific SageMaker Studio user profile. Users can conveniently access Studio directly from the AWS SSO portal, as shown in the following screenshot.

The solution integrates with a custom SAML 2.0 application as the mechanism to trigger the user authentication for Studio. It requires that the custom SAML application is configured with the Amazon API Gateway endpoint URL as its Assertion Consumer Service (ACS), and needs mapping attributes containing the AWS SSO user ID as well as the SageMaker domain ID.

The API Gateway endpoint calls an AWS Lambda function that parses the SAML response to extract the domain ID and user ID and use them to generate a Studio presigned URL. The Lambda function finally performs a redirection via an HTTP 302 response to sign in the user in Studio.

An IAM policy controls the network environment that Studio users are allowed to log in from, which includes restricting conditions as described in the previous section. This IAM policy is attached to the Lambda function. The IAM policy contains a permission to call the sagemaker:CreatePresignedDomainURL API for a specific user profile only:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sagemaker:CreatePresignedDomainUrl"
            ],
            "Resource": "arn:aws:sagemaker: <Region>:<Account_id>
:user-profile/*/*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "NotIpAddress": {
                    "aws:VpcSourceIp": "10.100.10.0/24"
                }
            },
            "Action": [
                "sagemaker:*"
            ],
            "Resource": "arn:aws:sagemaker: <Region>:<Account_id>
:user-profile/*/*",
            "Effect": "Deny"
        }
    ]
}

The following diagram shows the solution architecture.

The solution deploys a SageMaker domain into your private VPC and VPC endpoints to access Studio, SageMaker runtime, and the SageMaker API via a private connection without need for an internet gateway. The VPC endpoints are configured with private DNS enabled (PrivateDnsEnabled=True) to associate a private hosted zone with your VPC. This enables Studio to access the SageMaker API using the default public DNS name api.sagemaker.<Region>.amazonaws.com resolved to the private IP address of the endpoint rather than using the VPC endpoint URL.

You need to add VPC endpoints to your VPC if you want to access any other AWS services like Amazon Simple Storage Service (Amazon S3), Amazon Elastic Container Registry (Amazon ECR), AWS Security Token Service (AWS STS), AWS CloudFormation, or AWS CodeCommit.

You can fully control permissions used to generate the presigned URL and any other API calls with IAM policies attached to the Lambda function execution role or control access to any used AWS service via VPC endpoint policies. For examples of using IAM policies to control access to Studio and SageMaker API, refer to Control Access to the SageMaker API by Using Identity-based Policies.

Although the solution requires the Studio domain to be deployed in IAM mode, it does allow for AWS SSO to be used as the mechanism for end users to log in to Studio.

The following subsections contain detailed descriptions of the main solution components.

API Gateway

The API Gateway endpoint acts as the target for the application ACS URL configured in the custom SAML 2.0 application. The endpoint is private, and has a resource called /saml and a POST method with integration request configured as Lambda proxy. The solution uses a VPC endpoint with a configured com.amazonaws.<region>.execute-api DNS name to call this API endpoint from within the VPC.

AWS SSO

A custom SAML 2.0 application is configured with the API Gateway endpoint URL https:/{ restapi-id}.execute-api.amazonaws.com/saml as its application ACS URL, and uses attribute mappings with the following requirements:

  • User identifier:
    • User attribute in the application – user name
    • Maps user attribute in AWS SSO${user:AD_GUID}
  • SageMaker domain ID identifier:
    • User attribute in the applicationdomain-id
    • Maps user attribute in AWS SSO – Domain ID for the Studio instance

The application implements the access control for an AWS SSO user by provisioning a Studio user profile with the name equal to the AWS SSO user ID.

Lambda function

The solution configures a Lambda function as an invocation point for the API Gateway /saml resource. The function parses the SAMLResponse sent by AWS SSO, extracts the domain-id as well as the user name, and calls the createPresignedDomainUrl SageMaker API to retrieve the Studio URL and token and redirect the user to log in using an HTTP 302 response. The Lambda function has a specific IAM policy attached to its execution role that allows the sagemaker:createPresignedDomainUrl action only when it’s requested from a specific network CIDR range using the VpcSourceIp condition.

The Lambda function doesn’t have any logic to validate the SAML response, for example to check a signature. However, because the API Gateway endpoint serving as the ACS is private or internal only, it’s not mandatory for this proof of concept environment.

Deploy the solution

The GitHub repository provides the full source code for the end-to-end solution.

To deploy the solution, you must have administrator (or power user) permissions for an AWS account, and install the AWS Command Line Interface (AWS CLI) and AWS SAM CLI and minimum Python 3.8.

The solution supports deployment to three AWS Regions: eu-west-1, eu-central-1, and us-east-1. Make sure you select one of these Regions for deployment.

To start testing the solution, you must complete the following deployment steps from the solution’s GitHub README file:

  1. Set up AWS SSO if you don’t have it configured.
  2. Deploy the solution using the SAM application.
  3. Create a new custom SAML 2.0 application.

After you complete the deployment steps, you can proceed with the solution test.

Test the solution

The solution simulates two use cases to demonstrate the usage of AWS SSO and SageMaker identity-based policies:

  • Positive use case – A user accesses Studio from within a designated CIDR range through a VPC endpoint
  • Negative use case – A user accesses Studio from a public IP address

To test these use cases, the solution created three Amazon Elastic Compute Cloud (Amazon EC2) instances:

  • Private host – An EC2 Windows instance in a private subnet that is able to access Studio (your on-premises secured environment)
  • Bastion host – An EC2 Linux instance in the public subnet used to establish an SSH tunnel into the private host on the private network
  • Public host – An EC2 Windows instance in a public subnet to demonstrate that the user can’t access Studio from an unauthorized IP address

Test Studio access from an authorized network

Follow these steps to perform the test:

  1. To access the EC2 Windows instance on the private network, run the command provided as the value of the SAM output key TunnelCommand. Make sure that the private key of the key pair specified in the parameter is in the directory where the SSH tunnel command runs from. The command creates an SSH tunnel from the local computer on localhost:3389 to the EC2 Windows instance on the private network. See the following example code:
    ssh -i sso-username.pem -A -N -L localhost:3389:10.100.10.187:3389 ec2-user@3.250.93.113

  2. On your local desktop or notebook, open a new RDP connection (for example using Microsoft Remote Desktop) using localhost as the target remote host. This connection is tunneled via the bastion host to the private EC2 Windows instance. Use the user name Administrator and password from the stack output SageMakerWindowsPassword.
  3. Open the Firefox web browser from the remote desktop.
  4. Navigate and log in to the AWS SSO portal using the credentials associated with the user name that you specified as the ssoUserName parameter.
  5. Choose the SageMaker Secure Demo AWS SSO application from the AWS SSO portal.

You’re redirected to the Studio IDE in a new browser window.

Test Studio access from an unauthorized network

Now follow these steps to simulate access from an unauthorized network:

  1. Open a new RDP connection on the IP provided in the SageMakerWindowsPublicHost SAML output.
  2. Open the Firefox web browser from the remote desktop.
  3. Navigate and log in to the AWS SSO portal using the credentials associated with the user name that was specified as the ssoUserName parameter.
  4. Choose the SageMaker Secure Demo AWS SSO application from the AWS SSO portal.

This time you receive an unauthorized access message.

Clean up

To avoid charges, you must remove all solution-provisioned and manually created resources from your AWS account. Follow the instructions in the solution’s README file.

Conclusion

We demonstrated that by introducing a middleware authentication layer between the end user and Studio, we can control the environment that user is allowed to access Studio from and explicitly block every other unauthorized environment.

To further tighten security, you can add an IAM policy to a user role to prevent access to Studio from the console. If you use AWS Organizations, you can implement the following service control policy for the organizational units or accounts that need access to Studio:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "sagemaker:*"
      ],
      "Resource": "*",
      "Effect": "Allow"
    },
    {
      "Condition": {
        "NotIpAddress": {
          "aws:VpcSourceIp": "<Authorized CIDR>"
        }
      },
      "Action": [
        "sagemaker:CreatePresignedDomainUrl"
      ],
      "Resource": "*",
      "Effect": "Deny"
    }
  ]
}

Although the solution described in this post uses API Gateway and Lambda, you can explore other ways such as an EC2 instance with an instance role using the same permission validation workflow as described or even an independent system to handle user authentication and authorization and generate a Studio presigned URL.

Further reading

Securing access to Studio is an active research topic, and there are other relevant posts on similar approaches. Refer to the following posts on the AWS Machine Learning Blog to learn more about other services and architectures you can use:


About the Authors

Jerome Bachelet is a Solutions Architect at Amazon Web Services. He thrives on helping customers get the most value out of AWS to achieve their business objectives. Jerome has over 10 years of experience working with data protection and data security solutions. Besides being in the cloud, Jerome enjoys travels and quality time with his wife and 2 daughters in the Geneva, Switzerland area.

Yevgeniy Ilyin is a Solutions Architect at AWS. He has over 20 years of experience working at all levels of software development and solutions architecture and has used programming languages from COBOL and Assembler to .NET, Java, and Python. He develops and codes cloud native solutions with a focus on big data, analytics, and data engineering.

Read More

Industrial automation at Tyson with computer vision, AWS Panorama, and Amazon SageMaker

This is the first in a two-part blog series on how Tyson Foods, Inc., is utilizing Amazon SageMaker and AWS Panorama to automate industrial processes at their meat packing plants by bringing the benefits of artificial intelligence applications at the edge. In part one, we discuss an inventory counting application for packaging lines. In part two, we discuss a vision-based anomaly detection solution at the edge for predictive maintenance of industrial equipment.

As one of the largest processors and marketers of chicken, beef, and pork in the world, Tyson Foods, Inc., is known for bringing innovative solutions to their production and packing plants. In Feb 2020, Tyson announced its plan to bring Computer Vision (CV) to its chicken plants and launched a pilot with AWS to pioneer efforts on inventory management. Tyson collaborated with Amazon ML Solutions Lab to create a state-of-the-art chicken tray counting CV solution that provides real-time insights into packed inventory levels. In this post, we provide an overview of the AWS architecture and a complete walkthrough of the solution to demonstrate the key components in the tray counting pipeline set up at Tyson’s plant. We will focus on the data collection and labeling, training, and deploying of CV models at the edge using Amazon SageMaker, Apache MXNet Gluon, and AWS Panorama.

Operational excellence is a key priority at Tyson Foods. Tyson employs strict quality assurance (QA) measures in their packaging lines, ensuring that only those packaged products that pass their quality control protocols are shipped to its customers. In order to meet customer demand and to stay ahead of any production issue, Tyson closely monitors packed chicken tray counts. However, current manual techniques to count chicken trays that pass QA are not accurate and do not present a clear picture of over/under production levels. Alternate strategies such as monitoring hourly total weight of production per rack does not provide immediate feedback to the plant employees. With a chicken processing capacity of 45,000,000 head per week, production accuracy and efficiency are critical to Tyson’s business. CV can be effectively used in such scenarios to accurately estimate the amount of chicken processed in real-time, empowering employees to identify potential bottlenecks in packaging and production lines as they occur. This enables implementation of corrective measures and improves production efficiency.

Streaming and processing on-premise video streams at the cloud for CV applications requires high network bandwidth and provisioning of relevant infrastructure. This can be a cost prohibitive task. AWS Panorama removes these requirements and enables Tyson to process video streams at the edge on the AWS Panorama Appliance. It reduces latency to/from the cloud and bandwidth costs, while providing an easy-to-use interface for managing devices and applications at the edge.

Object detection is one of the most commonly used CV algorithms that can localize the position of objects in images and videos. This technology is currently being used in various real-life applications such as pedestrian spotting in autonomous vehicles, detecting tumors in medical scans, people counting systems to monitor footfall in retail spaces, amongst others. It is also crucial for inventory management use cases, such as meat tray counting for Tyson, to reduce waste by creating a feedback loop with production processes, cost savings, and delivery of customer shipments on time.

The following sections of this blog post outline how we use live-stream videos from one of the Tyson Foods plants to train an object detection model using Amazon SageMaker. We then deploy it at the edge with the AWS Panorama device.

AWS Panorama

AWS Panorama is a machine learning(ML) appliance that allows organizations to bring CV to on-premise cameras to make predictions locally with high accuracy and low latency. The AWS Panorama Appliance is a hardware device that allows you to run applications that use ML to collect data from video streams, output video with text and graphical overlays, and interact with other AWS services. The appliance can run multiple CV models against multiple video streams in parallel and output the results in real time. It is designed for use in commercial and industrial settings.

The AWS Panorama Appliance enables you to run self-contained CV applications at the edge, without sending images to the AWS Cloud. You can also use the AWS SDK on the AWS Panorama Appliance to integrate with other AWS services and use them to track data from the application over time. To build and deploy applications, you use the AWS Panorama Application CLI. The CLI is a command line tool that generates default application folders and configuration files, builds containers with Docker, and uploads assets.

AWS Panorama supports models built with Apache MXNet, DarkNet, GluonCV, Keras, ONNX, PyTorch, TensorFlow, and TensorFlow Lite. Refer to this blog post to learn more about building applications on AWS Panorama. During the deployment process AWS Panorama takes care of compiling the model specific to the edge platform through Amazon SageMaker Neo compilation. The inference results can be routed to AWS services such as Amazon S3, Amazon CloudWatch or integrated with on-premise line-of-business applications. The deployment logs are stored in Amazon CloudWatch.

To track any change in inference script logic or trained model, one can create a new version of the application. Application versions are immutable snapshots of an application’s configuration. AWS Panorama saves previous versions of your applications so that you can roll back updates that aren’t successful, or run different versions on different appliances.

For more information, refer to the AWS Panorama page. To learn more about building sample applications, refer to AWS Panorama Samples.

Approach

A plant employee continuously fills-in packed chicken trays into plastic bins and stacks them over time, as show in the preceding figure. We want to be able to detect and count the total number of trays across all the bins stacked vertically.

A trained object detection model can predict bounding boxes of all the trays placed in a bin at every video frame. This can be used to gauge tray counts in a bin at a given instance. We also know that at any point in time, only one bin is being filled with packed trays; the tray counts continuously oscillate from high (during filling) to low (when a new bin obstructs the view of filled bin).

With this knowledge, we adopt the following strategy to count total number of chicken trays:

  1. Maintain two different counters – local and global. Global counter maintains total trays binned and local counter stores maximum number of trays placed in a new bin.
  2. Update local counter as new trays are placed in the bin.
  3. Detect a new bin event in the following ways:
    1. The tray count in a given frame goes to zero. (or)
    2. The stream of tray numbers in the last n frames drops continuously.
  4. Once the new bin event is detected, add the local counter value to global counter.
  5. Reset local counter to zero.

We tested this algorithm on several hours of video and got consistent results.

Training an object detection model with Amazon SageMaker

Dataset creation:

Capturing new images for labelling jobs

Capturing new images for labeling jobs

We collected image samples from the packaging line using the AWS Panorama Appliance. The script to process images and save them was packaged as an application and deployed on AWS Panorama. The application collects video frames from an on-premise camera set up near the packaging zone and saves them at 60 seconds intervals to an Amazon S3 bucket; this prevents capturing similar images in the video sequence that are a few seconds apart. We also mask out adjacent regions in the image that are not relevant for the use-case.

We labeled the chicken trays with bounding boxes using Amazon SageMaker Ground Truth’s streaming labeling job. We also set up an Amazon S3 Event notification that publishes object-created events to an Amazon Simple Notification Service (SNS) topic, which acts as the input source for the labeling job. When the AWS Panorama application script saves an image to an S3 bucket, an event notification is published to the SNS topic, which then sends this image to the labeling job. As the annotators label every incoming image, Ground Truth saves the labels into a manifest file, which contains S3 path of the image as well as coordinates of chicken tray bounding boxes.

We perform several data augmentations (for example: random noise, random contrast and brightness, channel shuffle) on the labeled images to make the model robust to variations in real-life. The original and augmented images were combined to form a unified dataset.

Model Training:

Once the labeling job is completed, we manually trigger an AWS Lambda function. This Lambda function bundles images and their corresponding labels from the output manifest into an LST file. Our training and test files had images collected from different packaging lines to prevent any data leak in evaluation. The Lambda function then triggers an Amazon SageMaker training job.

We use SageMaker Script Mode, which allows you to bring your own training algorithms and directly train models while staying within the user-friendly confines of Amazon SageMaker. We train models like SSD, Yolo-v3 (for real-time inference latency) with various backbone network combinations from GluonCV Model Zoo for object detection in script-mode. Neural networks have the tendency to overfit training data, leading to poor out-of-sample results. GluonCV provides image normalization and image augmentations, such as randomized image flipping and cropping, to help reduce overfitting during training. The model training code is containerized and uses the Docker image in our AWS Elastic Container Registry. The training job takes the S3 image folder and LST file paths as inputs and saves the best model artifact (.params and .json) to S3 upon completion.

Model Evaluation Pipeline

Model Evaluation Pipeline

The top-2 models based on our test set were SSD-resnet50 and Yolov3-darketnet53, with a mAP score of 0.91 each. We also performed real-world testing by deploying an inference application on AWS Panorama device along with the trained model. The inference script saves the predictions and video frames to an Amazon S3 bucket. We created another SageMaker Ground Truth job for annotating ground truth and then performed additional quantitative model evaluation. The ground truth and predicted bounding box labels on images were saved in S3 for qualitative evaluation. The models were able to generalize on the real-world data and yielded consistent performance similar to that on our test-set.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, implementing Hyperparameter Optimization (HPO), and model deployment on Amazon SageMaker on the AWS Labs GitHub repo.

Deploying meat-tray counting application

Production Architecture

Production Architecture

Before deployment, we package all our assets – model, inference script, camera and global variable configuration into a single container as mentioned in this blog post. Our continuous integration and continuous deployment (CI/CD) pipeline updates any change in the inference script as a new application version. Once the new application version is published, we deploy it programmatically using boto3 SDK in Python.

Upon application deployment, AWS Panorama first creates an AWS SageMaker Neo Compilation job to compile the model for the AWS Panorama device. The inference application script imports the compiled-model on the device and performs chicken-tray detection at every frame. In addition to SageMaker Neo-Compilation, we enabled post-training quantization by adding a os.environ['TVM_TENSORRT_USE_FP16'] = '1' flag in the script. This reduces the size of model weights from float 32 to float 16, decreasing model size by half and improving latency without degradation in performance. The inference results are captured in AWS SiteWise Monitor through MQTT messages from the AWS Panorama device via AWS IoT core. The results are then pushed to Amazon S3 and visualized in Amazon QuickSight Dashboards. The plant managers and employees can directly view these dashboards to understand throughput of every packaging line in real-time.

Conclusion

By combining AWS Cloud service like Amazon SageMaker, Amazon S3 and edge service like AWS Panorama, Tyson Foods Inc., is infusing artificial intelligence to automate human-intensive industrial processes like inventory counting in its manufacturing plants. Real-time edge inference capabilities enable Tyson to identify over/under production and dynamically adjust their production flow to maximize efficiency. Furthermore, by owning the AWS Panorama device at the edge, Tyson is also able to save costs associated with expensive network bandwidth to transfer video files to the cloud and can now process all their video/image assets locally in their network.

This blog post provides you with an end-end edge application overview and reference architectures for developing a CV application with AWS Panorama. We discussed 3 different aspects of building an edge CV application.

  1. Data: Data collection, processing and labeling using AWS Panorama and Amazon SageMaker Ground Truth.
  2. Model: Model training and evaluation using Amazon SageMaker and AWS Lambda
  3. Application Package: Bundling trained model, scripts and configuration files for AWS Panorama.

Stay tuned for part two of this series on how Tyson is using AWS Panorama for CV based predictive maintenance of industrial machines.

Click here to start your journey with AWS Panorama. To learn more about collaborating with ML Solutions Lab, see Amazon Machine Learning Solutions Lab.


About the Authors

Divya Bhargavi is a data scientist at the Amazon ML Solutions Lab where she works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

Dilip Subramaniam is a Senior Developer with the Emerging Technologies team at Tyson Foods. He is passionate about building large-scale distributed applications to solve business problems and simplify processes using his knowledge in Software Development, Machine Learning, and Big Data.

Read More

Develop an automatic review image inspection service with Amazon SageMaker

This is a guest post by Jihye Park, a Data Scientist at MUSINSA. 

MUSINSA is one of the largest online fashion platforms in South Korea, serving 8.4M customers and selling 6,000 fashion brands. Our monthly user traffic reaches 4M, and over 90% of our demographics consist of teens and young adults who are sensitive to fashion trends. MUSINSA is a trend-setting platform leader in the country, leading with massive amounts of data.

The MUSINSA Data Solution Team engages in everything related to data collected from the MUSINSA Store. We do full stack development from log collection to data modeling and model serving. We develop various data-based products, including the Live Product Recommendation Service on our app’s main page and the Keyword Highlighting Service that detects and highlights words such as ‘size’ or ‘satisfaction level’ from text reviews.

Challenges in the Automate Review Image Inspection Process

The quality and quantity of customer reviews are critical for ecommerce businesses, as customers make purchase decisions without seeing the products in person. We give credits to those who write image reviews on the products they purchased (that is, reviews with photos of the products or photos of them wearing/using the products) to enhance customer experience and increase the purchase conversion rate. To determine if the submitted photos met our criteria for credits, all of the photos are inspected individually by humans. For example, our criteria states that a “Style Review” should contain photos featuring the whole body of a person wearing/using the product while a “Product Review” should provide a full shot of the product. The following images show examples of a Product Review and a Style Review. Uploaders’ consent has been granted for use of the photos.

Examples of Product Review

Examples of Product Review. 

Examples of Style Review

Examples of Style Review. 

Over 20,000 photos are uploaded daily to the MUSINSA Store platform that require inspection. The inspection process classifies images as ‘package’, ‘product’, ‘full-length’, or ‘half-length’. The image inspection process is completely manual, so it was extremely time consuming and classifications are often done differently by different individuals, even with the guidelines. Faced with this challenge, we used Amazon SageMaker to automate this task.

Amazon SageMaker is a fully managed service for building, training, and deploying machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. It let us quickly implement the automated image inspection service with good results.

We will go into detail about how we addressed our problems using ML models and used Amazon SageMaker along the way.

Automation of the Review Image Inspection Process

The first step toward automating the Image Review Inspection process was to manually label images, thereby matching them to the appropriate categories and inspection criteria. For example, we classified images as a “full body shot,” “upper body shot,” “packaging shot,” “product shot,” etc. In the case of a Product Review, credits were given only for a product shot image. Likewise, in the case of a Style Review, credits were given for a full body shot.

As for image classification, we largely depended on a pre-trained convolutional neural network (CNN) model due to the sheer volume of input images required to train our model. While defining and categorizing meaningful features from images are both critical to training a model, an image can have a limitless number of features. Therefore, using the CNN model made the most sense, and we pre-trained our model with 10,000+ ImageNet datasets, then we used transfer learning. This meant that our model could be trained more effectively with our image labels later.

Image Collection with Amazon SageMaker Ground Truth

However, transfer learning had its own limitations, because a model must be newly trained on higher layers. This means that it constantly required input images. On the other hand, this method performed well and required fewer input images when trained on entire layers. It easily identified features from images from these layers because it had already been trained with a massive amount of data. At MUSINSA, our entire infrastructure runs on AWS, and we are storing customer-uploaded photos in Amazon Simple Storage Service (S3). We categorized these images into different folders based on the labels we defined, and we used Amazon SageMaker Ground Truth for the following reasons:

  1. More consistent results – In manual processes, a single inspector’s mistake could be fed into model training without any intervention. With SageMaker Ground Truth, we could have several inspectors review the same image and make sure that the inputs from the most trustworthy inspector were rated higher for image labeling, thus leading to more reliable results.
  2. Less manual work – SageMaker Ground Truth automated data labeling can be applied with a confidence score threshold so that any images that cannot be confidently machine-labelled are sent for human labeling. This ensures the best balance of cost and accuracy. More information is available in the Amazon SageMaker Ground Truth Developer Guide.
    Using this method, we reduced the number of manually-classified images by 43%. The following table shows the number of images processed per iteration after we adopted Ground Truth (note that the training and validation data are accumulated data, while the other metrics are on a per-iteration basis).SageMaker Ground Truth Performance results
  3. Directly load results – When building models in SageMaker, we could load the resulting manifest files generated by SageMaker Ground Truth and use them for training.

In summary, categorizing 10,000 images required 22 inspectors five days and cost $980.

Development of Image Classification Model with Amazon SageMaker Studio

We needed to classify review images as full body shots, upper body shots, package shots, product shots, and products into applicable categories. To accomplish our goals, we considered two models: the ResNet-based SageMaker built-in model and the Tensorflow-based MobileNet. We tested both on the same test datasets and found that the SageMaker built-in model was more accurate, with a 0.98 F1 score vs 0.88 from the TensorFlow model. Therefore, we decided on the SageMaker built-in model.

The SageMaker Studio-based model training process was as follows:

  1. Import labeled images from SageMaker Ground Truth
  2. Preprocess images – image resizing and augmenting
  3. Load the Amazon SageMaker built-in model as a Docker image
  4. Tune hyperparameters through grid search
  5. Apply transfer learning
  6. Re-tune parameters based on training metrics
  7. Save the model

SageMaker made it straightforward to train the model with just one click and without worrying about provisioning and managing a fleet of servers for training.

For hyperparameter turning, we employed grid search to determine the optimal values of hyperparameters, as the number of training layers (num_layers) and training cycles (epochs) during transfer learning had affected our classification model accuracy.

epochs_list = [5, 10, 15]
num_layers_list = [18, 34, 50]
 
from sagemaker.analytics import TrainingJobAnalytics
metric_df = pd.DataFrame()
 
for i in range(len(epochs_list)):
    for j in range(len(num_layers_list)):
        # hyperparameter settings
        ic.set_hyperparameters(num_layers=num_layers_list[j],
                                 use_pretrained_model=1,
                                 image_shape = "3,256,256",
                                 num_classes=9,
                                 num_training_samples=50399,
                                 mini_batch_size=128,
                                 epochs=epochs_list[i],
                                 learning_rate=0.01,
                                 precision_dtype='float32')
         
        ic.fit(inputs=data_channels, logs=True)
         
        latest_job_name = ic.latest_training_job.job_name
        job_metric = TrainingJobAnalytics(training_job_name=latest_job_name).dataframe()
        job_metric['epochs'] = epochs_list[i]
        job_metric['num_layers'] = num_layers_list[j]
         
        metric_df = pd.concat([metric_df, job_metric])

Model Serving with SageMaker Batch Transform and Apache Airflow

The image classification model we built required ML workflows to determine if a review image was qualified for credits. We established workflows with the following four steps.

  1. Import review images and metadata that must be automatically reviewed
  2. Infer the labels of the images (inference)
  3. Determine if credits should be given based on the inferred labels
  4. Store the results table in the production database

We are using Apache Airflow to manage data product workflows. It is a workflow scheduling and monitoring platform developed by Airbnb known for simple and intuitive web UI graphs. It supports Amazon SageMaker, so it easily migrates the code developed with SageMaker Studio to Apache Airflow. There are two ways to run SageMaker jobs on Apache Airflow:

  1. Using Amazon SageMaker Operators
  2. Using Python Operators : Write a Python function with Amazon SageMaker Python SDK on Apache Airflow and import it as a callable parameter
def transform(dt, bucket, training_job, **kwargs):
    estimator = sagemaker.estimator.Estimator.attach(training_job)
    transformer = estimator.transformer(instance_count=1,
                                        instance_type='ml.m4.xlarge',
                                        output_path=f's3://{bucket}/.../dt={dt}',
                                        max_payload=1)
    transformer.transform(data=f's3://{bucket}/.../dt={dt}',
                          data_type='S3Prefix',
                          content_type='application/x-image',
                          split_type='None')
    transformer.wait()

… 

transform_op = PythonOperator(
        task_id='transform',
        dag=dag,
        provide_context=True,
        python_callable=transform,
        op_kwargs={"dt": dt,
                   "bucket": bucket,
                   "training_job": training_job})

The second option let us maintain our existing Python codes that we already had on SageMaker Studio, and it didn’t require us to learn new grammars for Amazon SageMaker Operators.

However, we went through some trial and error, as it was our first time integrating Apache Airflow with Amazon SageMaker. The lessons we learned were:

  1. Boto3 update: Amazon SageMaker Python SDK version 2 required Boto3 1.14.12 or newer. Therefore, we needed to update the Boto3 version of our existing Apache Airflow environment, which was at 1.13.4.
  2. IAM Role and permission inheritance: AWS IAM roles used by Apache Airflow needed to inherit roles that could run Amazon SageMaker.
  3. Network configuration: To run SageMaker codes with Apache Airflow, its endpoints needed to be configured for network connections. The following endpoints were based on the AWS Regions and services that we were using. For more information, see the AWS website.

    1. api.sagemaker.ap-northeast-2.amazonaws.com
    2. runtime.sagemaker.ap-northeast-2.amazonaws.com
    3. aws.sagemaker.ap-northeast-2.studio

Outcomes

By automating review image inspection processes, we gained the following business outcomes:

  1. Increased work efficiency – Currently, 76% of images of the categories where the service were applied are inspected automatically with a 98% inspection accuracy.
  2. Consistency in giving credits – Credits are given based on clear criteria. However, there were occasions where credits were given differently for similar cases due to differences in inspectors’ judgments. The ML model applies rules more consistently with and higher consistency in applying our credit policies.
  3. Reduced human errors – Every human engagement carries a risk of human errors. For example, we had cases where Style Review criteria were used for Product Reviews. Our automatic inspection model dramatically reduced the risks of these human errors.

We gained the following benefits specifically by using Amazon SageMaker to automate the image inspection process:

  1. Established an environment where we can build and test models through modular processes – What we liked most about Amazon SageMaker is that it consists of modules. This lets us build and test services easily and quickly. We obviously needed some time to learn about Amazon SageMaker at first, but once learned, we could easily apply it in our operations. We believe that Amazon SageMaker is ideal for businesses requiring rapid service developments, as in the case of the MUSINSA Store.
  2. Collect reliable input data with Amazon SageMaker Ground Truth – Collecting input data is becoming increasingly more important than modeling itself in the area of ML. With the rapid advancement of ML, pre-trained models can perform much better than before, and without additional tuning. AutoML has also removed the need to write codes for ML modeling. Therefore, the ability to collect quality input data is more important than ever, and using labeling services such as Amazon SageMaker Ground Truth is critical.

Conclusion

Going forward, we plan to automate not only model serving but also model training through automatic batches. We want our model to identify the optimal hyperparameters automatically when new labels or images are added. In addition, we will continue improving the performance of our model, namely recalls and precision, based on the previously mentioned automated training method. We will increase our model coverage so that it can inspect more review images, reduce more costs, and achieve higher accuracies, which will all lead to higher customer satisfaction.

For more information about how to use Amazon SageMaker to solve your business problems using ML, visit the product webpage. And, as always, stay up to date with the latest AWS Machine Learning News here.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Jihye Park is a Data Scientist at MUSINSA who is responsible for data analysis and modeling. She loves working with ubiquitous data such as ecommerce. Her main role is data modeling but she has interests in data engineering too.

Sungmin Kim is a Sr. Solutions Architect at Amazon Web Services. He works with startups to architect, design, automated, and build solutions on AWS for their business needs. He specializes in AI/ML and Analytics.

Read More

How ReliaQuest uses Amazon SageMaker to accelerate its AI innovation by 35x 

Cybersecurity continues to be a top concern for enterprises. Yet the constantly evolving threat landscape that they face makes it harder than ever to be confident in their cybersecurity protections.

To address this, ReliaQuest built GreyMatter, an Open XDR-as-a-Service platform that brings together telemetry from any security and business solution, whether on-premises or in one or multiple clouds, to unify detection, investigation, response, and resilience.

In 2021, ReliaQuest turned to AWS to help it enhance its artificial intelligence (AI) capabilities and build new features faster.

Using Amazon SageMaker, Amazon Elastic Container Registry (ECR), and AWS Step Functions, ReliaQuest reduced the time needed to deploy and test critical new AI capabilities for its GreyMatter platform from eighteen months to two weeks. This increased the speed of its AI innovation by 35x.

“This innovative architecture has dramatically decreased the time to value of ReliaQuest’s data science initiatives.

Now, we can truly focus on what’s most important – developing powerful solutions to further improve the security of our customer’s environments in an ever-changing threat landscape.”

Lauren Jenkins, Snr Product Manager, Data Science, ReliaQuest

Using AI to enhance the performance of human analysts

GreyMatter takes a fundamentally new approach to cybersecurity, pairing advanced software with a team of highly-trained security analysts to deliver drastically improved security effectiveness and efficiency.

Although ReliaQuest’s security analysts are some of the best-trained security talent in the industry, a single analyst may receive hundreds of new security incidents on any given day. These analysts must review each incident to determine the threat level and the optimal response method.

To streamline this process, and reduce time to resolution, ReliaQuest set out to develop an AI-driven recommendation system that automatically matches new security incidents to similar previous occurrences. This enhanced the speed with which human analysts can identify the incident type as well as the best next action.

Using Amazon SageMaker to put AI to work faster

ReliaQuest had developed an initial machine learning (ML) model, but it was missing the supporting infrastructure to utilize it.

To solve this, ReliaQuest’s Data Scientist, Mattie Langford, and ML Ops Engineer, Riley Rohloff, turned to Amazon SageMaker. SageMaker is an end-to-end ML platform that helps developers and data scientists quickly and easily build, train, and deploy ML models.

Amazon SageMaker accelerates the deployment of ML workloads by simplifying the ML build process. It provides a broad set of ML capabilities on top of fully-managed infrastructure. This removes the undifferentiated heavy lifting that too-often hinders ML development.

ReliaQuest chose SageMaker because of its built-in hosting feature, a key capability that enabled ReliaQuest to quickly deploy its initial pre-trained model onto fully-managed infrastructure.

ReliaQuest also used Amazon ECR to store its pre-trained model images, using Amazon ECRs fully-managed container registry that makes it easy to store, manage, share, and deploy container images and artifacts, such as pre-trained ML models, anywhere.

ReliaQuest chose Amazon ECR because of its native integration with Amazon SageMaker. This enabled it to serve custom model images for both training and predictions, the latter via a custom Flask application it had built.

Using Amazon SageMaker and Amazon ECR, a single ReliaQuest team developed, tested, and deployed its pre-trained model behind a managed endpoint quickly and efficiently, without needing to hand-off to or depend on other teams for support.

Using AWS Step Functions to automatically retrain and improve model performance

In addition, ReliaQuest was able to build an entire orchestration layer for their ML workflow using AWS Step Functions, a low-code visual workflow service that can orchestrate AWS services, automate business processes, and enable serverless applications.

ReliaQuest chose AWS Step Functions because of its deep functionality and integration with other AWS services. This enabled ReliaQuest to build a fully automated learning loop for its model, including:

  • a trigger that looked for updated data in an S3 bucket
  • a full retraining process that created a new training job with the updated data
  • a performance assessment of that training job
  • pre-defined accuracy thresholds to determine whether to update the deployed model through a new endpoint configuration.

Using AWS to increase innovation and reimagine cybersecurity protection

By combining Amazon SageMaker, Amazon ECR, and AWS Step Functions, ReliaQuest was able to improve the speed with which it deployed and tested valuable new AI capabilities from eighteen months to two weeks, an acceleration of 35x in its new feature deployment.

Not only do these new capabilities continue to enhance GreyMatter’s continuous threat detection, threat hunting, and remediation capabilities for its customers, but also they deliver ReliaQuest a step-change improvement in its ability to test and deploy new capabilities into the future.

In the complex landscape of cybersecurity threats, ReliaQuest’s use of AI to enhance its human analysts will continue to improve their effectiveness. Furthermore, its accelerated innovation capabilities will enable it to continue helping its customers stay ahead of the rapidly evolving threats that they face.

Learn more about how you can accelerate your ability to innovate with AI by visiting Getting Started with Amazon SageMaker or reviewing the Amazon SageMaker Developer Resources today.


About the Author

Daniel Burke is the European lead for AI and ML in the Private Equity group at AWS. In this role, Daniel works directly with Private Equity funds and their portfolio companies to design and implement AI and ML solutions that accelerate innovation and generate additional enterprise value.

Read More

Blur faces in videos automatically with Amazon Rekognition Video

With the advent of artificial intelligence (AI) and machine learning (ML), customers and the general public have become increasingly aware of their privacy, as well as the value that it holds in today’s data-driven world. Enterprises are actively seeking out and marketing privacy-first solutions, especially in the Computer Vision (CV) domain. They need to reassure their customers that personal information such as faces are anonymized and generally kept safe.

Face blurring is one of the best-known practices when anonymizing both images and videos. It usually involves first detecting the face in an image/video, then applying a blob of pixels or other distortion effects on it. This workload can be considered a CV task. First, we analyze the pixels of the image/video until a face is recognized, then we extract the area where the face is in every frame, and finally we apply a mask on the previously found pixels. The first part of this can be achieved with ML and Deep Learning tools, such as Amazon Rekognition, while the second part is standard pixel manipulation.

In this post, we demonstrate how AWS Step Functions can be used to orchestrate AWS Lambda functions that call Amazon Rekognition Video to detect faces in videos, and use an open source CV and ML software library called OpenCV to blur them.

Solution overview

In our solution, AWS Step Functions, a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications, is used to orchestrate the calls and manage the flow of data between AWS Lambda functions. When an object is created in an Amazon Simple Storage Service (S3) bucket, for example by a video file upload, an ObjectCreated event is detected and a first Lambda function is triggered. This Lambda function makes an asynchronous call to the Amazon Rekognition Video face detection API and starts the execution of the AWS Step Functions workflow.

Inside the workflow, we use a Lambda function and a Wait State until the Amazon Rekognition Video asynchronous analysis started earlier finishes execution. Afterward, another Lambda function retrieves the result of the completed process from Amazon Rekognition and passes it to another Lambda function that uses OpenCV to blur the detected faces. To easily use OpenCV with our Lambda function, we built a Docker image hosted on Amazon Elastic Container Registry (ECR), and then deployed on AWS Lambda thanks to Container Image Support.

The architecture is entirely serverless, so we don’t need to provision, scale, or maintain our infrastructure. We also use Amazon Rekognition, a highly scalable and managed AWS AI service that requires no deep learning expertise.

Moreover, we have built our application with the AWS Cloud Development Kit (AWS CDK), an open-source software development framework. This lets us write Infrastructure as Code (IaC) using Python, thereby making the application easy to deploy, modify, and maintain.

Let’s look closer at the suggested architecture:

  1. The event flow starts at the moment of the video ingestion into Amazon S3. Amazon Rekognition Video supports MPEG-4 and MOV file formats, encoded using the H.264 codec.
  2. After the video file has been stored into Amazon S3, it automatically kicks-off an event triggering a Lambda function.
  3. The Lambda function uses the video’s attributes (name and location on Amazon S3) to start the face detection job on Amazon Rekognition through an API call.
  4. The same Lambda function then starts the Step Functions state machine, forwarding the video’s attributes and the Amazon Rekognition job ID.
  5. The Step Functions workflow starts with a Lambda function waiting for the Amazon Rekognition job to be finished. Once it’s done, another Lambda function gets the results from Amazon Rekognition.
  6. Finally, a Lambda function with Container Image Support fetches its Docker image, which supports OpenCV from Amazon ECR, blurs the faces detected by Amazon Rekognition, and temporarily stores the output video locally.
  7. Then, the blurred video is put into the output S3 bucket and removed from local files.

Providing a serverless function access to OpenCV is easier than ever with Container Image Support. Instead of uploading a code package to AWS Lambda, the function’s code resides in a Docker image that is hosted in Amazon Elastic Container Registry.

FROM public.ecr.aws/lambda/python:3.7
# Install the function's dependencies
# Copy file requirements.txt from your project folder and install
# the requirements in the app directory.
COPY requirements.txt  .
RUN  pip install -r requirements.txt
# Copy helper functions
COPY video_processor.py video_processor.py
# Copy handler function (from the local app directory)
COPY  app.py  .
# Overwrite the command by providing a different command directly in the template.
CMD ["app.lambda_function"]

If you want to build your own application using Amazon Rekognition face detection for videos and OpenCV to process videos with Python, consider the following:

  • Amazon Rekognition API responses for videos contain faces-detected timestamps in milliseconds
  • OpenCV works on frames and uses the video’s frame rate to combine frames into a video

Therefore, you must convert Amazon Rekognition information to make it usable with OpenCV. You may find our implementation in the apply_faces_to_video function, in /rekopoc-apply-faces-to-video-docker/video_processor.py.

Deploy the application

If you want to deploy the sample application to your own account, go to this GitHub repository. Clone it to your local environment (you can also use tools such as AWS Cloud9) and deploy it via cdk deploy. Find more details in the later section “Deploy the AWS CDK application”. First, let’s look at the repository project structure.

Project structure

This project contains source code and supporting files for a serverless application that you can deploy with the AWS CDK. It includes the following files and folders.

  • rekognition_video_face_blurring_cdk/ – CDK Python code for deploying the application.
  • rekopoc-apply-faces-to-video-docker/ – Code for Lambda function: uses OpenCV to blur faces per frame in video, uploads final result to output S3 bucket.
  • rekopoc-check-status/ – Code for Lambda function: Gets face detection results for the Amazon Rekognition Video analysis.
  • rekopoc-get-timestamps-faces/ – Code for Lambda function: Gets bounding boxes of detected faces and associated timestamps.
  • rekopoc-start-face-detect/ – Code for Lambda function: is triggered by an S3 event when a new .mp4 or .mov video file is uploaded, starts asynchronous detection of faces in a stored video, and starts the execution of AWS Step Functions’ State Machine.
  • requirements.txt – Required packages for deploying the AWS CDK application.

The application uses several AWS resources, including AWS Step Functions, Lambda functions, and S3 buckets. These resources are defined in the rekognition_video_face_blurring_cdk/rekognition_video_face_blurring_cdk_stack.py of this project. Update the Python code to add AWS resources through the same deployment process that updates your application code. Depending on the size of the video that you want to anonymize, you might need to update the configuration of the Lambda functions and adjust memory and timeout. You can provision a maximum of 10,240 MB (10 GB) of memory, and configure your AWS Lambda functions to run up to 15 minutes per execution.

Deploy the AWS CDK application

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to define your cloud application resources using familiar programming languages. This project uses the AWS CDK in Python.

To build and deploy your application for the first time, you must:

Step 1: Ensure you have Docker running.
You will need Docker running to build the image before pushing it to Amazon ECR.

Step 2: Configure your AWS credentials.
The easiest way to satisfy this requirement is to issue the following command in your shell:

aws configure

For additional guidance on how to set up your AWS CLI installation, follow the Quick configuration with aws configure from the AWS CLI user guide.

Step 3: Install the AWS CDK and the requirements.
Simply run the following in your shell:

npm install -g aws-cdk
pip install -r requirements.txt
  • The first command will install the AWS CDK Toolkit globally using Node Package Manager.
  • The second command will install all of the Python packages needed by the AWS CDK using pip package manager. This command should be issued from the root folder of the cloned GitHub repository.

Step 4: Bootstrap your AWS environment for the CDK and deploy the application.

cdk bootstrap
cdk deploy
  • The first command will provision initial resources that the AWS CDK needs to perform the deployment. These resources include an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.
  • Finally, cdk deploy will deploy the stack.

Step 5: Test the application.
Upload a video to the input S3 bucket through the AWS Management Console, the AWS CLI, or the SDK, and find the result in the output bucket.

Cleanup

To delete the sample application that you created, use the AWS CDK:

cdk destroy

Conclusion

In this post, we showed you how to deploy a solution to automatically blur videos without provisioning any resources to your AWS account. We used Amazon Rekognition Video face detection feature, Container Image Support for AWS Lambda functions to easily work with OpenCV, and we orchestrated the whole workflow with AWS Step Functions. Finally, we made our solution comprehensive and reusable with the AWS CDK to make it easier to deploy and adapt.

Next Steps

If you have feedback about this post, submit it in the Comments section below. For more information, visit the following links about the tools and services that we used and follow the code in GitHub. We look forward to your feedback and contributions!


About the Authors

Anastasia Pachni Tsitiridou is a Solutions Architect at AWS. She is based in Amsterdam and supports ISVs across the Benelux in their cloud journey. She studied Electrical and Computer Engineering before being introduced to Computer Vision. What she enjoys most nowadays is working at the intersection of CV and ML.

Olivier Sutter is a Solutions Architect in France. He is based in Paris and always sets his customers’ best interests as his top priority. With a strong academic background in applied mathematics, he started developing his AI/ML passion at university, and now thrives applying this knowledge on real-world use-cases with his customers.

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customer throughout Benelux. He has been a developer since very young, starting to code at the age of 7. He has started learning AI/ML since the latest years of university, and has fallen in love with it since then.

Read More