Easily customize your notifications while using Amazon Lookout for Metrics

We are excited to announce that you can now add filters to alerts and also edit existing alerts while using Amazon Lookout for Metrics. With this launch, you can add filters to your alerts configuration to only get notifications for anomalies that matter the most to you. You can also modify existing alerts as per your needs for notification as anomalies evolve.

Lookout for Metrics uses machine learning (ML) to automatically monitor the metrics that are most important to businesses with greater speed and accuracy. The service also makes it easier to diagnose the root cause of anomalies like unexpected dips in revenue, high rates of abandoned shopping carts, spikes in payment transaction failures, increases in new user signups, and many more. Lookout for Metrics goes beyond simple anomaly detection. It allows developers to set up autonomous monitoring for important metrics to detect anomalies and identify their root cause in a matter of few clicks, using the same technology used by Amazon internally to detect anomalies in its metrics—all with no ML experience required.

Alert is an optional feature that allows you to set up notifications on anomalies in the datasets, which are sent through Amazon Simple Notification Service (Amazon SNS) and AWS Lambda functions. Previously, when you set up an alert, you were notified on all detected anomalies above the severity score you selected, which made it challenging to quickly identify the most relevant anomalies to your business. Now, by implementing filters and edits in the alert system, different business units within your organization are able to specify the types of alerts they receive. Your developers can benefit from this feature by being able to receive alerts on anomalies that are related to the development of their service, while your business analysts and business managers can track anomalies related to the status of their business, such as a location that is underperforming. For example, you may set up an alert to get notified when there is a spike or drop in your revenue. But you may only be interested in a specific store location and in a particular product. The filtering capability allows you to get alerted only when a revenue anomaly fits the criteria you have set.

Solution overview

In this post, we demonstrate how to create Alert with filters and how the configured filters publish alerts only for anomalies matching the filter criteria. The alert filters are based on metrics and dimensions that are present in the dataset definition for the anomaly detector. The solution enables you to use alert filters to get targeted notifications for anomalies detected in your data. The following diagram illustrates the solution architecture.

Provision resources with AWS CloudFormation

You can use the provided AWS CloudFormation stack to set up resources for the walkthrough. It contains resources to continuously generate live data and publish them to Amazon S3, create a detector (named TestAlertFilters) and add a dataset (named AlertFiltersDataset) to the detector. Complete the following steps:

  1. Choose Launch Stack:
  2. Choose Next.
  3. Enter a stack name (for example, L4MAlertFiltersStack).
  4. Enter the values for the detector (TestAlertFilters) and dataset (AlertFiltersDataset).
  5. Choose Next.
  6. Leave the settings for Configure stack options at their defaults and choose Next.
  7. Select the acknowledgement check box and choose Create stack.

Activate the detector created by CFN template

To set up your detector, complete the following steps:

  1. On the Lookout for Metrics console, choose Detectors in the navigation pane.
  2. Select the detector TestAlertFilters and choose View details.
  3. To activate the detector, you can either choose Activate at the top or choose Activate detector under How it works.
  4. Choose Activate to confirm if you want to activate the detector for continuous detection.

A confirmation message shows that the detector is activating. Activation can take up to 1 hour to complete. In the meantime, we can proceed with alert configuration.

Configure your alert

We now configure an alert to get notifications for anomalies detected by the detector. Alert filters are optional configurations, and you can select up to 5 measures and 5 dimensions while adding filters. In this post, we walk through creating an alert with filters. Complete the following steps:

  1. On your detector details page, choose Add alerts.
  2. Confirm your alert name.
    Lookout for Metrics populates the configuration fields with the metrics and dimensions supplied during dataset creation.In this release, the Severity score field is optional, which previously was a required field. By default, we start with severity score of 70, which you can change or remove.
  3. To add a measure, choose Add criteria and choose Measure.
  4. For Measure EQUALS, choose the revenue measure.
  5. Choose Add criteria again and choose Dimension.

    You can choose up to 5 dimension filters. For this post, we configure two.
  6. For Dimension, choose the marketplace dimension.
  7. For Equals, add the values US and CA.
  8. Add category as your second dimension with the values fashion and jewellery.
  9. For Severity score, enter 20.
  10. For Channel, choose Amazon SNS.
  11. Choose your SNS topic (for this post, we use the SNS topic to which we already subscribed our email to receive the alert notifications).
  12. Choose your format (for this post, we choose Long Text).
  13. Under Service access, select Use an existing service role and choose your role.
  14. Choose Add alert.

    A message appears when the alert is created successfully.
  15. Select the alert and choose View details.

You can review the alert filters and other details. The Filter criteria explains how the configured filters are used to filter anomalies before publishing alert notifications.

If you want to modify the alert configuration, select the alert on the Alerts page and choose Edit.

Alternatively, you can open the alert details page and choose Edit.

You’re redirected to the Edit page, where you can modify the alert configuration as required. You can modify the same configurations you set when you created the alert, but you can’t change the alert name while editing.

Review and analyze the results

When Lookout for Metrics detects anomalies in your data, it sends a notification if alerts were configured on that detector. If the anomaly group details match the filter criteria (measure filter, dimension filter, and severity score) of the alert, a notification is published.

For this example, we created two alerts on the detector, testAlertWithNoFilters and testRevenueForFashionOrJewelleryInUSOrCA, and injected anomalies in our data. We also enabled email subscription on the SNS topic used for alert notification publishing. The following screenshots show the details for each alert.

The following is an example of an anomaly notification for testRevenueForFashionOrJewelleryInUSOrCA:

{
"Type" : "Notification",
 "MessageId" : "0b0a7bfe-d029-5f4f-b706-20f644793c3d",
 "TopicArn" : "arn:aws:sns:us-west-2:488415817882:filterAlertsDemoTopic",
 "Message" : "[Amazon LookoutForMetrics] The anomaly detector TestAlertFilters detected 
             an anomaly in revenue with a severity score of 77.3 on May 25, 2022 at 8:05 PM.
             nAnomalous graphs were detected for the following:n
             nrevenue for: jewellery, thirdParty, CA, regular, priorityn
             nrevenue for: electronics, self, MX, premium, overnightn
             nrevenue for: electronics, self, US, regular, overnightn
             nTo view the anomaly, visit the Lookout for Metrics console at: 
             https://us-west-2.console.aws.amazon.com/lookoutmetrics/home?region=us-west-2#arn:aws:lookoutmetrics:us-west-2:488415817882:AnomalyDetector:TestAlertFilters/anomalies/anomaly/bd0a07e1-c520-46bd-aaa3-dcc00583d707
             nTo modify settings for this alert: https://us-west-2.console.aws.amazon.com/lookoutmetrics/home?region=us-west-2#arn:aws:lookoutmetrics:us-west-2:488415817882:AnomalyDetector:TestAlertFilters/alerts/alertDetails/arn:aws:lookoutmetrics:us-west-2:488415817882:Alert:testRevenueForFashionOrJewelleryInUSOrCA",
 "Timestamp" : "2022-05-25T20:31:12.330Z",
 "SignatureVersion" : "1",
 "Signature" : "pFDZj3TwLrL9rqjkRiVgbWjcrPhxz5PDV485d6NroLXWhrviX7sUEQqOIL5j8YYd0SFBjFEkrZKZ27RSbd+33sRhJ52mmd1eR23cZQP68+iIVdpeWubcPgGnqxoOa3APE1WZr4SmVK/bgJAjX1RXn0rKZvPzwDkxPD2fZB4gnbqPJ8GBw/1dxU5qfJzRpkqc87d1gpvQIwMpb5uUROuPZEQVyaR/By0BTsflkE2Sz2mOeZQkMaXz3q9dwX/qDxyR9q6gNviMagGtOLwtb6StN8/PUYlvK9fCBcJnJxg0bdmMtnXiXWdl1O7J50Wqj4Tkl8amph97UlVAnComoe649g==",
 "SigningCertURL" : "https://sns.us-west-2.amazonaws.com/SimpleNotificationService-7ff5318490ec183fbaddaa2a969abfda.pem",
 "UnsubscribeURL" : "https://sns.us-west-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-west-2:488415817882:filterAlertsDemoTopic:8f24ae74-b160-44c7-8bc9-96a30e27d365"
}

The following is an example of an anomaly notification for testAlertWithNoFilters:

{
 "Type" : "Notification",
 "MessageId" : "fcc70263-f2c1-52ed-81ec-596b8c399b67",
 "TopicArn" : "arn:aws:sns:us-west-2:488415817882:filterAlertsDemoTopic",
 "Message" : "[Amazon LookoutForMetrics] The anomaly detector TestAlertFilters detected 
             an anomaly in revenue with a severity score of 77.59 on May 25, 2022 at 6:35 PM.
             nAnomalous graphs were detected for the following:n
             nrevenue for: jewellery, self, UK, regular, overnightn
             nrevenue for: jewellery, thirdParty, JP, premium, overnightn
             nrevenue for: electronics, thirdParty, DE, premium, priorityn
             nTo view the anomaly, visit the Lookout for Metrics console at: 
             https://us-west-2.console.aws.amazon.com/lookoutmetrics/home?region=us-west-2#arn:aws:lookoutmetrics:us-west-2:488415817882:AnomalyDetector:TestAlertFilters/anomalies/anomaly/194c87f4-3312-420c-8920-12fbfc9b1700
             nTo modify settings for this alert: https://us-west-2.console.aws.amazon.com/lookoutmetrics/home?region=us-west-2#arn:aws:lookoutmetrics:us-west-2:488415817882:AnomalyDetector:TestAlertFilters/alerts/alertDetails/arn:aws:lookoutmetrics:us-west-2:488415817882:Alert:testAlertWithNoFilters",
 "Timestamp" : "2022-05-25T19:00:08.374Z",
 "SignatureVersion" : "1",
 "Signature" : "e4+BHo4eh8wNbfQMaR3L8MWY2wkpqxoxKKrj2h/QROQHvhcnYfucYchjfppgjM8LNIF7Oo4QfuP6qcLj9DlghiMZ80qpzHyAH6vmIDfSjK7Bz23i8rnIMyKJIVRFN8z69YlC9vfsp3MayWyyMJcskeVJ1bzsdkDIeA5gkT1le8yh/9nhbsgwm+bowNjsnl+/sFwk6QZJlplYB27sOqegrm73nH/CrmTe4FcPtekCRysSECwMLKazPJqR1uiGagnWfUeyTptRg9rVQVQJJdmOUwlv8vodR96s52btAegpY4iZZLUJ87vs1PwOwVfTTIHf+pdnwPUuFupzejUEudP7sQ==",
 "SigningCertURL" : "https://sns.us-west-2.amazonaws.com/SimpleNotificationService-7ff5318490ec183fbaddaa2a969abfda.pem",
 "UnsubscribeURL" : "https://sns.us-west-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-west-2:488415817882:filterAlertsDemoTopic:8f24ae74-b160-44c7-8bc9-96a30e27d365"
}

We didn’t receive the notification for this anomaly through the testRevenueForFashionOrJewelleryInUSOrCA alert because the anomaly group details don’t match the filter criteria for dimension marketplace. For our filter criteria on the measure revenue, the dimension marketplace must equal US or CA, and the dimension category must equal fashion or jewellery, with a severity threshold of 20.

Although the anomaly detected matches the filter criteria for the measure, severity score, and category dimension, it doesn’t match the criteria for the marketplace dimension, so the alert wasn’t published.

Based on the notifications we received, we can confirm that Lookout for Metrics detected anomalies and verified the alert filter-based notifications.

Clean up

After you complete the testing, you can delete the CloudFormation stack created by the template. Deletion of stack the cleans up all the resources created for the purpose of this test. To delete the stack, open the AWS CloudFormation console, select the stack L4MAlertFiltersStack, and choose Delete.

Deletion of the stack doesn’t delete the S3 bucket created by the template because it’s not empty; you have to delete it manually.

Conclusion

You can now easily customize your notification experience by adding filters and editing existing alerts to reduce noise and focus on the metrics that matter the most to your business.

To learn more about this capability, see Working with Alerts. You can use this capability in all Regions where Lookout for Metrics is publicly available. For more information about Region availability, see AWS Regional Services.


About the Authors

Alex Kim is a Sr. Product Manager for AWS AI Services. His mission is to deliver AI/ML solutions to all customers who can benefit from it. In his free time, he enjoys all types of sports and discovering new places to eat.

Utkarsh Dubey is a Software Development Engineer in the Lookout for Metrics team. His interests lie in building scalable distributed systems. In his spare time, he enjoys traveling and catching up with friends.

Read More

Use a pre-signed URL to provide your business analysts with secure access to Amazon SageMaker Canvas

Agility and security have historically been two aspects of IT of paramount importance for any company. With the simplification of access to advanced IT technologies thanks to low-code and no-code (LCNC) tools, an even bigger number of people must be enabled to access resources, without impacting security. For many companies, the solution has been to develop a company web portal, which simplifies access to cloud applications and resources, by redirecting to or embedding applications, so that employees can have a single point of access to the services they use most.

In this post, we suggest an architecture for a company with an existing web portal to generate a pre-signed URL redirecting to Amazon SageMaker Canvas, a visual point-and-click interface for business analysts to build machine learning (ML) models and generate accurate predictions without writing code or having any previous ML experience, without having to log in via the AWS Management Console.

Solution overview

The solution architecture is composed of three main parts:

  • The company web portal, with its own system for authentication of users and other resources.
  • An AWS Lambda function, responsible for calling the Amazon SageMaker SDK. This function is directly called via its function URL, a simple way to assign an HTTP(S) endpoint to the Lambda function directly, without the need for a REST API.
  • The Canvas app.

The following diagram illustrates the solution workflow.

The flow has four steps:

  1. The business analyst accesses the company portal, (optionally) authenticates, then chooses to generate a Canvas URL.
  2. The Lambda function receives information about the user from the company portal, and uses it to call SageMaker via an AWS SDK to generate a presigned Canvas URL. For this post, we use the AWS SDK for Python (Boto3).
  3. The generated URL is sent back to the business analyst through the company portal.
  4. The business analyst can then choose that link to access Canvas directly, without having to access the console.

Prerequisites

Before you implement the solution architecture, make sure that you have correctly onboarded to an Amazon SageMaker Studio domain using AWS Identity and Access Management (IAM). For instructions, refer to Onboard to Amazon SageMaker Domain Using IAM. IAM as method of authentication is a strict requirement, because the API CreatePresignedDomainURL requires the IAM authentication method, and it won’t work with AWS Single Sign-On authentication for your domain. Also, make sure you have created at least one user profile for your Studio domain.

Deploy the solution

The first step is to create the Lambda function.

  1. On the Lambda console, choose Create function.
  2. For Name, enter a name (for this post, canvas-presignedURL).
  3. For Runtime, choose Python 3.9.
  4. For Architecture, select your preferred architecture (for this post, we select arm64).
  5. Under Permissions, expand Change default execution role.
  6. Select Create a new role with basic Lambda permissions.
    We change the Lambda permissions in a later step.
  7. Under Advanced settings, select Enable function URL.
  8. For Auth type, select NONE.
    For this post, we don’t provide authentication details to our requests. However, this isn’t a best practice and it’s not advised for production workloads. We suggest using IAM authentication for your Lambda function, or another method for authentication and authorization such as Amazon Cognito.
  9. If your domain runs in a VPC, select Enable VPC to access those private resources.
  10. Choose Create function.
    Function creation takes a few seconds to complete. You can now set up the permissions to run SageMaker calls.
  11. On the Configuration tab, choose Permissions in the left pane.
  12. Choose your role name.

    You’re redirected to the IAM console.
  13. Choose Add permissions.
  14. Choose Create inline policy.
  15. For Service, choose SageMaker.
  16. For Actions, choose CreatePresignedDomainUrl.
  17. For Resources, select Any in this account.
  18. Choose Review.
  19. Enter a name for the policy (for this post, CanvasPresignedURLsFromLambda).
  20. Choose Create policy.
    The policy is now created and assigned to the role. You can close the IAM console tab and return to the Lambda console.Now it’s time to change our code base to run a call to SageMaker. We use the Boto3 call create_presigned_domain_url.
  21. On the Code tab, replace the code inside the lambda_function.py file with the following:
    import json
    import boto3
    
    sagemaker = boto3.client('sagemaker')
    SESSION_EXPIRATION_IN_SECONDS = 8*60*60 # the session will be valid for 8 hours
    URL_TIME_TO_LIVE_IN_SECONDS = 60 # the URL is only valid for 60 seconds
    
    def lambda_handler(event, context):
        
        # Parse the event body
        body = json.loads(event['body'])
        
        # Pass the domain ID and user profile name as part of the request
        domain_id = body['domain_id']
        user_profile_name = body['user_profile_name']
        
        # Call the service to create the URL
        response = sagemaker.create_presigned_domain_url(
            DomainId=domain_id,
            UserProfileName=user_profile_name,
            SessionExpirationDurationInSeconds=SESSION_EXPIRATION_IN_SECONDS,
            ExpiresInSeconds=URL_TIME_TO_LIVE_IN_SECONDS
        )
        studio_url = response['AuthorizedUrl']
        
        # Add the redirect to Canvas
        canvas_url = studio_url + '&redirect=Canvas'
        
        # Return to the app
        return {
            'statusCode': 200,
            'body': json.dumps(canvas_url)
        }

    The preceding code consists of three main steps:

    • Parsing the body of the request and retrieving the Studio domain ID and user profile name
    • Calling the API with this information
    • Adding the redirection to Canvas and returning the result

    Now that the function is ready, let’s test it.

  22. Choose Deploy, then choose Test.
  23. In the test event configuration, provide the following event JSON, substituting the correct values:
    {
      "body": "{"domain_id": "<YOUR-DOMAIN-ID>","user_profile_name": "<YOUR-USER-PROFILE>"}"
    }

  24. Save the test event and choose Test again.

Your result should now be available in the body of your response.

You can now test this with your HTTP request tool of choice, such as CURL or Postman, to integrate into your existing company web portal. Below, a screenshot of a Postman POST request to the AWS Lambda function URL created in the previous steps, and the response payload containing the pre-signed URL.

The following screenshot shows an example of a (simplified) company web portal that, upon login, generates a pre-signed URL to access Amazon SageMaker Canvas.

Conclusion

In this post, we discussed a solution to help business analysts experience no-code ML via Canvas in a secured and unified way through their company web portal, without the need to allow access via the console. We used a Lambda function to generate a presigned URL, which the business analyst can use directly in their browser.

To make this solution production-ready, we suggest considering how to implement authentication and authorization, either via IAM authentication of Lambda functions with function URLs, or more advanced solutions based on Amazon API Gateway, such as API Gateway Lambda authorizers. For more information, refer to Security and auth model for Lambda function URLs.

If you haven’t built yet your company web portal, you might want to check out AWS Amplify Studio, a visual development environment that lets developers easily build and ship complete web and mobile apps in hours instead of weeks. With Amplify Studio, you can quickly build an app backend, create rich user interface (UI) components, and connect a UI to the backend with minimal coding.

To learn more about Canvas, check out Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.


About the Author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since very young, starting to code at the age of 7. He started learning AI/ML in his later years of university, and has fallen in love with it since then.

Read More

Enable business analysts to access Amazon SageMaker Canvas without using the AWS Management Console with AWS SSO

IT has evolved in recent years: thanks to low-code and no-code (LCNC) technologies, an increasing number of people with varying backgrounds require access to tools and platforms that were previously a prerogative to more tech-savvy individuals in the company, such as engineers or developers.

Out of those LCNC technologies, we have recently announced Amazon SageMaker Canvas, a visual point-and-click interface for business analysts to build machine learning (ML) models and generate accurate predictions without writing code or having any previous ML experience.

To enable agility for those new users while ensuring security of the environments, many companies have chosen to adopt single sign-on technology, such as AWS Single Sign-On. AWS SSO is a cloud-based single sign-on service that makes it easy to centrally manage SSO access to all your AWS accounts and cloud applications. It includes a user portal where end-users can find and access all their assigned AWS accounts and cloud applications in one place, including custom applications that support Security Assertion Markup Language (SAML) 2.0.

In this post, we walk you through the necessary steps to configure Canvas as a custom SAML 2.0 application in AWS SSO, so that your business analysts can seamlessly access Canvas with their credentials from AWS SSO or other existing identity providers (IdPs), without the need to do so via the AWS Management Console.

Solution overview

To establish a connection from AWS SSO to the Amazon SageMaker Studio domain app, you must complete the following steps:

  1. Create a user profile in Studio for every AWS SSO user that should access Canvas.
  2. Create a custom SAML 2.0 application in AWS SSO and assign it to the users.
  3. Create the necessary AWS Identity and Access Management (IAM) SAML provider and AWS SSO role.
  4. Map the necessary information from AWS SSO to the SageMaker domain via attribute mappings.
  5. Access the Canvas application from AWS SSO.

Prerequisites

To connect Canvas to AWS SSO, you must have the following prerequisites set up:

Create a Studio domain user profile

In a Studio domain, every user has their own user profile. Studio apps like Studio IDE, RStudio, and Canvas can be created by these user profiles, and are bound to the user profile that has created them.

For AWS SSO to access the Canvas app for a given user profile, you have to map the user profile name to the user name in AWS SSO. This way, the AWS SSO user name—and therefore the user profile name—can be passed automatically by AWS SSO to Canvas.

In this post, we assume that AWS SSO users are already available, created during the prerequisites of onboarding to AWS SSO. You need a user profile for each AWS SSO user that you want to onboard to your Studio domain and therefore to Canvas.

To retrieve this information, navigate to the Users page on the AWS SSO console. Here you can see the user name of your user, in our case davide-gallitelli.

With this information, you can now go to your Studio domain and create a new user profile called exactly davide-gallitelli.

If you have another IdP, you can use any information provided by it to name your user profile, as long as it’s unique for your domain. Just make sure you map it correctly according to AWS SSO attribute mapping.

Create the custom SAML 2.0 application in AWS SSO

The next step is to create a custom SAML 2.0 application in AWS SSO.

  1. On the AWS SSO console, choose Applications in the navigation pane.
  2. Choose Add a new application.
  3. Choose Add a custom SAML 2.0 application.
  4. Download the AWS SSO SAML metadata file, which you use during IAM configuration.
  5. For Display name, enter a name, such as SageMaker Canvas followed by your Region.
  6. For Description, enter an optional description.
  7. For Application start URL, leave as is.
  8. For Relay state, enter https://YOUR-REGION.console.aws.amazon.com/sagemaker/home?region=YOUR-REGION#/studio/canvas/open/YOUR-STUDIO-DOMAIN-ID.
  9. For Session duration, choose your session duration. We suggest 8 hours.
    The Session duration value represents the amount of time you want the user session to last before authentication is required again. One hour is the most secure, whereas more time means less need for interaction. We choose 8 hours in this case, equivalent to one work day.
  10. For Application ACS URL, enter https://signin.aws.amazon.com/saml.
  11. For Application SAML audience, enter urn:amazon:webservices.
    After your settings are saved, your application configuration should look similar to the following screenshot.
    You can now assign your users to this application, so that the application appears in their AWS SSO portal after login.
  12. On the Assigned users tab, choose Assign users.
  13. Choose your users.

Optionally, if you want to enable a lot of data scientists and business analysts in your company to use Canvas, the fastest and easiest way is to use AWS SSO groups. To do so, we create two AWS SSO groups: business-analysts and data-scientists. We assign the users to these groups according to their roles, and then give access to the application to both groups.

Configure your IAM SAML provider and AWS SSO role

To configure your IAM SAML provider, complete the following steps:

  1. On the IAM console, choose Identity providers in the navigation pane.
  2. Choose Add provider.
  3. For Provider type, select SAML.
  4. For Provider name, enter a name, such as AWS_SSO_Canvas.
  5. Upload the metadata document you downloaded earlier.
  6. Note the ARN to use in a later step.

    We also need to create a new role for AWS SSO to use to access the application.
  7. On the IAM console, choose Roles in the navigation pane.
  8. Choose Create role.
  9. For Trusted entity type, select SAML 2.0 federation.
  10. For SAML 2.0-based provider, choose the provider you created (AWS_SSO_Canvas).
  11. Don’t select either of the two SAML 2.0 access methods.
  12. For Attribute, choose SAML:sub_type.
  13. For Value, enter persistent.
  14. Choose Next.

    We need to give AWS SSO the permission to create a Studio domain presigned URL, which we need to perform the redirect to Canvas.
  15. On the Permissions policies page, choose Create policy.
  16. On the Create policy tab, choose JSON and enter the following code:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "sagemaker:CreatePresignedDomainUrlWithPrincipalTag",
                    "sagemaker:CreatePresignedDomainUrl"
                ],
                "Resource": "*"
            }
        ]
    }

  17. Choose Next:Tags and provide tags if needed.
  18. Choose Next:Review.
  19. Name the policy, for example CanvasSSOPresignedURL.
  20. Choose Create policy.
  21. Return to the Add permissions page and search for the policy you created.
  22. Select the policy, then choose Next.
  23. Name the role, for example AWS_SSO_Canvas_Role, and provide an optional description.
  24. On the review page, edit the trust policy to match the following code:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Federated": "<ARN OF THE SAML PROVIDER FROM IAM>"
                },
                "Action": [
                    "sts:AssumeRoleWithSAML",
                    "sts:SetSourceIdentity",
                    "sts:TagSession"
                ],
                "Condition": {
                    "StringEquals": {
                        "SAML:sub_type": "persistent",
                        "SAML:aud": "https://signin.aws.amazon.com/saml"
                    }
                }
            }
        ]
    }

  25. Save the changes, then choose Create role.
  26. Note the ARN of this role as well, to use in the following section.

Configure the attribute mappings in AWS SSO

The final step is to configure the attribute mappings. The attributes you map here become part of the SAML assertion that is sent to the application. You can choose which user attributes in your application map to corresponding user attributes in your connected directory. For more information, refer to Attribute mappings.

  1. On the AWS SSO console, navigate to the application you created.
  2. On the Attribute mappings tab, configure the following mappings:
User attribute in the application Maps to this string value or user attribute in AWS SSO
Subject ${user:email}
https://aws.amazon.com/SAML/Attributes/RoleSessionName ${user:email}
https://aws.amazon.com/SAML/Attributes/PrincipalTag:SageMakerStudioUserProfileName ${user:subject}
https://aws.amazon.com/SAML/Attributes/Role <ARN OF THE SAML PROVIDER FROM IAM>, <ARN OF THE CANVAS SSO ROLE FROM IAM>
  1. Choose Save changes.

You’re done!

Access the Canvas application from AWS SSO

On the AWS SSO console, note down the user portal URL. We suggest you log out of your AWS account first, or open an incognito browser window. Navigate to the user portal URL, log in with the credentials you set for the AWS SSO user, then choose your Canvas application.

You’re automatically redirected to the Canvas application.

Conclusion

In this post, we discussed a solution to enable business analysts to experience no-code ML via Canvas in a secured and unified way through a single sign-on portal. To do this, we configured Canvas as a custom SAML 2.0 application within AWS SSO. Business analysts are now one click away from using Canvas and solving new challenges with no-code ML. This enables the security needed by cloud engineering and security teams, while allowing for the agility and independence of business analysts teams. A similar process can be replicated in any IdP by reproducing these steps and adapting them to the specific SSO.

To learn more about Canvas, check out Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts. Canvas also enables easy collaboration with data science teams. To learn more, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For IT administrators, we suggest checking out Setting up and managing Amazon SageMaker Canvas (for IT administrators).


About the Author

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customer throughout Benelux. He has been a developer since very young, starting to code at the age of 7. He started learning AI/ML in his later years of university, and has fallen in love with it since then.

Read More

Create, train, and deploy a billion-parameter language model on terabytes of data with TensorFlow and Amazon SageMaker

The increasing size of language models has been one of the biggest trends in natural language processing (NLP) in recent years. Since 2018, we’ve seen unprecedented development and deployment of ever-larger language models, including BERT and its variants, GPT-2, T-NLG, and GPT-3 (175 billion parameters).

These models have pushed the boundaries of possible architectural innovations. We face several challenges when training large-scale deep learning models, especially the new wave of generative pre-trained transformers. These challenges include hardware limitations and trade-offs with computation and efficiency. To overcome these challenges of model and data parallelism, AWS offers a wide range of capabilities.

In this post, we introduce two main approaches: data parallelization and model parallelization using Amazon SageMaker, and discuss their pros and cons.

The model

For the language model, we use Transformers, introduced in the paper Attention Is All You Need. Transformers are deep learning models designed to deliberately avoid the pitfalls of RNNs by relying on a self-attention mechanism to draw global dependencies between input and output. The Transformer model architecture allows for significantly better parallelization and can achieve high performance in relatively short training time. Built on the success of Transformers, BERT, introduced in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, added bidirectional pre-training for language representation. Inspired by the Cloze task, BERT is pre-trained with masked language modeling (MLM), in which the model learns to recover the original words for randomly masked tokens. The BERT model is also pretrained on the next sentence prediction (NSP) task to predict if two sentences are in correct reading order. Since its advent in 2018, BERT and its variations have been widely used in language models.

We begin by creating two embedding layers for token and positional embedding. The input embeddings are the sum of the token embeddings and position embeddings.

class TokenAndPositionEmbedding(tf.keras.layers.Layer):
    """
    Creates two separate embedding layers: one for tokens and one for token index (positions).
    """
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = tf.keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]

        # positions are represented by a token's index
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)

        # token embedding
        x = self.token_emb(x)

        # return sum as input 
        return x + positions

Then we define a transformer decoder block with two sub-layers: a multi-head self-attention layer, and a simple fully connected feed-forward network followed by layer normalization and dropout:

class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        # self attention layer
        super(TransformerBlock, self).__init__()
        self.att = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        
        # feed forward layer
        self.ffn = [tf.keras.layers.Dense(ff_dim, activation="relu"), tf.keras.layers.Dense(embed_dim)]

        # layer normalization 
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        # dropout 
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, inputs):
        # getting batch size and seq len from input shape
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]

        # decoder casual mask
        casual_mask = casual_attention_mask(batch_size, seq_len, seq_len, tf.bool)

        # self attention forward pass
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)

        # dense layers, dropout and normalization
        attention_output = self.dropout1(attention_output)
        ffn_output = self.ffn[0](out1)
        ffn_output = self.ffn[1](ffn_output)
        out2 = self.dropout2(ffn_output)
        
        return self.layernorm2(out1 + out2)

Finally, we create our language model with the preceding embedding layer and transformer blocks:

class MyModel(tf.keras.Model):
    def __init__(self, maxlen, vocab_size, embed_dim, num_heads, feed_forward_dim, num_layers, learning_rate):
        super(MyModel, self).__init__(maxlen, vocab_size, embed_dim, num_heads, feed_forward_dim, num_layers, learning_rate)

        # embedding layer
        self.embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)

        # transformer blocks
        self.transformer_blocks = [
            TransformerBlock(embed_dim, num_heads, feed_forward_dim)
            for i in range(num_layers)
        ]

        # last dense layer
        self.dense = tf.keras.layers.Dense(vocab_size)
        
    def call(self, inputs, training=None):
        x_emb = self.embedding_layer(inputs)
        x = x_emb        
        for transformer_block in self.transformer_blocks:
            x = transformer_block(x)
        outputs = self.dense(x)
        return [outputs, x_emb]


def init_train_settings(maxlen, vocab_size, embed_dim, num_heads, feed_forward_dim, num_layers, learning_rate):
    """
    Creates model, optimizer and loss function 
    """
    model = MyModel(maxlen, vocab_size, embed_dim, num_heads, feed_forward_dim, num_layers, learning_rate) 
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    return model, optimizer, loss_fn

Depending on your hyperparameters, you can scale this model from thousands of parameters to billions of parameters. The primary challenge with billion-parameter models is that you can’t host the model in one instance and need to distribute the model over several nodes for training and inference.

The dataset

In our experiments, we used the Pile dataset. The Pile is an 800 GiB English text dataset designed for training large-scale language models. It is created from 22 diverse and high-quality datasets, including both established NLP datasets and newly introduced ones.

The dataset is created from a variety of data sources, including books; GitHub repositories; webpages; chat logs; and medical, physics, math, computer science, and philosophy papers. Specifically, it uses the following sources: Pile-CC, PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu, IRC, HackerNews, YouTube, PhilPapers, Books3, Project Gutenberg (PG-19), OpenSubtitles, English Wikipedia, DM Mathematics, EuroParl, the Enron Emails corpus, and NIH ExPorter. It also includes OpenWebText2 and BookCorpus2, which are extensions of the original OpenWebText and BookCorpus datasets, respectively. The diversity in data sources can improve the general cross-domain knowledge and consequently improve downstream generalization capabilities.

The primary challenge with this dataset is the sheer size; the dataset has 825 GiB of text, which translates into 4.2 TiB of preprocessed and compressed datapoints. Similar to the challenges we face with training and hosting the models, training a model with this dataset on a single instance will take a lot of time and isn’t practical.

Our solution is to break down the dataset into approximately 1 GiB chunks of data, load and preprocess the features in TensorFlow Dataset objects, and store them in Amazon Elastic File Service (Amazon EFS). TensorFlow datasets provide an easy-to-use and high-performance data pipeline that integrates well with our models. Amazon EFS is an easy-to-use service that enables us to build a shared file system that scales automatically as files are added and deleted. In addition, Amazon EFS is capable of bursting to higher throughput levels when needed, which is critical in our data and model training pipeline.

Next, we look into distributed training strategies to tackle these challenges.

Distributed training

In this project, we faced two challenges: scaling model size and data volume. Increasing the model size and number of trainable parameters may result in better accuracy, but there’s a limit to the model you can fit into a single GPU memory or even multiple GPUs in a single instance. In addition, bigger model sizes take more time to train.

You can tackle these challenges two different ways: data parallelism and model parallelism. With data parallelism, we perform Stochastic Gradient Descent (SGD) by distributing the records of a mini-batch over different devices to speed up the training. However, parallel data training comes with extra complexity of computing mini-batch gradient average with gradients from all devices, a step called AllReduce, which becomes harder as the training cluster is grown. While using data parallelism, we must be able to fit the model and a single datapoint in a device (CPU or GPU), which is a limiting factor in our experiments because the size of such a large model is much larger than the single GPU’s memory size.

Another solution is to use model parallelism, which splits the model over multiple devices. Model parallelism is the process of splitting a model up between multiple devices or nodes (such as GPU-equipped instances) and creating an efficient pipeline to train the model across these devices to maximize GPU utilization.

Data parallelization

Parallelizing the data is the most common approach to multiple GPUs or distributed training. You can batch your data, send it to multiple devices (each hosting a replicated model), then aggregate the results. We experimented with two packages for data parallelization: Horovod and the SageMaker distributed data parallel library.

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. To use Horovod, we went through the following process:

  1. Initialize by running hvd.init().
  2. Associate each device with a single process. The first process or worker is associated with the first device, the second process is associated with the second device, and so on.
  3. Adjust the learning rate based on the number of devices.
  4. Wrap the optimizer in hvd.DistributedOptimizer.
  5. Broadcast the initial variable states from the first worker with rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.
  6. Make sure that only device 0 can save checkpoints to prevent other workers from corrupting them.

The following is the training script:

import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# Build model
...

@tf.function
def training_step(texts, labels, first_batch):
    with tf.GradientTape() as tape:
        predictions = model(texts, training=True)
        loss = loss_fn(labels, predictions[0])

    # Horovod: add Horovod Distributed GradientTape.
    tape = hvd.DistributedGradientTape(tape)

    grads = tape.gradient(loss, model.trainable_variables)
    opt.apply_gradients(zip(grads, model.trainable_variables))

    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    #
    # Note: broadcast should be done after the first gradient step to ensure optimizer
    # initialization.
    if first_batch:
        hvd.broadcast_variables(model.variables, root_rank=0)
        hvd.broadcast_variables(opt.variables(), root_rank=0)

    return loss

# Horovod: adjust number of steps based on number of GPUs.
for batch, (texts, labels) in enumerate(dataset.take(10000 // hvd.size())):
    loss = training_step(texts, labels, batch == 0)

    if batch % 10 == 0 and hvd.local_rank() == 0:
        print('Step #%dtLoss: %.6f' % (batch, loss))

# Horovod: save checkpoints only on worker 0 to prevent other workers from
# corrupting it.
if hvd.rank() == 0:
    checkpoint.save(checkpoint_dir)

The SageMaker data parallel library enables us to scale our training with near-linear efficiency, speeding up our training with minimal code changes. The library performs a custom AllReduce operation and optimizes device-to-device communication by fully utilizing AWS’s network infrastructure and Amazon Elastic Compute Cloud (Amazon EC2) instance topology. To use the SageMaker data parallel library, we went through the following process:

  1. Import and initialize sdp.init().
  2. Associate each device with a single smdistributed.dataparallel process with local_rank. sdp.tensorflow.local_rank() gives us the local rank of devices. The leader is rank 0, and workers are rank 1, 2, 3, and so on.
  3. Adjust the learning rate based on the number of devices.
  4. Wrap tf.GradientTape with DistributedGradientTape to perform AllReduce.
  5. Broadcast the initial model variables from the leader node to all the worker nodes.
  6. Make sure that only device 0 can save checkpoints.

Model parallelization

We can adjust the hyperparameters to keep the model small enough to train using a single GPU, or we can use model parallelism to split the model between multiple GPUs across multiple instances. Increasing a model’s number of trainable parameters can result in better accuracy, but there’s a limit to the maximum model size you can fit in a single GPU memory. We used the SageMaker distributed model parallel library to train our larger models. The steps are as follows:

  1. Import and initialize the library with smp.init().
  2. The Keras model needs to inherit from smp.DistributedModel instead of the Keras Model class.
  3. Set drop_remainder=True in the tf.Dataset.batch() method to ensure that the batch size is always divisible by the number of microbatches.
  4. Random operations in the data pipeline all need to use the same seed: smp.dp_rank(), for example, shuffle(ds, seed=smp.dp_rank()). This ensures consistency of data samples across devices that hold different model partitions.
  5. Forward and backward logic needs to be in a step function with smp.step decoration.
  6. Perform postprocessing on the outputs across microbatches using StepOutput methods such as reduce_mean. The smp.step function must have a return value that depends on the output of smp.DistributedModel.

The training script is as follows:

import smdistributed.modelparallel.tensorflow as smp

# SMP: Initialize
smp.init()

# SMP: Define smp.DistributedModel the same way as Keras sub-classing API
class MyModel(smp.DistributedModel):
    def __init__(self, maxlen, vocab_size, embed_dim, num_heads, feed_forward_dim, num_layers, learning_rate):
        super(MyModel, self).__init__(maxlen, vocab_size, embed_dim, num_heads, feed_forward_dim, num_layers, learning_rate)
        
        self.embedding_layer = gpt_model.TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
        self.transformer_blocks = [
            gpt_model.TransformerBlock(embed_dim, num_heads, feed_forward_dim)
            for i in range(num_layers)
        ]
        self.dense = tf.keras.layers.Dense(vocab_size)
        
    def call(self, inputs, training=None):
        x_emb = self.embedding_layer(inputs)
        x = x_emb

        for transformer_block in self.transformer_blocks:
            x = transformer_block(x)
        outputs = self.dense(x)
        return [outputs, x_emb]


# SMP: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(texts, labels):
    predictions = model(texts, training=True)
    loss = loss_fn(labels, predictions[0])
    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions[0]

@tf.function
def train_step(texts, labels, first_batch):
    gradients, loss, predictions = get_grads(texts, labels)
    # SMP: Accumulate the gradients across microbatches
    gradients = [g.accumulate() for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    # SMP: Average the loss across microbatches
    train_loss(loss.reduce_mean())
    # SMP: Merge predictions across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()

histories = []

for _ in range(epochs):
    train_loss.reset_states()
    train_accuracy.reset_states()

    for texts, labels in text_ds:
        for i in range(128):
            text = tf.expand_dims(texts[0][i], axis=0)
            label = tf.expand_dims(labels[0][i], axis=0)
            train_step(text, label)  

For a detailed guide to enable the TensorFlow training script for the SageMaker distributed model parallel library, refer to Modify a TensorFlow Training Script. For PyTorch, refer to Modify a PyTorch Training Script.

SageMaker Debugger

In the previous sections, we discussed how to optimize the training using model and data parallelization techniques. With Amazon SageMaker Debugger, we can now capture performance profiling information from our training runs to determine how much the training has improved. By default, Debugger captures system metrics for each SageMaker training job such as GPU, CPU utilization, memory, network, and I/O at a sampling interval of 500 milliseconds. We can access the data as follows:

from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob('SMD-MP-demo-2022-01-21-06-43-23-841', "us-east-1")
tj.wait_for_sys_profiling_data_to_be_available()
system_metrics_reader = tj.get_systems_metrics_reader()

Debugger provides utilities to visualize the profiling data in different ways. In the following example, we see the total GPU and CPU utilization as well as the I/O wait time for the multi-GPU training job using Horovod. To generate these graphs, we run the following code:

from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

view_timeline_charts = TimelineCharts(
    system_metrics_reader, 
    framework_metrics_reader,
    select_dimensions=["CPU", "GPU", "I/O"], 
    select_events=["total"],
    show_workers=False           
)

The GPU utilization frequently fluctuates between 0–100%, and high I/O wait times with low GPU utilization are an indicator of an I/O bottleneck. Furthermore, the total CPU utilization never exceeds 70%, which means that we can improve data preprocessing by increasing the number of worker processes.

We can improve performance by switching from Horovod to the SageMaker distributed data parallel library. In the following graphs, we can see that GPUs are utilized more efficiently and only dropping to low utilization for short periods of time.

Training infrastructure

For training the models, we used 10 ml.p3.16xlarge instances using a SageMaker training job. SageMaker reduces the time and cost to train and tune machine learning (ML) models without the need to manage infrastructure. With SageMaker, you can easily train and tune ML models using built-in tools to manage and track training experiments, automatically choose optimal hyperparameters, debug training jobs, and monitor the utilization of system resources such as GPUs, CPUs, and network bandwidth. The data was hosted in Amazon EFS, which enabled us to grow and shrink as we add and remove files with no need for management or provisioning. Our primary objectives were to improve training speed and reduce costs.

Model scalability

Although this infrastructure is primarily used for language generation, with the GPT architecture and Pile dataset, you can use these techniques to train large-scale transformer models, which is useful in many domains beyond NLP. In machine learning itself, many computer vision tasks are now solved with large-parameter (transformer) architectures where they have been shown to outperform traditional CNNs (Convolutional Neural Network) on tasks like representation learning (see Advancing the state of the art in computer vision with self-supervised Transformers and 10x more efficient training) and large-scale mapping of images to text (such as CLIP). Large-parameter models are also breaking new ground in life sciences in fields like protein structure analysis and analysis of medical image data.

The solutions we detail in this post for distributed training and managing large models should apply to models in any of these domains as well.

Trade-offs

There has been an ongoing discussion in the research community regarding the risks of training large-scale language models, and whether enough thought has been put into the potential risks associated with developing them and strategies to mitigate these risks, some of which include the financial and environmental costs. According to a paper published in ACM, training a single BERT base model (without hyperparameter tuning) on GPUs was estimated to require as much energy as a trans-American flight. The environmental impacts scale with model size, and being able to efficiently fine-tune such models can potentially curtail the emissions significantly. AWS recently launched a new Customer Carbon Footprint Tool, available to all AWS customers at no cost, as part of Amazon’s efforts to increase sustainability and reduce carbon emissions. Running applications on the AWS Cloud can potentially decrease the carbon footprint (when compared to enterprise data centers that were surveyed in a 2019 report).

Conclusion

This post demonstrated a solution that facilitates the fine-tuning of language models with a billion parameters on the AWS Cloud using SageMaker.

For more information about model parallelism with SageMaker, refer to Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker and How Latent Space used the Amazon SageMaker model parallelism library to push the frontiers of large-scale transformers.

If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.


About the Authors

Sia Gholami is a Senior Data Scientist at the Amazon ML Solutions Lab, where he builds AI/ML solutions for customers across various industries. He is passionate about natural language processing (NLP) and deep learning. Outside of work, Sia enjoys spending time in nature and playing tennis.

Mehdi Nooriis a Manager and a Senior Applied Scientist at the Amazon ML Solutions Lab, where he works with customers across various industries, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Muhyun Kim is a data scientist at Amazon Machine Learning Solutions Lab. He solves customer’s various business problems by applying machine learning and deep learning, and also helps them gets skilled.

Danny Byrd is an Applied Scientist at the Amazon ML Solutions Lab. At the lab he’s helped customers develop advanced ML solutions, in ML specialties from computer vision to reinforcement learning. He’s passionate about pushing technology forward and unlocking new potential from AWS products along the way.

Francisco Calderon Rodriguez is a Data Scientist in the Amazon ML Solutions Lab. As a member of the ML Solutions Lab, he helps solve critical business problems for AWS customers using deep learning. In his spare time, Francisco likes to play music and guitar, play soccer with his daughters, and enjoy time with his family.

Yohei Nakayama is a Deep Learning Architect at the Amazon ML Solutions Lab. He works with customers across different verticals to accelerate their use of artificial intelligence and AWS Cloud services to solve their business challenges. He is interested in applying ML/AI technologies to the space industry.

Nathalie Rauschmayr is a Senior Applied Scientist at AWS, where she helps customers develop deep learning applications.

Read More

Identify potential root cause in business-critical anomalies using Amazon Lookout for Metrics

We are excited to launch a causal contribution analysis capability in Amazon Lookout for Metrics that helps you to understand the potential root causes for the business-critical anomalies in the data. Previously, you were only given the root causes for a single anomaly per measure. You had to analyze to determine if causal relationships existed between the detected anomalies in different measures. When focusing on a single anomaly, you can easily miss the anomaly’s downstream (or upstream) impact. For example, you may see a spike in your checkout cart abandonment and know that your revenue will decrease. However, you may not know what caused the checkout carts to be abandoned at a higher rate. The causal contribution analysis feature can tell you that the spike in checkout cart abandonment may be due to spikes in transaction failures or sudden changes in prices due to promotion expiration.

Lookout for Metrics uses machine learning (ML) to automatically detect and diagnose anomalies in large datasets where deviations from normal are hard to detect and missed anomalies have business-critical impact. Lookout for Metrics reduces the time to implement AI/ML services for business-critical problems.

In this post, we discuss the new causal contribution analysis capability and its benefits.

Challenges in anomaly detection

Anomaly detection has two parts: detecting anomalies and identifying the root cause that triggered the anomalies so that team can take action to mitigate the problem.

Traditional business intelligence (BI) systems that use static threshold-based or rule-based anomalies have three problems. First, you might have millions of metrics to track across multiple data sources. Take digital advertisement, for example—you want to track metrics like impression, clicks, revenue, and shopping cart metrics across campaign IDs, product categories, geographies, and more. And it’s the same for any domain, be it retail, telecom, gaming, or financial services. With traditional BI tools, managing data across multiple sources, creating dashboards and reports, and adding alerts at a granular level requires a lot of manual work and isn’t scalable.

Second, these traditional BI tools work by setting up rules. You set up a range, and anything outside the range is an anomaly and you are alerted on those. If the range is too broad, you miss important alerts, and if it’s too narrow, you receive too many false alerts.

These ranges (upper bound and lower bound in image above) are also static, and don’t change based on the time of the day, day of the week, or seasons; they need to be manually updated. You’re likely to miss important anomalies and receive too many false alarms, or you lose trust in the tool and start ignoring these alerts altogether.

Lastly, BI reports and dashboards are often generated at the end of hour, end of day, or end of week, when it’s too late for you to act on a problem. And even when these results come, it doesn’t answer the why. So developers, analysts, and business owners can spend weeks trying to identify the root cause of the anomaly, delaying meaningful action even further.

Causal inference in Lookout for Metrics

Although asking for the root cause of an unexpected event seems to be at the heart of the human way of understanding the world, statistical associations are often misinterpreted as a causal influence. That is, correlation doesn’t imply causation, and discerning the causes of events from observational data requires specialized causal inference methods.

The root cause analysis in Lookout for Metrics uses causal inference techniques to increase the visibility and interpretability of anomalies across measures. Lookout for Metrics is capable of not only identifying causal drivers, but also quantitatively attributing the anomalous events to them, providing a percentage score of likelihood among the probable causal drivers of an anomalous event. For example, Lookout for Metrics can now draw causal links between a drop in advertisement views (anomaly) due to fewer clicks on your website, IOS, and Android (causation), leading to a decline in the revenue (downstream impact). Suppose one or more potential root causes occur (website, iOS, Android). In that case, Lookout for Metrics can identify the most likely cause (for example, website with a 90% likelihood) that led to a drop in advertisement views.

The scientific approach relies on a two-step procedure:

  1. Infer the causal relations between the measures.
  2. Based on the inferred causal structure, attribute the anomalies of the affected measure to the causing measures.

To infer the causal relations between the measures, we use a Granger causality method that takes the panel data structure of Lookout for Metrics into account. The existing Granger causality methods for panel data can’t deal with dependencies across dimension value combinations (for instance, dependencies of revenue across different countries that we typically have in real data). For example, events such as Black Friday increase the revenue of multiple countries and therefore there is an external source that renders the revenue of different countries dependent). We therefore had to develop our own Granger causality[1] method on panel data that can deal with these types of dependencies.

Once the causal structure is available, we attribute the anomalies of the affected measure to its causing measures to quantify the cause-effect relationships.

Analyze anomalies on the Lookout for Metrics console

After Lookout for Metrics starts anomaly detection, you can look for the detected anomalies on the Anomalies page for the detector. When you choose an anomaly, you’re redirected to the details page for the observed anomaly.

The anomaly details page includes a Root cause analysis section. This section tries to explain this observed anomaly with respect to the other anomalies for the anomaly detector configured measures.

In the following example, “Revenue impacted” is the observed anomaly, and the potential causes include orders and non-configured measures. Orders contributes approximately 81.84% to the current anomaly, namely revenue that leads to a downstream impact on profit.

Choosing the potential cause orders takes us to the details of its observed anomaly. In this case, the possible causes for this anomaly are clicks and non-configured measures. Clicks could be one of the potential causes of this anomaly, but it gets a relatively low contribution score of 8.37%, and the detector doesn’t observe anything anomalous for it. In this case, Lookout for Metrics concludes that the orders anomaly is caused by external factors or measures that weren’t configured for monitoring during the detector setup phase. This anomaly in orders has a potential downstream impact on profit and revenue.

Choosing the potential downstream impact profit takes us to the details of its observed anomaly. In this case, the potential causes seem to be a mix of anomalies in revenue, orders, and non-configured measures, with respective contribution scores of 33%, 14%, and 53%. No downstream measures are affected by this anomaly.

For this example, the anomaly in profit can be partially explained by the anomalies in revenue and orders. Then the anomaly in revenue can be explained by the anomaly in orders with a high certainty.

Conclusion

The new causal contribution analysis capability in Lookout for Metrics detects the causal interaction between anomalies in your measures. To achieve this, the detector learns the causal relation between measures in your data fully self-supervised and uses this causal information to trace anomalies back to their root causes. This feature can help you causally connect anomalies across measures and provides you with a tool to quickly diagnose and subsequently fix any issues in your system.

[1] L. Minorics, C. Turkmen, P. Bloebaum, D. Kernert, L. Callot and D. Janzing. Testing Granger Non-Causality in Panels with Cross-Sectional dependencies. AISTATS, 2022.

About the Authors

Lenon Minorics is an Applied Scientist focusing on causal inference and anomaly detection. Prior to Amazon, Lenon was an academic researcher in mathematics. His personal research interests include machine learning, causal inference, stochastics, and fractal geometry. In his free time, Lenon enjoys practicing all kinds of sports, especially Brazilian Jiu-Jitsu.

Shashank Srivastava is Senior Product Manager for Amazon AI vertical services. He is passionate about solving problems in AI in NLP, novelty detection, and data scarcity. In his free time, Shashank enjoys playing tennis and golf.

Caner Türkmen is an Applied Scientist at Amazon Web Services, where he works on problems at the intersection of machine learning, forecasting, and anomaly detection. Before joining AWS, he worked in the management consulting industry as a data scientist, serving the financial services and telecommunications industries on projects across the globe. Caner’s personal research interests span a range of topics, including probabilistic and Bayesian ML, stochastic processes, and their practical applications.

Alex Kim is a Sr. Product Manager for AWS AI Services. His mission is to deliver AI/ML solutions to all customers who can benefit from it. In his free time, he enjoys all types of sports and discovering new places to eat.

Read More

Use AWS AI and ML services to foster accessibility and inclusion of people with a visual or communication impairment

AWS offers a broad set of artificial intelligence (AI) and machine learning (ML) services, including a suite of pre-trained, ready-to-use services for developers with no prior ML experience. In this post, we demonstrate how to use such services to build an application that fosters the inclusion of people with a visual or communication impairment, which includes difficulties in seeing, reading, hearing, speaking, or having a conversation in a foreign language. With services such as Amazon Transcribe, Amazon Polly, Amazon Translate, Amazon Rekognition and Amazon Textract, you can add features to your projects such as live transcription, text to speech, translation, object detection, and text extraction from images.

Screenshots of the web app showcasing five features of AWS AugmentAbility.

According to the World Health Organization, over 1 billion people—about 15% of the global population—live with some form of disability, and this number is likely to grow because of population ageing and an increase in the prevalence of some chronic diseases. For people with a speech, hearing, or visual impairment, everyday tasks such as listening to a speech or a TV program, expressing a feeling or a need, looking around, or reading a book can feel like impossible challenges. A wide body of research highlights the importance of assistive technologies for the inclusion of people with disabilities in society. According to research by the European Parliamentary Research Service, mainstream technologies such as smartphones provide more and more capabilities suitable for addressing the needs of people with disabilities. In addition, when you design for people with disabilities, you tend to build features that improve the experience for everyone; this is known as the curb-cut effect.

This post demonstrates how you can use the AWS SDK for JavaScript to integrate capabilities provided by AWS AI services into your own solutions. To do that, a sample web application showcases how to use Amazon Transcribe, Amazon Polly, Amazon Translate, Amazon Rekognition, and Amazon Textract to easily implement accessibility features. The source code of this application, AWS AugmentAbility, is available on GitHub to use as a starting point for your own projects.

Solution overview

AWS AugmentAbility is powered by five AWS AI services: Amazon Transcribe, Amazon Translate, Amazon Polly, Amazon Rekognition, and Amazon Textract. It also uses Amazon Cognito user pools and identity pools for managing authentication and authorization of users.

After deploying the web app, you will be able to access the following features:

  • Live transcription and text to speech – The app transcribes conversations and speeches for you in real time using Amazon Transcribe, an automatic speech recognition service. Type what you want to say, and the app says it for you by using Amazon Polly text-to-speech capabilities. This feature also integrates with Amazon Transcribe automatic language identification for streaming transcriptions—with a minimum of 3 seconds of audio, the service can automatically detect the dominant language and generate a transcript without you having to specify the spoken language.
  • Live transcription and text to speech with translation – The app transcribes and translates conversations and speeches for you, in real time. Type what you want to say, and the app translates and says it for you. Translation is available in the over 75 languages currently supported by Amazon Translate.
  • Real-time conversation translation – Select a target language, speak in your language, and the app translates what you said in your target language by combining Amazon Transcribe, Amazon Translate, and Amazon Polly capabilities.
  • Object detection – Take a picture with your smartphone, and the app describes the objects around you by using Amazon Rekognition label detection features.
  • Text recognition for labels, signs, and documents – Take a picture with your smartphone of any label, sign, or document, and the app reads it out loud for you. This feature is powered by Amazon Rekognition and Amazon Textract text extraction capabilities. AugmentAbility can also translate the text into over 75 languages, or make it more readable for users with dyslexia by using the OpenDyslexic font.

Live transcription, text to speech, and real-time conversation translation features are currently available in Chinese, English, French, German, Italian, Japanese, Korean, Brazilian Portuguese, and Spanish. Text recognition features are currently available in Arabic, English, French, German, Italian, Portuguese, Russian, and Spanish. An updated list of the languages supported by each feature is available on the AugmentAbility GitHub repo.

You can build and deploy AugmentAbility locally on your computer or in your AWS account by using AWS Amplify Hosting, a fully managed CI/CD and static web hosting service for fast, secure, and reliable static and server-side rendered apps.

The following diagram illustrates the architecture of the application, assuming that it’s deployed in the cloud using AWS Amplify Hosting.

Architecture diagram including AWS Amplify, Amazon Cognito, Transcribe, Translate, Polly, Rekognition, Textract.

The solution workflow includes the following steps:

  1. A mobile browser is used to access the web app—an HTML, CSS, and JavaScript application hosted by AWS Amplify Hosting. The application has been implemented using the SDK for JavaScript and the AWS Amplify JavaScript library.
  2. The user signs in by entering a user name and a password. Authentication is performed against the Amazon Cognito user pool. After a successful login, the Amazon Cognito identity pool is used to provide the user with the temporary AWS credentials required to access app features.
  3. While the user explores the different features of the app, the mobile browser interacts with Amazon Transcribe (StartStreamTranscriptionWebSocket operation), Amazon Translate (TranslateText operation), Amazon Polly (SynthesizeSpeech operation), Amazon Rekognition (DetectLabels and DetectText operations) and Amazon Textract (DetectDocumentText operation).

AWS services have been integrated in the mobile web app by using the SDK for JavaScript. Generally speaking, the SDK for JavaScript provides access to AWS services in either browser scripts or Node.js; for this sample project, the SDK is used in browser scripts. For additional information about how to access AWS services from a browser script, refer to Getting Started in a Browser Script. The SDK for JavaScript is provided as a JavaScript file supporting a default set of AWS services. This file is typically loaded into browser scripts using a <script> tag that references the hosted SDK package. A custom browser SDK was built with a specified set of services (for instructions, refer to Building the SDK for Browser).

Each service was integrated in the mobile web app following the guidelines and code samples available in the AWS SDK for JavaScript Developer Guide. The implementation of live transcription features required some additional steps because Amazon Transcribe Streaming WebSocket requires developers to encode the audio with event stream encoding and use the Signature Version 4 signing process for adding authentication information to AWS API requests sent by HTTP. For more information about this approach, refer to Transcribe speech to text in real time using Amazon Transcribe with WebSocket.

The user sign-in webpage has been implemented using authentication features of the AWS Amplify JavaScript library. For more details about the authentication and authorization flow, refer to Accessing AWS services using an identity pool after sign-in.

The following walkthrough shows how to deploy AugmentAbility by using AWS Amplify Hosting; it includes the following steps:

  1. Create the Amazon Cognito user pool and identity pool, and grant permissions for accessing AWS AI services.
  2. Clone the GitHub repository and edit the configuration file.
  3. Deploy the mobile web app to the AWS Amplify console.
  4. Use the mobile web app.

Create the Amazon Cognito user pool and identity pool, and grant permissions for accessing AWS AI services

The first step required for deploying the app consists of creating an Amazon Cognito user pool with the Hosted UI enabled, creating an Amazon Cognito identity pool, integrating the two pools, and finally granting permissions for accessing AWS services to the AWS Identity and Access Management (IAM) role associated with the identity pool. You can either complete this step by manually working on each task, or by deploying an AWS CloudFormation template.

The CloudFormation template automatically provisions and configures the necessary resources, including the Amazon Cognito pools, IAM roles, and IAM policies.

  1. Sign in to the AWS Management Console and launch the CloudFormation template by choosing Launch Stack:

    The template launches in the EU West (Ireland) AWS Region by default. To launch the solution in a different Region, use the Region selector in the console navigation bar. Make sure to select a Region in which the AWS services in scope (Amazon Cognito, AWS Amplify, Amazon Transcribe, Amazon Polly, Amazon Translate, Amazon Rekognition, and Amazon Textract) are available (us-east-2, us-east-1, us-west-1, us-west-2, ap-south-1, ap-northeast-2, ap-southeast-1, ap-southeast-2, ca-central-1, eu-central-1, eu-west-1, eu-west-2).
  2. Choose Next.
  3. For Region, enter the identifier of the Region you want use (among the supported ones).
  4. For Username, enter the user name you want to use to access the app.
  5. For Email, enter the email address to which the temporary password for your first sign-in should be sent.
  6. Choose Next.
  7. On the Configure stack options page, choose Next.
  8. On the Review page, review and confirm the settings.
  9. Select the check box acknowledging that the template will create IAM resources and may require an AWS CloudFormation capability.
  10. Choose Create stack to deploy the stack.

You can view the status of the stack on the AWS CloudFormation console in the Status column. You should receive a CREATE_COMPLETE status in a couple of minutes.

As part of the template deployment, the following permissions are granted to the IAM role that is assumed by the authenticated user:

  • transcribe:StartStreamTranscriptionWebSocket
  • translate:TranslateText
  • comprehend:DetectDominantLanguage
  • polly:SynthesizeSpeech
  • rekognition:DetectText
  • rekognition:DetectLabels
  • textract:DetectDocumentText

Even though Amazon Comprehend is not explicitly used in this web application, permissions are granted for the action comprehend:DetectDominantLanguage. Amazon Translate may automatically invoke Amazon Comprehend to determine the language of the text to be translated if a language code isn’t specified.

Clone the GitHub repository and edit the configuration file

Now that access to AWS AI services has been configured, you’re ready to clone the GitHub repository and edit the configuration file.

  1. In the AWS AugmentAbility GitHub repo, choose Code and Download ZIP.
    You’re either prompted to choose a location on your computer where the ZIP file should be downloaded to, or it will automatically be saved in your Downloads folder.
  2. After you download the file, unzip it and delete the ZIP file.
    You should have obtained a folder named aws-augmentability-main with some files and subfolders in it.
  3. Create a file named config.js with any text editor, and enter the following content in it:
    var appConfig = {
        "IdentityPoolId": "INSERT_COGNITO_IDENTITY_POOL_ID"
    }
    
    var amplifyConfig = {
        "Auth": {
            "region": "INSERT_AWS_REGION_ID",
            "userPoolId": "INSERT_COGNITO_USER_POOL_ID",
            "userPoolWebClientId": "INSERT_COGNITO_USER_POOL_CLIENT_ID",
            "mandatorySignIn": true,
            "cookieStorage": {
                "domain": window.location.hostname,
                "path": "/",
                "expires": 30,
                "secure": true
          }
        }
    }

  4. In the config.js file you created, replace the four INSERT_ strings with the Amazon Cognito identity pool ID, identifier of your Region of choice, Amazon Cognito user pool ID, and user pool client ID.
    You can retrieve such values by opening the AWS CloudFormation console, choosing the stack named augmentability-stack, and choosing the Outputs tab.
    Screenshot of the CloudFormation stack Outputs tab.
  5. Save the config.js file in the aws-augmentability-main folder, and zip the folder to obtain a new aws-augmentability-main.zip file.

Deploy the mobile web app to the Amplify console

Now that you have downloaded and edited the AugmentAbility project files, you’re ready to build and deploy the mobile web app using the Amplify console.

  1. On the Get started with Amplify Hosting page, choose Deploy without Git provider.
  2. Choose Continue.
  3. In the Start a manual deployment section, for App name, enter the name of your app.
  4. For Environment name, enter a meaningful name for the environment, such as development or production.
  5. For Method, choose Drag and drop.
  6. Either drag and drop the aws-augmentability-main.zip file from your computer onto the drop zone or use Choose files to select the aws-augmentability-main.zip file from your computer.
  7. Choose Save and deploy, and wait for the message Deployment successfully completed.

Use the mobile web app

The mobile web app should now be deployed. Before accessing the app for the first time, you have to set a new password for the user that has been automatically created during Step 1. You can find the link to the temporary login screen in the Outputs tab for the CloudFormation stack (field UserPoolLoginUrl). For this first sign-in, you use the user name you set up and the temporary password you received via email.

After you set your new password, you’re ready to test the mobile web app.

In the General section of the Amplify console, you should be able to find a link to the app under the Production branch URL label. Open it or send it to your smartphone, then sign in with your new credentials, and start playing with AugmentAbility.

Animated screenshot showcasing the “Live transcription and text to speech” feature of AWS AugmentAbility.

Animated screenshot showcasing the “Live transcription and text to speech” feature of AWS AugmentAbility.

Animated screenshot showcasing the “Object detection” feature of AWS AugmentAbility.

Animated screenshot showcasing the “Object detection” feature of AWS AugmentAbility.

Animated screenshot showcasing the “Text recognition” feature of AWS AugmentAbility.

Animated screenshot showcasing the “Text recognition” feature of AWS AugmentAbility.

Next steps

If you want to make changes to the mobile web app, you can work on the files cloned from the repository, locally build the mobile web app (as explained in the README file), and then redeploy the app by uploading the updated ZIP file via the Amplify console. As an alternative, you can create a GitHub, Bitbucket, GitLab, or AWS CodeCommit repository to store your project files, and connect it to Amplify to benefit from automatic builds on every code commit. To learn more about this approach, refer to Getting started with existing code. If you follow this tutorial, make sure to replace the command npm run build with npm run-script build at Step 2a.

To create additional users on the Amazon Cognito console, refer to Creating a new user in the AWS Management Console. In case you need to recover the password for a user, you should use the temporary login screen you used for changing the temporary password. You can find the link on the Outputs tab of the CloudFormation stack (field UserPoolLoginUrl).

Clean up

When you’re done with your tests, to avoid incurring future charges, delete the resources created during this walkthrough.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack augmentability-stack.
  3. Choose Delete and confirm deletion when prompted.
  4. On the Amplify console, select the app you created.
  5. On the Actions menu, choose Delete app and confirm deletion when prompted.

Conclusion

In this post, I showed you how to deploy a code sample that uses AWS AI and ML services to put features such as live transcription, text to speech, object detection, or text recognition in the hands of everyone. Knowing how to build applications that can be used by people with a wide range of abilities and disabilities is key for creating more inclusive and accessible products.

To get started with AugmentAbility, clone or fork the GitHub repository and start experimenting with the mobile web app. If you want to experiment with AugmentAbility before deploying resources in your AWS account, you can check out the live demo (credentials: demo-user, Demo-password-1).


About the Author

Luca Guida is a Solutions Architect at AWS; he is based in Milan and supports Italian ISVs in their cloud journey. With an academic background in computer science and engineering, he started developing his AI/ML passion at university; as a member of the natural language processing (NLP) community within AWS, Luca helps customers be successful while adopting AI/ML services.

Read More

How service providers can use natural language processing to gain insights from customer tickets with Amazon Comprehend

Today, customers can raise support tickets through multiple channels like – web, mobile, chat-bots, emails, or phone calls. When a support ticket is raised by a customer, it is processed and assigned to a category based on the information provided in the ticket. It is then routed to the support group for resolution according to the category of the ticket. It is estimated that a high number of support tickets are usually not routed to the right group due to incorrect ticket categorization. Incorrectly assigned tickets cause delay in overall resolution time, often resulting in severe customer dissatisfaction. It may also have other widespread impacts like financial, operational, or other business repercussions. Hence, ticket classification is an essential task for every organization these days. Although you may classify tickets manually, but it is prone to error, not cost-effective, and does not scale.

AWS Managed Services (AMS) uses Amazon Comprehend custom classifications to categorize inbound requests by resource and operation type based on how the customer described their issue. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to uncover valuable insights and connections in text. AMS utilizes custom classifiers to label customer requests with appropriate issue types, resource type, and resource action, thereby routing customer tickets to the SMEs. Amazon Comprehend classification is utilized to find opportunities for new internal automation tools that AMS engineers can use to fulfill customer requirements to reduce manual effort and chances of manual errors. The classification data is stored in an Amazon Redshift cluster and used to analyze customer requests and find new automation tool candidates. This automation results in increased operational efficiency and reduced cost.

In this post, we show how managed service providers can use Amazon Comprehend to classify and route the tickets, provide suggestions based on the classification, and utilize the classification data.

Solution overview

The following diagram shows the solution architecture.

The workflow is as follows:

  1. A customer submits the ticket.
  2. The ticket system receives the ticket from the customer, and invokes the ticket classifier AWS Lambda function with the ticket details. Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. Lambda is chosen for the solution to reduce cost and maintenance effort.
  3. The ticket classifier Lambda function classifies the ticket with Amazon Comprehend using the ticket title and description. With Amazon Comprehend, you can train the NLP model and provide both batch and real-time classifiers without provisioning and maintaining infrastructure.
  4. The ticket classifier Lambda function pushes the ticket classification data to the Amazon Redshift cluster via Amazon Kinesis Data Firehose. Kinesis Data Firehose is an extract, transform, and load (ETL) service that captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services. Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price performance at any scale. Kinesis Data Firehose delivers data to an Amazon Simple Storage Service (Amazon S3) bucket first and then issues an Amazon Redshift COPY command to load the data into an Amazon Redshift cluster.
  5. The ticket classifier Lambda function invokes the ticket handler Lambda function.
  6. The ticket handler Lambda function runs code to help the ticket handling. In this example, it returns the recommended materials for handling the ticket based on the classification.
  7. Ticket analysis can be done with Amazon QuickSight. From ticket analysis, you can find out the top requested ticket type. Based on the analysis, you can discover ticket trends and opportunities to automate top ticket types. QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are.

In the following sections, we walk you through the steps to implement the solution, integrate the ticket classification infrastructure with your ticketing system, and use the classification data with QuickSight.

Implement the solution

In this section, we walk through the steps to provision your solution resources and create the necessary infrastructure.

Configure Amazon Comprehend

In this step, we train two new Amazon Comprehend custom classification models: Operation and Resource, and create a real-time analysis endpoint for each model.

Upload the training data

To upload the training data, complete the following steps:

  1. Download ticket_training_data.zip and unzip the file.
    This folder contains the following two files:

    • training_data_operations.csv – This file is a two-column CSV file that we use to train the Operation classification model. The first column contains class, and the second column contains document.
    • training_data_resources.csv – This file is a two-column CSV file that we use to train the Resource classification model. Like the training_data_operations.csv file, the first column contains class, and the second column contains document.
  2. On the Amazon S3 console, create a new bucket for Amazon Comprehend. Because S3 bucket names are globally unique, you need to create a unique name for the bucket. For this post, we call it comprehend-ticket-training-data. Enable server-side encryption and block public access when creating the bucket.
  3. Upload training_data_operations.csv and training_data_resources.csv to the new S3 bucket.

Create two new models

To create your models, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom classification in the navigation pane.
  2. Choose Create new model.
  3. Provide the following information:
    1. For Model name, enter ticket-classification-operation.
    2. For Language, choose English.
    3. For Classifier mode, select Using Single-label mode.
    4. For Data format, select CSV file.
    5. For Training dataset, enter the S3 path for training_data_operations.csv.
    6. For Test data source, select Autosplit.
      Autosplit automatically selects 10% of your provided training data to use as testing data.
    7. For IAM Role, select Create an IAM role.
    8. For Permissions to access, choose the training, test, and output data (if specified) in your S3 buckets.
    9. For Name suffix, enter ticket-classification.
  4. Choose Create.
  5. Choose Create new model again to create your resource classification model.
  6. Provide the following information:
    1. For Model name, enter ticket-classification-resource.
    2. For Language, choose English.
    3. For Classifier mode, select Using Single-label mode.
    4. For Data format, select CSV file.
    5. For Training dataset, enter the S3 path for training_data_resources.csv.
    6. For Test data source, select Autosplit.
    7. For IAM Role, select Use an existing IAM role.
    8. For Role name, choose AmazonComprehendServiceRole-ticket-classification.
  7. Choose Create.

Amazon Comprehend is now processing the CSV files and using them to train custom classifiers. We then use these to help classify customer tickets. The larger and more accurate our training data is, the more accurate the classifier will be.

Wait for the version status to show as Trained as below. It may take up to 1 hour to complete, depending on the size of the training data.

Create Amazon Comprehend endpoints

Amazon Comprehend endpoints are billed in 1-second increments, with a minimum of 60 seconds. Charges continue to incur from the time you start the endpoint until it’s deleted, even if no documents are analyzed. For more information, see Amazon Comprehend Pricing. To create your endpoints, complete the following steps:

  1. On the Amazon Comprehend console, choose Endpoints in the navigation pane.
  2. Choose Create endpoint to create your operation classification endpoint.
  3. Provide the following information:
    1. For Endpoint name, enter ticket-classification-operation.
    2. For Custom model type, select Custom classification.
    3. For Classifier model, choose ticket-classification-operation.
    4. For Version, choose No Version Name.
    5. For Number of inference units (IUs), enter 1.
  4. Choose Create endpoint.
  5. Choose Create endpoint again to create the resource classification endpoint.
  6. Provide the following information:
    1. For Endpoint name, enter ticket-classification-resource.
    2. For Custom model type, select Custom classification.
    3. For Classifier model, choose ticket-classification-resource.
    4. For Version, choose No Version Name.
    5. For Number of inference units (IUs), enter 1.
  7. Choose Create endpoint.

After you create both endpoints, wait until the status for both shows as Active.

Test the Amazon Comprehend endpoints with real-time analysis

To test your endpoints, complete the following steps:

  1. On the Amazon Comprehend console, choose Real-time analysis in the navigation pane.
  2. For Analysis type¸ select Custom.
  3. For Endpoint¸ choose ticket-classification-operation.
  4. For Input text, enter the following:
    Hi support,
    Please update the timezone to UTC on i-abcd1234.
    Thanks.

  5. Choose Analyze.
    The results show that the Update class has the highest confidence score.
  6. Change Endpoint to ticket-classification-resource and choose Analyze again.

The results show that the EC2 class has the highest confidence score.

Create a secret for the Amazon Redshift cluster password

In this step, we create an AWS Secrets Manager secret for your Amazon Redshift cluster password. Secrets Manager helps you protect secrets needed to access your applications, services, and IT resources. The service enables you to easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. In this post, we store the Amazon Redshift cluster password in a Secrets Manager secret.

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.
  3. For Secret type, select Other type of secret.
  4. Under Key/value pairs, set your key as password and value as your Amazon Redshift cluster password.
    The password must be between 8–64 characters in length and contain at least one uppercase letter, one lowercase letter, and one number. It can be any printable ASCII character except ‘ (single quote), “ (double quote), , /, @, or space.
  5. Choose Next.
  6. For Secret name, enter ClassificationRedshiftClusterPassword.
  7. Choose Next.
  8. In the Secret rotation section, choose Next.
  9. Review your secret configuration and choose Store.

Provision your infrastructure with AWS CloudFormation

In this step, we provision the infrastructure for the solution using an AWS CloudFormation stack.

Upload the Lambda function code

Before launching the CloudFormation stack, upload your Lambda function code:

  1. Download lambda_code.zip
  2. On the Amazon S3 console, open the bucket that you created.
  3. Upload lambda_code.zip.

Create your CloudFormation stack

To provision resources with AWS CloudFormation, complete the following steps:

  1. Download cloudformation_template.json.
  2. On the AWS CloudFormation console, choose Create stack.
  3. Select With new resources (standard).
  4. For Template source, choose Upload a template file.
  5. Choose the downloaded CloudFormation template.
  6. Choose Next.
  7. For Stack name, enter Ticket-Classification-Infrastructure.
  8. In the Parameters section, enter the following values:
    1. For ClassificationRedshiftClusterNodeType, enter the Amazon Redshift cluster node type. dc2.large is the default.
    2. For ClassificationRedshiftClusterPasswordSecretName, enter the Secrets Manager secret name that stores the Amazon Redshift cluster password.
    3. For ClassificationRedshiftClusterSubnetId, enter the subnet ID where the Amazon Redshift Cluster is hosted. The subnet must be within the VPC which you mentioned in the ClassificationRedshiftClusterVpcId parameter.
    4. For ClassificationRedshiftClusterUsername, enter the Amazon Redshift cluster user name.
    5. For ClassificationRedshiftClusterVpcId, enter the VPC ID where the Amazon Redshift cluster is hosted.
    6. For LambdaCodeS3Bucket, enter the S3 bucket name where you uploaded the Lambda code.
    7. For LambdaCodeS3Key, enter the Amazon S3 key of the deployment package.
    8. For QuickSightRegion, enter the Region for QuickSight. The Region for QuickSight should be consistent with the Region you’re using for Amazon Comprehend and the S3 bucket.
  9. Choose Next.
  10. In the Configure stack options section, choose Next.
  11. In the Review section, select I acknowledge that AWS CloudFormation might create IAM resources.
  12. Choose Create stack.

Configure your Amazon Redshift cluster

In this step, you enable audit logging and add the new table to the Amazon Redshift cluster created through the CloudFormation template.

Audit logging is not turned on by default in Amazon Redshift. When you turn on logging on your cluster, Amazon Redshift exports logs to Amazon CloudWatch, which capture data from the time audit logging is enabled to the present time. Each logging update is a continuation of the previous logs.

Enable audit logging

You can skip this step if you don’t need audit logging for your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose the Amazon Redshift cluster starting with classificationredshiftcluster-.
  3. On the Properties tab, choose Edit.
  4. Choose Edit audit logging.
  5. For Configure audit logging¸ choose Turn on.
  6. For Log expert type, choose CloudWatch.
  7. Select all log types.
  8. Choose Save changes.

Create new table

To create a new table, complete the following steps:

  1. On the Amazon Redshift console, choose Query data.
  2. Choose Query in query editor v2.
  3. On the Database page, choose your cluster.
  4. For Database, enter ticketclassification.
  5. Enter the user name and password you configured in the CloudFormation stack parameters.
  6. Choose Create connection.
  7. When the connection is made, choose the plus sign and open a new query window.
  8. Enter the following query:
    CREATE TABLE tickets(
      id             VARCHAR(50)   NOT NULL,
      title          VARCHAR(1000) NOT NULL,
      description    VARCHAR(5000) NOT NULL,
      creation_time  TIMESTAMPTZ   NOT NULL,
      operation      VARCHAR(5000) NULL,
      resource       VARCHAR(5000) NULL
    );

  9. Choose Run.

Test the classification infrastructure

Now the infrastructure for ticket classification is ready. Before integrating with your ticket system, let’s test the classification infrastructure.

Run the test

To run the test, complete the following steps:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose the function that starts with Ticket-Classification-Inf-TicketClassifier.
  3. On the Test tab, choose Test event.
  4. For Name, enter TestTicket.
  5. Enter the following test data:
    {
      "TicketId": "00000001",
      "TicketTitle": "Update the timezone",
      "TicketDescription": "Hi support, Please update the timezone to UTC on i-abcd1234. Thanks.",
      "TicketCreationTime": "2020-12-04 03:09:00-08"
    }

  6. Choose Test.

The ticket is classified, and the classification data is stored in the Amazon Redshift cluster. After the classification, the ticket handler Lambda function runs, which handles the ticket based on the classification, including recommending materials to support engineers.

Check the ticket classifier test log

To check the test log, complete the following steps:

  1. In the result section of the test, choose Logs, or choose View logs in CloudWatch on the Monitor tab.
  2. Choose the log stream.

You can view the logs in the following screenshot, which shows the output from Amazon Comprehend and the final top classification of the ticket. In this example, the test ticket is classified as Resource=EC2, Operation=Update.

Check the ticket classification output in the Amazon Redshift cluster

To validate the output in your cluster, complete the following steps:

  1. On the Amazon Redshift query editor v2 console, choose the plus sign to open a new query window.
  2. Enter the following query:
    SELECT * FROM "tickets";

  3. Choose Run.

The following screenshot shows the ticket classification. If it’s not available yet, wait for a few minutes and retry (Kinesis Data Firehose needs some time to push the data). We can now use this data in QuickSight.

Check the ticket handler test log

After the ticket classifier pushes the classification data in the Amazon Redshift cluster, the ticket handler Lambda function runs, which handles the ticket based on the classification, including recommending materials to support engineers. In this example, the ticket handler returns recommended materials including the runbook, AWS documentation, and SSM documents so support can refer to them when handling the ticket. You can integrate the output with your ticket handling system, and you can customize the handling processes in the Lambda function code. In this step, we check what recommendations were made.

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose the Lambda function that starts with Ticket-Classification-Inf-TicketHandlerLambdaFunct.
  3. On the Monitor tab, choose View logs in CloudWatch.
  4. Choose the log stream.

The following screenshot shows the logs. You can see the output from Amazon Comprehend and the list of recommended AWS documents and SSM documents for the ticket classified as Update EC2. You can add your own runbooks, documents, SSM documents, or any other materials in the Lambda function code.

Integrate the ticket classification infrastructure with your ticketing system

In this section, we walk through the steps to integrate your ticketing classification infrastructure with your ticketing system and customize your configuration.

Most ticketing systems have a trigger feature, which allows you to run code when the ticket is submitted. Set up your ticketing system to invoke the ticket classifier Lambda function with the following formatted input:

{
  "TicketId": "{{ Ticket ID }}",
  "TicketTitle": "{{ Ticket Title }}",
  "TicketDescription": "{{ Ticket Description }}",
  "TicketCreationTime": "{{ Ticket Creation Time. e.g. 2020-12-04 03:09:00-08 }}"
}

If you want to customize the input, modify the ticket classifier Lambda function code. You need to add or remove parameters (lines 90–105) and customize the input for Amazon Comprehend (lines 15–17).

You can customize the ticket handler Lambda function to run automation or edit the recommendations. For example, you can add the internal comment to the ticket with the recommendations. To customize, open the ticket handler Lambda code, and edit lines 68–70 and 75–81.

Use classification data with QuickSight

After you integrate the ticket classification infrastructure with your ticket system, the ticket classification data is stored in the Amazon Redshift cluster. You can use QuickSight to check this data and generate reports. In this example, we generate a QuickSight analysis with the classification data.

Sign up for QuickSight

If you don’t already have QuickSight, sign up with the following steps:

  1. On the QuickSight console, choose Sign up for QuickSight.
  2. Choose Standard.
  3. Under QuickSight region, choose the Region you configured in the CloudFormation parameter QuickSightRegion.
  4. Under Account info, enter your QuickSight account name and notification email address.
  5. Under QuickSight access to AWS services, select Amazon Redshift.
  6. If you want to allow access and autodiscovery for other resources, select them as well.
  7. Choose Finish.
  8. Choose Go to Amazon QuickSight after you’re signed up.

Connect your Amazon Redshift cluster to QuickSight

To connect your cluster to QuickSight as a data source, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Redshift Auto-discovered.
  4. Provide the following information:
    1. For Data source name, enter ticketclassification.
    2. For Instance ID, choose the Amazon Redshift cluster starting with classificationredshiftcluster-.
    3. For Connection type, choose Public network.
    4. For Database name, enter ticketclassification.
    5. Enter the Amazon Redshift cluster user name and password you configured in the CloudFormation stack parameters.
  5. Choose Validate connection to see if the connection works.
    If it doesn’t work, this is likely due to using the wrong user name and password, or the QuickSight Region is different from what you specified in the CloudFormation stack.
  6. Choose Create data source.
  7. In the Choose your table section, select the tickets table.
  8. Choose Select.
  9. Select Import to SPICE for quicker analytics.
    SPICE is the QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. Importing (also called ingesting) your data into SPICE can save time and money. For more information on SPICE, refer to Importing Data into SPICE. If you get the error “Not enough SPICE capacity,” purchase more SPICE capacity. For more information, refer to Purchasing SPICE capacity in an AWS Region.
  10. Choose Visualize.

Create a ticket classification analysis report

Once you finish dataset creation, you can see the new QuickSight analysis. In this section, we walk through the steps to create a ticket classification analysis report, including a pivot table, pie charts, and line charts.

  1. Choose AutoGraph.
  2. Under Visual types, choose the pivot table.
  3. Drag operation from Fields list to Rows.
  4. Drag resource from Fields list to Columns.
  5. On the Add menu, choose Add visual.
  6. Under Visual types, choose the pie chart.
  7. Drag operation from Fields list to Group/Color.
  8. On the Add menu, choose Add visual again.
  9. Under Visual types, choose the pie chart again.
  10. Drag resource from Fields list to Group/Color.
  11. On the Add menu, choose Add visual again.
  12. Under Visual types, choose the line chart.
  13. Drag creation_time from Fields list to X axis.
  14. Drag operation from Fields list to Color.
  15. On the Add menu, choose Add visual again.
  16. Under Visual types, choose the line chart again.
  17. Drag creation_time from Fields list to X axis.
  18. Drag operation from Fields list to Color.
  19. Resize and reorder the charts as needed.
  20. Choose Save as.
  21. Enter a name for your analysis and choose Save.

Congratulations! Your first ticket analysis is ready. Once you have more data, the analysis will look like the following screenshot.

Clean up

In this step, we clean up the resources we created with various services.

Amazon Comprehend

To delete your endpoints, complete the following steps:

  1. On the Amazon Comprehend console, choose Endpoints in the navigation pane.
  2. Select the endpoint ticket-classification-operation.
  3. Choose Delete and follow the prompts.
  4. Repeat these steps to delete the ticket-classification-resource endpoint.
    Next, delete the custom classifications you created.
  5. Choose Custom classification in the navigation pane.
  6. Select the classification ticket-classification-operation.
  7. Select No Version Name.
  8. Choose Delete and follow the prompts.
  9. Repeat these steps to delete the ticket-classification-resource classification.

Amazon S3

Next, clean up the S3 bucket you created.

  1. On the Amazon S3 console, select the bucket you created.
  2. Delete all the objects in the bucket.
  3. Delete the bucket.

Amazon QuickSight

Delete the QuickSight analyses and dataset you created.

  1. On the QuickSight console, choose Analyses in the navigation pane.
  2. Choose the options icon (three dots) on the analysis you created.
  3. Choose Delete and follow the prompts.
  4. Choose Datasets in the navigation pane.
  5. Choose the tickets dataset.
  6. Choose Delete dataset and follow the prompts.

AWS CloudFormation

Clean up the resources you created as part of the CloudFormation stack.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the Ticket-Classification-Infrastructure stack.
  3. On the Resources tab, choose the physical ID of ClassificationDeliveryStreamS3Bucket.
    The Amazon S3 console opens.
  4. Delete any objects in this bucket.
  5. Return to the AWS CloudFormation console, choose Delete, and follow the prompts.

AWS Secrets Manager

Lastly, delete the Secrets Manager secret.

  1. On the Secrets Manager console, select the secret ClassificationRedshiftClusterPassword.
  2. On the Actions menu, choose Delete secret.
  3. Set the waiting period as 7 days and choose Schedule Delete.

Your secret will be automatically deleted after 7 days.

Conclusion

In this post, you learned how to utilize AWS services to create an automatic classification and recommendation system. This solution will help your organizations build the following workflow:

  1. Classify customer requests.
  2. Recommend automated solutions.
  3. Analyze customer request classifications and discover top customer requests.
  4. Release a new automated solution and increase the automation rate.

For more information about Amazon Comprehend, see Amazon Comprehend Documentation. You can also discover other Amazon Comprehend features and get inspiration from other AWS blog posts about using Amazon Comprehend beyond classification.


About the Authors

Seongyeol Jerry Cho is a Senior Systems Development Engineer at AWS Managed Services based in Sydney, Australia. He focuses on building highly scalable and automated cloud operations software using a variety of technologies, including machine learning. Outside of work, he enjoys travel, camping, reading, cooking, and running.

Manu Sasikumar is a Sr. Systems Engineer Manager with AWS Managed Services. Manu and his team focus on building powerful and easy-to-use automations to reduce manual effort, and build AI and ML-based solutions for managing customer requests. Outside of work, he loves spending his spare time with his family, as well as being part of various humanitarian and volunteer activities.

Read More