Amazon Transcribe streaming adds support for Japanese, Korean, and Brazilian Portuguese

Amazon Transcribe streaming adds support for Japanese, Korean, and Brazilian Portuguese

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy to add speech-to-text capabilities to your applications. Today, we’re excited to launch Japanese, Korean, and Brazilian Portuguese language support for Amazon Transcribe streaming. To deliver streaming transcriptions with low latency for these languages, we’re also announcing availability of Amazon Transcribe streaming in the Asia Pacific (Seoul), Asia Pacific (Tokyo), and South America (São Paulo) Regions.

Amazon Transcribe added support for Italian and German languages earlier in November 2020, and this launch continues to expand the service’s streaming footprint. Now you can automatically generate live streaming transcriptions for a diverse set of use cases within your contact centers and media production workflows.

Customer stories

Our customers are using Amazon Transcribe to capture customer calls in real time to better assist agents, improve call resolution times, generate content highlights on the fly, and more. Here are a few stories of their streaming audio use cases.

PLUS Corporation

PLUS Corporation is a global manufacturer and retailer of office supplies. Mr. Yoshiki Yamaguchi, the Deputy Director of the IT Department at PLUS, says, “Every day, our contact center processes millions of minutes of calls. We are excited by Amazon Transcribe’s new addition of Japanese language support for streaming audio. This will provide our agents and supervisors with a robust, streaming transcription service that will enable them to better analyze and monitor incoming calls to improve call handling times and customer service quality.”

Nomura Research Institute

Nomura Research Institute (NRI) is one of the largest economic research consulting firms in Japan and an AWS premier consulting partner specializing in contact center intelligence solutions. Mr. Kakinoki, the deputy division manager of NRI’s Digital Workplace Solutions says, “For decades, we have been bringing intelligence into contact centers by providing several Natural Language Processing (NLP) based solutions in Japanese. We are pleased to welcome Amazon Transcribe’s latest addition of Japanese support for streaming audio. This enables us to automatically provide agents with real-time call transcripts as part of our solution. Now we are able to offer an integrated AWS contact center solution to our customers that addresses their need for real-time FAQ knowledge recommendations and text summarization of inbound calls. This will help our customers such as large financial firms reduce call resolution times, increase agent productivity and improve quality management capabilities.”

Accenture

Dr. Gakuse Hoshina, a Managing Director and Lead for Japan Applied Intelligence at Accenture Strategy & Consulting, says, “Accenture provides services such as AI Powered Contact Center and AI Powered Concierge, which synergize automated AI responses and human interactions to help our clients create highly satisfying customer experiences. I expect that the latest addition of Japanese for Amazon Transcribe streaming support will lead to a further improved customer experience, improving the accuracy of AI’s information retrieval and automated responses.”

Audioburst

Audioburst is a technology provider that is transforming the discovery, distribution, and personalization of talk audio. Gal Klein, the co-founder and CTO of Audioburst, says, “Every day, we analyze 225,000 minutes of live talk radio to create thousands of short, topical segments of information for playlists and search. We chose Amazon Transcribe because it is a remarkable speech recognition engine that helps us transcribe live audio content for our downstream content production work streams. Transcribe provides a robust system that can simultaneously convert a hundred audio streams into text for a reasonable cost. With this high-quality output text, we are then able to quickly process live talk radio episodes into consumable segments that provide next-gen listening experiences and drive higher engagement.”

Getting started

You can try Amazon Transcribe streaming for any of these new languages on the Amazon Transcribe console. The following is a quick demo of how to use it in Japanese.

You can take advantage of streaming transcription within your own applications with the Amazon Transcribe API for HTTP/2 or WebSockets implementations.

For the full list of supported languages and Regions for Amazon Transcribe streaming, see Streaming Transcription and AWS Regional Services.

Summary

Amazon Transcribe is a powerful ASR service for accurately converting your real-time speech into text. Try using it to help streamline your content production needs and equip your agents with effective tools to improve your overall customer experience. See all the ways in which other customers are using Amazon Transcribe.

 


About the Author

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

Read More

Real-time anomaly detection for Amazon Connect call quality using Amazon ES

Real-time anomaly detection for Amazon Connect call quality using Amazon ES

If your contact center is serving calls over the internet, network metrics like packet loss, jitter, and round-trip time are key to understanding call quality. In the post Easily monitor call quality with Amazon Connect, we introduced a solution that captures real-time metrics from the Amazon Connect softphone, stores them in Amazon Elasticsearch Service (Amazon ES), and creates easily understandable dashboards using Kibana. Examining these metrics and implementing rule-based alerting can be valuable. However, as your contact center scales, outliers becomes harder to detect across broad aggregations. For example, average packet loss for each of your sites may be below a rule threshold, but individual agent issues might go undetected.

The high cardinality anomaly detection feature of Amazon ES is a machine learning (ML) approach that can solve this problem. You can streamline your operational workflows by detecting anomalies for individual agents across multiple metrics. Alerts allow you to proactively reach out to the agent to help them resolve issues as they’re detected or use the historical data in Amazon ES to understand the issue. In this post, we give a four-step guide to detecting an individual agent’s call quality anomalies.

Anomaly detection in four stages

If you have an Amazon Connect instance, you can deploy the solution and follow along with this post. There are four steps to getting started with using anomaly detection to proactively monitor your data:

  1. Create an anomaly detector.
  2. Observe the results.
  3. Configure alerts.
  4. Tune the detector.

Creating an anomaly detector

A detector is the component used to track anomalies in a data source in real time. Features are the metrics the detector monitors in that data.

Our solution creates a call quality metric detector when deployed. This detector finds potential call quality issues by monitoring network metrics for agents’ calls, with features for round trip time, packet loss, and jitter. We use high cardinality anomaly detection to monitor anomalies across these features and identify agents who are having issues. Getting these granular results means that you can effectively isolate trends or individual issues in your call center.

Observing results

We can observe active detectors from Kibana’s anomaly detection dashboard. The graphs in Kibana show live anomalies, historical anomalies, and your feature data. You can determine the severity of an anomaly by the anomaly grade and the confidence of the detector. The following screenshot shows a heat map of anomalies detected, including the anomaly grade, confidence level, and the agents impacted.

We can choose any of the tiles in the heat map to drill down further and use the feature breakdown tab to see more details. For example, the following screenshot shows the feature breakdown of one the tiles from the agent john-doe. The feature breakdown shows us that the round trip time, jitter, and packet loss values spiked during this time period.

To see if this is an intermittent issue for this agent, we have two approaches. First, we can check the anomaly occurrences for the agent to see if there is a trend. In the following screenshot, John Doe had 12 anomalies over the last day, starting around 3:25 PM. This is a clear sign to investigate.

Alternatively, referencing the heat map, we can look to see if there is a visual trend. Our preceding heat map shows a block of anomalies around 9:40 PM on November 23—this is a reason to look at other broader issues impacting these agents. For instance, they might all be at the same site, which suggests we should look at the site’s network health.

Configuring alerts

We can also configure each detector with alerts to send a message to Amazon Chime or Slack. Alternatively, you can send alerts to an Amazon Simple Notification Service (Amazon SNS) topic for emails or SMS. Alerting an operations engineer or operations team allows you to investigate these issues in real time. For instructions on configuring these alerts, see Alerting for Amazon Elasticsearch Service.

Tuning the detector

It’s simple to start monitoring new features. You can add them to your model with a few clicks. To get the best results from your data, you can tailor the time window to your business. These time intervals are referred to as window size. The window size effects how quickly the algorithm adjusts to your data. Tuning the window size to match your data allows you to improve detector performance. For more information, see Anomaly Detection for Amazon Elasticsearch Service.

Conclusion

In this post, we’ve shown the value of high cardinality anomaly detection for Amazon ES. To get actionable insights and preconfigured anomaly detection for Amazon Connect metrics, you can deploy the call quality monitoring solution.

The high cardinality anomaly detection feature is available on all Amazon ES domains running Elasticsearch 7.9 or greater. For more information, see Anomaly Detection for Amazon Elasticsearch Service.

 


About the Authors

Kun Qian is a Specialist Solutions Architect on the Amazon Connect Customer Solutions Acceleration team. He solving complex problems with technology. In his spare time he loves hiking, experimenting in the kitchen and meeting new people.

 

 

 

Rob Taylor is a Solutions Architect helping customers in Australia designing and building solutions on AWS. He has a background in theoretical computer science and modeling. Originally from the US, in his spare time he enjoys exploring Australia and spending time with his family.

Read More

Analyzing data stored in Amazon DocumentDB (with MongoDB compatibility) using Amazon Sagemaker

Analyzing data stored in Amazon DocumentDB (with MongoDB compatibility) using Amazon Sagemaker

One of the challenges in data science is getting access to operational or real-time data, which is often stored in operational database systems. Being able to connect data science tools to operational data easily and efficiently unleashes enormous potential for gaining insights from real-time data. In this post, we explore using Amazon SageMaker to analyze data stored in Amazon DocumentDB (with MongoDB compatibility).

For illustrative purposes, we use public event data from the GitHub API, which has a complex nested JSON format, and is well-suited for a document database such as Amazon DocumentDB. We use SageMaker to analyze this data, conduct descriptive analysis, and build a simple machine learning (ML) model to predict whether a pull request will close within 24 hours, before writing prediction results back into the database.

SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB 3.6 application code, drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without having to worry about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

Solution overview

In this post, we analyze GitHub events, examples of which include issues, forks, and pull requests. Each event is represented by the GitHub API as a complex, nested JSON object, which is a format well-suited for Amazon DocumentDB. The following code is an example of the output from a pull request event:

{
  "id": "13469392114",
  "type": "PullRequestEvent",
  "actor": {
    "id": 33526713,
    "login": "arjkesh",
    "display_login": "arjkesh",
    "gravatar_id": "",
    "url": "https://api.github.com/users/arjkesh",
    "avatar_url": "https://avatars.githubusercontent.com/u/33526713?"
  },
  "repo": {
    "id": 234634164,
    "name": "aws/deep-learning-containers",
    "url": "https://api.github.com/repos/aws/deep-learning-containers"
  },
  "payload": {
    "action": "closed",
    "number": 570,
    "pull_request": {
      "url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570",
      "id": 480316742,
      "node_id": "MDExOlB1bGxSZXF1ZXN0NDgwMzE2NzQy",
      "html_url": "https://github.com/aws/deep-learning-containers/pull/570",
      "diff_url": "https://github.com/aws/deep-learning-containers/pull/570.diff",
      "patch_url": "https://github.com/aws/deep-learning-containers/pull/570.patch",
      "issue_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/570",
      "number": 570,
      "state": "closed",
      "locked": false,
      "title": "[test][tensorflow][ec2] Add timeout to Data Service test setup",
      "user": {
        "login": "arjkesh",
        "id": 33526713,
        "node_id": "MDQ6VXNlcjMzNTI2NzEz",
        "avatar_url": "https://avatars3.githubusercontent.com/u/33526713?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/arjkesh",
        "html_url": "https://github.com/arjkesh",
        "followers_url": "https://api.github.com/users/arjkesh/followers",
        "following_url": "https://api.github.com/users/arjkesh/following{/other_user}",
        "gists_url": "https://api.github.com/users/arjkesh/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/arjkesh/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/arjkesh/subscriptions",
        "organizations_url": "https://api.github.com/users/arjkesh/orgs",
        "repos_url": "https://api.github.com/users/arjkesh/repos",
        "events_url": "https://api.github.com/users/arjkesh/events{/privacy}",
        "received_events_url": "https://api.github.com/users/arjkesh/received_events",
        "type": "User",
        "site_admin": false
      },
      "body": "*Issue #, if available:*rnrn## Checklistrn- [x] I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]rnrn*Description:*rnCurrently, this test does not timeout in the same manner as execute_ec2_training_test or other ec2 training tests. As such, a timeout should be added here to avoid hanging instances. A separate PR will be opened to address why the global timeout does not catch this.rnrn*Tests run:*rnPR testsrnrnrnBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.rnrn",
      "created_at": "2020-09-05T00:29:22Z",
      "updated_at": "2020-09-10T06:16:53Z",
      "closed_at": "2020-09-10T06:16:53Z",
      "merged_at": null,
      "merge_commit_sha": "4144152ac0129a68c9c6f9e45042ecf1d89d3e1a",
      "assignee": null,
      "assignees": [

      ],
      "requested_reviewers": [

      ],
      "requested_teams": [

      ],
      "labels": [

      ],
      "milestone": null,
      "draft": false,
      "commits_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/commits",
      "review_comments_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/comments",
      "review_comment_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/comments{/number}",
      "comments_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/570/comments",
      "statuses_url": "https://api.github.com/repos/aws/deep-learning-containers/statuses/99bb5a14993ceb29c16641bd54865db46ee6bf59",
      "head": {
        "label": "arjkesh:add_timeouts",
        "ref": "add_timeouts",
        "sha": "99bb5a14993ceb29c16641bd54865db46ee6bf59",
        "user": {
          "login": "arjkesh",
          "id": 33526713,
          "node_id": "MDQ6VXNlcjMzNTI2NzEz",
          "avatar_url": "https://avatars3.githubusercontent.com/u/33526713?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/arjkesh",
          "html_url": "https://github.com/arjkesh",
          "followers_url": "https://api.github.com/users/arjkesh/followers",
          "following_url": "https://api.github.com/users/arjkesh/following{/other_user}",
          "gists_url": "https://api.github.com/users/arjkesh/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/arjkesh/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/arjkesh/subscriptions",
          "organizations_url": "https://api.github.com/users/arjkesh/orgs",
          "repos_url": "https://api.github.com/users/arjkesh/repos",
          "events_url": "https://api.github.com/users/arjkesh/events{/privacy}",
          "received_events_url": "https://api.github.com/users/arjkesh/received_events",
          "type": "User",
          "site_admin": false
        },
        "repo": {
          "id": 265346646,
          "node_id": "MDEwOlJlcG9zaXRvcnkyNjUzNDY2NDY=",
          "name": "deep-learning-containers-1",
          "full_name": "arjkesh/deep-learning-containers-1",
          "private": false,
          "owner": {
            "login": "arjkesh",
            "id": 33526713,
            "node_id": "MDQ6VXNlcjMzNTI2NzEz",
            "avatar_url": "https://avatars3.githubusercontent.com/u/33526713?v=4",
            "gravatar_id": "",
            "url": "https://api.github.com/users/arjkesh",
            "html_url": "https://github.com/arjkesh",
            "followers_url": "https://api.github.com/users/arjkesh/followers",
            "following_url": "https://api.github.com/users/arjkesh/following{/other_user}",
            "gists_url": "https://api.github.com/users/arjkesh/gists{/gist_id}",
            "starred_url": "https://api.github.com/users/arjkesh/starred{/owner}{/repo}",
            "subscriptions_url": "https://api.github.com/users/arjkesh/subscriptions",
            "organizations_url": "https://api.github.com/users/arjkesh/orgs",
            "repos_url": "https://api.github.com/users/arjkesh/repos",
            "events_url": "https://api.github.com/users/arjkesh/events{/privacy}",
            "received_events_url": "https://api.github.com/users/arjkesh/received_events",
            "type": "User",
            "site_admin": false
          },
          "html_url": "https://github.com/arjkesh/deep-learning-containers-1",
          "description": "AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.",
          "fork": true,
          "url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1",
          "forks_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/forks",
          "keys_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/keys{/key_id}",
          "collaborators_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/collaborators{/collaborator}",
          "teams_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/teams",
          "hooks_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/hooks",
          "issue_events_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/issues/events{/number}",
          "events_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/events",
          "assignees_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/assignees{/user}",
          "branches_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/branches{/branch}",
          "tags_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/tags",
          "blobs_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/blobs{/sha}",
          "git_tags_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/tags{/sha}",
          "git_refs_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/refs{/sha}",
          "trees_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/trees{/sha}",
          "statuses_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/statuses/{sha}",
          "languages_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/languages",
          "stargazers_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/stargazers",
          "contributors_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/contributors",
          "subscribers_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/subscribers",
          "subscription_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/subscription",
          "commits_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/commits{/sha}",
          "git_commits_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/commits{/sha}",
          "comments_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/comments{/number}",
          "issue_comment_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/issues/comments{/number}",
          "contents_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/contents/{+path}",
          "compare_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/compare/{base}...{head}",
          "merges_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/merges",
          "archive_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/{archive_format}{/ref}",
          "downloads_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/downloads",
          "issues_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/issues{/number}",
          "pulls_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/pulls{/number}",
          "milestones_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/milestones{/number}",
          "notifications_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/notifications{?since,all,participating}",
          "labels_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/labels{/name}",
          "releases_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/releases{/id}",
          "deployments_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/deployments",
          "created_at": "2020-05-19T19:38:21Z",
          "updated_at": "2020-06-23T04:18:45Z",
          "pushed_at": "2020-09-10T02:04:27Z",
          "git_url": "git://github.com/arjkesh/deep-learning-containers-1.git",
          "ssh_url": "git@github.com:arjkesh/deep-learning-containers-1.git",
          "clone_url": "https://github.com/arjkesh/deep-learning-containers-1.git",
          "svn_url": "https://github.com/arjkesh/deep-learning-containers-1",
          "homepage": "https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html",
          "size": 68734,
          "stargazers_count": 0,
          "watchers_count": 0,
          "language": "Python",
          "has_issues": false,
          "has_projects": true,
          "has_downloads": true,
          "has_wiki": true,
          "has_pages": true,
          "forks_count": 0,
          "mirror_url": null,
          "archived": false,
          "disabled": false,
          "open_issues_count": 0,
          "license": {
            "key": "apache-2.0",
            "name": "Apache License 2.0",
            "spdx_id": "Apache-2.0",
            "url": "https://api.github.com/licenses/apache-2.0",
            "node_id": "MDc6TGljZW5zZTI="
          },
          "forks": 0,
          "open_issues": 0,
          "watchers": 0,
          "default_branch": "master"
        }
      },
      "base": {
        "label": "aws:master",
        "ref": "master",
        "sha": "9514fde23ae9eeffb9dfba13ce901fafacef30b5",
        "user": {
          "login": "aws",
          "id": 2232217,
          "node_id": "MDEyOk9yZ2FuaXphdGlvbjIyMzIyMTc=",
          "avatar_url": "https://avatars3.githubusercontent.com/u/2232217?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/aws",
          "html_url": "https://github.com/aws",
          "followers_url": "https://api.github.com/users/aws/followers",
          "following_url": "https://api.github.com/users/aws/following{/other_user}",
          "gists_url": "https://api.github.com/users/aws/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/aws/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/aws/subscriptions",
          "organizations_url": "https://api.github.com/users/aws/orgs",
          "repos_url": "https://api.github.com/users/aws/repos",
          "events_url": "https://api.github.com/users/aws/events{/privacy}",
          "received_events_url": "https://api.github.com/users/aws/received_events",
          "type": "Organization",
          "site_admin": false
        },
        "repo": {
          "id": 234634164,
          "node_id": "MDEwOlJlcG9zaXRvcnkyMzQ2MzQxNjQ=",
          "name": "deep-learning-containers",
          "full_name": "aws/deep-learning-containers",
          "private": false,
          "owner": {
            "login": "aws",
            "id": 2232217,
            "node_id": "MDEyOk9yZ2FuaXphdGlvbjIyMzIyMTc=",
            "avatar_url": "https://avatars3.githubusercontent.com/u/2232217?v=4",
            "gravatar_id": "",
            "url": "https://api.github.com/users/aws",
            "html_url": "https://github.com/aws",
            "followers_url": "https://api.github.com/users/aws/followers",
            "following_url": "https://api.github.com/users/aws/following{/other_user}",
            "gists_url": "https://api.github.com/users/aws/gists{/gist_id}",
            "starred_url": "https://api.github.com/users/aws/starred{/owner}{/repo}",
            "subscriptions_url": "https://api.github.com/users/aws/subscriptions",
            "organizations_url": "https://api.github.com/users/aws/orgs",
            "repos_url": "https://api.github.com/users/aws/repos",
            "events_url": "https://api.github.com/users/aws/events{/privacy}",
            "received_events_url": "https://api.github.com/users/aws/received_events",
            "type": "Organization",
            "site_admin": false
          },
          "html_url": "https://github.com/aws/deep-learning-containers",
          "description": "AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.",
          "fork": false,
          "url": "https://api.github.com/repos/aws/deep-learning-containers",
          "forks_url": "https://api.github.com/repos/aws/deep-learning-containers/forks",
          "keys_url": "https://api.github.com/repos/aws/deep-learning-containers/keys{/key_id}",
          "collaborators_url": "https://api.github.com/repos/aws/deep-learning-containers/collaborators{/collaborator}",
          "teams_url": "https://api.github.com/repos/aws/deep-learning-containers/teams",
          "hooks_url": "https://api.github.com/repos/aws/deep-learning-containers/hooks",
          "issue_events_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/events{/number}",
          "events_url": "https://api.github.com/repos/aws/deep-learning-containers/events",
          "assignees_url": "https://api.github.com/repos/aws/deep-learning-containers/assignees{/user}",
          "branches_url": "https://api.github.com/repos/aws/deep-learning-containers/branches{/branch}",
          "tags_url": "https://api.github.com/repos/aws/deep-learning-containers/tags",
          "blobs_url": "https://api.github.com/repos/aws/deep-learning-containers/git/blobs{/sha}",
          "git_tags_url": "https://api.github.com/repos/aws/deep-learning-containers/git/tags{/sha}",
          "git_refs_url": "https://api.github.com/repos/aws/deep-learning-containers/git/refs{/sha}",
          "trees_url": "https://api.github.com/repos/aws/deep-learning-containers/git/trees{/sha}",
          "statuses_url": "https://api.github.com/repos/aws/deep-learning-containers/statuses/{sha}",
          "languages_url": "https://api.github.com/repos/aws/deep-learning-containers/languages",
          "stargazers_url": "https://api.github.com/repos/aws/deep-learning-containers/stargazers",
          "contributors_url": "https://api.github.com/repos/aws/deep-learning-containers/contributors",
          "subscribers_url": "https://api.github.com/repos/aws/deep-learning-containers/subscribers",
          "subscription_url": "https://api.github.com/repos/aws/deep-learning-containers/subscription",
          "commits_url": "https://api.github.com/repos/aws/deep-learning-containers/commits{/sha}",
          "git_commits_url": "https://api.github.com/repos/aws/deep-learning-containers/git/commits{/sha}",
          "comments_url": "https://api.github.com/repos/aws/deep-learning-containers/comments{/number}",
          "issue_comment_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/comments{/number}",
          "contents_url": "https://api.github.com/repos/aws/deep-learning-containers/contents/{+path}",
          "compare_url": "https://api.github.com/repos/aws/deep-learning-containers/compare/{base}...{head}",
          "merges_url": "https://api.github.com/repos/aws/deep-learning-containers/merges",
          "archive_url": "https://api.github.com/repos/aws/deep-learning-containers/{archive_format}{/ref}",
          "downloads_url": "https://api.github.com/repos/aws/deep-learning-containers/downloads",
          "issues_url": "https://api.github.com/repos/aws/deep-learning-containers/issues{/number}",
          "pulls_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls{/number}",
          "milestones_url": "https://api.github.com/repos/aws/deep-learning-containers/milestones{/number}",
          "notifications_url": "https://api.github.com/repos/aws/deep-learning-containers/notifications{?since,all,participating}",
          "labels_url": "https://api.github.com/repos/aws/deep-learning-containers/labels{/name}",
          "releases_url": "https://api.github.com/repos/aws/deep-learning-containers/releases{/id}",
          "deployments_url": "https://api.github.com/repos/aws/deep-learning-containers/deployments",
          "created_at": "2020-01-17T20:52:43Z",
          "updated_at": "2020-09-09T22:57:46Z",
          "pushed_at": "2020-09-10T04:01:22Z",
          "git_url": "git://github.com/aws/deep-learning-containers.git",
          "ssh_url": "git@github.com:aws/deep-learning-containers.git",
          "clone_url": "https://github.com/aws/deep-learning-containers.git",
          "svn_url": "https://github.com/aws/deep-learning-containers",
          "homepage": "https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html",
          "size": 68322,
          "stargazers_count": 61,
          "watchers_count": 61,
          "language": "Python",
          "has_issues": true,
          "has_projects": true,
          "has_downloads": true,
          "has_wiki": true,
          "has_pages": false,
          "forks_count": 49,
          "mirror_url": null,
          "archived": false,
          "disabled": false,
          "open_issues_count": 28,
          "license": {
            "key": "apache-2.0",
            "name": "Apache License 2.0",
            "spdx_id": "Apache-2.0",
            "url": "https://api.github.com/licenses/apache-2.0",
            "node_id": "MDc6TGljZW5zZTI="
          },
          "forks": 49,
          "open_issues": 28,
          "watchers": 61,
          "default_branch": "master"
        }
      },
      "_links": {
        "self": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570"
        },
        "html": {
          "href": "https://github.com/aws/deep-learning-containers/pull/570"
        },
        "issue": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/issues/570"
        },
        "comments": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/issues/570/comments"
        },
        "review_comments": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/comments"
        },
        "review_comment": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/comments{/number}"
        },
        "commits": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/commits"
        },
        "statuses": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/statuses/99bb5a14993ceb29c16641bd54865db46ee6bf59"
        }
      },
      "author_association": "CONTRIBUTOR",
      "active_lock_reason": null,
      "merged": false,
      "mergeable": false,
      "rebaseable": false,
      "mergeable_state": "dirty",
      "merged_by": null,
      "comments": 1,
      "review_comments": 0,
      "maintainer_can_modify": false,
      "commits": 6,
      "additions": 26,
      "deletions": 18,
      "changed_files": 1
    }
  },
  "public": true,
  "created_at": "2020-09-10T06:16:53Z",
  "org": {
    "id": 2232217,
    "login": "aws",
    "gravatar_id": "",
    "url": "https://api.github.com/orgs/aws",
    "avatar_url": "https://avatars.githubusercontent.com/u/2232217?"
  }
}

Amazon DocumentDB stores each JSON event as a document. Multiple documents are stored in a collection, and multiple collections are stored in a database. Borrowing terminology from relational databases, documents are analogous to rows, and collections are analogous to tables. The following table summarizes these terms.

Document Database Concepts SQL Concepts
Document Row
Collection Table
Database Database
Field Column

We now implement the following Amazon DocumentDB tasks using SageMaker:

  1. Connect to an Amazon DocumentDB cluster.
  2. Ingest GitHub event data stored in the database.
  3. Generate descriptive statistics.
  4. Conduct feature selection and engineering.
  5. Generate predictions.
  6. Store prediction results.

Creating resources

We have prepared the following AWS CloudFormation template to create the required AWS resources for this post. For instructions on creating a CloudFormation stack, see the video Simplify your Infrastructure Management using AWS CloudFormation.

The CloudFormation stack provisions the following:

  • A VPC with three private subnets and one public subnet.
  • An Amazon DocumentDB cluster with three nodes, one in each private subnet. When creating an Amazon DocumentDB cluster in a VPC, its subnet group should have subnets in at least two Availability Zones in a given Region.
  • An AWS Secrets Manager secret to store login credentials for Amazon DocumentDB. This allows us to avoid storing plaintext credentials in our SageMaker instance.
  • A SageMaker role to retrieve the Amazon DocumentDB login credentials, allowing connections to the Amazon DocumentDB cluster from a SageMaker notebook.
  • A SageMaker instance to run queries and analysis.
  • A SageMaker instance lifecycle configuration to run a bash script every time the instance boots up, downloading a certificate bundle to create TLS connections to Amazon DocumentDB, as well as a Jupyter Notebook containing the code for this tutorial. The script also installs required Python libraries (such as pymongo for database methods and xgboost for ML modeling), so that we don’t need to install these libraries from the notebook. See the following code:
    #!/bin/bash
    sudo -u ec2-user -i <<'EOF'
    source /home/ec2-user/anaconda3/bin/activate python3
    pip install --upgrade pymongo
    pip install --upgrade xgboost
    source /home/ec2-user/anaconda3/bin/deactivate
    cd /home/ec2-user/SageMaker
    wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem
    wget https://github.com/aws-samples/documentdb-sagemaker-example/blob/main/script.ipynb
    EOF

In creating the CloudFormation stack, you need to specify the following:

  • Name for your CloudFormation stack
  • Amazon DocumentDB username and password (to be stored in Secrets Manager)
  • Amazon DocumentDB instance type (default db.r5.large)
  • SageMaker instance type (default ml.t3.xlarge)

It should take about 15 minutes to create the CloudFormation stack. The following diagram shows the resource architecture.

Running this tutorial for an hour should cost no more than US$2.00.

Connecting to an Amazon DocumentDB cluster

All the subsequent code in this tutorial is in the Jupyter Notebook in the SageMaker instance created in your CloudFormation stack.

  1. To connect to your Amazon DocumentDB cluster from a SageMaker notebook, you have to first specify the following code:
    stack_name = "docdb-sm" # name of CloudFormation stack

The stack_name refers to the name you specified for your CloudFormation stack upon its creation.

  1. Use this parameter in the following method to get your Amazon DocumentDB credentials stored in Secrets Manager:
    def get_secret(stack_name):
    
        # Create a Secrets Manager client
        session = boto3.session.Session()
        client = session.client(
            service_name='secretsmanager',
            region_name=session.region_name
        )
        
        secret_name = f'{stack_name}-DocDBSecret'
        get_secret_value_response = client.get_secret_value(SecretId=secret_name)
        secret = get_secret_value_response['SecretString']
        
        return json.loads(secret)

  1. Next, we extract the login parameters from the stored secret:
    secret = get_secret(secret_name)
    
    db_username = secret['username']
    db_password = secret['password']
    db_port = secret['port']
    db_host = secret['host']

  1. Using the extracted parameters, we create a MongoClient from the pymongo library to establish a connection to the Amazon DocumentDB cluster.
    uri_str = f"mongodb://{db_username}:{db_password}@{db_host}:{db_port}/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false"
    client = MongoClient(uri_str)

Ingesting data

After we establish the connection to our Amazon DocumentDB cluster, we create a database and collection to store our GitHub event data. For this post, we name our database gharchive, and our collection events:

db_name = "gharchive" # name the database
collection_name = "events" # name the collection

db = client[db_name] # create a database
events = db[collection_name] # create a collection

Next, we need to download the data from gharchive.org, which has been aggregated into hourly archives with the following naming format:

https://data.gharchive.org/YYYY-MM-DD-H.json.gz

The aim of this analysis is to predict whether a pull request closes within 24 hours. For simplicity, we limit the analysis over two days: February 10–11, 2015. Across these two days, there were over 1 million GitHub events.

The following code downloads the relevant hourly archives, then formats and ingests the data into your Amazon DocumentDB database. It takes about 7 minutes to run on an ml.t3.xlarge instance.

# Specify target date and time range for GitHub events
year = 2015
month = 2
days = [10, 11]
hours = range(0, 24)

# Download data from gharchive.org and insert into Amazon DocumentDB

for day in days:
    for hour in hours:
        
        print(f"Processing events for {year}-{month}-{day}, {hour} hr.")
        
        # zeropad values
        month_ = str(month).zfill(2)
        day_ = str(day).zfill(2)
        
        # download data
        url = f"https://data.gharchive.org/{year}-{month_}-{day_}-{hour}.json.gz"
        response = requests.get(url, stream=True)
        
        # decompress data
        respdata = zlib.decompress(response.content, zlib.MAX_WBITS|32)
        
        # format data
        stringdata = respdata.split(b'n')
        data = [json.loads(x) for x in stringdata if 0 < len(x)]
        
        # ingest data
        events.insert_many(data, ordered=True, bypass_document_validation=True)

The option ordered=False command allows the data to be ingested out of order. The bypass_document_validation=True command allows the write to skip validating the JSON input, which is safe to do because we validated the JSON structure when we issued the json.loads() command prior to inserting.

Both options expedite the data ingestion process.

Generating descriptive statistics

As is a common first step in data science, we want to explore the data to get some general descriptive statistics. We can use database operations to calculate some of these basic descriptive statistics.

To get a count of the number of GitHub events, we use the count_documents() command:

events.count_documents({})
> 1174157

The count_documents() command gets the number of documents in a collection. Each GitHub event is recorded as a document, and events is what we had named our collection earlier.

The 1,174,157 documents comprise different types of GitHub events. To see the frequency of each type of event occurring in the dataset, we query the database using the aggregate command:

# Frequency of event types
event_types_query = events.aggregate([
    # Group by the type attribute and count
    {"$group" : {"_id": "$type", "count": {"$sum": 1}}}, 
    # Reformat the data
    {"$project": {"_id": 0, "Type": "$_id", "count": "$count"}},
    # Sort by the count in descending order
    {"$sort": {"count": -1} }  
])
df_event_types = pd.DataFrame(event_types_query

The preceding query groups the events by type, runs a count, and sorts the results in descending order of count. Finally, we wrap the output in pd.DataFrame() to convert the results to a DataFrame. This allows us to generate visualizations such as the following.

From the plot, we can see that push events were the most frequent, numbering close to 600,000.

Returning to our goal to predict if a pull request closes within 24 hours, we implement another query to include only pull request events, using the database match operation, and then count the number of such events per pull request URL:

# Frequency of PullRequestEvent actions by URL
action_query = events.aggregate([
    # Keep only PullRequestEvent types
    {"$match" : {"type": "PullRequestEvent"} }, 
    # Group by HTML URL and count
    {"$group": {"_id": "$payload.pull_request.html_url", "count": {"$sum": 1}}}, 
    # Reformat the data
    {"$project": {"_id": 0, "url": "$_id", "count": "$count"}},
    # Sort by the count in descending order
    {"$sort": {"count": -1} }  
])
df_action = pd.DataFrame(action_query)

From the result, we can see that a single URL could have multiple pull request events, such as those shown in the following screenshot.


One of the attributes of a pull request event is the state of the pull request after the event. Therefore, we’re interested in the latest event by the end of 24 hours in determining whether the pull request was open or closed in that window of time. We show how to run this query later in this post, but continue now with a discussion of descriptive statistics.

Apart from counts, we can also have the database calculate the mean, maximum, and minimum values for us. In the following query, we do this for potential predictors of a pull request open/close status, specifically the number of stars, forks, and open issues, as well as repository size. We also calculate the time elapsed (in milliseconds) of a pull request event since its creation. For each pull request, there could be multiple pull request events (comments), and this descriptive query spans across all these events:

# Descriptive statistics (mean, max, min) of repo size, stars, forks, open issues, elapsed time
descriptives = list(events.aggregate([
    # Keep only PullRequestEvents
    {"$match": {"type": "PullRequestEvent"} }, 
    # Project out attributes of interest
    {"$project": {
        "_id": 0, 
        "repo_size": "$payload.pull_request.base.repo.size", 
        "stars": "$payload.pull_request.base.repo.stargazers_count", 
        "forks": "$payload.pull_request.base.repo.forks_count", 
        "open_issues": "$payload.pull_request.base.repo.open_issues_count",
        "time_since_created": {"$subtract": [{"$dateFromString": {"dateString": "$payload.pull_request.updated_at"}}, 
                                  {"$dateFromString": {"dateString": "$payload.pull_request.created_at"}}]} 
    }}, 
    # Calculate min/max/avg for various metrics grouped over full data set
    {"$group": {
        "_id": "descriptives", 
        "mean_repo_size": {"$avg": "$repo_size"}, 
        "mean_stars": {"$avg": "$stars"}, 
        "mean_forks": {"$avg": "$forks"}, 
        "mean_open_issues": {"$avg": "$open_issues" },
        "mean_time_since_created": {"$avg": "$time_since_created"},
        
        "min_repo_size": {"$min": "$repo_size"}, 
        "min_stars": {"$min": "$stars"}, 
        "min_forks": {"$min": "$forks"}, 
        "min_open_issues": {"$min": "$open_issues" },
        "min_time_since_created": {"$min": "$time_since_created"},
        
        "max_repo_size": {"$max": "$repo_size"}, 
        "max_stars": {"$max": "$stars"}, 
        "max_forks": {"$max": "$forks"}, 
        "max_open_issues": {"$max": "$open_issues" },
        "max_time_since_created": {"$max": "$time_since_created"}
    }},
    # Reformat results
    {"$project": {
        "_id": 0, 
        "repo_size": {"mean": "$mean_repo_size", 
                      "min": "$min_repo_size",
                      "max": "$max_repo_size"},
        "stars": {"mean": "$mean_stars", 
                  "min": "$min_stars",
                  "max": "$max_stars"},
        "forks": {"mean": "$mean_forks", 
                  "min": "$min_forks",
                  "max": "$max_forks"},
        "open_issues": {"mean": "$mean_open_issues", 
                        "min": "$min_open_issues",
                        "max": "$max_open_issues"},
        "time_since_created": {"mean": "$mean_time_since_created", 
                               "min": "$min_time_since_created",
                               "max": "$max_time_since_created"},
    }}
]))

pd.DataFrame(descriptives[0]

The query results in the following output.

For supported methods of aggregations in Amazon DocumentDB, refer to Aggregation Pipeline Operators.

Conducting feature selection and engineering

Before we can begin building our prediction model, we need to select relevant features to include, and also engineer new features. In the following query, we select pull request events from non-empty repositories with more than 50 forks. We select possible predictors including number of forks (forks_count) and number of open issues (open_issues_count), and engineer new predictors by normalizing those counts by the size of the repository (repo.size). Finally, we shortlist the pull request events that fall within our period of evaluation, and record the latest pull request status (open or close), which is the outcome of our predictive model.

df = list(events.aggregate([
    # Filter on just PullRequestEvents
    {"$match": {
        "type": "PullRequestEvent",                                 # focus on pull requests
        "payload.pull_request.base.repo.forks_count": {"$gt": 50},  # focus on popular repos
        "payload.pull_request.base.repo.size": {"$gt": 0}           # exclude empty repos
    }},
    # Project only features of interest
    {"$project": {
        "type": 1,
        "payload.pull_request.base.repo.size": 1, 
        "payload.pull_request.base.repo.stargazers_count": 1,
        "payload.pull_request.base.repo.has_downloads": 1,
        "payload.pull_request.base.repo.has_wiki": 1,
        "payload.pull_request.base.repo.has_pages" : 1,
        "payload.pull_request.base.repo.forks_count": 1,
        "payload.pull_request.base.repo.open_issues_count": 1,
        "payload.pull_request.html_url": 1,
        "payload.pull_request.created_at": 1,
        "payload.pull_request.updated_at": 1,
        "payload.pull_request.state": 1,
        
        # calculate no. of open issues normalized by repo size
        "issues_per_repo_size": {"$divide": ["$payload.pull_request.base.repo.open_issues_count",
                                             "$payload.pull_request.base.repo.size"]},
        
        # calculate no. of forks normalized by repo size
        "forks_per_repo_size": {"$divide": ["$payload.pull_request.base.repo.forks_count",
                                            "$payload.pull_request.base.repo.size"]},
        
        # format datetime variables
        "created_time": {"$dateFromString": {"dateString": "$payload.pull_request.created_at"}},
        "updated_time": {"$dateFromString": {"dateString": "$payload.pull_request.updated_at"}},
        
        # calculate time elapsed since PR creation
        "time_since_created": {"$subtract": [{"$dateFromString": {"dateString": "$payload.pull_request.updated_at"}}, 
                                             {"$dateFromString": {"dateString": "$payload.pull_request.created_at"}} ]}
    }},
    # Keep only events within the window (24hrs) since pull requests was created
    # Keep only pull requests that were created on or after the start and before the end period
    {"$match": {
        "time_since_created": {"$lte": prediction_window},
        "created_time": {"$gte": date_start, "$lt": date_end}
    }},
    # Sort by the html_url and then by the updated_time
    {"$sort": {
        "payload.pull_request.html_url": 1,
        "payload.pull_request.updated_time": 1
    }},
    # keep the information from the first event in each group, plus the state from the last event in each group
    # grouping by html_url
    {"$group": {
        "_id": "$payload.pull_request.html_url",
        "repo_size": {"$first": "$payload.pull_request.base.repo.size"},
        "stargazers_count": {"$first": "$payload.pull_request.base.repo.stargazers_count"},
        "has_downloads": {"$first": "$payload.pull_request.base.repo.has_downloads"},
        "has_wiki": {"$first": "$payload.pull_request.base.repo.has_wiki"},
        "has_pages" : {"$first": "$payload.pull_request.base.repo.has_pages"},
        "forks_count": {"$first": "$payload.pull_request.base.repo.forks_count"},
        "open_issues_count": {"$first": "$payload.pull_request.base.repo.open_issues_count"},
        "issues_per_repo_size": {"$first": "$issues_per_repo_size"},
        "forks_per_repo_size": {"$first": "$forks_per_repo_size"},
        "state": {"$last": "$payload.pull_request.state"}
    }}
]))

df = pd.DataFrame(df)

Generating predictions

Before building our model, we split our data into two sets for training and testing:

X = df.drop(['state_open'], axis=1)
y = df['state_open']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    stratify=y,
                                                    random_state=42,
                                                   )

For this post, we use 70% of the documents for training the model, and the remaining 30% for testing the model’s predictions against the actual pull request status. We use the XGBoost algorithm to train a binary:logistic model evaluated with area under the curve (AUC) over 20 iterations. The seed is specified to enable reproducibility of results. The other parameters are left as default values. See the following code:

# Format data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Specify model parameters
param = {
    'objective':'binary:logistic',
    'eval_metric':'auc',
    'seed': 42,
        }

# Train model
num_round = 20
bst = xgb.train(param, dtrain, num_round)

Next, we use the trained model to generate predictions for the test dataset and to calculate and plot the AUC:

preds = bst.predict(dtest)
roc_auc_score(y_test, preds)
> 0.609441068887402

The following plot shows our results.

We can also examine the leading predictors by importance of a pull request event’s state:

xgb.plot_importance(bst, importance_type='weight')

The following plot shows our results.

A predictor has different definitions of importance. For this post, we use weight, which is the number of times a predictor appears in the XGBoost trees. The top predictor is the number of open issues normalized by the repository size. Using a box plot, we compare the spread of values for this predictor between closed and still-open pull requests.

After we examine the results and are satisfied with the model performance, we can write predictions back into Amazon DocumentDB.

Storing prediction results

The final step is to store the model predictions back into Amazon DocumentDB. First, we create a new Amazon DocumentDB collection to hold our results, called predictions:

predictions = db['predictions']

Then we change the generated predictions to type float, to be accepted by Amazon DocumentDB:

preds = preds.astype(float)

We need to associate these predictions with their respective pull request events. Therefore, we use the pull request URL as each document’s ID. We match each prediction to its respective pull request URL and consolidate them in a list:

urls = y_test.index

def gen_preds(url, pred):
    """
    Generate document with prediction of whether pull request will close in 24 hours.
    ID is pull request URL.
    """
    doc = {
        "_id": url, 
        "close_24hr_prediction": pred}
    
    return doc

documents = [gen_preds(url, pred) for url, pred in zip(urls, preds)]

Finally, we use the insert_many command to write the documents to Amazon DocumentDB:

predictions.insert_many(documents, ordered=False)

We can query a sample of five documents in the predictions collections to verify that the results have been inserted correctly:

pd.DataFrame(predictions.find({}).limit(5))

The following screenshot shows our results.

Cleaning up resources

To save cost, delete the CloudFormation stack you created. This removes all the resources you provisioned using the CloudFormation template, including the VPC, Amazon DocumentDB cluster, and SageMaker instance. For instructions, see Deleting a stack on the AWS CloudFormation console.

Summary

We used SageMaker to analyze data stored in Amazon DocumentDB, conduct descriptive analysis, and build a simple ML model to make predictions, before writing prediction results back into the database.

Amazon DocumentDB provides you with a number of capabilities that help you back up and restore your data based on your use case. For more information, see Best Practices for Amazon DocumentDB. If you’re new to Amazon DocumentDB, see Getting Started with Amazon DocumentDB. If you’re planning to migrate to Amazon DocumentDB, see Migrating to Amazon DocumentDB.

 


About the Authors

Annalyn Ng is a Senior Data Scientist with AWS Professional Services, where she develops and deploys machine learning solutions for customers. Annalyn graduated with an MPhil from the University of Cambridge, and blogs about machine learning at algobeans.com. Her book, Numsense! Data Science for the Layman’, has been translated into over five languages and is used in top universities as reference text.

 

 

Brian Hess is a Senior Solution Architect Specialist for Amazon DocumentDB (with MongoDB compatibility) at AWS. He has been in the data and analytics space for over 20 years and has extensive experience with relational and NoSQL databases.

Read More

Creating Amazon SageMaker Studio domains and user profiles using AWS CloudFormation

Creating Amazon SageMaker Studio domains and user profiles using AWS CloudFormation

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps required to build, train, tune, debug, deploy, and monitor models. In this post, we demonstrate how you can create a SageMaker Studio domain and user profile using AWS CloudFormation. AWS CloudFormation gives you an easy way to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycle by treating infrastructure as code.

Because AWS CloudFormation isn’t natively integrated with SageMaker Studio at the time of this writing, we use AWS CloudFormation to provision two AWS Lambda functions and then invoke these functions to create, delete, and update the Studio domain and user profile. In the rest of this post, we walk through the Lambda function to create the Studio domain (the code for creating a Studio user profile works similarly) and then the CloudFormation template. All the code is available in the GitHub repo.

Lambda function for creating, deleting, and updating a Studio domain

In the Lambda function, the lambda_handler calls one of the three functions, handle_create, handle_update, and handle_delete, to create, update, and delete the Studio domain, respectively. Because we invoke this function using an AWS CloudFormation custom resource, the custom resource request type is sent in the RequestType field from AWS CloudFormation. RequestType determines which function to call inside the lambda_handler function. For example, when AWS CloudFormation detects any changes in the custom::StudioDomain section of our CloudFormation template, the RequestType is set to Update by AWS CloudFormation, and the handle_update function is called. The following is the lambda_handler code:

def lambda_handler(event, context):
    try:
        if event['RequestType'] == 'Create':
            handle_create(event, context)
        elif event['RequestType'] == 'Update':
            handle_update(event, context)
        elif event['RequestType'] == 'Delete':
            handle_delete(event, context)
    except ClientError as exception:
        logging.error(exception)
        cfnresponse.send(event, context, cfnresponse.FAILED,
                         {}, error=str(exception))

The three functions for creating, updating, and deleting the domain work similarly. For this post, we walk through the code responsible for creating a domain. When invoking the Lambda function through an AWS CloudFormation custom resource, we pass key parameters that help define our Studio domain via the custom resource Properties. We extract these parameters from the AWS CloudFormation event source in the Lambda function. In the handle_create function, parameters are read in from the event and passed on to the create_studio_domain function. See the following code for handle_create:

def handle_create(event, context):
    print("**Starting running the SageMaker workshop setup code")
    resource_config = event['ResourceProperties']
    print("**Creating studio domain")
    response_data = create_studio_domain(resource_config)
    cfnresponse.send(event, context, cfnresponse.SUCCESS,
                     {}, physicalResourceId=response_data['DomainArn'])

We use a boto3 SageMaker client to create Studio domains. For this post, we set the domain name, the VPC and subnet that Studio uses, and the SageMaker execution role for the Studio domain. After the create_domain API is made, we check the creation status every 5 seconds. When creation is complete, we return the Amazon Resource Name (ARN) and the URL of the created domain. The amount of time that Lambda allows a function to run before stopping it is 3 seconds by default. Therefore, make sure that the timeout limit of your Lambda function is set appropriately. We set the timeout limit to 900 seconds. The following is the create_studio_domain code (the functions for deleting and updating domains are also implemented using boto3 and constructed in a similar fashion):

client = boto3.client('sagemaker')
def create_studio_domain(config):
    vpc_id = config['VPC']
    subnet_ids = config['SubnetIds']
    default_user_settings = config['DefaultUserSettings']
    domain_name = config['DomainName']

    response = client.create_domain(
        DomainName=domain_name,
        AuthMode='IAM',
        DefaultUserSettings=default_user_settings,
        SubnetIds=subnet_ids.split(','),
        VpcId=vpc_id
    )

    domain_id = response['DomainArn'].split('/')[-1]
    created = False
    while not created:
        response = client.describe_domain(DomainId=domain_id)
        time.sleep(5)
        if response['Status'] == 'InService':
            created = True

    logging.info("**SageMaker domain created successfully: %s", domain_id)
    return response

Finally, we zip the Python script, save it as domain_function.zip, and upload it to Amazon Simple Storage Service (Amazon S3).

The Lambda function used for creating a user profile is constructed similarly. For more information, see the UserProfile_function.py script in the GitHub repo.

CloudFormation template

In the CloudFormation template, we create an execution role for Lambda, an execution role for SageMaker Studio, and the Lambda function using the code explained in the previous section. We invoke this function by specifying it as the target of a customer resource. For more information about invoking a Lambda function with AWS CloudFormation, see Using AWS Lambda with AWS CloudFormation.

Lambda execution role

This role gives our Lambda function the permission to create an Amazon CloudWatch Logs stream and write logs to CloudWatch. Because we create, delete, and update Studio domains in our function, we also grant this role the permission to do so. See the following code:

LambdaExecutionRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - "sts:AssumeRole"
      Path: /

  LambdaExecutionPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Path: /
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Sid: CloudWatchLogsPermissions
            Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: !Sub "arn:${AWS::Partition}:logs:*:*:*"
          - Sid: SageMakerDomainPermission
            Effect: Allow
            Action:
              - sagemaker:CreateDomain
              - sagemaker:DescribeDomain
              - sagemaker:DeleteDomain
              - sagemaker:UpdateDomain
              - sagemaker:CreateUserProfile
              - sagemaker:UpdateUserProfile
              - sagemaker:DeleteUserProfile
              - sagemaker:DescribeUserProfile
            Resource:
              - !Sub "arn:${AWS::Partition}:sagemaker:*:*:domain/*"
              - !Sub "arn:${AWS::Partition}:sagemaker:*:*:user-profile/*"
          - Sid: SageMakerExecPassRole
            Effect: Allow
            Action:
              - iam:PassRole
            Resource: !GetAtt  SageMakerExecutionRole.Arn
      Roles:
        - !Ref LambdaExecutionRole

SageMaker execution role

The following SageMaker execution role is attached to Studio (for demonstration purposes, we grant this role SageMakerFullAccess):

SageMakerExecutionRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - sagemaker.amazonaws.com
            Action:
              - "sts:AssumeRole"
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

Lambda function

The AWS::Lambda::Function resource creates a Lambda function. To create a function, we need a deployment package and an execution role. The deployment package contains our function code (function.zip). The execution role, which is the LambdaExecutionRole created from previous step, grants the function permission to create a Lambda function. We also added the CfnResponseLayer to our function’s execution environment. CfnResponseLayer enables the function to interact with an AWS CloudFormation custom resource. It contains a send method to send responses from Lambda to AWS CloudFormation. See the following code:

Resources:
...
 StudioDomainFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: lambda_function.lambda_handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        S3Bucket: !Ref S3Bucket
        S3Key: function.zip
        S3ObjectVersion: !Ref S3ObjectVersion
      Runtime: python3.8
      Timeout: 900
      Layers:
        - !Ref CfnResponseLayer
       
  CfnResponseLayer:
    Type: AWS::Lambda::LayerVersion
    Properties:
      CompatibleRuntimes:
        - python3.8
      Content:
        S3Bucket: !Ref S3Bucket
        S3Key: cfnResponse-layer.zip
      Description: cfn-response layer
      LayerName: cfn-response

Invoking the Lambda function using an AWS CloudFormation custom resource

Custom resources provide a way for you to write custom provisioning logic in a CloudFormation template and have AWS CloudFormation run it during a stack operation, such as when you create, update, or delete a stack. For more information, see Custom resources. We get the Lambda function’s ARN created from previous step and pass it to AWS CloudFormation as our service token. This allows AWS CloudFormation to invoke the Lambda function. We pass parameters required for creating, updating, and deleting our domain under Properties. See the following code:

StudioDomain:
    Type: Custom::StudioDomain
    Properties:
      ServiceToken: !GetAtt StudioDomainFunction.Arn
      VPC: !Ref VPCId
      SubnetIds: !Ref SubnetIds
      DomainName: "MyDomainName"
      DefaultUserSettings:
        ExecutionRole: !GetAtt SageMakerExecutionRole.Arn

In the same fashion, we invoke the Lambda function for creating a user profile:

UserProfile:
    Type: Custom::UserProfile
    Properties:
      ServiceToken: !GetAtt UserProfileFunction.Arn
      DomainId: !GetAtt StudioDomain.DomainId
      UserProfileName: !Ref UserProfileName
      UserSettings:
        ExecutionRole: !GetAtt SageMakerExecutionRole.Arn

Conclusion

In this post, we walked through the steps of creating, deleting, and updating SageMaker Studio domains using AWS CloudFormation and Lambda. The sample files are available in the GitHub repo. For information about creating Studio domain inside a VPC, see Securing Amazon SageMaker Studio connectivity using a private VPC. For more information about SageMaker Studio, see Get Started with Amazon SageMaker Studio.


About the Authors

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

 

 

Joseph Jegan is a Cloud Application Architect at Amazon Web Services. He helps AWS customers use AWS services to design scalable and secure applications. He has over 20 years of software development experience prior to AWS, working on developing e-commerce platform for large retail customers. He is based out of New York metro and enjoys learning emerging cloud native technologies.

 

 

David Ping is a Principal Machine Learning Solutions Architect and Sr. Manager of AI/ML Solutions Architecture at Amazon Web Services. He helps enterprise customers build and operate machine learning solutions on AWS. In his spare time, David enjoys hiking and reading the latest machine learning articles.

 

Read More