Introducing the AWS Panorama Device SDK: Scaling computer vision at the edge with AWS Panorama-enabled devices

Introducing the AWS Panorama Device SDK: Scaling computer vision at the edge with AWS Panorama-enabled devices

Yesterday, at AWS re:Invent, we announced AWS Panorama, a new Appliance and Device SDK that allows organizations to bring computer vision to their on-premises cameras to make automated predictions with high accuracy and low latency. With AWS Panorama, companies can use compute power at the edge (without requiring video streamed to the cloud) to improve their operations by automating monitoring and visual inspection tasks like evaluating manufacturing quality, finding bottlenecks in industrial processes, and assessing worker safety within their facilities. The AWS Panorama Appliance is a hardware device that customers can install on their network to connect to existing cameras within your facility, to run computer vision models on multiple concurrent video streams.

This post covers how the AWS Panorama Device SDK helps device manufacturers build a broad portfolio of AWS Panorama-enabled devices. Scaling computer vision at the edge requires purpose-built devices that cater to specific customer needs without compromising security and performance. However, creating such a wide variety of computer vision edge devices is hard for device manufacturers because they need to do the following:

  • Integrate various standalone cloud services to create an end-to-end computer vision service that works with their edge device and provides a scalable ecosystem of applications for customers.
  • Invest in choosing and enabling silicon according to customers’ performance and cost requirements.

The AWS Panorama Device SDK addresses these challenges.

AWS Panorama Device SDK

The AWS Panorama Device SDK powers the AWS Panorama Appliance and allows device manufacturers to build AWS Panorama-enabled edge appliances and smart cameras. With the AWS Panorama Device SDK, device manufacturers can build edge computer vision devices for a wide array of use cases across industrial, manufacturing, worker safety, logistics, transportation, retail analytics, smart building, smart city, and other segments. In turn, customers have the flexibility to choose the AWS Panorama-enabled devices that meet their specific performance, design, and cost requirements.

The following diagram shows how the AWS Panorama Device SDK allows device manufacturers to build AWS Panorama-enabled edge appliances and smart cameras.

The AWS Panorama Device SDK includes the following:

  • Core controller – Manages AWS Panorama service orchestration between the cloud and edge device. The core controller provides integration to media processing and hardware accelerator on-device along with integration to AWS Panorama applications.
  • Silicon abstraction layer – Provides device manufacturers the ability to enable AWS Panorama across various silicon platforms and devices.

Edge devices integrated with the AWS Panorama Device SDK can offer all AWS Panorama service features, including camera stream onboarding and management, application management, application deployment, fleet management, and integration with event management business logic for real-time predictions, via the AWS Management Console. For example, a device integrated with the AWS Panorama Device SDK can automatically discover camera streams on the network, and organizations can review the discovered video feeds and name or group them on the console. Organizations can use the console to create applications by choosing a model and pairing it with business logic. After the application is deployed on the target device through the console, the AWS Panorama-enabled device runs the machine learning (ML) inference locally to enable high-accuracy and low-latency predictions.

To get device manufacturers started, the AWS Panorama Device SDK provides them with a device software stack for computer vision, sample code, APIs, and tools to enable and test their respective device for the AWS Panorama service. When ready, device manufacturers can work with the AWS Panorama team to finalize the integration of the AWS Panorama Device SDK and run certification tests ahead of their commercial launch.

Partnering with NVIDIA and Ambarella to enable the AWS Panorama Device SDK on their leading edge AI platforms

The AWS Panorama Device SDK will support the NVIDIA® Jetson product family and Ambarella CV 2x product line as the initial partners to build the ecosystem of hardware-accelerated edge AI/ML devices with AWS Panorama.

“Ambarella is in mass production today with CVflow AI vision processors for the enterprise, embedded, and automotive markets. We’re excited to partner with AWS to enable the AWS Panorama service on next-generation smart cameras and embedded systems for our customers. The ability to effortlessly deploy computer vision applications to Ambarella SoC-powered devices in a secure, optimized fashion is a powerful tool that makes it possible for our customers to rapidly bring the next generation of AI-enabled products to market.”

– Fermi Wang, CEO of Ambarella

 

“The world’s first computer created for AI, robotics, and edge computing, NVIDIA® Jetson AGX Xavier™ delivers massive computing performance to handle demanding vision and perception workloads at the edge. Our collaboration with AWS on the AWS Panorama Appliance powered by the NVIDIA Jetson platform accelerates time to market for enterprises and developers by providing a fully managed service to deploy computer vision from cloud to edge in an easily extensible and programmable manner.”

– Deepu Talla, Vice President and General Manager of Edge Computing at NVIDIA

Enabling edge appliances and smart cameras with the AWS Panorama Device SDK

Axis Communications, ADLINK Technology, Basler AG, Lenovo, STANLEY Security, and Vivotek will be using the AWS Panorama Device SDK to build AWS Panorama-enabled devices in 2021.

We’re excited to collaborate to accelerate computer vision innovation with AWS Panorama and explore the advantages of the Axis Camera Application Platform (ACAP), our open application platform that offers users an expanded ecosystem, an accelerated development process, and ultimately more innovative, scalable, and reliable network solutions.”

– Johan Paulsson, CTO of Axis Communications AB

 

“The integration of AWS Panorama on ADLINK’s industrial vision systems makes for truly plug-and-play computer vision at the edge. In 2021, we will be making AWS Panorama-powered ADLINK NEON cameras powered by NVIDIA Jetson NX Xavier available to customers to drive high-quality computer vision powered outcomes much, much faster. This allows ADLINK to deliver AI/ML digital experiments and time to value for our customers more rapidly across logistics, manufacturing, energy, and utilities use cases.”

– Elizabeth Campbell, CEO of ADLINK USA

 

“Basler is looking forward to continuing our technology collaborations in machine learning with AWS in 2021. We will be expanding our solution portfolio to include AWS Panorama to allow customers to develop AI-based IoT applications on an optimized vision system from the edge to the cloud. We will be integrating AWS Panorama with our AI Vision Solution Kit, reducing the complexity and need for additional expertise in embedded hardware and software components, providing developers with a new and efficient approach to rapid prototyping, and enabling them to leverage the ecosystem of AWS Panorama computer vision application providers and systems integrators.”

– Arndt Bake, Chief Marketing Officer at Basler AG

 

“Going beyond traditional security applications, VIVOTEK developed its AI-driven video analytics from smart edge cameras to software applications. We are excited that we will be able to offer enterprises advanced flexibility and functionality through the seamless integration with AWS Panorama. What makes this joint force more powerful is the sufficient machine learning models that our solutions can benefit from AWS Cloud. We look forward to a long-term collaboration with AWS.”

– Joe Wu, Chief Technology Officer at VIVOTEK Inc.

Next steps

Join now to become a device partner and build edge computer vision devices with the AWS Panorama Device SDK.

 


About the Authors

As a Product Manager on the AWS AI Devices team, Shardul Brahmbhatt currently focuses on AWS Panorama. He is deeply passionate about building products that drive adoption of AI at the Edge.

 

 

 

Kamal Garg leads strategic hardware partnerships for AWS AI Devices. He is deeply passionate about incubating technology ecosystems that optimize the customer and developer experience . Over the last 5+ years, Kamal has developed strategic relationships with leading silicon and connected device manufacturers for next generation services like Alexa, A9 Visual Search, Prime Video, AWS IoT, Sagemaker Neo, Sagemaker Edge, and AWS Panorama.

Read More

Configuring autoscaling inference endpoints in Amazon SageMaker

Configuring autoscaling inference endpoints in Amazon SageMaker

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models at scale. Amazon SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models. You can one-click deploy your ML models for making low latency inferences in real-time on fully managed inference endpoints. Autoscaling is an out-of-the-box feature that monitors your workloads and dynamically adjusts the capacity to maintain steady and predictable performance at the possible lowest cost. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances, helping you reduce your compute cost.

The following diagram is a sample architecture that showcases how a model is invoked for inference using an Amazon SageMaker endpoint.

Amazon SageMaker automatically attempts to distribute your instances across Availability Zones. So, we strongly recommend that you deploy multiple instances for each production endpoint for high availability. If you’re using a VPC, configure at least two subnets in different Availability Zones so Amazon SageMaker can distribute your instances across those Availability Zones.

Amazon SageMaker supports four different ways to implement horizontal scaling of Amazon SageMaker endpoints. You can configure some of these policies using the Amazon SageMaker console, the AWS Command Line Interface (AWS CLI), or the AWS SDK’s Application Auto Scaling API for the advanced options. In this post, we showcase how to configure using the boto3 SDK for Python and outline different scaling policies and patterns. 

Prerequisites

This post assumes that you have a functional Amazon SageMaker endpoint deployed. Models are hosted within an Amazon SageMaker endpoint; you can have multiple model versions being served via the same endpoint. Each model is referred to as a production variant.

If you’re new to Amazon SageMaker and have not created an endpoint yet, complete the steps in Identifying bird species on the edge using the Amazon SageMaker built-in Object Detection algorithm and AWS DeepLens until the section Testing the model to develop and host an object detection model.

If you want to get started directly with this post, you can also fetch a model from the MXNet model zoo. For example, if you plan to use ResidualNet152, you need the model definition and the model weights inside a tarball. You can also create custom models that can be hosted as an Amazon SageMaker endpoint. For instructions on building a tarball with Gluon and Apache MXNet, see Deploying custom models built with Gluon and Apache MXNet on Amazon SageMaker.

Configuring autoscaling

The following are the high-level steps for creating a model and applying a scaling policy:

  1. Use Amazon SageMaker to create a model or bring a custom model.
  2. Deploy the model.

If you use the MXNet estimator to train the model, you can call deploy to create an Amazon SageMaker endpoint:

# Train my estimator
mxnet_estimator = MXNet('train.py',
                framework_version='1.6.0',
                py_version='py3',
                instance_type='ml.p2.xlarge',
                instance_count=1)

mxnet_estimator.fit('s3://my_bucket/my_training_data/')

# Deploy my estimator to an Amazon SageMaker endpoint and get a Predictor
predictor = mxnet_estimator.deploy(instance_type='ml.m5.xlarge',
                initial_instance_count=1)#Instance_count=1 is not recommended for production use. Use this only for experimentation.

If you use a pretrained model like ResidualNet152, you can create an MXNetModel object and call deploy to create the Amazon SageMaker endpoint:

mxnet_model = MXNetModel(model_data='s3://my_bucket/pretrained_model/model.tar.gz',
                         role=role,
                         entry_point='inference.py',
                         framework_version='1.6.0',
                         py_version='py3')
predictor = mxnet_model.deploy(instance_type='ml.m5.xlarge',#
                               initial_instance_count=1)
  1. Create a scaling policy and apply the scaling policy to the endpoint. The following section discusses your scaling policy options.

Scaling options

You can define minimum, desired, and maximum number of instances per endpoint and, based on the autoscaling configurations, instances are managed dynamically. The following diagram illustrates this architecture.

To scale the deployed Amazon SageMaker endpoint, first fetch its details:

import pprint
import boto3
from sagemaker import get_execution_role
import sagemaker
import json

pp = pprint.PrettyPrinter(indent=4, depth=4)
role = get_execution_role()
sagemaker_client = boto3.Session().client(service_name='sagemaker')
endpoint_name = 'name-of-the-endpoint'
response = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
pp.pprint(response)

#Let us define a client to play with autoscaling options
client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services

Simple scaling or TargetTrackingScaling

Use this option when you want to scale based on a specific Amazon CloudWatch metric. You can do this by choosing a specific metric and setting threshold values. The recommended metrics for this option are average CPUUtilization or SageMakerVariantInvocationsPerInstance.

SageMakerVariantInvocationsPerInstance is the average number of times per minute that each instance for a variant is invoked. CPUUtilization is the sum of work handled by a CPU. 

The following code snippets show how to scale using these metrics. You can also push custom metrics to CloudWatch or use other metrics. For more information, see Monitor Amazon SageMaker with Amazon CloudWatch.

resource_id='endpoint/' + endpoint_name + '/variant/' + 'AllTraffic' # This is the format in which application autoscaling references the endpoint

response = client.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=2
)

#Example 1 - SageMakerVariantInvocationsPerInstance Metric
response = client.put_scaling_policy(
    PolicyName='Invocations-ScalingPolicy',
    ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource. 
    ResourceId=resource_id, # Endpoint name 
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count
    PolicyType='TargetTrackingScaling', # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 10.0, # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance', # is the average number of times per minute that each instance for a variant is invoked. 
        },
        'ScaleInCooldown': 600, # The cooldown period helps you prevent your Auto Scaling group from launching or terminating 
                                # additional instances before the effects of previous activities are visible. 
                                # You can configure the length of time based on your instance startup time or other application needs.
                                # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start. 
        'ScaleOutCooldown': 300 # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        
        # 'DisableScaleIn': True|False - ndicates whether scale in by the target tracking policy is disabled. 
                            # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    }
)

#Example 2 - CPUUtilization metric
response = client.put_scaling_policy(
    PolicyName='CPUUtil-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 90.0,
        'CustomizedMetricSpecification':
        {
            'MetricName': 'CPUUtilization',
            'Namespace': '/aws/sagemaker/Endpoints',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': endpoint_name },
                {'Name': 'VariantName','Value': 'AllTraffic'}
            ],
            'Statistic': 'Average', # Possible - 'Statistic': 'Average'|'Minimum'|'Maximum'|'SampleCount'|'Sum'
            'Unit': 'Percent'
        },
        'ScaleInCooldown': 600,
        'ScaleOutCooldown': 300
    }
)

With the scale-in cooldown period, the intention is to scale-in conservatively to protect your application’s availability, so scale-in activities are blocked until the cooldown period has expired. With the scale-out cooldown period, the intention is to continuously (but not excessively) scale out. After Application Auto Scaling successfully scales out using a target tracking scaling policy, it starts to calculate the cooldown time.

Step scaling

This is an advanced type of scaling where you define additional policies to dynamically adjust the number of instances to scale based on size of the alarm breach. This helps you configure a more aggressive response when demand reaches a certain level. The following code is an example of a step scaling policy based on the OverheadLatency metric:

#Example 3 - OverheadLatency metric and StepScaling Policy
response = client.put_scaling_policy(
    PolicyName='OverheadLatency-ScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='StepScaling', 
    StepScalingPolicyConfiguration={
        'AdjustmentType': 'ChangeInCapacity', # 'PercentChangeInCapacity'|'ExactCapacity' Specifies whether the ScalingAdjustment value in a StepAdjustment 
                                              # is an absolute number or a percentage of the current capacity.
        'StepAdjustments': [ # A set of adjustments that enable you to scale based on the size of the alarm breach.
            {
                'MetricIntervalLowerBound': 0.0, # The lower bound for the difference between the alarm threshold and the CloudWatch metric.
                 # 'MetricIntervalUpperBound': 100.0, # The upper bound for the difference between the alarm threshold and the CloudWatch metric.
                'ScalingAdjustment': 1 # The amount by which to scale, based on the specified adjustment type. 
                                       # A positive value adds to the current capacity while a negative number removes from the current capacity.
            },
        ],
        # 'MinAdjustmentMagnitude': 1, # The minimum number of instances to scale. - only for 'PercentChangeInCapacity'
        'Cooldown': 120,
        'MetricAggregationType': 'Average', # 'Minimum'|'Maximum'
    }
)

Scheduled scaling

You can use this option when you know that the demand follows a particular schedule in the day, week, month, or year. This helps you specify a one-time schedule or a recurring schedule or cron expressions along with start and end times, which form the boundaries of when the autoscaling action starts and stops. See the following code:

#Example 4 - Scaling based on a certain schedule.
response = client.put_scheduled_action(
    ServiceNamespace='sagemaker',
    Schedule='at(2020-10-07T06:20:00)', # yyyy-mm-ddThh:mm:ss You can use one-time schedule, cron, or rate
    ScheduledActionName='ScheduledScalingTest',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    #StartTime=datetime(2020, 10, 7), #Start date and time for when the schedule should begin
    #EndTime=datetime(2020, 10, 8), #End date and time for when the recurring schedule should end
    ScalableTargetAction={
        'MinCapacity': 2,
        'MaxCapacity': 3
    }
)

On-demand scaling

Use this option only when you want to increase or decrease the number of instances manually. This updates the endpoint weights and capacities without defining a trigger. See the following code:

response = client.update_endpoint_weights_and_capacities(EndpointName=endpoint_name,
                            DesiredWeightsAndCapacities=[
                                {
                                    'VariantName': 'string',
                                    'DesiredWeight': ...,
                                    'DesiredInstanceCount': 123
                                }
                            ])

Comparing scaling methods

Each of these methods, when successfully applied, results in the addition of instances to an already deployed Amazon SageMaker endpoint. When you make a request to update your endpoint with autoscaling configurations, the status of the endpoint moves to Updating. While the endpoint is in this state, other update operations on this endpoint fail. You can monitor the state by using the DescribeEndpoint API. There is no traffic interruption while instances are being added to or removed from an endpoint.

When creating an endpoint, we specify initial_instance_count; this value is only used at endpoint creation time. That value is ignored afterward, and autoscaling or on-demand scaling uses the change in desiredInstanceCount to set the instance count behind an endpoint.

Finally, if you do use UpdateEndpoint to deploy a new EndpointConfig to an endpoint, to retain the current number of instances, you should set RetainAllVariantProperties to true.

Considerations for designing an autoscaling policy to scale your ML workload

You should consider the following when designing an efficient autoscaling policy to minimize traffic interruptions and be cost-efficient:

  • Traffic patterns and metrics – Especially consider traffic patterns that involve invoking the inference logic. Then determine which metrics these traffic patterns affect the most. Or what metric is the inference logic sensitive to (such as GPUUtilization, CPUUtilization, MemoryUtilization, or Invocations) per instance? Is the inference logic GPU bound, memory bound, or CPU bound?
  • Custom metrics – If it’s a custom metric that needs to be defined based on the problem domain, we have the option of deploying a custom metrics collector. With a custom metrics collector, you have an additional option of fine-tuning the granularity of metrics collection and publishing.
  • Threshold – After we decide on our metrics, we need to decide on the threshold. In other words, how to detect the increase in load, based on the preceding metric, within a time window that allows for the addition of an instance and for your inference logic to be ready to serve inference. This consideration also governs the measure of the scale-in and scale-out cooldown period.
  • Autoscaling – Depending on the application logic’s tolerance to autoscaling, there should be a balance between over-provisioning and autoscaling. Depending on the workload, if you select a specialized instance such as Inferentia, the throughput gains might alleviate the need to autoscale to a certain degree.
  • Horizontal scaling – When we have these estimations, it’s time to consider one or more strategies that we enlist in this post to deploy for horizontal scaling. Some work particularly well in certain situations. For example, we strongly recommend that you use a target tracking scaling policy to scale on a metric such as average CPU utilization or the SageMakerVariantInvocationsPerInstance metric. But a good guideline is to empirically derive an apt scaling policy based on your particular workload and above factors. You can start with a simple target tracking scaling policy, and you still have the option to use step scaling as an additional policy for a more advanced configuration. For example, you can configure a more aggressive response when demand reaches a certain level.

Retrieving your scaling activity log

When you want to see all the scaling policies attached to your Amazon SageMaker endpoint, you can use describe_scaling_policies, which helps you understand and debug the different scaling configurations’ behavior:

response = client.describe_scaling_policies(
    ServiceNamespace='sagemaker'
)

for i in response['ScalingPolicies']:
    print('')
    pp.pprint(i['PolicyName'])
    print('')
    if('TargetTrackingScalingPolicyConfiguration' in i):
        pp.pprint(i['TargetTrackingScalingPolicyConfiguration']) 
    else:
        pp.pprint(i['StepScalingPolicyConfiguration'])
    print('')

Conclusion

For models facing unpredictable traffic, Amazon SageMaker autoscaling helps economically respond to the demand and removes the undifferentiated heavy lifting of managing the inference infrastructure. One of the best practices of model deployment is to perform load testing. Determine the appropriate thresholds for your scaling policies and choose metrics based on load testing. For more information about load testing, see Amazon EC2 Testing Policy and Load test and optimize an Amazon SageMaker endpoint using automatic scaling.

References

For additional references, see the following:


About the Authors

Chaitanya Hazarey is a Machine Learning Solutions Architect with the Amazon SageMaker Product Management team. He focuses on helping customers design and deploy end-to-end ML pipelines in production on AWS. He has set up multiple such workflows around problems in the areas of NLP, Computer Vision, Recommender Systems, and AutoML Pipelines.

 

 

Pavan Kumar Sunder is a Senior R&D Engineer with Amazon Web Services. He provides technical guidance and helps customers accelerate their ability to innovate through showing the art of the possible on AWS. He has built multiple prototypes around AI/ML, IoT, and Robotics for our customers.

 

 

 

Rama Thamman is a Software Development Manager with the AI Platforms team, leading the ML Migrations team.

Read More

What’s around the turn in 2021? AWS DeepRacer League announces new divisions, rewards, and community leagues

What’s around the turn in 2021? AWS DeepRacer League announces new divisions, rewards, and community leagues

AWS DeepRacer allows you to get hands on with machine learning (ML) through a fully autonomous 1/18th scale race car driven by reinforcement learning, a 3D racing simulator on the AWS DeepRacer console, a global racing league, and hundreds of customer-initiated community races.

The action is already underway at the Championship Cup at AWS re:Invent 2020, with the Wildcard round of racing coming to a close and Round 1 Knockouts officially underway, streaming live on Twitch (updated schedule coming soon). Check out the Championship Cup blog for up-to-date news and schedule information.

We’ve got some exciting announcements about the upcoming 2021 racing season. The AWS DeepRacer League is introducing new skill-based Open and Pro racing divisions in March 2021, with five times as many opportunities for racers to win prizes, and recognition for participation and performance. Another exciting new feature coming in 2021 is the expansion of community races into community leagues, enabling organizations and racing enthusiasts to set up their own racing leagues and race with their friends over multiple races. But first, we’ve got a great deal for racers in December!

December ‘tis the season for racing

Start your engines and hit the tracks in December for less! We’re excited to announce that starting Dec 1, 2020 through Dec. 31, 2020 we’re reducing the cost of training and evaluation for AWS DeepRacer by over 70% (from $3.50 to $1 per hour) for the duration of AWS re:Invent 2020. It’s a great time to learn, compete, and get excited for what’s coming up next year.]

Introducing new racing divisions and digital rewards in 2021

The AWS DeepRacer League’s 2021 season will introduce new skill-based Open and Pro racing divisions. The new racing divisions include five times as many opportunities for racers of all skill levels to win prizes and recognition for participation and performance. AWS DeepRacer League’s new Open and Pro divisions enable all participants to win prizes by splitting the current Virtual Circuit monthly leaderboard into two skill-based divisions to level the competition, each with their own prizes, while maintaining a high level of competitiveness in the League.

The Open division is where all racers begin their ML journey and rewards participation each month with new vehicle customizations and other rewards. Racers can earn their way into the Pro division each month by finishing in the top 10 percent of time trial results. The Pro division celebrates competition and accomplishment with DeepRacer merchandise, DeepRacer Evo cars, exclusive prizes, and qualification to the Championship Cup at re:Invent 2021.

Similar to previous seasons, winners of the Pro division’s monthly race automatically qualify for the Championship Cup with an all-expenses paid trip to re:Invent 2021 for a chance to lift the 2021 Cup, receive a $10,000 machine learning education scholarship, and an F1 Grand Prix experience for two. Starting with the 2021 Virtual Circuit Pre-Season, racers can win multiple prizes each month—including dozens of customizations to personalize your car in the Virtual Circuit (such as car bodies, paint colors, and wraps), several official AWS DeepRacer branded items (such as racing jackets, hats, and shirts), and hundreds of DeepRacer Evo cars, complete with stereo cameras and LiDAR.

Participating in Open and Pro Division races can earn you new digital rewards, like this new racing skin for your virtual racing fun!

“The DeepRacer League has been a fantastic way for thousands of people to test out their newly learnt machine learning skills.“ says AWS Hero and AWS Machine Learning Community Founder, Lyndon Leggate. “Everyone’s competitive spirit quickly shows through and the DeepRacer community has seen tremendous engagement from members keen to learn from each other, refine their skills, and move up the ranks. The new 2021 league format looks incredible and the Open and Pro divisions bring an interesting new dimension to racing! It’s even more fantastic that everyone will get more chances for their efforts to be rewarded, regardless of how long they’ve been racing. This will make it much more engaging for everyone and I can’t wait to take part!”

Empowering organizations to create their own leagues with multi-race seasons and admin tools

With events flourishing around the globe virtually, we’ll soon offer the ability for racers to not only create their own race, but also create multiple races, similar to the monthly Virtual Circuit, with community leagues. Race organizers will be able to set up the whole season, decide on qualification rounds, and use bracket elimination for head-to-head races finalists.

Organizations and individuals can race with their friends and colleagues with Community races.

The new series of tools enables race organizers to run their own events, including access to racer information, model training details, and logs to engage with their audience and develop learnings from participants. Over the past year, organizations such as Capital One, JPMC, Accenture, and Moody’s have already successfully managed internal AWS DeepRacer leagues. Now, even more organizations, schools, companies, and private groups of friends can use AWS DeepRacer and the community leagues as a fun learning tool to actively develop their ML skills.

“We have observed huge participation and interest in AWS DeepRacer events,” says Chris Thomas, Managing Director of Technology & Innovation at Moody’s Accelerator. “They create opportunities for employees to challenge themselves, collaborate with colleagues, and enhance ML skills. We view this as a success from the tech side with the additional benefit of growing our innovation culture.“

AWS re:Invent is a great time to learn more about ML

As you get ready for what’s in store for 2021, don’t forget that registration is still open for re:Invent 2020. Be sure to check out our three informative virtual sessions to help you along your ML journey with AWS DeepRacer:Get rolling with Machine Learning on AWS DeepRacer”, “Shift your Machine Learning model into overdrive with AWS DeepRacer analysis tools” and “Replicate AWS DeepRacer architecture to master the track with SageMaker Notebooks.” You can take all the courses during re:Invent or learn at your own speed. It’s up to you. Register for re:Invent today and find out more on when these sessions are available to watch live or on-demand. 

Start training today and get ready for the 2021 season

Remember to take advantage of the December cost reductions for training and evaluation for AWS DeepRacer by over 70% (from $3.50 to $1 per hour) for the duration of AWS re:Invent 2020. Take advantage of these low rates today and get ready for the AWS DeepRacer League 2021 season. Now is a great time to get rolling with ML and AWS DeepRacer. Watch this page for schedule and video updates all through AWS re:Invent 2020. Let’s get ready to race!


About the Author

Dan McCorriston is a Senior Product Marketing Manager for AWS Machine Learning. He is passionate about technology, collaborating with developers, and creating new methods of expanding technology education. Out of the office he likes to hike, cook and spend time with his family.

Read More

Private package installation in Amazon SageMaker running in internet-free mode

Private package installation in Amazon SageMaker running in internet-free mode

Amazon SageMaker Studio notebooks and Amazon SageMaker notebook instances are internet-enabled by default. However, many regulated industries, such as financial industries, healthcare, telecommunications, and others, require that network traffic traverses their own Amazon Virtual Private Cloud (Amazon VPC) to restrict and control which traffic can go through public internet. Although you can disable direct internet access to Sagemaker Studio notebooks and notebook instances, you need to ensure that your data scientists can still gain access to popular packages. Therefore, you may choose to build your own isolated dev environments that contain your choice of packages and kernels.

In this post, we learn how to set up such an environment for Amazon SageMaker notebook instances and SageMaker Studio. We also describe how to integrate this environment with AWS CodeArtifact, which is a fully managed artifact repository that makes it easy for organizations of any size to securely store, publish, and share software packages used in your software development process.

Solution overview

In this post, we cover the following steps: 

  1. Set up the Amazon SageMaker for internet-free mode.
  2. Set up the Conda repository using Amazon Simple Storage Service (Amazon S3). You create a bucket that hosts your Conda channels.
  3. Set up the Python Package Index (PyPI) repository using CodeArtifact. You create a repository and set up AWS PrivateLink endpoints for CodeArtifact.
  4. Build an isolated dev environment with Amazon SageMaker notebook instances. In this step, you use the lifecycle configuration feature to build a custom Conda environment and configure your PyPI client.
  5. Install packages in SageMaker Studio notebooks. In this last step, you can create a custom Amazon SageMaker image and install the packages through Conda or pip client.

Setting up Amazon SageMaker for internet-free mode

We assume that you have already set up a VPC that lets you provision a private, isolated section of the AWS Cloud where you can launch AWS resources in a virtual network. You use it to host Amazon SageMaker and other components of your data science environment. For more information about building secure environments or well-architected pillars, see the following whitepaper, Financial Services Industry Lens: AWS Well-Architected Framework.

Creating an Amazon SageMaker notebook instance

You can disable internet access for Amazon SageMaker notebooks, and also associate them to your secure VPC environment, which allows you to apply network-level control, such as access to resources through security groups, or to control ingress and egress traffic of data.

  1. On the Amazon SageMaker console, choose Notebook instances in the navigation pane.
  2. Choose Create notebook instance.
  3. For IAM role, choose your role.
  4. For VPC, choose your VPC.
  5. For Subnet, choose your subnet.
  6. For Security groups(s), choose your security group.
  7. For Direct internet access, select Disable — use VPC only.
  8. Choose Create notebook instance.

  1. Connect to your notebook instance from your VPC instead of connecting over the public internet.

Amazon SageMaker notebook instances support VPC interface endpoints. When you use a VPC interface endpoint, communication between your VPC and the notebook instance is conducted entirely and securely within the AWS network instead of the public internet. For instructions, see Creating an interface endpoint.

Setting up SageMaker Studio

Similar to Amazon SageMaker notebook instances, you can launch SageMaker Studio in a VPC of your choice, and also disable direct internet access to add an additional layer of security.

  1. On the Amazon SageMaker console, choose Amazon SageMaker Studio in the navigation pane.
  2. Choose Standard setup.
  3. To disable direct internet access, in the Network section, select the VPC only network access type for when you onboard to Studio or call the CreateDomain API.

Doing so prevents Amazon SageMaker from providing internet access to your SageMaker Studio notebooks.

  1. Create interface endpoints (via AWS PrivateLink) to access the following (and other AWS services you may require):
    1. Amazon SageMaker API
    2. Amazon SageMaker runtime
    3. Amazon S3
    4. AWS Security Token Service (AWS STS)
    5. Amazon CloudWatch

Setting up a custom Conda repository using Amazon S3

Amazon SageMaker notebooks come with multiple environments already installed. The different Jupyter kernels in Amazon SageMaker notebooks are separate Conda environments. If you want to use an external library in a specific kernel, you can install the library in the environment for that kernel. This is typically done using conda install. When you use a conda command to install a package, Conda environment searches a set of default channels, which are usually online or remote channels (URLs) that host the Conda packages. However, because we assume the notebook instances don’t have internet access, we modify those Conda channel paths to a private repository where our packages are stored.

  1. Build such custom channel is to create a bucket in Amazon S3.
  2. Copy the packages into the bucket.

These packages can be either approved packages among the organization or the custom packages built using conda build. These packages need to be indexed periodically or as soon as there is an update. The methods to index packages are out of scope of this post.

Because we set up the notebook to not allow direct internet access, the notebook can’t connect to the S3 buckets that contain the channels unless you create a VPC endpoint.

  1. Create an Amazon S3 VPC endpoint to send the traffic through the VPC instead of the public internet.

By creating a VPC endpoint, you allow your notebook instance to access the bucket where you stored the channels and its packages.

  1. We recommend that you also create a custom resource-based bucket policy that allows only requests from your private VPC to access to your S3 buckets. For instructions, see Endpoints for Amazon S3.
  2. Replace the default channels of the Conda environment in your Amazon SageMaker notebooks with your custom channel (we do that in the next step when we build the isolated dev environment):
    # remove default channel from the .condarc
    conda config --remove channels 'defaults'
    # add the conda channels to the .condarc file
    conda config --add channels 's3://user-conda-repository/main/'
    conda config --add channels 's3://user-conda-repository/condaforge/'

Setting up a custom PyPI repository using CodeArtifact

Data scientists typically use package managers such as pip, maven, npm, and others to install packages into their environments. By default, when you use pip to install a package, it downloads the package from the public PyPI repository. To secure your environment, you can use private package management tools either on premises, such as Artifactory or Nexus, or on AWS, such as CodeArtifact. This allows you to allow access only to approved packages and perform safety checks. Alternatively, you may choose use a private PyPI mirror set up on Amazon Elastic Container Service (Amazon ECS) or AWS Fargate to mirror the public PyPI repository in your private environment. For more information on this approach, see Building Secure Environments.

If you want to use pip to install Python packages, you can use CodeArtifact to control access to and validate the safety of the Python packages. CodeArtifact is a managed artifact repository service to help developers and organizations securely store and share the software packages used in your development, build, and deployment processes. The CodeArtifact integration with AWS Identity and Access Management (IAM), support for AWS CloudTrail, and encryption with AWS Key Management Service (AWS KMS) gives you visibility and the ability to control who has access to the packages.

You can configure CodeArtifact to fetch software packages from public repositories such as PyPI. PyPI helps you find and install software developed and shared by the Python community. When you pull a package from PyPI, CodeArtifact automatically downloads and stores application dependencies from the public repositories, so recent versions are always available to you.

Creating a repository for PyPI

You can create a repository using the CodeArtifact console or the AWS Command Line Interface (AWS CLI). Each repository is associated with the AWS account that you use when you create it. The following screenshot shows the view of choosing your AWS account on the CodeArtifact console.

A repository can have one or more CodeArtifact repository associated with it as an upstream repository. It can facilitate two needs.

Firstly, it allows a package manager client to access the packages contained in more than one repository using a single URL endpoint.

Secondly, when you create a repository, it doesn’t contain any packages. If an upstream repository has an external connection to a public repository, the repositories that are downstream from it can pull packages from that public repository. For example, the repository my-shared-python-repository has an upstream repository named pypi-store, which acts as an intermediate repository that connects your repository to an external connection (your PyPI repository). In this case, a package manager that is connected to my-shared-python-repository can pull packages from the PyPI public repository. The following screenshot shows this package flow.

For instructions on creating a CodeArtifact repository, see Software Package Management with AWS CodeArtifact.

Because we disable internet access for the Amazon SageMaker notebooks, in the next section, we set up AWS PrivateLink endpoints to make sure all the traffic for installing the package in the notebooks traverses through the VPC.

Setting up AWS PrivateLink endpoints for CodeArtifact

You can configure CodeArtifact to use an interface VPC endpoint to improve the security of your VPC. When you use an interface VPC endpoint, you don’t need an internet gateway, NAT device, or virtual private gateway. To create VPC endpoints for CodeArtifact, you can use the AWS CLI or Amazon VPC console. For this post, we use the Amazon Elastic Compute Cloud (Amazon EC2) create-vpc-endpoint AWS CLI command. The following two VPC endpoints are required so that all requests to CodeArtifact are in the AWS network.

The following command creates an endpoint to access CodeArtifact repositories:

aws ec2 create-vpc-endpoint --vpc-id vpcid --vpc-endpoint-type Interface 
  --service-name com.amazonaws.region.codeartifact.api --subnet-ids subnetid 
  --security-group-ids groupid

The following command creates an endpoint to access package managers and build tools:

aws ec2 create-vpc-endpoint --vpc-id vpcid --vpc-endpoint-type Interface 
  --service-name com.amazonaws.region.codeartifact.repositories --subnet-ids subnetid 
  --security-group-ids groupid --private-dns-enabled

CodeArtifact uses Amazon S3 to store package assets. To pull packages from CodeArtifact, you must create a gateway endpoint for Amazon S3. See the following code:

aws ec2 create-vpc-endpoint --vpc-id vpcid --service-name com.amazonaws.region.s3 
  --route-table-ids routetableid

Building your dev environment

Amazon SageMaker periodically updates the Python and dependency versions in the environments installed on the Amazon SageMaker notebook instances (when you stop and start) or in the images launched in SageMaker Studio. This might cause some incompatibility if you have your own managed package repositories and dependencies. You can freeze your dependencies in internet-free mode so that:

  • You’re not affected by periodic updates from Amazon SageMaker to the base environment
  • You have better control over the dependencies in your environments and can get ample time to update or upgrade your dependencies

Using Amazon SageMaker notebook instancesartifcat

To create your own dev environment with specific versions of Python and dependencies, you can use lifecycle configuration scripts. A lifecycle configuration provides shell scripts that run only when you create the notebook instance or whenever you start one. When you create a notebook instance, you can create a new lifecycle configuration and the scripts it uses or apply one that you already have. Amazon SageMaker has a lifecycle config script sample that you can use and modify to create isolated dependencies as described earlier. With this script, you can do the following:

  • Build an isolated installation of Conda
  • Create a Conda environment with it
  • Make the environment available as a kernel in Jupyter

This makes sure that dependencies in that kernel aren’t affected by the upgrades that Amazon SageMaker periodically roles out to the underlying AMI. This script installs a custom, persistent installation of Conda on the notebook instance’s EBS volume, and ensures that these custom environments are available as kernels in Jupyter. We add Conda and CodeArtifact configuration to this script.

The on-create script downloads and installs a custom Conda installation to the EBS volume via Miniconda. Any relevant packages can be installed here.

  1. Set up CodeArtifact.
  2. Set up your Conda channels.
  3. Install ipykernel to make sure that the custom environment can be used as a Jupyter kernel.
  4. Make sure the notebook instance has internet connectivity to download the Miniconda installer.

The on-create script installs the ipykernel library so you can use create custom environments as Jupyter kernels, and uses pip install and conda install to install libraries. You can adapt the script to create custom environments and install the libraries that you want. Amazon SageMaker doesn’t update these libraries when you stop and restart the notebook instance, so you can make sure that your custom environment has specific versions of libraries that you want. See the following code:

#!/bin/bash

 set -e
 sudo -u ec2-user -i <<'EOF'
 unset SUDO_UID

 # Configure common package managers to use CodeArtifact
 aws codeartifact login --tool pip --domain my-org --domain-owner <000000000000> --repository  my-shared-python-repository  --endpoint-url https://vpce-xxxxx.api.codeartifact.us-east-1.vpce.amazonaws.com 

 # Install a separate conda installation via Miniconda
 WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
 mkdir -p "$WORKING_DIR"
 wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh -O "$WORKING_DIR/miniconda.sh"
 bash "$WORKING_DIR/miniconda.sh" -b -u -p "$WORKING_DIR/miniconda" 
 rm -rf "$WORKING_DIR/miniconda.sh"

 # Create a custom conda environment
 source "$WORKING_DIR/miniconda/bin/activate"

 # remove default channel from the .condarc 
 conda config --remove channels 'defaults'
 # add the conda channels to the .condarc file
 conda config --add channels 's3://user-conda-repository/main/'
 conda config --add channels 's3://user-conda-repository/condaforge/'

 KERNEL_NAME="custom_python"
 PYTHON="3.6"

 conda create --yes --name "$KERNEL_NAME" python="$PYTHON"
 conda activate "$KERNEL_NAME"

 pip install --quiet ipykernel

 # Customize these lines as necessary to install the required packages
 conda install --yes numpy
 pip install --quiet boto3

 EOF

The on-start script uses the custom Conda environment created in the on-create script, and uses the ipykernel package to add that as a kernel in Jupyter, so that they appear in the drop-down list in the Jupyter New menu. It also logs in to CodeArtifact to enable installing the packages from the custom repository. See the following code:

#!/bin/bash

set -e

sudo -u ec2-user -i <<'EOF'
unset SUDO_UID

# Get pip artifact
/home/ec2-user/SageMaker/aws/aws codeartifact login --tool pip --domain <my-org> --domain-owner <xxxxxxxxx> --repository  <my-shared-python-repository.  --endpoint-url <https://vpce-xxxxxxxx.api.codeartifact.us-east-1.vpce.amazonaws.com> 

WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda/
source "$WORKING_DIR/miniconda/bin/activate"

for env in $WORKING_DIR/miniconda/envs/*; do
    BASENAME=$(basename "$env")
    source activate "$BASENAME"
    python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
done


EOF

echo "Restarting the Jupyter server.."
restart jupyter-server

CodeArtifact authorization tokens are valid for a default period of 12 hours. You can add a cron job to the on-start script to refresh the token automatically, or log in to CodeArtifact again in the Jupyter notebook terminal.

Using SageMaker Studio notebooks

You can create your own custom Amazon SageMaker images in your private dev environment in SageMaker Studio. You can add the custom kernels, packages, and any other files required to run a Jupyter notebook in your image. It gives you the control and flexibility to do the following:

  • Install your own custom packages in the image
  • Configure the images to be integrated with your custom repositories for package installation by users

For example, you can install a selection of R or Python packages when building the image:

# Dockerfile
RUN conda install --quiet --yes 
    'r-base=4.0.0' 
    'r-caret=6.*' 
    'r-crayon=1.3*' 
    'r-devtools=2.3*' 
    'r-forecast=8.12*' 
    'r-hexbin=1.28*'

Or you can set up the Conda in the image to just use your own custom channels in Amazon S3 to install packages by changing the configuration of Conda channels:

# Dockerfile
RUN 
    # add the conda channels to the .condarc file
    conda config --add channels 's3://my-conda-repository/_conda-forge/' && 
    conda config --add channels 's3://my-conda-repository/main/' && 
    # remove defaults from the .condarc 
    conda config --remove channels 'defaults'

You should use the CodeArtifact login command in SageMaker Studio to fetch credentials for use with pip:

# PyPIconfig.py
# Configure common package managers to use CodeArtifact
 aws codeartifact login --tool pip --domain my-org --domain-owner <000000000000> --repository  my-shared-python-repository  --endpoint-url https://vpce-xxxxx.api.codeartifact.us-east-1.vpce.amazonaws.com 

CodeArtifact needs authorization tokens. You can add a cron job into the image to run the above command periodically. Alternatively, you can execute it manually when the notebooks get started. To make it simple for your users, you can add the preceding command to a shell script (such as PyPIConfig.sh) and copy the file into to the image to be loaded in SageMaker Studio. In your Dockerfile, add the following command:

# Dockerfile
COPY PyPIconfig.sh /home/PyPIconfig.sh

For ease of use, the PyPIconfig.sh is available in /home on SageMaker Studio. You can easily run it to configure your pip client in SageMaker Studio and fetch an authorization token from CodeArtifact using your AWS credentials.

Now, you can build and push your image into Amazon Elastic Container Repository (Amazon ECR). Finally, attach the image to multiple users (by attaching to a domain) or a single user (by attaching to the user’s profile) in SageMaker Studio. The following screenshot shows the configuration on the SageMaker Studio control panel.

For more information about building a custom image and attaching it to SageMaker Studio, see Bring your own custom SageMaker image tutorial.

Installing the packages

In Amazon SageMaker notebook instances, as soon as you start the Jupyter notebook, you see a new kernel in Jupyter in the drop-down list of kernels (see the following screenshot). This environment is isolated from other default Conda environments.

In your notebook, when you use pip install <package name>, the Python package manager client connects to your custom repository instead of the public repositories. Also, if you use conda install <package name>, the notebook instance uses the packages in your Amazon S3 channels to install it. See the following screenshot of this code.

In SageMaker Studio, the custom images appear in the image selector dialog box of the SageMaker Studio Launcher. As soon as you select your own custom image, the kernel you installed in the image appears in the kernel selector dialog box. See the following screenshot.

As mentioned before, CodeArtifact authorization tokens are valid for a default period of 12 hours. If you’re using CodeArtifact, you can open a terminal or notebook in SageMaker Studio and run the PyPIconfig.sh file to configure your client or refresh your expired token:

# Configure PyPI package managers to use CodeArtifact
 /home/pyPIconfig.sh

The following screenshot shows your view in SageMaker Studio.

Conclusion

This post demonstrated how to build a private environment for Amazon SageMaker notebook instances and SageMaker Studio to have better control over the dependencies in your environments. To build the private environment, we used the lifecycle configuration feature in notebook instances. The sample lifecycle config scripts are available on the GitHub repo. To install custom packages in SageMaker Studio, we built a custom image and attached it to SageMaker Studio. For more information about this feature, see Bringing your own custom container image to Amazon SageMaker Studio notebooks. For this solution, we used CodeArtifact, which makes it easy to build a PyPI repository for approved Python packages across the organization. For more information, see Software Package Management with AWS CodeArtifact.

Give the CodeArtifact a try, and share your feedback and questions in the comments.


About the Author

Saeed Aghabozorgi Ph.D. is senior ML Specialist in AWS, with a track record of developing enterprise level solutions that substantially increase customers’ ability to turn their data into actionable knowledge. He is also a researcher in the artificial intelligence and machine learning field.

 

 

 

Stefan Natu is a Sr. Machine Learning Specialist at AWS. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.

Read More

Securing data analytics with an Amazon SageMaker notebook instance and Kerberized Amazon EMR cluster

Securing data analytics with an Amazon SageMaker notebook instance and Kerberized Amazon EMR cluster

Ever since Amazon SageMaker was introduced at AWS re:Invent 2017, customers have used the service to quickly and easily build and train machine learning (ML) models and directly deploy them into a production-ready hosted environment. SageMaker notebook instances provide a powerful, integrated Jupyter notebook interface for easy access to data sources for exploration and analysis. You can enhance the SageMaker capabilities by connecting the notebook instance to an Apache Spark cluster running on Amazon EMR. It gives data scientists and engineers a common instance with shared experience where they can collaborate on AI/ML and data analytics tasks.

If you’re using a SageMaker notebook instance, you may need a way to allow different personas (such as data scientists and engineers) to do different tasks on Amazon EMR with a secure authentication mechanism. For example, you might use the Jupyter notebook environment to build pipelines in Amazon EMR to transform datasets in the data lake, and later switch personas and use the Jupyter notebook environment to query the prepared data and perform advanced analytics on it. Each of these personas and actions may require their own distinct set of permissions to the data.

To address this requirement, you can deploy a Kerberized EMR cluster. Amazon EMR release version 5.10.0 and later supports MIT Kerberos, which is a network authentication protocol created by the Massachusetts Institute of Technology (MIT). Kerberos uses secret-key cryptography to provide strong authentication so passwords or other credentials aren’t sent over the network in an unencrypted format.

This post walks you through connecting a SageMaker notebook instance to a Kerberized EMR cluster using SparkMagic and Apache Livy. Users are authenticated with Kerberos Key Distribution Center (KDC), where they obtain temporary tokens to impersonate themselves as different personas before interacting with the EMR cluster with appropriately assigned privileges.

This post also demonstrates how a Jupyter notebook uses PySpark to download the COVID-19 database in CSV format from the Johns Hopkins GitHub repository. The data is transformed and processed by Pandas and saved to an S3 bucket in columnar Parquet format referenced by an Apache Hive external table hosted on Amazon EMR.

Solution walkthrough

The following diagram depicts the overall architecture of the proposed solution. A VPC with two subnets are created: one public, one private. For security reasons, a Kerberized EMR cluster is created inside the private subnet. It needs access to the internet to access data from the public GitHub repo, so a NAT gateway is attached to the public subnet to allow for internet access.

The Kerberized EMR cluster is configured with a bootstrap action in which three Linux users are created and Python libraries are installed (Pandas, requests, and Matplotlib).

You can set up Kerberos authentication a few different ways (for more information, see Kerberos Architecture Options):

  • Cluster dedicated KDC
  • Cluster dedicated KDC with Active Directory cross-realm trust
  • External KDC
  • External KDC integrated with Active Directory

The KDC can have its own user database or it can use cross-realm trust with an Active Directory that holds the identity store. For this post, we use a cluster dedicated KDC that holds its own user database. First, the EMR cluster has security configuration enabled to support Kerberos and is launched with a bootstrap action to create Linux users on all nodes and install the necessary libraries. A bash step is launched right after the cluster is ready to create HDFS directories for the Linux users with default credentials that are forced to change as soon as the users log in to the EMR cluster for the first time.

A SageMaker notebook instance is spun up, which comes with SparkMagic support. The Kerberos client library is installed and the Livy host endpoint is configured to allow for the connection between the notebook instance and the EMR cluster. This is done through configuring the SageMaker notebook instance’s lifecycle configuration feature. We provide sample scripts later in this post to illustrate this process.

Fine-grained user access control for EMR File System

The EMR File System (EMRFS) is an implementation of HDFS that all EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon Simple Storage Service (Amazon S3). The Amazon EMR security configuration enables you to specify the AWS Identity and Access Management (IAM) role to assume when a user or group uses EMRFS to access Amazon S3. Choosing the IAM role for each user or group enables fine-grained access control for EMRFS on multi-tenant EMR clusters. This allows different personas to be associated with different IAM roles to access Amazon S3.r 

Deploying the resources with AWS CloudFormation

You can use the provided AWS CloudFormation template to set up this architecture’s building blocks, including the EMR cluster, SageMaker notebook instance, and other required resources. The template has been successfully tested in the us-east-1 Region.

Complete the following steps to deploy the environment:

  1. Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
  2. Choose Launch Stack to launch the CloudFormation template:

  1. Choose Next.

  1. For Stack name, enter a name for the stack (for example, blog).
  2. Leave the other values as default.
  3. Continue to choose Next and leave other parameters at their default.
  4. On the review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  5. Choose Create stack.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. The process usually takes about 10–15 minutes.

After the environment is complete, we can investigate what the template provisioned.

Notebook instance lifecycle configuration

Lifecycle configurations perform the following tasks to ensure a successful Kerberized authentication between the notebook and EMR cluster:

  • Configure the Kerberos client on the notebook instance
  • Configure SparkMagic to use Kerberos authentication

You can view your provisioned lifecycle configuration, SageEMRConfig, on the Lifecycle configuration page on the SageMaker console.

The template provisioned two scripts to start and create your notebook: start.sh and create.sh, respectively. The scripts replace {EMRDNSName} with your own EMR cluster primary node’s DNS hostname during the CloudFormation deployment.

EMR cluster security configuration

Security configurations in Amazon EMR are templates for different security setups. You can create a security configuration to conveniently reuse a security setup whenever you create a cluster. For more information, see Use Security Configurations to Set Up Cluster Security.

To view the EMR security configuration created, complete the following steps:

  1. On the Amazon EMR console, choose Security configurations.
  2. Expand the security configuration created. Its name begins with blog-securityConfiguration.

Two IAM roles are created as part of the solution in the CloudFormation template, one for each EMR user (user1 and user2). The two users are created during the Amazon EMR bootstrap action.

  1. Choose the role for user1, blog-allowEMRFSAccessForUser1.

The IAM console opens and shows the summary for the IAM role.

  1. Expand the policy attached to the role blog-emrFS-user1.

This is the S3 bucket the CloudFormation template created to store the COVID-19 datasets.

  1. Choose {} JSON.

You can see the policy definition and permissions to the bucket named blog-s3bucket-xxxxx.

  1. Return to the EMR security configuration.
  2. Choose the IAM policy for user2, blog-allowEMRFSAccessForUser2.
  3. Expand the policy attached to the role, blog-emrFS-user2.
  4. Choose {} JSON.

You can see the policy definition and permissions to the bucket named my-other-bucket.

Authenticating with Kerberos

To use these IAM roles, you authenticate via Kerberos from the notebook instance to the EMR cluster KDC. The authenticated user inherits the permissions associated with the policy of the IAM role defined in the Amazon EMR security configuration.

To authenticate with Kerberos in the SageMaker notebook instance, complete the following steps:

  1. On the SageMaker console, under Notebook, choose Notebook instances.
  2. Locate the instance named SageEMR.
  3. Choose Open JupyterLab.

  1. On the File menu, choose New.
  2. Choose Terminal.

  1. Enter kinit followed by the username user2.
  2. Enter the user’s password.

The initial default password is pwd2. The first time logging in, you’re prompted to change the password.

  1. Enter a new password.

  1. Enter klist to view the Kerbereos ticket for user2.

After your user is authenticated, they can access the resources associated with the IAM role defined in Amazon EMR.

Running the example notebook

To run your notebook, complete the following steps:

  1. Choose the Covid19-Pandas-Spark example notebook.

  1. Choose the Run (▶) icon to progressively run the cells in the example notebook.

 

When you reach the cell in the notebook to save the Spark DataFrame (sdf) to an internal hive table, you get an Access Denied error.

This step fails because the IAM role associated with Amazon EMR user2 doesn’t have permissions to write to the S3 bucket blog-s3bucket-xxxxx.

  1. Navigate back to the Terminal
  2. Enter kinit followed by the username user1.
  3. Enter the user’s password.

The initial default password is pwd1. The first time logging in, you’re prompted to change the password.

  1. Enter a new password.

  1. Restart the kernel.
  2. Run all cells in the notebook by choosing Kernel and choosing Restart Kernel and Run All Cells.

This re-establishes a new Livy session with the EMR cluster using the new Kerberos token for user1.

The notebook uses Pandas and Matplotlib to process and transform the raw COVID-19 dataset into a consumable format and visualize it.

The notebook also demonstrates the creation of a native Hive table in HDFS and an external table hosted on Amazon S3, which are queried by SparkSQL. The notebook is self-explanatory; you can follow the steps to complete the demo.

Restricting principal access to Amazon S3 resources in the IAM role

The CloudFormation template by default allows any principal to assume the role blog-allowEMRFSAccessForUser1. This is apparently too permissive. We need to further restrict the principals that can assume the role.

  1. On the IAM console, under Access management, choose Roles.
  2. Search for and choose the role blog-allowEMRFSAccessForUser1.

  1. On the Trust relationship tab, choose Edit trust relationship.

  1. Open a second browser window to look up your EMR cluster’s instance profile role name.

You can find the instance profile name on the IAM console by searching for the keyword instanceProfileRole. Typically, the name looks like <stack name>-EMRClusterinstanceProfileRole-xxxxx.

  1. Modify the policy document using the following JSON file, providing your own AWS account ID and the instance profile role name:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
               "Effect": "Allow", 
    		 "Principal":{
      "AWS":[
               "arn:aws:iam::<account id>:role/<EMR Cluster instanceProfile Role name>"
            ]
    		 },
               "Action": "sts:AssumeRole"
            }
        ]
    }

  1. Return to the first browser window.
  2. Choose Update Trust Policy.

This makes sure that only the EMR cluster’s users are allowed to access their own S3 buckets.

Cleaning up

You can complete the following steps to clean up resources deployed for this solution. This also deletes the S3 bucket, so you should copy the contents in the bucket to a backup location if you want to retain the data for later use.

  1. On the CloudFormation console, choose Stacks.
  2. Select the slack deployed for this solution.
  3. Choose Delete.

Summary

We walked through the solution using a SageMaker notebook instance authenticated with a Kerberized EMR cluster via Apache Livy, and processed a public COVID-19 dataset with Pandas before saving it in Parquet format in an external Hive table. The table references the data hosted in an Amazon S3 bucket. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We strongly encourage you to use these managed and serverless services such as Amazon Athena and Amazon QuickSight for your specific use cases in production.


About the Authors

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

Graham Zulauf is a Senior Solutions Architect. Graham is focused on helping AWS’ strategic customers solve important problems at scale.

Read More

Customization, automation and scalability in customer service: Integrating Genesys Cloud and AWS Contact Center Intelligence

Customization, automation and scalability in customer service: Integrating Genesys Cloud and AWS Contact Center Intelligence

This is a guest post authored by Rebecca Owens and Julian Hernandez, who work at Genesys Cloud. 

Legacy technology limits organizations in their ability to offer excellent customer service to users. Organizations must design, establish, and implement their customer relationship strategies while balancing against operational efficiency concerns.

Another factor to consider is the constant evolution of the relationship with the customer. External drivers, such as those recently imposed by COVID-19, can radically change how we interact in a matter of days. Customers have been forced to change the way they usually interact with brands, which has resulted in an increase in the volume of interactions hitting those communication channels that remain open, such as contact centers. Organizations have seen a significant increase in the overall number of interactions they receive, in some cases as much as triple the pre-pandemic volumes. This is further compounded by issues that restrict the number of agents available to serve customers.

The customer experience (CX) is becoming increasingly relevant and is considered by most organizations as a key differentiator.

In recent years, there has been a sharp increase in the usage of artificial intelligence (AI) in many different areas and operations within organizations. AI has evolved from being a mere concept to a tangible technology that can be incorporated in our day-to-day lives. The issue is that organizations are starting down this path only to find limitations due to language availability. Technologies are often only available in English or require a redesign or specialized development to handle multiple languages, which creates a barrier to entry.

Organizations face a range of challenges when formulating a CX strategy that offers a differentiated experience and can rapidly respond to changing business needs. To minimize the risk of adoption, you should aim to deploy solutions that provide greater flexibility, scalability, services, and automation possibilities.

Solution

Genesys Cloud (an omni-channel orchestration and customer relationship platform) provides all of the above as part of a public cloud model that enables quick and simple integration of AWS Contact Center Intelligence (AWS CCI) to transform the modern contact center from a cost center into a profit center. With AWS CCI, AWS and Genesys are committed to offer a variety of ways organizations can quickly and cost-effectively add functionalities such as conversational interfaces based on Amazon Lex, Amazon Polly, and Amazon Kendra.

In less than 10 minutes, you can integrate Genesys Cloud with the AWS CCI self-service solution powered by Amazon Lex and Amazon Polly in either English-US, Spanish-US, Spanish-SP, French-FR, French-CA, and Italian-IT (recently released). This enables you to configure automated self-service channels that your customers can use to communicate naturally with bots powered by AI, which can understand their needs and provide quick and timely responses. Amazon Kendra (Amazon’s intelligent search service) “turbocharges” Amazon Lex with the ability to query FAQs and articles contained in a variety of knowledge bases to address the long tail of questions. You don’t have to explicitly program all these questions and corresponding answers in Amazon Lex. For more information, see AWS announces AWS Contact Center Intelligence solutions.

This is complemented by allowing for graceful escalation of conversations to live agents in situations where the bot can’t fully respond to a customer’s request, or when the company’s CX strategy requires it. The conversation context is passed to the agent so they know the messages that the user has previously exchanged with the bot, optimizing handle time, reducing effort, and increasing overall customer satisfaction.

With Amazon Lex, Amazon Polly, Amazon Kendra, and Genesys Cloud, you can easily create a bot and deploy it to different channels: voice, chat, SMS, and social messaging apps.

Enabling the integration

The integration between Amazon Lex (from which the addition of Amazon Polly and Amazon Kendra easily follows) and Genesys Cloud is available out of the box. It’s designed so that you can employ it quickly and easily.

You should first configure an Amazon Lex bot in one of the supported languages (for this post, we use Spanish-US). In the following use case, the bot is designed to enable a conversational interface that allows users to validate information, availability, and purchase certain products. It also allows them to manage the order, including tracking, modification, and cancellation. All of these are implemented as intents configured in the bot.

The following screenshot shows a view of Genesys Cloud Resource Center, where you can get started.

Integration consists of three simple steps (for full instructions, see About the Amazon Lex integration):

  1. Install the Amazon Lex integration from Genesys AppFoundry (Genesys Marketplace).
  2. Configure the IAM role with permissions for Amazon Lex.
  3. Set up and activate the Lex integration into Genesys Cloud.

After completing these steps, you can use any bots that you configured in Amazon Lex within Genesys Cloud flows, regardless of whether the flow is for voice (IVR type) or for digital channels like web chat, social networks, and messaging channels. The following screenshot shows a view of available bots for our use case on the Amazon Lex console.

To use our sample retail management bot, go into Architect (a Genesys Cloud flow configuration product) and choose the type of flow to configure (voice, chat, or messaging) so you can use the tools available for that channel.

In the flow toolbox, you can add the Call Lex Bot action anywhere in the flow by adding it via drag-and-drop.

This is how you can call onto any of your existing Amazon Lex bots from a Genesys Cloud Architect flow. In this voice flow example, we first identify the customer through a query to the CRM before passing them to the bot.

The Call Lex Bot action allows you to select one of your existing bots and configure information to pass (input variables). It outputs the intent identified in Amazon Lex and the slot information collected by the bot (output variables). Genesys Cloud can use the outputs to continue processing the interaction and provide context to the human agent if the interaction is transferred.

Going back to our example, we use the bot JH_Retail_Spa and configure two variables to pass to Amazon Lex that we collected from the CRM earlier in the flow: Task.UserName and Task.UserAccount. We then configure the track an order intent and its associated output variables.

The output information is played back to the customer, who can choose to finish the interaction or, if necessary, seek the support of a human agent. The agent is presented with a script that provides them with the context so they can seamlessly pick up the conversation at the point where the bot left off. This means the customer avoids having to repeat themselves, removing friction and improving customer experience.

You can enable the same functionality on digital channels, such as web chat, social networks, or messaging applications like WhatsApp or Line. In this case, all you need to do is use the same Genesys Cloud Architect action (Call Lex Bot) in digital flows.

The following screenshot shows an example of interacting with a bot on an online shopping website.

As with voice calls, if the customer needs additional support in digital interactions, these interactions are transferred to agents according to the defined routing strategy. Again, context is provided and the transcription of the conversation between the client and the bot is displayed to the agent.

In addition to these use cases, you can use the Genesys Cloud REST API to generate additional interaction types, providing differentiated customer service. For example, with the release of Amazon Lex in Spanish, some of our customers and partners are building Alexa Skills, delivering an additional personalized communication channel to their users.

Conclusion

Customer experience operations are constantly coming up against new challenges, especially in the days of COVID-19. Genesys Cloud provides a solution that can manage all the changes we’re facing daily. It natively provides a flexible, agile, and resilient omni-channel solution that enables scalability on demand.

With the release of Amazon Lex in Spanish, you can quickly incorporate bots within your voice or digital channels, improving efficiency and customer service. These interactions can be transferred when needed to human agents with the proper context so they can continue the conversation seamlessly and focus on more complex cases where they can add more value.

If you have Genesys Cloud, check out the integration with Amazon Lex in Spanish and US Spanish to see how simple and beneficial it can be. If you’re not a customer, this is an additional reason to migrate and take full advantage of the benefits Genesys and AWS CCI can offer you. Differentiate your organization by personalizing every customer service interaction, improving agent satisfaction, and enhancing visibility into important business metrics with a more intelligent contact center.


About the Authors

Rebecca Owens is a Senior Product Manager at Genesys and is based out of Raleigh, North Carolina.

Julian Hernandez is Senior Cloud Business Development – LATAM for Genesys and is based out of Bogota D.C. Area, Colombia.

 

Read More

Amazon Transcribe streaming adds support for Japanese, Korean, and Brazilian Portuguese

Amazon Transcribe streaming adds support for Japanese, Korean, and Brazilian Portuguese

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy to add speech-to-text capabilities to your applications. Today, we’re excited to launch Japanese, Korean, and Brazilian Portuguese language support for Amazon Transcribe streaming. To deliver streaming transcriptions with low latency for these languages, we’re also announcing availability of Amazon Transcribe streaming in the Asia Pacific (Seoul), Asia Pacific (Tokyo), and South America (São Paulo) Regions.

Amazon Transcribe added support for Italian and German languages earlier in November 2020, and this launch continues to expand the service’s streaming footprint. Now you can automatically generate live streaming transcriptions for a diverse set of use cases within your contact centers and media production workflows.

Customer stories

Our customers are using Amazon Transcribe to capture customer calls in real time to better assist agents, improve call resolution times, generate content highlights on the fly, and more. Here are a few stories of their streaming audio use cases.

PLUS Corporation

PLUS Corporation is a global manufacturer and retailer of office supplies. Mr. Yoshiki Yamaguchi, the Deputy Director of the IT Department at PLUS, says, “Every day, our contact center processes millions of minutes of calls. We are excited by Amazon Transcribe’s new addition of Japanese language support for streaming audio. This will provide our agents and supervisors with a robust, streaming transcription service that will enable them to better analyze and monitor incoming calls to improve call handling times and customer service quality.”

Nomura Research Institute

Nomura Research Institute (NRI) is one of the largest economic research consulting firms in Japan and an AWS premier consulting partner specializing in contact center intelligence solutions. Mr. Kakinoki, the deputy division manager of NRI’s Digital Workplace Solutions says, “For decades, we have been bringing intelligence into contact centers by providing several Natural Language Processing (NLP) based solutions in Japanese. We are pleased to welcome Amazon Transcribe’s latest addition of Japanese support for streaming audio. This enables us to automatically provide agents with real-time call transcripts as part of our solution. Now we are able to offer an integrated AWS contact center solution to our customers that addresses their need for real-time FAQ knowledge recommendations and text summarization of inbound calls. This will help our customers such as large financial firms reduce call resolution times, increase agent productivity and improve quality management capabilities.”

Accenture

Dr. Gakuse Hoshina, a Managing Director and Lead for Japan Applied Intelligence at Accenture Strategy & Consulting, says, “Accenture provides services such as AI Powered Contact Center and AI Powered Concierge, which synergize automated AI responses and human interactions to help our clients create highly satisfying customer experiences. I expect that the latest addition of Japanese for Amazon Transcribe streaming support will lead to a further improved customer experience, improving the accuracy of AI’s information retrieval and automated responses.”

Audioburst

Audioburst is a technology provider that is transforming the discovery, distribution, and personalization of talk audio. Gal Klein, the co-founder and CTO of Audioburst, says, “Every day, we analyze 225,000 minutes of live talk radio to create thousands of short, topical segments of information for playlists and search. We chose Amazon Transcribe because it is a remarkable speech recognition engine that helps us transcribe live audio content for our downstream content production work streams. Transcribe provides a robust system that can simultaneously convert a hundred audio streams into text for a reasonable cost. With this high-quality output text, we are then able to quickly process live talk radio episodes into consumable segments that provide next-gen listening experiences and drive higher engagement.”

Getting started

You can try Amazon Transcribe streaming for any of these new languages on the Amazon Transcribe console. The following is a quick demo of how to use it in Japanese.

You can take advantage of streaming transcription within your own applications with the Amazon Transcribe API for HTTP/2 or WebSockets implementations.

For the full list of supported languages and Regions for Amazon Transcribe streaming, see Streaming Transcription and AWS Regional Services.

Summary

Amazon Transcribe is a powerful ASR service for accurately converting your real-time speech into text. Try using it to help streamline your content production needs and equip your agents with effective tools to improve your overall customer experience. See all the ways in which other customers are using Amazon Transcribe.

 


About the Author

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

Read More

Real-time anomaly detection for Amazon Connect call quality using Amazon ES

Real-time anomaly detection for Amazon Connect call quality using Amazon ES

If your contact center is serving calls over the internet, network metrics like packet loss, jitter, and round-trip time are key to understanding call quality. In the post Easily monitor call quality with Amazon Connect, we introduced a solution that captures real-time metrics from the Amazon Connect softphone, stores them in Amazon Elasticsearch Service (Amazon ES), and creates easily understandable dashboards using Kibana. Examining these metrics and implementing rule-based alerting can be valuable. However, as your contact center scales, outliers becomes harder to detect across broad aggregations. For example, average packet loss for each of your sites may be below a rule threshold, but individual agent issues might go undetected.

The high cardinality anomaly detection feature of Amazon ES is a machine learning (ML) approach that can solve this problem. You can streamline your operational workflows by detecting anomalies for individual agents across multiple metrics. Alerts allow you to proactively reach out to the agent to help them resolve issues as they’re detected or use the historical data in Amazon ES to understand the issue. In this post, we give a four-step guide to detecting an individual agent’s call quality anomalies.

Anomaly detection in four stages

If you have an Amazon Connect instance, you can deploy the solution and follow along with this post. There are four steps to getting started with using anomaly detection to proactively monitor your data:

  1. Create an anomaly detector.
  2. Observe the results.
  3. Configure alerts.
  4. Tune the detector.

Creating an anomaly detector

A detector is the component used to track anomalies in a data source in real time. Features are the metrics the detector monitors in that data.

Our solution creates a call quality metric detector when deployed. This detector finds potential call quality issues by monitoring network metrics for agents’ calls, with features for round trip time, packet loss, and jitter. We use high cardinality anomaly detection to monitor anomalies across these features and identify agents who are having issues. Getting these granular results means that you can effectively isolate trends or individual issues in your call center.

Observing results

We can observe active detectors from Kibana’s anomaly detection dashboard. The graphs in Kibana show live anomalies, historical anomalies, and your feature data. You can determine the severity of an anomaly by the anomaly grade and the confidence of the detector. The following screenshot shows a heat map of anomalies detected, including the anomaly grade, confidence level, and the agents impacted.

We can choose any of the tiles in the heat map to drill down further and use the feature breakdown tab to see more details. For example, the following screenshot shows the feature breakdown of one the tiles from the agent john-doe. The feature breakdown shows us that the round trip time, jitter, and packet loss values spiked during this time period.

To see if this is an intermittent issue for this agent, we have two approaches. First, we can check the anomaly occurrences for the agent to see if there is a trend. In the following screenshot, John Doe had 12 anomalies over the last day, starting around 3:25 PM. This is a clear sign to investigate.

Alternatively, referencing the heat map, we can look to see if there is a visual trend. Our preceding heat map shows a block of anomalies around 9:40 PM on November 23—this is a reason to look at other broader issues impacting these agents. For instance, they might all be at the same site, which suggests we should look at the site’s network health.

Configuring alerts

We can also configure each detector with alerts to send a message to Amazon Chime or Slack. Alternatively, you can send alerts to an Amazon Simple Notification Service (Amazon SNS) topic for emails or SMS. Alerting an operations engineer or operations team allows you to investigate these issues in real time. For instructions on configuring these alerts, see Alerting for Amazon Elasticsearch Service.

Tuning the detector

It’s simple to start monitoring new features. You can add them to your model with a few clicks. To get the best results from your data, you can tailor the time window to your business. These time intervals are referred to as window size. The window size effects how quickly the algorithm adjusts to your data. Tuning the window size to match your data allows you to improve detector performance. For more information, see Anomaly Detection for Amazon Elasticsearch Service.

Conclusion

In this post, we’ve shown the value of high cardinality anomaly detection for Amazon ES. To get actionable insights and preconfigured anomaly detection for Amazon Connect metrics, you can deploy the call quality monitoring solution.

The high cardinality anomaly detection feature is available on all Amazon ES domains running Elasticsearch 7.9 or greater. For more information, see Anomaly Detection for Amazon Elasticsearch Service.

 


About the Authors

Kun Qian is a Specialist Solutions Architect on the Amazon Connect Customer Solutions Acceleration team. He solving complex problems with technology. In his spare time he loves hiking, experimenting in the kitchen and meeting new people.

 

 

 

Rob Taylor is a Solutions Architect helping customers in Australia designing and building solutions on AWS. He has a background in theoretical computer science and modeling. Originally from the US, in his spare time he enjoys exploring Australia and spending time with his family.

Read More

Analyzing data stored in Amazon DocumentDB (with MongoDB compatibility) using Amazon Sagemaker

Analyzing data stored in Amazon DocumentDB (with MongoDB compatibility) using Amazon Sagemaker

One of the challenges in data science is getting access to operational or real-time data, which is often stored in operational database systems. Being able to connect data science tools to operational data easily and efficiently unleashes enormous potential for gaining insights from real-time data. In this post, we explore using Amazon SageMaker to analyze data stored in Amazon DocumentDB (with MongoDB compatibility).

For illustrative purposes, we use public event data from the GitHub API, which has a complex nested JSON format, and is well-suited for a document database such as Amazon DocumentDB. We use SageMaker to analyze this data, conduct descriptive analysis, and build a simple machine learning (ML) model to predict whether a pull request will close within 24 hours, before writing prediction results back into the database.

SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB 3.6 application code, drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without having to worry about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

Solution overview

In this post, we analyze GitHub events, examples of which include issues, forks, and pull requests. Each event is represented by the GitHub API as a complex, nested JSON object, which is a format well-suited for Amazon DocumentDB. The following code is an example of the output from a pull request event:

{
  "id": "13469392114",
  "type": "PullRequestEvent",
  "actor": {
    "id": 33526713,
    "login": "arjkesh",
    "display_login": "arjkesh",
    "gravatar_id": "",
    "url": "https://api.github.com/users/arjkesh",
    "avatar_url": "https://avatars.githubusercontent.com/u/33526713?"
  },
  "repo": {
    "id": 234634164,
    "name": "aws/deep-learning-containers",
    "url": "https://api.github.com/repos/aws/deep-learning-containers"
  },
  "payload": {
    "action": "closed",
    "number": 570,
    "pull_request": {
      "url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570",
      "id": 480316742,
      "node_id": "MDExOlB1bGxSZXF1ZXN0NDgwMzE2NzQy",
      "html_url": "https://github.com/aws/deep-learning-containers/pull/570",
      "diff_url": "https://github.com/aws/deep-learning-containers/pull/570.diff",
      "patch_url": "https://github.com/aws/deep-learning-containers/pull/570.patch",
      "issue_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/570",
      "number": 570,
      "state": "closed",
      "locked": false,
      "title": "[test][tensorflow][ec2] Add timeout to Data Service test setup",
      "user": {
        "login": "arjkesh",
        "id": 33526713,
        "node_id": "MDQ6VXNlcjMzNTI2NzEz",
        "avatar_url": "https://avatars3.githubusercontent.com/u/33526713?v=4",
        "gravatar_id": "",
        "url": "https://api.github.com/users/arjkesh",
        "html_url": "https://github.com/arjkesh",
        "followers_url": "https://api.github.com/users/arjkesh/followers",
        "following_url": "https://api.github.com/users/arjkesh/following{/other_user}",
        "gists_url": "https://api.github.com/users/arjkesh/gists{/gist_id}",
        "starred_url": "https://api.github.com/users/arjkesh/starred{/owner}{/repo}",
        "subscriptions_url": "https://api.github.com/users/arjkesh/subscriptions",
        "organizations_url": "https://api.github.com/users/arjkesh/orgs",
        "repos_url": "https://api.github.com/users/arjkesh/repos",
        "events_url": "https://api.github.com/users/arjkesh/events{/privacy}",
        "received_events_url": "https://api.github.com/users/arjkesh/received_events",
        "type": "User",
        "site_admin": false
      },
      "body": "*Issue #, if available:*rnrn## Checklistrn- [x] I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]rnrn*Description:*rnCurrently, this test does not timeout in the same manner as execute_ec2_training_test or other ec2 training tests. As such, a timeout should be added here to avoid hanging instances. A separate PR will be opened to address why the global timeout does not catch this.rnrn*Tests run:*rnPR testsrnrnrnBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.rnrn",
      "created_at": "2020-09-05T00:29:22Z",
      "updated_at": "2020-09-10T06:16:53Z",
      "closed_at": "2020-09-10T06:16:53Z",
      "merged_at": null,
      "merge_commit_sha": "4144152ac0129a68c9c6f9e45042ecf1d89d3e1a",
      "assignee": null,
      "assignees": [

      ],
      "requested_reviewers": [

      ],
      "requested_teams": [

      ],
      "labels": [

      ],
      "milestone": null,
      "draft": false,
      "commits_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/commits",
      "review_comments_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/comments",
      "review_comment_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls/comments{/number}",
      "comments_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/570/comments",
      "statuses_url": "https://api.github.com/repos/aws/deep-learning-containers/statuses/99bb5a14993ceb29c16641bd54865db46ee6bf59",
      "head": {
        "label": "arjkesh:add_timeouts",
        "ref": "add_timeouts",
        "sha": "99bb5a14993ceb29c16641bd54865db46ee6bf59",
        "user": {
          "login": "arjkesh",
          "id": 33526713,
          "node_id": "MDQ6VXNlcjMzNTI2NzEz",
          "avatar_url": "https://avatars3.githubusercontent.com/u/33526713?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/arjkesh",
          "html_url": "https://github.com/arjkesh",
          "followers_url": "https://api.github.com/users/arjkesh/followers",
          "following_url": "https://api.github.com/users/arjkesh/following{/other_user}",
          "gists_url": "https://api.github.com/users/arjkesh/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/arjkesh/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/arjkesh/subscriptions",
          "organizations_url": "https://api.github.com/users/arjkesh/orgs",
          "repos_url": "https://api.github.com/users/arjkesh/repos",
          "events_url": "https://api.github.com/users/arjkesh/events{/privacy}",
          "received_events_url": "https://api.github.com/users/arjkesh/received_events",
          "type": "User",
          "site_admin": false
        },
        "repo": {
          "id": 265346646,
          "node_id": "MDEwOlJlcG9zaXRvcnkyNjUzNDY2NDY=",
          "name": "deep-learning-containers-1",
          "full_name": "arjkesh/deep-learning-containers-1",
          "private": false,
          "owner": {
            "login": "arjkesh",
            "id": 33526713,
            "node_id": "MDQ6VXNlcjMzNTI2NzEz",
            "avatar_url": "https://avatars3.githubusercontent.com/u/33526713?v=4",
            "gravatar_id": "",
            "url": "https://api.github.com/users/arjkesh",
            "html_url": "https://github.com/arjkesh",
            "followers_url": "https://api.github.com/users/arjkesh/followers",
            "following_url": "https://api.github.com/users/arjkesh/following{/other_user}",
            "gists_url": "https://api.github.com/users/arjkesh/gists{/gist_id}",
            "starred_url": "https://api.github.com/users/arjkesh/starred{/owner}{/repo}",
            "subscriptions_url": "https://api.github.com/users/arjkesh/subscriptions",
            "organizations_url": "https://api.github.com/users/arjkesh/orgs",
            "repos_url": "https://api.github.com/users/arjkesh/repos",
            "events_url": "https://api.github.com/users/arjkesh/events{/privacy}",
            "received_events_url": "https://api.github.com/users/arjkesh/received_events",
            "type": "User",
            "site_admin": false
          },
          "html_url": "https://github.com/arjkesh/deep-learning-containers-1",
          "description": "AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.",
          "fork": true,
          "url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1",
          "forks_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/forks",
          "keys_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/keys{/key_id}",
          "collaborators_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/collaborators{/collaborator}",
          "teams_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/teams",
          "hooks_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/hooks",
          "issue_events_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/issues/events{/number}",
          "events_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/events",
          "assignees_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/assignees{/user}",
          "branches_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/branches{/branch}",
          "tags_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/tags",
          "blobs_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/blobs{/sha}",
          "git_tags_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/tags{/sha}",
          "git_refs_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/refs{/sha}",
          "trees_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/trees{/sha}",
          "statuses_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/statuses/{sha}",
          "languages_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/languages",
          "stargazers_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/stargazers",
          "contributors_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/contributors",
          "subscribers_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/subscribers",
          "subscription_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/subscription",
          "commits_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/commits{/sha}",
          "git_commits_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/git/commits{/sha}",
          "comments_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/comments{/number}",
          "issue_comment_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/issues/comments{/number}",
          "contents_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/contents/{+path}",
          "compare_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/compare/{base}...{head}",
          "merges_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/merges",
          "archive_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/{archive_format}{/ref}",
          "downloads_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/downloads",
          "issues_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/issues{/number}",
          "pulls_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/pulls{/number}",
          "milestones_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/milestones{/number}",
          "notifications_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/notifications{?since,all,participating}",
          "labels_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/labels{/name}",
          "releases_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/releases{/id}",
          "deployments_url": "https://api.github.com/repos/arjkesh/deep-learning-containers-1/deployments",
          "created_at": "2020-05-19T19:38:21Z",
          "updated_at": "2020-06-23T04:18:45Z",
          "pushed_at": "2020-09-10T02:04:27Z",
          "git_url": "git://github.com/arjkesh/deep-learning-containers-1.git",
          "ssh_url": "git@github.com:arjkesh/deep-learning-containers-1.git",
          "clone_url": "https://github.com/arjkesh/deep-learning-containers-1.git",
          "svn_url": "https://github.com/arjkesh/deep-learning-containers-1",
          "homepage": "https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html",
          "size": 68734,
          "stargazers_count": 0,
          "watchers_count": 0,
          "language": "Python",
          "has_issues": false,
          "has_projects": true,
          "has_downloads": true,
          "has_wiki": true,
          "has_pages": true,
          "forks_count": 0,
          "mirror_url": null,
          "archived": false,
          "disabled": false,
          "open_issues_count": 0,
          "license": {
            "key": "apache-2.0",
            "name": "Apache License 2.0",
            "spdx_id": "Apache-2.0",
            "url": "https://api.github.com/licenses/apache-2.0",
            "node_id": "MDc6TGljZW5zZTI="
          },
          "forks": 0,
          "open_issues": 0,
          "watchers": 0,
          "default_branch": "master"
        }
      },
      "base": {
        "label": "aws:master",
        "ref": "master",
        "sha": "9514fde23ae9eeffb9dfba13ce901fafacef30b5",
        "user": {
          "login": "aws",
          "id": 2232217,
          "node_id": "MDEyOk9yZ2FuaXphdGlvbjIyMzIyMTc=",
          "avatar_url": "https://avatars3.githubusercontent.com/u/2232217?v=4",
          "gravatar_id": "",
          "url": "https://api.github.com/users/aws",
          "html_url": "https://github.com/aws",
          "followers_url": "https://api.github.com/users/aws/followers",
          "following_url": "https://api.github.com/users/aws/following{/other_user}",
          "gists_url": "https://api.github.com/users/aws/gists{/gist_id}",
          "starred_url": "https://api.github.com/users/aws/starred{/owner}{/repo}",
          "subscriptions_url": "https://api.github.com/users/aws/subscriptions",
          "organizations_url": "https://api.github.com/users/aws/orgs",
          "repos_url": "https://api.github.com/users/aws/repos",
          "events_url": "https://api.github.com/users/aws/events{/privacy}",
          "received_events_url": "https://api.github.com/users/aws/received_events",
          "type": "Organization",
          "site_admin": false
        },
        "repo": {
          "id": 234634164,
          "node_id": "MDEwOlJlcG9zaXRvcnkyMzQ2MzQxNjQ=",
          "name": "deep-learning-containers",
          "full_name": "aws/deep-learning-containers",
          "private": false,
          "owner": {
            "login": "aws",
            "id": 2232217,
            "node_id": "MDEyOk9yZ2FuaXphdGlvbjIyMzIyMTc=",
            "avatar_url": "https://avatars3.githubusercontent.com/u/2232217?v=4",
            "gravatar_id": "",
            "url": "https://api.github.com/users/aws",
            "html_url": "https://github.com/aws",
            "followers_url": "https://api.github.com/users/aws/followers",
            "following_url": "https://api.github.com/users/aws/following{/other_user}",
            "gists_url": "https://api.github.com/users/aws/gists{/gist_id}",
            "starred_url": "https://api.github.com/users/aws/starred{/owner}{/repo}",
            "subscriptions_url": "https://api.github.com/users/aws/subscriptions",
            "organizations_url": "https://api.github.com/users/aws/orgs",
            "repos_url": "https://api.github.com/users/aws/repos",
            "events_url": "https://api.github.com/users/aws/events{/privacy}",
            "received_events_url": "https://api.github.com/users/aws/received_events",
            "type": "Organization",
            "site_admin": false
          },
          "html_url": "https://github.com/aws/deep-learning-containers",
          "description": "AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.",
          "fork": false,
          "url": "https://api.github.com/repos/aws/deep-learning-containers",
          "forks_url": "https://api.github.com/repos/aws/deep-learning-containers/forks",
          "keys_url": "https://api.github.com/repos/aws/deep-learning-containers/keys{/key_id}",
          "collaborators_url": "https://api.github.com/repos/aws/deep-learning-containers/collaborators{/collaborator}",
          "teams_url": "https://api.github.com/repos/aws/deep-learning-containers/teams",
          "hooks_url": "https://api.github.com/repos/aws/deep-learning-containers/hooks",
          "issue_events_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/events{/number}",
          "events_url": "https://api.github.com/repos/aws/deep-learning-containers/events",
          "assignees_url": "https://api.github.com/repos/aws/deep-learning-containers/assignees{/user}",
          "branches_url": "https://api.github.com/repos/aws/deep-learning-containers/branches{/branch}",
          "tags_url": "https://api.github.com/repos/aws/deep-learning-containers/tags",
          "blobs_url": "https://api.github.com/repos/aws/deep-learning-containers/git/blobs{/sha}",
          "git_tags_url": "https://api.github.com/repos/aws/deep-learning-containers/git/tags{/sha}",
          "git_refs_url": "https://api.github.com/repos/aws/deep-learning-containers/git/refs{/sha}",
          "trees_url": "https://api.github.com/repos/aws/deep-learning-containers/git/trees{/sha}",
          "statuses_url": "https://api.github.com/repos/aws/deep-learning-containers/statuses/{sha}",
          "languages_url": "https://api.github.com/repos/aws/deep-learning-containers/languages",
          "stargazers_url": "https://api.github.com/repos/aws/deep-learning-containers/stargazers",
          "contributors_url": "https://api.github.com/repos/aws/deep-learning-containers/contributors",
          "subscribers_url": "https://api.github.com/repos/aws/deep-learning-containers/subscribers",
          "subscription_url": "https://api.github.com/repos/aws/deep-learning-containers/subscription",
          "commits_url": "https://api.github.com/repos/aws/deep-learning-containers/commits{/sha}",
          "git_commits_url": "https://api.github.com/repos/aws/deep-learning-containers/git/commits{/sha}",
          "comments_url": "https://api.github.com/repos/aws/deep-learning-containers/comments{/number}",
          "issue_comment_url": "https://api.github.com/repos/aws/deep-learning-containers/issues/comments{/number}",
          "contents_url": "https://api.github.com/repos/aws/deep-learning-containers/contents/{+path}",
          "compare_url": "https://api.github.com/repos/aws/deep-learning-containers/compare/{base}...{head}",
          "merges_url": "https://api.github.com/repos/aws/deep-learning-containers/merges",
          "archive_url": "https://api.github.com/repos/aws/deep-learning-containers/{archive_format}{/ref}",
          "downloads_url": "https://api.github.com/repos/aws/deep-learning-containers/downloads",
          "issues_url": "https://api.github.com/repos/aws/deep-learning-containers/issues{/number}",
          "pulls_url": "https://api.github.com/repos/aws/deep-learning-containers/pulls{/number}",
          "milestones_url": "https://api.github.com/repos/aws/deep-learning-containers/milestones{/number}",
          "notifications_url": "https://api.github.com/repos/aws/deep-learning-containers/notifications{?since,all,participating}",
          "labels_url": "https://api.github.com/repos/aws/deep-learning-containers/labels{/name}",
          "releases_url": "https://api.github.com/repos/aws/deep-learning-containers/releases{/id}",
          "deployments_url": "https://api.github.com/repos/aws/deep-learning-containers/deployments",
          "created_at": "2020-01-17T20:52:43Z",
          "updated_at": "2020-09-09T22:57:46Z",
          "pushed_at": "2020-09-10T04:01:22Z",
          "git_url": "git://github.com/aws/deep-learning-containers.git",
          "ssh_url": "git@github.com:aws/deep-learning-containers.git",
          "clone_url": "https://github.com/aws/deep-learning-containers.git",
          "svn_url": "https://github.com/aws/deep-learning-containers",
          "homepage": "https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html",
          "size": 68322,
          "stargazers_count": 61,
          "watchers_count": 61,
          "language": "Python",
          "has_issues": true,
          "has_projects": true,
          "has_downloads": true,
          "has_wiki": true,
          "has_pages": false,
          "forks_count": 49,
          "mirror_url": null,
          "archived": false,
          "disabled": false,
          "open_issues_count": 28,
          "license": {
            "key": "apache-2.0",
            "name": "Apache License 2.0",
            "spdx_id": "Apache-2.0",
            "url": "https://api.github.com/licenses/apache-2.0",
            "node_id": "MDc6TGljZW5zZTI="
          },
          "forks": 49,
          "open_issues": 28,
          "watchers": 61,
          "default_branch": "master"
        }
      },
      "_links": {
        "self": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570"
        },
        "html": {
          "href": "https://github.com/aws/deep-learning-containers/pull/570"
        },
        "issue": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/issues/570"
        },
        "comments": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/issues/570/comments"
        },
        "review_comments": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/comments"
        },
        "review_comment": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/comments{/number}"
        },
        "commits": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/pulls/570/commits"
        },
        "statuses": {
          "href": "https://api.github.com/repos/aws/deep-learning-containers/statuses/99bb5a14993ceb29c16641bd54865db46ee6bf59"
        }
      },
      "author_association": "CONTRIBUTOR",
      "active_lock_reason": null,
      "merged": false,
      "mergeable": false,
      "rebaseable": false,
      "mergeable_state": "dirty",
      "merged_by": null,
      "comments": 1,
      "review_comments": 0,
      "maintainer_can_modify": false,
      "commits": 6,
      "additions": 26,
      "deletions": 18,
      "changed_files": 1
    }
  },
  "public": true,
  "created_at": "2020-09-10T06:16:53Z",
  "org": {
    "id": 2232217,
    "login": "aws",
    "gravatar_id": "",
    "url": "https://api.github.com/orgs/aws",
    "avatar_url": "https://avatars.githubusercontent.com/u/2232217?"
  }
}

Amazon DocumentDB stores each JSON event as a document. Multiple documents are stored in a collection, and multiple collections are stored in a database. Borrowing terminology from relational databases, documents are analogous to rows, and collections are analogous to tables. The following table summarizes these terms.

Document Database Concepts SQL Concepts
Document Row
Collection Table
Database Database
Field Column

We now implement the following Amazon DocumentDB tasks using SageMaker:

  1. Connect to an Amazon DocumentDB cluster.
  2. Ingest GitHub event data stored in the database.
  3. Generate descriptive statistics.
  4. Conduct feature selection and engineering.
  5. Generate predictions.
  6. Store prediction results.

Creating resources

We have prepared the following AWS CloudFormation template to create the required AWS resources for this post. For instructions on creating a CloudFormation stack, see the video Simplify your Infrastructure Management using AWS CloudFormation.

The CloudFormation stack provisions the following:

  • A VPC with three private subnets and one public subnet.
  • An Amazon DocumentDB cluster with three nodes, one in each private subnet. When creating an Amazon DocumentDB cluster in a VPC, its subnet group should have subnets in at least two Availability Zones in a given Region.
  • An AWS Secrets Manager secret to store login credentials for Amazon DocumentDB. This allows us to avoid storing plaintext credentials in our SageMaker instance.
  • A SageMaker role to retrieve the Amazon DocumentDB login credentials, allowing connections to the Amazon DocumentDB cluster from a SageMaker notebook.
  • A SageMaker instance to run queries and analysis.
  • A SageMaker instance lifecycle configuration to run a bash script every time the instance boots up, downloading a certificate bundle to create TLS connections to Amazon DocumentDB, as well as a Jupyter Notebook containing the code for this tutorial. The script also installs required Python libraries (such as pymongo for database methods and xgboost for ML modeling), so that we don’t need to install these libraries from the notebook. See the following code:
    #!/bin/bash
    sudo -u ec2-user -i <<'EOF'
    source /home/ec2-user/anaconda3/bin/activate python3
    pip install --upgrade pymongo
    pip install --upgrade xgboost
    source /home/ec2-user/anaconda3/bin/deactivate
    cd /home/ec2-user/SageMaker
    wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem
    wget https://github.com/aws-samples/documentdb-sagemaker-example/blob/main/script.ipynb
    EOF

In creating the CloudFormation stack, you need to specify the following:

  • Name for your CloudFormation stack
  • Amazon DocumentDB username and password (to be stored in Secrets Manager)
  • Amazon DocumentDB instance type (default db.r5.large)
  • SageMaker instance type (default ml.t3.xlarge)

It should take about 15 minutes to create the CloudFormation stack. The following diagram shows the resource architecture.

Running this tutorial for an hour should cost no more than US$2.00.

Connecting to an Amazon DocumentDB cluster

All the subsequent code in this tutorial is in the Jupyter Notebook in the SageMaker instance created in your CloudFormation stack.

  1. To connect to your Amazon DocumentDB cluster from a SageMaker notebook, you have to first specify the following code:
    stack_name = "docdb-sm" # name of CloudFormation stack

The stack_name refers to the name you specified for your CloudFormation stack upon its creation.

  1. Use this parameter in the following method to get your Amazon DocumentDB credentials stored in Secrets Manager:
    def get_secret(stack_name):
    
        # Create a Secrets Manager client
        session = boto3.session.Session()
        client = session.client(
            service_name='secretsmanager',
            region_name=session.region_name
        )
        
        secret_name = f'{stack_name}-DocDBSecret'
        get_secret_value_response = client.get_secret_value(SecretId=secret_name)
        secret = get_secret_value_response['SecretString']
        
        return json.loads(secret)

  1. Next, we extract the login parameters from the stored secret:
    secret = get_secret(secret_name)
    
    db_username = secret['username']
    db_password = secret['password']
    db_port = secret['port']
    db_host = secret['host']

  1. Using the extracted parameters, we create a MongoClient from the pymongo library to establish a connection to the Amazon DocumentDB cluster.
    uri_str = f"mongodb://{db_username}:{db_password}@{db_host}:{db_port}/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false"
    client = MongoClient(uri_str)

Ingesting data

After we establish the connection to our Amazon DocumentDB cluster, we create a database and collection to store our GitHub event data. For this post, we name our database gharchive, and our collection events:

db_name = "gharchive" # name the database
collection_name = "events" # name the collection

db = client[db_name] # create a database
events = db[collection_name] # create a collection

Next, we need to download the data from gharchive.org, which has been aggregated into hourly archives with the following naming format:

https://data.gharchive.org/YYYY-MM-DD-H.json.gz

The aim of this analysis is to predict whether a pull request closes within 24 hours. For simplicity, we limit the analysis over two days: February 10–11, 2015. Across these two days, there were over 1 million GitHub events.

The following code downloads the relevant hourly archives, then formats and ingests the data into your Amazon DocumentDB database. It takes about 7 minutes to run on an ml.t3.xlarge instance.

# Specify target date and time range for GitHub events
year = 2015
month = 2
days = [10, 11]
hours = range(0, 24)

# Download data from gharchive.org and insert into Amazon DocumentDB

for day in days:
    for hour in hours:
        
        print(f"Processing events for {year}-{month}-{day}, {hour} hr.")
        
        # zeropad values
        month_ = str(month).zfill(2)
        day_ = str(day).zfill(2)
        
        # download data
        url = f"https://data.gharchive.org/{year}-{month_}-{day_}-{hour}.json.gz"
        response = requests.get(url, stream=True)
        
        # decompress data
        respdata = zlib.decompress(response.content, zlib.MAX_WBITS|32)
        
        # format data
        stringdata = respdata.split(b'n')
        data = [json.loads(x) for x in stringdata if 0 < len(x)]
        
        # ingest data
        events.insert_many(data, ordered=True, bypass_document_validation=True)

The option ordered=False command allows the data to be ingested out of order. The bypass_document_validation=True command allows the write to skip validating the JSON input, which is safe to do because we validated the JSON structure when we issued the json.loads() command prior to inserting.

Both options expedite the data ingestion process.

Generating descriptive statistics

As is a common first step in data science, we want to explore the data to get some general descriptive statistics. We can use database operations to calculate some of these basic descriptive statistics.

To get a count of the number of GitHub events, we use the count_documents() command:

events.count_documents({})
> 1174157

The count_documents() command gets the number of documents in a collection. Each GitHub event is recorded as a document, and events is what we had named our collection earlier.

The 1,174,157 documents comprise different types of GitHub events. To see the frequency of each type of event occurring in the dataset, we query the database using the aggregate command:

# Frequency of event types
event_types_query = events.aggregate([
    # Group by the type attribute and count
    {"$group" : {"_id": "$type", "count": {"$sum": 1}}}, 
    # Reformat the data
    {"$project": {"_id": 0, "Type": "$_id", "count": "$count"}},
    # Sort by the count in descending order
    {"$sort": {"count": -1} }  
])
df_event_types = pd.DataFrame(event_types_query

The preceding query groups the events by type, runs a count, and sorts the results in descending order of count. Finally, we wrap the output in pd.DataFrame() to convert the results to a DataFrame. This allows us to generate visualizations such as the following.

From the plot, we can see that push events were the most frequent, numbering close to 600,000.

Returning to our goal to predict if a pull request closes within 24 hours, we implement another query to include only pull request events, using the database match operation, and then count the number of such events per pull request URL:

# Frequency of PullRequestEvent actions by URL
action_query = events.aggregate([
    # Keep only PullRequestEvent types
    {"$match" : {"type": "PullRequestEvent"} }, 
    # Group by HTML URL and count
    {"$group": {"_id": "$payload.pull_request.html_url", "count": {"$sum": 1}}}, 
    # Reformat the data
    {"$project": {"_id": 0, "url": "$_id", "count": "$count"}},
    # Sort by the count in descending order
    {"$sort": {"count": -1} }  
])
df_action = pd.DataFrame(action_query)

From the result, we can see that a single URL could have multiple pull request events, such as those shown in the following screenshot.


One of the attributes of a pull request event is the state of the pull request after the event. Therefore, we’re interested in the latest event by the end of 24 hours in determining whether the pull request was open or closed in that window of time. We show how to run this query later in this post, but continue now with a discussion of descriptive statistics.

Apart from counts, we can also have the database calculate the mean, maximum, and minimum values for us. In the following query, we do this for potential predictors of a pull request open/close status, specifically the number of stars, forks, and open issues, as well as repository size. We also calculate the time elapsed (in milliseconds) of a pull request event since its creation. For each pull request, there could be multiple pull request events (comments), and this descriptive query spans across all these events:

# Descriptive statistics (mean, max, min) of repo size, stars, forks, open issues, elapsed time
descriptives = list(events.aggregate([
    # Keep only PullRequestEvents
    {"$match": {"type": "PullRequestEvent"} }, 
    # Project out attributes of interest
    {"$project": {
        "_id": 0, 
        "repo_size": "$payload.pull_request.base.repo.size", 
        "stars": "$payload.pull_request.base.repo.stargazers_count", 
        "forks": "$payload.pull_request.base.repo.forks_count", 
        "open_issues": "$payload.pull_request.base.repo.open_issues_count",
        "time_since_created": {"$subtract": [{"$dateFromString": {"dateString": "$payload.pull_request.updated_at"}}, 
                                  {"$dateFromString": {"dateString": "$payload.pull_request.created_at"}}]} 
    }}, 
    # Calculate min/max/avg for various metrics grouped over full data set
    {"$group": {
        "_id": "descriptives", 
        "mean_repo_size": {"$avg": "$repo_size"}, 
        "mean_stars": {"$avg": "$stars"}, 
        "mean_forks": {"$avg": "$forks"}, 
        "mean_open_issues": {"$avg": "$open_issues" },
        "mean_time_since_created": {"$avg": "$time_since_created"},
        
        "min_repo_size": {"$min": "$repo_size"}, 
        "min_stars": {"$min": "$stars"}, 
        "min_forks": {"$min": "$forks"}, 
        "min_open_issues": {"$min": "$open_issues" },
        "min_time_since_created": {"$min": "$time_since_created"},
        
        "max_repo_size": {"$max": "$repo_size"}, 
        "max_stars": {"$max": "$stars"}, 
        "max_forks": {"$max": "$forks"}, 
        "max_open_issues": {"$max": "$open_issues" },
        "max_time_since_created": {"$max": "$time_since_created"}
    }},
    # Reformat results
    {"$project": {
        "_id": 0, 
        "repo_size": {"mean": "$mean_repo_size", 
                      "min": "$min_repo_size",
                      "max": "$max_repo_size"},
        "stars": {"mean": "$mean_stars", 
                  "min": "$min_stars",
                  "max": "$max_stars"},
        "forks": {"mean": "$mean_forks", 
                  "min": "$min_forks",
                  "max": "$max_forks"},
        "open_issues": {"mean": "$mean_open_issues", 
                        "min": "$min_open_issues",
                        "max": "$max_open_issues"},
        "time_since_created": {"mean": "$mean_time_since_created", 
                               "min": "$min_time_since_created",
                               "max": "$max_time_since_created"},
    }}
]))

pd.DataFrame(descriptives[0]

The query results in the following output.

For supported methods of aggregations in Amazon DocumentDB, refer to Aggregation Pipeline Operators.

Conducting feature selection and engineering

Before we can begin building our prediction model, we need to select relevant features to include, and also engineer new features. In the following query, we select pull request events from non-empty repositories with more than 50 forks. We select possible predictors including number of forks (forks_count) and number of open issues (open_issues_count), and engineer new predictors by normalizing those counts by the size of the repository (repo.size). Finally, we shortlist the pull request events that fall within our period of evaluation, and record the latest pull request status (open or close), which is the outcome of our predictive model.

df = list(events.aggregate([
    # Filter on just PullRequestEvents
    {"$match": {
        "type": "PullRequestEvent",                                 # focus on pull requests
        "payload.pull_request.base.repo.forks_count": {"$gt": 50},  # focus on popular repos
        "payload.pull_request.base.repo.size": {"$gt": 0}           # exclude empty repos
    }},
    # Project only features of interest
    {"$project": {
        "type": 1,
        "payload.pull_request.base.repo.size": 1, 
        "payload.pull_request.base.repo.stargazers_count": 1,
        "payload.pull_request.base.repo.has_downloads": 1,
        "payload.pull_request.base.repo.has_wiki": 1,
        "payload.pull_request.base.repo.has_pages" : 1,
        "payload.pull_request.base.repo.forks_count": 1,
        "payload.pull_request.base.repo.open_issues_count": 1,
        "payload.pull_request.html_url": 1,
        "payload.pull_request.created_at": 1,
        "payload.pull_request.updated_at": 1,
        "payload.pull_request.state": 1,
        
        # calculate no. of open issues normalized by repo size
        "issues_per_repo_size": {"$divide": ["$payload.pull_request.base.repo.open_issues_count",
                                             "$payload.pull_request.base.repo.size"]},
        
        # calculate no. of forks normalized by repo size
        "forks_per_repo_size": {"$divide": ["$payload.pull_request.base.repo.forks_count",
                                            "$payload.pull_request.base.repo.size"]},
        
        # format datetime variables
        "created_time": {"$dateFromString": {"dateString": "$payload.pull_request.created_at"}},
        "updated_time": {"$dateFromString": {"dateString": "$payload.pull_request.updated_at"}},
        
        # calculate time elapsed since PR creation
        "time_since_created": {"$subtract": [{"$dateFromString": {"dateString": "$payload.pull_request.updated_at"}}, 
                                             {"$dateFromString": {"dateString": "$payload.pull_request.created_at"}} ]}
    }},
    # Keep only events within the window (24hrs) since pull requests was created
    # Keep only pull requests that were created on or after the start and before the end period
    {"$match": {
        "time_since_created": {"$lte": prediction_window},
        "created_time": {"$gte": date_start, "$lt": date_end}
    }},
    # Sort by the html_url and then by the updated_time
    {"$sort": {
        "payload.pull_request.html_url": 1,
        "payload.pull_request.updated_time": 1
    }},
    # keep the information from the first event in each group, plus the state from the last event in each group
    # grouping by html_url
    {"$group": {
        "_id": "$payload.pull_request.html_url",
        "repo_size": {"$first": "$payload.pull_request.base.repo.size"},
        "stargazers_count": {"$first": "$payload.pull_request.base.repo.stargazers_count"},
        "has_downloads": {"$first": "$payload.pull_request.base.repo.has_downloads"},
        "has_wiki": {"$first": "$payload.pull_request.base.repo.has_wiki"},
        "has_pages" : {"$first": "$payload.pull_request.base.repo.has_pages"},
        "forks_count": {"$first": "$payload.pull_request.base.repo.forks_count"},
        "open_issues_count": {"$first": "$payload.pull_request.base.repo.open_issues_count"},
        "issues_per_repo_size": {"$first": "$issues_per_repo_size"},
        "forks_per_repo_size": {"$first": "$forks_per_repo_size"},
        "state": {"$last": "$payload.pull_request.state"}
    }}
]))

df = pd.DataFrame(df)

Generating predictions

Before building our model, we split our data into two sets for training and testing:

X = df.drop(['state_open'], axis=1)
y = df['state_open']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    stratify=y,
                                                    random_state=42,
                                                   )

For this post, we use 70% of the documents for training the model, and the remaining 30% for testing the model’s predictions against the actual pull request status. We use the XGBoost algorithm to train a binary:logistic model evaluated with area under the curve (AUC) over 20 iterations. The seed is specified to enable reproducibility of results. The other parameters are left as default values. See the following code:

# Format data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Specify model parameters
param = {
    'objective':'binary:logistic',
    'eval_metric':'auc',
    'seed': 42,
        }

# Train model
num_round = 20
bst = xgb.train(param, dtrain, num_round)

Next, we use the trained model to generate predictions for the test dataset and to calculate and plot the AUC:

preds = bst.predict(dtest)
roc_auc_score(y_test, preds)
> 0.609441068887402

The following plot shows our results.

We can also examine the leading predictors by importance of a pull request event’s state:

xgb.plot_importance(bst, importance_type='weight')

The following plot shows our results.

A predictor has different definitions of importance. For this post, we use weight, which is the number of times a predictor appears in the XGBoost trees. The top predictor is the number of open issues normalized by the repository size. Using a box plot, we compare the spread of values for this predictor between closed and still-open pull requests.

After we examine the results and are satisfied with the model performance, we can write predictions back into Amazon DocumentDB.

Storing prediction results

The final step is to store the model predictions back into Amazon DocumentDB. First, we create a new Amazon DocumentDB collection to hold our results, called predictions:

predictions = db['predictions']

Then we change the generated predictions to type float, to be accepted by Amazon DocumentDB:

preds = preds.astype(float)

We need to associate these predictions with their respective pull request events. Therefore, we use the pull request URL as each document’s ID. We match each prediction to its respective pull request URL and consolidate them in a list:

urls = y_test.index

def gen_preds(url, pred):
    """
    Generate document with prediction of whether pull request will close in 24 hours.
    ID is pull request URL.
    """
    doc = {
        "_id": url, 
        "close_24hr_prediction": pred}
    
    return doc

documents = [gen_preds(url, pred) for url, pred in zip(urls, preds)]

Finally, we use the insert_many command to write the documents to Amazon DocumentDB:

predictions.insert_many(documents, ordered=False)

We can query a sample of five documents in the predictions collections to verify that the results have been inserted correctly:

pd.DataFrame(predictions.find({}).limit(5))

The following screenshot shows our results.

Cleaning up resources

To save cost, delete the CloudFormation stack you created. This removes all the resources you provisioned using the CloudFormation template, including the VPC, Amazon DocumentDB cluster, and SageMaker instance. For instructions, see Deleting a stack on the AWS CloudFormation console.

Summary

We used SageMaker to analyze data stored in Amazon DocumentDB, conduct descriptive analysis, and build a simple ML model to make predictions, before writing prediction results back into the database.

Amazon DocumentDB provides you with a number of capabilities that help you back up and restore your data based on your use case. For more information, see Best Practices for Amazon DocumentDB. If you’re new to Amazon DocumentDB, see Getting Started with Amazon DocumentDB. If you’re planning to migrate to Amazon DocumentDB, see Migrating to Amazon DocumentDB.

 


About the Authors

Annalyn Ng is a Senior Data Scientist with AWS Professional Services, where she develops and deploys machine learning solutions for customers. Annalyn graduated with an MPhil from the University of Cambridge, and blogs about machine learning at algobeans.com. Her book, Numsense! Data Science for the Layman’, has been translated into over five languages and is used in top universities as reference text.

 

 

Brian Hess is a Senior Solution Architect Specialist for Amazon DocumentDB (with MongoDB compatibility) at AWS. He has been in the data and analytics space for over 20 years and has extensive experience with relational and NoSQL databases.

Read More