Amazon EC2 Inf1 instances featuring AWS Inferentia chips now available in five new Regions and with improved performance

Amazon EC2 Inf1 instances featuring AWS Inferentia chips now available in five new Regions and with improved performance

Following strong customer demand, AWS has expanded the availability of Amazon EC2 Inf1 instances to five new Regions: US East (Ohio), Asia Pacific (Sydney, Tokyo), and Europe (Frankfurt, Ireland). Inf1 instances are powered by AWS Inferentia chips, which Amazon custom-designed to provide you with the lowest cost per inference in the cloud and lower barriers for everyday developers to use machine learning (ML) at scale.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to support ML inference applications and deliver up to 30% higher throughput and up to 45% lower cost per inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

Customers and Amazon services adopting Inf1instances

Since the launch of Inf1 instances, a broad spectrum of customers, such as large enterprises and startups, as well as Amazon services, have begun using them to run production workloads. Amazon’s Alexa team is in the process of migrating their Text-To-Speech workload from running on GPUs to Inf1 instances. INGA Technology, a startup focused on advanced text summarization, got started with Inf1 instances quickly and saw immediate gains.

“We quickly ramped up on AWS Inferentia-based Amazon EC2 Inf1 instances and integrated them in our development pipeline,” says Yaroslav Shakula, Chief Business Development Officer at INGA Technologies. “The impact was immediate and significant. The Inf1 instances provide high performance, which enables us to improve the efficiency and effectiveness of our inference model pipelines. Out of the box, we have experienced four times higher throughput, and 30% lower overall pipeline costs compared to our previous GPU-based pipeline.”

SkyWatch provides you with the tools you need to cost-effectively add Earth observation data into your applications. They use deep learning to process hundreds of trillions of pixels of Earth observation data captured from space every day.

“Adopting the new AWS Inferentia-based Inf1 instances using Amazon SageMaker for real-time cloud detection and image quality scoring was quick and easy,” says Adler Santos, Engineering Manager at SkyWatch. “It was all a matter of switching the instance type in our deployment configuration. By switching instance types to AWS Inferentia-based Inf1, we improved performance by 40% and decreased overall costs by 23%. This is a big win. It has enabled us to lower our overall operational costs while continuing to deliver high-quality satellite imagery to our customers, with minimal engineering overhead.”

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances using the AWS Neuron SDK, which is integrated with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. Because Neuron is integrated with ML frameworks, you can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen dramatic improvement in performance, delivering throughput up to two times higher for image classification models and up to 60% improvement for natural language processing models. The most recent launch of Neuron added support for OpenPose, a model for multi-person keypoint detection, providing 72% lower cost per inference than GPU instances.

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.


About the Author

Michal Skiba is a Senior Product Manager at AWS and passionate about enabling developers to leverage innovative hardware. Over the past ten years he has managed various cloud computing infrastructure products at Silicon Valley companies, large and small.

 

 

 

 

 

Read More

Expanding scientific portfolios and adapting to a changing world with Amazon Personalize

Expanding scientific portfolios and adapting to a changing world with Amazon Personalize

This is a guest blog post by David A. Smith at Thermo Fisher. In their own words, “Thermo Fisher Scientific is the world leader in serving science. Our Mission is to enable our customers to make the world healthier, cleaner, and safer. Whether our customers are accelerating life sciences research, solving complex analytical challenges, improving patient diagnostics and therapies, or increasing productivity in their laboratories, we are here to support them”

Researchers in Life Sciences perform increasingly complex work in an industry that’s changing at an accelerated pace. With the recent focus on the COVID-19 pandemic, scientists around the world are under the microscope as they work to deliver a cure. At Thermo Fisher, our driving principle is to provide these researchers, and others like them, the tools and materials they need to study the world’s most pressing problems.

The specialized products we sell have always necessitated personalized customer experiences. We sell nearly every type of product related to scientific work, from everyday essentials like labware and chemical reagents to specialized instrumentation for genetic sequencing. Our goal is to let our customers know that they can get everything they need at Thermo Fisher. Traditionally, we approached this problem via dedicated commercial sales teams trained to handle specific products. In today’s world, customer data comes from many different touchpoints, which makes it increasingly difficult for our sales teams to understand which products their customers need to do their research.

Over the last three years, my team has maintained a custom portal for these sales teams where they can see data for every part of their customers’ journey. This fast-moving environment presents a unique opportunity for us to use data science to deliver personalized product recommendations that target the right products for the right customers at the right time.

In this post, we discuss how and why we decided to use Amazon Personalize and how that decision has empowered our team to deliver highly personalized, multi-channel content in an ever-developing ecosystem.

First-generation recommendations

Our team initially developed a rules-based recommendation system based on content curated by in-house scientists and run using SQL queries within our Amazon Redshift cluster.

We had this system in place for a year, and it worked well, but as our data volume grew, our team was spending more and more time maintaining the system. We felt that our current infrastructure wasn’t keeping up, and we wanted to migrate to a completely serverless infrastructure for improved scalability and fault tolerance. The following diagram illustrates our existing recommendations infrastructure.

Another risk we identified was that these recommendations relied on an internal content creation process to understand where products fit in the customer journey. Although this was a powerful tool, we struggled to provide high-quality recommendations for new or recently introduced products. This is a classic “cold-start” problem for recommender systems, and one of our requirements for any new system was that it could surface new items without additional maintenance.

Custom recommendations

Our team initially looked at third-party vendors to help improve our recommendations. However, we found that purchasing a solution would be costly to implement and force us to sacrifice some of the flexibility required to operate in a commercial organization. We quickly decided against buying an off the-shelf solution.

The consensus was that we would build a custom machine learning (ML)-based system from scratch. We explored a few different options, including hierarchical recurrent neural network (HRNN) models. Eventually, we settled on a factorization machine model as the best combination of performance, ease of implementation, and scalability.

Personalized recommendations

About 8 weeks later, we were wrapping up the initial phases of model development and validation. The new system was performing well. We had significantly improved our predictions, and we were getting good feedback from some sample recommendations we had sent out.

We were gearing up to productionize our new solution when our team learned about Amazon Personalize. It was immediately apparent to us that Amazon Personalize had the ideal balance of flexibility, scalability, and measurability we were looking for when we had evaluated off-the-shelf solutions 2 months prior.

We decided to run some initial tests with Amazon Personalize to see how it performed on real data and get a feel for how much effort would be required to implement it. It took 2 days to prepare the data, train a model, and begin generating high-quality recommendations.

Bringing the test together

As a team that had recently planned for 4–6 weeks spent deploying our custom model into production, this was very attractive. For me, the data scientist responsible for successfully designing, building, and evaluating a completely homegrown solution, it was less attractive. I was excited about finally deploying our custom solution, and I was proud of its performance. We eventually decided to put the two models head-to-head, with the winner determined by the best combination of model performance, scalability, and flexibility.

Like any proud parent, I immediately set out to prove the custom model was better. I designed 32 tests for each model, and, over the next week, I ran each test on over 100 different slices of data to see which performed better on a holdout dataset. The deeper and more expressive neural network models provided by Amazon Personalize did a better job of predicting user behavior over roughly 80% of the testing criteria.

If you’re a data scientist, this story might make you cringe, but it has a happy ending. Designing this testing process forced me to examine our data even more deeply and creatively than I had while building our custom recommendation system. I was able to rapidly test all the different hypotheses and use the results to develop a deep understanding of each model’s relative strengths and weaknesses related to the business problem we initially set out to solve.

Our team couldn’t have performed such a thorough analysis if we were also managing the infrastructure required for deep learning models. As a team, we had the choice to either spend 6–8 weeks deploying our custom model or 2 weeks implementing a recommender system using Amazon Personalize.

Serverless infrastructure

Scalability and fault tolerance were our main priorities when designing the infrastructure for our scientific product recommendations. We also wanted a system that would allow us to visually monitor progress and track errors.

We opted to use AWS Step Functions to build the backbone of our recommendations inference pipeline with customized AWS Lambda functions to pull data from our Amazon Redshift cluster, prepare the datasets for ingestion by Amazon Personalize, and trigger and monitor Amazon Personalize jobs. The following graph illustrates this inference pipeline.

Flexibility in a changing world

Like many companies, our customers changed their habits significantly when the COVID-19 pandemic struck and businesses around the world shifted to work-from-home policies. There was a new demand to increase multi-channel targeting using email advertising campaigns.

Our team received a request to use the recommendation system we built with Amazon Personalize for targeted product email recommendations. Although we had never planned for this, it only took us a week to take our existing serverless inference pipeline and modify it to build, test, and validate an entirely new inference pipeline tuned specifically to email recommendations. Pivoting quickly is always challenging, but our commitment to building scalable and flexible infrastructure allowed us to overcome many of the challenges traditionally faced by teams when managing ML deployments and infrastructure. The following diagram illustrates the architecture of the email inference pipeline.

Despite the short turnaround time, the emails we’ve sent out following these recommendations have performed significantly better than previous baselines.

Looking back, it’s clear to me that we would have had significantly more difficulty meeting this request if we had opted to deploy our custom factorization machine model instead of using Amazon Personalize.

Conclusion

Thermo Fisher is constantly striving to help scientists around the world solve some of our greatest challenges. With Amazon Personalize, we’ve dramatically improved our ability to understand the work our customers do and serve them personalized experiences via multiple channels. Using Amazon Personalize has allowed us to focus on solving difficult problems instead of managing ML infrastructure.


About the Author

David A. Smith is a data scientist for Thermo Fisher Scientific based out of Carlsbad, California. He works with cross-organizational teams to design, build, and deploy automated models to drive customer intelligence and create business value. His interests include NLP, serverless ML, and blockchain technology. Outside of work, you can find David rock climbing, playing tennis, or swimming with his dog.

Read More

Amazon Textract now available in Asia Pacific (Mumbai) and EU (Frankfurt) Regions 

Amazon Textract now available in Asia Pacific (Mumbai) and EU (Frankfurt) Regions 

You can now use Amazon Textract, a machine learning (ML) service that quickly and easily extracts text and data from forms and tables in scanned documents, for workloads in the AWS Asia Pacific (Mumbai) and EU (Frankfurt) Regions.

Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented. The Amazon Textract API supports multiple image formats like scans, PDFs, and photos, and you can use it with other AWS ML services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Augmented AI, and Amazon Translate to derive deeper meaning from the extracted text and data. You can also use this text and data to build smart searches on large archives of documents, or load it into a database for use by applications, such as accounting, auditing, and compliance software.

An in-country infrastructure is critical for customers with data residency requirements and regulations, such as those operating in government, insurance, healthcare, and financial services. With this launch, customers in the AWS Asia Pacific (Mumbai) and EU (Frankfurt) Regions can benefit from Amazon Textract while complying with data residency requirements and integrating with other services and applications available in these Regions.

Perfios is a leading product technology company enabling businesses to aggregate, curate, and analyze structured and unstructured data to help in decision-making. “We have been testing Amazon Textract since its early days and are very excited to see it launch in India to help us address data sovereignty requirements for the region, which now unblocks us to use it at scale,” says Ramgopal Cillanki, Vice President, Head of Engineering at Perfios Software. “We believe that the service will help to transform the banking, financial services, and insurance (BFSI) industry from operations-heavy, human-in-the-loop processes to machine learning-powered API automation with minimal manual operations. Textract will not only help us reduce lenders’ decision-making turnaround time but also create business impact for our end-users in the long run.”

For more information about Amazon Textract and its Region availability, see Amazon Textract FAQs. To get started with Amazon Textract, see the Amazon Textract Developer Guide.


About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

 

 

 

 

Read More

Accessing data sources from Amazon SageMaker R kernels

Accessing data sources from Amazon SageMaker R kernels

Amazon SageMaker notebooks now support R out-of-the-box, without needing you to manually install R kernels on the instances. Also, the notebooks come pre-installed with the reticulate library, which offers an R interface for the Amazon SageMaker Python SDK and enables you to invoke Python modules from within an R script. You can easily run machine learning (ML) models in R using the Amazon SageMaker R kernel to access the data from multiple data sources. The R kernel is available by default in all Regions that Amazon SageMaker is available in.

R is a programming language built for statistical analysis and is very popular in data science communities. In this post, we will show you how to connect to the following data sources from the Amazon SageMaker R kernel using Java Database Connectivity (JDBC):

For more information about using Amazon SageMaker features using R, see R User Guide to Amazon SageMaker.

Solution overview

To build this solution, we first need to create a VPC with public and private subnets. This will allow us to securely communicate with different resources and data sources inside an isolated network. Next, we create the data sources in the custom VPC and the notebook instance with all necessary configuration and access to connect various data sources using R.

To make sure that the data sources are not reachable from the Internet, we create them inside a private subnet of the VPC. For this post, we create the following:

Connect to the Amazon EMR cluster inside the private subnet using AWS Systems Manager Session Manager to create Hive tables.

To run the code using the R kernel in Amazon SageMaker, create an Amazon SageMaker notebook. Download the JDBC drivers for the data sources. Create a lifecycle configuration for the notebook containing the setup script for R packages, and attach the lifecycle configuration to the notebook on create and on start to make sure the setup is complete.

Finally, we can use the AWS Management Console to navigate to the notebook to run code using the R kernel and access the data from various sources. The entire solution is also available in the GitHub repository.

Solution architecture

The following architecture diagram shows how you can use Amazon SageMaker to run code using the R kernel by establishing connectivity to various sources. You can also use the Amazon Redshift query editor or Amazon Athena query editor to create data resources. You need to use the Session Manager in AWS Systems Manager to SSH to the Amazon EMR cluster to create Hive resources.

Launching the AWS CloudFormation template

To automate resource creation, you run an AWS CloudFormation template. The template gives you the option to create an Amazon EMR cluster, Amazon Redshift cluster, or Amazon Aurora MySQL-compatible cluster automatically, as opposed to executing each step manually. It will take a few minutes to create all the resources.

  1. Choose the following link to launch the CloudFormation stack, which creates the required AWS resources to implement this solution:
  2. On the Create stack page, choose Next.
  3. Enter a stack name.
  4. You can change the default values for the following stack details:
Stack Details Default Values
Choose Second Octet for Class B VPC Address (10.xxx.0.0/16) 0
SageMaker Jupyter Notebook Instance Type ml.t2.medium
Create EMR Cluster Automatically? “Yes”
Create Redshift Cluster Automatically? “Yes”
Create Aurora MySQL DB Cluster Automatically? “Yes”
  1. Choose Next.
  2. On the Configure stack options page, choose Next.
  3. Select I acknowledge that AWS CloudFormation might create IAM resources.
  4. Choose Create stack.

You can now see the stack being created, as in the following screenshot.

When stack creation is complete, the status shows as CREATE_COMPLETE.

  1. On the Outputs tab, record the keys and their corresponding values.

You use the following keys later in this post:

  • AuroraClusterDBName – Aurora cluster database name
  • AuroraClusterEndpointWithPort – Aurora cluster endpoint address with port number
  • AuroraClusterSecret – Aurora cluster credentials secret ARN
  • EMRClusterDNSAddress – EMR cluster DNS name
  • EMRMasterInstanceId – EMR cluster primary instance ID
  • PrivateSubnets – Private subnets
  • PublicSubnets – Public subnets
  • RedshiftClusterDBName – Amazon Redshift cluster database name
  • RedshiftClusterEndpointWithPort – Amazon Redshift cluster endpoint address with port number
  • RedshiftClusterSecret – Amazon Redshift cluster credentials secret ARN
  • SageMakerNotebookName – Amazon SageMaker notebook instance name
  • SageMakerRS3BucketName – Amazon SageMaker S3 data bucket
  • VPCandCIDR – VPC ID and CIDR block

Creating your notebook with necessary R packages and JAR files

JDBC is an application programming interface (API) for the programming language Java, which defines how you can access a database. RJDBC is a package in R that allows you to connect to various data sources using the JDBC interface. The notebook instance that the CloudFormation template created ensures that the necessary JAR files for Hive, Presto, Amazon Athena, Amazon Redshift and MySQL are present in order to establish a JDBC connection.

  1. In the Amazon SageMaker Console, under Notebook, choose Notebook instances.
  2. Search for the notebook that matches the SageMakerNotebookName key you recorded earlier.
  3. Select the notebook instance.
  4. Click on “Open Jupyter” under “Actions” to locate the “jdbc” directory.

The CloudFormation template downloads the JAR files for Hive, Presto, Athena, Amazon Redshift, and Amazon Aurora MySQL-compatible inside the “jdbc” directory.

  1. Locate the lifecycle configuration attached.

A lifecycle configuration allows you to install packages or sample notebooks on your notebook instance, configure networking and security for it, or otherwise use a shell script for customization. A lifecycle configuration provides shell scripts that run when you create the notebook instance or when you start the notebook.

  1. Inside the Lifecycle configuration section, choose View script to see the lifecycle configuration script that sets up the R kernel in Amazon SageMaker to make JDBC connections to data sources using R.

It installs the RJDBC package and dependencies in the Anaconda environment of the Amazon SageMaker notebook.

Connecting to Hive and Presto

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

You can create a test table in Hive by logging in to the EMR master node from the AWS console using the Session Manager capability in Systems Manager. Systems Manager gives you visibility and control of your infrastructure on AWS. Systems Manager also provides a unified user interface so you can view operational data from multiple AWS services and allows you to automate operational tasks across your AWS resources. Session Manager is a fully managed Systems Manager capability that lets you manage your Amazon Elastic Compute Cloud (Amazon EC2) instances, on-premises instances, and virtual machines (VMs) through an interactive, one-click browser-based shell or through the AWS Command Line Interface (AWS CLI).

You use the following values from the AWS CloudFormation Outputs tab in this step:

  • EMRClusterDNSAddress – EMR cluster DNS name
  • EMRMasterInstanceId – EMR cluster primary instance ID
  • SageMakerNotebookName – Amazon SageMaker notebook instance name
  1. On the Systems Manager Console, under Instances & Nodes, choose Session Manager.
  2. Choose Start Session.
  3. Start an SSH session with the EMR primary node by locating the instance ID as specified by the value of the key EMRMasterInstanceId.

This starts the browser-based shell.

  1. Run the following SSH commands:
    # change user to hadoop 
    whoami
    sudo su - hadoop

  2. Create a test table in Hive from the EMR master node as you have already logged in using SSH:
    # Run on the EMR master node to create a table called students in Hive
    hive -e "CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));"
    
    # Run on the EMR master node to insert data to students created above
    hive -e "INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);"
    
    # Verify 
    hive -e "SELECT * from students;"
    exit
    exit

The following screenshot shows the view in the browser-based shell.

  1. Close the browser after exiting the shell.

To query the data from Amazon EMR using the Amazon SageMaker R kernel, you open the notebook the CloudFormation template created.

  1. On the Amazon SageMaker Console, under Notebook, chose Notebook instances.
  2. Find the notebook as specified by the value of the key SageMakerNotebookName.
  3. Choose Open Jupyter.
  4. To demonstrate connectivity from the Amazon SageMaker R kernel, choose Upload and upload the ipynb notebook.

    1. Alternatively, from the New drop-down menu, choose R to open a new notebook.
    2. Enter the code as mentioned in “hive_connect.ipynb”, replacing the emr_dns value with the value from key EMRClusterDNSAddress:
  5. Run all the cells in the notebook to connect to Hive on Amazon EMR using the Amazon SageMaker R console.

You follow similar steps to connect Presto:

  1. On the Amazon SageMaker Console, open the notebook you created.
  2. Choose Open Jupyter.
  3. Choose Upload to upload the ipynb notebook.
    1. Alternatively, from the New drop-down menu, choose R to open a new notebook.
    2. Enter the code as mentioned in “presto_connect.ipynb”, replacing the emr_dns value with the value from key EMRClusterDNSAddress:
  4. Run all the cells in the notebook to connect to PrestoDB on Amazon EMR using the Amazon SageMaker R console.

Connecting to Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. To connect to Amazon Athena from the Amazon SageMaker R kernel using RJDBC, we use the Amazon Athena JDBC driver, which is already downloaded to the notebook instance via the lifecycle configuration script.

You also need to set the query result location in Amazon S3. For more information, see Working with Query Results, Output Files, and Query History.

  1. On the Amazon Athena Console, choose Get Started.
  2. Choose Set up a query result location in Amazon S3.
  3. For Query result location, enter the Amazon S3 location as specified by the value of the key SageMakerRS3BucketName.
  4. Optionally, add a prefix, such as results.
  5. Choose Save.
  6. Create a database or schema and table in Athena with the example Amazon S3 data.
  7. Similar to connecting to Hive and Presto, to establish a connection from Athena to Amazon SageMaker using the R kernel, you can upload the ipynb notebook.
    1. Alternatively, open a new notebook and enter the code in “athena_connect.ipynb”, replacing the s3_bucket value with the value from key SageMakerRS3BucketName:
  8. Run all the cells in the notebook to connect to Amazon Athena from the Amazon SageMaker R console.

Connecting to Amazon Redshift

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. To connect to Amazon Redshift from the Amazon SageMaker R kernel using RJDBC, we use the Amazon Redshift JDBC driver, which is already downloaded to the notebook instance via the lifecycle configuration script.

You need the following keys and their values from the AWS CloudFormation Outputs tab:

  • RedshiftClusterDBName – Amazon Redshift cluster database name
  • RedshiftClusterEndpointWithPort – Amazon Redshift cluster endpoint address with port number
  • RedshiftClusterSecret – Amazon Redshift cluster credentials secret ARN

The CloudFormation template creates a secret for the Amazon Redshift cluster in AWS Secrets Manager, which is a service that helps you protect secrets needed to access your applications, services, and IT resources. Secrets Manager lets you easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle.

  1. On the AWS Secrets Manager Console, choose Secrets.
  2. Choose the secret denoted by the RedshiftClusterSecret key value.
  3. In the Secret value section, choose Retrieve secret value to get the user name and password for the Amazon Redshift cluster.
  4. On the Amazon Redshift Console, choose Editor (which is essentially the Amazon Redshift query editor).
  5. For Database name, enter redshiftdb.
  6. For Database password, enter your password.
  7. Choose Connect to database.
  8. Run the following SQL statements to create a table and insert a couple of records:
    CREATE TABLE public.students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
    INSERT INTO public.students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
    

  9. On the Amazon SageMaker Console, open your notebook.
  10. Choose Open Jupyter.
  11. Upload the ipynb notebook.
    1. Alternatively, open a new notebook and enter the code as mentioned in “redshift_connect.ipynb”, replacing the values for RedshiftClusterEndpointWithPort, RedshiftClusterDBName, and RedshiftClusterSecret:
  12. Run all the cells in the notebook to connect to Amazon Redshift on the Amazon SageMaker R console.

Connecting to Amazon Aurora MySQL-compatible

Amazon Aurora is a MySQL-compatible relational database built for the cloud, which combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases. To connect to Amazon Aurora from the Amazon SageMaker R kernel using RJDBC, we use the MariaDB JDBC driver, which is already downloaded to the notebook instance via the lifecycle configuration script.

You need the following keys and their values from the AWS CloudFormation Outputs tab:

  • AuroraClusterDBName – Aurora cluster database name
  • AuroraClusterEndpointWithPort – Aurora cluster endpoint address with port number
  • AuroraClusterSecret – Aurora cluster credentials secret ARN

The CloudFormation template creates a secret for the Aurora cluster in Secrets Manager.

  1. On the AWS Secrets Manager Console, locate the secret as denoted by the AuroraClusterSecret key value.
  2. In the Secret value section, choose Retrieve secret value to get the user name and password for the Aurora cluster.

To connect to the cluster, you follow similar steps as with other services.

  1. On the Amazon SageMaker Console, open your notebook.
  2. Choose Open Jupyter.
  3. Upload the ipynb notebook.
    1. Alternatively, open a new notebook and enter the code as mentioned in “aurora_connect.ipynb”, replacing the values for AuroraClusterEndpointWithPort, AuroraClusterDBName, and AuroraClusterSecret:
  4. Run all the cells in the notebook to connect Amazon Aurora on the Amazon SageMaker R console.

Conclusion

In this post, we demonstrated how to connect to various data sources, such as Hive and PrestoDB on Amazon EMR, Amazon Athena, Amazon Redshift, and Amazon Aurora MySQL-compatible cluster, in your environment to analyze, profile, run statistical computions using R from Amazon SageMaker. You can extend this method to other data sources via JDBC.


Author Bio

Kunal Ghosh is a Solutions Architect at AWS. His passion is building efficient and effective solutions on the cloud, especially involving analytics, AI, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.

 

 

 

Gagan Brahmi is a Specialist Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

 

 

 

 

Read More

Training a custom single class object detection model with Amazon Rekognition Custom Labels

Training a custom single class object detection model with Amazon Rekognition Custom Labels

Customers often need to identify single objects in images; for example, to identify their company’s logo, find a specific industrial or agricultural defect, or locate a specific event, like hurricanes, in satellite scans. In this post, we showcase how to train a custom model to detect a single object using Amazon Rekognition Custom Labels.

Amazon Rekognition is a fully managed service that provides computer vision (CV) capabilities for analyzing images and video at scale, using deep learning technology without requiring machine learning (ML) expertise. Amazon Rekognition Custom Labels lets you extend the detection and classification capabilities of the Amazon Rekognition pre-trained APIs by using data to train a custom CV model specific to your business needs. With the latest update to support single object training, Amazon Rekognition Custom Labels now lets you create a custom object detection model with single object classes.

Solution overview

To show you how the single class object detection feature works, we create a custom model to detect pizza in images. Because we only care about finding pizza in our images, we don’t want to create labels for other food types or create a “not pizza” label.

To create our custom model, we follow these steps:

  1. Create a project in Amazon Rekognition Custom Labels.
  2. Create a dataset with images containing one or more pizzas.
  3. Label the images by applying bounding boxes on all pizzas in the images using the user interface provided by Amazon Rekognition Custom Labels.
  4. Train the model and evaluate the performance.
  5. Test the new custom model using the automatically generated API endpoint.

Amazon Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end process.

Creating your project

To create your pizza-detection project, complete the following steps:

  1. On the Amazon Rekognition console, choose Custom Labels.
  2. Choose Get Started.
  3. For Project name, enter PizzaDetection.
  4. Choose Create project

You can also create a project on the Projects page. You can access the Projects page via the left navigation pane.

Creating your dataset

To create your pizza model, you first need to create a dataset to train the model with. For this post, our dataset is composed of 39 images that contain pizza. We sourced our images from pexels.com.

To create your dataset:

  1. Choose Create dataset.
  2. Select Upload images from your computer.

  1. Choose Add Images.
  2. Upload your images. You can always add more images later.

Labeling the images with bounding boxes

You’re now ready to label the images by applying bounding boxes on all images with pizza.

  1. Add Pizza as a label to your dataset via the labels list on the left side of the gallery.
  2. Apply the label to the pizzas in the images by selecting all the images with pizza and choosing Draw Bounding Box.

You can use the Shift key to automatically select multiple images between the first and last selected images.

Make sure to draw a bounding box that covers the pizza as tightly as possible.

Training your model

After you label your images, you’re ready to train your model.

  1. Choose Train Model.
  2. For Choose project, choose your PizzaDetection project.
  3. For Choose training dataset, choose your PizzaImages dataset.

As part of the training, Amazon Rekognition Custom Labels requires a labeled test dataset. You use the text dataset to verify how well the trained model predicts the correct labels and generate evaluation metrics. You don’t use the images in the test dataset to train your model; they should represent the types of images you want your model to analyze.

  1. For Create test set, choose how you want to provide your test dataset.

Amazon Rekognition Custom Labels provides three options:

  • Choose an existing test dataset
  • Create a new test dataset
  • Split training dataset

For this post, we select Split training dataset and let Amazon Rekognition hold back 20% of the images for testing and use the remaining 80% of the images to train the model.

Our model took approximately 1 hour to train. The training time required for your model depends on many factors, including the number of images provided in the dataset and the complexity of the model.

When training is complete, Amazon Rekognition Custom Labels outputs key metrics with every training, including F1 score, precision, recall, and the assumed threshold for each label. For more information about metrics, see Metrics for Evaluating Your Model.

Looking at our evaluation results, our model has a precision of 1.0, which means that no objects were mistakenly identified as pizza (false positives) in our test set. Our model did miss some pizzas in our test set (false negatives), which is reflected in our recall score of 0.81. You can often use the F1 score as an overall quality score because it takes both precision and recall into account. Finally, we see that our assumed threshold to generate the F1 score, precision, and recall metrics for Pizza is 0.61. By default, our model returns predictions above this assumed threshold. We can increase the recall for this model if we lower the confidence threshold. However, this would most likely cause a drop in precision.

We can also choose View Test Results to see each test image and how our model performed. The following screenshot shows an example of a correctly identified image of pizza during the model testing (true positive).

Testing your model

Your custom pizza detection model is now ready for use. Amazon Rekogntion Custom Labels provides the API calls for starting and using the model; you don’t need to deploy, provision, or manage any infrastructure. The following screenshot shows the API calls for using the model.

By using the API, we tried our model on a new test set of images from pexels.com.

For example, the following image shows a pizza on a table with other objects.

The model detects the pizza with a confidence of 91.72% and a correct bounding box. The following code is the JSON response received by the API call:

{
    "CustomLabels": [
        {
            "Name": "Pizza",
            "Confidence": 91.7249984741211,
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.7824199795722961,
                    "Height": 0.3644999861717224,
                    "Left": 0.11868999898433685,
                    "Top": 0.37672001123428345
                }
            }
        }
    ]
}

The following image has a confidence score of 98.40.

The following image has a confidence score of 96.51.

The following image has an empty JSON result, as expected, because the image doesn’t contain pizza.

The following image also has an empty JSON result.

In addition to using the API, you can also use the Custom Labels Demonstration. This AWS CloudFormation template enables you to set up a custom, password-protected UI where you can start and stop your models and run demonstration inferences.

Conclusion

In this post, we showed you how to create a single class object detection model with Amazon Rekognition Custom Labels. This feature makes it easy to train a custom model that can detect an object class without needing to specify other objects or losing accuracy in its results.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?


About the Author

Woody Borraccino is a Senior AI Solutions Architect at AWS.

 

 

 

 

Read More

Increasing the relevance of your Amazon Personalize recommendations by leveraging contextual information

Increasing the relevance of your Amazon Personalize recommendations by leveraging contextual information

Getting relevant recommendations in front of your users at the right time is a crucial step for the success of your personalization strategy. However, your customer’s decision-making process shifts depending on the context at the time when they’re interacting with your recommendations. In this post, I show you how to set up and query a context-aware Amazon Personalize deployment.

Amazon Personalize allows you to easily add sophisticated personalization capabilities to your applications by using the same machine learning (ML) technology used on Amazon.com for over 20 years. No ML experience is required. Amazon Personalize supports the automatic adjustment of recommendations based on contextual information about your user, such as device type, location, time of day, or other information you provide.

The Harvard study How Context Affects Choice defines context as factors that can influence the choice outcome by altering the process by which a decision is made. As a business owner, you can identify this context by analyzing how your customers shop differently when accessing your catalog from a phone vs. a computer, or seeing the shift in your customer’s content consumption on rainy vs. sunny days.

Leveraging your user’s context allows you to provide a more personalized experience for existing users and helps decrease the cold-start phase for new or unidentified users. The cold-start phase refers to the period when your recommendation engine provides non-personalized recommendations due to the lack of historical information regarding that user.

Adding context to Amazon Personalize

You can set up and use context in Amazon Personalize in four simple steps:

  1. Include your user’s context in the historical user-item interactions dataset.
  2. Train a context aware solution with a User Personalization or Personalized Ranking recipe. A recipe refers to the algorithm your recommender is trained on using the behavioral data specified in your interactions dataset plus any user or items metadata.
  3. Specify the user’s context when querying for real-time recommendations using the GetRecommendations or GetPersonalizedRanking
  4. Include your user’s context when recording events using the event tracker.

The following diagram illustrates the architecture of these steps.

You want to be explicit about the context to consider when constructing datasets. A common example of context customers actively use is device type, such as a phone, tablet, or desktop. The study The Effect of Device Type on Buying Behavior in Ecommerce: An Exploratory Study from the University of Twente in the Netherlands has proven that device type has an influence on buying behavior and people might postpone a buying decision if they’re online with the wrong device type. Embedding device type context in your datasets allows Amazon Personalize to learn this pattern and, at inference time, recommend the most appropriate content with awareness of the user’s context.

Recommendations use case

For this use case, a travel enthusiast is our potential customer. They look at a few things when deciding which airline to travel with to their given destination. For example, is it a short or a long flight? Will the trip be booked with cash or with miles? Are they traveling alone? Where are they be departing and returning to? After they answer these initial questions, the next big decision is picking the cabin type to fly in. If our travel enthusiast is flying in a high-end cabin type, we can assume they’re looking at which airline provides the best experience possible. Now that we have a good idea on what our user is looking for, it’s shopping time!

Consider some of the variables that go into the decision-making process of this use case. We can’t control many of these factors, but we can use some to tailor our recommendations. First, identify common denominators that might affect a user’s behavior. In this case, flight duration and cabin type are good candidates to use as context, and traveler type and traveler residence are good candidates for user metadata when building our recommendation datasets. Metadata is information you know about your users and items that stays somewhat constant over a period of time, whereas context is environmental information that can shift rapidly across time, influencing your customer’s perception and behavior.

Selecting the most relevant metadata fields in your training datasets and enriching your interactions datasets with context is important for generating relevant user recommendations. In this post, we build an Amazon Personalize deployment that returns a list of airline recommendations for a customer. We add cabin type as the context and traveler residence as the metadata field and observe how recommendations shift based on context and metadata.

Prerequisites

We first need to set up the following Amazon Personalize resources. For full instructions, see Getting Started (Console) to complete the following steps:

  1. Create a dataset group. In this post, we name it airlines-blog-example.
  2. Create an Interactions dataset using the following schema and import data using the interactions_dataset.csv file:
    {
      "type": "record",
      "name": "Interactions",
      "namespace": "com.amazonaws.personalize.schema",
      "fields": [
    {
              "name": "ITEM_ID",
              "type": "string"
          },
          {
              "name": "USER_ID",
              "type": "string"
          },
          {
              "name": "TIMESTAMP",
              "type": "long"
          },
          {
              "name":"CABIN_TYPE",
              "type": "string",
              "categorical": true
          },
          {
            "name": "EVENT_TYPE",
            "type": "string"
          },
          {
            "name": "EVENT_VALUE",
            "type": "float"
          }
      ],
      "version": "1.0"
    }

  3. Create a Users dataset using the following schema and import data using the users_dataset.csv file:
    {
      "type": "record",
      "name": "Users",
      "namespace": "com.amazonaws.personalize.schema",
      "fields": [
        {
          "name": "USER_ID",
          "type": "string"
        },
        {
          "name": "USER_RESIDENCE",
          "type": "string",
          "categorical": true
        }
      ],
      "version": "1.0"
    }

  4. Create a solution. In this post, we use the default solution configurations, except for the following:
    1. Recipeaws-hrnn-metadata
    2. Event type – RATING
    3. Perform HPO – True

Hyperparameter optimization (HPO) is recommended if you want Amazon Personalize to run parallel trainings and experiments to identify the most performant hyperparameters. For more information, see Hyperparameters and HPO.

  1. Create a campaign.

You can set up the preceding resources on the Amazon Personalize console or by following the Jupyter notebook personalize_hrnn_metadata_contextual_example.ipynb example on the GitHub repo.

Exploring your Amazon Personalize resources

We have now created several Amazon Personalize resources, including a dataset group called airlines-blog-example. The dataset group contains two datasets: interactions and users, which contain the data used to train your Amazon Personalize model (also known as a solution). We also created a campaign to provide real-time recommendations.

We can now explore how the interactions and users dataset schemas help our model learn from the context and metadata embedded in the datasets.

Interactions dataset

We provide Amazon Personalize an interactions dataset with a numeric rating (combination of EVENT_TYPE + EVENT_VALUE) that a user (USER_ID) has given an airline (ITEM_ID) when flying in a certain cabin type (CABIN_TYPE) at a given time (TIMESTAMP). By providing this information to Amazon Personalize in the dataset and schema, we can add CABIN_TYPE as the context when querying the recommendations for a user and recording new interactions through the event tracker. At training time, the model automatically identifies important features from this data (for our use case, the highest rated airlines across cabin types).

The following screenshot showcases a small portion of the interactions_dataset.csv file.

User dataset

We also provide Amazon Personalize a user dataset with the users (USER_ID) who provided the ratings in the interactions dataset, assuming that they gave the rating from their country of residence (USER_RESIDENCE). In this use case, USER_RESIDENCE is the metadata we picked for these users. By providing USER_RESIDENCE as user metadata, the model can learn which airlines are interacted with the most by users across countries and regions, so when we query for recommendations, it takes USER_RESIDENCE in consideration. For example, users in Asia see different airline options compared to users in South America or Europe.

The following screenshot shows a small portion of the user_dataset.csv file.

The raw dataset of user airlines ratings from Skytrax contains 20 columns with over 40,000 records. In this post, we use a modified version of this dataset and split the most relevant columns of the raw dataset into two datasets (users and interactions). For more information about splitting the data in a Jupyter notebook, see personalize_hrnn_metadata_contextual_example.ipynb on the GitHub repo.

The next section shows how context and metadata influence the real-time recommendations provided by your Amazon Personalize campaign.

Applying context to your Amazon Personalize real-time recommendations queries

During this test, we observe the effect that context has on the recommendations provided to users. In our use case, we have an interactions dataset of numerical airline ratings from multiple users. In our schemas, the cabin type is included as a categorical value for the interactions dataset and the user residence as a metadata field in the users dataset. Our theory is that by adding the cabin type as context, the airline recommendations will shift to account for it.

  1. On your Amazon Personalize dataset group dashboard, choose View campaigns.
  2. Choose your newly created campaign.
  3. For User ID, enter JDowns.
  4. Choose Get recommendations.

You should see a Test campaign results page similar to the following screenshot.

We initially queried a list of airlines for our user without any context. We now focus on the top 10 recommendations and verify that they shift based on the context. We can add the context via the console by providing a key and value pair. In our use case, the key is CABIN_TYPE and the value can be one of the following:

  • Economy
  • Premium Economy
  • Business Class
  • First Class

The following two screenshots show our results for querying recommendations for the same user with Economy and First Class as values for the CABIN_TYPE context. The economy context doesn’t shift the top 10 list, but the first class context does have an effect—bumping Alaska Airlines to first place on the list.

You can explore your users_dataset.csv file for additional users to test your recommendations API, and a very similar shift of recommendations based on the context you include in the API call. You can also find that the airlines list shifts based on the User Residency metadata field. For example, the following screenshots show the top 10 recommendations for our JDowns user, who has United States as the value for User Residency, compared to the PhillipHarris user, who has France as the value for User Residency.

Conclusion

As shown in this post, adding context to your recommendation strategy is a very powerful and easy-to-implement exercise when using Amazon Personalize. The benefits of enriching your recommendations with context can result in an increase in your user engagement, which eventually leads to an increase in the revenue influenced by your recommendations.

This post showed you how to create an Amazon Personalize context-aware deployment and an end-to-end test of getting real-time recommendations applying context via the Amazon Personalize console. For instructions on using a Jupyter environment to set up the Amazon Personalize infrastructure and get recommendations using the Boto3 Python SDK, see personalize_hrnn_metadata_contextual_example.ipynb on the GitHub repo.

There’s even more that you can do with Amazon Personalize. For more information about core use cases and automation examples, see the GitHub repo.

If this post helps you or inspires you to solve a problem, share your thoughts and questions in the comments.


About the Author

Luis Lopez Soria is an AI/ML specialist solutions architect working with the AWS machine learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.

 

 

 

Read More

Amazon Forecast can now use Convolutional Neural Networks (CNNs) to train forecasting models up to 2X faster with up to 30% higher accuracy

Amazon Forecast can now use Convolutional Neural Networks (CNNs) to train forecasting models up to 2X faster with up to 30% higher accuracy

We’re excited to announce that Amazon Forecast can now use Convolutional Neural Networks (CNNs) to train forecasting models up to 2X faster with up to 30% higher accuracy. CNN algorithms are a class of neural network-based machine learning (ML) algorithms that play a vital role in Amazon.com’s demand forecasting system and enable Amazon.com to predict demand for over 400 million products every day. For more information about Amazon.com’s journey building demand forecasting technology using CNN models, watch the re:MARS 2019 keynote video. Forecast brings the same technology used at Amazon.com into the hands of everyday developers as a fully managed service. Anyone can start using Forecast, without any prior ML experience, by using the Forecast console or the API.

Forecasting is the science of predicting the future. By examining historical trends, businesses can make a call on what might happen and when, and build that into their future plans for everything from product demand to inventory to staffing. Given the consequences of forecasting, accuracy matters. If a forecast is too high, businesses over-invest in products and staff, which ends up as wasted investment. If the forecast is too low, they under-invest, which leads to a shortfall in inventory and a poor customer experience. Today, businesses try to use everything from simple spreadsheets to complex financial planning software to generate forecasts, but high accuracy remains elusive for two reasons:

  • Traditional forecasts struggle to incorporate very large volumes of historical data, missing out on important signals from the past that are lost in the noise.
  • Traditional forecasts rarely incorporate related but independent data, which can offer important context (such as sales, holidays, locations, and marketing promotions). Without the full history and the broader context, most forecasts fail to predict the future accurately.

At Amazon, we have learned over the years that no one algorithm delivers the most accurate forecast for all types of data. Traditional statistical models have been useful in predicting demand for products that have regular demand patterns, such as sunscreen lotions in the summer and woolen clothes in the winter. However, statistical models can’t deliver accurate forecasts for more complex scenarios, such as frequent price changes, differences between regional versus national demand, products with different selling velocities, and the addition of new products. Sophisticated deep learning models can provide higher accuracy in these use cases. Forecast automatically examines your data and selects the best algorithm across a set of statistical and deep learning algorithms to train the more accurate forecasting model for your data. With the addition of the CNN-based deep learning algorithm, Forecast can now further improve accuracy by up to 30% and train models up to 2X faster compared to the currently supported algorithms. This new algorithm can more accurately detect leading indicators of demand, such as pre-order information, product page visits, price changes, and promotional spikes, to build more accurate forecasts.

More Retail, a market leader in the fresh food and grocery category in India, participated in a beta test of the new CNN algorithm, with the help of Ganit, an analytics partner. Supratim Banerjee, Chief Transformation Officer at More Retail Limited, says, “At More, we rapidly innovate to sustain our business and beat competition. We have been looking for opportunities to reduce wastage due to over stocking, while continuing to meet customer demand. In our experiments for the fresh produce category, we found the new CNN algorithm in Amazon Forecast to be 1.7X more accurate compared to our existing forecasting system. This translates into massive cost savings for our business.”

Training a CNN predictor and creating forecasts

You can start using CNNs in Forecast through the CreatePredictor API or on the Forecast console. In this section, we walk through a series of steps required to train a CNN predictor and create forecasts within Forecast.

  1. On the Forecast console, create a dataset group.

  1. Upload your dataset.

  1. Choose Predictors from the navigation pane.
  2. Choose Train predictor.

  1. For Algorithm selection, select Manual.
  2. For Algorithm, choose CNN-QR.

To manually select CNN-QR through the CreatePredictor API, use arn:aws:forecast:::algorithm/CNN-QR for the AlgorithmArn.

When you choose CNN-QR from the drop-down menu, the Advanced Configuration section auto-expands.

  1. To let Forecast train the most optimized and accurate CNN model for your data, select Perform hyperparameter optimization (HPO).
  2. After you enter all your details on the Predictors page, choose Train predictor.

After your predictor is trained, you can view its details by choosing your predictor on the Predictors page. On the predictor’s details page, you can view the accuracy metrics and optimized hyperparameters for your model.

  1. Now that your model is trained, choose Forecasts from the navigation name.
  2. Choose Create a forecast.
  3. Create a forecast using your trained predictor.

You can generate forecasts at any quantile to balance your under-forecasting and over-forecasting costs.

Choosing the most accurate model with Forecast

With this launch, Forecast now supports one proprietary CNN model, one proprietary RNN model, and four other statistical models: Prophet, NPTS (Amazon proprietary), ARIMA, and ETS. The new CNN model is part of AutoML. We recommend always starting your experimentation with AutoML, in which Forecast finds the most optimized and accurate model for your dataset.

  1. On the Train predictor page, for Algorithm selection, select Automatic (AutoML).

  1. After your predictor is trained using AutoML, choose the predictor to see more details on the chosen algorithm.
  2. On the predictor’s details page, in the Algorithm metrics section, choose different algorithms from the drop-down menu to view their accuracy for comparison.

Tips and best practices

As you begin to experiment with CNNs and build your demand planning solutions on top of Forecast, consider the following tips and best practices:

  • For experimentation, start by identifying the most important item IDs for your business that you are looking to improve your forecasting accuracy. Measure the accuracy of your existing forecasting methodology as a baseline.
  • Use Forecast with only your target time series and assess the wQuantileLoss accuracy metric. We recommend selecting AutoML in Forecast to find the most optimized and accurate model for your data. For more information, see Evaluating Predictor Accuracy.
  • AutoML optimizes for accuracy and not training time, so AutoML may take longer to optimize your model. If training time is a concern for you, we recommend manually selecting CNN-QR and assessing its accuracy and training time. A slight degradation in accuracy may be an acceptable trade-off for considerable gains in training time.
  • After you see an increase in accuracy over your baseline, we recommend experimenting to find the right forecasting quantile that balances your under-forecasting and over-forecasting costs to your business.
  • We recommend deploying your model as a continuous workload within your systems to start reaping the benefits of more accurate forecasts. You can continue to experiment by adding related time series and item metadata to further improve the accuracy.
  • Incrementally add related time series or item metadata to train your model to assess whether additional information improves accuracy. Different combinations of related time series and item metadata can give you different results.

Conclusion

The new CNN algorithm is available in all Regions where Forecast is publicly available. For more information about Region availability, see Region Table. For more information about the CNN algorithm, see CNN-QR algorithm documentation.


About the authors

Namita Das is a Sr. Product Manager for Amazon Forecast. Her current focus is to democratize machine learning by building no-code/low-code ML services. She frequently advises startups and has started dabbling in baking.

 

 

 

Danielle Robinson is an Applied Scientist on the Amazon Forecast team. Her research is in time series forecasting and in particular how we can apply new neural network-based algorithms within Amazon Forecast. Her thesis research was focused on developing new, robust, and physically accurate numerical models for computational fluid dynamics. Her hobbies include cooking, swimming, and hiking.

 

 

Aaron Spieler is a working student in the Amazon Forecast team. He is starting his masters degree at the University of Tuebingen, and studied Data Engineering at Hasso Plattner Institute after obtaining a BS in Computer Science from University of Potsdam. His research interests span time series forecasting (especially using neural network models), machine learning, and computational neuroscience.

 

 

 

Gunjan Garg: Gunjan Garg is a Sr. Software Development Engineer in the AWS Vertical AI team. In her current role at Amazon Forecast, she focuses on engineering problems and enjoys building scalable systems that provide the most value to end-users. In her free time, she enjoys playing Sudoku and Minesweeper.

 

 

 

Chinmay Bapat is a Software Development Engineer in the Amazon Forecast team. His interests lie in the applications of machine learning and building scalable distributed systems. Outside of work, he enjoys playing board games and cooking.

 

 

 

 

 

 

Read More

Securing Amazon Comprehend API calls with AWS PrivateLink

Securing Amazon Comprehend API calls with AWS PrivateLink

Amazon Comprehend now supports Amazon Virtual Private Cloud (Amazon VPC) endpoints via AWS PrivateLink so you can securely initiate API calls to Amazon Comprehend from within your VPC and avoid using the public internet.

Amazon Comprehend is a fully managed natural language processing (NLP) service that uses machine learning (ML) to find meaning and insights in text. You can use Amazon Comprehend to analyze text documents and identify insights such as sentiment, people, brands, places, and topics in text. No ML expertise required.

Using AWS PrivateLink, you can access Amazon Comprehend easily and securely by keeping your network traffic within the AWS network, while significantly simplifying your internal network architecture. It enables you to privately access Amazon Comprehend APIs from your VPC in a scalable manner by using interface VPC endpoints. A VPC endpoint is an elastic network interface in your subnet with a private IP address that serves as the entry point for all Amazon Comprehend API calls.

In this post, we show you how to set up a VPC endpoint and enforce the use of this private connectivity for all requests to Amazon Comprehend using AWS Identity and Access Management (IAM) policies.

Prerequisites

For this example, you should have an AWS account and sufficient access to create resources in the following services:

Solution overview

The walkthrough includes the following high-level steps:

  1. Deploy your resources.
  2. Create VPC endpoints.
  3. Enforce private connectivity with IAM.
  4. Use Amazon Comprehend via AWS PrivateLink.

Deploying your resources

For your convenience, we have supplied an AWS CloudFormation template to automate the creation of all prerequisite AWS resources. We use the us-east-2 Region in this post, so the console and URLs may differ depending on the Region you select. To use this template, complete the following steps:

  1. Choose Launch Stack:
  2. Confirm the following parameters, which you can leave at the default values:
    1. SubnetCidrBlock1 – The primary IPv4 CIDR block assigned to the first subnet. The default value is 10.0.1.0/24.
    2. SubnetCidrBlock2 – The primary IPv4 CIDR block assigned to the second subnet. The default value is 10.0.2.0/24.
  3. Acknowledge that AWS CloudFormation may create additional IAM resources.
  4. Choose Create stack.

The creation process should take roughly 10 minutes to complete.

The CloudFormation template creates the following resources on your behalf:

  • A VPC with two private subnets in separate Availability Zones
  • VPC endpoints for private Amazon S3 and Amazon Comprehend API access
  • IAM roles for use by Lambda and Amazon Comprehend
  • An IAM policy to enforce the use of VPC endpoints to interact with Amazon Comprehend
  • An IAM policy for Amazon Comprehend to access data in Amazon S3
  • An S3 bucket for storing open-source data

The next two sections detail how to manually create a VPC endpoint for Amazon Comprehend and enforce usage with an IAM policy. If you deployed the CloudFormation template and prefer to skip to testing the API calls, you can advance to the Using Amazon Comprehend via AWS PrivateLink section.

Creating VPC endpoints

To create a VPC endpoint, complete the following steps:

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create Endpoint.
  3. For Service category, select AWS services.
  4. For Service Name, choose amazonaws.us-east-2.comprehend.
  5. For VPC, enter the VPC you want to use.
  6. For Availability Zone, select your preferred Availability Zones.
  7. For Enable DNS name, select Enable for this endpoint.

This creates a private hosted zone that enables you to access the resources in your VPC using custom DNS domain names, such as example.com, instead of using private IPv4 addresses or private DNS hostnames provided by AWS. The Amazon Comprehend DNS hostname that the AWS Command Line Interface (CLI) and Amazon Comprehend SDKs use by default (https://comprehend.Region.amazonaws.com) resolves to your VPC endpoint.

  1. For Security group, choose the security group to associate with the endpoint network interface.

If you don’t specify a security group, the default security group for your VPC is associated.

  1. Choose Create Endpoint.

When the Status changes to available, your VPC endpoint is ready for use.

  1. Choose the Policy tab to apply more restrictive access control to the VPC endpoint.

The following example policy limits VPC endpoint access to an IAM role used by a Lambda function in our deployment. You should apply the principle of least privilege when defining your own policy. For more information, see Controlling access to services with VPC endpoints.

{                        
    "Version": "2012-10-17",
    "Statement": [    
        {                
            "Action": [
                "comprehend:DetectEntities",
                "comprehend:CreateDocumentClassifier"
            ],           
            "Resource": [
                "*"  
            ],           
            "Effect": "Allow",
            "Principal": {
                "AWS": [
"arn:aws:iam::#########:role/ComprehendPrivateLink-LambdaExecutionRole"                                         
                ]    
            }            
        }                
    ]                    
}

Enforcing private connectivity with IAM

To allow or deny access to Amazon Comprehend based on the use of a VPC endpoint, we include an aws:sourceVpce condition in the IAM policy. The following example policy provides access specifically to the DetectEntities and CreateDocumentClassifier APIs only when the request utilizes your VPC endpoint. You can include additional Amazon Comprehend APIs in the “Action” section of the policy or use “comprehend:*” to include them all. You can attach this policy to an IAM role to enable compute resources hosted within your VPC to interact with Amazon Comprehend.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ComprehendEnforceVpce",
            "Effect": "Allow",
            "Action": [
                "comprehend:CreateDocumentClassifier",
                "comprehend:DetectEntities"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:SourceVpce": "vpce-xxxxxxxx"
                }
            }
        },
        {
            "Sid": "PassRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::#########:role/ComprehendDataAccessRole"
        }
    ]
}

You should replace the VPC endpoint ID with the endpoint ID you created earlier. Permission to invoke the PassRole API is required for asynchronous operations in Amazon Comprehend like CreateDocumentClassifer and should be scoped to your specific data access role.

Using Amazon Comprehend via AWS PrivateLink

To start using Amazon Comprehend with AWS PrivateLink, you perform the following high-level steps:

  1. Review the Lambda function for API testing.
  2. Create the DetectEntities test event.
  3. Train a custom classifier.

Reviewing the Lambda function

To review your Lambda function, on the Lambda console, choose the Lambda function that contains ComprehendPrivateLink in its name.

The VPC section of the Lambda console provides links to the various networking components automatically created for you during the CloudFormation deployment.

The function code includes a sample program that takes user input to invoke the specific Amazon Comprehend APIs supported by our example IAM policy.

Creating a test event

In this section, we create an event to detect entities within sample text using a pretrained model.

  1. From the Test drop-down menu, choose Create new test event.
  2. For Event name, enter a name (for example, DetectEntities).
  3. Replace the event JSON with the following code:
    {
      "comprehend_api": "DetectEntities",
      "language_code": "en",
      "text": "Amazon.com, Inc. is located in Seattle, WA and was founded July 5th, 1994 by Jeff Bezos, allowing customers to buy everything from books to blenders."
    }

  4. Choose Save to store the test event.
  5. Choose Save to update the Lambda function.
  6. Choose Test to invoke the DetectEntities API.

The response should include results similar to the following code:

{
    "Entities": [
        {
            "Score": 0.9266431927680969,
            "Type": "ORGANIZATION",
            "Text": "Amazon.com, Inc.",
            "BeginOffset": 0,
            "EndOffset": 16
        },
        {
            "Score": 0.9952651262283325,
            "Type": "LOCATION",
            "Text": "Seattle, WA",
            "BeginOffset": 31,
            "EndOffset": 42
        },
        {
            "Score": 0.9998188018798828,
            "Type": "DATE",
            "Text": "July 5th, 1994",
            "BeginOffset": 59,
            "EndOffset": 73
        },
        {
            "Score": 0.9999810457229614,
            "Type": "PERSON",
            "Text": "Jeff Bezos",
            "BeginOffset": 77,
            "EndOffset": 87
        }
    ]
}

You can update the test event to identify entities from your own text.

Training a custom classifier

We now demonstrate how to build a custom classifier. For training data, we use a version of the Yahoo answers corpus that is preprocessed into the format expected by Amazon Comprehend. This corpus, available on the AWS Open Data Registry, is cited in the paper Text Understanding from Scratch by Xiang Zhang and Yann LeCun. It is also used in the post Building a custom classifier using Amazon Comprehend.

  1. Retrieve the training data from Amazon S3.
  2. On the Amazon S3 console, choose the example S3 bucket created for you.
  3. Choose Upload and add the file you retrieved.
  4. Choose the uploaded object and note the Key.
  5. Return to the test function on the Lambda console.
  6. From the Test drop-down menu, choose Create new test event.
  7. For Event name, enter a name (for example, TrainCustomClassifier).
  8. Replace the event input with the following code:
    {
      "comprehend_api": "CreateDocumentClassifier",
      "custom_classifier_name": "custom-classifier-example",
      "language_code": "en",
      "training_data_s3_key": "comprehend-train.csv"
    }

  9. If you changed the default file name, update the training_data_s3_key to match.
  10. Choose Save to store the test event.
  11. Choose Save to update the Lambda function.
  12. Choose Test to invoke the CreateDocumentClassifier API.

The response should include results similar to the following code:

{
"DocumentClassifierArn": "arn:aws:comprehend:us-east-2:0123456789:document-classifier/custom-classifier-example"
}
  1. On the Amazon Comprehend console, choose Custom classification to check the status of the document classifier training.

After approximately 20 minutes, the document classifier is trained and available for use.

Cleaning Up

To avoid incurring future charges, delete the resources you created during this walkthrough after concluding your testing.

  1. On the Amazon Comprehend console, delete the custom classifier.
  2. On the Amazon S3 console, empty the bucket created for you.
  3. If you launched the automated deployment, on the AWS CloudFormation console, delete the appropriate stack.

The deletion process takes approximately 10 minutes.

Conclusion

You have now successfully invoked Amazon Comprehend APIs using AWS PrivateLink. The use of IAM policies prevents requests from leaving your VPC and further improves your security posture. You can extend this solution to securely test additional features like Amazon Comprehend custom entity recognition real-time endpoints.

All Amazon Comprehend API calls are now supported via AWS PrivateLink. This feature exists in all commercial Regions where AWS PrivateLink and Amazon Comprehend are available. To learn more about securing Amazon Comprehend, see Security in Amazon Comprehend.


About the Authors

Dave Williams is a Cloud Consultant for AWS Professional Services. He works with public sector customers to securely adopt AI/ML services. In his free time, he enjoys spending time with his family, traveling, and watching college football.

 

 

 

Adarsha Subick is a Cloud Consultant for AWS Professional Services based out of Virginia. He works with public sector customers to help solve their AI/ML-focused business problems. In his free time, he enjoys archery and hobby electronics.

 

 

 

Saman Zarandioon is a Sr. Software Development Engineer for Amazon Comprehend. He earned a PhD in Computer Science from Rutgers University.

 

 

 

 

Read More