Configure a custom Amazon S3 query output location and data retention policy for Amazon Athena data sources in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Snowflake, and 26 federated query data sources supported by Amazon Athena.

Starting today, when importing data from Athena data sources, you can configure the S3 query output location and data retention period to import data in Data Wrangler to control where and how long Athena stores the intermediary data. In this post, we walk you through this new feature.

Solution overview

Athena is an interactive query service that makes it easy to browse the AWS Glue Data Catalog, and analyze data in Amazon S3 and 26 federated query data sources using standard SQL. When you use Athena to import data, you can use Data Wrangler’s default S3 location for the Athena query output, or specify an Athena workgroup to enforce a custom S3 location. Previously, you had to implement cleanup workflows to remove this intermediary data, or manually set up S3 lifecycle configuration to control storage cost and meet your organization’s data security requirements. This is a big operational overhead, and not scalable.

Data Wrangler now supports custom S3 locations and data retention periods for your Athena query output. With this new feature, you can change the Athena query output location to a custom S3 bucket. You now have a default data retention policy of 5 days for the Athena query output, and you can change this to meet your organization’s data security requirements. Based on the retention period, the Athena query output in the S3 bucket gets cleaned up automatically. After you import the data, you can perform exploratory data analysis on this dataset and store the clean data back to Amazon S3.

The following diagram illustrates this architecture.

For our use case, we use a sample bank dataset to walk through the solution. The workflow consists of the following steps:

  1. Download the sample dataset and upload it to an S3 bucket.
  2. Set up an AWS Glue crawler to crawl the schema and store the metadata schema in the AWS Glue Data Catalog.
  3. Use Athena to access the Data Catalog to query data from the S3 bucket.
  4. Create a new Data Wrangler flow to connect to Athena.
  5. When creating the connection, set the retention TTL for the dataset.
  6. Use this connection in the workflow and store the clean data in another S3 bucket.

For simplicity, we assume that you have already set up the Athena environment (steps 1–3). We detail the subsequent steps in this post.

Prerequisites

To set up the Athena environment, refer to the User Guide for step-by-step instructions, and complete steps 1–3 as outlined in the previous section.

Import your data from Athena to Data Wrangler

To import your data, complete the following steps:

  1. On the Studio console, choose the Resources icon in the navigation pane.
  2. Choose Data Wrangler on the drop-down menu.
  3. Choose New flow.
  4. On the Import tab, choose Amazon Athena.

    A detail page opens where you can connect to Athena and write a SQL query to import from the database.
  5. Enter a name for your connection.
  6. Expand Advanced configuration.
    When connecting to Athena, Data Wrangler uses Amazon S3 to stages the queried data. By default, this data is staged at the S3 location s3://sagemaker-{region}-{account_id}/athena/ with a retention period of 5 days.
  7. For Amazon S3 location of query results, enter your S3 location.
  8. Select Data retention period and set the data retention period (for this post, 1 day).
    If you deselect this option, the data will persist indefinitely.Behind the scenes, Data Wrangler attaches an S3 lifecycle configuration policy to that S3 location to automatically clean up. See the following example policy:

     "Rules": [
            {
                "Expiration": {
                    "Days": 1
                },
                "ID": "sm-data-wrangler-retention-policy-xxxxxxx",
                "Filter": {
                    "Prefix": "athena/test"
                },
                "Status": "Enabled"
            }
        ]

    You need s3:GetLifecycleConfiguration and s3:PutLifecycleConfiguration for your SageMaker execution role to correctly apply the lifecycle configuration policies. Without these permissions, you get error messages when you try to import the data.

    The following error message is an example of missing the GetLifecycleConfiguration permission.

    The following error message is an example of missing the PutLifecycleConfiguration permission.

  9. Optionally, for Workgroup, you can specify an Athena workgroup.
    An Athena workgroup isolates users, teams, applications, or workloads into groups, each with its own permissions and configuration settings. When you specify a workgroup, Data Wrangler inherits the workgroup setting defined in Athena. For example, if a workgroup has an S3 location defined to store query results and enable Override client side settings, you can’t edit the S3 query result location.By default, Data Wrangler also saves the Athena connection for you. This is displayed as a new Athena tile in the Import tab. You can always reopen that connection to query and bring different data into Data Wrangler.
  10. Deselect Save connection if you don’t want to save the connection.
  11. To configure the Athena connection, choose None for Sampling to import the entire dataset.

    For large datasets, Data Wrangler allows you to import a subset of your data to build out your transformation workflow, and only process the entire dataset when you’re ready. This speeds up the iteration cycle and save processing time and cost. To learn more about different data sampling options available, visit Amazon SageMaker Data Wrangler now supports random sampling and stratified sampling.
  12. For Data catalog¸ choose AwsDataCatalog.
  13. For Database, choose your database.

    Data Wrangler displays the available tables. You can choose each table to check the schema and preview the data.
  14. Enter the following code in the query field:
    Select *
    From bank_additional_full

  15. Choose Run to preview the data.
  16. If everything looks good, choose Import.
  17. Enter a dataset name and choose Add to import the data into your Data Wrangler workspace.

Analyze and process data with Data Wrangler

After you load the data in to Data Wrangler, you can do exploratory data analysis (EDA) and prepare the data for machine learning.

  1. Choose the plus sign next to the bank-data dataset in the data flow, and choose Add analysis.
    Data Wrangler provides built-in analyses, including a Data Quality and Insights Report, data correlation, a pre-training bias report, a summary of your dataset, and visualizations (such as histograms and scatter plots). Additionally, you can create your own custom visualization.
  2. For Analysis type¸ choose Data Quality and Insight Report.
    This automatically generates visualizations, analyses to identify data quality issues, and recommendations for the right transformations required for your dataset.
  3. For Target column, choose Y.
  4. Because this is a classification problem statement, for Problem type, select Classification.
  5. Choose Create.

    Data Wrangler creates a detailed report on your dataset. You can also download the report to your local machine.
  6. For data preparation, choose the plus sign next to the bank-data dataset in the data flow, and choose Add transform.
  7. Choose Add step to start building your transformations.

At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.

You can now start building your transforms and analyses based on your business requirements.

Clean up

To avoid ongoing costs, delete the Data Wrangler resources using the steps below when you’re finished.

  1. Select Running Instances and Kernels icon.
  2. Under RUNNING APPS, click on the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  3. Choose Shut down all to confirm.

Conclusion

In this post, we provided an overview of customizing your S3 location and enabling S3 lifecycle configurations for importing data from Athena to Data Wrangler. With this feature, you can store intermediary data in a secured S3 location, and automatically remove the data copy after the retention period to reduce the risk for unauthorized access to data. We encourage you to try out this new feature. Happy building!

To learn more about Athena and SageMaker, visit the Athena User Guide and Amazon SageMaker Documentation.


About the authors

 Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Read More

Use RStudio on Amazon SageMaker to create regulatory submissions for the life sciences industry

Pharmaceutical companies seeking approval from regulatory agencies such as the US Food & Drug Administration (FDA) or Japanese Pharmaceuticals and Medical Devices Agency (PMDA) to sell their drugs on the market must submit evidence to prove that their drug is safe and effective for its intended use. A team of physicians, statisticians, chemists, pharmacologists, and other clinical scientists review the clinical trial submission data and proposed labeling. If the review establishes that the there is sufficient statistical evidence to prove that the health benefits of the drug outweigh the risks, the drug is approved for sale.

The clinical trial submission package consists of tabulated data, analysis data, trial metadata, and statistical reports consisting of statistical tables, listings, and figures. In the case of the US FDA, the electronic common technical document (eCTD) is the standard format for submitting applications, amendments, supplements, and reports to the FDA’s Center for Biologics Evaluation and Research (CBER) and Center for Drug Evaluation and Research (CDER). For the FDA and Japanese PMDA, it’s a regulatory requirement to submit tabulated data in CDISC Standard Data Tabulation Model (SDTM), analysis data in CDISC Analysis Dataset Model (ADaM), and trial metadata in CDISC Define-XML (based on Operational Data Model (ODM)).

In this post, we demonstrate how we can use RStudio on Amazon SageMaker to create such regulatory submission deliverables. This post describes the clinical trial submission process, how we can ingest clinical trial research data, tabulate and analyze the data, and then create statistical reports—summary tables, data listings, and figures (TLF). This method can enable pharmaceutical customers to seamlessly connect to clinical data stored in their AWS environment, process it using R, and help accelerate the clinical trial research process.

Drug development process

The drug development process can broadly be divided into five major steps, as illustrated in the following figure.

Drug Development Process

It takes on an average 10–15 years and approximately USD $1–3 billion for one drug to receive a successful approval out of around 10,000 potential molecules. During the early phases of research (the drug discovery phase), promising drug candidates are identified, which move further to preclinical research. During the preclinical phase, researchers try to find out the toxicity of the drug by performing in vitro experiments in the lab and in vivo experiments on animals. After preclinical testing, drugs move on the clinical trial research phase, where they must be tested on humans to ascertain their safety and efficacy. The researchers design clinical trials and detail the study plan in the clinical trial protocol. They define the different clinical research phases—from small Phase 1 studies to determine drug safety and dosage, to a bigger Phase 2 trials to determine drug efficacy and side effects, to even bigger Phase 3 and 4 trials to determine drug efficacy, safety, and monitoring adverse reactions. After successful human clinical trials, the drug sponsor files a New Drug Application (NDA) to market the drug. The regulatory agencies review all the data, work with the sponsor on prescription labeling information, and approve the drug. After the drug’s approval, the regulatory agencies review post-market safety reports to ensure the complete product’s safety.

In 1997, Clinical Data Interchange Standards Consortium (CDISC), a global, non-profit organization comprising of pharmaceutical companies, CROs, biotech, academic institutions, healthcare providers, and government agencies, was started as volunteer group. CDISC has published data standards to streamline the flow of data from collection through submissions, and facilitated data interchange between partners and providers. CDISC has published the following standards:

  • CDASH (Clinical Data Acquisition Standards Harmonization) – Standards for collected data
  • SDTM (Study Data Tabulation Model) – Standards for submitting tabulated data
  • ADaM (Analysis Data Model) – Standards for analysis data
  • SEND (Standard for Exchange of Nonclinical Data) – Standards for nonclinical data
  • PRM (Protocol Representation Model) – Standards for protocol

These standards can help trained reviewers analyze data more effectively and quickly using standard tools, thereby reducing drug approval times. It’s a regulatory requirement from the US FDA and Japanese PMDA to submit all tabulated data using the SDTM format.

R for clinical trial research submissions

SAS and R are two of the most used statistical analysis software used within the pharmaceutical industry. When development of the SDTM standards was started by CDISC, SAS was in almost universal use in the pharmaceutical industry and at the FDA. However, R is gaining tremendous popularity nowadays because it’s open source, and new packages and libraries are continuously added. Students primarily use R during their academics and research, and they take this familiarity with R to their jobs. R also offers support for emerging technologies such as advanced deep learning integrations.

Cloud providers such as AWS have now become the platform of choice for pharmaceutical customers to host their infrastructure. AWS also provides managed services such as SageMaker, which makes it effortless to create, train, and deploy machine learning (ML) models in the cloud. SageMaker also allows access to the RStudio IDE from anywhere via a web browser. This post details how statistical programmers and biostatisticians can ingest their clinical data into the R environment, how R code can be run, and how results are stored. We provide snippets of code that allow clinical trial data scientists to ingest XPT files into the R environment, create R data frames for SDTM and ADaM, and finally create TLF that can be stored in an Amazon Simple Storage Service (Amazon S3) object storage bucket.

RStudio on SageMaker

On November 2, 2021, AWS in collaboration with RStudio PBC announced the general availability of RStudio on SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to SageMaker in just a few simple steps. To learn more about this exciting collaboration, check out Announcing RStudio on Amazon SageMaker.

Along with the RStudio Workbench, the RStudio suite for R developers also offers RStudio Connect and RStudio Package Manager. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. It makes it easy to share ML and data science insights from data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.

Solution overview

In the following sections, we discuss how we can import raw data from a remote repository or S3 bucket in RStudio on SageMaker. It’s also possible to connect directly to Amazon Relational Database Service (Amazon RDS) and data warehouses like Amazon Redshift (see Connecting R with Amazon Redshift) directly from RStudio; however, this is outside the scope of this post. After data has been ingested from a couple of different sources, we process it and create R data frames for a table. Then we convert the table data frame into an RTF file and store the results back in an S3 bucket. These outputs can then potentially be used for regulatory submission purposes, provided the R packages used in the post have been validated for use for regulatory submissions by the customer.

Set up RStudio on SageMaker

For instructions on setting up RStudio on SageMaker in your environment, refer to Get started with RStudio on SageMaker. Make sure that the execution role of RStudio on SageMaker has access to download and upload data to the S3 bucket in which data is stored. To learn more about how to manage R packages and publish your analysis using RStudio on SageMaker, refer to Announcing Fully Managed RStudio on SageMaker for Data Scientists.

Ingest data into RStudio

In this step, we ingest data from various sources to make it available for our R session. We import data in SAS XPT format; however, the process is similar if you want to ingest data in other formats. One of the advantages of using RStudio on SageMaker is that if the source data is stored in your AWS accounts, then SageMaker can natively access the data using AWS Identity and Access Management (IAM) roles.

Access data stored in a remote repository

In this step, we import ADaM data from the FDA’s GitHub repository. We create a local directory called data in the RStudio environment to store the data and download demographics data (dm.xpt) from the remote repository. In this context, the local directory refers to a directory created on the your private Amazon EFS storage that is attached by default to your R session environment. See the following code:

######################################################
# Step 1.1 – Ingest Data from Remote Data Repository #
######################################################

# Remote Data Path 
raw_data_url = “https://github.com/FDA/PKView/raw/master/Installation%20Package/OCP/data/clinical/DRUG000/0000/m5/datasets/test001/tabulations/sdtm”
raw_data_name = “dm.xpt”

#Create Local Directory to store downloaded files
dir.create(“data”)
local_file_location <- paste0(getwd(),”/data/”)
download.file(raw_data_url, paste0(local_file_location,raw_data_name))

When this step is complete, you can see dm.xpt being downloaded by navigating to Files, data, dm.xpt.

Access data stored in Amazon S3

In this step, we download data stored in an S3 bucket in our account. We have copied contents from the FDA’s GitHub repository to the S3 bucket named aws-sagemaker-rstudio for this example. See the following code:

#####################################################
# Step 1.2 - Ingest Data from S3 Bucket             #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()

s3_bucket = "aws-sagemaker-rstudio"
s3_key = "DRUG000/test001/tabulations/sdtm/pp.xpt"

session$download_data(local_file_location, s3_bucket, s3_key)

When the step is complete, you can see pp.xpt being downloaded by navigating to Files, data, pp.xpt.

Process XPT data

Now that we have SAS XPT files available in the R environment, we need to convert them into R data frames and process them. We use the haven library to read XPT files. We merge CDISC SDTM datasets dm and pp to create ADPP dataset. Then we create a summary statistic table using the ADPP data frame. The summary table is then exported in RTF format.

First, XPT files are read using the read_xpt function of the haven library. Then an analysis dataset is created using the sqldf function of the sqldf library. See the following code:

########################################################
# Step 2.1 - Read XPT files. Create Analysis dataset.  #
########################################################

library(haven)
library(sqldf)


# Read XPT Files, convert them to R data frame
dm = read_xpt("data/dm.xpt")
pp = read_xpt("data/pp.xpt")

# Create ADaM dataset
adpp = sqldf("select a.USUBJID
                    ,a.PPCAT as ACAT
                    ,a.PPTESTCD
                    ,a.PPTEST
                    ,a.PPDTC
                    ,a.PPSTRESN as AVAL
                    ,a.VISIT as AVISIT
                    ,a.VISITNUM as AVISITN
                    ,b.sex
                from pp a 
           left join dm b 
                  on a.usubjid = b.usubjid
             ")

Then, an output data frame is created using functions from the Tplyr and dplyr libraries:

########################################################
# Step 2.2 - Create output table                       #
########################################################

library(Tplyr)
library(dplyr)

t = tplyr_table(adpp, SEX) %>% 
  add_layer(
    group_desc(AVAL, by = "Area under the concentration-time curve", where= PPTESTCD=="AUC") %>% 
      set_format_strings(
        "n"        = f_str("xx", n),
        "Mean (SD)"= f_str("xx.x (xx.xx)", mean, sd),
        "Median"   = f_str("xx.x", median),
        "Q1, Q3"   = f_str("xx, xx", q1, q3),
        "Min, Max" = f_str("xx, xx", min, max),
        "Missing"  = f_str("xx", missing)
      )
  )  %>% 
  build()

output = t %>% 
  rename(Variable = row_label1,Statistic = row_label2,Female =var1_F, Male = var1_M) %>% 
  select(Variable,Statistic,Female, Male)

The output data frame is then stored as an RTF file in the output folder in the RStudio environment:

#####################################################
# Step 3 - Save the Results as RTF                  #
#####################################################
library(rtf)

dir.create("output")
rtf = RTF("output/tab_adpp.rtf")  
addHeader(rtf,title="Section 1 - Tables", subtitle="This Section contains all tables")
addParagraph(rtf, "Table 1 - Pharmacokinetic Parameters by Sex:n")
addTable(rtf, output)
done(rtf)

Upload outputs to Amazon S3

After the output has been generated, we put the data back in an S3 bucket. We can achieve this by creating a SageMaker session again, if a session isn’t active already, and uploading the contents of the output folder to an S3 bucket using the session$upload_data function:

#####################################################
# Step 4 - Upload outputs to S3                     #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()
s3_bucket = "aws-sagemaker-rstudio"
output_location = "output/"
s3_folder_name = "output"
session$upload_data(output_location, s3_bucket, s3_folder_name)

With these steps, we have ingested data, processed it, and uploaded the results to be made available for submission to regulatory authorities.

Clean up

To avoid incurring any unintended costs, you need to quit your current session. On the top right corner of the page, choose the power icon. This will automatically stop the underlying instance and therefore stop incurring any unintended compute costs.

Challenges

The post has outlined steps for ingesting raw data stored in an S3 bucket or from a remote repository. However, there are many other sources of raw data for a clinical trial, primarily eCRF (electronic case report forms) data stored in EDC (electronic data capture) systems such as Oracle Clinical, Medidata Rave, OpenClinica, or Snowflake; lab data; data from eCOA (clinical outcome assessment) and ePRO (electronic Patient-Reported Outcomes); real-world data from apps and medical devices; and electronic health records (EHRs) at the hospitals. Significant preprocessing is involved before this data can be made usable for regulatory submissions. Building connectors to various data sources and collecting them in a centralized data repository (CDR) or a clinical data lake, while maintaining proper access controls, poses significant challenges.

Another key challenge to overcome is that of regulatory compliance. The computer system used for creating regulatory submission outputs must be compliant with appropriate regulations, such as 21 CFR Part 11, HIPAA, GDPR, or any other GxP requirements or ICH guidelines. This translates to working in a validated and qualified environment with controls for access, security, backup, and auditability in place. This also means that any R packages that are used to create regulatory submission outputs must be validated before use.

Conclusion

In this post, we saw that the some of the key deliverables for an eCTD submission were CDISC SDTM, ADaM datasets, and TLF. This post outlined the steps needed to create these regulatory submission deliverables by first ingesting data from a couple of sources into RStudio on SageMaker. We then saw how we can process the ingested data in XPT format; convert it into R data frames to create SDTM, ADaM, and TLF; and then finally upload the results to an S3 bucket.

We hope that with the broad ideas laid out in the post, statistical programmers and biostatisticians can easily visualize the end-to-end process of loading, processing, and analyzing clinical trial research data into RStudio on SageMaker and use the learnings to define a custom workflow suited for your regulatory submissions.

Can you think of any other applications of using RStudio to help researchers, statisticians, and R programmers to make their lives easier? We would love to hear about your ideas! And if you have any questions, please share them in the comments section.

Resources

For more information, visit the following links:


About the authors

Rohit Banga is a Global Clinical Development Industry Specialist based out of London, UK. He is a biostatistician by training and helps Healthcare and LifeScience customers deploy innovative clinical development solutions on AWS. He is passionate about how data science, AI/ML, and emerging technologies can be used to solve real business problems within the Healthcare and LifeScience industry. In his spare time, Rohit enjoys skiing, BBQing, and spending time with family and friends.

Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.

Read More

Churn prediction using Amazon SageMaker built-in tabular algorithms LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. These algorithms and models can be used for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

Customer churn is a problem faced by a wide range of companies, from telecommunications to banking, where customers are typically lost to competitors. It’s in a company’s best interest to retain existing customers rather than acquire new customers because it usually costs significantly more to attract new customers. Mobile operators have historical records in which customers continued using the service or ultimately ended up churning. We can use this historical information of a mobile operator’s churn to train an ML model. After training this model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have it predict whether this customer is going to churn or not.

In this post, we train and deploy four recently released SageMaker algorithms—LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular—on a churn prediction dataset. We use SageMaker Automatic Model Tuning (a tool for hyperparameter optimization) to find the best hyperparameters for each model, and compare their performance on a holdout test dataset to select the optimal one.

You can also use this solution as a template to search over a collection of state-of-the-art tabular algorithms and use hyperparameter optimization to find the best overall model. You can easily replace the example dataset with your own to solve real business problems you’re interested in. If you want to jump straight into the SageMaker SDK code we go through in this post, you can refer to the following sample Jupyter notebook.

Benefits of SageMaker built-in algorithms

When selecting an algorithm for your particular type of problem and data, using a SageMaker built-in algorithm is the easiest option, because doing so comes with the following major benefits:

  • Low coding – The built-in algorithms require little coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.
  • Efficient and scalable algorithm implementations – The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms. If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker and input the hyperparameters you already know rather than port it over and write a training script yourself.
  • Transparency – You’re the owner of the resulting model artifacts. You can take that model and deploy it on SageMaker for several different inference patterns (check out all the available deployment types) and easy endpoint scaling and management, or you can deploy it wherever else you need it.

Data visualization and preprocessing

First, we gather our customer churn dataset. It’s a relatively small dataset with 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes range from the US state where the customer resides, to the number of calls they placed to customer service, to the cost they are billed for daytime calls. We’re trying to predict whether the customer will churn or not, which is a binary classification problem. The following is a subset of those features look like, with the label as the last column.

The following are some insights for each column, specifically the summary statistics and histogram of selected features.

We then preprocess the data, split it into training, validation, and test sets, and upload the data to Amazon Simple Storage Service (Amazon S3).

Automatic model tuning of tabular algorithms

Hyperparameters control how our underlying algorithms operate and influence the performance of the model. Those hyperparameters can be the number of layers, learning rate, weight decay rate, and dropout for neural network-based models, or the number of leaves, iterations, and maximum tree depth for tree ensemble models. To select the best model, we apply SageMaker automatic model tuning to each of the four trained SageMaker tabular algorithms. You need only select the hyperparameters to tune and a range for each parameter to explore. For more information about automatic model tuning, refer to Amazon SageMaker Automatic Model Tuning: Using Machine Learning for Machine Learning or Amazon SageMaker automatic model tuning: Scalable gradient-free optimization.

Let’s see how this works in practice.

LightGBM

We start by running automatic model tuning with LightGBM, and adapt that process to the other algorithms. As is explained in the post Amazon SageMaker JumpStart models and algorithms now available via API, the following artifacts are required to train a pre-built algorithm via the SageMaker SDK:

  • Its framework-specific container image, containing all the required dependencies for training and inference
  • The training and inference scripts for the selected model or algorithm

We first retrieve these artifacts, which depend on the model_id (lightgbm-classification-model in this case) and version:

from sagemaker import image_uris, model_uris, script_uris
train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
training_instance_type = "ml.m5.4xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(region=None,
                                      framework=None,
                                      model_id=train_model_id,
                                      model_version=train_model_version,
                                      image_scope=train_scope,
                                      instance_type=training_instance_type,
                                      )                                      
# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=train_model_id,
                                        model_version=train_model_version,
                                        script_scope=train_scope
                                        )
# Retrieve the pre-trained model tarball (in the case of tabular modeling, it is a dummy file)
train_model_uri = model_uris.retrieve(model_id=train_model_id,
                                      model_version=train_model_version,
                                      model_scope=train_scope)

We then get the default hyperparameters for LightGBM, set some of them to selected fixed values such as number of boosting rounds and evaluation metric on the validation data, and define the value ranges we want to search over for others. We use the SageMaker parameters ContinuousParameter and IntegerParameter for this:

from sagemaker import hyperparameters
from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=train_model_id,
                                                   model_version=train_model_version
                                                   )
# [Optional] Override default hyperparameters with custom values
hyperparameters["num_boost_round"] = "500"
hyperparameters["metric"] = "auc"

# Define search ranges for other hyperparameters
hyperparameter_ranges_lgb = {
    "learning_rate": ContinuousParameter(1e-4, 1, scaling_type="Logarithmic"),
    "num_boost_round": IntegerParameter(2, 30),
    "num_leaves": IntegerParameter(10, 50),
    "feature_fraction": ContinuousParameter(0, 1),
    "bagging_fraction": ContinuousParameter(0, 1),
    "bagging_freq": IntegerParameter(1, 10),
    "max_depth": IntegerParameter(5, 30),
    "min_data_in_leaf": IntegerParameter(5, 50),
}

Finally, we create a SageMaker Estimator, feed it into a HyperarameterTuner, and start the hyperparameter tuning job with tuner.fit():

from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperParameterTuner

# Create SageMaker Estimator instance
tabular_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
)

tuner = HyperparameterTuner(
            tabular_estimator,
            "auc",
            hyperparameter_ranges_lgb,
            [{"Name": "auc", "Regex": "auc: ([0-9\.]+)"}],
            max_jobs=10,
            max_parallel_jobs=5,
            objective_type="Maximize",
            base_tuning_job_name="some_name",
        )

tuner.fit({"training": training_dataset_s3_path}, logs=True)

The max_jobs parameter defines how many total jobs will be run in the automatic model tuning job, and max_parallel_jobs defines how many concurrent training jobs should be started. We also define the objective to “Maximize” the model’s AUC (area under the curve). To dive deeper into the available parameters exposed by HyperParameterTuner, refer to HyperparameterTuner.

Check out the sample notebook to see how we proceed to deploy and evaluate this model on the test set.

CatBoost

The process for hyperparameter tuning on the CatBoost algorithm is the same as before, although we need to retrieve model artifacts under the ID catboost-classification-model and change the range selection of hyperparameters:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)
# [Optional] Override default hyperparameters with custom values
hyperparameters["iterations"] = "500"
hyperparameters["eval_metric"] = "AUC"

# Define search ranges for other hyperparameters
hyperparameter_ranges_cat = {
    "learning_rate": ContinuousParameter(0.00001, 0.1, scaling_type="Logarithmic"),
    "iterations": IntegerParameter(50, 1000),
    "depth": IntegerParameter(1, 10),
    "l2_leaf_reg": IntegerParameter(1, 10),
    "random_strength": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
}

TabTransformer

The process for hyperparameter tuning on the TabTransformer model is the same as before, although we need to retrieve model artifacts under the ID pytorch-tabtransformerclassification-model and change the range selection of hyperparameters.

We also change the training instance_type to ml.p3.2xlarge. TabTransformer is a model recently derived from Amazon research, which brings the power of deep learning to tabular data using Transformer models. To train this model in an efficient manner, we need a GPU-backed instance. For more information, refer to Bringing the power of deep learning to data in tables.

from sagemaker import hyperparameters
from sagemaker.tuner import CategoricalParameter

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)
# [Optional] Override default hyperparameters with custom values
hyperparameters["n_epochs"] = 40  # The same hyperparameter is named as "iterations" for CatBoost
hyperparameters["patience"] = 10

# Define search ranges for other hyperparameters
hyperparameter_ranges_tab = {
    "learning_rate": ContinuousParameter(0.001, 0.01, scaling_type="Auto"),
    "batch_size": CategoricalParameter([64, 128, 256, 512]),
    "attn_dropout": ContinuousParameter(0.0, 0.8, scaling_type="Auto"),
    "mlp_dropout": ContinuousParameter(0.0, 0.8, scaling_type="Auto"),
    "input_dim": CategoricalParameter(["16", "32", "64", "128", "256"]),
    "frac_shared_embed": ContinuousParameter(0.0, 0.5, scaling_type="Auto"),
}

AutoGluon-Tabular

In the case of AutoGluon, we don’t run hyperparameter tuning. This is by design, because AutoGluon focuses on ensembling multiple models with sane choices of hyperparameters and stacking them in multiple layers. This ends up being more performant than training one model with the perfect selection of hyperparameters and is also computationally cheaper. For details, check out AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

Therefore, we switch the model_id to autogluon-classification-ensemble, and only fix the evaluation metric hyperparameter to our desired AUC score:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)

hyperparameters["eval_metric"] = "roc_auc"

Instead of calling tuner.fit(), we call estimator.fit() to start a single training job.

Benchmarking the trained models

After we deploy all four models, we send the full test set to each endpoint for prediction and calculate accuracy, F1, and AUC metrics for each (see code in the sample notebook). We present the results in the following table, with an important disclaimer: results and relative performance between these models will depend on the dataset you use for training. These results are representative, and even though the tendency for certain algorithms to perform better is based on relevant factors (for example, AutoGluon intelligently ensembles the predictions of both LightGBM and CatBoost models behind the scenes), the balance in performance might change given a different data distribution.

. LightGBM with Automatic Model Tuning CatBoost with Automatic Model Tuning TabTransformer with Automatic Model Tuning AutoGluon-Tabular
Accuracy 0.8977 0.9622 0.9511 0.98
F1 0.8986 0.9624 0.9517 0.98
AUC 0.9629 0.9907 0.989 0.9979

Conclusion

In this post, we trained four different SageMaker built-in algorithms to solve the customer churn prediction problem with low coding effort. We used SageMaker automatic model tuning to find the best hyperparameters to train these algorithms with, and compared their performance on a selected churn prediction dataset. You can use the related sample notebook as a template, replacing the dataset with your own to solve your desired tabular data-based problem.

Make sure to try these algorithms on SageMaker, and check out sample notebooks on how to use other built-in algorithms available on GitHub.


About the authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize Deep Learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Read More

Parallel data processing with RStudio on Amazon SageMaker

Last year, we announced the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE, and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

With ever-increasing data volume being generated, datasets used for ML and statistical analysis are growing in tandem. With this brings the challenges of increased development time and compute infrastructure management. To solve these challenges, data scientists have looked to implement parallel data processing techniques. Parallel data processing, or data parallelization, takes large existing datasets and distributes them across multiple processers or nodes to operate on the data simultaneously. This can allow for faster processing time of larger datasets, along with optimized usage on compute. This can help ML practitioners create reusable patterns for dataset generation, and also help reduce compute infrastructure load and cost.

Solution overview

Within Amazon SageMaker, many customers use SageMaker Processing to help implement parallel data processing. With SageMaker Processing, you can use a simplified, managed experience on SageMaker to run your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation. This brings many benefits because there’s no long-running infrastructure to manage—processing instances spin down when jobs are complete, environments can be standardized via containers, data within Amazon Simple Storage Service (Amazon S3) is natively distributed across instances, and infrastructure settings are flexible in terms of memory, compute, and storage.

SageMaker Processing offers options for how to distribute data. For parallel data processing, you must use the ShardedByS3Key option for the S3DataDistributionType. When this parameter is selected, SageMaker Processing takes the provided n instances and distribute objects 1/n objects from the input data source across the instances. For example, if two instances are provided with four data objects, each instance receives two objects.

SageMaker Processing requires three components to run processing jobs:

  • A container image that has your code and dependencies to run your data processing workloads
  • A path to an input data source within Amazon S3
  • A path to an output data source within Amazon S3

The process is depicted in the following diagram.

In this post, we show you how to use RStudio on SageMaker to interface with a series of SageMaker Processing jobs to create a parallel data processing pipeline using the R programming language.

The solution consists of the following steps:

  1. Set up the RStudio project.
  2. Build and register the processing container image.
  3. Run the two-step processing pipeline:
    1. The first step takes multiple data files and processes them across a series of processing jobs.
    2. The second step concatenates the output files and splits them into train, test, and validation datasets.

Prerequisites

Complete the following prerequisites:

  1. Set up the RStudio on SageMaker Workbench. For more information, refer to Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists.
  2. Create a user with RStudio on SageMaker with appropriate access permissions.

Set up the RStudio project

To set up the RStudio project, complete the following steps:

  1. Navigate to your Amazon SageMaker Studio control panel on the SageMaker console.
  2. Launch your app in the RStudio environment.
  3. Start a new RStudio session.
  4. For Session Name, enter a name.
  5. For Instance Type and Image, use the default settings.
  6. Choose Start Session.
  7. Navigate into the session.
  8. Choose New Project, Version control, and then Select Git.
  9. For Repository URL, enter https://github.com/aws-samples/aws-parallel-data-processing-r.git
  10. Leave the remaining options as default and choose Create Project.

You can navigate to the aws-parallel-data-processing-R directory on the Files tab to view the repository. The repository contains the following files:

  • Container_Build.rmd
  • /dataset

    • bank-additional-full-data1.csv
    • bank-additional-full-data2.csv
    • bank-additional-full-data3.csv
    • bank-additional-full-data4.csv
  • /docker
  • Dockerfile-Processing
  • Parallel_Data_Processing.rmd
  • /preprocessing

    • filter.R
    • process.R

Build the container

In this step, we build our processing container image and push it to Amazon Elastic Container Registry (Amazon ECR). Complete the following steps:

  1. Navigate to the Container_Build.rmd file.
  2. Install the SageMaker Studio Image Build CLI by running the following cell. Make sure you have the required permissions prior to completing this step, this is a CLI designed to push and register container images within Studio.
    pip install sagemaker-studio-image-build

  3. Run the next cell to build and register our processing container:
    /home/sagemaker-user/.local/bin/sm-docker build . --file ./docker/Dockerfile-Processing --repository sagemaker-rstudio-parallel-processing:1.0

After the job has successfully run, you receive an output that looks like the following:

Image URI: <Account_Number>.dkr.ecr.<Region>.amazonaws.com/sagemaker-rstudio- parallel-processing:1.0

Run the processing pipeline

After you build the container, navigate to the Parallel_Data_Processing.rmd file. This file contains a series of steps that helps us create our parallel data processing pipeline using SageMaker Processing. The following diagram depicts the steps of the pipeline that we complete.

Start by running the package import step. Import the required RStudio packages along with the SageMaker SDK:

suppressWarnings(library(dplyr))
suppressWarnings(library(reticulate))
suppressWarnings(library(readr))
path_to_python <- system(‘which python’, intern = TRUE)

use_python(path_to_python)
sagemaker <- import('sagemaker')

Now set up your SageMaker execution role and environment details:

role = sagemaker$get_execution_role()
session = sagemaker$Session()
bucket = session$default_bucket()
account_id <- session$account_id()
region <- session$boto_region_name
local_path <- dirname(rstudioapi::getSourceEditorContext()$path)

Initialize the container that we built and registered in the earlier step:

container_uri <- paste(account_id, "dkr.ecr", region, "amazonaws.com/sagemaker-rstudio-parallel-processing:1.0", sep=".")
print(container_uri)

From here we dive into each of the processing steps in more detail.

Upload the dataset

For our example, we use the Bank Marketing dataset from UCI. We have already split the dataset into multiple smaller files. Run the following code to upload the files to Amazon S3:

local_dataset_path <- paste0(local_path,"/dataset/")

dataset_files <- list.files(path=local_dataset_path, pattern="\.csv$", full.names=TRUE)
for (file in dataset_files){
  session$upload_data(file, bucket=bucket, key_prefix="sagemaker-rstudio-example/split")
}

input_s3_split_location <- paste0("s3://", bucket, "/sagemaker-rstudio-example/split")

After the files are uploaded, move to the next step.

Perform parallel data processing

In this step, we take the data files and perform feature engineering to filter out certain columns. This job is distributed across a series of processing instances (for our example, we use two).

We use the filter.R file to process the data, and configure the job as follows:

filter_processor <- sagemaker$processing$ScriptProcessor(command=list("Rscript"),
                                                        image_uri=container_uri,
                                                        role=role,
                                                        instance_count=2L,
                                                        instance_type="ml.m5.large")

output_s3_filter_location <- paste0("s3://", bucket, "/sagemaker-rstudio-example/filtered")
s3_filter_input <- sagemaker$processing$ProcessingInput(source=input_s3_split_location,
                                                        destination="/opt/ml/processing/input",
                                                        s3_data_distribution_type="ShardedByS3Key",
                                                        s3_data_type="S3Prefix")
s3_filter_output <- sagemaker$processing$ProcessingOutput(output_name="bank-additional-full-filtered",
                                                         destination=output_s3_filter_location,
                                                         source="/opt/ml/processing/output")

filtering_step <- sagemaker$workflow$steps$ProcessingStep(name="FilterProcessingStep",
                                                      code=paste0(local_path, "/preprocessing/filter.R"),
                                                      processor=filter_processor,
                                                      inputs=list(s3_filter_input),
                                                      outputs=list(s3_filter_output))

As mentioned earlier, when running a parallel data processing job, you must adjust the input parameter with how the data will be sharded, and the type of data. Therefore, we provide the sharding method by S3Prefix:

s3_data_distribution_type="ShardedByS3Key",
                                                      s3_data_type="S3Prefix")

After you insert these parameters, SageMaker Processing will equally distribute the data across the number of instances selected.

Adjust the parameters as necessary, and then run the cell to instantiate the job.

Generate training, test, and validation datasets

In this step, we take the processed data files, combine them, and split them into test, train, and validation datasets. This allows us to use the data for building our model.

We use the process.R file to process the data, and configure the job as follows:

script_processor <- sagemaker$processing$ScriptProcessor(command=list("Rscript"),
                                                         image_uri=container_uri,
                                                         role=role,
                                                         instance_count=1L,
                                                         instance_type="ml.m5.large")

output_s3_processed_location <- paste0("s3://", bucket, "/sagemaker-rstudio-example/processed")
s3_processed_input <- sagemaker$processing$ProcessingInput(source=output_s3_filter_location,
                                                         destination="/opt/ml/processing/input",
                                                         s3_data_type="S3Prefix")
s3_processed_output <- sagemaker$processing$ProcessingOutput(output_name="bank-additional-full-processed",
                                                         destination=output_s3_processed_location,
                                                         source="/opt/ml/processing/output")

processing_step <- sagemaker$workflow$steps$ProcessingStep(name="ProcessingStep",
                                                      code=paste0(local_path, "/preprocessing/process.R"),
                                                      processor=script_processor,
                                                      inputs=list(s3_processed_input),
                                                      outputs=list(s3_processed_output),
                                                      depends_on=list(filtering_step))

Adjust the parameters are necessary, and then run the cell to instantiate the job.

Run the pipeline

After all the steps are instantiated, start the processing pipeline to run each step by running the following cell:

pipeline = sagemaker$workflow$pipeline$Pipeline(
  name="BankAdditionalPipelineUsingR",
  steps=list(filtering_step, processing_step)
)

upserted <- pipeline$upsert(role_arn=role)
execution <- pipeline$start()

execution$describe()
execution$wait()

The time each of these jobs takes will vary based on the instance size and count selected.

Navigate to the SageMaker console to see all your processing jobs.

We start with the filtering job, as shown in the following screenshot.

When that’s complete, the pipeline moves to the data processing job.

When both jobs are complete, navigate to your S3 bucket. Look within the sagemaker-rstudio-example folder, under processed. You can see the files for the train, test and validation datasets.

Conclusion

With an increased amount of data that will be required to build more and more sophisticated models, we need to change our approach to how we process data. Parallel data processing is an efficient method in accelerating dataset generation, and if coupled with modern cloud environments and tooling such as RStudio on SageMaker and SageMaker Processing, can remove much of the undifferentiated heavy lifting of infrastructure management, boilerplate code generation, and environment management. In this post, we walked through how you can implement parallel data processing within RStudio on SageMaker. We encourage you to try it out by cloning the GitHub repository, and if you have suggestions on how to make the experience better, please submit an issue or a pull request.

To learn more about the features and services used in this solution, refer to RStudio on Amazon SageMaker and Amazon SageMaker Processing.


About the authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Jake Wen is a Solutions Architect at AWS with passion for ML training and Natural Language Processing. Jake helps Small Medium Business customers with design and thought leadership to build and deploy applications at scale. Outside of work, he enjoys hiking.

Aditi Rajnish is a first-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Add-ons.

Paul Wu is a Solutions Architect working in AWS’ Greenfield Business in Texas. His areas of expertise include containers and migrations.

Read More

Discover insights from Zendesk with Amazon Kendra intelligent search

Customer relationship management (CRM) is a critical tool that organizations maintain to manage customer interactions and build business relationships. Zendesk is a CRM tool that makes it easy for customers and businesses to keep in sync. Zendesk captures a wealth of customer data, such as support tickets created and updated by customers and service agents, community discussions, and helpful guides. With such a wealth of complex data, simple keyword searches don’t suffice when it comes to discovering meaningful, accurate customer information.

Now you can use the Amazon Kendra Zendesk connector to index your Zendesk service tickets, help guides, and community posts, and perform intelligent search powered by machine learning (ML). Amazon Kendra smartly and efficiently answers natural language-based queries using advanced natural language processing (NLP) techniques. It can learn effectively from your Zendesk data, extracting meaning and context.

This post shows how to configure the Amazon Kendra Zendesk connector to index your Zendesk domain and take advantage of Amazon Kendra intelligent search. We use an example of an illustrative Zendesk domain to discuss technical topics related to AWS services.

Overview of solution

Amazon Kendra was built for intelligent search using NLP. You can use Amazon Kendra to ask factoid questions, descriptive questions, and perform keyword searches. You can use the Amazon Kendra connector for Zendesk to crawl your Zendesk domain and index service tickets, guides, and community posts to discover answers for your questions faster.

In this post, we show how to use the Amazon Kendra connector for Zendesk to index data from your Zendesk domain for intelligent search.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Administrator level access to your Zendesk domain
  • Privileges to create an Amazon Kendra index, AWS resources, and AWS Identity and Access Management (IAM) roles and policies
  • Basic knowledge of AWS services and working knowledge of Zendesk

Configure your Zendesk domain

Your Zendesk domain has a domain owner or administrator, service group administrators, and a customer. Sample service tickets, community posts, and guides have been created for the purpose of this walkthrough. A Zendesk API client with the unique identifier amazon_kendra is registered to create an OAuth token for accessing your Zendesk domain from Amazon Kendra for crawling and indexing. The following screenshot shows the details of the OAuth configuration for the Zendesk API client.

Configure the data source using the Amazon Kendra connector for Zendesk

You can add the Zendesk connector data source to an existing Amazon Kendra index or create a new index. Then complete the following steps to configure the Zendesk connector:

  1. On the Amazon Kendra console, open the index and choose Data sources in the navigation pane.
  2. Under Zendesk, choose Add connector.
  3. Choose Add connector.
  4. In the Specify data source details section, enter the name and description of your data source and choose Next.
  5. In the Define access and security section, for Zendesk URL, enter the URL to your Zendesk domain. Use the URL format https://<domain>.zendesk.com/.
  6. Under Authentication, you can either choose Create to add a new secret using the user OAuth token created for the amazon_kendra, or use an existing AWS Secrets Manager secret that has the user OAuth token for the Zendesk domain that you want the connector to access.
  7. Optionally, configure a new AWS secret for Zendesk API access.
  8. For IAM role, you can choose Create a new role or choose an existing IAM role configured with appropriate IAM policies to access the Secrets Manager secret, Amazon Kendra index, and data source.
  9. Choose Next.
  10. In the Configure sync settings section, provide information regarding your sync scope and run schedule.
  11. Choose Next.
  12. In the Set field mappings section, you can optionally configure the field mappings, or how the Zendesk field names are mapped to Amazon Kendra attributes or facets.
  13. Choose Next.
  14. Review your settings and confirm to add the data source.
  15. After the data source is created, select the data source and choose Sync Now.
  16. Choose Facet definition in the navigation pane.
  17. Select the check box in the Facetable column for the facet _category.

Run queries with the Amazon Kendra search console

Now that the data is synced, we can run a few search queries on the Amazon Kendra search console by navigating to the Search indexed content page.

For the first query, we ask Amazon Kendra a general question related to AWS service durability. The following screenshot shows the response. The suggested answer provides the correct answer to the query by applying natural language comprehension.

For our second query, let’s query Amazon Kendra to search for product issues from Zendesk service tickets. The following screenshot shows the response for the search, along with facets showing various categories of documents included in the result.

Notice the search result includes the URL to the source document as well. Choosing the URL takes us directly to the Zendesk service ticket page, as shown in the following screenshot.

Clean up

To avoid incurring future charges, clean up any resources created as part of this solution. Delete the Zendesk connector data source so any data indexed from the source is removed from the index. If you created a new Amazon Kendra index, delete the index as well.

Conclusion

In this post, we discussed how to configure the Amazon Kendra connector for Zendesk to crawl and index service tickets, community posts, and help guides. We showed how Amazon Kendra ML-based search enables your business leaders and agents to discover insights from your Zendesk content quicker and respond to customer needs faster.

To learn more about the Amazon Kendra connector for Zendesk, refer to the Amazon Kendra Developer Guide.


About the author

Rajesh Kumar Ravi is a Senior AI Services Solution Architect at Amazon Web Services specializing in intelligent document search with Amazon Kendra. He is a builder and problem solver, and contributes to the development of new ideas. He enjoys walking and loves to go on short hiking trips outside of work.

Read More

Amazon SageMaker Automatic Model Tuning now provides up to three times faster hyperparameter tuning with Hyperband

Amazon SageMaker Automatic Model Tuning introduces Hyperband, a multi-fidelity technique to tune hyperparameters as a faster and more efficient way to find an optimal model. In this post, we show how automatic model tuning with Hyperband can provide faster hyperparameter tuning—up to three times as fast.

The benefits of Hyperband

Hyperband presents two advantages over existing black-box tuning strategies: efficient resource utilization and a better time-to-convergence.

Machine learning (ML) models are increasingly training-intensive, involve complex models and large datasets, and require a lot of effort and resources to find the optimal hyperparameters. Traditional black-box search strategies, such as Bayesian, random search, or grid search, tend to scale linearly with the complexity of the ML problem at hand, requiring longer training time.

To speed up hyperparameter tuning and optimize training cost, Hyperband uses Asynchronous Successive Halving Algorithm (ASHA), a strategy that massively parallelizes hyperparameter tuning and automatically stops training jobs early by using previously evaluated configurations to predict whether a specific candidate is promising or not.

As we demonstrate in this post, Hyperband converges to the optimal objective metric faster than most black-box strategies and therefore saves training time. This allows you to tune larger-scale models where evaluating each hyperparameter configuration requires running an expensive training loop, such as in computer vision and natural language processing (NLP) applications. If you’re interested in finding the most accurate models, Hyperband also allows you to run your tuning jobs with more resources and converge to a better solution.

Hyperband with SageMaker

The new Hyperband approach implemented for hyperparameter tuning has a few new data elements changed through AWS API calls. Implementation via the AWS Management Console is not available at this time. Let’s look at some data elements introduced for Hyperband:

  • Strategy – This defines the hyperparameter approach you want to choose. A new value for Hyperband is introduced with this change. Valid values are Bayesian, random, and Hyperband.
  • MinResource – Defines the minimum number of epochs or iterations to be used for a training job before a decision is made to stop the training.
  • MaxResource – Specifies the maximum number of epochs or iterations to be used for a training job to achieve the objective. This parameter is not required if you have numbers of training epochs defined as a hyperparameter in the tuning job.

Implementation

The following sample code shows the tuning job config and training job definition:

tuning_job_config = {
    "ParameterRanges": {
    "Strategy": "Hyperband",
    "ResourceLimits": {
        "MaxNumberOfTrainingJobs":20,
        "MaxParallelTrainingJobs": 5},
    "StrategyConfig":{
        "HyperbandStrategyConfig":{
            "MinResource": 10,
            "MaxResource": 100}
            },
    "HyperParameterTuningJobObjective": {
        "MetricName": "validation:auc", 
        "Type": "Maximize"
        },
    "CategoricalParameterRanges": [],
    "ContinuousParameterRanges": [
    {
    "MaxValue": "1",
    "MinValue": "0",
    "Name": "eta",
    },
    {
    "MaxValue": "10",
    "MinValue": "1",
    "Name": "min_child_weight",
    },
    {
    "MaxValue": "2",
    "MinValue": "0",
    "Name": "alpha",
    },
    ],
    "IntegerParameterRanges": [
    {
    "MaxValue": "10",
    "MinValue": "1",
    "Name": "max_depth",
    }
    ],
    },

}

The preceding code defines strategy as Hyperband and also defines the lower and upper bound resource limits inside the strategy configuration using HyperbandStrategyConfig, which serves as a lever to control the training runtime. For more details on how to configure and run automatic model tuning, refer to Specify the Hyperparameter Tuning Job Settings.

Hyperband compared to black-box search

In this section, we perform two experiments to compare Hyperband to a black-box search.

First experiment

In the first experiment, given a binary classification task, we aim to optimize a three-layer fully-connected network on a synthetic data set, which contains 5000 data points. The hyperparameters for the network include the number of units per layer, the learning rate of the Adam optimizer, the L2 regularization parameter, and the batch size. The range of these parameters are:

  • Number of units per layer – 10 to 1e3
  • Learning rate – 1e-4 to 0.1
  • L2 regularization parameter – 1e-6 to 2
  • Batch size – 10 to 200

The following graph shows our findings.

Comparing Hyperband with other strategies, given a target accuracy of 96%, Hyperband can achieve this target in 357 seconds, while Bayesian needs 560 seconds and random search needs 614 seconds. Hyperband consistently finds a more optimal solution at any given wall-clock time budget. This shows a clear advantage of a multi-fidelity optimization algorithm.

Second experiment

In this second experiment, we consider the Cifar10 dataset and train ResNet-20, a popular network architecture for computer vision tasks. We run all experiments on ml.g4dn.xlarge instances and optimize the neural network through SGD. We also apply standard image augmentation including random flip and random crop. The search space contains the following parameters:

  • Mini-batch size: an integer from 4 to 512
  • Learning rate of SGD: a float from 1e-6 to 1e-1
  • Momentum of SGD: a float from 0.01 to 0.99
  • Weight decay: a float from 1e-5 to 1

The following graph illustrates our findings.

Given a target validation accuracy 0.87, Hyperband can reach it in fewer than 2000 seconds while Random and Bayesian require 10000 and almost 9000 seconds respectively. This amount to a speed-up factor of ~5x and ~4.5x for Hyperband on this task compared to Random and Bayesian strategies. This shows a clear advantage for the multi-fidelity optimization algorithm, which significantly reduces the wall-clock time it takes to tune your deep learning models.

Conclusion

SageMaker Automatic Model Tuning allows you to reduce the time to tune a model by automatically searching for the best hyperparameter configuration within the ranges that you specify. You can find the best version of your model by running training jobs on your dataset with several search strategies, such as black-box or multi-fidelity strategies.

In this post, we discussed how you can now use a multi-fidelity strategy called Hyperband in SageMaker to find the best model. The support for Hyperband makes it possible for SageMaker Automatic Model Tuning to tune larger-scale models where evaluating each hyperparameter configuration requires running an expensive training loop, such as in computer vision and NLP applications.

Finally, we saw how Hyperband further optimizes runtime compared to black-box strategies with early stopping by using previously evaluated configurations to predict whether a specific candidate is promising and, if not, stop the evaluation to reduce the overall time and compute cost. Using Hyperband in SageMaker also allows you to specify the minimum and maximum resource in the HyperbandStrategyConfig parameter for further runtime controls.

To learn more, visit Perform Automatic Model Tuning with SageMaker.


About the authors

Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solutions in the cloud.

Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the Financial Services industry with their operations in AWS. As a machine learning specialist, Gopi works to help customers succeed in their ML journey.

Xingchen Ma is an Applied Scientist at AWS. He works in the team owning the service for SageMaker Automatic Model Tuning.

Read More

Read webpages and highlight content using Amazon Polly

In this post, we demonstrate how to use Amazon Polly—a leading cloud service that converts text into lifelike speech—to read the content of a webpage and highlight the content as it’s being read. Adding audio playback to a webpage improves the accessibility and visitor experience of the page. Audio-enhanced content is more impactful and memorable, draws more traffic to the page, and taps into the spending power of visitors. It also improves the brand of the company or organization that publishes the page. Text-to-speech technology makes these business benefits attainable. We accelerate that journey by demonstrating how to achieve this goal using Amazon Polly.

This capability improves accessibility for visitors with disabilities, and could be adopted as part of your organization’s accessibility strategy. Just as importantly, it enhances the page experience for visitors without disabilities. Both groups have significant spending power and spend more freely from pages that use audio enhancement to grab their attention.

Overview of solution

PollyReadsThePage (PRTP)—as we refer to the solution—allows a webpage publisher to drop an audio control onto their webpage. When the visitor chooses Play on the control, the control reads the page and highlights the content. PRTP uses the general capability of Amazon Polly to synthesize speech from text. It invokes Amazon Polly to generate two artifacts for each page:

  • The audio content in a format playable by the browser: MP3
  • A speech marks file that indicates for each sentence of text:
    • The time during playback that the sentence is read
    • The location on the page the sentence appears

When the visitor chooses Play, the browser plays the MP3 file. As the audio is read, the browser checks the time, finds in the marks file which sentence to read at that time, locates it on the page, and highlights it.

PRTP allows the visitor to read in different voices and languages. Each voice requires its own pair of files. PRTP uses neural voices. For a list of supported neural voices and languages, see Neural Voices. For a full list of standard and neural voices in Amazon Polly, see Voices in Amazon Polly.

We consider two types of webpages: static and dynamic pages. In a static page, the content is contained within the page and changes only when a new version of the page is published. The company might update the page daily or weekly as part of its web build process. For this type of page, it’s possible to pre-generate the audio files at build time and place them on the web server for playback. As the following figure shows, the script PRTP Pre-Gen invokes Amazon Polly to generate the audio. It takes as input the HTML page itself and, optionally, a configuration file that specifies which text from the page to extract (Text Extract Config). If the extract config is omitted, the pre-gen script makes a sensible choice of text to extract from the body of the page. Amazon Polly outputs the files in an Amazon Simple Storage Service (Amazon S3) bucket; the script copies them to your web server. When the visitor plays the audio, the browser downloads the MP3 directly from the web server. For highlights, a drop-in library, PRTP.js, uses the marks file to highlight the text being read.Solution Architecture Diagram

The content of a dynamic page changes in response to the visitor interaction, so audio can’t be pre-generated but must be synthesized dynamically. As the following figure shows, when the visitor plays the audio, the page uses PRTP.js to generate the audio in Amazon Polly, and it highlights the synthesized audio using the same approach as with static pages. To access AWS services from the browser, the visitor requires an AWS identity. We show how to use an Amazon Cognito identity pool to allow the visitor just enough access to Amazon Polly and the S3 bucket to render the audio.

Dynamic Content

Generating both Mp3 audio and speech marks requires the Polly service to synthesize the same input twice. Refer to the Amazon Polly Pricing Page to understand cost implications. Pre-generation saves costs because synthesis is performed at build time rather than on-demand for each visitor interaction.

The code accompanying this post is available as an open-source repository on GitHub.

To explore the solution, we follow these steps:

  1. Set up the resources, including the pre-gen build server, S3 bucket, web server, and Amazon Cognito identity.
  2. Run the static pre-gen build and test static pages.
  3. Test dynamic pages.

Prerequisites

To run this example, you need an AWS account with permission to use Amazon Polly, Amazon S3, Amazon Cognito, and (for demo purposes) AWS Cloud9.

Provision resources

We share an AWS CloudFormation template to create in your account a self-contained demo environment to help you follow along with the post. If you prefer to set up PRTP in your own environment, refer to instructions in README.md.

To provision the demo environment using CloudFormation, first download a copy of the CloudFormation template. Then complete the following steps:

  1. On the AWS CloudFormation console, choose Create stack.
  2. Choose With new resources (standard).
  3. Select Upload a template file.
  4. Choose Choose file to upload the local copy of the template that you downloaded. The name of the file is prtp.yml.
  5. Choose Next.
  6. Enter a stack name of your choosing. Later you enter this again as a replacement for <StackName>.
  7. You may keep default values in the Parameters section.
  8. Choose Next.
  9. Continue through the remaining sections.
  10. Read and select the check boxes in the Capabilities section.
  11. Choose Create stack.
  12. When the stack is complete, find the value for BucketName in the stack outputs.

We encourage you to review the stack with your security team prior to using it a production environment.

Set up the web server and pre-gen server in an AWS Cloud9 IDE

Next, on the AWS Cloud9 console, locate the environment PRTPDemoCloud9 created by the CloudFormation stack. Choose Open IDE to open the AWS Cloud9 environment. Open a terminal window and run the following commands, which clones the PRTP code, sets up pre-gen dependencies, and starts a web server to test with:

#Obtain PRTP code
cd /home/ec2-user/environment
git clone https://github.com/aws-samples/amazon-polly-reads-the-page.git

# Navigate to that code
cd amazon-polly-reads-the-page/setup

# Install Saxon and html5 Python lib. For pre-gen.
sh ./setup.sh <StackName>

# Run Python simple HTTP server
cd ..
./runwebserver.sh <IngressCIDR> 

For <StackName>, use the name you gave the CloudFormation stack. For <IngressCIDR>, specify a range of IP addresses allowed to access the web server. To restrict access to the browser on your local machine, find your IP address using https://whatismyipaddress.com/ and append /32 to specify the range. For example, if your IP is 10.2.3.4, use 10.2.3.4/32. The server listens on port 8080. The public IP address on which the server listens is given in the output. For example:

Public IP is

3.92.33.223

Test static pages

In your browser, navigate to PRTPStaticDefault.html. (If you’re using the demo, the URL is http://<cloud9host>:8080/web/PRTPStaticDefault.html, where <cloud9host> is the public IP address that you discovered in setting up the IDE.) Choose Play on the audio control at the top. Listen to the audio and watch the highlights. Explore the control by changing speeds, changing voices, pausing, fast-forwarding, and rewinding. The following screenshot shows the page; the text “Skips hidden paragraph” is highlighted because it is currently being read.

Try the same for PRTPStaticConfig.html and PRTPStaticCustom.html. The results are similar. For example, all three read the alt text for the photo of the cat (“Random picture of a cat”). All three read NE, NW, SE, and SW as full words (“northeast,” “northwest,” “southeast,” “southwest”), taking advantage of Amazon Polly lexicons.

Notice the main differences in audio:

  • PRTPStaticDefault.html reads all the text in the body of the page, including the wrapup portion at the bottom with “Your thoughts in one word,” “Submit Query,” “Last updated April 1, 2020,” and “Questions for the dev team.” PRTPStaticConfig.html and PRTPStaticCustom.html don’t read these because they explicitly exclude the wrapup from speech synthesis.
  • PRTPStaticCustom.html reads the QB Best Sellers table differently from the others. It reads the first three rows only, and reads the row number for each row. It repeats the columns for each row. PRTPStaticCustom.html uses a custom transformation to tailor the readout of the table. The other pages use default table rendering.
  • PRTPStaticCustom.html reads “Tom Brady” at a louder volume than the rest of the text. It uses the speech synthesis markup language (SSML) prosody tag to tailor the reading of Tom Brady. The other pages don’t tailor in this way.
  • PRTPStaticCustom.html, thanks to a custom transformation, reads the main tiles in NW, SW, NE, SE order; that is, it reads “Today’s Articles,” “Quote of the Day,” “Photo of the Day,” “Jokes of the Day.” The other pages read in the order the tiles appear in the natural NW, NE, SW, SE order they appear in the HTML: “Today’s Articles,” “Photo of the Day,” “Quote of the Day,” “Jokes of the Day.”

Let’s dig deeper into how the audio is generated, and how the page highlights the text.

Static pre-generator

Our GitHub repo includes pre-generated audio files for the PRPTStatic pages, but if you want to generate them yourself, from the bash shell in the AWS Cloud9 IDE, run the following commands:

# navigate to examples
cd /home/ec2-user/environment/amazon-polly-reads-the-page-blog/pregen/examples

# Set env var for my S3 bucket. Example, I called mine prtp-output
S3_BUCKET=prtp-output # Use output BucketName from CloudFormation

#Add lexicon for pronuniciation of NE NW SE NW
#Script invokes aws polly put-lexicon
./addlexicon.sh.

#Gen each variant
./gen_default.sh
./gen_config.sh
./gen_custom.sh

Now let’s look at how those scripts work.

Default case

We begin with gen_default.sh:

cd ..
python FixHTML.py ../web/PRTPStaticDefault.html  
   example/tmp_wff.html
./gen_ssml.sh example/tmp_wff.html generic.xslt example/tmp.ssml
./run_polly.sh example/tmp.ssml en-US Joanna 
   ../web/polly/PRTPStaticDefault compass
./run_polly.sh example/tmp.ssml en-US Matthew 
   ../web/polly/PRTPStaticDefault compass

The script begins by running the Python program FixHTML.py to make the source HTML file PRTPStaticDefault.html well-formed. It writes the well-formed version of the file to example/tmp_wff.html. This step is crucial for two reasons:

  • Most source HTML is not well formed. This step repairs the source HTML to be well formed. For example, many HTML pages don’t close P elements. This step closes them.
  • We keep track of where in the HTML page we find text. We need to track locations using the same document object model (DOM) structure that the browser uses. For example, the browser automatically adds a TBODY to a TABLE. The Python program follows the same well-formed repairs as the browser.

gen_ssml.sh takes the well-formed HTML as input, applies an XML stylesheet transformation (XSLT) transformation to it, and outputs an SSML file. (SSML is the language in Amazon Polly to control how audio is rendered from text.) In the current example, the input is example/tmp_wff.html. The output is example/tmp.ssml. The transform’s job is to decide what text to extract from the HTML and feed to Amazon Polly. generic.xslt is a sensible default XSLT transform for most webpages. In the following example code snippet, it excludes the audio control, the HTML header, as well as HTML elements like script and form. It also excludes elements with the hidden attribute. It includes elements that typically contain text, such as P, H1, and SPAN. For these, it renders both a mark, including the full XPath expression of the element, and the value of the element.

<!-- skip the header -->
<xsl:template match="html/head">
</xsl:template>

<!-- skip the audio itself -->
<xsl:template match="html/body/table[@id='prtp-audio']">
</xsl:template>

<!-- For the body, work through it by applying its templates. This is the default. -->
<xsl:template match="html/body">
<speak>
      <xsl:apply-templates />
</speak>
</xsl:template>

<!-- skip these -->
<xsl:template match="audio|option|script|form|input|*[@hidden='']">
</xsl:template>

<!-- include these -->
<xsl:template match="p|h1|h2|h3|h4|li|pre|span|a|th/text()|td/text()">
<xsl:for-each select=".">
<p>
      <mark>
          <xsl:attribute name="name">
          <xsl:value-of select="prtp:getMark(.)"/>
          </xsl:attribute>
      </mark>
      <xsl:value-of select="normalize-space(.)"/>
</p>
</xsl:for-each>
</xsl:template>

The following is a snippet of the SSML that is rendered. This is fed as input to Amazon Polly. Notice, for example, that the text “Skips hidden paragraph” is to be read in the audio, and we associate it with a mark, which tells us that this text occurs in the location on the page given by the XPath expression /html/body[1]/div[2]/ul[1]/li[1].

<speak>
<p><mark name="/html/body[1]/div[1]/h1[1]"/>PollyReadsThePage Normal Test Page</p>
<p><mark name="/html/body[1]/div[2]/p[1]"/>PollyReadsThePage is a test page for audio readout with highlights.</p>
<p><mark name="/html/body[1]/div[2]/p[2]"/>Here are some features:</p>
<p><mark name="/html/body[1]/div[2]/ul[1]/li[1]"/>Skips hidden paragraph</p>
<p><mark name="/html/body[1]/div[2]/ul[1]/li[2]"/>Speaks but does not highlight collapsed content</p>
…
</speak>

To generate audio in Amazon Polly, we call the script run_polly.sh. It runs the AWS Command Line Interface (AWS CLI) command aws polly start-speech-synthesis-task twice: once to generate MP3 audio, and again to generate the marks file. Because the generation is asynchronous, the script polls until it finds the output in the specified S3 bucket. When it finds the output, it downloads to the build server and copies the files to the web/polly folder. The following is a listing of the web folders:

  • PRTPStaticDefault.html
  • PRTPStaticConfig.html
  • PRTPStaticCustom.html
  • PRTP.js
  • polly/PRTPStaticDefault/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
  • polly/PRTPStaticConfig/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks
  • polly/PRTPStaticCustom/Joanna.mp3, Joanna.marks, Matthew.mp3, Matthew.marks

Each page has its own set of voice-specific MP3 and marks files. These files are the pre-generated files. The page doesn’t need to invoke Amazon Polly at runtime; the files are part of the web build.

Config-driven case

Next, consider gen_config.sh:

cd ..
python FixHTML.py ../web/PRTPStaticConfig.html 
  example/tmp_wff.html
python ModGenericXSLT.py example/transform_config.json 
  example/tmp.xslt
./gen_ssml.sh example/tmp_wff.html example/tmp.xslt 
  example/tmp.ssml
./run_polly.sh example/tmp.ssml en-US Joanna 
  ../web/polly/PRTPStaticConfig compass
./run_polly.sh example/tmp.ssml en-US Matthew 
  ../web/polly/PRTPStaticConfig compass

The script is similar to the script in the default case, but the bolded lines indicate the main difference. Our approach is config-driven. We tailor the content to be extracted from the page by specifying what to extract through configuration, not code. In particular, we use the JSON file transform_config.json, which specifies that the content to be included are the elements with IDs title, main, maintable, and qbtable. The element with ID wrapup should be excluded. See the following code:

{
 "inclusions": [ 
 	{"id" : "title"} , 
 	{"id": "main"}, 
 	{"id": "maintable"}, 
 	{"id": "qbtable" }
 ],
 "exclusions": [
 	{"id": "wrapup"}
 ]
}

We run the Python program ModGenericXSLT.py to modify generic.xslt, used in the default case, to use the inclusions and exclusions that we specify in transform_config.json. The program writes the results to a temp file (example/tmp.xslt), which it passes to gen_ssml.sh as its XSLT transform.

This is a low-code option. The web publisher doesn’t need to know how to write XSLT. But they do need to understand the structure of the HTML page and the IDs used in its main organizing elements.

Customization case

Finally, consider gen_custom.sh:

cd ..
python FixHTML.py ../web/PRTPStaticCustom.html 
   example/tmp_wff.html
./gen_ssml.sh example/tmp_wff.html example/custom.xslt  
   example/tmp.ssml
./run_polly.sh example/tmp.ssml en-US Joanna 
   ../web/polly/PRTPStaticCustom compass
./run_polly.sh example/tmp.ssml en-US Matthew 
   ../web/polly/PRTPStaticCustom compass

This script is nearly identical to the default script, except it uses its own XSLT—example/custom.xslt—rather than the generic XSLT. The following is a snippet of the XSLT:

<!-- Use NW, SW, NE, SE order for main tiles! -->
<xsl:template match="*[@id='maintable']">
    <mark>
        <xsl:attribute name="name">
        <xsl:value-of select="stats:getMark(.)"/>
        </xsl:attribute>
    </mark>
    <xsl:variable name="tiles" select="./tbody"/>
    <xsl:variable name="tiles-nw" select="$tiles/tr[1]/td[1]"/>
    <xsl:variable name="tiles-ne" select="$tiles/tr[1]/td[2]"/>
    <xsl:variable name="tiles-sw" select="$tiles/tr[2]/td[1]"/>
    <xsl:variable name="tiles-se" select="$tiles/tr[2]/td[2]"/>
    <xsl:variable name="tiles-seq" select="($tiles-nw,  $tiles-sw, $tiles-ne, $tiles-se)"/>
    <xsl:for-each select="$tiles-seq">
         <xsl:apply-templates />  
    </xsl:for-each>
</xsl:template>   

<!-- Say Tom Brady load! -->
<xsl:template match="span[@style = 'color:blue']" >
<p>
      <mark>
          <xsl:attribute name="name">
          <xsl:value-of select="prtp:getMark(.)"/>
          </xsl:attribute>
      </mark>
      <prosody volume="x-loud">Tom Brady</prosody>
</p>
</xsl:template>

If you want to study the code in detail, refer to the scripts and programs in the GitHub repo.

Browser setup and highlights

The static pages include an HTML5 audio control, which takes as its audio source the MP3 file generated by Amazon Polly and residing on the web server:

<audio id="audio" controls>
  <source src="polly/PRTPStaticDefault/en/Joanna.mp3" type="audio/mpeg">
</audio>

At load time, the page also loads the Amazon Polly-generated marks file. This occurs in the PRTP.js file, which the HTML page includes. The following is a snippet of the marks file for PRTPStaticDefault:

{“time”:11747,“type”:“sentence”,“start”:289,“end”:356,“value”:“PollyReadsThePage is a test page for audio readout with highlights.“}
{“time”:15784,“type”:“ssml”,“start”:363,“end”:403,“value”:“/html/body[1]/div[2]/p[2]“}
{“time”:16427,“type”:“sentence”,“start”:403,“end”:426,“value”:“Here are some features:“}
{“time”:17677,“type”:“ssml”,“start”:433,“end”:480,“value”:“/html/body[1]/div[2]/ul[1]/li[1]“}
{“time”:18344,“type”:“sentence”,“start”:480,“end”:502,“value”:“Skips hidden paragraph”}
{“time”:19894,“type”:“ssml”,“start”:509,“end”:556,“value”:“/html/body[1]/div[2]/ul[1]/li[2]“}
{“time”:20537,“type”:“sentence”,“start”:556,“end”:603,“value”:“Speaks but does not highlight collapsed content”}

During audio playback, there is an audio timer event handler in PRTP.js that checks the audio’s current time, finds the text to highlight, finds its location on the page, and highlights it. The text to be highlighted is an entry of type sentence in the marks file. The location is the XPath expression in the name attribute of the entry of type SSML that precedes the sentence. For example, if the time is 18400, according to the marks file, the sentence to be highlighted is “Skips hidden paragraph,” which starts at 18334. The location is the SSML entry at time 17667: /html/body[1]/div[2]/ul[1]/li[1].

Test dynamic pages

The page PRTPDynamic.html demonstrates dynamic audio readback using default, configuration-driven, and custom audio extraction approaches.

Default case

In your browser, navigate to PRTPDynamic.html. The page has one query parameter, dynOption, which accepts values default, config, and custom. It defaults to default, so you may omit it in this case. The page has two sections with dynamic content:

  • Latest Articles – Changes frequently throughout the day
  • Greek Philosophers Search By Date – Allows the visitor to search for Greek philosophers by date and shows the results in a table

Create some content in the Greek Philosopher section by entering a date range of -800 to 0, as shown in the example. Then choose Find.

Now play the audio by choosing Play in the audio control.

Behind the scenes, the page runs the following code to render and play the audio:

   buildSSMLFromDefault();
   chooseRenderAudio();
   setVoice();

First it calls the function buildSSMLFromDefault in PRTP.js to extract most of the text from the HTML page body. That function walks the DOM tree, looking for text in common elements such as p, h1, pre, span, and td. It ignores text in elements that usually don’t contain text to be read aloud, such as audio, option, and script. It builds SSML markup to be input to Amazon Polly. The following is a snippet showing extraction of the first row from the philosopher table:

<speak>
...
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[1]"/>Thales</p>
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[2]"/>-624 to -546</p>
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[3]"/>Miletus</p>
  <p><mark name="/HTML[1]/BODY[1]/DIV[3]/DIV[1]/DIV[1]/TABLE[1]/TBODY[1]/TR[2]/TD[4]"/>presocratic</p>
...
</speak>

The chooseRenderAudio function in PRTP.js begins by initializing the AWS SDK for Amazon Cognito, Amazon S3, and Amazon Polly. This initialization occurs only once. If chooseRenderAudio is invoked again because the content of the page has changed, the initialization is skipped. See the following code:

AWS.config.region = env.REGION
AWS.config.credentials = new AWS.CognitoIdentityCredentials({
            IdentityPoolId: env.IDP});
audioTracker.sdk.connection = {
   polly: new AWS.Polly({apiVersion: '2016-06-10'}),
   s3: new AWS.S3()
};

It generates MP3 audio from Amazon Polly. The generation is synchronous for small SSML inputs and asynchronous (with output sent to the S3 bucket) for large SSML inputs (greater than 6,000 characters). In the synchronous case, we ask Amazon Polly to provide the MP3 file using a presigned URL. When the synthesized output is ready, we set the src attribute of the audio control to that URL and load the control. We then request the marks file and load it the same way as in the static case. See the following code:

// create signed URL
const signer = new AWS.Polly.Presigner(pollyAudioInput, audioTracker.sdk.connection.polly);

// call Polly to get MP3 into signed URL
signer.getSynthesizeSpeechUrl(pollyAudioInput, function(error, url) {
  // Audio control uses signed URL
  audioTracker.audioControl.src =
    audioTracker.sdk.audio[audioTracker.voice];
  audioTracker.audioControl.load();

  // call Polly to get marks
  audioTracker.sdk.connection.polly.synthesizeSpeech(
    pollyMarksInput, function(markError, markData) {
    const marksStr = new
      TextDecoder().decode(markData.AudioStream);
    // load marks into page the same as with static
    doLoadMarks(marksStr);
  });
});

Config-driven case

In your browser, navigate to PRTPDynamic.html?dynOption=config. Play the audio. The audio playback is similar to the default case, but there are minor differences. In particular, some content is skipped.

Behind the scenes, when using the config option, the page extracts content differently than in the default case. In the default case, the page uses buildSSMLFromDefault. In the config-driven case, the page specifies the sections it wants to include and exclude:

const ssml = buildSSMLFromConfig({
	 "inclusions": [ 
	 	{"id": "title"}, 
	 	{"id": "main"}, 
	 	{"id": "maintable"}, 
	 	{"id": "phil-result"},
	 	{"id": "qbtable"}, 
	 ],
	 "exclusions": [
	 	{"id": "wrapup"}
	 ]
	});

The buildSSMLFromConfig function, defined in PRTP.js, walks the DOM tree in each of the sections whose ID is provided under inclusions. It extracts content from each and combines them together, in the order specified, to form an SSML document. It excludes the sections specified under exclusions. It extracts content from each section in the same way buildSSMLFromDefault extracts content from the page body.

Customization case

In your browser, navigate to PRTPDynamic.html?dynOption=custom. Play the audio. There are three noticeable differences. Let’s note these and consider the custom code that runs behind the scenes:

  • It reads the main tiles in NW, SW, NE, SE order. The custom code gets each of these cell blocks from maintable and adds them to the SSML in NW, SW, NE, SE order:
const nw = getElementByXpath("//*[@id='maintable']//tr[1]/td[1]");
const sw = getElementByXpath("//*[@id='maintable']//tr[2]/td[1]");
const ne = getElementByXpath("//*[@id='maintable']//tr[1]/td[2]");
const se = getElementByXpath("//*[@id='maintable']//tr[2]/td[2]");
[nw, sw, ne, se].forEach(dir => buildSSMLSection(dir, []));
  • “Tom Brady” is spoken loudly. The custom code puts “Tom Brady” text inside an SSML prosody tag:
if (cellText == "Tom Brady") {
   addSSMLMark(getXpathOfNode( node.childNodes[tdi]));
   startSSMLParagraph();
   startSSMLTag("prosody", {"volume": "x-loud"});
   addSSMLText(cellText);
   endSSMLTag();
   endSSMLParagraph();
}
  • It reads only the first three rows of the quarterback table. It reads the column headers for each row. Check the code in the GitHub repo to discover how this is implemented.

Clean up

To avoid incurring future charges, delete the CloudFormation stack.

Conclusion

In this post, we demonstrated a technical solution to a high-value business problem: how to use Amazon Polly to read the content of a webpage and highlight the content as it’s being read. We showed this using both static and dynamic pages. To extract content from the page, we used DOM traversal and XSLT. To facilitate highlighting, we used the speech marks capability in Amazon Polly.

Learn more about Amazon Polly by visiting its service page.

Feel free to ask questions in the comments.


About the authors

Mike Havey is a Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. Visit his Amazon author page to read more.

Vineet Kachhawaha is a Solutions Architect at AWS with expertise in machine learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.

Read More

Use Amazon SageMaker Data Wrangler for data preparation and Studio Labs to learn and experiment with ML

Amazon SageMaker Studio Lab is a free machine learning (ML) development environment based on open-source JupyterLab for anyone to learn and experiment with ML using AWS ML compute resources. It’s based on the same architecture and user interface as Amazon SageMaker Studio, but with a subset of Studio capabilities.

When you begin working on ML initiatives, you need to perform exploratory data analysis (EDA) or data preparation before proceeding with model building. Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes.

A key accelerator of feature preparation in Data Wrangler is the Data Quality and Insights Report. This report checks data quality and helps detect abnormalities in your data, so that you can perform the required data engineering to fix your dataset. You can use the Data Quality and Insights Report to perform an analysis of your data to gain insights into your dataset such as the number of missing values and number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention and help you identify the data preparation steps you need to perform.

Studio Lab users can benefit from Data Wrangler because data quality and feature engineering are critical for the predictive performance of your model. Data Wrangler helps with data quality and feature engineering by giving insights into data quality issues and easily enabling rapid feature iteration and engineering using a low-code UI.

In this post, we show you how to perform exploratory data analysis, prepare and transform data using Data Wrangler, and export the transformed and prepared data to Studio Lab to carry out model building.

Solution overview

The solution includes the following high-level steps:

  1. Create AWS account and admin user. This is a prerequisite
  2. Download the dataset churn.csv.
  3. Load the dataset to Amazon Simple Storage Service (Amazon S3).
  4. Create a SageMaker Studio domain and launch Data Wrangler.
  5. Import the dataset into the Data Wrangler flow from Amazon S3.
  6. Create the Data Quality and Insights Report and draw conclusions on necessary feature engineering.
  7. Perform the necessary data transforms in Data Wrangler.
  8. Download the Data Quality and Insights Report and the transformed dataset.
  9. Upload the data to a Studio Lab project for model training.

The following diagram illustrates this workflow.

Prerequisites

To use Data Wrangler and Studio Lab, you need the following prerequisites:

Build a data preparation workflow with Data Wrangler

To get started, complete the following steps:

  1. Upload your dataset to Amazon S3.
  2. On the SageMaker console, under Control panel in the navigation pane, choose Studio.
  3. On the Launch app menu next to your user profile, choose Studio.

    After you successfully log in to Studio, you should see a development environment like the following screenshot.
  4. To create a new Data Wrangler workflow, on the File menu, choose New, then choose Data Wrangler Flow.

    The first step in Data Wrangler is to import your data. You can import data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, and Databricks. In this example, we use Amazon S3.If you just want to see how Data Wrangler works, you can always choose Use sample dataset.
  5. Choose Import data.
  6. Choose Amazon S3.
  7. Choose the dataset you uploaded and choose Import.

    Data Wrangler enables you to either import the entire dataset or sample a portion of it.
  8. To quickly get insights on the dataset, choose First K for Sampling and enter 50000 for Sample size.

Understand data quality and get insights

Let’s use the Data Quality and Insights Report to perform an analysis of the data that we imported into Data Wrangler. You can use the report to understand what steps you need to take to clean and process your data. This report provides information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

  1. Choose the plus sign next to Data types and choose Get data insights.
  2. For Analysis type, choose Data Quality and Insights Report.
  3. For Target column, choose Churn?.
  4. For Problem type¸ select Classification.
  5. Choose Create.

You’re presented with a detailed report that you can review and download. The report includes several sections such as quick model, feature summary, feature correlation, and data insights. The following screenshots provide examples of these sections.

Observations from the report

From the report, we can make the following observations:

  • No duplicate rows were found.
  • The State column appears to be quite evenly distributed, so the data is balanced in terms of state population.
  • The Phone column presents too many unique values to be of any practical use. Too many unique values make this column not useful. We can drop the Phone column in our transformation.
  • Based on feature correlation section of the report, Mins and Charge are highly correlated. We can remove one of them.

Transformation

Based on our observations, we want to make the following transformations:

  • Remove the Phone column because it has many unique values.
  • We also see several features that essentially have 100% correlation with one another. Including these feature pairs in some ML algorithms can create undesired problems, whereas in others it will only introduce minor redundancy and bias. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, and Intl Charge from the pair with Intl Mins.
  • Convert True or False in the Churn column to be a numerical value of 1 or 0.
  1. Return to the data flow and choose the plus sign next to Data types.
  2. Choose Add transform.
  3. Choose Add step.
  4. You can search for the transform you looking for (in our case, manage columns).
  5. Choose Manage columns.
  6. For Transform¸ choose Drop column.
  7. For Columns to drop¸ choose Phone, Day Charge, Eve Charge, Night Charge, and Intl Charge.
  8. Choose Preview, then choose Update.

    Let’s add another transform to perform a categorical encode on the Churn? column.
  9. Choose the transform Encode categorical.
  10. For Transform, choose Ordinal encode.
  11. For Input columns, choose the Churn? column.
  12. For Invalid handling strategy, choose Replace with NaN.
  13. Choose Preview, then choose Update.

Now True and False are converted to 1 and 0, respectively.

Now that we have a good understand of the data and have prepared and transformed the data for model building, we can move the data to Studio Lab for model building.

Upload the data to Studio Lab

To start using the data in Studio Lab, complete the following steps:

  1. Choose Export data to export to an S3 bucket.
  2. For Amazon S3 location, enter your S3 path.
  3. Specify the file type.
  4. Choose Export data.
  5. After you export the data, you can download the data from the S3 bucket to your local computer.
  6. Now you can go to Studio Lab and upload the file to Studio Lab.

    Alternatively, you can connect to Amazon S3 from Studio Lab. For more information, refer to Use external resources in Amazon SageMaker Studio Lab.
  7. Let’s install SageMaker and import Pandas.
  8. Import all libraries as required.
  9. Now we can read the CSV file.
  10. Let’s print churn to confirm the dataset is correct.

Now that you have the processed dataset in Studio Lab, you can carry out further steps required for model building.

Data Wrangler pricing

You can perform all the steps in this post for EDA or data preparation within Data Wrangler and pay for the simple instance, jobs, and storage pricing based on usage or consumption. No upfront or licensing fees are required.

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees. To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
    Data Wrangler automatically saves your data flow every 60 seconds.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we saw how you can gain insights into your dataset, perform exploratory data analysis, prepare and transform data using Data Wrangler within Studio, and export the transformed and prepared data to Studio Lab and carry out model building and other steps.

With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.


About the authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about the cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with a passion to design, create and promote human-centered Data and Analytics experiences. He supports AWS Strategic customers on their transformation towards data driven organization.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Read More