Amazon AWS – Page 212

Drive efficiencies with CI/CD best practices on Amazon Lex

July 7, 2022

by Swapandeep Singh Amazon AWS

Let’s say you have identified a use case in your organization that you would like to handle via a chatbot. You familiarized yourself with Amazon Lex, built a prototype, and did a few trial interactions with the bot. You liked the overall experience and now want to deploy the bot in your production environment, but aren’t sure about best practices for Amazon Lex. In this post, we review the best practices for developing and deploying Amazon Lex bots, enabling you to streamline the end-to-end bot lifecycle and optimize your operations.

We have covered the planning, design, and configuration phases in previous blog posts. We suggest reviewing these posts to help you build engaging conversations with your bot before you proceed. After you’ve initially configured the bot, you should test it internally and iterate on the bot definition. You’re now ready to deploy it in your production environment (such as a call center), where the bot will process live conversations. Once in production, you should monitor it continuously to make sure it’s meeting your desired business goals. This cycle repeats as you add new use cases and enhancements.

Let’s review the best practices for development, testing, deployment, and monitoring bots.

Development

Consider the following best practices when developing your bot:

Manage bot schema via code – The Amazon Lex console provides an easy-to-use interface as you design and configure the bot, but relies on manual actions to replicate the setup. We recommend converting the bot schema into code after finishing the design to simplify this step. You can use APIs or AWS CloudFormation (see Creating Amazon Lex V2 resources with AWS CloudFormation) to manage the bot programmatically.
Checkpoint bot schema with bot versioning – Checkpointing is a common approach often used to revert an application to a last-known stable state. Amazon Lex offers this functionality via bot versioning. We recommend using a new version at each milestone in your development process. This allows you to make incremental changes to your bot definition, with an easy way to revert them in case they don’t work as expected.
Identify data handling requirements and configure appropriate controls – Amazon Lex follows the AWS shared responsibility model, which includes guidelines for data protection to comply with industry regulations and with your company’s own data privacy standards. Additionally, Amazon Lex adheres to compliance programs such as SOC, PCI, and FedRAMP. Amazon Lex provides the ability to obfuscate slots that are considered sensitive. You should identify your data privacy requirements and configure the appropriate controls in your bot.

Testing

After you have a bot definition, you should test the bot to ensure that it works as intended and is configured correctly. For example, it should have permissions to trigger other services, such as AWS Lambda functions. In addition, you should also test the bot to confirm it’s able to interpret different types of user requests. Consider the following best practices for testing:

Identify test data – You should gather relevant test data to test bot performance. The test data should include a comprehensive representation of expected user conversations with the bot, especially for IVR use cases where the bot will need to understand voice inputs. The test data should cover different speaking styles and accents. Such test data can provide experience validation for your target customer base.
Identify user experience metrics – Defining the conversational experience can be hard. You have to anticipate and plan for all the different ways users might engage with the bot. How do you guide the caller without sounding too prescriptive? How do you recover if the caller provides incorrect or incomplete information? To manage the dialog through many different scenarios, you should set a clear goal that covers different speaking styles, acoustic conditions, and modality, and identify objective metrics that you can track. For example, an objective indicator would be “90% of conversations should have less than two re-prompts played to the user,” versus a subjective indicator such as “the majority of conversations should not ask users to repeat their input.”
Evaluate user experience along the way – In some cases, seemingly small changes can have a big impact on the user experience. For example, consider a situation where you inadvertently introduce a typo in the regular expression used for an account ID slot type, which leads to bot re-prompting the user to provide input again. You should evaluate the user experience, and invest in an automated testing to generate key metrics. You can refer to Evaluating an automatic speech recognition service and Testing accuracy and regression with Amazon Connect and Amazon Lex for examples of how to test and generate key metrics.

Deployment

Once you’re satisfied with the bot’s performance, you’ll want to deploy the bot to start serving your production traffic. As you iterate the bot over the course of its lifecycle, you repeat the deployments, making it a continuous process, so it’s critical to have a streamlined, automated deployment to reduce the chance of errors. Consider the following best practices for deployment:

Use a multi-account environment – You should follow the AWS recommended multi-account environment setup in your organization and use separate AWS accounts for your development stage and production stage. If you have a multi-Region presence, then you should also use a separate AWS account per Region for production. Using separate AWS accounts per stage offers you security, access, and billing boundaries for your AWS resources.
Automate promoting a bot from development through to production – When replicating the bot setup in your development stage to your production stage, you should use automated solutions and minimize manual touch points. You should use CloudFormation templates to create your bots. Alternatively, you can use Amazon Lex export and import APIs to provide an automated means to copy a bot schema across accounts.
Roll out changes in a phased manner – You should deploy changes to your production environment in a phased manner, so that changes are released to a subset of your production traffic before being released to all users. Such an approach gives you the chance to limit the blast radius in case there are any issues with the change. One way you can achieve this is by having a two-phased deployment approach: you create two aliases for a bot (for example, prod-05 and prod-95). You first associate the new bot version with one alias (prod-05 in this example). After you validate the key metrics meet the success criteria, you associate the second alias (prod-95) with the new bot version.

Note that you need to control the distribution of traffic on the client application used to integrate with Amazon Lex bots. For example, if you’re using Amazon Connect to integrate with your bots, you can use a Distribute by percentage contact block in conjunction with two or more Get customer input blocks.

It’s important to note that Amazon Lex provides a test alias out of the box. The test alias is meant to be used for ad hoc manual testing via the Amazon Lex console only, and is not meant to handle production-scale loads. We recommend using a dedicated alias for your production traffic.

Monitoring

Monitoring is important for maintaining reliability, availability, and an effective end-user experience. You should analyze your bot’s metrics and use the learnings as a feedback mechanism to improve the bot schema as well your development, testing, and deployment practices. Amazon Lex supports multiple mechanisms to monitor bots. Consider the following best practices for monitoring your Lex bots:

Monitor constantly and iterate – Amazon Lex integrates with Amazon CloudWatch to provide near-real-time metrics that can provide you with key insights into your users’ interactions with the bot. These insights can help you gain perspective on the end-user experience. To learn more about the different types of metrics that Amazon Lex emits, see Monitoring Amazon Lex V2 with Amazon CloudWatch. We recommend setting up thresholds to trigger alarms. Similarly, Amazon Lex gives you visibility into the raw input utterances from your users’ interactions with the bot. You should use utterance statistics or conversation logs to gain insights to identify communication patterns and make appropriate changes to your bot as necessary. To learn how to create a personalized analytics dashboard for your bots, refer to Monitor operational metrics for your Amazon Lex chatbot.

The best practices discussed in this post focus primarily on Amazon Lex-specific use cases. In addition to these, you should review and adhere to best practices when managing your cloud infrastructure in AWS. Make sure that your cloud infrastructure is secure and only accessible by authorized users. You should also review and adopt the appropriate AWS security best practices within your organization. Lastly, you should proactively review the AWS quotas for individual AWS services (including Amazon Lex quotas) and request appropriate changes if necessary.

Conclusion

You can use Amazon Lex to enable sophisticated natural language conversations and drive customer service efficiencies. In this post, we reviewed the best practices for the development, testing, deployment, and monitoring phases of a bot lifecycle. With these guidelines, you can improve the end-user experience and achieve better customer engagement. Start building your Amazon Lex conversational experience today!

About the Author

Swapandeep Singh is an engineer with the Amazon Lex team. He works on making interactions with bots smoother and more human-like. Outside of work, he likes to travel and learn about different cultures.

Anwar Walid receives 2022 IEEE INFOCOM Test of Time Paper Award

July 7, 2022

by Amazon AWS

Walid’s 2010 paper on distributed caching algorithms for content distribution networks cited for its “significant impact on the research community”.Read More

Feature engineering at scale for healthcare and life sciences with Amazon SageMaker Data Wrangler

July 7, 2022

by Forrest Sun Amazon AWS

Machine learning (ML) is disrupting a lot of industries at an unprecedented pace. The healthcare and life sciences (HCLS) industry has been going through a rapid evolution in recent years embracing ML across a multitude of use cases for delivering quality care and improving patient outcomes.

In a typical ML lifecycle, data engineers and scientists spend the majority of their time on the data preparation and feature engineering steps before even getting started with the process of model building and training. Having a tool that can lower the barrier to entry for data preparation, thereby improving productivity, is a highly desirable ask for these personas. Amazon SageMaker Data Wrangler is purpose built by AWS to reduce the learning curve and enable data practitioners to accomplish data preparation, cleaning, and feature engineering tasks in less effort and time. It offers a GUI interface with many built-in functions and integrations with other AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon SageMaker Feature Store, as well as partner data sources including Snowflake and Databricks.

In this post, we demonstrate how to use Data Wrangler to prepare healthcare data for training a model to predict heart failure, given a patient’s demographics, prior medical conditions, and lab test result history.

Solution overview

The solution consists of the following steps:

Acquire a healthcare dataset as input to Data Wrangler.
Use Data Wrangler’s built-in transformation functions to transform the dataset. This includes drop columns, featurize data/time, join datasets, impute missing values, encode categorical variables, scale numeric values, balance the dataset, and more.
Use Data Wrangler’s custom transform function (Pandas or PySpark code) to supplement additional transformations required beyond the built-in transformations and demonstrate the extensibility of Data Wrangler. This includes filter rows, group data, form new dataframes based on conditions, and more.
Use Data Wrangler’s built-in visualization functions to perform visual analysis. This includes target leakage, feature correlation, quick model, and more.
Use Data Wrangler’s built-in export options to export the transformed dataset to Amazon S3.
Launch a Jupyter notebook to use the transformed dataset in Amazon S3 as input to train a model.

Generate a dataset

Now that we have settled on the ML problem statement, we first set our sights on acquiring the data we need. Research studies such as Heart Failure Prediction may provide data that’s already in good shape. However, we often encounter scenarios where the data is quite messy and requires joining, cleansing, and several other transformations that are very specific to the healthcare domain before it can be used for ML training. We want to find or generate data that is messy enough and walk you through the steps of preparing it using Data Wrangler. With that in mind, we picked Synthea as a tool to generate synthetic data that fits our goal. Synthea is an open-source synthetic patient generator that models the medical history of synthetic patients. To generate your dataset, complete the following steps:

Follow the instructions as per the quick start documentation to create an Amazon SageMaker Studio domain and launch Studio.
This is a prerequisite step. It is optional if Studio is already set up in your account.
After Studio is launched, on the Launcher tab, choose System terminal.
This launches a terminal session that gives you a command line interface to work with.

To install Synthea and generate the dataset in CSV format, run the following commands in the launched terminal session:

$ sudo yum install -y java-1.8.0-openjdk-devel
$ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
$ export PATH=$JAVA_HOME/bin:$PATH
$ git clone https://github.com/synthetichealth/synthea
$ git checkout v3.0.0
$ cd synthea
$ ./run_synthea --exporter.csv.export=true -p 10000

We supply a parameter to generate the datasets with a population size of 10,000. Note the size parameter denotes the number of alive members of the population. Additionally, Synthea also generates data for dead members of the population which might add a few extra data points on top of the specified sample size.

Wait until the data generation is complete. This step usually takes around an hour or less. Synthea generates multiple datasets, including patients, medications, allergies, conditions, and more. For this post, we use three of the resulting datasets:

patients.csv – This dataset is about 3.2 MB and contains approximately 11,000 rows of patient data (25 columns including patient ID, birthdate, gender, address, and more)
conditions.csv – This dataset is about 47 MB and contains approximately 370,000 rows of medical condition data (six columns including patient ID, condition start date, condition code, and more)
observations.csv – This dataset is about 830 MB and contains approximately 5 million rows of observation data (eight columns including patient ID, observation date, observation code, value, and more)

There is a one-to-many relationship between the patients and conditions datasets. There is also a one-to-many relationship between the patients and observations datasets. For a detailed data dictionary, refer to CSV File Data Dictionary.

To upload the generated datasets to a source bucket in Amazon S3, run the following commands in the terminal session:
```
$ cd ./output/csv
$ aws s3 sync . s3://<source bucket name>/
```

Launch Data Wrangler

Choose SageMaker resources in the navigation page in Studio and on the Projects menu, choose Data Wrangler to create a Data Wrangler data flow. For detailed steps how to launch Data Wrangler from within Studio, refer to Get Started with Data Wrangler.

Import data

To import your data, complete the following steps:

Choose Amazon S3 and locate the patients.csv file in the S3 bucket.
In the Details pane, choose First K for Sampling.
Enter 1100 for Sample size.
In the preview pane, Data Wrangler pulls the first 100 rows from the dataset and lists them as a preview.
Choose Import.
Data Wrangler selects the first 1,100 patients from the total patients (11,000 rows) generated by Synthea and imports the data. The sampling approach lets Data Wrangler only process the sample data. It enables us to develop our data flow with a smaller dataset, which results in quicker processing and a shorter feedback loop. After we create the data flow, we can submit the developed recipe to a SageMaker processing job to horizontally scale out the processing for the full or larger dataset in a distributed fashion.
Repeat this process for the conditions and observations datasets.
1. For the conditions dataset, enter 37000 for Sample size, which is 1/10 of the total 370,000 rows generated by Synthea.
2. For the observations dataset, enter 500000 for Sample size, which is 1/10 of the total observations 5 million rows generated by Synthea.

You should see three datasets as shown in the following screenshot.

Transform the data

Data transformation is the process of changing the structure, value, or format of one or more columns in the dataset. The process is usually developed by a data engineer and can be challenging for people with a smaller data engineering skillset to decipher the logic proposed for the transformation. Data transformation is part of the broader feature engineering process, and the correct sequence of steps is another important criterion to keep in mind while devising such recipes.

Data Wrangler is designed to be a low-code tool to reduce the barrier of entry for effective data preparation. It comes with over 300 preconfigured data transformations for you to choose from without writing a single line of code. In the following sections, we see how to transform the imported datasets in Data Wrangler.

Drop columns in patients.csv

We first drop some columns from the patients dataset. Dropping redundant columns removes non-relevant information from the dataset and helps us reduce the amount of computing resources required to process the dataset and train a model. In this section, we drop columns such as SSN or passport number based on common sense that these columns have no predictive value. In other words, they don’t help our model predict heart failure. Our study is also not concerned about other columns such as birthplace or healthcare expenses’ influence to a patient’s heart failure, so we drop them as well. Redundant columns can also be identified by running the built-in analyses like target leakage, feature correlation, multicollinearity, and more, which are built into Data Wrangler. For more details on the supported analyses types, refer to Analyze and Visualize. Additionally, you can use the Data Quality and Insights Report to perform automated analyses on the datasets to arrive at a list of redundant columns to eliminate.

Choose the plus sign next to Data types for the patients.csv dataset and choose Add transform.
Choose Add step and choose Manage columns.
For Transform¸ choose Drop column.
For Columns to drop, choose the following columns:
1. SSN
2. DRIVERS
3. PASSPORT
4. PREFIX
5. FIRST
6. LAST
7. SUFFIX
8. MAIDEN
9. RACE
10. ETHNICITY
11. BIRTHPLACE
12. ADDRESS
13. CITY
14. STATE
15. COUNTY
16. ZIP
17. LAT
18. LON
19. HEALTHCARE_EXPENSES
20. HEALTHCARE_COVERAGE
Choose Preview to review the transformed dataset, then choose Add.

You should see the step Drop column in your list of transforms.

Featurize date/time in patients.csv

Now we use the Featurize date/time function to generate the new feature Year from the BIRTHDATE column in the patients dataset. We use the new feature in a subsequent step to calculate a patient’s age at the time of observation takes place.

In the Transforms pane of your Drop column page for the patients dataset, choose Add step.
Choose the Featurize date/time transform.
Choose Extract columns.
For Input columns, add the column BIRTHDATE.
Select Year and deselect Month, Day, Hour, Minute, Second.
Choose Preview, then choose Add.

Add transforms in observations.csv

Data Wrangler supports custom transforms using Python (user-defined functions), PySpark, Pandas, or PySpark (SQL). You can choose your transform type based on your familiarity with each option and preference. For the latter three options, Data Wrangler exposes the variable df for you to access the dataframe and apply transformations on it. For a detailed explanation and examples, refer to Custom Transforms. In this section, we add three custom transforms to the observations dataset.

Add a transform to observations.csv and drop the DESCRIPTION column.
Choose Preview, then choose Add.
In the Transforms pane, choose Add step and choose Custom transform.
On the drop-down menu, choose Python (Pandas).

Enter the following code:

df = df[df["CODE"].isin(['8867-4','8480-6','8462-4','39156-5','777-3'])]

These are LONIC codes that correspond to the following observations we’re interested in using as features for predicting heart failure:

heart rate: 8867-4
systolic blood pressure: 8480-6
diastolic blood pressure: 8462-4
body mass index (BMI): 39156-5
platelets [#/volume] in Blood: 777-3

Choose Preview, then choose Add.
Add a transform to extract Year and Quarter from the DATE column.
Choose Preview, then choose Add.
Choose Add step and choose Custom transform.
On the drop-down menu, choose Python (PySpark).

The five types of observations may not always be recorded on the same date. For example, a patient may visit their family doctor on January 21 and have their systolic blood pressure, diastolic blood pressure, heart rate, and body mass index measured and recorded. However, a lab test that includes platelets may be done at a later date on February 2. Therefore, it’s not always possible to join dataframes by the observation date. Here we join dataframes on a coarse granularity at the quarter basis.

Enter the following code:

from pyspark.sql.functions import col

systolic_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed("value", "systolic")
                   .filter((col("code") == "8480-6"))
  )

diastolic_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'diastolic')
                   .filter((col("code") == "8462-4"))
    )

hr_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'hr')
                   .filter((col("code") == "8867-4"))
    )

bmi_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'bmi')
                   .filter((col("code") == "39156-5"))
    )

platelets_df = (
    df.select("patient", "DATE_year", "DATE_quarter", "value")
                   .withColumnRenamed('value', 'platelets')
                   .filter((col("code") == "777-3"))
    )

df = (
    systolic_df.join(diastolic_df, ["patient", "DATE_year", "DATE_quarter"])
                            .join(hr_df, ["patient", "DATE_year", "DATE_quarter"])
                            .join(bmi_df, ["patient", "DATE_year", "DATE_quarter"])
                            .join(platelets_df, ["patient", "DATE_year", "DATE_quarter"])
)

Choose Preview, then choose Add.
Choose Add step, then choose Manage rows.
For Transform, choose Drop duplicates.
Choose Preview, then choose Add.
Choose Add step and choose Custom transform.
On the drop-down menu, choose Python (Pandas).

Enter the following code to take an average of data points that share the same time value:

import pandas as pd
df.loc[:, df.columns != 'patient']=df.loc[:, df.columns != 'patient'].apply(pd.to_numeric)
df = df.groupby(['patient','DATE_year','DATE_quarter']).mean().round(0).reset_index()

Choose Preview, then choose Add.

Join patients.csv and observations.csv

In this step, we showcase how to effectively and easily perform complex joins on datasets without writing any code via Data Wrangler’s powerful UI. To learn more about the supported types of joins, refer to Transform Data.

To the right of Transform: patients.csv, choose the plus sign next to Steps and choose Join.
You can see the transformed patients.csv file listed under Datasets in the left pane.
To the right of Transform: observations.csv, click on the Steps to initiate the join operation.
The transformed observations.csv file is now listed under Datasets in the left pane.
Choose Configure.
For Join Type, choose Inner.
For Left, choose Id.
For Right, choose patient.
Choose Preview, then choose Add.

Add a custom transform to the joined datasets

In this step, we calculate a patient’s age at the time of observation. We also drop columns that are no longer needed.

Choose the plus sign next to 1st Join and choose Add transform.

Add a custom transform in Pandas:

df['age'] = df['DATE_year'] - df['BIRTHDATE_year']
df = df.drop(columns=['BIRTHDATE','DEATHDATE','BIRTHDATE_year','patient'])

Choose Preview, then choose Add.

Add custom transforms to conditions.csv

Choose the plus sign next to Transform: conditions.csv and choose Add transform.

Add a custom transform in Pandas:

df = df[df["CODE"].isin(['84114007', '88805009', '59621000', '44054006', '53741008', '449868002', '49436004'])]
df = df.drop(columns=['DESCRIPTION','ENCOUNTER','STOP'])

Note: As we demonstrated earlier, you can drop columns either using custom code or using the built-in transformations provided by Data Wrangler. Custom transformations within Data Wrangler provides the flexibility to bring your own transformation logic in the form of code snippets in the supported frameworks. These snippets can later be searched and applied if needed.

The codes in the preceding transform are SNOMED-CT codes that correspond to the following conditions. The heart failure or chronic congestive heart failure condition becomes the label. We use the remaining conditions as features for predicting heart failure. We also drop a few columns that are no longer needed.

Heart failure: 84114007
Chronic congestive heart failure: 88805009
Hypertension: 59621000
Diabetes: 44054006
Coronary Heart Disease: 53741008
Smokes tobacco daily: 449868002
Atrial Fibrillation: 49436004

Next, let’s add a custom transform in PySpark:

from pyspark.sql.functions import col, when

heartfailure_df = (
    df.select("patient", "start")
                      .withColumnRenamed("start", "heartfailure")
                   .filter((col("code") == "84114007") | (col("code") == "88805009"))
  )

hypertension_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "hypertension")
                   .filter((col("code") == "59621000"))
  )

diabetes_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "diabetes")
                   .filter((col("code") == "44054006"))
  )

coronary_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "coronary")
                   .filter((col("code") == "53741008"))
  )

smoke_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "smoke")
                   .filter((col("code") == "449868002"))
  )

atrial_df = (
    df.select("patient", "start")
                   .withColumnRenamed("start", "atrial")
                   .filter((col("code") == "49436004"))
  )

df = (
    heartfailure_df.join(hypertension_df, ["patient"], "leftouter").withColumn("has_hypertension", when(col("hypertension") < col("heartfailure"), 1).otherwise(0))
    .join(diabetes_df, ["patient"], "leftouter").withColumn("has_diabetes", when(col("diabetes") < col("heartfailure"), 1).otherwise(0))
    .join(coronary_df, ["patient"], "leftouter").withColumn("has_coronary", when(col("coronary") < col("heartfailure"), 1).otherwise(0))
    .join(smoke_df, ["patient"], "leftouter").withColumn("has_smoke", when(col("smoke") < col("heartfailure"), 1).otherwise(0))
    .join(atrial_df, ["patient"], "leftouter").withColumn("has_atrial", when(col("atrial") < col("heartfailure"), 1).otherwise(0))
)

We perform a left outer join to keep all entries in the heart failure dataframe. A new column has_xxx is calculated for each condition other than heart failure based on the condition’s start date. We’re only interested in medical conditions that were recorded prior to the heart failure and use them as features for predicting heart failure.

Add a built-in Manage columns transform to drop the redundant columns that are no longer needed:
1. hypertension
2. diabetes
3. coronary
4. smoke
5. atrial
Extract Year and Quarter from the heartfailure column.
This matches the granularity we used earlier in the transformation of the observations dataset.
We should have a total of 6 steps for conditions.csv.

Join conditions.csv to the joined dataset

We now perform a new join to join the conditions dataset to the joined patients and observations dataset.

Choose Transform: 1st Join.
Choose the plus sign and choose Join.
Choose Steps next to Transform: conditions.csv.
Choose Configure.
For Join Type, choose Left outer.
For Left, choose Id.
For Right, choose patient.
Choose Preview, then choose Add.

Add transforms to the joined datasets

Now that we have all three datasets joined, let’s apply some additional transformations.

Add the following custom transform in PySpark so has_heartfailure becomes our label column:

from pyspark.sql.functions import col, when
df = (
    df.withColumn("has_heartfailure", when(col("heartfailure").isNotNull(), 1).otherwise(0))
)

Add the following custom transformation in PySpark:
```
from pyspark.sql.functions import col

df = (
    df.filter(
      (col("has_heartfailure") == 0) | 
      ((col("has_heartfailure") == 1) & ((col("date_year") <= col("heartfailure_year")) | ((col("date_year") == col("heartfailure_year")) & (col("date_quarter") <= col("heartfailure_quarter")))))
    )
)
```
We’re only interested in observations recorded prior to when the heart failure condition is diagnosed and use them as features for predicting heart failure. Observations taken after heart failure is diagnosed may be affected by the medication a patient takes, so we want to exclude those ones.
Drop the redundant columns that are no longer needed:
1. Id
2. DATE_year
3. DATE_quarter
4. patient
5. heartfailure
6. heartfailure_year
7. heartfailure_quarter
On the Analysis tab, for Analysis type¸ choose Table summary.
A quick scan through the summary shows that the MARITAL column has missing data.
Choose the Data tab and add a step.
Choose Handle Missing.
For Transform, choose Fill missing.
For Input columns, choose MARITAL.
For Fill value, enter S.
Our strategy here is to assume the patient is single if the marital status has missing value. You can have a different strategy.
Choose Preview, then choose Add.
Fill the missing value as 0 for has_hypertension, has_diabetes, has_coronary, has_smoke, has_atrial.

Marital and Gender are categorial variables. Data Wrangler has a built-in function to encode categorial variables.

Add a step and choose Encode categorial.
For Transform, choose One-hot encode.
For Input columns, choose MARITAL.
For Output style, choose Column.
This output style produces encoded values in separate columns.
Choose Preview, then choose Add.
Repeat these steps for the Gender column.

The one-hot encoding splits the Marital column into Marital_M (married) and Marital_S (single), and splits the Gender column into Gender_M (male) and Gender_F (female). Because Marital_M and Marital_S are mutually exclusive (as are Gender_M and Gender_F), we can drop one column to avoid redundant features.

Drop Marital_S and Gender_F.

Numeric features such as systolic, heart rate, and age have different unit standards. For a linear regression-based model, we need to normalize these numeric features first. Otherwise, some features with higher absolute values may have an unwarranted advantage over other features with lower absolute values and result in poor model performance. Data Wrangler has the built-in transform Min-max scaler to normalize the data. For a decision tree-based classification model, normalization isn’t required. Our study is a classification problem so we don’t need to apply normalization. Imbalanced classes are a common problem in classification. Imbalance happens when the training dataset contains severely skewed class distribution. For example, when our dataset contains disproportionally more patients without heart failure than patients with heart failure, it can cause the model to be biased toward predicting no heart failure and perform poorly. Data Wrangler has a built-in function to tackle the problem.

Add a custom transform in Pandas to convert data type of columns from “object” type to numeric type:
```
import pandas as pd
df=df.apply(pd.to_numeric)
```
Choose the Analysis tab.
For Analysis type¸ choose Histogram.
For X axis, choose has_heartfailure.
Choose Preview.

It’s obvious that we have an imbalanced class (more data points labeled as no heart failure than data points labeled as heart failure).
Go back to the Data tab. Choose Add step and choose Balance data.
For Target column, choose has_heartfailure.
For Desired ratio, enter 1.
For Transform, choose SMOTE.

SMOTE stands for Synthetic Minority Over-sampling Technique. It’s a technique to create new minority instances and add to the dataset to reach class balance. For detailed information, refer to SMOTE: Synthetic Minority Over-sampling Technique.
Choose Preview, then choose Add.
Repeat the histogram analysis in step 20-23. The result is a balanced class.

Visualize target leakage and feature correlation

Next, we’re going to perform a few visual analyses using Data Wrangler’s rich toolset of advanced ML-supported analysis types. First, we look at target leakage. Target leakage occurs when data in the training dataset is strongly correlated with the target label, but isn’t available in real-world data at inference time.

On the Analysis tab, for Analysis type¸ choose Target Leakage.
For Problem Type, choose classification.
For Target, choose has_heartfailure.
Choose Preview.

Based on the analysis, hr is a target leakage. We’ll drop it in a subsequent step. age is flagged a target leakage. It’s reasonable to say that a patient’s age will be available during inference time, so we keep age as a feature. Systolic and diastolic are also flagged as likely target leakage. We expect to have the two measurements during inference time, so we keep them as features.
Choose Add to add the analysis.

Then, we look at feature correlation. We want to select features that are correlated with the target but are uncorrelated among themselves.

On the Analysis tab, for Analysis type¸ choose Feature Correlation.
For Correlation Type¸ choose linear.
Choose Preview.

The coefficient scores indicate strong correlations between the following pairs:

systolic and diastolic
bmi and age
has_hypertension and has_heartfailure (label)

For features that are strongly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To mitigate the correlation, we can simply remove one from the pair. We drop diastolic and bmi and keep systolic and age in a subsequent step.

Drop diastolic and bmi columns

Add additional transform steps to drop the hr, diastolic and bmi columns using the built-in transform.

Generate the Data Quality and Insights Report

AWS recently announced the new Data Quality and Insights Report feature in Data Wrangler. This report automatically verifies data quality and detects abnormalities in your data. Data scientists and data engineers can use this tool to efficiently and quickly apply domain knowledge to process datasets for ML model training. This step is optional. To generate this report on our datasets, complete the following steps:

On the Analysis tab, for Analysis type, choose Data Quality and Insights Report.
For Target column, choose has_heartfailure.
For Problem type, select Classification.
Choose Create.

In a few minutes, it generates a report with a summary, visuals, and recommendations.

Generate a Quick Model analysis

We have completed our data preparation, cleaning, and feature engineering. Data Wrangler has a built-in function that provides a rough estimate of the expected predicted quality and the predictive power of features in our dataset.

On the Analysis tab, for Analysis type¸ choose Quick Model.
For Label, choose has_heartfailure.
Choose Preview.

As per our Quick Model analysis, we can see the feature has_hypertension has the highest feature importance score among all features.

Export the data and train the model

Now let’s export the transformed ML-ready features to a destination S3 bucket and scale the entire feature engineering pipeline we have created so far using the samples into the entire dataset in a distributed fashion.

Choose the plus sign next to the last box in the data flow and choose Add destination.
Choose Amazon S3.
Enter a Dataset name. For Amazon S3 location, choose a S3 bucket, then choose Add Destination.
Choose Create job to launch a distributed PySpark processing job to perform the transformation and output the data to the destination S3 bucket.

Depending on the size of the datasets, this option lets us easily configure the cluster and horizontally scale in a no-code fashion. We don’t have to worry about partitioning the datasets or managing the cluster and Spark internals. All of this is automatically taken care for us by Data Wrangler.
On the left pane, choose Next, 2. Configure job.
Then choose Run.

Alternatively, we can also export the transformed output to S3 via a Jupyter Notebook. With this approach, Data Wrangler automatically generates a Jupyter notebook with all the code needed to kick-off a processing job to apply the data flow steps (created using a sample) on the larger full dataset and use the transformed dataset as features to kick-off a training job later. The notebook code can be readily run with or without making changes. Let’s now walk through the steps on how to get this done via Data Wrangler’s UI.

Choose the plus sign next to the last step in the data flow and choose Export to.
Choose Amazon S3 (via Jupyter Notebook).
It automatically opens a new tab with a Jupyter notebook.
In the Jupyter notebook, locate the cell in the (Optional) Next Steps section and change run_optional_steps from False to True.
The enabled optional steps in the notebook perform the following:
- Train a model using XGBoost
Go back to the top of the notebook and on the Run menu, choose Run All Cells.

If you use the generated notebook as is, it launches a SageMaker processing job that scales out the processing across two m5.4xlarge instances to processes the full dataset on the S3 bucket. You can adjust the number of instances and instance types based on the dataset size and time you need to complete the job.

Wait until the training job from the last cell is complete. It generates a model in the SageMaker default S3 bucket.

The trained model is ready for deployment for either real-time inference or batch transformation. Note that we used synthetic data to demonstrate functionalities in Data Wrangler and used processed data for training model. Given that the data we used is synthetic, the inference result from the trained model is not meant for real-world medical condition diagnosis or substitution of judgment from medical practitioners.

You can also directly export your transformed dataset into Amazon S3 by choosing Export on top of the transform preview page. The direct export option only exports the transformed sample if sampling was enabled during the import. This option is best suited if you’re dealing with smaller datasets. The transformed data can also be ingested directly into a feature store. For more information, refer to Amazon SageMaker Feature Store. The data flow can also be exported as a SageMaker pipeline that can be orchestrated and scheduled as per your requirements. For more information, see Amazon SageMaker Pipelines.

Conclusion

In this post, we showed how to use Data Wrangler to process healthcare data and perform scalable feature engineering in a tool-driven, low-code fashion. We learned how to apply the built-in transformations and analyses aptly wherever needed, combining it with custom transformations to add even more flexibility to our data preparation workflow. We also walked through the different options for scaling out the data flow recipe via distributed processing jobs. We also learned how the transformed data can be easily used for training a model to predict heart failure.

There are many other features in Data Wrangler we haven’t covered in this post. Explore what’s possible in Prepare ML Data with Amazon SageMaker Data Wrangler and learn how to leverage Data Wrangler for your next data science or machine learning project.

About the Authors

Forrest Sun is a Senior Solution Architect with the AWS Public Sector team in Toronto, Canada. He has worked in the healthcare and finance industries for the past two decades. Outside of work, he enjoys camping with his family.

Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

A quick guide to Amazon’s 45-plus NAACL papers

July 7, 2022

by Amazon AWS

The breadth and originality of Amazon’s natural-language-processing research are on display at the annual meeting of the North American chapter of the Association for Computational Linguistics.Read More

Break through language barriers with Amazon Transcribe, Amazon Translate, and Amazon Polly

July 6, 2022

by Michael Tran Amazon AWS

Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens to your business when you’re no longer bound by language?

It’s common today to have virtual meetings with international teams and customers that speak many different languages. Whether they’re internal or external meetings, meaning often gets lost in complex discussions and you may encounter language barriers that prevent you from being as effective as you could be.

In this post, you will learn how to use three fully managed AWS services (Amazon Transcribe, Amazon Translate, and Amazon Polly) to produce a near-real-time speech-to-speech translator solution that can quickly translate a source speaker’s live voice input into a spoken, accurate, translated target language, all with zero machine learning (ML) experience.

Overview of solution

Our translator consists of three fully managed AWS ML services working together in a single Python script by using the AWS SDK for Python (Boto3) for our text translation and text-to-speech portions, and an asynchronous streaming SDK for audio input transcription.

Amazon Transcribe: Streaming speech to text

The first service you use in our stack is Amazon Transcribe, a fully managed speech-to-text service that takes input speech and transcribes it to text. Amazon Transcribe has flexible ingestion methods, batch or streaming, because it accepts either stored audio files or streaming audio data. In this post, you use the asynchronous Amazon Transcribe streaming SDK for Python, which uses the HTTP/2 streaming protocol to stream live audio and receive live transcriptions.

When we first built this prototype, Amazon Transcribe streaming ingestion didn’t support automatic language detection, but this is no longer the case as of November 2021. Both batch and streaming ingestion now support automatic language detection for all supported languages. In this post, we show how a parameter-based solution though a seamless multi-language parameterless design is possible through the use of streaming automatic language detection. After our transcribed speech segment is returned as text, you send a request to Amazon Translate to translate and return the results in our Amazon Transcribe EventHandler method.

Amazon Translate: State-of-the-art, fully managed translation API

Next in our stack is Amazon Translate, a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. As of June of 2022, Amazon Translate supports translation across 75 languages, with new language pairs and improvements being made constantly. Amazon Translate uses deep learning models hosted on a highly scalable and resilient AWS Cloud architecture to quickly deliver accurate translations either in real time or batched, depending on your use case. Using Amazon Translate is straightforward and requires no management of underlying architecture or ML skills. Amazon Translate has several features, like creating and using a custom terminology to handle mapping between industry-specific terms. For more information on Amazon Translate service limits, refer to Guidelines and limits. After the application receives the translated text in our target language, it sends the translated text to Amazon Polly for immediate translated audio playback.

Amazon Polly: Fully managed text-to-speech API

Finally, you send the translated text to Amazon Polly, a fully managed text-to-speech service that can either send back lifelike audio clip responses for immediate streaming playback or batched and saved in Amazon Simple Storage Service (Amazon S3) for later use. You can control various aspects of speech such as pronunciation, volume, pitch, speech rate, and more using standardized Speech Synthesis Markup Language (SSML).

You can synthesize speech for certain Amazon Polly Neural voices using the Newscaster style to make them sound like a TV or radio newscaster. You can also detect when specific words or sentences in the text are being spoken based on the metadata included in the audio stream. This allows the developer to synchronize graphical highlighting and animations, such as the lip movements of an avatar, with the synthesized speech.

You can modify the pronunciation of particular words, such as company names, acronyms, foreign words, or neologisms, for example “P!nk,” “ROTFL,” or “C’est la vie” (when spoken in a non-French voice), using custom lexicons.

Architecture overview

The following diagram illustrates our solution architecture.

This diagram shows the data flow from the client device to Amazon Transcribe, Amazon Translate, and Amazon Polly

The workflow is as follows:

Audio is ingested by the Python SDK.
Amazon Polly converts the speech to text, in 39 possible languages.
Amazon Translate converts the languages.
Amazon Live Transcribe converts text to speech.
Audio is outputted to speakers.

Prerequisites

You need a host machine set up with a microphone, speakers, and reliable internet connection. A modern laptop should work fine for this because no additional hardware is needed. Next, you need to set up the machine with some software tools.

You must have Python 3.7+ installed to use the asynchronous Amazon Transcribe streaming SDK and for a Python module called pyaudio, which you use to control the machine’s microphone and speakers. This module depends on a C library called portaudio.h. If you encounter issues with pyaudio errors, we suggest checking your OS to see if you have the portaudio.h library installed.

For authorization and authentication of service calls, you create an AWS Identity and Access Management (IAM) service role with permissions to call the necessary AWS services. By configuring the AWS Command Line Interface (AWS CLI) with this IAM service role, you can run our script on your machine without having to pass in keys or passwords, because the AWS libraries are written to use the configured AWS CLI user’s credentials. This is a convenient method for rapid prototyping and ensures our services are being called by an authorized identity. As always, follow the principle of least privilege when assigning IAM policies when creating an IAM user or role.

To summarize, you need the following prerequisites:

A PC, Mac, or Linux machine with microphone, speakers, and internet connection
The portaudio.h C library for your OS (brew, apt get, wget), which is needed for pyaudio to work
AWS CLI 2.0 with properly authorized IAM user configured by running aws configure in the AWS CLI
Python 3.7+
The asynchronous Amazon Transcribe Python SDK
The following Python libraries:
- boto3
- amazon-transcribe
- pyaudio
- asyncio
- concurrent

Implement the solution

You will be relying heavily on the asynchronous Amazon Transcribe streaming SDK for Python as a starting point, and are going to build on top of that specific SDK. After you have experimented with the streaming SDK for Python, you add streaming microphone input by using pyaudio, a commonly used Python open-source library used for manipulating audio data. Then you add Boto3 calls to Amazon Translate and Amazon Polly for our translation and text-to-speech functionality. Finally, you stream out translated speech through the computer’s speakers again with pyaudio. The Python module concurrent gives you the ability to run blocking code in its own asynchronous thread to play back your returned Amazon Polly speech in a seamless, non-blocking way.

Let’s import all our necessary modules, transcribe streaming classes, and instantiate some globals:

import boto3
 import asyncio
 import pyaudio
 import concurrent
 from amazon_transcribe.client import TranscribeStreamingClient
 from amazon_transcribe.handlers import TranscriptResultStreamHandler
 from amazon_transcribe.model import TranscriptEvent


 polly = boto3.client('polly', region_name = 'us-west-2')
 translate = boto3.client(service_name='translate', region_name='us-west-2', use_ssl=True)
 pa = pyaudio.PyAudio()

 #for mic stream, 1024 should work fine
 default_frames = 1024

 #current params are set up for English to Mandarin, modify to your liking
 params['source_language'] = "en"
 params['target_language'] = "zh"
 params['lang_code_for_polly'] = "cmn-CN"
 params['voice_id'] = "Zhiyu"
 params['lang_code_for_transcribe'] = "en-US"

First, you use pyaudio to obtain the input device’s sampling rate, device index, and channel count:

#try grabbing the default input device and see if we get lucky
 default_indput_device = pa.get_default_input_device_info()

 # verify this is your microphone device 
 print(default_input_device)

 #if correct then set it as your input device and define some globals
 input_device = default_input_device

 input_channel_count = input_device["maxInputChannels"]
 input_sample_rate = input_device["defaultSampleRate"]
 input_dev_index = input_device["index"]

If this isn’t working, you can also loop through and print your devices as shown in the following code, and then use the device index to retrieve the device information with pyaudio:

print ("Available devices:n")
 for i in range(0, pa.get_device_count()):
     info = pa.get_device_info_by_index(i)
     print (str(info["index"])  + ": t %s n t %s n" % (info["name"], p.get_host_api_info_by_index(info["hostApi"])["name"]))

 # select the correct index from the above returned list of devices, for example zero
 dev_index = 0 
 input_device = pa.get_device_info_by_index(dev_index)

 #set globals for microphone stream
 input_channel_count = input_device["maxInputChannels"]
 input_sample_rate = input_device["defaultSampleRate"]
 input_dev_index = input_device["index"]

You use channel_count, sample_rate, and dev_index as parameters in a mic stream. In that stream’s callback function, you use an asyncio nonblocking thread-safe callback to put the input bytes of the mic stream into an asyncio input queue. Take note of the loop and input_queue objects created with asyncio and how they’re used in the following code:

async def mic_stream():
     # This function wraps the raw input stream from the microphone forwarding
     # the blocks to an asyncio.Queue.
     
     loop = asyncio.get_event_loop()
     input_queue = asyncio.Queue()
     
     def callback(indata, frame_count, time_info, status):
         loop.call_soon_threadsafe(input_queue.put_nowait, indata)
         return (indata, pyaudio.paContinue)
         
     # Be sure to use the correct parameters for the audio stream that matches
     # the audio formats described for the source language you'll be using:
     # https://docs.aws.amazon.com/transcribe/latest/dg/streaming.html
     
     print(input_device)
     
     #Open stream
     stream = pa.open(format = pyaudio.paInt16,
                 channels = input_channel_count,
                 rate = int(input_sample_rate),
                 input = True,
                 frames_per_buffer = default_frames,
                 input_device_index = input_dev_index,
                 stream_callback=callback)
     # Initiate the audio stream and asynchronously yield the audio chunks
     # as they become available.
     stream.start_stream()
     print("started stream")
     while True:
         indata = await input_queue.get()
         yield indata

Now when the generator function mic_stream() is called, it continually yields input bytes as long as there is microphone input data in the input queue.

Now that you know how to get input bytes from the microphone, let’s look at how to write Amazon Polly output audio bytes to a speaker output stream:

#text will come from MyEventsHandler
 def aws_polly_tts(text):

     response = polly.synthesize_speech(
         Engine = 'standard',
         LanguageCode = params['lang_code_for_polly'],
         Text=text,
         VoiceId = params['voice_id'],
         OutputFormat = "pcm",
     )
     output_bytes = response['AudioStream']
     
     #play to the speakers
     write_to_speaker_stream(output_bytes)
     
 #how to write audio bytes to speakers

 def write_to_speaker_stream(output_bytes):
     """Consumes bytes in chunks to produce the response's output'"""
     print("Streaming started...")
     chunk_len = 1024
     channels = 1
     sample_rate = 16000
     
     if output_bytes:
         polly_stream = pa.open(
                     format = pyaudio.paInt16,
                     channels = channels,
                     rate = sample_rate,
                     output = True,
                     )
         #this is a blocking call - will sort this out with concurrent later
         while True:
             data = output_bytes.read(chunk_len)
             polly_stream.write(data)
             
         #If there's no more data to read, stop streaming
             if not data:
                 output_bytes.close()
                 polly_stream.stop_stream()
                 polly_stream.close()
                 break
         print("Streaming completed.")
     else:
         print("Nothing to stream.")

Now let’s expand on what you built in the post Asynchronous Amazon Transcribe Streaming SDK for Python. In the following code, you create an executor object using the ThreadPoolExecutor subclass with three workers with concurrent. You then add an Amazon Translate call on the finalized returned transcript in the EventHandler and pass that translated text, the executor object, and our aws_polly_tts() function into an asyncio loop with loop.run_in_executor(), which runs our Amazon Polly function (with translated input text) asynchronously at the start of next iteration of the asyncio loop.

#use concurrent package to create an executor object with 3 workers ie threads
 executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)

 class MyEventHandler(TranscriptResultStreamHandler):
     async def handle_transcript_event(self, transcript_event: TranscriptEvent):

         #If the transcription is finalized, send it to translate
 
         results = transcript_event.transcript.results
         if len(results) > 0:
             if len(results[0].alternatives) > 0:
                 transcript = results[0].alternatives[0].transcript
                 print("transcript:", transcript)

                 print(results[0].channel_id)
                 if hasattr(results[0], "is_partial") and results[0].is_partial == False:
                     
                     #translate only 1 channel. the other channel is a duplicate
                     if results[0].channel_id == "ch_0":
                         trans_result = translate.translate_text(
                             Text = transcript,
                             SourceLanguageCode = params['source_language'],
                             TargetLanguageCode = params['target_language']
                         )
                         print("translated text:" + trans_result.get("TranslatedText"))
                         text = trans_result.get("TranslatedText")

                         #we run aws_polly_tts with a non-blocking executor at every loop iteration
                         await loop.run_in_executor(executor, aws_polly_tts, text)

Finally, we have the loop_me() function. In it, you define write_chunks(), which takes an Amazon Transcribe stream as an argument and asynchronously writes chunks of streaming mic input to it. You then use MyEventHandler() with the output transcription stream as its argument and create a handler object. Then you use await with asyncio.gather() and pass in the write_chunks() and handler with the handle_events() method to handle the eventual futures of these coroutines. Lastly, you gather all event loops and loop the loop_me() function with run_until_complete(). See the following code:

async def loop_me():
 # Setup up our client with our chosen AWS region

     client = TranscribeStreamingClient(region="us-west-2")
     stream = await client.start_stream_transcription(
         language_code=params['lang_code_for_transcribe'],
         media_sample_rate_hz=int(device_info["defaultSampleRate"]),
         number_of_channels = 2,
         enable_channel_identification=True,
         media_encoding="pcm",
     )
     recorded_frames = []
     async def write_chunks(stream):
         
         # This connects the raw audio chunks generator coming from the microphone
         # and passes them along to the transcription stream.
         print("getting mic stream")
         async for chunk in mic_stream():
             t.tic()
             recorded_frames.append(chunk)
             await stream.input_stream.send_audio_event(audio_chunk=chunk)
             t.toc("chunks passed to transcribe: ")
         await stream.input_stream.end_stream()

     handler = MyEventHandler(stream.output_stream)
     await asyncio.gather(write_chunks(stream), handler.handle_events())

 #write a proper while loop here
 loop = asyncio.get_event_loop()
 loop.run_until_complete(loop_me())
 loop.close()

When the preceding code is run together without errors, you can speak into the microphone and quickly hear your voice translated to Mandarin Chinese. The automatic language detection feature for Amazon Transcribe and Amazon Translate translates any supported input language into the target language. You can speak for quite some time and because of the non-blocking nature of the function calls, all your speech input is translated and spoken, making this an excellent tool for translating live speeches.

Conclusion

Although this post demonstrated how these three fully managed AWS APIs can function seamlessly together, we encourage you to think about how you could use these services in other ways to deliver multilingual support for services or media like multilingual closed captioning for a fraction of the current cost. Medicine, business, and even diplomatic relations could all benefit from an ever-improving, low-cost, low-maintenance translation service.

For more information about the proof of concept code base for this use case check out our Github.

About the Authors

Michael Tran is a Solutions Architect with Envision Engineering team at Amazon Web Services. He provides technical guidance and helps customers accelerate their ability to innovate through showing the art of the possible on AWS. He has built multiple prototypes around AI/ML, and IoT for our customers. You can contact me @Mike_Trann on Twitter.

Cameron Wilkes is a Prototyping Architect on the AWS Industry Accelerator team. While on the team he delivered several ML based prototypes to customers to demonstrate the “Art of the Possible” of ML on AWS. He enjoys music production, off-roading and design.

Second annual Machine Learning Summer School launches in India

July 6, 2022

by Amazon AWS

Expanded program aimed at engineering undergraduate and graduate students builds off the success of inaugural program.Read More

Use Amazon SageMaker Data Wrangler in Amazon SageMaker Studio with a default lifecycle configuration

July 5, 2022

by Rajakumar Sampathkumar Amazon AWS

If you use the default lifecycle configuration for your domain or user profile in Amazon SageMaker Studio and use Amazon SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment with a default lifecycle configuration.

Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare data for machine learning (ML) applications via a visual interface. Data preparation is a crucial step of the ML lifecycle, and Data Wrangler provides an end-to-end solution to import, explore, transform, featurize, and process data for ML in a visual, low-code experience. It lets you easily and quickly connect to AWS components like Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and AWS Lake Formation, and external sources like Snowflake and DataBricks DeltaLake. Data Wrangler supports standard data types such as CSV, JSON, ORC, and Parquet.

Studio apps are interactive applications that enable Studio’s visual interface, code authoring, and run experience. App types can be either Jupyter Server or Kernel Gateway:

Jupyter Server – Enables access to the visual interface for Studio. Every user in Studio gets their own Jupyter Server app.
Kernel Gateway – Enables access to the code run environment and kernels for your Studio notebooks and terminals. For more information, see Jupyter Kernel Gateway.

Lifecycle configurations (LCCs) are shell scripts to automate customization for your Studio environments, such as installing JupyterLab extensions, preloading datasets, and setting up source code repositories. LCC scripts are triggered by Studio lifecycle events, such as starting a new Studio notebook. To set a lifecycle configuration as the default for your domain or user profile programmatically, you can create a new resource or update an existing resource. To associate a lifecycle configuration as a default, you first need to create a lifecycle configuration following the steps in Creating and Associating a Lifecycle Configuration

Note: Default lifecycle configurations set up at the domain level are inherited by all users, whereas those set up at the user level are scoped to a specific user. If you apply both domain-level and user profile-level lifecycle configurations at the same time, the user profile-level lifecycle configuration takes precedence and is applied to the application irrespective of what lifecycle configuration is applied at the domain level. For more information, see Setting Default Lifecycle Configurations.

Data Wrangler accepts the default Kernel Gateway lifecycle configuration, but some of the commands defined in the default Kernel Gateway lifecycle configuration aren’t applicable to Data Wrangler, which can cause Data Wrangler to fail to start. The following screenshot shows an example of an error message you might get when launching the Data Wrangler flow. This may happen only with default lifecycle configurations and not with lifecycle configurations.

Solution overview

Customers using the default lifecycle configuration in Studio can follow this post and use the supplied code block within the lifecycle configuration script to launch a Data Wrangler app without any errors.

Set up the default lifecycle configuration

To set up a default lifecycle configuration, you must add it to the DefaultResourceSpec of the appropriate app type. The behavior of your lifecycle configuration depends on whether it’s added to the DefaultResourceSpec of a Jupyter Server or Kernel Gateway app:

Jupyter Server apps – When added to the DefaultResourceSpec of a Jupyter Server app, the default lifecycle configuration script runs automatically when the user logs in to Studio for the first time or restarts Studio. You can use this to automate one-time setup actions for the Studio developer environment, such as installing notebook extensions or setting up a GitHub repo. For an example of this, see Customize Amazon SageMaker Studio using Lifecycle Configurations.
Kernel Gateway apps – When added to the DefaultResourceSpec of a Kernel Gateway app, Studio defaults to selecting the lifecycle configuration script from the Studio launcher. You can launch a notebook or terminal with the default script or choose a different one from the list of lifecycle configurations.

A default Kernel Gateway lifecycle configuration specified in DefaultResourceSpec applies to all Kernel Gateway images in the Studio domain unless you choose a different script from the list presented in the Studio launcher.

When you work with lifecycle configurations for Studio, you create a lifecycle configuration and attach it to either your Studio domain or user profile. You can then launch a Jupyter Server or Kernel Gateway application to use the lifecycle configuration.

The following table summarizes these errors you may encounter when launching a Data Wrangler application with default lifecycle configurations.

Level at Which the Lifecycle Configuration Is Applied	Create Data Wrangler Flow Works (or) Error	Workaround
Domain	Bad Request Error	Apply the script (see below)
User Profile	Bad Request Error	Apply the script (see below)
Application	Works—No issue	Not required

When you use the default lifecycle configuration associated with Studio and Data Wrangler (Kernel Gateway app), you might encounter Kernel Gateway app failure. In this post, we demonstrate how to set the default lifecycle configuration properly to exclude running commands in a Data Wrangler application so you don’t encounter Kernel Gateway app failure.

Let’s say you want to install a git-clone-repo script as the default lifecycle configuration that checks out a Git repository under the user’s home folder automatically when the Jupyter server starts. Let’s look at each scenario of applying a lifecycle configuration (Studio domain, user profile, or application level).

Apply lifecycle configuration at the Studio domain or user profile level

To apply the default Kernel Gateway lifecycle configuration at the Studio domain or user profile level, complete the steps in this section. We start with instructions for the user profile level.

In your lifecycle configuration script, you have to include the following code block that checks and skips the Data Wrangler Kernel Gateway app:

#!/bin/bash
set -eux
STATUS=$(
python3 -c "import sagemaker_dataprep"
echo $?
)
if [ "$STATUS" -eq 0 ]; then
echo 'Instance is of Type Data Wrangler'
else
echo 'Instance is not of Type Data Wrangler'
<remainder of LCC here within in else block – this contains some pip install, etc>
fi

For example, let’s use the following script as our original (note that the folder to clone the repo is changed to /root from /home/sagemaker-user):

# Clones a git repository into the user's home folder
#!/bin/bash

set -eux

# Replace this with the URL of your git repository
export REPOSITORY_URL="https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples.git"

git -C /root clone $REPOSITORY_URL

The new modified script looks like the following:

#!/bin/bash
set -eux
STATUS=$(
python3 -c "import sagemaker_dataprep"
echo $?
)
if [ "$STATUS" -eq 0 ]; then
echo 'Instance is of Type Data Wrangler'
else
echo 'Instance is not of Type Data Wrangler'

# Replace this with the URL of your git repository
export REPOSITORY_URL="https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples.git"

git -C /root clone $REPOSITORY_URL

fi

You can save this script as git_command_test.sh.

Now you run a series of commands in your terminal or command prompt. You should configure the AWS Command Line Interface (AWS CLI) to interact with AWS. If you haven’t set up the AWS CLI, refer to Configuring the AWS CLI.

Convert your git_command_test.sh file into Base64 format. This requirement prevents errors due to the encoding of spacing and line breaks.
```
LCC_GIT=openssl base64 -A -in /Users/abcde/Downloads/git_command_test.sh
```

Create a Studio lifecycle configuration. The following command creates a lifecycle configuration that runs on launch of an associated Kernel Gateway app:

aws sagemaker create-studio-lifecycle-config —region us-east-2 —studio-lifecycle-config-name lcc-git —studio-lifecycle-config-content $LCC_GIT —studio-lifecycle-config-app-type KernelGateway

Use the following API call to create a new user profile with an associated lifecycle configuration:

aws sagemaker create-user-profile --domain-id d-vqc14vvvvvvv 
--user-profile-name test 
--region us-east-2 
--user-settings '{
"KernelGatewayAppSettings": {
"LifecycleConfigArns" : ["arn:aws:sagemaker:us-east-2:000000000000:studio-lifecycle-config/lcc-git"],
"DefaultResourceSpec": {
"InstanceType": "ml.m5.xlarge",
"LifecycleConfigArn": "arn:aws:sagemaker:us-east-2:00000000000:studio-lifecycle-config/lcc-git"
}
}
}'

Alternatively, if you want to create a Studio domain to associate your lifecycle configuration at the domain level, or update the user profile or domain, you can follow the steps in Setting Default Lifecycle Configurations.

Now you can launch your Studio app from the SageMaker Control Panel.
In your Studio environment, on the File menu, choose New and Data Wrangler Flow.The new Data Wrangler flow should open without any issues.
To validate the Git clone, you can open a new Launcher in Studio.
Under Notebooks and compute resources, choose the Python 3 notebook and the Data Science SageMaker image to start your script as your default lifecycle configuration script.

You can see the Git cloned to /root in the following screenshot.

We have successfully applied the default Kernel lifecycle configuration at the user profile level and created a Data Wrangler flow. To configure at the Studio domain level, the only change is instead of creating a user profile, you pass the ARN of the lifecycle configuration in a create-domain call.

Apply lifecycle configuration at the application level

If you apply the default Kernel Gateway lifecycle configuration at the application level, you won’t have any issues because Data Wrangler skips the lifecycle configuration applied at the application level.

Conclusion

In this post, we showed how to configure your default lifecycle configuration properly for Studio when you use Data Wrangler for data preparation and visualization requirements.

To summarize, if you need to use the default lifecycle configuration for Studio to automate customization for your Studio environments and use Data Wrangler for data preparation, you can apply the default Kernel Gateway lifecycle configuration at the user profile or Studio domain level with the appropriate code block included in your lifecycle configuration so that the default lifecycle configuration checks it and skips the Data Wrangler Kernel Gateway app.

For more information, see the following resources:

About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Vicky Zhang is a Software Development Engineer at Amazon SageMaker. She is passionate about problem solving. In her spare time, she enjoys watching detective movies and playing badminton.

Rahul Nabera is a Data Analytics Consultant in AWS Professional Services. His current work focuses on enabling customers build their data and machine learning workloads on AWS. In his spare time, he enjoys playing cricket and volleyball.

My experience at Amazon while teaching at Stanford

July 5, 2022

by Amazon AWS

Co-mingling industry experience and academic teaching.Read More

Ten stories from the first half of 2022 that captivated readers

July 4, 2022

by Amazon AWS

From Josh Miele’s passion for making the world more accessible to improving forecasting by learning quantile functions, these stories resonated with our audience.Read More

Better joint representations of image and text

July 1, 2022

by Amazon AWS

Two methods presented at CVPR achieve state-of-the-art results by imposing additional structure on the representational space.Read More

Development

Testing

Deployment

Monitoring

Conclusion

About the Author

Solution overview

Generate a dataset

Launch Data Wrangler

Import data

Transform the data

Drop columns in patients.csv

Featurize date/time in patients.csv

Add transforms in observations.csv

Join patients.csv and observations.csv

Add a custom transform to the joined datasets

Add custom transforms to conditions.csv

Join conditions.csv to the joined dataset

Add transforms to the joined datasets

Visualize target leakage and feature correlation

Drop diastolic and bmi columns

Generate the Data Quality and Insights Report

Generate a Quick Model analysis

Export the data and train the model

Conclusion

About the Authors

Overview of solution

Amazon Transcribe: Streaming speech to text

Amazon Translate: State-of-the-art, fully managed translation API

Amazon Polly: Fully managed text-to-speech API

Architecture overview

Prerequisites

Implement the solution

Conclusion

About the Authors

Solution overview

Set up the default lifecycle configuration

Apply lifecycle configuration at the Studio domain or user profile level

Apply lifecycle configuration at the application level

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.