Combining contrastive training and selection of hard negative examples establishes new benchmarks.Read More
Recent honors and awards for Amazon scientists
Researchers honored for their contributions to the scientific community.Read More
How to redact PII data in conversation transcripts
Customer service interactions often contain personally identifiable information (PII) such as names, phone numbers, and dates of birth. As organizations incorporate machine learning (ML) and analytics into their applications, using this data can provide insights on how to create more seamless customer experiences. However, the presence of PII information often restricts the use of this data. In this blog post, we will review a solution to automatically redact PII data from a customer service conversation transcript.
Let’s take an example conversation between a customer and a call center agent.
Agent: Hi, thank you for calling us today. Whom do I have the pleasure of speaking with today?
Caller: Hello, my name is John Stiles.
Agent: Hi John, how may I help you?
Caller: I haven’t received my W2 statement yet and wanted to check on its status.
Agent: Sure, I can help you with that. Can you please confirm the last four digits of your Social Security number?
Caller: Yes, it’s 1111.
Agent: Ok. I’m pulling up the status now. I see that it was sent out yesterday, and the estimated arrival is early next week. Would you like me to turn on automated alerts so you can be notified of any delays?
Caller: Yes, please.
Agent: The number we have on file for you is 555-456-7890. Is that still correct?
Caller: Yes, it is.
Agent: Great. I have turned on automated notifications. Is there anything else I can assist you with John?
Caller: No, that’s all. Thank you.
Agent: Thank you, John. Have a great day.
In this brief interaction, there are several pieces of data that would generally be considered PII, including the caller’s name, the last four digits of their Social Security number, and the phone number. Let’s review how we can redact this PII data in the transcript.
Solution overview
We will create an AWS Step Functions state machine, which orchestrates an Amazon Comprehend PII redaction job. Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover valuable insights and connections in text, including the ability to detect and redact PII data.
You will provide the transcripts in the input Amazon S3 bucket. The transcripts are in the format used by Contact Lens for Amazon Connect. You will also specify an output S3 bucket, which stores the redaction output as well as intermediate data. The intermediate data are micro-batched versions of the input data. For example, if there are 10,000 conversations to be redacted, the workflow will split them into 10 batches of 1000 conversations each. Each batch is stored using a unique prefix, which is then used as the input source for Comprehend. The Step Functions map state is used to execute these redaction jobs in parallel by calling the StartPIIEntitiesDetectionJob API. This approach allows you to run multiple jobs in parallel rather than individual jobs in sequence. Since the job is implemented as a Step Functions state machine, it can be triggered to run manually or automatically as part of a daily process.
You can learn more about how Comprehend detects and redacts PII data in this blog post.
Deploy the sample solution
First, sign in to the AWS Management Console in your AWS account.
You will need an S3 bucket with some sample transcript data to redact and another bucket for output. If you don’t have existing sample transcript data, follow these steps:
- Navigate to the Amazon S3 console.
- Choose Create bucket.
- Enter a bucket name, such as
text-redaction-data-<your-account-number>
. - Accept the defaults, and choose Create bucket.
- Open the bucket you created, and choose Create folder.
- Enter a folder name, such as “sample-data” and choose Create folder.
- Click on your new folder name to open it.
- Download the SampleData.zip file.
- Open the .zip file on your local computer and then drag the folder to the S3 bucket you created.
- Choose Upload.
Now click the following link to deploy the sample solution to US East (N. Virginia):
This will create a new AWS CloudFormation stack.
Enter the Stack name (e.g., pii-redaction-workflow
), the name of the S3 input bucket containing the input transcript data, and the name of the S3 output bucket. Choose Next and add any tags that you want for your stack (optional). Choose Next again and review the stack details. Select the checkbox to acknowledge that AWS Identity and Access Management (IAM) resources will be created, and then choose Create stack.
The CloudFormation stack will create an IAM role with the ability to list and read the objects from the bucket. You can further customize the role per your requirements. It will also create a Step Functions state machine, several AWS Lambda functions used by the state machine, and an S3 bucket for storing the redacted output versions of the transcripts.
After a few minutes, your stack will be complete, and then you can examine the Step Functions state machine that was created as part of the CloudFormation template.
Run a redaction job
To run a job, navigate to Step Functions in the AWS console, select the state machine, and choose Start execution.
Next provide the input arguments to run the job. For the job input, you want to provide the name of your input S3 bucket as the S3InputDataBucket value, the folder name as the S3InputDataPrefix value, the name of your output S3 bucket as the S3OutputDataBucket
value, and the folder to store the results as S3OutputDataPrefix
value then click Start execution.
As the job executes, you can monitor its status in the Step Functions graph view. It will take a few minutes to run the job. Once the job is complete, you will see the output for each of the jobs in the Execution input and output section of the console. You can use the output URI to retrieve the output of a job. If multiple jobs were executed, you can copy the results of all jobs to a destination bucket for further analysis.
Let’s take a look at the redacted version of the conversation that we started with.
Agent: Hi, thank you for calling us today. Whom do I have the pleasure of speaking with today?
Caller: Hello, my name is [NAME].
Agent: Hi [NAME], how may I help you?
Caller: I haven’t received my W2 statement yet and wanted to check on its status.
Agent: Sure, I can help you with that. Can you please confirm the last four digits of your Social Security number?
Caller: Yes, it’s [SSN].
Agent: Ok. I’m pulling up the status now. I see that it was sent out yesterday, and the estimated arrival is early next week. Would you like me to turn on automated alerts so you can be notified of any delays?
Caller: Yes, please.
Agent: The number we have on file for you is [PHONE]. Is that still correct?
Caller: Yes, it is.
Agent: Great. I have turned on automated notifications. Is there anything else I can assist you with, [NAME]?
Caller: No, that’s all. Thank you.
Agent: Thank you, [NAME]. Have a great day.
Clean up
You may want to clean up the resources created as part of CloudFormation template after you are complete to avoid ongoing charges. To do so, delete the deployed CloudFormation stack and delete the S3 bucket with the sample transcript data if one was created.
Conclusion
With customers demanding seamless experiences across channels and also expecting security to be embedded at every point, the use of Step Functions and Amazon Comprehend to redact PII data in text conversation transcripts is a powerful tool at your disposal. Organizations can speed time to value by using the redacted transcripts to analyze customer service interactions and glean insights to improve the customer experience.
Try using this workflow to redact your data and leave us a comment!
About the author
Alex Emilcar is a Senior Solutions Architect in the Amazon Machine Learning Solutions Lab, where he helps customers build digital experiences with AWS AI technologies. Alex has over 10 years of technology experience working in different capacities from developer, infrastructure engineer, and Solutions Architecture. In his spare time, Alex likes to spend time reading and doing yard work.
AmazonNext program hosts final project presentations at Virginia HQ2
Program focuses on diversifying tech-industry talent.Read More
Auto Machine Translation and Synchronization for “Dive into Deep Learning”
A system built on Amazon Translate reduces the workload of human translators.Read More
Get to production-grade data faster by using new built-in interfaces with Amazon SageMaker Ground Truth Plus
Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow based on these requirements. From there, an expert workforce that is trained on a variety of machine learning (ML) tasks labels your data. You don’t even need deep ML expertise or knowledge of workflow design and quality management to use Ground Truth Plus.
Today, we are excited to announce the launch of new built-in interfaces on Ground Truth Plus. With this new capability, multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through self-serve interfaces. This enables you to accelerate the development of high-quality training datasets by reducing project set up time. Additionally, you can control fine-grained access to your data by scoping your AWS Identity and Access Management (IAM) role permissions to match your individual level of Amazon Simple Storage Service (Amazon S3) access, and you always have the option to revoke access to certain buckets.
Until now, you had to reach out to your Ground Truth Plus operations program manager (OPM) to create new data labeling projects and batches. This process had some restrictions because it allowed only one user to request a new project and batch—if multiple users within the organization were using the same AWS account, then only one user could request a new data labeling project and batch using the Ground Truth Plus console. Additionally, the process created artificial delays in kicking off the labeling process due to multiple manual touchpoints and troubleshooting required in case of issues. Separately, all the projects used the same IAM role for accessing data. Therefore, to run projects and batches that needed access to different data sources such as different Amazon S3 buckets, you had to rely on your Ground Truth Plus OPM to provide your account specific S3 policies, which you had to manually apply to your S3 buckets. This entire operation was manually intensive resulting in operational overheads.
This post walks you through steps to create a new project and batch, share data, and receive data using the new self-serve interfaces to efficiently kickstart the labeling process. This post assumes that you are familiar with Ground Truth Plus. For more information, see Amazon SageMaker Ground Truth Plus – Create Training Datasets Without Code or In-house Resources.
Solution overview
We demonstrate how to do the following:
- Update existing projects
- Request a new project
- Set up a project team
- Create a batch
Prerequisites
Before you get started, make sure you have the following prerequisites:
- An AWS account
- An IAM user with access to create IAM roles
- The Amazon S3 URI of the bucket where your labeling objects are stored
Update existing projects
If you have a Ground Truth Plus project before the launch (December 9, 2022) of the new features described in this post, then you need to create and share an IAM role so that you can use these features with your existing Ground Truth Plus project. If you’re a new user of Ground Truth Plus, you can skip this section.
To create an IAM role, complete the following steps:
- On the IAM console, choose Create role.
- Select Custom trust policy.
- Specify the following trust relationship for the role:
- Choose Next.
- Choose Create policy.
- On the JSON tab, specify the following policy. Update the Resource property by specifying two entries for each bucket: one with just the bucket ARN, and another with the bucket ARN followed by
/*
. For example, replace <your-input-s3-arn> witharn:aws:s3:::my-bucket/myprefix/
and <your-input-s3-arn>/* witharn:aws:s3:::my-bucket/myprefix/*
. - Choose Next: Tags and Next: Review.
- Enter the name of the policy and an optional description.
- Choose Create policy.
- Close this tab and go back to the previous tab to create your role.
On the Add permissions tab, you should see the new policy you created (refresh the page if you don’t see it).
- Select the newly created policy and choose Next.
- Enter a name (for example,
GTPlusExecutionRole
) and optionally a description of the role. - Choose Create role.
- Provide the role ARN to your Ground Truth Plus OPM, who will then update your existing project with this newly created role.
Request a new project
To request a new project, complete the following steps:
- On the Ground Truth Plus console, navigate to the Projects section.
This is where all your projects are listed.
- Choose Request project.
The Request project page is your opportunity to provide details that will help us schedule an initial consultation call and set up your project.
- In addition to specifying general information like the project name and description, you must specify the project’s task type and whether it contains personally identifiable information (PII).
To label your data, Ground Truth Plus needs temporary access to your raw data in an S3 bucket. When the labeling process is complete, Ground Truth Plus delivers the labeling output back to your S3 bucket. This is done through an IAM role. You can either create a new role, or you can navigate to the IAM console to create a new role (refer to the previous section for instructions).
- If you choose to create a role, choose Enter a custom IAM role ARN and enter your IAM role ARN, which is in the format of
arn:aws:iam::<YourAccountNumber>:role/<RoleName>
. - To use the built-in tool, on the drop-down menu under IAM Role, choose Create a new role.
- Specify the bucket location of your labeling data. If you don’t know the location of your labeling data or if you don’t have any labeling data uploaded, select Any S3 bucket, which will give Ground Truth Plus access to all your account’s buckets.
- Choose Create to create the role.
Your IAM role will allow Ground Truth Plus, identified as sagemaker-ground-truth-plus.amazonaws.com
in the role’s trust policy, to run the following actions on your S3 buckets:
- Choose Request project to complete the request.
A Ground Truth Plus OPM will schedule an initial consultation call with you to discuss your data labeling project requirements and pricing.
Set up a project team
After you request a project, you need to create a project team to log in to your project portal. A project team provides access to the members from your organization or team to track projects, view metrics, and review labels. You can use the option Invite new members by email or Import members from existing Amazon Cognito user groups. In this post, we show how to import members from existing Amazon Cognito user groups to add users to your project team.
- On the Ground Truth Plus console, navigate to the Project team section.
- Choose Create project team.
- Choose Import members from existing Amazon Cognito user groups.
- Choose an Amazon Cognito user pool.
User pools require a domain and an existing user group.
- Choose an app client.
We recommend using a client generated by Amazon SageMaker.
- Choose a user group from your pool to import members.
- Choose Create project team.
You can add more team members after creating the project team by choosing Invite new members on the Members page of the Ground Truth Plus console.
Create a batch
After you have successfully submitted the project request and created a project team, you can access the Ground Truth Plus project portal by clicking Open project portal on the Ground Truth Plus console.
You can use the project portal to create batches for a project, but only after the project’s status has changed to Request approved
.
- View a project’s details and batches by choosing the project name.
A page titled with the project name opens. - In the Batches section, choose Create batch.
- Enter a batch name and optional description.
- Enter the S3 locations of the input and output datasets.
To ensure the batch is created successfully, you must meet the following requirements:
-
- The S3 bucket and prefix should exist, and the total number of files should be greater than 0
- The total number of objects should be less than 10,000
- The size of each object should be less than 2 GB
- The total size of all objects combined is less than 100 GB
- The IAM role provided to create a project has permission to access the input bucket, output bucket, and S3 files that are used to create the batch
- The files under the provided S3 location for the input datasets should not be encrypted by AWS Key Management Service (AWS KMS)
- Choose Submit.
Your batch status will show as Request submitted
. After Ground Truth Plus has temporary access to your data, AWS experts will set up data labeling workflows and operate them on your behalf, which will change the batch status to In-progress
. When the labeling is complete, the batch status changes from In-progress
to Ready for review
. If you want to review your labels before receiving the labels then choose Review batch. From there, you have an option to choose Accept batch to receive your labeled data.
Conclusion
This post showed you how multiple Ground Truth Plus users can now create a new project and batch, share data, and receive data using the same AWS account through new self-serve interfaces. This new capability allows you to kickstart your labeling projects faster and reduces operational overhead. We also demonstrated how you can control fine-grained access to data by scoping your IAM role permissions to match your individual level of access.
We encourage you to try out this new functionality, and connect with the Machine Learning & AI community if you have any questions or feedback!
About the authors
Manish Goel is the Product Manager for Amazon SageMaker Ground Truth Plus. He is focused on building products that make it easier for customers to adopt machine learning. In his spare time, he enjoys road trips and reading books.
Karthik Ganduri is a Software Development Engineer at Amazon AWS, where he works on building ML tools for customers and internal solutions. Outside of work, he enjoys clicking pictures.
Zhuling Bai is a Software Development Engineer at Amazon AWS. She works on developing large scale distributed systems to solve machine learning problems.
Aatef Baransy is a Frontend engineer at Amazon AWS. He writes fast, reliable, and thoroughly tested software to nurture and grow the industry’s most cutting-edge AI applications.
Mohammad Adnan is a Senior Engineer for AI and ML at AWS. He was part of many AWS service launch, notably Amazon Lookout for Metrics and AWS Panorama. Currently, he is focusing on AWS human-in-the-loop offerings (AWS SageMaker’s Ground truth, Ground truth plus and Augmented AI). He is a clean code advocate and a subject-matter expert on server-less and event-driven architecture. You can follow him on LinkedIn, mohammad-adnan-6a99a829.
Popular deep-learning book from Amazon authors gets update
Google JAX Python library implementation and new topics added; volume 1 of book to be published by Cambridge University Press.Read More
Nine teams selected for Alexa Prize SocialBot Grand Challenge
Fifth challenge adds new elements and features four new competitors for the $1 million research grant.Read More
Announcing the updated Salesforce connector (V2) for Amazon Kendra
Amazon Kendra is a highly accurate and simple-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.
Valuable data in organizations is stored in both structured and unstructured repositories. An enterprise search solution should be able to pull together data across several structured and unstructured repositories to index and search on.
One such data repository is Salesforce. Salesforce is a comprehensive CRM tool for managing support, sales, and marketing teams. It’s an intelligent, proactive, AI-powered platform that empowers employees with the information they need to make the best decisions for every customer. It’s the backbone of the world’s most customer-centered organizations and helps companies put the customer at the center of everything they do.
We’re excited to announce that we have updated the Salesforce connector for Amazon Kendra to add even more capabilities. In this version (V2), we have added support for Salesforce Lightning in addition to Classic. You can now choose to crawl attachments and also bring in identity/ACL information to make your searches more granular. We now support 20 standard entities, and you can choose to index more fields.
You can import the following entities (and attachments for those marked with *):
- Accounts*
- Campaign*
- Partner
- Pricebook
- Case*
- Contact*
- Contract*
- Document
- Group
- Idea
- Lead*
- Opportunity*
- Product
- Profile
- Solution*
- Task*
- User*
- Chatter*
- Knowledge Articles
- Custom Objects*
Solution overview
With Amazon Kendra, you can configure multiple data sources to provide a central place to search across your document repository. For our solution, we demonstrate how to index a Salesforce repository or folder using the Amazon Kendra connector for Salesforce. The solution consists of the following steps:
- Create and configure an app on Salesforce and get the connection details.
- Create a Salesforce data source via the Amazon Kendra console.
- Index the data in the Salesforce repository.
- Run a sample query to get the information.
- Filter the query by users or groups.
Prerequisites
To try out the Amazon Kendra connector for Salesforce, you need the following:
- A Salesforce Enterprise account with enough access permissions to set up an OAuth data source.
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic knowledge of AWS.
Configure a Salesforce app and gather connection details
Before we set up the Salesforce data source, we need a few details about your Salesforce repository. Let’s gather those in advance (refer to Authorization Through Connected Apps and OAuth 2.0 for more details).
- Go to https://login.salesforce.com/ and log in with your credentials.
- In the navigation pane, choose Setup Home.
- Under Apps, choose App Manager.
This refreshes the right pane.
- Choose New Connected App.
- Select Enable OAuth Settings to expand the API (Enable OAuth Settings) section.
- For Callback URL, enter
https://login.salesforce.com/services/oauth2/token
. - For Selected OAuth Scopes, choose eclair_api and choose the right arrow icon.
- Select Introspect All Tokens.
- Choose Save.A warning appears that says “Changes can take up to 10 minutes to take effect.”
- Choose Continue to acknowledge.
- On the confirmation page, choose Manage Consumer Details.
- Copy and save the values for Consumer Key and Consumer Secret to use later when setting up your Amazon Kendra data source.
Next, we generate a security token.
- On the home page, choose the View Profile icon and choose Settings.
- In the navigation pane, expand My Personal Information and choose Reset My Security Token.
The security token is sent to the email you used when configuring the app. The following screenshot shows an example email.
- Save the security token to use when you configure the Salesforce connector to Amazon Kendra.
Configure the Amazon Kendra connector for Salesforce
To configure the Amazon Kendra connector, complete the following steps:
- On the Amazon Kendra console, choose Create an Index.
- For Index name, enter a name for the index (for example,
my-salesforce-index
). - Enter an optional description.
- Choose Create a new role.
- For Role name, enter an IAM role name.
- Configure optional encryption settings and tags.
- Choose Next.
- In the Configure user access control section, leave the settings at their defaults and choose Next.
- Select Developer edition and choose Create.
This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.
- Return to the Amazon Kendra console and choose Data sources in the navigation pane.
- Scroll down and locate Salesforce Online connector V2.0, and choose Add connector.
- For Data source name, enter a name (for example,
my-salesforce-datasourcev2
). - Enter an optional description.
- Choose Next.
- For Salesforce URL, enter the URL at the top of the browser when you log in to Salesforce.
- For Configure VPC and security group, leave the default (No VPC).
- Keep Identity crawler is on selected.This imports identity/ACL information into the index.
- For IAM role, choose Create a new role.
- Enter a role name, such as
AmazonKendra-salesforce-datasourcev2
. - Choose Next.
- In the Authentication section, choose Create and add new secret.
- Enter the details you gathered while setting up the Salesforce app:
- Secret name – The name you gave your secret.
- Username – The user name you use to log in to Salesforce.
- Password – The password you use to log in to Salesforce.
- Security token – The security token you received in your email while going through the setup in Salesforce.
- Consumer key – The key generated while going through the setup in Salesforce.
- Consumer secret – The secret generated while going through the setup in Salesforce.
- Authentication URL – Enter
https://login.salesforce.com/services/oauth2/token
.
- Choose Save.
The next page is prefilled with the name of the secret.
- Choose Next.
- Select All standard objects and Include all attachments.
- For Sync run schedule, choose Run on demand.
- Choose Next.
- Keep all the defaults in the Field Mappings section and choose Next.
- On the review page, choose Add data source.
- Choose Sync now.
This indexes all the content in Salesforce as per your configuration. You will see a success message at the top of the page and also in the sync history.
Test the solution
Now that you have ingested the content from your Salesforce account into your Amazon Kendra index, you can test some queries.
- Go to your index and choose Search indexed content in the navigation pane.
- Enter a search term and press Enter.
One of the features of the data source is that it brings in the ACL information along with the contents of Salesforce. You can use this to narrow down your queries by users or groups.
- Return to the search page and expand Test query with user name or groups.Choose Apply user name or groups.
- For Username, enter your user name and choose Apply.
A message appears saying Attributes applied.
- Enter a new test query and press Enter.
Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Salesforce account.
Conclusion
With the Salesforce connector for Amazon Kendra, organizations can tap into the repository of information stored in their account securely using intelligent search powered by Amazon Kendra.
In this post, we introduced you to the basics, but there are many additional features that we didn’t cover. For example:
- You can enable user-based access control for your Amazon Kendra index and restrict access to users and groups that you configure
- You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results
- You can integrate the Salesforce data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion
To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.
About the author
Ashish Lagwankar is a Senior Enterprise Solutions Architect at AWS. His core interests include AI/ML, serverless, and container technologies. Ashish is based in the Boston, MA, area and enjoys reading, outdoors, and spending time with his family.
Speed ML development using SageMaker Feature Store and Apache Iceberg offline store compaction
Today, companies are establishing feature stores to provide a central repository to scale ML development across business units and data science teams. As feature data grows in size and complexity, data scientists need to be able to efficiently query these feature stores to extract datasets for experimentation, model training, and batch scoring.
Amazon SageMaker Feature Store is a purpose-built feature management solution that helps data scientists and ML engineers securely store, discover, and share curated data used in training and prediction workflows. SageMaker Feature Store now supports Apache Iceberg as a table format for storing features. This accelerates model development by enabling faster query performance when extracting ML training datasets, taking advantage of Iceberg table compaction. Depending on the design of your feature groups and their scale, you can experience training query performance improvements of 10x to 100x by using this new capability.
By the end of this post, you will know how to create feature groups using the Iceberg format, execute Iceberg’s table management procedures using Amazon Athena, and schedule these tasks to run autonomously. If you are a Spark user, you’ll also learn how to execute the same procedures using Spark and incorporate them into your own Spark environment and automation.
SageMaker Feature Store and Apache Iceberg
Amazon SageMaker Feature Store is a centralized store for features and associated metadata, allowing features to be easily discovered and reused by data scientist teams working on different projects or ML models.
SageMaker Feature Store consists of an online and an offline mode for managing features. The online store is used for low-latency real-time inference use cases. The offline store is primarily used for batch predictions and model training. The offline store is an append-only store and can be used to store and access historical feature data. With the offline store, users can store and serve features for exploration and batch scoring and extract point-in-time correct datasets for model training.
The offline store data is stored in an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account. SageMaker Feature Store automatically builds an AWS Glue Data Catalog during feature group creation. Customers can also access offline store data using a Spark runtime and perform big data processing for ML feature analysis and feature engineering use cases.
Table formats provide a way to abstract data files as a table. Over the years, many table formats have emerged to support ACID transaction, governance, and catalog use cases. Apache Iceberg is an open table format for very large analytic datasets. It manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg tracks individual data files in a table instead of in directories. This allows writers to create data files in place (files are not moved or changed) and only add files to the table in an explicit commit. The table state is maintained in metadata files. All changes to the table state create a new metadata file version that atomically replaces the older metadata. The table metadata file tracks the table schema, partitioning configuration, and other properties.
Iceberg has integrations with AWS services. For example, you can use the AWS Glue Data Catalog as the metastore for Iceberg tables, and Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.
With SageMaker Feature Store, you can now create feature groups with Iceberg table format as an alternative to the default standard Glue format. With that, customers can leverage the new table format to use Iceberg’s file compaction and data pruning features to meet their use case and optimization requirements. Iceberg also lets customers perform deletion, time-travel queries, high-concurrency transactions, and higher-performance queries.
By combining Iceberg as a table format and table maintenance operations such as compaction, customers get faster query performance when working with offline feature groups at scale, letting them more quickly build ML training datasets.
The following diagram shows the structure of the offline store using Iceberg as a table format.
In the next sections, you will learn how to create feature groups using the Iceberg format, execute Iceberg’s table management procedures using AWS Athena and use AWS services to schedule these tasks to run on-demand or on a schedule. If you are a Spark user, you will also learn how to execute the same procedures using Spark.
For step-by-step instructions, we also provide a sample notebook, which can be found in GitHub. In this post, we will highlight the most important parts.
Creating feature groups using Iceberg table format
You first need to select Iceberg as a table format when creating new feature groups. A new optional parameter TableFormat
can be set either interactively using Amazon SageMaker Studio or through code using the API or the SDK. This parameter accepts the values ICEBERG
or GLUE
(for the current AWS Glue format). The following code snippet shows you how to create a feature group using the Iceberg format and FeatureGroup.create
API of the SageMaker SDK.
The table will be created and registered automatically in the AWS Glue Data Catalog.
Now that the orders_feature_group_iceberg
is created, you can ingest features using your ingestion pipeline of choice. In this example, we ingest records using the FeatureGroup.ingest() API, which ingests records from a Pandas DataFrame. You can also use the FeatureGroup().put_record API to ingest individual records or to handle streaming sources. Spark users can also ingest Spark dataframes using our Spark Connector.
You can verify that the records have been ingested successfully by running a query against the offline feature store. You can also navigate to the S3 location and see the new folder structure.
Executing Iceberg table management procedures
Amazon Athena is a serverless SQL query engine that natively supports Iceberg management procedures. In this section, you will use Athena to manually compact the offline feature group you created. Note you will need to use Athena engine version 3. For this, you can create a new workgroup, or configure an existing workgroup, and select the recommended Athena engine version 3. For more information and instructions for changing your Athena engine version, refer to Changing Athena engine versions.
As data accumulates into an Iceberg table, queries may gradually become less efficient because of the increased processing time required to open additional files. Compaction optimizes the structural layout of the table without altering table content.
To perform compaction, you use the OPTIMIZE table REWRITE DATA
compaction table maintenance command in Athena. The following syntax shows how to optimize the data layout of a feature group stored using the Iceberg table format. The sagemaker_featurestore
represents the name of the SageMaker Feature Store database, and orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334
is our feature group table name.
After running the optimize command, you use the VACUUM
procedure, which performs snapshot expiration and removes orphan files. These actions reduce metadata size and remove files that are not in the current table state and are also older than the retention period specified for the table.
Note that table properties are configurable using Athena’s ALTER TABLE
. For an example of how to do this, see the Athena documentation. For VACUUM, vacuum_min_snapshots_to_keep
and vacuum_max_snapshot_age_seconds
can be used to configure snapshot pruning parameters.
Let’s have a look at the performance impact of running compaction on a sample feature group table. For testing purposes, we ingested the same orders feature records into two feature groups, orders-feature-group-iceberg-pre-comp-02-11-03-06-1669979003
and orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334
, using a parallelized SageMaker processing job with Scikit-Learn, which results in 49,908,135 objects stored in Amazon S3 and a total size of 106.5 GiB.
We run a query to select the latest snapshot without duplicates and without deleted records on the feature group orders-feature-group-iceberg-pre-comp-02-11-03-06-1669979003
. Prior to compaction, the query took 1hr 27mins.
We then run compaction on orders-feature-group-iceberg-post-comp-03-14-05-17-1670076334
using the Athena OPTIMIZE query, which compacted the feature group table to 109,851 objects in Amazon S3 and a total size of 2.5 GiB. If we then run the same query after compaction, its runtime decreased to 1min 13sec.
With Iceberg file compaction, the query execution time improved significantly. For the same query, the run time decreased from 1h 27mins to 1min 13sec, which is 71 times faster.
Scheduling Iceberg compaction with AWS services
In this section, you will learn how to automate the table management procedures to compact your offline feature store. The following diagram illustrates the architecture for creating feature groups in Iceberg table format and a fully automated table management solution, which includes file compaction and cleanup operations.
At a high level, you create a feature group using the Iceberg table format and ingest records into the online feature store. Feature values are automatically replicated from the online store to the historical offline store. Athena is used to run the Iceberg management procedures. To schedule the procedures, you set up an AWS Glue job using a Python shell script and create an AWS Glue job schedule.
AWS Glue Job setup
You use an AWS Glue job to execute the Iceberg table maintenance operations on a schedule. First, you need to create an IAM role for AWS Glue to have permissions to access Amazon Athena, Amazon S3, and CloudWatch.
Next, you need to create a Python script to run the Iceberg procedures. You can find the sample script in GitHub. The script will execute the OPTIMIZE query using boto3.
The script has been parametrized using the AWS Glue getResolvedOptions(args, options)
utility function that gives you access to the arguments that are passed to your script when you run a job. In this example, the AWS Region, the Iceberg database and table for your feature group, the Athena workgroup, and the Athena output location results folder can be passed as parameters to the job, making this script reusable in your environment.
Finally, you create the actual AWS Glue job to run the script as a shell in AWS Glue.
- Navigate to the AWS Glue console.
- Choose the Jobs tab under AWS Glue Studio.
- Select Python Shell script editor.
- Choose Upload and edit an existing script. Click Create.
- The Job details button lets you configure the AWS Glue job. You need to select the IAM role you created earlier. Select Python 3.9 or the latest available Python version.
- In the same tab, you can also define a number of other configuration options, such as Number of retries or Job timeout. In Advanced properties, you can add job parameters to execute the script, as shown in the example screenshot below.
- Click Save.
In the Schedules tab, you can define the schedule to run the feature store maintenance procedures. For example, the following screenshot shows you how to run the job on a schedule of every 6 hours.
You can monitor job runs to understand runtime metrics such as completion status, duration, and start time. You can also check the CloudWatch Logs for the AWS Glue job to check that the procedures run successfully.
Executing Iceberg table management tasks with Spark
Customers can also use Spark to manage the compaction jobs and maintenance methods. For more detail on the Spark procedures, see the Spark documentation.
You first need to configure some of the common properties.
The following code can be used to optimize the feature groups via Spark.
You can then execute the next two table maintenance procedures to remove older snapshots and orphan files that are no longer needed.
You can then incorporate the above Spark commands into your Spark environment. For example, you can create a job that performs the optimization above on a desired schedule or in a pipeline after ingestion.
To explore the complete code example, and try it out in your own account, see the GitHub repo.
Conclusion
SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across data science teams. In this post, we explained how you can leverage Apache Iceberg as a table format and table maintenance operations such as compaction to benefit from significantly faster queries when working with offline feature groups at scale and, as a result, build training datasets faster. Give it a try, and let us know what you think in the comments.
About the authors
Arnaud Lauer is a Senior Partner Solutions Architect in the Public Sector team at AWS. He enables partners and customers to understand how best to use AWS technologies to translate business needs into solutions. He brings more than 17 years of experience in delivering and architecting digital transformation projects across a range of industries, including public sector, energy, and consumer goods. Arnaud holds 12 AWS certifications, including the ML Specialty Certification.
Ioan Catana is an Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He helps customers develop and scale their ML solutions in the AWS Cloud. Ioan has over 20 years of experience mostly in software architecture design and cloud engineering.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.
Brandon Chatham is a software engineer with the SageMaker Feature Store team. He’s deeply passionate about building elegant systems that bring big data and machine learning to people’s fingertips.