Announcing enhanced table extractions with Amazon Textract

Announcing enhanced table extractions with Amazon Textract

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. Amazon Textract has a Tables feature within the AnalyzeDocument API that offers the ability to automatically extract tabular structures from any document. In this post, we discuss the improvements made to the Tables feature and how it makes it easier to extract information in tabular structures from a wide variety of documents.

Tabular structures in documents such as financial reports, paystubs, and certificate of analysis files are often formatted in a way that enables easy interpretation of information. They often also include information such as table title, table footer, section title, and summary rows within the tabular structure for better readability and organization. For a similar document prior to this enhancement, the Tables feature within AnalyzeDocument would have identified those elements as cells, and it didn’t extract titles and footers that are present outside the bounds of the table. In such cases, custom postprocessing logic to identify such information or extract it separately from the API’s JSON output was necessary. With this announcement of enhancements to the Table feature, the extraction of various aspects of tabular data becomes much simpler.

In April 2023, Amazon Textract introduced the ability to automatically detect titles, footers, section titles, and summary rows present in documents via the Tables feature. In this post, we discuss these enhancements and give examples to help you understand and use them in your document processing workflows. We walk through how to use these improvements through code examples to use the API and process the response with the Amazon Textract Textractor library.

Overview of solution

The following image shows that the updated model not only identifies the table in the document but all corresponding table headers and footers. This sample financial report document contains table title, footer, section title, and summary rows.

Financial Report with table

The Tables feature enhancement adds support for four new elements in the API response that allows you to extract each of these table elements with ease, and adds the ability to distinguish the type of table.

Table elements

Amazon Textract can identify several components of a table such as table cells and merged cells. These components, known as Blockobjects, encapsulate the details related to the component, such as the bounding geometry, relationships, and confidence score. A Block represents items that are recognized in a document within a group of pixels close to each other. The following are the new Table Blocks introduced in this enhancement:

  • Table title – A new Block type called TABLE_TITLE that enables you to identify the title of a given table. Titles can be one or more lines, which are typically above a table or embedded as a cell within the table.
  • Table footers – A new Block type called TABLE_FOOTER that enables you to identify the footers associated with a given table. Footers can be one or more lines that are typically below the table or embedded as a cell within the table.
  • Section title – A new Block type called TABLE_SECTION_TITLE that enables you to identify if the cell detected is a section title.
  • Summary cells – A new Block type called TABLE_SUMMARY that enables you to identify if the cell is a summary cell, such as a cell for totals on a paystub.

Financial Report with table elements

Types of tables

When Amazon Textract identifies a table in a document, it extracts all the details of the table into a top-level Block type of TABLE. Tables can come in various shapes and sizes. For example, documents often contain tables that may or may not have a discernible table header. To help distinguish these types of tables, we added two new entity types for a TABLE Block: SEMI_STRUCTURED_TABLE and STRUCTURED_TABLE. These entity types help you distinguish between a structured versus a semistructured table.

Structured tables are tables that have clearly defined column headers. But with semi-structured tables, data might not follow a strict structure. For example, data may appear in tabular structure that isn’t a table with defined headers. The new entity types offer the flexibility to choose which tables to keep or remove during post-processing. The following image shows an example of STRUCTURED_TABLE and SEMI_STRUCTURED_TABLE.

Table types

Analyzing the API output

In this section, we explore how you can use the Amazon Textract Textractor library to postprocess the API output of AnalyzeDocument with the Tables feature enhancements. This allows you to extract relevant information from tables.

Textractor is a library created to work seamlessly with Amazon Textract APIs and utilities to subsequently convert the JSON responses returned by the APIs into programmable objects. You can also use it to visualize entities on the document and export the data in formats such as comma-separated values (CSV) files. It’s intended to aid Amazon Textract customers in setting up their postprocessing pipelines.

In our examples, we use the following sample page from a 10-K SEC filing document.

10-K SEC filing document

The following code can be found within our GitHub repository. To process this document, we make use of the Textractor library and import it for us to postprocess the API outputs and visualize the data:

pip install amazon-textract-textractor

The first step is to call Amazon Textract AnalyzeDocument with Tables feature, denoted by the features=[TextractFeatures.TABLES] parameter to extract the table information. Note that this method invokes the real-time (or synchronous) AnalyzeDocument API, which supports single-page documents. However, you can use the asynchronous StartDocumentAnalysis API to process multi-page documents (with up to 3,000 pages).

from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures, Direction, DirectionalFinderType
image = Image.open("sec_filing.png") # loads the document image with Pillow
extractor = Textractor(region_name="us-east-1") # Initialize textractor client, modify region if required
document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)

The document object contains metadata about the document that can be reviewed. Notice that it recognizes one table in the document along with other entities in the document:

This document holds the following data:
Pages - 1
Words - 658
Lines - 122
Key-values - 0
Checkboxes - 0
Tables - 1
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents – 0

Now that we have the API output containing the table information, we visualize the different elements of the table using the response structure discussed previously:

table = EntityList(document.tables[0])
document.tables[0].visualize()

10-K SEC filing document table highlighted

The Textractor library highlights the various entities within the detected table with a different color code for each table element. Let’s dive deeper into how we can extract each element. The following code snippet demonstrates extracting the title of the table:

table_title = table[0].title.text
table_title

'The following table summarizes, by major security type, our cash, cash equivalents, restricted cash, and marketable securities that are measured at fair value on a recurring basis and are categorized using the fair value hierarchy (in millions):'

Similarly, we can use the following code to extract the footers of the table. Notice that table_footers is a list, which means that there can be one or more footers associated with the table. We can iterate over this list to see all the footers present, and as shown in the following code snippet, the output displays three footers:

table_footers = table[0].footers
for footers in table_footers:
    print (footers.text)

(1) The related unrealized gain (loss) recorded in "Other income (expense), net" was $(116) million and $1.0 billion in Q3 2021 and Q3 2022, and $6 million and $(11.3) billion for the nine months ended September 30, 2021 and 2022.

(2) We are required to pledge or otherwise restrict a portion of our cash, cash equivalents, and marketable fixed income securities primarily as collateral for real estate, amounts due to third-party sellers in certain jurisdictions, debt, and standby and trade letters of credit. We classify cash, cash equivalents, and marketable fixed income securities with use restrictions of less than twelve months as "Accounts receivable, net and other" and of twelve months or longer as non-current "Other assets" on our consolidated balance sheets. See "Note 4 - Commitments and Contingencies."

(3) Our equity investment in Rivian had a fair value of $15.6 billion and $5.2 billion as of December 31, 2021 and September 30, 2022, respectively. The investment was subject to regulatory sales restrictions resulting in a discount for lack of marketability of approximately $800 million as of December 31, 2021, which expired in Q1 2022.

Generating data for downstream ingestion

The Textractor library also helps you simplify the ingestion of table data into downstream systems or other workflows. For example, you can export the extracted table data into a human readable Microsoft Excel file. At the time of this writing, this is the only format that supports merged tables.

table[0].to_excel(filepath="sec_filing.xlsx")

Table to Excel

We can also convert it to a Pandas DataFrame. DataFrame is a popular choice for data manipulation, analysis, and visualization in programming languages such as Python and R.

In Python, DataFrame is a primary data structure in the Pandas library. It’s flexible and powerful, and is often the first choice for data analysis professionals for various data analysis and ML tasks. The following code snippet shows how to convert the extracted table information into a DataFrame with a single line of code:

df=table[0].to_pandas()
df

Table to DataFrame

Lastly, we can convert the table data into a CSV file. CSV files are often used to ingest data into relational databases or data warehouses. See the following code:

table[0].to_csv()

',0,1,2,3,4,5n0,,"December 31, 2021",,September,"30, 2022",n1,,Total Estimated Fair Value,Cost or Amortized Cost,Gross Unrealized Gains,Gross Unrealized Losses,Total Estimated Fair Valuen2,Cash,"$ 10,942","$ 10,720",$ -,$ -,"$ 10,720"n3,Level 1 securities:,,,,,n4,Money market funds,"20,312","16,697",-,-,"16,697"n5,Equity securities (1)(3),"1,646",,,,"5,988"n6,Level 2 securities:,,,,,n7,Foreign government and agency securities,181,141,-,(2),139n8,U.S. government and agency securities,"4,300","2,301",-,(169),"2,132"n9,Corporate debt securities,"35,764","20,229",-,(799),"19,430"n10,Asset-backed securities,"6,738","3,578",-,(191),"3,387"n11,Other fixed income securities,686,403,-,(22),381n12,Equity securities (1)(3),"15,740",,,,19n13,,"$ 96,309","$ 54,069",$ -,"$ (1,183)","$ 58,893"n14,"Less: Restricted cash, cash equivalents, and marketable securities (2)",(260),,,,(231)n15,"Total cash, cash equivalents, and marketable securities","$ 96,049",,,,"$ 58,662"n'</p><h2> </h2>

Conclusion

The introduction of these new block and entity types (TABLE_TITLE, TABLE_FOOTER, STRUCTURED_TABLE, SEMI_STRUCTURED_TABLE, TABLE_SECTION_TITLE, TABLE_FOOTER, and TABLE_SUMMARY) marks a significant advancement in extraction of tabular structures from documents with Amazon Textract.

These tools provide a more nuanced and flexible approach, catering to both structured and semistructured tables and making sure that no important data is overlooked, regardless of its location in a document.

This means we can now handle diverse data types and table structures with enhanced efficiency and accuracy. As we continue to embrace the power of automation in document processing workflows, these enhancements will no doubt pave the way for more streamlined workflows, higher productivity, and more insightful data analysis. For more information on AnalyzeDocument and the Tables feature, refer to AnalyzeDocument.


About the authors

Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).

Anjan Biswas is a Senior AI Services Solutions Architect with focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand, and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations and is actively helping customers get started and scale on AWS AI services.

Lalita ReddiLalita Reddi is a Senior Technical Product Manager with the Amazon Textract team. She is focused on building machine learning-based services for AWS customers. In her spare time, Lalita likes to play board games, and go on hikes.

Read More

Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker

Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker

This blog post is co-written with Dr. Ebtesam Almazrouei, Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII.

United Arab Emirate’s (UAE) Technology Innovation Institute (TII), the applied research pillar of Abu Dhabi’s Advanced Technology Research Council, has launched Falcon LLM, a foundational large language model (LLM) with 40 billion parameters. TII is a leading global research center dedicated to pushing the frontiers of knowledge. TII’s team of scientists, researchers, and engineers work to deliver discovery science and transformative technologies. TII’s work focuses on breakthroughs that will future-proof our society. Trained on 1 trillion tokens, TII Falcon LLM boasts top-notch performance while remaining incredibly cost-effective. Falcon-40B matches the performance of other high-performing LLMs, and is the top-ranked open-source model in the public Hugging Face Open LLM leaderboard. It’s available as open-source in two different sizes – Falcon-40B and Falcon-7B and was built from scratch using data preprocessing and model training jobs built on Amazon SageMaker. Open-sourcing Falcon 40B enables users to construct and customize AI tools that cater to unique users needs, facilitating seamless integration and ensuring the long-term preservation of data assets. The model weights are available to download, inspect and deploy anywhere.

Starting June 7th, both Falcon LLMs will also be available in Amazon SageMaker JumpStart, SageMaker’s machine learning (ML) hub that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get started with ML. You can deploy and use the Falcon LLMs with a few clicks in SageMaker Studio or programmatically through the SageMaker Python SDK. To deploy and run inference against Falcon LLMs, refer to the Introduction to SageMaker JumpStart – Text Generation with Falcon LLMs example notebook.

Dr. Ebtesam Almazrouei, Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII, shares:

“We proudly announce the official open-source release of Falcon-40B, the world’s top-ranking open-source language model, developed by TII. Falcon-40B has surpassed renowned models like LLaMA-65B, StableLM, RedPajama, and MPT on the public leaderboard maintained by Hugging Face, demonstrating its exceptional performance without specialized fine-tuning.”

“This impressive achievement reflects the UAE’s dedication to push the boundaries of AI innovation,” continues Dr. Almazrouei. “By releasing Falcon-40B as an open-source model, we provide researchers, businesses, and organizations with the opportunity to leverage its powerful capabilities across various sectors. Falcon-40B’s open-source release empowers organizations to harness its exceptional capabilities and drive advancements in AI-driven solutions. It represents a significant milestone in our commitment to fostering AI innovation and exemplifies the profound scientific contributions of the UAE. To explore Falcon-40B’s remarkable potential, please visit FalconLLM.tii.ae. Join us in leveraging the power of Falcon-40B to shape the future of AI and revolutionize industries.”

In this post, we dive deep with Dr. Almazrouei about Falcon LLM training on SageMaker, data curation, optimization, performance, and next steps.

A new generation of LLMs

LLMs are software algorithms trained to complete natural text sequences. Due to their size and the volume of training data they interact with, LLMs have impressive text processing abilities, including summarization, question answering, in-context learning, and more.

In early 2020, research organizations across the world set the emphasis on model size, observing that accuracy correlated with number of parameters. For example, GPT-3 (2020) and BLOOM (2022) feature around 175 billion parameters, Gopher (2021) has 230 billion parameters, and MT-NLG (2021) 530 billion parameters. In 2022, Hoffman et al. observed that the current balance of compute between model parameters and dataset size was suboptimal, and published empirical scaling laws suggesting that balancing the compute budget towards smaller models trained on more data could lead to better performing models. They implemented their guidance in the 70B parameter Chinchilla (2022) model, that outperformed much bigger models.

LLM training on SageMaker

SageMaker is a collection of managed APIs for developing, training, tuning, and hosting machine learning (ML) models, including LLMs. Numerous customers rely on SageMaker for their LLM workloads, such as Stability AI, AI21 Labs, and LG AI. SageMaker Training provisions compute clusters with user-defined hardware configuration and code. Compute jobs are billed per run, pro-rated to the second, meaning that users are not charged for GPU capacity when not using the service. TII used transient clusters provided by the SageMaker Training API to train the Falcon LLM, up to 48 ml.p4d.24xlarge instances, cumulating in 384 NVIDIA A100 GPUs. Now, TII is training the next Falcon LLM and scaled their training to 3,136 A100 GPU (392 ml.p4d instances).

An unprecedented amount of custom innovations went into all layers of the project in order to raise the bar of science quality and training speed. In the next sections, we describe the optimizations TII conducted at all layers of the deep learning (DL) training system.

Scalable data curation

Latest-generation LLMs get their strength from the size and quality of training data. The team put specific care into the craft of a high-quality trillion-token dataset. Several SageMaker Training CPU jobs transformed petabytes of cheap, scalable web data into a curated, safe training dataset. Automated systems filtered and deduplicated the data; for example, ML classifiers were used to filter profanity. CPU jobs running on ml.c5.18xlarge (72 vCPUs, 144 GB RAM) were instantiated in a few API calls via SageMaker Training to run data transformation tasks. The team used both single-instance and multi-instance CPU jobs for difference use cases. Some of these jobs used hundreds of parallel share-nothing architecture (SNA) jobs, each on a single machine, and for tasks requiring inter-worker synchronization, the team launched multi-instance jobs, cumulating in dozens of instances and thousands of vCPUs. Anecdotally, on a downstream dataset preparation task, the team went up to 257 ml.c5.18xlarge in a single SageMaker Training job, cumulating in 18,504 vCPU and 37 TB of memory.

Maximizing training throughput

To minimize both training costs and time-to-market, the team pursued several directions of optimization to accelerate the training speed proportional to training tokens processed per second and measured in TFLOPs/GPU. The team used a fully custom 3D-parallel LLM training framework, featuring custom optimized layers written in compiled GPU code. The team went as far as writing their own custom matrix multiplication implementation to gain further speed! The team also developed logic that adapts parallel communication to the underlying network topology. During their initial scaling experiments, TII was able to reach 166 TFLOPs/GPU on a 147B model on 256 GPUs, and 173 TFLOPs/GPU on a 13B model on 16 GPUs, in our knowledge the fastest-known model TFLOPs achieved in the cloud at the time of the test in late 2022.

Serverless storage

LLM training is storage intensive; several terabytes of training data need to be channeled to the training cluster, and several terabytes of model checkpoints regularly travel back from the cluster to the permanent storage. Checkpoints also need to reach the training cluster as fast as possible in the event of job restart. In traditional high-performance computing (HPC), computing nodes are connected to distributed file systems, which provide high-performance I/O and throughput via a POSIX-like interface. In AWS, customers regularly use the Amazon FSx for Lustre file system for this purpose (for more details, refer to Speed up training on Amazon SageMaker using Amazon FSx for Lustre and Amazon EFS file systems), and we also documented the self-managed use of BeeGFS in a distributed computer vision case study. Due to their focus on costs and operational simplicity, the team decided not to implement and operate file system servers, but instead took up the challenge of building exclusively on top of serverless object storage Amazon Simple Storage Service (Amazon S3). A custom S3 dataset class was built using the AWS SDK for Python (Boto3), and provided satisfactory performance while enabling the scientists to iterate autonomously on I/O engineering and model science within the same codebase.

Client-side innovation

An LLM project rarely consists of a single training job; numerous jobs are needed to conduct initial tests and experiences. Over the course of the main production training, several jobs may be chained, for example to update configuration or software versions, deploy patches, or recover from failures. Scientists from TII conducted significant engineering to build custom clients adapted to LLM training. A launcher client was built on top of the SageMaker Training SDK in order to pack together multiple functionalities in one command, for example code versioning, Docker image building, and job launch. Additionally, an AWS Lambda serverless compute function was designed to watch, monitor, and intervene on jobs as needed.

Using Slack bots for inference quality audits

Towards the end of training, the team deployed the model on an internal SageMaker Hosting GPU endpoint for real-time interaction. The team went as far as creating a Slack bot to dialog with, to get realistic feedback and run qualitative quality audits of the model.

Training and performance monitoring

Training an LLM requires large amounts of computational resources, including CPU, GPU, and memory resources. Therefore, TII needed to monitor the performance and idle time of the training job to ensure optimal utilization of the computational resources and their cost-effectiveness.

To build an automated monitoring solution, TII used Amazon CloudWatch alarms to monitor the utilization GPU, CPU, and memory for the training jobs. CloudWatch collects raw data and processes it into readable, near-real-time metrics from the underlying container instances being using in the SageMaker Training job. After that, we set thresholds for each of these metrics, and if any metric falls below the threshold, an alarm is triggered. This alarm notifies TII’s team of the low resource utilization, allowing them to take corrective actions to rectify resource utilization constraints.

In addition to monitoring resource utilization, TII could also monitor the idle time of the training job resources. If the training job resources were idle for a prolonged period of time, it could indicate a bottleneck at any stage of the training cycle and require manual investigation. In some instances, the resource utilization was still relatively optimal, but the training process itself wasn’t progressing. For these cases, TII integrated CloudWatch alarms with Lambda functions to query and read the generated training logs, then take automatic actions based on either the generated error or the idleness of the log generation process (cluster is halted). The alarm triggers an action to stop the training job, which ensures that TII doesn’t incur unnecessary costs when the resources were not being utilized.

Conclusion

Using SageMaker paired with proprietary, custom innovation, TII was able to train a model that is state-of-the-art in multiple dimensions: technological breakthrough, science quality, training speed, and also operational simplicity.

“Our Falcon LLM illustrates the technology leadership of the UAE, and paves the way for AI-powered innovation in the region. In line with the UAE National AI Strategy 2031, the UAE’s participation in global technological advancements like Falcon LLM is a critical component in our journey towards a knowledge-based economy. The UAE chooses to actively involve itself in the broader conversation by investing in and developing AI solutions that will help create new economic, social, and educational opportunities. As part of this commitment, the open-source release of Falcon LLM showcases the UAE’s dedication to fostering collaboration, promoting transparency, and supporting innovation and research in the field of AI. By making Falcon LLM open source, we aim to enable widespread access to its advanced tech capabilities and empower researchers and organizations worldwide. This significant step exemplifies the UAE’s commitment to driving advancements in AI and solidifies its position as a leader in the global AI community. Next steps include contributing to further advancements in the field of AI and advanced technologies, with new models on the horizon, and promoting the utilization of advanced AI tech within UAE organizations and businesses.”

– Dr. Almazrouei

To learn more about Falcon LLM, check out the website FalconLLM.tii.ae and the model card on Hugging Face!


About the Authors

Dr. Ebtesam Almazrouei is Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII. Her work focuses on delivering AI and advanced tech solutions across multiple industries from healthcare, telecommunication, education, energy, and security. Dr. Almazrouei plays a pivotal role in building LLMs and stepping up the UAE’s capability in this space, leading the team behind building Falcon LLM. In addition, she led the development of Noor, the world’s largest Arabic LLM to date.

Will Badr is a Sr. Manager AI/ML Solutions Architects based in Dubai – UAE who works as part of the global Amazon Machine Learning team. Will is passionate about using technology in innovative ways to positively impact the community. In his spare time, he likes to go diving, play soccer and explore the Pacific Islands.

Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.

Read More

Build high-performance ML models using PyTorch 2.0 on AWS – Part 1

Build high-performance ML models using PyTorch 2.0 on AWS – Part 1

PyTorch is a machine learning (ML) framework that is widely used by AWS customers for a variety of applications, such as computer vision, natural language processing, content creation, and more. With the recent PyTorch 2.0 release, AWS customers can now do same things as they could with PyTorch 1.x but faster and at scale with improved training speeds, lower memory usage, and enhanced distributed capabilities. Several new technologies including torch.compile, TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor have been included in the PyTorch2.0 release. Refer to PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever for details.

This post demonstrates the performance and ease of running large-scale, high-performance distributed ML model training and deployment using PyTorch 2.0 on AWS. This post further walks through a step-by-step implementation of fine-tuning a RoBERTa (Robustly Optimized BERT Pretraining Approach) model for sentiment analysis using AWS Deep Learning AMIs (AWS DLAMI) and AWS Deep Learning Containers (DLCs) on Amazon Elastic Compute Cloud (Amazon EC2 p4d.24xlarge) with an observed 42% speedup when used with PyTorch 2.0 torch.compile + bf16 + fused AdamW. The fine-tuned model is then deployed on AWS Graviton-based C7g EC2 instance on Amazon SageMaker with an observed 10% speedup compared to PyTorch 1.13.

The following figure shows a performance benchmark of fine-tuning a RoBERTa model on Amazon EC2 p4d.24xlarge with AWS PyTorch 2.0 DLAMI + DLC.

Refer to Optimized PyTorch 2.0 inference with AWS Graviton processors for details on AWS Graviton-based instance inference performance benchmarks for PyTorch 2.0.

Support for PyTorch 2.0 on AWS

PyTorch2.0 support is not limited to the services and compute shown in example use-case in this post; it extends to many others on AWS, which we discuss in this section.

Business requirement

Many AWS customers, across a diverse set of industries, are transforming their businesses by using artificial intelligence (AI), specifically in the area of generative AI and large language models (LLMs) that are designed to generate human-like text. These are basically big models based on deep learning techniques that are trained with hundreds of billions of parameters. The growth in model sizes is increasing training time from days to weeks, and even months in some cases. This is driving an exponential increase in training and inference costs, which requires, more than ever, a framework such as PyTorch 2.0 with built-in support of accelerated model training and the optimized infrastructure of AWS tailored to the specific workloads and performance needs.

Choice of compute

AWS provides PyTorch 2.0 support on the broadest choice of powerful compute, high-speed networking, and scalable high-performance storage options that you can use for any ML project or application and customize to fit your performance and budget requirements. This is manifested in the diagram in the next section; in the bottom tier, we provide a broad selection of compute instances powered by AWS Graviton, Nvidia, AMD, and Intel processors.

For model deployments, you can use ARM-based processors such as the recently announced AWS Graviton-based instance that provides inference performance for PyTorch 2.0 with up to 3.5 times the speed for Resnet50 compared to the previous PyTorch release, and up to 1.4 times the speed for BERT, making AWS Graviton-based instances the fastest compute-optimized instances on AWS for CPU-based model inference solutions.

Choice of ML services

To use AWS compute, you can select from a broad set of global cloud-based services for ML development, compute, and workflow orchestration. This choice allows you to align with your business and cloud strategies and run PyTorch 2.0 jobs on the platform of your choice. For instance, if you have on-premises restrictions or existing investments in open-source products, you can use Amazon EC2, AWS ParallelCluster, or AWS UltraCluster to run distributed training workloads based on a self-managed approach. You could also use a fully managed service like SageMaker for a cost-optimized, fully managed, and production-scale training infrastructure. SageMaker also integrates with various MLOps tools, which allows you to scale your model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden.

Similarly, if you have existing Kubernetes investments, you can also use Amazon Elastic Kubernetes Service (Amazon EKS) and Kubeflow on AWS to implement an ML pipeline for distributed training or use an AWS-native container orchestration service like Amazon Elastic Container Service (Amazon ECS) for model training and deployments. Options to build your ML platform are not limited to these services; you can pick and choose depending on your organizational requirements for your PyTorch 2.0 jobs.

stack

Enabling PyTorch 2.0 with AWS DLAMI and AWS DLC

To use the aforementioned stack of AWS services and powerful compute, you have to install an optimized compiled version of the PyTorch2.0 framework and its required dependencies, many of which are independent projects, and test them end to end. You may also need CPU-specific libraries for accelerated math routines, GPU-specific libraries for accelerated math and inter-GPU communication routines, and GPU drivers that need to be aligned with the GPU compiler used to compile the GPU libraries. If your jobs require large-scale multi-node training, you need an optimized network that can provide lowest latency and highest throughput. After you build your stack, you need to regularly scan and patch them for security vulnerabilities and rebuild and retest the stack after every framework version upgrade.

AWS helps reduce this heavy lifting by offering a curated and secure set of frameworks, dependencies, and tools to accelerate deep learning in the cloud though AWS DLAMIs and AWS DLCs. These pre-built and tested machine images and containers are optimized for deep learning on EC2 Accelerated Computing Instance types, allowing you to scale out to multiple nodes for distributed workloads more efficiently and easily. It includes a pre-built Elastic Fabric Adapter (EFA), Nvidia GPU stack, and many deep learning frameworks (TensorFlow, MXNet, and PyTorch with latest release of 2.0) for high-performance distributed deep learning training. You don’t need to spend time installing and troubleshooting deep learning software and drivers or building ML infrastructure, nor do you have to incur the recurring cost of patching these images for security vulnerabilities or recreating the images after every new framework version upgrade. Instead, you can focus on the higher value-added effort of training jobs at scale in a shorter amount of time and iterating on your ML models faster.

Solution overview

Considering that training on GPU and inference on CPU is a popular use case for AWS customers, we have included as part of this post a step-by-step implementation of a hybrid architecture (as shown in the following diagram). We will explore the art-of-the-possible and use a P4 EC2 instance with BF16 support initialized with Base GPU DLAMI including NVIDIA drivers, CUDA, NCCL, EFA stack, and PyTorch2.0 DLC for fine-tuning a RoBERTa sentiment analysis model that gives you control and flexibility to use any open-source or proprietary libraries. Then we use SageMaker for a fully managed model hosting infrastructure to host our model on AWS Graviton3-based C7g instances. We picked C7g on SageMaker because it’s proven to reduce inference costs by up to 50% relative to comparable EC2 instances for real-time inference on SageMaker. The following diagram illustrates this architecture.

sagemaker_final

The model training and hosting in this use case consists of the following steps:

  1. Launch a GPU DLAMI-based EC2 Ubuntu instance in your VPC and connect to your instance using SSH.
  2. After you log in to your EC2 instance, download the AWS PyTorch 2.0 DLC.
  3. Run your DLC container with a model training script to fine-tune the RoBERTa model.
  4. After model training is complete, package the saved model, inference scripts, and a few metadata files into a tar file that SageMaker inference can use and upload the model package to an Amazon Simple Storage Service (Amazon S3) bucket.
  5. Deploy the model using SageMaker and create an HTTPS inference endpoint. The SageMaker inference endpoint holds a load balancer and one or more instances of your inference container in different Availability Zones. You can deploy either multiple versions of the same model or entirely different models behind this single endpoint. In this example, we host a single model.
  6. Invoke your model endpoint by sending it test data and verify the inference output.

In the following sections, we showcase fine-tuning a RoBERTa model for sentiment analysis. RoBERTa is developed by Facebook AI, improving on the popular BERT model by modifying key hyperparameters and pre-training on a larger corpus. This leads to improved performance compared to vanilla BERT.

We use the transformers library by Hugging Face to get the RoBERTa model pre-trained on approximately 124 million tweets, and we fine-tune it on the Twitter dataset for sentiment analysis.

Prerequisites

Make sure you meet the following prerequisites:

  • You have an AWS account.
  • Make sure you’re in the us-west-2 Region to run this example. (This example is tested in us-west-2; however, you can run in any other Region.)
  • Create a role with the name sagemakerrole. Add managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give SageMaker access to S3 buckets.
  • Create an EC2 role with the name ec2_role. Use the following permission policy:
#Refer - Make sure EC2 role has following policies
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "ecr:BatchGetImage",
        "ecr:BatchCheckLayerAvailability",
        "ecr:CompleteLayerUpload",
        "ecr:GetDownloadUrlForLayer",
        "ecr:InitiateLayerUpload",
        "ecr:PutImage",
        "ecr:UploadLayerPart",
        "ecr:GetAuthorizationToken",
        "s3:*",
        "s3-object-lambda:*",
        "iam:Get*",
        "iam:PassRole",
        "sagemaker:*"
      ],
      "Resource": "*"
    }
  ]
}

1. Launch your development instance

We create a p4d.24xlarge instance that offers 8 NVIDIA A100 Tensor Core GPUs in us-west-2:

#STEP 1.1
For a short guide on launching your instance, read the Getting Started with Amazon EC2 documentation.

When selecting the AMI, follow the release notes to run this command using the AWS Command Line Interface (AWS CLI) to find the AMI ID to use in us-west-2:

#STEP 1.2 - This requires AWS CLI credentials to call ec2 describe-images api (ec2:DescribeImages).
aws ec2 describe-images --region us-west-2 --owners amazon --filters 'Name=name,Values=Deep Learning Base GPU AMI (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text 

Make sure the size of the gp3 root volume is 200 GiB.

EBS volume encryption is not enabled by default. Consider changing this when moving this solution to production.

2. Download a Deep Learning Container

AWS DLCs are available as Docker images in Amazon Elastic Container Registry Public, a managed AWS container image registry service that is secure, scalable, and reliable. Each Docker image is built for training or inference on a specific deep learning framework version, Python version, with CPU or GPU support. Select the PyTorch 2.0 framework from the list of available Deep Learning Containers images.

Complete the following steps to download your DLC:

a. SSH to the instance. By default, security group used with EC2 opens up SSH port to all. Please consider this if you are moving this solution to production:

#STEP 2.1 - Use Public IP
ssh -i ~/.ssh/<pub_key> ubuntu@<IP_ADDR>

#Refer - Output: Notice python3.9 package that we will use to run and install Inference scripts

__| __|_ )
_| ( / Deep Learning Base GPU AMI (Ubuntu 20.04)
___|___|___|

Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1035-aws x86_64v)

* Please note that Amazon EC2 P2 Instance is not supported on current DLAMI.
* Supported EC2 instances: G3, P3, P3dn, P4d, P4de, G5, G4dn.
NVIDIA driver version: 525.85.12
Default CUDA version: 11.2

Utility libraries are installed in /usr/bin/python3.9.
To access them, use /usr/bin/python3.9.

By default, the security group used with Amazon EC2 opens up the SSH port to all. Consider changing this if you are moving this solution to production.

b. Set the environment variables required to run the remaining steps of this implementation:

#STEP 2.2
Attach the role “ec2_role” to your EC2 instance from the AWS console.

#STEP 2.3
Follow the steps here to create a S3 bucket in us-west-2 region

#STEP 2.4 - Set Environment variables
#Bucket created in step 2.3
export S3_BUCKET=<your-s3-bucket>
export PYTHON_V=python3.9
export SAGEMAKER_ROLE=$(aws iam get-role --role-name sagemakerrole --output text --query 'Role.Arn')
aws configure set default.region 'us-west-2'

Amazon ECR supports public image repositories with resource-based permissions using AWS Identity and Access Management (IAM) so that specific users or services can access images.

c. Log in to the DLC registry:

#STEP 2.5 - login
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

#Refer - Output
Login Succeeded

d. Pull the latest PyTorch 2.0 container with GPU support in us-west-2

#STEP 2.6 - pull the latest DLC PyTorch image
docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2

#Refer - Output
7608715873ec: Pull complete
a0bad51e1731: Pull complete
f7778ea3b9cc: Pull complete
....

Digest: sha256:1ab0d477345a11970d811cc252bc461dd70859f15caa19a65198e7941953e6b8
StaRefertus: Downloaded newer image for 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2

If you get the error “no space left on device”, make sure you increase the EC2 EBS volume to 200 GiB and then extend the Linux file system.

3. Clone the latest scripts adapted to PyTorch 2.0

Clone the scripts with the following code:

#STEP 3.1
cd $HOME
git clone https://github.com/aws-samples/aws-deeplearning-labs.git
cd aws-deeplearning-labs/workshop/twitter_lm/scripts/
export ml_working_dir=$PWD

Because we’re using the Hugging Face transformers API with the latest version 4.28.1, it has already enabled PyTorch 2.0 support. We added the following argument to the trainer API in train_sentiment.py to enable new PyTorch 2.0 features:

  • Torch compile – Experience an average 43% speedup on Nvidia A100 GPUs with single line of change.
  • BF16 datatype – New data type support (Brain Floating Point) for Ampere or newer GPUs.
  • Fused AdamW optimizer – Fused AdamW implementation to further speed up training. This stochastic optimization method modifies the typical implementation of weight decay in Adam by decoupling weight decay from the gradient update.
#Refer - updated training config
training_args = TrainingArguments(
do_eval=True,
evaluation_strategy='epoch',
output_dir='test_trainer',
logging_dir='test_trainer',
logging_strategy='epoch',
save_strategy='epoch',
num_train_epochs=10,
learning_rate=1e-05,
# pytorch 2.0.0 specific args
torch_compile=True,
bf16=True,
optim='adamw_torch_fused',
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
load_best_model_at_end=True,
metric_for_best_model='recall',
)

4. Build a new Docker image with dependencies

We extend the pre-built PyTorch 2.0 DLC image to install the Hugging Face transformer and other libraries that we need to fine-tune our model. This allows you to use the included tested and optimized deep learning libraries and settings without having to create an image from scratch. See the following code:

#STEP 4.1 - Create Dockerfile with following content
printf 'FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2
RUN pip install scikit-learn evaluate transformers xformers
' > Dockerfile

#STEP 4.2 - Build new docker file
docker build -f Dockerfile -t pytorch2.0:roberta-sentiment-analysis .

5. Start training using the container

Run the following Docker command to begin fine-tuning the model on the tweet_eval sentiment dataset. We’re using the Docker container arguments (shared memory size, max locked memory, and stack size) recommend by Nvidia for deep learning workloads.

#STEP 5.1 - run docker container for model training
docker run --net=host --uts=host --ipc=host --shm-size=1g --ulimit stack=67108864 --ulimit memlock=-1 --gpus all -v "/home/ubuntu:/workspace" pytorch2.0:roberta-sentiment-analysis python /workspace/aws-deeplearning-labs/workshop/twitter_lm/scripts/train_sentiment.py

You should expect the following output. The script first downloads the TweetEval dataset, which consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include irony, hate, offensive, stance, emoji, emotion, and sentiment.

The script then downloads the base model and starts the fine-tuning process. Training and evaluation metrics are reported at the end of each epoch.

#Refer - Output
{'loss': 0.6927, 'learning_rate': 9e-06, 'epoch': 1.0}
{'eval_loss': 0.6144512295722961, 'eval_recall': 0.7129473901625799, 'eval_runtime': 3.2694, 'eval_samples_per_second': 611.74, 'eval_steps_per_second': 4.894, 'epoch': 1.0}
{'loss': 0.5554, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.0}
{'eval_loss': 0.5860999822616577, 'eval_recall': 0.7312511094156663, 'eval_runtime': 3.3918, 'eval_samples_per_second': 589.655, 'eval_steps_per_second': 4.717, 'epoch': 2.0}
{'loss': 0.5084, 'learning_rate': 7e-06, 'epoch': 3.0}
{'eval_loss': 0.6119785308837891, 'eval_recall': 0.730757638985487, 'eval_runtime': 3.592, 'eval_samples_per_second': 556.791, 'eval_steps_per_second': 4.454, 'epoch': 3.0}

Performance statistics

With PyTorch 2.0 and the latest Hugging Face transformers library 4.28.1, we observed a 42% speedup on a single p4d.24xlarge instance with 8 A100 40GB GPUs. Performance improvements comes from a combination of torch.compile, the BF16 data type, and the fused AdamW optimizer. The following code is the final result of two training runs with and without new features:

#Refer performance statistics
wihtout torch.compile + bf16 + fused adamw:
{'eval_loss': 0.7532123327255249, 'eval_recall': 0.7315191840508296, 'eval_runtime': 3.7641, 'eval_samples_per_second': 531.341, 'eval_steps_per_second': 4.251, 'epoch': 10.0}
{'train_runtime': 1891.5635, 'train_samples_per_second': 241.15, 'train_steps_per_second': 1.887, 'train_loss': 0.4372138784713104, 'epoch': 10.0}

with torch.compile + bf16 + fused adamw
{'eval_loss': 0.7548801898956299, 'eval_recall': 0.7251081080195005, 'eval_runtime': 3.5685, 'eval_samples_per_second': 560.453, 'eval_steps_per_second': 4.484, 'epoch': 10.0}
{'train_runtime': 1095.388, 'train_samples_per_second': 416.428, 'train_steps_per_second': 3.259, 'train_loss': 0.44210514314368327, 'epoch': 10.0}

6. Test the trained model locally before preparing for SageMaker inference

You can find the following files under $ml_working_dir/saved_model/ after training:

#Refer - model training artifacts
config.json
merges.txt
pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json

Let’s make sure we can run inference locally before preparing for SageMaker inference. We can load the saved model and run inference locally using the test_trained_model.py script:

#STEP 6.1 - run docker container for test model infernce
docker run --net=host --uts=host --ipc=host --ulimit stack=67108864 --ulimit memlock=-1 --gpus all -v "/home/ubuntu:/workspace" pytorch2.0:roberta-sentiment-analysis python /workspace/aws-deeplearning-labs/workshop/twitter_lm/scripts/test_trained_model.py

You should expect the following output with the input “Covid cases are increasing fast!”:

#Refer - Output
[{'label': 'negative', 'score': 0.854185163974762}]

7. Prepare the model tarball for SageMaker inference

Under the directory where the model is located, make a new directory called code:

#STEP 7.1 - set permissions
cd $ml_working_dir
sudo chown ubuntu:ubuntu saved_model
cd saved_model
mkdir code

In the new directory, create the file inference.py and add the following to it:

#STEP 7.2 - write inference.py
printf 'import json
from transformers import pipeline

REQUEST_CONTENT_TYPE = "application/x-text"
STR_DECODE_CODE = "utf-8"
RESULT_CLASS = "sentiment"
RESULT_SCORE = "score"

def model_fn(model_dir):
    sentiment_analysis = pipeline(
        "sentiment-analysis",
        model=model_dir,
        tokenizer=model_dir,
        return_all_scores=True
    )
    return sentiment_analysis


def input_fn(request_body, request_content_type):
    if request_content_type == REQUEST_CONTENT_TYPE:
        input_data = request_body.decode(STR_DECODE_CODE)
        return input_data

def predict_fn(input_data, model):
    return model(input_data)

def output_fn(prediction, accept):
    class_label = None
    score = -1
    for _pred in prediction[0]:
        if _pred["score"] > score:
            score = _pred["score"]
            class_label = _pred["label"]
    return json.dumps({RESULT_CLASS: class_label, RESULT_SCORE: score})' > code/inference.py

Make another file in the same directory called requirements.txt and put transformers in it. SageMaker installs the dependencies in requirements.txt in the inference container for you.

#STEP 7.3 - write requirements.txt
printf 'transformers' > code/requirements.txt

In the end, you should have the following folder structure:

#Refer - inference package folder structure
code/
code/inference.py
code/requirements.txt
config.json
merges.txt
pytorch_model.bin
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.json

The model is ready to be packaged and uploaded to Amazon S3 for use with SageMaker inference:

#STEP 7.4 - Create inference package tar file and upload it to S3
sudo tar -cvpzf ./personal-roberta-base-sentiment.tar.gz -C ./ .
aws s3 cp ./personal-roberta-base-sentiment.tar.gz s3://$S3_BUCKET

8. Deploy the model on a SageMaker AWS Graviton instance

New generations of CPUs offer a significant performance improvement in ML inference due to specialized built-in instructions. In this use case, we use the SageMaker fully managed hosting infrastructure with AWS Graviton3-based C7g instances. AWS has also measured up to a 50% cost savings for PyTorch inference with AWS Graviton3-based EC2 C7g instances across Torch Hub ResNet50, and multiple Hugging Face models relative to comparable EC2 instances.

To deploy the models to AWS Graviton instances, we use AWS DLCs that provide support for PyTorch 2.0 and TorchServe 0.8.0, or you can bring your own containers that are compatible with the ARMv8.2 architecture.

We use the model we trained earlier: s3://<your-s3-bucket>/twitter-roberta-base-sentiment-latest.tar.gz. If you haven’t used SageMaker before, review Get Started with Amazon SageMaker.

To start, make sure the SageMaker package is up to date:

#STEP 8.1 - Install SageMaker library
cd $ml_working_dir
$PYTHON_V -m pip install -U sagemaker

Because this is an example, create a file called start_endpoint.py and add the following code. This will be the Python script to start a SageMaker inference endpoint with the mode:

#STEP 8.2 - write start_endpoint.py
printf '# Import some needed modules
from sagemaker import get_execution_role, Session, image_uris
from sagemaker.model import Model
import boto3
import os

model_name = "pytorch-roberta-model"

# Setup SageMaker session
region = boto3.Session().region_name
role = os.environ.get("SAGEMAKER_ROLE")
sm_client = boto3.client("sagemaker", region_name=region)
sagemaker_session = Session()
bucket = os.environ.get("S3_BUCKET")

# Select container. In our case,its graviton
container_uri = image_uris.retrieve(
region="us-west-2",
framework="pytorch",
version="2.0.0",
image_scope="inference_graviton")

# Set model parameters
model = Model(
image_uri=container_uri,
model_data=f"s3://{bucket}/personal-roberta-base-sentiment.tar.gz",
role=role,
name=model_name,
sagemaker_session=sagemaker_session
)

# Deploy model
endpoint = model.deploy(
initial_instance_count=1,
instance_type="ml.c7g.4xlarge",
endpoint_name="sm-endpoint-" + model_name
)' > start_endpoint.py

We’re using ml.c7g.4xlarge for the instance and are retrieving PT 2.0 with an image scope inference_graviton. This is our AWS Graviton3 instance.

Next, we create the file that runs the prediction. We do these as separate scripts so we can run the predictions as many times as we want. Create predict.py with the following code:

#STEP 8.3 - write predict.py
printf 'import boto3
from boto3 import Session, client

model_name = "pytorch-roberta-model"
data = "Writing data to analyze sentiments and see how the data is viewed"

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name="us-west-2")
endpoint_name="sm-endpoint-" + model_name
print("Calling model:" + endpoint_name)
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
Body=bytes(data, "utf-8"),
ContentType="application/x-text",
)
print(response["Body"].read().decode("utf-8"))' > predict.py

With the scripts generated, we can now start an endpoint, do predictions against the endpoint, and clean up when we’re done:

#Step 8.4 - Start the SageMaker Inference endpoint
$PYTHON_V start_endpoint.py

#Step 8.5 Do a prediction this can be run as many times as we like
$PYTHON_V predict.py

#Refer - Prediction Output
Calling model:sm-endpoint-pytorch-roberta-model
{"sentiment": "neutral", "score": 0.9342969059944153}

9. Clean up

Lastly, we want to clean up from this example. Create cleanup.py and add the following code:

#STEP 9.1 CleanUp Script
printf 'from boto3 import client

model_name = "pytorch-roberta-model"
endpoint_name="sm-endpoint-" + model_name

sagemaker_client = client("sagemaker", region_name="us-west-2")
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
sagemaker_client.delete_model(ModelName=model_name)' > cleanup.py

#Step 9.2 Cleanup
$PYTHON_V cleanup.py

Conclusion

AWS DLAMIs and DLCs have become the go-to standard for running deep learning workloads on a broad selection of compute and ML services on AWS. Along with using framework-specific DLCs on AWS ML services, you can also use a single framework on Amazon EC2, which removes the heavy lifting necessary for developers to build and maintain deep learning applications. Refer to Release Notes for DLAMI and Available Deep Learning Containers Images to get started.

This post showed one of many possibilities to train and serve your next model on AWS and discussed several formats that you can adopt to meet your business objectives. Give this example a try or use our other AWS ML services to expand the data productivity for your business. We have included a simple sentiment analysis problem so that customers new to ML can understand how simple it is to get started with PyTorch 2.0 on AWS. We will be covering more advanced use cases, models, and AWS technologies in upcoming blog posts.


About the authors

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Mike Schneider is a Systems Developer, based in Phoenix AZ. He is a member of Deep Learning containers, supporting various Framework container images, to include Graviton Inference. He is dedicated to infrastructure efficiency and stability.

Lai Wei is a Senior Software Engineer at Amazon Web Services. He is focusing on building easy to use, high-performance and scalable deep learning frameworks for accelerating distributed model training. Outside of work, he enjoys spending time with his family, hiking, and skiing.

Read More

Arrange your transcripts into paragraphs with Amazon Transcribe

Arrange your transcripts into paragraphs with Amazon Transcribe

Amazon Transcribe is a speech recognition service that generates transcripts from video and audio files in multiple supported languages and accents. It comes with a rich set of features, including automatic language identification, multi-channel and multi-speaker support, custom vocabularies, and transcript redaction.

Amazon Transcribe supports two modes of operation: batch and streaming. In batch mode, a transcription job is created to process files residing in an Amazon Simple Storage Service (Amazon S3) bucket; in streaming mode, the audio source is integrated in real time with Amazon Transcribe through HTTP/2 calls or Web Sockets.

Intro

In this post, we explore how to automatically arrange the generated transcript into paragraphs while in batch mode, increasing the readability of the generated transcript.

Transcription output

Amazon Transcribe uses JSON representation for its output. It provides the transcription result in two different formats: text format and itemized format.

Text format provides the transcript altogether, as a block of text, whereas itemized format provides the transcript in the form of timely ordered transcribed items, along with additional metadata per item. Both formats exist in parallel in the output file.

Depending on the features selected during transcription job creation, Amazon Transcribe creates additional and enriched views of the transcription result. See the following example code:

{
    "jobName": "2x-speakers_2x-channels",
    "accountId": "************",
    "results": {
        "transcripts": [
            {
                "transcript": "Hi, welcome."
            }
        ],
        "speaker_labels": [
            {
                "channel_label": "ch_0",
                "speakers": 2,
                "segments": [
                ]
            },
            {
                "channel_label": "ch_1",
                "speakers": 2,
                "segments": [
                ]
            }
        ],
        "channel_labels": {
            "channels": [
            ],
            "number_of_channels": 2
        },
        "items": [
            
        ],
        "segments": [
        ]
    },
    "status": "COMPLETED"
}

The views are as follows:

  • Transcripts – Represented by the transcripts element, it contains only the text format of the transcript. In multi-speaker, multi-channel scenarios, concatenation of all transcripts is provided as a single block.
  • Speakers – Represented by the speaker_labels element, it contains both the text and itemized formats of the transcript grouped by speaker. It’s available only when the multi-speakers feature is enabled.
  • Channels – Represented by the channel_labels element, it contains both the text and itemized formats of the transcript, grouped by channel. It’s available only when the multi-channels feature is enabled.
  • Items – Represented by the items element, it contains only the itemized format of the transcript. In multi-speaker, multi-channel scenarios, items are enriched with additional properties, indicating speaker and channel.
  • Segments – Represented by the segments element, it contains both the text and itemized formats of the transcript, grouped by alternative transcription. It’s available only when the alternative results feature is enabled.

Transcription metadata in the items view

In the items view, items are provided in the form of a timely ordered list, with every item containing additional metadata information:

{
    "results": {
        "items": [
            {
                "channel_label": "ch_0",
                "start_time": "1.509",
                "speaker_label": "spk_0",
                "end_time": "2.21",
                "alternatives": [
                    {
                        "confidence": "0.999",
                        "content": "Hi"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "channel_label": "ch_0",
                "speaker_label": "spk_0",
                "alternatives": [
                    {
                        "confidence": "0.0",
                        "content": ","
                    }
                ],
                "type": "punctuation"
            },
            {
                "channel_label": "ch_0",
                "start_time": "2.22",
                "speaker_label": "spk_0",
                "end_time": "2.9",
                "alternatives": [
                    {
                        "confidence": "0.999",
                        "content": "welcome"
                    }
                ],
                "type": "pronunciation"
            },
            {
                "channel_label": "ch_0",
                "speaker_label": "spk_0",
                "alternatives": [
                    {
                        "confidence": "0.0",
                        "content": "."
                    }
                ],
                "type": "punctuation"
            }
        ]
    }
}

The metadata is as follows:

  • Type – The type value indicates if the specific item is a punctuation or a pronunciation. Examples of supported punctuations are comma, full stop, and question mark.
  • Alternatives – An array of objects containing the actual transcription, along with confidence level, ordered by confidence level. When alternative results feature is not enabled, this list always has one item only.
    • Confidence – An indication of how confident Amazon Transcribe is about the correctness of transcription. It uses values from 0–1, with 1 indicating 100% confidence.
    • Content – The transcribed word.
  • Start time – A time pointer of the audio or video file indicating the start of the item in ss.SSS format.
  • End time – A time pointer of the audio or video file indicating the end of the item in ss.SSS format.
  • Channel label – The channel identifier, which is present in the item only when the channel identification feature was enabled in the job configuration.
  • Speaker label – The speaker identifier, which is present in the item only when the speaker partitioning feature was enabled in the job configuration.

Identifying paragraphs

Identification of paragraphs relies on metadata information in the items view. In particular, we utilize start and end time information along with transcription type and content to identify sentences and then decide which sentences are the best candidates for paragraph entry points.

A sentence is considered to be a list of transcription items that exists between punctuation items that indicate full stop. Exceptions to this are the start and end of the transcript, which are by default sentence boundaries. The following figure shows an example of these items.

Sentence

Sentence identification is straightforward with Amazon Transcribe because punctuation is an out-of-the-box feature, along with the punctuation types comma, full stop, question mark. In this concept, we utilize a full stop as the sentence boundary.

Not every sentence should be a paragraph point. To identify paragraphs, we introduce a new insight at the sentence level called a start delay, as illustrated in the following figure. We use a start delay to define the time delay the speaker introduces to the pronunciation of the current sentence in comparison to the previous one.

Start Delay

Calculation of the start delay requires the start time of the current sentence and end time of the previous one per speaker. Because Amazon Transcribe provides start and end times per item, the calculation requires the usage of the first and last items of the current and previous sentences, respectively.

Knowing the start delays of every sentence, we can apply statistical analysis and figure out the significance of every delay in comparison to the total population of delays. In our context, significant delays are those that are over the population’s typical duration. The following graph shows an example.

Start Delay Box Plot

For this concept, we decide to accept the sentences with start delays greater than the mean value as significant, and introduce a paragraph point at the beginning of every such sentence. Apart from the mean value, there are other options, like accepting all start delays greater than the median, or third quantile or upper fence value of the population.

We add one more additional step to the paragraph identification process, taking into consideration the number of words contained by each paragraph. When paragraphs contain a significant number of words, we run a split operation, thereby adding one more paragraph to the final result.

In the context of word counts, we define as significant the word counts that exceed the upper fence value. We make this decision deliberately, so that we restrict split operations to the paragraphs that truly behave as outliers in our results. The following graph shows an example.

Word Count Box Plot

The split operation selects the new paragraph entry point by considering the maximum sentence start delay insight. This way, the new paragraph is introduced at the sentence that exhibits the max start delay inside the current paragraph. Splits can be repeated until no word count exceeds the selected boundary, in our case the upper fence value. The following figure shows an example.

Paragraph Split

Conclusion

In this post, we presented a concept to automatically introduce paragraphs to your transcripts, without manual intervention, based on the metadata Amazon Transcribe provides along with the actual transcript.

Document

This concept is not language or accent specific, because it relies on non-linguistic metadata to suggest paragraph entry points. Future variations can include grammatical or semantic information on a per-language case, further enhancing the paragraph identification logic.

If you have feedback about this post, submit your comments in the comments section. We look forward to hearing from you. Check out Amazon Transcribe Features for additional features that will help you get the most value out of your transcripts.


About the Authors

Kostas Tzouvanas is an Enterprise Solution Architect at Amazon Web Services. He helps customers architect cloud-based solutions to achieve their business potential. His main focus is trading platforms and high performance computing systems. He is also passionate about genomics and bioinformatics.

Pavlos Kaimakis is an Enterprise Solutions Architect looking after Enterprise customers in GR/CY/MT supporting them with his experience to design and implement solutions that drive value to them. Pavlos has spent the biggest amount of time in his career in the product and customer support sector – both from an engineering and a management perspective. Pavlos loves travelling and he’s always up for exploring new places in the world.

Read More

Build machine learning-ready datasets from the Amazon SageMaker offline Feature Store using the Amazon SageMaker Python SDK

Build machine learning-ready datasets from the Amazon SageMaker offline Feature Store using the Amazon SageMaker Python SDK

Amazon SageMaker Feature Store is a purpose-built service to store and retrieve feature data for use by machine learning (ML) models. Feature Store provides an online store capable of low-latency, high-throughput reads and writes, and an offline store that provides bulk access to all historical record data. Feature Store handles the synchronization of data between the online and offline stores.

Because model development is an iterative process, customers will frequently query the offline store and build various datasets for model training. Currently, there are several ways to access features in the offline store, including running SQL queries with Amazon Athena or using Spark SQL in Apache Spark. However, these patterns require writing ad hoc (and sometimes complex) SQL statements, which isn’t always suitable for the data scientist persona.

Feature Store recently extended the SageMaker Python SDK to make it easier to create datasets from the offline store. With this release, you can use a new set of methods in the SDK to create datasets without writing SQL queries. These new methods support common operations such as time travel, filtering duplicate records, and joining multiple feature groups while ensuring point-in-time accuracy.

In this post, we demonstrate how to use the SageMaker Python SDK to build ML-ready datasets without writing any SQL statements.

Solution overview

To demonstrate the new functionality, we work with two datasets: leads and web marketing metrics. These datasets can be used to build a model that predicts if a lead will convert into a sale given marketing activities and metrics captured for that lead.

The leads data contains information on prospective customers who are identified using Lead_ProspectID. The features for a lead (for example, LeadSource) can be updated over time, which results in a new record for that lead. The Lead_EventTime represents the time in which each record is created. The following screenshot shows an example of this data.

The web marketing metrics data tracks the engagement metrics for a lead, where each lead is identified using the Web_ProspectID. The Web_EventTime represents the time in which the record was created. Unlike the leads feature group, there is only one record per lead in this feature group. The following screenshot shows an example of this data.

We walk through the key parts of the sagemaker-feature-store-offline-sdk.ipynb notebook, which demonstrates the following steps:

  1. Create a dataset from a feature group.
  2. Join multiple feature groups.
  3. Create a point-in-time join between a feature group and a dataset based on a set of events at specific timestamps.
  4. Retrieve feature history within a specific time range.
  5. Retrieve features as of a specific timestamp.

Prerequisites

You need the following prerequisites:

git clone https://github.com/aws-samples/amazon-sagemaker-feature-store-offline-queries.git

We assume a feature group for the leads data has been created using the existing FeatureGroup.create method, and can be referenced using the variable base_fg. For more information on feature groups, refer to Create Feature Groups.

Create a dataset from a feature group

To create a dataset using the SageMaker SDK, we use the new FeatureStore class, which contains the create_dataset method. This method accepts a base feature group that may be joined with other feature groups or DataFrames. We start by providing the leads feature group as the base and an Amazon Simple Storage Service (Amazon S3) path to store the dataset:

from sagemaker.feature_store.feature_store import FeatureStore
feature_store = FeatureStore(sagemaker_session=feature_store_session)
ds1_builder = feature_store.create_dataset (base=base_fg,
output_path=f"s3://{s3_bucket_name}/dataset_query_results",)

The create_dataset method returns a DatasetBuilder object, which can be used to generate a dataset from one or multiple feature groups (which we demonstrate in the next section). To create a simple dataset consisting of only the leads features, we invoke the to_csv_file method. This runs a query in Athena to retrieve the features from the offline store, and saves the results to the specified S3 path.

csv, query = ds1_builder.to_csv_file()
# Show S3 location of CSV file
print(f'CSV file: {csv}')

Join multiple feature groups

With the SageMaker SDK, you can easily join multiple feature groups to build a dataset. You can also perform join operations between an existing Pandas DataFrame to a single or multiple feature groups. The base feature group is an important concept for joins. The base feature group is the feature group that has other feature groups or the Pandas DataFrame joined to it.

While creating the dataset using the create_dataset function, we use the with_feature_group method, which performs an inner join between the base feature group and another feature group using the record identifier and the target feature name in the base feature group. In our example, the base feature group is the leads feature group, and the target feature group is the web marketing feature group. The with_feature_group method accepts the following arguments:

  • feature_group – This is the feature group we are joining with. In our code sample, the target feature group is created by using the web marketing dataset.
  • target_feature_name_in_base – The name of the feature in the base feature group that we’re using as a key in the join. We use Lead_ProspectID as the record identifier for the base feature group.
  • included_feature_names – This is the list of the feature names of the base feature group. We use this field to specify the features that we want to include in the dataset.

The following code shows an example of creating a dataset by joining the base feature group with the target feature group:

join_builder = feature_store.create_dataset(base=base_fg, 
output_path=f"s3://{s3_bucket_name}/dataset_query_results").with_feature_group(
feature_group=target_fg,
target_feature_name_in_base="Lead_ProspectID",
included_feature_names=["Web_ProspectID",
"LastCampaignActivity","PageViewsPerVisit",
"TotalTimeOnWebsite","TotalWebVisits",
"AttendedMarketingEvent","OrganicSearch",
"ViewedAdvertisement",],)

You can extend the join operations to include multiple feature groups by adding the with_feature_group method at the end of the preceding code example and defining the required arguments for the new feature group. You can also perform join operations with an existing DataFrame by defining the base to be your existing Pandas DataFrame and joining with the interested feature groups. The following code sample shows how to create dataset with an existing Pandas DataFrame and an existing feature group:

ds2_builder = feature_store.create_dataset(
base=new_records_df2, # Pandas DataFrame
event_time_identifier_feature_name="Lead_EventTime",
record_identifier_feature_name="Lead_ProspectID",
output_path=f"s3://{s3_bucket_name}/dataset_query_results",).with_feature_group(
base_fg, "Lead_ProspectID", ["LeadSource"])

For more examples on these various configurations, refer to Create a Dataset from your Feature Groups.

Create a point-in-time join

One of the most powerful capabilities of this enhancement is to perform point-in-time joins simply and without the need to write complex SQL code. When building ML models, data scientists need to avoid data leakage or target leakage, which is accidentally using data during model training that wouldn’t be available at the time of prediction. For instance, if we’re trying to predict credit card fraud, we should exclude transactions that arrive after the fraudulent charge we’re trying to predict, otherwise the trained model could use this post-fraud information to alter the model, making it generalize less well.

Retrieval of point-in-time accurate feature data requires you to supply an entity DataFrame that provides a set of record IDs (or primary key) and corresponding event times that serve as the cutoff time for the event. This retrieval mechanism is sometimes referred to as row-level time travel, because it allows a different time constraint to be applied for each row key. To perform point-in-time joins with the SageMaker SDK, we use the Dataset Builder class and provide the entity DataFrame as the base argument to the constructor.

In the following code, we create a simple entity DataFrame with two records. We set the event times, used to indicate the cutoff time, near the middle of the time series data (mid-January 2023):

# Create Events (entity table) dataframe to pass Timestamp for Point-in-Time Join
events = [['2023-01-20T00:00:00Z', record_id1],
['2023-01-15T00:00:00Z', record_id2]]
df_events = pd.DataFrame(events, columns=['Event_Time', 'Lead_ProspectID'])

When we use the point_in_time_accurate_join functionality with the create_dataset call, the internal query excludes all records with timestamps later then the cutoff times supplied, returning the latest feature values that would have been available at the time of the event:

# Create Dataset Builder using point-in-time-accurate-join function

pit_builder = feature_store.create_dataset(
base=df_events,
event_time_identifier_feature_name='Event_Time',
record_identifier_feature_name='Lead_ProspectID',
output_path=f"s3://{s3_bucket_name}/{s3_prefix}/dataset_query_results"
).with_feature_group(base_fg, "Lead_ProspectID"
).point_in_time_accurate_join(
).with_number_of_recent_records_by_record_identifier(1)

Notice that there are only two records in the DataFrame returned by the point-in-time join. This is because we only submitted two record IDs in the entity DataFrame, one for each Lead_ProspectID we want to retrieve. The point-in-time criteria specifies that a record’s event time (stored in the Lead_Eventtime field) must contain a value that is less than the cutoff time.

Additionally, we instruct the query to retrieve only the latest record that meets this criteria because we have applied the with_number_of_recent_records_by_record_identifier method. When used in conjunction with the point_in_time_accurate_join method, this allows the caller to specify how many records to return from those that meet the point-in-time join criteria.

Compare point-in-time join results with Athena query results

To verify the output returned by the SageMaker SDK point_in_time_accurate_join function, we compare it to the result of an Athena query. First, we create a standard Athena query using a SELECT statement tied to the specific table created by the Feature Store runtime. This table name can be found by referencing the table_name field after instantiating the athena_query from the FeatureGroup API:

SELECT * FROM "sagemaker_featurestore"."off_sdk_fg_lead_1682348629" 
WHERE "off_sdk_fg_lead_1682348629"."Lead_ProspectID" = '5e84c78f-6438-4d91-aa96-b492f7e91029'

The Athena query doesn’t contain any point-in-time join semantics, so it returns all records that match the specified record_id (Lead_ProspectID).

Next, we use the Pandas library to sort the Athena results by event times for easy comparison. The records with timestamps later than the event times specified in the entity DataFrame (for example, 2023-01-15T00:00:00Z) submitted to the point_in_time_accurate_join don’t show up in the point-in-time results. Because we additionally specified that we only want a single record from the preceding create_dataset code, we only get the latest record prior to the cutoff time. By comparing the SageMaker SDK results with the Athena query results, we see that the point-in-time join function returned the proper records.

Therefore, we have confidence that we can use the SageMaker SDK to perform row-level time travel and avoid target leakage. Furthermore, this capability works across multiple feature groups that may be refreshed on completely different timelines.

Retrieve feature history within a specific time range

We also want to demonstrate the use of specifying a time range window when joining the feature groups to form a dataset. The time window is defined using with_event_time_range, which accepts two inputs, starting_timestamp and ending_timestamp, and returns a dataset builder object. In our code sample, we set the retrieval time window for 1 full day from 2022-07-01 00:00:00 until 2022-07-02 00:00:00.

The following code shows how to create a dataset with the specified event time window while joining the base feature group with the target feature group:

# Setup Event Time window: seconds of unix epoch time
# Start at 07/01/2022 and set time window to one day
start_ts = 1656633600
time_window = 86400
# Using hard-coded timestamps from dataset, then adding time window
datetime_start = datetime.fromtimestamp(start_ts)
datetime_end = datetime.fromtimestamp(start_ts+time_window)
print(f'Setting retrieval time window: {datetime_start} until {datetime_end}')
time_window_builder = (feature_store.create_dataset(
base=base_fg, output_path=f"s3://{s3_bucket_name}/dataset_query_results").with_feature_group(
feature_group=target_fg,
target_feature_name_in_base="Lead_ProspectID",
included_feature_names=["Web_ProspectID","LastCampaignActivity","PageViewsPerVisit",
"TotalTimeOnWebsite","TotalWebVisits","AttendedMarketingEvent",
"OrganicSearch","ViewedAdvertisement",],)
.with_event_time_range(starting_timestamp=datetime_start,ending_timestamp=datetime_end))

We also confirm the difference between the sizes of the dataset created using with_event_time_range by exporting to a Pandas DataFrame with the to_dataframe() method and displaying the data. Notice how the result set has only a fraction of the original 10,020 records, because it only retrieves records whose event_time is within the 1-day time period.

Retrieve features as of a specific timestamp

The DatasetBuilder as_of method retrieves features from a dataset that meet a timestamp-based constraint, which the caller provides as an argument to the function. This mechanism is useful for scenarios such as rerunning experiments on previously collected data, backtesting time series models, or building a dataset from a previous state of the offline store for data auditing purposes. This functionality is sometimes referred to as time travel because it essentially rolls back the data store to an earlier date and time. This time constraint is also referred to as the cutoff timestamp.

In our sample code, we first create the cutoff timestamp by reading the write_time value for the last record written to the Feature Store, the one written with put_record. Then we provide this cutoff timestamp to the DatasetBuilder as an argument to the as_of method:

# Create dataset using as-of timestamp
print(f'using cut-off time: {asof_cutoff_datetime}')
as_of_builder = feature_store.create_dataset(
base=base_fg,
output_path=f"s3://{s3_bucket_name}/{s3_prefix}/dataset_query_results").with_feature_group(
feature_group=target_fg,
target_feature_name_in_base='Lead_ProspectID',
included_feature_names=['Web_ProspectID','Web_EventTime',
'TotalWebVisits']).as_of(asof_cutoff_datetime)

It’s important to note that the as_of method applies the time constraint to the internal write_time field, which is automatically generated by Feature Store. The write_time field represents the actual timestamp when the record is written to the data store. This is different than other methods like point-in-time-accurate-join and with_event_time_range that use the client-provided event_time field as a comparator.

Clean up

Be sure to delete all the resources created as part of this example to avoid incurring ongoing charges. This includes the feature groups and the S3 bucket containing the offline store data.

SageMaker Python SDK experience vs. writing SQL

The new methods in the SageMaker Python SDK allow you to quickly create datasets and move to the training step quickly during the ML lifecycle. To show the time and effort that can be saved, let’s examine a use case where we need to join two feature groups while retrieving the features within a specified time frame. The following figure compares the Python queries on the offline Feature Store vs. SQL used to create the dataset behind a Python query.

As you can see, the same operation of joining two feature groups requires you to create a long, complex SQL query, whereas it can be accomplished using just the with_feature_group and with_event_time_range methods in the SageMaker Python SDK.

Conclusion

The new offline store methods in the Python SageMaker SDK allow you to query your offline features without having to write complex SQL statements. This provides a seamless experience for customers who are accustomed to writing Python code during model development. For more information about feature groups, refer to Create a Dataset From Your Feature Groups and Feature Store APIs: Feature Group.

The full example in this post can be found in the GitHub repository. Give it a try and let us know your feedback in the comments.


About the Authors

Paul Hargis has focused his efforts on machine learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and teaching people how to leverage them. Paul likes to help customers expand their machine learning initiatives to solve real-world problems. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions, helping amazon.com improve the experience for international shoppers.

Mecit Gungor is an AI/ML Specialist Solution Architect at AWS helping customers design and build AI/ML solutions at scale. He covers a wide range of AI/ML use cases for Telecommunication customers and currently focuses on Generative AI, LLMs, and training and inference optimization. He can often be found hiking in the wilderness or playing board games with his friends in his free time.

Tony Chen is a Machine Learning Solutions Architect at AWS, helping customers design scalable and robust machine learning capabilities in the cloud. As a former data scientist and data engineer, he leverages his experience to help tackle some of the most challenging problems organizations face with operationalizing machine learning.

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience in end-to-end designs and solutions for machine learning; business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.

Read More

Use Amazon SageMaker Canvas to build machine learning models using Parquet data from Amazon Athena and AWS Lake Formation

Use Amazon SageMaker Canvas to build machine learning models using Parquet data from Amazon Athena and AWS Lake Formation

Data is the foundation for machine learning (ML) algorithms. One of the most common formats for storing large amounts of data is Apache Parquet due to its compact and highly efficient format. This means that business analysts who want to extract insights from the large volumes of data in their data warehouse must frequently use data stored in Parquet.

To simplify access to Parquet files, Amazon SageMaker Canvas has added data import capabilities from over 40 data sources, including Amazon Athena, which supports Apache Parquet.

Canvas provides connectors to AWS data sources such as Amazon Simple Storage Service (Amazon S3), Athena, and Amazon Redshift. In this post, we describe how to query Parquet files with Athena using AWS Lake Formation and use the output Canvas to train a model.

Solution overview

Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open table and file formats. Many teams are turning to Athena to enable interactive querying and analyze their data in the respective data stores without creating multiple data copies.

Athena allows applications to use standard SQL to query massive amounts of data on an S3 data lake. Athena supports various data formats, including:

  • CSV
  • TSV
  • JSON
  • text files
  • Open-source columnar formats, such as ORC and Parquet
  • Compressed data in Snappy, Zlib, LZO, and GZIP formats

Parquet files organize the data into columns and use efficient data compression and encoding schemes for fast data storage and retrieval. You can reduce the import time in Canvas by using Parquet files for bulk data imports and with specific columns.

Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and ML. Lake Formation automatically manages access to the registered data in Amazon S3 through services including AWS Glue, Athena, Amazon Redshift, Amazon QuickSight, and Amazon EMR using Zeppelin notebooks with Apache Spark to ensure compliance with your defined policies.

In this post, we show you how to import Parquet data to Canvas from Athena, where Lake Formation enables data governance.

To illustrate, we use the operations data of a consumer electronics business. We create a model to estimate the demand for electronic products using their historical time series data.

This solution is illustrated in three steps:

  1. Set up the Lake Formation.
  2. Grant Lake Formation access permissions to Canvas.
  3. Import the Parquet data to Canvas using Athena.
  4. Use the imported Parquet data to build ML models with Canvas.

The following diagram illustrates the solution architecture.

Set up the Lake Formation database

The steps listed here form a one-time setup to show you the data lake hosting the Parquet data, which can be consumed by your analysts to gain insights using Canvas. Either cloud engineers or administrators can best perform these prerequisites. Analysts can go directly to Canvas and import the data from Athena.

The data used in this post consist of two datasets sourced from Amazon S3. These datasets have been generated synthetically for this post.

  • Consumer Electronics Target Time Series (TTS) – The historical data of the quantity to forecast is called the Target Time Series (TTS). In this case, it’s the demand for an item.
  • Consumer Electronics Related Time Series (RTS) – Other historical data that is known at exactly the same time as every sales transaction is called the Related Time Series (RTS). In our use case, it’s the price of an item. An RTS dataset includes time series data that isn’t included in a TTS dataset and might improve the accuracy of your predictor.
  1. Upload data to Amazon S3 as Parquet files from these two folders:
    1. ce-rts – Contains Consumer Electronics Related Time Series (RTS).
    2. ce-tts – Contains Consumer Electronics Target Time Series (TTS).

  1. Create a data lake with Lake Formation.
  2. On the Lake Formation console, create a database called consumer-electronics.

  1. Create two tables for the consumer electronics dataset with the names ce-rts-Parquet and ce-tts-Parquet with the data sourced from the S3 bucket.

We use the database we created in this step in a later step to import the Parquet data into Canvas using Athena.

Grant Lake Formation access permissions to Canvas

This is a one-time setup to be done by either cloud engineers or administrators.

  1. Grant data lake permissions to access Canvas to access the consumer-electronics Parquet data.
  2. In the SageMaker Studio domain, view the Canvas user’s details.
  3. Copy the execution role name.
  4. Make sure the execution role has enough permissions to access the following services:
    • Canvas.
    • The S3 bucket where Parquet data is stored.
    • Athena to connect from Canvas.
    • AWS Glue to access the Parquet data using the Athena connector.

  1. In Lake Formation, choose Data Lake permissions in the navigation pane.
  2. Choose Grant.

  1. For Principals, select IAM users and roles to provide Canvas access to your data artifacts.
  2. Specify your SageMaker Studio domain user’s execution role.
  3. Specify the database and tables.
  4. Choose Grant.

You can grant granular actions on the tables, columns, and data. This option provides granular access configuration of your sensitive data by the segregation of roles you have defined.

After you set up the required environment for the Canvas and Athena integration, proceed to the next step to import the data into Canvas using Athena.

Import data using Athena

Complete the following steps to import the Lake Formation-managed Parquet files:

  1. In Canvas, choose Datasets in the navigation pane.

  1. Choose + Import to import the Parquet datasets managed by Lake Formation.

  1. Choose Athena as the data source.

  1. Choose the consumer-electronics dataset in Parquet format from the Athena data catalog and table details in the menu.
  2. Import the two datasets. Drag and drop the data source to select the first one.

When you drag and drop the dataset, the data preview appears in the bottom frame of the page.

  1. Choose Import data.
  2. Enter consumer-electronics-rts as the name for the dataset you’re importing.

Data import takes time based on the data size. The dataset in this example is small, so the import takes a few seconds. When the data import is completed, the status turns from Processing to Ready.

  1. Repeat the import process for the second dataset (ce-tts).

When the ce-tts Parquet data is imported, the Datasets pageshow both datasets.

The imported datasets contain targeted and related time series data. The RTS dataset can help deep learning models improve forecast accuracy.

Let’s join the datasets to prepare for our analysis.

  1. Select the datasets.
  2. Choose Join data.

  1. Select and drag both the datasets to the center pane, which applies an inner join.
  2. Choose the Join icon to see the join conditions applied and to make sure the inner join is applied and the right columns are joined.
  3. Choose Save & close to apply the join condition.

  1. Provide a name for the joined dataset.
  2. Choose Import data.

Joined data is imported and created as a new dataset. The joined dataset source is shown as Join.

Use the Parquet data to build ML models with Canvas

The Parquet data from Lake Formation is now available on Canvas. Now you can run your ML analysis on the data.

  1. Choose Create a custom model in Ready-to-use models from Canvas after successfully importing the data.

  1. Enter a name for the model.
  2. Select your problem type (for this post, Predictive analysis).
  3. Choose Create.

  1. Select the consumer-electronic-joined dataset to train the model to predict the demand for electronic items.

  1. Select demand as the target column to forecast demand for consumer electronic items.

Based on the data provided to Canvas, the Model type is automatically derived as Time series forecasting and provides a Configure time series model option.

  1. Choose the Configure time series model link to provide time series model options.
  2. Enter forecasting configurations as shown in the following screenshot.
  3. Exclude group column because no logical grouping is executed for the dataset.

For building the model, Canvas offers two build options. Choose the option as per your preference. Quick build generally takes around 15–20 minutes, whereas Standard takes around 4 hours.

    • Quick build – Builds a model in a fraction of the time compared to a standard build; potential accuracy is exchanged for speed
    • Standard build – Builds the best model from an optimized process powered by AutoML; speed is exchanged for greatest accuracy
  1. For this post, we choose Quick build for illustrative purposes.

When the quick build is completed, the model evaluation metrics are presented in the Analyze section.

  1. Choose Predict to run a single prediction or batch prediction.

Clean up

Log out from Canvas to avoid future charges.

Conclusion

Enterprises have data in data lakes in various formats, including the highly efficient Parquet format. Canvas has launched more than 40 data sources, including Athena, from which you can easily pull data in various formats from data lakes. To learn more, refer to Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas.

In this post, we took Lake Formation-managed Parquet files and imported them into Canvas using Athena. The Canvas ML model forecasted the demand of consumer electronics using historical demand and price data. Thanks to a user-friendly interface and vivid visualizations, we completed this without writing a single line of code. Canvas now allows business analysts to use Parquet files from data engineering teams and build ML models, conduct analysis, and extract insights independently of data science teams.

To learn more about Canvas, refer to Predict types of machine failures with no-code machine learning using Canvas. Refer to Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capabilities for Business Analysts for more information on creating ML models with a no-code solution.


About the authors

Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the Financial Services industry with their operations in AWS. As a machine learning enthusiast, Gopi works to help customers succeed in their ML journey. In his spare time, he likes to play badminton, spend time with family, and travel.

Hariharan Suresh is a Senior Solutions Architect at AWS. He is passionate about databases, machine learning, and designing innovative solutions. Prior to joining AWS, Hariharan was a product architect, core banking implementation specialist, and developer, and worked with BFSI organizations for over 11 years. Outside of technology, he enjoys paragliding and cycling.

Read More

Amazon SageMaker Automatic Model Tuning now automatically chooses tuning configurations to improve usability and cost efficiency

Amazon SageMaker Automatic Model Tuning now automatically chooses tuning configurations to improve usability and cost efficiency

Amazon SageMaker Automatic Model Tuning has introduced Autotune, a new feature to automatically choose hyperparameters on your behalf. This provides an accelerated and more efficient way to find hyperparameter ranges, and can provide significant optimized budget and time management for your automatic model tuning jobs.

In this post, we discuss this new capability and some of the benefits it brings.

Hyperparameter overview

When training any machine learning (ML) model, you are generally dealing with three types of data: input data (also called the training data), model parameters, and hyperparameters. You use the input data to train your model, which in effect learns your model parameters. During the training process, your ML algorithms are trying to find the optimal model parameters based on data while meeting the goals of your objective function. For example, when a neural network is trained, the weight of the network nodes is learned from the training, and indicates how much impact it has on the final prediction. These weights are the model parameters.

Hyperparameters, on the other hand, are parameters of a learning algorithm and not the model itself. The number of hidden layers and the number of nodes are some of the examples of hyperparameters you can set for a neural network. The difference between model parameters and hyperparameters is that model parameters are learned during the training process, whereas hyperparameters are set prior to the training and remain constant during the training process.

Pain points

SageMaker automatic model tuning, also called hyperparameter tuning, runs many training jobs on your dataset using a range of hyperparameters that you specify. It can accelerate your productivity by trying many variations of a model. It looks for the best model automatically by focusing on the most promising combinations of hyperparameter values within the ranges that you specify. However, to get good results, you must choose the right ranges to explore.

But how do you know what the right range is to begin with? With hyperparameter tuning jobs, we are assuming that the optimal set of hyperparameters lies within the range that we specified. What happens if the chosen range is not right, and the optimal hyperparameter actually falls outside of the range?

Choosing the right hyperparameters requires experience with the ML technique you are using and understanding how its hyperparameters behave. It’s important to understand the hyperparameter implications because every hyperparameter that you choose to tune has the potential to increase the number of trials required for a successful tuning job. You need to strike an optimal trade-off between resources allocated to the tuning job and achieving the goals you’ve set.

The SageMaker Automatic Model Tuning team is constantly innovating on behalf of our customers to optimize their ML workloads. AWS recently announced support of new completion criteria for hyperparameter optimization: the max runtime criteria, which is a budget control completion criteria that can be used to bound cost and runtime. Desired target metrics, improvement monitoring, and convergence detection monitors the performance of the model and assists with early stopping if the models don’t improve after a defined number of training jobs. Autotune is a new feature of automatic model tuning that helps save you time and reduce wasted resources on finding optimal hyperparameter ranges.

Benefits of Autotune and how automatic model tuning alleviates those pain points

Autotune is a new configuration in the CreateHyperParameterTuningJob API and in the HyperparameterTuner SageMaker Python SDK that alleviates the need to specify the hyperparameter ranges, tuning strategy, objective metrics, or the number of jobs that were required as part of the job definition. Autotune automatically chooses the optimal configurations for your tuning job, helps prevent wasted resources, and accelerates productivity.

The following example showcases how many of the parameters are not necessary when using Autotune.

The following code creates a hyperparameter tuner using the SageMaker Python SDK without Autotune:

estimator = PyTorch(
    entry_point="mnist.py",
    instance_type="ml.p4d.24xlarge",
    hyperparameters={
        "epochs": 1, "backend": "gloo"
    },
)

tuner = HyperparameterTuner(
    estimator, 
    objective_metric_name='validation:rmse',
    objective_type='Minimize',
    hyperparameter_ranges = {
        "lr": ContinuousParameter(0.001, 0.1),
        "batch-size": CategoricalParameter([32, 64, 128, 256, 512])
    },
    metric_definitions=[{...}],
    max_jobs=10,
    strategy="Random"
)

tuner.fit(...)

The following example showcases how many of the parameters are not necessary when using Autotune:

estimator = PyTorch(
    entry_point="mnist.py",
    instance_type="ml.p4d.24xlarge",
    hyperparameters={
        "epochs": 1, "backend": "gloo", "lr": 0.01, "batch-size": 32
    },
)
tuner = HyperparameterTuner(
    estimator, 
    objective_metric_name='validation:rmse',
    objective_type='Minimize', 
    autotune=True
)

If you are using API, the equivalent code would be as follows:

create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name,
    HyperParameterTuningJobConfig=tuning_job_config,
    TrainingJobDefinition=training_job_definition,
    Autotune={'Mode': 'Enabled'},
)

The code example illustrates some of the key benefits of Autotune:

  • A key choice for a tuning job is which hyperparameters to tune and their ranges. Autotune makes this choice for you based on a list of hyperparameters that you provide. Using the previous example, the hyperparameters that Autotune can choose to be tunable are lr and batch-size.
  • Autotune will automatically select the hyperparameter ranges on your behalf. Autotune uses best practices as well as internal benchmarks for selecting the appropriate ranges.
  • Autotune automatically selects the strategy on how to choose the combinations of hyperparameter values to use for the training job.
  • Early stopping is enabled by default when using Autotune. When using early stopping, SageMaker stops training jobs launched by the hyperparameter tuning job when they are unlikely to perform better than previously completed training jobs to avoid additional resource utilization.
  • Maximum expected resources to be consumed by the tuning job (parallel jobs, max runtime, and so on) will be calculated and set in the tuning job record as soon as the tuning job is created. Such reserved resources will not increase during the course of the tuning job; this will maintain an upper bound of cost and duration of the tuning job that is easily predictable by the user. A max runtime of 48 hours will be used by default.

You can override any settings chosen automatically by Autotune. As an example, if you specify your own hyperparameter ranges, those will be used alongside the inferred ranges. Any user-specified hyperparameter range will take precedence over the same named inferred ranges:

estimator = PyTorch(
    ...
    hyperparameters={
        "epochs": 100, "backend": "gloo", "lr": 0.01, "beta1": 0.8
    }

tuner = HyperparameterTuner(
    ...
    hyperparameter_ranges = {
        "lr": ContinuousParameter(0.001, 0.01) # takes precedence over inferred "lr"
    }

Autotune generates a set of settings as part of the tuning job. Any customer-specified settings that have the same name will override the Autotune-selected settings. Any customer-provided settings (that aren’t the same as the named Autotune settings) are added in addition to the Autotune-selected settings.

Inspecting parameters chosen by Autotune

Autotune reduces the time you would normally have spent in deciding on the initial set of hyperparameters to tune. But how do you get insights into what hyperparameter values Autotune ended up choosing? You can get information about decisions made for you in the description of the running tuning job (in the response of the DescribeHyperParameterTuningJob operation). After you submit a request to create a tuning job, the request is processed, and all missing fields are set by Autotune. All set fields are reported in the DescribeHyperParameterTuningJob operation.

Alternatively, you can inspect HyperparameterTuner class fields to see the settings chosen by Autotune.

The following is an XGBoost example of how you may use the DescribeHyperParameterTuningJob to inspect the hyperparameters chosen by Autotune.

First, we create a tuning job with automatic model tuning:

hyperparameters = {
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",
    "max_depth": "5",  # overlap with ranges is ok when Autotune is enabled
}
estimator = XGBoost(hyperparameters=hyperparameters, ...)

hp_tuner = HyperparameterTuner(estimator, autotune=True)
hp_tuner.fit(wait=False)

After the tuning job is created successfully, we can discover what settings Autotune chose. For example, we can describe the tuning job by the name given by it from hp_tuner:

import boto3 
sm = boto3.client('sagemaker')

response = sm.describe_hyper_parameter_tuning_job(
   HyperParameterTuningJobName=hp_tuner.latest_tuning_job.name
)

print(response)

Then we can inspect the generated response to review the settings chosen by Autotune on our behalf.

If the current tuning job settings are not satisfactory, you can stop the tuning job:

hp_tuner.stop()

Conclusion

SageMaker Automatic Model Tuning allows you to reduce the time to tune a model by automatically searching for the best hyperparameter configuration within the ranges that you specify. However, choosing the right hyperparameter ranges can be a time-consuming process and can have direct implications on your training cost and duration.

In this post, we discussed how you can now use Autotune, a new feature introduced as part of automatic model tuning, to automatically pick an initial set of hyperparameter ranges on your behalf. This can reduce the time it takes for you to get started with your model tuning process. Additionally, you can evaluate the ranges picked by Autotune and adjust them according to your needs.

We also showed how Autotune can automatically pick the optimal parameter settings on your behalf, such as the number of training jobs, the strategy to choose the hyperparameter combinations, and enabling early stopping by default. This can result in significantly optimized budget and time bounds that are easily predictable.

To learn more, refer to Perform Automatic Model Tuning with SageMaker.


About the Authors

Jas Singh is a Senior Solutions Architect helping public sector customers achieve their business outcomes through architecting and implementing innovative and resilient solutions at scale. Jas has over 20 years of experience in designing and implementing mission-critical applications and holds a master’s degree in computer science from Baylor University.

Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the Financial Services industry with their operations in AWS. As a machine learning enthusiast, Gopi works to help customers succeed in their ML journey. In his spare time, he likes to play badminton, spend time with family, and travel.

Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11 years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.

Iaroslav Shcherbatyi is a Machine Learning Engineer at AWS. He works mainly on improvements to the Amazon SageMaker platform and helping customers best use its features. In his spare time, he likes to go to gym, do outdoor sports such as ice skating or hiking, and to catch up on new AI research.

Read More