How Mantium achieves low-latency GPT-J inference with DeepSpeed on Amazon SageMaker

Mantium is a global cloud platform provider for building AI applications and managing them at scale. Mantium’s end-to-end development platform enables enterprises and businesses of all sizes to build AI applications and automation faster and easier than what has been traditionally possible. With Mantium, technical and non-technical teams can prototype, develop, test, and deploy AI applications, all with a low-code approach. Through automatic logging, monitoring, and safety features, Mantium also releases software and DevOps engineers from spending their time reinventing the wheel. At a high level, Mantium delivers:

  • State-of-the-art AI – Experiment and develop with an extensive selection of open-source and private large language models with a simple UI or API.
  • AI process automation – Easily build AI-driven applications with a growing library of integrations and Mantium’s graphical AI Builder.
  • Rapid deployment – Shorten the production timeline from months to weeks or even days with one-click deployment. This feature turns AI applications into shareable web apps with one click.
  • Safety and regulation – Ensure safety and compliance with governance policies and support for human-in-the-loop processes.

With the Mantium AI Builder, you can develop sophisticated workflows that integrate external APIs, logic operations, and AI models. The following screenshot shows an example of the Mantium AI app, which chains together a Twilio input, governance policy, AI block (which can rely on an open-source model like GPT-J) and Twilio output.

To support this app, Mantium provides comprehensive and uniform access to not only model APIs from AI providers like Open AI, Co:here, and AI21, but also state-of-the-art open source models. At Mantium, we believe that anyone should be able to build modern AI applications that they own, end-to-end, and we support this by providing no-code and low-code access to performance-optimized open-source models.

For example, one of Mantium’s core open-source models is GPT-J, a state-of-the-art natural language processing (NLP) model developed by EleutherAI. With 6 billion parameters, GPT-J is one of the largest and best-performing open-source text generation models. Mantium users can integrate GPT-J into their AI applications via Mantium’s AI Builder. In the case of GPT-J, this involves specifying a prompt (a natural language representation of what the model should do) and configuring some optional parameters.

For example, the following screenshot shows an abbreviated demonstration of a sentiment analysis prompt that produces explanations and sentiment predictions. In this example, the author wrote that the “food was wonderful” and that their “service was extraordinary.” Therefore, this text expresses positive sentiment.

However, one challenge with open-source models is that they’re rarely designed for production-grade performance. In the case of large models like GPT-J, this can make production deployment impractical and even infeasible, depending on the use case.

To ensure that our users have access to best-in-class performance, we’re always looking for ways to decrease the latency of our core models. In this post, we describe the results of an inference optimization experiment in which we use DeepSpeed’s inference engine to increase GPT-J’s inference speed by approximately 116%. We also describe how we have deployed the Hugging Face Transformers implementation of GPT-J with DeepSpeed in our Amazon SageMaker inference endpoints.

Overview of the GPT-J model

GPT-J is a generative pretrained (GPT) language model and, in terms of its architecture, it’s comparable to popular, private, large language models like Open AI’s GPT-3. As noted earlier, it consists of approximately 6 billion parameters and 28 layers, which consist of a feedforward block and a self-attention block. When it was first released, GPT-J was one of the first large language models to use rotary embeddings, a new position encoding strategy that unifies absolute and relative position encoders. It also employs an innovative parallelization strategy where dense and feedforward layers are combined in a single layer, which minimizes communication overhead.

Although GPT-J might not quite qualify as large by today’s standards—large models typically consist of more than 100 billion parameters—it’s still impressively performant, and with some prompt engineering or minimal fine-tuning, you can use it to solve many problems. Furthermore, its relatively modest size means that you can deploy it more rapidly and at a much lower cost than larger models.

That said, GPT-J is still pretty big. For example, training GPT-J in FP32 with full weight updates and the Adam optimizer requires over 200 GB memory: 24 GB for the model parameters, 24 GB for the gradients, 24 GB for Adam’s squared gradients, 24 GB for the optimizer states, and the additional memory requirements for loading training batches and storing activations. Of course, training in FP16 reduces these memory requirements almost by half, but a memory footprint of over 100 GB still necessitates innovative training strategies. For instance, in collaboration with SageMaker, Mantium’s NLP team developed a workflow for training (fine-tuning) GPT-J using the SageMaker distributed model parallel library.

In contrast, serving GPT-J for inference has much lower memory requirements—in FP16, model weights occupy less than 13 GB, which means that inference can easily be conducted on a single 16 GB GPU. However, inference with out-of-the-box implementations of GPT-J, such as the Hugging Face Transformers implementation that we use, is relatively slow. To support use cases that require highly responsive text-generation, we’ve focused on reducing GPT-J’s inference latency.

Response latency challenges of GPT-J

Response latency is a core obstacle for the generative pretrained transformers (GPTs) such as GPT-J that power modern text generation. GPT models generate text through sequences of inference steps. At each inference step, the model is given text as input, and, conditional on this input, it samples a word from its vocabulary to append to the text. For example, given the sequence of tokens “I need an umbrella because it’s,” a high-likelihood next token might be “raining.” However, it could also be “sunny” or “bound,” which could be the first step toward a text sequence like “I need an umbrella because it’s bound to start raining.”

Scenarios like this raise some interesting challenges for deploying GPT models because real-world use cases might involve tens, hundreds, or even thousands of inference steps. For example, generating a 1,000-token response requires 1,000 inference steps! Accordingly, although a model might offer inference speeds that seem fast enough in isolation, it’s easy for latency to reach untenable levels when long texts are generated. We observed an average latency of 280 milliseconds per inference step on a V100 GPU. This might seem fast for a 6.7 billion parameter model, but with such latencies, it takes approximately 30 seconds to generate a 500-token response, which isn’t ideal from a user experience perspective.

Optimizing inference speeds with DeepSpeed Inference

DeepSpeed is an open-source deep-learning optimization library developed by Microsoft. Although it primarily focuses on optimizing of training large models, DeepSpeed also provides an inference optimization framework that supports a select set of models, including BERT, Megatron, GPT-Neo, GPT2, and GPT-J. DeepSpeed Inference facilitates high-performance inference with large Transformer-based architectures through a combination of model parallelism, inference-optimized CUDA kernels, and quantization.

To boost inference speed with GPT-J, we use DeepSpeed’s inference engine to inject optimized CUDA kernels into the Hugging Face Transformers GPT-J implementation.

To evaluate the speed benefits of DeepSpeed’s inference engine, we conducted a series of latency tests in which we timed GPT-J under various configurations. Specifically, we varied whether or not DeepSpeed was used, hardware, output sequence length, and input sequence length. We focused on both output and input sequence length, because they both affect inference speed. To generate an output sequence of 50 tokens, the model must perform 50 inference steps. Furthermore, the time required to perform an inference step depends on the size of the input sequence—larger inputs require more processing time. Although the effect of output sequence size is much larger than the effect of input sequence size, it’s still necessary to account for both factors.

In our experiment, we used the following design:

  • DeepSpeed inference engine – On, off
  • Hardware – T4 (ml.g4dn.2xlarge), V100 (ml.p3.2xlarge)
  • Input sequence length – 50, 200, 500, 1000
  • Output sequence length – 50, 100, 150, 200

In total, this design has 64 combinations of these four factors, and for each combination, we ran 20 latency tests. Each test was run on a pre-initialized SageMaker inference endpoint, ensuring that our latency tests reflect production times, including API exchanges and preprocessing.

Our tests demonstrate that DeepSpeed’s GPT-J inference engine is substantially faster than the baseline Hugging Face Transformers PyTorch implementation. The following figure illustrates the mean text generation latencies for GPT-J with and without DeepSpeed acceleration on ml.g4dn.2xlarge and ml.p3.2xlarge SageMaker inference endpoints.

On the ml.g4dn.2xlarge instance, which is equipped with a 16 GB NVIDIA T4 GPU, we observed a mean latency reduction of approximately 24% [Standard Deviation (SD) = 0.05]. This corresponded to an increase from a mean 12.5 (SD = 0.91) tokens per second to a mean 16.5 (SD = 2.13) tokens per second. Notably, DeepSpeed’s acceleration effect was even stronger on the ml.p3.2xlarge instance, which is equipped with an NVIDIA V100 GPU. On that hardware, we observed a 53% (SD = .07) mean latency reduction. In terms of tokens per second, this corresponded to an increase from a mean 21.9 (SD = 1.97) tokens per second to a mean 47.5 (SD = 5.8) tokens per second.

We also observed that the acceleration offered by DeepSpeed attenuated slightly on both hardware configurations as the size of the input sequences grew. However, across all conditions, inference with DeepSpeed’s GPT-J optimizations was still substantially faster than the baseline. For example, on the g4dn instance, the maximum and minimum latency reductions were 31% (input sequence size = 50) and 15% (input sequence size = 1000), respectively. And on the p3 instance, the maximum and minimum latency reductions were 62% (input sequence size = 50) and 40% (input sequence size = 1000), respectively.

Deploying GPT-J with DeepSpeed on a SageMaker inference endpoint

In addition to dramatically increasing text generation speeds for GPT-J, DeepSpeed’s inference engine is simple to integrate into a SageMaker inference endpoint. Before adding DeepSpeed to our inference stack, our endpoints were running on a custom Docker image based on an official PyTorch image. SageMaker makes it very easy to deploy custom inference endpoints, and integrating DeepSpeed was as simple as including the dependency and writing a few lines of code. The open-sourced guide to the deployment workflow to deploy GPT-J with DeepSpeed is available on GitHub.

Conclusion

Mantium is dedicated to leading innovation so that everyone can quickly build with AI. From AI-driven process automation to stringent safety and compliance settings, our complete platform provides all the tools necessary to develop and manage robust, responsible AI applications at scale and lowers the barrier to entry. SageMaker helps companies like Mantium get to market quickly.

To learn how Mantium can help you build complex AI-driven workflows for your organization, visit www.mantiumai.com.


About the authors

Joe Hoover is a Senior Applied Scientist on Mantium’s AI R&D team. He is passionate about developing models, methods, and infrastructure that help people solve real-world problems with cutting-edge NLP systems. In his spare time, he enjoys backpacking, gardening, cooking, and hanging out with his family.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Sunil Padmanabhan is a Startup Solutions Architect at AWS. As a former startup founder and CTO, he is passionate about machine learning and focuses on helping startups leverage AI/ML for their business outcomes and design and deploy ML/AI solutions at scale.

Read More

Prepare data faster with PySpark and Altair code snippets in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for machine learning (ML). It allows you to use a visual interface to access data and perform exploratory data analysis (EDA) and feature engineering. The EDA feature comes with built-in data analysis capabilities for charts (such as scatter plot or histogram) and time-saving model analysis capabilities such as feature importance, target leakage, and model explainability. The feature engineering capability has over 300 built-in transforms and can perform custom transformations using either Python, PySpark, or Spark SQL runtime.

For custom visualizations and transforms, Data Wrangler now provides example code snippets for common types of visualizations and transforms. In this post, we demonstrate how to use these code snippets to quickstart your EDA in Data Wrangler.

Solution overview

At the time of this writing, you can import datasets into Data Wrangler from Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Databricks, and Snowflake. For this post, we use Amazon S3 to store the 2014 Amazon reviews dataset. The following is a sample of the dataset:

{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is sometimes hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" } 

In this post, we perform EDA using three columns—asin, reviewTime, and overall—which map to the product ID, review time date, and the overall review score, respectively. We use this data to visualize dynamics for the number of reviews across months and years.

Using example Code Snippet for EDA in Data Wrangler

To start performing EDA in Data Wrangler, complete the following steps:

  1. Download the Digital Music reviews dataset JSON and upload it to Amazon S3.
    We use this as the raw dataset for the EDA.
  2. Open Amazon SageMaker Studio and create a new Data Wrangler flow and import the dataset from Amazon S3.

    This dataset has nine columns, but we only use three: asin, reviewTime, and overall. We need to drop the other six columns.

  3. Create a custom transform and choose Python (PySpark).
  4. Expand Search example snippets and choose Drop all columns except several.
  5. Enter the provided snippet into your custom transform and follow the directions to modify the code.
    # Specify the subset of columns to keep
    cols = ["asin", "reviewTime", "overall"]
    
    cols_to_drop = set(df.columns).difference(cols) 
    df = df.drop(*cols_to_drop)

    Now that we have all the columns we need, let’s filter the data down to only keep reviews between 2000–2020.

  6. Use the Filter timestamp outside range snippet to drop the data before year 2000 and after 2020:
    from pyspark.sql.functions import col
    from datetime import datetime
    
    # specify the start and the stop timestamp
    timestamp_start = datetime.strptime("2000-01-01 12:00:00", "%Y-%m-%d %H:%M:%S")
    timestamp_stop = datetime.strptime("2020-01-01 12:00:00", "%Y-%m-%d %H:%M:%S")
    
    df = df.filter(col("reviewTime").between(timestamp_start, timestamp_stop))

    Next, we extract the year and month from the reviewTime column.

  7. Use the Featurize date/time transform.
  8. For Extract columns, choose year and month.

    Next, we want to aggregate the number of reviews by year and month that we created in the previous step.

  9. Use the Compute statistics in groups snippet:
    # Table is available as variable `df`
    from pyspark.sql.functions import sum, avg, max, min, mean, count
    
    # Provide the list of columns defining groups
    groupby_cols = ["reviewTime_year", "reviewTime_month"]
    
    # Specify the map of aggregate function to the list of colums
    # aggregates to use: sum, avg, max, min, mean, count
    aggregate_map = {count: ["overall"]}
    
    all_aggregates = []
    for a, cols in aggregate_map.items():
        all_aggregates += [a(col) for col in cols]
    
    df = df.groupBy(groupby_cols).agg(*all_aggregates)

  10. Rename the aggregation of the previous step from count(overall) to reviews_num by choosing Manage Columns and the Rename column transform.
    Finally, we want to create a heatmap to visualize the distribution of reviews by year and by month.
  11. On the analysis tab, choose Custom visualization.
  12. Expand Search for snippet and choose Heatmap on the drop-down menu.
  13. Enter the provided snippet into your custom visualization:
    # Table is available as variable `df`
    # Table is available as variable `df`
    import altair as alt
    
    # Takes first 1000 records of the Dataframe
    df = df.head(1000)  
    
    chart = (
        alt.Chart(df)
        .mark_rect()
        .encode(
            # Specify the column names for X and Y axis,
            # Both should have discrete values: ordinal (:O) or nominal (:N)
            x= "reviewTime_year:O",
            y="reviewTime_month:O",
            # Color can be both discrete (:O, :N) and quantitative (:Q)
            color="reviews_num:Q",
        )
        .interactive()
    )

    We get the following visualization.


    If you want to enhance the heatmap further, you can slice the data to only show reviews prior to 2011. These are hard to identify in the heatmap we just created due to large volumes of reviews since 2012.

  14. Add one line of code to your custom visualization:
    # Table is available as variable `df`
    import altair as alt
    
    df = df[df.reviewTime_year < 2011]
    # Takes first 1000 records of the Dataframe
    df = df.head(1000)  
    
    chart = (
        alt.Chart(df)
        .mark_rect()
        .encode(
            # Specify the column names for X and Y axis,
            # Both should have discrete values: ordinal (:O) or nominal (:N)
            x= "reviewTime_year:O",
            y="reviewTime_month:O",
            # Color can be both discrete (:O, :N) and quantitative (:Q)
            color="reviews_num:Q",
        )
        .interactive()
    )

We get the following heatmap.

Now the heatmap reflects the reviews prior to 2011 more visibly: we can observe the seasonal effects (the end of the year brings more purchases and therefore more reviews) and can identify anomalous months, such as October 2003 and March 2005. It’s worth investigating further to determine the cause of those anomalies.

Conclusion

Data Wrangler is a purpose-built data aggregation and preparation tool for ML. In this post, we demonstrated how to perform EDA and transform your data quickly using code snippets provided by Data Wrangler. You just need to find a snippet, enter the code, and adjust the parameters to match your dataset. You can continue to iterate on your script to create more complex visualizations and transforms.
To learn more about Data Wrangler, refer to Create and Use a Data Wrangler Flow.


About the Authors

Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.

Haider Naqvi is a Solutions Architect at AWS. He has extensive software development and enterprise architecture experience. He focuses on enabling customers to achieve business outcomes with AWS. He is based out of New York.

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

James Wu is a Senior Customer Solutions Manager at AWS, based in Dallas, TX. He works with customers to accelerate their cloud journey and fast-track their business value realization. In addition to that, James is also passionate about developing and scaling large AI/ ML solutions across various domains. Prior to joining AWS, he led a multi-discipline innovation technology team with ML engineers and software developers for a top global firm in the market and advertising industry.

Read More

Extract insights from SAP ERP with no-code ML solutions with Amazon AppFlow and Amazon SageMaker Canvas

Customers in industries like consumer packaged goods, manufacturing, and retail are always looking for ways to empower their operational processes by enriching them with insights and analytics generated from data. Tasks like sales forecasting directly affect operations such as raw material planning, procurement, manufacturing, distribution, and inbound/outbound logistics, and it can have many levels of impact, from a single warehouse all the way to large-scale production facilities.

Sales representatives and managers use historical sales data to make informed predictions about future sales trends. Customers use SAP ERP Central Component (ECC) to manage planning for the manufacturing, sale, and distribution of goods. The sales and distribution (SD) module within SAP ECC helps manage sales orders. SAP systems are the primary source of historical sales data.

Sales representatives and managers have the domain knowledge and in-depth understanding of their sales data. However, they lack data science and programming skills to create machine learning (ML) models that can generate sales forecasts. They seek intuitive, simple-to-use tools to create ML models without writing a single line of code.

To help organizations achieve the agility and effectiveness that business analysts seek, we introduced Amazon SageMaker Canvas, a no-code ML solution that helps you accelerate delivery of ML solutions down to hours or days. Canvas enables analysts to easily use available data in data lakes, data warehouses, and operational data stores; build ML models; and use them to make predictions interactively and for batch scoring on bulk datasets—all without writing a single line of code.

In this post, we show how to bring sales order data from SAP ECC to generate sales forecasts using an ML model built using Canvas.

Solution overview

To generate sales forecasts using SAP sales data, we need the collaboration of two personas: data engineers and business analysts (sales representatives and managers). Data engineers are responsible for configuring the data export from the SAP system to Amazon Simple Storage Service (Amazon S3) using Amazon AppFlow, which business analysts can then run either on-demand or automatically (schedule-based) to refresh SAP data in the S3 bucket. Business analysts are then responsible for generating forecasts with the exported data using Canvas. The following diagram illustrates this workflow.

For this post, we use SAP NetWeaver Enterprise Procurement Model (EPM) for the sample data. EPM is generally used for demonstration and testing purposes in SAP. It uses common business process model and follows the business object (BO) paradigm to support a well-defined business logic. We used the SAP transaction SEPM_DG (data generator) to generate around 80,000 historical sales orders and created a HANA CDS view to aggregate the data by product ID, sales date, and city, as shown in the following code:

@AbapCatalog.sqlViewName: 'ZCDS_EPM_VIEW'
@AbapCatalog.compiler.compareFilter: true
@AbapCatalog.preserveKey: true
@AccessControl.authorizationCheck: #CHECK
@EndUserText.label: 'Sagemaker canvas sales order'
@OData.publish: true 
define view ZCDS_EPM as select from epm_v_sales_data as sd
inner join epm_v_bp as bp
    on sd.bp_id = bp.bp_id  {
    key sd.product_id as productid,
    bp.city,
    concat( cast(
         Concat(
            Concat(
                Concat(substring(cast (sd.created_at as abap.char( 30 )), 1, 4), '-'),
                Concat(substring(cast (sd.created_at as abap.char( 30 )), 5, 2), '-')
         ),
      Substring(cast (sd.created_at as abap.char( 30 )), 7, 2)
         )
      as char10 preserving type),' 00:00:00') as saledate,
    cast(sum(sd.gross_amount) as abap.dec( 15, 3 )) as totalsales 
}
group by sd.product_id,sd.created_at, bp.city

In the next section, we expose this view using SAP OData services as ABAP structure, which allows us to extract the data with Amazon AppFlow.

The following table shows the representative historical sales data from SAP, which we use in this post.

productid saledate city totalsales
P-4 2013-01-02 00:00:00 Quito 1922.00
P-5 2013-01-02 00:00:00 Santo Domingo 1903.00

The data file is daily frequency historical data. It has four columns (productid, saledate, city, and totalsales). We use Canvas to build an ML model that is used to forecast totalsales for productid in a particular city.

This post has been organized to show the activities and responsibilities for both data engineers and business analysts to generate product sales forecasts.

Data engineer: Extract, transform, and load the dataset from SAP to Amazon S3 with Amazon AppFlow

The first task you perform as a data engineer is to run an extract, transform, and load (ETL) job on historical sales data from SAP ECC to an S3 bucket, which the business analyst uses as the source dataset for their forecasting model. For this, we use Amazon AppFlow, because it provides an out-of-the-box SAP OData Connector for ETL (as shown in the following diagram), with a simple UI to set up everything needed to configure the connection from the SAP ECC to the S3 bucket.

Prerequisites

The following are requirements to integrate Amazon AppFlow with SAP:

  • SAP NetWeaver Stack version 7.40 SP02 or above
  • Catalog service (OData v2.0/v2.0) enabled in SAP Gateway for service discovery
  • Support for client-side pagination and query options for SAP OData Service
  • HTTPS enabled connection to SAP

Authentication

Amazon AppFlow supports two authentication mechanisms to connect to SAP:

  • Basic – Authenticates using SAP OData user name and password.
  • OAuth 2.0 – Uses OAuth 2.0 configuration with an identity provider. OAuth 2.0 must be enabled for OData v2.0/v2.0 services.

Connection

Amazon AppFlow can connect to SAP ECC using a public SAP OData interface or a private connection. A private connection improves data privacy and security by transferring data through the private AWS network instead of the public internet. A private connection uses the VPC endpoint service for the SAP OData instance running in a VPC. The VPC endpoint service must have the Amazon AppFlow service principal appflow.amazonaws.com as an allowed principal and must be available in at least more than 50% of the Availability Zones in an AWS Region.

Set up a flow in Amazon AppFlow

We configure a new flow in Amazon AppFlow to run an ETL job on data from SAP to an S3 bucket. This flow allows for configuration of the SAP OData Connector as source, S3 bucket as destination, OData object selection, data mapping, data validation, and data filtering.

  1. Configure the SAP OData Connector as a data source by providing the following information:
    1. Application host URL
    2. Application service path (catalog path)
    3. Port number
    4. Client number
    5. Logon language
    6. Connection type (private link or public)
    7. Authentication mode
    8. Connection name for the configuration
  2. After you configure the source, choose the OData object and subobject for the sales orders.
    Generally, sales data from SAP is exported at a certain frequency, such as monthly or quarterly for the full size. For this post, choose the subobject option for the full-size export.
  3. Choose the S3 bucket as the destination.
    The flow exports data to this bucket.
  4. For Data format preference, select CSV format.
  5. For Data transfer preference, select Aggregate all records.
  6. For Filename preference, select Add a timestamp to the file name.
  7. For Folder structure preference, select No timestamped folder.
    The record aggregation configuration exports the full-size sales data from SAP combined in a single file. The file name ends with a timestamp in the YYYY-MM-DDTHH:mm:ss format in a single folder (flow name) within the S3 bucket. Canvas imports data from this single file for model training and forecasting.
  8. Configure data mapping and validations to map the source data fields to destination data fields, and enable data validation rules as required.
  9. You also configure data filtering conditions to filter out specific records if your requirement demands.
  10. Configure your flow trigger to decide whether the flow runs manually on-demand or automatically based on a schedule.
    When configured for a schedule, the frequency is based on how frequently the forecast needs to be generated (generally monthly, quarterly, or half-yearly).
    After the flow is configured, the business analysts can run it on demand or based on the schedule to perform an ETL job on the sales order data from SAP to an S3 bucket.
  11. In addition to the Amazon AppFlow configuration, the data engineers also need to configure an AWS Identity and Access Management (IAM) role for Canvas so that it can access other AWS services. For instructions, refer to Give your users permissions to perform time series forecasting.

Business analyst: Use the historical sales data to train a forecasting model

Let’s switch gears and move to the business analyst side. As a business analyst, we’re looking for a visual, point-and-click service that makes it easy to build ML models and generate accurate predictions without writing a single line of code or having ML expertise. Canvas fits the requirement as no-code ML solution.

First, make sure that your IAM role is configured in such a way that Canvas can access other AWS services. For more information, refer to Give your users permissions to perform time series forecasting, or you can ask for help to your Cloud Engineering team.

When the data engineer is done setting up the Amazon AppFlow-based ETL configuration, the historical sales data is available for you in an S3 bucket.

You’re now ready to train a model with Canvas! This typically involves four steps: importing data into the service, configuring the model training by selecting the appropriate model type, training the model, and finally generating forecasts using the model.

Import data in Canvas

First, launch the Canvas app from the Amazon SageMaker console or from your single sign-on access. If you don’t know how to do that, contact your administrator so that they can guide you through the process of setting up Canvas. Make sure that you access the service in the same Region as the S3 bucket containing the historical dataset from SAP. You should see a screen like the following.

Then complete the following steps:

  1. In Canvas, choose Datasets in the navigation pane.
  2. Choose Import to start importing data from the S3 bucket.
  3. On the import screen, choose the data file or object from the S3 bucket to import the training data.

You can import multiple datasets in Canvas. It also supports creating joins between the datasets by choosing Join data, which is particularly useful when the training data is spread across multiple files.

Configure and train the model

After you import the data, complete the following steps:

  1. Choose Models in the navigation pane.
  2. Choose New model to start configuration for training the forecast model.
  3. For the new model, give it a suitable name, such as product_sales_forecast_model.
  4. Select the sales dataset and choose Select dataset.

    After the dataset is selected, you can see data statistics and configure the model training on the Build tab.
  5. Select totalsales as the target column for the prediction.
    You can see Time series forecasting is automatically selected as the model type.
  6. Choose Configure.
  7. In the Time series forecasting configuration section, choose productid for Item ID column.
  8. Choose city for Group column.
  9. Choose saledate for Time stamp column.
  10. For Days, enter 120.
  11. Choose Save.
    This configures the model to make forecasts for totalsales for 120 days using saledate based on historical data, which can be queried for productid and city.
  12. When the model training configuration is complete, choose Standard Build to start the model training.

The Preview model option is not available for time series forecasting model type. You can review the estimated time for the model training on the Analyze tab.

Model training might take 1–4 hours to complete, depending on the data size. When the model is ready, you can use it to generate the forecast.

Generate a forecast

When the model training is complete, it shows prediction accuracy of the model on the Analyze tab. For instance, in this example, it shows prediction accuracy as 92.87%.

The forecast is generated on the Predict tab. You can generate forecasts for all the items or a selected single item. It also shows the date range for which the forecast can be generated.

As an example, choose the Single item option. Select P-2 for Item and Quito for Group to generate a prediction for product P-2 for city Quito for the date range 2017-08-15 00:00:00 through 2017-12-13 00:00:00.

The generated forecast shows the average forecast as well as the upper and lower bound of the forecast. The forecast bounds help configure an aggressive or balanced approach for the forecast handling.

You can also download the generated forecast as a CSV file or image. The generated forecast CSV file is generally to used to work offline with the forecast data.

The forecast is now generated for the time series data. When a new baseline of data becomes available for the forecast, you can change the dataset in Canvas to retrain the forecast model using the new baseline.

You can retrain the model multiple times as and when the training data changes.

Conclusion

In this post, you learned how the Amazon AppFlow SAP OData Connector exports sales order data from the SAP system into an S3 bucket and then how to use Canvas to build a model for forecasting.

You can use Canvas for any SAP time series data scenarios, such as expense or revenue prediction. The entire forecast generation process is configuration driven. Sales managers and representatives can generate sales forecasts repeatedly per month or per quarter with a refreshed set of data in a fast, straightforward, and intuitive way without writing a single line of code. This helps improve productivity and enables quick planning and decisions.

To get started, learn more about Canvas and Amazon AppFlow using the following resources:


About the Authors

Brajendra Singh is solution architect in Amazon Web Services working with enterprise customers. He has strong developer background and is a keen enthusiast for data and machine learning solutions.

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Customize pronunciations using Amazon Polly

Amazon Polly breathes life into text by converting it into lifelike speech. This empowers developers and businesses to create applications that can converse in real time, thereby offering an enhanced interactive experience. Text-to-speech (TTS) in Amazon Polly supports a variety of languages and locales, which enables you to perform TTS conversion according to your preferences. Multiple factors guide this choice, such as geographic location and language locales.

Amazon Polly uses advanced deep learning technologies to synthesize text to speech in real time in various output formats, such as MP3, ogg vorbis, JSON, or PCM, across standard and neural engines. The Speech Synthesis Markup Language (SSML) support for Amazon Polly further bolsters the service’s capability to customize speech with a plethora of options, including controlling speech rate and volume, adding pauses, emphasizing certain words or phrases, and more.

In today’s world, businesses continue to expand across multiple geographic locations, and they’re continuously looking for mechanisms to improve personalized end-user engagement. For instance, you may require accurate pronunciation of certain words in a specific style pertaining to different geographical locations. Your business may also need to pronounce certain words and phrases in certain ways depending on their intended meaning. You can achieve this with the help of SSML tags provided by Amazon Polly.

This post aims to assist you in customizing pronunciation when dealing with a truly global customer base.

Modify pronunciation using phonemes

A phoneme can be considered as the smallest unit of speech. The <phoneme> SSML tag in Amazon Polly helps customize pronunciation based on phonemes using the IPA (International Phonetic Alphabets) or X-SAMPA (Extended Speech Assessment Methods Phonetic Alphabet). X-SAMPA is a representation of IPA in ASCII encoding. Phoneme tags are available and fully supported in both the standard and neural TTS engine. For example, the word “lead” can be pronounced as the present tense verb, or it can refer to the chemical element lead. We will discuss this with an example further in this blog post.

International Phonetic Alphabet

The IPA is used to portray sounds across different languages. For a list of phonemes Amazon Polly supports, refer to Phoneme and Viseme Tables for Supported Languages.

By default, Amazon Polly determines the pronunciation of the word in a specific format. Let’s use the example of the word “lead,” which can have different pronunciations when referring to the chemical element or the verb. In this example, when we provide the word “lead” as input, it’s spoken in the present tense form (without the use of any customizing SSML tags). The default pronunciation for L E A D by Amazon Polly is the present tense form of “lead.”

<speak>
The default pronunciation by Amazon Polly for L E A D is <break time = "300ms"/> lead,
which is the present tense form.
</speak>

To return the pronunciation of the chemical element lead (which can also be the verb in past tense), we can use phonemes along with IPA or X-SAMPA. IPA is generally used to customize the pronunciation of a word in a given language using phonemes:

<speak>
This is the pronunciation using the
<say-as interpret-as="characters">IPA</say-as> attribute
in the <say-as interpret-as="characters">SSML</say-as> tag. 
The verb form for L E A D is <break time="150ms"/> lead.
The chemical element <break time="150ms"/><phoneme alphabet="ipa" ph="lɛd">lead</phoneme> 
<break time="300ms"/>also has an identical spelling.
</speak>

Modify pronunciation by specifying parts of speech

If we consider the same example of pronouncing “lead,” we can also differentiate between the chemical element and the verb by specifying the parts of speech using the <w> SSML tag.

The <w> tag allows us to customize pronunciation by specifying parts of speech. You can configure the pronunciation in terms of verb (present simple or past tense), noun, adjective, preposition, and determiner. See the following example:

<speak>
The word<p> <say-as interpret-as="characters">lead</say-as></p> 
may be interpreted as either the present simple form <w role="amazon:VB">lead</w>, 
or the chemical element <w role="amazon:SENSE_1">lead</w>.
</speak>

Additionally, you can use the <sub> tag to indicate the pronunciation of acronyms and abbreviations:

<speak>
Polly is an <sub alias="Amazon Web Services">AWS</sub> 
offering providing text-to-Speech service. 
</speak>

Extended Speech Assessment Methods Phonetic Alphabet

The X-SAMPA transcription scheme is an extrapolation to the various language-specific SAMPA phoneme sets available.

The following snippet shows how you can use X-SAMPA to pronounce different variations of the word “lead”:

<speak>
This is the pronunciation using the X-SAMPA attribute, 
in the verb form <break time="1s"/> lead.
The chemical element <break time="1s"/> 
<phoneme alphabet='x-sampa' ph='lEd'>lead</phoneme> <break time="0.5s"/>
also has an identical spelling.
</speak>

The stress mark in IPA is usually represented by ˈ. We often encounter scenarios in which an apostrophe is used instead, which might give a different output than expected. In X-SAMPA, the stress mark is the double quotation mark, therefore we should use a single quotation mark for the word and specify the phonemic alphabet. See the following example:

<speak>
You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. 
</speak>

In the example above, we can see the character ˈ used for stressing the word. Similarly, the stress mark in X-SAMPA is shown in double quotation below:

<speak>
You say, <phoneme alphabet='x-sampa' ph='pI"kA:n'>pecan</phoneme>.
</speak>

Modify pronunciations using other SSML tags

You can use the <say as> tag to modify pronunciation by enabling the spell-out or character feature. Furthermore, it enhances pronunciations in terms of digits, fractions, unit, date, time, address, telephone, cardinal, and ordinal, and can also censor the text enclosed within the tag. For more information, refer to Controlling How Special Types of Words Are Spoken. Let’s look at examples of these attributes.

Date

By default, Amazon Polly speaks out different text inputs. However, for handling specific attributes such as dates, you can use the date attribute to customize pronunciation in the required format, such as month-day-year or day-month-year.

Without the date attribute, Amazon Polly provides the following output when speaking out dates:

<speak>
The default pronunciation when using date is 01-11-1996
</speak>

However, if you want the dates spoken in a specific format, the date attribute in the <say-as> tags helps customize the pronunciation:

<speak>
We will see the examples of different date formats using the date SSML tag.
The following date is written in the day-month-year format.
<say-as interpret-as="date" format="dmy">01-11-1995</say-as><break time="500ms"/>
The following date is written in the month-day-year format.
<say-as interpret-as="date" format="mdy">09-24-1995</say-as>
</speak>

Cardinal

This attribute represents a number in its cardinal format. For example, 124456 is pronounced “one hundred twenty four thousand four hundred fifty six”:

<speak> 
The following number is pronounced in it's cardinal form.
<say-as interpret-as="cardinal">124456</say-as>
</speak>

Ordinal

This attribute represents a number in its ordinal format. Without the ordinal attribute, the number is pronounced in its numerical form:

<speak>
The following number is pronounced in it's ordinal form 
without the use of any SSML attribute in the say as tag - 1242 
</speak>

If we want to pronounce 1242 as “one thousand two hundred forty second,” we can use the ordinal attribute:

<speak>
The following number is pronounced in it's ordinal form.
<say-as interpret-as="ordinal">1242</say-as>
</speak>

Digits

The digits attribute is used to speak out the numbers. For example, “1234” is pronounced as “one two three four”:

<speak>
The following number is pronounced as individual digits.
<say-as interpret-as="digits">1242</say-as>
</speak>

Fraction

The fraction attribute is used to customize the pronunciations in the fractional form:

<speak> 
The following are examples of pronunciations when 
<prosody volume="loud"> fraction</prosody>
is used as an attribute in the say -as tag. 
<break time="500ms"/>Seven one by two is pronounced as
<say-as interpret-as="fraction">7 ½ </say-as>
whereas three by twenty is pronounced as <say-as interpret-as="fraction">3/20</say-as>
</speak>

Time

The time attribute is used to measure the time across minutes and seconds:

<speak>
Polly also supports customizing pronunciation in terms of minutes and seconds. 
For example, <say-as interpret-as="time">2'42"</say-as>
</speak>

Expletive

The expletive attribute censors the text enclosed within the tags:

<speak> 
The value that is going to be censored is
<say-as interpret-as="expletive">this is not good</say-as>
You should have heard the beep sound.
</speak>

Telephone

To pronounce telephone numbers, you can use the telephone attribute to speak out telephone numbers instead of pronouncing them as standalone digits or as a cardinal number:

<speak>
The telephone number is 
<say-as interpret-as="telephone">1800 3000 9009</say-as>
</speak>

Address

The address attribute is used to customize the pronunciation of an address aligning to a specific format:

<speak> 
The address is<break time="1s"/>
<say-as interpret-as="address">440 Terry Avenue North, Seattle
WA 98109 USA</say-as>
</speak>

Lexicons

We’ve looked at some of the SSML tags readily available in Amazon Polly. Other use cases might require a higher degree of control for customized pronunciations. Lexicons help achieve this requirement. You can use lexicons when certain words need to be pronounced in a certain form that is uncommon to that specific language.

Another use case for lexicons is with the use of numeronyms, which are abbreviations formed with the help of numbers. For example, Y2K is pronounced as the “year 2000.” You can use lexicons to customize these pronunciations.

Amazon Polly supports lexicon files in .pls and .xml formats. For more information, see Managing Lexicons.

Conclusion

Amazon Polly SSML tags can help you customize pronunciation in a variety of ways. We hope that this post gives you a head start into the world of speech synthesis and powers your applications to provide more lifelike human interactions.


About the Authors

Abilashkumar P C is a Cloud Support Engineer at AWS. He works with customers providing technical troubleshooting guidance, helping them achieve their workloads at scale. Outside of work, he loves driving, following cricket, and reading.

Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.

Read More

Demystifying machine learning at the edge through real use cases

Edge is a term that refers to a location, far from the cloud or a big data center, where you have a computer device (edge device) capable of running (edge) applications. Edge computing is the act of running workloads on these edge devices. Machine learning at the edge (ML@Edge) is a concept that brings the capability of running ML models locally to edge devices. These ML models can then be invoked by the edge application. ML@Edge is important for many scenarios where raw data is collected from sources far from the cloud. These scenarios may also have specific requirements or restrictions:

  • Low-latency, real-time predictions
  • Poor or non-existing connectivity to the cloud
  • Legal restrictions that don’t allow sending data to external services
  • Large datasets that need to be preprocessed locally before sending responses to the cloud

The following are some of many use cases that can benefit from ML models running close to the equipment that generates the data used for the predictions:

  • Security and safety – A restricted area where heavy machines operate in an automated port is monitored by a camera. If a person enters this area by mistake, a safety mechanism is activated to stop the machines and protect the human.
  • Predictive maintenance – Vibration and audio sensors collect data from a gearbox of a wind turbine. An anomaly detection model processes the sensor data and identifies if anomalies with the equipment. If an anomaly is detected, the edge device can start a contingency measurement in real time to avoid damaging the equipment, like engage the breaks or disconnect the generator from the grid.
  • Defect detection in production lines – A camera captures images of products on a conveyor belt and process the frames with an image classification model. If a defect is detected, the product can be discarded automatically without manual intervention.

Although ML@Edge can address many use cases, there are complex architectural challenges that need to be solved in order to have a secure, robust, and reliable design. In this post, you learn some details about ML@Edge, related topics, and how to use AWS services to overcome these challenges and implement a complete solution for your ML at the edge workload.

ML@Edge overview

There is a common confusion when it comes to ML@Edge and Internet of Things (IoT), therefore it’s important to clarify how ML@Edge is different from IoT and how they both could come together to provide a powerful solution in certain cases.

An edge solution that uses ML@Edge has two main components: an edge application and an ML model (invoked by the application) running on the edge device. ML@Edge is about controlling the lifecycle of one or more ML models deployed to a fleet of edge devices. The ML model lifecycle can start on the cloud side (on Amazon SageMaker, for instance) but normally ends on a standalone deployment of the model on the edge device. Each scenario demands different ML model lifecycles that can be composed by many stages, such as data collection; data preparation; model building, compilation, and deployment to the edge device; model loading and running; and repeating the lifecycle.

The ML@Edge mechanism is not responsible for the application lifecycle. A different approach should be adopted for that purpose. Decoupling the ML model lifecycle and application lifecycle gives you the freedom and flexibility to keep evolving them at different paces. Imagine a mobile application that embeds an ML model as a resource like an image or XML file. In this case, each time you train a new model and want to deploy it to the mobile phones, you need to redeploy the whole application. This consumes time and money, and can introduce bugs to your application. By decoupling the ML model lifecycle, you publish the mobile app one time and deploy as many versions of the ML model as you need.

But how does IoT correlate to ML@Edge? IoT relates to physical objects embedded with technologies like sensors, processing ability, and software. These objects are connected to other devices and systems over the internet or other communication networks, in order to exchange data. The following figure illustrates this architecture. The concept was initially created when thinking of simple devices that just collect data from the edge, perform simple local processing, and send the result to a more powerful computing unity that runs analytics processes that help people and companies in their decision-making. The IoT solution is responsible for controlling the edge application lifecycle. For more information about IoT, refer to Internet of things.

If you already have an IoT application, you can add ML@Edge capabilities to make the product more efficient, as shown in the following figure. Keep in mind that ML@Edge doesn’t depend on IoT, but you can combine them to create a more powerful solution. When you do that, you improve the potential of your simple device to generate real-time insights for your business faster than just sending data to the cloud for later processing.

If you’re creating a new edge solution from scratch with ML@Edge capabilities, it’s important to design a flexible architecture that supports both the application and ML model lifecycles. We provide some reference architectures for edge applications with ML@Edge later in this post. But first, let’s dive deeper into edge computing and learn how to choose the correct edge device for your solution, based on the restrictions of the environment.

Edge computing

Depending on how far the device is from the cloud or a big data center (base), three main characteristics of the edge devices need to be considered to maximize performance and longevity of the system: computing and storage capacity, connectivity, and power consumption. The following diagram shows three groups of edge devices that combine different specifications of these characteristics, depending on how far from they are from the base.

The groups are as follows:

  • MECs (Multi-access Edge Computing) – MECs or small data centers, characterized by low or ultra-low latency and high bandwidth, are common environments where ML@Edge can bring benefits without big restrictions when compared to cloud workloads. 5G antennas and servers at factories, warehouses, laboratories, and so on with minimal energy constraints and with good internet connectivity offer different ways to run ML models on GPUs and CPUs, virtual machines, containers, and bare-metal servers.
  • Near edge – This is when mobility or data aggregation are requirements and the devices have some constraints regarding power consumption and processing power, but still have some reliable connectivity, although with higher latency, with limited throughput and more expensive than “close to the edge.” Mobile applications, specific boards to accelerate ML models, or simple devices with capacity to run ML models, covered by wireless networks, are included in this group.
  • Far edge – In this extreme scenario, edge devices have severe power consumption or connectivity constraints. Consequently, processing power is also restricted in many far edge scenarios. Agriculture, mining, surveillance and security, and maritime transportation are some areas where far edge devices play an important role. Simple boards, normally without GPUs or other AI accelerators, are common. They are designed to load and run simple ML models, save the predictions in a local database, and sleep until the next prediction cycle. The devices that need to process real-time data can have big local storages to avoid losing data.

Challenges

It’s common to have ML@Edge scenarios where you have hundreds or thousands (maybe even millions) of devices running the same models and edge applications. When you scale your system, it’s important to have a robust solution that can manage the number of devices that you need to support. This is a complex task and for these scenarios, you need to ask many questions:

  • How do I operate ML models on a fleet of devices at the edge?
  • How do I build, optimize, and deploy ML models to multiple edge devices?
  • How do I secure my model while deploying and running it at the edge?
  • How do I monitor my model’s performance and retrain it, if needed?
  • How do I eliminate the need of installing a big framework like TensorFlow or PyTorch on my restricted device?
  • How do I expose one or multiple models with my edge application as a simple API?
  • How do I create a new dataset with the payloads and predictions captured by the edge devices?
  • How do I do all these tasks automatically (MLOps plus ML@Edge)?

In the next section, we provide answers to all these questions through example use cases and reference architectures. We also discuss which AWS services you can combine to build complete solutions for each of the explored scenarios. However, if you want to start with a very simple flow that describes how to use some of the services provided by AWS to create your ML@Edge solution, this is an example:

With SageMaker, you can easily prepare a dataset and build the ML models that are deployed to the edge devices. With Amazon SageMaker Neo, you can compile and optimize the model you trained to the specific edge device you chose. After compiling the model, you only need a light runtime to run it (provided by the service). Amazon SageMaker Edge Manager is responsible for managing the lifecycle of all ML models deployed to your fleet of edge devices. Edge Manager can manage fleets of up to millions of devices. An agent, installed to each one of the edge devices, exposes the deployed ML models as an API to the application. The agent is also responsible for collecting metrics, payloads, and predictions that you can use for monitoring or building a new dataset to retrain the model if needed. Finally, with Amazon SageMaker Pipelines, you can create an automated pipeline with all the steps required to build, optimize, and deploy ML models to your fleet of devices. This automated pipeline can then be triggered by simple events you define, without human intervention.

Use case 1

Let’s say an airplane manufacturer wants to detect and track parts and tools in the production hangar. To improve productivity, all the required parts and correct tools need to be available for the engineers at each stage of production. We want to be able to answer questions like: Where is part A? or Where is tool B? We have multiple IP cameras already installed and connected to a local network. The cameras cover the entire hangar and can stream real-time HD video through the network.

AWS Panorama fits in nicely in this case. AWS Panorama provides an ML appliance and managed service that enables you to add computer vision (CV) to your existing fleet of IP cameras and automate. AWS Panorama gives you the ability to add CV to your existing Internet Protocol (IP) cameras and automate tasks that traditionally require human inspection and monitoring.

In the following reference architecture, we show the major components of the application running on an AWS Panorama Appliance. The Panorama Application SDK makes it easy to capture video from camera streams, perform inference with a pipeline of multiple ML models, and process the results using Python code running inside a container. You can run models from any popular ML library such as TensorFlow, PyTorch, or TensorRT. The results from the model can be integrated with business systems on your local area network, allowing you to respond to events in real time.

The solution consists of the following steps:

  1. Connect and configure an AWS Panorama device to the same local network.
  2. Train an ML model (object detection) to identify parts and tools in each frame.
  3. Build an AWS Panorama Application that gets the predictions from the ML model, applies a tracking mechanism to each object, and sends the results to a real-time database.
  4. The operators can send queries to the database to locate the parts and tools.

Use case 2

For our next use case, imagine we’re creating a dashcam for vehicles capable of supporting the driver in many situations, such as avoiding pedestrians, based on a CV25 board from Ambaralla. Hosting ML models on a device with limited system resources can be difficult. In this case, let’s assume we already have a well-established over-the-air (OTA) delivery mechanism in place to deploy the application components needed on to the edge device. However, we would still benefit from ability to do OTA deployment of the model itself, thereby isolating the application lifecycle and model lifecycle.

Amazon SageMaker Edge Manager and Amazon SageMaker Neo fit well for this use case.

Edge Manager makes it easy for ML edge developers to use the same familiar tools in the cloud or on edge devices. It reduces the time and effort required to get models to production, while allowing you to continuously monitor and improve model quality across your device fleet. SageMaker Edge includes an OTA deployment mechanism that helps you deploy models on the fleet independent of the application or device firmware. The Edge Manager agent allows you to run multiple models on the same device. The agent collects prediction data based on the logic that you control, such as intervals, and uploads it to the cloud so that you can periodically retrain your models over time. SageMaker Edge cryptographically signs your models so you can verify that it wasn’t tampered with as it moves from the cloud to edge device.

Neo is a compiler as a service and an especially good fit in this use case. Neo automatically optimizes ML models for inference on cloud instances and edge devices to run faster with no loss in accuracy. You start with an ML model built with one of supported frameworks and trained in SageMaker or anywhere else. Then you choose your target hardware platform, (refer to the list of supported devices). With a single click, Neo optimizes the trained model and compiles it into a package that can be run using the lightweight SageMaker Edge runtime. The compiler uses an ML model to apply the performance optimizations that extract the best available performance for your model on the cloud instance or edge device. You then deploy the model as a SageMaker endpoint or on supported edge devices and start making predictions.

The following diagram illustrates this architecture.

The solution workflow consists of the following steps:

  1. The developer builds, trains, validates, and creates the final model artefact that needs to be deployed to the dashcam.
  2. Invoke Neo to compile the trained model.
  3. The SageMaker Edge agent is installed and configured on the Edge device, in this case the dashcam.
  4. Create a deployment package with a signed model and the runtime used by the SageMaker Edge agent to load and invoke the optimized model.
  5. Deploy the package using the existing OTA deployment mechanism.
  6. The edge application interacts with the SageMaker Edge agent to do inference.
  7. The agent can be configured (if required) to send real-time sample input data from the application for model monitoring and refinement purposes.

Use case 3

Suppose your customer is developing an application that detects anomalies in the mechanisms of a wind turbine (like the gearbox, generator, or rotor). The goal is to minimize the damage on the equipment by running local protection procedures on the fly. These turbines are very expensive and located in places that aren’t easily accessible. Each turbine can be outfitted with an NVIDIA Jetson device to monitor sensor data from the turbine. We then need a solution to capture the data and use an ML algorithm to detect anomalies. We also need an OTA mechanism to keep the software and ML models on the device up to date.

AWS IoT Greengrass V2 along with Edge Manager fit well in this use case. AWS IoT Greengrass is an open-source IoT edge runtime and cloud service that helps you build, deploy, and manage IoT applications on your devices. You can use AWS IoT Greengrass to build edge applications using pre-built software modules, called components, that can connect your edge devices to AWS services or third-party services. This ability of AWS IoT Greengrass makes it easy to deploy assets to devices, including a SageMaker Edge agent. AWS IoT Greengrass is responsible for managing the application lifecycle, while Edge Manager decouples the ML model lifecycle. This gives you the flexibility to keep evolving the whole solution by deploying new versions of the edge application and ML models independently. The following diagram illustrates this architecture.

The solution consists of the following steps:

  1. The developer builds, trains, validates, and creates the final model artefact that needs to be deployed to the wind turbine.
  2. Invoke Neo to compile the trained model.
  3. Create a model component using Edge Manager with AWS IoT Greengrass V2 integration.
  4. Set up AWS IoT Greengrass V2.
  5. Create an inference component using AWS IoT Greengrass V2.
  6. The edge application interacts with the SageMaker Edge agent to do inference.
  7. The agent can be configured (if required) to send real-time sample input data from the application for model monitoring and refinement purposes.

Use case 4

For our final use case, let’s look at a vessel transporting containers, where each container has a couple of sensors and streams a signal to the compute and storage infrastructure deployed locally. The challenge is that we want to know the content of each container, and the condition of the goods based on temperature, humidity, and gases inside each container. We also want to track all the goods in each one of the containers. There is no internet connectivity throughout the voyage, and the voyage can take months. The ML models running on this infrastructure should preprocess the data and generate information to answer all our questions. The data generated needs to be stored locally for months. The edge application stores all the inferences in a local database and then synchronizes the results with the cloud when the vessel approaches the port.

AWS Snowcone and AWS Snowball from the AWS Snow Family could fit very well in this use case.

AWS Snowcone is a small, rugged, and secure edge computing and data migration device. Snowcone is designed to the OSHA standard for a one-person liftable device. Snowcone enables you to run edge workloads using Amazon Elastic Compute Cloud (Amazon EC2) computing, and local storage in harsh, disconnected field environments such as oil rigs, search and rescue vehicles, military sites, or factory floors, as well as remote offices, hospitals, and movie theaters.

Snowball adds more computing when compared to Snowcone and therefore may be a great fit for more demanding applications. The Compute Optimized feature provides an optional NVIDIA Tesla V100 GPU along with EC2 instances to accelerate an application’s performance in disconnected environments. With the GPU option, you can run applications such as advanced ML and full motion video analysis in environments with little or no connectivity.

On top of the EC2 instance, you have the freedom to build and deploy any type of edge solution. For instance: you can use Amazon ECS or other container manager to deploy the edge application, Edge Manager Agent and the ML model as individual containers. This architecture would be similar to Use Case 2 (except that it will work offline most of the time), with the addition of a container manager tool.

The following diagram illustrates this solution architecture.

To implement this solution, simply order your Snow device from the AWS Management Console and launch your resources.

Conclusion

In this post, we discussed the different aspects of edge that you may choose to work with based on your use case. We also discussed some of the key concepts around ML@Edge and how decoupling the application lifecycle and the ML model lifecycle gives you the freedom to evolve them without any dependency on each other. We emphasized how choosing the right edge device for your workload and asking the right questions during the solution process can help you work backward and narrow down the right AWS services. We also presented different use cases along with reference architectures to inspire you to create your own solutions that will work for your workload.


About the Authors

Dinesh Kumar Subramani is a Senior Solutions Architect with the UKIR SMB team, based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning. Dinesh enjoys working with customers across industries to help them solve their problems with AWS services. Outside of work, he loves spending time with his family, playing chess and enjoying music across genres.

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions which solve their business challenges using AWS. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.

Read More

Text summarization with Amazon SageMaker and Hugging Face

In this post, we show you how to implement one of the most downloaded Hugging Face pre-trained models used for text summarization, DistilBART-CNN-12-6, within a Jupyter notebook using Amazon SageMaker and the SageMaker Hugging Face Inference Toolkit. Based on the steps shown in this post, you can try summarizing text from the WikiText-2 dataset managed by fast.ai, available at the Registry of Open Data on AWS.

Global data volumes are growing at zettabyte scale as companies and consumers expand their use of digital products and online services. To better understand this growing data, machine learning (ML) natural language processing (NLP) techniques for text analysis have evolved to address use cases involving text summarization, entity recognition, classification, translation, and more. AWS offers pre-trained AWS AI services that can be integrated into applications using API calls and require no ML experience. For example, Amazon Comprehend can perform NLP tasks such as custom entity recognition, sentiment analysis, key phrase extraction, topic modeling, and more to gather insights from text. It can perform text analysis on a wide variety of languages for its various features.

Text summarization is a helpful technique in understanding large amounts of text data because it creates a subset of contextually meaningful information from source documents. You can apply this NLP technique to longer-form text documents and articles, enabling quicker consumption and more effective document indexing, for example to summarize call notes from meetings.

Hugging Face is a popular open-source library for NLP, with over 49,000 pre-trained models in more than 185 languages with support for different frameworks. AWS and Hugging Face have a partnership that allows a seamless integration through SageMaker with a set of AWS Deep Learning Containers (DLCs) for training and inference in PyTorch or TensorFlow, and Hugging Face estimators and predictors for the SageMaker Python SDK. These capabilities in SageMaker help developers and data scientists get started with NLP on AWS more easily. Processing texts with transformers in deep learning frameworks such as PyTorch is typically a complex and time-consuming task for data scientists, often leading to frustration and lack of efficiency when developing NLP projects. The rise of AI communities like Hugging Face, combined with the power of ML services in the cloud like SageMaker, accelerate and simplify the development of these text processing tasks. SageMaker helps you build, train, deploy, and operationalize Hugging Face models.

Text summarization overview

You can apply text summarization to identify key sentences within a document or identify key sentences across multiple documents. Text summarization can produce two types of summaries: extractive and abstractive. Extractive summaries don’t contain any machine-generated text and are a collection of important sentences selected from the input document. Abstractive summaries contain new human-readable phrases and sentences generated by the text summarization model. Most text summarization systems are based on extractive summarization because accurate abstractive text summarization is difficult to achieve.

Hugging Face has over 400 pre-trained state-of-the-art text summarization models available, implementing different combinations of NLP techniques. These models are trained on different datasets, uploaded and maintained by technology companies and members of the Hugging Face community. You can filter the models by most downloaded or most liked, and directly load them when using the summarization pipeline Hugging Face transformer API. The Hugging Face transformer simplifies the NLP implementation process so that high-performance NLP models can be fine-tuned to deliver text summaries, without requiring extensive ML operation knowledge.

Hugging Face text summarization models on AWS

SageMaker offers business analysts, data scientists, and MLOps engineers a choice of tools to design and operate ML workloads on AWS. These tools provide you with faster implementation and testing of ML models to achieve your optimal outcomes.

From the SageMaker Hugging Face Inference Toolkit, an open-source library, we outline three different ways to implement and host Hugging Face text summarization models using a Jupyter notebook:

  • Hugging Face summarization pipeline – Create a Hugging Face summarization pipeline using the “summarization” task identifier to use a default text summarization model for inference within your Jupyter notebook. These pipelines abstract away the complex code, offering novice ML practitioners a simple API to quickly implement text summarization without configuring an inference endpoint. The pipeline also allows the ML practitioner to select a specific pre-trained model and its associated tokenizer. Tokenizers prepare text to be ready as an input for the model by splitting text into words or subwords, which then are converted to IDs through a lookup table. For simplicity, the following code snippet provides for the default case when using pipelines. The DistilBART-CNN-12-6 model is one of the most downloaded summarization models on Hugging Face and is the default model for the summarization pipeline. The last line calls the pre-trained model to get a summary for the passed text given the provided two arguments.

    from transformers import pipeline
    
    summarizer = pipeline("summarization")
    summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)

  • SageMaker endpoint with pre-trained model – Create a SageMaker endpoint with a pre-trained model from the Hugging Face Model Hub and deploy it on an inference endpoint, such as the ml.m5.xlarge instance in the following code snippet. This method allows experienced ML practitioners to quickly select specific open-source models, fine-tune them, and deploy the models onto high-performing inference instances.

    from sagemaker.huggingface import HuggingFaceModel
    from sagemaker import get_execution_role
    
    role = get_execution_role()
    
    # Hub Model configuration. https://huggingface.co/models
    hub = {
      'HF_MODEL_ID':'sshleifer/distilbart-cnn-12-6',
      'HF_TASK':'summarization'
    }
    
    # create Hugging Face Model Class
    huggingface_model = HuggingFaceModel(
        transformers_version='4.17.0',
        pytorch_version='1.10.2',
        py_version='py38',
        env=hub,
        role=role,
    )
    
    # deploy model to SageMaker Inference
    predictor = huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

  • SageMaker endpoint with a trained model – Create a SageMaker model endpoint with a trained model stored in an Amazon Simple Storage Service (Amazon S3) bucket and deploy it on an inference endpoint. This method allows experienced ML practitioners to quickly deploy their own models stored on Amazon S3 onto high-performing inference instances. The model itself is downloaded from Hugging Face and compressed, and then can be uploaded to Amazon S3. This step is demonstrated in the following code snippet:

    from sagemaker.huggingface import HuggingFaceModel
    from sagemaker import get_execution_role
    
    role = get_execution_role()
    
    # create Hugging Face Model Class
    huggingface_model = HuggingFaceModel(
        transformers_version='4.17.0',
        pytorch_version='1.0.2',
        py_version='py38',
        model_data='s3://my-trained-model/artifacts/model.tar.gz',
        role=role,
    )
    
    # deploy model to SageMaker Inference
    predictor = huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

AWS has several resources available to assist you in deploying your ML workloads. The Machine Learning Lens of the AWS Well Architected Framework recommends ML workloads best practices, including optimizing resources and reducing cost. These recommended design principles ensure that well architected ML workloads on AWS are deployed to production. Amazon SageMaker Inference Recommender helps you select the right instance to deploy your ML models at optimal inference performance and cost. Inference Recommender speeds up model deployment and reduces time to market by automating load testing and optimizing model performance across ML instances.

In the next sections, we demonstrate how to load a trained model from an S3 bucket and deploy it to a suitable inference instance.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Load the Hugging Face model to SageMaker for text summarization inference

Use the following code to download the Hugging Face pre-trained text summarization model DistilBART-CNN-12-6 and its tokenizer, and save them locally in SageMaker to your Jupyter notebook directory:

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

PRE_TRAINED_MODEL_NAME='sshleifer/distilbart-cnn-12-6'

model = BartForConditionalGeneration.from_pretrained(PRE_TRAINED_MODEL_NAME, cache_dir=hf_cache_dir)
model.save_pretrained('./models/bart_model/')

tokenizer = BartTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
tokenizer.save_pretrained('./models/bart_tokenizer/')

Compress the saved text summarization model and its tokenizer into tar.gz format and upload the compressed model artifact to an S3 bucket:

! tar -C models/ -czf model.tar.gz code/ bart_tokenizer/ bart_model/
from sagemaker.s3 import S3Uploader

file_key = 'model.tar.gz'
model_artifact = S3Uploader.upload(file_key,'s3://my-trained-model/artifacts')

Select an inference Docker container image to perform the text summarization inference. Define the Linux OS, PyTorch framework, and Hugging Face Transformer version and specify the Amazon Elastic Compute Cloud (Amazon EC2) instance type to run the container.

The Docker image is available in the Amazon Elastic Container Registry (Amazon ECR) of the same AWS account, and the link for that container image is returned as a URI.

from sagemaker.image_uris import retrieve

deploy_instance_type = 'ml.m5.xlarge'

pytorch_inference_image_uri = retrieve('huggingface',
                                       region=region,
                                       version='4.6.1',
                                       instance_type=deploy_instance_type,
                                       base_framework_version='pytorch1.8.1',
                                       image_scope='inference')

Define the text summarization model to be deployed by the selected container image performing inference. In the following code snippet, the compressed model uploaded to Amazon S3 is deployed:

from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker import get_execution_role

role = get_execution_role()

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://my-trained-model/artifacts/model.tar.gz", # path to your trained sagemaker model
   image_uri=pytorch_inference_image_uri,
   role=role, # iam role with permissions to create an Endpoint
   transformers_version="4.6.1", # transformers version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1, 
   instance_type="ml.m5.xlarge"
)

Test the deployed text summarization model on a sample input:

# example request, you need to define "inputs"
data = {
   "text": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

# request
predictor.predict(data)

Use Inference Recommender to evaluate the optimal EC2 instance for the inference task

Next, create multiple payload samples of input text in JSON format and compress them into a single payload file. These payload samples are used by the Inference Recommender to compare inference performance between different EC2 instance types. Each of the sample payloads must match the JSON format shown earlier. You can get examples from the WikiText-2 dataset managed by fast.ai, available at the Registry of Open Data on AWS.

Upload the compressed text summarization model artifact and the compressed sample payload file to the S3 bucket. We uploaded the model in an earlier step, but for clarity we include the code to upload it again:

bucket = sagemaker.Session().default_bucket()

prefix = "sagemaker/inference-recommender"

model_archive_name = "model.tar.gz"
payload_archive_name = "payload.tar.gz"

sample_payload_url = sagemaker.Session().upload_data(
    payload_archive_name, bucket=bucket, key_prefix=prefix + "/inference"
)
model_url = sagemaker.Session().upload_data(
    model_archive_name, bucket=bucket, key_prefix=prefix + "/model"
)

Review the list of standard ML models available on SageMaker across common model zoos, such as NLP and computer vision. Select an NLP model to perform the text summarization inference:

import boto3
import pandas as pd

inference_client = boto3.client("sagemaker", region)

list_model_metadata_response = inference_client.list_model_metadata()

domains = []
frameworks = []
framework_versions = []
tasks = []
models = []

for model_summary in list_model_metadata_response["ModelMetadataSummaries"]:
    domains.append(model_summary["Domain"])
    tasks.append(model_summary["Task"])
    models.append(model_summary["Model"])
    frameworks.append(model_summary["Framework"])
    framework_versions.append(model_summary["FrameworkVersion"])

data = {
    "Domain": domains,
    "Task": tasks,
    "Framework": frameworks,
    "FrameworkVersion": framework_versions,
    "Model": models,
}

df = pd.DataFrame(data)

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)
pd.set_option("display.colheader_justify", "center")
pd.set_option("display.precision", 3)

display(df.sort_values(by=["Domain", "Task", "Framework", "FrameworkVersion"]))

The following example uses the bert-base-cased NLP model. Register the text summarization model into the SageMaker model registry with the correctly identified domain, framework, and task from the previous step. The parameters for this example are shown at the beginning of the following code snippet.

Note the range of EC2 instance types to be evaluated by Inference Recommender under SupportedRealtimeInferenceInstanceTypes in the following code. Make sure that the service limits for the AWS account allow the deployment of these types of inference nodes.

ml_domain = "NATURAL_LANGUAGE_PROCESSING"
ml_task = "FILL_MASK"
model_name = "bert-base-cased"
dlc_uri = pytorch_inference_image_uri
framework = 'PYTORCH'
framework_version='1.6.0'

inference_client = boto3.client("sagemaker", region)

model_package_group_name = uuid.uuid1()

model_pacakge_group_response = inference_client.create_model_package_group(
    ModelPackageGroupName=str(model_package_group_name), ModelPackageGroupDescription="description"
)

model_package_version_response = inference_client.create_model_package(
    ModelPackageGroupName=str(model_package_group_name),
    ModelPackageDescription="InferenceRecommenderDemo",
    Domain=ml_domain,
    Task=ml_task,
    SamplePayloadUrl=sample_payload_url,
    InferenceSpecification={
        "Containers": [
            {
                "ContainerHostname": "huggingface-pytorch",
                "Image": dlc_uri,
                "ModelDataUrl": model_url,
                "Framework": framework,
                "FrameworkVersion": framework_version,
                "NearestModelName": model_name,
                "Environment": {
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                    "SAGEMAKER_PROGRAM": "inference.py",
                    "SAGEMAKER_REGION": region,
                    "SAGEMAKER_SUBMIT_DIRECTORY": model_url,
                },
            },
        ],
        "SupportedRealtimeInferenceInstanceTypes": [
            "ml.t2.xlarge",
            "ml.c5.xlarge",
            "ml.m5.xlarge",
            "ml.m5d.xlarge",
            "ml.r5.xlarge",
            "ml.inf1.xlarge",
        ],
        "SupportedContentTypes": [
            "application/json",
        ],
        "SupportedResponseMIMETypes": ["application/json"],
    },
)

Create an Inference Recommender default job using the ModelPackageVersion resulting from the previous step. The uuid Python library is used to generate a unique name for the job.

from sagemaker import get_execution_role

client = boto3.client("sagemaker", region)

role = get_execution_role()
default_job = uuid.uuid1()
default_response = client.create_inference_recommendations_job(
    JobName=str(default_job),
    JobDescription="Job Description",
    JobType="Default",
    RoleArn=role,
    InputConfig={"ModelPackageVersionArn": model_package_version_response["ModelPackageArn"]},
)

You can get the status of the Inference Recommender job by running the following code:

inference_recommender_job = client.describe_inference_recommendations_job(
        JobName=str(default_job)
)

When the job status is COMPLETED, compare the inference latency, runtime, and other metrics of the EC2 instance types evaluated by the Inference Recommender default job. Select the suitable node type based on your use case requirements.

data = [
    {**x["EndpointConfiguration"], **x["ModelConfiguration"], **x["Metrics"]}
    for x in inference_recommender_job["InferenceRecommendations"]
]
df = pd.DataFrame(data)
df.drop("VariantName", inplace=True, axis=1)
pd.set_option("max_colwidth", 400)
df.head()

Conclusion

SageMaker offers multiple ways to use Hugging Face models; for more examples, check out the AWS Samples GitHub. Depending on the complexity of the use case and the need to fine-tune the model, you can select the optimal way to use these models. The Hugging Face pipelines can be a good starting point to quickly experiment and select suitable models. When you need to customize and parameterize the selected models, you can download the models and deploy them to customized inference endpoints. To fine-tune the model more for a specific use case, you’ll need to train the model after downloading it.

NLP models in general, including text summarization models, perform better after being trained on a dataset that is specific for the use case. The MLOPs and model monitoring features of SageMaker make sure that the deployed model continues to perform within expectations. In this post, we used Inference Recommender to evaluate the best suited instance type to deploy the text summarization model. These recommendations can optimize performance and cost for your ML use case.


About the Authors

Dr. Nidal AlBeiruti is a Senior Solutions Architect at Amazon Web Services, with a passion for machine learning solutions. Nidal has over 25 years of experience working in a variety of global IT roles at different levels and verticals. Nidal acts as a trusted advisor for many AWS customers to support and accelerate their cloud adoption journey.

Darren Ko is a Solutions Architect based in London. He advises UK and Ireland SMB customers on rearchitecting and innovating on the cloud. Darren is interested in applications built with serverless architectures and he is passionate about solving sustainability challenges with machine learning.

Read More

Profiling XNNPACK with TFLite

Posted by Alan Kelly, Software Engineer

We are happy to share that detailed profiling information for XNNPACK is now available in TensorFlow 2.9.1 and later. XNNPACK is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms, and it is the default TensorFlow Lite CPU inference engine for floating-point models.

The most common and expensive neural network operators, such as fully connected layers and convolutions, are executed by XNNPACK so that you get the best performance possible from your model. Historically the profiler would measure the runtime for the entire section of delegated graph, meaning that the runtime of all delegated operators was accumulated in one result, making it difficult to identify the individual operations that were slow.

Previous TFLite profiling results when XNNPACK was used. The runtime of all delegated operators was accumulated in one row.

If you are using TensorFlow Lite 2.9.1 or later, it gives the per operator profile even for the section that is delegated to XNNPACK so that you no longer need to decide between fast inference and detailed performance information. The operator name, data layout (NHWC for example), datatype (FP32) and microkernel type (if applicable) are shown.

New detailed per-operator profiling information is now shown. The operator name, data layout, data type and microkernel type are visible.
Now, you get lots of helpful information, such as the runtime per operator and the percentage of the total runtime that it accounts for. The runtime of each node is given in the order in which they were executed. The most expensive operators are also listed.
The most expensive operators are listed. In this example, you can see that a deconvolution accounted for 33.91% of the total runtime.

XNNPACK can also perform inference in half-precision (16 bit) floating point format if the hardware supports these operations natively, and IEEE16 inference is supported for every floating-point operator in the model, and the model’s `reduced_precision_support` metadata indicates that it is compatible with FP16 inference. FP16 inference can also be forced. More information is available here. If half precision has been used, then F16 will be present in the Name column:

FP16 inference has been used.

Here, unsigned quantized inference has been used (QU8).

QU8 indicates that unsigned quantized inference has been used

And finally, sparse inference has been used. Sparse operators require that the data layout change from NHWC to NCHW as this is more efficient. This can be seen in the operator name.

SPMM microkernel indicates that the operator is evaluated via SParse matrix-dense Matrix Multiplication. Note that sparse inference use NCHW layout (vs the typical NHWC) for the operators.

Note that when some operators are delegated to XNNPACK, and others aren’t, two sets of profile information are shown. This happens when not all operators in the model are supported by XNNPACK. The next step in this project is to merge profile information from XNNPACK operators and TensorFlow Lite into one profile.

Next Steps

You can learn more about performance measurement and profiling in TensorFlow Lite by visiting this guide. Thanks for reading!

Read More

Take your intelligent search experience to the next level with Amazon Kendra hierarchical facets

Unstructured data continues to grow in many organizations, making it a challenge for users to get the information they need. Amazon Kendra is a highly accurate, intelligent search service powered by machine learning (ML). Amazon Kendra uses deep learning and reading comprehension to deliver precise answers, and returns a list of ranked documents that match the search query for you to choose from. To help users interactively narrow down the list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities.

In a search solution with a growing number of documents, simple faceting or filtering isn’t always sufficient to enable users to really pinpoint documents with the information they’re looking for. Amazon Kendra now features hierarchical facets, with a more granular view of the scope of the search results. Hierarchical facets offer filtering options with more details about the number of results expected for each option, and allows users to further narrow their search, pinpointing their documents of interest quickly.

In this post, we demonstrate what hierarchical facets in Amazon Kendra can do. We first ingest a set of documents, along with their metadata, into an Amazon Kendra index. We then make search queries using both simple and hierarchical facets, and add filtering to get straight to the documents of interest.

Solution overview

Instead of presenting each facet individually as a list, hierarchical facets enable defining a parent-child relationship between facets to shape the scope of the search results. With this, you see the number of results that not only have a particular facet, but also have each of the sub-facets. Let’s take the example of a repository of AWS documents of types User_Guides, Reference_Guides and Release_Notes, regarding compute, storage, and database technologies.

First let’s look at non-hierarchical facets from the response to a search query:

Technology
  Databases:23
  Storage:22
  Compute:15
Document_Type
  User_Guides:37
  Reference_Guides:18
  Release_Notes:5

Here we know the number of search results in each of the technologies, as well as each of the document types. However, we don’t know, for example, how many results to expect from User_Guides related to Storage, except that it’s going to be less than 22, as the smaller of the number of results from User_Guides:37 and from Storage:22.

Now let’s look at hierarchical facets from the response to the same search query:

Technology
  Databases:23
    Document_Type
      User_Guides:12
      Reference_Guides:7
      Release_Notes:4
  Storage:22
    Document_Type
      User_Guides:16
      Reference_Guides:6
  Compute:15
    Document_Type
      User_Guides:9
      Reference_Guides:5
      Release_Notes:1

With hierarchical facets, we get more information in terms of the number results from each document type about each technology. With this additional information, we know that there are 16 results from User_Guides related to Storage.

In the subsequent sections, we use this example to demonstrate the use of hierarchical facets to narrow down search results along with step-by-step instructions you can follow to try this out in your own AWS account. If you just want to read about this feature without running it yourself, you can refer to the Python script facet-search-query.py used in this post, and its output output.txt, and then jump to the section Search and filtering using facets without hierarchy.

Prerequisites

To deploy and experiment with the solution in this post, make sure that you have the following:

Set up the infrastructure and run the Python script to query the Amazon Kendra index

To set up the solution, complete the following steps:

  1. Use the AWS Management Console for Amazon S3 to create an S3 bucket to use as a data source to store the sample documents.
  2. On the AWS Management Console, start CloudShell by choosing the shell icon on the navigation bar.
    Alternatively, you can run the Python script from any computer that has the AWS SDK for Python (Boto3) installed and an AWS account with access to the Amazon Kendra index. Make sure to update Boto3 on your computer. For simplicity, the step-by-step instructions in this post focus on CloudShell.
  3. After CloudShell starts, download facet-search-query.py to your local machine.
  4. Upload the script to your CloudShell by switching to the CloudShell tab, choosing the Actions menu, and choosing Upload file.
  5. Download hierarchical-facets-data.zip to your local machine, unzip it, and upload the entire directory structure to your S3 bucket.
  6. If you’re not using an existing Amazon Kendra index, create a new Amazon Kendra index.
  7. On the Amazon Kendra console, open your index.
  8. In the navigation pane, choose Facet definition.
  9. Choose Add field.
  10. Configure the field Document_Type and choose Add.
  11. Configure the field Technology and choose Add.
  12. Configure your S3 bucket as a data source to the Amazon Kendra index you just created.
  13. Sync the data source and wait for the sync to complete.
  14. Switch to the CloudShell tab.
  15. Update Boto3 by running pip3 install boto3=1.23.1 --upgrade.
    This ensures that CloudShell has a version of Boto3 that supports hierarchical facets.
  16. Edit facet-search-query.py and replace REPLACE-WITH-YOUR-AMAZON-KENDRA-INDEX-ID with your Amazon Kendra index ID.
    You can get the index ID by opening your index details on the Amazon Kendra console.
  17. In the CloudShell prompt, run facet-search-query.py using the command python3 facet-search-query.py | tee output.txt.

If this step is canceled with the error Unknown parameter in Facets[0]: “Facets”, must be one of: DocumentAttributeKey,
choose the Actions menu, and choose Delete AWS CloudShell home directory. Repeat the steps to download facet-search-query.py, update Boto3, edit facet-search-query.py, and run it again. If you have any other data in the CloudShell home directory, you should back it up before running this step.

For convenience, all the steps are included in one Python script. You can read facet-search-query.py and experiment by copying parts of this script and making your own scripts. Edit output.txt to observe the search results.

Search and filtering with facets without hierarchy

Let’s start by querying with facets having no hierarchy. In this case, the facets parameter used in the query only provides the information that the results in the response should be faceted using two attributes: Technology and Document_Type. See the following code:

fac0 = [
    { "DocumentAttributeKey":"Technology" },
    { "DocumentAttributeKey":"Document_Type" }
]

This is used as a parameter to the query API call:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac0)

The formatted version of the response is as follows:

Query:  How to encrypt data?
Number of results: 62
Document Title:  developerguide
Document Attributes:
  Document_Type: User_Guides
  Technology: Databases
Document Excerpt:
  4. Choose the option that you want for encryption at rest. Whichever
  option you choose, you can't   change it after the cluster is
  created. • To encrypt data at rest in this cluster, choose Enable
  encryption. • If you don't want to encrypt data at rest in this
  cluster, choose Disable encryption.
----------------------------------------------------------------------
Facets:
  Technology
    Databases:23
    Storage:22
    Compute:16
  Document_Type
    User_Guides:37
    Reference_Guides:19
    Release_Notes:5
======================================================================

The first result from the response is from a User_Guide about Databases. The facets below the result show the number of results for Technology and Document_Type present in the response.

Let’s narrow down these results to be only from User_Guides and Storage by setting the filter as follows:

att_filter0 = {
    "AndAllFilters": [
        {
            "EqualsTo":{
                "Key": "Technology",
                "Value": {
                    "StringValue": "Storage"
                }
            }
        },
        {
            "EqualsTo":{
                "Key": "Document_Type",
                "Value": {
                    "StringValue": "User_Guides"
                }
            }
        }
    ]
}

Now let’s make a query call using the facets without hierarchy and the preceding filter:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac0, AttributeFilter=att_filter0)

A formatted version of the response is as follows:

Query:  How to encrypt data?
Query Filter: Technology: Storage AND Document_Type: User_Guides
Number of results: 18
Document Title:  efs-ug
Document Attributes:
  Document_Type: User_Guides
  Technology: Storage
Document Excerpt:
  ,             "Action": [                 "kms:Describe*",
  "kms:Get*",                 "kms:List*",
  "kms:RevokeGrant"             ],             "Resource": "*"
  }     ] }   Encrypting data in transit You can encrypt data in
  transit using an Amazon EFS file sys
----------------------------------------------------------------------
Facets:
  Technology
    Storage:16
  Document_Type
    User_Guides:16

The response contains 16 results from User_Guides on Storage. Based on the non-hierarchical facets in the response without filters, we only knew to expect fewer than 22 results.

Search and filtering with hierarchical facets with Document_Type as a sub-facet of Technology

Now let’s run a query using hierarchical facets, with the relationship of Document_Type being a sub-facet of Technology. This hierarchical relationship is important for a Technology-focused user such as an engineer. Note the nested facets in the following definition. The MaxResults parameter is used to display only top MaxResults facets. For our example, there are only three facets for Technology and Document_Type, therefore this parameter isn’t particularly useful. When the number of facets is high, it makes sense to use this parameter.

fac1 = [{
    "DocumentAttributeKey":"Technology",
    "Facets":[{
        "DocumentAttributeKey":"Document_Type",
        "MaxResults": max_results
    }],
}]

The query API call is made as follows:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac1)

The formatted version of the response is as follows:

Document Attributes:
  Document_Type: User_Guides
  Technology: Databases
Document Excerpt:
  4. Choose the option that you want for encryption at rest. Whichever
  option you choose, you can't   change it after the cluster is
  created. • To encrypt data at rest in this cluster, choose Enable
  encryption. • If you don't want to encrypt data at rest in this
  cluster, choose Disable encryption.
----------------------------------------------------------------------
Facets:
  Technology
    Databases:23
      Document_Type
        User_Guides:12
        Reference_Guides:7
        Release_Notes:4
    Storage:22
      Document_Type
        User_Guides:16
        Reference_Guides:6
    Compute:16
      Document_Type
        User_Guides:9
        Reference_Guides:6
        Release_Notes:1
======================================================================

The results are classified as per the Technology facet followed by Document_Type. In this case, looking at the facets, we know that 16 results are from User_Guides about Storage and 7 are from Reference_Guides related to Databases.

Let’s narrow down these results to be only from Reference_Guides related to Databases using the following filter:

att_filter1 = {
    "AndAllFilters": [
        {
            "EqualsTo":{
                "Key": "Technology",
                "Value": {
                    "StringValue": "Databases"
                }
            }
        },
        {
            "EqualsTo":{
                "Key": "Document_Type",
                "Value": {
                    "StringValue": "Reference_Guides"
                }
            }
        }
    ]
}

Now let’s make a query API call using the hierarchical facets with this filter:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac1, AttributeFilter=att_filter1)

The formatted response to this is as follows:

Query:  How to encrypt data?
Query Filter: Technology: Databases AND Document_Type: Reference_Guides
Number of results: 7
Document Title:  redshift-api
Document Attributes:
  Document_Type: Reference_Guides
  Technology: Databases
Document Excerpt:
  ...Constraints: Maximum length of 2147483647.   Required: No
  KmsKeyId   The AWS Key Management Service (KMS) key ID of the
  encryption key that you want to use to encrypt data in the cluster.
  Type: String   Length Constraints: Maximum length of 2147483647.
  Required: No LoadSampleData   A flag...
----------------------------------------------------------------------
Facets:
  Technology
    Databases:7
      Document_Type
        Reference_Guides:7
======================================================================

From the facets of this response, there are seven results, all from Reference_Guides related to Databases, exactly as we knew before making the query.

Search and filtering with hierarchical facets with Technology as a sub-facet of Document_Type

You can choose the hierarchical relationship between different facets at the time of querying. Let’s define Technology as the sub-facet of Document_Type, as shown in the following code. This hierarchical relationship would be important for a Document_Type-focused user such as a technical writer.

fac2 = [{
    "DocumentAttributeKey":"Document_Type",
    "Facets":[{
        "DocumentAttributeKey":"Technology",
        "MaxResults": max_results
    }]
}]

The query API call is made as follows:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac2)

The formatted response to this is as follows:

Query:  How to encrypt data?
Number of results: 62
Document Title:  developerguide
Document Attributes:
  Document_Type: User_Guides
  Technology: Databases
Document Excerpt:
  4. Choose the option that you want for encryption at rest. Whichever
  option you choose, you can't   change it after the cluster is
  created. • To encrypt data at rest in this cluster, choose Enable
  encryption. • If you don't want to encrypt data at rest in this
  cluster, choose Disable encryption.
----------------------------------------------------------------------
Facets:
  Document_Type
    User_Guides:37
      Technology
        Storage:16
        Databases:12
        Compute:9
    Reference_Guides:19
      Technology
        Databases:7
        Compute:6
        Storage:6
    Release_Notes:5
      Technology
        Databases:4
        Compute:1
======================================================================

The results are classified as per their Document_Type followed by Technology. In other words, reversing the hierarchical relationship results in transposing the matrix of scope of results as shown by the preceding facets. Six results are from Reference_Guides related to Compute. Let’s define the filter as follows:

att_filter2 = {
    "AndAllFilters": [
        {
            "EqualsTo":{
                "Key": "Document_Type",
                "Value": {
                    "StringValue": "Reference_Guides"
                }
            }
        },
        {
            "EqualsTo":{
                "Key": "Technology",
                "Value": {
                    "StringValue": "Compute"
                }
            }
        }
    ]
}

We use this filter to make the query API call:

kclient.query(IndexId=indexid, QueryText=kquery, Facets=fac2, AttributeFilter=att_filter2)

The formatted response to this is as follows:

Query:  How to encrypt data?
Query Filter: Document_Type: Reference_Guides AND Technology:Compute
Number of results: 7
Document Title:  ecr-api
Document Attributes:
  Document_Type: Reference_Guides
  Technology: Compute
Document Excerpt:
  When you use AWS KMS to encrypt your data, you can either use the
  default AWS managed AWS KMS key for Amazon ECR, or specify your own
  AWS KMS key, which you already created. For more information, see
  Protecting data using server-side encryption with an AWS KMS key
  stored in AWS Key Management Service
----------------------------------------------------------------------
Facets:
  Document_Type
    Reference_Guides:6
      Technology
        Compute:6
======================================================================

The results contain six Reference_Guides related to Compute, exactly as we knew before running the query.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Amazon Kendra connector for Amazon S3, delete that data source. If you created an Amazon S3 bucket to store the data used, delete that as well.

Conclusion

You can use Amazon Kendra hierarchical facets to define a hierarchical relationship between attributes to provide granular information about the scope of the results in the response to a query. This enables you to make an informed filtering choice to narrow down the search results and find the documents you’re looking for quickly.

To learn more about facets and filters in Amazon Kendra, refer to the Filtering queries.

For more information on how you can automatically create, modify, or delete metadata, which you can use for faceting the search results, refer to Customizing document metadata during the ingestion process and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.


About the Authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Ji Kim is a Software Development Engineer at Amazon Web Services and is a member of the Amazon Kendra team.

Read More

All-In-One Financial Services? Vietnam’s MoMo Has a Super-App for That

For younger generations, paper bills, loan forms and even cash might as well be in a museum. Smartphones in hand, their financial services largely take place online.

The financial-technology companies that serve them are in a race to develop AI that can make sense of the vast amount of data the companies collect — both to provide better customer service and to improve their own backend operations.

Vietnam-based fintech company MoMo has developed a super-app that includes payment and financial transaction processing in one self-contained online commerce platform. The convenience of this all-in-one mobile platform has already attracted over 30 million users in Vietnam.

To improve the efficiency of the platform’s chatbots, know-your-customer (eKYC) systems and recommendation engines, MoMo uses NVIDIA GPUs running in Google Cloud. It uses NVIDIA DGX systems for training and batch processing.

In just a few months, MoMo has achieved impressive results in speeding development of solutions that are more robust and easy to scale. Using NVIDIA GPUs for eKYC inference tasks has resulted in a 10x speedup compared to using CPU, the company says. For the MoMo Face Payment service, using TensorRT has reduced training and inference time by 10x.

AI Offers a Different Perspective

Tuan Trinh, director of data science at MoMo, describes his company’s use of AI as a way to get a different perspective on its business. One such project processes vast amounts of data and turns it into computerized visuals or graphs that can then be analyzed to improve connectivity between users in the app.

MoMo developed its own AI algorithm that uses over a billion data points to direct recommendations of additional services and products to its customers. These offerings help maintain a line of communication with the company’s user base that helps boost engagement and conversion.

The company also deploys a recommendation box on the home screen of its super-app. This caused its click-through rate to improve dramatically as the AI prompts customers with useful recommendations and keeps them engaged.

With AI, MoMo says it can process the habits of 10 million active users over the course of the last 30-60 days to train its predictive models. In addition, NVIDIA Triton Inference Server helps unify the serving flows for recommendation engines, which significantly reduces the effort to deploy AI applications in production environments. In addition, TensorRT has contributed to 3x performance improvement of MoMo’s payment services AI model inference, boosting the customer experience.

Chatbots Advance the Conversation

MoMo’s will use AI-powered chatbots to allow it to scale up faster when accommodating and engaging with users. Chatbot services are especially effective on mobile device apps, which tend to be popular with younger users, who often prefer them over making phone calls to customer service.

Chatbot users can inquire about a product and get the support they need to evaluate it before purchasing — all from one interface — which is essential for a super-app like MoMo’s that functions as a one-stop-shop.

The chatbots are also an effective vehicle for upselling or suggesting additional services, MoMo says. When combined with machine learning, it’s possible to categorize target audiences for different products or services to customize their experience with the app.

AI chatbots have the additional benefit of freeing up MoMo’s customer service team to handle other important tasks.

Better Credit Scoring

Credit history data from all of MoMo’s 30 million-plus users can be applied to models used for risk control of financial services by using AI algorithms. MoMo has applied credit scoring to the lending services within its super-app. Since the company doesn’t solely depend on traditional deep learning for tasks that are less complex, MoMo’s development team has been able to obtain higher accuracy with shorter processing times.

The MoMo app takes less than 2 seconds to make a lending decision but is still able to reduce taking on risky lending targets with more accurate predictions from AI. This helps keep customers from taking on too much debt, and helps MoMo from missing out on potential revenue.

Since AI is capable of processing both structured and unstructured data, it’s able to incorporate information beyond traditional credit scores, like whether customers spend their money on necessities or luxuries, to assess a borrower’s risk more accurately.

Future of AI in Fintech

With fintechs increasingly applying AI to their massive data stores, MoMo’s team predicts the industry will need to evaluate how to do so in a way that keeps user data safe — or risk losing customer loyalty. MoMo already plans to expand its use of graph neural networks and models based on its proven ability to dramatically improve its operations.

The MoMo team also believes that AI could one day make credit scores obsolete. Since AI is able to make decisions based on broader unstructured data, it’s possible to determine loan approval by considering other risks besides a credit score. This would help open up the pool of potential users on fintech apps like MoMo’s to people in underserved and underbanked communities, who may not have credit scores, let alone “good” ones.

With around one in four American adults “underbanked,” which makes it more difficult for them to get a loan or credit card, and more than half of Africa’s population completely “credit invisible,” which refers to people without a bank or a credit score, MoMo believes AI could bring banking access to communities like these and open up a new user base for fintech apps at the same time.

Explore NVIDIA’s AI solutions and enterprise-level AI platforms driving innovation in financial services. 

The post All-In-One Financial Services? Vietnam’s MoMo Has a Super-App for That appeared first on NVIDIA Blog.

Read More