Get started with the Amazon Kendra Amazon WorkDocs connector

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra reimagines enterprise search for your websites and applications so your employees and customers can easily find the content they’re looking for, even when it’s scattered across multiple locations and content repositories within your organization.

With Amazon Kendra, you can search through troves of unstructured data and discover the right answers to your questions, when you need them. Amazon Kendra is a fully managed service, so there are no servers to provision, and no ML models to build, train, or deploy.

Amazon WorkDocs is a fully managed and secure content creation, storage, and collaboration service. With Amazon WorkDocs, you can easily create, edit, and share content. Moreover, because it’s stored centrally on AWS, you can access it from anywhere on any device.

In this post, we show how Amazon Kendra allows your users to search documents stored in Amazon WorkDocs.

Use case

For this post, we created a specific folder in Amazon WorkDocs containing a set of PDFs and Microsoft Word documents that we want to search content on. The Amazon WorkDocs connector also allows you to ingest comments for those documents.

The following screenshot shows the contents of a fictional WorkDocs folder called WorkdocsBlogpostDataset.

Create an Amazon WorkDocs connector

To create an Amazon WorkDocs connector, complete the following steps:

  1. On the Amazon Kendra console, choose Data sources.
  2. Choose Add data source.
  3. Under WorkDocs, choose Add connector.
  4. For Data source name, enter a name for your data source.
  5. Enter an optional description.
  6. Choose Next.
  7. In the Source section, choose the organization ID for your Amazon WorkDocs site.
  8. Create a new AWS Identity and Access Management (IAM) role for the data source.
  9. For Sync scope, select Crawl document comments and Use change logs.

For this post, we want Amazon Kendra to ingest the documents in the WorkdocsBlogpostDataset folder.

  1. In the Additional configuration section, enter WorkdocsBlogpostDataset as a path on the Include patterns tab.
  2. Choose Add.
  3. For Sync run schedule¸ choose Run on demand.
  4. Choose Next.
  5. In the WorkDocs field mapping section, use the default field mapping.
  6. Choose Next.
  7. Review the settings and choose Create.
  8. When the creation process is complete, choose Sync.

When the sync process complete, you can see how many documents were ingested.

Now your documents are ready be searched by Amazon Kendra.

  1. In the navigation pane, choose Search console.

You can now submit some test queries, as shown in the following screenshots.

Also, with the Amazon WorkDocs connector, you can ingest feedback (comments) on your documents. For example, the following screenshot shows that this document has feedback.

The following screenshot shows what the feedback search experience looks like.

Conclusion

In this post, you created a data source and ingested your Amazon WorkDocs documents into your Amazon Kendra index. As a next step, you can try some more queries and see what kind of results you obtain. You can also dive deep into Amazon Kendra with the Amazon Kendra Essentials workshop or try the multilingual chatbot experience.


About the Author

Juan Bustos is an AI Services Specialist Solutions Architect at Amazon Web Services, based in Dallas, TX. Outside of work, he loves spending time writing and playing music as well as trying random restaurants with his family.

 

 

 

Vijai Gandikota is a Senior Product Manager at Amazon Web Services for Amazon Kendra.

Read More

Orchestrate XGBoost ML Pipelines with Amazon Managed Workflows for Apache Airflow

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.

This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Why Airflow for orchestration

Customers choose Apache Airflow and specifically Amazon MWAA for several reasons, but three stand out:

  • Airflow is Python-based – Airflow, as a Python-based tool, enjoys the benefits of an imperative programming paradigm. This enables developers to programmatically define how tasks are to be done. Tools that are declarative, such as AWS Step Functions, only allow you to define what is to be done. When orchestrating ML pipelines, the ability to directly define the control flow is often required to navigate complex workflows.
  • Directed Acyclic Graph (DAG) workflow management – Airflow provides a DAG interface as a simple mechanism for defining and running complex workflows with dependencies. These DAG workflows are visualized through a GUI for operations management.
  • Extensibility – Airflow operators provide a structured way to perform common tasks using reusable modules. This capability is extensible and providers are free to develop custom Airflow operators that integrate with their tools and services. Many cloud-based services are supported. These operators provide useful abstraction, repeatability, and an API. In the context of big data and ML, these operators are especially valuable because they provide a way to orchestrate sometimes very long-running data pipelines or asynchronous ML processes such as model training.

Set up an Amazon MWAA environment

To create your Amazon MWAA environment, complete the following steps:

  1. On the Amazon MWAA console, choose Create environment.
  2. For Name, enter a unique name.
  3. For Airflow version, choose the version to use. For this post, we use Airflow v2.0.2. We also include code for Airflow v1.10.12.

  1. In the Dag code in the Amazon S3 section, specify the Amazon Simple Storage Service (Amazon S3) bucket where Amazon MWAA can find the DAGs, plugins.zip file, and requirements.txt file.

Airflow configuration for XGBoost

An XGBoost model requires a specific configuration in the Managed Airflow environment. The core.enable_xcom_pickling parameter must be set to True. The reason for this is the trained XGBoost model needs to be serialized in order to save it as a file in Amazon S3. Certain Python objects (like datetime) can’t be serialized without converting the Python object hierarchy into a byte stream through a process called pickling.

Requirements.txt file

Upload a requirements.txt file to the Amazon S3 location you specified in the Amazon MWAA setup. To support this demonstration, the requirements.txt file should have the following entries:

boto3==1.17.49
sagemaker==1.72.0
s3fs==0.5.1

Orchestrate an XGBoost ML pipeline

Our ML pipeline is a simplified three-step pipeline:

  1. Data preprocessing using AWS Glue. Real pipelines could require numerous processing steps for data cleaning and featuring engineering. Although Amazon SageMaker Pipelines provides a similar functionality, we use AWS Glue to illustrate how different AWS services or third-party tools and services are orchestrated in a single pipeline.
  2. Train an XGBoost model using a SageMaker training job.
  3. Deploy the trained model as a real-time inference endpoint.

Solution architecture

Our ML pipeline is pictured in the following diagram. We use AWS Lambda to invoke DAGs with a Lambda function. We also use Amazon EventBridge to trigger Lambda functions. For more information, see Tutorial: Schedule AWS Lambda functions using EventBridge.

Stage the AWS Glue script

In our demo, we create the AWS Glue job dynamically using a PySpark script saved in Amazon S3. Copy the glue_etl.py file provided in the source code repo to an Amazon S3 location.

Set DAG configuration values

To keep things simple, we use a config.py file to import any environment-specific configurations rather than define it in the main DAG script. You can view the config.py file in its entirety on GitHub. A best practice is to use AWS Secrets Manager to store configuration and secrets information (as of this writing, AWS Systems Manager Parameter Store isn’t a supported backend on Amazon MWAA). Detailed documentation on how to securely store secrets in AWS Secrets Manager for Amazon MWAA is available here.

Upload the updated config.py file to the DAG directory.

Stage the customer churn training data

The customer churn dataset is mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It’s attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. The dataset is publicly available and provided in the GitHub repo.

Upload the customer-churn.csv file to the Amazon S3 location you specified in the config.py file.

Construct the DAG

For our demonstration, the DAG consists of four primary sections:

  • Import statements
  • DAG operator configuration
  • DAG task definitions
  • DAG task dependency definition

Import statements

Because Airflow is Python-based, the DAG file is a simple Python file and the modules for Airflow are imported just as they would be for any Python application.

Some services have native Airflow operators available that manage asynchronous API calls and polling to determine success or failure of orchestrated tasks. We recommend using native operators wherever possible. AWS services that don’t have native Airflow operators, like AWS Glue, can still be orchestrated in Airflow using AWS SDKs called from the general PythonOperator.

For nearly all AWS services, the AWS SDK for Python (Boto3) provides service-level access to the APIs. This SDK provides a high degree of control, but also a lower level of abstraction. For ML pipelines using SageMaker, you can use the SageMaker Python SDK. This is a streamlined SDK abstracted specifically for ML experimentation.

The following import statements include general Airflow modules and operators, native Airflow operators for SageMaker, and the Boto3 and SageMaker SDKs:

# Airflow Operators
import airflow
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator

# Airflow Sagemaker Operators
from airflow.providers.amazon.aws.operators.sagemaker_training import SageMakerTrainingOperator
from airflow.providers.amazon.aws.operators.sagemaker_endpoint import SageMakerEndpointOperator
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook

# AWS SDK for Python
import boto3

# Amazon SageMaker SDK
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator
from sagemaker.session import s3_input

# Airflow SageMaker Configuration
from sagemaker.workflow.airflow import training_config
from sagemaker.workflow.airflow import model_config_from_estimator
from sagemaker.workflow.airflow import deploy_config_from_estimator

# Configuration variables
import config

Other import statements are needed to support this demonstration; refer to the GitHub repo for the full code.

DAG operator configuration

The DAG and DAG tasks are defined based on the operators invoked to run each task.

For the AWS Glue task, we invoke the PythonOperator using the SDK for Python to create a client for AWS Glue. To keep the DAG code tidy, we abstract the AWS Glue client code in a helper function called preprocess_glue. We stage the glue_etl.py (referenced in the GitHub repo) in Amazon S3 so it can be loaded when the AWS Glue job is created. See the following code:

def preprocess_glue():
  """preprocess data using glue for etl"""

  # not best practice to hard code location 
  glue_script_location = 's3://{}/{}'.format(config.GLUE_JOB_SCRIPT_S3_BUCKET, config.GLUE_JOB_SCRIPT_S3_KEY)
  glue_client = boto3.client('glue')

  # instantiate the Glue ETL job
  response = glue_client.create_job(
    Name=glue_job_name,
    Description='PySpark job to extract the data and split in to training and validation data sets',
    Role=config.GLUE_ROLE_NAME,
    ExecutionProperty={
      'MaxConcurrentRuns': 2
    },
    Command={
      'Name': 'glueetl',
      'ScriptLocation': glue_script_location,
      'PythonVersion': '3'
    },
    DefaultArguments={
      '--job-language': 'python'
    },
    GlueVersion='1.0',
    WorkerType='Standard',
    NumberOfWorkers=2,
    Timeout=60
    )
  
  # execute the previously instantiated Glue ETL job
  response = glue_client.start_job_run(
    JobName=response['Name'],
    Arguments={
      '--S3_SOURCE': config.DATA_S3_SOURCE,
      '--S3_DEST': config.DATA_S3_DEST,
      '--TRAIN_KEY': 'train/',
      '--VAL_KEY': 'validation/' 
    }
  )

We create a helper function that returns the ARN of the SageMaker role:

def get_sagemaker_role_arn(role_name, region_name):
    iam = boto3.client("iam", region_name=region_name)
    response = iam.get_role(RoleName=role_name)
    return response["Role"]["Arn"]

The XGBoost estimator requires the SageMaker role, container image, and hyperparameters, which we collect using a hook into SageMaker:

hook = AwsBaseHook(aws_conn_id="airflow-sagemaker", client_type="sagemaker")
sess = hook.get_session(region_name=config.REGION_NAME)
sagemaker_role = get_sagemaker_role_arn(config.SAGEMAKER_ROLE_NAME, config.REGION_NAME)
container = get_image_uri(sess.region_name, "xgboost")
hyperparameters = {
    "max_depth":"5",
    "eta":"0.2",
    "gamma":"4",
    "min_child_weight":"6",
    "subsample":"0.8",
    "objective":"binary:logistic",
    "num_round":"100"
}

With the parameters defined, we can create the estimator object:

xgb_estimator = Estimator(
    image_name=container, 
    hyperparameters=hyperparameters,
    role=sagemaker_role,
    sagemaker_session=sagemaker.session.Session(sess),
    train_instance_count=1, 
    train_instance_type='ml.m5.4xlarge', 
    train_volume_size=5,
    output_path=config.SAGEMAKER_MODEL_S3_DEST
)

This estimator object is an input parameter into the training configuration. We need to define other training parameters:

# create unique name with guid
sagemaker_taining_job_name=config.SAGEMAKER_TRAINING_JOB_NAME_PREFIX+'-{}'.format(guid)

# define S3 locations for training & validation data processed using Glue
sagemaker_training_data = s3_input(config.SAGEMAKER_TRAINING_DATA_S3_SOURCE, content_type=config.SAGEMAKER_CONTENT_TYPE)
sagemaker_validation_data = s3_input(config.SAGEMAKER_VALIDATION_DATA_S3_SOURCE, content_type=config.SAGEMAKER_CONTENT_TYPE)

sagemaker_training_inputs = {
  'train': sagemaker_training_data,
  'validation': sagemaker_validation_data
  }

Let’s take a closer look at the arguments for sagemaker_training_inputs. The XGBoost algorithm supports both LIBSVM and CSV text formats for training and validation datasets. However, LIBSVM is supported by default. This means that we must specify CSV explicitly so XGBoost interprets our data correctly. The content type is set as text/csv in our custom DAG configuration file. We use CSV because it’s the most common data file format familiar to all ML practitioners.

With these parameters defined, we can create the training config object:

training_config = training_config(
  estimator=xgb_estimator,
  inputs=sagemaker_training_inputs,
  job_name=sagemaker_taining_job_name
)

For native Airflow SageMaker operators, you can construct and reference well-defined configuration objects when invoking the operators.

The next configuration definition is for the SageMaker endpoint:

# create unique name using guid
sagemaker_model_name=config.SAGEMAKER_MODEL_NAME_PREFIX+'-{}'.format(guid)
sagemaker_endpoint_name=config.SAGEMAKER_ENDPOINT_NAME_PREFIX+'-{}'.format(guid)

For this simple pipeline, we use the deploy_config_from_estimator API option in the SageMaker SDK to export an Airflow deploy config directly from the SageMaker XGBoost estimator (the endpoint_name parameter must be 63 characters or less):

endpoint_config = deploy_config_from_estimator(
  estimator=xgb_estimator, 
  task_id="train", 
  task_type="training", 
  initial_instance_count=1, 
  instance_type="ml.m4.xlarge",
  model_name=sagemaker_model_name,
  endpoint_name=sagemaker_endpoint_name
)

For more information about how we set up the model training and deployment configuration, including how we used the SageMaker SDK sagemaker.workflow.airflow APIs, see the GitHub repo.

With the operator configuration complete, we’re ready to put it all together to define our DAG.

DAG task definitions

For the XGBoost model training task, we invoke the SageMakerTrainingOperator. For the endpoint deployment task, we invoke the SageMakerEndpointOperator. It’s important to note the separation of concerns: we create a model using the SageMakerModelOperator but configure the SageMaker endpoint using the SageMakerEndpointConfigOperator. This provides added granular control over the creation and deployment of the model. See the following code:

args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(2), 'depends_on_past': False}

with DAG(
    dag_id=config.AIRFLOW_DAG_ID,
    default_args=args,
    start_date=days_ago(2),
    schedule_interval=None,
    concurrency=1,
    max_active_runs=1,
) as dag:
    process_task = PythonOperator(
      task_id="process",
      dag=dag,
      #provide_context=False,
      python_callable=preprocess_glue,
    )

    train_task = SageMakerTrainingOperator(
      task_id = "train",
      config = training_config,
      aws_conn_id = "airflow-sagemaker",
      wait_for_completion = True,
      check_interval = 60, #check status of the job every minute
      max_ingestion_time = None, #allow training job to run as long as it needs, change for early stop
    )

    endpoint_deploy_task = SageMakerEndpointOperator(
      task_id = "endpoint-deploy",
      config = endpoint_config,
      aws_conn_id = "sagemaker-airflow",
      wait_for_completion = True,
      check_interval = 60, #check status of endpoint deployment every minute
      max_ingestion_time = None,
      operation = 'create', #change to update if you are updating rather than creating an endpoint
    )

DAG task dependency definition

After we define the tasks, we set the dependencies of the tasks. Airflow implements the right shift logical operator (>>) to define downstream dependencies and the left shift logical operator (<<) to define upstream dependencies. In our example, we only define downstream dependencies:

# set the dependencies between tasks
process_task >> train_task >> endpoint_deploy_task

When the completed DAG is uploaded to the designated Amazon S3 location, Amazon MWAA automatically ingests the DAG. The graph view visually shows the task dependencies. You can trigger the DAG manually from the console during iterative testing, or as we described earlier, from an external source such as EventBridge and a Lambda function. Each task is highlighted depending on the stage of completion, as shown in the following screenshot. Dark green indicates successful completion of the task.

Test the deployed endpoint

After the endpoint-deploy task is complete, we can view the endpoint on the SageMaker console. The SageMaker endpoint is a real-time inference endpoint. SageMaker takes care of deploying, hosting, and exposing the HTTPS endpoint.

We can test the deployed endpoint with a SageMaker notebook.

Follow these steps to set up a SageMaker notebook environment:

  1. Launch a SageMaker notebook instance.
  2. On the Notebook instances page, open your notebook instance by choosing either Open JupyterLab for the JupyterLab interface or Open Jupyter for the classic Jupyter view.
  3. Choose Upload to import the test notebook available in the GitHub repo.

Prepare a test sample

We use Pandas DataFrames to create a test dataset out of the customer churn dataset that was used for training. For the test dataset, we must drop the label column, which is the first column. We also take a random sample of the dataset using the Pandas DataFrame sample method.

Request inferences

Now that we have our sampled test data, we use the Boto3 library to create a SageMaker runtime client. We use the client when we invoke our endpoint, pass it test data, and receive an inference value.

Conclusion

You can use Amazon MWAA to orchestrate and automate complex ML pipelines from the data processing stage through model training and endpoint deployment. You can set special configuration options in the Amazon MWAA environment to support popular ML frameworks like XGBoost.

In this post, we demonstrated how to dynamically create and run an AWS Glue job to preprocess training and validation data. We showed how to construct the DAG to support this ML pipeline, including the import statements, the DAG operator configuration, the DAG task definitions, and the DAG dependency definition. We demonstrated the difference between using native Airflow operators vs. invoking AWS SDK API calls from a generic PythonOperator.

Amazon MWAA is a highly versatile orchestration tool that enterprises can use to operationalize and scale their ML capabilities.


About the authors

Justin Leto is a Sr. Solutions Architect at Amazon Web Services with specialization in big data analytics and machine learning. His passion is helping customers achieve better cloud adoption. In his spare time, he enjoys offshore sailing and playing jazz piano. He lives in Manhattan with his wife Veera.

 

 

David Ehrlich is a Machine Learning Specialist at Amazon Web Services. He is passionate about helping customers unlock the true potential of their data. In his spare time, he enjoys exploring the different neighborhoods in New York City, going to comedy clubs, and traveling.

 

 

 

Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using AWS services.

Read More

Announcing specialized support for extracting data from invoices and receipts using Amazon Textract

Receipts and invoices are documents that are critical to small and medium businesses (SMBs), startups, and enterprises for managing their accounts payable processes. These types of documents are difficult to process at scale because they follow no set design rules, yet any individual customer encounters thousands of distinct types of these documents.

In this post, we show how you can use Amazon Textract’s new Analyze Expense API to extract line item details in addition to key-value pairs from invoices and receipts, which is a frequent request we hear from customers. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts, and automatically extracts specific information like vendor name, price, and payment terms. In this post, we walk you through processing an invoice/receipt using Amazon Textract and extracting a set of fields and line-item details. While AWS takes care of building, training, and deploying advanced ML models in a highly available and scalable environment, you take advantage of these models with simple-to-use API actions.

We cover the following topics in this post:

  • How Amazon Textract processes invoices and receipts
  • A walkthrough of the Amazon Textract console
  • Anatomy of the Amazon Textract AnalyzeExpense API response
  • How to process the response with the Amazon Textract parser library
  • Sample solution architecture to automate invoice and receipts processing
  • How to deploy and use the solution

Invoice and receipt processing using Amazon Textract

SMBs, startups, and enterprises process paper-based invoices and receipts as part of their accounts payable process to reconcile their goods received and for auditing purposes. Employees who submit expense reports also submit scans or images of the associated receipts. Companies try to standardize electronic invoicing, but some vendors only offer paper invoices, and some countries legally require paper invoices.

The peculiarities of invoices and receipts mean it’s also a difficult problem to solve at scale—invoices and receipts all look different, because each vendor designs its own documents independently. The labels are imperfect and inconsistent. Vendor name is often not explicitly labeled and has to be interpreted based on context. Other important information such as customer number, customer ID, or account ID are labeled differently from document to document.

To solve this problem, you can use Amazon Textract to process invoices and receipts at scale. Amazon Textract works with any style of invoice or receipt, no templates or configuration required, and extracts relevant data that can be tricky to extract such as contact information, items purchased, and vendor name from those documents. That includes the line-item details, not just the headline amounts.

Amazon Textract also identifies vendor names that are critical for your workflows but may not be explicitly labeled. For example, Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo at the top of the page without an explicit key-value pair combination.

Amazon Textract also makes it easy to consolidate input from diverse receipts and invoices. Different documents use different words for the same concept. For example, Amazon Textract maps relationships between field names in different documents such as customer no., customer number, and account ID, and outputs standard taxonomy (in this case, INVOICE_RECEIPT_ID), thereby representing data consistently across document types.

Amazon Textract console walkthrough

Before we get started with the API and code samples, let’s review the Amazon Textract console. The following images show examples of both an invoice and a receipt document on the Analyze Expense output tab of the Amazon Textract console.

Amazon Textract automatically detects the vendor name, invoice number, ship to address, and more from the sample invoice and displays them on the Summary Fields tab. It also represents the standard taxonomy of fields in brackets next to the actual value on the document. For example, it identifies “INVOICE #” as the standard field INVOICE_RECEIPT_ID.

Additionally, Amazon Textract detects the items purchased and displays them on the Line Item Fields tab.

The following is a similar example of a receipt. Amazon Textract detects “Whole Foods Market” as VENDOR_NAME even though the receipt doesn’t explicitly mention it as the vendor name.

The Amazon Textract AnalyzeExpense API response

In this section, we explain the AnalyzeExpense API response structure using sample images. The following is a sample receipt and the corresponding AnalyzeExpense response JSON.

Sample store receipt image

AnalyzeExpense JSON response of SummaryFields :

{
    "DocumentMetadata": {
        "Pages": 1
    },
    "ExpenseDocuments": [
        {
            "ExpenseIndex": 1,
            "SummaryFields": [
                {
                    "Type": {
                        "Text": "VENDOR_NAME",
                        "Confidence": 97.0633544921875
                    },
                    "ValueDetection": {
                        "Text": "New Store X1",
                        "Geometry": {
                            …
                        },
                        "Confidence": 96.65239715576172
                    },
                    "PageNumber": 1
                },
                {
                    "Type": {
                        "Text": "OTHER",
                        "Confidence": 81.0
                    },
                    "LabelDetection": {
                        "Text": "Order type:",
                        "Geometry": {
                            …
                        },
                        "Confidence": 80.8591079711914
                    },
                    "ValueDetection": {
                        "Text": "Quick Sale",
                        "Geometry": {
                            …
                        },
                        "Confidence": 80.82302856445312
                    },
                    "PageNumber": 1
                }
…

AnalyzeExpense JSON response for LineItemGroups:

"LineItemGroups": [
                {
                    "LineItemGroupIndex": 1,
                    "LineItems": [
                        {
                            "LineItemExpenseFields": [
                                {
                                    "Type": {
                                        "Text": "ITEM",
                                        "Confidence": 99.95216369628906
                                    },
                                    "ValueDetection": {
                                        "Text": "Red Banana is innbusiness ",
                                        "Geometry": {
                                            …
                                        },
                                        "Confidence": 99.81525421142578
                                    },
                                    "PageNumber": 1
                                },
                                {
                                    "Type": {
                                        "Text": "PRICE",
                                        "Confidence": 99.95216369628906
                                    },
                                    "ValueDetection": {
                                        "Text": "$66.96",
                                        "Geometry": {
                                            …
                                        },

The AnalyzeExpense JSON output contains ExpenseDocuments, and each ExpenseDocument contains SummaryFields and LineItemGroups. The ExpenseIndex field uniquely identifies the expense, and associates the appropriate SummaryFields or LineItemGroups detected to that expense.

The most granular level of data in the AnalyzeExpense response consists of Type, ValueDetection, and LabelDetection (optional). Let’s call this set of data an AnalyzeExpense element. The preceding example illustrates an AnalyzeExpense element that contains Type, ValueDetection, and LabelDetection.

In the preceding example, Amazon Textract detected 16 SummaryField key-value pairs, including VENDOR_NAME: New Store X1 and Order type:Quick Sale. AnalyzeExpense detects this key-value pair and displays it as shown in the preceding example. The individual entities are as follows:

  • LabelDetection – The optional key of the key-value pair. In the Order type: Quick Sale example, it’s Order type:. For implied values such as Vendor Name, where the key isn’t explicitly shown in the receipt, LabelDetection isn’t included in the AnalyzeExpense element. In the preceding example, “New Store X1” at the top of the receipt is the vendor name without an explicit key. The AnalyzeExpense element for “New Store X1” has a type of VENDOR_NAME and ValueDetection of New Store X1, but doesn’t have a LabelDetection.
  • Type – This is the normalized type of the key-value pair. Because Order type isn’t a normalized taxonomy value, it’s classified as OTHER. Examples of normalized values are Vendor Name, Receiver Address, and Payment Terms. For a full list of normalized taxonomy values, see the Amazon Textract Developer Guide.
  • ValueDetection – The value of the key-value pair. In the example of Order type: Quick Sale, it’s Quick Sale.

The AnalyzeExpense API also detects ITEM, QUANTITY, and PRICE within line items as normalized fields. If other text is in a line item on the receipt image, such as SKU or a detailed description, it’s included in the JSON as EXPENSE_ROW, as shown in the following example:

{
                                    "Type": {
                                        "Text": "EXPENSE_ROW",
                                        "Confidence": 99.95216369628906
                                    },
                                    "ValueDetection": {
                                        "Text": "Red Banana is in x3 $66.96nbusiness ",
                                        "Geometry": {
                                          …
                                        },
                                        "Confidence": 98.11214447021484
                                    }

In addition to the detected content, the AnalyzeExpense API provides information like confidence scores and bounded boxes for detected elements. It gives you control of how you consume extracted content and integrate it into your applications. For example, you can flag any elements that have a confidence score under a certain threshold for manual review.

The input document is either bytes or an Amazon Simple Storage Service (Amazon S3) object. You pass image bytes to an Amazon Textract API operation by using the Bytes property. For example, you use the Bytes property to pass a document loaded from a local file system.

Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.

You can call the AnalyzeExpense API using the AWS Command Line Interface (AWS CLI), as shown in the following code. Make sure you have the latest AWS CLI version installed.

aws textract analyze-expense --document '{"S3Object": {"Bucket": "&lt;Your Bucket&gt;","Name": "Invoice/Receipts S3 Objects"}}'

Process the response with the Amazon Textract parser library

Apart from working with the JSON output as-is, you can use the Amazon Textract response parser library to parse the JSON returned by the AnalyzeExpense API. The library parses JSON and provides programming language-specific constructs to work with different parts of the document. For more details, refer to the GitHub repo. Using the Amazon Textract response parser makes it easier to deserialize the JSON response and use it in your application in a similar way that the Amazon Textract PrettyPrinter library allows you to print the parsed response in different formats. The following GitHub repository shows examples for parsing the Amazon Textract responses. You can parse SummaryFields and LineItemGroups for every ExpenseDocument in the AnalyzeExpense response JSON using the AnalyzeExpense parser as shown in the following code:

Install the latest boto3 python SDK -
python3 -m pip install boto3 –-upgrade 

Install the latest version of amazon textract response parser 
python3 -m pip install amazon-textract-response-parser --upgrade

client = boto3.client(
         service_name='textract',
         region_name= 'us-east-1',
         endpoint_url='https://textract.us-east-1.amazonaws.com',
)

with open(documentName, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', documentName)

# process using image bytes
response = client.analyze_expense(Document={'Bytes': bytes_test})

You can further use the serializer and deserializer to validate the response JSON and convert it into the Python object representation, and vice versa.

The following code deserializes the response JSON:

# j holds the Textract JSON
from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().load(json.loads(j))

The following code serializes the response JSON:

from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().dump(t_doc)

You can also convert the output to formats like CSV, Presto, TSV, HTML, LaTeX, and more by using the Amazon Textract PrettyPrinter library.

Install the PrettyPrinter library with the following code:

python3 -m pip install amazon-textract-prettyprinter --upgrade

Call the get_string method of textractprettyprinter.t_pretty_print_expense with the output_type as SUMMARY or LINEITEMGROUPS and table_format set to whichever format you want to output. The following example code outputs both SUMMARY and LINEITEMGROUPS in the fancy grid format:

import os
import boto3
from textractprettyprinter.t_pretty_print_expense import get_string
from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format

"""
boto3 client for Amazon Texract
"""
textract = boto3.client(service_name='textract')

"""
Set the S3 Bucket Name and File name 
Please set the below variables to your S3 Location
"""
s3_source_bucket_name = "YOUR S3 BUCKET NAME"
s3_request_file_name = "YOUR S3 EXPENSE IMAGE FILENAME "
    
"""
Call the Textract AnalyzeExpense API with the input Expense Image in Amazon S3
"""
try:
    response = textract.analyze_expense(
        Document={
            'S3Object': {
                'Bucket': s3_source_bucket_name,
                'Name': s3_request_file_name
            }
        })
    """
    Call Amazon Pretty Printer get_string method to parse the response and print it in fancy_grid format. 
    You can set pretty print format to other types as well like csv, latex etc.
    """
    pretty_printed_string = get_string(textract_json=response, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
        
    """
    Use the pretty printed string to save the response in storage of your choice. 
    Below is just printing it on stdout.
    """
    print(pretty_printed_string)    

except Exception as e_raise:
    print(e_raise)
    raise e_raise

The following is the PrettyPrinter output for a sample receipt.

The following is another example of detecting structured data from an invoice.

AnalyzeExpense detects the various normalized summary fields like PAYMENT_TERMS, INVOICER_RECEIPT_ID, TOTAL, TAX, and RECEIVER_ADDRESS.

It also detected one LineItemGroup with one LineItem having DESCRIPTION, QUANTITY, and PRICE, as shown in the following PrettyPrinter output.

Solution architecture overview

The following diagram is a common solution architecture pattern you can use to process documents using Amazon Textract. The solution uses the new AnalyzeExpense API to process receipts and invoices on Amazon S3 and stores the results back in Amazon S3.

The solution architecture includes the following steps:

  1. The input and output S3 buckets store the input expense documents (images) in PNG and JPEG formats and the AnalyzeExpense PrettyPrinter outputs, respectively.
  2. An event rule based on an event pattern in Amazon EventBridge matches incoming S3 PutObject events in the input S3 bucket containing the raw expense document images.
  3. The configured EventBridge rule sends the event to an AWS Lambda function for further processing.
  4. The Lambda function reads the images from the input S3 bucket, calls the AnalyzeExpense API, uses the Amazon Textract response parser to deserialize the JSON response, uses Amazon Textract PrettyPrinter to easily print the parsed response, and stores the results back to the S3 bucket in different formats.

Deploy and use the solution

You can deploy the solution architecture using an AWS CloudFormation template that performs much of the setup work for you.

  1. Choose Launch Stack to deploy the solution architecture in the US East (N. Virginia) Region.

  1. Don’t make any changes to stack name or parameters.
  2. In the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

To use the solution, upload the receipt and invoice images in the S3 bucket referred by SourceBucket in the CloudFormation template. This triggers an event to invoke the Lambda function that calls the AnalyzeExpense API and parses the response JSON, converts the parsed response into CSV or fancy_grid format, and stores it back to another S3 bucket (referred by OutputBucket in the CloudFormation template).

You can extend the provided Lambda function further based on your requirements and also change the output format to other types like TSV, grid, LaTex, and many more by setting the appropriate value of output_type when calling the get_string method of textractprettyprinter.t_pretty_print_expense in Amazon Textract PrettyPrinter.

The sample Lambda function deployment package included in this CloudFormation template consists of the Boto3 SDK as well. If you want to upgrade the Boto3 SDK in future, you either need to create a new deployment package with the upgraded Boto3 SDK or use the latest Boto3 SDK provided by the Lambda Python runtime.

Clean up resources

To delete the resources that the CloudFormation template created, complete the following steps:

  1. Delete the Input, Output and Logging Amazon S3 Buckets created by the CloudFormation template.
  2. On the AWS CloudFormation console, select the stack that you created.
  3. On the Actions menu, choose Delete.

Summary

In this post, we provided an overview of the new Amazon Textract AnalyzeExpense API to quickly and easily retrieve structured data from receipts and invoices. We also described how you can parse the AnalyzeExpense response JSON using the Amazon Textract parser library and save the output in different formats using Amazon Textract PrettyPrinter. Finally, we provided a solution architecture for processing invoices and receipts using Amazon S3, EventBridge, and a Lambda function.

For more information, see the Amazon Textract Developer Guide.


About the Authors

Dhawalkumar Patel is a Sr. Startups Machine Learning Solutions Architect at AWS with expertise in Machine Learning and Serverless domains. He has worked with organizations ranging from large enterprises to startups on problems related to distributed computing and artificial intelligence

 

 

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

 

 

 

Manish Chugh is a Sr. Solutions Architect at AWS based in San Francisco, CA. He has worked with organizations ranging from large enterprises to early-stage startups. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS. In his free time, he enjoys hiking East Bay trails, road biking, and watching (and playing) cricket.

Read More

Detect small shapes and objects within your images using Amazon Rekognition Custom Labels

There are multiple scenarios in which you may want to use computer vision to detect small objects or symbols within a given image. Whether it’s detecting company logos on grocery store shelves to manage inventory, detecting informative symbols on documents, or evaluating survey or quiz documents that contain checkmarks or shaded circles, the size ratio of objects of interest to the overall image size is often very small. This presents a common challenge with machine learning (ML) models, because the processes performed by the models on an image can sometimes miss smaller objects when trying to reduce down and learn the details of larger objects in the image.

In this post, we demonstrate a preprocessing method to increase the size ratio of small objects with respect to an image and optimize the object detection capabilities of Amazon Rekognition Custom Labels, effectively solving the small object detection challenge.

Amazon Rekognition is a fully managed service that provides computer vision (CV) capabilities for analyzing images and video at scale, using deep learning technology without requiring ML expertise. Amazon Rekognition Custom Labels, an automated ML feature of Amazon Rekognition, lets you quickly train custom CV models specific to your business needs simply by bringing labeled images.

Solution overview

The preprocessing method that we implement is an image tiling technique. We discuss how this works in detail in a later section. To demonstrate the performance boost our preprocessing method presents, we train two Amazon Rekognition Custom Labels models—one baseline model and one tiling model—evaluate them on the same test dataset, and compare model results.

Data

For this post, we created a sample dataset to replicate many use cases containing small objects that have significant meaning or value within the document. In the following sample document, we’re interested in four object classes: empty triangles, solid triangles, up arrows, and down arrows. In the creation of this dataset, we converted each page of the PDF document into an image. The four object classes listed were labeled using Amazon SageMaker Ground Truth.

Baseline model

After you have a labeled dataset, you’re ready to train your Amazon Rekognition Custom Labels model. First we train our baseline model that doesn’t include the preprocessing technique. The first step is to set up our Amazon Rekognition Custom Labels project.

  1. On the Amazon Rekognition console, choose Use Custom Labels.
  2. Under Datasets, choose Create new dataset.
  3. For Dataset name¸ enter cl-blog-original-train.

We repeat this process to create our Amazon SageMaker Ground Truth labeled test set called cl-blog-original-test.

  1. Under Projects, choose Create new project.
  2. For Project name, enter cl-blog-original.

  1. On the cl-blog-original project details page, choose Train new model.
  2. Reference the train and test set you created that Amazon Rekognition uses to train and evaluate the model.

The training time varies based on the size and complexity of the dataset. For reference, our model took about 1.5 hours to train.

After the model is trained, Amazon Rekognition outputs various metrics to the model dashboard so you can analyze the performance of your model. The main metric we’re interested in for this example is the F1 score, but precision and recall are also provided. F1 score measures a models’ overall accuracy, which is calculated as the mean of precision and recall. Precision measures how many model predictions are objects of interest, and recall measures how many objects of interest the model accurately predicted.

Our baseline model yielded an F1 score of about 0.40. The following results show that the model performs fairly well with regard to precision for the different labels, but fails to recall the labels across the board. The recall scores for down_arrow, empty_triangle, solid_triangle, and up_arrow are 0.27, 0.33, 0.33, and 0.17, respectively. Because precision and recall both contribute to the F1 score, the high precision scores for each label are brought down by the poor recall scores.

To improve our model performance, we explore a tiling approach in the next section.

Tiling approach model

Now that we’ve trained our baseline model and analyzed its performance, let’s see if tiling our original images into smaller images results in a performance increase. The idea here is that tiling our image into smaller images allows for the small symbol to appear with a greater size ratio when compared to the same small symbol in the original image. The first step before creating a labeled dataset in Ground Truth is to tile the images in our original dataset into smaller images. We then generate our labeled dataset with Ground Truth using this new tiled dataset, and train a new Amazon Rekognition Custom Labels tiled model.

  1. Open up your preferred IDE and create two directories:
    1. original_images for storing the original document images
    2. tiled_images for storing the resulting tiled images
  2. Upload your original images into the original_images directory.

For this post, we use Python and import a package called image_slicer. This package includes functions to slice our original images into a specified number of tiles and save the resulting tiled images into a specified folder.

  1. In the following code, we iterate over each image in our original image folder, slice the image into quarters, and save the resulting tiled images into our tiled images folder.
import os
import image_slicer

for img in os.listdir("original_images"):
tiles = image_slicer.slice(os.path.join("original_images", img), 4, save=False)
image_slicer.save_tiles(
tiles, directory="tiled_images",
prefix=str(img).split(".")[0], format="png")

You can experiment with the number of tiles, because smaller symbols in more complex images may need a higher number of tiles, but four worked well for our set of images.

The following image shows an example output from the tiling script.

After our image dataset is tiled, we can repeat the same steps from training our baseline model.

  1. Create a labeled Ground Truth cl-blog-tiled-train and cl-blog-tiled-test set.

We make sure to use the same images in our baseline-model-test set as our tiled-model-test set to ensure we’re maintaining an unbiased evaluation when comparing both models.

  1. On the Amazon Rekognition console, on the cl-blog-tiled project details page, choose Train new model.
  2. Reference the tiled train and test set you just created that Amazon Rekognition uses to train and evaluate the model.

After the model is trained, Amazon Rekognition outputs various metrics to the model dashboard so you can analyze the performance of your model. Our model predicts with a F1 score of about 0.68. The following results show that the recall improved for all labels at the expense of the precision of down_arrow and up_arrow. However, the improved recall scores were enough to boost our F1 score.

Results

When we compare the results of both the baseline model and the tiled model, we can observe that the tiling approach gave the new model a significant 0.28 boost in F1 score, from 0.40 to 0.68. Because the F1 score measures model accuracy, this means that the tiled model is 28% more likely to accurately identify small objects on a page than the baseline model. Although the precision for the down_arrow label dropped from 0.89 to 0.28 and the precision for the up_arrow label dropped from 0.71 to 0.68, the recall for all the labels significantly increased.

Conclusion

In this post, we demonstrated how to use tiling as a preprocessing technique to maximize the performance of Amazon Rekognition Custom Labels when detecting small objects. This method is a quick addition to improve the predictive accuracy of a Amazon Rekognition Custom Labels model by increasing the size ratio of small objects relative to the overall image.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?


About the Authors

Alexa Giftopoulos is a Machine Learning Consultant at AWS with the National Security Professional Services team, where she develops AI/ML solutions for various business problems.

 

 

 

Sarvesh Bhagat is a Cloud Consultant for AWS Professional Services based out of Virginia. He works with public sector customers to help solve their AI/ML-focused business problems.

 

 

 


Akash Patwal is a Machine Learning Consultant at AWS where he builds AI/ML solutions for customers.

Read More

Bring your own container to project model accuracy drift with Amazon SageMaker Model Monitor

The world we live in is constantly changing, and so is the data that is collected to build models. One of the problems that is often seen in production environments is that the deployed model doesn’t behave the same way as it did during the training phase. This concept is generally called data drift or dataset shift, and can be caused by many factors, such as bias in sampling data that affects features or label data, the non-stationary nature of time series data, or changes in the data pipeline. Because machine learning (ML) models aren’t deterministic, it’s important to minimize the variance in the production environment by periodically monitoring the deployment environment for model drift and sending alerts and, if necessary, triggering retraining of the models with new data.

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. After you train an ML model, you can deploy it on SageMaker endpoints that are fully managed and can serve inferences in real time with low latency. After you deploy your model, you can use Amazon SageMaker Model Monitor to continuously monitor the quality of your ML model in real time. You can also configure alerts to notify and trigger actions if any drift in model performance is observed. Early and proactive detection of these deviations enables you to take corrective actions, such as collecting new ground truth training data, retraining models, and auditing upstream systems, without having to manually monitor models or build additional tooling.

In this post, we present some techniques to detect covariate drift (one of the types of data drift) and demonstrate how to incorporate your own drift detection algorithms and visualizations with Model Monitor.

Types of data drift

Data drift can be classified into three categories depending on whether the distribution shift is happening on the input or on the output side, or whether the relationship between the input and the output has changed.

Covariate shift

In a covariate shift, the distribution of inputs changes over time, but the conditional distribution P(y|x) doesn’t change. This type of drift is called covariate shift because the problem arises due to a shift in the distribution of the covariates (features). For example, the training dataset for a face recognition algorithm may contain predominantly younger faces, while the real world may have a much larger proportion of older people.

Label shift

While covariate shift focuses on changes in the feature distribution, label shift focuses on changes in the distribution of the class variable. This type of shifting is essentially the reverse of covariate shift. An intuitive way to think about it might be to consider an unbalanced dataset; for example, if the spam to non-spam ratio of emails in our training set is 50%, but in reality, 10% of our emails are non-spam, then the target label distribution has shifted.

Concept shift

Concept shift is different from covariate and label shift in that it’s not related to the data distribution or the class distribution, but instead is related to the relationship between the two variables. For example, in stock analysis, the relationship between prior stock market data and the stock price is non-stationary. The relationship between the input data and the target variable changes over time, and the model needs to be refreshed often.

Now that we know the types of distribution shifts, let’s see how Model Monitor helps us in detecting data drifts. In the next section, we walk through setting up Model Monitor and incorporating our custom drift detection algorithms.

Model Monitor

Model Monitor offers four different types of monitoring capabilities to detect and mitigate model drift in real time:

  • Data quality – Helps detect change in data schemas and statistical properties of independent variables and alerts when a drift is detected.
  • Model quality – For monitoring model performance characteristics such as accuracy or precision in real time, Model Monitor allows you to ingest the ground truth labels collected from your applications. Model Monitor automatically merges the ground truth information with prediction data to compute the model performance metrics.
  • Model bias –Model Monitor is integrated with Amazon SageMaker Clarify to improve visibility into potential bias. Although your initial data or model may not be biased, changes in the world may cause bias to develop over time in a model that has already been trained.
  • Model explainability – Drift detection alerts you when a change occurs in the relative importance of feature attributions.

Deequ, which measures data quality, powers some of the Model Monitor pre-built monitors. You don’t require coding to utilize these pre-built monitoring capabilities. You also have the flexibility to monitor models by coding to provide custom analysis. All metrics emitted by Model Monitor can be collected and viewed in Amazon SageMaker Studio, so you can visually analyze your model performance without writing additional code.

In certain scenarios, the pre-built monitors may not be sufficient to generate sophisticated metrics to detect drifts, and may necessitate bringing your own metrics. In the next sections, we describe the setup to bring in your metrics by building a custom container.

Environment setup

For this post, we use a SageMaker notebook to set up Model Monitor and also visualize the drifts. We begin with setting up required roles and Amazon Simple Storage Service (Amazon S3) buckets to store our data:

region = boto3.Session().region_name

sm_client = boto3.client('sagemaker')

role = get_execution_role()
print(f"RoleArn: {role}")

# You can use a different bucket, but make sure the role you chose for this notebook
# has the s3:PutObject permissions. This is the bucket into which the data is captured
bucket = session.Session(boto3.Session()).default_bucket()
print(f"Demo Bucket: {bucket}")
prefix = 'sagemaker/DEMO-ModelMonitor'

s3_capture_upload_path = f's3://{bucket}/{prefix}/datacapture'
s3_report_path = f's3://{bucket}/{prefix}/reports'

Upload train dataset, test dataset, and model file to Amazon S3

Next, we upload our training and test datasets, and also the training model we use for inference. For this post, we use the Census Income Dataset from UCI Machine Learning Repository. The dataset consists of people’s income and several attributes that describe population demographics. The task is to predict if a person makes above or below $50,000. This dataset contains both categorical and integral attributes, and has several missing values. This makes it a good example to demonstrate model drift and detection.

We use XGBoost algorithm to train the model offline using SageMaker. We provide the model file for deployment. The training dataset is used for comparing with inference data to generate drift scores, while test dataset is used for computing how much the accuracy of the model has degraded due to drift. We provide more intuition on these algorithms in later steps.

The following code uploads our datasets and model to Amazon S3:

model_file = open("model/model.tar.gz", 'rb')
train_file = open("data/train.csv", 'rb')
test_file = open("data/test.csv", 'rb')

s3_model_key = os.path.join(prefix, 'model.tar.gz')
s3_train_key = os.path.join(prefix, 'train.csv')
s3_test_key = os.path.join(prefix, 'test.csv')

boto3.Session().resource('s3').Bucket(bucket).Object(s3_model_key).upload_fileobj(model_file)
boto3.Session().resource('s3').Bucket(bucket).Object(s3_train_key).upload_fileobj(train_file)
boto3.Session().resource('s3').Bucket(bucket).Object(s3_test_key).upload_fileobj(test_file)

Set up a Docker container

Model Monitor supports bringing your own custom model monitor containers. When you create a MonitoringSchedule, Model Monitor starts processing jobs for evaluating incoming inference data. While invoking the containers, Model Monitor sets up additional environment variables for you so that your container has enough context to process the data for that particular run of the scheduled monitoring. For the container code variables, see Container Contract Inputs. Of the available input environmental variables, we’re interested in dataset_source, output_path, and end_time:

"Environment": {
    "dataset_source": "/opt/ml/processing/endpointdata",
    "output_path": "/opt/ml/processing/resultdata",
    "end_time": "2019-12-01T16: 20: 00Z"
}

The dataset_source variable specifies the data capture location on the container, and end_time refers to the time of the last event capture. The custom drift algorithm is included in the src directory. For details on the algorithms, see the GitHub repo.

Now we build the Docker container and push it to Amazon ECR. See the following Dockerfile:

FROM python:3.8-slim-buster

RUN pip3 install pandas==1.1.4 numpy==1.19.4 scikit-learn==0.23.2 pyarrow==2.0.0 scipy==1.5.4 boto3==1.17.12

WORKDIR /home
COPY src/* /home/

ENTRYPOINT ["python3", "drift_detector.py"]

We build and push it to Amazon ECR with the following code:

from docker_utils import build_and_push_docker_image

repository_short_name = 'custom-model-monitor'

image_name = build_and_push_docker_image(repository_short_name)

Set up an endpoint and enable data capture

As we mentioned before, our model was trained using XGBoost, so we use XGBoostModel from the SageMaker SDK to deploy the model. Because we have a custom input parser, which includes imputation and one-hot encoding, we provide the inference entry point along with source directory, which includes the Scikit ColumnTransfomer model. The following code is the customer inference function:

script_path = pathlib.Path(__file__).parent.absolute()
with open(f'{script_path}/preprocess.pkl', 'rb') as f:
    preprocess = pickle.load(f) 


def input_fn(request_body, content_type):
    """
    The SageMaker XGBoost model server receives the request data body and the content type,
    and invokes the `input_fn`.

    Return a DMatrix (an object that can be passed to predict_fn).
    """

    if content_type == 'text/csv':        
        df = pd.read_csv(StringIO(request_body), header=None)
        X = preprocess.transform(df)
        
        X_csv = StringIO()
        pd.DataFrame(X).to_csv(X_csv, header=False, index=False)
        req_transformed = X_csv.getvalue().replace('n', '')
        
        return xgb_encoders.csv_to_dmatrix(req_transformed)
    else:
        raise ValueError(
            f'Content type {request_content_type} is not supported.'
        )

We also include the configuration for data capture, specifying the S3 destination, and sampling percentage with the endpoint deployment (see the following code). Ideally, we want to capture 100% of the incoming data for drift detection, but for a high traffic endpoint, we suggest reducing the sampling percentage so that the endpoint availability isn’t affected. Besides, Model Monitor might automatically reduce the sampling percentage if it senses the endpoint availability is affected.

from sagemaker.xgboost.model import XGBoostModel
from sagemaker.serializers import CSVSerializer
from sagemaker.model_monitor import DataCaptureConfig

model_url = f's3://{bucket}/{s3_model_key}'

xgb_inference_model = XGBoostModel(
    model_data=model_url,
    role=role,
    entry_point='inference.py',
    source_dir='script',
    framework_version='1.2-1',
)

data_capture_config = DataCaptureConfig(
                        enable_capture=True,
                        sampling_percentage=100,
                        destination_s3_uri=s3_capture_upload_path)

predictor = xgb_inference_model.deploy(
    initial_instance_count=1,
    instance_type='ml.c5.xlarge',
    serializer=CSVSerializer(),
    data_capture_config=data_capture_config)

After we deploy the model, we can set up a monitor schedule.

Create a monitor schedule

Typically, we create a processing job to generate baseline metrics of our training set. Then we create a monitor schedule to start periodic jobs to analyze incoming inference requests and generate metrics similar to baseline metrics. The inference metrics are compared with baseline metrics and a detailed report on constraints violation and drift in data quality is generated. Using SageMaker SDK simplifies the creation of baseline metrics and scheduling model monitor.

For the algorithms we use in this post (such as Wasserstein distance and Kolmogorov–Smirnov test), the container that we build needs access to both the training dataset and the inference data for computing metrics. This non-typical setup requires some low-level setup of the monitor schedule. The following code is the Boto3 request to set up the monitor schedule. The key fields are related to the container image URL and arguments to the container. We cover building containers in the next section. For now, note that we pass the S3 URL for training the dataset:

s3_train_path = f's3://{bucket}/{s3_train_key}'
s3_test_path = f's3://{bucket}/{s3_test_key}'
s3_result_path = f's3://{bucket}/{prefix}/result/{predictor.endpoint_name}'

sm_client.create_monitoring_schedule(
    MonitoringScheduleName=predictor.endpoint_name,
    MonitoringScheduleConfig={
        'ScheduleConfig': {
            'ScheduleExpression': 'cron(0 * ? * * *)'
        },
        'MonitoringJobDefinition': {
            'MonitoringInputs': [
                {
                    'EndpointInput': {
                        'EndpointName': predictor.endpoint_name,
                        'LocalPath': '/opt/ml/processing/endpointdata'
                    }
                },
            ],
            'MonitoringOutputConfig': {
                'MonitoringOutputs': [
                    {
                        'S3Output': {
                            'S3Uri': s3_result_path,
                            'LocalPath': '/opt/ml/processing/resultdata',
                            'S3UploadMode': 'EndOfJob'
                        }
                    },
                ]
            },
            'MonitoringResources': {
                'ClusterConfig': {
                    'InstanceCount': 1,
                    'InstanceType': 'ml.c5.xlarge',
                    'VolumeSizeInGB': 10
                }
            },
            'MonitoringAppSpecification': {
                'ImageUri': image_name,
                'ContainerArguments': [
                    '--train_s3_uri',
                    s3_train_path,
                    '--test_s3_uri',
                    s3_test_path,
                    '--target_label',
                    'income'
                ]
            },
            'StoppingCondition': {
                'MaxRuntimeInSeconds': 600
            },
            'Environment': {
                'string': 'string'
            },
            'RoleArn': role
        }
    }
)

Generate requests for the endpoint

After we deploy the model, we send traffic at a constant rate to the endpoints. The following code launches a thread to generate the requests for about 10 hours. Make sure to stop the kernel if you want to stop the traffic. We let the traffic generator run for a few hours to collect enough samples to visualize drift. The plots automatically refresh whenever there is new data.

def invoke_endpoint(ep_name, file_name, runtime_client):
    pre_time = time()
    with open(file_name) as f:
        count = len(f.read().split('n')) - 2 # Remove EOF and header
    
    # Calculate time needed to sleep between inference calls if we need to have a constant rate of calls for 10 hours
    ten_hours_in_sec = 10*60*60
    sleep_time = ten_hours_in_sec/count
    
    with open(file_name, 'r') as f:
        next(f) # Skip header
        
        for ind, row in enumerate(f):   
            start_time = time()
            payload = row.rstrip('n')
            response = runtime_client(data=payload)
            
            # Print every 15 minutes (900 seconds)
            if (ind+1) % int(count/ten_hours_in_sec*900) == 0:
                print(f'Finished sending {ind+1} records.')
            
            # Sleep to ensure constant rate. Time spent for inference is subtracted
            sleep(max(sleep_time - (time() - start_time), 0))
                
    print("Done!")
    
print(f"Sending test traffic to the endpoint {predictor.endpoint_name}.nPlease wait...")

thread = Thread(target = invoke_endpoint, args=(predictor.endpoint, 'data/infer.csv', predictor.predict))
thread.start()

Visualize the data drift

We use the entire dataset for training and synthetically generate a new dataset for inference by modifying statistical properties of some of the attributes from the original dataset as described in our GitHub repo.

Normalized drift score per feature

A normalized drift score (in %) for each feature is computed by calculating the ratio of overlap of the distribution of incoming inference data with the original data to the original data. For categorical data, this is computed by summing the absolute of the difference in probability scores (training and inference) across each label. For numerical data, the training data is split into 10 bins (deciles), and the absolute of the difference in probability scores over bin is summed to calculate the drift score.

The following plot shows the drift scores over the time intervals. A few features have low to no drifts, while some features have drifts that are increasing over time, and finally hours-per-week has a large drift right from the start.

Projected drift in model accuracy

This metric provides a proxy of the accuracy of a model due to drift in inference data from the test data. The idea behind this approach is to determine how much percentage of the inference data is similar to the portion of the test data, where the model predicts well. For this metric, we use Isolation Forests to train one-class classifier, and generate scores for the portion of the test data that model predicted well, and for the inference data. The relative difference in the mean of these scores is the projected drift in accuracy.

The following plot shows the corresponding degradation in model accuracy due to drift in the incoming data. This demonstrates that covariate shift can reduce accuracy of a model. A few spikes in the metric can be attributed to the statistical nature of how the data is generated. This is only a proxy for model accuracy, and not the actual accuracy, which can only be known from the ground truth of the labels.

P-values of null test hypothesis

A null test hypothesis tests whether two samples (for this post, the training and inference datasets) are derived from the same general population. The p-value gives the probability of obtaining test results at least as extreme as the observations under the assumption that the null hypothesis is correct. A p-value threshold of 5% is used to decide whether the observed sample has drifted from the training data or not.

The following plot shows p-values of different attributes that have crossed the threshold at least once. The y-axis shows the inverse log of p-values to distinguish very small p-values. P-values for numerical attributes are obtained from Kolmogorov–Smirnov test, and for categorical attributes from Chi-square test.

Conclusion

Amazon SageMaker Model Monitor is a powerful tool to detect data drift. In this post, we showed how to easily integrate your custom data drift detection algorithms into Model Monitor, while benefiting from the heavy lifting that Model Monitor provides in terms of data capture and scheduling the monitor runs. The notebook provides a detailed step-by-step instruction on how to build a custom container and attach it to Model Monitor schedules. Give Model Monitor a try and leave your feedback in the comments.


About the Author

Vinay Hanumaiah is a Senior Deep Learning Architect at Amazon ML Solutions Lab, where he helps customers build AI and ML solutions to accelerate their business challenges. Prior to this, he contributed to the launch of AWS DeepLens and Amazon Personalize. In his spare time, he enjoys time with his family and is an avid rock climber.

Read More

Detect defects and augment predictions using Amazon Lookout for Vision and Amazon A2I

With machine learning (ML), more powerful technologies have become available that can automate the task of detecting visual anomalies in a product. However, implementing such ML solutions is time-consuming and expensive because it involves managing and setting up complex infrastructure and having the right ML skills. Furthermore, ML applications need human oversight to ensure accuracy with anomaly detection, help provide continuous improvements, and retrain models with updated predictions. However, you’re often forced to choose between an ML-only or human-only system. Companies are looking for the best of both worlds, integrating ML systems into your workflow while keeping a human eye on the results to achieve higher precision.

In this post, we show how you can easily set up Amazon Lookout For Vision to train a visual anomaly detection model using a printed circuit board dataset, use a human-in-the-loop workflow to review the predictions using Amazon Augmented AI (Amazon A2I), augment the dataset to incorporate human input, and retrain the model.

Solution overview

Lookout for Vision is an ML service that helps spot product defects using computer vision to automate the quality inspection process in your manufacturing lines, with no ML expertise required. You can get started with as few as 30 product images (20 normal, 10 anomalous) to train your unique ML model. Lookout for Vision uses your unique ML model to analyze your product images in near-real time and detect product defects, allowing your plant personnel to diagnose and take corrective actions.

Amazon A2I is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether running on AWS or not.

To get started with Lookout for Vision, we create a project, create a dataset, train a model, and run inference on test images. After going through these steps, we show you how you can quickly set up a human review process using Amazon A2I and retrain your model with augmented or human reviewed datasets. We also provide an accompanying Jupyter notebook.

Architecture overview

The following diagram illustrates the solution architecture.

 

The solution has the following workflow:

  1. Upload data from the source to Amazon Simple Storage Service (Amazon S3).
  2. Run Lookout for Vision to process data from the Amazon S3 path.
  3. Store inference results in Amazon S3 for downstream review.
  4. Use Lookout for Vision to determine if an input image is damaged and validate that the confidence level is above 70%. If below 70%, we start a human loop for a worker to manually determine whether an image is damaged.
  5. A private workforce investigates and validates the detected damages and provides feedback.
  6. Update the training data with corresponding feedback for subsequent model retraining.
  7. Repeat the retraining cycle for continuous model retraining.

Prerequisites

Before you get started, complete the following steps to set up the Jupyter notebook:

  1. Create a notebook instance in Amazon SageMaker.
  2. When the notebook is active, choose Open Jupyter.
  3. On the Jupyter dashboard, choose New, and choose Terminal.
  4. In the terminal, enter the following code:
cd SageMaker<br />git clone https://github.com/aws-samples/amazon-lookout-for-vision.git
  1. Open the notebook for this post: Amazon-Lookout-for-Vision-and-Amazon-A2I-Integration.ipynb.

You’re now ready to run the notebook cells.

  1. Run the setup environment step to set up the necessary Python SDKs and variables:
!pip install lookoutvision
!pip install simplejson

In the first step, you need to define the following:

  • region – The Region where your project is located
  • project_name – The name of your Lookout for Vision project
  • bucket – The name of the Amazon S3 bucket where we output the model results
  • model_version – Your model version (the default setting is 1)
# Set the AWS region
region = '<AWS REGION>'

# Set your project name here
project_name = '<CHANGE TO AMAZON LOOKOUT FOR VISION PROJECT NAME>'

# Provide the name of the S3 bucket where we will output results and store images
bucket = '<S3 BUCKET NAME>'

# This will default to a value of 1; Since we're training a new model, leave this set to a value of 1
model_version = '1'
  1. Create the S3 buckets to store images:
!aws s3 mb s3://{bucket}
  1. Create a manifest file from the dataset by running the cell in the section Create a manifest file from the dataset in the notebook.

Lookout for Vision uses this manifest file to determine the location of the files, as well as the labels associated with the files.

Upload circuit board images to Amazon S3

To train a Lookout for Vision model, we need to copy the sample dataset from our local Jupyter notebook over to Amazon S3:

# Upload images to S3 bucket:
!aws s3 cp circuitboard/train/normal s3://{bucket}/{project_name}/training/normal --recursive
!aws s3 cp circuitboard/train/anomaly s3://{bucket}/{project_name}/training/anomaly --recursive

!aws s3 cp circuitboard/test/normal s3://{bucket}/{project_name}/validation/normal --recursive
!aws s3 cp circuitboard/test/anomaly s3://{bucket}/{project_name}/validation/anomaly –recursive

Create a Lookout for Vision project

You have a couple of options on how to create your Lookout for Vision project: the Lookout for Vision console, the AWS Command Line Interface (AWS CLI), or the Boto3 SDK. We chose the Boto3 SDK in this example, but highly recommend you check out the console method as well.

The steps we take with the SDK are:

  1. Create a project (the name was defined at the beginning) and tell your project where to find your training dataset. This is done via the manifest file for training.
  2. Tell your project where to find your test dataset. This is done via the manifest file for test.

This second step is optional. In general, all test-related code is optional; Lookout for Vision also works with just a training dataset. We use both because training and testing is a common (best) practice when training AI and ML models.

Create a manifest file from the dataset

Lookout for Vision uses this manifest file to determine the location of the files, as well as the labels associated with the files. See the following code:

#Create the manifest file

from lookoutvision.manifest import Manifest
mft = Manifest(
    bucket=bucket,
    s3_path="{}/".format(project_name),
    datasets=["training", "validation"])
mft_resp = mft.push_manifests()
print(mft_resp)

Create a Lookout for Vision project

The following command creates a Lookout for Vision project:

# Create an Amazon Lookout for Vision Project

from lookoutvision.lookoutvision import LookoutForVision
l4v = LookoutForVision(project_name=project_name)
# If project does not exist: create it
p = l4v.create_project()
print(p)
print('Done!')

Create and train a model

In this section, we walk through the steps of creating the training and test datasets, training the model, and hosting the model.

Create the training and test datasets from images in Amazon S3

After we create the Lookout for Vision project, we create the project dataset by using the sample images we uploaded to Amazon S3 along with the manifest files. See the following code:

dsets = l4v.create_datasets(mft_resp, wait=True)
print(dsets)
print('Done!')

Train the model

After we create the Lookout for Vision project and the datasets, we can train our first model:

l4v.fit(
    output_bucket=bucket,
    model_prefix="mymodel_",
    wait=True)

When training is complete, we can view available model metrics:

met = Metrics(project_name=project_name)

met.describe_model(model_version=model_version)

You should see an output similar to the following.

Metrics Model Arn Status Message Performance Model Performance
F1 Score 1 TRAINED Training completed successfully. 0.93023
Precision 1 TRAINED Training completed successfully. 0.86957
Recall 1 TRAINED Training completed successfully. 1

Host the model

Before we can use our newly trained Lookout for Vision model, we need to host it:

l4v.deploy(
    model_version=model_version,
    wait=True)

Set up Amazon A2I to review predictions from Lookout for Vision

In this section, you set up a human review loop in Amazon A2I to review inferences that are below the confidence threshold. You must first create a private workforce and create a human task UI.

Create a workforce

You need to create a workforce via the SageMaker console. Note the ARN of the workforce and enter its value in the notebook cell:

WORKTEAM_ARN = 'your workforce team ARN'

The following screenshot shows the details of a private team named lfv-a2i and its corresponding ARN.

Create a human task UI

You now create a human task UI resource: a UI template in liquid HTML. This HTML page is rendered to the human workers whenever a human loop is required. For over 70 pre-built UIs, see the amazon-a2i-sample-task-uis GitHub repo.

Follow the steps provided in the notebook section Create a human task UI to create the web form, initialize Amazon A2I APIs, and inspect output:

...
def create_task_ui():
    '''
    Creates a Human Task UI resource.

    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response
...

Create a human task workflow

Workflow definitions allow you to specify the following:

  • The worker template or human task UI you created in the previous step.
  • The workforce that your tasks are sent to. For this post, it’s the private workforce you created in the prerequisite steps.
  • The instructions that your workforce receives.

This post uses the Create Flow Definition API to create a workflow definition. The results of human review are stored in an Amazon S3 bucket, which can be accessed by the client application. Run the cell Create a Human task Workflow in the notebook and inspect the output:

create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName = flowDefinitionName,
        RoleArn = role,
        HumanLoopConfig = {
            "WorkteamArn": workteam_arn,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Select if the component is damaged or not.",
            "TaskTitle": "Verify if the component is damaged or not"
        },
        OutputConfig={
            "S3OutputPath" : a2i_results
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] 
# let's save this ARN for future use

Make predictions and start a human loop based on the confidence level threshold

In this section, we loop through an array of new images and use the Lookout for Vision SDK to determine if our input images are damaged or not, and if they’re above or below a defined threshold. For this post, we set the threshold confidence level at .70. If our result is below .70, we start a human loop for a worker to manually determine if our image is normal or an anomaly. See the following code:

...

SCORE_THRESHOLD = .70

for fname in Incoming_Images_Array:
    #Lookout for Vision inference using detect_anomalies
    fname_full_path = (Incoming_Images_Dir + "/" + fname)
    with open(fname_full_path, "rb") as image:
        modelresponse = L4Vclient.detect_anomalies(
            ProjectName=project_name,
            ContentType="image/jpeg",  # or image/png for png format input image.
            Body=image.read(),
            ModelVersion=model_version,
            )
        modelresponseconfidence = (modelresponse["DetectAnomalyResult"]["Confidence"])

    if (modelresponseconfidence < SCORE_THRESHOLD):
...
        # start an a2i human review loop with an input
        start_loop_response = a2i.start_human_loop(
            HumanLoopName=humanLoopName,
            FlowDefinitionArn=flowDefinitionArn,
            HumanLoopInput={
                "InputContent": json.dumps(inputContent)
            }
        )
... 

You should get the output shown in the following screenshot.

Complete your review and check the human loop status

If inference results are below the defined threshold, a human loop is created. We can review the status of those jobs and wait for results:

...
completed_human_loops = []
for human_loop_name in human_loops_started:
    resp = a2i.describe_human_loop(HumanLoopName=human_loop_name)
    print(f'HumanLoop Name: {human_loop_name}')
    print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
    print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
    print('n')
    
    if resp["HumanLoopStatus"] == "Completed":
        completed_human_loops.append(resp)
workteamName = workteam_arn[workteam_arn.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

The work team sees the following screenshot to choose the correct label for the image.

View results of the Amazon A2I workflow and move objects to the correct folder for retraining

After the work team members have completed the human loop tasks, let’s use the results of the tasks to sort our images into the correct folders for training a new model. See the following code:

...

    # move the image to the appropriate training folder
    if (labelanswer == "Normal"):
        # move object to the Normal training folder s3://a2i-lfv-output/image_folder/normal/
        !aws s3 cp {taskObjectResponse} s3://{bucket}/{project_name}/train/normal/
    else:
        # move object to the Anomaly training folder
        !aws s3 cp {taskObjectResponse} s3://{bucket}/{project_name}/train/anomaly/
...

Retrain your model based on augmented datasets from Amazon A2I

Training a new model version can be triggered as a batch job on a schedule, manually as needed, based on how many new images have been added to the training folders, and so on. For this example, we use the Lookout for Vision SDK to retrain our model using the images that we’ve now included in our modified dataset. Follow the accompanying Jupyter notebook downloadable from [GitHub-LINK] for the complete notebook.

# Train the model!
l4v.fit(
    output_bucket=bucket,
    model_prefix="mymodelprefix_",
    wait=True)

You should see an output similar to the following.

Now that we’ve trained a new model using newly added images, let’s check the model metrics! We show the results from the first model and the second model at the same time:

# All models of the same project
met.describe_models()

You should see an output similar to the following. The table shows two models: a hosted model (ModelVersion:1) and the retrained model (ModelVersion:2). The performance of the retrained model is better with the human reviewed and labeled images.

Metrics ModelVersion Status StatusMessage Model Performance
F1 Score 2 TRAINED Training completed successfully. 0.98
Precision 2 TRAINED Training completed successfully. 0.96
Recall 2 TRAINED Training completed successfully. 1
F1 Score 1 HOSTED The model is running. 0.93023
Precision 1 HOSTED The model is running. 0.86957
Recall 1 HOSTED The model is running. 1

Clean up

Run the Stop the model and cleanup resources cell to clean up the resources that were created. Delete any Lookout for Vision projects you’re no longer using, and remove objects from Amazon S3 to save costs. See the following code:

#If you are not using the model, stop to save costs! This can take up to 5 minutes.

#change the model version to whichever model you're using within your current project
model_version='1'
l4v.stop_model(model_version=model_version)

Conclusion

This post demonstrated how you can use Lookout for Vision and Amazon A2I to train models to detect defects in objects unique to your business and define conditions to send the predictions to a human workflow with labelers to review and update the results. You can use the human labeled output to augment the training dataset for retraining to improve the model accuracy.

Start your journey towards industrial anomaly detection and identification by visiting the Lookout for Vision Developer Guide and the Amazon A2I Developer Guide.


About the Author

Dennis Thurmon is a Solutions Architect at AWS, with a passion for Artificial Intelligence and Machine Learning. Based in Seattle, Washington, Dennis worked as a Systems Development Engineer on the Amazon Go and Amazon Books team before focusing on helping AWS customers bring their workloads to life in the AWS Cloud.

 

 

Amit Gupta is an AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

 

 

 

Neel Sendas is a Senior Technical Account Manager at Amazon Web Services. Neel works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He has worked on various ML use cases, ranging from anomaly detection to predictive product quality for manufacturing and logistics optimization. When he is not helping customers, he dabbles in golf and salsa dancing.

 

 

Read More