Encode multi-lingual text properties in Amazon Neptune to train predictive models

Amazon Neptune ML is a machine learning (ML) capability of Amazon Neptune that helps you make accurate and fast predictions on your graph data. Under the hood, Neptune ML uses Graph Neural Networks (GNNs) to simultaneously take advantage of graph structure and node/edge properties to solve the task at hand. Traditional methods either only use properties and no graph structure (e.g., XGBoost, Neural Networks), or only graph structure and no properties (e.g., node2vec, Label Propagation). To better manipulate the node/edge properties, ML algorithms require the data to be well behaved numerical data, but raw data in a database can have other types, like raw text. To make use of these other types of data, we need specialized processing steps that convert them from their native type into numerical data, and the quality of the ML results is strongly dependent on the quality of these data transformations. Raw text, like sentences, are among the most difficult types to transform, but recent progress in the field of Natural Language Processing (NLP) has led to strong methods that can handle text coming from multiple languages and a wide variety of lengths.

Beginning with version 1.1.0.0, Neptune ML supports multiple text encoders (text_fasttext, text_sbert, text_word2vec, and text_tfidf), which bring the benefits of recent advances in NLP and enables support for multi-lingual text properties as well as additional inference requirements around languages and text length. For example, in a job recommendation use case, the job posts in different countries can be described in different languages and the length of job descriptions vary considerably. Additionally, Neptune ML supports an auto option that automatically chooses the best encoding method based on the characteristics of the text feature in the data.

In this post, we illustrate the usage of each text encoder, compare their advantages and disadvantages, and show an example of how to choose the right text encoders for a job recommendation task.

What is a text encoder?

The goal of text encoding is to convert the  text-based edge/node properties in Neptune into fixed-size vectors for use in downstream machine learning models for either node classification or link prediction tasks. The length of the text feature can vary a lot. It can be a word, phrase, sentence, paragraph, or even a document with multiple sentences (the maximum size of a single property is 55 MB in Neptune). Additionally, the text features can be in different languages. There may also be sentences that contain words in several different languages, which we define as code-switching.

Beginning with the 1.1.0.0 release, Neptune ML allows you to choose from several different text encoders. Each encoder works slightly differently, but has the same goal of converting a text value field from Neptune into a fixed-size vector that we use to build our GNN model using Neptune ML. The new encoders are as follows:

  • text_fasttext (new) – Uses fastText encoding. FastText is a library for efficient text representation learning. text_fasttext is recommended for features that use one and only one of the five languages that fastText supports (English, Chinese, Hindi, Spanish, and French). The text_fasttext method can optionally take the max_length field, which specifies the maximum number of tokens in a text property value that will be encoded, after which the string is truncated. You can regard a token as a word. This can improve performance when text property values contain long strings, because if max_length is not specified, fastText encodes all the tokens regardless of the string length.
  • text_sbert (new) – Uses the Sentence BERT (SBERT) encoding method. SBERT is a kind of sentence embedding method using the contextual representation learning models, BERT-Networks. text_sbert is recommended when the language is not supported by text_fasttext. Neptune supports two SBERT methods: text_sbert128, which is the default if you just specify text_sbert, and text_sbert512. The difference between them is the maximum number of tokens in a text property that get encoded. The text_sbert128 encoding only encodes the first 128 tokens, whereas text_sbert512 encodes up to 512 tokens. As a result, using text_sbert512 can require more processing time than text_sbert128. Both methods are slower than text_fasttext.
  • text_word2vec – Uses Word2Vec algorithms originally published by Google to encode text. Word2Vec only supports English.
  • text_tfidf – Uses a term frequency-inverse document frequency (TF-IDF) vectorizer for encoding text. TF-IDF encoding supports statistical features that the other encodings do not. It quantifies the importance or relevance of words in one node property among all the other nodes.

Note that text_word2vec and text_tfidf were previously supported and the new methods text_fasttext and text_sbert are recommended over the old methods.

Comparison of different text encoders

The following table shows the detailed comparison of all the supported text encoding options (text_fasttext, text_sbert, and text_word2vec). text_tfidf is not a model-based encoding method, but rather a count-based measure that evaluates how relevant a token (for example, a word) is to the text features in other nodes or edges, so we don’t include text_tfidf for comparison. We recommend using text_tfidf when you want to quantify the importance or relevance of some words in one node or edge property amongst all the other node or edge properties.)

. . text_fasttext text_sbert text_word2vec
Model Capability Supported language English, Chinese, Hindi, Spanish, and French More than 50 languages English
Can encode text properties that contain words in different languages No Yes No
Max-length support No maximum length limit Encodes the text sequence with the maximum length of 128 and 512 No maximum length limit
Time Cost Loading Approximately 10 seconds Approximately 2 seconds Approximately 2 seconds
Inference Fast Slow Medium

Note the following usage tips:

  • For text property values in English, Chinese, Hindi, Spanish, and French, text_fasttext is the recommended encoding. However, it can’t handle cases where the same sentence contains words in more than one language. For other languages than the five that fastText supports, use text_sbert encoding.
  • If you have many property value text strings longer than, for example, 120 tokens, use the max_length field to limit the number of tokens in each string that text_fasttext encodes.

To summarize, depending on your use case, we recommend the following encoding method:

  • If your text properties are in one of the five supported languages, we recommend using text_fasttext due to its fast inference. text_fasttext is the recommended choices and you can also use text_sbert in the following two exceptions.
  • If your text properties are in different languages, we recommend using text_sbert because it’s the only supported method that can encode text properties containing words in several different languages.
  • If your text properties are in one language that isn’t one of the five supported languages, we recommend using text_sbert because it supports more than 50 languages.
  • If the average length of your text properties is longer than 128, consider using text_sbert512 or text_fasttext. Both methods can use encode longer text sequences.
  • If your text properties are in English only, you can use text_word2vec, but we recommend using text_fasttext for its fast inference.

Use case demo: Job recommendation task

The goal of the job recommendation task is to predict what jobs users will apply for based on their previous applications, demographic information, and work history. This post uses an open Kaggle dataset. We construct the dataset as a three-node type graph: job, user, and city.

A job is characterized by its title, description, requirements, located city, and state. A user is described with the properties of major, degree type, number of work history, total number of years for working experience, and more. For this use case, job title, job description, job requirements, and majors are all in the form of text.

In the dataset, users have the following properties:

  • State – For example, CA or 广东省 (Chinese)
  • Major – For example, Human Resources Management or Lic Cytura Fisica (Spanish)
  • DegreeType – For example, Bachelor’s, Master’s, PhD, or None
  • WorkHistoryCount – For example, 0, 1, 16, and so on
  • TotalYearsExperience – For example, 0.0, 10.0, or NAN

Jobs have the following properties:

  • Title – For example, Administrative Assistant or Lic Cultura Física (Spanish).
  • Description – For example, “This Administrative Assistant position is responsible for performing a variety of clerical and administrative support functions in the areas of communications, …” The average number of words in a description is around 192.2.
  • Requirements – For example, “JOB REQUIREMENTS: 1. Attention to detail; 2.Ability to work in a fast paced environment;3.Invoicing…”
  • State: – For example, CA, NY, and so on.

The node type city like Washington DC and Orlando FL only has the identifier for each node. In the following section, we analyze the characteristics of different text features and illustrate how to select the proper text encoders for different text properties.

How to select different text encoders

For our example, the Major and Title properties are in multiple languages and have short text sequences, so text_sbert is recommended. The sample code for the export parameters is as follows. For the text_sbert type, there are no other parameter fields. Here we choose text_sbert128 other than text_sbert512, because the text length is relatively shorter than 128.

"additionalParams": {
    "neptune_ml": {
        "version": "v2.0",
        "targets": [ ... ],
        "features": [
            {
                "node": "user",
                "property": "Major",
                "type": "text_sbert128"
            },
            {
                "node": "job",
                "property": "Title",
                "type": "text_sbert128",
            }, ...
        ], ...
    }
}

The Description and Requirements properties are usually in long text sequences. The average length of a description is around 192 words, which is longer than the maximum input length of text_sbert (128). We can use text_sbert512, but it may result in slower inference. In addition, the text is in a single language (English). Therefore, we recommend text_fasttext with the en language value because of its fast inference and not limited input length. The sample code for the export parameters is as follows. The text_fasttext encoding can be customized using language and max_length. The language value is required, but max_length is optional.

"additionalParams": {
    "neptune_ml": {
        "version": "v2.0",
        "targets": [ ... ],
        "features": [
            {
                "node": "job",
                "property": "Description",
                "type": "text_fasttext",
                "language": "en",
                "max_length": 256
            },
            {
                "node": "job",
                "property": "Requirements",
                "type": "text_fasttext",
                "language": "en"
            }, ...
        ], ...
    }
}

More details of the job recommendation use cases can be found in the Neptune notebook tutorial.

For demonstration purposes, we select one user, i.e., user 443931, who holds a Master’s degree in ‘Management and Human Resources. The user has applied to five different jobs, titled as “Human Resources (HR) Manager”, “HR Generalist”, “Human Resources Manager”, “Human Resources Administrator”, and “Senior Payroll Specialist”. In order to evaluate the performance of the recommendation task, we delete 50% of the apply jobs (the edges) of the user (here we delete “Human Resources Administrator” and “Human Resources (HR) Manager) and try to predict the top 10 jobs this user is most likely to apply for.

After encoding the job features and user features, we perform a link prediction task by training a relational graph convolutional network (RGCN) model. Training a Neptune ML model requires three steps: data processing, model training, and endpoint creation. After the inference endpoint has been created, we can make recommendations for user 443931. From the predicted top 10 jobs for user 443931 (i.e., “HR Generalist”, “Human Resources (HR) Manager”, “Senior Payroll Specialist”, “Human Resources Administrator”, “HR Analyst”, et al.), we observe that the two deleted jobs are among the 10 predictions.

Conclusion

In this post, we showed the usage of the newly supported text encoders in Neptune ML. These text encoders are simple to use and can support multiple requirements. In summary,

  • text_fasttext is recommended for features that use one and only one of the five languages that text_fasttext supports.
  • text_sbert is recommended for text that text_fasttext doesn’t support.
  • text_word2vec only supports English, and can be replaced by text_fasttext in any scenario.

For more details about the solution, see the GitHub repo. We recommend using the text encoders on your graph data to meet your requirements. You can just choose an encoder name and set some encoder attributes, while keeping the GNN model unchanged.


About the authors

Jiani Zhang is an applied scientist of AWS AI Research and Education (AIRE). She works on solving real-world applications using machine learning algorithms, especially natural language and graph related problems.

Read More

Build a solution for a computer vision skin lesion classifier using Amazon SageMaker Pipelines

Amazon SageMaker Pipelines is a continuous integration and continuous delivery (CI/CD) service designed for machine learning (ML) use cases. You can use it to create, automate, and manage end-to-end ML workflows. It tackles the challenge of orchestrating each step of an ML process, which requires time, effort, and resources. To facilitate its use, multiple templates are available that you can customize according to your needs.

Fully managed image and video analysis services have also accelerated the adoption of Computer vision solutions. AWS offers a pre-trained and fully managed AWS AI service called Amazon Rekognition that can be integrated into computer vision applications using API calls and require no ML experience. You just have to provide an image to the Amazon Rekognition API and it can identify the required objects according to pre-defined labels. It is also possible to provide custom labels specific to your use case and build a customized computer vision model with little to no overhead need for ML expertise.

In this post, we address a specific computer vision problem: skin lesion classification, and use Pipelines by customizing an existing template and tailoring it to this task. Accurate skin lesion classification can help with early diagnosis of cancer diseases. However, it’s a challenging task in the medical field, because there is a high similarity between different kinds of skin lesions. Pipelines allows us to take advantage of a variety of existing models and algorithms, and establish an end-to-end productionized pipeline with minimal effort and time.

Solution overview

In this post, we build an end-to-end pipeline using Pipelines to classify dermatoscopic images of common pigmented skin lesions. We use the Amazon SageMaker Studio project template MLOps template for building, training, and deploying models and the code in the following GitHub repository. The resulting architecture is shown in the following figure.

For this pipeline, we use the HAM10000 (“Human Against Machine with 10000 training images”) dataset, which consists of 10,015 dermatoscopic images. The task at hand is a multi-class classification in the field of computer vision. This dataset depicts six of the most important diagnostic categories in the realm of pigmented lesions: actinic keratoses and intraepithelial carcinoma or Bowen’s disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines or seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv), and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc).

For the format of the model’s input, we use the RecordIO format. This is a compact format that stores image data together for continuous reading and therefore faster and more efficient training. In addition, one of the challenges of using the HAM10000 dataset is the class imbalance. The following table illustrates the class distribution.

Class akiec bcc bkl df mel nv vasc
Number of images 327 514 1099 115 1113 6705 142
Total 10015

To address this issue, we augment the dataset using random transformations (such as cropping, flipping, mirroring, and rotating) to have all classes with approximately the same number of images.

This preprocessing step uses MXNet and OpenCV, therefore it uses a pre-built MXNet container image. The rest of the dependencies are installed using a requirements.txt file. If you want to create and use a custom image, refer to Create Amazon SageMaker projects with image building CI/CD pipelines.

For the training step, we use the estimator available from the SageMaker built-in Scikit Docker image for image classification and set the parameters as follows:

hyperparameters = {
        "num_layers": 18,
        "use_pretrained_model": 1,
        "augmentation_type": 'crop_color_transform',
        "image_shape": '3,224,224', 
        "num_classes": 7,
        "num_training_samples": 29311, 
        "mini_batch_size": 8,
        "epochs": 5, 
        "learning_rate": 0.00001,
        "precision_dtype": 'float32'
    }

    estimator_config = {
        "hyperparameters": hyperparameters,
        "image_uri": training_image,
        "role": role,
        "instance_count": 1,
        "instance_type": "ml.p3.2xlarge",
        "volume_size": 100,
        "max_run": 360000,
        "output_path": "s3://{bucket}/{base_job_prefix}/training_jobs",
    }
    
    image_classifier = sagemaker.estimator.Estimator(**estimator_config)

For further details about the container image, refer to Image Classification Algorithm.

Create a Studio project

For detailed instructions on how to set up Studio, refer to Onboard to Amazon SageMaker Domain Using Quick setup. To create your project, complete the following steps:

  1. In Studio, choose the Projects menu on the SageMaker resources menu.

    On the projects page, you can launch a pre-configured SageMaker MLOps template.
  2. Choose MLOps template for model building, training, and deployment.
  3. Choose Select project template.
  4. Enter a project name and short description.
  5. Choose Create project.

The project takes a few minutes to be created.

Prepare the dataset

To prepare the dataset, complete the following steps:

  1. Go to Harvard DataVerse.
  2. Choose Access Dataset, and review the license Creative Commons Attribution-NonCommercial 4.0 International Public License.
  3. If you accept the license, choose Original Format Zip and download the ZIP file.
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket and choose a name starting with sagemaker (this allows SageMaker to access the bucket without any extra permissions).
  5. You can enable access logging and encryption for security best practices.
  6. Upload dataverse_files.zip to the bucket.
  7. Save the S3 bucket path for later use.
  8. Make a note of the name of the bucket you have stored the data in, and the names of any subsequent folders, to use later.

Prepare for data preprocessing

Because we’re using MXNet and OpenCV in our preprocessing step, we use a pre-built MXNet Docker image and install the remaining dependencies using the requirements.txt file. To do so, you need to copy it and paste it under pipelines/skin in the sagemaker-<pipeline-name>-modelbuild repository. In addition, add the MANIFEST.in file at the same level as setup.py, to tell Python to include the requirements.txt file. For more information about MANIFEST.in, refer to Including files in source distributions with MANIFEST.in. Both files can be found in the GitHub repository.

Change the Pipelines template

To update the Pipelines template, complete the following steps:

  1. Create a folder inside the default bucket.
  2. Make sure the Studio execution role has access to the default bucket as well as the bucket containing the dataset.
  3. From the list of projects, choose the one that you just created.
  4. On the Repositories tab, choose the hyperlinks to locally clone the AWS CodeCommit repositories to your local Studio instance.
  5. Navigate to the pipelines directory inside the sagemaker-<pipeline-name>-modelbuild directory and rename the abalone directory to skin.
  6. Open the codebuild-buildspec.yml file in the sagemaker-<pipeline-name>-modelbuild directory and modify the run pipeline path from run-pipeline —module-name pipelines.abalone.pipeline (line 15) to the following:
    run-pipeline --module-name pipelines.skin.pipeline 

  7. Save the file.
  8. Replace the files pipelines.py, preprocess.py, and evaluate.py in the pipelines directory with the files from the GitHub repository.
  9. Update the preprocess.py file (lines 183-186) with the S3 location (SKIN_CANCER_BUCKET) and folder name (SKIN_CANCER_BUCKET_PATH) where you uploaded the dataverse_files.zip archive:
    1. skin_cancer_bucket=”<bucket-name-containing-dataset>”
    2. skin_cancer_bucket_path=”<prefix-to-dataset-inside-bucket>”
    3. skin_cancer_files=”<dataset-file-name-without-extension>”
    4. skin_cancer_files_ext=”<dataset-file-name-with-extension>”

In the preceding example, the dataset would be stored under s3://monai-bucket-skin-cancer/skin_cancer_bucket_prefix/dataverse_files.zip.

Trigger a pipeline run

Pushing committed changes to the CodeCommit repository (done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. We can monitor the run by choosing the pipeline inside the SageMaker project. The following screenshot shows an example of a pipeline that ran successfully.

  1. To commit the changes, navigate to the Git section on the left pane.
  2. Stage all relevant changes. You don’t need to keep track of the -checkpoint file. You can add an entry to the .gitignore file with *checkpoint.* to ignore them.
  3. Commit the changes by providing a summary as well as your name and an email address.
  4. Push the changes.
  5. Navigate back to the project and choose the Pipelines section.
  6. If you choose the pipelines in progress, the steps of the pipeline appear.
    This allows you to monitor the step that is currently running.It may take a couple of minutes for the pipeline to appear. For the pipeline to start running, the steps defined in CI/CD codebuild-buildspec.yml have to run successfully. To check on the status of these steps, you can use AWS CodeBuild. For more information, refer to AWS CodeBuild (AMS SSPS).
  7. When the pipeline is complete, go back to the project page and choose the Model groups tab to inspect the metadata attached to the model artifacts.
  8. If everything looks good, choose the Update Status tab and manually approve the model.The default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 60% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.
  9. Navigate to the Endpoints page on the SageMaker console, where you can see a staging endpoint being created.After few minutes, the endpoint is listed with the InService status.
  10. To deploy the endpoint into production, on the CodePipeline console, choose the sagemaker-<pipeline-name>-modeldeploy pipeline that is currently in progress.
  11. At the end of the DeployStaging phase, you need to manually approve the deployment.

After this step, you can see the production endpoint being deployed on the SageMaker Endpoints page. After a while, the endpoint shows as InService.

Clean up

You can easily clean up all the resources created by the SageMaker project.

  1. In the navigation pane in Studio, choose SageMaker resources.
  2. Choose Projects from the drop-down menu and choose your project.
  3. On the Actions menu, choose Delete to delete all related resources.

Results and next steps

We successfully used Pipelines to create an end-to-end MLOps framework for skin lesion classification using a built-in model on the HAM10000 dataset. For the parameters provided in the repository, we obtained the following results on the test set.

Metric Precision Recall F1 score
Value 0.643 0.8 0.713

You can work further on improving the performance of the model by fine-tuning its hyperparameters, adding more transformations for data augmentation, or using other methods, such as Synthetic Minority Oversampling Technique (SMOTE) or Generative Adversarial Networks (GANs). Furthermore, you can use your own model or algorithm for training by using built-in SageMaker Docker images or adapting your own container to work on SageMaker. For further details, refer to Using Docker containers with SageMaker.

You can also add additional features to your pipeline. If you want to include monitoring, you can choose the MLOps template for model building, training, deployment and monitoring template when creating the SageMaker project. The resulting architecture has an additional monitoring step. Or if you have an existing third-party Git repository, you can use it by choosing the MLOps template for model building, training, and deployment with third-party Git repositories using Jenkins project and providing information for both model building and model deployment repositories. This allows you to utilize any existing code and saves you any time or effort on integration between SageMaker and Git. However, for this option, a AWS CodeStar connection is required.

Conclusion

In this post, we showed how to create an end-to-end ML workflow using Studio and automated Pipelines. The workflow includes getting the dataset, storing it in a place accessible to the ML model, configuring a container image for preprocessing, then modifying the boilerplate code to accommodate such image. Then we showed how to trigger the pipeline, the steps that the pipeline follows, and how they work. We also discussed how to monitor model performance and deploy the model to an endpoint.

We performed most of these tasks within Studio, which acts as an all-encompassing ML IDE, and accelerates the development and deployment of such models.

This solution is not bound to the skin classification task. You can extend it to any classification or regression task using any of the SageMaker built-in algorithms or pre-trained models.


About the authors

Mariem Kthiri is an AI/ML consultant at AWS Professional Services Globals and is part of the Health Care and Life Science (HCLS) team. She is passionate about building ML solutions for various problems and always eager to jump on new opportunities and initiatives. She lives in Munich, Germany and is keen of traveling and discovering other parts of the world.

Yassine Zaafouri is an AI/ML consultant within Professional Services at AWS. He enables global enterprise customers to build and deploy AI/ML solutions in the cloud to overcome their business challenges. In his spare time, he enjoys playing and watching sports and traveling around the world.

Fotinos Kyriakides is an AI/ML Engineer within Professional Services in AWS. He is passionate about using the technology to provide value to customers and achieve business outcomes. Base in London, in his spare time he enjoys running and exploring.

Anna Zapaishchykova was a ProServe Consultant in AI/ML and a member of Amazon Healthcare TFC. She is passionate about technology and the impact it can make on healthcare. Her background is in building MLOps and AI-powered solutions to customer problems in a variety of domains such as insurance, automotive, and healthcare.

Read More

How Amazon Search runs large-scale, resilient machine learning projects with Amazon SageMaker

If you have searched for an item to buy on amazon.com, you have used Amazon Search services. At Amazon Search, we’re responsible for the search and discovery experience for our customers worldwide. In the background, we index our worldwide catalog of products, deploy highly scalable AWS fleets, and use advanced machine learning (ML) to match relevant and interesting products to every customer’s query.

Our scientists regularly train thousands of ML models to improve the quality of search results. Supporting large-scale experimentation presents its own challenges, especially when it comes to improving the productivity of the scientists training these ML models.

In this post, we share how we built a management system around Amazon SageMaker training jobs, allowing our scientists to fire-and-forget thousands of experiments and be notified when needed. They can now focus on high-value tasks and resolving algorithmic errors, saving 60% of their time.

The challenge

At Amazon Search, our scientists solve information retrieval problems by experimenting and running numerous ML model training jobs on SageMaker. To keep up with our team’s innovation, our models’ complexity and number of training jobs have increased over time. SageMaker training jobs allow us to reduce the time and cost to train and tune those models at scale, without the need to manage infrastructure.

Like everything in such large-scale ML projects, training jobs can fail due to a variety of factors. This post focuses on capacity shortages and failures due to algorithm errors.

We designed an architecture with a job management system to tolerate and reduce the probability of a job failing due to capacity unavailability or algorithm errors. It allows scientists to fire-and-forget thousands of training jobs, automatically retry them on transient failure, and get notified of success or failure if needed.

Solution overview

In the following solution diagram, we use SageMaker training jobs as the basic unit of our solution. That is, a job represents the end-to-end training of an ML model.

Logical architecture of our solution

The high-level workflow of this solution is as follows:

  1. Scientists invoke an API to submit a new job to the system.
  2. The job is registered with the New status in a metadata store.
  3. A job scheduler asynchronously retrieves New jobs from the metadata store, parses their input, and tries to launch SageMaker training jobs for each one. Their status changes to Launched or Failed depending on success.
  4. A monitor checks the jobs progress at regular intervals, and reports their Completed, Failed, or InProgress state in the metadata store.
  5. A notifier is triggered to report Completed and Failed jobs to the scientists.

Persisting the jobs history in the metadata store also allows our team to conduct trend analysis and monitor project progress.

This job scheduling solution uses loosely coupled serverless components based on AWS Lambda, Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), and Amazon EventBridge. This ensures horizontal scalability, allowing our scientists to launch thousands of jobs with minimal operations effort. The following diagram illustrates the serverless architecture.

Architecture overview of our solution

In the following sections, we go into more detail about each service and its components.

DynamoDB as the metadata store for job runs

The ease of use and scalability of DynamoDB made it a natural choice to persist the jobs metadata in a DynamoDB table. This solution stores several attributes of jobs submitted by scientists, thereby helping with progress tracking and workflow orchestration. The most important attributes are as follows:

  • JobId – A unique job ID. This can be auto-generated or provided by the scientist.
  • JobStatus – The status of the job.
  • JobArgs – Other arguments required for creating a training job, such as the input path in Amazon S3, the training image URI, and more. For a complete list of parameters required to create a training job, refer to CreateTrainingJob.

Lambda for the core logic

We use three container-based Lambda functions to orchestrate the job workflow:

  • Submit Job – This function is invoked by scientists when they need to launch new jobs. It acts as an API for simplicity. You can also front it with Amazon API Gateway, if needed. This function registers the jobs in the DynamoDB table.
  • Launch Jobs – This function periodically retrieves New jobs from the DynamoDB table and launches them using the SageMaker CreateTrainingJob command. It retries on transient failures, such as ResourceLimitExceeded and CapacityError, to instrument resiliency into the system. It then updates the job status as Launched or Failed depending on success.
  • Monitor Jobs – This function periodically keeps track of job progress using the DescribeTrainingJob command, and updates the DynamoDB table accordingly. It polls Failed jobs from the metadata and assesses whether they should be resubmitted or marked as terminally failed. It also publishes notification messages to the scientists when their jobs reach a terminal state.

EventBridge for scheduling

We use EventBridge to run the Launch Jobs and Monitor Jobs Lambda functions on a schedule. For more information, refer to Tutorial: Schedule AWS Lambda functions using EventBridge.

Alternatively, you can use Amazon DynamoDB Streams for the triggers. For more information, see DynamoDB Streams and AWS Lambda triggers.

Notifications with Amazon SNS

Our scientists are notified by email using Amazon SNS when their jobs reach a terminal state (Failed after a maximum number of retries), Completed, or Stopped.

Conclusion

In this post, we shared how Amazon Search adds resiliency to ML model training workloads by scheduling them, and retrying them on capacity shortages or algorithm errors. We used Lambda functions in conjunction with a DynamoDB table as a central metadata store to orchestrate the entire workflow.

Such a scheduling system allows scientists to submit their jobs and forget about them. This saves time and allows them to focus on writing better models.

To go further in your learnings, you can visit Awesome SageMaker and find in a single place, all the relevant and up-to-date resources needed for working with SageMaker.


About the Authors

Luochao Wang is a Software Engineer at Amazon Search. He focuses on scalable distributed systems and automation tooling on the cloud to accelerate the pace of scientific innovation for Machine Learning applications.

Ishan Bhatt is a Software Engineer in Amazon Prime Video team. He primarily works in the MLOps space and has experience building MLOps products for the past 4 years using Amazon SageMaker.

Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.

Eiman Elnahrawy is a Principal Software Engineer at Amazon Search leading the efforts on Machine Learning acceleration, scaling, and automation. Her expertise spans multiple areas, including Machine Learning, Distributed Systems, and Personalization.

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

Romi DattaDr. Romi Datta  is a Senior Manager of Product Management in the Amazon SageMaker team responsible for training, processing and feature store. He has been in AWS for over 4 years, holding several product management leadership roles in SageMaker, S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

RJ is an engineer in Search M5 team leading the efforts for building large scale deep learning systems for training and inference. Outside of work he explores different cuisines of food and plays racquet sports.

Read More

Customize business rules for intelligent document processing with human review and BI visualization

A massive amount of business documents are processed daily across industries. Many of these documents are paper-based, scanned into your system as images, or in an unstructured format like PDF. Each company may apply unique rules associated with its business background while processing these documents. How to extract information accurately and process them flexibly is a challenge many companies face.

Amazon Intelligent Document Processing (IDP) allows you to take advantage of industry-leading machine learning (ML) technology without previous ML experience. This post introduces a solution included in the Amazon IDP workshop showcasing how to process documents to serve flexible business rules using Amazon AI services. You can use the following step-by-step Jupyter notebook to complete the lab.

Amazon Textract helps you easily extract text from various documents, and Amazon Augmented AI (Amazon A2I) allows you to implement a human review of ML predictions. The default Amazon A2I template allows you to build a human review pipeline based on rules, such as when the extraction confidence score is lower than a pre-defined threshold or required keys are missing. But in a production environment, you need the document processing pipeline to support flexible business rules, such as validating the string format, verifying the data type and range, and validating fields across documents. This post shows how you can use Amazon Textract and Amazon A2I to customize a generic document processing pipeline supporting flexible business rules.

Solution overview

For our sample solution, we use the Tax Form 990, a US IRS (Internal Revenue Service) form that provides the public with financial information about a non-profit organization. For this example, we only cover the extraction logic for some of the fields on the first page of the form. You can find more sample documents on the IRS website.

The following diagram illustrates the IDP pipeline that supports customized business rules with human review.IDP HITM Overview

The architecture is composed of three logical stages:

  • Extraction – Extract data from the 990 Tax Form (we use page 1 as an example).

    • Retrieve a sample image stored in an Amazon Simple Storage Service (Amazon S3) bucket.
    • Call the Amazon Textract analyze_document API using the Queries feature to extract text from the page.
  • Validation – Apply flexible business rules with a human-in-the-loop review.

    • Validate the extracted data against business rules, such as validating the length of an ID field.
    • Send the document to Amazon A2I for a human to review if any business rules fail.
    • Reviewers use the Amazon A2I UI (a customizable website) to verify the extraction result.
  • BI visualization – We use Amazon QuickSight to build a business intelligence (BI) dashboard showing the process insights.

Customize business rules

You can define a generic business rule in the following JSON format. In the sample code, we define three rules:

  • The first rule is for the employer ID field. The rule fails if the Amazon Textract confidence score is lower than 99%. For this post, we set the confidence score threshold high, which will break by design. You could adjust the threshold to a more reasonable value to reduce unnecessary human effort in a real-world environment, such as 90%.
  • The second rule is for the DLN field (the unique identifier of the tax form), which is required for the downstream processing logic. This rule fails if the DLN field is missing or has an empty value.
  • The third rule is also for the DLN field but with a different condition type: LengthCheck. The rule breaks if the DLN length is not 16 characters.

The following code shows our business rules in JSON format:

rules = [
    {
        "description": "Employee Id confidence score should greater than 99",
        "field_name": "d.employer_id",
        "field_name_regex": None, # support Regex: "_confidence$",
        "condition_category": "Confidence",
        "condition_type": "ConfidenceThreshold",
        "condition_setting": "99",
    },
    {
        "description": "dln is required",
        "field_name": "dln",
        "condition_category": "Required",
        "condition_type": "Required",
        "condition_setting": None,
    },
    {
        "description": "dln length should be 16",
        "field_name": "dln",
        "condition_category": "LengthCheck",
        "condition_type": "ValueRegex",
        "condition_setting": "^[0-9a-zA-Z]{16}$",
    }
]

You can expand the solution by adding more business rules following the same structure.

Extract text using an Amazon Textract query

In the sample solution, we call the Amazon Textract analyze_document API query feature to extract fields by asking specific questions. You don’t need to know the structure of the data in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Queries use a combination of visual, spatial, and language cues to extract the information you seek with high accuracy.

To extract value for the DLN field, you can send a request with questions in natural languages, such as “What is the DLN?” Amazon Textract returns the text, confidence, and other metadata if it finds corresponding information on the image or document. The following is an example of an Amazon Textract query request:

textract.analyze_document(
        Document={'S3Object': {'Bucket': data_bucket, 'Name': s3_key}},
        FeatureTypes=["QUERIES"],
        QueriesConfig={
                'Queries': [
                    {
                        'Text': 'What is the DLN?',
                       'Alias': 'The DLN number - unique identifier of the form'
                    }
               ]
        }
)

Define the data model

The sample solution constructs the data in a structured format to serve the generic business rule evaluation. To keep extracted values, you can define a data model for each document page. The following image shows how the text on page 1 maps to the JSON fields.Custom data model

Each field represents a document’s text, check box, or table/form cell on the page. The JSON object looks like the following code:

{
    "dln": {
        "value": "93493319020929",
        "confidence": 0.9765, 
        "block": {} 
    },
    "omb_no": {
        "value": "1545-0047",
        "confidence": 0.9435,
        "block": {}
    },
    ...
}

You can find the detailed JSON structure definition in the GitHub repo.

Evaluate the data against business rules

The sample solution comes with a Condition class—a generic rules engine that takes the extracted data (as defined in the data model) and the rules (as defined in the customized business rules). It returns two lists with failed and satisfied conditions. We can use the result to decide if we should send the document to Amazon A2I for human review.

The Condition class source code is in the sample GitHub repo. It supports basic validation logic, such as validating a string’s length, value range, and confidence score threshold. You can modify the code to support more condition types and complex validation logic.

Create a customized Amazon A2I web UI

Amazon A2I allows you to customize the reviewer’s web UI by defining a worker task template. The template is a static webpage in HTML and JavaScript. You can pass data to the customized reviewer page using the Liquid syntax.

In the sample solution, the custom Amazon A2I UI template displays the page on the left and the failure conditions on the right. Reviewers can use it to correct the extraction value and add their comments.

The following screenshot shows our customized Amazon A2I UI. It shows the original image document on the left and the following failed conditions on the right:

  • The DLN numbers should be 16 characters long. The actual DLN has 15 characters.
  • The confidence score of employer_id is lower than 99%. The actual confidence score is around 98%.

The reviewers can manually verify these results and add comments in the CHANGE REASON text boxes.Customized A2I review UI

For more information about integrating Amazon A2I into any custom ML workflow, refer to over 60 pre-built worker templates on the GitHub repo and Use Amazon Augmented AI with Custom Task Types.

Process the Amazon A2I output

After the reviewer using the Amazon A2I customized UI verifies the result and chooses Submit, Amazon A2I stores a JSON file in the S3 bucket folder. The JSON file includes the following information on the root level:

  • The Amazon A2I flow definition ARN and human loop name
  • Human answers (the reviewer’s input collected by the customized Amazon A2I UI)
  • Input content (the original data sent to Amazon A2I when starting the human loop task)

The following is a sample JSON generated by Amazon A2I:

{
  "flowDefinitionArn": "arn:aws:sagemaker:us-east-1:711334203977:flow-definition/a2i-custom-ui-demo-workflow",
  "humanAnswers": [
    {
      "acceptanceTime": "2022-08-23T15:23:53.488Z",
      "answerContent": {
        "Change Reason 1": "Missing X at the end.",
        "True Value 1": "93493319020929X",
        "True Value 2": "04-3018996"
      },
      "submissionTime": "2022-08-23T15:24:47.991Z",
      "timeSpentInSeconds": 54.503,
      "workerId": "94de99f1bc6324b8",
      "workerMetadata": {
        "identityData": {
          "identityProviderType": "Cognito",
          "issuer": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_URd6f6sie",
          "sub": "cef8d484-c640-44ea-8369-570cdc132d2d"
        }
      }
    }
  ],
  "humanLoopName": "custom-loop-9b4e67ff-2c9f-40f9-aae5-0e26316c905c",
  "inputContent": {...} # the original input send to A2I when starting the human review task
}

You can implement extract, transform, and load (ETL) logic to parse information from the Amazon A2I output JSON and store it in a file or database. The sample solution comes with a CSV file with processed data. You can use it to build a BI dashboard by following the instructions in the next section.

Create a dashboard in Amazon QuickSight

The sample solution includes a reporting stage with a visualization dashboard served by Amazon QuickSight. The BI dashboard shows key metrics such as the number of documents processed automatically or manually, the most popular fields that required human review, and other insights. This dashboard can help you get an oversight of the document processing pipeline and analyze the common reasons causing human review. You can optimize the workflow by further reducing human input.

The sample dashboard includes basic metrics. You can expand the solution using Amazon QuickSight to show more insights into the data.BI dashboard

Expand the solution to support more documents and business rules

To expand the solution to support more document pages with corresponding business rules, you need to make the following changes:

  • Create a data model for the new page in JSON structure representing all the values you want to extract out of the pages. Refer to the Define the data model section for a detailed format.
  • Use Amazon Textract to extract text out of the document and populate values to the data model.
  • Add business rules corresponding to the page in JSON format. Refer to the Customize business rules section for the detailed format.

The custom Amazon A2I UI in the solution is generic, which doesn’t require a change to support new business rules.

Conclusion

Intelligent document processing is in high demand, and companies need a customized pipeline to support their unique business logic. Amazon A2I also offers a built-in template integrated with Amazon Textract to implement your human review use cases. It also allows you to customize the reviewer page to serve flexible requirements.

This post guided you through a reference solution using Amazon Textract and Amazon A2I to build an IDP pipeline that supports flexible business rules. You can try it out using the Jupyter notebook in the GitHub IDP workshop repo.


About the authors

Lana Zhang is a Sr. Solutions Architect at the AWS WWSO AI Services team with expertise in AI and ML for intelligent document processing and content moderation. She is passionate about promoting AWS AI services and helping customers transform their business solutions.


Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus are Artificial Intelligence & Machine Learning for Intelligent Document Processing.

Read More

Automate classification of IT service requests with an Amazon Comprehend custom classifier

Enterprises often deal with large volumes of IT service requests. Traditionally, the burden is put on the requester to choose the correct category for every issue. A manual error or misclassification of a ticket usually means a delay in resolving the IT service request. This can result in reduced productivity, a decrease in customer satisfaction, an impact to service level agreements (SLAs), and broader operational impacts. As your enterprise grows, the problem of getting the right service request to the right team becomes even more important. Using an approach based on machine learning (ML) and artificial intelligence can help with your enterprise’s ever-evolving needs.

Supervised ML is a process that uses labeled datasets and outputs to train learning algorithms on how to classify data or predict an outcome. Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover valuable insights and connections in text. It provides APIs powered by ML to extract key phrases, entities, sentiment analysis, and more.

In this post, we show you how to implement a supervised ML model that can help classify IT service requests automatically using Amazon Comprehend custom classification. Amazon Comprehend custom classification helps you customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions. With automatic ML, or AutoML, Amazon Comprehend custom classification builds customized NLP models on your behalf, using the training data that you provide.

Overview of solution

To illustrate the IT service request classification, this solution uses the SEOSS dataset. This dataset is a systematically retrieved dataset consisting of 33 open-source software projects that contains a large number of typed artifacts and trace links between them. This solution uses the issue data from these 33 open-source projects, summaries, and descriptions as reported by end-users to build a custom classifier model using Amazon Comprehend.

This post demonstrates how to implement and deploy the solution using the AWS Cloud Development Kit (AWS CDK) in an isolated Amazon Virtual Private Cloud (Amazon VPC) environment consisting of only private subnets. We also use the code to demonstrate how you can use the AWS CDK provider framework, a mini-framework for implementing a provider for AWS CloudFormation custom resources to create, update, or delete a custom resource, such as an Amazon Comprehend endpoint. The Amazon Comprehend endpoint includes managed resources that make your custom model available for real-time inference to a client machine or third-party applications. The code for this solution is available on Github.

You use the AWS CDK to deploy the infrastructure, application code, and configuration for the solution. You also need an AWS account and the ability to create AWS resources. You use the AWS CDK to create AWS resources such as a VPC with private subnets, Amazon VPC endpoints, Amazon Elastic File System (Amazon EFS), an Amazon Simple Notification Service (Amazon SNS) topic, an Amazon Simple Storage Service (Amazon S3) bucket, Amazon S3 event notifications, and AWS Lambda functions. Collectively, these AWS resources constitute the training stack, which you use to build and train the custom classifier model.

After you create these AWS resources, you download the SEOSS dataset and upload the dataset to the S3 bucket created by the solution. If you’re deploying this solution in AWS Region us-east-2, the format of the S3 bucket name is comprehendcustom-<AWS-ACCOUNT-NUMBER>-us-east-2-s3stack. The solution uses the Amazon S3 multi-part upload trigger to invoke a Lambda function that starts the pre-processing of the input data, and uses the preprocessed data to train the Amazon Comprehend custom classifier to create the custom classifier model. You then use the Amazon Resource Name (ARN) of the custom classifier model to create the inference stack, which creates an Amazon Comprehend endpoint using the AWS CDK provider framework, which you can then use for inferences from a third-party application or client machine.

The following diagram illustrates the architecture of the training stack.

Training stack architecture

The workflow steps are as follows:

  1. Upload the SEOSS dataset to the S3 bucket created as part of the training stack deployment process. This creates an event trigger that invokes the etl_lambda function.
  2. The etl_lambda function downloads the raw data set from Amazon S3 to Amazon EFS.
  3. The etl_lambda function performs the data preprocessing task of the SEOSS dataset.
  4. When the function execution completes, it uploads the transformed data with prepped_data prefix to the S3 bucket.
  5. After the upload of the transformed data is complete, a successful ETL completion message is send to Amazon SNS.
  6. In Amazon Comprehend, you can classify your documents using two modes: multi-class or multi-label. Multi-class mode identifies one and only one class for each document, and multi-label mode identifies one or more labels for each document. Because we want to identify a single class to each document, we train the custom classifier model in multi-class mode. Amazon SNS triggers the train_classifier_lambda function, which initiates the Amazon Comprehend classifier training in a multi-class mode.
  7. The train_classifier_lambda function initiates the Amazon Comprehend custom classifier training.
  8. Amazon Comprehend downloads the transformed data from the prepped_data prefix in Amazon S3 to train the custom classifier model.
  9. When the model training is complete, Amazon Comprehend uploads the model.tar.gz file to the output_data prefix of the S3 bucket. The average completion time to train this custom classifier model is approximately 10 hours.
  10. The Amazon S3 upload trigger invokes the extract_comprehend_model_name_lambda function, which retrieves the custom classifier model ARN.
  11. The function extracts the custom classifier model ARN from the S3 event payload and the response of list-document-classifiers call.
  12. The function sends the custom classifier model ARN to the email address that you had subscribed earlier as part of the training stack creation process. You then use this ARN to deploy the inference stack.

This deployment creates the inference stack, as shown in the following figure. The inference stack provides you with a REST API secured by an AWS Identity and Access Management (IAM) authorizer, which you can then use to generate confidence scores of the labels based on the input text supplied from a third-party application or client machine.

Inference stack architecture

Prerequisites

For this demo, you should have the following prerequisites:

  • An AWS account.
  • Python 3.7 or later, Node.js, and Git in the development machine. The AWS CDK uses specific versions of Node.js (>=10.13.0, except for version 13.0.0 – 13.6.0). A version in active long-term support (LTS) is recommended.
    To install the active LTS version of Node.js, you can use the following install script for nvm and use nvm to install the Node.js LTS version. You can also install the current active LTS Node.js via package manager depending on the operating system of your choice.

    For macOS, you can install the Node.js via package manager using the following instructions.

    For Windows, you can install the Node.js via package manager using the following instructions.

  • AWS CDK v2 is pre-installed if you’re using an AWS Cloud9 IDE. If you’re using AWS Cloud9 IDE, you can skip this step.If you don’t have the AWS CDK installed in the development machine, install AWS CDK v2 globally using the Node Package Manager command npm install -g aws-cdk. This step requires Node.js to be installed in the development machine.
  • Configure your AWS credentials to access and create AWS resources using the AWS CDK. For instructions, refer to Specifying credentials and region.
  • Download the SEOSS dataset consisting of requirements, bug reports, code history, and trace links of 33 open-source software projects. Save the file dataverse_files.zip on your local machine.

SEOSS dataset

Deploy the AWS CDK training stack

For AWS CDK deployment, we start with the training stack. Complete the following steps:

  1. Clone the GitHub repository:
$ git clone https://github.com/aws-samples/amazon-comprehend-custom-automate-classification-it-service-request.git
  1. Navigate to the amazon-comprehend-custom-automate-classification-it-service-request folder:
$ cd amazon-comprehend-custom-automate-classification-it-service-request/

All the following commands are run within the amazon-comprehend-custom-automate-classification-it-service-request directory.

  1. In the amazon-comprehend-custom-automate-classification-it-service-request directory, initialize the Python virtual environment and install requirements.txt with pip:
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
  1. If you’re using the AWS CDK in a specific AWS account and Region for the first time, see the instructions for bootstrapping your AWS CDK environment:
$ cdk bootstrap aws://<AWS-ACCOUNT-NUMBER>/<AWS-REGION>
  1. Synthesize the CloudFormation templates for this solution using cdk synth and use cdk deploy to create the AWS resources mentioned earlier:
$ cdk synth
$ cdk deploy VPCStack EFSStack S3Stack SNSStack ExtractLoadTransformEndPointCreateStack --parameters SNSStack:emailaddressarnnotification=<emailaddress@example.com>

After you enter cdk deploy, the AWS CDK prompts whether you want to deploy changes for each of the stacks called out in the cdk deploy command.

  1. Enter y for each of the stack creation prompts, then the cdk deploy step creates these stacks. Subscribe the email address provide by you to the SNS topic created as part of the cdk deploy.
  2. After cdk deploy completes successfully, create a folder called raw_data in the S3 bucket comprehendcustom-<AWS-ACCOUNT-NUMBER>-<AWS-REGION>-s3stack.
  3. Upload the SEOSS dataset dataverse_files.zip that you downloaded earlier to this folder.

After the upload is complete, the solution invokes the etl_lambda function using an Amazon S3 event trigger to start the extract, transform, and load (ETL) process. After the ETL process completes successfully, a message is sent to the SNS topic, which invokes the train_classifier_lambda function. This function triggers an Amazon Comprehend custom classifier model training. Depending on whether you train your model on the complete SEOSS dataset, training could take up to 10 hours. When the training process is complete, Amazon Comprehend uploads the model.tar.gz file to the output_data prefix in the S3 bucket.

This upload triggers the extract_comprehend_model_name_lambda function using a S3 event trigger that extracts the custom classifier model ARN and sends it to the email address you had subscribed earlier. This custom classifier model ARN is then used to create the inference stack. When the model training is complete, you can view the performance metrics of the custom classifier model by navigating to the version details section in the Amazon Comprehend console (see the following screenshot), or by using the Amazon Comprehend Boto3 SDK.

Perfomance metrics

Deploy the AWS CDK inference stack

Now you’re ready to deploy the inference stack.

  1. Copy the custom classifier model ARN from the email you received and use the following cdk deploy command to create the inference stack.

This command deploys an API Gateway REST API secured by an IAM authorizer, which you use for inference with an AWS user ID or IAM role that just has the execute-api:Invoke IAM privilege. The following cdk deploy command deploys the inference stack. This stack uses the AWS CDK provider framework to create the Amazon Comprehend endpoint as a custom resource, so that creating, deleting, and updating of the Amazon Comprehend endpoint can be done as part of the inference stack lifecycle using the cdk deploy and cdk destroy commands.

Because you need to run the following command after model training is complete, which could take up to 10 hours, ensure that you’re in the Python virtual environment that you initialized in an earlier step and in the amazon-comprehend-custom-automate-classification-it-service-request directory:

$ cdk deploy APIGWInferenceStack --parameters APIGWInferenceStack:documentclassifierarn=<custom classifier model ARN retrieved from email>

For example:

$ cdk deploy APIGWInferenceStack --parameters APIGWInferenceStack:documentclassifierarn=arn:aws:comprehend:us-east-2:111122223333:document-classifier/ComprehendCustomClassifier-11111111-2222-3333-4444-abc5d67e891f/version/v1
  1. After the cdk deploy command completes successfully, copy the APIGWInferenceStack.ComprehendCustomClassfierInvokeAPI value from the console output, and use this REST API to generate inferences from a client machine or a third-party application that has execute-api:Invoke IAM privilege. If you’re running this solution in us-east-2, the format of this REST API is https://<restapi-id>.execute-api.us-east-2.amazonaws.com/prod/invokecomprehendV1.

Alternatively, you can use the test client apiclientinvoke.py from the GitHub repository to send a request to the custom classifier model. Before using the apiclientinvoke.py, ensure that the following prerequisites are in place:

  • You have the boto3 and requests Python package installed using pip on the client machine.
  • You have configured Boto3 credentials. By default, the test client assumes that a profile named default is present and it has the execute-api:Invoke IAM privilege on the REST API.
  • SigV4Auth points to the Region where the REST API is deployed. Update the <AWS-REGION> value to us-east-2 in apiclientinvoke.py if your REST API is deployed in us-east-2.
  • You have assigned the raw_data variable with the text on which you want to make the class prediction or the classification request:
raw_data="""Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis."""
  • You have assigned the restapi variable with the REST API copied earlier:

restapi="https://<restapi-id>.execute-api.us-east-2.amazonaws.com/prod/invokecomprehendV1"

  1. Run the apiclientinvoke.py after the preceding updates:
$ python3 apiclientinvoke.py

You get the following response from the custom classifier model:

{
 "statusCode": 200,
 "body": [
	{
	 "Name": "SPARK",
	 "Score": 0.9999773502349854
	},
	{
	 "Name": "HIVE",
	 "Score": 1.1613215974648483e-05
	},
	{
	 "Name": "DROOLS",
	 "Score": 1.1110682862636168e-06
	}
   ]
}

Amazon Comprehend returns confidence scores for each label that it has attributed correctly. If the service is highly confident about a label, the score will be closer to 1. Therefore, for the Amazon Comprehend custom classifier model that was trained using the SEOSS dataset, the custom classifier model predicts that the text belongs to class SPARK. This classification returned by the Amazon Comprehend custom classifier model can then be used to classify the IT service requests or predict the correct category of the IT service requests, thereby reducing manual errors or misclassification of service requests.

Clean up

To clean up all the resources created in this post that were created as part of the training stack and inference stack, use the following command. This command deletes all the AWS resources created as part of the previous cdk deploy commands:

$ cdk destroy --all

Conclusion

In this post, we showed you how enterprises can implement a supervised ML model using Amazon Comprehend custom classification to predict the category of IT service requests based on either the subject or the description of the request submitted by the end-user. After you build and train a custom classifier model, you can run real-time analysis for custom classification by creating an endpoint. After you deploy this model to an Amazon Comprehend endpoint, it can be used to run real-time inference by third-party applications or other client machines, including IT service management tools. You can then use this inference to predict the defect category and reduce manual errors or misclassifications of tickets. This helps reduce delays for ticket resolution and increases resolution accuracy and customer productivity, which ultimately results in increased customer satisfaction.

You can extend the concepts in this post to other use cases, such as routing business or IT tickets to various internal teams such as business departments, customer service agents, and Tier 2/3 IT support, created either by end-users or through automated means.

References

  • Rath, Michael; Mäder, Patrick, 2019, “The SEOSS Dataset – Requirements, Bug Reports, Code History, and Trace Links for Entire Projects”, https://doi.org/10.7910/DVN/PDDZ4Q, Harvard Dataverse, V1

About the Authors

Arnab Chakraborty is a Sr. Solutions Architect at AWS based out of Cincinnati, Ohio. He is passionate about topics in Enterprise & Solution architecture, Data analytics, Serverless and Machine Learning. In his spare time, he enjoys watching movies , travel shows and sports.

Viral Desai is a Principal Solutions Architect at AWS. With more than 25 years of experiences in information technology, he has been helping customers adopt AWS and modernize their architectures. He likes hiking, and enjoys diving deep with customers on all things AWS.

Read More