Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library

Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library

Fraudulent users and malicious accounts can result in billions of dollars in lost revenue annually for businesses. Although many businesses use rule-based filters to prevent malicious activity in their systems, these filters are often brittle and may not capture the full range of malicious behavior.

However, some solutions, such as graph techniques, are especially suited for detecting fraudsters and malicious users. Fraudsters can evolve their behavior to fool rule-based systems or simple feature-based models, but it’s difficult to fake the graph structure and relationships between users and other entities captured in transaction or interaction logs. Graph neural networks (GNNs) combine information from the graph structure with attributes of users or transactions to learn meaningful representations that can distinguish malicious users and events from legitimate ones.

This post shows how to use Amazon SageMaker and Deep Graph Library (DGL) to train GNN models and detect malicious users or fraudulent transactions. Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud.

In this blog post, we focus on the data preprocessing and model training with Amazon SageMaker.  To train the GNN model, you must first construct a heterogeneous graph using information from transaction tables or access logs. A heterogeneous graph is one that contains different types of nodes and edges. In the case where nodes represent users or transactions, the nodes can have several kinds of distinct relationships with other users and possibly other entities, such as device identifiers, institutions, applications, IP addresses and so on.

Some examples of use cases that fit under this include:

  • A financial network where users transact with other users and specific financial institutions or applications
  • A gaming network where users interact with other users but also with distinct games or devices
  • A social network where users can have different types of links to other users

The following diagram illustrates a heterogeneous financial transaction network.

GNNs can incorporate user features like demographic information or transaction features like activity frequency. In other words, you can enrich the heterogeneous graph representation with features for nodes and edges as metadata. After the node and relations in the heterogeneous graph are established, with their associated features, you can train a GNN model to learn to classify different nodes as malicious or legitimate, using both the node or edge features as well as the graph structure. The model training is set up in a semi-supervised manner—you have a subset of nodes in the graph already labeled as fraudulent or legitimate. You use this labeled subset as a training signal to learn the parameters of the GNN. The trained GNN model can then predict the labels for the remaining unlabeled nodes in the graph.

Architecture

To get started, you can use the full solution architecture that uses Amazon SageMaker to run the processing jobs and training jobs. You can trigger the Amazon SageMaker jobs automatically with AWS Lambda functions that respond to Amazon Simple Storage Service (Amazon S3) put events, or manually by running cells in an example Amazon SageMaker notebook. The following diagram is a visual depiction of the architecture.

The full implementation is available on the GitHub repo with an AWS CloudFormation template that launches the architecture in your AWS account.

Data preprocessing for fraud detection with GNNs

In this section, we show how to preprocess an example dataset and identify the relations that will make up the heterogeneous graph.

Dataset

For this use case, we use the IEEE-CIS fraud dataset to benchmark the modeling approach. This is an anonymized dataset that contains 500 thousand transactions between users. The dataset has two main tables:

  • Transactions table – Contains information about transactions or interactions between users
  • Identity table – Contains information about access logs, device, and network information for users performing transactions

You use a subset of these transactions with their labels as a supervision signal for the model training. For the transactions in the test dataset, their labels are masked during training. The task is to predict which masked transactions are fraudulent and which are not.

The following code example gets the data and uploads it to an S3 bucket that Amazon SageMaker uses to access the dataset during preprocessing and training (run this in a Jupyter notebook cell):

# Replace with an S3 location or local path to point to your own dataset
raw_data_location = 's3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data'

bucket = 'SAGEMAKER_S3_BUCKET'
prefix = 'dgl'
input_data = 's3://{}/{}/raw-data'.format(bucket, prefix)

!aws s3 cp --recursive $raw_data_location $input_data

# Set S3 locations to store processed data for training and post-training results and artifacts respectively
train_data = 's3://{}/{}/processed-data'.format(bucket, prefix)
train_output = 's3://{}/{}/output'.format(bucket, prefix)

Despite the efforts of fraudsters to mask their behavior, fraudulent or malicious activities often have telltale signs like high out-degree or activity aggregation in the graph structure. The following sections show how to perform feature extraction and graph construction to allow the GNN models to take advantage of these patterns to predict fraud.

Feature extraction

Feature extraction consists of performing numerical encoding on categorical features and some transformation of numerical columns. For example, the transaction amounts are logarithmically transformed to indicate the relative magnitude of the amounts, and categorical attributes can be converted to numerical form by performing one hot encoding. For each transaction, the feature vector contains attributes from the transaction tables with information about the time delta between previous transactions, name and addresses matches, and match counts.

Constructing the graph

To construct the full interaction graph, split the relational information in the data into edge lists for each relation type. Each edge list is a bipartite graph between transaction nodes and other entity types. These entity types each constitute an identifying attribute about the transaction. For example, you can have an entity type for the kind of card (debit or credit) used in the transaction, the IP address of the device the transaction was completed with, and the device ID or operating system of the device used. The entity types used for graph construction consist of all the attributes in the identity table and a subset of attributes in the transactions table, like credit card information or email domain. The heterogeneous graph is constructed with the set of per relation type edge lists and the feature matrix for the nodes.

Using Amazon SageMaker Processing

You can execute the data preprocessing and feature extraction step using Amazon SageMaker Processing. Amazon SageMaker Processing is a feature of Amazon SageMaker that lets you run preprocessing and postprocessing workloads on fully managed infrastructure. For more information, see Process Data and Evaluate Models.

First define a container for the Amazon SageMaker Processing job to use. This container should contain all the dependencies that the data preprocessing script requires. Because the data preprocessing here only depends on the pandas library, you can have a minimal Dockerfile to define the container. See the following code:

FROM python:3.7-slim-buster

RUN pip3 install pandas==0.24.2
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

You can build the container and push the built container to an Amazon Elastic Container Registry (Amazon ECR) repository by entering the following of code:

import boto3

region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-preprocessing-container'
ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)

!bash data-preprocessing/container/build_and_push.sh $ecr_repository docker

When the data preprocessing container is ready, you can create an Amazon SageMaker ScriptProcessor that sets up a Processing job environment using the preprocessing container. You can then use the ScriptProcessor to run a Python script, which has the data preprocessing implementation, in the environment defined by the container. The Processing job terminates when the Python script execution is complete and the preprocessed data has been saved back to Amazon S3. This process is completely managed by Amazon SageMaker. When running the ScriptProcessor, you have the option of passing in arguments to the data preprocessing script. Specify what columns in the transaction table should be considered as identity columns and what columns are categorical features. All other columns are assumed to be numerical features. See the following code:

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

script_processor = ScriptProcessor(command=['python3'],
                                   image_uri=ecr_repository_uri,
                                   role=role,
                                   instance_count=1,
                                   instance_type='ml.r5.24xlarge')

script_processor.run(code='data-preprocessing/graph_data_preprocessor.py',
                     inputs=[ProcessingInput(source=input_data,
                                             destination='/opt/ml/processing/input')],
                     outputs=[ProcessingOutput(destination=train_data,
                                               source='/opt/ml/processing/output')],
                     arguments=['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',
                                '--cat-cols',' M1,M2,M3,M4,M5,M6,M7,M8,M9'])

The following code example shows the outputs of the Amazon SageMaker Processing job stored in Amazon S3:

from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('n'.join(processed_files))Output:

===== Processed Files =====
s3://graph-fraud-detection/dgl/processed-data/features.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceInfo_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceType_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_P_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_ProductCD_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_R_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_TransactionID_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card3_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card4_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card5_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card6_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_01_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_02_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_03_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_04_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_05_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_06_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_07_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_08_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_09_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_10_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_11_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_12_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_13_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_14_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_15_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_16_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_17_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_18_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_19_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_20_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_21_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_22_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_23_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_24_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_25_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_26_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_27_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_28_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_29_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_30_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_31_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_32_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_33_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_34_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_35_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_36_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_37_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_38_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/tags.csv
s3://graph-fraud-detection/dgl/processed-data/test.csv

All the relation edgelist files represent the different kinds of edges used to construct the heterogenous graph during training. Features.csv contains the final transformed features of the transaction nodes, and tags.csv contains the labels of the nodes used as the training supervision signal. Test.csv contains the TransactionID data to use as a test dataset to evaluate the performance of the model. The labels for these nodes are masked during training.

GNN model training

Now you can use Deep Graph Library (DGL) to create the graph and define a GNN model, and use Amazon SageMaker to launch the infrastructure to train the GNN. Specifically,  a relational graph convolutional neural network model can be used to learn embeddings for the nodes in the heterogeneous graph, and a fully connected layer for the final node classification.

Hyperparameters

To train the GNN, you need to define a few hyperparameters that are fixed before the training process, such as the kind of graph you’re constructing, the class of GNN models you’re using, the network architecture, and the optimizer and optimization parameters. See the following code:

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
          'edges': 'relation*.csv',
          'labels': 'tags.csv',
          'model': 'rgcn',
          'num-gpus': 1,
          'batch-size': 10000,
          'embedding-size': 64,
          'n-neighbors': 1000,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2
        }

The preceding code shows a few of the hyperparameters. For more information about all the hyperparameters and their default values, see estimator_fns.py in the GitHub repo.

Model training with Amazon SageMaker

With the hyperparameters defined, you can now kick off the training job. The training job uses DGL, with MXNet as the backend deep learning framework, to define and train the GNN. Amazon SageMaker makes it easy to train GNN models with the framework estimators, which have the deep learning framework environments already set up. For more information about training GNNs with DGL on Amazon SageMaker, see Train a Deep Graph Network.

You can now create an Amazon SageMaker MXNet estimator and pass in the model training script, hyperparameters, and the number and type of training instances you want. You can then call fit on the estimator and pass in the training data location in Amazon S3. See the following code:

from sagemaker.mxnet import MXNet

estimator = MXNet(entry_point='train_dgl_mxnet_entry_point.py',
                  source_dir='dgl-fraud-detection',
                  role=role, 
                  train_instance_count=1, 
                  train_instance_type='ml.p2.xlarge',
                  framework_version="1.6.0",
                  py_version='py3',
                  hyperparameters=params,
                  output_path=train_output,
                  code_location=train_output,
                  sagemaker_session=sess)

estimator.fit({'train': train_data})

Results

After training the GNN, the model learns to distinguish legitimate transactions from fraudulent ones. The training job produces a pred.csv file, which contains the model’s predictions for the transactions in test.csv. The ROC curve depicts the relationship between the true positive rate and the false positive rate at various thresholds, and the Area Under the Curve (AUC) can be used as an evaluation metric. The following graph shows that the GNN model we trained outperforms both fully connected feed forward networks and gradient boosted trees that use the features but don’t fully take advantage of the graph structure.

Conclusion

In this post, we showed how to construct a heterogeneous graph from user transactions and activity and use that graph and other collected features to train a GNN model to predict which transactions are fraudulent. This post also showed how to use DGL and Amazon SageMaker to define and train a GNN that achieves high performance on this task. For more information about the full implementation of the project and other GNN models for the task, see the GitHub repo.

Additionally, we showed how to perform data processing to extract useful features and relations from raw transaction data logs using Amazon SageMaker Processing. You can get started with the project by deploying the provided CloudFormation template and passing in your own dataset to detect malicious users and fraudulent transactions in your data.


About the Author

Soji Adeshina is a Machine Learning Developer who works on developing deep learning based solutions for AWS customers. Currently, he’s working on graph learning with applications in financial services and advertising but he also has a background in computer vision and recommender systems. In his spare time, he likes to cook and read philosophical texts.

 

Read More

Integrate Amazon Kendra and Amazon Lex using a search intent

Integrate Amazon Kendra and Amazon Lex using a search intent

Customer service conversations typically revolve around one or more topics and contain related questions. Answering these questions seamlessly is essential for a good conversational experience. For example, as part of a car rental reservation, you have queries such as, “What’s the charge for an additional driver?” or, “Do you have car seats for kids?” Starting today, you can use a search intent in your Amazon Lex bots to integrate with Amazon Kendra, so your bots can surface answers from Kendra.

Amazon Kendra was recently made generally available to all AWS customers, with exciting new features. Amazon Kendra provides you with a highly accurate and easy-to-use enterprise search service powered by machine learning. It offers a more intuitive way to search—using natural language—and returns more accurate answers, so your end-users can discover information stored within your organization’s vast amount of content.

We previously described how you can add an AWS Lambda function in an Amazon Lex bot to integrate with Kendra. For more information, see Smarter FAQ bots with Amazon Kendra. With this approach, you need to write and maintain integration code, and any updates to the configuration require updating this code. Now you can can add a search intent to get the answers you’re looking for from FAQ documents. You simply define the search index parameters in the intent as part of the bot definition. The bot queries the Amazon Kendra search index during the interaction if the user input doesn’t match any of the other configured intents. The search query responses are populated with request attributes that the bot can use in its response. You don’t have to write any Lambda code. The integration is simple, and you can seamlessly deliver natural language answers from Amazon Kendra in your Amazon Lex conversations.

This post shows how you can integrate an Amazon Lex bot with Amazon Kendra using a search intent, so you can quickly build a bot to provide answers in a conversation. The following diagram illustrates this solution.

Building a Lex bot

We will use the same conversation from the previous post to model a bot conversation:

User: Where’s the IT Help Desk?
Agent: 2nd floor; room 201 (two rights after you get out of the elevators).
User: Can I come in at noon?
Agent: Sure. What’s the nature of your problem?
User: I have a cracked laptop screen.
Agent: For verification purposes, can I get the last four digits of your employee ID?
User: 1234.
Agent: Got it, we’ll see you then to take a look at your screen.
User: How long does it take to fix a screen?
Agent: Laptop screen repairs typically take 4 hours.
User: What do I need to bring with me for a screen repair?
Agent: For a laptop screen repair, please bring your employee badge, laptop, and VPN fob.
User: How do I access my documents while my laptop is being repaired?
Agent: You can use Amazon Workdocs to access your documents.
User: Can I use WorkDocs on my tablet?
Agent: Yes. Users can access Amazon WorkDocs using free applications for iPad, Kindle Fire, and Android tablets that you can download from iTunes, Amazon Appstore for Android, and Google Play. Users can also access Amazon WorkDocs using the web browser on any mobile device.
User: When does the help desk close?
Agent: Our hours are from 11 AM to 2 PM.
User: Is there a coffee shop nearby?
Agent: There is one on the third floor and it’s open from 8 AM to 4 PM daily.

The first step is to build an Amazon Lex bot with intents to book an appointment and check appointment status. The MakeAppointment intent elicits the information required to make an appointment, such as the date, time, employee ID, and the nature of the issue. The CheckAppointmentStatus intent provides the status of the appointment. When a user asks a question that the Lex bot can’t answer with these intents, it uses the built-in KendraSearchIntent intent to connect to Amazon Kendra to search for an appropriate answer.

Deploying the sample bot

To create the sample bot, complete the following steps. This creates an Amazon Lex bot called help_desk_bot and a Lambda fulfillment function called help_desk_bot_handler.

  1. Download the Amazon Lex definition and Lambda code.
  2. In the AWS Lambda console, choose Create function.
  3. Enter the function name help_desk_bot_handler.
  4. Choose the latest Python runtime (for example, Python 3.8).
  5. For Permissions, choose Create a new role with basic Lambda permissions.
  6. Choose Create function.
  7. Once your new Lambda function is available, in the Function code section, choose Actions, choose Upload a .zip file, choose Upload, and select the help_desk_bot_lambda_handler.zip file that you downloaded.
  8. Choose Save.
  9. On the Amazon Lex console, choose Actions, and then Import.
  10. Choose the file help_desk_bot.zip that you downloaded, and choose Import.
  11. On the Amazon Lex console, choose the bot help_desk_bot.
  12. For each of the intents, choose AWS Lambda function in the Fulfillment section, and select the help_desk_bot_handler function in the dropdown list. If you are prompted “You are about to give Amazon Lex permission to invoke your Lambda Function”, choose OK.
  13. When all the intents are updated, choose Build.

At this point, you should have a working bot that is not yet connected to Amazon Kendra.

Creating an Amazon Kendra index

You’re now ready to create an Amazon Kendra index for your documents and FAQ. Complete the following steps:

  1. On the Amazon Kendra console, choose Launch Amazon Kendra.
  2. If you have existing Amazon Kendra indexes, choose Create index.
  3. For Index name, enter a name, such as it-helpdesk.
  4. For Description, enter an optional description, such as IT Help Desk FAQs.
  5. For IAM role, choose Create a new role to create a role to allow Amazon Kendra to access Amazon CloudWatch Logs.
  6. For Role name, enter a name, such as cloudwatch-logs. Kendra will prefix the name with AmazonKendra and the AWS region.
  7. Choose Next.
  8. For Provisioning editions, choose Developer edition.
  9. Choose Create.

Adding your FAQ content

While Amazon Kendra creates your new index, upload your content to an Amazon Simple Storage Service (Amazon S3) bucket.

  1. On the Amazon S3 console, create a new bucket, such as kendra-it-helpdesk-docs-<your-account#>.
  2. Keep the default settings and choose Create bucket.
  3. Download the following sample files and upload them to your new S3 bucket:

When the index creation is complete, you can add your FAQ content.

  1. On the Amazon Kendra console, choose your index, then choose FAQs, and Add FAQ.
  2. For FAQ name, enter a name, such as it-helpdesk-faq.
  3. For Description, enter an optional description, such as FAQ for the IT Help Desk.
  4. For S3, browse Amazon S3 to find your bucket, and choose help-desk-faq.csv.
  5. For IAM role, choose Create a new role to allow Amazon Kendra to access your S3 bucket.
  6. For Role name, enter a name, such as s3-access. Kendra will prefix your role name with AmazonKendra-.
  7. Choose Add.
  8. Stay on the page while Amazon Kendra creates your FAQ.
  9. When the FAQ is complete, choose Add FAQ to add another FAQ.
  10. For FAQ name, enter a name, such as workdocs-faq.
  11. For Description, enter a description, such as FAQ for Amazon WorkDocs mobile and web access.
  12. For S3, browse Amazon S3 to find your bucket, and choose workdocs-faq.csv.
  13. For IAM role, choose the same role you created in step 9.
  14. Choose Add.

After you create your FAQs, you can try some Kendra searches by choosing Search console. For example:

  • When is the help desk open?
  • When does the help desk close?
  • Where is the help desk?
  • Can I access WorkDocs from my phone?

Adding a search intent

Now that you have a working Amazon Kendra index, you need to add a search intent.

  1. On the Amazon Lex console, choose help_desk_bot.
  2. Under Intents, choose the + icon next to add an intent.
  3. Choose Search existing intents.
  4. Under Built-in intents, choose KendraSearchIntent.
  5. Enter a name for your intent, such as help_desk_kendra_search.
  6. Choose Add.
  7. Under Amazon Kendra query, choose the index you created (it-helpdesk).
  8. For IAM role, choose Add Amazon Kendra permissions.
  9. For Fulfillment, leave the default value Return parameters to client selected.

  10. For Response, choose Message, enter the following message value and choose + to add it:
    ((x-amz-lex:kendra-search-response-question_answer-answer-1))

  11. Choose Save intent.
  12. Choose Build.

The message value you used in step 10 is a request attribute, which is set automatically by the Amazon Kendra search intent. This response is only selected if Kendra surfaces an answer.  For more information on request attributes, see the AMAZON.KendraSearchIntent documentation.

Your bot can now execute Amazon Kendra queries. You can test this on the Amazon Lex console. For example, you can try the sample conversation from the beginning of this post.

Deploying on a Slack channel

You can put this solution in a real chat environment, such as Slack, so that users can easily get information. To create a Slack channel association with your bot, complete the following steps:

  1. On the Amazon Lex console, choose Settings.
  2. Choose Publish.
  3. For Create an alias, enter an alias name, such as test.
  4. Choose Publish.
  5. When your alias is published, choose the Channels
  6. Under Channels, choose Slack.
  7. Enter a Channel Name, such as slack_help_desk_bot.
  8. For Channel Description, add an optional description.
  9. From the KMS Key drop-down menu, leave aws/lex selected.
  10. For Alias, choose test.
  11. Provide the Client Id, Client Secret, and Verification Token for your Slack application.
  12. Choose Activate to generate the OAuth URL and Postback URL.

Use the OAuth URL and Postback URL on the Slack application portal to complete the integration. For more information about setting up a Slack application and integrating with Amazon Lex, see Integrating an Amazon Lex Bot with Slack.

Conclusion

This post demonstrates how to integrate Amazon Lex and Amazon Kendra using a search intent. Amazon Kendra can extract specific answers from unstructured data. No pre-training is required; you simply point Amazon Kendra at your content, and it provides specific answers to natural language queries. For more information about incorporating these techniques into your bots, please see the AMAZON.KendraSearchIntent documentation.

 


About the authors

Brian Yost is a Senior Consultant with the AWS Professional Services Conversational AI team. In his spare time, he enjoys mountain biking, home brewing, and tinkering with technology.

 

 

 

As a Product Manager on the Amazon Lex team, Harshal Pimpalkhute spends his time trying to get machines to engage (nicely) with humans.

 

 

 

 

 

Read More

Detecting and visualizing telecom network outages from tweets with Amazon Comprehend

Detecting and visualizing telecom network outages from tweets with Amazon Comprehend

In today’s world, social media has become a place where customers share their experiences with services that they consume. Every telecom provider wants to have the ability to understand their customer pain points as soon as possible and to do this carriers frequently establish a social media team within their NOC (network operation center). This team manually reviews social media messages, such as tweets, trying to identify patterns of customer complaints or issues that might suggest that there is a specific problem in the carrier’s network .

Unhappy customers are more likely to change provider, so operators look to improve their customers’ experience and proactively approach dissatisfied customers who are reporting issues with their services .

Of course, social media operates at a vast scale and our telecom customers are telling us that trying to uncover customer issues from social media data manually is extremely challenging.

This post shows how to classify tweets in real time so telecom companies can identify outages and proactively engage with customers by using Amazon Comprehend custom multi-class classification.

Solution overview

Telecom customers not only post about outages on social media, but also comment on the service they get or compare the company to a competitor.

Your company can benefit from targeting those types of tweets separately. One option is customer feedback, in which care agents respond to the customer. For outages, you need to collect information and open a ticket in an external system so an engineer can specify the problem.

The solution for this post extends the AI-Driven Social Media Dashboard solution. The following diagram illustrates the solution architecture.

AI-Driven Social Media Dashboard Solutions Implementation architecture

This solution deploys an Amazon Elastic Compute Cloud (Amazon EC2) instance running in an Amazon Virtual Private Cloud (Amazon VPC) that ingests tweets from Twitter. An Amazon Kinesis Data Firehose delivery stream loads the streaming tweets into the raw prefix in the solution’s Amazon Simple Storage Service (Amazon S3) bucket. Amazon S3 invokes an AWS Lambda function to analyze the raw tweets using Amazon Translate to translate non-English tweets into English, and Amazon Comprehend to use natural-language-processing (NLP) to perform entity extraction and sentiment analysis.

A second Kinesis Data Firehose delivery stream loads the translated tweets and sentiment values into the sentiment prefix in the Amazon S3 bucket. A third delivery stream loads entities in the entities prefix using in the Amazon S3 bucket.

The solution also deploys a data lake that includes AWS Glue for data transformation, Amazon Athena for data analysis, and Amazon QuickSight for data visualization. AWS Glue Data Catalog contains a logical database which is used to organize the tables for the data on Amazon S3. Athena uses these table definitions to query the data stored on Amazon S3 and return the information to an Amazon QuickSight dashboard.

You can extend this solution by building Amazon Comprehend custom classification to detect outages, customer feedback, and comparisons to competitors.

Creating the dataset

The solution uses raw data from tweets. In the original solution, you deploy an AWS CloudFormation template that defines a comma-delimited list of terms for the solution to monitor. As an example, this post focuses on tweets that contain the word “BT” (BT Group in the UK), but equally this could be any network provider.

To get started, launch the AI-driven Social Media Dashboard solution. On the Specify Stack Details page, replace the default TwitterTermList with your terms. For this example, 'BT','bt'. After you click on Create Stack, wait 15 minutes for the deployment to complete. You will now begin capturing tweets.

For more information about available attributes and data types, see Appendix B: Auto-generated Data Model.

The tweet data is stored in Amazon Simple Storage Service (Amazon S3), which you can query with Amazon Athena. The following screenshot shows an example query.

SELECT id,text FROM "ai_driven_social_media_dashboard"."tweets" limit 10;

Because you captured every tweet that contains the keyword BT or bt, you have a lot of tweets that aren’t referring to British Telecom; for example, tweets that misspell the word “but.”

Additionally, the tweets in your dataset are global, but for this post, you want to focus on the United Kingdom, so the tweets are even more likely to refer to British Telecom (and therefore your dataset is more accurate). You can modify this solution for use cases in other countries, for example, defining the keyword as KPN and narrowing the dataset to focus only on the Netherlands.

In the existing solution, the coordinates and geo types look relevant, but those usually aren’t populated—tweets don’t include the poster’s location by default due to privacy requirements, unless the user allows it.

The user type contains relevant user data that comes from the user profile. You can use the location data from the user profile to narrow down tweets to your target country or region.

To look at the user type, you can use the Athena CREATE TABLE AS SELECT (CTAS) query. For more information, see Creating a Table from Query Results (CTAS). The following screenshot shows the Create table from query option in the Create drop-down menu.

SELECT text,user.location from tweets

You can create a table that consists of the tweet text and the user location, which gives you the ability to look only at tweets that originated in the UK. The following screenshot shows the query results.

SELECT * FROM "ai_driven_social_media_dashboard"."location_text_02"
WHERE location like '%UK%' or location like '%England%' or location like '%Scotland%' or location like '%Wales%'

Now that you have a dataset with your target location and tweet keywords, you can train your custom classifier.

Amazon Comprehend custom classification

You train your model in multi-class mode. For this post, you label three different classes:

  • Outage – People who are experiencing or reporting an outage in their provider network
  • Customer feedback – Feedback regarding the service they have received from the provider
  • Competition – Tweets about the competition and the provider itself

You can export the dataset from Athena and train it to use the custom classifier.

You first look at the dataset and start labeling the different tweets. Because you have a large number of tweets, it can take manual effort and perhaps several hours to review the data and label it. We recommend that you train the model with at least 50 documents per label.

In the dataset, customers reported an outage, which resulted in 71 documents with the outage label. Competition and customer feedback had under 50 labels.

After you gather sufficient data, you can always improve your accuracy by training a new model.

The following screenshot shows some of the entries in the final training CSV file.

As a future enhancement to remove the manual effort of labeling tweets, you can automate the process with Amazon SageMaker Ground Truth. Ground Truth offers easy access to labelers through Amazon Mechanical Turk and provides built-in workflows and interfaces for common labeling tasks.

When the labeling work is complete, upload the CSV file to your S3 bucket.

Now that the training data is in Amazon S3, you can train your custom classifier. Complete the following steps:

  1. On the Amazon Comprehend console, choose Custom classification.
  2. Choose Train classifier.
  3. For Name, enter a name for your classifier; for example, TweetsBT.
  4. For Classifier mode, select Using multi-class mode.
  5. For S3 location, enter the location of your CSV file.
  6. Choose Train classifier.

The status of the classifier changes from Submitted to Training. When the job is finished, the status changes to Trained.

After you train the custom classifier, you can analyze documents in either asynchronous or synchronous operations. You can analyze a large number of documents at the same time by using the asynchronous operation. The resulting analysis returns in a separate file. When you use the synchronous operation, you can only analyze a single document, but you can get results in real time.

For this use case, you want to analyze tweets in real time. When a tweet lands in Amazon S3 via Amazon Kinesis Data Firehose, it triggers an AWS Lambda function. The function triggers the custom classifier endpoint to run an analysis on the tweet and determine if it’s in regards to an outage, customer feedback, or referring to a competitor.

Testing the training data

After you train the model, Amazon Comprehend uses approximately 10% of the training documents to test the custom classifier model. Testing the model provides you with metrics that you can use to determine if the model is trained well enough for your purposes. These metrics are displayed in the Classifier performance section of the Classifier details page on the Amazon Comprehend console. See the following screenshot.

They’re also returned in the Metrics fields returned by the DescribeDocumentClassifier operation.

Creating an endpoint

To create an endpoint, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom classification.
  2. From the Actions drop-down menu, choose Create endpoint.
  3. For Endpoint name, enter a name; for example, BTtweetsEndpoint.
  4. For Inference units¸ enter the number to assign to an endpoint.

Each unit represents a throughput of 100 characters per second for up to two documents per second. You can assign up to 10 inference units per endpoint. This post assigns 1.

  1. Choose Create endpoint.

When the endpoint is ready, the status changes to Ready.

Triggering the endpoint and customizing the existing Lambda function

You can use the existing Lambda function from the original solution and extend it to do the following:

  • Trigger the Amazon Comprehend custom classifier endpoint per tweet
  • Determine which class has the highest confidence score
  • Create an additional Firehose delivery stream so the results land back in Amazon S3

For more information about the original Lambda function, see the GitHub repo.

To make the necessary changes to the function, complete the following steps:

  1. On the Lambda console, select the function that contains the string Tweet-SocialMediaAnalyticsLambda.

Before you start adding code, make sure you understand how the function reads the tweets coming in, calls the Amazon Comprehend API, and stores the responses on a Firehose delivery stream so it writes the data to Amazon S3.

  1. Call the custom classifier endpoint (see the following code example).

The first two calls use the API on the tweet text to detect sentiment and entities; those both come out-of-the-box with the officinal solution.

The following code uses the ClassifyDocument API:

 sentiment_response = comprehend.detect_sentiment(
                    Text=comprehend_text,
                    LanguageCode='en'
                )
            #print(sentiment_response)
            
            entities_response = comprehend.detect_entities(
                    Text=comprehend_text,
                    LanguageCode='en'
                )
                
            #we will create a 'custom_response' using the ClassifyDocument API call
            custom_response = comprehend.classify_document(
                     #point to the relevant Custom classifier endpoint ARN
                     EndpointArn= "arn:aws:comprehend:us-east-1:12xxxxxxx91:document-classifier-endpoint/BTtweets-endpoint",
                     #this is where we use comprehend_text which is the original tweet text
                     Text=comprehend_text
                   
                )

The following code is the returned result:

{"File": "all_tweets.csv", "Line": "23", "Classes": [{"Name": "outage", "Score": 0.9985}, {"Name": "Competition", "Score": 0.0005}, {"Name": "Customer feedback", "Score": 0.0005}]}

You now need to iterate over to the array, which contains the classes and confidence scores. For more information, see DocumentClass.

Because you’re using the multi-class approach, you can pick the class with the highest score and add some simple code that iterates over the array and takes the biggest score and class.

You also take tweet[‘id’] because you can join it with the other tables that the solution generates to relate the results to the original tweet.

  1. Enter the following code:
    score=0
            for classs in custom_response['Classes']:
             if score<classs['Score']:
                 score=classs['Score']
                 custom_record = {
                    
                    'tweetid': tweet['id'],
                    'classname':classs['Name'],
                    'classscore':classs['Score']
                 }

After you create the custom_record, you can decide if you want to define a certain threshold for your class score (the level of confidence for the results you want to store in Amazon S3). For this use case, you choose to only define classes with a confidence score of at least 70%.

To put the result on a Firehose delivery stream (which you need to create in advance), use the PutRecord API. See the following code:

if custom_record['classscore']>0.7:
         print('we are in')
         response = firehose.put_record(
                        DeliveryStreamName=os.environ['CUSTOM_STREAM'],
                        Record={
                            'Data': json.dumps(custom_record) + 'n'
                        }
                    )

You now have a dataset in Amazon S3 based on your Amazon Comprehend custom classifier output.

Exploring the output

You can now explore the output from your custom classifier in Athena. Complete the following steps:

  1. On the Athena console, run a SELECT query to see the following:
    1. tweetid – You can use this to join the original tweet table to get the tweet text and additional attributes.
    2. classname – This is the class that the custom classifier identified the tweet as with the highest level of confidence.
    3. classscore – This is the level of confidence.
    4. Stream partitions – These help you know the time when the data was written to Amazon S3:
      1. Partition_0 (month)
      2. Partition_1 (day)
      3. Partition_2 (hour)

The following screenshot shows your query results.

SELECT * FROM "ai_driven_social_media_dashboard"."custom2020" where classscore>0.7 limit 10;

  1. Join your table using the tweetid with the following:
    1. The original tweet table to get the actual tweet text.
    2. A sentiment table that Amazon Comprehend generated in the original solution.

The following screenshot shows your results. One of the tweets contains negative feedback, and other tweets identify potential outages.

SELECT classname,classscore,tweets.text,sentiment FROM "ai_driven_social_media_dashboard"."custom2020"
left outer join tweets on custom2020.tweetid=tweets.id 
left outer join tweet_sentiments on custom2020.tweetid=tweet_sentiments.tweetid
where classscore>0.7 
limit 10;

Preparing the data for visualization

To prepare the data for visualization, first create a timestamp field by concatenating the partition fields.

You can use timestamp field for various visualizations, such as outages in a certain period or customer feedback on a specific day. To do so, use AWS Glue notebooks and write a piece of code in PySpark.

You can use the PySpark code to not only prepare your data but also transform the data from CSV to Apache Parquet format. For more information, see the GitHub repo.

You should now have a new dataset that contains a timestamp field in Parquet format, which is more efficient and cost-effective to query.

For this use case, you can determine the outages reported on a map using geospacial charts in Amazon QuickSight. To get the location of the tweet, you can use the following:

  • Longitude and latitude coordinates in the original tweets dataset. Unfortunately, coordinates aren’t usually present due to privacy defaults.
  • Amazon Comprehend entity dataset, which can identify locations as entities within the tweet text.

For this use case, you can create a new dataset combining the tweets, custom2020 (your new dataset based on the custom classifier output, and tweetsEntities datasets.

The following screenshot shows the query results, which returned tweets with locations that also identify outages.

SELECT distinct classname,final,text,entity FROM "ai_driven_social_media_dashb
oard"."custom2020"."quicksight_with_lat_lang"
where type='LOCATION' and classname='outage'
order by final asc 

You have successfully identified outages in a specific window and determined their location.

To get the geolocation of a specific location, you could choose from a variety of publicly available datasets to upload to Amazon S3 and join with your data. This post uses the World Cities Database, which has a free option. You can join it with your existing data to get the location coordinates.

Visualizing outage locations in Amazon QuickSight

To visualize your outage locations in Amazon QuickSight, complete the following steps:

  1. To add the dataset you created in Athena, on the Amazon QuickSight console, choose Manage data.
  2. Choose New dataset.
  3. Choose Athena.
  4. Select your database or table.
  5. Choose Save & Visualize.
  6. Under Visual types, choose the Points on map
  7. Drag the lng and lat fields to the field wells.

The following screenshot shows the outages on a UK map.

To see the text of a specific tweet, hover over one of the dots on the map.

You have many different options when analyzing your data and can further enhance this solution. For example, you can enrich your dataset with potential contributors and drill down on a specific outage location for more details.

Conclusion

We have now the ability to detect outages which customers are reporting upon, we can also leverage the solution to look on customer feedback and competition. We are now able to identify key trends on the social media at scale. In the blog post we have showed an example which is relevant for telecom companies, but this solution can be customized and leveraged by every company that has customers using the social media.

In the near feature, we would like to extend this solution, and create an end to end flow , where the customer reporting an outage ,will automatically receive a reply in tweeter from an Amazon Lex chat bot, which can ask for more information from the customer who complained via a secured channel and send this info to a call center agent via an integration with Amazon Connect or create a ticket in an external ticket system for an engineer to work on the problem .

Give the solution a try, see if you can extend it further, and share your feedback and questions in the comments.


About the Author

Guy Ben-Baruch is a Senior solution architect in the news & communications team in AWS UKIR. Since Guy joined AWS in March 2016, he has worked closely with enterprise customers, focusing on the telecom vertical, supporting their digital transformation and their cloud adoption. Outside of work, Guy likes doing BBQ and playing football with his kids in the park when the British weather allows it.

 

 

 

Read More

Amazon Polly launches a child US English NTTS voice

Amazon Polly launches a child US English NTTS voice

Amazon Polly turns text into lifelike speech, allowing you to create voice-enabled applications. We’re excited to announce the general availability of a new US English child voice—Kevin. Kevin’s voice was developed using the latest Neural Text-to-Speech (NTTS) technology, making it sound natural and human-like. This voice imitates the voice of a male child. Have a listen to the Kevin voice:

Kevin sample 1

Listen now

Voiced by Amazon Polly

Kevin sample 2

Listen now

Voiced by Amazon Polly

Amazon Polly has 14 neural voices to choose from:

  • US English (en-US): Ivy, Joey, Justin, Kendra, Kevin, Kimberly, Joanna, Matthew, Salli
  • British English (en-GB): Amy, Brian, Emma
  • Brazilian Portuguese (pt-BR): Camila
  • US Spanish (es-US): Lupe

Neural voices are supported in the following Regions:

  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Sydney)
  • EU (Ireland)

For the full list of text-to-speech voices, see Voices in Amazon Polly.

Our customers are using Amazon Polly voices to build new categories of speech-enabled products, including (but not limited to) voicing news content, games, eLearning platforms, telephony applications, accessibility applications, and Internet of Things (IoT). Amazon Polly voices are high quality, cost-effective, and ensure fast responses, which makes it a viable option for low-latency use cases. Amazon Polly also supports SSML tags, which give you additional control over speech output.

For more information, see What Is Amazon Polly? and log in to the Amazon Polly console to try it out!


About the Author

Ankit Dhawan is a Senior Product Manager for Amazon Polly, technology enthusiast, and huge Liverpool FC fan. When not working on delighting our customers, you will find him exploring the Pacific Northwest with his wife and dog. He is an eternal optimist, and loves reading biographies and playing poker. You can indulge him in a conversation on technology, entrepreneurship, or soccer any time of the day.

 

Read More