The Calibration Generalization Gap

This paper was accepted at the Workshop on Distribution-Free Uncertainty Quantification at ICML 2022.
Calibration is a fundamental property of a good predictive model: it requires that the model predicts correctly in proportion to its confidence. Modern neural networks, however, provide no strong guarantees on their calibration— and can be either poorly calibrated or well-calibrated depending on the setting. It is currently unclear which factors contribute to good calibration (architecture, data augmentation, overparameterization, etc), though various claims exist in the literature. We…Apple Machine Learning Research

Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures

his paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a `reasoning’ function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when…Apple Machine Learning Research

Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents

The perception system in personalized mobile agents requires developing indoor scene understanding models, which can understand 3D geometries, capture objectiveness, analyze human behaviors, etc. Nonetheless, this direction has not been well-explored in comparison with models for outdoor environments (e.g., the autonomous driving system that includes pedestrian prediction, car detection, traffic sign recognition, etc.). In this paper, we first discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments, and other challenges such as fusion between…Apple Machine Learning Research

Automate classification of IT service requests with an Amazon Comprehend custom classifier

Enterprises often deal with large volumes of IT service requests. Traditionally, the burden is put on the requester to choose the correct category for every issue. A manual error or misclassification of a ticket usually means a delay in resolving the IT service request. This can result in reduced productivity, a decrease in customer satisfaction, an impact to service level agreements (SLAs), and broader operational impacts. As your enterprise grows, the problem of getting the right service request to the right team becomes even more important. Using an approach based on machine learning (ML) and artificial intelligence can help with your enterprise’s ever-evolving needs.

Supervised ML is a process that uses labeled datasets and outputs to train learning algorithms on how to classify data or predict an outcome. Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover valuable insights and connections in text. It provides APIs powered by ML to extract key phrases, entities, sentiment analysis, and more.

In this post, we show you how to implement a supervised ML model that can help classify IT service requests automatically using Amazon Comprehend custom classification. Amazon Comprehend custom classification helps you customize Amazon Comprehend for your specific requirements without the skillset required to build ML-based NLP solutions. With automatic ML, or AutoML, Amazon Comprehend custom classification builds customized NLP models on your behalf, using the training data that you provide.

Overview of solution

To illustrate the IT service request classification, this solution uses the SEOSS dataset. This dataset is a systematically retrieved dataset consisting of 33 open-source software projects that contains a large number of typed artifacts and trace links between them. This solution uses the issue data from these 33 open-source projects, summaries, and descriptions as reported by end-users to build a custom classifier model using Amazon Comprehend.

This post demonstrates how to implement and deploy the solution using the AWS Cloud Development Kit (AWS CDK) in an isolated Amazon Virtual Private Cloud (Amazon VPC) environment consisting of only private subnets. We also use the code to demonstrate how you can use the AWS CDK provider framework, a mini-framework for implementing a provider for AWS CloudFormation custom resources to create, update, or delete a custom resource, such as an Amazon Comprehend endpoint. The Amazon Comprehend endpoint includes managed resources that make your custom model available for real-time inference to a client machine or third-party applications. The code for this solution is available on Github.

You use the AWS CDK to deploy the infrastructure, application code, and configuration for the solution. You also need an AWS account and the ability to create AWS resources. You use the AWS CDK to create AWS resources such as a VPC with private subnets, Amazon VPC endpoints, Amazon Elastic File System (Amazon EFS), an Amazon Simple Notification Service (Amazon SNS) topic, an Amazon Simple Storage Service (Amazon S3) bucket, Amazon S3 event notifications, and AWS Lambda functions. Collectively, these AWS resources constitute the training stack, which you use to build and train the custom classifier model.

After you create these AWS resources, you download the SEOSS dataset and upload the dataset to the S3 bucket created by the solution. If you’re deploying this solution in AWS Region us-east-2, the format of the S3 bucket name is comprehendcustom-<AWS-ACCOUNT-NUMBER>-us-east-2-s3stack. The solution uses the Amazon S3 multi-part upload trigger to invoke a Lambda function that starts the pre-processing of the input data, and uses the preprocessed data to train the Amazon Comprehend custom classifier to create the custom classifier model. You then use the Amazon Resource Name (ARN) of the custom classifier model to create the inference stack, which creates an Amazon Comprehend endpoint using the AWS CDK provider framework, which you can then use for inferences from a third-party application or client machine.

The following diagram illustrates the architecture of the training stack.

Training stack architecture

The workflow steps are as follows:

  1. Upload the SEOSS dataset to the S3 bucket created as part of the training stack deployment process. This creates an event trigger that invokes the etl_lambda function.
  2. The etl_lambda function downloads the raw data set from Amazon S3 to Amazon EFS.
  3. The etl_lambda function performs the data preprocessing task of the SEOSS dataset.
  4. When the function execution completes, it uploads the transformed data with prepped_data prefix to the S3 bucket.
  5. After the upload of the transformed data is complete, a successful ETL completion message is send to Amazon SNS.
  6. In Amazon Comprehend, you can classify your documents using two modes: multi-class or multi-label. Multi-class mode identifies one and only one class for each document, and multi-label mode identifies one or more labels for each document. Because we want to identify a single class to each document, we train the custom classifier model in multi-class mode. Amazon SNS triggers the train_classifier_lambda function, which initiates the Amazon Comprehend classifier training in a multi-class mode.
  7. The train_classifier_lambda function initiates the Amazon Comprehend custom classifier training.
  8. Amazon Comprehend downloads the transformed data from the prepped_data prefix in Amazon S3 to train the custom classifier model.
  9. When the model training is complete, Amazon Comprehend uploads the model.tar.gz file to the output_data prefix of the S3 bucket. The average completion time to train this custom classifier model is approximately 10 hours.
  10. The Amazon S3 upload trigger invokes the extract_comprehend_model_name_lambda function, which retrieves the custom classifier model ARN.
  11. The function extracts the custom classifier model ARN from the S3 event payload and the response of list-document-classifiers call.
  12. The function sends the custom classifier model ARN to the email address that you had subscribed earlier as part of the training stack creation process. You then use this ARN to deploy the inference stack.

This deployment creates the inference stack, as shown in the following figure. The inference stack provides you with a REST API secured by an AWS Identity and Access Management (IAM) authorizer, which you can then use to generate confidence scores of the labels based on the input text supplied from a third-party application or client machine.

Inference stack architecture

Prerequisites

For this demo, you should have the following prerequisites:

  • An AWS account.
  • Python 3.7 or later, Node.js, and Git in the development machine. The AWS CDK uses specific versions of Node.js (>=10.13.0, except for version 13.0.0 – 13.6.0). A version in active long-term support (LTS) is recommended.
    To install the active LTS version of Node.js, you can use the following install script for nvm and use nvm to install the Node.js LTS version. You can also install the current active LTS Node.js via package manager depending on the operating system of your choice.

    For macOS, you can install the Node.js via package manager using the following instructions.

    For Windows, you can install the Node.js via package manager using the following instructions.

  • AWS CDK v2 is pre-installed if you’re using an AWS Cloud9 IDE. If you’re using AWS Cloud9 IDE, you can skip this step.If you don’t have the AWS CDK installed in the development machine, install AWS CDK v2 globally using the Node Package Manager command npm install -g aws-cdk. This step requires Node.js to be installed in the development machine.
  • Configure your AWS credentials to access and create AWS resources using the AWS CDK. For instructions, refer to Specifying credentials and region.
  • Download the SEOSS dataset consisting of requirements, bug reports, code history, and trace links of 33 open-source software projects. Save the file dataverse_files.zip on your local machine.

SEOSS dataset

Deploy the AWS CDK training stack

For AWS CDK deployment, we start with the training stack. Complete the following steps:

  1. Clone the GitHub repository:
$ git clone https://github.com/aws-samples/amazon-comprehend-custom-automate-classification-it-service-request.git
  1. Navigate to the amazon-comprehend-custom-automate-classification-it-service-request folder:
$ cd amazon-comprehend-custom-automate-classification-it-service-request/

All the following commands are run within the amazon-comprehend-custom-automate-classification-it-service-request directory.

  1. In the amazon-comprehend-custom-automate-classification-it-service-request directory, initialize the Python virtual environment and install requirements.txt with pip:
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
  1. If you’re using the AWS CDK in a specific AWS account and Region for the first time, see the instructions for bootstrapping your AWS CDK environment:
$ cdk bootstrap aws://<AWS-ACCOUNT-NUMBER>/<AWS-REGION>
  1. Synthesize the CloudFormation templates for this solution using cdk synth and use cdk deploy to create the AWS resources mentioned earlier:
$ cdk synth
$ cdk deploy VPCStack EFSStack S3Stack SNSStack ExtractLoadTransformEndPointCreateStack --parameters SNSStack:emailaddressarnnotification=<emailaddress@example.com>

After you enter cdk deploy, the AWS CDK prompts whether you want to deploy changes for each of the stacks called out in the cdk deploy command.

  1. Enter y for each of the stack creation prompts, then the cdk deploy step creates these stacks. Subscribe the email address provide by you to the SNS topic created as part of the cdk deploy.
  2. After cdk deploy completes successfully, create a folder called raw_data in the S3 bucket comprehendcustom-<AWS-ACCOUNT-NUMBER>-<AWS-REGION>-s3stack.
  3. Upload the SEOSS dataset dataverse_files.zip that you downloaded earlier to this folder.

After the upload is complete, the solution invokes the etl_lambda function using an Amazon S3 event trigger to start the extract, transform, and load (ETL) process. After the ETL process completes successfully, a message is sent to the SNS topic, which invokes the train_classifier_lambda function. This function triggers an Amazon Comprehend custom classifier model training. Depending on whether you train your model on the complete SEOSS dataset, training could take up to 10 hours. When the training process is complete, Amazon Comprehend uploads the model.tar.gz file to the output_data prefix in the S3 bucket.

This upload triggers the extract_comprehend_model_name_lambda function using a S3 event trigger that extracts the custom classifier model ARN and sends it to the email address you had subscribed earlier. This custom classifier model ARN is then used to create the inference stack. When the model training is complete, you can view the performance metrics of the custom classifier model by navigating to the version details section in the Amazon Comprehend console (see the following screenshot), or by using the Amazon Comprehend Boto3 SDK.

Perfomance metrics

Deploy the AWS CDK inference stack

Now you’re ready to deploy the inference stack.

  1. Copy the custom classifier model ARN from the email you received and use the following cdk deploy command to create the inference stack.

This command deploys an API Gateway REST API secured by an IAM authorizer, which you use for inference with an AWS user ID or IAM role that just has the execute-api:Invoke IAM privilege. The following cdk deploy command deploys the inference stack. This stack uses the AWS CDK provider framework to create the Amazon Comprehend endpoint as a custom resource, so that creating, deleting, and updating of the Amazon Comprehend endpoint can be done as part of the inference stack lifecycle using the cdk deploy and cdk destroy commands.

Because you need to run the following command after model training is complete, which could take up to 10 hours, ensure that you’re in the Python virtual environment that you initialized in an earlier step and in the amazon-comprehend-custom-automate-classification-it-service-request directory:

$ cdk deploy APIGWInferenceStack --parameters APIGWInferenceStack:documentclassifierarn=<custom classifier model ARN retrieved from email>

For example:

$ cdk deploy APIGWInferenceStack --parameters APIGWInferenceStack:documentclassifierarn=arn:aws:comprehend:us-east-2:111122223333:document-classifier/ComprehendCustomClassifier-11111111-2222-3333-4444-abc5d67e891f/version/v1
  1. After the cdk deploy command completes successfully, copy the APIGWInferenceStack.ComprehendCustomClassfierInvokeAPI value from the console output, and use this REST API to generate inferences from a client machine or a third-party application that has execute-api:Invoke IAM privilege. If you’re running this solution in us-east-2, the format of this REST API is https://<restapi-id>.execute-api.us-east-2.amazonaws.com/prod/invokecomprehendV1.

Alternatively, you can use the test client apiclientinvoke.py from the GitHub repository to send a request to the custom classifier model. Before using the apiclientinvoke.py, ensure that the following prerequisites are in place:

  • You have the boto3 and requests Python package installed using pip on the client machine.
  • You have configured Boto3 credentials. By default, the test client assumes that a profile named default is present and it has the execute-api:Invoke IAM privilege on the REST API.
  • SigV4Auth points to the Region where the REST API is deployed. Update the <AWS-REGION> value to us-east-2 in apiclientinvoke.py if your REST API is deployed in us-east-2.
  • You have assigned the raw_data variable with the text on which you want to make the class prediction or the classification request:
raw_data="""Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis."""
  • You have assigned the restapi variable with the REST API copied earlier:

restapi="https://<restapi-id>.execute-api.us-east-2.amazonaws.com/prod/invokecomprehendV1"

  1. Run the apiclientinvoke.py after the preceding updates:
$ python3 apiclientinvoke.py

You get the following response from the custom classifier model:

{
 "statusCode": 200,
 "body": [
	{
	 "Name": "SPARK",
	 "Score": 0.9999773502349854
	},
	{
	 "Name": "HIVE",
	 "Score": 1.1613215974648483e-05
	},
	{
	 "Name": "DROOLS",
	 "Score": 1.1110682862636168e-06
	}
   ]
}

Amazon Comprehend returns confidence scores for each label that it has attributed correctly. If the service is highly confident about a label, the score will be closer to 1. Therefore, for the Amazon Comprehend custom classifier model that was trained using the SEOSS dataset, the custom classifier model predicts that the text belongs to class SPARK. This classification returned by the Amazon Comprehend custom classifier model can then be used to classify the IT service requests or predict the correct category of the IT service requests, thereby reducing manual errors or misclassification of service requests.

Clean up

To clean up all the resources created in this post that were created as part of the training stack and inference stack, use the following command. This command deletes all the AWS resources created as part of the previous cdk deploy commands:

$ cdk destroy --all

Conclusion

In this post, we showed you how enterprises can implement a supervised ML model using Amazon Comprehend custom classification to predict the category of IT service requests based on either the subject or the description of the request submitted by the end-user. After you build and train a custom classifier model, you can run real-time analysis for custom classification by creating an endpoint. After you deploy this model to an Amazon Comprehend endpoint, it can be used to run real-time inference by third-party applications or other client machines, including IT service management tools. You can then use this inference to predict the defect category and reduce manual errors or misclassifications of tickets. This helps reduce delays for ticket resolution and increases resolution accuracy and customer productivity, which ultimately results in increased customer satisfaction.

You can extend the concepts in this post to other use cases, such as routing business or IT tickets to various internal teams such as business departments, customer service agents, and Tier 2/3 IT support, created either by end-users or through automated means.

References

  • Rath, Michael; Mäder, Patrick, 2019, “The SEOSS Dataset – Requirements, Bug Reports, Code History, and Trace Links for Entire Projects”, https://doi.org/10.7910/DVN/PDDZ4Q, Harvard Dataverse, V1

About the Authors

Arnab Chakraborty is a Sr. Solutions Architect at AWS based out of Cincinnati, Ohio. He is passionate about topics in Enterprise & Solution architecture, Data analytics, Serverless and Machine Learning. In his spare time, he enjoys watching movies , travel shows and sports.

Viral Desai is a Principal Solutions Architect at AWS. With more than 25 years of experiences in information technology, he has been helping customers adopt AWS and modernize their architectures. He likes hiking, and enjoys diving deep with customers on all things AWS.

Read More

How we’re using machine learning to understand proteins

How we’re using machine learning to understand proteins

When most people think of proteins, their mind typically goes to protein-rich foods such as steak or tofu. But proteins are so much more. They’re essential to how living things operate and thrive, and studying them can help improve lives. For example, insulin treatments are life-changing for people with diabetes that are based on years of studying proteins.

There is a world of information yet to discover when it comes to proteins — from helping people get the healthcare they need to finding ways to protect plant species. Teams at Google are focused on studying proteins so we can realize Google Health’s mission to help billions of people live healthier lives.

Back in March, we published apost about a model we developed at Google that predicts protein function and a tool that allows scientists to use this model. Since then, the protein function team has accomplished more work in this space. We chatted with software engineer Max Bileschi to find out more about studying proteins and the work Google is doing.

Can you give us a quick crash course in proteins?

Proteins dictate so much of what happens in and around us, like how we and other organisms function.

Two things determine what a protein does: its chemical formula and its environment. For example, we know that human hemoglobin, a protein inside your blood, carries oxygen to your organs. We also know that if there are particular tiny changes to the chemical formula of hemoglobin in your body, it can trigger sickle cell anemia. Further, we know that blood behaves differently at different temperatures because proteins behave differently at higher temperatures.

So why did a team at Google start studying proteins?

We have the opportunity to look at how machine learning can help various scientific fields. Proteins are an obvious choice because of the amazing breadth of functions they have in our bodies and in the world. There is an enormous amount of public data, and while individual researchers have done excellent work studying specific proteins, we know that we’ve just scratched the surface of fully understanding the protein universe. It’s highly aligned to Google’s mission of organizing information and making it accessible and useful.

This sounds exciting! Tell us more about the use of machine learning in identifying what proteins do and how it improves upon the status quo.

Only around 1% of proteins have been studied in a laboratory setting. We want to see how machine learning can help us learn about the other 99%.

It’s difficult work. There are at least a billion proteins in the world, and they’ve evolved throughout history and have been shaped by the same forces of natural selection we normally think of as acting on DNA. It’s useful to understand this evolutionary relatedness among proteins. The presence of a similar protein in two or more distantly related organisms (say humans and zebrafish) can be indicative that it’s useful for survival. Proteins that are closely related can have similar functions but with small differences, like encouraging the same chemical reaction but doing so at different temperatures. Sometimes it’s easy to determine that two proteins are closely related, but other times it’s difficult. This was the first problem in protein function annotation that we tackled with machine learning.

Machine learning helps best when it truly helps, not replaces, current techniques. For example, we demonstrated that about 300 previously-uncharacterized proteins are related to “phage capsid” proteins. These capsid proteins can help us deliver medicines to the cells that really need them. We worked with a trusted protein database, Pfam, to confirm our hypothesis, and now these proteins are listed as being related to phage capsid proteins — for all the public to see — including researchers.

Back up a bit. Can you explain what the protein family database Pfam is? How has your team contributed to this database?

A community of scientists built a number of tools and databases, over decades, to help classify what each different protein does. Pfam is one of the most-used databases, and it classifies proteins into about 20,000 types of proteins.

This work of classifying proteins requires both computer models and experts (called curators) to validate and improve the computer models.

Graph showing how the Pfam region coverage over time, depicting that machine learning helped grow the database and add several years of progress.

We used machine learning to add classifications for human proteins that previously lacked Pfam classifications — helping grow the database and adding several years of progress.

Since the publication of your paper ‘Using deep learning to annotate the protein universe’ in June, what has your team been up to?

We’re focused on identifying more proteins and sharing that knowledge with the science and research community. And we’re soon making Pfam data and MGnify data, another database that catalogs microbiome data, available on Google Cloud Platform so more people can have access to it. Later this year, we’ll launch an initiative with UniProt, a prominent database in our field, to use language models to name uncharacterized proteins in UniProt. We’re excited about the progress we’re making and how sharing this data can help solve challenging problems.

Read More

How mapping the world’s buildings makes a difference

In Lamwo district, in northern Uganda, providing access to electricity is a challenge. In a country where only about 24% of the population has a power supply to their home from the national grid, the rate in Lamwo is even lower. This is partly due to lack of information: The government doesn’t have precise data about where settlements are located, what types of buildings there are, and what the buildings’ electricity needs might be. And canvassing the area isn’t practical, because the roads require four-wheel-drive vehicles and are impassable in the rain.

Ernest Mwebaze leads Sunbird AI, a Ugandan nonprofit that uses data technology for social good. They’re assessing areas in Lamwo district to support planning at the Ministry of Energy in Uganda. “There are large areas to plan for,” explains Ernest. “Even when you’re there on the ground, it’s difficult to get an overall sense of where all the buildings are and what is the size of each settlement. Currently people have to walk long distances just to charge their phones.”

To help with their analysis, Ernest’s team have been using Google’s Open Buildings. An open-access dataset project based on satellite imagery pinpointing the locations and geometry of buildings across Africa, Open Buildings allows the team to study the electrification needs, and potential solutions, at a level of detail that was previously impossible.

Our research center in Ghana led the development of the Open Buildings project to support policy planning for the areas in the world with the biggest information gaps. We created it by applying artificial intelligence methods to satellite imagery to identify the locations and outlines of buildings.

Since we released the data, we’ve heard from many organizations — including UN agencies, nonprofits and academics — who have been using it:

  • The UN Refugee Agency, UNHCR, has been using Open Buildings for survey sampling. It’s common to do household surveys in regions where people have been displaced, in order to know what people need. But UNHCR needs to first have an assessment of where the households actually are, which is where the Open Buildings project has been useful.
  • UN Habitat is using Open Buildings to study urbanization across the African continent. Having detail on the way that cities are laid out enables them to make recommendations on urban planning.
  • The International Energy Agency is using Open Buildings to estimate energy needs. With data about individual buildings, they can assess the needs of communities at a new level of precision and know how much energy is needed for cooking, lighting and for operating machinery. This will help with planning sustainable energy policy.

We’re excited to make this information available in more countries and to assist more organizations in their essential work. As Ernest says, “By providing decision makers with better data, they can make better decisions. Geographical data is particularly important for providing an unbiased source of information for planning basic services, and we need more of it.”

Read More

Detect fraud in mobile-oriented businesses using GrabDefence device intelligence and Amazon Fraud Detector

In this post, we present a solution that combines rich mobile device intelligence with customized machine learning (ML) modeling to help you catch fraudsters who exploit mobile apps.

GrabDefence (GD), Grab’s proprietary fraud detection and prevention technology, and AWS have launched GDxAFD, a fraud detection solution tailored for mobile apps that integrates GD’s device intelligence capabilities with Amazon Fraud Detector, AWS’s fully managed ML fraud detection solution. With GDxAFD, you can take advantage of more than 20 years of fraud detection expertise from Amazon as well as extensive mobile fraud experience from Southeast Asia’s leading superapp to safeguard your mobile application from fraudsters.

This solution rides on a larger global wave of anti-fraud efforts, which experts forecast to grow to USD $62.70 billion by 2028. With the rise of the digital economy, fraud syndicates increasingly target online businesses, causing financial loss and destroying the trust between end-users and the platform. The true cost to battle fraud is also increasing rapidly as more fraud checks leads to poorer customer experience, false positives, as well as operational burden, which as a whole is estimated to be three times larger than the actual fraud losses from the True Cost of FraudTM APAC Study by LexisNexis® Risk Solutions.

From the combined industry experience, the solution team believes that many of the modus operandi in a mobile environment is driven by fraudsters having tools and methods to create fake accounts at scale and bypass a platform’s security checks on the device, thereby enabling them to exploit the platform for large returns. Therefore, preventing mobile fraud starts from clearly understanding the risk profile of the devices used to access the mobile app and then using the device risk intelligence gathered, together with additional data about the user, event, or account, to detect potential fraudulent behavior in real time and at scale. By combining rich device intelligence and ML, companies are better positioned to stay ahead of mobile-focused fraud syndicates, and reduce fraud on their platforms.

GD device intelligence

GD is a product from Grab’s fraud prevention team, which has years of experience building solutions for Grab. Grab is a NASDAQ listed company and a leading superapp in South East Asia, with over 30 million monthly transacting users (as per Grab’s Q1 2022 Results). Due to the scale of its operations as a leading superapp in SEA and the nature of a mobile-first business, Grab has been investing heavily in building fraud prevention solutions enabled by rich data, technology focus, and insights gathered from its operational experience and exposure. GD’s device intelligence service collects rich device-level data, excluding any personally identifiable information (PII), from mobile application users and securely analyzes it to understand the risk profile of the device. Learning from a large device network built via Grab’s superapp, GD’s device intelligence service can accurately generate device fingerprints and detect risky attributes such as device or app modification or tampering, emulator usage, and GPS spoofing. As mentioned earlier, many fraud modus operandi on mobile platforms involve mass creation of fake accounts, device reengineering, and location spoofing, which GD device intelligence is capable of detecting. As a result, by integrating GD device intelligence and Amazon Fraud Detector, platforms that face similar fraud attacks can expect up to a 23% increase in fraud detection based on statistical studies done by GrabDefence on Grab’s fraud prevention systems.

Custom fraud detection ML models in Amazon Fraud Detector

Amazon Fraud Detector customizes each model it creates to your own dataset, making the accuracy of models higher than one-size-fits-all ML solutions. During the fully automated model training process, a series of models that have learned patterns of fraud from AWS and Amazon’s own fraud expertise are used to boost your model performance even further.

With the GDxAFD solution, you now have step-by-step guidance and a reference architecture for how to use flexible event schemas in Amazon Fraud Detector to add GD device intelligence findings into your custom fraud detector models. The end result is an ML model that, once trained, has the benefit of learning from multiple data sources, including your own historical data, GD’s device intelligence data, fraud patterns seen across Amazon, and additional third-party data (added automatically by Amazon Fraud Detector). Based on our pilot between GD and Amazon Fraud Detector, our model using GD device intelligence has shown a 23% increase in detection performance for detecting fake account registrations. You can deploy these models to detect mobile fraud to prevent not only fake account registration but also fraudulent payments, promotion abuse, or loyalty program abuse, among others.

To get started, you first integrate GD’s mobile SDK into your mobile application to collect device-level data. Next, you use Amazon Fraud Detector to define the event you want to evaluate for fraud by specifying the event and account data points you have available for the event or account, including the device risk intelligence data points from GD. After this, you train your ML model in Amazon Fraud Detector in just a few steps. After you train the model, you can add it to a detector.

To begin performing real-time predictions, you integrate Amazon Fraud Detector’s low-latency prediction API into your application and begin sending new mobile events to generate fraud predictions. Each fraud prediction considers the GD device intelligence data for the device associated with the event as well as additional data and intelligence automatically added by Amazon Fraud Detector, including signals from fraud patterns experienced across Amazon.

Solution overview

Device intelligence is a critical type of input for risk decisions. One of the common challenges faced in fraud detection in the mobile space is the lack of enriched data availability to make risk decisions. On the other hand, mobile devices are typically the most expensive asset the fraudsters and fraud syndicates possess and, therefore, a significant level of effort is put into masking the true identity and profile of the device being used. Understanding the risk profile of the mobile device (which sometimes isn’t even a real device) and being able to drive insights from the relationship between different mobile devices can significantly improve risk decisions for any mobile business, and becomes central to any mobile-based fraud management strategy.

For generating real-time fraud predictions, the GDxAFD solution uses Amazon Fraud Detector and GrabDefence’s device intelligence SDK, along with Amazon API Gateway and AWS Lambda. You can provision the AWS portions of the solution using AWS CloudFormation.

The following diagram illustrates our solution architecture.

The workflow consists of the following steps:

  1. When an end-user interacts with your mobile app, GD’s mobile SDK passively gathers device data and streams this data to GD’s device intelligence service, where a risk profile for the device is generated.
  2. Then, when that user transacts using the mobile app and you want to assess fraud risk in real time, the mobile app sends the transaction data gathered by the app via API Gateway to a Lambda function.
  3. The Lambda function gathers the GrabDefence risk profile for the device used during the transaction, combines that profile data with the other transaction data, and sends it to the fraud detector.
  4. The fraud detector performs a fraud prediction using your custom fraud detection ML model and ruleset, and returns a risk score and outcome to the Lambda function. This result is sent back to your mobile app via API Gateway.
  5. If desired, the mobile app can then choose to adjust the end-user experience accordingly based on this risk assessment.

Use cases for device intelligence with Amazon Fraud Detector

The ideal end-state solution is an Amazon Fraud Detector model that is trained on a dataset of your historical events and their associated historical GD device intelligence data. To achieve this, you need to integrate the GD Guardian SDK for mobile devices and then gather device intelligence data for your events until you have enough to train a model (for example,10,000 events with at least 400 examples of fraud events). Depending on your use case and availability of fraud labels, you have a couple of ways to get started sooner as you gather data for this solution:

  • Use case A: Use GD device intelligence data directly in the fraud detector rules – With this use case, you create a detector in Amazon Fraud Detector with a ruleset designed to flag high-risk events provided by the device intelligence. This works effectively when you have clear risk mitigation policies that you want to deploy for your platform. (for example, act on the user if the device is jailbroken, or don’t allow redemption of a promo if the device has more than five accounts) In such cases, you can set up your detector rules to flag events based on a combination of GD device risk score and GD device verdicts. This option requires no historical event data or labels to get started, so it can be ready to use sooner than the ML-based detection options.
  • Use case B: Use GD device intelligence and an Amazon Fraud Detector ML model with the fraud detector rules – If you have a historical event dataset and are able to train an Amazon Fraud Detector ML model immediately, you can build on use case A by adding an Amazon Fraud Detector model to your rules-based detector. This way, your detector logic is evaluating device intelligence with rules and all other event data with a customized ML model. This allows you to solve for more complex fraud tactics where statistical methods are required to separate fraud from non-fraud.

Best results are often achieved when both of these scenarios work in tandem, because they can serve different use cases over time even after you have more historical data. With these methods, Amazon Fraud Detector makes it easy to transition to the ideal solution in a few steps.

In the following sections, we walk through the steps to get started using Amazon Fraud Detector with GD device intelligence data.

Integrate the GD mobile SDK and start collecting device intelligence data

Prior to using GrabDefence device intelligence within your application, you must first register as a GrabDefence client. You receive the following credentials from the GrabDefence team:

  • tenant_id – A unique client identifier that represents your organization
  • app_id – A unique application identifier that represents the application you’re integrating

Refer to the GrabDefence documentation for further guidance on how to integrate this SDK.

Create your event type in Amazon Fraud Detector

An event type defines the schema for the event you want to assess for fraud. When creating an event type in Amazon Fraud Detector, you define all the data elements you will have available at the time of the fraud evaluation, including the GD device intelligence risk profile data elements such as the unique device ID and various device verdicts, to Amazon Fraud Detector variables. You need to include event variables (such as IP, email, or billing address) that are unique to the type of event you’re evaluating for fraud, as well as GD device intelligence data. The following table shows examples of event variables, the GD device intelligence data, and the recommended Amazon Fraud Detector variable type to map each element to.

Event Variable Type Event Variable (Not Exhaustive) Amazon Fraud Detector Event Variable Example
Event Metadata EVENT_TIMESTAMP EVENT_TIMESTAMP 2019-11-30T13:01:01Z
EVENT_ID EVENT_ID test0299df10-e2db-11eb-96e2-f7dgje3d3k03
ENTITY_ID ENTITY_ID 123
EVENT_LABEL EVENT_LABEL FRAUD or LEGIT
LABEL_TIMESTAMP LABEL_TIMESTAMP 2019-11-30T13:01:01Z
Event Variables Email EMAIL_ADDRESS test@example.com
IP IP_ADDRESS 192.0.2.1
Phone PHONE_NUMBER 555-0123
GD Device Intelligence Verdicts Verdict: IOS Jailbroken Device CUSTOM: CATEGORICAL GV_IOS_JAIL_BROKEN
Verdict: Debugger Detected CUSTOM: CATEGORICAL GV_DEBUGGER_DETECTED
Verdict: Event Token Signature Mismatch CUSTOM: CATEGORICAL GV_EVENT_TOKEN_SIGNATURE_MISMATCH
Verdict: Server Challenge Mismatch CUSTOM: CATEGORICAL GV_SERVER_CHALLENGE_MISMATCH
GD Risk Scores User account risk score CUSTOM: NUMERICAL 0.9 etc

Build your detection logic in Amazon Fraud Detector

At this point, you need to decide whether you want to start with use case A or use case B. For use case A, you start building a rules-based detector. For use case B, you build an Amazon Fraud Detector model first and, once finished, add the model to your detector.

For instructions on building an Amazon Fraud Detector model and detector, refer to the Amazon Fraud Detector user guide.

The following screenshot shows sample detector rules on the Amazon Fraud Detector console.

Test your detector using Amazon Fraud Detector batch predictions

You can use a batch predictions job to test your detector against a set of events using either the Amazon Fraud Detector console or the CreateBatchPredictionJob API. You need to specify the detector version (created in the previous step) and provide the events via a CSV file (up to 50 MB large) stored in an Amazon Simple Storage Service (Amazon S3) bucket. The output file containing the original input data along with appended results of the detector’s predictions will be available in the same S3 bucket (unless you specify a different location).

For more information on running an Amazon Fraud Detector batch prediction, refer to Amazon Fraud Detector batch predictions documentation page.

Set up the supporting infrastructure

To perform real-time predictions using the detector you built, you must set up a Lambda function that performs the following actions:

  1. Receives transaction data (via API Gateway) gathered from your mobile app. This includes data such as IP address, email address, shipping and billing info, and so on, that is unique to the transaction and use case.
  2. Collects the risk profile from the GD API. This includes device intelligence data and risk signals from GD. You need to convert the GD verdicts to the appropriate Amazon Fraud Detector variable CUSTOM: CATEGORICAL types. For example, if the GD verdict list contains GV_IOS_JAIL_BROKEN, you need to set the Verdict: IOS Jailbroken Device variable to TRUE when sending to Amazon Fraud Detector (as detailed in the next section).
  3. Sends the data to the detector using the GetEventPrediction API (see the next section).

Perform real-time predictions using the Amazon Fraud Detector GetEventPrediction API

Your Lambda function can call the Amazon Fraud Detector GetEventPrediction API to perform real-time predictions and obtain results synchronously. The GetEventPrediction API returns matched outcomes based on the rules you set up earlier. If you attached a model to your detector in Amazon Fraud Detector, the model score is also returned as part of the GetEventPrediction API response. You can find examples of GetEventPrediction requests on the aws-fraud-detector-samples GitHub repository.

You can configure your Lambda function accordingly to parse the response from this API, and return the appropriate action to the mobile application (via API Gateway).

Build and train your model

After you integrate the GD SDK and are generating predictions with Amazon Fraud Detector, your events are stored in Amazon Fraud Detector and you can use the UpdateEventLabel API to add fraud labels for confirmed fraud events. When your stored dataset has 10,000 events with device data and at least 400 labelled as fraud, you can start building a custom Amazon Fraud Detector model that learns from GD’s device intelligence data.

At this point, you’re ready to train the model. This takes a few steps on the Amazon Fraud Detector console, and model training typically takes around an hour but can be longer depending on the size of your training dataset.

  1. On the Amazon Fraud Detector console, choose Create model.
  2. Choose Transaction Fraud Insights as the model type.
  3. Choose the event type you created earlier.
  4. Choose the date range for your training dataset that encompasses the period where you’ve collected GD device intelligence data.
  5. Add all the event type variables, including the GD device-specific elements, to your model’s input configuration.
  6. Strat training the model.

After your model is trained, you can review performance metrics and then deploy it by changing its status to Active. To learn more about model scores and performance metrics, see Model scores and Training performance metrics. At this point, you can now add your model to your detector, add threshold rules to interpret the risk scores that the model outputs, and continue making predictions using the GetEventPrediction API.

Automate the solution

You can use AWS CloudFormation to automate the creation of your Amazon Fraud Detector event type and related resources. For more details, refer to managing resources using AWS CloudFormation.

Conclusion

Congrats! You have successfully built an Amazon Fraud Detector model that integrates GD device intelligence into your detector. The Amazon Fraud Detector ML model you trained has learned from multiple data sources, including your own historical data, GD’s device intelligence data, fraud patterns seen across Amazon, and additional third-party data (added automatically by Amazon Fraud Detector). You can deploy this solution on your mobile apps to detect and capture various types of mobile fraud.

Special thanks to everyone who contributed to this blog including, Abhishek Ravi, Tanay Bhargava, Eric Burris, Puneet Gambhir (GrabDefence), Brian Kim (GrabDefence), and Sing Kwan Ng (GrabDefence).


About the author

Marcel Pividal is a Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for Fintechs, Payment Providers, Pharma, and government agencies. His current areas of focus are Risk Management, Fraud Prevention, and Identity Verification.

Adriaan de Jonge is Partner Solutions Architect at AWS in Singapore. He is part of the AWS GSI team in the ASEAN geography. Adriaan is particularly interested in serverless, cloud-native development, and DevOps. In his spare time, he likes to bake cakes that are suitable for people with allergies.

Jianbo Liu is a Research Scientist with Amazon Fraud Detector.

Read More