Successful communication and cooperation have been crucial for helping societies advance throughout history. The closed environments of board games can serve as a sandbox for modelling and investigating interaction and communication – and we can learn a lot from playing them. In our recent paper, published today in Nature Communications, we show how artificial agents can use communication to better cooperate in the board game Diplomacy, a vibrant domain in artificial intelligence (AI) research, known for its focus on alliance building.Read More
AI for the board game Diplomacy
Successful communication and cooperation have been crucial for helping societies advance throughout history. The closed environments of board games can serve as a sandbox for modelling and investigating interaction and communication – and we can learn a lot from playing them. In our recent paper, published today in Nature Communications, we show how artificial agents can use communication to better cooperate in the board game Diplomacy, a vibrant domain in artificial intelligence (AI) research, known for its focus on alliance building.Read More
Amazon adds Catalan to MASSIVE dataset
New language data will find immediate adoption by Barcelona Supercomputing Center.Read More
Metrics for evaluating an identity verification solution
Globally, there has been an accelerated shift toward frictionless digital user experiences. Whether it’s registering at a website, transacting online, or simply logging in to your bank account, organizations are actively trying to reduce the friction their customers experience while at the same time enhance their security, compliance, and fraud prevention measures. The shift toward frictionless user experiences has given rise to face-based biometric identity verification solutions aimed at answering the question “How do you verify a person in the digital world?”
There are two key advantages of facial biometrics when it comes to questions of identification and authentication. First, it’s a convenient technology for users: there is no need to remember a password, deal with multi-factor challenges, click verification links, or solve CAPTCHA puzzles. Secondly, a high level of security is achieved: identification and authentication on the basis of facial-biometrics is secure and less susceptible to fraud and attacks.
In this post, we dive into the two primary use cases of identity verification: onboarding and authentication. Then we dive into the two key metrics used to evaluate a biometric system’s accuracy: the false match rate (also known as false acceptance rate) and false non-match rate (also known as false rejection rate). These two measures are widely used by organizations to evaluate accuracy and error rate of biometric systems. Finally, we discuss a framework and best practices for performing an evaluation of an identity verification service.
Refer to the accompanying Jupyter notebook that walks through all the steps mentioned in this post.
Use cases: Onboarding and Authentication
There are two primary use cases for biometric solutions: user onboarding (often referred to as verification) and authentication (often referred to as identification). Onboarding entails one-to-one matching of faces between two images, for example comparing a selfie to a trusted identification document like a driver’s license or passport. Authentication, on the other hand, entails one-to-many search of a face against a stored collection of faces, for example searching a collection of employee faces to see if an employee is authorized access to a particular floor in a building.
Accuracy performance of onboarding and authentication use cases is measured by the false positive and false negative errors that the biometric solution can make. A similarity score (ranging from 0% meaning no match to 100% meaning a perfect match) is used to make the determination of a match or a non-match decision. A false positive occurs when the solution considers images of two different individuals to be the same person. A false negative, on the other hand, means that the solution considered two images of the same person to be different.
Onboarding: One-to-one verification
Biometric-based onboarding processes both simplify and secure the process. Most importantly, it sets the organization and customer up for a near-frictionless onboarding experience. To do this, users are simply required to present an image of some form of trusted identification document containing the user’s face (such as driver’s license or passport) as well as take a selfie image during the onboarding process. After the system has these two images, it simply compares the faces within the two images. When the similarity is greater than a specified threshold, then you have a match; otherwise, you have a non-match. The following diagram outlines the process.
Consider the example of Julie, a new user opening a digital bank account. The solution prompts her to snap a picture of her driver’s license (step 2) and snap a selfie (step 3). After the system checks the quality of the images (step 4), it compares the face in the selfie to the face on the driver’s license (one-to-one matching) and a similarity score (step 5) is produced. If the similarity score is less than the required similarity threshold, then the onboarding attempt by Julie is rejected. This is what we call a false non-match or false rejection: the solution considered two images of the same person to be different. On the other hand, if the similarity score was greater than the required similarity, then the solution considers the two images to be the same person or a match.
Authentication: One-to-many identification
From entering a building, to checking in at a kiosk, to prompting a user for a selfie to verify their identity, this type of zero-to-low-friction authentication via facial recognition has become commonplace for many organizations. Instead of performing image-to-image matching, this authentication use case takes a single image and compares it to a searchable collection of images for a potential match. In a typical authentication use case, the user is prompted to snap a selfie, which is then compared against the faces stored in the collection. The result of the search yields zero, one, or more potential matches with corresponding similarity scores and external identifiers. If no match is returned, then the user is not authenticated; however, assuming the search returns one or more matches, the system makes the authentication decision based on the similarity scores and external identifiers. If the similarity score exceeds the required similarity threshold and the external identifier matches the expected identifier, then the user is authenticated (matched). The following diagram outlines an example face-based biometric authentication process.
Consider the example of Jose, a gig-economy delivery driver. The delivery service authenticates delivery drivers by prompting the driver to snap a selfie before starting a delivery using the company’s mobile application. One problem gig-economy service providers face is job-sharing; essentially two or more users share the same account in order to game the system. To combat this, many delivery services use an in-car camera to snap images (step 2) of the driver at random times during a delivery (to ensure that the delivery driver is the authorized driver). In this case, Jose not only snaps a selfie at the start of his delivery, but an in-car camera snaps images of him during the delivery. The system performs quality checks (step 3) and searches (step 4) the collection of registered drivers to verify the identity of the driver. If a different driver is detected, then the gig-economy delivery service can investigate further.
A false match (false positive) occurs when the solution considered two or more images of different people to be the same person. In our use case, suppose that instead of the authorized driver, Jose he lets his brother Miguel take one of his deliveries for him. If the solution incorrectly matches Miguel’s selfie to the images of Jose, then a false match (false positive) occurs.
To combat the potential of a false matches, we recommend that collections contain several images of each subject. It’s common practice to index trusted identification documents containing a face, a selfie at time of onboarding, and selfies from the last several identification checks. Indexing several images of a subject provides the ability to aggregate the similarity scores across faces returned, thereby improving the accuracy of the identification. Additionally, external identifiers are used to limit the risk of a false acceptance. An example business rule might look something like this:
IF aggregate similarity score >= required similarity threshold AND external identifier == expected identifier THEN authenticate
Key biometric accuracy measures
In a biometric system, we’re interested in the false match rate (FMR) and false non-match rate (FNMR) based on the similarity scores from face comparisons and searches. Whether it’s an onboarding or authentication use case, biometric systems decide to accept or reject matches of a user’s face based on the similarity score of two or more images. Like any decision system, there will be errors where the system incorrectly accepts or rejects an attempt at onboarding or authentication. As part of evaluating your identity verification solution, you need to evaluate the system at various similarity thresholds to minimize false match and false non-match rates, as well as contrast those errors against the cost of making incorrect rejections and acceptances. We use FMR and FNMR as our two key metrics to evaluate facial biometric systems.
False non-match rate
When the identity verification system fails to correctly identify or authorize a genuine user, a false non-match occurs, also known as a false negative. The false non-match rate (FNMR) is a measure of how prone the system is to incorrectly identifying or authorizing a genuine user.
The FNMR is expressed as a percentage of instances where an onboarding or authentication attempt is made, where the user’s face is incorrectly rejected (a false negative) because the similarity score is below the prescribed threshold.
A true positive (TP) is when the solution considers two or more images of the same person to be the same. That is, the similarity of the comparison or search is above the required similarity threshold.
A false negative (FN) is when the solution considers two or more images of the same person to be different. That is, the similarity of the comparison or search is below the required similarity threshold.
The formula for the FNMR is:
FNMR = False Negative Count / (True Positive Count + False Negative Count)
For example, suppose we have 10,000 genuine authentication attempts but 100 are denied because their similarity to the reference image or collection falls below the specified similarity threshold. Here we have 9,900 true positives and 100 false negatives, therefore our FNMR is 1.0%
FNMR = 100 / (9900 + 100) or 1.0%
False match rate
When an identity verification system incorrectly identifies or authorizes an unauthorized user as genuine, a false match occurs, also known as a false positive. The false match rate (FMR) is a measure of how prone the system is to incorrectly identifying or authorizing an unauthorized user. It’s measured by the number of false positive recognitions or authentications divided by the total number of identification attempts.
A false positive occurs when the solution considers two or more images of different people to be the same person. That is, the similarity score of the comparison or search is above the required similarity threshold. Essentially, the system incorrectly identifies or authorizes a user when it should have rejected their identification or authentication attempt.
The formula for the FMR is:
FMR = False Positive Count / (Total Attempts)
For example, suppose we have 100,000 authentication attempts but 100 bogus users are incorrectly authorized because their similarity to the reference image or collection falls above the specified similarity threshold. Here we have 100 false positives, therefore our FMR is 0.01%
FMR = 100 / (100,000) or 0.01%
False match rate vs. false non-match rate
False match rate and false non-match rate are at odds with each other. As the similarity threshold increases, the potential for a false match decreases, while the potential for a false non-match increases. Another way to think about this trade-off is that as the similarity threshold increases, the solution becomes more restrictive, making fewer low similarity matches. For example, it’s common for use cases involving public safety and security to set a match similarity threshold quite high (99 and above). Alternatively, an organization may choose a less restrictive similarity threshold (90 and above), where the impact of friction to the user is more important. The following diagram illustrates these trade-offs. The challenge for organizations is to find a threshold that minimizes both FMR and FNMR based on your organizational and application requirements.
Selecting a similarity threshold depends on the business application. For example, suppose you want to limit customer friction during onboarding (a less restrictive similarity threshold, as shown in the following figure on the left). Here you might have a lower required similarity threshold, and are willing to accept the risk of onboarding users where the confidence in the match between their selfie and driver’s license is lower. By contrast, suppose you want to ensure only authorized users get into an application. Here you might operate at a quite restrictive similarity threshold (as shown in the figure on the right).
Steps for calculating false match and non-match rates
There are several of ways to calculate these two metrics. The following is a relatively simple approach of dividing the steps into gathering genuine image pairs, creating an imposter pairing (images that shouldn’t match), and finally using a probe to loop over the expected match and non-match image pairs, capturing the resulting similarity. The steps are as follows:
- Gather a genuine sample image set. We recommend starting with a set of image pairs and assigning an external identifier, which is used to make an official match determination. The pair consists of the following images:
- Source image – Your trusted source image, for example a driver’s license.
- Target image – Your selfie or image you are going to compare with.
- Gather an image set of imposter matches. These are pairs of images where the source and target don’t match. This is used to assess the FMR (the probability that the system will incorrectly match the faces of two different users). You can create an imposter image set using the image pairs by creating a Cartesian product of the images then filtering and sampling the result.
- Probe the genuine and imposter match sets by looping over the image pairs, comparing the source and imposter target and capturing the resulting similarity.
- Calculate FMR and FNMR by calculating the false positives and false negatives at different minimum similarity thresholds.
You can assess the cost of FMR and FNMR at different similarity thresholds relative to your application’s need.
Step 1: Gather genuine image pair samples
Choosing a representative sample of image pairs to evaluate is critical when evaluating an identity verification service. The first step is to identify a genuine set of image pairs. These are known source and target images of a user. The genuine image pairing is used to assess the FNMR, essentially the probability that the system won’t match two faces of the same person. One of the first questions often asked is “How many image pairs are necessary?” The answer is that it depends on your use case, but the general guidance is the following:
- Between 100–1,000 image pairs provides a measure of feasibility
- Up to 10,000 images pairs is large enough to measure variability between images
- More than 10,000 image pairs provides a measure of operational quality and generalizability
More data is always better; however, as a starting point, use at least 1,000 image pairs. However, it’s not uncommon to use more than 10,000 image pairs to zero in on an acceptable FNMR or FMR for a given business problem.
The following is a sample image pair mapping file. We use the image pair mapping file to drive the rest of the evaluation process.
EXTERNAL_ID | SOURCE | TARGET | TEST |
9055 | 9055_M0.jpeg | 9055_M1.jpeg | Genuine |
19066 | 19066_M0.jpeg | 19066_M1.jpeg | Genuine |
11396 | 11396_M0.jpeg | 11396_M1.jpeg | Genuine |
12657 | 12657_M0.jpeg | 12657_M1.jpeg | Genuine |
… | . | . | . |
Step 2: Generate an imposter image pair set
Now that you have a file of genuine image pairs, you can create a Cartesian product of target and source images where the external identifiers don’t mach. This produces source-to-target pairs that shouldn’t match. This pairing is used to assess the FMR, essentially the probability the system will match the face of one user to a face of a different user.
external_id | SOURCE | TARGET | TEST |
114192 | 114192_4M49.jpeg | 307107_00M17.jpeg | Imposter |
105300 | 105300_04F42.jpeg | 035557_00M53.jpeg | Imposter |
110771 | 110771_3M44.jpeg | 120381_1M33.jpeg | Imposter |
281333 | 281333_04F35.jpeg | 314769_01M17.jpeg | Imposter |
40081 | 040081_2F52.jpeg | 326169_00F32.jpeg | Imposter |
… | . | . | . |
Step 3: Probe the genuine and imposter image pair sets
Using a driver program, we apply the Amazon Rekognition CompareFaces API over the image pairs and capture the similarity. You can also capture additional information like pose, quality, and other results of the comparison. The similarity scores are used to calculate the false match and non-match rates in the following step.
In the following code snippet, we apply the CompareFaces API to all the image pairs and populate all the similarity scores in a table:
The code snippet gives the following output.
EXTERNAL_ID | SOURCE | TARGET | TEST | SIMILARITY |
9055 | 9055_M0.jpeg | 9055_M1.jpeg | Genuine | 98.3 |
19066 | 19066_M0.jpeg | 19066_M1.jpeg | Genuine | 94.3 |
11396 | 11396_M0.jpeg | 11396_M1.jpeg | Genuine | 96.1 |
… | . | . | . | . |
114192 | 114192_4M49.jpeg | 307107_00M17.jpeg | Imposter | 0.0 |
105300 | 105300_04F42.jpeg | 035557_00M53.jpeg | Imposter | 0.0 |
110771 | 110771_3M44.jpeg | 120381_1M33.jpeg | Imposter | 0.0 |
Distribution analysis of similarity scores by tests are a starting point to understand the similarity score by image pairs. The following code snippet and output chart shows a simple example of the distribution of similarity score by test set as well as resulting descriptive statistics:
test | count | min | max | mean | median | std |
genuine | 204 | 0.2778 | 99.9957 | 91.7357 | 99.0961 | 19.9097 |
imposter | 1020 | 0.0075 | 87.3893 | 2.8111 | 0.8330 | 7.3496 |
In this example, we can see that the mean and median similarity for genuine face pairs was 91.7 and 99.1, whereas for the imposter pairs was 2.8 and 0.8, respectively. As expected, this shows the high similarity scores for genuine image pairs and low similarity scores for imposter image pairs.
Step 4: Calculate FMR and FNMR at different similarity threshold levels
In this step, we calculate the false match and non-match rates at different thresholds of similarity. To do this, we simply loop through similarity thresholds (for example, 90–100). At each selected similarity threshold, we calculate our confusion matrix containing true positive, true negative, false positive, and false negative counts, which are used to calculate the FMR and FNMR at each selected similarity.
Actual | |||
Predicted | |||
. | Match | No-Match | |
>= selected similarity | TP | FP | |
< selected similarity | FN | TN |
To do this, we create a function that returns the false positive and negative counts, and loop through a range of similarity scores (90–100):
The following table shows the results of the counts at each similarity threshold.
Similarity Threshold | TN | FN | TP | FP | FNMR | FMR |
80 | 1019 | 22 | 182 | 1 | 0.1% | 0.1% |
85 | 1019 | 23 | 181 | 1 | 0.11% | 0.1% |
90 | 1020 | 35 | 169 | 0 | 0.12% | 0.0% |
95 | 1020 | 51 | 153 | 0 | 0.2% | 0.0% |
96 | 1020 | 53 | 151 | 0 | 0.25% | 0.0% |
97 | 1020 | 60 | 144 | 0 | 0.3% | 0.0% |
98 | 1020 | 75 | 129 | 0 | 0.4% | 0.0% |
99 | 1020 | 99 | 105 | 0 | 0.5% | 0.0% |
How does the similarity threshold impact false non-match rate?
Suppose we have 1,000 genuine user onboarding attempts, and we reject 10 of these attempts based on a required minimum similarity of 95% to be considered a match. Here we reject 10 genuine onboarding attempts (false negatives) because their similarity falls below the specified minimum required similarity threshold. In this case, our FNMR is 1.0%.
Actual | |||
Predicted | |||
. | Match | No-Match | |
>= 95% similarity | 990 | 0 | |
< 95% similarity | 10 | 0 | |
. | total | 1,000 | . |
FNMR = False Negative Count / (True Positive Count + False Negative Count)
FNMR = 10 / (990 + 10) or 1.0%
By contrast, suppose instead of having 1,000 genuine users to onboard, we have 990 genuine users and 10 imposter users (false positive). At a 95% minimum similarity, suppose we accept all 1,000 users as genuine. Here we would have a 1% FMR.
Actual | ||||
Predicted | ||||
. | Match | No-Match | total | |
>= 95% similarity | 990 | 10 | 1,000 | |
< 95% similarity | 0 | 0 | . |
FMR = False Positive Count / (Total Attempts)
FMR = 10 / (1,000) or 1.0%
Assessing costs of FMR and FNMR at onboarding
In an onboarding use case, the cost of a false non-match (a rejection) is generally associated with additional user friction or loss of a registration. For example, in our banking use case, suppose Julie presents two images of herself but is incorrectly rejected at time of onboarding because the similarity between the two images falls below the selected similarity (a false non-match). The financial institution may risk losing Julie as a potential customer, or it may cause Julie additional friction by requiring her to perform steps to prove her identity.
Conversely, suppose the two images of Julie are of different people and Julie’s onboarding should have been rejected. In the case where Julie is incorrectly accepted (a false match), the cost and risk to the financial institution is quite different. There could be regulatory issues, risk of fraud, and other risks associated with financial transactions.
Responsible use
Artificial intelligence (AI) applied through machine learning (ML) will be one of the most transformational technologies of our generation, tackling some of humanity’s most challenging problems, augmenting human performance, and maximizing productivity. Responsible use of these technologies is key to fostering continued innovation. AWS is committed to developing fair and accurate AI and ML services and providing you with the tools and guidance needed to build AI and ML applications responsibly.
As you adopt and increase your use of AI and ML, AWS offers several resources based on our experience to assist you in the responsible development and use of AI and ML:
- Use cases that involve public safety
- AWS Service Terms
- Resources and tools provided by AWS on responsible use of AI and ML
Best practices and common mistakes to avoid
In this section, we discuss the following best practices:
- Use a large enough sample of images
- Avoid open-source and synthetic face datasets
- Avoid manual and synthetic image manipulation
- Check image quality at time of evaluation and over time
- Monitor FMR and FNMR over time
- Use a human in the loop review
- Stay up to date with Amazon Rekognition
Use a large enough sample of images
Use a large enough but reasonable sample of images. What is a reasonable sample size? It depends on the business problem. If you’re an employer and have 10,000 employees that you want to authenticate, then using all 10,000 images is probably reasonable. However, suppose you’re an organization with millions of customers that you want to onboard. In this case, taking a representative sample of customers such as 5,000–20,000 is probably sufficient. Here is some guidance on the sample size:
- A sample size of 100 – 1,000 image pairs proves feasibility
- A sample size of 1,000 – 10,000 image pairs is useful to measure variability between images
- A sample size of 10,000 – 1 million image pairs provides a measure of operational quality and generalizability
The key with sampling image pairs is to ensure that the sample provides enough variability across the population of faces in your application. You can further extend your sampling and testing to include demographic information like skin tone, gender, and age.
Avoid open-source and synthetic face datasets
There are dozens of curated open-source facial image datasets as well as astonishingly realistic synthetic face sets that are often used in research and to study feasibility. The challenge is that these datasets are generally not useful for 99% of real-world use cases simply because they aren’t representative of the cameras, faces, and quality of the images your application is likely to encounter in the wild. Although they’re useful for application development, the accuracy measures of these image sets don’t generalize to what you’ll encounter in your own application. Instead, we recommend starting with a representative sample of real images from your solution, even if the sample image pairs are small (under 1,000).
Avoid manual and synthetic image manipulation
There are often edge cases that people are interested in understanding. Things like image capture quality or obfuscations of specific facial features are always of interest. For example, we often get asked about the impact of age and image quality on facial recognition. You could simply synthetically age a face or manipulate the image to make the subject appear older, or manipulate the image quality, but this doesn’t translate well to real-world aging of images. Instead, our recommendation is to gather a representative sample of real-world edge cases you’re interested in testing.
Check image quality at time of evaluation and over time
Camera and application technology changes quite rapidly over time. As a best practice, we recommend monitoring image quality over time. From the size of faces captured (using bounding boxes), to the brightness and sharpness of an image, to the pose of a face, as well as potential obfuscations (hats, sunglasses, beards, and so on), all of these image and facial features change over time.
Monitor FNMR and FMR over time
Changes occur, whether it’s the images, the application, or the similarity thresholds used in the application. It’s important to periodically monitor false match and non-match rates over time. Changes in the rates (even subtle changes) can often point to upstream challenges with the application or how the application is being used. Changes to similarity thresholds and business rules used to make accept or reject decisions can have major impact on onboarding and authentication user experiences.
Use a human in the loop review
Identity verification systems make automated decisions to match and non-match based on similarity thresholds and business rules. Besides regulatory and internal compliance requirements, an important process in any automated decision system is to utilize human reviewers as part of the ongoing monitoring of the decision process. Human oversight of these automated decisioning systems provides validation and continuous improvement as well as transparency into the automated decision-making process.
Stay up to date with Amazon Rekognition
The Amazon Recognition faces model is updated periodically (usually annually), and is currently on version 6. This updated version made important improvements to accuracy and indexing. It’s important to stay up to date with new model versions and understand how to use these new versions in your identity verification application. When new versions of the Amazon Rekognition face model are launched, it’s good practice to rerun your identity verification evaluation process and determine any potential impacts (positive and negative) to your false match and non-match rates.
Conclusion
This post discusses the key elements needed to evaluate the performance aspect of your identity verification solution in terms of various accuracy metrics. However, accuracy is only one of the many dimensions that you need to evaluate when choosing a particular content moderation service. It’s critical that you include other parameters, such as the service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing.
To learn more about identity verification in Amazon Rekognition, visit Identity Verification using Amazon Rekognition.
About the Authors
Mike Ames is a data scientist turned identity verification solution specialist, with extensive experience developing machine learning and AI solutions to protect organizations from fraud, waste, and abuse. In his spare time, you can find him hiking, mountain biking, or playing freebee with his dog Max.
Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.
Zuhayr Raghib is an AI Services Solutions Architect at AWS. Specializing in applied AI/ML, he is passionate about enabling customers to use the cloud to innovate faster and transform their businesses.
Marcel Pividal is a Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for fintechs, payment providers, pharma, and government agencies. His current areas of focus are risk management, fraud prevention, and identity verification.
Amazon’s ML conference focuses on community and connections
Internal event designed to replicate external science conferences.Read More
NeurIPS 2022: Seven Microsoft Research Papers Selected for Oral Presentations
Microsoft is proud to be a platinum sponsor of the 36th annual conference on Neural Information Processing Systems (NeurIPS), which is widely regarded as the world’s most prestigious research conference on artificial intelligence and machine learning.
Microsoft has a strong presence at NeurIPS again this year, with more than 150 of our researchers participating in the conference and 122 of our research papers accepted. Our researchers are also taking part in 10 workshops, four competitions and a tutorial.
In one of the workshops, AI for Science: Progress and Promises, a panel of leading researchers will discuss how artificial intelligence and machine learning have the potential to advance scientific discovery. The panel will include two Microsoft researchers: Max Welling, Vice President and Distinguished Scientist, Microsoft Research AI4Science, who will serve as moderator, and Peter Lee, Corporate Vice President, Microsoft Research and Incubations.
Of the 122 Microsoft research papers accepted for the conference, seven have been selected for oral presentations during the virtual NeurIPS experience the week of December 4th. The oral presentations provide a deeper dive into each of the featured research topics.
In addition, two other Microsoft research papers received Outstanding Paper Awards for NeurIPS 2022. One of those papers, Gradient Estimation with Discrete Stein Operators, explains how researchers developed a gradient estimator that achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations, which has the potential to improve problem solving in machine learning. In the other paper, A Neural Corpus Indexer for Document Retrieval, researchers demonstrate that an end-to-end deep neural network that unifies training and indexing stages can significantly improve the recall performance of traditional document retrieval methods.
Spotlight: On-Demand EVENT
Microsoft Research Summit 2022
On-Demand
Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.
Below we have provided the titles, authors and abstracts for all seven of the Microsoft research papers chosen for oral presentations at NeurIPS, with links to additional information for those who want to explore the topics more fully:
Uni[MASK]: Unified Inference in Sequential Decision Problems
Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
Abstract: Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the UniMASK framework, which provides a unified way to specify models which can be trained on many different sequential decision-making tasks. We show that a single UniMASK model is often capable of carrying out many tasks with performance similar to or better than single-task models. Additionally, after fine tuning, our UniMASK models consistently outperform comparable single-task models.
K-LITE: Learning Transferable Visual Models with External Knowledge
Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao
Abstract: The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency.
In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https://github.com/microsoft/klite.
Extreme Compression for Pre-trained Transformers Made Simple and Efficient
Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He
Abstract: Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods.
In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression.
Our simplified pipeline demonstrates that:
(1) we can skip the pre-training knowledge distillation to obtain a 5-layer bert while achieving better performance than previous state-of-the-art methods, like TinyBERT;
(2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.
On the Complexity of Adversarial Decision Making
Dylan J Foster, Alexander Rakhlin, Ayush Sekhari, Karthik Sridharan
Abstract: A central problem in online learning and decision making—from bandits to reinforcement learning—is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision-making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show—via new upper and lower bounds—that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results –both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy and the Exploration-by-Optimization objective of Lattimore and György.
Maximum Class Separation as Inductive Bias in One Matrix
Tejaswi Kasarla, Gertjan J. Burghouts, Max van Spengler, Elise van der Pol, Rita Cucchiara, Pascal Mettes
Abstract: Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. Current approaches tend to optimize classification and separation jointly: aligning inputs with class vectors and separating class vectors angularly.
This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations. The main observation behind our approach is that separation does not require optimization but can be solved in closed-form prior to training and plugged into a network. We outline a recursive approach to obtain the matrix consisting of maximally separable vectors for any number of classes, which can be added with negligible engineering effort and computational overhead. Despite its simple nature, this one matrix multiplication provides real impact. We show that our proposal directly boosts classification, long-tailed recognition, out-of-distribution detection, and open-set recognition, from CIFAR to ImageNet. We find empirically that maximum separation works best as a fixed bias; making the matrix learnable adds nothing to the performance. The closed-form implementation and code to reproduce the experiments are available on GitHub.
Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis
Tim Pearce, Jong-Hyeon Jeong, Yichen Jia, Jun Zhu
Abstract: This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterization of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimization of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimizes a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximization, and secondly that it exhibits a desirable `self-correcting’ property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
Learning (Very) Simple Generative Models Is Hard
Sitan Chen, Jerry Li, Yuanzhi Li
Abstract: Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network (F:mathbb{R}^dtomathbb{R}^{d’}), let (D) be the distribution over (mathbb{R}^{d’}) given by pushing the standard Gaussian (mathcal{N}(0,textrm{Id}_d)) through (F). Given i.i.d. samples from (D), the goal is to output ({any}) distribution close to (D) in statistical distance.
We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of (F) are one-hidden-layer ReLU networks with (log(d)) neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for (supervised) (learning) and required at least two hidden layers and (poly(d)) neurons [Daniely-Vardi ’21, Chen-Gollakota-Klivans-Meka ’22].
The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function (f) with polynomially-bounded slopes such that the pushforward of (mathcal{N}(0,1)) under (f) matches all low-degree moments of (mathcal{N}(0,1)).
The post NeurIPS 2022: Seven Microsoft Research Papers Selected for Oral Presentations appeared first on Microsoft Research.
AI at the Point of Care: Startup’s Portable Scanner Diagnoses Brain Stroke in Minutes
For every minute that a stroke is left untreated, the average patient loses nearly 2 million neurons. This means that for each hour in which treatment fails to occur, the brain loses as many neurons as it does in more than three and a half years of normal aging.
With one of the world’s first portable brain scanners for stroke diagnosis, Australia-based healthcare technology developer EMVision is on a mission to enable quicker triage and treatment to reduce such devastating impacts.
The NVIDIA Inception member’s EMVision device fits like a helmet and can be used at the point of care and in ambulances for prehospital stroke diagnosis. It relies on electromagnetic imaging technology and uses NVIDIA-powered AI to distinguish between ischaemic and haemorrhagic strokes — clots and bleeds — in just minutes.
A cart-based version of the device, built using the NVIDIA Jetson edge AI platform and NVIDIA DGX systems, can also help with routine monitoring of a patient post-intervention to inform their progress and recovery.
“With EMVision, the healthcare community can access advanced, portable solutions that will assist in making critical decisions and interventions earlier, when time is of the essence,” said Ron Weinberger, CEO of EMVision. “This means we can provide faster stroke diagnosis and treatment to ensure fewer disability outcomes and an improved quality of life for patients.”
Point-of-Care Diagnosis
Traditional neuroimaging techniques, like CT scans and MRIs, produce excellent images but require large, stationary, complex machines and specialist operators, Weinberger said. This limits point-of-care accessibility.
The EMVision device is designed to scan the brain wherever the patient may be — in an ambulance or even at home if monitoring a patient who has a history of stroke.
“Whether for a new, acute stroke or a complication of an existing stroke, urgent brain imaging is required before correct triage, treatment or intervention decisions can be made,” Weinberger said.
The startup has developed and validated novel electromagnetic brain scanner hardware and AI algorithms capable of classifying and localizing a stroke, as well as creating an anatomical reconstruction of the patient’s brain.
“NVIDIA accelerated computing has played an important role in the development of EMVision’s technology, from hardware verification and algorithm development to rapid image reconstruction and AI-powered decision making,” Weinberger said. “With NVIDIA’s support, we are set to transform stroke diagnosis and care for patients around the world.”
EMVision uses NVIDIA DGX for hardware verification and optimization, as well as for prototyping and training AI models. EMVision has trained its AI models 10x faster using NVIDIA DGX compared with other systems, according to Weinberger.
Each brain scanner has an NVIDIA Jetson AGX Xavier module on board for energy-efficient AI inference at the edge. And the startup is looking to use NVIDIA Jetson Orin Nano modules for next-generation edge AI.
“The interactions between low-energy electromagnetic signals and brain tissue are incredibly complex,” Weinberger said. “Making sense of these signal interactions to identify if pathologies are present and recreate quality images wouldn’t be possible without the massive power of NVIDIA GPU-accelerated computing.”
As a member of NVIDIA Inception, a free, global program for cutting-edge startups, EMVision has shortened product development cycles and go-to-market time, Weinberger added.
Subscribe to NVIDIA healthcare news and learn more about NVIDIA Inception.
The post AI at the Point of Care: Startup’s Portable Scanner Diagnoses Brain Stroke in Minutes appeared first on NVIDIA Blog.
Personalized federated learning for a better customer experience
Accounting for data heterogeneity across edge devices enables more useful model updates, both locally and globally.Read More
Improve explainability of ML models to meet regulatory requirements
Learn about the development, operational, and process improvements that can be incorporated by organizations to improve the explainability of models while adhering to regulatory requirements.Read More
Introducing one-step classification and entity recognition with Amazon Comprehend for intelligent document processing
“Intelligent document processing (IDP) solutions extract data to support automation of high-volume, repetitive document processing tasks and for analysis and insight. IDP uses natural language technologies and computer vision to extract data from structured and unstructured content, especially from documents, to support automation and augmentation.” – Gartner
The goal of Amazon’s intelligent document processing (IDP) is to automate the processing of large amounts of documents using machine learning (ML) in order to increase productivity, reduce costs associated with human labor, and provide a seamless user experience. Customers spend a significant amount of time and effort identifying documents and extracting critical information from them for various use cases. Today, Amazon Comprehend supports classification for plain text documents, which requires you to preprocess documents in semi-structured formats (scanned, digital PDF or images such as PNG, JPG, TIFF) and then use the plain text output to run inference with your custom classification model. Similarly, for custom entity recognition in real time, preprocessing to extract text is required for semi-structured documents such as PDF and image files. This two-step process introduces complexities in document processing workflows.
Last year, we announced support for native document formats with custom named entity recognition (NER) asynchronous jobs. Today, we are excited to announce one-step document classification and real-time analysis for NER for semi-structured documents in native formats (PDF, TIFF, JPG, PNG) using Amazon Comprehend. Specifically, we are announcing the following capabilities:
- Support for documents in native formats for custom classification real-time analysis and asynchronous jobs
- Support for documents in native formats for custom entity recognition real-time analysis
With this new release, Amazon Comprehend custom classification and custom entity recognition (NER) supports documents in formats such as PDF, TIFF, PNG, and JPEG directly, without the need to extract UTF8 encoded plain text from them. The following figure compares the previous process to the new procedure and support.
This feature simplifies document processing workflows by eliminating any preprocessing steps required to extract plain text from documents, and reduces the overall time required to process them.
In this post, we discuss a high-level IDP workflow solution design, a few industry use cases, the new features of Amazon Comprehend, and how to use them.
Overview of solution
Let’s start by exploring a common use case in the insurance industry. A typical insurance claim process involves a claim package that may contain multiple documents. When an insurance claim is filed, it includes documents like insurance claim form, incident reports, identity documents, and third-party claim documents. The volume of documents to process and adjudicate an insurance claim can run up to hundreds and even thousands of pages depending on the type of claim and business processes involved. Insurance claim representatives and adjudicators typically spend hundreds of hours manually sifting, sorting, and extracting information from hundreds or even thousands of claim filings.
Similar to the insurance industry use case, the payment industry also processes large volumes of semi-structured documents for cross-border payment agreements, invoices, and forex statements. Business users spend the majority of their time on manual activities such as identifying, organizing, validating, extracting, and passing required information to downstream applications. This manual process is tedious, repetitive, error prone, expensive, and difficult to scale. Other industries that face similar challenges include mortgage and lending, healthcare and life sciences, legal, accounting, and tax management. It is extremely important for businesses to process such large volumes of documents in a timely manner with a high level of accuracy and nominal manual effort.
Amazon Comprehend provides key capabilities to automate document classification and information extraction from a large volume of documents with high accuracy, in a scalable and cost-effective way. The following diagram shows an IDP logical workflow with Amazon Comprehend. The core of the workflow consists of document classification and information extraction using NER with Amazon Comprehend custom models. The diagram also demonstrates how the custom models can be continuously improved to provide higher accuracies as documents and business processes evolve.
Custom document classification
With Amazon Comprehend custom classification, you can organize your documents into predefined categories (classes). At a high level, the following are the steps to set up a custom document classifier and perform document classification:
- Prepare training data to train a custom document classifier.
- Train a customer document classifier with the training data.
- After the model is trained, optionally deploy a real-time endpoint.
- Perform document classification with either an asynchronous job or in real time using the endpoint.
Steps 1 and 2 are typically done at the beginning of an IDP project after the document classes relevant to the business process are identified. A custom classifier model can then be periodically retrained to improve accuracy and introduce new document classes. You can train a custom classification model either in multi-class mode or multi-label mode. Training can be done for each in one of two ways: using a CSV file, or using an augmented manifest file. Refer to Preparing training data for more details on training a custom classification model. After a custom classifier model is trained, a document can be classified either using real-time analysis or an asynchronous job. Real-time analysis requires an endpoint to be deployed with the trained model and is best suited for small documents depending on the use case. For a large number of documents, an asynchronous classification job is best suited.
Train a custom document classification model
To demonstrate the new feature, we trained a custom classification model in multi-label mode, which can classify insurance documents into one of seven different classes. The classes are INSURANCE_ID
, PASSPORT
, LICENSE
, INVOICE_RECEIPT
, MEDICAL_TRANSCRIPTION
, DISCHARGE_SUMMARY
, and CMS1500
. We want to classify sample documents in native PDF, PNG, and JPEG format, stored in an Amazon Simple Storage Service (Amazon S3) bucket, using the classification model. To start an asynchronous classification job, complete the following steps:
- On the Amazon Comprehend console, choose Analysis jobs in the navigation pane.
- Choose Create job.
- For Name, enter a name for your classification job.
- For Analysis type¸ choose Custom classification.
- For Classifier model, choose the appropriate trained classification model.
- For Version, choose the appropriate model version.
In the Input data section, we provide the location where our documents are stored.
- For Input format, choose One document per file.
- For Document read mode¸ choose Force document read action.
- For Document read action, choose Textract detect document text.
This enables Amazon Comprehend to use the Amazon Textract DetectDocumentText API to read the documents before running the classification. The DetectDocumentText
API is helpful in extracting lines and words of text from the documents. You may also choose Textract analyze document for Document read action, in which case Amazon Comprehend uses the Amazon Textract AnalyzeDocument API to read the documents. With the AnalyzeDocument
API, you can choose to extract Tables, Forms, or both. The Document read mode option enables Amazon Comprehend to extract the text from documents behind the scenes, which helps reduce the extra step of extracting text from the document, which is required in our document processing workflow.
The Amazon Comprehend custom classifier can also process raw JSON responses generated by the DetectDocumentText
and AnalyzeDocument
APIs, without any modification or preprocessing. This is useful for existing workflows where Amazon Textract is involved in extracting text from the documents already. In this case, the JSON output from Amazon Textract can be fed directly to the Amazon Comprehend document classification APIs.
- In the Output data section, for S3 location, specify an Amazon S3 location where you want the asynchronous job to write the results of the inference.
- Leave the remaining options as default.
- Choose Create job to start the job.
You can view the status of the job on the Analysis jobs page.
When the job is complete, we can view the output of the analysis job, which is stored in the Amazon S3 location provided during the job configuration. The classification output for our single-page PDF sample CMS1500 document is as follows. The output is a file in JSON lines format, which has been formatted to improve readability.
The preceding sample is a single-page PDF document; however, custom classification can also handle multi-page PDF documents. In the case of multi-page documents, the output contains multiple JSON lines, where each line is the classification result of each of the pages in a document. The following is a sample multi-page classification output:
Custom entity recognition
With an Amazon Comprehend custom entity recognizer, you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs. At a high level, the following are the steps to set up a custom entity recognizer and perform entity detection:
- Prepare training data to train a custom entity recognizer.
- Train a custom entity recognizer with the training data.
- After the model is trained, optionally deploy a real-time endpoint.
- Perform entity detection with either an asynchronous job or in real time using the endpoint.
A custom entity recognizer model can be periodically retrained to improve accuracy and to introduce new entity types. You can train a custom entity recognizer model with either entity lists or annotations. In both cases, Amazon Comprehend learns about the kind of documents and the context where the entities occur to build an entity recognizer model that can generalize to detect new entities. Refer to Preparing the training data to learn more about preparing training data for custom entity recognizer.
After a custom entity recognizer model is trained, entity detection can be done either using real-time analysis or an asynchronous job. Real-time analysis requires an endpoint to be deployed with the trained model and is best suited for small documents depending on the use case. For a large number of documents, an asynchronous classification job is best suited.
Train a custom entity recognition model
To demonstrate the entity detection in real time, we trained a custom entity recognizer model with insurance documents and augmented manifest files using custom annotations and deployed the endpoint using the trained model. The entity types are Law Firm
, Law Office Address
, Insurance Company
, Insurance Company Address
, Policy Holder Name
, Beneficiary Name
, Policy Number
, Payout
, Required Action
, and Sender
. We want to detect entities from sample documents in native PDF, PNG, and JPEG format, stored in an S3 bucket, using the recognizer model.
Note that you can use a custom entity recognition model that is trained with PDF documents to extract custom entities from PDF, TIFF, image, Word, and plain text documents. If your model is trained using text documents and an entity list, you can only use plain text documents to extract the entities.
We need to detect entities from a sample document in any native PDF, PNG, and JPEG format using the recognizer model. To start a synchronous entity detection job, complete the following steps:
- On the Amazon Comprehend console, choose Real-time analysis in the navigation pane.
- Under Analysis type, select Custom.
- For Custom entity recognition, choose the custom model type.
- For Endpoint, choose the real-time endpoint that you created for your entity recognizer model.
- Select Upload file and choose Choose File to upload the PDF or image file for inference.
- Expand the Advanced document input section and for Document read mode, choose Service default.
- For Document read action, choose Textract detect document text.
- Choose Analyze to analyze the document in real time.
The recognized entities are listed in the Insights section. Each entity contains the entity value (the text), the type of entity as defined by your during the training process, and the corresponding confidence score.
For more details and a complete walkthrough on how to train a custom entity recognizer model and use it to perform asynchronous inference using asynchronous analysis jobs, refer to Extract custom entities from documents in their native format with Amazon Comprehend.
Conclusion
This post demonstrated how you can classify and categorize semi-structured documents in their native format and detect business-specific entities from them using Amazon Comprehend. You can use real-time APIs for low-latency use cases, or use asynchronous analysis jobs for bulk document processing.
As a next step, we encourage you to visit the Amazon Comprehend GitHub repository for full code samples to try out these new features. You can also visit the Amazon Comprehend Developer Guide and Amazon Comprehend developer resources for videos, tutorials, blogs, and more.
About the authors
Wrick Talukdar is a Senior Architect with the Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.
Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.
Godwin Sahayaraj Vincent is an Enterprise Solutions Architect at AWS who is passionate about machine learning and providing guidance to customers to design, deploy, and manage their AWS workloads and architectures. In his spare time, he loves to play cricket with his friends and tennis with his three kids.