Introducing Amazon Textract Bulk Document Uploader for enhanced evaluation and analysis

Introducing Amazon Textract Bulk Document Uploader for enhanced evaluation and analysis

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. To make it simpler to evaluate the capabilities of Amazon Textract, we have launched a new Bulk Document Uploader feature on the Amazon Textract console that enables you to quickly process your own set of documents without writing any code.

In this post, we walk through when and how to use the Amazon Textract Bulk Document Uploader to evaluate how Amazon Textract performs on your documents.

Overview of solution

The Bulk Document Uploader should be used for quick evaluation of Amazon Textract for predetermined use cases. By uploading multiple documents simultaneously through an intuitive UI, you can easily gauge how well Amazon Textract performs on your documents.

You can upload and process up to 150 documents at once. Unlike the existing Amazon Textract console demos, which impose artificial limits on the number of documents, document size, and maximum allowed number of pages, the Bulk Document Uploader supports processing up to 150 documents per request and has the same document size and page limits as the Amazon Textract APIs. This makes it more efficient for you to evaluate a larger set of documents.

The Bulk Document Uploader outputs a standard Amazon Textract JSON response and CSV file. The results are provided in JSON format for easy programmatic analysis. Additionally, a human-readable CSV file with confidence scores is provided for simple comparison and evaluation of the extracted information.

When using this feature, keep in mind the following:

  • The Bulk Document Uploader processes documents via asynchronous operations. You can track the status of the processing on the Amazon Textract console. Only DetectDocumentText (OCR), AnalyzeDocument (Tables, Queries, Forms, and Signatures), and AnalyzeExpense APIs are currently supported.
  • The Bulk Document Uploader provides JSON results of the API operations and formatted CSV reports. You may need to rely on external tools for visualization of the data, such as displaying bounding box highlights on the document using the JSON results.
  • Using this feature to process documents incurs the same charges as regular Amazon Textract usage (depending on which feature is used), and is subject to the TPS (transactions per second) limits for APIs that are set for the account and Region. For more information on pricing, refer to Amazon Textract pricing. To learn more about Amazon Textract limits, refer to Quotas in Amazon Textract.
  • Accepted file formats for bulk uploader are JPEG, PNG, TIF, and PDF. JPEG 2000-encoded images within PDFs are also supported. JPEG and PNG files have a 10 MB size limit, whereas PDF and TIF files have a 500 MB size limit. Multi-page PDF and TIF files have a 3,000 page limit.

Use the Bulk Document Uploader

The Bulk Document Uploader is intended to help you quickly evaluate how Amazon Textract performs on a set of your own documents, without needing to write any code. You can use the Bulk Document Uploader to process as many as 150 documents instead of uploading and processing documents individually. You can bulk upload documents directly from your computer or import documents from an existing Amazon Simple Storage Service (Amazon S3) bucket.

The Bulk Document Uploader provides results that you can download later for offline review. Each downloadable ZIP file contains the Amazon Textract API response in JSON file format and a human-readable CSV file of the output containing the extracted data and confidence scores. The output results are available for download for 7 days after processing. After 14 days, documents are cleared from the Submitted documents section. To use the Bulk Document Uploader, complete the following steps:

  1. On the Amazon Textract console, under Demos in the navigation pane, choose Bulk Document Uploader.
  2. Choose Upload documents.
  3. Specify the source of your documents.

You have two options to upload documents:

  • Import documents from S3 bucket – If you’re using an S3 bucket for your documents, provide the bucket URL and (optionally) the prefix where your documents reside, in s3://your-bucket/prefix/ format. Alternatively, choose Browse S3 to browse and select the desired location of your documents. If the Amazon S3 location you specified contains more than 150 documents, then only the first 150 documents will be sent to Amazon Textract for processing.
  • Upload documents from your computer – If you’re uploading documents from your computer, you can upload up to 50 documents at a time by choosing Upload Documents. To upload additional documents (up to the maximum of 150), choose Add documents after your initial documents are uploaded.

In this case, your documents are first uploaded to an S3 bucket in your account that is created on your behalf, therefore it’s important to ensure that you have permissions to access and upload documents to Amazon S3. This is a one-time action, and the same bucket will be used for all subsequent uploads from your computer. If you want to upload and process the same set of documents, you can use the path to this S3 bucket using the Import documents from S3 bucket option. The S3 bucket created on your behalf will be visible after the bucket gets created.

  1. Next, specify the Amazon Textract feature you want to use to process your documents.

You may select only one feature at a time to process your documents. If you need to evaluate additional features, you must create a separate request by selecting the desired feature and uploading the documents again. If the AnalyzeDocument – Queries feature is selected, you need to provide the queries you want to test against your documents. You can specify up to 30 queries at a time. If the uploaded documents contain multi-page (PDF or TIF) files, queries are only applied to the first page of each document. Refer to Best Practices for Queries to learn about how to construct queries.

  1. Choose Start processing to submit the documents to Amazon Textract for processing.

You can track the document status and download the output results of processed documents in the Submitted documents section. This section updates periodically, and you can manually refresh it to see if the processing is complete. Each document is processed individually, so you can either select the document with Ready to download status or wait for all documents to complete processing to download the results. The output of the processed documents will remain available for up to 7 days for download, after which they will expire. Expired documents will be cleared from the Submitted documents section after 7 additional days (14 days from the processed date). We suggest downloading and preserving the outputs within the 7-day period.

Conclusion

In this post, we announced the new Amazon Textract Bulk Document Uploader feature, which allows you to quickly process a large number of documents for evaluation purposes. You can use this feature to evaluate Amazon Textract for a predetermined use case with your documents. To learn more about how you can use Amazon Textract in your intelligent document processing workload, visit Amazon Textract features and Getting started with Amazon Textract.


About the Authors

Shashwat SapreShashwat Sapre is a Senior Technical Product Manager with the Amazon Textract team. He is focused on building machine learning-based services for AWS customers. In his spare time, he likes reading about new technologies, traveling and exploring different cuisines.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Read More

Highlights from CHI 2023

Highlights from CHI 2023

Microsoft at CHI'23 highlights

The ways in which people are able to interact with technologies can have a profound effect on a technology’s utility and adoptability. Building computing tools and services around people’s natural styles of work, communication, and play can give technology the value it needs to have meaningful impact. For decades, human-computer interaction (HCI) has examined the relationship between people and computers to help maximize the capabilities of each across a range of experiences and situations.

The ACM CHI Conference on Human Factors in Computing Systems (CHI) is a renowned meeting ground for top talent in the HCI field and a showcase for some of its most compelling work. Hosted April 23 through April 28, this year’s conference drew more than 4,500 participants from 79 countries. Contributions from Microsoft researchers and their collaborators demonstrated the breadth of work inspired by the myriad and diverse ways people use computing today and will in the future.

Check out a few highlights from this year’s conference below, including researchers’ efforts to better understand the role of wellbeing in work, to augment memory through our sense of smell, and to bridge the gap between programmers and code-generating models, which received honorable mention at the conference.

“What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models
CHI 2023 Honorable Mention

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Ben Zorn, Jack Williams, Neil Toronto, Andy Gordon

Programming languages are an extremely powerful form of user interface. They also happen to be extremely difficult to learn, especially for non-expert end-user programmers who lack training in computing. What if end-user programmers could instead use a natural language they already know? This prospect can be realized through large language models (LLM): deep neural networks using the transformer architecture, trained on large corpora, and fine-tuned to generate code from natural language. Despite impressive benchmark performance, LLMs are beset with issues in practical use. Lab and field studies have shown that the mapping between natural language and code is poorly understood, that generated code can contain subtle bugs, and that generated code can be difficult to verify.

In their paper, researchers consider the specific problem of abstraction matching: when the user has well-formed intent, how do they select an utterance from the near infinite space of naturalistic utterances that they believe the system will reliably map to a satisfactory solution? This involves “matching” the utterance to the right level of “abstraction” by specifying the utterance at a level of granularity and detail that matches the set of actions the system can take and selecting suitable words and grammar.

Workplace Rhythm Variability and Emotional Distress in Information Workers

Subigya Kumar Nepal, Javier Hernandez, Judith Amores, Mehrab Bin Morshed, Robert Lewis, Hemma Prafullchandra, Mary Czerwinski

Regularity in daily activities has been linked to positive wellbeing outcomes, but previous studies have mainly focused on clinical populations and traditional daily activities such as sleep and exercise. This research extends prior work by examining the regularity of both self-reported and digital activities of 49 information workers in a four-week naturalistic study. Findings suggest that greater variability in self-reported mood, job demands, lunch time, and sleep quality may be associated with increased stress, anxiety, and depression. However, when it comes to digital activity–based measures, greater variability in rhythm is associated with reduced emotional distress. This study expands our understanding of workers and the potential insights that can be gained from analyzing technology interactions and wellbeing.

SPOTLIGHT: AI focus area

AI and Microsoft Research

Learn more about the breadth of AI research at Microsoft

Olfactory Wearables for Targeted Memory Reactivation

Judith Amores, Nirmita Mehra, Bjoern Rasch, Pattie Maes

This paper investigates how a smartphone-controlled olfactory wearable might improve memory recall. Researchers conducted a within-subjects experiment with 32 participants using the device and not using the device (control). In the experimental condition, bursts of odor were released during visuo-spatial memory navigation tasks, which also had a language learning component, and rereleased during sleep the following night in the subjects’ home. The researchers found that compared with control, there was an improvement in memory performance when using the scent wearable in memory tasks that involved walking in a physical space. Furthermore, participants recalled more objects and translations when re-exposed to the same scent during the recall test in addition to during sleep. These effects were statistically significant, and in the object recall task, they also persisted for more than a week. This experiment demonstrates a potential practical application of olfactory interfaces that can interact with a user during wake, as well as sleep, to support memory.

AdHocProx: Sensing Mobile, Ad-Hoc Collaborative Device Formations using Dual Ultra-Wideband Radios

Richard Li, Teddy Seyed, Nicolai Marquardt, Eyal Ofek, Steve Hodges, Mike Sinclair, Hugo Romat, Michel Pahud, Jatin Sharma, William A. S. Buxton, Ken Hinckley, Nathalie Henry Riche

In their paper, researchers present AdHocProx, a system that uses device-relative, inside-out sensing to augment co-located collaboration across multiple devices without recourse to externally anchored beacons or even reliance on Wi-Fi connectivity.

AdHocProx achieves this via sensors, including dual ultra-wideband (UWB) radios for sensing distance and angle to other devices in dynamic, ad-hoc arrangements and capacitive grip to determine where the user’s hands hold the device and to partially correct for the resulting UWB signal attenuation. All spatial sensing and communication take place via the side-channel capability of the UWB radios, suitable for small-group collaboration across up to four devices (eight UWB radios).

Together, these sensors detect proximity and natural, socially meaningful device movements to enable contextual interaction techniques. Researchers find that AdHocProx can obtain 95 percent accuracy recognizing various ad-hoc device arrangements in an offline evaluation, with participants particularly appreciative of interaction techniques that automatically leverage proximity-awareness and relative orientation among multiple devices.

Escapement: A Tool for Interactive Prototyping with Video via Sensor-Mediated Abstraction of Time

Molly Jane Nicholas, Nicolai Marquardt, Michel Pahud, Nathalie Henry Riche, Hugo Romat, Christopher Collins, David Ledo, Rohan Kadekodi, Badrish Chandramouli, Ken Hinckley

This paper introduces Escapement, a video prototyping tool that introduces a powerful new concept for prototyping screen-based interfaces by flexibly mapping sensor values to dynamic playback control of videos. This recasts the time dimension of video mockups as sensor-mediated interaction.

This abstraction of time as interaction, which the researchers dub video-escapement prototyping, empowers designers to rapidly explore and viscerally experience direct touch or sensor-mediated interactions across one or more device displays. The system affords cross-device and bidirectional remote (telepresent) experiences via cloud-based state sharing across multiple devices. This makes Escapement especially potent for exploring multi-device, dual-screen, or remote-work interactions for screen-based applications. Researchers share the results of observations of long-term usage of video-escapement techniques with experienced interaction designers and articulate design choices for supporting a reflective, iterative, and open-ended creative design process.

Your Mileage May Vary: Case Study of a Robotic Telepresence Pilot Roll-out for a Hybrid Knowledge Work Organization

Andriana Boudouraki, Joel E. Fischer, Stuart Reeves, Sean Rintel

Organizations wishing to maintain employee satisfaction for hybrid collaboration need to explore flexible solutions that provide value for both remote and on-site employees. This case study reports on the roll-out of a telepresence robot pilot at Microsoft Research Cambridge to test whether robots would provide enjoyable planned and unplanned encounters between remote and on-site employees. Researchers describe the work that was undertaken to prepare for the roll-out, including the occupational health and safety assessment, systems for safety and security, and the information for employees on safe and effective use practices. The pilot ended after three months, and robot use has been discontinued after weighing the opportunities against low adoption and other challenges. The researchers discuss the pros and cons within this organizational setting and make suggestions for future work and roll-outs.

Focus Time for Wellbeing and Work Engagement of Information Workers 

Koustuv Saha, Shamsi Iqbal 

Having little time for focused work is a major challenge in information work. While research has explored computing-assisted user-facing solutions for protecting time for focused work, there’s limited empirical evidence about the effectiveness of these features on wellbeing and work engagement. Toward this problem, researchers study the effects of automatically scheduling time for focused work on people’s work calendars using the “focus time” feature on Outlook calendars. The researchers conducted an experimental study over six weeks with 15 treatment and 10 control participants, who responded to survey questions on wellbeing and work engagement throughout the study. The researchers found that the treatment participants showed higher wellbeing, including increased excitement, relaxation, and satisfaction, and decreased anger, frustration, tiredness, and stress. The researchers study the needs, benefits, and challenges of scheduling focus time and discuss the importance of and design recommendations for enabling mechanisms and tools supporting focused work.

The post Highlights from CHI 2023 appeared first on Microsoft Research.

Read More

Consensus and subjectivity of skin tone annotation for ML fairness

Consensus and subjectivity of skin tone annotation for ML fairness

Skin tone is an observable characteristic that is subjective, perceived differently by individuals (e.g., depending on their location or culture) and thus is complicated to annotate. That said, the ability to reliably and accurately annotate skin tone is highly important in computer vision. This became apparent in 2018, when the Gender Shades study highlighted that computer vision systems struggled to detect people with darker skin tones, and performed particularly poorly for women with darker skin tones. The study highlights the importance for computer researchers and practitioners to evaluate their technologies across the full range of skin tones and at intersections of identities. Beyond evaluating model performance on skin tone, skin tone annotations enable researchers to measure diversity and representation in image retrieval systems, dataset collection, and image generation. For all of these applications, a collection of meaningful and inclusive skin tone annotations is key.

Monk Skin Tone (MST) Scale See more at skintone.google.

Last year, in a step toward more inclusive computer vision systems, Google’s Responsible AI and Human-Centered Technology team in Research partnered with Dr. Ellis Monk to openly release the Monk Skin Tone (MST) Scale, a skin tone scale that captures a broad spectrum of skin tones. In comparison to an industry standard scale like the Fitzpatrick Skin-Type Scale designed for dermatological use, the MST offers a more inclusive representation across the range of skin tones and was designed for a broad range of applications, including computer vision.

Today we’re announcing the Monk Skin Tone Examples (MST-E) dataset to help practitioners understand the MST scale and train their human annotators. This dataset has been made publicly available to enable practitioners everywhere to create more consistent, inclusive, and meaningful skin tone annotations. Along with this dataset, we’re providing a set of recommendations, noted below, around the MST scale and MST-E dataset so we can all create products that work well for all skin tones.

Since we launched the MST, we’ve been using it to improve Google’s computer vision systems to make equitable image tools for everyone and to improve representation of skin tone in Search. Computer vision researchers and practitioners outside of Google, like the curators of MetaAI’s Casual Conversations dataset, are recognizing the value of MST annotations to provide additional insight into diversity and representation in datasets. Incorporation into widely available datasets like these are essential to give everyone the ability to ensure they are building more inclusive computer vision technologies and can test the quality of their systems and products across a wide range of skin tones.

Our team has continued to conduct research to understand how we can continue to advance our understanding of skin tone in computer vision. One of our core areas of focus has been skin tone annotation, the process by which human annotators are asked to review images of people and select the best representation of their skin tone. MST annotations enable a better understanding of the inclusiveness and representativeness of datasets across a wide range of skin tones, thus enabling researchers and practitioners to evaluate quality and fairness of their datasets and models. To better understand the effectiveness of MST annotations, we’ve asked ourselves the following questions:

  • How do people think about skin tone across geographic locations?
  • What does global consensus of skin tone look like?
  • How do we effectively annotate skin tone for use in inclusive machine learning (ML)?

The MST-E dataset

The MST-E dataset contains 1,515 images and 31 videos of 19 subjects spanning the 10 point MST scale, where the subjects and images were sourced through TONL, a stock photography company focusing on diversity. The 19 subjects include individuals of different ethnicities and gender identities to help human annotators decouple the concept of skin tone from race. The primary goal of this dataset is to enable practitioners to train their human annotators and test for consistent skin tone annotations across various environment capture conditions.

The MST-E image set contains 1,515 images and 31 videos featuring 19 models taken under various lighting conditions and facial expressions. Images by TONL. Copyright TONL.CO 2022 ALL RIGHTS RESERVED. Used with permission.

All images of a subject were collected in a single day to reduce variation of skin tone due to seasonal or other temporal effects. Each subject was photographed in various poses, facial expressions, and lighting conditions. In addition, Dr. Monk annotated each subject with a skin tone label and then selected a “golden” image for each subject that best represents their skin tone. In our research we compare annotations made by human annotators to those made by Dr. Monk, an academic expert in social perception and inequality.

Terms of use

Each model selected as a subject provided consent for their images and videos to be released. TONL has given permission for these images to be released as part of MST-E and used for research or human-annotator-training purposes only. The images are not to be used to train ML models.

Challenges with forming consensus of MST annotations

Although skin tone is easy for a person to see, it can be challenging to systematically annotate across multiple people due to issues with technology and the complexity of human social perception.

On the technical side, things like the pixelation, lighting conditions of an image, or a person’s monitor settings can affect how skin tone appears on a screen. You might notice this yourself the next time you change the display setting while watching a show. The hue, saturation, and brightness could all affect how skin tone is displayed on a monitor. Despite these challenges, we find that human annotators are able to learn to become invariant to lighting conditions of an image when annotating skin tone.

On the social perception side, aspects of a person’s life like their location, culture, and lived experience may affect how they annotate various skin tones. We found some evidence for this when we asked photographers in the United States and photographers in India to annotate the same image. The photographers in the United States viewed this person as somewhere between MST-5 & MST-7. However, the photographers in India viewed this person as somewhere between MST-3 & MST-5.

The distribution of Monk Skin Tone Scale annotations for this image from a sample of 5 photographers in the U.S. and 5 photographers in India.

Continuing this exploration, we asked trained annotators from five different geographical regions (India, Philippines, Brazil, Hungary, and Ghana) to annotate skin tone on the MST scale. Within each market each image had 5 annotators who were drawn from a broader pool of annotators in that region. For example, we could have 20 annotators in a market, and select 5 to review a particular image.

With these annotations we found two important details. First, annotators within a region had similar levels of agreement on a single image. Second, annotations between regions were, on average, significantly different from each other. (p<0.05). This suggests that people from the same geographic region may have a similar mental model of skin tone, but this mental model is not universal.

However, even with these regional differences, we also find that the consensus between all five regions falls close to the MST values supplied by Dr. Monk. This suggests that a geographically diverse group of annotators can get close to the MST value annotated by an MST expert. In addition, after training, we find no significant difference between annotations on well-lit images, versus poorly-lit images, suggesting that annotators can become invariant to different lighting conditions in an image — a non-trivial task for ML models.

The MST-E dataset allows researchers to study annotator behavior across curated subsets controlling for potential confounders. We observed similar regional variation when annotating much larger datasets with many more subjects.

Skin Tone annotation recommendations

Our research includes four major findings. First, annotators within a similar geographical region have a consistent and shared mental model of skin tone. Second, these mental models differ across different geographical regions. Third, the MST annotation consensus from a geographically diverse set of annotators aligns with the annotations provided by an expert in social perception and inequality. And fourth, annotators can learn to become invariant to lighting conditions when annotating MST.

Given our research findings, there are a few recommendations for skin tone annotation when using the MST.

  1. Having a geographically diverse set of annotators is important to gain accurate, or close to ground truth, estimates of skin tone.
  2. Train human annotators using the MST-E dataset, which spans the entire MST spectrum and contains images in a variety of lighting conditions. This will help annotators become invariant to lighting conditions and appreciate the nuance and differences between the MST points.
  3. Given the wide range of annotations we suggest having at least two annotators in at least five different geographical regions (10 ratings per image).

Skin tone annotation, like other subjective annotation tasks, is difficult but possible. These types of annotations allow for a more nuanced understanding of model performance, and ultimately help us all to create products that work well for every person across the broad and diverse spectrum of skin tones.

Acknowledgements

We wish to thank our colleagues across Google working on fairness and inclusion in computer vision for their contributions to this work, especially Marco Andreetto, Parker Barnes, Ken Burke, Benoit Corda, Tulsee Doshi, Courtney Heldreth, Rachel Hornung, David Madras, Ellis Monk, Shrikanth Narayanan, Utsav Prabhu, Susanna Ricco, Sagar Savla, Alex Siegman, Komal Singh, Biao Wang, and Auriel Wright. We also would like to thank Annie Jean-Baptiste, Florian Koenigsberger, Marc Repnyek, Maura O’Brien, and Dominique Mungin and the rest of the team who help supervise, fund, and coordinate our data collection.

Read More

Learning Language-Specific Layers for Multilingual Machine Translation

Multilingual Machine Translation promises to improve translation quality between non-English languages. This is advantageous for several reasons, namely lower latency (no need to translate twice), and reduced error cascades (e.g. , avoiding losing gender and formality information when translating through English). On the downside, adding more languages reduces model capacity per language, which is usually countered by increasing the overall model size, making training harder and inference slower. In this work, we introduce Language-Specific Transformer Layers (LSLs), which allow us to increase…Apple Machine Learning Research

PointConvFormer: Revenge of the Point-based Convolution

We introduce PointConvFormer, a novel building block for point cloud based deep network architectures. Inspired by generalization theory, PointConvFormer combines ideas from point convolution, where filter weights are only based on relative position, and Transformers which utilize feature-based attention. In PointConvFormer, attention computed from feature difference between points in the neighborhood is used to modify the convolutional weights at each point. Hence, we preserved the invariances from point convolution, whereas attention helps to select relevant points in the neighborhood for…Apple Machine Learning Research

F-VLM: Open-vocabulary object detection upon frozen vision and language models

F-VLM: Open-vocabulary object detection upon frozen vision and language models

Detection is a fundamental vision task that aims to localize and recognize objects in an image. However, the data collection process of manually annotating bounding boxes or instance masks is tedious and costly, which limits the modern detection vocabulary size to roughly 1,000 object classes. This is orders of magnitude smaller than the vocabulary people use to describe the visual world and leaves out many categories. Recent vision and language models (VLMs), such as CLIP, have demonstrated improved open-vocabulary visual recognition capabilities through learning from Internet-scale image-text pairs. These VLMs are applied to zero-shot classification using frozen model weights without the need for fine-tuning, which stands in stark contrast to the existing paradigms used for retraining or fine-tuning VLMs for open-vocabulary detection tasks.

Intuitively, to align the image content with the text description during training, VLMs may learn region-sensitive and discriminative features that are transferable to object detection. Surprisingly, features of a frozen VLM contain rich information that are both region sensitive for describing object shapes (second column below) and discriminative for region classification (third column below). In fact, feature grouping can nicely delineate object boundaries without any supervision. This motivates us to explore the use of frozen VLMs for open-vocabulary object detection with the goal to expand detection beyond the limited set of annotated categories.

We explore the potential of frozen vision and language features for open-vocabulary detection. The K-Means feature grouping reveals rich semantic and region-sensitive information where object boundaries are nicely delineated (column 2). The same frozen features can classify groundtruth (GT) regions well without fine-tuning (column 3).

In “F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models”, presented at ICLR 2023, we introduce a simple and scalable open-vocabulary detection approach built upon frozen VLMs. F-VLM reduces the training complexity of an open-vocabulary detector to below that of a standard detector, obviating the need for knowledge distillation, detection-tailored pre-training, or weakly supervised learning. We demonstrate that by preserving the knowledge of pre-trained VLMs completely, F-VLM maintains a similar philosophy to ViTDet and decouples detector-specific learning from the more task-agnostic vision knowledge in the detector backbone. We are also releasing the F-VLM code along with a demo on our project page.

Learning upon frozen vision and language models

We desire to retain the knowledge of pretrained VLMs as much as possible with a view to minimize effort and cost needed to adapt them for open-vocabulary detection. We use a frozen VLM image encoder as the detector backbone and a text encoder for caching the detection text embeddings of offline dataset vocabulary. We take this VLM backbone and attach a detector head, which predicts object regions for localization and outputs detection scores that indicate the probability of a detected box being of a certain category. The detection scores are the cosine similarity of region features (a set of bounding boxes that the detector head outputs) and category text embeddings. The category text embeddings are obtained by feeding the category names through the text model of pretrained VLM (which has both image and text models)r.

The VLM image encoder consists of two parts: 1) a feature extractor and 2) a feature pooling layer. We adopt the feature extractor for detector head training, which is the only step we train (on standard detection data), to allow us to directly use frozen weights, inheriting rich semantic knowledge (e.g., long-tailed categories like martini, fedora hat, pennant) from the VLM backbone. The detection losses include box regression and classification losses.

At training time, F-VLM is simply a detector with the last classification layer replaced by base-category text embeddings.

Region-level open-vocabulary recognition

The ability to perform open-vocabulary recognition at region level (i.e., bounding box level as opposed to image level) is integral to F-VLM. Since the backbone features are frozen, they do not overfit to the training categories (e.g., donut, zebra) and can be directly cropped for region-level classification. F-VLM performs this open-vocabulary classification only at test time. To obtain the VLM features for a region, we apply the feature pooling layer on the cropped backbone output features. Because the pooling layer requires fixed-size inputs, e.g., 7×7 for ResNet50 (R50) CLIP backbone, we crop and resize the region features with the ROI-Align layer (shown below). Unlike existing open-vocabulary detection approaches, we do not crop and resize the RGB image regions and cache their embeddings in a separate offline process, but train the detector head in one stage. This is simpler and makes more efficient use of disk storage space.. In addition, we do not crop VLM region features during training because the backbone features are frozen.

Despite never being trained on regions, the cropped region features maintain good open-vocabulary recognition capability. However, we observe the cropped region features are not sensitive enough to the localization quality of the regions, i.e., a loosely vs. tightly localized box both have similar features. This may be good for classification, but is problematic for detection because we need the detection scores to reflect localization quality as well. To remedy this, we apply the geometric mean to combine the VLM scores with the detection scores for each region and category. The VLM scores indicate the probability of a detection box being of a certain category according to the pretrained VLM. The detection scores indicate the class probability distribution of each box based on the similarity of region features and input text embeddings.

At test time, F-VLM uses the region proposals to crop out the top-level features of the VLM backbone and compute the VLM score per region. The trained detector head provides the detection boxes and masks, while the final detection scores are a combination of detection and VLM scores.

Evaluation

We apply F-VLM to the popular LVIS open-vocabulary detection benchmark. At the system-level, the best F-VLM achieves 32.8 average precision (AP) on rare categories (APr), which outperforms the state of the art by 6.5 mask APr and many other approaches based on knowledge distillation, pre-training, or joint training with weak supervision. F-VLM shows strong scaling property with frozen model capacity, while the number of trainable parameters is fixed. Moreover, F-VLM generalizes and scales well in the transfer detection tasks (e.g., Objects365 and Ego4D datasets) by simply replacing the vocabularies without fine-tuning the model. We test the LVIS-trained models on the popular Objects365 datasets and demonstrate that the model can work very well without training on in-domain detection data.

F-VLM outperforms the state of the art (SOTA) on LVIS open-vocabulary detection benchmark and transfer object detection. On the x-axis, we show the LVIS metric mask AP on rare categories (APr), and the Objects365 (O365) metric box AP on all categories. The sizes of the detector backbones are as follows: Small(R50), Base (R50x4), Large(R50x16), Huge(R50x64). The naming follows CLIP convention.

We visualize F-VLM on open-vocabulary detection and transfer detection tasks (shown below). On LVIS and Objects365, F-VLM correctly detects both novel and common objects. A key benefit of open-vocabulary detection is to test on out-of-distribution data with categories given by users on the fly. See the F-VLM paper for more visualization on LVIS, Objects365 and Ego4D datasets.

F-VLM open-vocabulary and transfer detections. Top: Open-vocabulary detection on LVIS. We only show the novel categories for clarity. Bottom: Transfer to Objects365 dataset shows accurate detection of many categories. Novel categories detected: fedora, martini, pennant, football helmet (LVIS); slide (Objects365).

Training efficiency

We show that F-VLM can achieve top performance with much less computational resources in the table below. Compared to the state-of-the-art approach, F-VLM can achieve better performance with 226x fewer resources and 57x faster wall clock time. Apart from training resource savings, F-VLM has potential for substantial memory savings at training time by running the backbone in inference mode. The F-VLM system runs almost as fast as a standard detector at inference time, because the only addition is a single attention pooling layer on the detected region features.

Method       APr       Training Epochs       Training Cost
(per-core-hour)
      Training Cost Savings      
SOTA       26.3       460       8,000       1x      
F-VLM       32.8       118       565       14x      
F-VLM       31.0       14.7       71       113x      
F-VLM       27.7       7.4       35       226x      

We provide additional results using the shorter Detectron2 training recipes (12 and 36 epochs), and show similarly strong performance by using a frozen backbone. The default setting is marked in gray.

Backbone       Large Scale Jitter       #Epochs       Batch Size       APr      
R50             12       16       18.1      
R50             36       64       18.5      
R50             100       256       18.6      
R50x64             12       16       31.9      
R50x64             36       64       32.6      
R50x64             100       256       32.8      

Conclusion

We present F-VLM – a simple open-vocabulary detection method which harnesses the power of frozen pre-trained large vision-language models to provide detection of novel objects. This is done without a need for knowledge distillation, detection-tailored pre-training, or weakly supervised learning. Our approach offers significant compute savings and obviates the need for image-level labels. F-VLM achieves the new state-of-the-art in open-vocabulary detection on the LVIS benchmark at system level, and shows very competitive transfer detection on other datasets. We hope this study can both facilitate further research in novel-object detection and help the community explore frozen VLMs for a wider range of vision tasks.

Acknowledgements

This work is conducted by Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. We would like to thank our colleagues at Google Research for their advice and helpful discussions.

Read More

AI-powered code suggestions and security scans in Amazon SageMaker notebooks using Amazon CodeWhisperer and Amazon CodeGuru

AI-powered code suggestions and security scans in Amazon SageMaker notebooks using Amazon CodeWhisperer and Amazon CodeGuru

Amazon SageMaker comes with two options to spin up fully managed notebooks for exploring data and building machine learning (ML) models. The first option is fast start, collaborative notebooks accessible within Amazon SageMaker Studio—a fully integrated development environment (IDE) for machine learning. You can quickly launch notebooks in Studio, easily dial up or down the underlying compute resources without interrupting your work, and even share your notebook as a link in few clicks. In addition to creating notebooks, you can perform all the ML development steps to build, train, debug, track, deploy, and monitor your models in a single pane of glass in Studio. The second option is Amazon SageMaker notebook instances—a single, fully managed ML compute instance running notebooks in the cloud, offering you more control on your notebook configurations.

Today, we are excited to announce the availability of Amazon CodeWhisperer and Amazon CodeGuru Security extensions in SageMaker notebooks. These AI-powered extensions help accelerate ML development by offering code suggestions as you type, and ensure that your code is secure and follows AWS best practices.

In this post, we show how you can get started with Amazon CodeGuru Security and CodeWhisperer in Studio and SageMaker notebook instances.

Solution overview

The CodeWhisperer extension is an AI coding companion that provides developers with real-time code suggestions in notebooks. Individual developers can use CodeWhisperer for free in Studio and SageMaker notebook instances. The coding companion generates real-time single-line or full function code suggestions. It understands semantics and context in your code and can recommend suggestions built on AWS and development best practices, improving developer efficiency, quality, and speed.

The CodeGuru Security extension offers security and code quality scans for Studio and SageMaker notebook instances. This assists notebook users in detecting security vulnerabilities such as injection flaws, data leaks, weak cryptography, or missing encryption within the notebook cells. You can also detect many common issues that affect the readability, reproducibility, and correctness of computational notebooks, such as misuse of ML library APIs, invalid run order, and nondeterminism. When vulnerabilities or quality issues are identified in the notebook, CodeGuru generates recommendations that enable you to remediate those issues based on AWS security best practices.

In the following sections, we show how to install each of the extensions and discuss the capabilities of each, demonstrating how these tools can improve overall developer productivity.

Prerequisites

If this is your first time working with Studio, you first need to create a SageMaker domain. Additionally, make sure you have appropriate access to both CodeWhisperer and CodeGuru using AWS Identity and Access Management (IAM).

You can use these extensions in any AWS Region, but requests to CodeWhisperer will be served through the us-east-1 Region. Requests will be served to CodeGuru in the Region of the Studio domain and if CodeGuru is supported in the Region. For all non-supported Regions, the requests will be served through us-east-1.

Set up CodeWhisperer with SageMaker notebooks

In this section, we demonstrate how to set up CodeWhisperer with SageMaker Studio.

Update IAM permissions to use the extension

You can use the CodeWhisperer extension in any Region, but all requests to CodeWhisperer will be served through the us-east-1 Region.

To use the CodeWhisperer extension, ensure that you have the necessary permissions. On the IAM console, add the following policy to the SageMaker user execution role:

{
"Version": "2012-10-17",
"Statement": [
          {
               	"Sid": "CodeWhispererPermissions",
               	"Effect": "Allow",
               	"Action": ["codewhisperer:GenerateRecommendations"],
				"Resource": "*"
          }
    ]
}

Install the CodeWhisperer extension

You can install the CodeWhisperer extension through the command line. In this section, we look at the steps involved. To get started, complete the following steps:

  1. On the File menu, choose New and Terminal.
  2. Run the following commands to install the extension:
    conda activate studio
    pip install amazon-codewhisperer-jupyterlab-ext
    jupyter server extension enable amazon_codewhisperer_jupyterlab_ext
    conda deactivate
    restart-jupyter-server

Refresh your browser, and you will have successfully installed the CodeWhisperer extension.

Use CodeWhisperer in Studio

After we complete the installation steps, we can use CodeWhisperer by opening a new notebook or Python file. For our example we will open a sample Notebook.

You will see a toolbar at the bottom of your notebook called CodeWhisperer. This shows common shortcuts for CodeWhisperer along with the ability to pause code suggestions, open the code reference log, and get a link to the CodeWhisperer documentation.

The code reference log will flag or filter code suggestions that resemble open-source training data. Get the associated open-source project’s repository URL and license so that you can more easily review them and add attributions.

To get started, place your cursor in a code block in your notebook, and CodeWhisperer will begin to make suggestions .If you don’t see suggestions, press Alt+C in Windows or Option+C in Mac to manually invoke suggestions.

The following video shows how to use CodeWhisperer to read and perform descriptive statistics on a data file in Studio.

Use CodeWhisperer in SageMaker Notebook Instances

Complete the following steps to use CodeWhisperer in notebook instances:

  1. Navigate to your SageMaker notebook instance.
  2. Make sure you have attached the CodeWhisperer policy from earlier to the notebook instance IAM role.
  3. When the permissions are added, choose Open JupyterLab.
  4. Install the extension. by using a terminal, on the File menu, choose New and Terminal, and enter the following commands:
    pip install amazon-codewhisperer-jupyterlab-ext
    jupyter server extension enable amazon_codewhisperer_jupyterlab_ext

  5. Once the commands complete, on the File menu, choose Shut Down to restart our Jupyter Server.
  6. Refresh the browser window.

You will now see the CodeWhisperer extension installed and ready to use.

Let’s test it out in a Python file.

  1. On the File menu, choose New and Python File.

The following video shows how to create a function to convert a JSON file to a CSV.

Set up CodeGuru Security with SageMaker notebooks

In this section, we demonstrate how to set up CodeGuru Security with SageMaker Studio.

Update IAM permissions to use the extension

To use the CodeGuru Security extension, ensure that you have the necessary permissions. Complete the following steps to update permission policies with IAM:

  1. Preferred: On the IAM console, you can attach the AmazonCodeGuruSecurityScanAccess managed policy to your IAM identities. This policy grants permissions that allow a user to work with scans, including creating scans, viewing scan information, and viewing scan findings.
  2. For custom policies, enter the following permissions:
    { 
        "Version": "2012-10-17", 
        "Statement": [ 
            { 
                "Sid": "AmazonCodeGuruSecurityScanAccess", 
                "Effect": "Allow", 
                "Action": [ 
                    "codeguru-security:CreateScan", 
                    "codeguru-security:CreateUploadUrl", 
                    "codeguru-security:GetScan", 
                    "codeguru-security:GetFindings" 
                ], 
                "Resource": "arn:aws:codeguru-security:*:*:scans/*" 
            } 
        ] 
    }

  3. Attach the policy to any user or role that will use the CodeGuru Security extension.

For more information, see Policies and permissions in IAM.

Install the CodeGuru Security extension

You can install the CodeGuru Security extension through the command line. To get started, complete the following steps:

  1. On the File menu, choose New and Terminal.
  2. Run the following commands to install the extension in the conda environment:
    conda activate studio
    pip install amazon-codeguru-jupyterlab-extension
    conda deactivate

Refresh your browser, and you will have successfully installed the CodeGuru extension.

Run a code scan

The following steps demonstrate running your first CodeGuru Security scan using an example file:

  1. Create a new notebook called example.ipynb with the following code for testing purposes:
    import torch
    # import tensorflow as tf
    
        
    def tensorflow_avoid_using_nondeterministic_api_noncompliant():
        data = tf.ones((1, 1))
        # Noncompliant: Determinism of tf.compat.v1.Session
        # can not be guaranteed in TF2.
        Ítf.config.experimental.enable_op_determinism()
        tf.compat.v1.Session(
            target='', graph=None, config=None
        )
        layer = tf.keras.layers.Input(shape=[1])
        model = tf.keras.models.Model(inputs=layer, outputs=layer)
        model.compile(loss="categorical_crossentropy", metrics="AUC")
        model.fit(x=data, y=data)
        
    def pytorch_sigmoid_before_bceloss_compliant():
        # Compliant: `BCEWithLogitsLoss` function integrates a `Sigmoid`
        # layer and the `BCELoss` into one class
        # and is numerically robust.
        loss = nn.BCEWithLogitsLoss()
    
        input = torch.randn(3, requires_grad=True)
        target = torch.empty(3).random_(2)
        output = loss(input, target)
        output.backward()

The below code has intentionally incorporated common bad practices to showcase the capabilities of Amazon CodeGuru Security.

  1. Important: Please confirm that the CodeGuru-Security extension is installed and if the LSP server says Fully initialized as shown below when you open your notebook.

If you don’t see the extension fully initialized, return to the previous section to install the extension and complete the installation steps.

  1. Initiate the scan. You can initiate a scan in one of the following ways:
    • Choose any code cell in your file, then choose the lightbulb icon.
    • Choose (right-click) any code cell in your file, then choose Run CodeGuru scan.

When the scan is started, the scan status will show as CodeGuru: Scan in progress.

After a few seconds, when the scan is complete, the status will change to CodeGuru: Scan completed.

View and address findings

After the scan is finished, your code may have some underlined findings. Hover over the underlined code, and a pop-up window appears with a brief summary of the finding. To access additional details about the findings, right-click on any cell and choose Show diagnostics panel.

This will open a panel containing additional information and suggestions related to the findings, located at the bottom of the notebook file.

After making changes to your code based on the recommendations, you can rerun the scan to check if the issue has been resolved. It’s important to note that the scan findings will disappear after you modify your code, and you’ll need to rerun the scan to view them again.

Enable automatic code scans

Automatic scans are disabled by default. Optionally, you can enable automatic code scans and set the frequency and AWS Region for your scan runs. To enable automatic code scans, complete the following steps.

  1. In Studio, on the Settings menu, choose Advanced Settings Editor.
  2. For Auto scans, choose Enabled.
  3. Specify the scan frequency in seconds and the Region for your CodeGuru Security scan.

For our example, we configure CodeGuru to perform an automatic security scan every 240 seconds in the us-east-1 Region. You can modify this value for any region that CodeGuru Security is supported.

Conclusion

SageMaker Studio and SageMaker Notebook Instances now support AI-powered CodeWhisperer and CodeGuru extensions that help you write secure code faster. We encourage you to try out both extensions. To learn more about CodeGuru Security for SageMaker, refer to Get started with the Amazon CodeGuru Extension for JupyterLab and SageMaker Studio, and to learn more about CodeWhisperer for SageMaker, refer to Setting up CodeWhisperer with Amazon SageMaker Studio. Please share any feedback in the comments!


About the authors

Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).

Gaurav Parekh is a Solutions Architect helping AWS customers build large scale modern architecture. His core area of expertise include Data Analytics, Networking and Technology strategy. Outside of work, Gaurav enjoys playing cricket, soccer and volleyball.

Arkaprava De is a Senior Software Engineer at AWS. He has been at Amazon for over 7 years and is currently working on improving the Amazon SageMaker Studio IDE experience. You can find him on LinkedIn.

Prashant Pawan Pisipati is a Principal Product Manager at Amazon Web Services (AWS). He has built various products across AWS and Alexa, and is currently focused on helping Machine Learning practitioners be more productive through AWS services.

Read More

On Privacy and Personalization in Federated Learning: A Retrospective on the US/UK PETs Challenge

On Privacy and Personalization in Federated Learning: A Retrospective on the US/UK PETs Challenge

TL;DR: We study the use of differential privacy in personalized, cross-silo federated learning (NeurIPS’22), explain how these insights led us to develop a 1st place solution in the US/UK Privacy-Enhancing Technologies (PETs) Prize Challenge, and share challenges and lessons learned along the way. If you are feeling adventurous, checkout the extended version of this post with more technical details!


How can we be better prepared for the next pandemic?

Patient data collected by groups such as hospitals and health agencies is a critical tool for monitoring and preventing the spread of disease. Unfortunately, while this data contains a wealth of useful information for disease forecasting, the data itself may be highly sensitive and stored in disparate locations (e.g., across multiple hospitals, health agencies, and districts).

In this post we discuss our research on federated learning, which aims to tackle this challenge by performing decentralized learning across private data silos. We then explore an application of our research to the problem of privacy-preserving pandemic forecasting—a scenario where we recently won a 1st place, $100k prize in a competition hosted by the US & UK governments—and end by discussing several directions of future work based on our experiences.


Part 1: Privacy, Personalization, and Cross-Silo Federated Learning

Federated learning (FL) is a technique to train models using decentralized data without directly communicating such data. Typically:

  • a central server sends a model to participating clients;
  • the clients train that model using their own local data and send back updated models; and
  • the server aggregates the updates (e.g., via averaging, as in FedAvg)

and the cycle repeats. Companies like Apple and Google have deployed FL to train models for applications such as predictive keyboards, text selection, and speaker verification in networks of user devices.

However, while significant attention has been given to cross-device FL (e.g., learning across large networks of devices such as mobile phones), the area of cross-silo FL (e.g., learning across a handful of data silos such as hospitals or financial institutions) is relatively under-explored, and it presents interesting challenges in terms of how to best model federated data and mitigate privacy risks. In Part 1.1, we’ll examine a suitable privacy granularity for such settings, and in Part 1.2, we’ll see how this interfaces with model personalization, an important technique in handling data heterogeneity across clients.

1.1. How should we protect privacy in cross-silo federated learning?

Although the high-level federated learning workflow described above can help to mitigate systemic privacy risks, past work suggests that FL’s data minimization principle alone isn’t sufficient for data privacy, as the client models and updates can still reveal sensitive information.

This is where differential privacy (DP) can come in handy. DP provides both a formal guarantee and an effective empirical mitigation to attacks like membership inference and data poisoning. In a nutshell, DP is a statistical notion of privacy where we add randomness to a query on a “dataset” to create quantifiable uncertainty about whether any one “data point” has contributed to the query output. DP is typically measured by two scalars ((varepsilon, delta))—the smaller, the more private.

In the above, “dataset” and “data point” are in quotes because privacy granularity matters. In cross-device FL, it is common to apply “client-level DP” when training a model, where the federated clients (e.g., mobile phones) are thought of as “data points”. This effectively ensures that each participating client/mobile phone user remains private.

However, while client-level DP makes sense for cross-device FL as each client naturally corresponds to a person, this privacy granularity may not be suitable for cross-silo FL, where there are fewer (2-100) ‘clients’ but each holds many data subjects that require protection, e.g., each ‘client’ may be a hospital, bank, or school with many patient, customer, or student records.

Visualizing client-level DP vs. silo-specific example-level DP in federated learning.

In our recent work (NeurIPS’22), we instead consider the notion of “silo-specific example-level DP” in cross-silo FL (see figure above). In short, this says that the (k)-th data silo may set its own ((varepsilon_k, delta_k)) example-level DP target for any learning algorithm with respect to its local dataset.

This notion is better aligned with real-world use cases of cross-silo FL, where each data subject contributes a single “example”, e.g., each patient in a hospital contributes their individual medical record. It is also very easy to implement: each silo can just run DP-SGD for local gradient steps with calibrated per-step noise. As we discuss below, this alternate privacy granularity affects how we consider modeling federated data to improve privacy/utility trade-offs.

1.2. The interplay of privacy, heterogeneity, and model personalization

Let’s now look at how this privacy granularity may interface with model personalization in federated learning.

Model personalization is a common technique used to improve model performance in FL when data heterogeneity (i.e. non-identically distributed data) exists between data silos.1 Indeed, existing benchmarks suggest that realistic federated datasets may be highly heterogeneous and that fitting separate local models on the federated data are already competitive baselines.

When considering model personalization techniques under silo-specific example-level privacy, we find that a unique trade-off may emerge between the utility costs from privacy and data heterogeneity (see figure below):

  • As DP noises are added independently by each silo for its own privacy targets, these noises are reflected in the silos’ model updates and can thus be smoothed out when these updates are averaged (e.g. via FedAvg), leading to a smaller utility drop from DP for the federated model.
  • On the other hand, federation also means that the shared, federated model may suffer from data heterogeneity (“one size does not fit all”).
Consider two interesting phenomena illustrated by a simple experiment where all silos use (ε = 1, δ = 1e-7) example-level DP for their own dataset. Left: FedAvg can smooth out the independent, per-silo DP noise and lead to smaller average utility drop from DP; Mid/Right: Local finetuning (FedAvg followed by further local training) may not improve utility as expected, as the effect of noise reduction is removed when finetuning begins.

This “privacy-heterogeneity cost tradeoff” is interesting because it suggests that model personalization can play a key and distinct role in cross-silo FL. Intuitively, local training (no FL participation) and FedAvg (full FL participation) can be viewed as two ends of a personalization spectrum with identical privacy costs—silos’ participation in FL itself does not incur privacy costs due to DP’s robustness to post-processing—and various personalization algorithms (finetuning, clustering, …) are effectively navigating this spectrum in different ways.

If local training minimizes the effect of data heterogeneity but enjoys no DP noise reduction, and contrarily for FedAvg, it is natural to wonder whether there are personalization methods that lie in between and achieve better utility. If so, what methods would work best?

Privacy-utility tradeoffs for representative personalization methods under silo-specific example-level DP across four cross-silo FL datasets. Finetune: a common baseline for model personalization; IFCA/HypCluster: hard clustering of client models; Ditto: a recently proposed method for personalized FL. MR-MTL: mean-regularized multi-task learning, which consistently outperform other baselines.

Our analysis points to mean-regularized multi-task learning (MR-MTL) as a simple yet particularly suitable form of personalization. MR-MTL simply asks each client (k) to train its own local model (w_k), regularize it towards the mean of others’ models (bar w) via a penalty (fraclambda 2 | w_k – bar w |_2^2 ), and keep (w_k) across rounds (i.e. client is stateful). The mean model (bar w) is maintained by the FL server (as in FedAvg) and may be updated in every round. More concretely, each local update step takes the following form:

The hyperparameter (lambda) serves as a smooth knob between local training and FedAvg: (lambda = 0) recovers local training, and a larger (lambda) forces the personalized models to be closer to each other (intuitively, “federate more”).

MR-MTL has some nice properties in the context of private cross-silo FL:

  1. Noise reduction is attained throughout training via the soft proximity constraint towards an averaged model;
  2. The mean-regularization itself has no privacy overhead;2 and
  3. (lambda) provides a smooth interpolation along the personalization spectrum.

Why is the above interesting? Consider the following experiment where we try a range of (lambda) values roughly interpolating local training and FedAvg. Observe that we could find a “sweet spot” (lambda^ast) that outperforms both of the endpoints under the same privacy cost. Moreover, both the utility advantage of MR-MTL((lambda^ast)) over the endpoints, and (lambda^ast) itself, are larger under privacy; intuitively, this says that silos are encouraged to “federate more” for noise reduction.

Test acc ± std of MR-MTL on a simple cross-silo FL task with varying λ. A “sweet spot” λ* exists where it outperforms both ends of the personalization spectrum (local / FedAvg) under the same privacy budget. Results correspond to ε = 0.5 in the first subplot in the privacy-utility tradeoff curves. Ditto resembles MR-MTL in terms of the training procedure and exhibits similar interpolation behaviors, but it suffers from privacy overhead due to 2x local training iterations.

The above provides rough intuition on why MR-MTL may be a strong baseline for private cross-silo FL and motivates this approach for a practical pandemic forecasting problem, which we discuss in Part 2. Our full paper delves deeper into the analyses and provides additional results and discussions!


Part 2: Federated Pandemic Forecasting at the US/UK PETs Challenge

Illustration of the pandemic forecasting problem at the US/UK PETs challenge (image source).

Let’s now take a look at a federated pandemic forecasting problem at the US/UK Privacy-Enhancing Technologies (PETs) prize challenge, and how we may apply the ideas from Part 1.

2.1. Problem setup

The pandemic forecasting problem asks the following: Given a person’s demographic attributes (e.g. age, household size), locations, activities, infection history, and the contact network, what is the likelihood of infection in the next (t_text{pred}=7) days? Can we make predictions while protecting the privacy of individuals? Moreover, what if the data are siloed across administrative regions?

There’s a lot to unpack in the above. First, the pandemic outbreak problem follows a discrete-time SIR model (Susceptible → Infectious → Recovered) and we begin with a subset of the population infected. Subsequently,

  • Each person goes about their usual daily activities and gets into contact with others (e.g. at a shopping mall)—this forms a contact graph where individuals are nodes and direct contacts are edges;
  • Each person may get infected with different risk levels depending on a myriad of factors—their age, the nature and duration of their contact(s), their node centrality, etc.; and
  • Such infection can also be asymptomatic—the individual can appear in the S state while being secretly infectious.

The challenge dataset models a pandemic outbreak in Virginia and contains roughly 7.7 million nodes (persons) and 186 million edges (contacts) with health states over 63 days; so the actual contact graph is fairly large but also quite sparse.

There are a few extra factors that make this problem challenging:

  1. Data imbalance: less than 5% of people are ever in the I or R state and roughly 0.3% of people became infected in the final week.
  2. Data silos: the true contact graph is cut along administrative boundaries, e.g., by grouped FIPS codes/counties. Each silo only sees a local subgraph, but people may still travel and make contacts across multiple regions! in In the official evaluation, the population sizes can also vary by more than 10(times) across silos.
  3. Temporal modeling: we are given the first (t_text{train} = 56) days of each person’s health states (S/I/R) and asked to predict individual infections any time in the subsequent ( t_text{pred} = 7 ) days. What is a training example in this case? How should we perform temporal partitioning? How does this relate to privacy accounting?
  4. Graphs generally complicate DP: we are often used to ML settings where we can clearly define the privacy granularity and how it relates to an actual individual (e.g. medical images of patients). This is tricky with graphs: people can make different numbers of contacts each of different natures, and their influence can propagate throughout the graph. At a high level (and as specified by the scope of sensitive data of the competition), what we care about is known as node-level DP—the model output is “roughly the same” if we add/remove/replace a node, along with its edges.
2.2. Applying MR-MTL with silo-specific example-level privacy

One clean approach to the pandemic forecasting problem is to just operate on the individual level and view it as (federated) binary classification: if we could build a feature vector to summarize an individual, then risk scores are simply the sigmoid probabilities of near-term infection.

Of course, the problem lies in what that feature vector (and the corresponding label) is—we’ll get to this in the following section. But already, we can see that MR-MTL with silo-specific example-level privacy (from Part 1) is a nice framework for a number of reasons:

  • Model personalization is likely needed as the silos are large and heterogeneous by construction (geographic regions are unlike to all be similar).
  • Privacy definition: There are a small number of clients, but each holds many data subjects, and client-level DP isn’t suitable.
  • Usability, efficiency, and scalability: MR-MTL is remarkably easy to implement with minimal resource overhead (over FedAvg and local training). This is crucial for real-world applications.
  • Adaptability and explainability: The framework is highly adaptable to any learning algorithm that can take DP-SGD-style updates. It also preserves the explainability of the underlying ML algorithm as we don’t obfuscate the model weights, updates, or predictions.

It is also helpful to look at the threat model we might be dealing with and how our framework behaves under it; the interested reader may find more details in the extended post!

2.3. Building training examples
Illustration of iterative, ℓ-hop neighborhood aggregation. Here, green nodes are the sampled neighbors and the yellow node can’t be sampled.

We now describe how to convert individual information and the contact network into a tabular dataset for every silo ( k ) with ( n_k ) nodes.

Recall that our task is to predict the risk of infection of a person within ( t_text{pred} = 7) days, and that each silo only sees its local subgraph. We formulate this via a silo-specific set of examples ( ( X_k in mathbb R^{n_k times d}, Y_k in mathbb {0, 1}^{n_k} ) ), where the features ( {X_k^{(i)} in mathbb R^d} ) describe the neighborhood around a person ( i ) (see figure) and binary label ( {Y_k^{(i)}} ) denotes if the person become infected in the next ( t_text{pred} ) days.

Each example’s features ( X_k^{(i)} ) consist of the following:

(1) Individual features: Basic (normalized) demographic features like age, gender, and household size; activity features like working, school, going to church, or shopping; and the individual’s infection history as concatenated one-hot vectors (which depends on how we create labels; see below).

(2) Contact features: One of our key simplifying heuristics is that each node’s (ell)-hop neighborhood should contain most of the information we need to predict infection. We build the contact features as follows:

  • Every sampled neighbor (v) of a node (u) is encoded using its individual features (as above) along with the edge features describing the contact—e.g. the location, the duration, and the activity type.
  • We use iterative neighborhood sampling (figure above), meaning that we first select a set of ( S_1 ) 1-hop neighbors, and then sample (S_2) 2-hop neighbors adjacent to those 1-hop neighbors, and so on. This allows reusing 1-hop edge features and keeps the feature dimension (d) low.
  • We also used deterministic neighborhood sampling—the same person always takes the same subset of neighbors. This drastically reduces computation as the graph/neighborhoods can now be cached. For the interested reader, this also has implications on privacy accounting.
Illustration of the tabularized features. Red/pink blocks are individual (node) features and green blocks are edge features describing the contact. Each blue block denotes the combined features of a single social contact (the neighboring node & the edge), and contacts of higher degrees are concatenated.

The figure above illustrates the neighborhood feature vector that describes a person and their contacts for the binary classifier! Intriguingly, this makes the per-silo models a simplified variant of a graph neural network (GNN) with a single-step, non-parameterized neighborhood aggregation and prediction (cf. SGC models).

For the labels ( Y_k^{(i)} ), we deployed a random infection window strategy:

  1. Pick a window size ( t_text{window} ) (say 21 days);
  2. Select a random day (t’) within the valid range ((t_text{window} le t’ le t_text{train} – t_text{pred}));
  3. Encode the S/I/R states in the past window from (t’) for every node in the neighborhood as individual features;
  4. The label is then whether person (i) is infected in any of the next (t_text{pred}) days from (t’).
During training, every time we sample a person (node) we take a random window of infection states to use as features (the “observation” window) and labels (1 iff the person transitions into infection during the “prediction” window) and their neighboring nodes will use the same window for building the neighborhood feature vector. During testing, we deterministically take the latest days of the infection history.

Our strategy implicitly assumes that a person’s infection risk is individual: whether Bob gets infected depends only on his own activities and contacts in the past window. This is certainly not perfect as it ignores population-level modeling (e.g. denser areas have higher risks of infection), but it makes the ML problem very simple: just plug-in existing tabular data modeling approaches!

2.4. Putting it all together

We can now see our solution coming together: each silo builds a tabular dataset using neighborhood vectors for features and infection windows for labels, and each silo trains a personalized binary classifier under MR-MTL with silo-specific example-level privacy. We complete our method with a few additional ingredients:

  1. Privacy accounting. We’ve so far glossed over what silo-specific “example-level” DP actually means for an individual. We’ve put more details in the extended blog post, and the main idea is that local DP-SGD can give “neighborhood-level” DP since each node’s enclosing neighborhood is fixed and unique, and we can then convert it to node-level DP (our privacy goal from Part 2.1) by carefully accounting for how a certain node may appear in other nodes’ neighborhoods.
  2. Noisy SGD as an empirical defense. While we have a complete framework for providing silo-specific node-level DP guarantees, for the PETs challenge specifically we decided to opt for weak DP ((varepsilon > 500)) as an empirical protection, rather than a rigorous theoretical guarantee. While some readers may find this mildly disturbing at first glance, we note that the strength of protection depends on the data, the models, the actual threats, the desired privacy-utility trade-off, and several crucial factors linking theory and practice which we outline in the extended blog. Our solution was in turn attacked by several red teams to test for vulnerabilities.
  3. Model architecture: simple is good. While the model design space is large, we are interested in methods amenable to gradient-based private optimization (e.g. DP-SGD) and weight-space averaging for federated learning. We compared simple logistic regression and a 3-layer MLP and found that the variance in data strongly favors linear models, which also have benefits in privacy (in terms of limited capacity for memorization) as well as explainability, efficiency, and robustness.
  4. Computation-utility tradeoff for neighborhood sampling. While larger neighborhood sizes (S) and more hops (ell) better capture the original contact graph, they also blow up the computation and our experiments found that larger (S) and (ell) tend to have diminishing returns.
  5. Data imbalance and weighted loss. Because the data are highly imbalanced, training naively will suffer from low recall and AUPRC. While there are established over-/under-sampling methods to deal with such imbalance, they, unfortunately, make privacy accounting a lot trickier in terms of the subsampling assumption or the increased data queries. We leveraged the focal loss from the computer vision literature designed to emphasize hard examples (infected cases) and found that it did improve both the AUPRC and the recall considerably.

The above captures the essence of our entry to the challenge. Despite the many subtleties in fully building out a working system, the main ideas were quite simple: train personalized models with DP and add some proximity constraints!


Takeaways and Open Challenges

In Part 1, we reviewed our NeurIPS’22 paper that studied the application of differential privacy in cross-silo federated learning scenarios, and in Part 2, we saw how the core ideas and methods from the paper helped us develop our submission to the PETs prize challenge and win a 1st place in the pandemic forecasting track. For readers interested in more details—such as theoretical analyses, hyperparameter tuning, further experiments, and failure modes—please check out our full paper. Our work also identified several important future directions in this context:

DP under data imbalance. DP is inherently a uniform guarantee, but data imbalance implies that examples are not created equal—minority examples (e.g., disease infection, credit card fraud) are more informative, and they tend to give off (much) larger gradients during model training. Should we instead do class-specific (group-wise) DP or refine “heterogeneous DP” or “outlier DP” notions to better cater to the discrepancy between data points?

Graphs and privacy. Another fundamental basis of DP is that we could delineate what is and isn’t an individual. But as we’ve seen, the information boundaries are often nebulous when an individual is a node in a graph (think social networks and gossip propagation), particularly when the node is arbitrarily well connected. Instead of having rigid constraints (e.g., imposing a max node degree and accounting for it), are there alternative privacy definitions that offer varying degrees of protection for varying node connectedness?

Scalable, private, and federated trees for tabular data. Decision trees/forests tend to work extremely well for tabular data such as ours, even with data imbalance, but despite recent progress, we argue that they are not yet mature under private and federated settings due to some underlying assumptions.

Novel training frameworks. While MR-MTL is a simple and strong baseline under our privacy granularity, it has clear limitations in terms of modeling capacity. Are there other methods that can also provide similar properties to balance the emerging privacy-heterogeneity cost tradeoff?

Honest privacy cost of hyperparameter search. When searching for better frameworks, the dependence on hyperparameters is particularly interesting: our full paper (section 7) made a surprising but somewhat depressing observation that the honest privacy cost of just tuning (on average) 10 configurations (values of (lambda) in this case) may already outweigh the utility advantage of the best tune MR-MTL((lambda^ast)). What does this mean if MR-MTL is already a strong baseline with just a single hyperparameter?


Check out the following related links:


DISCLAIMER: All opinions expressed in this post are those of the authors and do not represent the views of CMU.

Footnotes

1    Note that “personalization” refers to customizing models for each client (data silo) in federated learning rather than for a specific person.
2    As compared to local training or FedAvg for a fixed (lambda). However, tuning (lambda) as a hyperparameter can incur privacy cost.

Read More

Enabling conversational interaction on mobile with LLMs

Enabling conversational interaction on mobile with LLMs

Intelligent assistants on mobile devices have significantly advanced language-based interactions for performing simple daily tasks, such as setting a timer or turning on a flashlight. Despite the progress, these assistants still face limitations in supporting conversational interactions in mobile user interfaces (UIs), where many user tasks are performed. For example, they cannot answer a user’s question about specific information displayed on a screen. An agent would need to have a computational understanding of graphical user interfaces (GUIs) to achieve such capabilities.

Prior research has investigated several important technical building blocks to enable conversational interaction with mobile UIs, including summarizing a mobile screen for users to quickly understand its purpose, mapping language instructions to UI actions and modeling GUIs so that they are more amenable for language-based interaction. However, each of these only addresses a limited aspect of conversational interaction and requires considerable effort in curating large-scale datasets and training dedicated models. Furthermore, there is a broad spectrum of conversational interactions that can occur on mobile UIs. Therefore, it is imperative to develop a lightweight and generalizable approach to realize conversational interaction.

In “Enabling Conversational Interaction with Mobile UI using Large Language Models”, presented at CHI 2023, we investigate the viability of utilizing large language models (LLMs) to enable diverse language-based interactions with mobile UIs. Recent pre-trained LLMs, such as PaLM, have demonstrated abilities to adapt themselves to various downstream language tasks when being prompted with a handful of examples of the target task. We present a set of prompting techniques that enable interaction designers and developers to quickly prototype and test novel language interactions with users, which saves time and resources before investing in dedicated datasets and models. Since LLMs only take text tokens as input, we contribute a novel algorithm that generates the text representation of mobile UIs. Our results show that this approach achieves competitive performance using only two data examples per task. More broadly, we demonstrate LLMs’ potential to fundamentally transform the future workflow of conversational interaction design.

Animation showing our work on enabling various conversational interactions with mobile UI using LLMs.

Prompting LLMs with UIs

LLMs support in-context few-shot learning via prompting — instead of fine-tuning or re-training models for each new task, one can prompt an LLM with a few input and output data exemplars from the target task. For many natural language processing tasks, such as question-answering or translation, few-shot prompting performs competitively with benchmark approaches that train a model specific to each task. However, language models can only take text input, while mobile UIs are multimodal, containing text, image, and structural information in their view hierarchy data (i.e., the structural data containing detailed properties of UI elements) and screenshots. Moreover, directly inputting the view hierarchy data of a mobile screen into LLMs is not feasible as it contains excessive information, such as detailed properties of each UI element, which can exceed the input length limits of LLMs.

To address these challenges, we developed a set of techniques to prompt LLMs with mobile UIs. We contribute an algorithm that generates the text representation of mobile UIs using depth-first search traversal to convert the Android UI’s view hierarchy into HTML syntax. We also utilize chain of thought prompting, which involves generating intermediate results and chaining them together to arrive at the final output, to elicit the reasoning ability of the LLM.

Animation showing the process of few-shot prompting LLMs with mobile UIs.

Our prompt design starts with a preamble that explains the prompt’s purpose. The preamble is followed by multiple exemplars consisting of the input, a chain of thought (if applicable), and the output for each task. Each exemplar’s input is a mobile screen in the HTML syntax. Following the input, chains of thought can be provided to elicit logical reasoning from LLMs. This step is not shown in the animation above as it is optional. The task output is the desired outcome for the target tasks, e.g., a screen summary or an answer to a user question. Few-shot prompting can be achieved with more than one exemplar included in the prompt. During prediction, we feed the model the prompt with a new input screen appended at the end.

Experiments

We conducted comprehensive experiments with four pivotal modeling tasks: (1) screen question-generation, (2) screen summarization, (3) screen question-answering, and (4) mapping instruction to UI action. Experimental results show that our approach achieves competitive performance using only two data examples per task.

Task 1: Screen question generation

Given a mobile UI screen, the goal of screen question-generation is to synthesize coherent, grammatically correct natural language questions relevant to the UI elements requiring user input.

We found that LLMs can leverage the UI context to generate questions for relevant information. LLMs significantly outperformed the heuristic approach (template-based generation) regarding question quality.

Example screen questions generated by the LLM. The LLM can utilize screen contexts to generate grammatically correct questions relevant to each input field on the mobile UI, while the template approach falls short.

We also revealed LLMs’ ability to combine relevant input fields into a single question for efficient communication. For example, the filters asking for the minimum and maximum price were combined into a single question: “What’s the price range?

We observed that the LLM could use its prior knowledge to combine multiple related input fields to ask a single question.

In an evaluation, we solicited human ratings on whether the questions were grammatically correct (Grammar) and relevant to the input fields for which they were generated (Relevance). In addition to the human-labeled language quality, we automatically examined how well LLMs can cover all the elements that need to generate questions (Coverage F1). We found that the questions generated by LLM had almost perfect grammar (4.98/5) and were highly relevant to the input fields displayed on the screen (92.8%). Additionally, LLM performed well in terms of covering the input fields comprehensively (95.8%).

      Template       2-shot LLM      
Grammar       3.6 (out of 5)       4.98 (out of 5)      
Relevance       84.1%       92.8%      
Coverage F1       100%       95.8%      

Task 2: Screen summarization

Screen summarization is the automatic generation of descriptive language overviews that cover essential functionalities of mobile screens. The task helps users quickly understand the purpose of a mobile UI, which is particularly useful when the UI is not visually accessible.

Our results showed that LLMs can effectively summarize the essential functionalities of a mobile UI. They can generate more accurate summaries than the Screen2Words benchmark model that we previously introduced using UI-specific text, as highlighted in the colored text and boxes below.

Example summary generated by 2-shot LLM. We found the LLM is able to use specific text on the screen to compose more accurate summaries.

Interestingly, we observed LLMs using their prior knowledge to deduce information not presented in the UI when creating summaries. In the example below, the LLM inferred the subway stations belong to the London Tube system, while the input UI does not contain this information.

LLM uses its prior knowledge to help summarize the screens.

Human evaluation rated LLM summaries as more accurate than the benchmark, yet they scored lower on metrics like BLEU. The mismatch between perceived quality and metric scores echoes recent work showing LLMs write better summaries despite automatic metrics not reflecting it.

  

Left: Screen summarization performance on automatic metrics. Right: Screen summarization accuracy voted by human evaluators.

Task 3: Screen question-answering

Given a mobile UI and an open-ended question asking for information regarding the UI, the model should provide the correct answer. We focus on factual questions, which require answers based on information presented on the screen.

Example results from the screen QA experiment. The LLM significantly outperforms the off-the-shelf QA baseline model.

We report performance using four metrics: Exact Matches (identical predicted answer to ground truth), Contains GT (answer fully containing ground truth), Sub-String of GT (answer is a sub-string of ground truth), and the Micro-F1 score based on shared words between the predicted answer and ground truth across the entire dataset.

Our results showed that LLMs can correctly answer UI-related questions, such as “what’s the headline?”. The LLM performed significantly better than baseline QA model DistillBERT, achieving a 66.7% fully correct answer rate. Notably, the 0-shot LLM achieved an exact match score of 30.7%, indicating the model’s intrinsic question answering capability.

Models       Exact Matches       Contains GT       Sub-String of GT       Micro-F1      
0-shot LLM       30.7%       6.5%       5.6%       31.2%      
1-shot LLM       65.8%       10.0%       7.8%       62.9%      
2-shot LLM       66.7%       12.6%       5.2%       64.8%      
DistillBERT       36.0%       8.5%       9.9%       37.2%      

Task 4: Mapping instruction to UI action

Given a mobile UI screen and natural language instruction to control the UI, the model needs to predict the ID of the object to perform the instructed action. For example, when instructed with “Open Gmail,” the model should correctly identify the Gmail icon on the home screen. This task is useful for controlling mobile apps using language input such as voice access. We introduced this benchmark task previously.

Example using data from the PixelHelp dataset. The dataset contains interaction traces for common UI tasks such as turning on wifi. Each trace contains multiple steps and corresponding instructions.

We assessed the performance of our approach using the Partial and Complete metrics from the Seq2Act paper. Partial refers to the percentage of correctly predicted individual steps, while Complete measures the portion of accurately predicted entire interaction traces. Although our LLM-based method did not surpass the benchmark trained on massive datasets, it still achieved remarkable performance with just two prompted data examples.

Models       Partial       Complete      
0-shot LLM       1.29       0.00      
1-shot LLM (cross-app)       74.69       31.67      
2-shot LLM (cross-app)       75.28       34.44      
1-shot LLM (in-app)       78.35       40.00      
2-shot LLM (in-app)       80.36       45.00      
Seq2Act       89.21       70.59      

Takeaways and conclusion

Our study shows that prototyping novel language interactions on mobile UIs can be as easy as designing a data exemplar. As a result, an interaction designer can rapidly create functioning mock-ups to test new ideas with end users. Moreover, developers and researchers can explore different possibilities of a target task before investing significant efforts into developing new datasets and models.

We investigated the feasibility of prompting LLMs to enable various conversational interactions on mobile UIs. We proposed a suite of prompting techniques for adapting LLMs to mobile UIs. We conducted extensive experiments with the four important modeling tasks to evaluate the effectiveness of our approach. The results showed that compared to traditional machine learning pipelines that consist of expensive data collection and model training, one could rapidly realize novel language-based interactions using LLMs while achieving competitive performance.

Acknowledgements

We thank our paper co-author Gang Li, and appreciate the discussions and feedback from our colleagues Chin-Yi Cheng, Tao Li, Yu Hsiao, Michael Terry and Minsuk Chang. Special thanks to Muqthar Mohammad and Ashwin Kakarla for their invaluable assistance in coordinating data collection. We thank John Guilyard for helping create animations and graphics in the blog.

Read More