SimPer: Simple self-supervised learning of periodic targets

SimPer: Simple self-supervised learning of periodic targets

Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.

Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).

Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.

Feature similarity is different in the context of periodic representations as compared to static features (e.g., images). For example, videos that are offset by short time delays or are reversed should be similar to the original sample, whereas videos that have been upsampled or downsampled by a factor x should be different from the original sample by a factor of x.

To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.

The SimPer framework

SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.

To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.

Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.

Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.

To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.

SimPer constructs negative views of data through transformations in the frequency domain. The input sequence x has an underlying associated periodic signal. SimPer transforms x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo speed or frequency labels for unlabeled input x (periodicity-variant augmentations τ). SimPer takes transformations that do not change the identity of the input and defines these as periodicity-invariant augmentations σ, thus creating different positive views of the sample. Then, it sends these augmented views to the encoder f, which extracts corresponding features.

Results

To evaluate SimPer’s performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.

Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.

Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE).

Conclusion and applications

We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.

Acknowledgements

We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.

Read More

Symbol tuning improves in-context learning in language models

Symbol tuning improves in-context learning in language models

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

  • Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
  • Symbol-tuned models are much stronger at algorithmic reasoning tasks.
  • Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.

In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.

Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.

Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.

Read More

An open-source gymnasium for machine learning assisted computer architecture design

An open-source gymnasium for machine learning assisted computer architecture design

Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. For example, the SimpleScalar simulator was introduced in the late 1990s and allowed researchers to explore various microarchitectural ideas. Computer architecture simulators and tools, such as gem5, DRAMSys, and many more have played a significant role in advancing computer architecture research. Since then, these shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other’s work, leading to significant advances in the field.

Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements, such as ML for computer architecture, ML for TinyML accelerationDNN accelerator datapath, memory controllers, power consumption, security, and privacy. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively.

To alleviate these challenges, in “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms. Enabled by ArchGym, our results indicate that with a sufficiently large number of samples, any of a diverse collection of ML algorithms are capable of finding the optimal set of architecture design parameters for each target problem; no one solution is necessarily better than another. These results further indicate that selecting the optimal hyperparameters for a given ML algorithm is essential for finding the optimal architecture design, but choosing them is non-trivial. We release the code and dataset across multiple computer architecture simulations and ML algorithms.

Challenges in ML-assisted architecture research

ML-assisted architecture research poses several challenges, including:

  1. For a specific ML-assisted computer architecture problem (e.g., finding an optimal solution for a DRAM controller) there is no systematic way to identify optimal ML algorithms or hyperparameters (e.g., learning rate, warm-up steps, etc.). There is a wider range of ML and heuristic methods, from random walk to reinforcement learning (RL), that can be employed for design space exploration (DSE). While these methods have shown noticeable performance improvement over their choice of baselines, it is not evident whether the improvements are because of the choice of optimization algorithms or hyperparameters.

    Thus, to ensure reproducibility and facilitate widespread adoption of ML-aided architecture DSE, it is necessary to outline a systematic benchmarking methodology.

  2. While computer architecture simulators have been the backbone of architectural innovations, there is an emerging need to address the trade-offs between accuracy, speed, and cost in architecture exploration. The accuracy and speed of performance estimation widely varies from one simulator to another, depending on the underlying modeling details (e.g., cycleaccurate vs. MLbased proxy models). While analytical or ML-based proxy models are nimble by virtue of discarding low-level details, they generally suffer from high prediction error. Also, due to commercial licensing, there can be strict limits on the number of runs collected from a simulator. Overall, these constraints exhibit distinct performance vs. sample efficiency trade-offs, affecting the choice of optimization algorithm for architecture exploration.

    It is challenging to delineate how to systematically compare the effectiveness of various ML algorithms under these constraints.

  3. Finally, the landscape of ML algorithms is rapidly evolving and some ML algorithms need data to be useful. Additionally, rendering the outcome of DSE into meaningful artifacts such as datasets is critical for drawing insights about the design space.

    In this rapidly evolving ecosystem, it is consequential to ensure how to amortize the overhead of search algorithms for architecture exploration. It is not apparent, nor systematically studied how to leverage exploration data while being agnostic to the underlying search algorithm.

ArchGym design

ArchGym addresses these challenges by providing a unified framework for evaluating different ML-based search algorithms fairly. It comprises two main components: 1) the ArchGym environment and 2) the ArchGym agent. The environment is an encapsulation of the architecture cost model — which includes latency, throughput, area, energy, etc., to determine the computational cost of running the workload, given a set of architectural parameters — paired with the target workload(s). The ArchGym agent is an encapsulation of the ML algorithm used for the search and consists of hyperparameters and a guiding policy. The hyperparameters are intrinsic to the algorithm for which the model is to be optimized and can significantly influence performance. The policy, on the other hand, determines how the agent selects a parameter iteratively to optimize the target objective.

Notably, ArchGym also includes a standardized interface that connects these two components, while also saving the exploration data as the ArchGym Dataset. At its core, the interface entails three main signals: hardware state, hardware parameters, and metrics. These signals are the bare minimum to establish a meaningful communication channel between the environment and the agent. Using these signals, the agent observes the state of the hardware and suggests a set of hardware parameters to iteratively optimize a (user-defined) reward. The reward is a function of hardware performance metrics, such as performance, energy consumption, etc. 

ArchGym comprises two main components: the ArchGym environment and the ArchGym agent. The ArchGym environment encapsulates the cost model and the agent is an abstraction of a policy and hyperparameters. With a standardized interface that connects these two components, ArchGym provides a unified framework for evaluating different ML-based search algorithms fairly while also saving the exploration data as the ArchGym Dataset.

ML algorithms could be equally favorable to meet user-defined target specifications

Using ArchGym, we empirically demonstrate that across different optimization objectives and DSE problems, at least one set of hyperparameters exists that results in the same hardware performance as other ML algorithms. A poorly selected (random selection) hyperparameter for the ML algorithm or its baseline can lead to a misleading conclusion that a particular family of ML algorithms is better than another. We show that with sufficient hyperparameter tuning, different search algorithms, even random walk (RW), are able to identify the best possible normalized reward. However, note that finding the right set of hyperparameters may require exhaustive search or even luck to make it competitive.

With a sufficient number of samples, there exists at least one set of hyperparameters that results in the same performance across a range of search algorithms. Here the dashed line represents the maximum normalized reward. Cloud-1, cloud-2, stream, and random indicate four different memory traces for DRAMSys (DRAM subsystem design space exploration framework).

Dataset construction and high-fidelity proxy model training

Creating a unified interface using ArchGym also enables the creation of datasets that can be used to design better data-driven ML-based proxy architecture cost models to improve the speed of architecture simulation. To evaluate the benefits of datasets in building an ML model to approximate architecture cost, we leverage ArchGym’s ability to log the data from each run from DRAMSys to create four dataset variants, each with a different number of data points. For each variant, we create two categories: (a) Diverse Dataset (DD), which represents the data collected from different agents (ACO, GA, RW, and BO), and (b) ACO only, which shows the data collected exclusively from the ACO agent, both of which are released along with ArchGym. We train a proxy model on each dataset using random forest regression with the objective to predict the latency of designs for a DRAM simulator. Our results show that:

  1. As we increase the dataset size, the average normalized root mean squared error (RMSE) slightly decreases.
  2. However, as we introduce diversity in the dataset (e.g., collecting data from different agents), we observe 9× to 42× lower RMSE across different dataset sizes.

Diverse dataset collection across different agents using ArchGym interface.
The impact of a diverse dataset and dataset size on the normalized RMSE.

The need for a community-driven ecosystem for ML-assisted architecture research

While, ArchGym is an initial effort towards creating an open-source ecosystem that (1) connects a broad range of search algorithms to computer architecture simulators in an unified and easy-to-extend manner, (2) facilitates research in ML-assisted computer architecture, and (3) forms the scaffold to develop reproducible baselines, there are a lot of open challenges that need community-wide support. Below we outline some of the open challenges in ML-assisted architecture design. Addressing these challenges requires a well coordinated effort and a community driven ecosystem.

Key challenges in ML-assisted architecture design.

We call this ecosystem Architecture 2.0. We outline the key challenges and a vision for building an inclusive ecosystem of interdisciplinary researchers to tackle the long-standing open problems in applying ML for computer architecture research. If you are interested in helping shape this ecosystem, please fill out the interest survey.

Conclusion

ArchGym is an open source gymnasium for ML architecture DSE and enables an standardized interface that can be readily extended to suit different use cases. Additionally, ArchGym enables fair and reproducible comparison between different ML algorithms and helps to establish stronger baselines for computer architecture research problems.

We invite the computer architecture community as well as the ML community to actively participate in the development of ArchGym. We believe that the creation of a gymnasium-type environment for computer architecture research would be a significant step forward in the field and provide a platform for researchers to use ML to accelerate research and lead to new and innovative designs.

Acknowledgements

This blogpost is based on joint work with several co-authors at Google and Harvard University. We would like to acknowledge and highlight Srivatsan Krishnan (Harvard) who contributed several ideas to this project in collaboration with Shvetank Prakash (Harvard), Jason Jabbour (Harvard), Ikechukwu Uchendu (Harvard), Susobhan Ghosh (Harvard), Behzad Boroujerdian (Harvard), Daniel Richins (Harvard), Devashree Tripathy (Harvard), and Thierry Thambe (Harvard).  In addition, we would also like to thank James Laudon, Douglas Eck, Cliff Young, and Aleksandra Faust for their support, feedback, and motivation for this work. We would also like to thank John Guilyard for the animated figure used in this post. Amir Yazdanbakhsh is now a Research Scientist at Google DeepMind and Vijay Janapa Reddi is an Associate Professor at Harvard.

Read More

Google at ACL 2023

Google at ACL 2023

This week, the 61st annual meeting of the Association for Computational Linguistics (ACL), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, is taking place online.

As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2023, Google will showcase the latest research in the field with over 50 publications, and active involvement in a variety of workshops and tutorials.

If you’re registered for ACL 2023, we hope that you’ll visit the Google booth to learn more about the projects at Google that go into solving interesting problems for billions of people. You can also learn more about Google’s participation below (Google affiliations in bold).

Board and Organizing Committee

Area chairs include: Dan Garrette

Workshop chairs include: Annie Louis

Publication chairs include: Lei Shu

Program Committee includes: Vinodkumar Prabhakaran, Najoung Kim, Markus Freitag

Spotlight papers

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti

Optimizing Test-Time Query Representations for Dense Retrieval

Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, Jinhyuk Lee

PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition

Sihao Chen*, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, Tal Schuster

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh*, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

Large Language Models with Controllable Working Memory

Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

OpineSum: Entailment-Based Self-Training for Abstractive Opinion Summarization

Annie Louis, Joshua Maynez

RISE: Leveraging Retrieval Techniques for Summarization Evaluation

David Uthus, Jianmo Ni

Follow the Leader(board) with Confidence: Estimating p-Values from a Single Test Set with Item and Response Variance


Shira Wein*, Christopher Homan, Lora Aroyo, Chris Welty

SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives

Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, Zhe Dong

Papers

Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability

Eleftheria Briakou, Colin Cherry, George Foster

Prompting PaLM for Translation: Assessing Strategies and Performance

David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, George Foster

Query Refinement Prompts for Closed-Book Long-Form QA

Reinald Kim Amplayo, Kellie Webster, Michael Collins, Dipanjan Das, Shashi Narayan

To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering

Dheeru Dua*, Emma Strubell, Sameer Singh, Pat Verga

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation (see blog post)

Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, Noah Constant

Conditional Generation with a Question-Answering Blueprint

Shashi Narayan, Joshua Maynez, Reinald Kim Amplayo, Kuzman Ganchev, Annie Louis, Fantine Huot, Anders Sandholm, Dipanjan Das, Mirella Lapata

Coreference Resolution Through a Seq2Seq Transition-Based System

Bernd Bohnet, Chris Alberti, Michael Collins

Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing

Rochelle Choenni, Dan Garrette, Ekaterina Shutova

DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue

William Held*, Christopher Hidey, Fei Liu, Eric Zhu, Rahul Goel, Diyi Yang, Rushin Shah

RARR: Researching and Revising What Language Models Say, Using Language Models

Luyu Gao*, Zhuyun Dai, Panupong Pasupat, Anthony Chen*, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, Kelvin Guu

Benchmarking Large Language Model Capabilities for Conditional Generation

Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann

Crosslingual Generalization Through Multitask Fine-Tuning

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, Omri Abend

Resolving Indirect Referring Expressions for Entity Selection

Mohammad Javad Hosseini, Filip Radlinski, Silvia Pareti, Annie Louis

SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models

Akshita Jha*, Aida Mostafazadeh Davani, Chandan K Reddy, Shachi Dave, Vinodkumar Prabhakaran, Sunipa Dev

The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

Nikil Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, Kai-Wei Chang

Character-Aware Models Improve Visual Text Rendering

Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant

Cold-Start Data Selection for Better Few-Shot Language Model Fine-Tuning: A Prompt-Based Uncertainty Propagation Approach

Yue Yu, Rongzhi Zhang, Ran Xu, Jieyu Zhang, Jiaming Shen, Chao Zhang

Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment

Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang Zhang, Kihyuk Sohn, Nikolay Glushinev, Renshen Wang, Joshua Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister

Dialect-Robust Evaluation of Generated Text

Jiao Sun*, Thibault Sellam, Elizabeth Clark, Tu Vu*, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, Sebastian Gehrmann

MISGENDERED: Limits of Large Language Models in Understanding Pronouns

Tamanna Hossain, Sunipa Dev, Sameer Singh

LAMBADA: Backward Chaining for Automated Reasoning in Natural Language

Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, Deepak Ramachandran

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Jeremiah Milbauer*, Annie Louis, Mohammad Javad Hosseini, Alex Fabrikant, Donald Metzler, Tal Schuster

Modular Visual Question Answering via Code Generation (see blog post)

Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun

Better Zero-Shot Reasoning with Self-Adaptive Prompting

Xingchen Wan*, Ruoxi Sun, Hanjun Dai, Sercan Ö. Arik, Tomas Pfister

Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback

Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, Idan Szpektor

Natural Language to Code Generation in Interactive Data Science Notebooks

Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Oleksandr Polozov, Charles Sutton

Teaching Small Language Models to Reason

Lucie Charlotte Magister*, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn

Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic

Connor Pryor*, Quan Yuan, Jeremiah Liu, Mehran Kazemi, Deepak Ramachandran, Tania Bedrax-Weiss, Lise Getoor

A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Raghavi Chandu and João Sedoc

Industry Track papers

Federated Learning of Gboard Language Models with Differential Privacy

Zheng Xu, Yanxiang Zhang, Galen Andrew, Christopher Choquette, Peter Kairouz, Brendan McMahan, Jesse Rosenstock, Yuanbo Zhang

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

Zhiwei Jia*, Pradyumna Narayana, Arjun Akula, Garima Pruthi, Hao Su, Sugato Basu, Varun Jampani

ACL Findings papers

Multilingual Summarization with Factual Consistency Evaluation

Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata

Parameter-Efficient Fine-Tuning for Robust Continual Multilingual Learning

Kartikeya Badola, Shachi Dave, Partha Talukdar

FiDO: Fusion-in-Decoder Optimized for Stronger Performance and Faster Inference

Michiel de Jong*, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, William Cohen

A Simple, Yet Effective Approach to Finding Biases in Code Generation

Spyridon Mouselinos, Mateusz Malinowski, Henryk Michalewski

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, Jason Wei

QueryForm: A Simple Zero-Shot Form Entity Query Framework

Zifeng Wang*, Zizhao Zhang, Jacob Devlin, Chen-Yu Lee, Guolong Su, Hao Zhang, Jennifer Dy, Vincent Perot, Tomas Pfister

ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, Chao Zhang

Multilingual Sequence-to-Sequence Models for Hebrew NLP

Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty

Triggering Multi-Hop Reasoning for Question Answering in Language Models Using Soft Prompts and Random Walks

Kanishka Misra*, Cicero Nogueira dos Santos, Siamak Shakeri

Tutorials

Complex Reasoning in Natural Language
Wenting Zhao, Mor Geva, Bill Yuchen Lin, Michihiro Yasunaga, Aman Madaan, Tao Yu

Generating Text from Language Models
Afra Amini, Ryan Cotterell, John Hewitt, Clara Meister, Tiago Pimentel

Workshops

Simple and Efficient Natural Language Processing (SustaiNLP)

Organizers include: Tal Schuster

Workshop on Online Abuse and Harms (WOAH)

Organizers include: Aida Mostafazadeh Davani

Document-Grounded Dialogue and Conversational Question Answering (DialDoc)

Organizers include: Roee Aharoni

NLP for Conversational AI

Organizers include: Abhinav Rastogi

Computation and Written Language (CAWL)

Organizers include: Kyle Gorman, Brian Roark, Richard Sproat

Computational Morphology and Phonology (SIGMORPHON)

Speakers include: Kyle Gorman

Workshop on Narrative Understanding (WNU)

Organizers include: Elizabeth Clark


* Work done while at Google

Read More

Modular visual question answering via code generation

Modular visual question answering via code generation

Visual question answering (VQA) is a machine learning task that requires a model to answer a question about an image or a set of images. Conventional VQA approaches need a large amount of labeled training data consisting of thousands of human-annotated question-answer pairs associated with images. In recent years, advances in large-scale pre-training have led to the development of VQA methods that perform well with fewer than fifty training examples (few-shot) and without any human-annotated VQA training data (zero-shot). However, there is still a significant performance gap between these methods and state-of-the-art fully supervised VQA methods, such as MaMMUT and VinVL. In particular, few-shot methods struggle with spatial reasoning, counting, and multi-hop reasoning. Furthermore, few-shot methods have generally been limited to answering questions about single images.

To improve accuracy on VQA examples that involve complex reasoning, in “Modular Visual Question Answering via Code Generation,” to appear at ACL 2023, we introduce CodeVQA, a framework that answers visual questions using program synthesis. Specifically, when given a question about an image or set of images, CodeVQA generates a Python program (code) with simple visual functions that allow it to process images, and executes this program to determine the answer. We demonstrate that in the few-shot setting, CodeVQA outperforms prior work by roughly 3% on the COVR dataset and 2% on the GQA dataset.

CodeVQA

The CodeVQA approach uses a code-writing large language model (LLM), such as PALM, to generate Python programs (code). We guide the LLM to correctly use visual functions by crafting a prompt consisting of a description of these functions and fewer than fifteen “in-context” examples of visual questions paired with the associated Python code for them. To select these examples, we compute embeddings for the input question and of all of the questions for which we have annotated programs (a randomly chosen set of fifty). Then, we select questions that have the highest similarity to the input and use them as in-context examples. Given the prompt and question that we want to answer, the LLM generates a Python program representing that question.

We instantiate the CodeVQA framework using three visual functions: (1) query, (2) get_pos, and (3) find_matching_image.

  • Query, which answers a question about a single image, is implemented using the few-shot Plug-and-Play VQA (PnP-VQA) method. PnP-VQA generates captions using BLIP — an image-captioning transformer pre-trained on millions of image-caption pairs — and feeds these into a LLM that outputs the answers to the question.
  • Get_pos, which is an object localizer that takes a description of an object as input and returns its position in the image, is implemented using GradCAM. Specifically, the description and the image are passed through the BLIP joint text-image encoder, which predicts an image-text matching score. GradCAM takes the gradient of this score with respect to the image features to find the region most relevant to the text.
  • Find_matching_image, which is used in multi-image questions to find the image that best matches a given input phrase, is implemented by using BLIP text and image encoders to compute a text embedding for the phrase and an image embedding for each image. Then the dot products of the text embedding with each image embedding represent the relevance of each image to the phrase, and we pick the image that maximizes this relevance.

The three functions can be implemented using models that require very little annotation (e.g., text and image-text pairs collected from the web and a small number of VQA examples). Furthermore, the CodeVQA framework can be easily generalized beyond these functions to others that a user might implement (e.g., object detection, image segmentation, or knowledge base retrieval).

Illustration of the CodeVQA method. First, a large language model generates a Python program (code), which invokes visual functions that represent the question. In this example, a simple VQA method (query) is used to answer one part of the question, and an object localizer (get_pos) is used to find the positions of the objects mentioned. Then the program produces an answer to the original question by combining the outputs of these functions.

Results

The CodeVQA framework correctly generates and executes Python programs not only for single-image questions, but also for multi-image questions. For example, if given two images, each showing two pandas, a question one might ask is, “Is it true that there are four pandas?” In this case, the LLM converts the counting question about the pair of images into a program in which an object count is obtained for each image (using the query function). Then the counts for both images are added to compute a total count, which is then compared to the number in the original question to yield a yes or no answer.

We evaluate CodeVQA on three visual reasoning datasets: GQA (single-image), COVR (multi-image), and NLVR2 (multi-image). For GQA, we provide 12 in-context examples to each method, and for COVR and NLVR2, we provide six in-context examples to each method. The table below shows that CodeVQA improves consistently over the baseline few-shot VQA method on all three datasets.

Method       GQA       COVR       NLVR2      
Few-shot PnP-VQA       46.56       49.06       63.37      
CodeVQA       49.03       54.11       64.04      

Results on the GQA, COVR, and NLVR2 datasets, showing that CodeVQA consistently improves over few-shot PnP-VQA. The metric is exact-match accuracy, i.e., the percentage of examples in which the predicted answer exactly matches the ground-truth answer.

We find that in GQA, CodeVQA’s accuracy is roughly 30% higher than the baseline on spatial reasoning questions, 4% higher on “and” questions, and 3% higher on “or” questions. The third category includes multi-hop questions such as “Are there salt shakers or skateboards in the picture?”, for which the generated program is shown below.

img = open_image("Image13.jpg")
salt_shakers_exist = query(img, "Are there any salt shakers?")
skateboards_exist = query(img, "Are there any skateboards?")
if salt_shakers_exist == "yes" or skateboards_exist == "yes":
    answer = "yes"
else:
    answer = "no"

In COVR, we find that CodeVQA’s gain over the baseline is higher when the number of input images is larger, as shown in the table below. This trend indicates that breaking the problem down into single-image questions is beneficial.

         Number of images      
Method    1    2    3    4    5   
Few-shot PnP-VQA     91.7    51.5    48.3    47.0    46.9   
CodeVQA    75.0    53.3    48.7    53.2    53.4   

Conclusion

We present CodeVQA, a framework for few-shot visual question answering that relies on code generation to perform multi-step visual reasoning. Exciting directions for future work include expanding the set of modules used and creating a similar framework for visual tasks beyond VQA. We note that care should be taken when considering whether to deploy a system such as CodeVQA, since vision-language models like the ones used in our visual functions have been shown to exhibit social biases. At the same time, compared to monolithic models, CodeVQA offers additional interpretability (through the Python program) and controllability (by modifying the prompts or visual functions), which are useful in production systems.

Acknowledgements

This research was a collaboration between UC Berkeley’s Artificial Intelligence Research lab (BAIR) and Google Research, and was conducted by Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, and Dan Klein.

Read More