Today, Google, Microsoft, OpenAI and Anthropic published a joint announcement establishing the Frontier Model Forum.Read More
Google at ICML 2023
Groups across Google actively pursue research in the field of machine learning (ML), ranging from theory and application. We build ML systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. We aim to build a more collaborative ecosystem with the broader ML research community through open-sourcing tools and datasets, publishing our work, and actively participating in conferences.
Google is proud to be a Diamond Sponsor of the 40th International Conference on Machine Learning (ICML 2023), a premier annual conference, which is being held this week in Honolulu, Hawaii. As a leader in ML research, Google has a strong presence at this year’s conference with over 120 accepted papers and active involvement in a number of workshops and tutorials. Google is also proud to be a Platinum Sponsor for both the LatinX in AI and Women in Machine Learning workshops. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.
Registered for ICML 2023? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Visit the @GoogleAI Twitter account to find out about Google booth activities (e.g., demos and Q&A sessions). See Google DeepMind’s blog to learn about their technical participation at ICML 2023.
Take a look below to learn more about the Google research being presented at ICML 2023 (Google affiliations in bold).
Board and Organizing Committee
Board Members include: Corinna Cortes, Hugo Larochelle
Tutorial Chairs include: Hanie Sedghi
Google Research booth activities
Presenters: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer
Title: Unsupervised Graph Embedding @ Google (paper, EXPO workshop)
Tuesday, July 25th at 10:30 AM HST
Presenters: Zheng Xu
Title: Federated Learning of Gboard Language Models with Differential Privacy (paper 1, paper 2, blog post)
Tuesday, July 25th at 3:30 PM HST
Presenters: Thomas Kipf
Title: Self-supervised scene understanding (paper 1, paper 2)
Wednesday, July 26th at 10:30 AM HST
Presenters: Johannes von Oswald, Max Vladymyrov
Title: Transformers learn in-context by gradient descent (paper)
Wednesday, July 26th at 3:30 PM HST
Accepted papers
Scaling Vision Transformers to 22 Billion Parameters (see blog post)
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
Best of Both Worlds Policy Optimization
Christoph Dann, Chen-Yu Wei, Julian Zimmert
Inflow, Outflow, and Reciprocity in Machine Learning
Mukund Sundararajan, Walid Krichene
Transformers Learn In-Context by Gradient Descent
Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov
Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models
Luke Vilnis, Yury Zemlyanskiy, Patrick Murray*, Alexandre Passos*, Sumit Sanghai
Differentially Private Hierarchical Clustering with Provable Approximation Guarantees (see blog post)
Jacob Imola*, Alessandro Epasto, Mohammad Mahdian, Vincent Cohen-Addad, Vahab Mirrokni
Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning
Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, Abhradeep Thakurta
Random Classification Noise Does Not Defeat All Convex Potential Boosters Irrespective of Model Choice
Yishay Mansour, Richard Nock, Robert Williamson
Simplex Random Features
Isaac Reid, Krzysztof Choromanski, Valerii Likhosherstov, Adrian Weller
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova
Mu2SLAM: Multitask, Multilingual Speech and Language Models
Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna
Robust Budget Pacing with a Single Sample
Santiago Balseiro, Rachitesh Kumar*, Vahab Mirrokni, Balasubramanian Sivan, Di Wang
A Statistical Perspective on Retrieval-Based Models
Soumya Basu, Ankit Singh Rawat, Manzil Zaheer
Approximately Optimal Core Shapes for Tensor Decompositions
Mehrdad Ghadiri, Matthew Fahrbach, Gang Fu, Vahab Mirrokni
Efficient List-Decodable Regression Using Batches
Abhimanyu Das, Ayush Jain*, Weihao Kong, Rajat Sen
Efficient Training of Language Models Using Few-Shot Learning
Sashank J. Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, Sanjiv Kumar
Fully Dynamic Submodular Maximization Over Matroids
Paul Duetting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam
GFlowNet-EM for Learning Compositional Latent Variable Models
Edward J Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio
Improved Online Learning Algorithms for CTR Prediction in Ad Auctions
Zhe Feng, Christopher Liaw, Zixin Zhou
Large Language Models Struggle to Learn Long-Tail Knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel
Multi-channel Autobidding with Budget and ROI Constraints
Yuan Deng, Negin Golrezaei, Patrick Jaillet, Jason Cheuk Nam Liang, Vahab Mirrokni
Multi-layer Neural Networks as Trainable Ladders of Hilbert Spaces
Zhengdao Chen
On User-Level Private Convex Optimization
Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, Chiyuan Zhang
PAC Generalization via Invariant Representations
Advait U Parulekar, Karthikeyan Shanmugam, Sanjay Shakkottai
Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice
Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Menard, Mohammad Gheshlaghi Azar, Remi Munos, Olivier Pietquin, Matthieu Geist,Csaba Szepesvari, Wataru Kumagai, Yutaka Matsuo
Speeding Up Bellman Ford via Minimum Violation Permutations
Silvio Lattanzi, Ola Svensson, Sergei Vassilvitskii
Statistical Indistinguishability of Learning Algorithms
Alkis Kalavasis, Amin Karbasi, Shay Moran, Grigoris Velegkas
Test-Time Adaptation with Slot-Centric Models
Mihir Prabhudesai, Anirudh Goyal, Sujoy Paul, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gaurav Aggarwal, Thomas Kipf, Deepak Pathak, Katerina Fragkiadaki>
Algorithms for Bounding Contribution for Histogram Estimation Under User-Level Privacy
Yuhan Liu*, Ananda Theertha Suresh, Wennan Zhu, Peter Kairouz, Marco Gruteser
Bandit Online Linear Optimization with Hints and Queries
Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit
CLUTR: Curriculum Learning via Unsupervised Task Representation Learning
Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, Ion Stoica
CSP: Self-Supervised Contrastive Spatial Pre-training for Geospatial-Visual Representations
Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, Stefano Ermon
Ewald-Based Long-Range Message Passing for Molecular Graphs
Arthur Kosmala, Johannes Gasteiger, Nicholas Gao, Stephan Günnemann
Fast (1+ε)-Approximation Algorithms for Binary Matrix Factorization
Ameya Velingker, Maximilian Vötsch, David Woodruff, Samson Zhou
Federated Linear Contextual Bandits with User-Level Differential Privacy
Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hejazinia, Jing Yang
Investigating the Role of Model-Based Learning in Exploration and Transfer
Jacob C Walker, Eszter Vértes, Yazhe Li, Gabriel Dulac-Arnold, Ankesh Anand, Theophane Weber, Jessica B Hamrick
Label Differential Privacy and Private Training Data Release
Robert Busa-Fekete, Andres Munoz, Umar Syed, Sergei Vassilvitskii
Lifelong Language Pretraining with Distribution-Specialized Experts
Wuyang Chen*, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cui
Multi-User Reinforcement Learning with Low Rank Rewards
Dheeraj Mysore Nagaraj, Suhas S Kowshik, Naman Agarwal, Praneeth Netrapalli, Prateek Jain
Multi-View Masked World Models for Visual Robotic Manipulation
Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel
PaLM-E: An Embodied Multimodal Language Model (see blog post)
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
Private Federated Learning with Autotuned Compression
Enayat Ullah*, Christopher A. Choquette-Choo, Peter Kairouz, Sewoong Oh
Refined Regret for Adversarial MDPs with Linear Function Approximation
Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory
Justin Cui, Ruoche Wan, Si Si, Cho-Jui Hsieh
SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance
Amit Attia, Tomer Koren
The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation
Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G. Bellemare, Will Dabney
Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features
Chieh Hubert Lin, Hung-Yu Tseng, Hsin-Ying Lee, Maneesh Kumar Singh, Ming-Hsuan Yang
User-Level Private Stochastic Convex Optimization with Optimal Rates
Raef Bassily, Ziteng Sun
A Simple Zero-Shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models
James Urquhart Allingham*, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan
Can Large Language Models Reason About Program Invariants?
Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, Pengcheng Yin
Concurrent Shuffle Differential Privacy Under Continual Observation
Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer
Constant Matters: Fine-Grained Error Bound on Differentially Private Continual Observation
Hendrik Fichtenberger, Monika Henzinger, Jalaj Upadhyay
Cross-Entropy Loss Functions: Theoretical Analysis and Applications
Anqi Mao, Mehryar Mohri, Yutao Zhong
Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation
Orin Levy, Alon Cohen, Asaf Cassel, Yishay Mansour
Fairness in Streaming Submodular Maximization Over a Matroid Constraint
Marwa El Halabi, Federico Fusco, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (see blog post)
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, Adam Roberts
Graph Reinforcement Learning for Network Control via Bi-level Optimization
Daniele Gammelli, James Harrison, Kaidi Yang, Marco Pavone, Filipe Rodrigues, Francisco C. Pereira
Learning-Augmented Private Algorithms for Multiple Quantile Release
Mikhail Khodak*, Kareem Amin, Travis Dick, Sergei Vassilvitskii
LegendreTron: Uprising Proper Multiclass Loss Learning
Kevin H Lam, Christian Walder, Spiridon Penev, Richard Nock
Measuring the Impact of Programming Language Distribution
Gabriel Orlanski*, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta*
Multi-task Differential Privacy Under Distribution Skew
Walid Krichene, Prateek Jain, Shuang Song, Mukund Sundararajan, Abhradeep Thakurta, Li Zhang
Muse: Text-to-Image Generation via Masked Generative Transformers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan
On the Convergence of Federated Averaging with Cyclic Client Participation
Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, Tong Zhang
Optimal Stochastic Non-smooth Non-convex Optimization Through Online-to-Non-convex Conversion
Ashok Cutkosky, Harsh Mehta, Francesco Orabona
Out-of-Domain Robustness via Targeted Augmentations
Irena Gao, Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, Percy Liang
Polynomial Time and Private Learning of Unbounded Gaussian Mixture Models
Jamil Arbas, Hassan Ashtiani, Christopher Liaw
Pre-computed Memory or On-the-Fly Encoding? A Hybrid Approach to Retrieval Augmentation Makes the Most of Your Compute
Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, William W. Cohen
Scalable Adaptive Computation for Iterative Generation
Allan Jabri*, David J. Fleet, Ting Chen
Scaling Spherical CNNs
Carlos Esteves, Jean-Jacques Slotine, Ameesh Makadia
STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh
Stratified Adversarial Robustness with Rejection
Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, Yingyu Liang, Somesh Jha
When Does Privileged information Explain Away Label Noise?
Guillermo Ortiz-Jimenez*, Mark Collier, Anant Nawalgaria, Alexander D’Amour, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou
Adaptive Computation with Elastic Input Sequence
Fuzhao Xue*, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You
Can Neural Network Memorization Be Localized?
Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang
Controllability-Aware Unsupervised Skill Discovery
Seohong Park, Kimin Lee, Youngwoon Lee, Pieter Abbeel
Efficient Learning of Mesh-Based Physical Simulation with Bi-Stride Multi-Scale Graph Neural Network
Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang
Federated Heavy Hitter Recovery Under Linear Sketching
Adria Gascon, Peter Kairouz, Ziteng Sun, Ananda Theertha Suresh
Graph Generative Model for Benchmarking Graph Neural Networks
Minji Yoon, Yue Wu, John Palowitch, Bryan Perozzi, Russ Salakhutdinov
H-Consistency Bounds for Pairwise Misranking Loss Surrogates
Anqi Mao, Mehryar Mohri, Yutao Zhong
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation
Uri Sherman, Tomer Koren, Yishay Mansour
Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames
Ondrej Biza*, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Thomas Kipf
Multi-task Off-Policy Learning from Bandit Feedback
Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh
Optimal No-Regret Learning for One-Sided Lipschitz Functions
Paul Duetting, Guru Guruganesh, Jon Schneider, Joshua Ruizhi Wang
Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games
Batuhan Yardim, Semih Cayci, Matthieu Geist, Niao He
Regret Minimization and Convergence to Equilibria in General-Sum Markov Games
Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, Yishay Mansour
Reinforcement Learning Can Be More Efficient with Multiple Rewards
Christoph Dann, Yishay Mansour, Mehryar Mohri
Reinforcement Learning with History-Dependent Dynamic Contexts
Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutlier
User-Defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems
Marc Anton Finzi*, Anudhyan Boral, Andrew Gordon Wilson, Fei Sha, Leonardo Zepeda-Nunez
Discrete Key-Value Bottleneck
Frederik Träuble, Anirudh Goyal, Nasim Rahaman, Michael Curtis Mozer, Kenji Kawaguchi, Yoshua Bengio, Bernhard Schölkopf
DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm
Lisang Ding, Kexin Jin, Bicheng Ying, Kun Yuan, Wotao Yin
Exphormer: Sparse Transformers for Graphs
Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J. Sutherland, Ali Kemal Sinop
Fast, Differentiable and Sparse Top-k: A Convex Analysis Perspective
Michael Eli Sander*, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel
Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation
Aditya Mate, Bryan Wilder, Aparna Taneja, Milind Tambe
In Search for a Generalizable Method for Source Free Domain Adaptation
Malik Boudiaf*, Tom Denton, Bart van Merrienboer, Vincent Dumoulin, Eleni Triantafillou
Learning Rate Schedules in the Presence of Distribution Shift
Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah
Not All Semantics Are Created Equal: Contrastive Self-Supervised Learning with Automatic Temperature Individualization
Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, Tianbao Yang
On the Relationship Between Explanation and Prediction: A Causal View
Amir-Hossein Karimi*, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, Been Kim
On the Role of Attention in Prompt-Tuning
Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis
PLay: Parametrically Conditioned Layout Generation Using Latent Diffusion
Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li
The Power of Learned Locally Linear Models for Nonlinear Policy Optimization
Daniel Pfrommer, Max Simchowitz, Tyler Westenbroek, Nikolai Matni, Stephen Tu
Relevant Walk Search for Explaining Graph Neural Networks
Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus Robert Muller,Shinichi Nakajima
Repository-Level Prompt Generation for Large Language Models of Code
Disha Shrivastava, Hugo Larochelle, Daniel Tarlow
Robust and Private Stochastic Linear Bandits
Vasileios Charisopoulos*, Hossein Esfandiari, Vahab Mirrokni
Simple Diffusion: End-to-End Diffusion for High Resolution Images
Emiel Hoogeboom, Jonathan Heek, Tim Salimans
Tied-Augment: Controlling Representation Similarity Improves Data Augmentation
Emirhan Kurtulus, Zichao Li, Yann Dauphin, Ekin D. Cubuk
Why Is Public Pre-Training Necessary for Private Model Training?
Arun Ganesh, Mahdi Haghifam*, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, Lun Wang
A Connection Between One-Step RL and Critic Regularization in Reinforcement Learning
Benjamin Eysenbach, Matthieu Geist, Sergey Levine, Ruslan Salakhutdinov
Beyond Uniform Lipschitz Condition in Differentially Private Optimization
Rudrajit Das*, Satyen Kale, Zheng Xu, Tong Zhang, Sujay Sanghavi
Efficient Graph Field Integrators Meet Point Clouds
Krzysztof Choromanski, Arijit Sehanobish, Han Lin, Yunfan Zhao, Eli Berger, Tetiana Parshakova, Alvin Pan, David Watkins, Tianyi Zhang, Valerii Likhosherstov, Somnath Basu Roy Chowdhury, Avinava Dubey, Deepali Jain, Tamas Sarlos, Snigdha Chaturvedi, Adrian Weller
Fast as CHITA: Neural Network Pruning with Combinatorial Optimization
Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder
Jump-Start Reinforcement Learning (see blog post)
Ikechukwu Uchendu*, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, Karol Hausman
Learning in POMDPs is Sample-Efficient with Hindsight Observability
Jonathan Lee, Alekh Agarwal, Christoph Dann, Tong Zhang
Low-Variance Gradient Estimation in Unrolled Computation Graphs with ES-Single
Paul Vicol
Masked Trajectory Models for Prediction, Representation, and Control
Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, Aravind Rajeswaran
Overcoming Simplicity Bias in Deep Networks Using a Feature Sieve
Rishabh Tiwari, Pradeep Shenoy
Pairwise Ranking Losses of Click-Through Rates Prediction for Welfare Maximization in Ad Auctions
Boxiang Lyu, Zhe Feng, Zachary Robertson, Sanmi Koyejo
Predictive Flows for Faster Ford-Fulkerson
Sami Davies, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang
Scaling Laws for Multilingual Neural Machine Translation
Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat
Sequential Monte Carlo Learning for Time Series Structure Discovery
Feras Saad, Brian Patton, Matthew Douglas Hoffman, Rif A. Saurous, Vikash Mansinghka
Stochastic Gradient Succeeds for Bandits
Jincheng Mei, Zixin Zhong, Bo Dai, Alekh Agarwal, Csaba Szepesvari, Dale Schuurmans
Subset-Based Instance Optimality in Private Estimation
Travis Dick, Alex Kulesza, Ziteng Sun, Ananda Theertha Suresh
The Unreasonable Effectiveness of Few-Shot Learning for Machine Translation
Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, Orhan Firat
Tutorials
Self-Supervised Learning in Vision: from Research Advances to Best Practices
Xinlei Chen, Ishan Misra, Randall Balestriero, Mathilde Caron, Christoph Feichtenhofer, Mark Ibrahim
How to DP-fy ML: A Practical Tutorial to Machine Learning with Differential Privacy (see blog post)
Sergei Vassilvitskii, Natalia Ponomareva, Zheng Xu
Recent Advances in the Generalization Theory of Neural Networks
Tengyu Ma, Alex Damian
EXPO Day workshops
Graph Neural Networks in Tensorflow: A Practical Guide
Workshop Organizers include: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer, Jonathan Halcrow
Google sponsored affinity workshops
LatinX in AI (LAXAI)
Platinum Sponsor
Keynote Speaker: Monica Ribero
Panelist: Yao Qin
Women in Machine Learning (WiML)
Platinum Sponsor
Panelists: Yao Qin
Workshops
Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities
Organizer: Peter Kairouz, Zheng Xu
Speaker: Brendan McMahan
Interpretable Machine Learning in Healthcare (IMLH)
Organizer: Ramin Zabih
Knowledge and Logical Reasoning in the Era of Data-Driven Learning
Organizer: Beliz Günel
The Many Facets of Preference-Based Learning (MFPL)
Organizer: Robert Busa-Fekete, Mohammad Ghavamzadeh
The Synergy of Scientific and Machine Learning Modelling (SynS & ML)
Speaker: Sercan Arik
Theory of Mind in Communicating Agents
Organizer: Pei Zhou
Artificial Intelligence & Human Computer Interaction
Organizer: Yang Li, Forrest Huang
Data-Centric Machine Learning Research (DMLR)
Organizer: Alicia Parrish, Najoung Kim
Speaker: Peter Mattson
Neural Compression: from Information Theory to Applications
Speaker: Johannes Ballé
Panelist: George Toderici
Organizer: Ahmad Beirami
Spurious Correlations, Invariance and Stability (SCIS)
Organizer: Amir Feder
* Work done while at Google
Using societal context knowledge to foster the responsible application of AI
AI-related products and technologies are constructed and deployed in a societal context: that is, a dynamic and complex collection of social, cultural, historical, political and economic circumstances. Because societal contexts by nature are dynamic, complex, non-linear, contested, subjective, and highly qualitative, they are challenging to translate into the quantitative representations, methods, and practices that dominate standard machine learning (ML) approaches and responsible AI product development practices.
The first phase of AI product development is problem understanding, and this phase has tremendous influence over how problems (e.g., increasing cancer screening availability and accuracy) are formulated for ML systems to solve as well many other downstream decisions, such as dataset and ML architecture choice. When the societal context in which a product will operate is not articulated well enough to result in robust problem understanding, the resulting ML solutions can be fragile and even propagate unfair biases.
When AI product developers lack access to the knowledge and tools necessary to effectively understand and consider societal context during development, they tend to abstract it away. This abstraction leaves them with a shallow, quantitative understanding of the problems they seek to solve, while product users and society stakeholders — who are proximate to these problems and embedded in related societal contexts — tend to have a deep qualitative understanding of those same problems. This qualitative–quantitative divergence in ways of understanding complex problems that separates product users and society from developers is what we call the problem understanding chasm.
This chasm has repercussions in the real world: for example, it was the root cause of racial bias discovered by a widely used healthcare algorithm intended to solve the problem of choosing patients with the most complex healthcare needs for special programs. Incomplete understanding of the societal context in which the algorithm would operate led system designers to form incorrect and oversimplified causal theories about what the key problem factors were. Critical socio-structural factors, including lack of access to healthcare, lack of trust in the health care system, and underdiagnosis due to human bias, were left out while spending on healthcare was highlighted as a predictor of complex health need.
To bridge the problem understanding chasm responsibly, AI product developers need tools that put community-validated and structured knowledge of societal context about complex societal problems at their fingertips — starting with problem understanding, but also throughout the product development lifecycle. To that end, Societal Context Understanding Tools and Solutions (SCOUTS) — part of the Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research — is a dedicated research team focused on the mission to “empower people with the scalable, trustworthy societal context knowledge required to realize responsible, robust AI and solve the world’s most complex societal problems.” SCOUTS is motivated by the significant challenge of articulating societal context, and it conducts innovative foundational and applied research to produce structured societal context knowledge and to integrate it into all phases of the AI-related product development lifecycle. Last year we announced that Jigsaw, Google’s incubator for building technology that explores solutions to threats to open societies, leveraged our structured societal context knowledge approach during the data preparation and evaluation phases of model development to scale bias mitigation for their widely used Perspective API toxicity classifier. Going forward SCOUTS’ research agenda focuses on the problem understanding phase of AI-related product development with the goal of bridging the problem understanding chasm.
Bridging the AI problem understanding chasm
Bridging the AI problem understanding chasm requires two key ingredients: 1) a reference frame for organizing structured societal context knowledge and 2) participatory, non-extractive methods to elicit community expertise about complex problems and represent it as structured knowledge. SCOUTS has published innovative research in both areas.
An illustration of the problem understanding chasm. |
A societal context reference frame
An essential ingredient for producing structured knowledge is a taxonomy for creating the structure to organize it. SCOUTS collaborated with other RAI-HCT teams (TasC, Impact Lab), Google DeepMind, and external system dynamics experts to develop a taxonomic reference frame for societal context. To contend with the complex, dynamic, and adaptive nature of societal context, we leverage complex adaptive systems (CAS) theory to propose a high-level taxonomic model for organizing societal context knowledge. The model pinpoints three key elements of societal context and the dynamic feedback loops that bind them together: agents, precepts, and artifacts.
- Agents: These can be individuals or institutions.
- Precepts: The preconceptions — including beliefs, values, stereotypes and biases — that constrain and drive the behavior of agents. An example of a basic precept is that “all basketball players are over 6 feet tall.” That limiting assumption can lead to failures in identifying basketball players of smaller stature.
- Artifacts: Agent behaviors produce many kinds of artifacts, including language, data, technologies, societal problems and products.
The relationships between these entities are dynamic and complex. Our work hypothesizes that precepts are the most critical element of societal context and we highlight the problems people perceive and the causal theories they hold about why those problems exist as particularly influential precepts that are core to understanding societal context. For example, in the case of racial bias in a medical algorithm described earlier, the causal theory precept held by designers was that complex health problems would cause healthcare expenditures to go up for all populations. That incorrect precept directly led to the choice of healthcare spending as the proxy variable for the model to predict complex healthcare need, which in turn led to the model being biased against Black patients who, due to societal factors such as lack of access to healthcare and underdiagnosis due to bias on average, do not always spend more on healthcare when they have complex healthcare needs. A key open question is how can we ethically and equitably elicit causal theories from the people and communities who are most proximate to problems of inequity and transform them into useful structured knowledge?
![]() |
Illustrative version of societal context reference frame. |
![]() |
Taxonomic version of societal context reference frame. |
Working with communities to foster the responsible application of AI to healthcare
Since its inception, SCOUTS has worked to build capacity in historically marginalized communities to articulate the broader societal context of the complex problems that matter to them using a practice called community based system dynamics (CBSD). System dynamics (SD) is a methodology for articulating causal theories about complex problems, both qualitatively as causal loop and stock and flow diagrams (CLDs and SFDs, respectively) and quantitatively as simulation models. The inherent support of visual qualitative tools, quantitative methods, and collaborative model building makes it an ideal ingredient for bridging the problem understanding chasm. CBSD is a community-based, participatory variant of SD specifically focused on building capacity within communities to collaboratively describe and model the problems they face as causal theories, directly without intermediaries. With CBSD we’ve witnessed community groups learn the basics and begin drawing CLDs within 2 hours.
![]() |
Data 4 Black Lives community members learning system dynamics. |
There is a huge potential for AI to improve medical diagnosis. But the safety, equity, and reliability of AI-related health diagnostic algorithms depends on diverse and balanced training datasets. An open challenge in the health diagnostic space is the dearth of training sample data from historically marginalized groups. SCOUTS collaborated with the Data 4 Black Lives community and CBSD experts to produce qualitative and quantitative causal theories for the data gap problem. The theories include critical factors that make up the broader societal context surrounding health diagnostics, including cultural memory of death and trust in medical care.
The figure below depicts the causal theory generated during the collaboration described above as a CLD. It hypothesizes that trust in medical care influences all parts of this complex system and is the key lever for increasing screening, which in turn generates data to overcome the data diversity gap.
![]() |
![]() |
Causal loop diagram of the health diagnostics data gap |
These community-sourced causal theories are a first step to bridge the problem understanding chasm with trustworthy societal context knowledge.
Conclusion
As discussed in this blog, the problem understanding chasm is a critical open challenge in responsible AI. SCOUTS conducts exploratory and applied research in collaboration with other teams within Google Research, external community, and academic partners across multiple disciplines to make meaningful progress solving it. Going forward our work will focus on three key elements, guided by our AI Principles:
- Increase awareness and understanding of the problem understanding chasm and its implications through talks, publications, and training.
- Conduct foundational and applied research for representing and integrating societal context knowledge into AI product development tools and workflows, from conception to monitoring, evaluation and adaptation.
- Apply community-based causal modeling methods to the AI health equity domain to realize impact and build society’s and Google’s capability to produce and leverage global-scale societal context knowledge to realize responsible AI.
![]() |
SCOUTS flywheel for bridging the problem understanding chasm. |
Acknowledgments
Thank you to John Guilyard for graphics development, everyone in SCOUTS, and all of our collaborators and sponsors.
Google’s AI Red Team: the ethical hackers making AI safer
Today we’re publishing information on Google’s AI Red Team for the first timeRead More
SimPer: Simple self-supervised learning of periodic targets
Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.
Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).
Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.
To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.
The SimPer framework
SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.
To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.
Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.
Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.
To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.
Results
To evaluate SimPer’s performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.
Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.
![]() |
Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE). |
Conclusion and applications
We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.
Acknowledgements
We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.
Why inclusive sets of images help us make better products
We worked with TONL, a stock image company, to create more representative datasets to help teams build more inclusive products.Read More
Google.org’s new grant to help track permafrost thaw
A new $5 million grant aims to help Woodwell Climate Research Center track Arctic permafrost thaw in near real-time.Read More
6 mujeres lideran la lucha contra el cambio climático
Homenajeamos a Eunice Newton Foot, científica pionera de la ciencia climática, y a 6 organizaciones beneficiarias de Google.org dirigidas por mujeres que construyen un f…Read More
6 women leading the fight against climate change
We’re celebrating climate science pioneer Eunice Newton Foote and six women-led Google.org grantees building a more sustainable future.Read More
Symbol tuning improves in-context learning in language models
A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.
In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.
- Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
- Symbol-tuned models are much stronger at algorithmic reasoning tasks.
- Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
Motivation
Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.
In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.
![]() |
Datasets and task types used for symbol tuning. |
Symbol-tuning procedure
We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.
For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.
![]() |
We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation. |
Experimental setup
We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.
In-context learning
In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)
Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).
Algorithmic reasoning
We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).
On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.
![]() |
Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task. |
Flipped labels
In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.
We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.
![]() |
Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are. |
Conclusion
We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.
We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.
Future work
Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.
Acknowledgements
The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.