Enabling AI-driven health advances without sacrificing patient privacy

There’s a lot of excitement at the intersection of artificial intelligence and health care. AI has already been used to improve disease treatment and detection, discover promising new drugs, identify links between genes and diseases, and more.

By analyzing large datasets and finding patterns, virtually any new algorithm has the potential to help patients — AI researchers just need access to the right data to train and test those algorithms. Hospitals, understandably, are hesitant to share sensitive patient information with research teams. When they do share data, it’s difficult to verify that researchers are only using the data they need and deleting it after they’re done.

Secure AI Labs (SAIL) is addressing those problems with a technology that lets AI algorithms run on encrypted datasets that never leave the data owner’s system. Health care organizations can control how their datasets are used, while researchers can protect the confidentiality of their models and search queries. Neither party needs to see the data or the model to collaborate.

SAIL’s platform can also combine data from multiple sources, creating rich insights that fuel more effective algorithms.

“You shouldn’t have to schmooze with hospital executives for five years before you can run your machine learning algorithm,” says SAIL co-founder and MIT Professor Manolis Kellis, who co-founded the company with CEO Anne Kim ’16, SM ’17. “Our goal is to help patients, to help machine learning scientists, and to create new therapeutics. We want new algorithms — the best algorithms — to be applied to the biggest possible data set.”

SAIL has already partnered with hospitals and life science companies to unlock anonymized data for researchers. In the next year, the company hopes to be working with about half of the top 50 academic medical centers in the country.

Unleashing AI’s full potential

As an undergraduate at MIT studying computer science and molecular biology, Kim worked with researchers in the Computer Science and Artificial Intelligence Laboratory (CSAIL) to analyze data from clinical trials, gene association studies, hospital intensive care units, and more.

“I realized there is something severely broken in data sharing, whether it was hospitals using hard drives, ancient file transfer protocol, or even sending stuff in the mail,” Kim says. “It was all just not well-tracked.”

Kellis, who is also a member of the Broad Institute of MIT and Harvard, has spent years establishing partnerships with hospitals and consortia across a range of diseases including cancers, heart disease, schizophrenia, and obesity. He knew that smaller research teams would struggle to get access to the same data his lab was working with.

In 2017, Kellis and Kim decided to commercialize technology they were developing to allow AI algorithms to run on encrypted data.

In the summer of 2018, Kim participated in the delta v startup accelerator run by the Martin Trust Center for MIT Entrepreneurship. The founders also received support from the Sandbox Innovation Fund and the Venture Mentoring Service, and made various early connections through their MIT network.

To participate in SAIL’s program, hospitals and other health care organizations make parts of their data available to researchers by setting up a node behind their firewall. SAIL then sends encrypted algorithms to the servers where the datasets reside in a process called federated learning. The algorithms crunch the data locally in each server and transmit the results back to a central model, which updates itself. No one — not the researchers, the data owners, or even SAIL —has access to the models or the datasets.

The approach allows a much broader set of researchers to apply their models to large datasets. To further engage the research community, Kellis’ lab at MIT has begun holding competitions in which it gives access to datasets in areas like protein function and gene expression, and challenges researchers to predict results.

“We invite machine learning researchers to come and train on last year’s data and predict this year’s data,” says Kellis. “If we see there’s a new type of algorithm that is performing best in these community-level assessments, people can adopt it locally at many different institutions and level the playing field. So, the only thing that matters is the quality of your algorithm rather than the power of your connections.”

By enabling a large number of datasets to be anonymized into aggregate insights, SAIL’s technology also allows researchers to study rare diseases, in which small pools of relevant patient data are often spread out among many institutions. That has historically made the data difficult to apply AI models to.

“We’re hoping that all of these datasets will eventually be open,” Kellis says. “We can cut across all the silos and enable a new era where every patient with every rare disorder across the entire world can come together in a single keystroke to analyze data.”

Enabling the medicine of the future

To work with large amounts of data around specific diseases, SAIL has increasingly sought to partner with patient associations and consortia of health care groups, including an international health care consulting company and the Kidney Cancer Association. The partnerships also align SAIL with patients, the group they’re most trying to help.

Overall, the founders are happy to see SAIL solving problems they faced in their labs for researchers around the world.

“The right place to solve this is not an academic project. The right place to solve this is in industry, where we can provide a platform not just for my lab but for any researcher,” Kellis says. “It’s about creating an ecosystem of academia, researchers, pharma, biotech, and hospital partners. I think it’s the blending all of these different areas that will make that vision of medicine of the future become a reality.”

Read More

The ML Glossary: Five years of new language

Over guacamole and corn chips at a party, a friend mentions that her favorite phone game uses augmented reality. Another friend points her phone at the host and shouts, “Watch out—a t-rex is sneaking up behind you.” Eager to join the conversation, you blurt, “My blender has an augmented reality setting.

If only you had looked up augmented reality in Google’s Machine Learning Glossary, which defines over 460 terms related to artificial intelligence, you’d know what the heck your friends are talking about. If you’ve ever wondered what a neural network is, or if you chronically confuse the negative class with the positive class at the doctor’s office (“Wait, the negative class means I’m healthy?”), the Glossary has you covered.

AI is increasingly intertwined with our future, and as the language of AI sneaks its way into household conversation, learning AI’s specialized vocabulary could be helpful to understanding many key technological advances — or what’s being said at a guacamole party.

A team of technical writers and AI experts produces the definitions. Sure, the definitions have to be technically accurate, but they also have to be as clear as possible. Clarity is rare in a field as notoriously complicated as artificial intelligence, which is why we created Google’s Machine Learning glossary in 2016. Since then, we’ve published nine full revisions, providing almost 300 additional terms.

Good glossaries are quicksand for the curious. You’ll come for the accuracy, stay for the class-imbalanced datasets, and then find yourself an hour later embedded in overfitting. It’s fun, educational, and blessedly blender free.

Read More

End-to-end tinyML audio classification with the Raspberry Pi RP2040

A guest post by Sandeep Mistry, Arm

Image of tools you’ll need for the project
Some tools you’ll need for this project (learn more below!)

Introduction

Machine learning enables developers and engineers to unlock new capabilities in their applications. Instead of explicitly defining instructions and rules for a computer to execute, you can collect large amounts of data for a classification task that your application requires, and train an ML model to learn from the patterns in the data.

Training typically happens in the cloud on computers equipped with one or more GPUs. Once a model has been trained, depending on its size, it can be deployed for inference on a wide range of devices. These devices range from large computers in the cloud with gigabytes of memory, to tiny microcontrollers (or MCUs) which typically have just kilobytes of memory.

Microcontrollers are low-power, self-contained, cost-effective computer systems that are embedded in devices that you use everyday, such as your microwave, electric toothbrush, or smart door lock. Microcontroller based systems typically interact with their surrounding environment via one or more sensors (think buttons, microphones, motion sensors) and perform an action using one or more actuators (think LEDs, motors, speakers).

Microcontrollers also offer privacy advantages, and can perform inference locally on the device, without needing to send any data to the cloud. This can have power advantages too for devices running off batteries.

In this article, we will demonstrate how an Arm Cortex-M based microcontroller can be used for local on-device ML to detect audio events from its surrounding environment. This is a tutorial-style article, and we’ll guide you through training a TensorFlow based audio classification model to detect a fire alarm sound.

We’ll show you how to use TensorFlow Lite for Microcontrollers with Arm CMSIS-NN accelerated kernels to deploy the ML model to an Arm Cortex-M0+ based microcontroller board for local on-device ML inference. Arm’s CMSIS-DSP library, which provides optimized Digital Signal Processing (DSP) function implementations for Arm Cortex-M processors, will also be used to extract features from the real-time audio data before inference.

While this guide focuses on detecting a fire alarm sound, it can be adapted for other sound classification tasks. You may also need to adapt the feature extraction stages and/or adjust ML model architecture for your use case.

An interactive version of this tutorial is available on Google Colab and all technical assets for this guide can be found on GitHub.

What you need to to get started

Development Environment

Hardware

You’ll need one of the following development boards that are based on Raspberry Pi’s RP2040 MCU chip that was released early in 2021.

SparkFun RP2040 MicroMod and MicroMod ML Carrier

This board is great for folks new to electronics and microcontrollers. It does not require a soldering iron, knowing how to solder, or how to wire up breadboards.

Image of tools you’ll need for the project

Raspberry Pi Pico and PDM microphone board

This option is great if you know how to solder (or would like to learn). It requires a soldering iron and knowledge of how to wire a breadboard with electronic components. You’ll need:

Image of Raspberry Pi Pico and PDM microphone board

Both of the options above will allow you to collect real-time 16 kHz audio from a digital microphone and process the audio signal on the development board’s Arm Cortex-M0+ processor, which operates at 125 MHz. The application running on the Arm Cortex-M0+ will have a Digital Signal Processing (DSP) stage to extract features from the audio signal. The extracted features will then be fed into a neural network to perform a classification task to determine if a fire alarm sound is present in the board’s environment.

Dataset

We will start by training a sound classifier (for many events) with TensorFlow using the ESC-50: Dataset for Environmental Sound Classification. After training on this broad dataset, we will use Transfer Learning to fine tune it for our specific audio classification task.

This model will be trained on the ESC-50 dataset, which contains 50 types of sounds. Each sound category has 40 audio files that are 5 seconds each in length. Each audio file will be split into 1 second soundbites, and any soundbites that contain pure silence will be discarded.

Image shows a sample waveform from the data set of a dog barking
A sample waveform from the data set of a dog barking.

Spectrograms

Rather than passing in the time series data directly into our TensorFlow model, we will transform the audio data into an audio spectrogram representation. This will create a 2D representation of the audio signal’s frequency content over time.

The input audio signal we will use will have a sampling rate of 16kHz, this means one second of audio will contain 16,000 samples. Using TensorFlow’s tf.signal.stft(…) function we can transform a 1 second audio signal into a 2D tensor representation. We will choose a frame length of 256 and a frame step of 128, so the output of this feature extraction stage will be a Tensor that has a shape of (124, 129).

Image shows An audio spectrogram representation of a dog barking.
An audio spectrogram representation of a dog barking.

The ML model

Now that we have the features extracted from the audio signal, we can create a model using TensorFlow’s Keras API. You can find the complete code linked above. The model will consist of 8 layers:

  1. An input layer.
  2. A preprocessing layer, that will resize the input tensor from 124x129x1 to 32x32x1.
  3. A normalization layer, that will scale the input values between -1 and 1
  4. A 2D convolution layer with: 8 filters, a kernel size of 8×8, and stride of 2×2, and ReLU activation function.
  5. A 2D max pooling layer with size of 2×2
  6. A flatten layer to flatten the 2D data to 1D
  7. A dropout layer, that will help reduce overfitting during training
  8. A dense layer with 50 outputs and a softmax activation function, which outputs the likelihood of the sound category (between 0 and 1).

The model summary can be found below:

Image of model summary

Notice that this model only has about 15K parameters (this is quite small!)

Fine tuning

Now we will use transfer learning and change the classification head (the last Dense layer) of the model to train a binary classification model for fire alarm sounds. We have collected 10 fire alarm clips from freesound.org and BigSoundBank.com. Background noise clips from the SpeechCommands dataset will be used for non-fire alarm sounds. This dataset is small, and enough for us to get started. Data augmentation techniques will be used to supplement the training data we’ve collected.

For real-world applications, it’s important to collect a much larger dataset (you can learn more about best practices on TensorFlow’s Responsible AI website).

Data Augmentation

Data augmentation is a set of techniques used to increase the size of a dataset. This is done by slightly modifying samples from the dataset or by creating synthetic data. In this situation we are using audio and we will create a few functions to augment different samples. We will use three techniques:

  1. Adding white noise to the audio samples.
  2. Adding random silence to the audio.
  3. Mixing two audio samples together.

As well as increasing the size of the dataset, data augmentation also helps to reduce overfitting by training the model on different (not perfect) data samples. For example, on a microcontroller you are unlikely to have perfect high quality audio, and so a technique like adding white noise can help the model work in situations where your microphone might every so often have noise in there.

A gif showing how data augmentation slightly changes the spectrogram by adding noise
A gif showing how data augmentation slightly changes the spectrogram by adding noise (watch it closely, it can be a bit hard to see).

Feature Extraction

TensorFlow Lite for Microcontroller (TFLu) provides a subset of TensorFlow operations, so we are unable to use the tf.signal.sft(…) API we’ve used for feature extraction of the baseline model on our MCU. However, we can leverage Arm’s CMSIS-DSP library to generate spectrograms on the MCU. CMSIS-DSP contains support for both floating-point and fixed-point DSP operations which are optimized for Arm Cortex-M processors, including the Arm Cortex-M0+ that we will be deploying the ML model to. The Arm Cortex-M0+ does not contain a floating-point unit (FPU) so it would be better to leverage a 16-bit fixed-point DSP based feature extraction pipeline on the board.

We can leverage CMSIS-DSP’s Python Wrapper in the notebook to perform the same operations on our training pipeline using 16-bit fixed-point math. At a high level we can replicate the TensorFlow SFT API with the following CMSIS-DSP based operations:

  1. Manually creating a Hanning Window of length 256 using the Hanning Window formula along with CMSIS-DSP’s arm_cos_f32 API.
    Screenshot showing the Hanning Window formula
  2. Creating a CMSIS-DSP arm_rfft_instance_q15 instance and initializing it using CMSIS-DSP’s arm_rfft_init_q15 API.
  3. Looping through the audio data 256 samples at a time, with a stride of 128 (this matches the parameters we’ve passed into the TF sft API)
    1. Multiplying the 256 samples by the Hanning Window, using CMSIS-DSP’s arm_mult_q15 API
    2. Calculating the FFT of the output of the previous step, using CMSIS-DSP’s arm_rfft_q15 API
    3. Calculating the magnitude of the previous step, using CMSIS-DSP’s arm_cmplx_mag_q15 API
  4. Each audio soundbites’s FFT magnitude represents the one column of the spectrogram.
  5. Since our baseline model expects a floating point input, instead of the 16-bit quantized value we were using, the CMSIS-DSP arm_q15_to_float API can be used to convert the spectrogram data from a 16-bit fixed-point value to a floating-point value for training.

The complete Python code for this is a bit long, but can be found in the “Transfer Learning -> Load dataset” section of the Google Colab notebook.

Image of waveform and audio spectrogram of a smoke alarm sound.
Waveform and audio spectrogram of a smoke alarm sound.

For an in-depth description of how to create audio spectrograms using fixed-point operations with CMSIS-DSP, please see Towards Data Science “Fixed-point DSP for Data Scientists” guide.

Loading the baseline model and changing the classification head

The model we previously trained on the ESC-50 dataset, predicted the presence of 50 sound types, and which resulted in the final dense layer of the model having 50 outputs. The new model we would like to create is a binary classifier, and needs to have a single output value.

We will load the baseline model, and swap out the final dense layer to match our needs:

# We need a new head with one neuron.
model_body = tf.keras.Model(inputs=model.input, outputs=model.layers[-2].output)

classifier_head = tf.keras.layers.Dense(1, activation="sigmoid")(model_body.output)

fine_tune_model = tf.keras.Model(model_body.input, classifier_head)

This results in the following model.summary():

Screenshot of model summary

Transfer Learning

Transfer Learning is the process of retraining a model that has been developed for a task to complete a new similar task. The idea is that the model has learned transferable “skills” and the weights and biases can be used in other models as a starting point.

As humans we use transfer learning too. The skills you developed to learn to walk could also be used to learn to run later on.

In a neural network, the first few layers of a model start to perform a “feature extraction” such as finding shapes, edges and colours. The layers later on are used as classifiers; they take the extracted features and classify them.

Because of this, we can assume the first few layers have learned quite general feature extraction techniques that can be applied to similar tasks, and so we can freeze all these layers and use them on a new task in the future. The classifier layer will need to be trained based on the new task.

To do this, we break the process into two steps:

  1. Freeze the “backbone” of the model and train the head with a fairly high learning rate. We slowly reduce the learning rate.
  2. Unfreeze the “backbone” and fine-tune the model with a low learning rate.

To freeze a layer in TensorFlow we can set layer.trainable=False. Let’s loop through all the layers and do this:

for layer in fine_tune_model.layers:
layer.trainable = False

and now unfreeze the last layer (the head):

fine_tune_model.layers[-1].trainable = True

We can now train the model using a binary crossentropy loss function. Keras callbacks for early stopping (to avoid overfitting) and a dynamic learning rate scheduler will also be used.

After we’ve trained with the frozen layers, we can unfreeze them:

for layer in fine_tune_model.layers:
layer.trainable = True

And train again for up to 10 epochs. You can find the complete code for this in the “Transfer Learning -> ”Train Model” section of Colab notebook.

Recording your own training data

We now have an ML model which can classify the presence of fire alarm sound. However this model was trained on publicly available sound recordings which might not match the sound characteristics of the hardware microphone we will use for inferencing.

The Raspberry Pi RP2040 MCU has a native USB feature that allows it to act like a custom USB device. We can flash an application to the board to enable it to act like a USB microphone to our PC. Then we can extend Google Colab’s capabilities with the Web Audio API on a modern Web browser like Google Chrome to collect live data samples (all from within Google Colab!)

Hardware Setup

SparkFun MicroMod RP2040

For assembly, remove the screw on the carrier board, at an angle, slide in the MicroMod RP2040 Processor board into the socket and secure it in place with the screw. See the MicroMod Machine Learning Carrier Board Hookup Guide for more details.

Image of removing the screw on the carrier board

Raspberry Pi Pico

Follow the instructions from the Hardware Setup section of the “Create a USB Microphone with the Raspberry Pi Pico” guide for assembly instructions.

Top: Fritzing wiring diagram Bottom: Assembled breadboard

Setting up the firmware applications toolchains

Rather than setting up the Raspberry Pi Pico’s SDK on your personal computer. We can leverage Colab’s built-in Linux shell command feature to set up the Pico SDK development environment with CMake and GNU Arm Embedded Toolchain.

The pico-sdk will also have to be downloaded to the Colab instance using git:

%%shell
git clone https://github.com/raspberrypi/pico-sdk.git
cd pico-sdk
git submodule init
git submodule update

Compiling and flashing the USB microphone application

Now we can use the USB microphone example from the Microphone Library for Pico. The example application can be compiled using cmake and make. Then we can flash the example application to the board over USB by putting the board into “boot ROM mode” which will allow us to upload an application to the board.

SparkFun

  • Plug the USB-C cable into the board and your PC to power the board.
  • While holding down the BOOT button on the board, tap the RESET button.

GIF shows holding down the BOOT button on the board, and tapping the RESET button

Raspberry Pi Pico

  • Plug the USB Micro cable into your PC, but do NOT plug in the Pico side.
  • While holding down the white BOOTSEL button, plug in the micro USB cable to the Pico.

GIF shows plugging in the micro USB cable to the Pico

If you are using a WebUSB API enabled browser like Google Chrome, you can directly flash the image onto the board from within Google Collab!

GIF showing Downloading USB microphone application to the board from within Google Colab and WebUSB

Downloading USB microphone application to the board from within Google Colab and WebUSB.

Otherwise, you can manually download the .uf2 file to your computer and then drag it onto the USB disk for the RP2040 board.

Collecting training data

Now that you have flashed the USB microphone application to the board, it will appear as a USB audio input on your PC.

We can now use Google Colab to record a fire alarm sound, select “MicNode ” as the audio input source in the drop down. Then while pressing the test button on a smoke alarm, click the record button on Google Colab to record a 1 second audio clip. Repeat this process a few times.

Similarly, we can also do the same to collect background audio samples in the next code cell in Google Colab. Repeat this a few times for non-fire alarm sounds like silence, yourself talking, or any other normal sounds for the environment.

Final model training

Now that we’ve collected additional samples with the microphone that will be used during inference. We can tune the model again with the new data.

Converting the Model to run on the MCU

We will need to convert the Keras model we’ve used to TensorFlow Lite format so that we can use it for inference on the device.

Quantization

To optimize the model to run on the Arm Cortex-M0+ processor, we will use a process called model quantization. Model quantization converts the model’s weights and bias from 32-bit floating point values to 8-bit values. The pico-tflmicro library, which is a port of TFLu for the RP2040’s Pico SDK contains Arm’s CMSIS-NN library, which supports optimized kernel operations for quantized 8-bit weights on Arm Cortex-M processors.

We can use TensorFlow’s Quantization Aware Training (QAT) feature to easily convert the floating-point model to quantized.

Converting the model to TF Lite format

We will now use the tf.lite.TFLiteConverter.from_keras_model(…) API to convert the quantized Keras model to TF Lite format, and then save it to disk as a .tflite file.

converter = tf.lite.TFLiteConverter.from_keras_model(quant_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

train_ds = train_ds.unbatch()

def representative_data_gen():
for input_value, output_value in train_ds.batch(1).take(100):
# Model has only one input so each data point has one element.
yield [input_value]

converter.representative_dataset = representative_data_gen
# Ensure that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Set the input and output tensors to uint8 (APIs added in r2.3)
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model_quant = converter.convert()

with open("tflite_model.tflite", "wb") as f:
f.write(tflite_model_quant)

Since TensorFlow also supports loading TF Lite models using tf.lite, we can also verify the functionality of the quantized model and compare its accuracy with the regular unquantized model inside Google Colab.

The RP2040 MCU on the boards we are deploying to, does not have a built-in file system, which means we cannot use the .tflite file directly on the board. However, we can use the Linux `xxd` command to convert the .tflite file to a .h file which can then be compiled in the inference application in the next step.

%%shell
echo "alignas(8) const unsigned char tflite_model[] = {" > tflite_model.h
cat tflite_model.tflite | xxd -i >> tflite_model.h
echo "};"

Deploy the model to the device

We now have a model that is ready to be deployed to the device. We’ve created an application template for inference which can be compiled with the .h file that we’ve generated for the model.

The C++ application uses the pico-sdk as the base, along with the CMSIS-DSP, pico-tflmicro, and Microphone Library for Pico libraries. It’s general structure is as follows:

  1. Initialization
    1. Configure the board’s built-in LED for output. The application will map the brightness of the LED to the output of the model. (0.0 LED off, 1.0 LED on with full brightness)
    2. Setup the TF Lite library and TF Lite model for inference
    3. Setup the CMSIS-DSP based DSP pipeline
    4. Setup and start the microphone for real-time audio
  2. Inference loop
    1. Wait for 128 * 4 = 512 new audio samples from the microphone
    2. Shift the spectrogram array over by 4 columns
    3. Shift the audio input buffer over by 128 * 4 = 512 samples and copy in the new samples
    4. Calculate 4 new spectrogram columns for the updated input buffer
    5. Perform inference on the spectrogram data
    6. Map the inference output value to the on-board LED’s brightness and output the status to the USB port

In-order to run in real-time each cycle of the inference loop must take under (512 / 16000) = 0.032 seconds or 32 milliseconds. The model we’ve trained and converted takes 24 ms for inference, which gives us ~8 ms for the other operations in the loop.

128 was used above to match the stride of 128 used in the training pipeline for the spectrogram. We used a shift of 4 in the spectrogram to fit within the real-time constraints we had.

Compiling the Firmware

Now we can use CMake to generate the build files required for compilation followed by make to compile.

The “cmake ..” line will have to be changed based on the board you are using:

  • SparkFun: cmake .. -DPICO_BOARD=sparkfun_micromod
  • Raspberry Pi Pico: cmake .. -DPICO_BOARD=pico

Flashing the Inference Application to the board

You’ll need to put the board into “boot ROM mode” again to load the new application to it.

SparkFun

  • Plug the USB-C cable into the board and your PC to power the board.
  • While holding down the BOOT button on the board, tap the RESET button.

Raspberry Pi Pico

  • Plug the USB Micro cable into your PC, but do NOT plug in the Pico side.
  • While holding down the white BOOTSEL button, plug in the micro USB cable to the Pico.

If you are using a WebUSB API enabled browser like Google Chrome, you can directly flash the image onto the board from within Google Colab. Otherwise, you can manually download the .uf2 file to your computer and then drag it onto the USB disk for the RP2040 board.

Monitoring the Inference on the board

Now that the inference application is running on the board you can observe it in action in two ways:

Visually by observing the brightness of the LED on the board. It should remain off or dim when no fire alarm sound is present – and be on when a fire alarm sound is present:

GIF shows LED on the board flashing

Connecting to the board’s USB serial port to view output from the inference application. If you are using a Web Serial API enabled browser like Google Chrome, this can be done directly from Google Colab:

GIF shows connecting to the board’s USB serial port to view output from the inference application

Improving the model

You now have the first version of the model deployed to the board, and it is performing inference on live 16,000 kHz audio data!

Test out various sounds to see if the model has the expected output. Maybe the fire alarm sound is being falsely detected (false positive) or not detected when it should be (false negative).

If this occurs, you can record more new audio data for the scenario(s) by flashing the USB microphone application firmware to the board, recording the data for training, re-training the model and converting to TF lite format, and re-compiling + flashing the inference application to the board.

Supervised machine learning models can generally only be as good as the training data they are trained with, so additional training data for these scenarios might help. You can also try to experiment with changing the model architecture or feature extraction process – but keep in mind that your model must be small enough and fast enough to run on the RP2040 MCU.

Conclusion

This article covered an end-to-end flow of how to train a custom audio classifier model to run locally on a development board that uses an Arm Cortex-M0+ processor. TensorFlow was used to train the model using transfer learning techniques along with a smaller dataset and data augmentation techniques. We also collected our own data from the microphone that is used at inference time by loading an USB microphone application to the board, and extending Colab’s features with the Web Audio API and JavaScript.

The training side of the project combined Google’s Colab service and Chrome browser, with the open source TensorFlow library. The inference application captured audio data from a digital microphone, used Arm’s CMSIS-DSP library for the feature extraction stage, then used TensorFlow Lite for Microcontrollers with Arm CMSIS-NN accelerated kernels to perform inference with a 8-bit quantized model that classified a real-time 16 kHz audio input on an Arm Cortex-M0+ processor.

The Web Audio API, Web USB API, and Web Serial API features of Google Chrome were used to extend Google Colab’s functionality to interact with the development board. This allowed us to experiment with and develop our application entirely with a web browser and deploy it to a constrained development board for on-device inference.

Since the ML processing was performed on the development boards RP2040 MCU, no audio data left the device at inference time.

Learn more

You can learn more and get hands-on experience using TinyML at the upcoming Arm DevSummit, a 3-day virtual event between October 19 – 21. The event includes workshops on tinyML computer vision for real-world embedded devices and building large vocabulary voice control with Arm Cortex-M based MCUs. We hope to see you there!

Read More

3 Questions: Kalyan Veeramachaneni on hurdles preventing fully automated machine learning

The proliferation of big data across domains, from banking to health care to environmental monitoring, has spurred increasing demand for machine learning tools that help organizations make decisions based on the data they gather.

That growing industry demand has driven researchers to explore the possibilities of automated machine learning (AutoML), which seeks to automate the development of machine learning solutions in order to make them accessible for nonexperts, improve their efficiency, and accelerate machine learning research. For example, an AutoML system might enable doctors to use their expertise interpreting electroencephalography (EEG) results to build a model that can predict which patients are at higher risk for epilepsy — without requiring the doctors to have a background in data science.

Yet, despite more than a decade of work, researchers have been unable to fully automate all steps in the machine learning development process. Even the most efficient commercial AutoML systems still require a prolonged back-and-forth between a domain expert, like a marketing manager or mechanical engineer, and a data scientist, making the process inefficient.

Kalyan Veeramachaneni, a principal research scientist in the MIT Laboratory for Information and Decision Systems who has been studying AutoML since 2010, has co-authored a paper in the journal ACM Computing Surveys that details a seven-tiered schematic to evaluate AutoML tools based on their level of autonomy.

A system at level zero has no automation and requires a data scientist to start from scratch and build models by hand, while a tool at level six is completely automated and can be easily and effectively used by a nonexpert. Most commercial systems fall somewhere in the middle.

Veeramachaneni spoke with MIT News about the current state of AutoML, the hurdles that prevent truly automatic machine learning systems, and the road ahead for AutoML researchers.

Q: How has automatic machine learning evolved over the past decade, and what is the current state of AutoML systems?

A: In 2010, we started to see a shift, with enterprises wanting to invest in getting value out of their data beyond just business intelligence. So then came the question, maybe there are certain things in the development of machine learning-based solutions that we can automate? The first iteration of AutoML was to make our own jobs as data scientists more efficient. Can we take away the grunt work that we do on a day-to-day basis and automate that by using a software system? That area of research ran its course until about 2015, when we realized we still weren’t able to speed up this development process.

Then another thread emerged. There are a lot of problems that could be solved with data, and they come from experts who know those problems, who live with them on a daily basis. These individuals have very little to do with machine learning or software engineering. How do we bring them into the fold? That is really the next frontier.

There are three areas where these domain experts have strong input in a machine learning system. The first is defining the problem itself and then helping to formulate it as a prediction task to be solved by a machine learning model. Second, they know how the data have been collected, so they also know intuitively how to process that data. And then third, at the end, machine learning models only give you a very tiny part of a solution — they just give you a prediction. The output of a machine learning model is just one input to help a domain expert get to a decision or action.

Q: What steps of the machine learning pipeline are the most difficult to automate, and why has automating them been so challenging?

A: The problem-formulation part is extremely difficult to automate. For example, if I am a researcher who wants to get more government funding, and I have a lot of data about the content of the research proposals that I write and whether or not I receive funding, can machine learning help there? We don’t know yet. In problem formulation, I use my domain expertise to translate the problem into something that is more tangible to predict, and that requires somebody who knows the domain very well. And he or she also knows how to use that information post-prediction. That problem is refusing to be automated.

There is one part of problem-formulation that could be automated. It turns out that we can look at the data and mathematically express several possible prediction tasks automatically. Then we can share those prediction tasks with the domain expert to see if any of them would help in the larger problem they are trying to tackle. Then once you pick the prediction task, there are a lot of intermediate steps you do, including feature engineering, modeling, etc., that are very mechanical steps and easy to automate.

But defining the prediction tasks has typically been a collaborative effort between data scientists and domain experts because, unless you know the domain, you can’t translate the domain problem into a prediction task. And then sometimes domain experts don’t know what is meant by “prediction.” That leads to the major, significant back and forth in the process. If you automate that step, then machine learning penetration and the use of data to create meaningful predictions will increase tremendously.

Then what happens after the machine learning model gives a prediction? We can automate the software and technology part of it, but at the end of the day, it is root cause analysis and human intuition and decision making. We can augment them with a lot of tools, but we can’t fully automate that.

Q: What do you hope to achieve with the seven-tiered framework for evaluating AutoML systems that you outlined in your paper?

A: My hope is that people start to recognize that some levels of automation have already been achieved and some still need to be tackled. In the research community, we tend to focus on what we are comfortable with. We have gotten used to automating certain steps, and then we just stick to it. Automating these other parts of the machine learning solution development is very important, and that is where the biggest bottlenecks remain.

My second hope is that researchers will very clearly understand what domain expertise means. A lot of this AutoML work is still being conducted by academics, and the problem is that we often don’t do applied work. There is not a crystal-clear definition of what a domain expert is and in itself, “domain expert,” is a very nebulous phrase. What we mean by domain expert is the expert in the problem you are trying to solve with machine learning. And I am hoping that everyone unifies around that because that would make things so much clearer.

I still believe that we are not able to build that many models for that many problems, but even for the ones that we are building, the majority of them are not getting deployed and used in day-to-day life. The output of machine learning is just going to be another data point, an augmented data point, in someone’s decision making. How they make those decisions, based on that input, how that will change their behavior, and how they will adapt their style of working, that is still a big, open question. Once we automate everything, that is what’s next.

We have to determine what has to fundamentally change in the day-to-day workflow of someone giving loans at a bank, or an educator trying to decide whether he or she should change the assignments in an online class. How are they going to use machine learning’s outputs? We need to focus on the fundamental things we have to build out to make machine learning more usable.

Read More

Facebook and USENIX announce the winners of the 2021 Internet Defense Prize

Today, Facebook and USENIX awarded a total of $200,000 to the top three winners of the Internet Defense Prize. Funded by Facebook and offered in partnership with USENIX, the award celebrates security research contributions to the protection and defense of the internet. In this post, we share details on the research we awarded today and also on the upcoming changes to how the Prize will operate in the future.

Award recipients

We awarded our first-place prize of $100,000 to winners Ofek Kirzner and Adam Morrison of Tel Aviv University for their work titled “An Analysis of Speculative Type Confusion Vulnerabilities in the Wild.” The paper defines “speculative type confusion,” an issue where branch mispredictions cause a victim program to execute with variables holding values of the wrong type. The impact in this scenario is that the victim program leaks sensitive memory content.

Second-place prize winner Nicholas Carlini of Google was awarded $60,000 for their paper “Poisoning the Unlabeled Dataset of Semi-Supervised Learning.” The paper looks at the “data set poisoning” problem: If an attacker can control (“poison”) a portion of the training set for a machine learning model, how much can the attacker force the model to incorrectly classify? The research shows that in the “semi supervised” setting where models include training on unlabeled data, poisoning as little as 0.1% of the unlabeled training data enables controlling the model’s output.

The third-place prize of $40,000 awarded to a team of researchers, including Kevin Bock (University of Maryland), Abdulrahman Alaraj (University of Colorado Boulder), Eric Wustrow (University of Colorado Boulder), Yair Fax (University of Maryland), Kyle Hurley (University of Maryland), and Dave Levin (University of Maryland). Their research “Weaponizing Middleboxes for TCP Reflected Amplification” looked at the problem of an attacker amplifying network traffic to cause a distributed denial of service attack previously believed to be a class called “reflective amplification,” which would work only for UDP-based protocols. The authors showed that, in fact, TCP-based protocols can be used in reflective amplification. Then, they scanned the entire IPv4 internet to demonstrate that there are hundreds of thousands of IP addresses hosting potential amplifiers.

We congratulate the 2021 winners of the Internet Defense Prize and thank them for their contributions to help make the internet more secure. To be considered for the Prize in 2022, submit a paper to USENIX Security 2022 here.

Starting in 2022, the USENIX Security Awards Committee will begin independently determining the prize, to be distributed by USENIX. Facebook will continue to fund the Internet Defense Prize as a founding partner.

See the USENIX post here.

The post Facebook and USENIX announce the winners of the 2021 Internet Defense Prize appeared first on Facebook Research.

Read More

Join us at the Women in Machine Learning Symposium

Posted by Jeanine Banks, VP of 3P Core Developer Platforms

Join us for the Women in Machine Learning Symposium on October 19

At Google we believe that diversity and inclusion are core to innovation, and we know there’s work to be done in improving representation to achieve equity. That’s why we’re excited to announce a new event: The Women in Machine Learning Symposium.

Join us virtually from 9-12 PDT on October 19, 2021, to hear from leaders in the machine learning (ML) industry.

All journeys are different and this event aims to empower the next generation of women leaders in ML. By learning from each other’s stories we want to inspire the creation of a community of support, bringing together women and allies in technology.

We’ll have two keynotes discussing the importance of diversity in ML communities and Open Source Software. You can also hear first-hand the stories and experiences of women who are breaking down barriers and ask them questions live!

Lastly, I invite you to attend one of the breakout sessions tailored to what stage you’re at in your career. From learning how to get started in ML, how to switch from being a tech developer to becoming an ML developer, to learning tips for taking your career to the C-level, this event has a place for you.

RSVP today to reserve your spot and head on over to our website to view the live agenda. I hope to see you there!

Read More

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

Natural language processing (NLP) has witnessed impressive developments
in answering questions, summarizing or translating reports, and
analyzing sentiment or offensiveness. Much of this progress is owed to
training ever-larger language models, such
as T5 or GPT-3,
that use deep monolithic architectures to internalize how language is
used within text from massive Web crawls. During training, these models
distill the facts they read into implicit knowledge, storing in their
parameters not only the capacity to “understand” language tasks, but
also highly abstract knowledge representations of entities, events, and
facts the model needs for solving tasks.

Despite the well-publicized success of large language models, their
black-box nature hinders key goals of NLP. In particular, existing large
language models are generally:

  • Inefficient. Researchers continue to enlarge these models, leading
    to striking inefficiencies as the field already pushes past 1
    trillion parameters. This imposes a considerable environmental impact
    and its costs exclude all but a few large organizations from the
    ability to train—or in many cases even deploy—such models.

  • Opaque. They encode “knowledge” into model weights, synthesizing
    what they manage to memorize from training examples. This makes it
    difficult to discern what sources—if any—the model uses to make a
    prediction, a concerning problem in practice as these models
    frequently generate fluent yet untrue statements.

  • Static. They are expensive to update. We cannot efficiently adapt a
    GPT model trained on, say, Wikipedia text from 2019 so it reflects
    the knowledge encoded in the 2021 Wikipedia—or the latest snapshot
    of the medical preprint server medRXiv. In practice, adaptation often
    necessitates expensive retraining or fine-tuning on the new corpus.

This post explores an emerging alternative, Retrieval-based NLP, in
which models directly “search” for information in a text corpus to
exhibit knowledge, leveraging the representational strengths of language models
while addressing the challenges above. Such
models—including REALM, RAG, ColBERT-QA,
and Baleen—are
already advancing the state of the art for tasks like answering
open-domain questions and verifying complex claims, all with
architectures that back their predictions with checkable sources while
being 100–1000× smaller, and thus far cheaper to execute, than GPT-3. At
Stanford, we have shown that improving the expressivity and
supervision of scalable neural retrievers can lead to much stronger NLP
systems: for instance, ColBERT-QA improves answer correctness on open-QA
benchmarks by up to 16 EM points and Baleen improves the ability to
check complex claims on
HoVer,
correctly and with provenance, by up to 42 percentage points against existing work.

Retrieval-based NLP

Figure 1: An illustration comparing (a) black-box language models and (b) retrieval-oriented NLP models, the paradigm this post advocates for.

As Figure 1 illustrates, retrieval-based NLP methods view tasks as
open-book
exams: knowledge is encoded explicitly in the form of a text corpus like
Wikipedia, the medical literature, or a software’s API documentation. When
solving a language task, the model learns to search for pertinent passages
and to then use the retrieved information for crafting knowledgeable responses.
In doing so, retrieval helps decouple the capacity that language models have for
understanding text from how they store knowledge, leading to three key advantages.

Tackling Inefficiency. Retrieval-based models can be much smaller and
faster
, and thus more environmentally friendly. Unlike black-box language models,
the parameters no longer need to store an ever-growing list of facts, as
such facts can be retrieved. Instead, we can dedicate those parameters
for processing language and solving tasks, leaving us with smaller
models that are highly effective. For instance, ColBERT-QA achieves
47.8% EM on the open-domain Natural Questions task, whereas a fine-tuned
T5-11B model (with 24x more parameters) and a few-shot GPT-3 model (with
400x more parameters) achieve only 34.8% and 29.9%, respectively.

Tackling Opaqueness. Retrieval-based NLP offers a transparent contract
with users: when the model produces an answer, we can read the sources
it retrieved and judge their relevance and credibility for ourselves.
This is essential whether the model is factually correct or not: by
inspecting the sources surfaced by a system like Baleen, we can trust
its outputs only if we find that reliable sources do support them.

Tackling Static Knowledge. Retrieval-based models emphasize learning
general techniques for finding and connecting information from the
available resources. With facts stored as text, the retrieval knowledge
store can be efficiently updated or expanded by modifying the text
corpus, all while the model’s capacity for finding and using information
remains constant. Besides computational cost reductions, this expedites generality:
developers, even in niche domains, can “plug in” a domain-specific text
collection and rely on retrieval to facilitate domain-aware responses.

ColBERT: Scalable yet expressive neural retrieval

As the name suggests, retrieval-based NLP relies on semantically rich search to extract
information. For search be practical and effective, it must scale to massive text corpora.
To draw on the open-book exam analogy, it’s hopeless to linearly look
through the pages of a hefty textbook during the exam—we need scalable
strategies for organizing the content in advance, and efficient
techniques for locating relevant information at inference time.

Figure 2: Schematic diagrams comparing two popular paradigms in neural IR in sub-figures (a) and (b) against the late interaction paradigm of ColBERT in sub-figure (c).

Traditionally in IR, search tasks were conducted using bag-of-words
models like BM25, which seek documents that contain the same tokens as
the query. In
2019, search was revolutionized with BERT for
ranking and its deployment
in Google and Bing for
Web search. The standard approach is illustrated in Figure 2(a). Each
document is concatenated with the query, and both are fed jointly into a BERT
model, fine-tuned to estimate relevance. BERT doubled the MRR@10 quality
metric over BM25 on the popular MS MARCO Passage Ranking leaderboard,
but it simultaneously posed a fundamental limitation: scoring
each query–document pair requires billions of computational operations
(FLOPs). As a result, BERT can only be used to re-rank the top-k (e.g.,
top-1000) documents already extracted by simpler methods like BM25,
having no capacity to recover useful documents that bag-of-word search
misses.

The key limitation of this approach is that it encodes queries and
documents jointly. Many representation-similarity systems have been
proposed to tackle this, some of which re-purpose BERT within the
paradigm depicted in Figure 2(b). In these systems
(like SBERT and ORQA,
and more
recently DPR and ANCE,
every document in the corpus is fed into a BERT encoder that produces a
dense vector meant to capture the semantics of the document. At search
time, the query is encoded, separately, through another BERT encoder, and the
top-k related documents are found using a dot product between the query
and document vectors. By removing the expensive interactions between the
query and the document, these models are able to scale far more
efficiently than the approach in Figure 2(a).

Nonetheless, representation-similarity models suffer from an
architectural bottleneck: they encode the query and document into
coarse-grained representations and model relevance as a single dot
product. This greatly diminishes quality compared with expensive
re-rankers that model token-level interactions between the contents of
queries and documents. Can we efficiently scale fine-grained, contextual
interactions to a massive corpus, without compromising speed or quality?
It turns out that the answer is “yes”, using a paradigm called late
interaction, first devised in
our ColBERT1 [code] model, which appeared at SIGIR 2020.

As depicted in Figure 2(c), ColBERT independently encodes queries and
documents into fine-grained multi-vector representations. It then
attempts to softly and contextually locate each query token inside the
document: for each query embedding, it finds the most similar embedding
in the document with a “MaxSim” operator and then sums up all of the
MaxSims to score the document. “MaxSim” is a careful choice that allows
us to index the document embeddings for Approximate Nearest Neighbor
(ANN) search, enabling us to scale this rich interaction to millions of passages with latency
on the order of tens of milliseconds. For instance, ColBERT can search over all
passages in English Wikipedia in approximately 70 milliseconds per query.
On MS MARCO Passage Ranking, ColBERT preserved the MRR@10 quality of BERT re-rankers while boosting recall@1k to nearly 97%
against the official BM25 ranking’s recall@1k of just 81%.

Making neural retrievers more lightweight remains an active area of
development, with models like DeepImpact
that trade away some quality for extreme forms of efficiency and
developments like BPR
and quantized ColBERT
that reduce the storage footprint by an order of magnitude while
preserving the quality of DPR and ColBERT, respectively.

ColBERT-QA and Baleen: Specializing neural retrieval to complex tasks, with tracked provenance

While scaling expressive search mechanisms is critical, NLP models need
more than just finding the right documents. In particular, we want NLP models
to use retrieval to answer questions, fact-check claims, respond
informatively in a conversation, or identify the sentiment of a piece of
text. Many tasks of this kind—dubbed knowledge-intensive language
tasks—are collected in
the KILT benchmark.
The most popular task is open-domain question answering (or Open-QA).
Systems are given a question from any domain and must produce an answer,
often by reference to the passages in a large corpus, as depicted in
Figure 1(b).

Benchmark System Metric Gains Baselines
Open-Domain Question Answering
Open-NaturalQuestions ColBERT-QA Answer Match +3 RAG, DPR, REALM, BM25+BERT
Open-TriviaQA +12
Open-SQuAD +17
Multi-Hop Reasoning
HotPotQA Baleen Retrieval Success@20 +10 / NA MDR / IRRR
Passage-Pair Match +5 / +3
HoVer Retrieval Success@100 +48 / +17 TF-IDF / ColBERT-Hop
“HoVer Score” for
Claim Verification
with Provenance
+42 Official “TF-IDF + BERT” Baseline
Cross-Lingual Open-Domain Question Answering
XOR TyDi GAAMA with ColBERT
from IBM Research
Recall@5000-tokens +10 Official “DPR + Vanilla Transformer” Baseline
Zero-Shot Information Retrieval
BEIR ColBERT Recall@100 Outperforms other off-the-shelf
dense retrievers on 13/17 tasks
DPR, ANCE, SBERT, USE-QA
Table 1: Results of models using ColBERT, ColBERT-QA, and Baleen across a wide range of language tasks.

Two popular models in this space are REALM and RAG, which rely on the
ORQA and DPR retrievers discussed earlier. REALM and RAG jointly tune a
retriever as well as a reader, a modeling component that consumes the
retrieved documents and produces answers or responses. Take RAG as an
example: its reader is a generative BART model, which attends to the
passages while generating the target outputs. While they constitute
important steps toward retrieval-based NLP, REALM and RAG suffer from
two major limitations. First, they use the restrictive paradigm of
Figure 2(b) for retrieval, thereby sacrificing recall: they are often
unable to find relevant passages for conducting their tasks. Second,
when training the retriever, REALM and RAG collect documents by
searching for them inside the training loop and, to make this practical, they
freeze the document encoder when fine-tuning, restricting the model’s adaptation to the task.

ColBERT-QA2 is an Open-QA system (published at TACL’21) that we built on
top of ColBERT to tackle both problems. By adapting ColBERT’s expressive search to the task,
ColBERT-QA finds useful passages for a larger fraction of the questions and thus
enables the reader component to answer more questions correctly and with provenance.
In addition, ColBERT-QA introduces relevance-guided supervision (RGS),
a training strategy whose goal is to adapt a
retriever like ColBERT to the specifics of an NLP task like Open-QA. RGS
proceeds in discrete rounds, using the retriever trained in the previous
round to collect “positive” passages that are likely useful for the
reader—specifically, passages ranked highly by the latest version of the
retriever and that also overlap with the gold answer of the question—and
challenging “negative” passages. By converging to a high coverage of
positive passages and by effectively sampling hard negatives, ColBERT-QA
improves retrieval Success@20 by more than 5-, 5-, and 12-point gains on
the open-domain QA settings of NaturalQuestions, TriviaQA, and SQuAD, and thus greatly
improves downstream answer match.

A more sophisticated version of the Open-QA task is multi-hop reasoning,
where systems must answer questions or verify claims by gathering
information from multiple sources. Systems in this space,
like GoldEn, MDR,
and IRRR,
find relevant documents and “hop” between them—often by running
additional searches—to find all pertinent sources. While these models
have demonstrated strong performance for two-hop tasks, scaling robustly
to more hops is challenging as the search space grows exponentially.

To tackle this, our Baleen3 system
(accepted as a Spotlight paper at NeurIPS’21) introduces a richer pipeline for
multi-hop retrieval: after each retrieval “hop”, Baleen summarizes the
pertinent information from the passages into a short context that is used
to inform future hops. In doing so, Baleen controls the search space
architecturally—obviating the need to explore each potential passage
at every hop—without sacrificing recall. Baleen also extends ColBERT’s
late interaction: it allows the representations of different documents
to “focus” on distinct parts of the same query, as each of those documents
in the corpus might satisfy a distinct aspect of the same complex query.
As a result of its more deliberate architecture and its stronger
retrieval modeling, Baleen saturates retrieval on the popular two-hop
HotPotQA benchmark (raising answer-recall@20 from 89% by MDR to 96%) and
dramatically improves performance on the harder four-hop claim
verification
benchmark HoVer,
finding all required passages in 92% of the examples—up from just 45%
for the official baseline and 75% for a many-hop flavor of ColBERT.

In these tasks, when our retrieval-based models make predictions, we can
inspect their underlying sources and decide whether we can trust the
answer. And when model errors stem from specific sources, those can be
removed or edited, and making sure models are faithful to such edits
is an active area of work.

Generalizing models to new domains with robust neural retrieval

In addition to helping with efficiency and transparency, retrieval
approaches promise to make domain generalization and knowledge updates
much easier in NLP. Exhibiting up-to-date, domain-specific knowledge is
essential for many applications: you might want to answer questions over
recent publications on COVID-19 or to develop a chatbot that guides
customers to suitable products among those currently available in a
fast-evolving inventory. For such applications, NLP models should be
able to leverage any corpus provided to them, without having to train a
new version of the model for each emerging scenario or domain.

While large language models are trained using plenty of data from the
Web, this snapshot is:

  • Static. The Web evolves as the world does: Wikipedia articles
    reflect new elected officials, news articles describe current events, and
    scientific papers communicate new research. Despite this, a language
    model trained in 2020 has no way to learn about 2021 events, short
    of training and releasing a new version of the model.

  • Incomplete. Many topics are under-represented in Web crawls like C4
    and The Pile. Suppose we seek to answer questions over the ACL
    papers published 2010–2021; there is no guarantee that The Pile
    contains all papers from the ACL Anthology a priori and there is no
    way to plug that in ad-hoc without additional training. Even when
    some ACL papers are present (e.g., through arXiv, which is included
    in The Pile), they form only a tiny sliver of the data, and it is
    difficult to reliably restrict the model to specifically those
    papers for answering NLP questions.

  • Public-only. Many applications hinge on private text, like internal
    company policies, in-house software documentation, copyrighted
    textbooks and novels, or personal email. Because models like GPT-3
    never see such data in their training, they are fundamentally
    incapable of exhibiting knowledge pertaining to those topics without
    special re-training or fine-tuning.

With retrieval-based NLP, models learn effective ways to encode and
extract information, allowing them to generalize to updated text,
specialized domains, or private data without resorting to additional
training. This suggests a vision where developers “plug in” their text
corpus, like in-house software documentation, which is indexed by a
powerful retrieval-based NLP model that can then answer questions, solve
classification tasks, or generate summaries using the knowledge from the
corpus, while always supporting its predictions with provenance from the
corpus.

An exciting benchmark connected to this space
is BEIR,
which evaluates retrievers on their capacity for search “out-of-the-box”
on unseen IR tasks, like Argument Retrieval, and in new domains, like
the COVID-19 research literature. While retrieval offers a concrete
mechanism for generalizing NLP models to new domains, not every IR model
generalizes equally: the BEIR evaluations highlight the impact of
modeling and supervision choices on generalization. For instance, due to
its late interaction modeling, a vanilla off-the-shelf ColBERT retriever
achieved the strongest recall of all competing IR models in the initial
BEIR evaluations, outperforming the other off-the-shelf dense
retrievers—namely, DPR, ANCE, SBERT, and USE-QA—on 13 out of 17
datasets. The BEIR benchmark continues to develop quickly, a recent
addition being the
TAS-B model,
which advances a sophisticated supervision approach to distill ColBERT
and BERT models into single-vector representations, inheriting much of
their robustness in doing so. While retrieval allows rapid deployment in new
domains, explicitly adapting retrieval to new scenarios is also
possible. This is an active area of research, with work
like QGen and AugDPR that
generate synthetic questions and use those to explicitly fine-tune
retrievers for targeting a new corpus.

Summary: Is retrieval “all you need”?

The black-box nature of large language models like T5 and GPT-3 makes
them inefficient to train and deploy, opaque in their knowledge representations and in backing
their claims with provenance, and static in facing a constantly evolving world and diverse downstream contexts.
This post explores retrieval-based NLP, where models retrieve information
pertinent to solving their tasks from a plugged-in text corpus. This
paradigm allows NLP models to leverage the representational strengths
of language models, while needing much smaller architectures, offering
transparent provenance for claims, and enabling efficient updates and adaptation.

We surveyed much of the existing and emerging work in this space and
highlighted some of our work at Stanford, including
ColBERT
for scaling up expressive retrieval to massive corpora via late
interaction,
ColBERT-QA for
accurately answering open-domain questions by adapting high-recall
retrieval to the task, and
Baleen for
solving tasks that demand information from several independent sources
using a condensed retrieval architecture.
We continue to actively maintain
our code as open source.

Acknowledgments. We would like to thank Megha Srivastava and Drew A. Hudson for helpful comments and feedback on this blog post. We also thank Ashwin Paranjape, Xiang Lisa Li, and Sidd Karamcheti for valuable and insightful discussions.

  1. Omar Khattab and Matei Zaharia. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 2020. 

  2. Omar Khattab, Christopher Potts, Matei Zaharia; “Relevance-guided Supervision for OpenQA with ColBERT.” Transactions of the Association for Computational Linguistics 2021; 9 929–944. doi: https://doi.org/10.1162/tacl_a_00405 

  3. Omar Khattab, Christopher Potts, and Matei Zaharia. “Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.” (To appear at NeurIPS 2021.) arXiv preprint arXiv:2101.00436 (2021). 

Read More