Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker

Automatic speech recognition (ASR) is a commonly used machine learning (ML) technology in our daily lives and business scenarios. Applications such as voice-controlled assistants like Alexa and Siri, and voice-to-text applications like automatic subtitling for videos and transcribing meetings, are all powered by this technology. These applications take audio clips as input and convert speech signals to text, also referred as speech-to-text applications.

This technology has matured in recent years, and many of the latest models can achieve a very good performance, such as transformer-based models Wav2Vec2 and Speech2Text. Transformer is a sequence-to-sequence deep learning architecture originally proposed for machine translation. Now it’s extended to solve all kinds of natural language processing (NLP) tasks, such as text classification, text summarization, and ASR. The transformer architecture yields very good model performance and results in various NLP tasks; however, the models’ sizes (the number of parameters) as well as the amount of data they’re pre-trained on increase exponentially when pursuing better performance. It becomes very time-consuming and costly to train a transformer from scratch, for example training a BERT model from scratch could take 4 days and cost $6,912 (for more information, see The Staggering Cost of Training SOTA AI Models). Hugging Face, an AI company, provides an open-source platform where developers can share and reuse thousands of pre-trained transformer models. With the transfer learning technique, you can fine-tune your model with a small set of labeled data for a target use case. This reduces the overall compute cost, speeds up the development lifecycle, and lessens the carbon footprint of the community.

AWS announced collaboration with Hugging Face in 2021. Developers can easily work with Hugging Face models on Amazon SageMaker and benefit from both worlds. You can fine-tune and optimize all models from Hugging Face, and SageMaker provides managed training and inference services that offer high performance resources and high scalability via Amazon SageMaker distributed training libraries. This collaboration can help you accelerate your NLP tasks’ productization journey and realize business benefits.

This post shows how to use SageMaker to easily fine-tune the latest Wav2Vec2 model from Hugging Face, and then deploy the model with a custom-defined inference process to a SageMaker managed inference endpoint. Finally, you can test the model performance with sample audio clips, and review the corresponding transcription as output.

Wav2Vec2 background

Wav2Vec2 is a transformer-based architecture for ASR tasks and was released in September 2020. The following diagram shows its simplified architecture. For more details, see the original paper. As the diagram shows, the model is composed of a multi-layer convolutional network (CNN) as a feature extractor, which takes an input audio signal and outputs audio representations, also considered as features. They are fed into a transformer network to generate contextualized representations. This part of training can be self-supervised; the transformer can be trained with unlabeled speech and learn from it. Then the model is fine-tuned on labeled data with the Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is Wav2Vec2-Base-960h, fine-tuned on 960 hours of Librispeech on 16 kHz sampled speech audio.

CTC is a character-based algorithm. During training, it’s able to demarcate each character of the transcription in the speech automatically, so the timeframe alignment isn’t required between audio signal and transcription. For example, if the audio clip says “Hello World,” we don’t need to know in which second the word “hello” is located. It saves a lot of labeling effort for ASR use cases. For more information about how the algorithm works, refer to Sequence Modeling With CTC.

Solution overview

In this post, we use the SUPERB (Speech processing Universal PERformance Benchmark) dataset available from the Hugging Face Datasets library, and fine-tune the Wav2Vec2 model and deploy it as a SageMaker endpoint for real-time inference for an ASR task. SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.

The following diagram provides a high-level view of the solution workflow.

First, we show how to load and preprocess the SUPERB dataset in a SageMaker environment in order to obtain a tokenizer and feature extractor, which are required for fine-tuning the Wav2Vec2 model. Then we use SageMaker Script Mode for training and inference steps, which allows you to define and use custom training and inference scripts, and SageMaker provides supported Hugging Face framework Docker containers. For more information about training and serving Hugging Face models on SageMaker, see Use Hugging Face with Amazon SageMaker. This functionality is available through the development of Hugging Face AWS Deep Learning Containers (DLCs).

The notebook and code from this post are available on GitHub. The notebook is tested in both Amazon SageMaker Studio and SageMaker notebook environments.

Data preprocessing

In this section, we walk through the steps to preprocess the data.

Process the dataset

In this post we use SUPERB dataset, which you can load from the Hugging Face Datasets library directly using the load_dataset function. The SUPERB dataset also includes speaker_id and chapter_id; we remove these columns and only keep audio files and transcriptions to fine-tune the Wav2Vec2 model for an ASR task, which transcribes speech to text. To speed up the fine-tuning process for this example, we only take the test dataset from the original dataset, then split it into train and test datasets. See the following code:

data = load_dataset("superb", 'asr', ignore_verifications=True) 
data = data.remove_columns(['speaker_id', 'chapter_id', 'id'])
# reduce the data volume for this example. only take the test data from the original dataset for fine-tune
data = data['test'] 

train_test = data.train_test_split(test_size=0.2)
dataset = DatasetDict({
    'train': train_test['train'],
    'test': train_test['test']})

After we process the data, the dataset structure is as follows:

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'text'],
        num_rows: 2096
    })
    test: Dataset({
        features: ['file', 'audio', 'text'],
        num_rows: 524
    })
})

Let’s print one data point from the train dataset and examine the information in each feature. ‘file’ is the audio file path where it’s saved and cached in the local repository. ‘audio’ contains three components: ‘path’ is the same as ‘file’, ‘array’ is the numerical representation of the raw waveform of the audio file in NumPy array format, and ‘sampling_rate’ shows the number of samples of audio recorded every second. ‘text’ is the transcript of the audio file.

print(dataset['train'][0])
result: 
{ {'file': '/root/.cache/huggingface/datasets/downloads/extracted/e0f3d50e856945385982ba36b58615b72eef9b2ba5a2565bdcc225b70f495eed/LibriSpeech/test-clean/7021/85628/7021-85628-0000.flac',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/e0f3d50e856945385982ba36b58615b72eef9b2ba5a2565bdcc225b70f495eed/LibriSpeech/test-clean/7021/85628/7021-85628-0000.flac',
  'array': array([-0.00018311, -0.00024414, -0.00018311, ...,  0.00061035,
          0.00064087,  0.00061035], dtype=float32),
  'sampling_rate': 16000},
 'text': 'but anders cared nothing about that'}

Build a vocabulary file

The Wav2Vec2 model uses the CTC algorithm to train deep neural networks in sequence problems, and its output is a single letter or blank. It uses a character-based tokenizer. Therefore, we extract distinct letters from the dataset and build the vocabulary file using the following code:

def extract_characters(batch):
  texts = " ".join(batch["text"])
  vocab = list(set(texts))
  return {"vocab": [vocab], "texts": [texts]}

vocabs = dataset.map(extract_characters, batched=True, batch_size=-1, 
                   keep_in_memory=True, remove_columns= dataset.column_names["train"])

vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

vocab_dict["[UNK]"] = len(vocab_dict) # add "unknown" token 
vocab_dict["[PAD]"] = len(vocab_dict) # add a padding token that corresponds to CTC's "blank token"

with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

Create a tokenizer and feature extractor

The Wav2Vec2 model contains a tokenizer and feature extractor. In this step, we use the vocab.json file that we created from the previous step to create the Wav2Vec2CTCTokenizer. We use Wav2Vec2FeatureExtractor to make sure that the dataset used in fine-tuning has the same audio sampling rate as the dataset used for pre-training. Finally, we create a Wav2Vec2 processor that can wrap the feature extractor and the tokenizer into one single processor. See the following code:

# create Wav2Vec2 tokenizer
tokenizer = Wav2Vec2CTCTokenizer("vocab.json", unk_token="[UNK]",
                                  pad_token="[PAD]", word_delimiter_token="|")

# create Wav2Vec2 feature extractor
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, 
                                             padding_value=0.0, do_normalize=True, return_attention_mask=False)
# create a processor pipeline 
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

Prepare the train and test datasets

Next, we extract the array representation of the audio files and its sampling_rate from the dataset and process them using the processor, in order to have train and test data that can be consumed by the model:

# extract the numerical representation from the dataset
def extract_array_samplingrate(batch):
    batch["speech"] = batch['audio']['array'].tolist()
    batch["sampling_rate"] = batch['audio']['sampling_rate']
    batch["target_text"] = batch["text"]
    return batch

dataset = dataset.map(extract_array_samplingrate, 
                      remove_columns=dataset.column_names["train"])

# process the dataset with processor pipeline that created above
def process_dataset(batch):  
    batch["input_values"] = processor(batch["speech"], 
                            sampling_rate=batch["sampling_rate"][0]).input_values

    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

data_processed = dataset.map(process_dataset, 
                    remove_columns=dataset.column_names["train"], batch_size=8, 
                    batched=True)

train_dataset = data_processed['train']
test_dataset = data_processed['test']

Then we upload the train and test data to Amazon Simple Storage Service (Amazon S3) using the following code:

from datasets.filesystems import S3FileSystem
s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f's3://{BUCKET}/{PREFIX}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{BUCKET}/{PREFIX}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

Fine-tune the Hugging Face model (Wav2Vec2)

We use SageMaker Hugging Face DLC script mode to construct the training and inference job, which allows you to write custom training and serving code and using Hugging Face framework containers that are maintained and supported by AWS.

When we create a training job using the script mode, the entry_point script, hyperparameters, its dependencies (inside requirements.txt), and input data (train and test datasets) are copied into the container. Then it invokes the entry_point training script, where the train and test datasets are loaded, training steps are performed, and model artifacts are saved in /opt/ml/model in the container. After training, artifacts in this directory are uploaded to Amazon S3 for later model hosting.

You can inspect the training script in the GitHub repo, in the scripts/ directory.

Create an estimator and start a training job

We use the Hugging Face estimator class to train our model. When creating the estimator, you need to specify the following parameters:

  • entry_point – The name of the training script. It loads data from the input channels, configures training with hyperparameters, trains a model, and saves the model.
  • source_dir – The location of the training scripts.
  • transformers_version – The Hugging Face Transformers library version we want to use.
  • pytorch_version – The PyTorch version that’s compatible with the Transformers library.

For this use case and dataset, we use one ml.p3.2xlarge instance, and the training job is able to finish in around 2 hours. You can select a more powerful instance with more memory and GPU to reduce the training time; however, it incurs more cost.

When you create a Hugging Face estimator, you can configure hyperparameters and provide a custom parameter into the training script, such as vocab_url in this example. Also, you can specify the metrics in the estimator, parse the logs of these metrics, and send them to Amazon CloudWatch to monitor and track the training performance. For more details, see Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics.

from sagemaker.huggingface import HuggingFace

#create an unique id to tag training job, model name and endpoint name. 
id = int(time.time())

TRAINING_JOB_NAME = f"huggingface-wav2vec2-training-{id}"
vocab_url = f"s3://{BUCKET}/{PREFIX}/vocab.json"

hyperparameters = {'epochs':10, # you can increase the epoch number to improve model accuracy
                   'train_batch_size': 8,
                   'model_name': "facebook/wav2vec2-base",
                   'vocab_url': vocab_url
                  }
                  
# define metrics definitions
metric_definitions=[
        {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e-)[0-9]+),?"},
        {'Name': 'eval_wer', 'Regex': "'eval_wer': ([0-9]+(.|e-)[0-9]+),?"},
        {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e-)[0-9]+),?"},
        {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e-)[0-9]+),?"},
        {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e-)[0-9]+),?"}]

OUTPUT_PATH= f's3://{BUCKET}/{PREFIX}/{TRAINING_JOB_NAME}/output/'

huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='./scripts',
                                    output_path= OUTPUT_PATH, 
                                    instance_type='ml.p3.2xlarge',
                                    instance_count=1,
                                    transformers_version='4.6.1',
                                    pytorch_version='1.7.1',
                                    py_version='py36',
                                    role=ROLE,
                                    hyperparameters = hyperparameters,
                                    metric_definitions = metric_definitions,
                                   )

#Starts the training job using the fit function, training takes approximately 2 hours to complete.
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path},
                          job_name=TRAINING_JOB_NAME)

In the following figure of CloudWatch training job logs, you can see that, after 10 epochs of training, the model evaluation metrics WER (word error rate) can achieve around 0.17 for the subset of the SUPERB dataset. WER is a commonly used metric to evaluate speech recognition model performance, and the objective is to minimize it. You can increase the number of epochs or use the full SUPERB dataset to improve the model further.

Deploy the model as an endpoint on SageMaker and run inference

In this section, we walk through the steps to deploy the model and perform inference.

Inference script

We use the SageMaker Hugging Face Inference Toolkit to host our fine-tuned model. It provides default functions for preprocessing, predicting, and postprocessing for certain tasks. However, the default capabilities can’t inference our model properly. Therefore, we defined the custom functions model_fn(), input_fn(), predict_fn(), and output_fn() in the inference.py script to override the default settings with custom requirements. For more details, refer to the GitHub repo.

As of January 2022, the Inference Toolkit can inference tasks from architectures that end with 'TapasForQuestionAnswering', 'ForQuestionAnswering', 'ForTokenClassification', 'ForSequenceClassification', 'ForMultipleChoice', 'ForMaskedLM', 'ForCausalLM', 'ForConditionalGeneration', 'MTModel', 'EncoderDecoderModel','GPT2LMHeadModel', and 'T5WithLMHeadModel'. The Wav2Vec2 model is not currently supported.

You can inspect the full inference script in the GitHub repo, in the scripts/ directory.

Create a Hugging Face model from the estimator

We use the Hugging Face Model class to create a model object, which you can deploy to a SageMaker endpoint. When creating the model, specify the following parameters:

  • entry_point – The name of the inference script. The methods defined in the inference script are implemented to the endpoint.
  • source_dir – The location of the inference scripts.
  • transformers_version – The Hugging Face Transformers library version we want to use. It should be consistent with the training step.
  • pytorch_version – The PyTorch version that is compatible with the Transformers library. It should be consistent with the training step.
  • model_data – The Amazon S3 location of a SageMaker model data .tar.gz file.
from sagemaker.huggingface import HuggingFaceModel

huggingface_model = HuggingFaceModel(
        entry_point = 'inference.py',
        source_dir='./scripts',
        name = f'huggingface-wav2vec2-model-{id}',
        transformers_version='4.6.1', 
        pytorch_version='1.7.1', 
        py_version='py36',
        model_data=huggingface_estimator.model_data,
        role=ROLE,
    )

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge", 
    endpoint_name = f'huggingface-wav2vec2-endpoint-{id}'
)

When you create a predictor by using the model.deploy function, you can change the instance count and instance type based on your performance requirements.

Inference audio files

After you deploy the endpoint, you can run prediction tests to check the model performance. You can download an audio file from the S3 bucket by using the following code:

import boto3
s3 = boto3.client('s3')
s3.download_file(BUCKET, 'huggingface-blog/sample_audio/xxx.wav', 'downloaded.wav')
file_name ='downloaded.wav'

Alternatively, you can download a sample audio file to run the inference request:

import soundfile
!wget https://datashare.ed.ac.uk/bitstream/handle/10283/343/MKH800_19_0001.wav
file_name ='MKH800_19_0001.wav'
speech_array, sampling_rate = soundfile.read(file_name)
json_request_data = {"speech_array": speech_array.tolist(),
                     "sampling_rate": sampling_rate}

prediction = predictor.predict(json_request_data)
print(prediction)

The predicted result is as follows:

['"she had your dark suit in grecy wash water all year"', 'application/json']

Clean up

When you’re finished using the solution, delete the SageMaker endpoint to avoid ongoing charges:

predictor.delete_endpoint()

Conclusion

In this post, we showed how to fine-tune the pre-trained Wav2Vec2 model on SageMaker using a Hugging Face estimator, and also how to host the model on SageMaker as a real-time inference endpoint using the SageMaker Hugging Face Inference Toolkit. For both training and inference steps, we provided custom defined scripts for greater flexibility, which are enabled and supported by SageMaker Hugging Face DLCs. You can use the method from this post to fine-tune a We2Vec2 model with your own datasets, or to fine-tune and deploy a different transformer model from Hugging Face.

Check out the notebook and code of this project from GitHub, and let us know your comments. For more comprehensive information, see Hugging Face on SageMaker and Use Hugging Face with Amazon SageMaker.

In addition, Hugging Face and AWS announced a partnership in 2022 that makes it even easier to train Hugging Face models on SageMaker. This functionality is available through the development of Hugging Face AWS DLCs. These containers include the Hugging Face Transformers, Tokenizers, and Datasets libraries, which allow us to use these resources for training and inference jobs. For a list of the available DLC images, see Available Deep Learning Containers Images. They are maintained and regularly updated with security patches. You can find many examples of how to train Hugging Face models with these DLCs and the Hugging Face Python SDK in the following GitHub repo.


About the Author

Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS. Her main areas of interests are deep learning, computer vision, NLP, and time series data prediction. In her spare time, she enjoys reading novels and hiking in national parks in the UK.

Read More

Build a virtual credit approval agent with Amazon Lex, Amazon Textract, and Amazon Connect

Banking and financial institutions review thousands of credit applications per week. The credit approval process requires financial organizations to invest time and resources in reviewing documents like W2s, bank statements, and utility bills. The overall experience can be costly for the organization. At the same time, organizations have to consider borrowers, who are waiting for decisions on their credit applications. To retain customers, organizations need to process borrower applications quickly with low turnaround times.

With an automated credit approval assistant using machine learning, financial organizations can expedite the process, reduce cost, and provide better customer experience with faster decisions. Banks and Fintechs can build a virtual agent that can review a customer’s financial documents and provide a decision instantly. Building an effective credit approval process not only improves the customer experience, but also lowers the cost.

In this post, we show how to build a virtual credit approval assistant that reviews the financial documents required for loan approval and makes decisions instantly for a seamless customer experience. The solution uses Amazon Lex, Amazon Textract, and Amazon Connect, among other AWS services.

Overview of the solution

You can deploy the solution using an AWS CloudFormation template. The solution creates a virtual agent using Amazon Lex and associates it with Amazon Connect, which acts as the conversational interface with customers and asks the loan applicant to upload the necessary documents. The documents are stored in an Amazon Simple Storage Service (Amazon S3) bucket used only for that customer.

This solution is completely serverless and uses Amazon S3 to store a static website that hosts the front end and custom JavaScript to enable the rest of the requests. Amazon CloudFront serves as a content delivery network (CDN) to allow a public front end for the website. CloudFront is a fast CDN service that securely delivers data, videos, applications, and APIs to customers globally with low latency and high transfer speeds, all within a developer-friendly environment.

This is a sample project designed to be easily deployable for experimentation. The AWS Identity and Access Management (IAM) policy permissions in this solution use least privilege, however the CloudFront and Amazon API Gateway resources deployed are publicly accessible. To take the appropriate measures to secure your CloudFront distribution and API Gateway resources, refer to Configuring secure access and restricting access to content and Security in Amazon API Gateway, respectively.

Additionally, the backend features API Gateway with HTTP routes for two AWS Lambda functions. The first function creates the session with Amazon Connect for chat; the second passes the pre-signed URL link fetched by the front end from Amazon Connect to Amazon Lex. Amazon Lex triggers the Lambda function associated with it and lets Amazon Textract read the documents and capture all the fields and information in them. This function also makes the credit decisions based on business processes previously defined by the organization. The solution is integrated with Amazon Connect to let customers connect to contact center agents if the customer is having difficulty or needs help through the process.

The following example depicts the interaction between bot and borrower.

The following diagram illustrates the solution architecture.

The solution workflow is as follows:

  1. Customers navigate to a URL served by CloudFront, which fetches webpages from an S3 bucket and sends JavaScript to the web browser.
  2. The web browser renders the webpages and makes an API call to API Gateway.
  3. API Gateway triggers the associated Lambda function.
  4. The function initiates a startChatContact API call with Amazon Connect and triggers the contact flow associated with it.
  5. Amazon Connect triggers Amazon Lex with the utterance to classify the intent. After the intent is classified, Amazon Lex elicits the required slots and asks the customer to upload the document to fulfill the intent.
  6. The applicant uploads the W2 document to the S3 bucket using the upload attachment icon in the chat window.

As a best practice, consider implementing encryption at rest for the S3 bucket using AWS Key Management Service (AWS KMS). Additionally, you can attach a bucket policy to the S3 bucket to ensure data is always encrypted in transit. Consider enabling server access logging for the S3 bucket to capture detailed records of requests to assist with security and access audits. For more information, see Security Best Practices for Amazon S3.

  1. The web browser makes a call to Amazon Connect to retrieve a pre-signed URL of the uploaded image. Make sure the pre-signed URLs expire a few minutes after the Lambda function runs the logic.
  2. After the document has been uploaded successfully, the web application makes an API call to API Gateway to updates the file location for use in Amazon Lex session attributes.
  3. API Gateway triggers a Lambda function to pass the W2 pre-signed URL location. The function updates the session attributes in Amazon Lex with the pre-signed URL of the W2 document.
  4. The web browser also updates the slot to uploaded, which fulfills the intent.
  5. Amazon Lex triggers a Lambda function, which downloads the W2 image data and sends it to Amazon Textract for processing.
  6. Amazon Textract reads all the fields from the W2 image document, converts them into key-value pairs, and passes the data back to the Lambda function.

Amazon Textract conforms to the AWS shared responsibility model, which outlines the responsibilities for data protection between AWS and the customer. For more information, refer to Data Protection in Amazon Textract.

  1. Lambda uses the W2 data for evaluation of the loan application and returns the result to the web browser.

Follow the best practices for enabling logging in Lambda. Refer to part 1 and part 2 of the blog series “Operating Lambda: Building a solid security foundation.

Data in-transit is secured using TLS, and it’s highly recommended to encrypt data at rest. For more information about protecting data inside your S3 bucket, refer to Strengthen the security of sensitive data stored in Amazon S3 by using additional AWS services.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  1. An AWS account.
  2. An Amazon Connect contact center instance in the us-east-1 Region. You can use an existing one or create a new one. For instructions, refer to Get started with Amazon Connect. If you have an existing Amazon Connect instance and chat isn’t enabled, refer to Enabling Chat in an Existing Amazon Connect Contact Center.
  3. Chat attachments enabled in Amazon Connect. For instructions, refer to Enable attachments to share files using chat. For CORS setup, use option 2, which uses the * wildcard to AllowedOrigin.
  4. The example project located in the GitHub repository. You need to clone this repository on your local machine and use AWS Serverless Application Model (AWS SAM) to deploy the project. To install the AWS SAM CLI and configure AWS credentials, refer to Getting started with AWS SAM.
  5. Python 3.9 runtime to support the AWS SAM deployment.

Import the Amazon Connect flow

To import the Amazon Connect flow, complete the following steps:

  1. Log in to your Amazon Connect instance.
  2. Under Routing, choose Contact Flows.
  3. Choose Create contact flow.
  4. On the Save menu, choose Import flow.
  5. Choose Select and choose the import flow file located in the /flow subdirectory, called Loan_App_Connect_Flow.
  6. Save the flow. Do not publish yet.
  7. Expand Show additional flow information and choose the copy icon to capture the ARN.
  8. Save these IDs for use as parameters in the CloudFormation template to be deployed in the next step:
    arn:aws:connect:us-east-1:123456789012:instance/11111111-1111-1111-1111-111111111111/contact-flow/22222222-2222-2222-2222-222222222222

The Amazon Connect instance ID is the long alphanumeric value between the slashes immediately following instance in the ARN. For this post, the instance ID is 11111111-1111-1111-1111-111111111111.

The contact flow ID is the long value after the slash following contact-flow in the ARN. For this post, the flow ID is 22222222-2222-2222-2222-222222222222.

Deploy with AWS SAM

With the instance and flow IDs captured, we’re ready to deploy the project.

  1. Open a terminal window and clone the GitHub repository in a directory of your choice.
  2. Navigate to the amazon-connect-virtual-credit-agent directory and follow the deployment instructions in GitHub repo.
  3. Record the Amazon Lex bot name from the Outputs section of the deployment for the next steps (called Loan_App_Bot if you accepted the default name).
  4. Return to these instructions once the AWS SAM deploy completes successfully.

Update the contact flow blocks

To update the contact flow blocks, complete the following steps:

  1. Log in to your Amazon Connect instance
  2. Under Routing, choose Contact Flows.
  3. Choose the flow named Loan_App_Flow.
  4. Choose the Get customer input block.
  5. Under the Amazon Lex section, choose the bot named Loan_App_Bot and the dev alias created earlier.
  6. Choose Save.
  7. Choose the Set working queue block.
  8. Choose the X icon and on the drop-down menu, choose BasicQueue.
  9. Choose Save.
  10. Save the flow.
  11. Publish the flow.

Test the solution

You’re now ready to test the solution.

  1. Log in to you Amazon Connect instance for setting up an Amazon Connect agent for a chat.
  2. On the dashboard, choose the phone icon to open the Contact Control Panel (CCP) in a separate window.
  3. In the CCP, change the agent state to Available.
  4. On the Outputs tab for your CloudFormation stack, choose the value for cloudFrontDistribution.

This is a link to your CloudFront URL. You’re redirected to a webpage with your loan services bot. A floating action button (FAB) is on the bottom right of the screen.

  1. Choose the FAB to open the chat bot.
  2. After you get the welcome message, enter I need a loan.
  3. When prompted, choose a loan type and enter a loan amount.
  4. Upload an image of a W2 document.

A sample W2 image file is located in the project repository in the /img subdirectory. The file is called w2.png.

After the image is uploaded, the bot asks you if you want to submit the application.

  1. Choose Yes to submit.

After submission, the bot evaluates the W2 image and provides a response. After a few seconds, you’re connected to an agent.

You should see a request to connect with chat in the CCP.

  1. Choose the request to accept.

The agent is now connected to the chat user. You can simulate each side of the conversation to test the chat session.

  1. Choose End Chat when you’re done.

Troubleshooting

After you deploy the stack, if you see an Amazon S3 permission error when viewing the CloudFront URL, it means the domain isn’t ready yet. The CDN can take up to 1 hour to be ready.

If you can’t add your attachments, check your CORS setting. For instructions, refer to Enable attachments to share files using chat. For CORS setup, use option 2, which uses the * wildcard to AllowedOrigin.

Clean up

To avoid incurring future charges, remove all resources created by deleting the CloudFormation stack.

Conclusion

In this post, we demonstrated how to quickly and securely set up a loan application processing solution. Data at rest and in transit are both encrypted and secured. This solution can act as a blueprint to build other self-service processing flows where Amazon Connect and Amazon Lex provide a conversational interface for customer engagement. We look forward to seeing what other solutions you build using this architecture.

Should you need assistance building these capabilities and Amazon Connect contact flows, please reach out to one of the dozens of Amazon Connect partners available worldwide.


About the Authors

Dipkumar Mehta is a Senior Conversational AI Consultant with the Amazon ProServe Natural Language AI team. He focuses on helping customers design, deploy and scale end-to-end Conversational AI solutions in production on AWS. He is also passionate about improving customer experience and drive business outcomes by leveraging data.

Cecil Patterson is a Natural Language AI consultant with AWS Professional services based in North Texas. He has many years of experience working with large enterprises to enable and support global infrastructure solutions. Cecil uses his experience and diverse skill set to build exceptional conversational solutions for customers of all types.

Sanju Sunny is a Digital Innovation Specialist with Amazon ProServe. He engages with customers in a variety of industries around Amazon’s distinctive customer-obsessed innovation mechanisms in order to rapidly conceive, validate and prototype new products, services and experiences.

Matt Kurio is a Security Transformation Consultant with the Amazon ProServe Shared Delivery Team.  He excels helping enterprise customers build secure platforms and manage security effectively and efficiently.  He also enjoys relaxing at the beach and outdoor activities with his family.

Read More

Startup Transforms Meeting Notes With Time-Saving Features

Gil Makleff and Artem Koren are developing AI for meeting transcripts, creating time-savers like shareable highlights of the text that is often TL;DR (too long; didn’t read).

The Sembly founders conceived the idea after years of working in enterprise operational consulting at UMT Consulting Group, which was acquired by Ernst & Young.

“We had an intuition that if AI were applied to those operational conversations and able to make sense of them, the value gains to enterprises could be enormous,” said Koren, chief product officer at Sembly.

Sembly goes far beyond basic transcription, allowing people to skip meetings and receive speaker highlights and key action items for follow-ups.

The New York startup uses proprietary AI models to transcribe and analyze meetings, transforming them into actionable insights. It aims to supercharge teams who want to focus on delivering results rather than spending time compiling notes.

Sembly’s GPU-fueled automatic speech recognition AI can be used with popular video call services such as Zoom, Webex, Microsoft Teams and Google Meet. In a few clicks on the Sembly site, it can be synced to Outlook or Google calendars or used for calls in progress via e-mail, web app, or the Sembly mobile app.

The service delivers market-leading transcript accuracy and AI-driven analytics, including highlights to pinpoint important discussion topics. It also allows users to zero in on meeting speakers and easily share clips of individual passages with team members, enhancing collaboration.

Sembly, founded in 2019, is a member of the NVIDIA Inception startup program.

Improving Speaker Tracking With NeMo

One of the pain points Sembly addresses in transcripts is what’s known as diarization, or identifying the correct speaker in text, which can be problematic. The company had tried popular diarization systems from major software makers with negligible results.

Diarization is a key step in the meeting processing pipeline because many of Sembly’s natural language processing features rely on that text to be properly identified. Its Glance View feature, for instance, can identify key meeting topics and who raised them.

Attributing meeting topics to the wrong person throws a wrench in follow-ups on action items.

Harnessing NVIDIA NeMo —  an open source framework for building, training and fine-tuning GPU-accelerated speech and natural language understanding models — provided a significant leap in accuracy.

Using the NeMo conversational AI toolkit for diarization model training, running on NVIDIA A100 GPUs, dramatically improved its speaker tracking. Before applying Nemo, it had an 11 percent error rate in diarization. After implementation, its error rate declined to 5 percent.

Business Boost Amid Meeting Fatigue

With a shift to fewer face-to-face meetings and more virtual ones, companies are seeking ways to counter online meeting fatigue for employees, said Koren. That’s important for delivering more engaging workplace experiences, he added.

“There’s a concept of ‘meeting tourists’ in large organizations. And this is one of those things that we’re hoping Sembly will help to address,” he said.

Adopting Semby to easily highlight key points and speakers in transcripts for sharing gives workers more time back in the day, he said. And leaner operational technologies that help companies stay more focused on key business objectives offer competitive advantages, said Koren.

For those with bloated calendars and the need to try to dance between two meetings, Sembly can also assist. Sembly can be directed to attend a meeting instead of the user and come back with a summary and a list of key items, saving time while keeping teams more informed.

“Sometimes I’d like to attend two meetings that overlap — with Sembly, now I can,” Koren said.

The post Startup Transforms Meeting Notes With Time-Saving Features appeared first on NVIDIA Blog.

Read More

A Night to Behold: Researchers Use Deep Learning to Bring Color to Night Vision

Talk about a bright idea. A team of scientists has used GPU-accelerated deep learning to show how color can be brought to night-vision systems. 

In a paper published this week in the journal PLOS One, a team of researchers at the University of California, Irvine led by Professor Pierre Baldi and Dr. Andrew Browne, describes how they reconstructed color images of photos of faces using an infrared camera. 

The study is a step toward predicting and reconstructing what humans would see using cameras that collect light using imperceptible near-infrared illumination. 

The study’s authors explain that humans see light in the so-called “visible spectrum,” or light with wavelengths of between 400 and 700 nanometers.

Typical night vision systems rely on cameras that collect infrared light outside this spectrum that we can’t see. 

Information gathered by these cameras is then transposed to a display that shows a monochromatic representation of what the infrared camera detects, the researchers explain.

The team at UC Irvine developed an imaging algorithm that relies on deep learning to predict what humans would see using light captured by an infrared camera.

 

Researchers at the University of California, Irvine, aimed to use deep learning to predict visible spectrum images using infrared illumination alone. Source: Browne, et al. 

 

In other words, they’re able to digitally render a scene for humans using cameras operating in what, to humans, would be complete “darkness.” 

To do this, the researchers used a monochromatic camera sensitive to visible and near-infrared light to acquire an image dataset of printed images of faces. 

These images were gathered under multispectral illumination spanning standard visible red, green, blue and infrared wavelengths. 

The researchers then optimized a convolutional neural network with a U-Net-like architecture — a specialized convolutional neural network first developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg — to predict visible spectrum images from near-infrared images.

On the left, visible spectrum ground truth image composed of red, green and blue input images. On the right, predicted reconstructions for UNet-GAN, UNet and linear regression using three infrared input images. Source: Browne, et al. 

The system was trained using NVIDIA GPUs and 140 images of human faces for training, 40 for validation and 20 for testing.  

The result: the team successfully recreated color portraits of people taken by an infrared camera in darkened rooms. In other words, they created systems that could “see” color images in the dark.  

To be sure, these systems aren’t yet ready for general purpose use. These systems would need to be trained to predict the color of different kinds of objects — such as flowers or faces.

Nevertheless, the study could one day lead to night vision systems able to see color, just as we do in daylight, or allow scientists to study biological samples sensitive to visible light.

Featured image source: Browne, et al. 

The post A Night to Behold: Researchers Use Deep Learning to Bring Color to Night Vision appeared first on NVIDIA Blog.

Read More

Learning to think critically about machine learning

Students in the MIT course 6.036 (Introduction to Machine Learning) study the principles behind powerful models that help physicians diagnose disease or aid recruiters in screening job candidates.

Now, thanks to the Social and Ethical Responsibilities of Computing (SERC) framework, these students will also stop to ponder the implications of these artificial intelligence tools, which sometimes come with their share of unintended consequences.

Last winter, a team of SERC Scholars worked with instructor Leslie Kaelbling, the Panasonic Professor of Computer Science and Engineering, and the 6.036 teaching assistants to infuse weekly labs with material covering ethical computing, data and model bias, and fairness in machine learning. The process was initiated in the fall of 2019 by Jacob Andreas, the X Consortium Assistant Professor in the Department of Electrical Engineering and Computer Science. SERC Scholars collaborate in multidisciplinary teams to help postdocs and faculty develop new course material.

Because 6.036 is such a large course, more than 500 students who were enrolled in the 2021 spring term grappled with these ethical dimensions alongside their efforts to learn new computing techniques. For some, it may have been their first experience thinking critically in an academic setting about the potential negative impacts of machine learning.

The SERC Scholars evaluated each lab to develop concrete examples and ethics-related questions to fit that week’s material. Each brought a different toolset. Serena Booth is a graduate student in the Interactive Robotics Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL). Marion Boulicault was a graduate student in the Department of Linguistics and Philosophy, and is now a postdoc in the MIT Schwarzman College of Computing, where SERC is based. And Rodrigo Ochigame was a graduate student in the Program in History, Anthropology, and Science, Technology, and Society (HASTS) and is now an assistant professor at Leiden University in the Netherlands. They collaborated closely with teaching assistant Dheekshita Kumar, MEng ’21, who was instrumental in developing the course materials.

They brainstormed and iterated on each lab, while working closely with the teaching assistants to ensure the content fit and would advance the core learning objectives of the course. At the same time, they helped the teaching assistants determine the best way to present the material and lead conversations on topics with social implications, such as race, gender, and surveillance.

“In a class like 6.036, we are dealing with 500 people who are not there to learn about ethics. They think they are there to learn the nuts and bolts of machine learning, like loss functions, activation functions, and things like that. We have this challenge of trying to get those students to really participate in these discussions in a very active and engaged way. We did that by tying the social questions very intimately with the technical content,” Booth says.

For instance, in a lab on how to represent input features for a machine learning model, they introduced different definitions of fairness, asked students to consider the pros and cons of each definition, then challenged them to think about the features that should be input into a model to make it fair.

Four labs have now been published on MIT OpenCourseWare. A new team of SERC Scholars is revising the other eight, based on feedback from the instructors and students, with a focus on learning objectives, filling in gaps, and highlighting important concepts.

An intentional approach

The students’ efforts on 6.036 show how SERC aims to work with faculty in ways that work for them, says Julie Shah, associate dean of SERC and professor of aeronautics and astronautics. They adapted the SERC process due to the unique nature of this large course and tight time constraints.

SERC was established more than two years ago through the MIT Schwarzman College of Computing as an intentional approach to bring faculty from divergent disciplines together into a collaborative setting to co-create and launch new course material focused on social and responsible computing.

Each semester, the SERC team invites about a dozen faculty members to join an Action Group dedicated to developing new curricular materials (there are several SERC Action Groups, each with a different mission). They are purposeful in whom they invite, and seek to include faculty members who will likely form fruitful partnerships in smaller subgroups, says David Kaiser, associate dean of SERC, the Germeshausen Professor of the History of Science, and professor of physics.

These subgroups of two or three faculty members hone their shared interest over the course of the term to develop new ethics-related material. But rather than one discipline serving another, the process is a two-way street; every faculty member brings new material back to their course, Shah explains. Faculty are drawn to the Action Groups from all of MIT’s five schools.

“Part of this involves going outside your normal disciplinary boundaries and building a language, and then trusting and collaborating with someone new outside of your normal circles. That’s why I think our intentional approach has been so successful. It is good to pilot materials and bring new things back to your course, but building relationships is the core. That makes this something valuable for everybody,” she says.

Making an impact

Over the past two years, Shah and Kaiser have been impressed by the energy and enthusiasm surrounding these efforts.

They have worked with about 80 faculty members since the program started, and more than 2,100 students took courses that included new SERC content in the last year alone. Those students aren’t all necessarily engineers — about 500 were exposed to SERC content through courses offered in the School of Humanities, Arts, and Social Sciences, the Sloan School of Management, and the School of Architecture and Planning.

Central to SERC is the principle that ethics and social responsibility in computing should be integrated into all areas of teaching at MIT, so it becomes just as relevant as the technical parts of the curriculum, Shah says. Technology, and AI in particular, now touches nearly every industry, so students in all disciplines should have training that helps them understand these tools, and think deeply about their power and pitfalls.

“It is not someone else’s job to figure out the why or what happens when things go wrong. It is all of our responsibility and we can all be equipped to do it. Let’s get used to that. Let’s build up that muscle of being able to pause and ask those tough questions, even if we can’t identify a single answer at the end of a problem set,” Kaiser says.

For the three SERC Scholars, it was uniquely challenging to carefully craft ethical questions when there was no answer key to refer to. But thinking deeply about such thorny problems also helped Booth, Boulicault, and Ochigame learn, grow, and see the world through the lens of other disciplines.

They are hopeful the undergraduates and teaching assistants in 6.036 take these important lessons to heart, and into their future careers.

“I was inspired and energized by this process, and I learned so much, not just the technical material, but also what you can achieve when you collaborate across disciplines. Just the scale of this effort felt exciting. If we have this cohort of 500 students who go out into the world with a better understanding of how to think about these sorts of problems, I feel like we could really make a difference,” Boulicault says.

Read More

Locked-image Tuning: Adding Language Understanding to Image Models

The ability to classify images into categories has been transformed by deep learning. It has also been significantly accelerated by transfer learning, whereby models are first pre-trained on large datasets, like ImageNet, to learn visual representations that are then transferred via fine-tuning to a new task with less data (e.g., classifying animals). Previous works such as BiT and ViT employed these methods to achieve state-of-the-art performance on a wide range of classification tasks, such as the VTAB benchmark.

However, fine-tuning has some downsides: though pre-training is done only once, fine-tuning is necessary on every new dataset for which task-specific data is needed. Multimodal contrastive learning is an alternative, recently popularized paradigm (e.g., CLIP, ALIGN) that overcomes these issues by instead learning how to match free-form text with images. These models can then solve new tasks by reformulating them as image-text matching problems, without extra data (referred to as “zero-shot” learning). Contrastive learning is flexible and easy to adapt to new tasks, but has its own limitations, namely the need for a lot of paired image-text data and weaker performance than transfer learning approaches.

With those limitations in mind, we propose “LiT: Zero-Shot Transfer with Locked-image Text Tuning”, to appear at CVPR 2022. LiT models learn to match text to an already pre-trained image encoder. This simple yet effective setup provides the best of both worlds: strong image representations from pre-training, plus flexible zero-shot transfer to new tasks via contrastive learning. LiT achieves state-of-the-art zero-shot classification accuracy, significantly closing the gap between the two styles of learning. We think the best way to understand is to try it yourself, so we’ve included a demo of LiT models at the end of this post.

Fine-tuning (left) requires task-specific data and training to adapt a pre-trained model to a new task. An LiT model (right) can be used with any task, without further data or adaptation.

Contrastive Learning on Image-Text Data
Contrastive learning models learn representations from “positive” and “negative” examples, such that representations for “positive” examples are similar to each other but different from “negative” examples.

Multimodal contrastive learning applies this to pairs of images and associated texts. An image encoder computes representations from images, and a text encoder does the same for texts. Each image representation is encouraged to be close to the representation of its associated text (“positive”), but distinct from the representation of other texts (“negatives”) in the data, and vice versa. This has typically been done with randomly initialized models (“from scratch”), meaning the encoders have to simultaneously learn representations and how to match them.

Multimodal contrastive learning trains models to produce similar representations for closely matched images and texts.

This training can be done on noisy, loosely aligned pairs of image and text, which naturally occur on the web. This circumvents the need for manual labeling, and makes data scaling easy. Furthermore, the model learns much richer visual concepts — it’s not constrained to what’s defined in the classification label space. Instead of classifying an image as “coffee”, it can understand whether it’s “a small espresso in a white mug” or “a large latte in a red flask”.

Once trained, a model that aligns image and text can be used in many ways. For zero-shot classification, we compare image representations to text representations of the class names. For example, a “wombat vs jaguar” classifier can be built by computing the representations of the texts “jaguar” and “wombat”, and classifying an image as a jaguar if its representation better matches the former. This approach scales to thousands of classes and makes it very easy to solve classification tasks without the extra data necessary for fine-tuning. Another application of contrastive models is image search (a.k.a. image-text retrieval), by finding the image whose representation best matches that of a given text, or vice versa.

The Best of Both Worlds with Locked-image Tuning
As mentioned earlier, transfer learning achieves state-of-the-art accuracy, but requires per-task labels, datasets, and training. On the other hand, contrastive models are flexible, scalable, and easily adaptable to new tasks, but fall short in performance. To compare, at the time of writing, the state of the art on ImageNet classification using transfer learning is 90.94%, but the best contrastive zero-shot models achieve 76.4%.

LiT tuning bridges this gap: we contrastively train a text model to compute representations well aligned with the powerful ones available from a pre-trained image encoder. Importantly, for this to work well, the image encoder should be “locked“, that is: it should not be updated during training. This may be unintuitive since one usually expects the additional information from further training to increase performance, but we find that locking the image encoder consistently leads to better results.

LiT-tuning contrastively trains a text encoder to match a pre-trained image encoder. The text encoder learns to compute representations that align to those from the image encoder.

This can be considered an alternative to the classic fine-tuning stage, where the image encoder is separately adapted to every new classification task; instead we have one stage of LiT-tuning, after which the model can classify any data. LiT-tuned models achieve 84.5% zero-shot accuracy on ImageNet classification, showing significant improvements over previous methods that train models from scratch, and halving the performance gap between fine-tuning and contrastive learning.

Left: LiT-tuning significantly closes the gap between the best contrastive models and the best models fine-tuned with labels. Right: Using a pre-trained image encoder is always helpful, but locking it is surprisingly a key part of the recipe to success; unlocked image models (dashed) yield significantly worse performance.

An impressive benefit of contrastive models is increased robustness — they retain high accuracy on datasets that typically fool fine-tuned models, such as ObjectNet and ImageNet-C. Similarly, LiT-tuned models have high performance across various challenging versions of ImageNet, for example achieving a state-of-the-art 81.1% accuracy on ObjectNet.

LiT-tuning has other advantages. While prior contrastive works require large amounts of data and train for a very long time, the LiT approach is much less data hungry. LiT models trained on 24M publicly available image-text pairs rival the zero-shot classification performance of prior models trained on 400M image-text pairs of private data. The locked image encoder also leads to faster training with a smaller memory footprint. On larger datasets, image representations can be pre-computed; not running the image model during training further improves efficiency and also unlocks much larger batch sizes, which increases the number of “negatives” the model sees and is key to high-performance contrastive learning. The method works well with varied forms of image pre-training (e.g., including self-supervised learning), and with many publicly available image models. We hope that these benefits make LiT a great testbed for researchers.

Conclusion
We present Locked-image Tuning (LiT), which contrastively trains a text encoder to match image representations from a powerful pre-trained image encoder. This simple method is data and compute efficient, and substantially improves zero-shot classification performance compared to existing contrastive learning approaches.

Want to try it yourself?

A preview of the demo: use it to match free-form text descriptions to images and build your own zero-shot classifier!

We have prepared a small interactive demo to try some LiT-tuned models. We also provide a Colab with more advanced use cases and larger models, which are a great way to get started.

Acknowledgments
We would like to thank Xiaohua Zhai, Xiao Wang, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer who have co-authored the LiT paper and been involved in all aspects of its development, as well as the Brain team in Zürich. We also would like to thank Tom Small for creating the animations used in this blogpost.

Read More

Three MIT students awarded 2022 Paul and Daisy Soros Fellowships for New Americans

MIT graduate student Fernanda De La Torre, alumna Trang Luu ’18, SM ’20, and senior Syamantak Payra are recipients of the 2022 Paul and Daisy Soros Fellowships for New Americans.

De La Torre, Luu, and Payra are among 30 New Americans selected from a pool of over 1,800 applicants. The fellowship honors the contributions of immigrants and children of immigrants by providing $90,000 in funding for graduate school.

Students interested in applying to the P.D. Soros Fellowship for future years may contact Kim Benard, associate dean of distinguished fellowships in Career Advising and Professional Development.

Fernanda De La Torre

Fernanda De La Torre is a PhD student in the Department of Brain and Cognitive Sciences. With Professor Josh McDermott, she studies how we integrate vision and sound, and with Professor Robert Yang, she develops computational models of imagination. 

De La Torre spent her early childhood with her younger sister and grandmother in Guadalajara, Mexico. At age 12, she crossed the Mexican border to reunite with her mother in Kansas City, Missouri. Shortly after, an abusive home environment forced De La Torre to leave her family and support herself throughout her early teens.

Despite her difficult circumstances, De La Torre excelled academically in high school. By winning various scholarships that would discretely take applications from undocumented students, she was able to continue her studies in computer science and mathematics at Kansas State University. There, she became intrigued by the mysteries of the human mind. During college, De La Torre received invaluable mentorship from her former high school principal, Thomas Herrera, who helped her become documented through the Violence Against Women Act. Her college professor, William Hsu, supported her interests in artificial intelligence and encouraged her to pursue a scientific career.

After her undergraduate studies, De La Torre won a post-baccalaureate fellowship from the Department of Brain and Cognitive Sciences at MIT, where she worked with Professor Tomaso Poggio on the theory of deep learning. She then transitioned into the department’s PhD program. Beyond contributing to scientific knowledge, De La Torre plans to use science to create spaces where all people, including those from backgrounds like her own, can innovate and thrive.

She says: “Immigrants face many obstacles, but overcoming them gives us a unique strength: We learn to become resilient, while relying on friends and mentors. These experiences foster both the desire and the ability to pay it forward to our community.”

Trang Luu

Trang Luu graduated from MIT with a BS in mechanical engineering in 2018, and a master of engineering degree in 2020. Her Soros award will support her graduate studies at Harvard University in the MBA/MS engineering sciences program.

Born in Saigon, Vietnam, Luu was 3 when her family immigrated to Houston, Texas. Watching her parents’ efforts to make a living in a land where they did not understand the culture or speak the language well, Luu wanted to alleviate hardship for her family. She took full responsibility for her education and found mentors to help her navigate the American education system. At home, she assisted her family in making and repairing household items, which fueled her excitement for engineering.

As an MIT undergraduate, Luu focused on assistive technology projects, applying her engineering background to solve problems impeding daily living. These projects included a new adaptive socket liner for below-the-knee amputees in Kenya, Ethiopia, and Thailand; a walking stick adapter for wheelchairs; a computer head pointer for patients with limited arm mobility, a safer makeshift cook stove design for street vendors in South Africa; and a quicker method to test new drip irrigation designs. As a graduate student in MIT D-Lab under the direction of Professor Daniel Frey, Luu was awarded a National Science Foundation Graduate Research Fellowship. In her graduate studies, Luu researched methods to improve evaporative cooling devices for off-grid farmers to reduce rapid fruit and vegetable deterioration.

These projects strengthened Luu’s commitment to innovating new technology and devices for people struggling with basic daily tasks. During her senior year, Luu collaborated on developing a working prototype of a wearable device that noninvasively reduces hand tremors associated with Parkinson’s disease or essential tremor. Observing patients’ joy after their tremors stopped compelled Luu and three co-founders to continue developing the device after college. Four years later, Encora Therapeutics has accomplished major milestones, including Breakthrough Device designation by the U.S. Food and Drug Administration.

Syamantak Payra

Hailing from Houston, Texas, Syamantak Payra is a senior majoring in electrical engineering and computer science, with minors in public policy and entrepreneurship and innovation. He will be pursuing a PhD in engineering, with the goal of creating new biomedical devices that can help improve daily life for patients worldwide and enhance health care outcomes for decades to come.

Payra’s parents had emigrated from India, and he grew up immersed in his grandparents’ rich Bengali culture. As a high school student, he conducted projects with NASA engineers at Johnson Space Center, experimented at home with his scientist parents, and competed in spelling bees and science fairs across the United States. Through these avenues and activities, Syamantak not only gained perspectives on bridging gaps between people, but also found passions for language, scientific discovery, and teaching others.

After watching his grandmother struggle with asthma and chronic obstructive pulmonary disease and losing his baby brother to brain cancer, Payra devoted himself to trying to use technology to solve health-care challenges. Payra’s proudest accomplishments include building a robotic leg brace for his paralyzed teacher and conducting free literacy workshops and STEM outreach programs that reached nearly a thousand underprivileged students across the Greater Houston Area.

At MIT, Payra has worked in Professor Yoel Fink’s research laboratory, creating digital sensor fibers that have been woven into intelligent garments that can assist in diagnosing illnesses, and in Professor Joseph Paradiso’s research laboratory, where he contributed to next-generation spacesuit prototypes that better protect astronauts on spacewalks. Payra’s research has been published by multiple scientific journals, and he was inducted into the National Gallery of America’s Young Inventors.

Read More

Video Classification on Edge Devices with TensorFlow Lite and MoViNet

Posted by Dan Kondratyuk, Liangzhe Yuan, Google Research and Khanh LeViet, TensorFlow Developer Relations

We are excited to announce MoViNets (pronounced “movie nets”), a family of new mobile-optimized model architectures for video classification. The models are trained on the Kinetics-600 dataset to be able to recognize 600 different human actions (such as playing trumpet, robot dancing, bowling, and more) and can classify video streams captured on a modern smartphone in real time. You can download the pre-trained TensorFlow Lite models from TensorFlow Hub or try it out using our Android and Raspberry Pi demo apps, as well as fine-tune your own MoViNets with the Colab demo and the code in the TensorFlow Model Garden.

Demo from the TensorFlow Lite video classification reference app

Video classification is a machine learning task that takes video frames as input and predicts a single class from a larger set of classes. Video action recognition is a type of video classification where the set of predicted classes consists of human actions that happened in the frames. Video action recognition is similar to image recognition in that both take input images and output the probabilities of the images belonging to each of the predefined classes. However, a video action recognition model has to look at both the content of each frame and the spatial relationships between adjacent frames to understand the actions in the video. For example, if you look at these still images, it’s difficult to tell what the person is doing.

But if you look at the full video, it becomes clear that the person is performing jumping jacks.

MoViNet Model Architecture

MoViNets are a family of convolutional neural networks which efficiently process video streams, outputting accurate predictions with a fraction of the latency of convolutional video classifiers like 3D ResNets or transformer-based classifiers like ViT.

Frame-based classifiers output predictions on each 2D frame independently, resulting in sub-optimal performance due to their lack of temporal reasoning. On the other hand, 3D video classifiers offer high accuracy predictions by processing all frames in a video clip simultaneously, at a cost of significant memory and latency penalties as the number of input frames increases. MoViNets offer key advantages from both 2D frame-based classifiers and 3D video classifiers while mitigating their disadvantages.

The following figure shows a typical approach to using 3D networks with multi-clip evaluation, where the predictions of multiple overlapping subclips are averaged together. Shorter subclips result in lower latency, but reduce the overall accuracy.

Diagram illustrating Multi-Clip Evaluation for 3D Video Networks

MoViNets take a hybrid approach, which proposes the use of causal convolutions in place of 3D convolutions, allowing intermediate activations to be cached across frames with a Stream Buffer. The Stream Buffer copies the input activations of all 3D operations, which is output by the model and then input back into the model on the next clip input.

Diagram illustrating Streaming Evaluation for MoViNets

The result is that MoViNets can receive one frame input at a time, reducing peak memory usage while resulting in no loss of accuracy, with predictions equivalent to inputting all frames at once like a 3D video classifier. MoViNets additionally leverage Neural Architecture Search (NAS) by searching for efficient configurations of models on video datasets (specifically Kinetics 600) across network width, depth, and resolution.

The result is a set of action classifiers that can output temporally-stable predictions that smoothly transition based on frame content. Below is an example plot of MoViNet-A2 making predictions on each frame on a video clip of skateboarding. Notice how the initial scene with a small amount of motion has relatively constant predictions, while the next scene with much larger motion causes a dramatic shift in predicted classes.

MoViNets need a few modifications to be able to run effectively on edge devices. We start with MoViNet-A0-Stream, MoViNet-A1-Stream, and MoViNet-A2-Stream, which represent the smaller models that can feasibly run in real time (20 fps or higher). To effectively quantize MoViNet, we adapt a few modifications to the model architecture – the hard swish activation is replaced by ReLU6, and Squeeze-and-Excitation layers are removed in the original architectures, which results in 3-4 p.p accuracy drop on Kinetics-600. We then convert the models to TensorFlow Lite and use integer-based post-training quantization (as well as float16 quantization) to reduce the model sizes and make them run faster on mobile CPUs. The integer-based post-training quantization process further introduces 2-3 p.p. accuracy loss. Compared to the original MoViNets, quantized MoViNets lag behind in accuracy on full 10-second Kinetics 600 clips (5-7 p.p. accuracy reduction in total), but in practice they are able to provide very accurate predictions on daily human actions, e.g., push ups, dancing, and playing piano. In the future, we plan to train with quantization-aware training to bridge this accuracy gap.

A video plotting the top-5 predictions of MoViNet-A2 over time on an example 8-second (25 fps) skateboarding video clip. Create your own plots with this Colab notebook.

We benchmark quantized A0, A1, and A2 on real hardware and the model inference time achieves 200, 120, and 60 fps respectively on Pixel 4 CPU. In practice, due to the input pipeline overhead, we see increased latency closer to 20-60 fps when running on Android with a camera as input.

Model

Quantization

Top-1 Accuracy (%)

Latency 
(ms, Pixel 4 CPU)

Model Size (MB)

Recommended Input

A0-Stream

int8

65.0

4.80

3.1

172 x 172, 5 fps

A1-Stream

int8

70.1

8.35

4.5

172 x 172, 5 fps

A2-Stream

int8

72.2

15.76

5.1

224 x 224, 5 fps

A0-Stream

float16

71.5

17.47

7.6

172 x 172, 5 fps

A1-Stream

float16

76.0

34.82

13

172 x 172, 5 fps

A2-Stream

float16

77.5

76.31

15

224 x 224, 5 fps

Train a Custom Model

You can train your own video classifier model using the MoViNet codebase in the TensorFlow Model Garden. The provided Colab notebook provides specific steps on how to fine-tune a pretrained video classifier on another dataset.

Future Steps

We are excited to see on-device online video action recognition powered by MoViNets, which demonstrate highly efficient performance. In the future, we plan to support quantize-aware training for MoViNets to mitigate the quantization accuracy loss. We also are interested in extending MoViNets as the backbone for more on-device video tasks, e.g. video object detection, video object segmentation, visual tracking, pose estimation, and more.

Acknowledgement

We would like to extend a big thanks to Yeqing Li for supporting MoViNets in TensorFlow Model Garden, Boqing Gong, Huisheng Wang, and Ting Liu for project guidance, Lu Wang for code reviews, and the TensorFlow Hub team for hosting our models.

Read More