Build a predictive maintenance solution with Amazon Kinesis, AWS Glue, and Amazon SageMaker

Organizations are increasingly building and using machine learning (ML)-powered solutions for a variety of use cases and problems, including predictive maintenance of machine parts, product recommendations based on customer preferences, credit profiling, content moderation, fraud detection, and more. In many of these scenarios, the effectiveness and benefits derived from these ML-powered solutions can be further enhanced when they can process and derive insights from data events in near-real time.

Although the business value and benefits of near-real-time ML-powered solutions are well established, the architecture required to implement these solutions at scale with optimum reliability and performance is complicated. This post describes how you can combine Amazon Kinesis, AWS Glue, and Amazon SageMaker to build a near-real-time feature engineering and inference solution for predictive maintenance.

Use case overview

We focus on a predictive maintenance use case where sensors deployed in the field (such as industrial equipment or network devices), need to replaced or rectified before they become faulty and cause downtime. Downtime can be expensive for businesses and can lead to poor customer experience. Predictive maintenance powered by an ML model can also help in augmenting the regular schedule-based maintenance cycles by informing when a machine part in good condition should not be replaced, therefore avoiding unnecessary cost.

In this post, we focus on applying machine learning to a synthetic dataset containing machine failures due to features such as air temperature, process temperature, rotation speed, torque, and tool wear. The dataset used is sourced from the UCI Data Repository.

Machine failure consists of five independent failure modes:

  • Tool Wear Failure (TWF)
  • Heat Dissipation Failure (HDF)
  • Power Failure (PWF)
  • Over-strain Failure (OSF)
  • Random Failure (RNF)

The machine failure label indicates whether the machine has failed for a particular data point if any of the preceding failure modes are true. If at least one of the failure modes is true, the process fails and the machine failure label is set to 1. The objective for the ML model is to identify machine failures correctly, so a downstream predictive maintenance action can be initiated.

Solution overview

For our predictive maintenance use case, we assume that device sensors stream various measurements and readings about machine parts. Our solution then takes a slice of streaming data each time (micro-batch), and performs processing and feature engineering to create features. The created features are then used to generate inferences from a trained and deployed ML model in near-real time. The generated inferences can be further processed and consumed by downstream applications, to take appropriate actions and initiate maintenance activity.

The following diagram shows the architecture of our overall solution.

The solution broadly consists of the following sections, which are explained in detail later in this post:

  • Streaming data source and ingestion – We use Amazon Kinesis Data Streams to collect streaming data from the field sensors at scale and make it available for further processing.
  • Near-real-time feature engineering – We use AWS Glue streaming jobs to read data from a Kinesis data stream and perform data processing and feature engineering, before storing the derived features in Amazon Simple Storage Service (Amazon S3). Amazon S3 provides a reliable and cost-effective option to store large volumes of data.
  • Model training and deployment – We use the AI4I predictive maintenance dataset from the UCI Data Repository to train an ML model based on the XGBoost algorithm using SageMaker. We then deploy the trained model to a SageMaker asynchronous inference endpoint.
  • Near-real-time ML inference – After the features are available in Amazon S3, we need to generate inferences from the deployed model in near-real time. SageMaker asynchronous inference endpoints are well suited for this requirement because they support larger payload sizes (up to 1 GB) and can generate inferences within minutes (up to a maximum of 15 minutes). We use S3 event notifications to run an AWS Lambda function to invoke a SageMaker asynchronous inference endpoint. SageMaker asynchronous inference endpoints accept S3 locations as input, generate inferences from the deployed model, and write these inferences back to Amazon S3 in near-real time.

The source code for this solution is located on GitHub. The solution has been tested and should be run in us-east-1.

We use an AWS CloudFormation template, deployed using AWS Serverless Application Model (AWS SAM), and SageMaker notebooks to deploy the solution.

Prerequisites

To get started, as a prerequisite, you must have the SAM CLI, Python 3, and PIP installed. You must also have the AWS Command Line Interface (AWS CLI) configured properly.

Deploy the solution

You can use AWS CloudShell to run these steps. CloudShell is a browser-based shell that is pre-authenticated with your console credentials and includes pre-installed common development and operations tools (such as AWS SAM, AWS CLI, and Python). Therefore, no local installation or configuration is required.

  • We begin by creating an S3 bucket where we store the script for our AWS Glue streaming job. Run the following command in your terminal to create a new bucket:
aws s3api create-bucket --bucket sample-script-bucket-$RANDOM --region us-east-1
  • Note down the name of the bucket created.

ML-9132 Solution Arch

  • Next, we clone the code repository locally, which contains the CloudFormation template to deploy the stack. Run the following command in your terminal:
git clone https://github.com/aws-samples/amazon-sagemaker-predictive-maintenance
  • Navigate to the sam-template directory:
cd amazon-sagemaker-predictive-maintenance/sam-template

ML-9132 git clone repo

  • Run the following command to copy the AWS Glue job script (from glue_streaming/app.py) to the S3 bucket you created:
aws s3 cp glue_streaming/app.py s3://sample-script-bucket-30232/glue_streaming/app.py

ML-9132 copy glue script

  • You can now go ahead with the build and deployment of the solution, through the CloudFormation template via AWS SAM. Run the following command:
sam build

ML-9132 SAM Build

sam deploy --guided
  • Provide arguments for the deployment such as the stack name, preferred AWS Region (us-east-1), and GlueScriptsBucket.

Make sure you provide the same S3 bucket that you created earlier for the AWS Glue script S3 bucket (parameter GlueScriptsBucket in the following screenshot).

ML-9132 SAM Deploy Param

After you provide the required arguments, AWS SAM starts the stack deployment. The following screenshot shows the resources created.

ML-9132 SAM Deployed

After the stack is deployed successfully, you should see the following message.

ML-9132 SAM CF deployed

  • On the AWS CloudFormation console, open the stack (for this post, nrt-streaming-inference) that was provided when deploying the CloudFormation template.
  • On the Resources tab, note the SageMaker notebook instance ID.
  1. ML-9132 SM Notebook Created
  • On the SageMaker console, open this instance.

ML-9132 image018

The SageMaker notebook instance already has the required notebooks pre-loaded.

Navigate to the notebooks folder and open and follow the instructions within the notebooks (Data_Pre-Processing.ipynb and ModelTraining-Evaluation-and-Deployment.ipynb) to explore the dataset, perform preprocessing and feature engineering, and train and deploy the model to a SageMaker asynchronous inference endpoint.

ML-9132 Open SM Notebooks

Streaming data source and ingestion

Kinesis Data Streams is a serverless, scalable, and durable real-time data streaming service that you can use to collect and process large streams of data records in real time. Kinesis Data Streams enables capturing, processing, and storing data streams from a variety of sources, such as IT infrastructure log data, application logs, social media, market data feeds, web clickstream data, IoT devices and sensors, and more. You can provision a Kinesis data stream in on-demand mode or provisioned mode depending on the throughput and scaling requirements. For more information, see Choosing the Data Stream Capacity Mode.

For our use case, we assume that various sensors are sending measurements such as temperature, rotation speed, torque, and tool wear to a data stream. Kinesis Data Streams acts as a funnel to collect and ingest data streams.

We use the Amazon Kinesis Data Generator (KDG) later in this post to generate and send data to a Kinesis data stream, simulating data being generated by sensors. The data from the data stream sensor-data-stream is ingested and processed using an AWS Glue streaming job, which we discuss next.

Near-real-time feature engineering

AWS Glue streaming jobs provide a convenient way to process streaming data at scale, without the need to manage the compute environment. AWS Glue allows you to perform extract, transform, and load (ETL) operations on streaming data using continuously running jobs. AWS Glue streaming ETL is built on the Apache Spark Structured Streaming engine, and can ingest streams from Kinesis, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

The streaming ETL job can use both AWS Glue built-in transforms and transforms that are native to Apache Spark Structured Streaming. You can also use the Spark ML and MLLib libraries in AWS Glue jobs for easier feature processing using readily available helper libraries.

If the schema of the streaming data source is pre-determined, you can specify it in an AWS Data Catalog table. If the schema definition can’t be determined beforehand, you can enable schema detection in the streaming ETL job. The job then automatically determines the schema from the incoming data. Additionally, you can use the AWS Glue Schema Registry to allow central discovery, control, and evolution of data stream schemas. You can further integrate the Schema Registry with the Data Catalog to optionally use schemas stored in the Schema Registry when creating or updating AWS Glue tables or partitions in the Data Catalog.

For this post, we create an AWS Glue Data Catalog table (sensor-stream) with our Kinesis data stream as the source and define the schema for our sensor data.

We create an AWS Glue dynamic dataframe from the Data Catalog table to read the streaming data from Kinesis. We also specify the following options:

  • A window size of 60 seconds, so that the AWS Glue job reads and processes data in 60-second windows
  • The starting position TRIM_HORIZON, to allow reading from the oldest records in the Kinesis data stream

We also use Spark MLlib’s StringIndexer feature transformer to encode the string column type into label indexes. This transformation is implemented using Spark ML Pipelines. Spark ML Pipelines provide a uniform set of high-level APIs for ML algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow.

We use the foreachBatch API to invoke a function named processBatch, which in turn processes the data referenced by this dataframe. See the following code:

# Read from Kinesis Data Stream
sourceStreamData = glueContext.create_data_frame.from_catalog(database = "sensordb", table_name = "sensor-stream", transformation_ctx = "sourceStreamData", additional_options = {"startingPosition": "TRIM_HORIZON"})
type_indexer = StringIndexer(inputCol="type", outputCol="type_enc", stringOrderType="alphabetAsc")
pipeline = Pipeline(stages=[type_indexer])
glueContext.forEachBatch(frame = sourceStreamData, batch_function = processBatch, options = {"windowSize": "60 seconds", "checkpointLocation": checkpoint_location})

The function processBatch performs the specified transformations and partitions the data in Amazon S3 based on year, month, day, and batch ID.

We also re-partition the AWS Glue partitions into a single partition, to avoid having too many small files in Amazon S3. Having several small files can impede read performance, because it amplifies the overhead related to seeking, opening, and reading each file. We finally write the features to generate inferences into a prefix (features) within the S3 bucket. See the following code:

# Function that gets called to perform processing, feature engineering and writes to S3 for every micro batch of streaming data from Kinesis.
def processBatch(data_frame, batchId):
transformer = pipeline.fit(data_frame)
now = datetime.datetime.now()
year = now.year
month = now.month
day = now.day
hour = now.hour
minute = now.minute
if (data_frame.count() > 0):
data_frame = transformer.transform(data_frame)
data_frame = data_frame.drop("type")
data_frame = DynamicFrame.fromDF(data_frame, glueContext, "from_data_frame")
data_frame.printSchema()
# Write output features to S3
s3prefix = "features" + "/year=" + "{:0>4}".format(str(year)) + "/month=" + "{:0>2}".format(str(month)) + "/day=" + "{:0>2}".format(str(day)) + "/hour=" + "{:0>2}".format(str(hour)) + "/min=" + "{:0>2}".format(str(minute)) + "/batchid=" + str(batchId)
s3path = "s3://" + out_bucket_name + "/" + s3prefix + "/"
print("-------write start time------------")
print(str(datetime.datetime.now()))
data_frame = data_frame.toDF().repartition(1)
data_frame.write.mode("overwrite").option("header",False).csv(s3path)
print("-------write end time------------")
print(str(datetime.datetime.now()))

Model training and deployment

SageMaker is a fully managed and integrated ML service that enables data scientists and ML engineers to quickly and easily build, train, and deploy ML models.

Within the Data_Pre-Processing.ipynb notebook, we first import the AI4I Predictive Maintenance dataset from the UCI Data Repository and perform exploratory data analysis (EDA). We also perform feature engineering to make our features more useful for training the model.

For example, within the dataset, we have a feature named type, which represents the product’s quality type as L (low), M (medium), or H (high). Because this is categorical feature, we need to encode it before training our model. We use Scikit-Learn’s LabelEncoder to achieve this:

from sklearn.preprocessing import LabelEncoder
type_encoder = LabelEncoder()
type_encoder.fit(origdf['type'])
type_values = type_encoder.transform(origdf['type'])

After the features are processed and the curated train and test datasets are generated, we’re ready to train an ML model to predict whether the machine failed or not based on system readings. We train a XGBoost model, using the SageMaker built-in algorithm. XGBoost can provide good results for multiple types of ML problems, including classification, even when training samples are limited.

SageMaker training jobs provide a powerful and flexible way to train ML models on SageMaker. SageMaker manages the underlying compute infrastructure and provides multiple options to choose from, for diverse model training requirements, based on the use case.

xgb = sagemaker.estimator.Estimator(container,
role,
instance_count=1,
instance_type='ml.c4.4xlarge',
output_path=xgb_upload_location,
sagemaker_session=sagemaker_session)
xgb.set_hyperparameters(max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.8,
silent=0,
objective='binary:hinge',
num_round=100)

xgb.fit({'train': s3_train_channel, 'validation': s3_valid_channel})

When the model training is complete and the model evaluation is satisfactory based on the business requirements, we can begin model deployment. We first create an endpoint configuration with the AsyncInferenceConfig object option and using the model trained earlier:

endpoint_config_name = resource_name.format("EndpointConfig")
create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
"VariantName": "variant1",
"ModelName": model_name,
"InstanceType": "ml.m5.xlarge",
"InitialInstanceCount": 1,
}
],
AsyncInferenceConfig={
"OutputConfig": {
"S3OutputPath": f"s3://{bucket}/{prefix}/output",
#Specify Amazon SNS topics
"NotificationConfig": {
"SuccessTopic": "arn:aws:sns:<region>:<account-id>:<success-sns-topic>",
"ErrorTopic": "arn:aws:sns:<region>:<account-id>:<error-sns-topic>",
}},
"ClientConfig": {"MaxConcurrentInvocationsPerInstance": 4},
},)

We then create a SageMaker asynchronous inference endpoint, using the endpoint configuration we created. After it’s provisioned, we can start invoking the endpoint to generate inferences asynchronously.

endpoint_name = resource_name.format("Endpoint")
create_endpoint_response = sm_client.create_endpoint(
EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name)

Near-real-time inference

SageMaker asynchronous inference endpoints provide the ability to queue incoming inference requests and process them asynchronously in near-real time. This is ideal for applications that have inference requests with larger payload sizes (up to 1 GB), may require longer processing times (up to 15 minutes), and have near-real-time latency requirements. Asynchronous inference also enables you to save on costs by auto scaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

You can create a SageMaker asynchronous inference endpoint similar to how you create a real-time inference endpoint and additionally specify the AsyncInferenceConfig object, while creating your endpoint configuration with the EndpointConfig field in the CreateEndpointConfig API. The following diagram shows the inference workflow and how an asynchronous inference endpoint generates an inference.

ML-9132 SageMaker Asych Arch

To invoke the asynchronous inference endpoint, the request payload should be stored in Amazon S3 and reference to this payload needs to be provided as part of the InvokeEndpointAsync request. Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon Simple Notification Service (Amazon SNS).

Test the end-to-end solution

To test the solution, complete the following steps:

  • On the AWS CloudFormation console, open the stack you created earlier (nrt-streaming-inference).
  • On the Outputs tab, copy the name of the S3 bucket (EventsBucket).

This is the S3 bucket to which our AWS Glue streaming job writes features after reading and processing from the Kinesis data stream.

ML-9132 S3 events bucket

Next, we set up event notifications for this S3 bucket.

  • On the Amazon S3 console, navigate to the bucket EventsBucket.
  • On the Properties tab, in the Event notifications section, choose Create event notification.

ML-9132 S3 events bucket properties

ML-9132 S3 events bucket notification

  • For Event name, enter invoke-endpoint-lambda.
  • For Prefix, enter features/.
  • For Suffix, enter .csv.
  • For Event types, select All object create events.

ML-9132 S3 events bucket notification config
ML-9132 S3 events bucket notification config

  • For Destination, select Lambda function.
  • For Lambda function, and choose the function invoke-endpoint-asynch.
  • Choose Save changes.

ML-9132 S3 events bucket notification config lambda

  • On the AWS Glue console, open the job GlueStreaming-Kinesis-S3.
  • Choose Run job.

ML-9132 Run Glue job

Next we use the Kinesis Data Generator (KDG) to simulate sensors sending data to our Kinesis data stream. If this is your first time using the KDG, refer to Overview for the initial setup. The KDG provides a CloudFormation template to create the user and assign just enough permissions to use the KDG for sending events to Kinesis. Run the CloudFormation template within the AWS account that you’re using to build the solution in this post. After the KDG is set up, log in and access the KDG to send test events to our Kinesis data stream.

  • Use the Region in which you created the Kinesis data stream (us-east-1).
  • On the drop-down menu, choose the data stream sensor-data-stream.
  • In the Records per second section, select Constant and enter 100.
  • Unselect Compress Records.
  • For Record template, use the following template:
{
"air_temperature": {{random.number({"min":295,"max":305, "precision":0.01})}},
"process_temperature": {{random.number({"min":305,"max":315, "precision":0.01})}},
"rotational_speed": {{random.number({"min":1150,"max":2900})}},
"torque": {{random.number({"min":3,"max":80, "precision":0.01})}},
"tool_wear": {{random.number({"min":0,"max":250})}},
"type": "{{random.arrayElement(["L","M","H"])}}"
}
  • Click Send data to start sending data to the Kinesis data stream.

ML-9132 Kineses Data Gen

The AWS Glue streaming job reads and extracts a micro-batch of data (representing sensor readings) from the Kinesis data stream based on the window size provided. The streaming job then processes and performs feature engineering on this micro-batch before partitioning and writing it to the prefix features within the S3 bucket.

As new features created by the AWS Glue streaming job are written to the S3 bucket, a Lambda function (invoke-endpoint-asynch) is triggered, which invokes a SageMaker asynchronous inference endpoint by sending an invocation request to get inferences from our deployed ML model. The asynchronous inference endpoint queues the request for asynchronous invocation. When the processing is complete, SageMaker stores the inference results in the Amazon S3 location (S3OutputPath) that was specified during the asynchronous inference endpoint configuration.

For our use case, the inference results indicate if a machine part is likely to fail or not, based on the sensor readings.

ML-9132 Model inferences

SageMaker also sends a success or error notification with Amazon SNS. For example, if you set up an email subscription for the success and error SNS topics (specified within the asynchronous SageMaker inference endpoint configuration), an email can be sent every time an inference request is processed. The following screenshot shows a sample email from the SNS success topic.

ML-9132 SNS email subscribe

For real-world applications, you can integrate SNS notifications with other services such as Amazon Simple Queue Service (Amazon SQS) and Lambda for additional postprocessing of the generated inferences or integration with other downstream applications, based on your requirements. For example, for our predictive maintenance use case, you can invoke a Lambda function based on an SNS notification to read the generated inference from Amazon S3, further process it (such as aggregation or filtering), and initiate workflows such as sending work orders for equipment repair to technicians.

Clean up

When you’re done testing the stack, delete the resources (especially the Kinesis data stream, Glue streaming job, and SNS topics) to avoid unexpected charges.

Run the following code to delete your stack:

sam delete nrt-streaming-inference

Also delete the resources such as SageMaker endpoints by following the cleanup section in the ModelTraining-Evaluation-and-Deployment notebook.

Conclusion

In this post, we used a predictive maintenance use case to demonstrate how to use various services such as Kinesis, AWS Glue, and SageMaker to build a near-real-time inference pipeline. We encourage you to try this solution and let us know what you think.

If you have any questions, share them in the comments.


About the authors

Rahul Sharma is a Solutions Architect at AWS Data Lab, helping AWS customers design and build AI/ML solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance sector, helping customers build data and analytical platforms.

Pat Reilly is an Architect in the AWS Data Lab, where he helps customers design and build data workloads to support their business. Prior to AWS, Pat consulted at an AWS Partner, building AWS data workloads across a variety of industries.

Read More

Action on Repeat: GFN Thursday Brings Loopmancer With RTX ON to the Cloud

Investigate the ultimate truth this GFN Thursday with Loopmancer, now streaming to all members on GeForce NOW. Stuck in a death loop, RTX 3080 and Priority members can search for the truth with RTX ON — including NVIDIA DLSS and ray-traced reflections.

Plus, players can enjoy the latest Genshin Impact event with the “Summer Fantasia” version 2.8 update. It’s all part of the nine new games joining the GeForce NOW library this week.

Enter the Dragon City

The cycle continues until the case is solved. Loopmancer is streaming on GeForce NOW, with RTX ON.

Playing as a detective in this roguelite-platformer action game, members will wake back up in their apartments each time they die, bathed in the neon lights of futuristic Dragon City. As the story progresses, reviewing what seemed like the correct choice in the past may lead to a different conclusion.

Face vicious gangsters, well-equipped mercs, crazy mutants, highly trained bionics and more while searching for clues, even on mobile devices. Unlock new weapons and abilities through endless reincarnations to enhance your fighting skills for fast-paced battles.

RTX 3080 and Priority members can experience Loopmancer with DLSS for improved image quality at higher frame rates, as well as real-time ray-tracing technology that simulates the realistic, physical behavior of light, even on underpowered devices and Macs. Every loop, detail and map – from the richly colored Dragon Town to the gloomy Shuigou Village – is rendered with beautiful cinematic quality.

Ready to initiate a new loop? Try out the Loopmancer demo in the Instant Play Free Demos row before diving into the full game. RTX 3080 and Priority members can even try the demo with RTX ON.

A Summertime Odyssey Awaits

With a cursed blade of unknown origin, a mysterious unsolved case and the familiar — but not too familiar — islands far at sea, the recent addition of Genshin Impact heats up with the version 2.8 “Summer Fantasia” update, now available.

Meet the newest Genshin character, Shikanoin Heizou, a young prodigy detective from the Tenryou Commission with sharp senses. Members can also cool off with the new sea-based “Summertime Odyssey” main event, explore the Golden Apple Archipelago, experience new stories and dress their best with new outfits.

RTX 3080 members can stream all of the fun at 4K resolution and 60 frames per second, or 1440p and 120 FPS from the PC and Mac native apps. They also get the perks of ultra-low latency that rivals console gaming and can catch all of the newest action with the maximized eight-hour play sessions.

Summer Gamin’, Havin’ a Blast

Neon Blight on GeForce NOW
 Fight through dystopian cyberspace and establish an exotic black market store in this rogue-lite, management, shoot ‘em up.

This week brings in a total of nine new titles for gamers to play.

With all of these awesome options to play and only so many hours in a day, we’ve got a question for you. Let us know your answer on Twitter or in the comments below.

The post Action on Repeat: GFN Thursday Brings Loopmancer With RTX ON to the Cloud appeared first on NVIDIA Blog.

Read More

DALL·E 2: Extending creativity

As part of our DALL·E 2 research preview, more than 3,000 artists from more than 118 countries have incorporated DALL·E into their creative workflows. The artists in our early access group have helped us discover new uses for DALL·E and have served as key voices as we’ve made decisions about DALL·E’s features.OpenAI Blog

Teaching AI to ask clinical questions

Physicians often query a patient’s electronic health record for information that helps them make treatment decisions, but the cumbersome nature of these records hampers the process. Research has shown that even when a doctor has been trained to use an electronic health record (EHR), finding an answer to just one question can take, on average, more than eight minutes.

The more time physicians must spend navigating an oftentimes clunky EHR interface, the less time they have to interact with patients and provide treatment.

Researchers have begun developing machine-learning models that can streamline the process by automatically finding information physicians need in an EHR. However, training effective models requires huge datasets of relevant medical questions, which are often hard to come by due to privacy restrictions. Existing models struggle to generate authentic questions — those that would be asked by a human doctor — and are often unable to successfully find correct answers.

To overcome this data shortage, researchers at MIT partnered with medical experts to study the questions physicians ask when reviewing EHRs. Then, they built a publicly available dataset of more than 2,000 clinically relevant questions written by these medical experts.

When they used their dataset to train a machine-learning model to generate clinical questions, they found that the model asked high-quality and authentic questions, as compared to real questions from medical experts, more than 60 percent of the time.

With this dataset, they plan to generate vast numbers of authentic medical questions and then use those questions to train a machine-learning model which would help doctors find sought-after information in a patient’s record more efficiently.

“Two thousand questions may sound like a lot, but when you look at machine-learning models being trained nowadays, they have so much data, maybe billions of data points. When you train machine-learning models to work in health care settings, you have to be really creative because there is such a lack of data,” says lead author Eric Lehman, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL).

The senior author is Peter Szolovits, a professor in the Department of Electrical Engineering and Computer Science (EECS) who heads the Clinical Decision-Making Group in CSAIL and is also a member of the MIT-IBM Watson AI Lab. The research paper, a collaboration between co-authors at MIT, the MIT-IBM Watson AI Lab, IBM Research, and the doctors and medical experts who helped create questions and participated in the study, will be presented at the annual conference of the North American Chapter of the Association for Computational Linguistics.

“Realistic data is critical for training models that are relevant to the task yet difficult to find or create,” Szolovits says. “The value of this work is in carefully collecting questions asked by clinicians about patient cases, from which we are able to develop methods that use these data and general language models to ask further plausible questions.”

Data deficiency

The few large datasets of clinical questions the researchers were able to find had a host of issues, Lehman explains. Some were composed of medical questions asked by patients on web forums, which are a far cry from physician questions. Other datasets contained questions produced from templates, so they are mostly identical in structure, making many questions unrealistic.

“Collecting high-quality data is really important for doing machine-learning tasks, especially in a health care context, and we’ve shown that it can be done,” Lehman says.

To build their dataset, the MIT researchers worked with practicing physicians and medical students in their last year of training. They gave these medical experts more than 100 EHR discharge summaries and told them to read through a summary and ask any questions they might have. The researchers didn’t put any restrictions on question types or structures in an effort to gather natural questions. They also asked the medical experts to identify the “trigger text” in the EHR that led them to ask each question.

For instance, a medical expert might read a note in the EHR that says a patient’s past medical history is significant for prostate cancer and hypothyroidism. The trigger text “prostate cancer” could lead the expert to ask questions like “date of diagnosis?” or “any interventions done?”

They found that most questions focused on symptoms, treatments, or the patient’s test results. While these findings weren’t unexpected, quantifying the number of questions about each broad topic will help them build an effective dataset for use in a real, clinical setting, says Lehman.

Once they had compiled their dataset of questions and accompanying trigger text, they used it to train machine-learning models to ask new questions based on the trigger text.

Then the medical experts determined whether those questions were “good” using four metrics: understandability (Does the question make sense to a human physician?), triviality (Is the question too easily answerable from the trigger text?), medical relevance (Does it makes sense to ask this question based on the context?), and relevancy to the trigger (Is the trigger related to the question?).

Cause for concern

The researchers found that when a model was given trigger text, it was able to generate a good question 63 percent of the time, whereas a human physician would ask a good question 80 percent of the time.

They also trained models to recover answers to clinical questions using the publicly available datasets they had found at the outset of this project. Then they tested these trained models to see if they could find answers to “good” questions asked by human medical experts.

The models were only able to recover about 25 percent of answers to physician-generated questions.

“That result is really concerning. What people thought were good-performing models were, in practice, just awful because the evaluation questions they were testing on were not good to begin with,” Lehman says.

The team is now applying this work toward their initial goal: building a model that can automatically answer physicians’ questions in an EHR. For the next step, they will use their dataset to train a machine-learning model that can automatically generate thousands or millions of good clinical questions, which can then be used to train a new model for automatic question answering.

While there is still much work to do before that model could be a reality, Lehman is encouraged by the strong initial results the team demonstrated with this dataset.

This research was supported, in part, by the MIT-IBM Watson AI Lab. Additional co-authors include Leo Anthony Celi of the MIT Institute for Medical Engineering and Science; Preethi Raghavan and Jennifer J. Liang of the MIT-IBM Watson AI Lab; Dana Moukheiber of the University of Buffalo; Vladislav Lialin and Anna Rumshisky of the University of Massachusetts at Lowell; Katelyn Legaspi, Nicole Rose I. Alberto, Richard Raymund R. Ragasa, Corinna Victoria M. Puyat, Isabelle Rose I. Alberto, and Pia Gabrielle I. Alfonso of the University of the Philippines; Anne Janelle R. Sy and Patricia Therese S. Pile of the University of the East Ramon Magsaysay Memorial Medical Center; Marianne Taliño of the Ateneo de Manila University School of Medicine and Public Health; and Byron C. Wallace of Northeastern University.

Read More

Rewriting Image Captions for Visual Question Answering Data Creation

Visual Question Answering (VQA) is a useful machine learning (ML) task that requires a model to answer a visual question about an image. What makes it challenging is its multi-task and open-ended nature; it involves solving multiple technical research questions in computer vision and natural language understanding simultaneously. Yet, progress on this task would enable a wide range of applications, from assisting the blind and the visually-impaired or communicating with robots to enhancing the user’s visual experience with external knowledge.

Effective and robust VQA systems cannot exist without high-quality, semantically and stylistically diverse large-scale training data of image-question-answer triplets. But, creating such data is time consuming and onerous. Perhaps unsurprisingly, the VQA community has focused more on sophisticated model development rather than scalable data creation.

In “All You May Need for VQA are Image Captions,” published at NAACL 2022, we explore VQA data generation by proposing “Visual Question Generation with Question Answering Validation” (VQ2A), a pipeline that works by rewriting a declarative caption into multiple interrogative question-answer pairs. More specifically, we leverage two existing assets — (i) large-scale image-text data and (ii) large-capacity neural text-to-text models — to achieve automatic VQA data generation. As the field has progressed, the research community has been making these assets larger and stronger in isolation (for general purposes such as learning text-only or image-text representations); together, they can achieve more and we adapt them for VQA data creation purposes. We find our approach can generate question-answer pairs with high precision and that this data can successfully be used for training VQA models to improve performance.

The VQ2A technique enables VQA data generation at scale from image captions by rewriting each caption into multiple question-answer pairs.

VQ2A Overview
The first step of the VQ2A approach is to apply heuristics based on named entity recognition, part-of-speech tagging and manually defined rules to generate answer candidates from the image caption. These generated candidates are small pieces of information that may be relevant subjects about which to ask questions. We also add to this list two default answers, “yes” and “no”, which allow us to generate Boolean questions.

Then, we use a T5 model that was fine-tuned to generate questions for the candidate, resulting in [question, candidate answer] pairs. We then filter for the highest quality pairs using another T5 model (fine-tuned to answer questions) by asking it to answer the question based on the caption. was . That is, we compare the candidate answer to the output of this model and if the two answers are similar enough, we define this question as high quality and keep it. Otherwise, we filter it out.

The idea of using both question answering and question generation models to check each other for their round-trip consistency has been previously explored in other contexts. For instance, Q2 uses this idea to evaluate factual consistency in knowledge-grounded dialogues. In the end, the VQ2A approach, as illustrated below, can generate a large number of [image, question, answer] triplets that are high-quality enough to be used as VQA training data.

VQ2A consists of three main steps: (i) candidate answer extraction, (ii) question generation, (iii) question answering and answer validation.

Results
Two examples of our generated VQA data are shown below, one based on human-written COCO Captions (COCO) and the other on automatically-collected Conceptual Captions (CC3M), which we call VQ2A-COCO and VQ2A-CC3M, respectively. We highlight the variety of question types and styles, which are critical for VQA. Overall, the cleaner the captions (i.e., the more closely related they are to their paired image), the more accurate the generated triplets. Based on 800 samples each, 87.3% of VQ2A-COCO and 66.0% VQ2A-CC3M are found by human raters to be valid, suggesting that our approach can generate question-answer pairs with high precision.

Generated question-answer pairs based on COCO Captions (top) and Conceptual Captions (bottom). Grey highlighting denotes questions that do not appear in VQAv2, while green highlighting denotes those that do, indicating that our approach is capable of generating novel questions that an existing VQA dataset does not have.

Finally, we evaluate our generated data by using it to train VQA models (highlights shown below). We observe that our automatically-generated VQA data is competitive with manually-annotated target VQA data. First, our VQA models achieve high performance on target benchmarks “out-of-the-box”, when trained only on our generated data (light blue and light red vs. yellow). Once fine-tuned on target data, our VQA models outperform target-only training slightly on large-scale benchmarks like VQAv2 and GQA, but significantly on the small, knowledge-seeking OK-VQA (dark blue/red vs. light blue/red).

VQA accuracy on popular benchmark datasets.

Conclusion
All we may need for VQA are image captions! This work demonstrates that it is possible to automatically generate high-quality VQA data at scale, serving as an essential building block for VQA and vision-and-language models in general (e.g., ALIGN, CoCa). We hope that our work inspires other work on data-centric VQA.

Acknowledgments
We thank Roee Aharoni, Idan Szpektor, and Radu Soricut for their feedback on this blogpost. We also thank our co-authors: Xi Chen, Nan Ding, Idan Szpektor, and Radu Soricut. We acknowledge contributions from Or Honovich, Hagai Taitelbaum, Roee Aharoni, Sebastian Goodman, Piyush Sharma, Nassim Oufattole, Gal Elidan, Sasha Goldshtein, and Avinatan Hassidim. Finally, we thank the authors of Q2, whose pipeline strongly influences this work.

Read More