MARRS: Multimodal Reference Resolution System

*= All authors listed contributed equally to this work
Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background…Apple Machine Learning Research

Build trust and safety for generative AI applications with Amazon Comprehend and LangChain

Build trust and safety for generative AI applications with Amazon Comprehend and LangChain

We are witnessing a rapid increase in the adoption of large language models (LLM) that power generative AI applications across industries. LLMs are capable of a variety of tasks, such as generating creative content, answering inquiries via chatbots, generating code, and more.

Organizations looking to use LLMs to power their applications are increasingly wary about data privacy to ensure trust and safety is maintained within their generative AI applications. This includes handling customers’ personally identifiable information (PII) data properly. It also includes preventing abusive and unsafe content from being propagated to LLMs and checking that data generated by LLMs follows the same principles.

In this post, we discuss new features powered by Amazon Comprehend that enable seamless integration to ensure data privacy, content safety, and prompt safety in new and existing generative AI applications.

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to uncover information in unstructured data and text within documents. In this post, we discuss why trust and safety with LLMs matter for your workloads. We also delve deeper into how these new moderation capabilities are utilized with the popular generative AI development framework LangChain to introduce a customizable trust and safety mechanism for your use case.

Why trust and safety with LLMs matter

Trust and safety are paramount when working with LLMs due to their profound impact on a wide range of applications, from customer support chatbots to content generation. As these models process vast amounts of data and generate humanlike responses, the potential for misuse or unintended outcomes increases. Ensuring that these AI systems operate within ethical and reliable boundaries is crucial, not just for the reputation of businesses that utilize them, but also for preserving the trust of end-users and customers.

Moreover, as LLMs become more integrated into our daily digital experiences, their influence on our perceptions, beliefs, and decisions grows. Ensuring trust and safety with LLMs goes beyond just technical measures; it speaks to the broader responsibility of AI practitioners and organizations to uphold ethical standards. By prioritizing trust and safety, organizations not only protect their users, but also ensure sustainable and responsible growth of AI in society. It can also help to reduce risk of generating harmful content, and help adhere to regulatory requirements.

In the realm of trust and safety, content moderation is a mechanism that addresses various aspects, including but not limited to:

  • Privacy – Users can inadvertently provide text that contains sensitive information, jeopardizing their privacy. Detecting and redacting any PII is essential.
  • Toxicity – Recognizing and filtering out harmful content, such as hate speech, threats, or abuse, is of utmost importance.
  • User intention – Identifying whether the user input (prompt) is safe or unsafe is critical. Unsafe prompts can explicitly or implicitly express malicious intent, such as requesting personal or private information and generating offensive, discriminatory, or illegal content. Prompts may also implicitly express or request advice on medical, legal, political, controversial, personal, or financial

Content moderation with Amazon Comprehend

In this section, we discuss the benefits of content moderation with Amazon Comprehend.

Addressing privacy

Amazon Comprehend already addresses privacy through its existing PII detection and redaction abilities via the DetectPIIEntities and ContainsPIIEntities APIs. These two APIs are backed by NLP models that can detect a large number of PII entities such as Social Security numbers (SSNs), credit card numbers, names, addresses, phone numbers, and so on. For a full list of entities, refer to PII universal entity types. DetectPII also provides character-level position of the PII entity within a text; for example, the start character position of the NAME entity (John Doe) in the sentence “My name is John Doe” is 12, and the end character position is 19. These offsets can be used to perform masking or redaction of the values, thereby reducing risks of private data propagation into LLMs.

Addressing toxicity and prompt safety

Today, we are announcing two new Amazon Comprehend features in the form of APIs: Toxicity detection via the DetectToxicContent API, and prompt safety classification via the ClassifyDocument API. Note that DetectToxicContent is a new API, whereas ClassifyDocument is an existing API that now supports prompt safety classification.

Toxicity detection

With Amazon Comprehend toxicity detection, you can identify and flag content that may be harmful, offensive, or inappropriate. This capability is particularly valuable for platforms where users generate content, such as social media sites, forums, chatbots, comment sections, and applications that use LLMs to generate content. The primary goal is to maintain a positive and safe environment by preventing the dissemination of toxic content.

At its core, the toxicity detection model analyzes text to determine the likelihood of it containing hateful content, threats, obscenities, or other forms of harmful text. The model is trained on vast datasets containing examples of both toxic and nontoxic content. The toxicity API evaluates a given piece of text to provide toxicity classification and confidence score. Generative AI applications can then use this information to take appropriate actions, such as stopping the text from propagating to LLMs. As of this writing, the labels detected by the toxicity detection API are HATE_SPEECH, GRAPHIC, HARRASMENT_OR_ABUSE, SEXUAL, VIOLENCE_OR_THREAT, INSULT, and PROFANITY. The following code demonstrates the API call with Python Boto3 for Amazon Comprehend toxicity detection:

import boto3
client = boto3.client('comprehend')
response = client.detect_toxic_content(
    	TextSegments=[{"Text": "What is the capital of France?"},
                      {"Text": "Where do I find good baguette in France?"}],
    	LanguageCode='en')
print(response)

Prompt safety classification

Prompt safety classification with Amazon Comprehend helps classify an input text prompt as safe or unsafe. This capability is crucial for applications like chatbots, virtual assistants, or content moderation tools where understanding the safety of a prompt can determine responses, actions, or content propagation to LLMs.

In essence, prompt safety classification analyzes human input for any explicit or implicit malicious intent, such as requesting personal or private information and generation of offensive, discriminatory, or illegal content. It also flags prompts looking for advice on medical, legal, political, controversial, personal, or financial subjects. Prompt classification returns two classes, UNSAFE_PROMPT and SAFE_PROMPT, for an associated text, with an associated confidence score for each. The confidence score ranges between 0–1 and combined will sum up to 1. For instance, in a customer support chatbot, the text “How do I reset my password?” signals an intent to seek guidance on password reset procedures and is labeled as SAFE_PROMPT. Similarly, a statement like “I wish something bad happens to you” can be flagged for having a potentially harmful intent and labeled as UNSAFE_PROMPT. It’s important to note that prompt safety classification is primarily focused on detecting intent from human inputs (prompts), rather than machine-generated text (LLM outputs). The following code demonstrates how to access the prompt safety classification feature with the ClassifyDocument API:

import boto3
client = boto3.client('comprehend')
response = self.client.classify_document(
           		Text=prompt_value, 
EndpointArn=endpoint_arn)
print(response)

Note that endpoint_arn in the preceding code is an AWS-provided Amazon Resource Number (ARN) of the pattern arn:aws:comprehend:<region>:aws:document-classifier-endpoint/prompt-safety, where <region> is the AWS Region of your choice where Amazon Comprehend is available.

To demonstrate these capabilities, we built a sample chat application where we ask an LLM to extract PII entities such as address, phone number, and SSN from a given piece of text. The LLM finds and returns the appropriate PII entities, as shown in the image on the left.

With Amazon Comprehend moderation, we can redact the input to the LLM and output from the LLM. In the image on the right, the SSN value is allowed to be passed to the LLM without redaction. However, any SSN value in the LLM’s response is redacted.

The following is an example of how a prompt containing PII information can be prevented from reaching the LLM altogether. This example demonstrates a user asking a question that contains PII information. We use Amazon Comprehend moderation to detect PII entities in the prompt and show an error by interrupting the flow.

The preceding chat examples showcase how Amazon Comprehend moderation applies restrictions on data being sent to an LLM. In the following sections, we explain how this moderation mechanism is implemented using LangChain.

Integration with LangChain

With the endless possibilities of the application of LLMs into various use cases, it has become equally important to simplify the development of generative AI applications. LangChain is a popular open source framework that makes it effortless to develop generative AI applications. Amazon Comprehend moderation extends the LangChain framework to offer PII identification and redaction, toxicity detection, and prompt safety classification capabilities via AmazonComprehendModerationChain.

AmazonComprehendModerationChain is a custom implementation of the LangChain base chain interface. This means that applications can use this chain with their own LLM chains to apply the desired moderation to the input prompt as well as to the output text from the LLM. Chains can be built by merging numerous chains or by mixing chains with other components. You can use AmazonComprehendModerationChain with other LLM chains to develop complex AI applications in a modular and flexible manner.

To explain it further, we provide a few samples in the following sections. The source code for the AmazonComprehendModerationChain implementation can be found within the LangChain open source repository. For full documentation of the API interface, refer to the LangChain API documentation for the Amazon Comprehend moderation chain. Using this moderation chain is as simple as initializing an instance of the class with default configurations:

from langchain_experimental.comprehend_moderation import AmazonComprehendModerationChain

comprehend_moderation = AmazonComprehendModerationChain()

Behind the scenes, the moderation chain performs three consecutive moderation checks, namely PII, toxicity, and prompt safety, as explained in the following diagram. This is the default flow for the moderation.

The following code snippet shows a simple example of using the moderation chain with the Amazon FalconLite LLM (which is a quantized version of the Falcon 40B SFT OASST-TOP1 model) hosted in Hugging Face Hub:

from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain
from langchain_experimental.comprehend_moderation import AmazonComprehendModerationChain

template = """Question: {question}
Answer:"""
repo_id = "amazon/FalconLite"
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = HuggingFaceHub(
repo_id=repo_id, 
model_kwargs={"temperature": 0.5, "max_length": 256}
)
comprehend_moderation = AmazonComprehendModerationChain(verbose=True)
chain = (
    prompt 
    | comprehend_moderation 
    | { "input" : (lambda x: x['output']) | llm }  
    | comprehend_moderation
)

try:
    response = chain.invoke({"question": "An SSN is of the format 123-45-6789. Can you give me John Doe's SSN?"})
except Exception as e:
    print(str(e))
else:
    print(response['output'])

In the preceding example, we augment our chain with comprehend_moderation for both text going into the LLM and text generated by the LLM. This will perform default moderation that will check PII, toxicity, and prompt safety classification in that sequence.

Customize your moderation with filter configurations

You can use the AmazonComprehendModerationChain with specific configurations, which gives you the ability to control what moderations you wish to perform in your generative AI–based application. At the core of the configuration, you have three filter configurations available.

  1. ModerationPiiConfig – Used to configure PII filter.
  2. ModerationToxicityConfig – Used to configure toxic content filter.
  3. ModerationIntentConfig – Used to configure intent filter.

You can use each of these filter configurations to customize the behavior of how your moderations behave. Each filter’s configurations have a few common parameters, and some unique parameters, that they can be initialized with. After you define the configurations, you use the BaseModerationConfig class to define the sequence in which the filters must apply to the text. For example, in the following code, we first define the three filter configurations, and subsequently specify the order in which they must apply:

from langchain_experimental.comprehend_moderation 
import (BaseModerationConfig, 
ModerationPromptSafetyConfig, 
ModerationPiiConfig, 
ModerationToxicityConfig)

pii_config = ModerationPiiConfig(labels=["SSN"],
   redact=True,
   mask_character="X")
toxicity_config = ModerationToxicityConfig(threshold=0.6)
prompt_safety_config = ModerationPromptSafetyConfig(threshold=0.8)
moderation_config = BaseModerationConfig(filters=[ toxicity_config, 
      pii_config, 
      prompt_safety_config])
comprehend_moderation = AmazonComprehendModerationChain(moderation_config=moderation_config)

Let’s dive a little deeper to understand what this configuration achieves:

  • First, for the toxicity filter, we specified a threshold of 0.6. This means that if the text contains any of the available toxic labels or entities with a score greater than the threshold, the whole chain will be interrupted.
  • If there is no toxic content found in the text, a PII check is In this case, we’re interested in checking if the text contains SSN values. Because the redact parameter is set to True, the chain will mask the detected SSN values (if any) where the SSN entitiy’s confidence score is greater than or equal to 0.5, with the mask character specified (X). If redact is set to False, the chain will be interrupted for any SSN detected.
  • Finally, the chain performs prompt safety classification, and will stop the content from propagating further down the chain if the content is classified with UNSAFE_PROMPT with a confidence score of greater than or equal to 0.8.

The following diagram illustrates this workflow.

In case of interruptions to the moderation chain (in this example, applicable for the toxicity and prompt safety classification filters), the chain will raise a Python exception, essentially stopping the chain in progress and allowing you to catch the exception (in a try-catch block) and perform any relevant action. The three possible exception types are:

  1. ModerationPIIError
  2. ModerationToxicityError
  3. ModerationPromptSafetyError

You can configure one filter or more than one filter using BaseModerationConfig. You can also have the same type of filter with different configurations within the same chain. For example, if your use case is only concerned with PII, you can specify a configuration that must interrupt the chain if in case an SSN is detected; otherwise, it must perform redaction on age and name PII entities. A configuration for this can be defined as follows:

pii_config1 = ModerationPiiConfig(labels=["SSN"],
    redact=False)
pii_config2 = ModerationPiiConfig(labels=["AGE", "NAME"],
    redact=True, 
    mask_character="X")
moderation_config = BaseModerationConfig(filters=[ pii_config1, 
      pii_config2])
comprehend_moderation = AmazonComprehendModerationChain(moderation_config=moderation_config)

Using callbacks and unique identifiers

If you’re familiar with the concept of workflows, you may also be familiar with callbacks. Callbacks within workflows are independent pieces of code that run when certain conditions are met within the workflow. A callback can either be blocking or nonblocking to the workflow. LangChain chains are, in essence, workflows for LLMs. AmazonComprehendModerationChain allows you to define your own callback functions. Initially, the implementation is limited to asynchronous (nonblocking) callback functions only.

This effectively means that if you use callbacks with the moderation chain, they will run independently of the chain’s run without blocking it. For the moderation chain, you get options to run pieces of code, with any business logic, after each moderation is run, independent of the chain.

You can also optionally provide an arbitrary unique identifier string when creating an AmazonComprehendModerationChain to enable logging and analytics later. For example, if you’re operating a chatbot powered by an LLM, you may want to track users who are consistently abusive or are deliberately or unknowingly exposing personal information. In such cases, it becomes necessary to track the origin of such prompts and perhaps store them in a database or log them appropriately for further action. You can pass a unique ID that distinctly identifies a user, such as their user name or email, or an application name that is generating the prompt.

The combination of callbacks and unique identifiers provides you with a powerful way to implement a moderation chain that fits your use case in a much more cohesive manner with less code that is easier to maintain. The callback handler is available via the BaseModerationCallbackHandler, with three available callbacks: on_after_pii(), on_after_toxicity(), and on_after_prompt_safety(). Each of these callback functions is called asynchronously after the respective moderation check is performed within the chain. These functions also receive two default parameters:

  • moderation_beacon – A dictionary containing details such as the text on which the moderation was performed, the full JSON output of the Amazon Comprehend API, the type of moderation, and if the supplied labels (in the configuration) were found within the text or not
  • unique_id – The unique ID that you assigned while initializing an instance of the AmazonComprehendModerationChain.

The following is an example of how an implementation with callback works. In this case, we defined a single callback that we want the chain to run after the PII check is performed:

from langchain_experimental.comprehend_moderation import BaseModerationCallbackHandler

class MyModCallback(BaseModerationCallbackHandler):
    async def on_after_pii(self, output_beacon, unique_id):
        import json
        moderation_type = output_beacon['moderation_type']
        chain_id = output_beacon['moderation_chain_id']
        with open(f'output-{moderation_type}-{chain_id}.json', 'w') as file:
            data = { 'beacon_data': output_beacon, 'unique_id': unique_id }
            json.dump(data, file)
    
    '''
    # implement this callback for toxicity
    async def on_after_toxicity(self, output_beacon, unique_id):
        pass

    # implement this callback for prompt safety
    async def on_after_prompt_safety(self, output_beacon, unique_id):
        pass
    '''

my_callback = MyModCallback()

We then use the my_callback object while initializing the moderation chain and also pass a unique_id. You may use callbacks and unique identifiers with or without a configuration. When you subclass BaseModerationCallbackHandler, you must implement one or all of the callback methods depending on the filters you intend to use. For brevity, the following example shows a way to use callbacks and unique_id without any configuration:

comprehend_moderation = AmazonComprehendModerationChain(
moderation_callback = my_callback,
unique_id = 'john.doe@email.com')

The following diagram explains how this moderation chain with callbacks and unique identifiers works. Specifically, we implemented the PII callback that should write a JSON file with the data available in the moderation_beacon and the unique_id passed (the user’s email in this case).

In the following Python notebook, we have compiled a few different ways you can configure and use the moderation chain with various LLMs, such as LLMs hosted with Amazon SageMaker JumpStart and hosted in Hugging Face Hub. We have also included the sample chat application that we discussed earlier with the following Python notebook.

Conclusion

The transformative potential of large language models and generative AI is undeniable. However, their responsible and ethical use hinges on addressing concerns of trust and safety. By recognizing the challenges and actively implementing measures to mitigate risks, developers, organizations, and society at large can harness the benefits of these technologies while preserving the trust and safety that underpin their successful integration. Use Amazon Comprehend ContentModerationChain to add trust and safety features to any LLM workflow, including Retrieval Augmented Generation (RAG) workflows implemented in LangChain.

For information on building RAG based solutions using LangChain and Amazon Kendra’s highly accurate, machine learning (ML)-powered intelligent search, see – Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models. As a next step, refer to the code samples we created for using Amazon Comprehend moderation with LangChain. For full documentation of the Amazon Comprehend moderation chain API, refer to the LangChain API documentation.


About the authors

Wrick Talukdar is a Senior Architect with the Amazon Comprehend Service team. He works with AWS customers to help them adopt machine learning on a large scale. Outside of work, he enjoys reading and photography.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, and analytics. In his spare time, he enjoys playing badminton with his daughter and exploring the outdoors.

Chin Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Read More

Enabling large-scale health studies for the research community

Enabling large-scale health studies for the research community

As consumer technologies like fitness trackers and mobile phones become more widely used for health-related data collection, so does the opportunity to leverage these data pathways to study and advance our understanding of medical conditions. We have previously touched upon how our work explores the use of this technology within the context of chronic diseases, in particular multiple sclerosis (MS). This effort leverages the FDA MyStudies platform, an open-source platform used to create clinical study apps, that makes it easier for anyone to run their own studies and collect good quality healthcare data, in a trusted and safe way.

Today, we describe the setup that we developed by expanding the FDA MyStudies platform and demonstrate how it can be used to set up a digital health study. We also present our exploratory research study created through this platform, called MS Signals, which consists of a symptom tracking app for MS patients. The goal for this app is twofold: 1) to ensure that the enhancements to the FDA MyStudies platform made for a more streamlined study creation experience; and 2) to understand how new data collection mechanisms can be used to revolutionize patients’ chronic disease management and tracking. We have open sourced our extension to the FDA MyStudies platform under the Apache 2.0 license to provide a resource for the community to build their own studies.

Extending the FDA MyStudies platform

The original FDA MyStudies platform allowed people to configure their own study apps, manage participants, and create separate iOS and Android apps. To simplify the study creation process and ensure increased study engagement, we made a number of accessibility changes. Some of the main improvements include: cross-platform (iOS and Android) app generation through the use of Flutter, an open source framework by Google for building multi-platform applications from a single codebase; a simplified setup, so that users can prototype their study quickly (under a day in most cases); and, most importantly, an emphasis on accessibility so that diverse patient’s voices are heard. The accessibility enhancements include changes to the underlying features of the platform and to the particular study design of the MS Signals study app.

Multi-platform support with rapid prototyping

We decided on the use of Flutter as it would be a single point that would generate both iOS and Android apps in one go, reducing the work required to support multiple platforms. Flutter also provides hot-reloading, which allows developers to build & preview features quickly. The design-system in the app takes advantage of this feature to provide a central point from which the branding & theme of the app can be changed to match the tone of a new study and previewed instantly. The demo environment in the app also utilizes this feature to allow developers to mock and preview questionnaires locally on their machines. In our experience this has been a huge time-saver in A/B testing the UX and the format and wording of questions live with clinicians.

System accessibility enhancements

To improve the accessibility of the platform for more users, we made several usability enhancements:

  1. Light & dark theme support
  2. Bold text & variable font-sizes
  3. High-contrast mode
  4. Improving user awareness of accessibility settings

Extended exposure to bright light themes can strain the eyes, so supporting dark theme features was necessary to make it easier to use the study app frequently. Some small or light text-elements are illegible to users with vision impairments, so we added 1) bold-text and support for larger font-sizes and 2) high-contrast color-schemes. To ensure that accessibility settings are easy to find, we placed an introductory one-time screen that was presented during the app’s first launch, which would directly take users to their system accessibility settings.

Study accessibility enhancements

To make the study itself easier to interact with and reduce cognitive overload, we made the following changes:

  1. Clarified the onboarding process
  2. Improved design for questionnaires

First, we clarified the on-boarding process by presenting users with a list of required steps when they first open the app in order to reduce confusion and participant drop-off.

The original questionnaire design in the app presented each question in a card format, which utilizes part of the screen for shadows and depth effects of the card. In many situations, this is a pleasant aesthetic, but in apps where accessibility is priority, these visual elements restrict the space available on the screen. Thus, when more accessible, larger font-sizes are used there are more frequent word breaks, which reduces readability. We fixed this simply by removing the card design elements and instead using the entire screen, allowing for better visuals with larger font-sizes.

The MS Signals prototype study

To test the usability of these changes, we used our redesigned platform to create a prototype study app called MS Signals, which uses surveys to gather information about a participant’s MS-related symptoms.

MS Signals app screenshots.

<!–

MS Signals app screenshots.

–>

MS Studies app design

As a first step, before entering any study information, participants are asked to complete an eligibility and study comprehension questionnaire to ensure that they have read through the potentially lengthy terms of study participation. This might include, for example, questions like “In what country is the study available?” or “Can you withdraw from the study?” A section like this is common in most health studies, and it tends to be the first drop-off point for participants.

To minimize study drop-off at this early stage, we kept the eligibility test brief and reflected correct answers for the comprehension test back to the participants. This helps minimize the number of times a user may need to go through the initial eligibility questionnaire and ensures that the important aspects of the study protocol are made clear to them.

After successful enrollment, participants are taken to the main app view, which consists of three pages:

  • Activities:

    This page lists the questionnaires available to the participant and is where the majority of their time is spent. The questionnaires vary in frequency — some are one-time surveys created to gather medical history, while others are repeated daily, weekly or monthly, depending on the symptom or area they are exploring. For the one-time survey we provide a counter above each question to signal to users how far they have come and how many questions are left, similar to the questionnaire during the eligibility and comprehension step.
  • Dashboard:
    To ensure that participants get something back in return for the information they enter during a study, the Dashboard area presents a summary of their responses in graph or pie chart form. Participants could potentially show this data to their care provider as a summary of their condition over the last 6 months, an improvement over the traditional pen and paper methods that many employ today.
  • Resources:
    A set of useful links, help articles and common questions related to MS.

Questionnaire design

Since needing to frequently input data can lead to cognitive overload, participant drop off, and bad data quality, we reduced the burden in two ways:

  1. We break down large questionnaires into smaller ones, resulting in 6 daily surveys, containing 3–5 questions each, where each question is multiple choice and related to a single symptom. This way we cover a total of 20 major symptoms, and present them in a similar way to how a clinician would ask these questions in an in-clinic setting.
  2. We ensure previously entered information is readily available in the app, along with the time of the entry.

In designing the survey content, we collaborated closely with experienced clinicians and researchers to finalize the wording and layout. While studies in this field typically use the Likert scale to gather symptom information, we defined a more intuitive verbose scale to provide better experience for participants tracking their disease and the clinicians or researchers viewing the disease history. For example, in the case of vision issues, rather than asking participants to rate their symptoms on a scale from 1 to 10, we instead present a multiple choice question where we detail common vision problems that they may be experiencing.

This verbose scale helps patients track their symptoms more accurately by including context that helps them more clearly define their symptoms. This approach also allows researchers to answer questions that go beyond symptom correlation. For example, for vision issues, data collected using the verbose scale would reveal to researchers whether nystagmus is more prominent in patients with MS compared to double vision.

Side-by-side comparison with a Likert scale on the left, and a Verbose scale on the right.

<!–

Side by side comparison with a Likert scale on the left, and a Verbose scale on the right.

–>

Focusing on accessibility

Mobile-based studies can often present additional challenges for participants with chronic conditions: the text can be hard to read, the color contrast could make it difficult to see certain bits of information, or it may be challenging to scroll through pages. This may result in participant drop off, which, in turn, could yield a biased dataset if the people who are experiencing more advanced forms of a disease are unable to provide data.

In order to prevent such issues, we include the following accessibility features:

  • Throughout, we employ color blind accessible color schemes. This includes improving the contrast between crucial text and important additional information, which might otherwise be presented in a smaller font and a faded text color.
  • We reduced the amount of movement required to access crucial controls by placing all buttons close to the bottom of the page and ensuring that pop-ups are controllable from the bottom part of the screen.

To test the accessibility of MS Signals, we collaborated with the National MS Society to recruit participants for a user experience study. For this, a call for participation was sent out by the Society to their members, and 9 respondents were asked to test out the various app flows. The majority indicated that they would like a better way than their current method to track their symptom data, that they considered MS Signals to be a unique and valuable tool that would enhance the accuracy of their symptom tracking, and that they would want to share the dashboard view with their healthcare providers.

Next steps

We want to encourage everyone to make use of the open source platform to start setting up and running their own studies. We are working on creating a set of standard study templates, which would incorporate what we learned from above, and we hope to release those soon. For any issues, comments or questions please check out our resource page.

Read More

Use machine learning without writing a single line of code with Amazon SageMaker Canvas

Use machine learning without writing a single line of code with Amazon SageMaker Canvas

In the recent past, using machine learning (ML) to make predictions, especially for data in the form of text and images, required extensive ML knowledge for creating and tuning of deep learning models. Today, ML has become more accessible to any user who wants to use ML models to generate business value. With Amazon SageMaker Canvas, you can create predictions for a number of different data types beyond just tabular or time series data without writing a single line of code. These capabilities include pre-trained models for image, text, and document data types.

In this post, we discuss how you can use pre-trained models to retrieve predictions for supported data types beyond tabular data.

Text data

SageMaker Canvas provides a visual, no-code environment for building, training, and deploying ML models. For natural language processing (NLP) tasks, SageMaker Canvas integrates seamlessly with Amazon Comprehend to allow you to perform key NLP capabilities like language detection, entity recognition, sentiment analysis, topic modeling, and more. The integration eliminates the need for any coding or data engineering to use the robust NLP models of Amazon Comprehend. You simply provide your text data and select from four commonly used capabilities: sentiment analysis, language detection, entities extraction, and personal information detection. For each scenario, you can use the UI to test and use batch prediction to select data stored in Amazon Simple Storage Service (Amazon S3).

Analyzing text data on SageMaker Canvas

Sentiment analysis

With sentiment analysis, SageMaker Canvas allows you to analyze the sentiment of your input text. It can determine if the overall sentiment is positive, negative, mixed, or neutral, as shown in the following screenshot. This is useful in situations like analyzing product reviews. For example, the text “I love this product, it’s amazing!” would be classified by SageMaker Canvas as having a positive sentiment, whereas “This product is horrible, I regret buying it” would be labeled as negative sentiment.

Sentiment Analysis on SageMaker Canvas

Entities extraction

SageMaker Canvas can analyze text and automatically detect entities mentioned within it. When a document is sent to SageMaker Canvas for analysis, it will identify people, organizations, locations, dates, quantities, and other entities in the text. This entity extraction capability enables you to quickly gain insights into the key people, places, and details discussed in documents. For a list of supported entities, refer to Entities.

Entites Extraction on SageMaker Canvas

Language detection

SageMaker Canvas can also determine the dominant language of text using Amazon Comprehend. It analyzes text to identify the main language and provides confidence scores for the detected dominant language, but doesn’t indicate percentage breakdowns for multilingual documents. For best results with long documents in multiple languages, split the text into smaller pieces and aggregate the results to estimate language percentages. It works best with at least 20 characters of text.

Language Detection on SageMaker Canvas

Personal information detection

You can also protect sensitive data using personal information detection with SageMaker Canvas. It can analyze text documents to automatically detect personally identifiable information (PII) entities, allowing you to locate sensitive data like names, addresses, dates of birth, phone numbers, email addresses, and more. It analyzes documents up to 100 KB and provides a confidence score for each detected entity so you can review and selectively redact the most sensitive information. For a list of entities detected, refer to Detecting PII entities.

PII Detection on SageMaker Canvas

Image data

SageMaker Canvas provides a visual, no-code interface that makes it straightforward for you to use computer vision capabilities by integrating with Amazon Rekognition for image analysis. For example, you can upload a dataset of images, use Amazon Rekognition to detect objects and scenes, and perform text detection to address a wide range of use cases. The visual interface and Amazon Rekognition integration make it possible for non-developers to harness advanced computer vision techniques.

Analyzing image data on SageMaker Canvas

Object detection in images

SageMaker Canvas uses Amazon Rekognition to detect labels (objects) in an image. You can upload the image from the SageMaker Canvas UI or use the Batch Prediction tab to select images stored in an S3 bucket. As shown in the following example, it can extract objects in the image such as clock tower, bus, buildings, and more. You can use the interface to search through the prediction results and sort them.

Object Detection in Images on SageMaker Canvas

Text detection in images

Extracting text from images is a very common use case. Now, you can perform this task with ease on SageMaker Canvas with no code. The text is extracted as line items, as shown in the following screenshot. Short phrases within the image are classified together and identified as a phrase.

Text Detection in Images on SageMaker Canvas

You can perform batch predictions by uploading a set of images, extract all the images in a single batch job, and download the results as a CSV file. This solution is useful when you want to extract and detect text in images.

Document data

SageMaker Canvas offers a variety of ready-to-use solutions that solve your day-to-day document understanding needs. These solutions are powered by Amazon Textract. To view all the available options for documents, choose to Ready-to-use models in the navigation pane and filter by Documents, as shown in the following screenshot.

Analyzing Document Data on SageMaker Canvas

Document analysis

Document analysis analyzes documents and forms for relationships among detected text. The operations return four categories of document extraction: raw text, forms, tables, and signatures. The solution’s capability of understanding the document structure gives you extra flexibility in the type of data you want to extract from the documents. The following screenshot is an example of what table detection looks like.

Document Analysis on SageMaker Canvas

This solution is able to understand layouts of complex documents, which is helpful when you need to extract specific information in your documents.

Identity document analysis

This solution is designed to analyze documents like personal identification cards, driver’s licenses, or other similar forms of identification. Information such as middle name, county, and place of birth, together with its individual confidence score on the accuracy, will be returned for each identity document, as shown in the following screenshot.

Identity Document Analysis on SageMaker Canvas

There is an option to do batch prediction, whereby you can bulk upload sets of identification documents and process them as a batch job. This provides a quick and seamless way to transform identification document details into key-value pairs that can be used for downstream processes such as data analysis.

Expense analysis

Expense analysis is designed to analyze expense documents like invoices and receipts. The following screenshot is an example of what the extracted information looks like.

Expense Analysis on SageMaker Canvas

The results are returned as summary fields and line item fields. Summary fields are key-value pairs extracted from the document, and contain keys such as Grand Total, Due Date, and Tax. Line item fields refer to data that is structured as a table in the document. This is useful for extracting information from the document while retaining its layout.

Document queries

Document queries are designed for you to ask questions about your documents. This is a great solution to use when you have multi-page documents and you want to extract very specific answers from your documents. The following is an example of the types of questions you can ask and what the extracted answers look like.

Document Queries on SageMaker Canvas

The solution provides a straightforward interface for you to interact with your documents. This is helpful when you want to get specific details within large documents.

Conclusion

SageMaker Canvas provides a no-code environment to use ML with ease across various data types like text, images, and documents. The visual interface and integration with AWS services like Amazon Comprehend, Amazon Rekognition, and Amazon Textract eliminates the need for coding and data engineering. You can analyze text for sentiment, entities, languages, and PII. For images, object and text detection enables computer vision use cases. Finally, document analysis can extract text while preserving its layout for downstream processes. The ready-to-use solutions in SageMaker Canvas make it possible for you to harness advanced ML techniques to generate insights from both structured and unstructured data. If you’re interested using no-code tools with ready-to-use ML models, try out SageMaker Canvas today. For more information, refer to Getting started with using Amazon SageMaker Canvas.


About the authors

Julia Ang is a Solutions Architect based in Singapore. She has worked with customers in a range of fields, from health and public sector to digital native businesses, to adopt solutions according to their business needs. She has also been supporting customers in Southeast Asia and beyond to use AI & ML in their businesses. Outside of work, she enjoys learning about the world through traveling and engaging in creative pursuits.

Loke Jun Kai is a Specialist Solutions Architect for AI/ML based in Singapore. He works with customer across ASEAN to architect machine learning solutions at scale in AWS. Jun Kai is an advocate for Low-Code No-Code machine learning tools. In his spare time, he enjoys being with the nature.

Read More

Explore advanced techniques for hyperparameter optimization with Amazon SageMaker Automatic Model Tuning

Explore advanced techniques for hyperparameter optimization with Amazon SageMaker Automatic Model Tuning

Creating high-performance machine learning (ML) solutions relies on exploring and optimizing training parameters, also known as hyperparameters. Hyperparameters are the knobs and levers that we use to adjust the training process, such as learning rate, batch size, regularization strength, and others, depending on the specific model and task at hand. Exploring hyperparameters involves systematically varying the values of each parameter and observing the impact on model performance. Although this process requires additional efforts, the benefits are significant. Hyperparameter optimization (HPO) can lead to faster training times, improved model accuracy, and better generalization to new data.

We continue our journey from the post Optimize hyperparameters with Amazon SageMaker Automatic Model Tuning. We previously explored a single job optimization, visualized the outcomes for SageMaker built-in algorithm, and learned about the impact of particular hyperparameter values. On top of using HPO as a one-time optimization at the end of the model creation cycle, we can also use it across multiple steps in a conversational manner. Each tuning job helps us get closer to a good performance, but additionally, we also learn how sensitive the model is to certain hyperparameters and can use this understanding to inform the next tuning job. We can revise the hyperparameters and their value ranges based on what we learned and therefore turn this optimization effort into a conversation. And in the same way that we as ML practitioners accumulate knowledge over these runs, Amazon SageMaker Automatic Model Tuning (AMT) with warm starts can maintain this knowledge acquired in previous tuning jobs for the next tuning job as well.

In this post, we run multiple HPO jobs with a custom training algorithm and different HPO strategies such as Bayesian optimization and random search. We also put warm starts into action and visually compare our trials to refine hyperparameter space exploration.

Advanced concepts of SageMaker AMT

In the next sections, we take a closer look at each of the following topics and show how SageMaker AMT can help you implement them in your ML projects:

  • Use custom training code and the popular ML framework Scikit-learn in SageMaker Training
  • Define custom evaluation metrics based on the logs for evaluation and optimization
  • Perform HPO using an appropriate strategy
  • Use warm starts to turn a single hyperparameter search into a dialog with our model
  • Use advanced visualization techniques using our solution library to compare two HPO strategies and tuning jobs results

Whether you’re using the built-in algorithms used in our first post or your own training code, SageMaker AMT offers a seamless user experience for optimizing ML models. It provides key functionality that allows you to focus on the ML problem at hand while automatically keeping track of the trials and results. At the same time, it automatically manages the underlying infrastructure for you.

In this post, we move away from a SageMaker built-in algorithm and use custom code. We use a Random Forest from SkLearn. But we stick to the same ML task and dataset as in our first post, which is detecting handwritten digits. We cover the content of the Jupyter notebook 2_advanced_tuning_with_custom_training_and_visualizing.ipynb and invite you to invoke the code side by side to read further.

Let’s dive deeper and discover how we can use custom training code, deploy it, and run it, while exploring the hyperparameter search space to optimize our results.

How to build an ML model and perform hyperparameter optimization

What does a typical process for building an ML solution look like? Although there are many possible use cases and a large variety of ML tasks out there, we suggest the following mental model for a stepwise approach:

  1. Understand your ML scenario at hand and select an algorithm based on the requirements. For example, you might want to solve an image recognition task using a supervised learning algorithm. In this post, we continue to use the handwritten image recognition scenario and the same dataset as in our first post.
  2. Decide which implementation of the algorithm in SageMaker Training you want to use. There are various options, inside SageMaker or external ones. Additionally, you need to define which underlying metric fits best for your task and you want to optimize for (such as accuracy, F1 score, or ROC). SageMaker supports four options depending on your needs and resources:
    • Use a pre-trained model via Amazon SageMaker JumpStart, which you can use out of the box or just fine-tune it.
    • Use one of the built-in algorithms for training and tuning, like XGBoost, as we did in our previous post.
    • Train and tune a custom model based on one of the major frameworks like Scikit-learn, TensorFlow, or PyTorch. AWS provides a selection of pre-made Docker images for this purpose. For this post, we use this option, which allows you to experiment quickly by running your own code on top of a pre-made container image.
    • Bring your own custom Docker image in case you want to use a framework or software that is not otherwise supported. This option requires the most effort, but also provides the highest degree of flexibility and control.
  3. Train the model with your data. Depending on the algorithm implementation from the previous step, this can be as simple as referencing your training data and running the training job or by additionally providing custom code for training. In our case, we use some custom training code in Python based on Scikit-learn.
  4. Apply hyperparameter optimization (as a “conversation” with your ML model). After the training, you typically want to optimize the performance of your model by finding the most promising combination of values for your algorithm’s hyperparameters.

Depending on your ML algorithm and model size, the last step of hyperparameter optimization may turn out to be a bigger challenge than expected. The following questions are typical for ML practitioners at this stage and might sound familiar to you:

  • What kind of hyperparameters are impactful for my ML problem?
  • How can I effectively search a huge hyperparameter space to find those best-performing values?
  • How does the combination of certain hyperparameter values influence my performance metric?
  • Costs matter; how can I use my resources in an efficient manner?
  • What kind of tuning experiments are worthwhile, and how can I compare them?

It’s not easy to answer these questions, but there is good news. SageMaker AMT takes the heavy lifting from you, and lets you concentrate on choosing the right HPO strategy and value ranges you want to explore. Additionally, our visualization solution facilitates the iterative analysis and experimentation process to efficiently find well-performing hyperparameter values.

In the next sections, we build a digit recognition model from scratch using Scikit-learn and show all these concepts in action.

Solution overview

SageMaker offers some very handy features to train, evaluate, and tune our model. It covers all functionality of an end-to-end ML lifecycle, so we don’t even need to leave our Jupyter notebook.

In our first post, we used the SageMaker built-in algorithm XGBoost. For demonstration purposes, this time we switch to a Random Forest classifier because we can then show how to provide your own training code. We opted for providing our own Python script and using Scikit-learn as our framework. Now, how do we express that we want to use a specific ML framework? As we will see, SageMaker uses another AWS service in the background to retrieve a pre-built Docker container image for training—Amazon Elastic Container Registry (Amazon ECR).

We cover the following steps in detail, including code snippets and diagrams to connect the dots. As mentioned before, if you have the chance, open the notebook and run the code cells step by step to create the artifacts in your AWS environment. There is no better way of active learning.

  1. First, load and prepare the data. We use Amazon Simple Storage Service (Amazon S3) to upload a file containing our handwritten digits data.
  2. Next, prepare the training script and framework dependencies. We provide the custom training code in Python, reference some dependent libraries, and make a test run.
  3. To define the custom objective metrics, SageMaker lets us define a regular expression to extract the metrics we need from the container log files.
  4. Train the model using the scikit-learn framework. By referencing a pre-built container image, we create a corresponding Estimator object and pass our custom training script.
  5. AMT enables us to try out various HPO strategies. We concentrate on two of them for this post: random search and Bayesian search.
  6. Choose between SageMaker HPO strategies.
  7. Visualize, analyze, and compare tuning results. Our visualization package allows us to discover which strategy performs better and which hyperparameter values deliver the best performance based on our metrics.
  8. Continue the exploration of the hyperparameter space and warm start HPO jobs.

AMT takes care of scaling and managing the underlying compute infrastructure to run the various tuning jobs on Amazon Elastic Compute Cloud (Amazon EC2) instances. This way, you don’t need to burden yourself to provision instances, handle any operating system and hardware issues, or aggregate log files on your own. The ML framework image is retrieved from Amazon ECR and the model artifacts including tuning results are stored in Amazon S3. All logs and metrics are collected in Amazon CloudWatch for convenient access and further analysis if needed.

Prerequisites

Because this is a continuation of a series, it is recommended, but not necessarily required, to read our first post about SageMaker AMT and HPO. Apart from that, basic familiarity with ML concepts and Python programming is helpful. We also recommend following along with each step in the accompanying notebook from our GitHub repository while reading this post. The notebook can be run independently from the first one, but needs some code from subfolders. Make sure to clone the full repository in your environment as described in the README file.

Experimenting with the code and using the interactive visualization options greatly enhances your learning experience. So, please check it out.

Load and prepare the data

As a first step, we make sure the downloaded digits data we need for training is accessible to SageMaker. Amazon S3 allows us to do this in a safe and scalable way. Refer to the notebook for the complete source code and feel free to adapt it with your own data.

sm_sess = sagemaker.session.Session(boto_session=boto_sess, sagemaker_client=sm)
BUCKET = sm_sess.default_bucket()
PREFIX = 'amt-visualize-demo'
s3_data_url = f's3://{BUCKET}/{PREFIX}/data'
digits         = datasets.load_digits()
digits_df      = pd.DataFrame(digits.data)
digits_df['y'] = digits.target
digits_df.to_csv('data/digits.csv', index=False)
!aws s3 sync data/ {s3_data_url} —exclude '*' —include 'digits.csv'

The digits.csv file contains feature data and labels. Each digit is represented by pixel values in an 8×8 image, as depicted by the following image for the digit 4.
Digits Dataset from Scikit-learn

Prepare the training script and framework dependencies

Now that the data is stored in our S3 bucket, we can define our custom training script based on Scikit-learn in Python. SageMaker gives us the option to simply reference the Python file later for training. Any dependencies like the Scikit-learn or pandas libraries can be provided in two ways:

  • They can be specified explicitly in a requirements.txt file
  • They are pre-installed in the underlying ML container image, which is either provided by SageMaker or custom-built

Both options are generally considered standard ways for dependency management, so you might already be familiar with it. SageMaker supports a variety of ML frameworks in a ready-to-use managed environment. This includes many of the most popular data science and ML frameworks like PyTorch, TensorFlow, or Scikit-learn, as in our case. We don’t use an additional requirements.txt file, but feel free to add some libraries to try it out.

The code of our implementation contains a method called fit(), which creates a new classifier for the digit recognition task and trains it. In contrast to our first post where we used the SageMaker built-in XGBoost algorithm, we now use a RandomForestClassifier provided by the ML library sklearn. The call of the fit() method on the classifier object starts the training process using a subset (80%) of our CSV data:

def fit(train_dir, n_estimators, max_depth, min_samples_leaf, max_features, min_weight_fraction_leaf):
    
    digits = pd.read_csv(Path(train_dir)/'digits.csv')
    
    Xtrain, Xtest, ytrain, ytest = train_test_split(digits.iloc[:, :-1], digits.iloc[:, -1], test_size=.2)
    
    m = RandomForestClassifier(n_estimators=n_estimators, 
                               max_depth=max_depth, 
                               min_samples_leaf=min_samples_leaf,
                               max_features=max_features,
                               min_weight_fraction_leaf=min_weight_fraction_leaf)
    m.fit(Xtrain, ytrain)
    predicted = m.predict(Xtest)
    pre, rec, f1, _ = precision_recall_fscore_support(ytest, predicted, pos_label=1, average='weighted')
    
    print(f'pre: {pre:5.3f} rec: {rec:5.3f} f1: {f1:5.3}')
    
    return m

See the full script in our Jupyter notebook on GitHub.

Before you spin up container resources for the full training process, did you try to run the script directly? This is a good practice to quickly ensure the code has no syntax errors, check for matching dimensions of your data structures, and some other errors early on.

There are two ways to run your code locally. First, you can run it right away in the notebook, which also allows you to use the Python Debugger pdb:

# Running the code from within the notebook. It would then be possible to use the Python Debugger, pdb.
from train import fit
fit('data', 100, 10, 1, 'auto', 0.01)

Alternatively, run the train script from the command line in the same way you may want to use it in a container. This also supports setting various parameters and overwriting the default values as needed, for example:

!cd src && python train.py --train ../data/ --model-dir /tmp/ --n-estimators 100

As output, you can see the first results for the model’s performance based on the objective metrics precision, recall, and F1-score. For example, pre: 0.970 rec: 0.969 f1: 0.969.

Not bad for such a quick training. But where did these numbers come from and what do we do with them?

Define custom objective metrics

Remember, our goal is to fully train and tune our model based on the objective metrics we consider relevant for our task. Because we use a custom training script, we need to define those metrics for SageMaker explicitly.

Our script emits the metrics precision, recall, and F1-score during training simply by using the print function:

print(f'pre: {pre:5.3f} rec: {rec:5.3f} f1: {f1:5.3}')

The standard output is captured by SageMaker and sent to CloudWatch as a log stream. To retrieve the metric values and work with them later in SageMaker AMT, we need to provide some information on how to parse that output. We can achieve this by defining regular expression statements (for more information, refer to Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics):

metric_definitions = [
    {'Name': 'valid-precision',  'Regex': r'pre:s+(-?[0-9.]+)'},
    {'Name': 'valid-recall',     'Regex': r'rec:s+(-?[0-9.]+)'},
    {'Name': 'valid-f1',         'Regex': r'f1:s+(-?[0-9.]+)'}]   

Let’s walk through the first metric definition in the preceding code together. SageMaker will look for output in the log that starts with pre: and is followed by one or more whitespace and then a number that we want to extract, which is why we use the round parenthesis. Every time SageMaker finds a value like that, it turns it into a CloudWatch metric with the name valid-precision.

Train the model using the Scikit-learn framework

After we create our training script train.py and instruct SageMaker on how to monitor the metrics within CloudWatch, we define a SageMaker Estimator object. It initiates the training job and uses the instance type we specify. But how can this instance type be different from the one you run an Amazon SageMaker Studio notebook on, and why? SageMaker Studio runs your training (and inference) jobs on separate compute instances than your notebook. This allows you to continue working in your notebook while the jobs run in the background.

The parameter framework_version refers to the Scikit-learn version we use for our training job. Alternatively, we can pass image_uri to the estimator. You can check whether your favorite framework or ML library is available as a pre-built SageMaker Docker image and use it as is or with extensions.

Moreover, we can run SageMaker training jobs on EC2 Spot Instances by setting use_spot_instances to True. They are spare capacity instances that can save up to 90% of costs.  These instances provide flexibility on when the training jobs are run.

estimator = SKLearn(
    'train.py',
    source_dir='src',
    role=get_execution_role(),
    instance_type= 'ml.m5.large',
    instance_count=1,
    framework_version='0.23-1',
    metric_definitions=metric_definitions,

    # Uncomment the following three lines to use Managed Spot Training
    # use_spot_instances= True,
    # max_run=  60 * 60 * 24,
    # max_wait= 60 * 60 * 24,

    hyperparameters = {'n-estimators': 100,
                       'max-depth': 10,
                       'min-samples-leaf': 1,
                       'max-features': 'auto',
                       'min-weight-fraction-leaf': 0.1}
)

After the Estimator object is set up, we start the training by calling the fit() function, supplying the path to the training dataset on Amazon S3. We can use this same method to provide validation and test data. We set the wait parameter to True so we can use the trained model in the subsequent code cells.

estimator.fit({'train': s3_data_url}, wait=True)

Define hyperparameters and run tuning jobs

So far, we have trained the model with one set of hyperparameter values. But were those values good? Or could we look for better ones? Let’s use the HyperparameterTuner class to run a systematic search over the hyperparameter space. How do we search this space with the tuner? The necessary parameters are the objective metric name and objective type that will guide the optimization. The optimization strategy is another key argument for the tuner because it further defines the search space. The following are four different strategies to choose from:

  • Grid search
  • Random search
  • Bayesian optimization (default)
  • Hyperband

We further describe these strategies and equip you with some guidance to choose one later in this post.

Before we define and run our tuner object, let’s recap our understanding from an architecture perspective. We covered the architectural overview of SageMaker AMT in our last post and reproduce an excerpt of it here for convenience.

Amazon SageMaker Automatic Model Tuning Architecture

We can choose what hyperparameters we want to tune or leave static. For dynamic hyperparameters, we provide hyperparameter_ranges that can be used to optimize for tunable hyperparameters. Because we use a Random Forest classifier, we have utilized the hyperparameters from the Scikit-learn Random Forest documentation.

We also limit resources with the maximum number of training jobs and parallel training jobs the tuner can use. We will see how these limits help us compare the results of various strategies with each other.

tuner_parameters = {
    'estimator': estimator,
    'base_tuning_job_name': 'random',
    'metric_definitions': metric_definitions,
    'objective_metric_name': 'valid-f1',
    'objective_type': 'Maximize',
    'hyperparameter_ranges': hpt_ranges,
    'strategy': 'Random',
    'max_jobs': n, # 50
    'max_parallel_jobs': k # 2
    } 

Similar to the Estimator’s fit function, we start a tuning job calling the tuner’s fit:

random_tuner = HyperparameterTuner(**tuner_parameters)
random_tuner.fit({'train': s3_data_url}, wait=False)

This is all we have to do to let SageMaker run the training jobs (n=50) in the background, each using a different set of hyperparameters. We explore the results later in this post. But before that, let’s start another tuning job, this time applying the Bayesian optimization strategy. We will compare both strategies visually after their completion.

tuner_parameters['strategy']             = 'Bayesian'
tuner_parameters['base_tuning_job_name'] = 'bayesian'
bayesian_tuner = HyperparameterTuner(**tuner_parameters)
bayesian_tuner.fit({'train': s3_data_url}, wait=False)

Note that both tuner jobs can run in parallel because SageMaker orchestrates the required compute instances independently of each other. That’s quite helpful for practitioners who experiment with different approaches at the same time, like we do here.

Choose between SageMaker HPO strategies

When it comes to tuning strategies, you have a few options with SageMaker AMT: grid search, random search, Bayesian optimization, and Hyperband. These strategies determine how the automatic tuning algorithms explore the specified ranges of hyperparameters.

Random search is pretty straightforward. It randomly selects combinations of values from the specified ranges and can be run in a sequential or parallel manner. It’s like throwing darts blindfolded, hoping to hit the target. We have started with this strategy, but will the results improve with another one?

Bayesian optimization takes a different approach than random search. It considers the history of previous selections and chooses values that are likely to yield the best results. If you want to learn from previous explorations, you can achieve this only with running a new tuning job after the previous ones. Makes sense, right? In this way, Bayesian optimization is dependent on the previous runs. But do you see what HPO strategy allows for higher parallelization?

Hyperband is an interesting one! It uses a multi-fidelity strategy, which means it dynamically allocates resources to the most promising training jobs and stops those that are underperforming. Therefore, Hyperband is computationally efficient with resources, learning from previous training jobs. After stopping the underperforming configurations, a new configuration starts, and its values are chosen randomly.

Depending on your needs and the nature of your model, you can choose between random search, Bayesian optimization, or Hyperband as your tuning strategy. Each has its own approach and advantages, so it’s important to consider which one works best for your ML exploration. The good news for ML practitioners is that you can select the best HPO strategy by visually comparing the impact of each trial on the objective metric. In the next section, we see how to visually identify the impact of different strategies.

Visualize, analyze, and compare tuning results

When our tuning jobs are complete, it gets exciting. What results do they deliver? What kind of boost can you expect on our metric compared to your base model? What are the best-performing hyperparameters for our use case?

A quick and straightforward way to view the HPO results is by visiting the SageMaker console. Under Hyperparameter tuning jobs, we can see (per tuning job) the combination of hyperparameter values that have been tested and delivered the best performance as measured by our objective metric (valid-f1).

Metrics for Hyperparameter tuning jobs

Is that all you need? As an ML practitioner, you may be not only interested in those values, but certainly want to learn more about the inner workings of your model to explore its full potential and strengthen your intuition with empirical feedback.

A good visualization tool can greatly help you understand the improvement by HPO over time and get empirical feedback on design decisions of your ML model. It shows the impact of each individual hyperparameter on your objective metric and provides guidance to further optimize your tuning results.

We use the amtviz custom visualization package to visualize and analyze tuning jobs. It’s straightforward to use and provides helpful features. We demonstrate its benefit by interpreting some individual charts, and finally comparing random search side by side with Bayesian optimization.

First, let’s create a visualization for random search. We can do this by calling visualize_tuning_job() from amtviz and passing our first tuner object as an argument:

from amtviz import visualize_tuning_job
visualize_tuning_job(random_tuner, advanced=True, trials_only=True)

You will see a couple of charts, but let’s take it step by step. The first scatter plot from the output looks like the following and already gives us some visual clues we wouldn’t recognize in any table.

Hyperparameter Optimization Job Results

Each dot represents the performance of an individual training job (our objective valid-f1 on the y-axis) based on its start time (x-axis), produced by a specific set of hyperparameters. Therefore, we look at the performance of our model as it progresses over the duration of the tuning job.

The dotted line highlights the best result found so far and indicates improvement over time. The best two training jobs achieved an F1 score of around 0.91.

Besides the dotted line showing the cumulative progress, do you see a trend in the chart?

Probably not. And this is expected, because we’re viewing the results of the random HPO strategy. Each training job was run using a different but randomly selected set of hyperparameters. If we continued our tuning job (or ran another one with the same setting), we would probably see some better results over time, but we can’t be sure. Randomness is a tricky thing.

The next charts help you gauge the influence of hyperparameters on the overall performance. All hyperparameters are visualized, but for the sake of brevity, we focus on two of them: n-estimators and max-depth.

Hyperparameter Jobs Details

Our top two training jobs were using n-estimators of around 20 and 80, and max-depth of around 10 and 18, respectively. The exact hyperparameter values are displayed via tooltip for each dot (training job). They are even dynamically highlighted across all charts and give you a multi-dimensional view! Did you see that? Each hyperparameter is plotted against the objective metric, as a separate chart.

Now, what kind of insights do we get about n-estimators?

Based on the left chart, it seems that very low value ranges (below 10) more often deliver poor results compared to higher values. Therefore, higher values may help your model to perform better—interesting.

In contrast, the correlation of the max-depth hyperparameter to our objective metric is rather low. We can’t clearly tell which value ranges are performing better from a general perspective.

In summary, random search can help you find a well-performing set of hyperparameters even in a relatively short amount of time. Also, it isn’t biased towards a good solution but gives a balanced view of the search space. Your resource utilization, however, might not be very efficient. It continues to run training jobs with hyperparameters in value ranges that are known to deliver poor results.

Let’s examine the results of our second tuning job using Bayesian optimization. We can use amtviz to visualize the results in the same way as we did so far for the random search tuner. Or, even better, we can use the capability of the function to compare both tuning jobs in a single set of charts. Quite handy!

visualize_tuning_job([random_tuner, bayesian_tuner], advanced=True, trials_only=True)

Hyperparameter Optimization Job Bayesian VS Random

There are more dots now because we visualize the results of all training jobs for both, the random search (orange dots) and the Bayesian optimization (blue dots). On the right side, you can see a density chart visualizing the distribution of all F1-scores. A majority of the training jobs achieved results in the upper part of the F1 scale (over 0.6)—that’s good!

What is the key takeaway here? The scatter plot clearly shows the benefit of Bayesian optimization. It delivers better results over time because it can learn from previous runs. That’s why we achieved significantly better results using Bayesian compared to random (0.967 vs. 0.919) with the same number of training jobs.

There is even more you can do with amtviz. Let’s drill in.

If you give SageMaker AMT the instruction to run a larger number of jobs for tuning, seeing many trials at once can get messy. That’s one of the reasons why we made these charts interactive. You can click and drag on every hyperparameter scatter plot to zoom in to certain value ranges and refine your visual interpretation of the results. All other charts are automatically updated. That’s pretty helpful, isn’t it? See the next charts as an example and try it for yourself in your notebook!

Hyperparameter Optimization Job Visualisation Features

As a tuning maximalist, you may also decide that running another hyperparameter tuning job could further improve your model performance. But this time, a more specific range of hyperparameter values can be explored because you already know (roughly) where to expect better results. For example, you may choose to focus on values between 100–200 for n-estimators, as shown in the chart. This lets AMT focus on the most promising training jobs and increases your tuning efficiency.

To sum it up, amtviz provides you with a rich set of visualization capabilities that allow you to better understand the impact of your model’s hyperparameters on performance and enable smarter decisions in your tuning activities.

Continue the exploration of the hyperparameter space and warm start HPO jobs

We have seen that AMT helps us explore the hyperparameter search space efficiently. But what if we need multiple rounds of tuning to iteratively improve our results? As mentioned in the beginning, we want to establish an optimization feedback cycle—our “conversation” with the model. Do we need to start from scratch every time?

Let’s look into the concept of running a warm start hyperparameter tuning job. It doesn’t initiate new tuning jobs from scratch, it reuses what has been learned in the previous HPO runs. This helps us be more efficient with our tuning time and compute resources. We can further iterate on top of our previous results. To use warm starts, we create a WarmStartConfig and specify warm_start_type as IDENTICAL_DATA_AND_ALGORITHM. This means that we change the hyperparameter values but we don’t change the data or algorithm. We tell AMT to transfer the previous knowledge to our new tuning job.

By referring to our previous Bayesian optimization and random search tuning jobs as parents, we can use them both for the warm start:

warm_start_config = WarmStartConfig(warm_start_type=WarmStartTypes.IDENTICAL_DATA_AND_ALGORITHM,
                                    parents=[bayesian_tuner_name, random_tuner_name])
tuner_parameters['warm_start_config'] =  warm_start_config

To see the benefit of using warm starts, refer to the following charts. These are generated by amtviz in a similar way as we did earlier, but this time we have added another tuning job based on a warm start.

Hyperparameter Optimization Job Warmstart

In the left chart, we can observe that new tuning jobs mostly lie in the upper-right corner of the performance metric graph (see dots marked in orange). The warm start has indeed reused the previous results, which is why those data points are in the top results for F1 score. This improvement is also reflected in the density chart on the right.

In other words, AMT automatically selects promising sets of hyperparameter values based on its knowledge from previous trials. This is shown in the next chart. For example, the algorithm would test a low value for n-estimators less often because these are known to produce poor F1 scores. We don’t waste any resources on that, thanks to warm starts.

Hyperparameter Optimization Visualised Jobs

Clean up

To avoid incurring unwanted costs when you’re done experimenting with HPO, you must remove all files in your S3 bucket with the prefix amt-visualize-demo and also shut down SageMaker Studio resources.

Run the following code in your notebook to remove all S3 files from this post:

!aws s3 rm s3://{BUCKET}/amt-visualize-demo --recursive

If you wish to keep the datasets or the model artifacts, you may modify the prefix in the code to amt-visualize-demo/data to only delete the data or amt-visualize-demo/output to only delete the model artifacts.

Conclusion

We have learned how the art of building ML solutions involves exploring and optimizing hyperparameters. Adjusting those knobs and levers is a demanding yet rewarding process that leads to faster training times, improved model accuracy, and overall better ML solutions. The SageMaker AMT functionality helps us run multiple tuning jobs and warm start them, and provides data points for further review, visual comparison, and analysis.

In this post, we looked into HPO strategies that we use with SageMaker AMT. We started with random search, a straightforward but performant strategy where hyperparameters are randomly sampled from a search space. Next, we compared the results to Bayesian optimization, which uses probabilistic models to guide the search for optimal hyperparameters. After we identified a suitable HPO strategy and good hyperparameter value ranges through initial trials, we showed how to use warm starts to streamline future HPO jobs.

You can explore the hyperparameter search space by comparing quantitative results. We have suggested the side-by-side visual comparison and provided the necessary package for interactive exploration. Let us know in the comments how helpful it was for you on your hyperparameter tuning journey!


About the authors

Uemit YoldasÜmit Yoldas is a Senior Solutions Architect with Amazon Web Services. He works with enterprise customers across industries in Germany. He’s driven to translate AI concepts into real-world solutions. Outside of work, he enjoys time with family, savoring good food, and pursuing fitness.

Elina LesykElina Lesyk is a Solutions Architect located in Munich. She is focusing on enterprise customers from the financial services industry. In her free time, you can find Elina building applications with generative AI at some IT meetups, driving a new idea on fixing climate change fast, or running in the forest to prepare for a half-marathon with a typical deviation from the planned schedule.

Mariano kampMariano Kamp is a Principal Solutions Architect with Amazon Web Services. He works with banks and insurance companies in Germany on machine learning. In his spare time, Mariano enjoys hiking with his wife.

Read More

Scroll Back in Time: AI Deciphers Ancient Roman Riddles

Scroll Back in Time: AI Deciphers Ancient Roman Riddles

Thanks to a viral trend sweeping social media, we now know some men think about the Roman Empire every day.

And thanks to Luke Farritor, a 21-year-old computer science undergrad at the University of Nebraska-Lincoln, and like-minded AI enthusiasts, there might soon be a lot more to think about.

Blending a passion for history with machine learning skills, Farritor has triumphed in the Vesuvius Challenge, wielding the power of the NVIDIA GeForce GTX 1070 GPU to bring a snippet of ancient text back from the ashes after almost 2,000 years.

Text Big Thing: Deciphering Rome’s Hidden History

The Herculaneum scrolls are a library of ancient texts that were carbonized and preserved by the eruption of Mount Vesuvius in 79 AD, which buried the cities of Pompeii and Herculaneum under a thick layer of ash and pumice.

The competition, which has piqued the interest of historians and technologists across the globe, seeks to extract readable content from the carbonized remains of the scrolls.

In a significant breakthrough, the word “πορφυρας,” which means “purple dye” or “cloths of purple,” emerged from the ancient texts thanks to the efforts of Farritor.

The Herculaneum scrolls, wound about 100 times around, are sealed by the heat of the lava.
The Herculaneum scrolls, wound about 100 times around, are sealed by the heat of the eruption of Vesuvius.

His achievement in identifying 10 letters within a small patch of scroll earned him a $40,000 prize.

Close on his heels was Youssef Nader, a biorobotics graduate student, who independently discerned the same word a few months later, meriting a $10,000 prize.

Adding to these notable successes, Casey Handmer, an entrepreneur with a keen eye, secured another $10,000 for his demonstration that significant amounts of ink were waiting to be discovered within the unopened scrolls.

All these discoveries are advancing the work pioneered by W. Brent Seales, chair of the University of Kentucky Computer Science Department, who has dedicated over a decade to developing methods to digitally unfurl and read the delicate Herculaneum scrolls.

Turbocharging these efforts is Nat Friedman, the CEO of GitHub and the organizer of the Vesuvius Challenge, whose commitment to open-source innovation has fostered a community where such historical breakthroughs are possible.

To become the first to decipher text from the scrolls, Farritor, who served as an intern at SpaceX, harnessed the GeForce GTX 1070 to accelerate his work.

When Rome Meets RAM: Older GPU Helps Uncover Even Older Text

Introduced in 2016, the GTX 1070 is celebrated among gamers, who have long praised the GPU for its balance of performance and affordability.

Instead of gaming, however, Farritor harnessed the parallel processing capabilities of the GPU to accelerate the ResNet deep learning framework, processing data at speeds unattainable by traditional computing methods.

Farritor is not the only competitor harnessing NVIDIA GPUs, which have proven themselves as indispensable tools to Vesuvius challenge competitors.

Latin Lingo and Lost Text

Discovered in the 18th century in the Villa of the Papyri, the Herculaneum scrolls have presented a challenge to researchers. Their fragile state has made them nearly impossible to read without causing damage. The advent of advanced imaging and AI technology changed that.

The project has become a passion for Farritor, who finds himself struggling to recall more of the Latin he studied in high school. “And man, like what’s in the scrolls … it’s just the anticipation, you know?” Farritor said.

The next challenge is to unearth passages from the Herculaneum scrolls that are 144 characters long, echoing the brevity of an original Twitter post.

Engaging over 1,500 experts in a collaborative effort, the endeavor is now more heated than ever.

Private donors have upped the ante, offering a $700,000 prize for those who can retrieve four distinct passages of at least 140 characters this year — a testament to the value placed on these ancient texts and the lengths required to reclaim them.

And Farritor’s eager to keep digging, reeling off the names of lost works of Roman and Greek history that he’d love to help uncover.

He reports he’s now thinking about Rome — and what his efforts might help discover — not just every day now, but “every hour.” “I think anything that sheds light on that time in human history is gonna be significant,” Farritor said.

Read More

Responsible AI at Google Research: Context in AI Research (CAIR)

Responsible AI at Google Research: Context in AI Research (CAIR)

Artificial intelligence (AI) and related machine learning (ML) technologies are increasingly influential in the world around us, making it imperative that we consider the potential impacts on society and individuals in all aspects of the technology that we create. To these ends, the Context in AI Research (CAIR) team develops novel AI methods in the context of the entire AI pipeline: from data to end-user feedback. The pipeline for building an AI system typically starts with data collection, followed by designing a model to run on that data, deployment of the model in the real world, and lastly, compiling and incorporation of human feedback. Originating in the health space, and now expanded to additional areas, the work of the CAIR team impacts every aspect of this pipeline. While specializing in model building, we have a particular focus on building systems with responsibility in mind, including fairness, robustness, transparency, and inclusion.

Data

The CAIR team focuses on understanding the data on which ML systems are built. Improving the standards for the transparency of ML datasets is instrumental in our work. First, we employ documentation frameworks to elucidate dataset and model characteristics as guidance in the development of data and model documentation techniques — Datasheets for Datasets and Model Cards for Model Reporting.

For example, health datasets are highly sensitive and yet can have high impact. For this reason, we developed Healthsheets, a health-contextualized adaptation of a Datasheet. Our motivation for developing a health-specific sheet lies in the limitations of existing regulatory frameworks for AI and health. Recent research suggests that data privacy regulation and standards (e.g., HIPAA, GDPR, California Consumer Privacy Act) do not ensure ethical collection, documentation, and use of data. Healthsheets aim to fill this gap in ethical dataset analysis. The development of Healthsheets was done in collaboration with many stakeholders in relevant job roles, including clinical, legal and regulatory, bioethics, privacy, and product.

Further, we studied how Datasheets and Healthsheets could serve as diagnostic tools that surface the limitations and strengths of datasets. Our aim was to start a conversation in the community and tailor Healthsheets to dynamic healthcare scenarios over time.

To facilitate this effort, we joined the STANDING Together initiative, a consortium that aims to develop international, consensus-based standards for documentation of diversity and representation within health datasets and to provide guidance on how to mitigate risk of bias translating to harm and health inequalities. Being part of this international, interdisciplinary partnership that spans academic, clinical, regulatory, policy, industry, patient, and charitable organizations worldwide enables us to engage in the conversation about responsibility in AI for healthcare internationally. Over 250 stakeholders from across 32 countries have contributed to refining the standards.

Healthsheets and STANDING Together: towards health data documentation and standards.

Model

When ML systems are deployed in the real world, they may fail to behave in expected ways, making poor predictions in new contexts. Such failures can occur for a myriad of reasons and can carry negative consequences, especially within the context of healthcare. Our work aims to identify situations where unexpected model behavior may be discovered, before it becomes a substantial problem, and to mitigate the unexpected and undesired consequences.

Much of the CAIR team’s modeling work focuses on identifying and mitigating when models are underspecified. We show that models that perform well on held-out data drawn from a training domain are not equally robust or fair under distribution shift because the models vary in the extent to which they rely on spurious correlations. This poses a risk to users and practitioners because it can be difficult to anticipate model instability using standard model evaluation practices. We have demonstrated that this concern arises in several domains, including computer vision, natural language processing, medical imaging, and prediction from electronic health records.

We have also shown how to use knowledge of causal mechanisms to diagnose and mitigate fairness and robustness issues in new contexts. Knowledge of causal structure allows practitioners to anticipate the generalizability of fairness properties under distribution shift in real-world medical settings. Further, investigating the capability for specific causal pathways, or “shortcuts”, to introduce bias in ML systems, we demonstrate how to identify cases where shortcut learning leads to predictions in ML systems that are unintentionally dependent on sensitive attributes (e.g., age, sex, race). We have shown how to use causal directed acyclic graphs to adapt ML systems to changing environments under complex forms of distribution shift. Our team is currently investigating how a causal interpretation of different forms of bias, including selection bias, label bias, and measurement error, motivates the design of techniques to mitigate bias during model development and evaluation.

Shortcut Learning: For some models, age may act as a shortcut in classification when using medical images.

The CAIR team focuses on developing methodology to build more inclusive models broadly. For example, we also have work on the design of participatory systems, which allows individuals to choose whether to disclose sensitive attributes, such as race, when an ML system makes predictions. We hope that our methodological research positively impacts the societal understanding of inclusivity in AI method development.

Deployment

The CAIR team aims to build technology that improves the lives of all people through the use of mobile device technology. We aim to reduce suffering from health conditions, address systemic inequality, and enable transparent device-based data collection. As consumer technology, such as fitness trackers and mobile phones, become central in data collection for health, we explored the use of these technologies within the context of chronic disease, in particular, for multiple sclerosis (MS). We developed new data collection mechanisms and predictions that we hope will eventually revolutionize patient’s chronic disease management, clinical trials, medical reversals and drug development.

First, we extended the open-source FDA MyStudies platform, which is used to create clinical study apps, to make it easier for anyone to run their own studies and collect good quality data, in a trusted and safe way. Our improvements include zero-config setups, so that researchers can prototype their study in a day, cross-platform app generation through the use of Flutter and, most importantly, an emphasis on accessibility so that all patient’s voices are heard. We are excited to announce this work has now been open sourced as an extension to the original FDA-Mystudies platform. You can start setting up your own studies today!

To test this platform, we built a prototype app, which we call MS Signals, that uses surveys to interface with patients in a novel consumer setting. We collaborated with the National MS Society to recruit participants for a user experience study for the app, with the goal of reducing dropout rates and improving the platform further.

MS Signals app screenshots. Left: Study welcome screen. Right: Questionnaire.

Once data is collected, researchers could potentially use it to drive the frontier of ML research in MS. In a separate study, we established a research collaboration with the Duke Department of Neurology and demonstrated that ML models can accurately predict the incidence of high-severity symptoms within three months using continuously collected data from mobile apps. Results suggest that the trained models can be used by clinicians to evaluate the symptom trajectory of MS participants, which may inform decision making for administering interventions.

The CAIR team has been involved in the deployment of many other systems, for both internal and external use. For example, we have also partnered with Learning Ally to build a book recommendation system for children with learning disabilities, such as dyslexia. We hope that our work positively impacts future product development.

Human feedback

As ML models become ubiquitous throughout the developed world, it can be far too easy to leave voices in less developed countries behind. A priority of the CAIR team is to bridge this gap, develop deep relationships with communities, and work together to address ML-related concerns through community-driven approaches.

One of the ways we are doing this is through working with grassroots organizations for ML, such as Sisonkebiotik, an open and inclusive community of researchers, practitioners and enthusiasts at the intersection of ML and healthcare working together to build capacity and drive forward research initiatives in Africa. We worked in collaboration with the Sisonkebiotik community to detail limitations of historical top-down approaches for global health, and suggested complementary health-based methods, specifically those of grassroots participatory communities (GPCs). We jointly created a framework for ML and global health, laying out a practical roadmap towards setting up, growing and maintaining GPCs, based on common values across various GPCs such as Masakhane, Sisonkebiotik and Ro’ya.

We are engaging with open initiatives to better understand the role, perceptions and use cases of AI for health in non-western countries through human feedback, with an initial focus in Africa. Together with Ghana NLP, we have worked to detail the need to better understand algorithmic fairness and bias in health in non-western contexts. We recently launched a study to expand on this work using human feedback.

Biases along the ML pipeline and their associations with African-contextualized axes of disparities.

The CAIR team is committed to creating opportunities to hear more perspectives in AI development. We partnered with Sisonkebiotik to co-organize the Data Science for Health Workshop at Deep Learning Indaba 2023 in Ghana. Everyone’s voice is crucial to developing a better future using AI technology.

Acknowledgements

We would like to thank Negar Rostamzadeh, Stephen Pfohl, Subhrajit Roy, Diana Mincu, Chintan Ghate, Mercy Asiedu, Emily Salkey, Alexander D’Amour, Jessica Schrouff, Chirag Nagpal, Eltayeb Ahmed, Lev Proleev, Natalie Harris, Mohammad Havaei, Ben Hutchinson, Andrew Smart, Awa Dieng, Mahima Pushkarna, Sanmi Koyejo, Kerrie Kauer, Do Hee Park, Lee Hartsell, Jennifer Graves, Berk Ustun, Hailey Joren, Timnit Gebru and Margaret Mitchell for their contributions and influence, as well as our many friends and collaborators at Learning Ally, National MS Society, Duke University Hospital, STANDING Together, Sisonkebiotik, and Masakhane.

Read More

Overcoming leakage on error-corrected quantum processors

Overcoming leakage on error-corrected quantum processors

The qubits that make up Google quantum devices are delicate and noisy, so it’s necessary to incorporate error correction procedures that identify and account for qubit errors on the way to building a useful quantum computer. Two of the most prevalent error mechanisms are bit-flip errors (where the energy state of the qubit changes) and phase-flip errors (where the phase of the encoded quantum information changes). Quantum error correction (QEC) promises to address and mitigate these two prominent errors. However, there is an assortment of other error mechanisms that challenges the effectiveness of QEC.

While we want qubits to behave as ideal two-level systems with no loss mechanisms, this is not the case in reality. We use the lowest two energy levels of our qubit (which form the computational basis) to carry out computations. These two levels correspond to the absence (computational ground state) or presence (computational excited state) of an excitation in the qubit, and are labeled |0⟩ (“ket zero”) and |1⟩ (“ket one”), respectively. However, our qubits also host many higher levels called leakage states, which can become occupied. Following the convention of labeling the level by indicating how many excitations are in the qubit, we specify them as |2⟩, |3⟩, |4⟩, and so on.

In “Overcoming leakage in quantum error correction”, published in Nature Physics, we identify when and how our qubits leak energy to higher states, and show that the leaked states can corrupt nearby qubits through our two-qubit gates. We then identify and implement a strategy that can remove leakage and convert it to an error that QEC can efficiently fix. Finally, we show that these operations lead to notably improved performance and stability of the QEC process. This last result is particularly critical, since additional operations take time, usually leading to more errors.

Working with imperfect qubits

Our quantum processors are built from superconducting qubits called transmons. Unlike an ideal qubit, which only has two computational levels — a computational ground state and a computational excited state — transmon qubits have many additional states with higher energy than the computational excited state. These higher leakage states are useful for particular operations that generate entanglement, a necessary resource in quantum algorithms, and also keep transmons from becoming too non-linear and difficult to operate. However, the transmon can also be inadvertently excited into these leakage states through a variety of processes, including imperfections in the control pulses we apply to perform operations or from the small amount of stray heat leftover in our cryogenic refrigerator. These processes are collectively referred to as leakage, which describes the transition of the qubit from computational states to leakage states.

Consider a particular two-qubit operation that is used extensively in our QEC experiments: the CZ gate. This gate operates on two qubits, and when both qubits are in their |1⟩ level, an interaction causes the two individual excitations to briefly “bunch” together in one of the qubits to form |2⟩, while the other qubit becomes |0⟩, before returning to the original configuration where each qubit is in |1⟩. This bunching underlies the entangling power of the CZ gate. However, with a small probability, the gate can encounter an error and the excitations do not return to their original configuration, causing the operation to leave a qubit in |2⟩, a leakage state. When we execute hundreds or more of these CZ gates, this small leakage error probability accumulates.

Transmon qubits support many leakage states (|2⟩, |3⟩, |4⟩, …) beyond the computational basis (|0⟩ and |1⟩). While we typically only use the computational basis to represent quantum information, sometimes the qubit enters these leakage states, and disrupts the normal operation of our qubits.

A single leakage event is especially damaging to normal qubit operation because it induces many individual errors. When one qubit starts in a leaked state, the CZ gate no longer correctly entangles the qubits, preventing the algorithm from executing correctly. Not only that, but CZ gates applied to one qubit in leaked states can cause the other qubit to leak as well, spreading leakage through the device. Our work includes extensive characterization of how leakage is caused and how it interacts with the various operations we use in our quantum processor.

Once the qubit enters a leakage state, it can remain in that state for many operations before relaxing back to the computational states. This means that a single leakage event interferes with many operations on that qubit, creating operational errors that are bunched together in time (time-correlated errors). The ability for leakage to spread between the different qubits in our device through the CZ gates means we also concurrently see bunches of errors on neighboring qubits (space-correlated errors). The fact that leakage induces patterns of space- and time-correlated errors makes it especially hard to diagnose and correct from the perspective of QEC algorithms.

The effect of leakage in QEC

We aim to mitigate qubit errors by implementing surface code QEC, a set of operations applied to a collection of imperfect physical qubits to form a logical qubit, which has properties much closer to an ideal qubit. In a nutshell, we use a set of qubits called data qubits to hold the quantum information, while another set of measure qubits check up on the data qubits, reporting on whether they have suffered any errors, without destroying the delicate quantum state of the data qubits. One of the key underlying assumptions of QEC is that errors occur independently for each operation, but leakage can persist over many operations and cause a correlated pattern of multiple errors. The performance of our QEC strategies is significantly limited when leakage causes this assumption to be violated.

Once leakage manifests in our surface code transmon grid, it persists for a long time relative to a single surface code QEC cycle. To make matters worse, leakage on one qubit can cause its neighbors to leak as well.

Our previous work has shown that we can remove leakage from measure qubits using an operation called multi-level reset (MLR). This is possible because once we perform a measurement on measure qubits, they no longer hold any important quantum information. At this point, we can interact the qubit with a very lossy frequency band, causing whichever state the qubit was in (including leakage states) to decay to the computational ground state |0⟩. If we picture a Jenga tower representing the excitations in the qubit, we tumble the entire stack over. Removing just one brick, however, is much more challenging. Likewise, MLR doesn’t work with data qubits because they always hold important quantum information, so we need a new leakage removal approach that minimally disturbs the computational basis states.

Gently removing leakage

We introduce a new quantum operation called data qubit leakage removal (DQLR), which targets leakage states in a data qubit and converts them into computational states in the data qubit and a neighboring measure qubit. DQLR consists of a two-qubit gate (dubbed Leakage iSWAP — an iSWAP operation with leakage states) inspired by and similar to our CZ gate, followed by a rapid reset of the measure qubit to further remove errors. The Leakage iSWAP gate is very efficient and greatly benefits from our extensive characterization and calibration of CZ gates within the surface code experiment.

Recall that a CZ gate takes two single excitations on two different qubits and briefly brings them to one qubit, before returning them to their respective qubits. A Leakage iSWAP gate operates similarly, but almost in reverse, so that it takes a single qubit with two excitations (otherwise known as |2⟩) and splits them into |1⟩ on two qubits. The Leakage iSWAP gate (and for that matter, the CZ gate) is particularly effective because it does not operate on the qubits if there are fewer than two excitations present. We are precisely removing the |2⟩ Jenga brick without toppling the entire tower.

By carefully measuring the population of leakage states on our transmon grid, we find that DQLR can reduce average leakage state populations over all qubits to about 0.1%, compared to nearly 1% without it. Importantly, we no longer observe a gradual rise in the amount of leakage on the data qubits, which was always present to some extent prior to using DQLR.

This outcome, however, is only half of the puzzle. As mentioned earlier, an operation such as MLR could be used to effectively remove leakage on the data qubits, but it would also completely erase the stored quantum state. We also need to demonstrate that DQLR is compatible with the preservation of a logical quantum state.

The second half of the puzzle comes from executing the QEC experiment with this operation interleaved at the end of each QEC cycle, and observing the logical performance. Here, we use a metric called detection probability to gauge how well we are executing QEC. In the presence of leakage, time- and space-correlated errors will cause a gradual rise in detection probabilities as more and more qubits enter and stay in leakage states. This is most evident when we perform no reset at all, which rapidly leads to a transmon grid plagued by leakage, and it becomes inoperable for the purposes of QEC.

The prior state-of-the-art in our QEC experiments was to use MLR on the measure qubits to remove leakage. While this kept leakage population on the measure qubits (green circles) sufficiently low, data qubit leakage population (green squares) would grow and saturate to a few percent. With DQLR, leakage population on both the measure (blue circles) and data qubits (blue squares) remain acceptably low and stable.

With MLR, the large reduction in leakage population on the measure qubits drastically decreases detection probabilities and mitigates a considerable degree of the gradual rise. This reduction in detection probability happens even though we spend more time dedicated to the MLR gate, when other errors can potentially occur. Put another way, the correlated errors that leakage causes on the grid can be much more damaging than the uncorrelated errors from the qubits waiting idle, and it is well worth it for us to trade the former for the latter.

When only using MLR, we observed a small but persistent residual rise in detection probabilities. We ascribed this residual increase in detection probability to leakage accumulating on the data qubits, and found that it disappeared when we implemented DQLR. And again, the observation that the detection probabilities end up lower compared to only using MLR indicates that our added operation has removed a damaging error mechanism while minimally introducing uncorrelated errors.

Leakage manifests during surface code operation as increased errors (shown as error detection probabilities) over the number of cycles. With DQLR, we no longer see a notable rise in detection probability over more surface code cycles.

Prospects for QEC scale-up

Given these promising results, we are eager to implement DQLR in future QEC experiments, where we expect error mechanisms outside of leakage to be greatly improved, and sensitivity to leakage to be enhanced as we work with larger and larger transmon grids. In particular, our simulations indicate that scale-up of our surface code will almost certainly require a large reduction in leakage generation rates, or an active leakage removal technique over all qubits, such as DQLR.

Having laid the groundwork by understanding where leakage is generated, capturing the dynamics of leakage after it presents itself in a transmon grid, and showing that we have an effective mitigation strategy in DQLR, we believe that leakage and its associated errors no longer pose an existential threat to the prospects of executing a surface code QEC protocol on a large grid of transmon qubits. With one fewer challenge standing in the way of demonstrating working QEC, the pathway to a useful quantum computer has never been more promising.

Acknowledgements

This work would not have been possible without the contributions of the entire Google Quantum AI Team.

Read More